Election Data Cleaning Project
Built a robust RStudio pipeline to clean, standardize, and merge granular PKW election records into a consistent, county-level analytics dataset across both rounds of Poland's 2020 presidential election.
Overview
This project consolidated raw precinct/commission CSV exports from PKW (Polish National Electoral Commission) into a harmonized county-level dataset, ensuring consistent schemas, clean geographic keys, and reproducible transformations in RStudio for downstream analysis and mapping.
The workflow covered schema alignment, text normalization, deduplication, outlier handling, and multi-table joins, followed by turnout and candidate-share calculations for round one and round two, statistical testing, and geospatial visualization using sf and ggplot2.
Technology Stack
- R & RStudio: Primary development environment
- dplyr, tidyr, readr: Data manipulation and import
- ggplot2, sf, scales: Visualization and geospatial data
- knitr: Reproducible research and documentation
Key Features
- Robust data pipeline: end-to-end from raw CSV to analysis-ready datasets
- Schema harmonization and geographic key standardization
- Multi-round analysis, statistical modeling, and geospatial visualization
- Reproducible workflow with version-controlled scripts
Challenges & Solutions
- Inconsistent Schemas: Mapping functions to align column structures across rounds
- Geographic Key Variations: Normalization pipeline with lowercasing and canonical name resolution
- Non-territorial Records: Filtered out "Zagranica" and "Statki"
Results & Outcomes
- Harmonized county-level dataset ready for analysis across both rounds
- Reproducible pipeline with clear documentation
- Statistical insights and geospatial visualizations revealing regional patterns
Visualizations

Related Projects
- Bachelor Thesis — the research that used this cleaned election data to analyze third-force candidates' impact on voter turnout