Election Data Cleaning Project

Built a robust RStudio pipeline to clean, standardize, and merge granular PKW election records into a consistent, county-level analytics dataset across both rounds of Poland's 2020 presidential election.

2024

Overview

This project consolidated raw precinct/commission CSV exports from PKW (Polish National Electoral Commission) into a harmonized county-level dataset, ensuring consistent schemas, clean geographic keys, and reproducible transformations in RStudio for downstream analysis and mapping.

The workflow covered schema alignment, text normalization, deduplication, outlier handling, and multi-table joins, followed by turnout and candidate-share calculations for round one and round two, statistical testing, and geospatial visualization using sf and ggplot2.

Technology Stack

R & RStudio: Primary development environment
dplyr, tidyr, readr: Data manipulation and import
ggplot2, sf, scales: Visualization and geospatial data
knitr: Reproducible research and documentation

Key Features

Robust data pipeline: end-to-end from raw CSV to analysis-ready datasets
Schema harmonization and geographic key standardization
Multi-round analysis, statistical modeling, and geospatial visualization
Reproducible workflow with version-controlled scripts

Challenges & Solutions

Inconsistent Schemas: Mapping functions to align column structures across rounds
Geographic Key Variations: Normalization pipeline with lowercasing and canonical name resolution
Non-territorial Records: Filtered out "Zagranica" and "Statki"

Results & Outcomes

Harmonized county-level dataset ready for analysis across both rounds
Reproducible pipeline with clear documentation
Statistical insights and geospatial visualizations revealing regional patterns

Visualizations

Election Data Visualization - Poland Election Results Map

Download R Script

← Back to Projects

Related Projects

Bachelor Thesis — the research that used this cleaned election data to analyze third-force candidates' impact on voter turnout