DEV Community

mwang-cmn
mwang-cmn

Posted on • Edited on

Outlier Detection in Election Data Using Geospatial Analysis - AKWA IBOM

Introduction

The aim of this project is to uncover potential election irregularities to enable the electoral commission to ensure transparency of election results. In this project , I will identify outlier polling units where the voting results deviate significantly from neighbouring units.

Data Understanding

The dataset used in this analysis, represents polling units in the state of Akwa Ibom only.The data used can be found here. I conducted this analysis in Python as follows

from google.colab import drive, files
drive.mount('/content/drive')
#Import Libraries
import pandas as pd
from geopy.geocoders import OpenCage
#path = '/content/drive/MyDrive/Colab Notebooks/Nigeria_Elections/'
data = pd.read_csv(path + "AKWA_IBOM_crosschecked.csv")

Enter fullscreen mode Exit fullscreen mode

Here is a summary about columns in the data set

  1. State: The name of the Nigerian state where the election took place (e.g., “AKWA IBOM”).
  2. LGA (Local Government Area): The specific local government area within the state (e.g., “ABAK”).
  3. Ward: The electoral ward within the local government area (e.g., “ABAK URBAN 1”).
  4. PU-Code (Polling Unit Code): A unique identifier for the polling unit (e.g., “3/1/2001 0:00”).
  5. PU-Name (Polling Unit Name): The name or location of the polling unit (e.g., “VILLAGE SQUARE, IKOT AKWA EBOM” or “PRY SCH, IKOT OKU UBARA”).
  6. Accredited Voters: The number of voters accredited to participate in the election at that polling unit.
  7. Registered Voters: The total number of registered voters in that polling unit.
  8. Results Found: Indicates whether results were found for this polling unit (usually TRUE or FALSE).
  9. Transcription Count: The count of how many times the results were transcribed (may be -1 if not applicable).
  10. Result Sheet Stamped: Indicates whether the result sheet was stamped (TRUE or FALSE).
  11. Result Sheet Corrected: Indicates whether any corrections were made to the result sheet (TRUE or FALSE).
  12. Result Sheet Invalid: Indicates whether the result sheet was deemed invalid (TRUE or FALSE).
  13. Result Sheet Unclear: Indicates whether the result sheet was unclear (TRUE or FALSE).
  14. Result Sheet Unsigned: Indicates whether the result sheet was unsigned (TRUE or FALSE).
  15. APC: The number of votes received by the All Progressives Congress (APC) party.
  16. LP: The number of votes received by the Labour Party (LP).
  17. PDP: The number of votes received by the People’s Democratic Party (PDP).
  18. NNPP: The number of votes received by the New Nigeria People’s Party (NNPP).

I then created the Address column by concatenating the Polling unit Name, Ward, the Local government Area and State, which will be useful during geocoding:

data['Address'] = data['PU-Name'] + ',' + data['Ward'] + ',' + data['LGA'] + ',' + data['State']
Enter fullscreen mode Exit fullscreen mode

To obtain the Latitude and Longitude columns, I utilized geospatial encoding techiniques.
I generated an API key on OpenCage Geocoding API, and defined a function geocode_address to geocode our new Address column to obtain the Latitude and Longitude columns

def geocode_address(Address):
  try:
    location = geolocator.geocode(Address)
    return location.latitude, location.longitude
  except:
    return None, None

data[['Latitude', 'Longitude']] = data['Address'].apply(lambda x: pd.Series(geocode_address(x)))

Enter fullscreen mode Exit fullscreen mode

A quick at our dataset:

Image description

Looks like our function works and I was able to obtain the Latitude and Longitude column.
As there are still null values in these 2 columns, I will Impute them using the Simple Imputer, which will replace the missing values with the mean.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
data[['Latitude', 'Longitude']] = imputer.fit_transform(data[['Latitude', 'Longitude']])
data.to_csv('AKWA_IBOM_geocode.csv', index = False)
Enter fullscreen mode Exit fullscreen mode

Identifying Neighbours

I defined a radius of 1 km to identify which polling units are considered neighbours

#Calculate distance and find neighbours
from geopy.distance import geodesic
neighbours= {}
def neighbouring_pu(data, radius = 1.0):
  for i, row in data.iterrows():
    neighbours[i] = []
    for j, row2 in data.iterrows():
      if i != j:
        distance = geodesic((row['Latitude'],row['Longitude']), (row2['Latitude'],row2['Longitude'])).km
        if distance <= radius:
          neighbours[i].append(j)
  return neighbours

neighbours = neighbouring_pu(data, radius =1.0)
Enter fullscreen mode Exit fullscreen mode

Outlier Calculation - Score
I will define a function, get_outlier_scores, that calculates the outlier scores for voting data in this dataset. It does so by comparing the votes each row received for various parties (APC, LP, PDP, NNPP) to the average votes received by its neighboring rows, which are specified in a dictionary, neighbours.
For each row, the function computes the absolute difference between the votes in that row and the average votes of its neighbors for each party, and stores these differences as outlier scores. Finally, it returns a new DataFrame that combines the original voting data with the calculated outlier scores. This allows for the identification of rows with voting patterns that significantly differ from their neighbors.

def get_outlier_scores(data, neighbours):
  outlier_scores = []
  parties = ['APC', 'LP', 'PDP', 'NNPP']
  for i, row in data.iterrows():
    scores = {}
    for party in parties:
      votes = row[party]
      neighbour_votes = data.loc[neighbours[i], party].mean() if neighbours[i] else 0
      scores[party + '_outlier_score'] = abs(votes - neighbour_votes)
    outlier_scores.append(scores)
    outlier_scores_data = pd.DataFrame(outlier_scores)
  return pd.concat([data, outlier_scores_data], axis = 1)

outlier_scores_df = get_outlier_scores(data, neighbours)
Enter fullscreen mode Exit fullscreen mode

Sorting and Reporting
I sorted the data by the outlier scores for each party and obtained the following detailed report that includes the top five outliers for each party, with the 'PU-Code', number of votes, and the outlier score.

: All Progressives Congress (APC) party

PU-Code APC APC_outlier_score
03-05-11-009 324 228.52
03-29-05-013 194 167.334
03-30-07-001 180 153.325
03-05-09-014 194 152.149
03-28-05-003 180 138.132

: Labour Party (LP)

PU-Code LP LP_outlier_score
03-05-11-009 59 45.451
03-29-05-013 42 6.65894
03-30-07-001 29 6.34942
03-05-09-014 3 26.5831
03-28-05-003 91 61.5261

: People’s Democratic Party (PDP)

PU-Code PDP PDP_outlier_score
03-05-11-009 7 27.3627
03-29-05-013 181 145.232
03-30-07-001 17 18.8739
03-05-09-014 36 24.2221
03-28-05-003 12 48.2519

: New Nigeria People’s Party - NNPP

PU-Code NNPP NNPP_outlier_score
03-05-11-009 0 0.27451
03-29-05-013 6 4.14865
03-30-07-001 0 1.85521
03-05-09-014 0 2.36104
03-28-05-003 0 2.36104

Visualize the neighbours

Generate scatterplots to visualize the geographical distribution of polling units based on their outlier scores for four political parties (APC, LP, PDP, NNPP).
Each point represents a polling unit plotted by its latitude and longitude.
Each plot provides a clear visual representation of how the outlier scores are geographically distributed, making it easier to identify patterns or anomalies in the data.

import matplotlib.pyplot as plt
import seaborn as sns

parties = ['APC', 'LP', 'PDP', 'NNPP']
for party in parties:
  plt.figure(figsize=(10, 6))
  sns.scatterplot(data=outlier_scores_df, x='Latitude', y='Longitude', hue=party + '_outlier_score', palette='viridis')
  plt.title(f'Polling Units by {party} Outlier Score')
  plt.xlabel('Latitude')
  plt.ylabel('Longitude')
  plt.legend(title=party + ' Outlier Score')
  plt.savefig(f'polling_units_{party}_outlier_score.png')
  plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Image description

Image description

Deliverables

  1. Find the full Notebook here
  2. Full Report - Top five outliers for each party.
  3. File with Latitude and Longitude - CSV
  4. File with sorted polling units by outlier scores - CSV

Top comments (0)