ML Model Scored 86%. The Dataset It Learned From Was Biased. GitHub Copilot Helped Me See It.

#devchallenge #githubchallenge #machinelearning #githubcopilot

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

In 2018 I trained a charity donor classifier for a Udacity machine learning nanodegree project. The task was to predict whether someone earns more than $50k per year as a proxy for donation likelihood for a fictional charity called CharityML. Gradient Boosting won the model comparison at 86.78% accuracy and an F-score of 0.7469. I submitted it, got my grade, and filed the notebook away.

Coming back in 2026, I did not just fix the code. I audited what the model actually learned. The answer was uncomfortable.

Demo

Live demo: https://sanskriti1991.github.io/machineLearningProjects/finding_donors/

GitHub repo: https://github.com/sanskriti1991/machineLearningProjects/tree/master/finding_donors

The demo lets you input census features and see how the model predicts donation likelihood. It shows a fairness warning for demographic groups with known prediction disparities and charts the prediction rates, false positive rates, and false negative rates across all demographic groups.

The Comeback Story

Before: a notebook that could not run on any modern setup

Opening the notebook in VS Code on Python 3.13 with current sklearn revealed three immediate problems.

The sklearn imports had not kept up with eight years of library changes:

# 2018 — no longer works
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

The print statements were Python 2 syntax throughout. And the visualization helper file had an integer division bug where j/3 returns a float in Python 3, breaking array indexing entirely. One character fix changed j/3 to j//3. The notebook had silently needed it for eight years.

After: a running notebook with a fairness audit

Once the code ran, I started looking at the dataset more carefully. The UCI Adult Income dataset extracted from the 1994 US Census had appeared in hundreds of published research papers by 2021, spanning AI fairness, privacy preservation, and model debugging. UC Berkeley researchers published "Retiring Adult" at NeurIPS 2021 calling for it to be retired.

Their finding: the $50k income threshold used as the positive class label was the 76th income percentile overall in 1994, but the 88th percentile for Black Americans and the 89th percentile for women.

The model did not learn who donates. It learned who 1994 America paid well.

The fairness audit made that concrete:

Asian-Pac-Islander males predicted as likely donors: 32%
White males: 26%
Black females: 4%
American Indian females: nearly 0%

86.78% overall accuracy. Completely silent on all of the above.

My Experience with GitHub Copilot

Working with Copilot on this project was not a smooth straight line. It was honestly more like a collaboration that required patience on both sides.

The rate limit reality

I am on the free Copilot tier. Partway through the session, after several back and forth prompts fixing the deprecated imports and print statements, Copilot hit its rate limit and went quiet. I had to wait for it to reset before continuing. That could have been the moment I gave up. It wasn't. I kept the notebook open, documented what had been fixed so far, and came back when the quota reset.

Learning to prompt better

My first prompts were too broad. Asking Copilot to fix the entire notebook at once produced suggestions it could not apply directly to notebook cells in the browser environment. I had to adjust, breaking the task into smaller pieces and being more specific about what I needed. That back and forth was frustrating at first but it forced me to understand the changes rather than just accepting them blindly.

Where Copilot genuinely delivered

Once I found the right prompt style the three moments that mattered most were clear.

First, identifying the deprecated sklearn imports and explaining exactly why each module had moved. Old line, new line, reason. Clear and immediately useful.

Second, catching the integer division bug in visuals.py where j/3 silently breaks in Python 3. I would have spent a long time hunting that one down without Copilot pointing at the exact line.

Third, generating the full fairness audit from a single inline comment. That was the most impressive moment. One descriptive comment and Copilot produced working code that reconstructed demographic groups from one-hot encoded columns, calculated prediction rates and error rates by group, and saved the charts. It then summarized the findings in plain English:

"The model appears to have learned patterns reflecting 1994 wage inequality rather than actual donation likelihood. This suggests that systemic biases in income distribution at the time are influencing the model's predictions."

That is the sentence I should have written in 2018. Now I have.

Why this dataset was called for retirement (UC Berkeley, NeurIPS 2021): https://arxiv.org/abs/2108.04884