How I Built and Published My First Python Library as a Semester 4 Student
Every data project I start looks the same. Load the data, spend 30 minutes hunting for outliers, write the same NaN handling code I wrote last week, watch my notebook eat RAM. Then repeat it all for the next project.
I got tired of it. So I built a library.
This is the story of how I went from a frustrated CS student to publishing pandasclean on PyPI — and what I learned along the way.
The Idea
It started simple. I just wanted a function that could detect outliers and let me choose what to do with them. But once I had that, I thought — why not add NaN handling? And memory reduction? And a single function that runs everything?
Three weeks later I had a published library.
What pandasclean Does
pip install pandasclean
It has four core functions:
1. find_outliers() — IQR based outlier detection
from pandasclean import find_outliers
# Just show me the bounds
df, bounds = find_outliers(df, strategy='report')
# Drop outlier rows
df_clean, bounds = find_outliers(df, strategy='drop')
# Cap values instead of dropping (Winsorization)
df_clean, bounds = find_outliers(df, strategy='cap')
The IQR method computes bounds as:
lower = Q1 - (multiplier × IQR)upper = Q3 + (multiplier × IQR)
Use multiplier=1.5 for mild outliers and multiplier=3.0 for extreme ones only.
2. handle_nan() — Missing value handling
from pandasclean import handle_nan
# Fill with mean, median, or custom values
df_clean, report = handle_nan(df, strategy='mean')
df_clean, report = handle_nan(df, strategy='custom', fill_value={'age': 0, 'name': 'unknown'})
# Or drop rows/columns entirely
df_clean, report = handle_nan(df, strategy='drop', axis='rows')
3. reduce_memory() — Dtype downcasting
This one surprised me the most when I saw the results.
from pandasclean import reduce_memory
before = df.memory_usage(deep=True).sum() / (1024*1024)
df_optimized, report = reduce_memory(df)
after = df_optimized.memory_usage(deep=True).sum() / (1024*1024)
print(f"Before: {before:.2f} MB")
print(f"After: {after:.2f} MB")
Before: 1527.07 MB
After: 371.93 MB
Reduction: 75.6%
On a 15 million row dataset. That number still makes me happy.
What's happening under the hood:
-
int64→ smallest safe integer type (int8,int16, orint32) -
float64→float32 - Low cardinality string columns →
categorydtype
4. auto_clean() — One function to rule them all
from pandasclean import auto_clean
df_clean, report = auto_clean(df)
Runs NaN handling, memory reduction and outlier detection in the correct order with sensible defaults.
The Interesting Technical Bits
Building this taught me things I never would have learned from a tutorial.
The IQR = 0 edge case — what happens when a column has constant values? Q1 == Q3, so IQR = 0, and the bounds collapse. I had to add a guard for this.
pandas StringDtype compatibility — newer versions of pandas use pd.StringDtype() instead of plain object for string columns. My is_object_dtype() check was returning False for string columns and silently skipping them. Fixed by also checking is_string_dtype().
Nullable integers — pandas has two integer systems. int64 (numpy) can't hold NaN. Int64 (pandas nullable, capital I) can. Trying to downcast Int64 with NaN to numpy int8 raises cannot convert NA to integer. The fix was to detect NaN integers and downcast to nullable Int8/Int16/Int32 instead.
Publishing to PyPI
This was simpler than I expected. You need three things:
1. pyproject.toml:
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "pandasclean"
version = "0.1.2"
description = "A lightweight library for cleaning and optimizing pandas DataFrames"
dependencies = ["pandas>=1.3.0", "numpy>=1.21.0"]
2. Build:
pip install build twine
python -m build
3. Upload:
twine upload dist/*
That's it. Your library is live.
Results
- 96 downloads on day one
- 75.6% memory reduction on a 15M row dataset
- Works across Python 3.8 to 3.12
What I Learned
More than I expected. Here's the short version:
- Write tests before you think you need them — they saved me multiple times
- Edge cases hide everywhere — IQR = 0, nullable integers, pandas version differences
- Shipping something imperfect is better than perfecting something unshipped
- A published library on your resume hits differently than a GitHub repo
What's Next
- Z-score outlier detection (v0.2.0)
- Column name standardisation
- Duplicate detection
- sklearn pipeline integration
Links
If you try it and something breaks — please open an issue. Feedback from real users is worth more than anything else at this stage.
Top comments (0)