If you’ve ever built a machine learning project, you already know the truth:
80% of ML work is data cleaning.
And 80% of that cleaning is… repetitive.
Handling missing values, encoding categoricals, scaling features, fixing data types — every new dataset, same boilerplate, different notebook.
After repeating this cycle one too many times, I decided to automate it.
That’s how AutoCleanML was born
The Problem I Faced
As a student working on multiple ML projects and datasets, I noticed:
- Writing the same preprocessing code again and again
- Inconsistent cleaning logic across projects
- Hard-to-maintain notebooks
- Beginners getting stuck before even training a model
I wanted something that:
- Works out of the box
- Follows best practices
- Is modular, reusable, and simple
✨ Introducing AutoCleanML
AutoCleanML is a Python library that helps you automatically clean and preprocess datasets for machine learning with minimal code.
It’s built for:
- Students
- ML beginners
- Data science interns
- Anyone tired of rewriting preprocessing logic
Using AutoCleanML
With AutoCleanML, you can go from a raw dataset to train-test splits in just a few lines.
import pandas as pd
from autocleanml import AutoCleanML
# Load dataset
df = pd.read_csv("data.csv")
# Initialize cleaner
cleaner = AutoCleanML(target_column="target")
# Clean data and split automatically
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Check preprocessing summary
print(report)
If data cleaning feels repetitive or slows down your ML projects, give AutoCleanML a try and see how much time it saves you.
🔗 GitHub: https://github.com/likith-n/AutoCleanML
📦 PyPI: https://pypi.org/project/AutoCleanML/
I’d genuinely love feedback from the community — whether it’s ideas, issues, or improvements.
If you find it useful, consider ⭐ starring the repo or sharing the post so others can benefit too.
Open source grows through people, not just code ❤️
Happy cleaning & happy modeling!
Top comments (0)