
Every data project starts with excitement.
Then comes:
- missing values
- duplicate rows
- inconsistent column names
- encoding
- leakage checks
- skew analysis
- outlier handling
- repetitive preprocessing pipelines
After rebuilding the same workflow across notebooks and projects, I decided to create something reusable.
So I built dfxpy — an open-source Python package focused on accelerating DataFrame workflows for machine learning, analytics, and research.
What dfxpy does
Automated Cleaning
- smart type inference
- missing value imputation
- duplicate removal
- snake_case normalization
- currency/percentage/date detection
- categorical encoding
ML Preparation
- feature/target splitting
- optional scaling
- target encoding
- date feature extraction
- class balancing
Diagnostics & Research
- leakage detection
- skewness + multicollinearity audits
- statistical profiling
- dataset lineage hashing
- publication-ready LaTeX exports
Workflow Utilities
- reusable transformation pipelines
- dataframe comparison tools
- schema validation
- standalone HTML EDA reports
- built-in CLI support
Example
from dfxpy import auto, prepare
df = auto(df)
X, y = prepare(
df,
target="sales",
scale=True
)
CLI:
dfxpy analyze dataset.csv
One design goal I cared about
I specifically didn’t want this to feel like a thin wrapper around Pandas.
The focus became:
- workflow automation
- preprocessing acceleration
- diagnostics
- reproducibility
- research-friendly tooling
rather than simply renaming Pandas functions.
Open Source
The project includes:
- automated GitHub workflows
- PyPI publishing
- modular architecture
- active development roadmap
I’d genuinely appreciate feedback from the Python/data community — especially around:
- API design
- architecture
- performance
- production-readiness
GitHub: https://github.com/sayantancodex/dfxpy
PyPI: https://pypi.org/project/dfxpy/
Top comments (0)