Skip to content

DEV Community

Sayantan Patra

Posted on May 6

I built 'dfxpy' to reduce repetitive Pandas + ML preprocessing workflows

#datascience #machinelearning #python #showdev

Every data project starts with excitement.

Then comes:

missing values
duplicate rows
inconsistent column names
encoding
leakage checks
skew analysis
outlier handling
repetitive preprocessing pipelines

After rebuilding the same workflow across notebooks and projects, I decided to create something reusable.

So I built dfxpy — an open-source Python package focused on accelerating DataFrame workflows for machine learning, analytics, and research.

What dfxpy does

Automated Cleaning

smart type inference
missing value imputation
duplicate removal
snake_case normalization
currency/percentage/date detection
categorical encoding

ML Preparation

feature/target splitting
optional scaling
target encoding
date feature extraction
class balancing

Diagnostics & Research

leakage detection
skewness + multicollinearity audits
statistical profiling
dataset lineage hashing
publication-ready LaTeX exports

Workflow Utilities

reusable transformation pipelines
dataframe comparison tools
schema validation
standalone HTML EDA reports
built-in CLI support

Example

from dfxpy import auto, prepare

df = auto(df)

X, y = prepare(
    df,
    target="sales",
    scale=True
)

CLI:

dfxpy analyze dataset.csv

One design goal I cared about

I specifically didn’t want this to feel like a thin wrapper around Pandas.

The focus became:

workflow automation
preprocessing acceleration
diagnostics
reproducibility
research-friendly tooling

rather than simply renaming Pandas functions.

Open Source

The project includes:

automated GitHub workflows
PyPI publishing
modular architecture
active development roadmap

I’d genuinely appreciate feedback from the Python/data community — especially around:

API design
architecture
performance
production-readiness

GitHub: https://github.com/sayantancodex/dfxpy
PyPI: https://pypi.org/project/dfxpy/

Top comments (0)

Subscribe