DEV Community

Sayantan Patra
Sayantan Patra

Posted on

I built 'dfxpy' to reduce repetitive Pandas + ML preprocessing workflows


Every data project starts with excitement.

Then comes:

  • missing values
  • duplicate rows
  • inconsistent column names
  • encoding
  • leakage checks
  • skew analysis
  • outlier handling
  • repetitive preprocessing pipelines

After rebuilding the same workflow across notebooks and projects, I decided to create something reusable.

So I built dfxpy — an open-source Python package focused on accelerating DataFrame workflows for machine learning, analytics, and research.


What dfxpy does

Automated Cleaning

  • smart type inference
  • missing value imputation
  • duplicate removal
  • snake_case normalization
  • currency/percentage/date detection
  • categorical encoding

ML Preparation

  • feature/target splitting
  • optional scaling
  • target encoding
  • date feature extraction
  • class balancing

Diagnostics & Research

  • leakage detection
  • skewness + multicollinearity audits
  • statistical profiling
  • dataset lineage hashing
  • publication-ready LaTeX exports

Workflow Utilities

  • reusable transformation pipelines
  • dataframe comparison tools
  • schema validation
  • standalone HTML EDA reports
  • built-in CLI support

Example

from dfxpy import auto, prepare

df = auto(df)

X, y = prepare(
    df,
    target="sales",
    scale=True
)
Enter fullscreen mode Exit fullscreen mode

CLI:

dfxpy analyze dataset.csv
Enter fullscreen mode Exit fullscreen mode

One design goal I cared about

I specifically didn’t want this to feel like a thin wrapper around Pandas.

The focus became:

  • workflow automation
  • preprocessing acceleration
  • diagnostics
  • reproducibility
  • research-friendly tooling

rather than simply renaming Pandas functions.


Open Source

The project includes:

  • automated GitHub workflows
  • PyPI publishing
  • modular architecture
  • active development roadmap

I’d genuinely appreciate feedback from the Python/data community — especially around:

  • API design
  • architecture
  • performance
  • production-readiness

GitHub: https://github.com/sayantancodex/dfxpy
PyPI: https://pypi.org/project/dfxpy/

Top comments (0)