DEV Community

Cover image for I built a Python preprocessing tool and a reviewer called it "low-trust" — here's what they got right
Shivanshu Pandey
Shivanshu Pandey

Posted on

I built a Python preprocessing tool and a reviewer called it "low-trust" — here's what they got right

I shipped my first PyPI package this week — a data preprocessing tool — and within 48 hours a reviewer called it "low-trust until proven otherwise." They weren't wrong.
This post is about what I built, why I built it, what I learned, and what the reviewer was right and wrong about.
What it is
PrePro Auto is an open-source data preprocessing workbench that runs locally. You load a dataset, it analyses every column, and you get decision cards — one per issue found. You approve, override, or skip each one. At the end you have a clean dataset plus a runnable Python pipeline that reproduces every step.

pip install prepro-auto

import prepro_auto

session = prepro_auto.launch_file(r"C:\data\bengaluru_house_prices.csv")
# Click the printed URL → clean visually in the workbench → then:
cleaned = session.current()
Enter fullscreen mode Exit fullscreen mode

The original DataFrame is never touched. session.current() always returns the latest cleaned version.
The part I'm most proud of: the decision-card system
Most automated preprocessing tools work like a black box. PrePro Auto does the opposite.
For every issue found, the engine generates a card that shows what it found, why it matters, the recommended action with a confidence score, the reasoning in plain English, and two or three alternatives you can pick instead — plus Approve / Override / Skip / Drop buttons.
This is the "human-in-the-loop" part. The engine handles statistical analysis; you handle domain knowledge. A column with 81 distinct values gets recommended for frequency encoding — but if you know it's an ordinal category, you override. The model never knew that. You did.
The technical problem I underestimated: sandboxing user expressions
Step 5 of the workbench lets you write arbitrary pandas: df['profit'] = df['revenue'] - df['cost']. That means executing user-supplied Python in a server process.
The naive version is eval(expression). That's also a remote code execution vulnerability.
Getting the sandbox right took longer than any other single feature. The final implementation blocks:

  • All **import **statements
  • Dunder attribute access (class, globals, builtins)
  • OS, network, and filesystem access
  • Dangerous pandas methods (to_sql, arbitrary to_csv paths, pipe with callables)
  • Subprocesses and lambdas

I tested it against 15 attack strings before I was satisfied. Sandboxing is one of those things that never shows up in a demo but matters a lot if the tool ever runs multi-user. The lesson: a feature that's "execute arbitrary user code" needs to be sandboxed from day one or it needs to not exist.
What the cleaning actually does — a real example
The Bengaluru House Prices dataset (Kaggle, 13,320 rows × 9 columns) has all the usual problems: missing values, outliers, categorical columns of varying cardinality, free-text fields, and a total_sqft column stored as strings like "1133 - 1384".

Here's what happens stage by stage:

  • Profile detects total_sqft as text, flags the missing-rate on society, finds the bath/balcony columns are right-skewed
  • Missing values stage detects MCAR mechanism on bath, recommends median imputation
  • Outliers stage uses IQR + Isolation Forest, flags properties with implausible total_sqft as likely data errors
  • Scaling uses Shapiro-Wilk to detect non-normality, recommends Yeo-Johnson for price **and **total_sqft
  • Encoding routes by cardinality: 4 distinct area_type values → one-hot encoding (88% confidence), 81 availability values → frequency encoding (too many for one-hot)

By the end of the guided run, the dataset's quality score improved meaningfully and several low-signal columns got dropped after correlation analysis flagged them.
The export gives me three files: the cleaned CSV, an audit PDF listing every decision, and pipeline.py — a runnable script that re-applies the exact same cleaning on any new data with no PrePro Auto dependency.
What I learned building this
Stats beats ML for most preprocessing decisions. The right scaler for a Gaussian column is StandardScaler — you don't need a model to learn that. The right imputation for a column whose missingness correlates with another column is MICE — that's a statistical relationship, not a learned one. I used ML (Isolation Forest for multivariate outliers, KNN/MICE for imputation) only where the data itself has to inform the decision. For everything else, statistical tests give better, more interpretable, and faster answers. I went into this expecting to use more ML and came out using less.
The notebook-to-UI bridge is harder than it looks. prepro_auto.launch(df) opens a browser tab against the live in-memory DataFrame while keeping the Jupyter kernel alive. The naive version blocks the kernel because uvicorn wants the event loop. The working version uses nest-asyncio to nest the FastAPI loop inside the kernel's loop, runs uvicorn in a daemon thread, and shares state through a session registry. None of this is exotic, but every piece had to be right or the kernel deadlocks. Two days of fighting asyncio for what ended up as 30 lines of code.
The first 60 seconds matter more than any feature. My most common new-user error: typing prepro_auto in a Jupyter cell and getting a module object printed instead of a server starting. It's correct Python behaviour. It's also a terrible first impression. The second most common error was UnicodeDecodeError when users did pd.read_csv("data.csv") on a Latin-1 encoded file. Both are technically the user's fault. Both lose users anyway. I added launch_file() so users never need pd.read_csv at all, and a note in the README explicitly saying "don't type the package name as a command in a notebook." If you build any developer tool, assume your users will hit your worst onboarding path first.

What the reviewer was right and wrong about
The reviewer who called my package "low-trust" was right about: limited PyPI page, no track record, no CI badges visible from the install page, no screenshots in the README at the time, no comparison to alternatives. All fair criticism for a beta package days after release.
They were wrong about: assuming the lack of issue history meant no testing. The repo has 202 tests, but a stranger looking at PyPI couldn't see them. They also misread some package-name search results.
The honest lesson: trust signals matter as much as the code. A tool with 50% of the features but a green CI badge, real screenshots, a CHANGELOG, and three blog posts written about it will get more adoption than a tool with 100% of the features and a bare PyPI page. I'm working on fixing this — slowly, because there's only one of me.
Where it stands

  • 202 tests passing across Python 3.10 / 3.11 / 3.12
  • 5 AI providers supported (Groq, OpenAI, Anthropic, Gemini, Mistral) — all optional; everything works offline with a curated dictionary + heuristics fallback
  • MIT license, runs fully locally, zero telemetry
  • Public beta: pip install prepro-auto — repo at github.com/Chilliflex/prepro_auto

What I'd find genuinely useful
If you try it and something breaks: please open an issue with the dataset shape and what stage failed. I read every one.
If you don't try it but have thoughts: the comments here are open, and I'll engage with anything specific. "It looks like X" or "have you considered Y?" — yes please. "Cool project" — appreciated but I can't do much with it.
Either way, thanks for reading.

Top comments (0)