I built a Python preprocessing tool and a reviewer called it "low-trust" — here's what they got right

#python #machinelearning #datascience #opensource

I shipped my first PyPI package this week — a data preprocessing tool — and within 48 hours a reviewer called it "low-trust until proven otherwise." They weren't wrong.
This post is about what I built, why I built it, what I learned, and what the reviewer was right and wrong about.
What it is
PrePro Auto is an open-source data preprocessing workbench that runs locally. You load a dataset, it analyses every column, and you get decision cards — one per issue found. You approve, override, or skip each one. At the end you have a clean dataset plus a runnable Python pipeline that reproduces every step.

pip install prepro-auto

import prepro_auto

session = prepro_auto.launch_file(r"C:\data\bengaluru_house_prices.csv")
# Click the printed URL → clean visually in the workbench → then:
cleaned = session.current()

The original DataFrame is never touched. session.current() always returns the latest cleaned version.
The part I'm most proud of: the decision-card system
Most automated preprocessing tools work like a black box. PrePro Auto does the opposite.
For every issue found, the engine generates a card that shows what it found, why it matters, the recommended action with a confidence score, the reasoning in plain English, and two or three alternatives you can pick instead — plus Approve / Override / Skip / Drop buttons.
This is the "human-in-the-loop" part. The engine handles statistical analysis; you handle domain knowledge. A column with 81 distinct values gets recommended for frequency encoding — but if you know it's an ordinal category, you override. The model never knew that. You did.
The technical problem I underestimated: sandboxing user expressions
Step 5 of the workbench lets you write arbitrary pandas: df['profit'] = df['revenue'] - df['cost']. That means executing user-supplied Python in a server process.
The naive version is eval(expression). That's also a remote code execution vulnerability.
Getting the sandbox right took longer than any other single feature. The final implementation blocks:

All **import **statements
Dunder attribute access (class, globals, builtins)
OS, network, and filesystem access
Dangerous pandas methods (to_sql, arbitrary to_csv paths, pipe with callables)
Subprocesses and lambdas

I tested it against 15 attack strings before I was satisfied. Sandboxing is one of those things that never shows up in a demo but matters a lot if the tool ever runs multi-user. The lesson: a feature that's "execute arbitrary user code" needs to be sandboxed from day one or it needs to not exist.
What the cleaning actually does — a real example
The Bengaluru House Prices dataset (Kaggle, 13,320 rows × 9 columns) has all the usual problems: missing values, outliers, categorical columns of varying cardinality, free-text fields, and a total_sqft column stored as strings like "1133 - 1384".

Here's what happens stage by stage:

Profile detects total_sqft as text, flags the missing-rate on society, finds the bath/balcony columns are right-skewed
Missing values stage detects MCAR mechanism on bath, recommends median imputation
Outliers stage uses IQR + Isolation Forest, flags properties with implausible total_sqft as likely data errors
Scaling uses Shapiro-Wilk to detect non-normality, recommends Yeo-Johnson for price **and **total_sqft
Encoding routes by cardinality: 4 distinct area_type values → one-hot encoding (88% confidence), 81 availability values → frequency encoding (too many for one-hot)

By the end of the guided run, the dataset's quality score improved meaningfully and several low-signal columns got dropped after correlation analysis flagged them.
The export gives me three files: the cleaned CSV, an audit PDF listing every decision, and pipeline.py — a runnable script that re-applies the exact same cleaning on any new data with no PrePro Auto dependency.
What I learned building this
Stats beats ML for most preprocessing decisions. The right scaler for a Gaussian column is StandardScaler — you don't need a model to learn that. The right imputation for a column whose missingness correlates with another column is MICE — that's a statistical relationship, not a learned one. I used ML (Isolation Forest for multivariate outliers, KNN/MICE for imputation) only where the data itself has to inform the decision. For everything else, statistical tests give better, more interpretable, and faster answers. I went into this expecting to use more ML and came out using less.
The notebook-to-UI bridge is harder than it looks. prepro_auto.launch(df) opens a browser tab against the live in-memory DataFrame while keeping the Jupyter kernel alive. The naive version blocks the kernel because uvicorn wants the event loop. The working version uses nest-asyncio to nest the FastAPI loop inside the kernel's loop, runs uvicorn in a daemon thread, and shares state through a session registry. None of this is exotic, but every piece had to be right or the kernel deadlocks. Two days of fighting asyncio for what ended up as 30 lines of code.
The first 60 seconds matter more than any feature. My most common new-user error: typing prepro_auto in a Jupyter cell and getting a module object printed instead of a server starting. It's correct Python behaviour. It's also a terrible first impression. The second most common error was UnicodeDecodeError when users did pd.read_csv("data.csv") on a Latin-1 encoded file. Both are technically the user's fault. Both lose users anyway. I added launch_file() so users never need pd.read_csv at all, and a note in the README explicitly saying "don't type the package name as a command in a notebook." If you build any developer tool, assume your users will hit your worst onboarding path first.

What the reviewer was right and wrong about
The reviewer who called my package "low-trust" was right about: limited PyPI page, no track record, no CI badges visible from the install page, no screenshots in the README at the time, no comparison to alternatives. All fair criticism for a beta package days after release.
They were wrong about: assuming the lack of issue history meant no testing. The repo has 202 tests, but a stranger looking at PyPI couldn't see them. They also misread some package-name search results.
The honest lesson: trust signals matter as much as the code. A tool with 50% of the features but a green CI badge, real screenshots, a CHANGELOG, and three blog posts written about it will get more adoption than a tool with 100% of the features and a bare PyPI page. I'm working on fixing this — slowly, because there's only one of me.
Where it stands

202 tests passing across Python 3.10 / 3.11 / 3.12
5 AI providers supported (Groq, OpenAI, Anthropic, Gemini, Mistral) — all optional; everything works offline with a curated dictionary + heuristics fallback
MIT license, runs fully locally, zero telemetry
Public beta: pip install prepro-auto — repo at github.com/Chilliflex/prepro_auto

What I'd find genuinely useful
If you try it and something breaks: please open an issue with the dataset shape and what stage failed. I read every one.
If you don't try it but have thoughts: the comments here are open, and I'll engage with anything specific. "It looks like X" or "have you considered Y?" — yes please. "Cool project" — appreciated but I can't do much with it.
Either way, thanks for reading.

Top comments (1)

Harjot Singh • May 31

"A reviewer called it low-trust" is a great prompt for reflection, and the useful question is whether they were right about the trust gap even if wrong about your specific code. Trust in a tool isn't about whether it works once, it's whether someone can rely on it without re-checking, which comes from tests, clear failure modes, and predictable behavior, not just correct output. Often "low-trust" really means "I can't tell why I should trust this," which is a docs/verification gap more than a code one. I think about earned-trust a lot in Moonshift, output you can verify is output people actually rely on. What was the reviewer's real objection, missing tests or unclear behavior?