<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shivanshu Pandey</title>
    <description>The latest articles on DEV Community by Shivanshu Pandey (@shivanshu_pandey).</description>
    <link>https://dev.to/shivanshu_pandey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960083%2F3d3441b2-c71a-41c7-bc31-c6f1be08b0bd.png</url>
      <title>DEV Community: Shivanshu Pandey</title>
      <link>https://dev.to/shivanshu_pandey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shivanshu_pandey"/>
    <language>en</language>
    <item>
      <title>I built a Python preprocessing tool and a reviewer called it "low-trust" — here's what they got right</title>
      <dc:creator>Shivanshu Pandey</dc:creator>
      <pubDate>Sat, 30 May 2026 16:05:53 +0000</pubDate>
      <link>https://dev.to/shivanshu_pandey/i-built-a-python-preprocessing-tool-and-a-reviewer-called-it-low-trust-heres-what-they-got-d2m</link>
      <guid>https://dev.to/shivanshu_pandey/i-built-a-python-preprocessing-tool-and-a-reviewer-called-it-low-trust-heres-what-they-got-d2m</guid>
      <description>&lt;p&gt;I shipped my first PyPI package this week — a data preprocessing tool — and within 48 hours a reviewer called it "low-trust until proven otherwise." They weren't wrong.&lt;br&gt;
This post is about what I built, why I built it, what I learned, and what the reviewer was right and wrong about.&lt;br&gt;
What it is&lt;br&gt;
PrePro Auto is an open-source data preprocessing workbench that runs locally. You load a dataset, it analyses every column, and you get decision cards — one per issue found. You approve, override, or skip each one. At the end you have a clean dataset plus a runnable Python pipeline that reproduces every step.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install prepro-auto&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prepro_auto&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prepro_auto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C:\data\bengaluru_house_prices.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Click the printed URL → clean visually in the workbench → then:
&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The original DataFrame is never touched. &lt;code&gt;session.current()&lt;/code&gt; always returns the latest cleaned version.&lt;br&gt;
The part I'm most proud of: the decision-card system&lt;br&gt;
Most automated preprocessing tools work like a black box. PrePro Auto does the opposite.&lt;br&gt;
For every issue found, the engine generates a card that shows what it found, why it matters, the recommended action with a confidence score, the reasoning in plain English, and two or three alternatives you can pick instead — plus Approve / Override / Skip / Drop buttons.&lt;br&gt;
This is the "human-in-the-loop" part. The engine handles statistical analysis; you handle domain knowledge. A column with 81 distinct values gets recommended for frequency encoding — but if you know it's an ordinal category, you override. The model never knew that. You did.&lt;br&gt;
The technical problem I underestimated: sandboxing user expressions&lt;br&gt;
Step 5 of the workbench lets you write arbitrary pandas: &lt;code&gt;df['profit'] = df['revenue'] - df['cost']&lt;/code&gt;. That means executing user-supplied Python in a server process.&lt;br&gt;
The naive version is &lt;strong&gt;eval(expression)&lt;/strong&gt;. That's also a remote code execution vulnerability.&lt;br&gt;
Getting the sandbox right took longer than any other single feature. The final implementation blocks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All **import **statements&lt;/li&gt;
&lt;li&gt;Dunder attribute access &lt;strong&gt;(&lt;strong&gt;class&lt;/strong&gt;, &lt;strong&gt;globals&lt;/strong&gt;, &lt;strong&gt;builtins&lt;/strong&gt;)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;OS, network, and filesystem access&lt;/li&gt;
&lt;li&gt;Dangerous pandas methods (&lt;strong&gt;to_sql&lt;/strong&gt;, arbitrary &lt;strong&gt;to_csv&lt;/strong&gt; paths, &lt;strong&gt;pipe&lt;/strong&gt; with callables)&lt;/li&gt;
&lt;li&gt;Subprocesses and lambdas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tested it against 15 attack strings before I was satisfied. Sandboxing is one of those things that never shows up in a demo but matters a lot if the tool ever runs multi-user. The lesson: a feature that's "execute arbitrary user code" needs to be sandboxed from day one or it needs to not exist.&lt;br&gt;
What the cleaning actually does — a real example&lt;br&gt;
The Bengaluru House Prices dataset (Kaggle, 13,320 rows × 9 columns) has all the usual problems: missing values, outliers, categorical columns of varying cardinality, free-text fields, and a &lt;strong&gt;total_sqft&lt;/strong&gt; column stored as strings like "1133 - 1384".&lt;/p&gt;

&lt;p&gt;Here's what happens stage by stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profile detects &lt;strong&gt;total_sqft&lt;/strong&gt; as text, flags the missing-rate on &lt;strong&gt;society&lt;/strong&gt;, finds the bath/balcony columns are right-skewed&lt;/li&gt;
&lt;li&gt;Missing values stage detects MCAR mechanism on &lt;strong&gt;bath&lt;/strong&gt;, recommends median imputation&lt;/li&gt;
&lt;li&gt;Outliers stage uses IQR + Isolation Forest, flags properties with implausible &lt;strong&gt;total_sqft&lt;/strong&gt; as likely data errors&lt;/li&gt;
&lt;li&gt;Scaling uses Shapiro-Wilk to detect non-normality, recommends Yeo-Johnson for &lt;strong&gt;price **and **total_sqft&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Encoding routes by cardinality: 4 distinct &lt;strong&gt;area_type&lt;/strong&gt; values → one-hot encoding (88% confidence), 81 &lt;strong&gt;availability&lt;/strong&gt; values → frequency encoding (too many for one-hot)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end of the guided run, the dataset's quality score improved meaningfully and several low-signal columns got dropped after correlation analysis flagged them.&lt;br&gt;
The export gives me three files: the cleaned CSV, an audit PDF listing every decision, and &lt;strong&gt;pipeline.py&lt;/strong&gt; — a runnable script that re-applies the exact same cleaning on any new data with no PrePro Auto dependency.&lt;br&gt;
What I learned building this&lt;br&gt;
Stats beats ML for most preprocessing decisions. The right scaler for a Gaussian column is StandardScaler — you don't need a model to learn that. The right imputation for a column whose missingness correlates with another column is MICE — that's a statistical relationship, not a learned one. I used ML (Isolation Forest for multivariate outliers, KNN/MICE for imputation) only where the data itself has to inform the decision. For everything else, statistical tests give better, more interpretable, and faster answers. I went into this expecting to use more ML and came out using less.&lt;br&gt;
The notebook-to-UI bridge is harder than it looks. &lt;strong&gt;prepro_auto.launch(df)&lt;/strong&gt; opens a browser tab against the live in-memory DataFrame while keeping the Jupyter kernel alive. The naive version blocks the kernel because uvicorn wants the event loop. The working version uses &lt;strong&gt;nest-asyncio&lt;/strong&gt; to nest the FastAPI loop inside the kernel's loop, runs uvicorn in a daemon thread, and shares state through a session registry. None of this is exotic, but every piece had to be right or the kernel deadlocks. Two days of fighting asyncio for what ended up as 30 lines of code.&lt;br&gt;
The first 60 seconds matter more than any feature. My most common new-user error: typing &lt;strong&gt;prepro_auto&lt;/strong&gt; in a Jupyter cell and getting a module object printed instead of a server starting. It's correct Python behaviour. It's also a terrible first impression. The second most common error was &lt;strong&gt;UnicodeDecodeError&lt;/strong&gt; when users did &lt;strong&gt;pd.read_csv("data.csv")&lt;/strong&gt; on a Latin-1 encoded file. Both are technically the user's fault. Both lose users anyway. I added &lt;strong&gt;launch_file()&lt;/strong&gt; so users never need &lt;strong&gt;pd.read_csv&lt;/strong&gt; at all, and a note in the README explicitly saying "don't type the package name as a command in a notebook." If you build any developer tool, assume your users will hit your worst onboarding path first.&lt;/p&gt;

&lt;p&gt;What the reviewer was right and wrong about&lt;br&gt;
The reviewer who called my package "low-trust" was right about: limited PyPI page, no track record, no CI badges visible from the install page, no screenshots in the README at the time, no comparison to alternatives. All fair criticism for a beta package days after release.&lt;br&gt;
They were wrong about: assuming the lack of issue history meant no testing. The repo has 202 tests, but a stranger looking at PyPI couldn't see them. They also misread some package-name search results.&lt;br&gt;
The honest lesson: trust signals matter as much as the code. A tool with 50% of the features but a green CI badge, real screenshots, a CHANGELOG, and three blog posts written about it will get more adoption than a tool with 100% of the features and a bare PyPI page. I'm working on fixing this — slowly, because there's only one of me.&lt;br&gt;
Where it stands&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;202 tests passing across Python 3.10 / 3.11 / 3.12&lt;/li&gt;
&lt;li&gt;5 AI providers supported (Groq, OpenAI, Anthropic, Gemini, Mistral) — all optional; everything works offline with a curated dictionary + heuristics fallback&lt;/li&gt;
&lt;li&gt;MIT license, runs fully locally, zero telemetry&lt;/li&gt;
&lt;li&gt;Public beta: &lt;code&gt;pip install prepro-auto&lt;/code&gt; — repo at &lt;a href="https://dev.tourl"&gt;github.com/Chilliflex/prepro_auto&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I'd find genuinely useful&lt;br&gt;
If you try it and something breaks: please open an issue with the dataset shape and what stage failed. I read every one.&lt;br&gt;
If you don't try it but have thoughts: the comments here are open, and I'll engage with anything specific. "It looks like X" or "have you considered Y?" — yes please. "Cool project" — appreciated but I can't do much with it.&lt;br&gt;
Either way, thanks for reading.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
