If you work with CSV data, you’ve probably written this code more times than you’d like:
- dropna()
- fillna()
- removing duplicates
- basic outlier filtering
- normalizing columns
None of it is particularly difficult.
But it’s repetitive.
The Problem
As a backend engineer working with data pipelines, I kept running into the same pattern.
Before doing anything meaningful with a dataset, I’d spend time writing the same preprocessing logic just to get the data into a usable state.
It wasn’t the hardest part of the job—but it was always there.
And it always slowed things down.
What I Noticed
The issue isn’t complexity.
It’s repetition.
You already know what needs to be done:
- clean missing values
- remove duplicates
- normalize data
- filter outliers
But you still have to write it. Every time.
My Usual Workflow
Most of the time, I’d:
- copy snippets from previous projects
- reuse old notebooks
- write quick pandas scripts
It works—but it’s not efficient.
Especially when you just want to:
👉 quickly inspect a dataset
👉 apply basic transformations
👉 move on to actual analysis or pipeline logic
So I Tried Something Different
Instead of writing the same code over and over, I started experimenting with doing preprocessing directly inside the IDE.
That led me to build a small JetBrains plugin.
The idea is simple:
- Load a CSV file inside the IDE
- Apply common preprocessing steps visually
- Generate ready-to-run pandas code from those actions
What It Looks Like
What It Handles
Right now, it supports things like:
- Column profiling (types, null counts, stats)
- Handling missing values (drop, fill with mean/median/mode/custom)
- Removing duplicates
- Outlier detection (IQR-based)
- Normalization (Min-Max, Z-score)
- Type casting
And the part I find most useful:
👉 it generates clean pandas code based on what you do
So you still end up with code you can use in scripts, pipelines, or notebooks.
Why This Helped Me
For me, this made it much faster to go from:
raw data → cleaned dataset → usable code
without constantly switching context or rewriting boilerplate.
Still Early
This is still an early version, and I’m actively improving it based on feedback.
If you work with data preprocessing, ETL pipelines, or just deal with CSVs often, I’d really appreciate your thoughts.
👉 https://plugins.jetbrains.com/plugin/31226-data-preprocessor/
Even small feedback like:
- what feels clunky
- what’s missing
- what you’d expect
would be really helpful.
Curious About Your Workflow
How do you currently handle preprocessing?
- Do you just write pandas scripts each time?
- Use templates?
- Have your own utilities?
Would be interesting to hear how others approach this.

Top comments (0)