Godwill Christopher

Posted on Apr 21

I got tired of rewriting the same pandas preprocessing code — so I built a plugin

#python #datascience #opensource #productivity

If you work with CSV data, you’ve probably written this code more times than you’d like:

dropna()
fillna()
removing duplicates
basic outlier filtering
normalizing columns

None of it is particularly difficult.

But it’s repetitive.

The Problem

As a backend engineer working with data pipelines, I kept running into the same pattern.

Before doing anything meaningful with a dataset, I’d spend time writing the same preprocessing logic just to get the data into a usable state.

It wasn’t the hardest part of the job—but it was always there.

And it always slowed things down.

What I Noticed

The issue isn’t complexity.

It’s repetition.

You already know what needs to be done:

clean missing values
remove duplicates
normalize data
filter outliers

But you still have to write it. Every time.

My Usual Workflow

Most of the time, I’d:

copy snippets from previous projects
reuse old notebooks
write quick pandas scripts

It works—but it’s not efficient.

Especially when you just want to:
👉 quickly inspect a dataset

👉 apply basic transformations

👉 move on to actual analysis or pipeline logic

So I Tried Something Different

Instead of writing the same code over and over, I started experimenting with doing preprocessing directly inside the IDE.

That led me to build a small JetBrains plugin.

The idea is simple:

Load a CSV file inside the IDE
Apply common preprocessing steps visually
Generate ready-to-run pandas code from those actions

What It Looks Like

What It Handles

Right now, it supports things like:

Column profiling (types, null counts, stats)
Handling missing values (drop, fill with mean/median/mode/custom)
Removing duplicates
Outlier detection (IQR-based)
Normalization (Min-Max, Z-score)
Type casting

And the part I find most useful:

👉 it generates clean pandas code based on what you do

So you still end up with code you can use in scripts, pipelines, or notebooks.

Why This Helped Me

For me, this made it much faster to go from:

raw data → cleaned dataset → usable code

without constantly switching context or rewriting boilerplate.

Still Early

This is still an early version, and I’m actively improving it based on feedback.

If you work with data preprocessing, ETL pipelines, or just deal with CSVs often, I’d really appreciate your thoughts.

👉 https://plugins.jetbrains.com/plugin/31226-data-preprocessor/

Even small feedback like:

what feels clunky
what’s missing
what you’d expect

would be really helpful.

Curious About Your Workflow

How do you currently handle preprocessing?

Do you just write pandas scripts each time?
Use templates?
Have your own utilities?

Would be interesting to hear how others approach this.

DEV Community