DEV Community: Godwill Christopher

What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem

Godwill Christopher — Mon, 25 May 2026 14:42:14 +0000

It's been a few months since I last wrote about Data Preprocessor, the IntelliJ plugin I built to stop re-writing the same pandas preprocessing scripts every project. The 1.5.x series has landed a real R codegen path, a more honest outlier-resistant normalizer, and one genuinely embarrassing deadlock that I want to talk about openly because the lesson is useful.
tl;dr on what the plugin does
You load a CSV, Excel, or JSON file inside your JetBrains IDE. The plugin profiles every column (type, null count, mean/median/std, mode, unique count). You build a pipeline visually — drop nulls, fill with mean, deduplicate, remove IQR outliers, normalize (min-max / z-score / robust), label-encode, one-hot, train/test split, sort, filter, type-cast — and then one click emits a complete, ready-to-run Python (pandas) or R (base + a few small libs) script.
All processing is local. The plugin collects no telemetry. The generated code is normal pandas or normal R — no runtime library, no plugin import, nothing magic. Read it, edit it, commit it alongside your dataset, run it long after you've uninstalled the plugin.
Here's roughly what a 5-step pipeline turns into:
python# Generated by Data Preprocessor 1.5.6

Source: sample-data/employees.csv

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("sample-data/employees.csv")

Step 1: drop rows where 'department' is null

df = df.dropna(subset=["department"])

Step 2: fill 'performance_score' null with median

df["performance_score"] = df["performance_score"].fillna(
df["performance_score"].median()
)

Step 3: remove duplicates

df = df.drop_duplicates()

Step 4: Robust Scaler on 'salary' (median/IQR, IQR=0 guard)

_med = df["salary"].median()
_q1 = df["salary"].quantile(0.25)
_q3 = df["salary"].quantile(0.75)
_iqr = _q3 - _q1
if _iqr != 0:
df["salary"] = (df["salary"] - _med) / _iqr

Step 5: train/test split (ratio 0.8)

train, test = train_test_split(df, train_size=0.8, random_state=42)
The R output is structurally the same, with readxl / jsonlite / fastDummies imported only when the pipeline actually uses them.
1.5.0 — R code generation, for real
The biggest change since I last posted is that the codegen is no longer Python-only. The full 16-operation pipeline now has an R equivalent. Label-encode is 0-based to match pandas.factorize (R's native factor() is 1-based by default — that was a fun footgun to find and fix in 1.5.5).
This was a deliberate choice rather than a feature request: when you preprocess for an analytics team, half of them are in Python and half are in R, and forcing the cleanup to be language-specific defeats the point of having a reproducible artifact. The visual pipeline is the spec; Python and R are just two render targets.
1.3.0 → 1.5.5 — Robust Scaler with honest edge cases
Min-max and z-score break in interesting ways when your column has outliers. A single row at 10⁹ collapses the rest of the column into a narrow band near zero. So 1.3.0 added the Robust Scaler — (x - median) / IQR — which gives you a normalization that doesn't get yanked around by the long tail.
The catch: when IQR = 0 (column is constant, or near-constant), the naïve formula divides by zero and silently produces NaN in Python or Inf in R. The Java preview already guarded against this (returned the column unchanged), but the generated scripts didn't. 1.5.5 added explicit if _iqr != 0: / if (.iqr != 0) guards in both generated outputs to match the preview's behaviour exactly.
Boring fix, but the kind of thing where the absence of an error is worse than a noisy crash. A NaN that propagates through three more steps is much harder to debug than a ZeroDivisionError at the source.
1.5.3 — the deadlock post-mortem
This is the one I want to talk about. The IntelliJ Platform 2024.2 changed how FileChooser.chooseFiles interacts with the EDT (event-dispatching thread). The Browse button started failing intermittently on newer IDEs, so 1.5.3 wrapped the call in ApplicationManager.invokeLater(...).
That was wrong, and not in a "minor regression" way — in a "the entire IDE freezes for every user who installs the plugin" way.
Here's the trap: FileChooser.chooseFiles is already asynchronous on its own. Wrapping it in invokeLater queues a runnable behind the EDT pump, but the runnable itself opens a modal-style dispatcher that blocks the EDT pump waiting for itself to dispatch. Neither side makes progress. Cursor hangs, dock icon stops responding, and the JVM has to be killed from Activity Monitor.
I caught it within about an hour because users on Marketplace were immediate and direct about it (sincere gratitude for that — angry early users are the most valuable kind), retracted 1.5.3, shipped 1.5.4 as a straight revert, and added a permanent comment to the source so I don't repeat the mistake:
java// FileChooser.chooseFiles is already asynchronous and must be called
// directly from the EDT — no wrapper is needed or safe.
1.5.5 then fixed the original Browse problem the right way: switched to the built-in single-file chooser, kept directories visible in the filter so users can navigate normally, and anchored the dialog to the tool window component rather than letting it float free.
Two lessons I'm carrying forward:

When the platform changes async semantics, read the source — don't guess. The 2024.2 release notes mentioned the dispatcher change, but I didn't connect it back to FileChooser because the API surface hadn't moved.
Modal-dialog-on-EDT bugs don't show up in CI. They show up the moment a real user clicks the button. Manual smoke-testing on a sandbox IDE before every publish is now non-negotiable for me.

1.5.6 — SDK alignment
Just shipped today. pluginSinceBuild bumped from 233 to 243, matching the 2024.3 SDK I actually compile against. JetBrains' Plugin Verifier reports Compatible against IC-243, IC-251, IC-252, and IU-253 — zero deprecated-API usages against 2024.3 itself, three soft deprecations in 2025.x that I'll address in the next minor.
I also disabled the Gradle IntelliJ Plugin's GitHub self-update check, which had a habit of failing the entire build whenever GitHub's API was rate-limited or my network was offline. That one ate two hours of my Monday before I tracked down the fix:
properties# gradle.properties
systemProp.org.jetbrains.intellij.buildFeature.selfUpdateCheck=false
If you build any IntelliJ plugin and you've ever stared at Cannot resolve the latest Gradle IntelliJ Plugin version and wondered why a build with no actual problems is failing — that line is the fix.
What's next
The most-requested features right now, in order:

Categorical binning — equal-width and quantile-based bucketization for numeric columns into categorical bins. Pandas has pd.cut and pd.qcut; R has a few options. Codegen for both is straightforward; the UI work is figuring out how to preview the bins without making the tool window huge.
Pipeline import/export as JSON so teams can share pipeline definitions in the repo and re-apply them via CLI in CI. This is the change that turns the plugin from a "speed up the first cleanup" tool into a "version-control your data cleanups" tool.
DuckDB read path for files too large to fit in memory. The current LoaderArchitecture is single-pass row-oriented; DuckDB would let the plugin profile and clean files up to ~10 GB on a laptop without rewriting the engine.

If you've used the plugin and have opinions on which of these to prioritize — or a totally different request — please drop it as an issue or just reply here. The most useful feedback is "I tried to do X and the generated code does Y instead" because those are the highest-leverage fixes.
Try it

Marketplace: https://plugins.jetbrains.com/plugin/31226-data-preprocessor
Source (MIT): https://github.com/codaBlurd/data-preprocessor-plugin

Bug reports, feature requests, and PRs all welcome. Reviews on the Marketplace are how the plugin gets discovered by new users — if it's saved you time, two minutes there is the highest-leverage thing you can do for it.
Thanks for reading. Build something good this week.

How I Used the Observer Pattern to Watch Directories in Java (And Prevent Race Conditions)

Godwill Christopher — Mon, 27 Apr 2026 14:22:46 +0000

If you've ever needed to react to file system changes in Java — a config file updating, an upload folder receiving a new file, a hot-reload mechanism — you've probably reached for Java's WatchService. It's clean. It's built-in. And it hides a subtle concurrency trap that will burn you in production if you're not paying attention.

In this post I'll walk through building a directory watcher using the Observer pattern, and then show exactly where race conditions creep in and how to shut them down with a ReentrantLock.

The Observer Pattern in 10 Seconds

Observer is a behavioural pattern where one object (the subject) maintains a list of dependents (observers) and notifies them automatically when its state changes.

In our case:

Subject — the directory watcher, watching for file events
Observers — any number of handlers that react when a file changes

public interface FileChangeObserver {
    void onFileChanged(Path filePath);
}

public class DirectoryWatcher {
    private final List<FileChangeObserver> observers = new ArrayList<>();

    public void addObserver(FileChangeObserver observer) {
        observers.add(observer);
    }

    private void notifyObservers(Path path) {
        for (FileChangeObserver observer : observers) {
            observer.onFileChanged(path);
        }
    }
}

Clean and extensible — add as many handlers as you need without touching the watcher itself.

Setting Up WatchService

Java NIO gives us WatchService — a low-level file system event API.

public void watch(Path directory) throws IOException, InterruptedException {
    WatchService watchService = FileSystems.getDefault().newWatchService();

    directory.register(watchService,
        StandardWatchEventKinds.ENTRY_CREATE,
        StandardWatchEventKinds.ENTRY_MODIFY,
        StandardWatchEventKinds.ENTRY_DELETE);

    while (true) {
        WatchKey key = watchService.take(); // blocks until an event arrives

        for (WatchEvent<?> event : key.pollEvents()) {
            Path changed = directory.resolve((Path) event.context());
            notifyObservers(changed);
        }

        key.reset();
    }
}

This works perfectly — until you run it in a multi-threaded environment.

Where the Race Condition Hides

Say you spin up a thread pool to process file change events faster:

ExecutorService executor = Executors.newFixedThreadPool(4);

for (WatchEvent<?> event : key.pollEvents()) {
    Path changed = directory.resolve((Path) event.context());
    executor.submit(() -> notifyObservers(changed));
}

Now four threads can be notifying observers simultaneously. If two events arrive for the same file at nearly the same time — say a file is written and then immediately modified — two threads can call onFileChanged on the same path concurrently.

Depending on what your observer does (write to a database, process the file, update a cache), you now have a race condition. Two threads reading and transforming the same file simultaneously. Silent data corruption. The worst kind of bug.

Fixing It With ReentrantLock

A ReentrantLock lets only one thread process a given file path at a time while other threads wait their turn.

public class DirectoryWatcher {
    private final List<FileChangeObserver> observers = new ArrayList<>();
    private final Map<Path, ReentrantLock> fileLocks = new ConcurrentHashMap<>();

    private ReentrantLock getLockForPath(Path path) {
        return fileLocks.computeIfAbsent(path, p -> new ReentrantLock());
    }

    private void notifyObservers(Path path) {
        ReentrantLock lock = getLockForPath(path);
        lock.lock();
        try {
            for (FileChangeObserver observer : observers) {
                observer.onFileChanged(path);
            }
        } finally {
            lock.unlock(); // always release in finally — never skip this
        }
    }
}

Key points:

ConcurrentHashMap gives you one lock per file path — threads processing different files don't block each other, only threads processing the same file do
computeIfAbsent is atomic — no two threads will create two locks for the same path
The finally block guarantees the lock is released even if an observer throws an exception

Why Not Just Use synchronized?

You could use a synchronized block on the path object, but ReentrantLock gives you more control — you can use tryLock() to skip processing if the lock is already held (useful if you want to drop duplicate events rather than queue them) and it's more explicit about what you're protecting.

// Skip instead of queue — useful for high-frequency file events
if (lock.tryLock()) {
    try {
        notifyObservers(path);
    } finally {
        lock.unlock();
    }
} else {
    System.out.println("Skipping duplicate event for: " + path);
}

The Full Picture

File system event
↓
WatchService
↓
Thread pool (4 threads)
↓
notifyObservers(path)
↓
ReentrantLock (per path)
↓
Observers notified safely

The Observer pattern keeps your handlers decoupled and easy to extend. The per-path locking ensures concurrent events on the same file are serialised without bottlenecking events on different files.

This pattern came up directly in production work — building file ingestion pipelines where multiple events on the same file within milliseconds of each other would otherwise cause partial reads and corrupt downstream processing.

I write about backend Java engineering, Spring Boot, and systems design. Follow for more.

What I learned after 80+ installs of my first JetBrains plugin

Godwill Christopher — Mon, 27 Apr 2026 12:31:15 +0000

After publishing my first JetBrains plugin, I expected the main challenge to be getting users.

It wasn’t.

The Real Problem Wasn’t Features

Within the first few days, the plugin got ~80 installs.

But something felt off.

No feedback.
No reviews.
No real engagement.

Turns out, the issue wasn’t missing features.

It was something much simpler:

👉 The tool window wasn’t showing after installation.

Which meant users installed the plugin…
…and couldn’t actually use it.

Fixing One Small Issue Changed Everything

After fixing that (v1.0.3), things immediately improved.

People could finally:

open the tool
load datasets
actually try the features

It was a reminder that:

The first-use experience matters more than feature depth.

What I Built (Quick Context)

The plugin is a small tool for preprocessing data directly inside JetBrains IDEs.

Instead of:

jumping to Excel or Jupyter
writing quick pandas scripts
switching back to the IDE

You can:

load CSV, Excel, or JSON files
inspect and clean data visually
generate pandas code from your steps

Early Feedback

Once people could actually use it, I started getting real feedback.

One suggestion stood out:

👉 “Add Excel-style sorting and filtering to the preview”

It makes sense.

A lot of the workflow is just:

exploring the dataset
spotting issues
understanding structure

Improving that part would make the tool much more useful.

What Changed Next

Based on feedback, I’ve already:

fixed the onboarding issue (tool window visibility)
added Excel (.xlsx) and JSON support (v1.1.0)

And next up:

improving the data preview experience

What I Learned

A few takeaways from this:

1. Installs don’t mean usage

People can install your tool and never actually use it.

2. Small UX issues can block everything

A missing UI element completely stopped users from getting value.

3. Feedback is everything early on

One good user can give you direction for your next version.

Still Early

This is still an early version, and I’m actively improving it.

If you work with data preprocessing, ETL pipelines, or just deal with messy datasets regularly, I’d really appreciate your thoughts.

👉 https://plugins.jetbrains.com/plugin/31226-data-preprocessor/

Even small feedback helps at this stage.

I got tired of rewriting the same pandas preprocessing code — so I built a plugin

Godwill Christopher — Tue, 21 Apr 2026 21:56:26 +0000

If you work with CSV data, you’ve probably written this code more times than you’d like:

dropna()
fillna()
removing duplicates
basic outlier filtering
normalizing columns

None of it is particularly difficult.

But it’s repetitive.

The Problem

As a backend engineer working with data pipelines, I kept running into the same pattern.

Before doing anything meaningful with a dataset, I’d spend time writing the same preprocessing logic just to get the data into a usable state.

It wasn’t the hardest part of the job—but it was always there.

And it always slowed things down.

What I Noticed

The issue isn’t complexity.

It’s repetition.

You already know what needs to be done:

clean missing values
remove duplicates
normalize data
filter outliers

But you still have to write it. Every time.

My Usual Workflow

Most of the time, I’d:

copy snippets from previous projects
reuse old notebooks
write quick pandas scripts

It works—but it’s not efficient.

Especially when you just want to:
👉 quickly inspect a dataset

👉 apply basic transformations

👉 move on to actual analysis or pipeline logic

So I Tried Something Different

Instead of writing the same code over and over, I started experimenting with doing preprocessing directly inside the IDE.

That led me to build a small JetBrains plugin.

The idea is simple:

Load a CSV file inside the IDE
Apply common preprocessing steps visually
Generate ready-to-run pandas code from those actions

What It Looks Like

What It Handles

Right now, it supports things like:

Column profiling (types, null counts, stats)
Handling missing values (drop, fill with mean/median/mode/custom)
Removing duplicates
Outlier detection (IQR-based)
Normalization (Min-Max, Z-score)
Type casting

And the part I find most useful:

👉 it generates clean pandas code based on what you do

So you still end up with code you can use in scripts, pipelines, or notebooks.

Why This Helped Me

For me, this made it much faster to go from:

raw data → cleaned dataset → usable code

without constantly switching context or rewriting boilerplate.

Still Early

This is still an early version, and I’m actively improving it based on feedback.

If you work with data preprocessing, ETL pipelines, or just deal with CSVs often, I’d really appreciate your thoughts.

👉 https://plugins.jetbrains.com/plugin/31226-data-preprocessor/

Even small feedback like:

what feels clunky
what’s missing
what you’d expect

would be really helpful.

Curious About Your Workflow

How do you currently handle preprocessing?

Do you just write pandas scripts each time?
Use templates?
Have your own utilities?

Would be interesting to hear how others approach this.