Shridhar Pandey

Posted on Apr 16

The ‘Missing Middle’ of Data Processing in Java (10M Rows in ~40s)

#java #performance #datascience #showdev

10M Records, 40s: Exploring the "Missing Middle" of Data Processing in Java

I’ve always found it strange how quickly developers leave the Java ecosystem when dealing with data processing.

If your data fits comfortably in memory, Java Streams work great. If you're processing massive datasets (50GB+), tools like Spark make sense. But what about everything in between?

What about:

a 300MB CSV
a nested JSON file
a one-off transformation you need to run locally

Not big enough for distributed systems. Too big for naive in-memory approaches.

This is the space I think is underserved, the "missing middle" of data processing.

The Problem I Kept Running Into

Every time I tried handling mid-sized datasets in Java, I hit the same wall:

Load everything into memory → OutOfMemoryError
Use Streams → elegant, but still memory-bound
Switch to Python/Pandas → works, but now I’ve left the JVM ecosystem entirely

That tradeoff didn’t sit right with me.

So I started exploring a different approach:

What if we treated data processing as a streaming pipeline instead of an in-memory transformation?

The Goal

I set a simple constraint:

Process ~10 million records (~300MB CSV) in under 40 seconds, on a single JVM, without blowing up memory.

The Approach: Streaming + Lazy Evaluation

Instead of loading data into a List<Row>, I built a pipeline that processes data row-by-row.

At a high level:

Operations are represented as a DAG (Directed Acyclic Graph)
Execution is lazy
Data flows through a pipeline, not into memory

This keeps memory usage close to constant O(1) for streaming transformations.

(Important caveat: operations like groupBy and merge still require state and are not O(1) memory.)

What Didn’t Work (and Why)

This part took longer than expected.

Some early approaches completely failed:

Naive in-memory loading → instant OOM on large files
Eager evaluation → unnecessary intermediate objects, heavy GC pressure
Simple grouping logic → memory spikes that killed performance

The biggest realization:

The problem isn’t just data size, it’s when and how you materialize it.

A Small Experiment: PureStream

This exploration led me to build a small library: PureStream.

Not as a Spark replacement. Not as a “framework”.

Just a lightweight way to experiment with streaming-first data pipelines in Java.

What it focuses on:

Zero external dependencies (Java 17+)
Streaming-first transformations
Familiar, fluent API (inspired by Streams)
Basic CSV and JSON handling (with JSON flattening via dot notation)

Example

PureStream.fromCsv("transactions.csv", true)
    .filter(row -> row.getDouble("amount") > 1000.0)
    .groupBy("region")
    .agg(builder -> builder.sum("amount").count("id"))
    .orderBy(sort -> sort.descDouble("sum_amount"))
    .toJsonFile("report.json");

Benchmark Context

Tested on machine with:

Java 17
16GB RAM
256GB SSD storage

Dataset:

~10 million rows (~300MB CSV)

Result:

~40 seconds end-to-end processing

This isn’t meant to be a rigorous benchmark, just a sanity check that the approach is viable.

Where It Breaks

This approach isn’t perfect, and I’m still exploring its limits:

groupBy and merge require memory (no magic here)
Current joins use a hash-based approach → not scalable for very large datasets
Performance depends heavily on disk I/O
API is still evolving

One area I’m particularly interested in next:

Implementing an external sort-merge join to handle large joins with limited memory

The Bigger Question

I don’t think tools like Spark are overkill.

I think we’re missing a simpler layer for everyday data tasks, something between:

Java Streams
and full distributed systems

Maybe this idea already exists and I’ve missed it.
Maybe it’s not as useful as I think.

If You’re Curious

Code is here if you want to explore or break it:

Open Question

How do you currently handle mid-sized datasets in Java?

Do you:

stick with Streams and hope it fits in memory
switch ecosystems (Python, Spark, etc.)
or use something else entirely

I’m curious if this “missing middle” is a real problem, or just something I’ve personally run into.

Top comments (1)

Shridhar Pandey • Apr 15 • Edited

Hey everyone,
Happy to dive deeper into the architecture.
Curious what people think about the “missing middle” idea.
Does this actually solve a real problem, or am I overthinking it?