DEV Community

Cover image for The ‘Missing Middle’ of Data Processing in Java (10M Rows in ~40s)
Shridhar Pandey
Shridhar Pandey

Posted on

The ‘Missing Middle’ of Data Processing in Java (10M Rows in ~40s)

10M Records, 40s: Exploring the "Missing Middle" of Data Processing in Java

I’ve always found it strange how quickly developers leave the Java ecosystem when dealing with data processing.

If your data fits comfortably in memory, Java Streams work great. If you're processing massive datasets (50GB+), tools like Spark make sense. But what about everything in between?

What about:

  • a 300MB CSV
  • a nested JSON file
  • a one-off transformation you need to run locally

Not big enough for distributed systems. Too big for naive in-memory approaches.

This is the space I think is underserved, the "missing middle" of data processing.


The Problem I Kept Running Into

Every time I tried handling mid-sized datasets in Java, I hit the same wall:

  • Load everything into memory → OutOfMemoryError
  • Use Streams → elegant, but still memory-bound
  • Switch to Python/Pandas → works, but now I’ve left the JVM ecosystem entirely

That tradeoff didn’t sit right with me.

So I started exploring a different approach:

What if we treated data processing as a streaming pipeline instead of an in-memory transformation?


The Goal

I set a simple constraint:

Process ~10 million records (~300MB CSV) in under 40 seconds, on a single JVM, without blowing up memory.


The Approach: Streaming + Lazy Evaluation

Instead of loading data into a List<Row>, I built a pipeline that processes data row-by-row.

At a high level:

  • Operations are represented as a DAG (Directed Acyclic Graph)
  • Execution is lazy
  • Data flows through a pipeline, not into memory

This keeps memory usage close to constant O(1) for streaming transformations.

(Important caveat: operations like groupBy and merge still require state and are not O(1) memory.)


What Didn’t Work (and Why)

This part took longer than expected.

Some early approaches completely failed:

  • Naive in-memory loading → instant OOM on large files
  • Eager evaluation → unnecessary intermediate objects, heavy GC pressure
  • Simple grouping logic → memory spikes that killed performance

The biggest realization:

The problem isn’t just data size, it’s when and how you materialize it.


A Small Experiment: PureStream

This exploration led me to build a small library: PureStream.

Not as a Spark replacement. Not as a “framework”.

Just a lightweight way to experiment with streaming-first data pipelines in Java.

What it focuses on:

  • Zero external dependencies (Java 17+)
  • Streaming-first transformations
  • Familiar, fluent API (inspired by Streams)
  • Basic CSV and JSON handling (with JSON flattening via dot notation)

Example

PureStream.fromCsv("transactions.csv", true)
    .filter(row -> row.getDouble("amount") > 1000.0)
    .groupBy("region")
    .agg(builder -> builder.sum("amount").count("id"))
    .orderBy(sort -> sort.descDouble("sum_amount"))
    .toJsonFile("report.json");
Enter fullscreen mode Exit fullscreen mode

Benchmark Context

Tested on machine with:

  • Java 17
  • 16GB RAM
  • 256GB SSD storage

Dataset:

  • ~10 million rows (~300MB CSV)

Result:

  • ~40 seconds end-to-end processing

This isn’t meant to be a rigorous benchmark, just a sanity check that the approach is viable.


Where It Breaks

This approach isn’t perfect, and I’m still exploring its limits:

  • groupBy and merge require memory (no magic here)
  • Current joins use a hash-based approach → not scalable for very large datasets
  • Performance depends heavily on disk I/O
  • API is still evolving

One area I’m particularly interested in next:

Implementing an external sort-merge join to handle large joins with limited memory


The Bigger Question

I don’t think tools like Spark are overkill.

I think we’re missing a simpler layer for everyday data tasks, something between:

  • Java Streams
  • and full distributed systems

Maybe this idea already exists and I’ve missed it.
Maybe it’s not as useful as I think.


If You’re Curious

Code is here if you want to explore or break it:


Open Question

How do you currently handle mid-sized datasets in Java?

Do you:

  • stick with Streams and hope it fits in memory
  • switch ecosystems (Python, Spark, etc.)
  • or use something else entirely

I’m curious if this “missing middle” is a real problem, or just something I’ve personally run into.

Top comments (1)

Collapse
 
shridey profile image
Shridhar Pandey • Edited

Hey everyone,
Happy to dive deeper into the architecture.
Curious what people think about the “missing middle” idea.
Does this actually solve a real problem, or am I overthinking it?