DEV Community

Cover image for Nested parallelStream(): More Concurrency, Less Performance
Hugo Marques
Hugo Marques

Posted on • Edited on

Nested parallelStream(): More Concurrency, Less Performance

.parallelStream() everywhere? Not always a good idea.

Have you ever thought:

“What if I sprinkle .parallelStream() across all the layers of my code? Everything will run faster, right?”

Yeah. I thought that too. Spoiler: it didn’t. 😅

Recently, while optimizing the processing and transformation of millions of in-memory objects (yay CPU pressure), I ended up nesting a few levels of parallelStream. When I checked my metrics, the CPU was melting. My hypothesis? The tasks were fighting over the same thread pool — the common fork join pool.

If you're as curious as I was to understand why, keep reading. In this post, we’ll look at some benchmarks that show what happens when you go overboard with .parallelStream().

I’ll also show you a better approach — and why it works.


A complex problem: too many layers, too much parallelism

Imagine code that processes data in multiple layers:

  • 10 outer groups (e.g. regions)
  • 100 middle groups (e.g. warehouses)
  • 100 inner items (e.g. products)

Each item performs a CPU-intensive calculation and may allocate memory during processing.

My first idea was to parallelize everything:

outer.parallelStream().forEach(o ->
    middle.parallelStream().forEach(m ->
        inner.parallelStream().forEach(this::process)
    )
);
Enter fullscreen mode Exit fullscreen mode

Looks ideal, right? But when I ran proper benchmarks...

🚨 Performance dropped. And variance exploded.


⚙️ How the experiment was run

To measure correctly, I used JMH (Java Microbenchmark Harness) — the go-to tool in the Java community for reliable benchmarks.

I simulated a data hierarchy:

  • outerSize: regions
  • middleSize: warehouses
  • innerSize: products

Each combination generated a task with CPU load and memory allocation.


📊 What we compared

I implemented three variations of the same processing logic:

Technique Description
nestedParallelStreams() Parallelizes every layer (overkill)
flattenedParallelStream() Only the outer layer is parallel
singleParallelStream() Flattens everything into one list, then parallelizes once

🧪 How we simulated the workload

Each task performs:

  • Math operations with Math.sqrt() (CPU pressure)
  • String concatenation
  • Creation of intermediate lists (a bit of heap pressure)
record ComplexObject(String name, int value, byte[] payload) {
    ComplexObject(String name, int value) {
        this(name, value, new byte[1024]); // Simulates memory weight
    }
}
Enter fullscreen mode Exit fullscreen mode

🕳️ Why we used Blackhole

This was new to me. JMH provides the Blackhole object to prevent the JIT compiler from optimizing away benchmark code.

Without Blackhole, the compiler might notice you're not using the result of a function — and just skip executing it. That would ruin the experiment.

blackhole.consume(results); // Ensures results are “used”
Enter fullscreen mode Exit fullscreen mode

▶️ How to run the benchmark

You can run it using the built-in main() method:

./gradlew run
Enter fullscreen mode Exit fullscreen mode

Or directly with:

./gradlew jmh
Enter fullscreen mode Exit fullscreen mode

The parameters (outerSize, middleSize, etc.) are controlled by @Param and can be changed via command-line or hardcoded in the class.


5 hours of benchmarking later...

Image description


🔍 What the results showed

✅ 1. Single-layer parallelism is more efficient

Configuration: (100, 50, 5) → 25,000 tasks

Technique Avg Time (ms)
singleParallelStream 3,233 ms
flattenedParallelStream 3,546 ms
nestedParallelStreams 3,972 ms

💡 Deep parallelism did not help. It just caused more overhead.


⚠️ 2. At scale, nested becomes chaos

Configuration: (500, 100, 10) → 500,000 tasks

Technique Avg Time (ms) Std Dev
nestedParallelStreams 69,486 ms ±106,638 ms 😱
flattenedParallelStream 78,037 ms ±44,430 ms
singleParallelStream 75,201 ms ±63,081 ms

💣 nestedParallelStreams seemed “faster” in one run, but the huge standard deviation shows how unstable the system became — probably due to GC thrashing or thread contention.


✅ Takeaway: parallelize with care

What we learned:

  • 🔹 Parallelize once, at the outermost layer
  • 🔹 Avoid nested .parallelStream()
  • 🔹 Benchmarks reveal what feels fast but isn’t
  • 🔹 More .parallelStream() ≠ better performance

✌️ Bonus: my personal lesson

“I thought I was optimizing. I was just confusing the scheduler.”

My hunch turned out to be right. And if you're curious, you can replicate this with smaller experiments — even simpler setups show that nestedParallelStreams adds overhead across the board.


🔗 Full code

You can find the full code here.

Just make sure to add the jmh dependency.

Clone it, run it, tweak the parameters — and see for yourself!


Top comments (0)