Hugo Marques

Posted on Apr 15, 2025 • Edited on Apr 16, 2025

Nested parallelStream(): More Concurrency, Less Performance

#java #performance #concurrency #threads

`.parallelStream()` everywhere? Not always a good idea.

Have you ever thought:

“What if I sprinkle .parallelStream() across all the layers of my code? Everything will run faster, right?”

Yeah. I thought that too. Spoiler: it didn’t. 😅

Recently, while optimizing the processing and transformation of millions of in-memory objects (yay CPU pressure), I ended up nesting a few levels of parallelStream. When I checked my metrics, the CPU was melting. My hypothesis? The tasks were fighting over the same thread pool — the common fork join pool.

If you're as curious as I was to understand why, keep reading. In this post, we’ll look at some benchmarks that show what happens when you go overboard with .parallelStream().

I’ll also show you a better approach — and why it works.

A complex problem: too many layers, too much parallelism

Imagine code that processes data in multiple layers:

10 outer groups (e.g. regions)
100 middle groups (e.g. warehouses)
100 inner items (e.g. products)

Each item performs a CPU-intensive calculation and may allocate memory during processing.

My first idea was to parallelize everything:

outer.parallelStream().forEach(o ->
    middle.parallelStream().forEach(m ->
        inner.parallelStream().forEach(this::process)
    )
);

Looks ideal, right? But when I ran proper benchmarks...

🚨 Performance dropped. And variance exploded.

⚙️ How the experiment was run

To measure correctly, I used JMH (Java Microbenchmark Harness) — the go-to tool in the Java community for reliable benchmarks.

I simulated a data hierarchy:

outerSize: regions
middleSize: warehouses
innerSize: products

Each combination generated a task with CPU load and memory allocation.

📊 What we compared

I implemented three variations of the same processing logic:

Technique	Description
`nestedParallelStreams()`	Parallelizes every layer (overkill)
`flattenedParallelStream()`	Only the outer layer is parallel
`singleParallelStream()`	Flattens everything into one list, then parallelizes once

🧪 How we simulated the workload

Each task performs:

Math operations with Math.sqrt() (CPU pressure)
String concatenation
Creation of intermediate lists (a bit of heap pressure)

record ComplexObject(String name, int value, byte[] payload) {
    ComplexObject(String name, int value) {
        this(name, value, new byte[1024]); // Simulates memory weight
    }
}

🕳️ Why we used `Blackhole`

This was new to me. JMH provides the Blackhole object to prevent the JIT compiler from optimizing away benchmark code.

Without Blackhole, the compiler might notice you're not using the result of a function — and just skip executing it. That would ruin the experiment.

blackhole.consume(results); // Ensures results are “used”

▶️ How to run the benchmark

You can run it using the built-in main() method:

./gradlew run

Or directly with:

./gradlew jmh

The parameters (outerSize, middleSize, etc.) are controlled by @Param and can be changed via command-line or hardcoded in the class.

5 hours of benchmarking later...

🔍 What the results showed

✅ 1. Single-layer parallelism is more efficient

Configuration: (100, 50, 5) → 25,000 tasks

Technique	Avg Time (ms)
`singleParallelStream`	3,233 ms
`flattenedParallelStream`	3,546 ms
`nestedParallelStreams`	3,972 ms

💡 Deep parallelism did not help. It just caused more overhead.

⚠️ 2. At scale, nested becomes chaos

Configuration: (500, 100, 10) → 500,000 tasks

Technique	Avg Time (ms)	Std Dev
`nestedParallelStreams`	69,486 ms	±106,638 ms 😱
`flattenedParallelStream`	78,037 ms	±44,430 ms
`singleParallelStream`	75,201 ms	±63,081 ms

💣 nestedParallelStreams seemed “faster” in one run, but the huge standard deviation shows how unstable the system became — probably due to GC thrashing or thread contention.

✅ Takeaway: parallelize with care

What we learned:

🔹 Parallelize once, at the outermost layer
🔹 Avoid nested .parallelStream()
🔹 Benchmarks reveal what feels fast but isn’t
🔹 More .parallelStream() ≠ better performance

✌️ Bonus: my personal lesson

“I thought I was optimizing. I was just confusing the scheduler.”

My hunch turned out to be right. And if you're curious, you can replicate this with smaller experiments — even simpler setups show that nestedParallelStreams adds overhead across the board.

🔗 Full code

You can find the full code here.

Just make sure to add the jmh dependency.

Clone it, run it, tweak the parameters — and see for yourself!

DEV Community

Nested parallelStream(): More Concurrency, Less Performance

`.parallelStream()` everywhere? Not always a good idea.

A complex problem: too many layers, too much parallelism

⚙️ How the experiment was run

📊 What we compared

🧪 How we simulated the workload

🕳️ Why we used `Blackhole`

▶️ How to run the benchmark

🔍 What the results showed

✅ 1. Single-layer parallelism is more efficient

⚠️ 2. At scale, nested becomes chaos

✅ Takeaway: parallelize with care

What we learned:

✌️ Bonus: my personal lesson

🔗 Full code

Top comments (0)

.parallelStream() everywhere? Not always a good idea.

A complex problem: too many layers, too much parallelism

⚙️ How the experiment was run

📊 What we compared

🧪 How we simulated the workload

🕳️ Why we used Blackhole

▶️ How to run the benchmark

🔍 What the results showed

✅ 1. Single-layer parallelism is more efficient

⚠️ 2. At scale, nested becomes chaos

✅ Takeaway: parallelize with care

What we learned:

✌️ Bonus: my personal lesson

🔗 Full code

`.parallelStream()` everywhere? Not always a good idea.

🕳️ Why we used `Blackhole`