.parallelStream() everywhere? Not always a good idea.
Have you ever thought:
“What if I sprinkle
.parallelStream()across all the layers of my code? Everything will run faster, right?”
Yeah. I thought that too. Spoiler: it didn’t. 😅
Recently, while optimizing the processing and transformation of millions of in-memory objects (yay CPU pressure), I ended up nesting a few levels of parallelStream. When I checked my metrics, the CPU was melting. My hypothesis? The tasks were fighting over the same thread pool — the common fork join pool.
If you're as curious as I was to understand why, keep reading. In this post, we’ll look at some benchmarks that show what happens when you go overboard with .parallelStream().
I’ll also show you a better approach — and why it works.
A complex problem: too many layers, too much parallelism
Imagine code that processes data in multiple layers:
- 10 outer groups (e.g. regions)
- 100 middle groups (e.g. warehouses)
- 100 inner items (e.g. products)
Each item performs a CPU-intensive calculation and may allocate memory during processing.
My first idea was to parallelize everything:
outer.parallelStream().forEach(o ->
middle.parallelStream().forEach(m ->
inner.parallelStream().forEach(this::process)
)
);
Looks ideal, right? But when I ran proper benchmarks...
🚨 Performance dropped. And variance exploded.
⚙️ How the experiment was run
To measure correctly, I used JMH (Java Microbenchmark Harness) — the go-to tool in the Java community for reliable benchmarks.
I simulated a data hierarchy:
-
outerSize: regions -
middleSize: warehouses -
innerSize: products
Each combination generated a task with CPU load and memory allocation.
📊 What we compared
I implemented three variations of the same processing logic:
| Technique | Description |
|---|---|
nestedParallelStreams() |
Parallelizes every layer (overkill) |
flattenedParallelStream() |
Only the outer layer is parallel |
singleParallelStream() |
Flattens everything into one list, then parallelizes once |
🧪 How we simulated the workload
Each task performs:
- Math operations with
Math.sqrt()(CPU pressure) - String concatenation
- Creation of intermediate lists (a bit of heap pressure)
record ComplexObject(String name, int value, byte[] payload) {
ComplexObject(String name, int value) {
this(name, value, new byte[1024]); // Simulates memory weight
}
}
🕳️ Why we used Blackhole
This was new to me. JMH provides the Blackhole object to prevent the JIT compiler from optimizing away benchmark code.
Without Blackhole, the compiler might notice you're not using the result of a function — and just skip executing it. That would ruin the experiment.
blackhole.consume(results); // Ensures results are “used”
▶️ How to run the benchmark
You can run it using the built-in main() method:
./gradlew run
Or directly with:
./gradlew jmh
The parameters (outerSize, middleSize, etc.) are controlled by @Param and can be changed via command-line or hardcoded in the class.
5 hours of benchmarking later...
🔍 What the results showed
✅ 1. Single-layer parallelism is more efficient
Configuration: (100, 50, 5) → 25,000 tasks
| Technique | Avg Time (ms) |
|---|---|
singleParallelStream |
3,233 ms |
flattenedParallelStream |
3,546 ms |
nestedParallelStreams |
3,972 ms |
💡 Deep parallelism did not help. It just caused more overhead.
⚠️ 2. At scale, nested becomes chaos
Configuration: (500, 100, 10) → 500,000 tasks
| Technique | Avg Time (ms) | Std Dev |
|---|---|---|
nestedParallelStreams |
69,486 ms | ±106,638 ms 😱 |
flattenedParallelStream |
78,037 ms | ±44,430 ms |
singleParallelStream |
75,201 ms | ±63,081 ms |
💣 nestedParallelStreams seemed “faster” in one run, but the huge standard deviation shows how unstable the system became — probably due to GC thrashing or thread contention.
✅ Takeaway: parallelize with care
What we learned:
- 🔹 Parallelize once, at the outermost layer
- 🔹 Avoid nested
.parallelStream() - 🔹 Benchmarks reveal what feels fast but isn’t
- 🔹 More
.parallelStream()≠ better performance
✌️ Bonus: my personal lesson
“I thought I was optimizing. I was just confusing the scheduler.”
My hunch turned out to be right. And if you're curious, you can replicate this with smaller experiments — even simpler setups show that nestedParallelStreams adds overhead across the board.
🔗 Full code
You can find the full code here.
Just make sure to add the jmh dependency.
Clone it, run it, tweak the parameters — and see for yourself!

Top comments (0)