.parallelStream()
everywhere? Not always a good idea.
Have you ever thought:
“What if I sprinkle
.parallelStream()
across all the layers of my code? Everything will run faster, right?”
Yeah. I thought that too. Spoiler: it didn’t. 😅
Recently, while optimizing the processing and transformation of millions of in-memory objects (yay CPU pressure), I ended up nesting a few levels of parallelStream
. When I checked my metrics, the CPU was melting. My hypothesis? The tasks were fighting over the same thread pool — the common fork join pool.
If you're as curious as I was to understand why, keep reading. In this post, we’ll look at some benchmarks that show what happens when you go overboard with .parallelStream()
.
I’ll also show you a better approach — and why it works.
A complex problem: too many layers, too much parallelism
Imagine code that processes data in multiple layers:
- 10 outer groups (e.g. regions)
- 100 middle groups (e.g. warehouses)
- 100 inner items (e.g. products)
Each item performs a CPU-intensive calculation and may allocate memory during processing.
My first idea was to parallelize everything:
outer.parallelStream().forEach(o ->
middle.parallelStream().forEach(m ->
inner.parallelStream().forEach(this::process)
)
);
Looks ideal, right? But when I ran proper benchmarks...
🚨 Performance dropped. And variance exploded.
⚙️ How the experiment was run
To measure correctly, I used JMH (Java Microbenchmark Harness) — the go-to tool in the Java community for reliable benchmarks.
I simulated a data hierarchy:
-
outerSize
: regions -
middleSize
: warehouses -
innerSize
: products
Each combination generated a task with CPU load and memory allocation.
📊 What we compared
I implemented three variations of the same processing logic:
Technique | Description |
---|---|
nestedParallelStreams() |
Parallelizes every layer (overkill) |
flattenedParallelStream() |
Only the outer layer is parallel |
singleParallelStream() |
Flattens everything into one list, then parallelizes once |
🧪 How we simulated the workload
Each task performs:
- Math operations with
Math.sqrt()
(CPU pressure) - String concatenation
- Creation of intermediate lists (a bit of heap pressure)
record ComplexObject(String name, int value, byte[] payload) {
ComplexObject(String name, int value) {
this(name, value, new byte[1024]); // Simulates memory weight
}
}
🕳️ Why we used Blackhole
This was new to me. JMH provides the Blackhole
object to prevent the JIT compiler from optimizing away benchmark code.
Without Blackhole
, the compiler might notice you're not using the result of a function — and just skip executing it. That would ruin the experiment.
blackhole.consume(results); // Ensures results are “used”
▶️ How to run the benchmark
You can run it using the built-in main()
method:
./gradlew run
Or directly with:
./gradlew jmh
The parameters (outerSize
, middleSize
, etc.) are controlled by @Param
and can be changed via command-line or hardcoded in the class.
5 hours of benchmarking later...
🔍 What the results showed
✅ 1. Single-layer parallelism is more efficient
Configuration: (100, 50, 5)
→ 25,000 tasks
Technique | Avg Time (ms) |
---|---|
singleParallelStream |
3,233 ms |
flattenedParallelStream |
3,546 ms |
nestedParallelStreams |
3,972 ms |
💡 Deep parallelism did not help. It just caused more overhead.
⚠️ 2. At scale, nested becomes chaos
Configuration: (500, 100, 10)
→ 500,000 tasks
Technique | Avg Time (ms) | Std Dev |
---|---|---|
nestedParallelStreams |
69,486 ms | ±106,638 ms 😱 |
flattenedParallelStream |
78,037 ms | ±44,430 ms |
singleParallelStream |
75,201 ms | ±63,081 ms |
💣 nestedParallelStreams
seemed “faster” in one run, but the huge standard deviation shows how unstable the system became — probably due to GC thrashing or thread contention.
✅ Takeaway: parallelize with care
What we learned:
- 🔹 Parallelize once, at the outermost layer
- 🔹 Avoid nested
.parallelStream()
- 🔹 Benchmarks reveal what feels fast but isn’t
- 🔹 More
.parallelStream()
≠ better performance
✌️ Bonus: my personal lesson
“I thought I was optimizing. I was just confusing the scheduler.”
My hunch turned out to be right. And if you're curious, you can replicate this with smaller experiments — even simpler setups show that nestedParallelStreams
adds overhead across the board.
🔗 Full code
You can find the full code here.
Just make sure to add the jmh
dependency.
Clone it, run it, tweak the parameters — and see for yourself!
Top comments (0)