To understand when you should use or avoid the use of parallelStream is important to understand the concept of stateful and stateless in java streams.
Stateful
Whenever an action in a stream needs to keep a state to finish its works, the operation would be considered a stateful operation, these actions would be the invocation of operations like:
distinct()
sorted()
limit()
skip()
And many others conditions that you can put in your code and will make it stateful. For a fast way to determine if an operation is stateful I use the documentation when I'm not sure:
A stateful operation should not be used in parallel, it will kill performances!
Stateless
Stateless operations keep no state during the pipeline execution, because of this characteristic those operations is much more performative.
Nondeterministic values
Because of the stateful behavioral, it's not possible to determine the execution result when running it in parallel, let's see an example in code:
for (int i = 0; i < 5; i++) {
Set<Integer> alreadySeen = new HashSet<>();
IntStream stream = IntStream.of(3, 4, 1, 2, 1, 2, 3, 4, 4, 5);
int sum = stream.parallel().map(
// Here we add a stateful behavioral parameter.
value -> alreadySeen.add(value) ? value : 0).sum();
System.out.println(sum);
}
And here is the output:
16
15
15
19
15
This result is nondeterministic, you gonne receive a different result for each execution you do, so to make it correct we gonna change our code taking off the parallel():
for (int i = 0; i < 5; i++) {
Set<Integer> alreadySeen = new HashSet<>();
IntStream stream = IntStream.of(3, 4, 1, 2, 1, 2, 3, 4, 4, 5);
int sum = stream.map(
// Here we add a stateful behavioral parameter.
value -> alreadySeen.add(value) ? value : 0).sum();
System.out.println(sum);
}
Now we can see a correct result that will be expected to be equal in all new execution:
15
15
15
15
15
This is because when executing an action in parallel, it is not possible to guarantee a correct validation with a previous state.
Workarounds
There're some workarounds that would help us with this, but these fix would undermine the benefits of parallelism.
Use of a synchronizedSet:
Set<Integer> alreadySeen = Collections.synchronizedSet(new HashSet<>());
Another one is to use distinct() that will keep the operation stateful, but is safe and deterministic:
int sum = stream.parallel().distinct().sum();
The use of parallel can speed up processing, but it can also cause problems in the application, so it’s important to know if we’re in a stateful or stateless flow to know how to handle it.
Top comments (0)