Good thoughts and intuition can lead to good results, but only benchmarks lead to great results.
The Problem
Once upon a time, I had the opportunity to build a custom data structure.
We needed something like a queue to act as a cache — a rolling buffer with a constant size, always filled with the latest data as the oldest expires.
The requirements sounded simple:
- Single writer, multiple concurrent readers
- Readers iterate safely while data is being written
- As little garbage and synchronization overhead as possible
Simple, right?
The Obvious Candidates
Naturally, the first candidates were Java’s built-in structures:
| Structure | Pros | Cons |
|---|---|---|
| ArrayList | Simple, contiguous | O(n) to remove oldest element |
| LinkedList | Fast removals | Creates garbage, poor cache locality |
| ArrayDeque | Nearly perfect | Not thread-safe, fails under concurrent modification |
Each had a deal-breaking flaw for our case.
The Custom Design
So, led by good intentions (and confidence), I decided to build my own —
the RollingBuffer.
Core idea:
- Backed by a plain array → cache locality
- Preallocate and reuse buckets/objects → improves memory locality and avoids garbage
- A single
volatilefield for synchronization - Minimal API surface:
void put(long timestamp, V value);
Iterator<V> getIterator(long startTimestamp);
The iterator would snapshot the volatile write index and iterate safely up to that point.
No resizing, no garbage, no allocations during runtime.
Perfect in theory.
The First Hurdles
Which bucket to write to?
What if there’s a gap in timestamps?
Example: with 10 buckets, data for timestamps 20–30 could yield something like:
[20, 11, 22, 23, 24, …]
I decided to handle this during iteration — assuming branch prediction would make it cheap.-
Synchronization when wrapping around
What if an iterator runs while the writer wraps around the array?
Solution: introduce two parameters:-
maxBuckets— total array capacity -
exposedBuckets— visible portion to readers
-
This created a small grace zone to prevent corruption.
First Benchmarks — Great News?
I wrote the first version and ran JMH benchmarks.
Results looked amazing:
- Iteration over N elements: tiny latency
- Write throughput: tens of millions ops/sec
Until I compared it against the real collections…
Reality Hits
Against the standard structures, my RollingBuffer was:
-
2.5× slower at iteration than
ArrayDeque -
~50% slower than
LinkedList
Not what I expected.
I suspected the iteration-time gap check was the bottleneck — so I reworked it.
The New Approach
Instead of trying to write to the “correct” bucket (skipping gaps),
I wrote to the next available bucket every time.
That simplified iteration — no runtime checks, just a linear scan.
But now, given a startTimestamp, I needed to find the right bucket efficiently.
Enter binary search.
Even with wrap-around, modular arithmetic made it work beautifully.
Result: ~15% faster iteration.
Better, but still not stellar.
The Real Culprit
Then I found the real performance killer:
nextWriteIndex = (nextWriteIndex + 1) % totalBuckets;
and in the iterator:
nextIndex = (nextIndex + 1) % totalBuckets;
Looked harmless.
But % is slow compared to a conditional branch.
I replaced it with:
nextWriteIndex = (nextWriteIndex + 1 == totalBuckets) ? 0 : writnextWriteIndexeIndex + 1;
nextIndex = (nextIndex + 1 == totalBuckets) ? 0 : nextIndex + 1;
Result?
3× faster iteration.
Yes — just by removing a single % operator.
Final Results
Write throughput:
≈ 78 million ops/sec — about half as fast as ArrayDeque or LinkedList,
but thread-safe and similar to LMAX Disruptor performance in single writter (ended up similar in design as well).
Read throughput:
| Comparison | RollingBuffer Speed |
|---|---|
vs ArrayList
|
1.3× faster |
vs ArrayDeque
|
1.4× faster |
vs LinkedList
|
2.3× faster |
Iteration:
Faster than both ArrayList and ArrayDeque,
fully thread-safe, and supports concurrent modification.
Not bad at all. 😎
Btw. check out this interactive graph I've made with Claude AI
The Lesson
Would I have figured this out without measuring?
No chance.
Is this level of optimization worth it for most code?
Absolutely not.
For 99.9% of use cases, built-ins are fine.
Final Thought
Intuition gets you close.
Benchmarks get you the truth.
Is this level of performance optimization really worth it?
For most people, jobs, or tasks — it’ll drown in the sea of other inefficiencies.
And not just for most — it’s like 95%+, easily.
But when it comes to the hot path of the system — the core where performance truly matters — this is where you’ll find excellence.
This process was invaluable.
I learned more about how the CPU, Java Memory Model, and JVM actually behave than any book or course could have taught me.
Github repo: https://github.com/JurenIvan/rolling-buffer
You can reason your way to correctness.
But you can only measure your way to performance.


Top comments (0)