Ivan Juren

Posted on Nov 9

Good Intentions Aren’t Enough — Measure!

#java #performance #opensource

Good thoughts and intuition can lead to good results, but only benchmarks lead to great results.

The Problem

Once upon a time, I had the opportunity to build a custom data structure.

We needed something like a queue to act as a cache — a rolling buffer with a constant size, always filled with the latest data as the oldest expires.

The requirements sounded simple:

Single writer, multiple concurrent readers
Readers iterate safely while data is being written
As little garbage and synchronization overhead as possible

Simple, right?

The Obvious Candidates

Naturally, the first candidates were Java’s built-in structures:

Structure	Pros	Cons
ArrayList	Simple, contiguous	O(n) to remove oldest element
LinkedList	Fast removals	Creates garbage, poor cache locality
ArrayDeque	Nearly perfect	Not thread-safe, fails under concurrent modification

Each had a deal-breaking flaw for our case.

The Custom Design

So, led by good intentions (and confidence), I decided to build my own —

the RollingBuffer.

Core idea:

Backed by a plain array → cache locality
Preallocate and reuse buckets/objects → improves memory locality and avoids garbage
A single volatile field for synchronization
Minimal API surface:

void put(long timestamp, V value);
Iterator<V> getIterator(long startTimestamp);

The iterator would snapshot the volatile write index and iterate safely up to that point.

No resizing, no garbage, no allocations during runtime.

Perfect in theory.

The First Hurdles

Which bucket to write to?

What if there’s a gap in timestamps?

Example: with 10 buckets, data for timestamps 20–30 could yield something like:

[20, 11, 22, 23, 24, …]

I decided to handle this during iteration — assuming branch prediction would make it cheap.
Synchronization when wrapping around

What if an iterator runs while the writer wraps around the array?

Solution: introduce two parameters:
- maxBuckets — total array capacity
- exposedBuckets — visible portion to readers

This created a small grace zone to prevent corruption.

First Benchmarks — Great News?

I wrote the first version and ran JMH benchmarks.

Results looked amazing:

Iteration over N elements: tiny latency
Write throughput: tens of millions ops/sec

Until I compared it against the real collections…

Reality Hits

Against the standard structures, my RollingBuffer was:

2.5× slower at iteration than ArrayDeque
~50% slower than LinkedList

Not what I expected.

I suspected the iteration-time gap check was the bottleneck — so I reworked it.

The New Approach

Instead of trying to write to the “correct” bucket (skipping gaps),

I wrote to the next available bucket every time.

That simplified iteration — no runtime checks, just a linear scan.

But now, given a startTimestamp, I needed to find the right bucket efficiently.

Enter binary search.

Even with wrap-around, modular arithmetic made it work beautifully.

Result: ~15% faster iteration.

Better, but still not stellar.

The Real Culprit

Then I found the real performance killer:

nextWriteIndex = (nextWriteIndex + 1) % totalBuckets;

and in the iterator:

nextIndex = (nextIndex + 1) % totalBuckets;

Looked harmless.

But % is slow compared to a conditional branch.

I replaced it with:

nextWriteIndex = (nextWriteIndex + 1 == totalBuckets) ? 0 : writnextWriteIndexeIndex + 1;
nextIndex = (nextIndex + 1 == totalBuckets) ? 0 : nextIndex + 1;

Result?

3× faster iteration.

Yes — just by removing a single % operator.

Final Results

Write throughput:

≈ 78 million ops/sec — about half as fast as ArrayDeque or LinkedList,

but thread-safe and similar to LMAX Disruptor performance in single writter (ended up similar in design as well).

Read throughput:

Comparison	RollingBuffer Speed
vs `ArrayList`	1.3× faster
vs `ArrayDeque`	1.4× faster
vs `LinkedList`	2.3× faster

Iteration:

Faster than both ArrayList and ArrayDeque,

fully thread-safe, and supports concurrent modification.

Not bad at all. 😎

Btw. check out this interactive graph I've made with Claude AI

The Lesson

Would I have figured this out without measuring?

No chance.

Is this level of optimization worth it for most code?

Absolutely not.

For 99.9% of use cases, built-ins are fine.

Final Thought

Intuition gets you close.

Benchmarks get you the truth.

Is this level of performance optimization really worth it?

For most people, jobs, or tasks — it’ll drown in the sea of other inefficiencies.

And not just for most — it’s like 95%+, easily.

But when it comes to the hot path of the system — the core where performance truly matters — this is where you’ll find excellence.

This process was invaluable.

I learned more about how the CPU, Java Memory Model, and JVM actually behave than any book or course could have taught me.

Github repo: https://github.com/JurenIvan/rolling-buffer