DEV Community

Ivan Juren
Ivan Juren

Posted on

Good Intentions Aren’t Enough — Measure!

Good thoughts and intuition can lead to good results, but only benchmarks lead to great results.


The Problem

Once upon a time, I had the opportunity to build a custom data structure.

We needed something like a queue to act as a cache — a rolling buffer with a constant size, always filled with the latest data as the oldest expires.

The requirements sounded simple:

  • Single writer, multiple concurrent readers
  • Readers iterate safely while data is being written
  • As little garbage and synchronization overhead as possible

Simple, right?


The Obvious Candidates

Naturally, the first candidates were Java’s built-in structures:

Structure Pros Cons
ArrayList Simple, contiguous O(n) to remove oldest element
LinkedList Fast removals Creates garbage, poor cache locality
ArrayDeque Nearly perfect Not thread-safe, fails under concurrent modification

Each had a deal-breaking flaw for our case.


The Custom Design

So, led by good intentions (and confidence), I decided to build my own —

the RollingBuffer.

Core idea:

  • Backed by a plain array → cache locality
  • Preallocate and reuse buckets/objects → improves memory locality and avoids garbage
  • A single volatile field for synchronization
  • Minimal API surface:
void put(long timestamp, V value);
Iterator<V> getIterator(long startTimestamp);
Enter fullscreen mode Exit fullscreen mode

The iterator would snapshot the volatile write index and iterate safely up to that point.

No resizing, no garbage, no allocations during runtime.

Perfect in theory.


The First Hurdles

  1. Which bucket to write to?

    What if there’s a gap in timestamps?

    Example: with 10 buckets, data for timestamps 20–30 could yield something like:

    [20, 11, 22, 23, 24, …]

    I decided to handle this during iteration — assuming branch prediction would make it cheap.

  2. Synchronization when wrapping around

    What if an iterator runs while the writer wraps around the array?

    Solution: introduce two parameters:

    • maxBuckets — total array capacity
    • exposedBuckets — visible portion to readers

This created a small grace zone to prevent corruption.


First Benchmarks — Great News?

I wrote the first version and ran JMH benchmarks.

Results looked amazing:

  • Iteration over N elements: tiny latency
  • Write throughput: tens of millions ops/sec

Until I compared it against the real collections…


Reality Hits

Against the standard structures, my RollingBuffer was:

  • 2.5× slower at iteration than ArrayDeque
  • ~50% slower than LinkedList

Not what I expected.

I suspected the iteration-time gap check was the bottleneck — so I reworked it.


The New Approach

Instead of trying to write to the “correct” bucket (skipping gaps),

I wrote to the next available bucket every time.

That simplified iteration — no runtime checks, just a linear scan.

But now, given a startTimestamp, I needed to find the right bucket efficiently.

Enter binary search.

Even with wrap-around, modular arithmetic made it work beautifully.

Result: ~15% faster iteration.

Better, but still not stellar.


The Real Culprit

Then I found the real performance killer:

nextWriteIndex = (nextWriteIndex + 1) % totalBuckets;
Enter fullscreen mode Exit fullscreen mode

and in the iterator:

nextIndex = (nextIndex + 1) % totalBuckets;
Enter fullscreen mode Exit fullscreen mode

Looked harmless.

But % is slow compared to a conditional branch.

I replaced it with:

nextWriteIndex = (nextWriteIndex + 1 == totalBuckets) ? 0 : writnextWriteIndexeIndex + 1;
nextIndex = (nextIndex + 1 == totalBuckets) ? 0 : nextIndex + 1;
Enter fullscreen mode Exit fullscreen mode

Result?

3× faster iteration.

Yes — just by removing a single % operator.


Final Results

Write throughput:

78 million ops/sec — about half as fast as ArrayDeque or LinkedList,

but thread-safe and similar to LMAX Disruptor performance in single writter (ended up similar in design as well).

Read throughput:

Comparison RollingBuffer Speed
vs ArrayList 1.3× faster
vs ArrayDeque 1.4× faster
vs LinkedList 2.3× faster

Iteration:

Faster than both ArrayList and ArrayDeque,

fully thread-safe, and supports concurrent modification.

Not bad at all. 😎

Btw. check out this interactive graph I've made with Claude AI


The Lesson

Would I have figured this out without measuring?

No chance.

Is this level of optimization worth it for most code?

Absolutely not.

For 99.9% of use cases, built-ins are fine.


Final Thought

Intuition gets you close.

Benchmarks get you the truth.

Is this level of performance optimization really worth it?

For most people, jobs, or tasks — it’ll drown in the sea of other inefficiencies.

And not just for most — it’s like 95%+, easily.

But when it comes to the hot path of the system — the core where performance truly matters — this is where you’ll find excellence.

This process was invaluable.

I learned more about how the CPU, Java Memory Model, and JVM actually behave than any book or course could have taught me.

Github repo: https://github.com/JurenIvan/rolling-buffer


You can reason your way to correctness.

But you can only measure your way to performance.


Top comments (0)