jvmind

Posted on Jun 29

JDK 26 G1 GC Dual Card Tables – A Benchmark Story

#programming #java #performance

TL;DR: JDK 26's G1 write barrier optimization (Dual Card Tables) delivers ~2.4x faster write barrier operations, but aggregate GC metrics can be misleading if you don't account for the application doing more work.

Background

The Dual Card Tables work landed in JDK 26, promising 5-15% throughput improvements for G1 GC. I wanted to understand how this behaves under a write-barrier-heavy workload, so I ran a controlled benchmark comparing JDK 25 vs JDK 26 G1.

Benchmark Setup

Workload: Write-barrier-heavy allocation test (storing newly allocated Objects into a fixed array)
Heap: 2GB, G1 GC
Runtime: ~31 seconds per test
JDKs: 25 vs 26 (both with G1)

Initial Observations (Misleading)

Metric	JDK 25	JDK 26	Change
GC Events	75	168	+124%
Total Pause Time	1.78s	3.26s	+83%
Throughput	94.30%	89.50%	-4.8 p.p.
Allocation Rate	2,874 MB/s	6,587 MB/s	+129%

On the surface, JDK 26 looked worse: more GC events, more total pause time, lower throughput. But this was a measurement artifact.

The Critical Data Point

The benchmark's raw output told a different story:

JDK	Result (ms/op)	Iterations
25	0.055 ± 0.013	54
26	0.023 ± 0.003	129

JDK 26 executes the same write-barrier operation in less than half the time – ~2.4x faster.

What Actually Happened

The allocation rate spike (2,874 → 6,587 MB/s) wasn't a regression. It was a consequence of the application running faster:

Allocation Rate = Allocated Bytes / Application Runtime

When the write barrier becomes faster, the application spends less time on barrier operations and more time actually doing work – so it allocates more bytes in the same wall-clock time. More allocations → more garbage → more GC events → more total pause time.

The "throughput regression" was actually a sign of throughput improvement.

Corrected Conclusion

Dimension	JDK 26 vs JDK 25
Write barrier performance	✅ ~2.4x faster
Single-pause latency	✅ Better across all percentiles
Effective throughput	✅ Significantly higher
GC events (count)	⚠️ Higher (because of more work)
Total pause time	⚠️ Higher (because of more work)

Key Takeaway

Aggregate GC metrics like "total pause time" or "throughput percentage" are not absolute measures of performance. They must be interpreted in context. JDK 26's G1 optimization is a clear win – it made the application run faster, which created more garbage, which triggered more GC activity.

Benchmark Code

// Simplified version – full code available on request
public class WriteBarrierBench {
    private static final int ARRAY_SIZE = 10000;
    private final Object[] array = new Object[ARRAY_SIZE];
    private volatile long blackhole;

    private void storeReferences() {
        for (int i = 0; i < array.length; i++) {
            array[i] = new Object();  // triggers write barrier
        }
        blackhole += array.length;     // prevents optimization
    }

    // ... measurement harness with warmup, iterations, etc.
}

Methodology Note

The benchmark uses a volatile long blackhole to prevent dead code elimination
Warmup iterations are included to allow JIT compilation
A bash harness controls JDK switching and GC logging
The test is controlled (single workload pattern) – results may not generalize to all allocation profiles

Open Questions

How does this scale with different heap sizes?
What does the behavior look like on other GC algorithms (Parallel, ZGC)?
Is there a direct way to measure write barrier overhead independently?

DEV Community