DEV Community: Kush V

A Practical Exploration of Transactional Map Implementations in Java

Kush V — Mon, 16 Mar 2026 18:43:58 +0000

This writeup documents the transactional maps I’ve implemented over the last few weeks. It mainly focuses on design, lifecycle, and implementation details. No benchmarks or perf numbers included.

NOTE: The original writeup can be found here

1. Optimistic Transactional Map

This is mainly inspired by the approach described in the transactional collections classes paper.

Isolation Level

READ COMMITTED globally
SERIALIZABLE per-key when writes are involved
No dirty reads, no phantom reads per key

Implementation

Core Data Structures:

ConcurrentMap<K, V> -> The underlying map
KeyToLockers<K> -> Maps each key to operation-specific GuardedTxSets
Each GuardedTxSet contains:
- A ReentrantReadWriteLock (split into read/write locks)
- A set of transactions waiting on that operation

Transaction Lifecycle

1. Schedule Phase (when operations are called)

tx.get(key)       // Acquires READ lock immediately
tx.put(key, val)  // Does NOT acquire any lock yet

For reads (GET/CONTAINS/SIZE):

Immediately acquire the read lock for that operation type on that key
Add transaction to the GuardedTxSet for that key + operation
Multiple readers can proceed concurrently

For writes (PUT/REMOVE):

No locks acquired yet (lazy intent declaration)
Just record the operation

2. Validation Phase

For reads:

Already validated (holding read lock since schedule time)

For writes:

Sort all write keys by identityHashCode() (prevents deadlock via normal ordering)
For each key being written:
1. Acquire the write lock for that key
2. For each conflicting read operation type (GET, CONTAINS):
  - Check if this transaction holds the read lock for that operation
  - If yes, release the read lock first (prevent upgrade deadlock)
  - Acquire the write lock for that operation type
3. Check if SIZE lock is needed:
  - PUT on new key -> acquire SIZE write lock (size will increase)
  - PUT on existing key -> release CONTAINS write lock if held (size unchanged, CONTAINS always true)
  - REMOVE on existing key -> acquire SIZE write lock (size will decrease)
  - REMOVE on non-existent key -> release CONTAINS write lock if held (size unchanged, CONTAINS always false)

3. Commit Phase

Apply all operations to the underlying map
Release all held locks (both read and write)
Remove transactions from all GuardedTxSet

Unique Properties

Readers block writers, writers block readers, but readers never block readers
The lock upgrade implementation (release read -> acquire write) prevents self-deadlock
Semantic awareness: doesn't acquire SIZE lock when not semantically necessary. Acquire CONTAINS lock, only holds on to it if in the case of REMOVE(the value exists), and, in the case of PUT (the values does not exist)

2. Pessimistic Transactional Map

This is mainly inspired by the approach described in the transactional collections classes paper.

Isolation Level

READ COMMITTED globally
SERIALIZABLE per key when writes are involved

Implementation

Core Data Structures:

ConcurrentMap<K, V> -> The underlying map
KeyToLockers<K> -> Maps each key to operation-specific GuardedTxSets
Each GuardedTxSet contains:
- A ReentrantLock (write lock)
- An AtomicInteger reader count
- A Latch with status (FREE/HELD), parent transaction, and CountDownLatch

Transaction Lifecycle

1. Schedule Phase

tx.get(key)       // Does NOT acquire lock, just registers
tx.put(key, val)  // Does NOT acquire lock, just records

For reads:

Add transaction to the appropriate GuardedTxSet
No locks acquired (lazy intent declaration)

For writes:

Just record the operation
No locks acquired

2. Validation Phase

For writes:

Sort all write keys by identityHashCode() for deadlock prevention
For each write key:
1. Acquire write locks for GET and CONTAINS operation types on that key
2. Check if key exists in underlying map
3. Determine if SIZE lock needed (same logic as optimistic):
  - PUT on new key -> acquire SIZE lock
  - REMOVE on existing key -> acquire SIZE lock
4. Set latch status to HELD with this transaction as parent
5. Drain existing readers: spin-wait while readerCount > 0

For reads:

Check the latch for the operation type on that key
If latch is HELD by another transaction:
- Waits on the CountDownLatch (blocks until writer releases)
- When awakened, recheck if the latch has been reacquired else increment readerCount
If latch is FREE:
- Just increment readerCount

3. Commit Phase

Apply all operations
For writers: set latch to FREE, countdown the latch (wake waiting readers)
Release all locks
For readers: decrement readerCount
Remove from GuardedTxSet

Unique Properties

Readers use atomic counters, not locks cheaper than lock acquisition
Writers must drain readers before proceeding (spin on readerCount)
The latch mechanism prevents TOCTOU bugs: check status and await in single atomic read
More OS context switches due to readers parking/unparking on CountDownLatch

3. Read Uncommitted Transactional Map

Isolation Level

READ UNCOMMITTED -> dirty reads allowed (can see uncommitted writes)

Implementation

Core Data Structures:

ConcurrentMap<K, V> -> The underlying map
LockHolder<K, V> -> Per-key ReentrantLocks for writes
Per-transaction store buffer (Map<K, V>) -> local cache of writes

Transaction Lifecycle

1. Schedule Phase

All operations just record intent
No locks acquired for reads OR writes

2. Validation Phase

For writes:

Sort write keys by identityHashCode()
Acquire per-key locks in sorted order

For reads:

No validation is needed

3. Commit Phase

Before committing any operations:

// Snapshot current map size
tx.size = txMap.map.size();

// Pre-populate store buffer with current values for each operation with a key
storeBuf.put(key, txMap.map.get(key));

Then for each operation:

For writes (PUT/REMOVE):

Apply to underlying map immediately
Update store buffer with new value
Track size delta:
- PUT on new key: delta++
- REMOVE on existing key: delta--

For reads (GET):

First check store buffer
If not found, perform dirty read from underlying map
Return result

For CONTAINS:

Check store buffer only (already populated)

For SIZE:

Return snapshotSize + delta

Unique Properties

No reader blocking at all
Dirty reads meaning you might see uncommitted data
Store buffer provides read-your-own-writes consistency within a transaction

4. Copy-on-Write Transactional Map

Isolation Level

READ COMMITTED
Repeatable reads aren't guaranteed

How It Works

Core Data Structures:

AtomicReference<ConcurrentMap<K, V>> -> Ref to current map snapshot

Transaction Lifecycle

1. Schedule Phase

Just record operations in local list
No locks, no map access

2. Validation Phase

Retry loop:
A quite simple retry loop

do {
    prev = txMap.map.get();                      // Get current snapshot
    underlyingMap = new HashMap<>(prev);          // COPY entire map
    // Apply all operations to local copy
    txs.forEach(child -> child.tryValidate());
} while (hasWrite && !txMap.map.compareAndSet(prev, new ConcurrentHashMap<>(underlyingMap)));

tryValidate():

Apply operation to the local copy of the underlyingMap
Store result in child transaction

If CAS fails:

Another transaction committed between snapshot and CAS
Retry: re-copy map, re-apply operations, re-attempt CAS

3. Commit Phase

CAS succeeded, so operations already applied to new map
Just complete futures with stored results

Unique Properties

Entire map copied on every write transaction -> doesn't scale with map size
Under high contention: lots of CAS retries -> lots of copies -> terrible write heavy performance
Best read performance when map is mostly read (no blocking, no locking)
Simple implementation but hard on memory and GC under high write contention

5. Flat Combined Transactional Maps

Isolation Level

SERIALIZABLE (all operations fully serialized through combiner)

Core Idea

Instead of threads fighting for locks, they:

Enqueue their operation
Spin waiting for a "combiner" thread to execute it
If no combiner exists, try to become one
Combiner executes operations for all waiting threads in one critical section

Implementation Variants

A. FlatCombinedTxMap (using combiners)

Uses one of four combiner types:

UnboundCombiner

This is based on the approach described in this paper.

Data Structures:

Thread-local Node<E, R> for each thread
Shared publication list (accessible via an atomic ref Node head)
Per-node StatefulAction with:
- Action<E, R> action -> This is a volatile barrier
- R result
- AtomicInteger status (ACTIVE/INACTIVE)
Dummy sentinel node marks end of queue

Implementation:

Enqueue:

if (node.isInactive()) {
    node.setActive();
    prevHead = head.getAndSet(node);  // Atomic swap
    node.setNext(prevHead);
}

Combine:

while (action != null) {             // Spin until result ready
    if (lock.tryLock()) {
        scanCombineApply();           // Process all nodes
        return result;
    }
    idle();
    enqueueIfInactive(node);          // Re-enqueue if removed
}

scanCombineApply:

Traverse from head to DUMMY
For each node with non-null action:
- node.statefulAction.apply(e)
- node.action = null (signals completion)
- Set age to current count
Every threshold passes, scan and dequeue aged nodes:
- If (currentCount - node.age) >= threshold: unlink, set INACTIVE

Keeping Track of a Livelock Bug I Fixed:
Threads could get stuck when the combiner removed their node before they noticed their action was applied. Fixed by forcing combiners to always re-enqueue and apply their own action rather than trusting others.

NodeCyclingCombiner (Reuses nodes)

Key Difference: Reuses nodes instead of cleanup.

Data Structures:

Thread-local Node<E, R>
Each node has AtomicInteger status (NOT_COMBINER/IS_COMBINER)
Shared AtomicReference<Node> tail

Implementation:

// Reset current node
Node newTail = local.get();
newTail.status.set(NOT_COMBINER);

// Swap, my node becomes tail, get previous tail
curNode = tail.getAndSet(newTail);
local.set(curNode);              // Previous tail becomes my node

curNode.action = action;
curNode.setNext(newTail);

// Wait to become combiner
while (curNode.status.get() == NOT_COMBINER) {
    idle();
}

//Node chain should look something like this given 3 threads(T1, T2, T3 with the inital node T0),  waiting for their result to be applied, assuming natural order
//TO -> T1 -> T2 -> T3 
// T3 node will be marked as the combiner when Thread 1, finishes applying 

// Now I'm the combiner, traverse and apply
for (node = curNode; i < threshold && node.next != null; node = node.next) {
    node.apply(e);
    node.action = null;
    node.next = null;
    node.status.lazySet(IS_COMBINER);  // Make next last node in the queue combiner
}

Unique Property: Nodes cycle between threads, no cleanup needed.

AtomicArrayCombiner (A bounded combiner)

Data Structures:

AtomicReferenceArray<Node> of size capacity
AtomicLong cellNum -> monotonically increasing cell assignment
Per-thread ThreadLocal<Node>

Implementation:

Enqueue:

cell = (cellNum.getAndIncrement() % capacity);
// CAS into array slot
while (!cells.compareAndSet(cell, null, node)) {
    Thread.yield();  // Slot occupied, retry
}

Combine:

while (!node.statefulAction.isApplied) {
    if (lock.tryLock()) {
        scanCombineApply();  // Scan entire array
        return result;
    }
    idle();
}

scanCombineApply:

for (i = 0; i < capacity; i++) {
    Node curr = cells.get(i);
    if (curr != null) {
        curr.statefulAction.apply(e);
        cells.setOpaque(i, null);  // Clear and apply every slot to prevent a situation where waiters spin on an unapplied node forever
        //Best when working with a fixed capacity of threads
    }
}

Unique Property: Fixed memory, no pointer chasing, but potential false sharing.

SynchronizedCombiner (Baseline)

lock.lock();
try {
    return action.apply(e);
} finally {
    lock.unlock();
}

Plain lock which is used as baseline to measure if flat combining actually helps.

B. SegmentedCombinedTxMap (Partitioned)

Key Difference: Instead of one combiner for the entire map, one combiner per key + one for SIZE.

Data Structures:

ConcurrentMap<K, Combiner<Map<K, V>>> -> Maps each key to its own combiner
Separate sizeCombiner for SIZE operations

Implementation:

// Group operations by key
for (key, operations) in keyToFuture:
    combiner = getCombiner(key);  // Get or create combiner for this key
    combiner.combine(_ -> {
        for (operation in operations) {
            result = operation.apply(map);
            operation.complete(result);
        }
    });

// Separately handle SIZE operations
sizeCombiner.combine(_ -> { /* apply size ops */ });

Unique Properties:

Better parallelism -> different keys can be combined concurrently
More overhead -> one combiner per key
Writes per key still serialized, but across different keys they're parallel Surprisingly, from the benchmarks, a single combiner scales better than a segmented combiner, mainly due to the fact that combiners benefit the fact that batching operations and spinning, amortizes the cost of lock contention, unlike a segmented combiner where batching is less common due to each operation of a transaction mainly working on different keys. However this is just a speculation ---

6. MVCC Transactional Map

Isolation Level

Snapshot Isolation -> each transaction reads from a consistent snapshot taken at begin time
No dirty reads, no non-repeatable reads
SIZE reads are dirty (no version chain maintained for size)

Core Idea

Rather than blocking readers with locks, each key maintains a version chain, an ordered list of all historical values written to that key. Each version has a beginTs and endTs defining the epoch range in which it is visible. Readers find the version that overlaps their snapshot epoch without acquiring any locks. Writers append new versions and conflict only with other concurrent writers on the same key.

This is based on the approach described in the VLDB paper.

Core Data Structures

ConcurrentMap<K, VersionChain<V>> — maps each key to its version chain
ConcurrentMap<K, KeyStatus> — per-key write lock (CAS-based, not a real lock)
EpochTracker — global epoch counter, tracks active transaction begin epochs for GC
GCThread — background thread that prunes unreachable versions
AtomicInteger(size) — dirty global size counter

Transaction Lifecycle

1. Begin Phase

tBegin = epochTracker.currentEpoch(); // Snapshot epoch

The transaction records the current global epoch as its tBegin. This is the epoch from which it reads any version visible at tBegin is part of its snapshot. The epoch tracker also registers this tBegin so the GC knows the oldest epoch still in use.

For writes (PUT/REMOVE):

Attempt to acquire the KeyStatus write lock for that key via CAS
If the key is already locked by another transaction, we abort immediately
Check that tBegin still overlaps the latest version on the chain (late-arriving transaction check):

if (!(tBegin >= latest.beginTs() && tBegin < latest.endTs())) abort();

If both checks pass, we record the write operation

For reads (GET/CONTAINS):

Check if the key's KeyStatus is held by another transaction, if so, abort
Snapshot the overlapping version at tBegin immediately:

seen = versionChain(key).findOverlap(tBegin);

Record the read operation and the version seen

3. Validation Phase

At commit time, a tCommit epoch is assigned via epochTracker.newEpoch().Sequentially, each read operation then re-checks whether the version it saw at tBegin is still the overlapping version at tCommit:

Version overlapAtCommit = versionChain(key).findOverlap(tCommit);
if (seen != overlapAtCommit) abort(); // Someone wrote to this key between tBegin and tCommit
//A quick example
//tBegin = 100, tCommit = 105
//key = "A", versions = [(0,101,"old"), (102,INF,"new")] //INF = INFINITY
// tBegin should see version "old", while tCommit should see version "new"

This catches read-write conflicts, if any key you read was written by a concurrent transaction between your begin and commit, you abort.

4. Commit Phase

For each write operation: append a new version to the version chain with beginTs = tCommit
Set the previous latest version's endTs = tCommit (closing its visibility window)
Release all KeyStatus write locks
Call epochTracker.leaveEpoch(tBegin) so GC knows this epoch might no longer be visible
If the version chain depth crosses the configured threshold, submit a cleanup request to the GC thread

Version Chains

Each key's history is stored in a VersionChain<V>. Two independent implementations exist.

QueueVersionChain

Backed by a ConcurrentLinkedDeque
findOverlap() does a descending linear scan, meaning we start from the tail of the deque, to find newer versions easier and cut traversal time
Size tracked with an AtomicLong to avoid O(N) size() calls on the deque
Better write performance, lower overhead per version

NavigableVersionChain

Backed by a ConcurrentSkipListMap keyed by beginTs
findOverlap() uses floorEntry(tBegin), O(log N)
More expensive writes due to skip list insertion
Pays off as version chains grow longer under sustained write load

Both implementations maintain a MinVisibleEpoch cache to short-circuit GC scans when no prunable versions exist, a minimal optimization which avoids a full traversal when nothing has changed.

Epoch Tracking

The epoch tracker serves two purposes: assigning monotonically increasing commit epochs, and tracking the minimum tBegin of all active transactions so the GC knows which versions are safe to delete.

Three implementations exist:

DefaultEpochTracker

ConcurrentHashMap<Long, AtomicLong> mapping epoch -> active transaction count
currentEpoch() registers the transaction in the map and increments the counter
leaveEpoch() decrements; removes the entry when count hits zero
minVisibleEpoch() streams over the key set to find the minimum
Suitable for virtual threads since keys are shared across threads

Long2LongEpochTracker

Synchronized Long2LongOpenHashMap (using fast-util's Long2LongHashMap) maps thread ID -> current epoch
No boxing on values which avoids allocation pressure on the hot path
leaveEpoch() writes a sentinel value (-1) rather than removing the entry
minVisibleEpoch() scans values, skipping sentinels
This is best paired with pooled platform threads

LongToArrayEpochTracker (default)

ConcurrentHashMap<Long, long[]> mapping thread ID to a single-element long[]
Same thread-local design as Long2LongEpochTracker but avoids boxing via the long[] trick without fast-utils's hashmap
This is best paired with pooled platform threads

GC Thread

Version chains grow unboundedly without cleanup. The GC thread handles pruning old versions that no transaction can ever see again.

Design:

A single platform daemon thread continuously drains a bounded queue(LinkedBlockingQueue) of cleanup requests
A ScheduledExecutorService (virtual thread) refreshes a cached minVisibleEpoch from an EpochTracker every 100ms
Writer transactions submit a cleanup request when versionChain.size() % threshold == 0

Why cached epoch reads:
Reading minVisibleEpoch() on every write transaction commit was a hotspot. It involves scanning the epoch tracker under contention. Decoupling this into a scheduled read trades precision for significantly lower write path overhead. Versions may survive slightly longer than necessary though.

Pruning logic:

// A version is prunable if:
version.endTs < minVisibleEpoch && version != latest

The latest version is always preserved regardless of its timestamps, since new transactions may still need it. The MinVisibleEpoch cache per version chain short-circuits the scan entirely if no version has an endTs below the current GC epoch.

Unique Properties

Readers never acquire locks -> unlike the optimistic and pessimistic implementations where readers hold read locks or increment atomic counters, MVCC readers are completely non-blocking. A read is just a version chain traversal with no shared mutable state touched
Write-write conflicts are detected at commit time
Abort-on-conflict instead of wait-on-conflict -> when a write lock is already held, the transaction aborts immediately rather than parking. This keeps latency bounded but means abort rates climb sharply under high write contention, unlike other transactional map implementations(except Copy On Write) which forces writers to wait
The queue chain favors writes (O(1) append, cheap traversal for short chains) while the navigable chain favors reads (O(log N) lookup) at the cost of more expensive skip list insertions.
The shared epoch DefaultEpochTracker works correctly with any thread model but contends on computeIfAbsent() calls, which could kill perf if frequent. The thread keyed trackers (LongToArray, Long2Long) eliminate that contention entirely since each key is owned by exactly one thread

Summary Table

Implementation	Isolation Level	When Locks Acquired	Readers Block Writers?	Writers Block Readers?	Probably Best For
Optimistic	READ COMMITTED (per-key SERIALIZABLE)	Reads: eager, Writes: lazy	Yes	Yes	Balanced workloads with strong consistency needs
Pessimistic	READ COMMITTED (per-key SERIALIZABLE)	Both lazy	Yes	Yes	Similar to optimistic
Read Uncommitted	READ UNCOMMITTED	Writes only at validation	No	Yes	Write-heavy and weak consistency is OK
Copy-on-Write	READ COMMITTED	Never (uses CAS)	No	No	Read heavy or small map size
Flat Combined	SERIALIZABLE	N/A (combiners)	N/A	N/A	High contention, batch benefits
Segmented Flat Combined	SERIALIZABLE	N/A (per-key combiners)	N/A	N/A	Outperformed by a single flat combined map
MVCC	SNAPSHOT	Reads: eager Writes: eager	N/A	N/A	Best for read heavy scenarios

Links/References

MVCC Paper: https://www.vldb.org/pvldb/vol10/p781-Wu.pdf
Flat Combining Paper: https://people.csail.mit.edu/shanir/publications/Flat%20Combining%20SPAA%2010.pdf
Transactional Collections Paper: https://people.csail.mit.edu/mcarbin/papers/ppopp07.pdf
GitHub:
Optimistic Tx-Map: https://github.com/kusoroadeolu/tx-map
Pessimistic/Copy-On-Write/Read Uncommitted Tx-Map: https://github.com/kusoroadeolu/tx-map/tree/pessimistic-txmap
Flat Combining Tx-Map: https://github.com/kusoroadeolu/tx-map/tree/fc-txmap
MVCC Tx-Map: https://github.com/kusoroadeolu/tx-map/tree/mvcc-txmap

Improving my MVCC Transactional Map

Kush V — Sat, 14 Mar 2026 00:57:12 +0000

Introduction

This post walks through the performance journey of an MVCC transactional map I've been building in Java, from a few thousand ops/s to a few million. It's mostly numbers and profiling observations, so if that's not your thing, fair warning.

Note: The original writeup can be found here

Benchmark Setup

JMH version: 1.37
JVM: Java 25, HotSpot 64-Bit Server VM (25+37-LTS-3491)
Benchmark mode: Throughput (ops/s)
Warmup: 10 iterations × 1s each
Measurement: 5 iterations × 1s each
Forks: 2
Thread configuration: 1, 2 , 4, 8 (Platform threads) CPU Specs: Intel(R) Core(TM) i5-10300H CPU @ 2.50GHz (2.50 GHz), 8 cores

The Journey

Initially my MVCC txMap had good read numbers for thrpt and decent write numbers, though the error margins for the write numbers were bad, so I decided to investigate. While investigating, I encountered an issue. Also, just a quick note before we continue that ActiveTransactions just keeps tab on all active transactions at the moment.

//A map keeping track of all active transactions
ActiveTransactions activeTxns = mvccMap.activeTransactions.copy(); //Copied the entire map on active txns, could be thousands 
long minVisibleEpoch = activeTxns.findMinVisibleEpoch(); 
versionChain.removeUnreachableVersions(minVisibleEpoch);

// ActiveTransactions method call
long findMinActiveEpoch(){
    Set<Long> set = new HashSet<>(map.values()); //Copy to set to prevent duplicates and min traversals, caused an OOME!!
    long min = 0;
    int count = 0;
    for (long l : set){
        if (count == 0 || l < min){
            min = l;
        }
        ++count;
    }

    return min;
}

This line of code caused an OOME under contention the active given the fact the active transactions map could be very large, if we’re copying and iterating it anytime we want to prune an old version, issues would occur. To fix this, I used a single writer that automatically tracks the minimum visible epoch(tBegin) from all active transactions in the set. The main tradeoff with this was that under contention the single writer could not keep up, but it was better than copying the map on every write transaction.

However, write throughput dropped to ~20k ops/s after the single writer fix, so I profiled to find out why. The culprit was SerialTransactionKeeper#remove. Instead of submitting a remove request to the queue, it was removing a non-existent entry, meaning every transaction commit was doing an O(N) traversal for nothing. And since nothing was actually being removed from the map, the minVisibleEpoch never advanced, so the GC epoch reclamation was doing meaningless work the entire time. Fixing this brought numbers back to a reasonable range.

Before

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	1,248.762	± 921.068	ops/s
readHeavy_2threads	thrpt	10	1,205.265	± 413.555	ops/s
readHeavy_4threads	thrpt	10	1,407.078	± 528.927	ops/s
readHeavy_8threads	thrpt	10	2,566.666	± 610.823	ops/s
writeHeavy_1thread	thrpt	10	2,021.587	± 487.228	ops/s
writeHeavy_2threads	thrpt	10	1,475.348	± 388.932	ops/s
writeHeavy_4threads	thrpt	10	1,693.183	± 532.851	ops/s
writeHeavy_8threads	thrpt	10	13,806.692	± 4,179.536	ops/s

After

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	454,138.457	± 92,558.035	ops/s
readHeavy_2threads	thrpt	10	449,543.650	± 226,056.899	ops/s
readHeavy_4threads	thrpt	10	623,248.848	± 177,717.798	ops/s
readHeavy_8threads	thrpt	10	464,862.805	± 234,641.969	ops/s
writeHeavy_1thread	thrpt	10	369,462.166	± 79,723.306	ops/s
writeHeavy_2threads	thrpt	10	445,274.916	± 171,877.221	ops/s
writeHeavy_4threads	thrpt	10	728,382.685	± 174,790.985	ops/s
writeHeavy_8threads	thrpt	10	902,537.739	± 93,291.275	ops/s

After another round of profiling, I realized that I was making findOverlap() calls frequently on my read heavy transactions, so basically an O(n) traversal for each find overlap call, which just becomes worse as the version queue per key grows, so I decided to use a different approach, I decided to use a navigable map as my version chain to reduce this traversal time per call to O(logN), at the cost of more expensive writes, and the numbers showed significant improvement

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	601,083.402	± 165,550.604	ops/s
readHeavy_2threads	thrpt	10	642,442.703	± 187,617.695	ops/s
readHeavy_4threads	thrpt	10	688,103.749	± 89,658.431	ops/s
readHeavy_8threads	thrpt	10	690,432.423	± 101,032.453	ops/s
writeHeavy_1thread	thrpt	10	454,926.218	± 184,083.441	ops/s
writeHeavy_2threads	thrpt	10	594,926.342	± 92,822.567	ops/s
writeHeavy_4threads	thrpt	10	836,488.562	± 80,008.830	ops/s
writeHeavy_8threads	thrpt	10	979,821.239	± 89,332.281	ops/s

After looking through my code again, I realized I never actually started the worker thread for my SerialTransactionKeeper class. After starting the thread and benchmarking again I was basically back to square one

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	37,911.129	± 13,675.765	ops/s
readHeavy_2threads	thrpt	10	89,182.952	± 16,428.530	ops/s
readHeavy_4threads	thrpt	10	1,197,666.295	± 370,356.847	ops/s
readHeavy_8threads	thrpt	10	868,822.634	± 334,063.540	ops/s
writeHeavy_1thread	thrpt	10	2,849.386	± 1,326.803	ops/s
writeHeavy_2threads	thrpt	10	9,548.036	± 6,833.268	ops/s
writeHeavy_4threads	thrpt	10	25,872.105	± 13,197.781	ops/s
writeHeavy_8threads	thrpt	10	1,132,311.163	± 285,098.340	ops/s

Reading through my profile data, I realized that garbage collecting on a writer transaction thread was causing a lot of inconsistencies across all benchmarks and thread counts, so I decided to move this to a separate worker thread. Now, a writer txn submits a cleanup request to the GC thread when the version chain depth reaches a certain threshold

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	864,770.978	± 123,348.212	ops/s
readHeavy_2threads	thrpt	10	1,187,432.216	± 171,323.762	ops/s
readHeavy_4threads	thrpt	10	827,029.790	± 525,292.066	ops/s
readHeavy_8threads	thrpt	10	661,419.334	± 215,284.213	ops/s
writeHeavy_1thread	thrpt	10	476,757.401	± 91,269.885	ops/s
writeHeavy_2threads	thrpt	10	622,144.154	± 155,951.737	ops/s
writeHeavy_4threads	thrpt	10	767,387.142	± 171,878.778	ops/s
writeHeavy_8threads	thrpt	10	819,022.141	± 337,489.007	ops/s

As a minor optimization, I decided to cache the minVisibleEpoch timestamp in my version chain, to prevent redundant iterations through version chains

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	1,384,520.967	± 151,058.220	ops/s
readHeavy_2threads	thrpt	10	1,559,840.856	± 361,048.003	ops/s
readHeavy_4threads	thrpt	10	1,328,777.375	± 619,468.965	ops/s
readHeavy_8threads	thrpt	10	843,850.257	± 241,581.930	ops/s
writeHeavy_1thread	thrpt	10	527,538.045	± 160,021.700	ops/s
writeHeavy_2threads	thrpt	10	836,844.450	± 95,255.683	ops/s
writeHeavy_4threads	thrpt	10	944,841.844	± 321,298.094	ops/s
writeHeavy_8threads	thrpt	10	764,114.028	± 472,135.579	ops/s

Looking at my profiled data again, I realized submitting requests to the GC thread was still a hotspot, so I decided to spread out the frequency in which requests are submitted by only submitting cleanup requests when depth % threshold == 0 rather than every time depth > threshold, spacing out GC cleanup submissions across writes

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	1,499,641.094	± 186,580.267	ops/s
readHeavy_2threads	thrpt	10	1,800,303.740	± 160,148.052	ops/s
readHeavy_4threads	thrpt	10	1,420,702.483	± 566,203.194	ops/s
readHeavy_8threads	thrpt	10	828,145.076	± 393,441.472	ops/s
writeHeavy_1thread	thrpt	10	654,820.291	± 132,849.682	ops/s
writeHeavy_2threads	thrpt	10	895,029.640	± 125,955.480	ops/s
writeHeavy_4threads	thrpt	10	987,045.255	± 437,727.470	ops/s
writeHeavy_8threads	thrpt	10	853,466.979	± 556,489.979	ops/s

Write numbers improved slightly and error margins tightened up. Next I tried segmenting my active transactions class just to measure the difference, and the results were a bit surprising

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	601,762.093	± 137,771.491	ops/s
readHeavy_2threads	thrpt	10	1,005,846.389	± 96,703.066	ops/s
readHeavy_4threads	thrpt	10	1,700,100.197	± 144,301.351	ops/s
readHeavy_8threads	thrpt	10	1,321,714.189	± 346,860.821	ops/s
writeHeavy_1thread	thrpt	10	321,688.060	± 95,079.876	ops/s
writeHeavy_2threads	thrpt	10	549,510.063	± 160,634.881	ops/s
writeHeavy_4threads	thrpt	10	826,926.873	± 149,437.015	ops/s
writeHeavy_8threads	thrpt	10	1,026,862.457	± 520,060.221	ops/s

My scaling basically inverted with my lowest thrpt for both operations being at one thread and the highest being at > 2 threads

I then realized that I'd been using my virtual threads as my background worker threads which is a terrible idea and was probably contributing the high variance, so I decided to run with my background worker threads as platform threads instead

Segmented Transaction Keeper

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	543,983.955	± 133,134.608	ops/s
readHeavy_2threads	thrpt	10	1,020,562.096	± 131,954.153	ops/s
readHeavy_4threads	thrpt	10	1,632,501.353	± 197,194.523	ops/s
readHeavy_8threads	thrpt	10	1,444,733.689	± 652,337.205	ops/s
writeHeavy_1thread	thrpt	10	335,078.800	± 128,993.334	ops/s
writeHeavy_2threads	thrpt	10	580,449.127	± 110,305.596	ops/s
writeHeavy_4threads	thrpt	10	864,819.719	± 159,350.383	ops/s
writeHeavy_8threads	thrpt	10	1,120,983.469	± 783,465.405	ops/s

Serial Transaction Keeper

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	1,251,410.402	± 170,242.439	ops/s
readHeavy_2threads	thrpt	10	1,619,555.645	± 107,851.864	ops/s
readHeavy_4threads	thrpt	10	1,345,362.000	± 445,030.583	ops/s
readHeavy_8threads	thrpt	10	784,240.432	± 407,925.752	ops/s
writeHeavy_1thread	thrpt	10	640,249.753	± 110,217.595	ops/s
writeHeavy_2threads	thrpt	10	813,339.356	± 155,722.320	ops/s
writeHeavy_4threads	thrpt	10	1,011,854.484	± 452,289.869	ops/s
writeHeavy_8threads	thrpt	10	927,350.663	± 194,550.049	ops/s

Moving on from that minor hiccup, I decided to change my strategy for version tracking to use map based epoch tracking rather than a single writer background thread since I noticed it became a bottleneck, and the results improved significantly, but oddly, at 2 threads, the thrpt is very bad. No idea why this is happening.

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,191,449.356	± 405,088.856	ops/s
readHeavy_2threads	thrpt	10	195,198.411	± 37,480.037	ops/s
readHeavy_4threads	thrpt	10	1,733,823.527	± 526,463.941	ops/s
readHeavy_8threads	thrpt	10	2,832,819.684	± 791,310.719	ops/s
writeHeavy_1thread	thrpt	10	898,844.327	± 184,090.618	ops/s
writeHeavy_2threads	thrpt	10	58,322.618	± 6,505.121	ops/s
writeHeavy_4threads	thrpt	10	118,490.002	± 21,500.755	ops/s
writeHeavy_8threads	thrpt	10	5,211,745.483	± 672,226.456	ops/s

To find the issue, I looked at my profile data, and found minVisibleEpoch() as a hotspot at 2 threads. The current issue is we scan the entire map for the min active epoch, ideally this shouldn't take too long but as the map grows, could become a bottleneck

I then decided to use a navigable map for cheaper reads at the cost of writes being more expensive, and my data actually improved a lot, with stable variance across all thread counts

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,189,669.131	± 219,454.082	ops/s
readHeavy_2threads	thrpt	10	1,209,368.581	± 118,708.165	ops/s
readHeavy_4threads	thrpt	10	1,390,759.022	± 138,734.722	ops/s
readHeavy_8threads	thrpt	10	1,831,714.755	± 209,719.251	ops/s
writeHeavy_1thread	thrpt	10	863,360.164	± 174,821.023	ops/s
writeHeavy_2threads	thrpt	10	695,208.420	± 182,427.969	ops/s
writeHeavy_4threads	thrpt	10	905,598.677	± 107,143.027	ops/s
writeHeavy_8threads	thrpt	10	1,438,306.517	± 376,373.683	ops/s

I moved minVisibleEpoch() reads off the writer transaction path by caching and scheduling them in the GC thread at 100ms intervals. The tradeoff being precision as the GC is working off a slightly stale epoch, but it keeps the writer path clean.

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,426,304.887	± 259,332.240	ops/s
readHeavy_2threads	thrpt	10	1,138,292.934	± 71,983.527	ops/s
readHeavy_4threads	thrpt	10	1,297,350.055	± 173,150.260	ops/s
readHeavy_8threads	thrpt	10	1,581,116.656	± 142,358.242	ops/s
writeHeavy_1thread	thrpt	10	831,511.799	± 272,854.879	ops/s
writeHeavy_2threads	thrpt	10	651,383.441	± 125,000.038	ops/s
writeHeavy_4threads	thrpt	10	864,550.813	± 102,841.781	ops/s
writeHeavy_8threads	thrpt	10	1,133,963.803	± 371,369.242	ops/s

Now that reads for minimum active epochs were scheduled, cached and off the hotpath, I decided to move back to a normal concurrent hashmap and compare results

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,033,144.625	± 384,180.964	ops/s
readHeavy_2threads	thrpt	10	1,588,018.927	± 270,740.481	ops/s
readHeavy_4threads	thrpt	10	1,729,477.638	± 222,670.776	ops/s
readHeavy_8threads	thrpt	10	2,155,201.736	± 367,418.564	ops/s
writeHeavy_1thread	thrpt	10	806,951.062	± 230,957.839	ops/s
writeHeavy_2threads	thrpt	10	736,237.004	± 198,644.974	ops/s
writeHeavy_4threads	thrpt	10	1,163,394.531	± 183,420.198	ops/s
writeHeavy_8threads	thrpt	10	1,664,437.392	± 348,898.556	ops/s

After looking at my queue version chain, I realized calling size() to check version chain depth on each write txn was killing perf, since we had to scan the whole queue to find the size(even with the queue's dead nodes) and restart from the head if we reached a dead node, so I decided to use a long adder to track the approx size for cheaper size() calls

Queue version chain

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,290,892.594	± 491,972.988	ops/s
readHeavy_2threads	thrpt	10	1,604,327.428	± 208,550.678	ops/s
readHeavy_4threads	thrpt	10	1,891,341.883	± 250,730.085	ops/s
readHeavy_8threads	thrpt	10	1,937,383.908	± 501,990.327	ops/s
writeHeavy_1thread	thrpt	10	932,853.659	± 165,703.963	ops/s
writeHeavy_2threads	thrpt	10	966,232.277	± 279,759.248	ops/s
writeHeavy_4threads	thrpt	10	1,135,401.769	± 380,147.279	ops/s
writeHeavy_8threads	thrpt	10	1,620,440.108	± 446,644.708	ops/s

I decided to use my queue version chain as my default version in the next benchmark.

The hotspot across all benchmarks was now the computeIfAbsent() call whenever a transaction requested its current epoch (tBegin) at creation time. To reduce contention on this path, I built a thread-local epoch tracker, not a literal ThreadLocal, but a map keyed by thread ID instead of epoch. Each thread maps to the epoch of the transaction it's currently hosting, and resets to a sentinel value once the transaction ends, signaling it's no longer active.

This was a meaningful shift from the previous DefaultEpochTracker, which tracked all active transactions regardless of thread, making it more suitable for virtual threads but higher contention by nature. Since keys here can't actually be contested (each key belongs to exactly one thread), lock wait time dropped significantly and performance improves as a result. The tradeoff though is that it works best with a small, fixed pool of platform threads.

Benchmark	Mode	Cnt	Score	Error	Units
readHeavy_1thread	thrpt	10	2,887,534.127	± 554,342.591	ops/s
readHeavy_2threads	thrpt	10	2,366,272.494	± 324,415.185	ops/s
readHeavy_4threads	thrpt	10	2,162,652.894	± 276,947.370	ops/s
readHeavy_8threads	thrpt	10	2,449,227.477	± 443,072.582	ops/s
writeHeavy_1thread	thrpt	10	1,271,287.026	± 498,230.994	ops/s
writeHeavy_2threads	thrpt	10	1,578,855.855	± 213,555.495	ops/s
writeHeavy_4threads	thrpt	10	1,340,979.654	± 316,184.797	ops/s
writeHeavy_8threads	thrpt	10	1,519,457.062	± 700,083.591	ops/s

The variance was actually worse in some scenarios, sometimes >30% of the actual score. Rerunning these benchmarks, I noticed something odd looking at my profile data for memory allocation, a lot of memory was getting allocated but barely cleaned up by the GC while running my benchmarks, leading to high variance and my numbers tanking in unusual ways during benchmarking. The profile data showed most of this occurred when I started a transaction which wasn't too helpful. So I decided to look at my actual memory usage when running these benchmarks and I noticed around 95% of my memory was being used while running these benchmarks.

My first suspect were my version chains, since stale versions might not be cleaned up by the GC. I decided to write a simple unit test(no idea why I didn't test my version chains earlier) to see if versions were being cleaned up, and they actually were not. The issue lied in a simple check I added earlier to prevent redundant O(N) lookups

    @Override
public void removeUnreachableVersions(long tBegin) {
    if (tBegin <= minVisibleEpoch.epoch) return; //This simple line here, the issue was that minVisibleEpoch was always initialized as Long.MAX_VALUE, even if the epoch could be updated while non-visible version were getting pruned, the GC would never actually get the chance to prune those versions, because beginTs(seen epoch) would always be less than Long.MAX_VALUE

    minVisibleEpoch.reset(); //Reset the holder to Long.MAX_VALUE, to prevent a situation where we are sitting on an older end ts, from a version pruned a while back
    var ls = this.latest;
    Set<Map.Entry<Long, Version<E>>> set = versionMap.entrySet();

    set.removeIf(entry -> {
        var val = entry.getValue();
        boolean shouldRemove = val.endTs < tBegin  && val != ls;

        if (!shouldRemove && val.endTs < minVisibleEpoch.epoch) minVisibleEpoch.epoch = val.endTs;
        return shouldRemove;
    });
}

This was changed to

if (minVisibleEpoch.epoch != Long.MAX_VALUE && tBegin <= minVisibleEpoch.epoch) return; //Actually gave the gc a chance to prune the older versions

After this, my numbers improved significantly and variance reduced a bit

Benchmark	Version Chain	Mode	Cnt	Score	Error	Units
readHeavy_1thread	queue	thrpt	10	3,022,274.294	± 397,643.609	ops/s
readHeavy_1thread	nav	thrpt	10	2,250,631.595	± 201,546.981	ops/s
readHeavy_2threads	queue	thrpt	10	2,583,840.733	± 444,871.511	ops/s
readHeavy_2threads	nav	thrpt	10	2,291,496.019	± 304,662.994	ops/s
readHeavy_4threads	queue	thrpt	10	3,152,546.075	± 350,295.696	ops/s
readHeavy_4threads	nav	thrpt	10	2,933,701.271	± 210,834.860	ops/s
readHeavy_8threads	queue	thrpt	10	4,209,719.728	± 611,222.951	ops/s
readHeavy_8threads	nav	thrpt	10	4,058,722.339	± 265,986.981	ops/s
writeHeavy_1thread	queue	thrpt	10	2,467,899.289	± 343,129.882	ops/s
writeHeavy_1thread	nav	thrpt	10	1,224,346.814	± 276,528.783	ops/s
writeHeavy_2threads	queue	thrpt	10	1,987,672.075	± 142,924.001	ops/s
writeHeavy_2threads	nav	thrpt	10	1,474,317.537	± 332,782.403	ops/s
writeHeavy_4threads	queue	thrpt	10	2,540,089.406	± 107,245.569	ops/s
writeHeavy_4threads	nav	thrpt	10	1,941,961.678	± 335,538.070	ops/s
writeHeavy_8threads	queue	thrpt	10	3,310,156.822	± 319,971.721	ops/s
writeHeavy_8threads	nav	thrpt	10	3,599,563.143	± 712,225.130	ops/s

I was still a bit skeptical about the variance, even though it was pretty reasonable, I realized a lot of memory was getting allocated in my thread local epoch tracker under contention due to long boxing when updating epoch values, so I decided to try using fast-utils synchronized Long2LongHashMap, to prevent boxing and allocations under high contention. Rerunning the benchmarks again, memory allocation on that hotpath dropped to basically zero, and writes under contention did suffer a bit, though variance for both write and read heavy workloads were pretty good

Benchmark	Version Chain	Mode	Cnt	Score	Error	Units
readHeavy_1thread	queue	thrpt	10	3,917,121.556	± 191,182.041	ops/s
readHeavy_1thread	nav	thrpt	10	2,592,747.908	± 192,348.108	ops/s
readHeavy_2threads	queue	thrpt	10	2,935,324.852	± 380,159.702	ops/s
readHeavy_2threads	nav	thrpt	10	2,587,901.962	± 253,365.744	ops/s
readHeavy_4threads	queue	thrpt	10	2,410,564.690	± 394,036.694	ops/s
readHeavy_4threads	nav	thrpt	10	2,425,199.868	± 109,061.792	ops/s
readHeavy_8threads	queue	thrpt	10	2,138,869.830	± 106,605.873	ops/s
readHeavy_8threads	nav	thrpt	10	2,006,265.789	± 154,944.074	ops/s
writeHeavy_1thread	queue	thrpt	10	2,789,016.664	± 258,145.402	ops/s
writeHeavy_1thread	nav	thrpt	10	1,247,223.769	± 212,243.516	ops/s
writeHeavy_2threads	queue	thrpt	10	2,308,032.198	± 234,208.698	ops/s
writeHeavy_2threads	nav	thrpt	10	1,395,297.149	± 187,661.268	ops/s
writeHeavy_4threads	queue	thrpt	10	2,285,734.168	± 222,636.258	ops/s
writeHeavy_4threads	nav	thrpt	10	1,738,846.900	± 503,307.766	ops/s
writeHeavy_8threads	queue	thrpt	10	2,274,919.824	± 46,946.253	ops/s
writeHeavy_8threads	nav	thrpt	10	1,893,349.970	± 213,244.636	ops/s

The thrpt was great, though after some research I found out about a generic trick, using primitive arrays as generic types, so instead of boxed long values. I decided to try this out with CHM and compare it to the serialized long2long version

ConcurrentMap<Long, Long> map //Instead of this, we could do
ConcurrentMap<Long, long[]> map //No boxing for values

Benchmark	Version Chain	Mode	Cnt	Score	Error	Units
readHeavy_1thread	queue	thrpt	10	3,921,045.464	± 379,095.211	ops/s
readHeavy_1thread	nav	thrpt	10	2,490,400.560	± 235,824.152	ops/s
readHeavy_2threads	queue	thrpt	10	3,351,486.812	± 330,871.332	ops/s
readHeavy_2threads	nav	thrpt	10	2,961,814.543	± 271,271.445	ops/s
readHeavy_4threads	queue	thrpt	10	4,130,101.194	± 365,901.882	ops/s
readHeavy_4threads	nav	thrpt	10	3,967,520.781	± 267,639.783	ops/s
readHeavy_8threads	queue	thrpt	10	6,100,773.757	± 555,229.766	ops/s
readHeavy_8threads	nav	thrpt	10	5,384,214.753	± 396,929.168	ops/s
writeHeavy_1thread	queue	thrpt	10	2,948,891.898	± 461,806.810	ops/s
writeHeavy_1thread	nav	thrpt	10	1,256,910.691	± 203,279.439	ops/s
writeHeavy_2threads	queue	thrpt	10	2,523,596.142	± 196,569.443	ops/s
writeHeavy_2threads	nav	thrpt	10	1,408,471.703	± 192,463.826	ops/s
writeHeavy_4threads	queue	thrpt	10	2,943,223.429	± 276,821.883	ops/s
writeHeavy_4threads	nav	thrpt	10	2,636,191.583	± 311,031.690	ops/s
writeHeavy_8threads	queue	thrpt	10	4,064,740.729	± 416,730.295	ops/s
writeHeavy_8threads	nav	thrpt	10	4,574,077.728	± 549,287.881	ops/s

Since, I've been testing thrpt for my mvcc map for "best case scenarios" i.e. no retries on aborts. I decided to test with retries on abort. Note that this is base-lined against my map with a Long2ArrayEpochTracker

Benchmark	Version Chain	Mode	Cnt	Score	Error	Units
readHeavy_1thread	queue	thrpt	10	4,255,294.022	± 417,518.780	ops/s
readHeavy_1thread	nav	thrpt	10	3,026,195.253	± 382,190.857	ops/s
readHeavy_2threads	queue	thrpt	10	3,727,360.460	± 469,491.078	ops/s
readHeavy_2threads	nav	thrpt	10	3,310,257.104	± 195,179.750	ops/s
readHeavy_4threads	queue	thrpt	10	4,222,921.041	± 334,896.201	ops/s
readHeavy_4threads	nav	thrpt	10	3,352,625.744	± 481,297.585	ops/s
readHeavy_8threads	queue	thrpt	10	4,512,409.390	± 385,141.320	ops/s
readHeavy_8threads	nav	thrpt	10	3,414,450.281	± 391,240.406	ops/s
writeHeavy_1thread	queue	thrpt	10	2,887,463.431	± 507,767.083	ops/s
writeHeavy_1thread	nav	thrpt	10	1,330,658.185	± 214,850.891	ops/s
writeHeavy_2threads	queue	thrpt	10	2,083,535.454	± 214,987.876	ops/s
writeHeavy_2threads	nav	thrpt	10	1,192,807.369	± 222,511.666	ops/s
writeHeavy_4threads	queue	thrpt	10	2,093,721.029	± 118,319.047	ops/s
writeHeavy_4threads	nav	thrpt	10	1,337,540.905	± 198,100.953	ops/s
writeHeavy_8threads	queue	thrpt	10	1,800,086.692	± 403,405.040	ops/s
writeHeavy_8threads	nav	thrpt	10	1,028,280.423	± 285,679.830	ops/s

To fully understand this drop on the write heavy bench I compared profile data from this benchmark to those without retries. While everything looked 'similarish' on the CPU side, memory was a different story with memory usage spiking up from ~16GB to almost ~30GB at every iteration. Due to the frequency of aborts, for each retry, a new transaction object had to be created, meaning more memory allocated for the transaction object, its read/write operation objects and its completable values, hence more pressure on the GC (Java's GC), more GC pauses, lower thrpt and higher variance.

Conclusions

The QueueVersionChain paired with Long2ArrayEpochTracker ended up being the best all round configuration, reads comfortably in the multi-million ops/s range across thread counts, writes improved significantly from the initial ~2k. A few things are still unresolved though, the 2-thread throughput anomaly with map-based epoch tracking being the main one, and the retry memory pressure under high abort rates is worth revisiting. I'd likely make transaction objects reusable to reduce GC pressure on retries.

However, if you want to check out the MVCC implementation, benchmark code, or the benchmarks under more realistic workloads i.e. Zipfian benchmarks. You can visit the GitHub repository: https://github.com/kusoroadeolu/tx-map/tree/mvcc-txmap

What Happens When You Give a Map Transactional Semantics

Kush V — Fri, 27 Feb 2026 02:09:41 +0000

Introduction

The goal of this writeup isn't necessarily to sell the idea of in memory transactional maps to you, but rather document my findings about the 5-6 different strategies involving transactional maps I implemented in this repo.
Before we continue, the main idea of transactional collections/maps, is to allow developers to achieve similar performance in large atomic regions as they would by interleaving multiple operations in mutual exclusion based primitives, while providing atomicity and different isolation levels based on preferences.
With that said, lets get into the implementations

NOTE: The original writeup can be found here

Introduction

I initially started building these transactional maps with only a single implementation which was an optimistic like transactional map with semantic concurrency control. I got most of the ideas of this from this paper.
The main idea was pretty clear: transactional semantics for a map using optimistic concurrency control. However, the paper assumed the reader/implementor had an underlying STM (Software Transactional Memory) runtime, hence, all the talk about rollbacks, aborts, open-nested, closed-nested transactions and memory level conflicts that the STM runtime handled for them.
Most of the ideas presented by the paper were very smart, and a lot of the ideas were transferred into a good number of the transactional maps I built, however there were a lot of significant limitations which prevented me from making a near 1:1 copy of what the paper suggested and honestly what I initially planned to do.
These limitations, rather than being roadblocks, ended up pushing me to explore different synchronization strategies and ultimately build 5 distinct transactional map implementations, each with their own tradeoffs.

Transactional Map Implementations

Optimistic Transactional Map: At first glance, the name of this map could imply that this map carries out optimistic reads, however, the transactional semantics of the map are quite different.
Before I explain the map's semantics, this map's commit phase is split into two, validation and commit. During validation, the transaction tries to acquire all semantically meaningful locks for the transaction i.e. if a put operation for key, already has a value for that key in the map, the contains key nor size lock is acquired.
This transactional map's semantics require readers eagerly stating their intent by acquiring read locks for their semantics at scheduled time(before commits). Writes however lazily state their intent and are serialized through one lock per key. This map promises READ COMMITTED isolation levels across the entire map but SERIALIZED isolation levels per key when writes are involved , meaning no dirty reads can ever happen on any key in the map, and repeatable or phantom reads are impossible per key.
Pessimistic Transactional Map: This transactional map's semantics require readers lazily stating their intent by incrementing an atomic counter at validation time. Writes however are serialized through one lock per key and lazily state their intent. This map also promises READ COMMITTED isolation levels across the entire map but SERIALIZED isolation levels per key
NOTE: For both optimistic and pessimistic maps, writers block readers and readers block writers, but readers never block each other.
Read Uncommitted Transactional Map: This transactional map's semantics do not require readers stating their intent at all. Writes however are still serialized through one lock per key and lazily state their intent. This map promises READ UNCOMMITTED isolation levels, meaning transactions can read data by transactions not fully committed yet i.e. Dirty reads. In the paper, this is regarded as an open-nested transaction.
Copy On Write Transactional Map: This transactional map's semantics can be linked closely to that of CopyOnWriteArrayList. This copies the last seen map reference onto it's thread stack, if the map on the stack was modified, it tries to replace the shared map reference using CAS, if it fails, it recopies to map, modifies and tries to cas until it succeeds. This map promises READ COMMITTED isolation levels across the entire map. Repeatable and phantom reads are possible if you read a value, another thread commits, and you read again, you might see the new value
Flat Combined Transactional Map: I wanted to see how a fully serialized approach would compare, which led me to flat combining.

Flat Combining

Flat combining is a synchronization semantic that proposes, rather than using fine-grained locking for critical sections, a coarse-grained lock is used for all operations. Now, rather than each thread contesting for the lock on the data structure which would kill performance, each thread enqueues their proposed operation onto a queue as a node and stores their operation in a thread local variable, then spins or waits until a combiner processes their request, if no combiner is active, any thread can become one. The combiner, traverses the queued operations and performs those operations on behalf of those threads, then unlocks the lock.
My explanation is a pretty simplified version of what flat combining actually entails and if you're interested in going into the actual details for flat combining, you can check it out here
There are two possible combiners my transactional map could use:

Unbound Combiner: This combiner implementation is faithful to what the paper describes, threads storing their intent in thread local variables, enqueuing its actions unto a shared publication queue and then becoming the combiner or waiting for the combiner to perform its operation of its behalf.
Semaphore Combiner: This combiner implementation cycles nodes across threads. Here, threads replace their thread local nodes with the current node at the head of the combiner publication queue, then sets that node as their tail. A combining node, then scans the queue, executes all actions on behalf of the waiting threads except the node at the tail of the queue, then informs the thread holding node at the tail of the queue that it is the next combiner. This combiner does not use a lock.

Benchmarks

To evaluate the properties of all my transactional maps, I ran two benchmarks: a disjoint key benchmark, where each thread operates exclusively on its own key (no key contention by design), and a contention benchmark, where all threads compete over a small fixed pool of 4 keys across three workload profiles — read heavy (90% get operations), balanced (50/50), and write heavy (90% put operations) both primarily focusing on throughput. I will only be focusing in this writeup on the numbers on read/write heavy portions of the contention benchmarks
It is important to point out that combining operations are fully serialized, hence these benchmarks may not actually be measuring what we think they'll be measuring potentially yielding false results. So I won't focus too much on the flat combined numbers until later, and then I'll share how I plan on improving them and more importantly reducing their error margins.

NOTE: These benchmarks were run using JMH(Java MicroHarness Benchmark)

Benchmark Setup

JMH version: 1.37
JVM: Java 25, HotSpot 64-Bit Server VM (25+37-LTS-3491)
Benchmark mode: Throughput (ops/s)
Warmup: 5 iterations × 1s each
Measurement: 5 iterations × 1s each
Forks: 2
Thread configuration: 1, 2 , 4, 8 (Platform threads)
CPU Specs: Intel(R) Core(TM) i5-10300H CPU @ 2.50GHz (2.50 GHz), 8 cores

Read Heavy

Under contention, the copy on write implementation's performance, throughput grows from ~3.7M ops/s at 1 thread to 6.4M at 8 threads (+72%), since readers don't block each other at all. This is a known strength of copy on write implementations which sacrifice write throughput for high read throughput.

The pessimistic transactional map's throughput drops from ~1.2M ops/s at 1 thread to ~865k at 8 threads. This is mainly due to the strong isolation guarantees this implementation provides, blocking readers while writes are active and vice versa. Also, context switches from multiple reader threads waking up after a writer has notified them contributes to this throughput drop as well

And for the optimistic transactional map, the story is similar to that of pessimistic, throughput drops from ~1.27M ops/s at 1 thread to ~701k ops/s at 8 threads, though read operations scale worse than pessimistic due to similar problems, though rather than only reader threads performing OS context switches, writers too, now also park/unpark due to lock contention.

Read uncommitted implementation's performance is more balanced with throughput growing from ~2.0M ops/s at 1 thread to ~3.1M at 8 threads compared to the others, though it offers much weaker isolation guarantees compared to the previous three.

Write Heavy

Copy on Write's throughput drops ~60% from 1 thread 2.4M ops/s to 8 threads 941K ops/s.Unlike reads, write operations cause multiple CAS retries and the over head of recopying the shared map on each retry under high contention becomes a big bottleneck.

The pessimistic transactional map's throughput drops ~60% from 1 thread ~927k ops/s to 8 threads ~388k ops/s. Since all writes per key are serialized under one lock to enforce serial writes and uphold its isolation guarantees, the overhead of multiple OS context switches due to lock contention becomes a primary bottleneck, as threads spend more time waiting to acquire locks than actually doing work

While for the optimistic transactional map, the story is similar to that of pessimistic, throughput drops from ~883k ops/s at 1 thread to ~304k ops/s at 8 threads, though read operations scale worse than pessimistic due to similar problems, much more time is spent waiting to acquire locks than doing actual work.

Read uncommitted implementation offers a much more interesting view as throughput only goes down ~18% from 1 thread ~1.36M ops/s to 8 threads at ~1.11M ops/s, offering the best write performance out of the three implementations, due to weak isolation guarantees.

From these numbers we can see that no single implementation wins across all workloads, the right choice depends entirely on your read/write ratio and how much you're willing to trade isolation guarantees for throughput.

Conclusions

Flat Combining

Looking back at the flat combining numbers even though both have terrible variance for both read/write heavy workloads, I do want to offer what I plan to do to improve them.

Experimenting with multiple idle strategies. Right now, both implementations while idle, spin for n number of times, before rechecking their results and potentially trying to become the combiner, this could lead to wasted work as their results could be ready much earlier or the combiner might have forfeited its status earlier as well. Two other strategies I'd implement and benchmark against are
Plain Busy Waiting: Threads just spin on their results continuously without spinning n number of times before checking their results or trying to acquire the lock. Though constant CAS tryLock() failures could in the unbound combiner could ideally make throughput and its variance worse than just using a global lock
Timed Parks: This involves threads parking for a fixed amount of time before rechecking their result, though OS context switches might actually become a bottleneck here.
Using atomic arrays instead of a publication linked queue: Right now, both combiners use a linked publication queue to share nodes for the combiner to process. Linked queues are susceptible to pointer chasing which could put pressure on the GC and JVM, rather we could use an AtomicReferenceArray with a fixed threshold to store nodes, in which the combiner can traverse. However, this could leave nodes susceptible to false sharing.
Increasing the threshold of the publication queue of both implementations: One of flat combining's strengths is its ability to batch apply requests for other waiting threads, increasing this amount while increasing wait time, could increase reduce the amount of serial handoffs needed for each combiner and less failed CAS operations to acquire the lock to become the combiner

Further Experimentation

We've compared multiple implementations of these transactional maps against each other in this writeup, here are some ways I plan to push my implementations further

MVCC Transactional Maps: MVCC(Multi Version Concurrency Control) is a very common technique used in relational databases providing READ-COMMITTED isolation guarantees while ensuring readers and writers do not block each other
Partitioned Flat Combined Transactional Maps: Rather than wrapping the whole transactional map in a combiner, we instead map each key and the size value to a combiner, transactions submit their read/write requests per key and wait on a response.

As I implement these, I'll be sure to document the performance/throughput of every experiment I perform here.
GitHub: https://github.com/kusoroadeolu/tx-map