Mustafa Bingül

Posted on Apr 25

How I Rebuilt a Three-Layer Cache System in Java — Redis, L1, and MongoDB Done Right

#java #redis #mongodb #backend

I've been working on Nexus, a backend infrastructure project, and recently hit a point where the data synchronization layer needed a serious rethink. What looked like a working cache system turned out to have a broken hierarchy, silent data loss paths, race conditions, and a latent deadlock waiting to happen.

This post walks through every problem I found in the original code and exactly how I fixed each one. Code comparisons included throughout.

The Architecture: What We're Building

The system manages data across three layers:

┌─────────────────────────────────┐
│        Redis Cache (MASTER)     │  ← single source of truth
└────────────┬──────────┬─────────┘
             │          │
       pull 10s      flush 15s
             │          │
┌────────────▼──┐  ┌────▼──────────────┐
│  L1 Cache     │  │     MongoDB        │
│  (in-memory)  │  │  (persistent DB)   │
└───────────────┘  └────────────────────┘

The rules are simple:

Redis is always master. No other layer can override it.
L1 is an in-memory mirror of Redis. It reads from Redis, never the other way around.
MongoDB is the persistent backup. Redis writes to it — not the other way around.

Three scheduled tasks keep everything in sync:

Task	Interval	Job
L1 Sync	10s	Pull Redis → update L1 if changed
Auto Flush	15s	Push dirty keys from Redis → MongoDB
Reconciliation	3 min	Compare Redis ↔ MongoDB, Redis wins

Now let's look at what was actually wrong with the original implementation.

Problem 1: The Hierarchy Was Backwards

This was the most fundamental issue. The reconciliation task was supposed to enforce Redis as master — but it was doing the opposite.

The original code

private void startReconciliationTask() {
    redisManager.processTask(() -> {
        idToDataList.forEach((keyTag, model) -> {
            if (dirtyKeys.contains(keyTag)) return;

            NexusApplication.getApplication().getMongoManager()
                .getValue(model.getAddon(), model.getSpecificDbKey())
                .thenAccept(dbJson -> {
                    if (dbJson == null) return;
                    try {
                        String cleanDbJson = model.getAddon().modelInitComp(dbJson);
                        if (!cleanDbJson.equals(model.getValueJson())) {
                            model.setValueJson(cleanDbJson);          // ← overwriting L1 with Mongo
                            redisManager.setData(keyTag, cleanDbJson); // ← overwriting Redis with Mongo!
                        }
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
        });
    });
}

When MongoDB had a different value than Redis, it was writing MongoDB's value into Redis. MongoDB was effectively acting as master. The entire priority chain was inverted.

The fix

private void startReconciliationTask() {
    RedisManager rm = NexusApplication.getApplication().getRedisManager();

    rm.processTask(() -> {
        // ...
        // Get the master value from Redis first
        String redisJson = rm.getData(key).orElseGet(model::getValueJson);

        NexusApplication.getApplication().getMongoManager()
            .getValue(model.getAddon(), model.getSpecificDbKey())
            .thenAccept(dbJson -> {
                // ...
                String cleanDbJson = model.getAddon().modelInitComp(dbJson);

                // Redis wins. If Mongo is different, update Mongo — not Redis.
                if (!cleanDbJson.equals(redisJson)) {
                    NexusApplication.getApplication().getMongoManager()
                        .setValue(model.getAddon(), model.getSpecificDbKey(), redisJson);
                    model.setValueJson(redisJson); // L1 follows Redis
                }
            });
    });
}

Redis is fetched first and treated as the ground truth. When there's a mismatch, MongoDB gets corrected — never Redis.

Problem 2: Silent Data Loss on Flush Failure

The auto flush task was removing the dirty flag before confirming the MongoDB write succeeded.

The original code

private void startAutoSyncTask() {
    List<String> keysToSync = new ArrayList<>(dirtyKeys);
    dirtyKeys.removeAll(keysToSync); // ← removed before writing to Mongo

    redisManager.processTask(() -> {
        for (String key : keysToSync) {
            DataModel model = idToDataList.get(key);
            if (model != null) {
                NexusApplication.getApplication().getMongoManager()
                    .setValue(model.getAddon(), model.getSpecificDbKey(), model.getValueJson());
                // if this fails, the key is already gone from dirtyKeys
                // it will never be retried
            }
        }
    });
}

If the MongoDB write threw an exception or the future completed exceptionally, the dirty flag was already removed. The entry would never be retried. Data was silently lost.

The fix

private void startAutoFlushTask() {
    List<String> keysToFlush = new ArrayList<>(dirtyKeys); // snapshot — don't removeAll yet

    rm.processTask(() -> {
        for (String key : keysToFlush) {
            DataModel model = keyToModel.get(key);
            if (model == null) {
                dirtyKeys.remove(key);
                continue;
            }

            String jsonToWrite = rm.getData(key).orElseGet(model::getValueJson);

            try {
                NexusApplication.getApplication().getMongoManager()
                    .setValue(model.getAddon(), model.getSpecificDbKey(), jsonToWrite)
                    .get(); // block until Mongo confirms the write

                dirtyKeys.remove(key); // only remove AFTER confirmed success

            } catch (Exception e) {
                // leave the key dirty — it will be retried on the next flush
                LOGGER.log(Level.SEVERE, "[AutoFlush] Write failed, will retry: " + key, e);
            }
        }
    });
}

Two changes here: the dirty flag snapshot is taken but not immediately cleared, and dirtyKeys.remove(key) only runs after .get() confirms the write succeeded. If anything goes wrong, the key stays dirty and gets retried 15 seconds later.

Problem 3: TOCTOU Race Condition in removeModel()

TOCTOU stands for Time-of-Check to Time-of-Use. The original removeModel() checked if a key existed before removing it — but another thread could delete it between those two operations.

The original code

public void removeModel(String key) {
    if (idToDataList.containsKey(key)) {  // Thread A checks: key exists
        // --- Thread B deletes the key here ---
        dirtyKeys.remove(key);
        idToDataList.remove(key);          // Thread A removes: but key is already gone
        NexusApplication.getApplication().getRedisManager().deleteData(key);
    }
}

In a concurrent environment, this is not safe. The containsKey check and remove call are two separate operations with no atomicity guarantee between them.

The fix

public void removeModel(String key) {
    DataModel removed = keyToModel.remove(key); // atomic: check + remove in one step
    if (removed == null) return; // wasn't there — nothing to do

    idToKey.remove(removed.getId());
    dirtyKeys.remove(key);
    NexusApplication.getApplication().getRedisManager().deleteData(key);
}

ConcurrentHashMap.remove() is atomic. Its return value tells you whether anything was actually removed. One operation, no race.

Problem 4: Deadlock Risk in Reconciliation

The original reconciliation task was dispatching a new processTask from inside an already-running processTask.

The original code

redisManager.processTask(() -> {
    idToDataList.forEach((keyTag, model) -> {
        // ...
        NexusApplication.getApplication().getMongoManager().getValue(...)
            .thenAccept(dbJson -> {
                // ...
                redisManager.processTask(() ->   // ← new task dispatched from inside a running task
                    NexusApplication.getApplication().getMongoManager()
                        .setValue(...));
            });
    });
});

If processTask uses a single-threaded executor (which is common for Redis clients to ensure command ordering), submitting a new task from inside a running task means the inner task can never start — the outer task is blocking the only available thread. That's a deadlock.

The fix

Everything runs in a single processTask context. The MongoDB writes inside thenAccept are plain async calls — no nested processTask.

rm.processTask(() -> {
    for (String key : keys) {
        // ...
        CompletableFuture<?> future = NexusApplication.getApplication()
            .getMongoManager()
            .getValue(...)
            .thenAccept(dbJson -> {
                // ...
                // direct Mongo write — no nested processTask
                NexusApplication.getApplication().getMongoManager()
                    .setValue(...);
            });

        batch.add(future);
        if (batch.size() >= RECONCILE_BATCH_SIZE) {
            waitForBatch(batch);
            batch.clear();
        }
    }
});

All reconciliation work happens inside one task. No task dispatches another task.

Problem 5: MongoDB Request Storm on Large Datasets

The original reconciliation fired a MongoDB query for every single entry simultaneously — no throttling, no batching.

On 1000+ entries, that means 1000+ concurrent MongoDB reads, followed immediately by potentially 1000+ writes. This can exhaust connection pools, spike latency, and cause cascading failures under load.

The fix: batch processing

private static final int RECONCILE_BATCH_SIZE = 50;

// inside startReconciliationTask:
List<CompletableFuture<?>> batch = new ArrayList<>(RECONCILE_BATCH_SIZE);

for (String key : keys) {
    if (dirtyKeys.contains(key)) continue;
    // ...

    CompletableFuture<?> future = mongoManager.getValue(...).thenAccept(...);
    batch.add(future);

    if (batch.size() >= RECONCILE_BATCH_SIZE) {
        waitForBatch(batch); // wait for all 50 to complete
        batch.clear();       // then start the next 50
    }
}

if (!batch.isEmpty()) waitForBatch(batch);

private void waitForBatch(List<CompletableFuture<?>> batch) {
    try {
        CompletableFuture.allOf(batch.toArray(new CompletableFuture[0])).get();
    } catch (Exception e) {
        LOGGER.log(Level.WARNING, "[Reconciliation] Batch wait error", e);
    }
}

At most 50 concurrent MongoDB requests are in flight at any time. The next batch only starts when the current one is done. This is easy to tune — bump RECONCILE_BATCH_SIZE if your MongoDB can handle more concurrency, lower it under constrained environments.

Problem 6: Redis Evict/TTL Was Silently Ignored

Redis can evict keys under memory pressure or when a TTL expires. The original L1 Sync did nothing when it detected a missing key.

The original code

redisManager.getData(key).ifPresent(redisJson -> {
    if (!redisJson.equals(model.getValueJson())) {
        model.setValueJson(redisJson);
    }
    // if getData() returns empty — we just skip it silently
    // L1 is now out of sync with an evicted Redis key
});

When a key was evicted from Redis, L1 kept its stale value indefinitely. The next flush would try to read from Redis, get nothing, fall back to L1's stale data, and write that to MongoDB — potentially losing the newer value that was in Redis before eviction.

The fix

Optional<String> redisOpt = rm.getData(key);

if (redisOpt.isPresent()) {
    String redisJson = redisOpt.get();
    if (!redisJson.equals(model.getValueJson())) {
        model.setValueJson(redisJson); // L1 follows Redis
    }
} else {
    // Key was evicted from Redis — restore it from L1 and re-dirty
    LOGGER.warning("[L1Sync] Redis key missing, restoring: " + key);
    rm.setData(key, model.getValueJson()); // restore Redis from L1
    dirtyKeys.add(key);                    // trigger a Mongo re-write too
}

When Redis doesn't have a key, L1's value is pushed back into Redis (restoring the master) and the key is marked dirty so the flush task re-persists it to MongoDB.

Problem 7: addModelFix() Left Redis Empty

This method was supposed to handle externally-provided data, but it only wrote to L1 — leaving Redis without the key.

The original code

public void addModelFix(String key, DataModel model) {
    idToDataList.put(key, model); // writes to L1 only
    dirtyKeys.add(key);
    // Redis has no entry for this key
    // Next L1 Sync will detect the missing Redis key, restore it from L1,
    // and re-dirty it — unnecessary extra cycle
}

The fix

public void addModelFix(String key, DataModel model) {
    writeToL1AndRedis(key, model); // writes both L1 and Redis together
    dirtyKeys.add(key);
}

All write paths now go through the same internal method, which guarantees both layers are always updated together.

Problem 8: O(n) ID Lookup

The getDataModelFromId() method streamed through the entire map on every call.

The original code

public Optional<DataModel> getDataModelFromId(String id) {
    return idToDataList.values().stream()
        .filter(dm -> dm.getId().equals(id))
        .findAny(); // O(n) — scans the whole map
}

On 1000 entries this is 1000 comparisons per lookup. If this method is called frequently (e.g., per player action in a game server), it compounds fast.

The fix: reverse index

A second ConcurrentHashMap keeps a id → key mapping updated alongside the main map.

// new field
private final ConcurrentHashMap<String, String> idToKey;

// updated on every write
private void writeToL1AndRedis(String key, DataModel model) {
    keyToModel.put(key, model);
    idToKey.put(model.getId(), key); // maintain reverse index
    NexusApplication.getApplication().getRedisManager().setData(key, model.getValueJson());
}

// and on remove
public void removeModel(String key) {
    DataModel removed = keyToModel.remove(key);
    if (removed == null) return;
    idToKey.remove(removed.getId()); // keep reverse index clean
    // ...
}

// lookup is now O(1)
public Optional<DataModel> getDataModelFromId(String id) {
    String key = idToKey.get(id);
    if (key == null) return Optional.empty();
    return Optional.ofNullable(keyToModel.get(key));
}

Two map lookups instead of a full scan. The memory cost is minimal — just a second map of strings.

The Final Picture

Here's a summary of every change made:

#	Problem	Impact	Fix
1	Hierarchy inverted — Mongo was overriding Redis	Wrong data served	Redis fetched first; Mongo updated to match Redis
2	Dirty flag removed before Mongo write confirmed	Silent data loss	`dirtyKeys.remove()` called only after `.get()` succeeds
3	TOCTOU race in `removeModel()`	Potential NPE / double-delete	Single atomic `remove()` with return value check
4	Nested `processTask` in reconciliation	Deadlock on single-thread executor	All work in one task context, no inner dispatch
5	All entries queried simultaneously in reconciliation	MongoDB connection storm	Batch processing (50 at a time) with `CompletableFuture.allOf()`
6	Redis evict/TTL silently ignored in L1 Sync	Stale L1 data, wrong Mongo writes	Restore Redis from L1, mark dirty
7	`addModelFix()` skipped Redis	Redis missing key, extra sync cycle	Unified `writeToL1AndRedis()` for all write paths
8	`getDataModelFromId()` was O(n)	CPU pressure on 1000+ entries	`id → key` reverse index for O(1) lookup

Takeaway

The original code wasn't obviously broken — it ran, it synced, it mostly worked. The bugs were in the edge cases: what happens when Mongo is unavailable for one cycle, when Redis evicts a key under pressure, when two threads hit removeModel() at the same time.

Distributed cache management is one of those areas where the details really matter. A clear layer hierarchy and explicit failure guarantees aren't optional extras — they're what separate a system that works from one that almost always works.

The full source is available in the v1.1.0 release.

If you spotted something I missed or have a different approach to any of these problems, I'd love to hear it in the comments.

DEV Community

How I Rebuilt a Three-Layer Cache System in Java — Redis, L1, and MongoDB Done Right

The Architecture: What We're Building

Problem 1: The Hierarchy Was Backwards

The original code

The fix

Problem 2: Silent Data Loss on Flush Failure

The original code

The fix

Problem 3: TOCTOU Race Condition in removeModel()

The original code

The fix

Problem 4: Deadlock Risk in Reconciliation

The original code

The fix

Problem 5: MongoDB Request Storm on Large Datasets

The fix: batch processing

Problem 6: Redis Evict/TTL Was Silently Ignored

The original code

The fix

Problem 7: addModelFix() Left Redis Empty

The original code

The fix

Problem 8: O(n) ID Lookup

The original code

The fix: reverse index

The Final Picture

Takeaway

Top comments (0)