DEV Community

Mustafa Bingül
Mustafa Bingül

Posted on

How I Rebuilt a Three-Layer Cache System in Java — Redis, L1, and MongoDB Done Right

I've been working on Nexus, a backend infrastructure project, and recently hit a point where the data synchronization layer needed a serious rethink. What looked like a working cache system turned out to have a broken hierarchy, silent data loss paths, race conditions, and a latent deadlock waiting to happen.

This post walks through every problem I found in the original code and exactly how I fixed each one. Code comparisons included throughout.


The Architecture: What We're Building

The system manages data across three layers:

┌─────────────────────────────────┐
│        Redis Cache (MASTER)     │  ← single source of truth
└────────────┬──────────┬─────────┘
             │          │
       pull 10s      flush 15s
             │          │
┌────────────▼──┐  ┌────▼──────────────┐
│  L1 Cache     │  │     MongoDB        │
│  (in-memory)  │  │  (persistent DB)   │
└───────────────┘  └────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The rules are simple:

  • Redis is always master. No other layer can override it.
  • L1 is an in-memory mirror of Redis. It reads from Redis, never the other way around.
  • MongoDB is the persistent backup. Redis writes to it — not the other way around.

Three scheduled tasks keep everything in sync:

Task Interval Job
L1 Sync 10s Pull Redis → update L1 if changed
Auto Flush 15s Push dirty keys from Redis → MongoDB
Reconciliation 3 min Compare Redis ↔ MongoDB, Redis wins

Now let's look at what was actually wrong with the original implementation.


Problem 1: The Hierarchy Was Backwards

This was the most fundamental issue. The reconciliation task was supposed to enforce Redis as master — but it was doing the opposite.

The original code

private void startReconciliationTask() {
    redisManager.processTask(() -> {
        idToDataList.forEach((keyTag, model) -> {
            if (dirtyKeys.contains(keyTag)) return;

            NexusApplication.getApplication().getMongoManager()
                .getValue(model.getAddon(), model.getSpecificDbKey())
                .thenAccept(dbJson -> {
                    if (dbJson == null) return;
                    try {
                        String cleanDbJson = model.getAddon().modelInitComp(dbJson);
                        if (!cleanDbJson.equals(model.getValueJson())) {
                            model.setValueJson(cleanDbJson);          // ← overwriting L1 with Mongo
                            redisManager.setData(keyTag, cleanDbJson); // ← overwriting Redis with Mongo!
                        }
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
        });
    });
}
Enter fullscreen mode Exit fullscreen mode

When MongoDB had a different value than Redis, it was writing MongoDB's value into Redis. MongoDB was effectively acting as master. The entire priority chain was inverted.

The fix

private void startReconciliationTask() {
    RedisManager rm = NexusApplication.getApplication().getRedisManager();

    rm.processTask(() -> {
        // ...
        // Get the master value from Redis first
        String redisJson = rm.getData(key).orElseGet(model::getValueJson);

        NexusApplication.getApplication().getMongoManager()
            .getValue(model.getAddon(), model.getSpecificDbKey())
            .thenAccept(dbJson -> {
                // ...
                String cleanDbJson = model.getAddon().modelInitComp(dbJson);

                // Redis wins. If Mongo is different, update Mongo — not Redis.
                if (!cleanDbJson.equals(redisJson)) {
                    NexusApplication.getApplication().getMongoManager()
                        .setValue(model.getAddon(), model.getSpecificDbKey(), redisJson);
                    model.setValueJson(redisJson); // L1 follows Redis
                }
            });
    });
}
Enter fullscreen mode Exit fullscreen mode

Redis is fetched first and treated as the ground truth. When there's a mismatch, MongoDB gets corrected — never Redis.


Problem 2: Silent Data Loss on Flush Failure

The auto flush task was removing the dirty flag before confirming the MongoDB write succeeded.

The original code

private void startAutoSyncTask() {
    List<String> keysToSync = new ArrayList<>(dirtyKeys);
    dirtyKeys.removeAll(keysToSync); // ← removed before writing to Mongo

    redisManager.processTask(() -> {
        for (String key : keysToSync) {
            DataModel model = idToDataList.get(key);
            if (model != null) {
                NexusApplication.getApplication().getMongoManager()
                    .setValue(model.getAddon(), model.getSpecificDbKey(), model.getValueJson());
                // if this fails, the key is already gone from dirtyKeys
                // it will never be retried
            }
        }
    });
}
Enter fullscreen mode Exit fullscreen mode

If the MongoDB write threw an exception or the future completed exceptionally, the dirty flag was already removed. The entry would never be retried. Data was silently lost.

The fix

private void startAutoFlushTask() {
    List<String> keysToFlush = new ArrayList<>(dirtyKeys); // snapshot — don't removeAll yet

    rm.processTask(() -> {
        for (String key : keysToFlush) {
            DataModel model = keyToModel.get(key);
            if (model == null) {
                dirtyKeys.remove(key);
                continue;
            }

            String jsonToWrite = rm.getData(key).orElseGet(model::getValueJson);

            try {
                NexusApplication.getApplication().getMongoManager()
                    .setValue(model.getAddon(), model.getSpecificDbKey(), jsonToWrite)
                    .get(); // block until Mongo confirms the write

                dirtyKeys.remove(key); // only remove AFTER confirmed success

            } catch (Exception e) {
                // leave the key dirty — it will be retried on the next flush
                LOGGER.log(Level.SEVERE, "[AutoFlush] Write failed, will retry: " + key, e);
            }
        }
    });
}
Enter fullscreen mode Exit fullscreen mode

Two changes here: the dirty flag snapshot is taken but not immediately cleared, and dirtyKeys.remove(key) only runs after .get() confirms the write succeeded. If anything goes wrong, the key stays dirty and gets retried 15 seconds later.


Problem 3: TOCTOU Race Condition in removeModel()

TOCTOU stands for Time-of-Check to Time-of-Use. The original removeModel() checked if a key existed before removing it — but another thread could delete it between those two operations.

The original code

public void removeModel(String key) {
    if (idToDataList.containsKey(key)) {  // Thread A checks: key exists
        // --- Thread B deletes the key here ---
        dirtyKeys.remove(key);
        idToDataList.remove(key);          // Thread A removes: but key is already gone
        NexusApplication.getApplication().getRedisManager().deleteData(key);
    }
}
Enter fullscreen mode Exit fullscreen mode

In a concurrent environment, this is not safe. The containsKey check and remove call are two separate operations with no atomicity guarantee between them.

The fix

public void removeModel(String key) {
    DataModel removed = keyToModel.remove(key); // atomic: check + remove in one step
    if (removed == null) return; // wasn't there — nothing to do

    idToKey.remove(removed.getId());
    dirtyKeys.remove(key);
    NexusApplication.getApplication().getRedisManager().deleteData(key);
}
Enter fullscreen mode Exit fullscreen mode

ConcurrentHashMap.remove() is atomic. Its return value tells you whether anything was actually removed. One operation, no race.


Problem 4: Deadlock Risk in Reconciliation

The original reconciliation task was dispatching a new processTask from inside an already-running processTask.

The original code

redisManager.processTask(() -> {
    idToDataList.forEach((keyTag, model) -> {
        // ...
        NexusApplication.getApplication().getMongoManager().getValue(...)
            .thenAccept(dbJson -> {
                // ...
                redisManager.processTask(() ->   // ← new task dispatched from inside a running task
                    NexusApplication.getApplication().getMongoManager()
                        .setValue(...));
            });
    });
});
Enter fullscreen mode Exit fullscreen mode

If processTask uses a single-threaded executor (which is common for Redis clients to ensure command ordering), submitting a new task from inside a running task means the inner task can never start — the outer task is blocking the only available thread. That's a deadlock.

The fix

Everything runs in a single processTask context. The MongoDB writes inside thenAccept are plain async calls — no nested processTask.

rm.processTask(() -> {
    for (String key : keys) {
        // ...
        CompletableFuture<?> future = NexusApplication.getApplication()
            .getMongoManager()
            .getValue(...)
            .thenAccept(dbJson -> {
                // ...
                // direct Mongo write — no nested processTask
                NexusApplication.getApplication().getMongoManager()
                    .setValue(...);
            });

        batch.add(future);
        if (batch.size() >= RECONCILE_BATCH_SIZE) {
            waitForBatch(batch);
            batch.clear();
        }
    }
});
Enter fullscreen mode Exit fullscreen mode

All reconciliation work happens inside one task. No task dispatches another task.


Problem 5: MongoDB Request Storm on Large Datasets

The original reconciliation fired a MongoDB query for every single entry simultaneously — no throttling, no batching.

On 1000+ entries, that means 1000+ concurrent MongoDB reads, followed immediately by potentially 1000+ writes. This can exhaust connection pools, spike latency, and cause cascading failures under load.

The fix: batch processing

private static final int RECONCILE_BATCH_SIZE = 50;

// inside startReconciliationTask:
List<CompletableFuture<?>> batch = new ArrayList<>(RECONCILE_BATCH_SIZE);

for (String key : keys) {
    if (dirtyKeys.contains(key)) continue;
    // ...

    CompletableFuture<?> future = mongoManager.getValue(...).thenAccept(...);
    batch.add(future);

    if (batch.size() >= RECONCILE_BATCH_SIZE) {
        waitForBatch(batch); // wait for all 50 to complete
        batch.clear();       // then start the next 50
    }
}

if (!batch.isEmpty()) waitForBatch(batch);
Enter fullscreen mode Exit fullscreen mode
private void waitForBatch(List<CompletableFuture<?>> batch) {
    try {
        CompletableFuture.allOf(batch.toArray(new CompletableFuture[0])).get();
    } catch (Exception e) {
        LOGGER.log(Level.WARNING, "[Reconciliation] Batch wait error", e);
    }
}
Enter fullscreen mode Exit fullscreen mode

At most 50 concurrent MongoDB requests are in flight at any time. The next batch only starts when the current one is done. This is easy to tune — bump RECONCILE_BATCH_SIZE if your MongoDB can handle more concurrency, lower it under constrained environments.


Problem 6: Redis Evict/TTL Was Silently Ignored

Redis can evict keys under memory pressure or when a TTL expires. The original L1 Sync did nothing when it detected a missing key.

The original code

redisManager.getData(key).ifPresent(redisJson -> {
    if (!redisJson.equals(model.getValueJson())) {
        model.setValueJson(redisJson);
    }
    // if getData() returns empty — we just skip it silently
    // L1 is now out of sync with an evicted Redis key
});
Enter fullscreen mode Exit fullscreen mode

When a key was evicted from Redis, L1 kept its stale value indefinitely. The next flush would try to read from Redis, get nothing, fall back to L1's stale data, and write that to MongoDB — potentially losing the newer value that was in Redis before eviction.

The fix

Optional<String> redisOpt = rm.getData(key);

if (redisOpt.isPresent()) {
    String redisJson = redisOpt.get();
    if (!redisJson.equals(model.getValueJson())) {
        model.setValueJson(redisJson); // L1 follows Redis
    }
} else {
    // Key was evicted from Redis — restore it from L1 and re-dirty
    LOGGER.warning("[L1Sync] Redis key missing, restoring: " + key);
    rm.setData(key, model.getValueJson()); // restore Redis from L1
    dirtyKeys.add(key);                    // trigger a Mongo re-write too
}
Enter fullscreen mode Exit fullscreen mode

When Redis doesn't have a key, L1's value is pushed back into Redis (restoring the master) and the key is marked dirty so the flush task re-persists it to MongoDB.


Problem 7: addModelFix() Left Redis Empty

This method was supposed to handle externally-provided data, but it only wrote to L1 — leaving Redis without the key.

The original code

public void addModelFix(String key, DataModel model) {
    idToDataList.put(key, model); // writes to L1 only
    dirtyKeys.add(key);
    // Redis has no entry for this key
    // Next L1 Sync will detect the missing Redis key, restore it from L1,
    // and re-dirty it — unnecessary extra cycle
}
Enter fullscreen mode Exit fullscreen mode

The fix

public void addModelFix(String key, DataModel model) {
    writeToL1AndRedis(key, model); // writes both L1 and Redis together
    dirtyKeys.add(key);
}
Enter fullscreen mode Exit fullscreen mode

All write paths now go through the same internal method, which guarantees both layers are always updated together.


Problem 8: O(n) ID Lookup

The getDataModelFromId() method streamed through the entire map on every call.

The original code

public Optional<DataModel> getDataModelFromId(String id) {
    return idToDataList.values().stream()
        .filter(dm -> dm.getId().equals(id))
        .findAny(); // O(n) — scans the whole map
}
Enter fullscreen mode Exit fullscreen mode

On 1000 entries this is 1000 comparisons per lookup. If this method is called frequently (e.g., per player action in a game server), it compounds fast.

The fix: reverse index

A second ConcurrentHashMap keeps a id → key mapping updated alongside the main map.

// new field
private final ConcurrentHashMap<String, String> idToKey;

// updated on every write
private void writeToL1AndRedis(String key, DataModel model) {
    keyToModel.put(key, model);
    idToKey.put(model.getId(), key); // maintain reverse index
    NexusApplication.getApplication().getRedisManager().setData(key, model.getValueJson());
}

// and on remove
public void removeModel(String key) {
    DataModel removed = keyToModel.remove(key);
    if (removed == null) return;
    idToKey.remove(removed.getId()); // keep reverse index clean
    // ...
}

// lookup is now O(1)
public Optional<DataModel> getDataModelFromId(String id) {
    String key = idToKey.get(id);
    if (key == null) return Optional.empty();
    return Optional.ofNullable(keyToModel.get(key));
}
Enter fullscreen mode Exit fullscreen mode

Two map lookups instead of a full scan. The memory cost is minimal — just a second map of strings.


The Final Picture

Here's a summary of every change made:

# Problem Impact Fix
1 Hierarchy inverted — Mongo was overriding Redis Wrong data served Redis fetched first; Mongo updated to match Redis
2 Dirty flag removed before Mongo write confirmed Silent data loss dirtyKeys.remove() called only after .get() succeeds
3 TOCTOU race in removeModel() Potential NPE / double-delete Single atomic remove() with return value check
4 Nested processTask in reconciliation Deadlock on single-thread executor All work in one task context, no inner dispatch
5 All entries queried simultaneously in reconciliation MongoDB connection storm Batch processing (50 at a time) with CompletableFuture.allOf()
6 Redis evict/TTL silently ignored in L1 Sync Stale L1 data, wrong Mongo writes Restore Redis from L1, mark dirty
7 addModelFix() skipped Redis Redis missing key, extra sync cycle Unified writeToL1AndRedis() for all write paths
8 getDataModelFromId() was O(n) CPU pressure on 1000+ entries id → key reverse index for O(1) lookup

Takeaway

The original code wasn't obviously broken — it ran, it synced, it mostly worked. The bugs were in the edge cases: what happens when Mongo is unavailable for one cycle, when Redis evicts a key under pressure, when two threads hit removeModel() at the same time.

Distributed cache management is one of those areas where the details really matter. A clear layer hierarchy and explicit failure guarantees aren't optional extras — they're what separate a system that works from one that almost always works.

The full source is available in the v1.1.0 release.


If you spotted something I missed or have a different approach to any of these problems, I'd love to hear it in the comments.

Top comments (0)