I've been working on Nexus, a backend infrastructure project, and recently hit a point where the data synchronization layer needed a serious rethink. What looked like a working cache system turned out to have a broken hierarchy, silent data loss paths, race conditions, and a latent deadlock waiting to happen.
This post walks through every problem I found in the original code and exactly how I fixed each one. Code comparisons included throughout.
The Architecture: What We're Building
The system manages data across three layers:
┌─────────────────────────────────┐
│ Redis Cache (MASTER) │ ← single source of truth
└────────────┬──────────┬─────────┘
│ │
pull 10s flush 15s
│ │
┌────────────▼──┐ ┌────▼──────────────┐
│ L1 Cache │ │ MongoDB │
│ (in-memory) │ │ (persistent DB) │
└───────────────┘ └────────────────────┘
The rules are simple:
- Redis is always master. No other layer can override it.
- L1 is an in-memory mirror of Redis. It reads from Redis, never the other way around.
- MongoDB is the persistent backup. Redis writes to it — not the other way around.
Three scheduled tasks keep everything in sync:
| Task | Interval | Job |
|---|---|---|
| L1 Sync | 10s | Pull Redis → update L1 if changed |
| Auto Flush | 15s | Push dirty keys from Redis → MongoDB |
| Reconciliation | 3 min | Compare Redis ↔ MongoDB, Redis wins |
Now let's look at what was actually wrong with the original implementation.
Problem 1: The Hierarchy Was Backwards
This was the most fundamental issue. The reconciliation task was supposed to enforce Redis as master — but it was doing the opposite.
The original code
private void startReconciliationTask() {
redisManager.processTask(() -> {
idToDataList.forEach((keyTag, model) -> {
if (dirtyKeys.contains(keyTag)) return;
NexusApplication.getApplication().getMongoManager()
.getValue(model.getAddon(), model.getSpecificDbKey())
.thenAccept(dbJson -> {
if (dbJson == null) return;
try {
String cleanDbJson = model.getAddon().modelInitComp(dbJson);
if (!cleanDbJson.equals(model.getValueJson())) {
model.setValueJson(cleanDbJson); // ← overwriting L1 with Mongo
redisManager.setData(keyTag, cleanDbJson); // ← overwriting Redis with Mongo!
}
} catch (Exception e) {
e.printStackTrace();
}
});
});
});
}
When MongoDB had a different value than Redis, it was writing MongoDB's value into Redis. MongoDB was effectively acting as master. The entire priority chain was inverted.
The fix
private void startReconciliationTask() {
RedisManager rm = NexusApplication.getApplication().getRedisManager();
rm.processTask(() -> {
// ...
// Get the master value from Redis first
String redisJson = rm.getData(key).orElseGet(model::getValueJson);
NexusApplication.getApplication().getMongoManager()
.getValue(model.getAddon(), model.getSpecificDbKey())
.thenAccept(dbJson -> {
// ...
String cleanDbJson = model.getAddon().modelInitComp(dbJson);
// Redis wins. If Mongo is different, update Mongo — not Redis.
if (!cleanDbJson.equals(redisJson)) {
NexusApplication.getApplication().getMongoManager()
.setValue(model.getAddon(), model.getSpecificDbKey(), redisJson);
model.setValueJson(redisJson); // L1 follows Redis
}
});
});
}
Redis is fetched first and treated as the ground truth. When there's a mismatch, MongoDB gets corrected — never Redis.
Problem 2: Silent Data Loss on Flush Failure
The auto flush task was removing the dirty flag before confirming the MongoDB write succeeded.
The original code
private void startAutoSyncTask() {
List<String> keysToSync = new ArrayList<>(dirtyKeys);
dirtyKeys.removeAll(keysToSync); // ← removed before writing to Mongo
redisManager.processTask(() -> {
for (String key : keysToSync) {
DataModel model = idToDataList.get(key);
if (model != null) {
NexusApplication.getApplication().getMongoManager()
.setValue(model.getAddon(), model.getSpecificDbKey(), model.getValueJson());
// if this fails, the key is already gone from dirtyKeys
// it will never be retried
}
}
});
}
If the MongoDB write threw an exception or the future completed exceptionally, the dirty flag was already removed. The entry would never be retried. Data was silently lost.
The fix
private void startAutoFlushTask() {
List<String> keysToFlush = new ArrayList<>(dirtyKeys); // snapshot — don't removeAll yet
rm.processTask(() -> {
for (String key : keysToFlush) {
DataModel model = keyToModel.get(key);
if (model == null) {
dirtyKeys.remove(key);
continue;
}
String jsonToWrite = rm.getData(key).orElseGet(model::getValueJson);
try {
NexusApplication.getApplication().getMongoManager()
.setValue(model.getAddon(), model.getSpecificDbKey(), jsonToWrite)
.get(); // block until Mongo confirms the write
dirtyKeys.remove(key); // only remove AFTER confirmed success
} catch (Exception e) {
// leave the key dirty — it will be retried on the next flush
LOGGER.log(Level.SEVERE, "[AutoFlush] Write failed, will retry: " + key, e);
}
}
});
}
Two changes here: the dirty flag snapshot is taken but not immediately cleared, and dirtyKeys.remove(key) only runs after .get() confirms the write succeeded. If anything goes wrong, the key stays dirty and gets retried 15 seconds later.
Problem 3: TOCTOU Race Condition in removeModel()
TOCTOU stands for Time-of-Check to Time-of-Use. The original removeModel() checked if a key existed before removing it — but another thread could delete it between those two operations.
The original code
public void removeModel(String key) {
if (idToDataList.containsKey(key)) { // Thread A checks: key exists
// --- Thread B deletes the key here ---
dirtyKeys.remove(key);
idToDataList.remove(key); // Thread A removes: but key is already gone
NexusApplication.getApplication().getRedisManager().deleteData(key);
}
}
In a concurrent environment, this is not safe. The containsKey check and remove call are two separate operations with no atomicity guarantee between them.
The fix
public void removeModel(String key) {
DataModel removed = keyToModel.remove(key); // atomic: check + remove in one step
if (removed == null) return; // wasn't there — nothing to do
idToKey.remove(removed.getId());
dirtyKeys.remove(key);
NexusApplication.getApplication().getRedisManager().deleteData(key);
}
ConcurrentHashMap.remove() is atomic. Its return value tells you whether anything was actually removed. One operation, no race.
Problem 4: Deadlock Risk in Reconciliation
The original reconciliation task was dispatching a new processTask from inside an already-running processTask.
The original code
redisManager.processTask(() -> {
idToDataList.forEach((keyTag, model) -> {
// ...
NexusApplication.getApplication().getMongoManager().getValue(...)
.thenAccept(dbJson -> {
// ...
redisManager.processTask(() -> // ← new task dispatched from inside a running task
NexusApplication.getApplication().getMongoManager()
.setValue(...));
});
});
});
If processTask uses a single-threaded executor (which is common for Redis clients to ensure command ordering), submitting a new task from inside a running task means the inner task can never start — the outer task is blocking the only available thread. That's a deadlock.
The fix
Everything runs in a single processTask context. The MongoDB writes inside thenAccept are plain async calls — no nested processTask.
rm.processTask(() -> {
for (String key : keys) {
// ...
CompletableFuture<?> future = NexusApplication.getApplication()
.getMongoManager()
.getValue(...)
.thenAccept(dbJson -> {
// ...
// direct Mongo write — no nested processTask
NexusApplication.getApplication().getMongoManager()
.setValue(...);
});
batch.add(future);
if (batch.size() >= RECONCILE_BATCH_SIZE) {
waitForBatch(batch);
batch.clear();
}
}
});
All reconciliation work happens inside one task. No task dispatches another task.
Problem 5: MongoDB Request Storm on Large Datasets
The original reconciliation fired a MongoDB query for every single entry simultaneously — no throttling, no batching.
On 1000+ entries, that means 1000+ concurrent MongoDB reads, followed immediately by potentially 1000+ writes. This can exhaust connection pools, spike latency, and cause cascading failures under load.
The fix: batch processing
private static final int RECONCILE_BATCH_SIZE = 50;
// inside startReconciliationTask:
List<CompletableFuture<?>> batch = new ArrayList<>(RECONCILE_BATCH_SIZE);
for (String key : keys) {
if (dirtyKeys.contains(key)) continue;
// ...
CompletableFuture<?> future = mongoManager.getValue(...).thenAccept(...);
batch.add(future);
if (batch.size() >= RECONCILE_BATCH_SIZE) {
waitForBatch(batch); // wait for all 50 to complete
batch.clear(); // then start the next 50
}
}
if (!batch.isEmpty()) waitForBatch(batch);
private void waitForBatch(List<CompletableFuture<?>> batch) {
try {
CompletableFuture.allOf(batch.toArray(new CompletableFuture[0])).get();
} catch (Exception e) {
LOGGER.log(Level.WARNING, "[Reconciliation] Batch wait error", e);
}
}
At most 50 concurrent MongoDB requests are in flight at any time. The next batch only starts when the current one is done. This is easy to tune — bump RECONCILE_BATCH_SIZE if your MongoDB can handle more concurrency, lower it under constrained environments.
Problem 6: Redis Evict/TTL Was Silently Ignored
Redis can evict keys under memory pressure or when a TTL expires. The original L1 Sync did nothing when it detected a missing key.
The original code
redisManager.getData(key).ifPresent(redisJson -> {
if (!redisJson.equals(model.getValueJson())) {
model.setValueJson(redisJson);
}
// if getData() returns empty — we just skip it silently
// L1 is now out of sync with an evicted Redis key
});
When a key was evicted from Redis, L1 kept its stale value indefinitely. The next flush would try to read from Redis, get nothing, fall back to L1's stale data, and write that to MongoDB — potentially losing the newer value that was in Redis before eviction.
The fix
Optional<String> redisOpt = rm.getData(key);
if (redisOpt.isPresent()) {
String redisJson = redisOpt.get();
if (!redisJson.equals(model.getValueJson())) {
model.setValueJson(redisJson); // L1 follows Redis
}
} else {
// Key was evicted from Redis — restore it from L1 and re-dirty
LOGGER.warning("[L1Sync] Redis key missing, restoring: " + key);
rm.setData(key, model.getValueJson()); // restore Redis from L1
dirtyKeys.add(key); // trigger a Mongo re-write too
}
When Redis doesn't have a key, L1's value is pushed back into Redis (restoring the master) and the key is marked dirty so the flush task re-persists it to MongoDB.
Problem 7: addModelFix() Left Redis Empty
This method was supposed to handle externally-provided data, but it only wrote to L1 — leaving Redis without the key.
The original code
public void addModelFix(String key, DataModel model) {
idToDataList.put(key, model); // writes to L1 only
dirtyKeys.add(key);
// Redis has no entry for this key
// Next L1 Sync will detect the missing Redis key, restore it from L1,
// and re-dirty it — unnecessary extra cycle
}
The fix
public void addModelFix(String key, DataModel model) {
writeToL1AndRedis(key, model); // writes both L1 and Redis together
dirtyKeys.add(key);
}
All write paths now go through the same internal method, which guarantees both layers are always updated together.
Problem 8: O(n) ID Lookup
The getDataModelFromId() method streamed through the entire map on every call.
The original code
public Optional<DataModel> getDataModelFromId(String id) {
return idToDataList.values().stream()
.filter(dm -> dm.getId().equals(id))
.findAny(); // O(n) — scans the whole map
}
On 1000 entries this is 1000 comparisons per lookup. If this method is called frequently (e.g., per player action in a game server), it compounds fast.
The fix: reverse index
A second ConcurrentHashMap keeps a id → key mapping updated alongside the main map.
// new field
private final ConcurrentHashMap<String, String> idToKey;
// updated on every write
private void writeToL1AndRedis(String key, DataModel model) {
keyToModel.put(key, model);
idToKey.put(model.getId(), key); // maintain reverse index
NexusApplication.getApplication().getRedisManager().setData(key, model.getValueJson());
}
// and on remove
public void removeModel(String key) {
DataModel removed = keyToModel.remove(key);
if (removed == null) return;
idToKey.remove(removed.getId()); // keep reverse index clean
// ...
}
// lookup is now O(1)
public Optional<DataModel> getDataModelFromId(String id) {
String key = idToKey.get(id);
if (key == null) return Optional.empty();
return Optional.ofNullable(keyToModel.get(key));
}
Two map lookups instead of a full scan. The memory cost is minimal — just a second map of strings.
The Final Picture
Here's a summary of every change made:
| # | Problem | Impact | Fix |
|---|---|---|---|
| 1 | Hierarchy inverted — Mongo was overriding Redis | Wrong data served | Redis fetched first; Mongo updated to match Redis |
| 2 | Dirty flag removed before Mongo write confirmed | Silent data loss |
dirtyKeys.remove() called only after .get() succeeds |
| 3 | TOCTOU race in removeModel()
|
Potential NPE / double-delete | Single atomic remove() with return value check |
| 4 | Nested processTask in reconciliation |
Deadlock on single-thread executor | All work in one task context, no inner dispatch |
| 5 | All entries queried simultaneously in reconciliation | MongoDB connection storm | Batch processing (50 at a time) with CompletableFuture.allOf()
|
| 6 | Redis evict/TTL silently ignored in L1 Sync | Stale L1 data, wrong Mongo writes | Restore Redis from L1, mark dirty |
| 7 |
addModelFix() skipped Redis |
Redis missing key, extra sync cycle | Unified writeToL1AndRedis() for all write paths |
| 8 |
getDataModelFromId() was O(n) |
CPU pressure on 1000+ entries |
id → key reverse index for O(1) lookup |
Takeaway
The original code wasn't obviously broken — it ran, it synced, it mostly worked. The bugs were in the edge cases: what happens when Mongo is unavailable for one cycle, when Redis evicts a key under pressure, when two threads hit removeModel() at the same time.
Distributed cache management is one of those areas where the details really matter. A clear layer hierarchy and explicit failure guarantees aren't optional extras — they're what separate a system that works from one that almost always works.
The full source is available in the v1.1.0 release.
If you spotted something I missed or have a different approach to any of these problems, I'd love to hear it in the comments.
Top comments (0)