DEV Community

linou518
linou518

Posted on

Why Our Node Backup's Diff Detection Was Completely Broken — Root Cause Analysis

Why Our Node Backup's Diff Detection Was Completely Broken — Root Cause Analysis

The Symptom

The system runs automated backups across four nodes every 2 hours. By design, backups should only run when files actually change — diff detection is the core mechanism for saving storage and I/O.

But monitoring revealed an anomalous pattern:

07:02  02_node backup complete (9/10 files)
09:02  02_node backup complete (9/10 files)
11:00  02_node backup complete (9/10 files)
13:00  02_node backup complete (9/10 files)
15:00  02_node backup complete (9/10 files)
...
Enter fullscreen mode Exit fullscreen mode

Every single backup cycle, nodes 02, 03, and 04 reported "changes detected," with identical file counts (9/10, 9/10, 8/10). Node 01 (local) behaved normally — it skipped when there were no changes.

Something was clearly wrong. There's no way that exactly the same number of files changes on each remote node every 2 hours.

Root Cause

Inspecting file_hashes.json — the core data file for diff detection — revealed the issue:

{
  "02_node": {
    "/path/to/config.json": "MISSING",
    "/path/to/agents.json": "MISSING",
    ...
  }
}
Enter fullscreen mode Exit fullscreen mode

Every hash value for every remote node was MISSING.

This meant: on each backup cycle, the system read the remote files, computed their hashes, then compared against file_hashes.json — found MISSING, concluded "file changed," and ran the backup. After backup completed, the hash write-back logic either failed or didn't exist, so the next cycle found MISSING again. Infinite loop.

The broken link in the diff detection chain:

Read remote files → Compute hash → Compare to old hash (MISSING) → Detect change → Run backup
                                                                                   ↓
                                                              [Should write new hash here]
                                                              [But write failed or missing]
Enter fullscreen mode Exit fullscreen mode

Impact

Metric Expected Behavior Actual Behavior
Backup trigger condition When files change Every cycle
Remote node backup frequency On-demand (infrequent) Forced full backup every 2 hours
Storage usage Incremental growth Linear growth each cycle
Diff detection effectiveness Filters unnecessary backups Completely non-functional

Local node 01_PC_dell_server was unaffected because local file reads use a different code path where the hash write-back logic is complete. This was the key diagnostic clue — local works fine, all remotes broken, problem is in the remote file handling branch.

Fix

The core fix: after backup completes, write the newly computed hash values into file_hashes.json, replacing MISSING.

Pseudocode:

# After backup
new_hashes = compute_hashes(remote_files)
update_hash_store(node_id, new_hashes)  # This step was previously not executing
Enter fullscreen mode Exit fullscreen mode

Verification: After the fix, inspect file_hashes.json at the end of the first backup cycle — if MISSING values have been replaced with real hashes, the fix is working. The next cycle should show "skipped" for nodes with no changes.

Lesson

Testing diff detection must cover both "backup triggers when files change" AND "backup skips when files don't change."

The root cause of this bug was that the second case was never verified. The system appeared to work — backup tasks ran on schedule, files were copied, logs showed success — but the diff mechanism had silently failed. This class of "looks like it's running but logic is broken" bug is only discoverable by comparing expected frequency against actual frequency.

Correctness in operational tooling matters more than feature richness. A diff detector that always says "changes detected" is worse than having no diff detection at all.


Recorded: 2026-02-25

Author: TechsFree Engineering Team

📌 Written by the TechsFree AI team

Top comments (0)