Konstantin

Posted on Jan 4

Crash-safe JSON at scale: atomic writes + recovery without a DB

#python #json #backend #architecture

description: How we hardened file-based JSON storage in a Flask app - atomic writes (fsync+os.replace), rolling .bak, and forensic-safe recovery reads.

1. Why we're still using JSON files in 2025

This system runs on Alpine Linux under Gunicorn. It manages 60+ retail stores today (roadmap: 300+), with ~50 SKUs per location. Operators tweak delivery schedules, safety stock logic, and product exclusions 5–10 times per day through a Flask UI.

The "configuration database" is just JSON files:

data/store_settings.json - store-level rules (delivery days, safety days, exclusions)
data/delivery_schedule.json - weekly routing / delivery cadence

This is not an argument that databases are bad. It's a workload fit:

The data is config-like, not an event stream.
Writes happen at human cadence ("Save" clicks), not per-request.
Reads are "load whole object and keep it in memory".
Reliability matters more than throughput: corrupted config can brick startup.

The downside is obvious: naive JSON writes can corrupt the main file on crashes. And json.load() will throw JSONDecodeError on invalid JSON — which turns "operator saved a setting" into "app can't start."

So the storage layer in this repo is built around two boring invariants:

Atomic writes: the main path is either the previous valid file or the new valid file — never a half-file.
Recoverable reads: if main is broken, fallback to .bak, and don't overwrite evidence with defaults.

Everything below is real code from production (src/utils/json_store.py, plus integration in managers). Every snippet has a real call site in the app.

2. The atomic write: making crashes boring

The classic footgun looks like this:

with open("settings.json", "w", encoding="utf-8") as f:
    json.dump(data, f)

If the process crashes after truncation but before finishing json.dump, you get a broken file. Then json.load() fails with JSONDecodeError on next read.

In this repo, all durable writes go through write_json_atomic():

write to a temp file next to the target (same directory)
flush() then os.fsync() (push Python + OS buffers to disk)
atomically swap temp -> main with os.replace()
keep a rolling *.bak snapshot before starting the risky part

Python's own docs spell out the "flush then fsync" sequence if you want buffers on disk. And os.replace() is designed for atomic replacement on the same filesystem (it maps to atomic rename/replace semantics on POSIX).

Here's the production helper from src/utils/json_store.py (trimmed to the core, unchanged in behavior):

def write_json_atomic(path, payload, *, validate=None, options=WriteOptions()):
    path = Path(path)
    if options.create_dirs:
        path.parent.mkdir(parents=True, exist_ok=True)

    if validate is not None:
        validate(payload)

    lock_fd = _lock_file(path) if options.lock else None
    try:
        if options.backup and path.exists():
            bak = path.with_suffix(path.suffix + ".bak")
            shutil.copy2(path, bak)

        with tempfile.NamedTemporaryFile(
            mode="w",
            encoding="utf-8",
            dir=str(path.parent),
            delete=False,
            prefix=path.name + ".",
            suffix=".tmp",
        ) as tf:
            json.dump(payload, tf, ensure_ascii=options.ensure_ascii,
                      indent=options.indent, sort_keys=options.sort_keys)
            tf.flush()
            os.fsync(tf.fileno())

        os.replace(tf.name, str(path))
    finally:
        if lock_fd is not None:
            lock_fd.close()

Why this holds up under ugly conditions:

Temp file in the same dir avoids cross-filesystem rename pitfalls. If you write to /tmp and then "move" into your data dir, you can silently end up with copy+delete semantics instead of a true atomic replace.
Rolling .bak before writing means a crash during the new write doesn't destroy your last known good state.
fsync() makes "power loss mid-save" a controlled problem instead of a probabilistic one.

Trade-offs (honest limitations):

This gives single-file atomicity. If you need a transaction across multiple JSON files, you need a different design (DB or a higher-level journal).
fsync() is a cost you pay on every save. In our workload (operator-driven saves), that cost is acceptable; in high-frequency writes it may not be.

3. Recovery reads: forensic-safe defaults

The write path keeps main or new valid, never half-broken. But reads are where you handle "something already went wrong."

A tempting fallback:

try:
    settings = json.load(open("settings.json"))
except JSONDecodeError:
    settings = get_default_settings()
    json.dump(settings, open("settings.json", "w"))  # Destroys evidence

Problem: if the main file was corrupted by a kernel panic or disk error, this auto-write obliterates your chance to debug root cause. You've replaced a forensic artifact (the broken JSON) with a pristine default. Good for uptime, bad for post-mortem.

The approach in this repo follows a forensic-safe principle: if main is corrupted and backup is missing, return the default without writing. Preserve the broken file for later analysis.

Here's the production helper from src/utils/json_store.py:

def read_json_with_recovery(
    path: str | Path,
    default: Any = None,
    *,
    validate: Optional[Callable[[Any], None]] = None,
    write_default_if_missing: bool = False,
    restore_main_from_backup: bool = True,
) -> Any:
    """
    Recovery read pattern:
    - main -> backup (.bak) on read/parse errors
    - if loaded from .bak: best-effort restore main (backup=False)
    - DEFAULT is written only when both main and backup are missing
    """
    path = Path(path)
    bak = path.with_suffix(path.suffix + ".bak")

    def _validate(payload: Any) -> None:
        if validate is not None:
            validate(payload)

    # 1) main exists
    if path.exists():
        try:
            payload = read_json(path, default=default)
            _validate(payload)
            return payload
        except (json.JSONDecodeError, OSError, ValueError):
            # 2) fallback to .bak
            if bak.exists():
                payload = read_json(bak, default=default)
                _validate(payload)

                if restore_main_from_backup:
                    # Restore main WITHOUT overwriting a good .bak
                    try:
                        write_json_atomic(
                            path,
                            payload,
                            options=WriteOptions(backup=False),
                        )
                    except Exception:
                        pass

                return payload

            # main is corrupted and no backup: forensic-safe default (no write)
            return default

    # 3) main missing, try backup
    if bak.exists():
        payload = read_json(bak, default=default)
        _validate(payload)

        if restore_main_from_backup:
            try:
                write_json_atomic(
                    path,
                    payload,
                    options=WriteOptions(backup=False),
                )
            except Exception:
                pass

        return payload

    # 4) both missing: optionally write DEFAULT
    if write_default_if_missing:
        try:
            write_json_atomic(
                path,
                default,
                options=WriteOptions(backup=False),
            )
        except Exception:
            pass

    return default

Key branches:

Main exists but corrupted, backup exists: Load from .bak, then restore main with backup=False (so we don't overwrite the good .bak with recovered data). Main becomes operational again; .bak stays as "last known good."

Main corrupted, no backup: Return default without writing. The broken main file sits on disk for investigation. The call site can log the failure and alert.

Both missing (first run): Only scenario where we write the default to disk. Parameter write_default_if_missing controls whether you want auto-creation or manual setup.

Why backup=False during restore matters: if you do write_json_atomic(main, data) with default backup=True, you'll snapshot the newly-restored main into .bak — replacing your actual last-known-good with the just-recovered data. Then you lose the forensic snapshot.

Over ~6 months we've hit the usual human-edit failures (trailing commas, missing braces). The point isn't heroics - it's that the app keeps starting, and the broken file stays on disk.

4. Production integration: SettingsManager

This isn't abstract code living in a utils folder. It's wired into the Flask managers that operators interact with daily.

The settings manager loads on app startup and reloads on demand. Operators change delivery days, safety stock thresholds, and product exclusions through a web UI. Those changes hit SettingsManager.save(), which triggers the atomic write.

From src/managers/settings_manager.py:

import os
from datetime import datetime
from src.utils.json_store import read_json_with_recovery, write_json_atomic

SETTINGS_FILE = 'data/store_settings.json'

class SettingsManager:
    def reload(self) -> bool:
        """Reload settings from disk (with recovery)"""
        settings_path = SETTINGS_FILE
        backup_path = f"{SETTINGS_FILE}.bak"

        existed_main = os.path.exists(settings_path)
        existed_bak = os.path.exists(backup_path)

        def _validate_settings(payload) -> None:
            if not isinstance(payload, dict):
                raise ValueError(f"store_settings must be a dict, got {type(payload)}")

        loaded = read_json_with_recovery(
            settings_path,
            default=self._get_default_settings(),
            validate=_validate_settings,
            write_default_if_missing=not existed_main and not existed_bak,
            restore_main_from_backup=True,
        )

        self._settings = loaded if isinstance(loaded, dict) else self._get_default_settings()
        return True

    def save(self) -> bool:
        """Persist settings to disk (atomic write)"""
        if '_metadata' not in self._settings:
            self._settings['_metadata'] = {}
        self._settings['_metadata']['updated_at'] = datetime.now().isoformat()

        write_json_atomic(SETTINGS_FILE, self._settings)
        return True

What happens on each save:

Metadata updated (timestamp)
write_json_atomic() creates .bak snapshot of current file
New JSON written to temp file with fsync()
Atomic swap via os.replace()

The actual settings structure (representative shape, redacted):

{
  "stores": {
    "Downtown Location": {
      "delivery_day": "friday",
      "safety_days": 5.0,
      "excluded_articles": ["102731", "103190"],
      "is_active": true
    },
    "Eastside Store": {
      "delivery_day": "wednesday",
      "delivery_weeks": [1, 3],
      "safety_days": 3.0,
      "is_active": true
    }
  },
  "global_settings": {
    "default_safety_days": 3.0,
    "default_safety_coefficient": 1.3
  },
  "_metadata": {
    "version": "1.0",
    "updated_at": "2025-01-04T15:23:11"
  }
}

Second integration point: delivery schedule.

The delivery schedule file (data/delivery_schedule.json) maps stores to weekly delivery days. Similar pattern, different manager (src/managers/delivery_schedule.py):

SCHEDULE_FILE = 'data/delivery_schedule.json'

def load_delivery_schedule() -> Dict:
    """Load delivery schedule (with recovery from .bak)"""
    schedule_path = SCHEDULE_FILE
    backup_path = f"{SCHEDULE_FILE}.bak"

    def _read_json(path: str) -> Dict:
        with open(path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        if not isinstance(data, dict):
            raise ValueError(f"Schedule must be a dict, got {type(data)}")
        return data

    # Try main file
    if os.path.exists(schedule_path):
        try:
            return _read_json(schedule_path)
        except Exception:
            # Fallback to .bak
            if os.path.exists(backup_path):
                schedule = _read_json(backup_path)
                # Restore main without touching .bak
                try:
                    write_json_atomic(
                        schedule_path,
                        schedule,
                        options=WriteOptions(backup=False),
                    )
                except Exception:
                    pass
                return schedule

            # Forensic-safe: don't overwrite corrupted main
            return DEFAULT_SCHEDULE

    # Main missing, try backup
    if os.path.exists(backup_path):
        schedule = _read_json(backup_path)
        try:
            write_json_atomic(
                schedule_path,
                schedule,
                options=WriteOptions(backup=False),
            )
        except Exception:
            pass
        return schedule

    # Both missing: create default
    save_delivery_schedule(DEFAULT_SCHEDULE)
    return DEFAULT_SCHEDULE

This is more explicit than calling the generic helper — same logic, with explicit branching for failure modes.

5. The incident: when .bak saved the system

What happened (~2 months into production):

An operator was editing store settings through the web UI - adjusting delivery parameters for one of the 60+ locations. Clicked "Save". The server experienced a power outage.

On restart:

json.load() failed on the main settings file with JSONDecodeError. Whether this came from filesystem-level corruption, an interrupted write in an older code path, or some other failure mode, the recovery logic behaved as designed.

Execution path:

store_settings.json exists -> json.load() fails
.bak exists -> load from .bak succeeds
Restore main with backup=False

The system came back up with the last known good configuration. No manual intervention. The operator's in-progress edit was lost (expected), but the previous stable state was intact.

What the .bak pattern prevented:

Without the rolling backup:

Corrupted main file
No fallback
System runs with in-memory defaults
Manual recovery to get the last known good configuration back

Why the pattern held:

Rolling .bak created before writing the new version: The backup snapshot happened before temp+fsync+replace. Even if the write path failed, the old .bak stayed intact.
Restore with backup=False: When we healed main from .bak, we didn't create a new backup. Restore is a healing write, not a new version - so we don't rotate backups during restore. The .bak stayed as "last known good."

This wasn't heroics. It's what the pattern is designed to handle: ungraceful shutdowns during operator-driven edits.

6. Where this stops working

Be honest about limitations.

This approach works for our current scale (60+ stores, expanding to 300+). But there are clear triggers where a DB becomes the simpler option.

You'll hit limits when:

1. Writes move from human cadence to automated churn

When writes move from operator-driven (a few per day) to automated high-frequency updates (many per hour), the cost profile changes. Each write:

Copies the previous file into .bak (via shutil.copy2)
Writes the whole payload to a temp file
Calls fsync() (blocks until disk confirms)

At operator-driven cadence, this is cheap. At high-frequency automated writes, you need append-only or delta-based storage. SQLite (often with WAL) is a common next step.

2. Concurrent writers

Last write wins. The fcntl-based lock is advisory and Unix-only:

def _lock_file(path: Path):
    if fcntl is None:  # Non-Unix platforms
        return None

    lock_path = path.with_suffix(path.suffix + ".lock")
    lock_path.parent.mkdir(parents=True, exist_ok=True)

    fd = open(lock_path, "a+", encoding="utf-8")
    fcntl.flock(fd.fileno(), fcntl.LOCK_EX)
    return fd

This is advisory locking - it helps when all writers cooperate. It doesn't prevent non-cooperating processes or out-of-band writes. For real multi-writer scenarios, you need database-level locks or conflict resolution.

3. Large files

Loading and writing the entire file on each change becomes wasteful. Files are currently under ~100KB. If your config grows significantly larger, partial updates matter.

4. Partial updates

Need to update one store's safety_days without loading all stores? With JSON files, you read-modify-write the whole object. With a database, you UPDATE a single row.

5. Complex queries

"Find all stores where delivery_day='friday' AND safety_days > 3" is a linear scan through the JSON. Databases with indexes avoid scanning the entire configuration.

Migration triggers:

We'll reconsider when:

Write frequency shifts from operator-driven to automated
File sizes grow beyond comfortable read-modify-write scale
Query patterns need indexes
Concurrent edit conflicts become a real issue

If/when the workload changes (more writes, larger configs, concurrent edits), SQLite is an obvious next step.

Current state (6 months in):

Operator-driven writes (a few per day)
Files under ~100KB
Single power-outage incident (recovered cleanly)
No performance issues
Pattern remains appropriate for workload

The decision isn't "JSON good, databases bad." It's "match storage to workload."

7. Design decisions that matter

These aren't best practices from a textbook. They're specific choices that make this pattern work in production.

Why forensic-safe defaults

When the main file is corrupted and backup is missing, we return the default without writing. This preserves evidence for post-mortem analysis.

The code path in read_json_with_recovery():

# 1) main exists
if path.exists():
    try:
        payload = read_json(path, default=default)
        _validate(payload)
        return payload
    except (json.JSONDecodeError, OSError, ValueError):
        # 2) fallback to .bak
        if bak.exists():
            payload = read_json(bak, default=default)
            _validate(payload)
            # ... restore main ...
            return payload

        # main corrupted, no backup: forensic-safe (no write)
        return default

The app runs with in-memory defaults. Operators can work. The broken file waits for investigation. Was it a filesystem issue? Disk error? Bug in an older code path? With the file still on disk, you can debug.

Why restore with backup=False

When loading from .bak, we restore main but don't create a new backup. Restore is a healing write, not a new version - so we don't rotate backups during restore.

if restore_main_from_backup:
    write_json_atomic(
        path,
        payload,
        options=WriteOptions(backup=False),
    )

This keeps .bak as "last known good state" rather than overwriting it with recovered data.

Why validation hooks

Simple isinstance() checks catch malformed JSON early. Better to fail fast on load than propagate corrupt data through the system.

def _validate_settings(payload):
    if not isinstance(payload, dict):
        raise ValueError("must be dict")

This isn't a schema system. It's a tripwire. It catches the usual human-edit failures early (wrong type, invalid JSON) before that data leaks deeper into the app.

The write_json_atomic helper can accept a validation hook, and read_json_with_recovery runs validation on loaded data. We use it primarily on read to catch corruption early.

Why same-directory temp files

To keep the atomic replace semantics, we create the temp file next to the target (same directory, same filesystem).

with tempfile.NamedTemporaryFile(
    dir=str(path.parent),  # Same filesystem as target
    delete=False,
    ...
) as tf:

Cross-filesystem moves can fall back to copy-and-delete operations depending on OS and mount points.

Why rolling .bak before write

Creating the backup before starting the write means power loss during the write doesn't lose both old and new data.

if options.backup and path.exists():
    bak = path.with_suffix(path.suffix + ".bak")
    shutil.copy2(path, bak)  # Snapshot before temp+fsync+replace

If we created .bak after the write, a crash mid-write could leave you with no valid state.

8. Production setup: Alpine + Gunicorn

The deployment stack:

Client -> Nginx -> Gunicorn (4 workers) -> Flask app -> data/*.json files

Alpine Linux:

We run on Alpine Linux. It uses OpenRC for init (not systemd) and apk for packages (not apt/yum). Smaller attack surface, smaller images.

Gunicorn:

We run 4 Gunicorn workers. Command:

gunicorn -w 4 -b 127.0.0.1:8000 'app:app'

Binding to 127.0.0.1 (not 0.0.0.0) because Nginx sits in front and handles external traffic.

File permissions:

chown -R app:app data/
chmod 755 data/
chmod 644 data/*.json data/*.json.bak

The app user needs read/write on the data directory. Backup files get the same permissions as main files (via shutil.copy2).

What matters for file-based storage:

The JSON files live in data/. The atomic write pattern creates temp files in the same directory (same filesystem). Permissions need to allow the app user to create/rename files in that directory.

OpenRC manages the service on Alpine. Nginx handles external requests and proxies to Gunicorn on localhost.

This setup has been running for 6 months with no additional JSON corruption incidents beyond the single power-outage recovery.

9. Code verification: what we actually tested

This isn't a theoretical pattern. The code is running in production managing 60+ stores.

Core helpers (src/utils/json_store.py):

write_json_atomic(...) writes to a temp file next to the target, flush()+fsync(), then atomically swaps with os.replace(), optionally keeping a rolling .bak.

read_json_with_recovery(...) falls back to .bak on parse/read errors. If recovery uses .bak, it restores main with backup=False. Defaults are written only when both main and .bak are missing and write_default_if_missing is enabled.

Production integration:

# src/managers/settings_manager.py
SettingsManager.reload() 
# - Uses read_json_with_recovery()
# - Validates payload is dict
# - Writes default only when both files missing

SettingsManager.save()
# - Updates metadata timestamp
# - Calls write_json_atomic()
# - Creates .bak on every save

# src/managers/delivery_schedule.py
load_delivery_schedule()
# - Same recovery pattern
# - Explicit branching for failure modes

Failure modes we tested locally (real commands):

(A) corrupted main + .bak exists → load .bak, restore main, .bak unchanged

python -c "import json,tempfile,pathlib
from src.managers import settings_manager as sm
d=pathlib.Path(tempfile.mkdtemp())
main=d/'store_settings.json'; bak=d/'store_settings.json.bak'
bak.write_text(json.dumps({'stores':{'X':{'delivery_day':'tuesday'}},'_metadata':{'updated_at':'never'}}),encoding='utf-8')
bak_bytes=bak.read_bytes(); bak_mtime=bak.stat().st_mtime
main.write_text('{broken json',encoding='utf-8')
sm.SETTINGS_FILE=str(main)
mgr=sm.SettingsManager(); mgr.reload()
assert main.read_text(encoding='utf-8')==bak.read_text(encoding='utf-8')
assert bak.read_bytes()==bak_bytes and bak.stat().st_mtime==bak_mtime
print('OK: loaded from .bak, main restored, .bak unchanged')"

(B) main missing + .bak missing → default created and persisted

python -c "import json,tempfile,pathlib
from src.managers import settings_manager as sm
d=pathlib.Path(tempfile.mkdtemp())
main=d/'store_settings.json'
sm.SETTINGS_FILE=str(main)
mgr=sm.SettingsManager(); mgr._settings=None; mgr.reload()
assert main.exists()
json.loads(main.read_text(encoding='utf-8'))
assert not (d/'store_settings.json.bak').exists()
print('OK: default written only when both main and .bak were missing')"

The pattern is straightforward: atomic writes, forensic reads, rolling backups. No hidden infrastructure.

10. Conclusion: boring reliability

What we built:

A file-based storage layer for configuration data that survives power outages, invalid JSON on disk, and operator mistakes. Running in production for 6 months managing 60+ retail stores.

Key pieces:

Atomic writes (temp + fsync + os.replace)
Forensic-safe recovery (rolling .bak, don't destroy evidence)
Production integration (Flask managers, real operators)
Real incident (power outage) + synthetic test scenarios

When to use this:

Config-heavy apps (settings, schedules, mappings)
Infrequent writes (operator-driven, not event streams)
Small files (under a few hundred KB)
Linux/POSIX environments
Read pattern is "load whole object"

When NOT to use:

High-frequency writes (automated churn, many per hour)
Concurrent writers needing coordination
Large datasets requiring partial updates
Complex query patterns needing indexes

Our migration triggers:

We'll move to SQLite when:

Write frequency shifts to automated high-volume
File sizes grow beyond comfortable read-modify-write scale
Query patterns need indexes
Concurrent edit conflicts become real

The 300+ store expansion may hit these triggers. That's fine. The pattern served its purpose: reliable config storage without premature database complexity.

The takeaway:

Match storage to workload. JSON files with careful atomicity work for config-heavy apps with infrequent writes. Once you need multi-writer coordination, partial updates, or indexed queries, a DB becomes the simpler option.

Don't over-engineer storage you don't need yet. But don't under-engineer atomicity when crashes matter.

This pattern gave us 6 months of boring reliability. Sometimes boring is exactly what you want.

If you've shipped similar file-backed storage, I'm curious what failure modes you tested.

DEV Community