description: How we hardened file-based JSON storage in a Flask app - atomic writes (fsync+os.replace), rolling .bak, and forensic-safe recovery reads.
1. Why we're still using JSON files in 2025
This system runs on Alpine Linux under Gunicorn. It manages 60+ retail stores today (roadmap: 300+), with ~50 SKUs per location. Operators tweak delivery schedules, safety stock logic, and product exclusions 5–10 times per day through a Flask UI.
The "configuration database" is just JSON files:
-
data/store_settings.json- store-level rules (delivery days, safety days, exclusions) -
data/delivery_schedule.json- weekly routing / delivery cadence
This is not an argument that databases are bad. It's a workload fit:
- The data is config-like, not an event stream.
- Writes happen at human cadence ("Save" clicks), not per-request.
- Reads are "load whole object and keep it in memory".
- Reliability matters more than throughput: corrupted config can brick startup.
The downside is obvious: naive JSON writes can corrupt the main file on crashes. And json.load() will throw JSONDecodeError on invalid JSON — which turns "operator saved a setting" into "app can't start."
So the storage layer in this repo is built around two boring invariants:
- Atomic writes: the main path is either the previous valid file or the new valid file — never a half-file.
-
Recoverable reads: if main is broken, fallback to
.bak, and don't overwrite evidence with defaults.
Everything below is real code from production (src/utils/json_store.py, plus integration in managers). Every snippet has a real call site in the app.
2. The atomic write: making crashes boring
The classic footgun looks like this:
with open("settings.json", "w", encoding="utf-8") as f:
json.dump(data, f)
If the process crashes after truncation but before finishing json.dump, you get a broken file. Then json.load() fails with JSONDecodeError on next read.
In this repo, all durable writes go through write_json_atomic():
- write to a temp file next to the target (same directory)
-
flush()thenos.fsync()(push Python + OS buffers to disk) - atomically swap temp -> main with
os.replace() - keep a rolling
*.baksnapshot before starting the risky part
Python's own docs spell out the "flush then fsync" sequence if you want buffers on disk. And os.replace() is designed for atomic replacement on the same filesystem (it maps to atomic rename/replace semantics on POSIX).
Here's the production helper from src/utils/json_store.py (trimmed to the core, unchanged in behavior):
def write_json_atomic(path, payload, *, validate=None, options=WriteOptions()):
path = Path(path)
if options.create_dirs:
path.parent.mkdir(parents=True, exist_ok=True)
if validate is not None:
validate(payload)
lock_fd = _lock_file(path) if options.lock else None
try:
if options.backup and path.exists():
bak = path.with_suffix(path.suffix + ".bak")
shutil.copy2(path, bak)
with tempfile.NamedTemporaryFile(
mode="w",
encoding="utf-8",
dir=str(path.parent),
delete=False,
prefix=path.name + ".",
suffix=".tmp",
) as tf:
json.dump(payload, tf, ensure_ascii=options.ensure_ascii,
indent=options.indent, sort_keys=options.sort_keys)
tf.flush()
os.fsync(tf.fileno())
os.replace(tf.name, str(path))
finally:
if lock_fd is not None:
lock_fd.close()
Why this holds up under ugly conditions:
-
Temp file in the same dir avoids cross-filesystem rename pitfalls. If you write to
/tmpand then "move" into your data dir, you can silently end up with copy+delete semantics instead of a true atomic replace. -
Rolling
.bakbefore writing means a crash during the new write doesn't destroy your last known good state. -
fsync()makes "power loss mid-save" a controlled problem instead of a probabilistic one.
Trade-offs (honest limitations):
- This gives single-file atomicity. If you need a transaction across multiple JSON files, you need a different design (DB or a higher-level journal).
-
fsync()is a cost you pay on every save. In our workload (operator-driven saves), that cost is acceptable; in high-frequency writes it may not be.
3. Recovery reads: forensic-safe defaults
The write path keeps main or new valid, never half-broken. But reads are where you handle "something already went wrong."
A tempting fallback:
try:
settings = json.load(open("settings.json"))
except JSONDecodeError:
settings = get_default_settings()
json.dump(settings, open("settings.json", "w")) # Destroys evidence
Problem: if the main file was corrupted by a kernel panic or disk error, this auto-write obliterates your chance to debug root cause. You've replaced a forensic artifact (the broken JSON) with a pristine default. Good for uptime, bad for post-mortem.
The approach in this repo follows a forensic-safe principle: if main is corrupted and backup is missing, return the default without writing. Preserve the broken file for later analysis.
Here's the production helper from src/utils/json_store.py:
def read_json_with_recovery(
path: str | Path,
default: Any = None,
*,
validate: Optional[Callable[[Any], None]] = None,
write_default_if_missing: bool = False,
restore_main_from_backup: bool = True,
) -> Any:
"""
Recovery read pattern:
- main -> backup (.bak) on read/parse errors
- if loaded from .bak: best-effort restore main (backup=False)
- DEFAULT is written only when both main and backup are missing
"""
path = Path(path)
bak = path.with_suffix(path.suffix + ".bak")
def _validate(payload: Any) -> None:
if validate is not None:
validate(payload)
# 1) main exists
if path.exists():
try:
payload = read_json(path, default=default)
_validate(payload)
return payload
except (json.JSONDecodeError, OSError, ValueError):
# 2) fallback to .bak
if bak.exists():
payload = read_json(bak, default=default)
_validate(payload)
if restore_main_from_backup:
# Restore main WITHOUT overwriting a good .bak
try:
write_json_atomic(
path,
payload,
options=WriteOptions(backup=False),
)
except Exception:
pass
return payload
# main is corrupted and no backup: forensic-safe default (no write)
return default
# 3) main missing, try backup
if bak.exists():
payload = read_json(bak, default=default)
_validate(payload)
if restore_main_from_backup:
try:
write_json_atomic(
path,
payload,
options=WriteOptions(backup=False),
)
except Exception:
pass
return payload
# 4) both missing: optionally write DEFAULT
if write_default_if_missing:
try:
write_json_atomic(
path,
default,
options=WriteOptions(backup=False),
)
except Exception:
pass
return default
Key branches:
Main exists but corrupted, backup exists: Load from .bak, then restore main with backup=False (so we don't overwrite the good .bak with recovered data). Main becomes operational again; .bak stays as "last known good."
Main corrupted, no backup: Return default without writing. The broken main file sits on disk for investigation. The call site can log the failure and alert.
Both missing (first run): Only scenario where we write the default to disk. Parameter write_default_if_missing controls whether you want auto-creation or manual setup.
Why backup=False during restore matters: if you do write_json_atomic(main, data) with default backup=True, you'll snapshot the newly-restored main into .bak — replacing your actual last-known-good with the just-recovered data. Then you lose the forensic snapshot.
Over ~6 months we've hit the usual human-edit failures (trailing commas, missing braces). The point isn't heroics - it's that the app keeps starting, and the broken file stays on disk.
4. Production integration: SettingsManager
This isn't abstract code living in a utils folder. It's wired into the Flask managers that operators interact with daily.
The settings manager loads on app startup and reloads on demand. Operators change delivery days, safety stock thresholds, and product exclusions through a web UI. Those changes hit SettingsManager.save(), which triggers the atomic write.
From src/managers/settings_manager.py:
import os
from datetime import datetime
from src.utils.json_store import read_json_with_recovery, write_json_atomic
SETTINGS_FILE = 'data/store_settings.json'
class SettingsManager:
def reload(self) -> bool:
"""Reload settings from disk (with recovery)"""
settings_path = SETTINGS_FILE
backup_path = f"{SETTINGS_FILE}.bak"
existed_main = os.path.exists(settings_path)
existed_bak = os.path.exists(backup_path)
def _validate_settings(payload) -> None:
if not isinstance(payload, dict):
raise ValueError(f"store_settings must be a dict, got {type(payload)}")
loaded = read_json_with_recovery(
settings_path,
default=self._get_default_settings(),
validate=_validate_settings,
write_default_if_missing=not existed_main and not existed_bak,
restore_main_from_backup=True,
)
self._settings = loaded if isinstance(loaded, dict) else self._get_default_settings()
return True
def save(self) -> bool:
"""Persist settings to disk (atomic write)"""
if '_metadata' not in self._settings:
self._settings['_metadata'] = {}
self._settings['_metadata']['updated_at'] = datetime.now().isoformat()
write_json_atomic(SETTINGS_FILE, self._settings)
return True
What happens on each save:
- Metadata updated (timestamp)
-
write_json_atomic()creates.baksnapshot of current file - New JSON written to temp file with
fsync() - Atomic swap via
os.replace()
The actual settings structure (representative shape, redacted):
{
"stores": {
"Downtown Location": {
"delivery_day": "friday",
"safety_days": 5.0,
"excluded_articles": ["102731", "103190"],
"is_active": true
},
"Eastside Store": {
"delivery_day": "wednesday",
"delivery_weeks": [1, 3],
"safety_days": 3.0,
"is_active": true
}
},
"global_settings": {
"default_safety_days": 3.0,
"default_safety_coefficient": 1.3
},
"_metadata": {
"version": "1.0",
"updated_at": "2025-01-04T15:23:11"
}
}
Second integration point: delivery schedule.
The delivery schedule file (data/delivery_schedule.json) maps stores to weekly delivery days. Similar pattern, different manager (src/managers/delivery_schedule.py):
SCHEDULE_FILE = 'data/delivery_schedule.json'
def load_delivery_schedule() -> Dict:
"""Load delivery schedule (with recovery from .bak)"""
schedule_path = SCHEDULE_FILE
backup_path = f"{SCHEDULE_FILE}.bak"
def _read_json(path: str) -> Dict:
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
if not isinstance(data, dict):
raise ValueError(f"Schedule must be a dict, got {type(data)}")
return data
# Try main file
if os.path.exists(schedule_path):
try:
return _read_json(schedule_path)
except Exception:
# Fallback to .bak
if os.path.exists(backup_path):
schedule = _read_json(backup_path)
# Restore main without touching .bak
try:
write_json_atomic(
schedule_path,
schedule,
options=WriteOptions(backup=False),
)
except Exception:
pass
return schedule
# Forensic-safe: don't overwrite corrupted main
return DEFAULT_SCHEDULE
# Main missing, try backup
if os.path.exists(backup_path):
schedule = _read_json(backup_path)
try:
write_json_atomic(
schedule_path,
schedule,
options=WriteOptions(backup=False),
)
except Exception:
pass
return schedule
# Both missing: create default
save_delivery_schedule(DEFAULT_SCHEDULE)
return DEFAULT_SCHEDULE
This is more explicit than calling the generic helper — same logic, with explicit branching for failure modes.
5. The incident: when .bak saved the system
What happened (~2 months into production):
An operator was editing store settings through the web UI - adjusting delivery parameters for one of the 60+ locations. Clicked "Save". The server experienced a power outage.
On restart:
json.load() failed on the main settings file with JSONDecodeError. Whether this came from filesystem-level corruption, an interrupted write in an older code path, or some other failure mode, the recovery logic behaved as designed.
Execution path:
-
store_settings.jsonexists ->json.load()fails -
.bakexists -> load from.baksucceeds - Restore main with
backup=False
The system came back up with the last known good configuration. No manual intervention. The operator's in-progress edit was lost (expected), but the previous stable state was intact.
What the .bak pattern prevented:
Without the rolling backup:
- Corrupted main file
- No fallback
- System runs with in-memory defaults
- Manual recovery to get the last known good configuration back
Why the pattern held:
Rolling .bak created before writing the new version: The backup snapshot happened before temp+fsync+replace. Even if the write path failed, the old .bak stayed intact.
Restore with
backup=False: When we healed main from .bak, we didn't create a new backup. Restore is a healing write, not a new version - so we don't rotate backups during restore. The .bak stayed as "last known good."
This wasn't heroics. It's what the pattern is designed to handle: ungraceful shutdowns during operator-driven edits.
6. Where this stops working
Be honest about limitations.
This approach works for our current scale (60+ stores, expanding to 300+). But there are clear triggers where a DB becomes the simpler option.
You'll hit limits when:
1. Writes move from human cadence to automated churn
When writes move from operator-driven (a few per day) to automated high-frequency updates (many per hour), the cost profile changes. Each write:
- Copies the previous file into
.bak(viashutil.copy2) - Writes the whole payload to a temp file
- Calls
fsync()(blocks until disk confirms)
At operator-driven cadence, this is cheap. At high-frequency automated writes, you need append-only or delta-based storage. SQLite (often with WAL) is a common next step.
2. Concurrent writers
Last write wins. The fcntl-based lock is advisory and Unix-only:
def _lock_file(path: Path):
if fcntl is None: # Non-Unix platforms
return None
lock_path = path.with_suffix(path.suffix + ".lock")
lock_path.parent.mkdir(parents=True, exist_ok=True)
fd = open(lock_path, "a+", encoding="utf-8")
fcntl.flock(fd.fileno(), fcntl.LOCK_EX)
return fd
This is advisory locking - it helps when all writers cooperate. It doesn't prevent non-cooperating processes or out-of-band writes. For real multi-writer scenarios, you need database-level locks or conflict resolution.
3. Large files
Loading and writing the entire file on each change becomes wasteful. Files are currently under ~100KB. If your config grows significantly larger, partial updates matter.
4. Partial updates
Need to update one store's safety_days without loading all stores? With JSON files, you read-modify-write the whole object. With a database, you UPDATE a single row.
5. Complex queries
"Find all stores where delivery_day='friday' AND safety_days > 3" is a linear scan through the JSON. Databases with indexes avoid scanning the entire configuration.
Migration triggers:
We'll reconsider when:
- Write frequency shifts from operator-driven to automated
- File sizes grow beyond comfortable read-modify-write scale
- Query patterns need indexes
- Concurrent edit conflicts become a real issue
If/when the workload changes (more writes, larger configs, concurrent edits), SQLite is an obvious next step.
Current state (6 months in):
- Operator-driven writes (a few per day)
- Files under ~100KB
- Single power-outage incident (recovered cleanly)
- No performance issues
- Pattern remains appropriate for workload
The decision isn't "JSON good, databases bad." It's "match storage to workload."
7. Design decisions that matter
These aren't best practices from a textbook. They're specific choices that make this pattern work in production.
Why forensic-safe defaults
When the main file is corrupted and backup is missing, we return the default without writing. This preserves evidence for post-mortem analysis.
The code path in read_json_with_recovery():
# 1) main exists
if path.exists():
try:
payload = read_json(path, default=default)
_validate(payload)
return payload
except (json.JSONDecodeError, OSError, ValueError):
# 2) fallback to .bak
if bak.exists():
payload = read_json(bak, default=default)
_validate(payload)
# ... restore main ...
return payload
# main corrupted, no backup: forensic-safe (no write)
return default
The app runs with in-memory defaults. Operators can work. The broken file waits for investigation. Was it a filesystem issue? Disk error? Bug in an older code path? With the file still on disk, you can debug.
Why restore with backup=False
When loading from .bak, we restore main but don't create a new backup. Restore is a healing write, not a new version - so we don't rotate backups during restore.
if restore_main_from_backup:
write_json_atomic(
path,
payload,
options=WriteOptions(backup=False),
)
This keeps .bak as "last known good state" rather than overwriting it with recovered data.
Why validation hooks
Simple isinstance() checks catch malformed JSON early. Better to fail fast on load than propagate corrupt data through the system.
def _validate_settings(payload):
if not isinstance(payload, dict):
raise ValueError("must be dict")
This isn't a schema system. It's a tripwire. It catches the usual human-edit failures early (wrong type, invalid JSON) before that data leaks deeper into the app.
The write_json_atomic helper can accept a validation hook, and read_json_with_recovery runs validation on loaded data. We use it primarily on read to catch corruption early.
Why same-directory temp files
To keep the atomic replace semantics, we create the temp file next to the target (same directory, same filesystem).
with tempfile.NamedTemporaryFile(
dir=str(path.parent), # Same filesystem as target
delete=False,
...
) as tf:
Cross-filesystem moves can fall back to copy-and-delete operations depending on OS and mount points.
Why rolling .bak before write
Creating the backup before starting the write means power loss during the write doesn't lose both old and new data.
if options.backup and path.exists():
bak = path.with_suffix(path.suffix + ".bak")
shutil.copy2(path, bak) # Snapshot before temp+fsync+replace
If we created .bak after the write, a crash mid-write could leave you with no valid state.
8. Production setup: Alpine + Gunicorn
The deployment stack:
Client -> Nginx -> Gunicorn (4 workers) -> Flask app -> data/*.json files
Alpine Linux:
We run on Alpine Linux. It uses OpenRC for init (not systemd) and apk for packages (not apt/yum). Smaller attack surface, smaller images.
Gunicorn:
We run 4 Gunicorn workers. Command:
gunicorn -w 4 -b 127.0.0.1:8000 'app:app'
Binding to 127.0.0.1 (not 0.0.0.0) because Nginx sits in front and handles external traffic.
File permissions:
chown -R app:app data/
chmod 755 data/
chmod 644 data/*.json data/*.json.bak
The app user needs read/write on the data directory. Backup files get the same permissions as main files (via shutil.copy2).
What matters for file-based storage:
The JSON files live in data/. The atomic write pattern creates temp files in the same directory (same filesystem). Permissions need to allow the app user to create/rename files in that directory.
OpenRC manages the service on Alpine. Nginx handles external requests and proxies to Gunicorn on localhost.
This setup has been running for 6 months with no additional JSON corruption incidents beyond the single power-outage recovery.
9. Code verification: what we actually tested
This isn't a theoretical pattern. The code is running in production managing 60+ stores.
Core helpers (src/utils/json_store.py):
write_json_atomic(...) writes to a temp file next to the target, flush()+fsync(), then atomically swaps with os.replace(), optionally keeping a rolling .bak.
read_json_with_recovery(...) falls back to .bak on parse/read errors. If recovery uses .bak, it restores main with backup=False. Defaults are written only when both main and .bak are missing and write_default_if_missing is enabled.
Production integration:
# src/managers/settings_manager.py
SettingsManager.reload()
# - Uses read_json_with_recovery()
# - Validates payload is dict
# - Writes default only when both files missing
SettingsManager.save()
# - Updates metadata timestamp
# - Calls write_json_atomic()
# - Creates .bak on every save
# src/managers/delivery_schedule.py
load_delivery_schedule()
# - Same recovery pattern
# - Explicit branching for failure modes
Failure modes we tested locally (real commands):
(A) corrupted main + .bak exists → load .bak, restore main, .bak unchanged
python -c "import json,tempfile,pathlib
from src.managers import settings_manager as sm
d=pathlib.Path(tempfile.mkdtemp())
main=d/'store_settings.json'; bak=d/'store_settings.json.bak'
bak.write_text(json.dumps({'stores':{'X':{'delivery_day':'tuesday'}},'_metadata':{'updated_at':'never'}}),encoding='utf-8')
bak_bytes=bak.read_bytes(); bak_mtime=bak.stat().st_mtime
main.write_text('{broken json',encoding='utf-8')
sm.SETTINGS_FILE=str(main)
mgr=sm.SettingsManager(); mgr.reload()
assert main.read_text(encoding='utf-8')==bak.read_text(encoding='utf-8')
assert bak.read_bytes()==bak_bytes and bak.stat().st_mtime==bak_mtime
print('OK: loaded from .bak, main restored, .bak unchanged')"
(B) main missing + .bak missing → default created and persisted
python -c "import json,tempfile,pathlib
from src.managers import settings_manager as sm
d=pathlib.Path(tempfile.mkdtemp())
main=d/'store_settings.json'
sm.SETTINGS_FILE=str(main)
mgr=sm.SettingsManager(); mgr._settings=None; mgr.reload()
assert main.exists()
json.loads(main.read_text(encoding='utf-8'))
assert not (d/'store_settings.json.bak').exists()
print('OK: default written only when both main and .bak were missing')"
The pattern is straightforward: atomic writes, forensic reads, rolling backups. No hidden infrastructure.
10. Conclusion: boring reliability
What we built:
A file-based storage layer for configuration data that survives power outages, invalid JSON on disk, and operator mistakes. Running in production for 6 months managing 60+ retail stores.
Key pieces:
- Atomic writes (temp + fsync + os.replace)
- Forensic-safe recovery (rolling .bak, don't destroy evidence)
- Production integration (Flask managers, real operators)
- Real incident (power outage) + synthetic test scenarios
When to use this:
- Config-heavy apps (settings, schedules, mappings)
- Infrequent writes (operator-driven, not event streams)
- Small files (under a few hundred KB)
- Linux/POSIX environments
- Read pattern is "load whole object"
When NOT to use:
- High-frequency writes (automated churn, many per hour)
- Concurrent writers needing coordination
- Large datasets requiring partial updates
- Complex query patterns needing indexes
Our migration triggers:
We'll move to SQLite when:
- Write frequency shifts to automated high-volume
- File sizes grow beyond comfortable read-modify-write scale
- Query patterns need indexes
- Concurrent edit conflicts become real
The 300+ store expansion may hit these triggers. That's fine. The pattern served its purpose: reliable config storage without premature database complexity.
The takeaway:
Match storage to workload. JSON files with careful atomicity work for config-heavy apps with infrequent writes. Once you need multi-writer coordination, partial updates, or indexed queries, a DB becomes the simpler option.
Don't over-engineer storage you don't need yet. But don't under-engineer atomicity when crashes matter.
This pattern gave us 6 months of boring reliability. Sometimes boring is exactly what you want.
If you've shipped similar file-backed storage, I'm curious what failure modes you tested.
Top comments (0)