NTCTech

Posted on Mar 17 • Originally published at rack2cloud.com

Database Backup Fidelity: Why Crash-Consistent Is Not a Database Backup

#database #devops #infrastructure #dataprotection

App-consistent database backup is the difference between a recoverable database and a recovery event that fails under pressure. Backup policies are designed by architects. They are discovered by engineers during recovery.

Most enterprise environments have backup schedules running, retention policies configured, and dashboards showing green. What most have never validated is the consistency level those backups are actually capturing.

That question gets answered — usually under pressure — when a DBA attempts to restore a production database and discovers the backup represents a storage snapshot taken mid-transaction.

The Two Models

At the storage layer, every backup is a point-in-time copy. The difference between crash-consistent and app-consistent database backup is what state the database engine is in when that copy is taken.

Crash-consistent captures whatever was on disk at the moment the snapshot fired — no coordination with the database engine, no quiesce. Open transactions are mid-flight. Write-ahead logs may not be flushed. Buffer pools may have data that hasn't reached disk. The result looks complete from a storage perspective and is incomplete from a database perspective.

A crash-consistent backup requires the database engine's recovery mechanisms — WAL replay, transaction log rollback, redo/undo — to work correctly on restore. If they do, the restore succeeds. If they don't — corrupted log sequence, missing log files, engine version mismatch — the restore fails.

App-consistent coordinates with the database engine before the snapshot fires. Buffer pool flushed. In-flight transactions completed or rolled back. Writes quiesced. The snapshot is taken against a database in a known-good state. No recovery mechanisms required on restore. The database mounts cleanly.

The practical difference: crash-consistent shifts recovery risk from backup time to restore time. App-consistent resolves that risk at backup time.

Why Environments Default to Crash-Consistent

Most environments don't choose crash-consistent. They inherit it.

VM snapshot tooling prioritizes speed — hypervisor snapshots capture the entire VM without knowledge of what's running inside. For databases, this produces crash-consistent backups unless VSS (Windows) or pre/post freeze scripts (Linux) are explicitly configured.
Backup vendors optimize for coverage — a single policy covering 500 VMs is a compelling story. Applied uniformly to database and application VMs without differentiation, it produces crash-consistent backups for the databases.
App-consistent database backup requires integration work — agent installation, credential configuration, quiesce validation. Under time pressure, the integration work gets deferred. The deferred work becomes the default.
Operators assume transaction logs will cover the gap — the assumption transfers risk without eliminating it. It fails when the log chain is broken, when logs are on a separate volume not in the snapshot, or when the recovery environment runs a different engine version.

The Green Dashboard Problem

	What It Shows	What It Cannot Show
Backups	✓ Running	? Consistency level
Schedule	✓ Configured	? Quiescing triggered
Retention	✓ Set	? Transaction logs included
Last Job	✓ Successful	? Agent active and connected
Failures	✓ None (60 days)	? Restore ever tested

The dashboard measures job completion. It does not measure recoverability.

How to Validate Backup Fidelity

Five questions that determine whether a database backup is actually recoverable — not whether the job completed.

01 — Does the backup trigger database quiescing?
Check the actual job settings, not the policy name. Is application-aware processing enabled? Is VSS invoked on Windows? Are pre/post freeze scripts configured on Linux? If none of these are present, the backup is crash-consistent regardless of documentation.

02 — Is a database agent installed and active?
Verify the agent is installed, running, and connecting successfully — not just that the policy assumes it exists. A stale agent or expired credentials produces a silent fallback to crash-consistent without alerting the operator.

03 — Are transaction logs included in the backup?
For point-in-time recovery, logs must be captured continuously — not just at the full backup window. A nightly full with no log backups means the maximum recovery point is the previous night, regardless of what the RPO documentation says.

04 — Is application-aware backup confirmed in the job log — not just configured in the policy?
Application-aware processing can be enabled in the policy and silently fail during execution — falling back to crash-consistent — if the agent is unreachable or the VSS writer returns an error. The job still completes. The job still shows green.

05 — Have restores been tested at the database layer?
VM restore testing — booting a restored VM — does not validate database recoverability. A database restore test requires mounting the database, running integrity checks, and confirming it serves queries from a known-good transaction state.

Engine Behavior: What Crash-Consistent Means Per Database

Engine	Crash-Consistent Behavior	Recovery Dependency	Risk
SQL Server	Data files captured mid-transaction. Crash recovery runs on attach — rolls back uncommitted, replays committed from log.	Transaction log must be intact. If log files are on a separate volume not in snapshot, recovery fails.	Medium (logs included) / High (logs separate)
PostgreSQL	Heap files and WAL in potentially inconsistent state. WAL replay runs on startup.	WAL files must be intact and complete from snapshot point. Missing segments = unrecoverable.	High
MySQL / MariaDB	InnoDB buffer pool not flushed. Dirty pages captured. InnoDB crash recovery runs on startup.	InnoDB redo log must be present. MyISAM tables will be inconsistent and require manual repair.	Medium (InnoDB-only) / High (mixed engine)
Oracle	Datafiles captured without RMAN coordination. Instance recovery runs on startup using redo logs.	All redo log members must be present. RMAN not invoked breaks the recovery catalog.	High
MongoDB	WiredTiger journal not synced. Journal replay runs on startup.	Journal files must be intact. Replica resync may be required if replay fails.	Medium

Recovery Risk by Scenario

Scenario	Crash-Consistent	App-Consistent
Full VM restore, database attach	Crash recovery runs. May succeed or fail depending on log integrity. Unpredictable.	Mounts cleanly. Predictable restore time.
Point-in-time recovery required	Requires unbroken log chain from snapshot to target. Any gap makes PITR impossible.	Clean base + log chain. Reliable if log backups are configured.
Log files on separate volume, not in snapshot	❌ Recovery fails. Database unattachable.	Not applicable — app-consistent includes all required files.
Ransomware recovery	Recovery state uncertain. Integrity validation extends window.	Known-good state. Deterministic recovery.
App-aware processing silently failed at backup	❌ Operator discovers crash-consistent backup during recovery. No warning was issued.	Agent failure surfaces as a job warning — visible, not silent.
Recovery to different engine version	❌ Crash recovery behavior varies between versions. May fail on target.	Standard restore procedures apply.

What App-Consistent Actually Requires

Windows — VSS
VSS signals all registered writers — including SQL Server's VSS writer — to flush buffers and freeze writes before the snapshot. Verify the SQL Server VSS writer is registered and stable: vssadmin list writers.

Linux — Pre/Post Freeze Scripts
Linux has no native VSS equivalent. App-consistent backups on Linux require pre-freeze scripts that quiesce the database before the snapshot and post-thaw scripts that resume it after. These must be configured explicitly in VMware Tools, Nutanix Guest Agent, or the backup agent.

Database Agents
Enterprise backup platforms provide database-specific agents that handle quiescing natively. These require installation inside the VM, sufficient database permissions, and periodic validation that they are connecting correctly.

Transaction Log Backup Frequency
Log backup frequency determines maximum data loss. A 15-minute log backup interval means 15 minutes maximum data loss regardless of full backup frequency. Configure against the RPO requirement, not the backup window.

Restore Testing
Run on a defined schedule — quarterly minimum for production databases. Restore, attach, run integrity checks, confirm the database serves queries from a known transaction state.

Architect's Verdict

Backup policies are designed by architects. They are discovered by engineers during recovery.

Crash-consistent backups are not wrong — they are appropriate for stateless workloads and as a fallback when app-consistent integration is not feasible. They are not appropriate as the default strategy for production databases where recovery time, recovery point, and data integrity are defined requirements.

The shift to app-consistent database backup is not a technology problem. Every enterprise backup platform supports it. It is an integration and validation problem — one that requires deliberate configuration, agent deployment, and restore testing to confirm that what the dashboard shows as protected is actually recoverable.

The five questions in the checklist above exist because the dashboard cannot answer them. Ask them before a recovery event, not during one.

Originally published at rack2cloud.com

DEV Community