Backups Don't Save You. Restores Do.
We ran a MongoDB restore drill last quarter. It failed — not the restore itself, but the confidence. Nobody in the room was sure the data was actually intact. The service came back up, and we all just stared at each other.
That was the problem. So we fixed it by automating everything.
One Jenkins job now provisions infra, builds the replica set, restores from dumps, validates data integrity, and stores a full audit trail. Here's exactly how it works.
The Goal
Remove every manual, error-prone step from the DR process:
- Identical restore flow across all environments
- Automated replica set setup — no manual
rs.initiate()typos - Real validation that proves data is intact, not just assumed
- Full audit trail for post-mortems and compliance reviews
The Pipeline: 5 Stages
1. Infrastructure with Terraform
Every drill starts with clean infra. Terraform provisions EC2s, networking, and persistent volumes from scratch — same starting point every time. No leftover state. No "works on my machine" surprises.
resource "aws_instance" "mongo_node" {
count = 3
ami = var.mongo_ami
instance_type = "t3.medium"
tags = {
Name = "mongo-dr-node-${count.index}"
Role = "mongodb-replica"
}
}
2. Replica Set Creation (Python)
Instead of manually running rs.initiate() and rs.add() and hoping the timing works, a Python script handles the entire setup — ordering, retries, and confirmation.
from pymongo import MongoClient
import time
def init_replica_set(primary_host, secondary_hosts):
client = MongoClient(f"mongodb://{primary_host}:27017")
config = {
"_id": "rs0",
"members": [{"_id": i, "host": h}
for i, h in enumerate([primary_host] + secondary_hosts)]
}
client.admin.command("replSetInitiate", config)
# Wait for PRIMARY election
for _ in range(30):
status = client.admin.command("replSetGetStatus")
if any(m["stateStr"] == "PRIMARY" for m in status["members"]):
return True
time.sleep(2)
raise Exception("Replica set did not elect a PRIMARY in time")
Automating this removes timing issues and misconfiguration. Every replica set comes up the same way.
3. Backup & Restore
Backups are normalized into compressed archives. The restore unpacks a dump and applies it to the fresh nodes:
# Create dump
mongodump --host $SOURCE_HOST --db $DB_NAME \
--out /backup/dump --gzip
# Restore to DR environment
mongorestore --host $DR_HOST --db $DB_NAME \
/backup/dump/$DB_NAME --gzip --drop
4. Validation & Comparison — The Part Most Teams Skip
This is the step that actually builds confidence. The validation script:
- Checks which collections exist (flags missing collections)
- Compares document counts collection by collection
- Compares indexes between source and restored DB
- Samples
_idvalues for obvious data mismatches
def validate_restore(source_uri, dr_uri, db_name):
src = MongoClient(source_uri)[db_name]
dr = MongoClient(dr_uri)[db_name]
report = {"status": "pass", "collections": {}}
for col in src.list_collection_names():
src_count = src[col].count_documents({})
dr_count = dr[col].count_documents({})
src_idx = sorted(src[col].index_information().keys())
dr_idx = sorted(dr[col].index_information().keys())
match = (src_count == dr_count) and (src_idx == dr_idx)
report["collections"][col] = {
"count_match": match,
"source_count": src_count,
"dr_count": dr_count,
"index_match": src_idx == dr_idx
}
if not match:
report["status"] = "fail"
return report
Exit code 0 = counts and indexes match → Jenkins passes.
Non-zero = mismatch → Jenkins fails the build immediately.
No more guessing. No more staring at each other in the war room.
5. Jenkins Orchestration
Single Jenkins pipeline. Stages run sequentially, each one gated on the previous:
pipeline {
agent any
stages {
stage('Provision Infra') {
steps {
sh 'terraform init && terraform apply -auto-approve'
}
}
stage('Setup Replica Set') {
steps {
sh 'python3 scripts/init_replica_set.py'
}
}
stage('Restore MongoDB') {
steps {
sh 'bash scripts/restore.sh'
}
}
stage('Validate Restore') {
steps {
sh 'python3 scripts/validate_restore.py'
}
}
stage('Archive Logs') {
steps {
archiveArtifacts artifacts: 'reports/*.json, logs/*.log'
}
}
}
}
Every run is logged, every report is archived. When auditors ask if restores work — you show them a report with timestamps, counts, and index diffs. Not a gut feeling.
Lessons Learned
Automate infra, not just the restore. Terraform gives you a clean slate every drill. Manual infra setup introduces variability that hides real problems.
Validation is not optional. A restore that "seems fine" is not the same as a restore that is fine. Document count mismatches and missing indexes are easy to catch automatically and impossible to catch by eyeballing logs.
Logs equal trust. The audit trail is what makes your DR process credible to others — engineers, management, auditors. Without it, you're asking people to take your word for it.
Minimal input reduces errors. We trimmed required inputs to just host + DB name and let scripts infer the rest. Less to type = fewer mistakes under pressure.
Practice makes permanent. Each drill found a small improvement. After ten drills, the process was genuinely fast and boring — which is exactly what you want.
The Outcome
We went from a 3-hour manual war room exercise to a single Jenkins job anyone can trigger. The drills are now predictable, repeatable, and quick.
More importantly — everyone on the team believes the restores work, because the validation script proves it every single time.
Boring DR is good DR.
Running MongoDB in production? When did you last drill a full restore? Drop your setup in the comments — curious how teams handle validation.
Top comments (0)