Mrinal Narang

Posted on Jun 12

MongoDB DR Drill Automation with Terraform, Python & Jenkins — How We Made Restores Boring

#mongodb #devops #terraform #jenkins

Backups Don't Save You. Restores Do.

We ran a MongoDB restore drill last quarter. It failed — not the restore itself, but the confidence. Nobody in the room was sure the data was actually intact. The service came back up, and we all just stared at each other.

That was the problem. So we fixed it by automating everything.

One Jenkins job now provisions infra, builds the replica set, restores from dumps, validates data integrity, and stores a full audit trail. Here's exactly how it works.

The Goal

Remove every manual, error-prone step from the DR process:

Identical restore flow across all environments
Automated replica set setup — no manual rs.initiate() typos
Real validation that proves data is intact, not just assumed
Full audit trail for post-mortems and compliance reviews

The Pipeline: 5 Stages

1. Infrastructure with Terraform

Every drill starts with clean infra. Terraform provisions EC2s, networking, and persistent volumes from scratch — same starting point every time. No leftover state. No "works on my machine" surprises.

resource "aws_instance" "mongo_node" {
  count         = 3
  ami           = var.mongo_ami
  instance_type = "t3.medium"
  tags = {
    Name = "mongo-dr-node-${count.index}"
    Role = "mongodb-replica"
  }
}

2. Replica Set Creation (Python)

Instead of manually running rs.initiate() and rs.add() and hoping the timing works, a Python script handles the entire setup — ordering, retries, and confirmation.

from pymongo import MongoClient
import time

def init_replica_set(primary_host, secondary_hosts):
    client = MongoClient(f"mongodb://{primary_host}:27017")
    config = {
        "_id": "rs0",
        "members": [{"_id": i, "host": h}
                    for i, h in enumerate([primary_host] + secondary_hosts)]
    }
    client.admin.command("replSetInitiate", config)
    # Wait for PRIMARY election
    for _ in range(30):
        status = client.admin.command("replSetGetStatus")
        if any(m["stateStr"] == "PRIMARY" for m in status["members"]):
            return True
        time.sleep(2)
    raise Exception("Replica set did not elect a PRIMARY in time")

Automating this removes timing issues and misconfiguration. Every replica set comes up the same way.

3. Backup & Restore

Backups are normalized into compressed archives. The restore unpacks a dump and applies it to the fresh nodes:

# Create dump
mongodump --host $SOURCE_HOST --db $DB_NAME \
  --out /backup/dump --gzip

# Restore to DR environment
mongorestore --host $DR_HOST --db $DB_NAME \
  /backup/dump/$DB_NAME --gzip --drop

4. Validation & Comparison — The Part Most Teams Skip

This is the step that actually builds confidence. The validation script:

Checks which collections exist (flags missing collections)
Compares document counts collection by collection
Compares indexes between source and restored DB
Samples _id values for obvious data mismatches

def validate_restore(source_uri, dr_uri, db_name):
    src = MongoClient(source_uri)[db_name]
    dr  = MongoClient(dr_uri)[db_name]

    report = {"status": "pass", "collections": {}}

    for col in src.list_collection_names():
        src_count = src[col].count_documents({})
        dr_count  = dr[col].count_documents({})
        src_idx   = sorted(src[col].index_information().keys())
        dr_idx    = sorted(dr[col].index_information().keys())

        match = (src_count == dr_count) and (src_idx == dr_idx)
        report["collections"][col] = {
            "count_match":  match,
            "source_count": src_count,
            "dr_count":     dr_count,
            "index_match":  src_idx == dr_idx
        }
        if not match:
            report["status"] = "fail"

    return report

Exit code 0 = counts and indexes match → Jenkins passes.
Non-zero = mismatch → Jenkins fails the build immediately.

No more guessing. No more staring at each other in the war room.

5. Jenkins Orchestration

Single Jenkins pipeline. Stages run sequentially, each one gated on the previous:

pipeline {
  agent any
  stages {
    stage('Provision Infra') {
      steps {
        sh 'terraform init && terraform apply -auto-approve'
      }
    }
    stage('Setup Replica Set') {
      steps {
        sh 'python3 scripts/init_replica_set.py'
      }
    }
    stage('Restore MongoDB') {
      steps {
        sh 'bash scripts/restore.sh'
      }
    }
    stage('Validate Restore') {
      steps {
        sh 'python3 scripts/validate_restore.py'
      }
    }
    stage('Archive Logs') {
      steps {
        archiveArtifacts artifacts: 'reports/*.json, logs/*.log'
      }
    }
  }
}

Every run is logged, every report is archived. When auditors ask if restores work — you show them a report with timestamps, counts, and index diffs. Not a gut feeling.

Lessons Learned

Automate infra, not just the restore. Terraform gives you a clean slate every drill. Manual infra setup introduces variability that hides real problems.

Validation is not optional. A restore that "seems fine" is not the same as a restore that is fine. Document count mismatches and missing indexes are easy to catch automatically and impossible to catch by eyeballing logs.

Logs equal trust. The audit trail is what makes your DR process credible to others — engineers, management, auditors. Without it, you're asking people to take your word for it.

Minimal input reduces errors. We trimmed required inputs to just host + DB name and let scripts infer the rest. Less to type = fewer mistakes under pressure.

Practice makes permanent. Each drill found a small improvement. After ten drills, the process was genuinely fast and boring — which is exactly what you want.

The Outcome

We went from a 3-hour manual war room exercise to a single Jenkins job anyone can trigger. The drills are now predictable, repeatable, and quick.

More importantly — everyone on the team believes the restores work, because the validation script proves it every single time.

Boring DR is good DR.

Running MongoDB in production? When did you last drill a full restore? Drop your setup in the comments — curious how teams handle validation.

DEV Community