Jakkie Koekemoer

Posted on Apr 17

Zero-Downtime Domain Migration: A Developer's Automation Playbook for DNS, Email, and Rollback

#dns #webdev

Domain migrations fail at the seams between steps. The DNS records move fine, then someone flips the MX records before SPF propagates to the new zone. Or the registrar transfer fires while the old nameservers are still authoritative. The blast radius cloud include a DMARC failure on transactional email, a broken Let's Encrypt DNS-01 challenge, or a two-day authentication outage for OAuth redirects tied to a TXT verification record that got dropped mid-migration.

The failure mode is consistent across these cases: DNS, email, and registrar changes executed as a single coupled batch, with no phase gates and no scripted reversion path. Engineers run through a checklist, something fails mid-flight, and the rollback gets improvised under pressure.

This guide structures the migration as three sequential phases, each with a health check that must pass before the next begins. Every rollback artifact gets produced before the change it protects. The code examples run against the name.com API, which covers all three phases through a single OpenAPI-spec'd surface: DNS CRUD via /core/v1/domains/{domainName}/records, email forwarding via /core/v1/domains/{domainName}/email/forwarding, and transfer operations including unlock, auth code retrieval, and domain-transfer-status-change webhooks.

Before touching anything live, map your blast radius per record type so health checks cover the right surface area:

Layer	Records	What breaks if dropped
Traffic	A, AAAA, CNAME	HTTP/HTTPS endpoints, CDN origins
TLS	CAA, DNS-01 TXT	Certificate issuance and renewal
Email delivery	MX, TXT (SPF, DKIM, DMARC)	Inbound routing, deliverability, DMARC enforcement
Authentication	TXT (Google, AWS, etc.)	OAuth, service integrations, domain verification

Run health check coverage against each row before declaring any phase complete.

Phase 1: Scripted DNS Snapshot, TTL Reduction, and Zone Replication

The snapshot file is the precondition for every rollback decision in this guide. Take it before any other step.

Note: All examples use the name.com sandbox environment. Once you're ready for production, change your API URL to https://api.name.com.

import requests
import json
from datetime import datetime

API_USER = "your_username"
API_TOKEN = "your_api_token"
DOMAIN = "yourdomain.com"

def snapshot_zone(domain):
    url = f"https://api.dev.name.com/core/v1/domains/{domain}/records"
    resp = requests.get(url, auth=(API_USER, API_TOKEN))
    resp.raise_for_status()
    records = resp.json()
    filename = f"snapshot_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(filename, "w") as f:
        json.dump(records, f, indent=2)
    print(f"Snapshot saved: {filename}")
    return records, filename

records, snapshot_file = snapshot_zone(DOMAIN)

Validate the snapshot before proceeding. A missing CAA record or absent DMARC TXT in the export means you're working from an incomplete source of truth.

REQUIRED_TYPES = {"A", "MX", "TXT", "CAA", "CNAME"}

def validate_snapshot(records):
    found_types = {r["type"] for r in records.get("records", [])}
    missing = REQUIRED_TYPES - found_types
    if missing:
        raise ValueError(f"Snapshot missing record types: {missing}")
    print("Snapshot validated.")

validate_snapshot(records)

With the snapshot confirmed, reduce TTL values well in advance so resolvers flush their caches before the actual cutover. Lower TTL to 300 seconds at least 24 hours before the migration window. Any resolver that previously cached a longer TTL needs time to expire its entry and pick up the shorter value before you cut over.

Note on minimum TTL limits: 300 seconds is the practical floor enforced by most registrar-managed DNS providers, including name.com. Attempting to set a lower value (such as 60s) may be silently ignored or rejected. Some dedicated DNS providers (e.g. Cloudflare non-Enterprise) support TTLs as low as 60 seconds, but this is not universal. Check your provider's documentation before assuming sub-300s values are accepted.

import time

def reduce_ttl(domain, records, target_ttl, auth):
    for record in records.get("records", []):
        record_id = record["id"]
        url = f"https://api.dev.name.com/core/v1/domains/{domain}/records/{record_id}"
        payload = {
            "host": record.get("host", ""),
            "type": record["type"],
            "answer": record["answer"],
            "ttl": target_ttl
        }
        resp = requests.put(url, json=payload, auth=auth)
        resp.raise_for_status()
    print(f"TTL reduced to {target_ttl}s across {len(records.get('records', []))} records")

# Reduce to 300s (the minimum for most providers including name.com)
reduce_ttl(DOMAIN, records, 300, (API_USER, API_TOKEN))
print("Waiting 24h for TTL propagation...")
time.sleep(86400)

# You are now ready to cut over — changes will propagate within ~5 minutes

Waiting the full 24 hours at 300s is necessary because resolvers that previously cached your records with a longer TTL (e.g. 86400s) won't re-check until that original TTL expires. Only after that window can you be confident all resolvers are working with the 300s value and will pick up your cutover quickly.

Phase 2: Email Continuity, Forwarding Rules, and MX Gating

Forwarding rules go up before MX records change. Any mail hitting the old domain during the transition window routes through to the destination address while the new MX records propagate. Set up forwarding using the name.com API's create-email-forwarding endpoint:

curl -u "username:apitoken" --request POST \
  --url https://api.dev.name.com/core/v1/domains/yourdomain.com/email/forwarding \
  --header 'Content-Type: application/json' \
  --data '
{
  "emailBox": "admin",
  "emailTo": "webmaster@example.com"
}
'

The /email/forwarding endpoint accepts one rule per request, so loop over your mailbox list when automating this across a team.

Before touching MX records, get SPF, DKIM, and DMARC TXT records in place on the destination zone and confirmed as propagated. Google and Yahoo mandated DMARC authentication for bulk senders in February 2024, and the enforcement is active. A migration that drops the DMARC TXT record before the MX cutover completes will route outbound mail to spam or trigger hard rejections, showing up as a deliverability incident in your postmaster dashboard hours after the fact.

Update the auth records using the same update-record endpoint from Phase 1, then run a propagation check before allowing the MX flip:

#!/bin/bash
DOMAIN="yourdomain.com"
MAX_ATTEMPTS=30
INTERVAL=15

for i in $(seq 1 $MAX_ATTEMPTS); do
    RESULT=$(dig TXT _dmarc.$DOMAIN +short | tr -d '"')
    echo "[$(date +%H:%M:%S)] Attempt $i: $RESULT"
    if [[ "$RESULT" == *"v=DMARC1"* ]]; then
        echo "DMARC record confirmed. Proceeding to MX flip."
        exit 0
    fi
    sleep $INTERVAL
done

echo "DMARC propagation timed out after $((MAX_ATTEMPTS * INTERVAL))s. Halting migration."
exit 1

A code 1 exit means you halt everything. The same polling pattern applies to SPF and DKIM: run a dig TXT check for each record before advancing. The MX flip goes through the update-record endpoint. Patch the existing MX record IDs using PUT /core/v1/domains/{domainName}/records/{id}, and leave the forwarding rules active for at least 48 hours post-cutover to catch any in-flight messages.

Phase 3: Programmatic Registrar Transfer, Webhooks, and Lock Sequencing

The registrar transfer carries the longest irreversibility window, which makes sequencing critical. The correct order:

Unlock the domain
Retrieve the auth code
Initiate the transfer at the receiving registrar
Subscribe to the domain-transfer-status-change webhook

import requests

def unlock_domain(domain, auth):
    url = f"https://api.dev.name.com/core/v1/domains/{domain}"
    resp = requests.patch(url, json={"locked": False}, auth=auth)
    resp.raise_for_status()
    print(f"Domain {domain} unlocked.")

def get_auth_code(domain, auth):
    url = f"https://api.dev.name.com/core/v1/domains/{domain}/authCode"
    resp = requests.get(url, auth=auth)
    resp.raise_for_status()
    return resp.json().get("authCode")

unlock_domain(DOMAIN, (API_USER, API_TOKEN))
auth_code = get_auth_code(DOMAIN, (API_USER, API_TOKEN))
print(f"Auth code retrieved: {auth_code}")

Once you've initiated the transfer at the receiving registrar with the auth code, subscribe to the webhook so your system receives async status updates:

def subscribe_transfer_webhook(domain, callback_url, auth):
    url = "https://api.dev.name.com/core/v1/notifications"
    payload = {
        "event": "domain-transfer-status-change",
        "domainName": domain,
        "url": callback_url
    }
    resp = requests.post(url, json=payload, auth=auth)
    resp.raise_for_status()
    print(f"Webhook subscribed: {resp.json()}")

subscribe_transfer_webhook(
    DOMAIN,
    "https://your-server.com/webhooks/transfer",
    (API_USER, API_TOKEN)
)

The webhook delivers a payload with a status field. Wire up a Flask endpoint to handle it:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/webhooks/transfer", methods=["POST"])
def transfer_webhook():
    payload = request.get_json()
    status = payload.get("status")
    domain = payload.get("domainName")

    if status in ("failed", "cancelled"):
        print(f"Transfer {status} for {domain}. Triggering rollback.")
        relock_domain(domain)
    elif status == "completed":
        print(f"Transfer completed for {domain}. Running post-transfer validation.")
        run_post_transfer_checks(domain)

    return jsonify({"received": True}), 200

If transfer initiation fails or the webhook fires a failure event, call POST /core/v1/domains/{domainName}/actions/lock immediately. An unlocked domain sitting in limbo is an active attack surface.

def relock_domain(domain, auth=(API_USER, API_TOKEN)):
    url = f"https://api.dev.name.com/core/v1/domains/{domain}/actions/lock"
    resp = requests.post(url, auth=auth)
    resp.raise_for_status()
    print(f"Domain {domain} re-locked.")

GoDaddy's API surfaces transfer status through polling only, which means a rollback trigger fires minutes after a failure event on an exposed domain. Namecheap's transfer API requires polling as well, with no webhook event support. The name.com domain-transfer-status-change webhook fires immediately on status change, giving the auto-relock function enough headroom to run in under a second of the failure event. That speed matters when the domain is unlocked.

Have a Clear Rollback Plan: Snapshot Design and Reversion Procedures

The rollback snapshot JSON from Phase 1 needs to contain everything required for a clean manual reversion. The minimum viable structure:

{
  "captured_at": "2024-11-15T09:30:00Z",
  "domain": "yourdomain.com",
  "nameservers": ["ns1.name.com", "ns2.name.com"],
  "records": [
    {
      "id": "12345",
      "host": "",
      "type": "A",
      "answer": "203.0.113.10",
      "ttl": 300
    },
    {
      "id": "12346",
      "host": "",
      "type": "MX",
      "answer": "mail.yourdomain.com",
      "ttl": 300
    }
  ],
  "forwarding_rules": [
    {
      "emailBox": "admin",
      "emailTo": "admin@oldprovider.com"
    }
  ]
}

Rollback scope by layer:

DNS-level: Re-create divergent records from the snapshot using POST /core/v1/domains/{domainName}/records. Compare the live zone against the snapshot by diffing jq-extracted record sets from both sources.
Email-level: Delete forwarding rules added during the migration via DELETE /core/v1/domains/{domainName}/email/forwarding/{emailBox}, then restore original MX records from the snapshot using the record IDs stored in the JSON.
Transfer-level: Re-lock the domain via PATCH /core/v1/domains/{domain}

Rollback triggers are human-evaluated conditions that require a deliberate go/no-go decision before execution. Automate the detection, but keep a person in the approval loop. Conditions that warrant investigation and a reversion decision:

HTTPS endpoints returning 5xx errors for more than 10 minutes post-cutover
dig MX yourdomain.com returning empty or unexpected results 30 minutes after the flip
DMARC reports showing authentication failure rates spiking above baseline
DNS propagation incomplete after 90% of the expected window (typically 24h at 300s TTL)

Set a hard decision deadline: 2 hours post-cutover for DNS changes, 24 hours for the MX flip. After TTLs have propagated across the global resolver population, the cost of rolling back climbs substantially. Make the call while reversion is still cheap.

The rollback runbook, executed manually and verified at each step:

Pull the snapshot: cat snapshot_20241115_093000.json | jq '.records'
Diff against live: curl -s -u "user:token" "https://api.dev.name.com/core/v1/domains/yourdomain.com/records" | jq '.records' > live.json && diff snapshot_records.json live.json
Re-create any missing or divergent records via POST /core/v1/domains/{domainName}/records
Restore MX records to original values using the record IDs from the snapshot
Re-lock the domain if transfer was initiated
Run the propagation polling script to confirm resolution returns to pre-migration values
Verify forwarding rules via GET /core/v1/domains/{domainName}/email/forwarding/{emailBox}

Execute each step individually, verify its output, and only proceed when the previous step's verification passes. This approach draws directly on the blue-green deployment and automated rollback principles used in zero-downtime database migrations: staged state transitions with explicit validation gates between them.

Observability: Propagation Polling, Divergence Alerts, and CI/CD Integration

Poll two resolvers in parallel and log every result with a timestamp. Divergence between 8.8.8.8 and 1.1.1.1 during propagation is expected. Both converging to the correct value is your success condition.

#!/bin/bash
DOMAIN="yourdomain.com"
EXPECTED_IP="203.0.113.10"
MAX_ITERATIONS=48
INTERVAL=300  # 5 minutes
LOGFILE="propagation_$(date +%Y%m%d_%H%M%S).log"

for i in $(seq 1 $MAX_ITERATIONS); do
    TS=$(date +%H:%M:%S)
    GOOGLE=$(dig @8.8.8.8 A $DOMAIN +short)
    CF=$(dig @1.1.1.1 A $DOMAIN +short)
    echo "[$TS] Google: $GOOGLE | Cloudflare: $CF" | tee -a $LOGFILE

    if [[ "$GOOGLE" == "$EXPECTED_IP" && "$CF" == "$EXPECTED_IP" ]]; then
        echo "Propagation confirmed on both resolvers." | tee -a $LOGFILE
        exit 0
    fi
    sleep $INTERVAL
done

echo "Propagation check failed after $((MAX_ITERATIONS * INTERVAL / 3600))h." | tee -a $LOGFILE
exit 1

The log file this produces is your audit trail: timestamped evidence of when each record resolved to the new value. The EU NIS2 Directive, which took effect in October 2024, places DNS providers and registrars under incident reporting and change management obligations. An automatically timestamped log of every DNS state transition covers the audit trail requirement without additional tooling.

Structure the full migration as a GitHub Actions workflow where each phase is a separate job with needs: dependencies. Failed jobs halt the pipeline and surface the rollback job.

name: Domain Migration

on:
  workflow_dispatch:

jobs:
  snapshot:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Take zone snapshot
        run: python scripts/snapshot_zone.py
        env:
          NAME_COM_USER: ${{ secrets.NAME_COM_USER }}
          NAME_COM_TOKEN: ${{ secrets.NAME_COM_TOKEN }}
      - name: Validate snapshot
        run: python scripts/validate_snapshot.py
      - uses: actions/upload-artifact@v4
        with:
          name: zone-snapshot
          path: snapshot_*.json

  ttl_ramp:
    needs: snapshot
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: zone-snapshot
      - name: Reduce TTL to 300s
        run: python scripts/reduce_ttl.py --ttl 300
      - name: Wait 24h
        run: sleep 86400
      - name: Reduce TTL to 60s
        run: python scripts/reduce_ttl.py --ttl 60

  email_gate:
    needs: ttl_ramp
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set forwarding rules
        run: bash scripts/set_forwarding.sh
      - name: Check DMARC propagation
        run: bash scripts/check_dmarc.sh

  mx_flip:
    needs: email_gate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Update MX records
        run: python scripts/flip_mx.py
      - name: Verify MX resolution
        run: bash scripts/verify_mx.sh

  transfer_init:
    needs: mx_flip
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Unlock domain and get auth code
        run: python scripts/unlock_and_auth.py
      - name: Subscribe transfer webhook
        run: python scripts/subscribe_webhook.py

  rollback:
    if: failure()
    needs: [snapshot, ttl_ramp, email_gate, mx_flip, transfer_init]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: zone-snapshot
      - name: Execute rollback
        run: python scripts/rollback.py

The needs: chain enforces the phase gate requirement without custom orchestration logic. The if: failure() condition on the rollback job means it runs only when something upstream fails, and the snapshot artifact from the first job is available to it via download-artifact.

Route 53 offers solid DNS scripting via Boto3, but it covers only the DNS layer, requires full AWS ecosystem buy-in, and provides no native email forwarding API. Cloudflare's DNS API is excellent but operates as a DNS host and proxy, covering only Phase 1 of this pipeline. The name.com API covers the DNS zone, email forwarding, and registrar transfer operations through a single authenticated surface, which is why the code examples above are all against it. Vercel, Replit, and Netlify run their domain operations on the same infrastructure.

Before You Touch Anything: Run the Zone Snapshot Right Now

If you're planning a migration and haven't taken a snapshot yet, do it now. You need a name.com API token first. Generate one in the API settings section of your name.com account, then pull your current zone state to a timestamped file:

curl -s -u "username:token" \
  "https://api.dev.name.com/core/v1/domains/yourdomain.com/records" \
  | jq '.' > snapshot_$(date +%Y%m%d_%H%M%S).json

Confirm the output JSON contains your A, MX, and TXT records before proceeding. That file is the precondition for the TTL ramp, the rollback diff, and the post-transfer validation. A migration without a verified pre-migration snapshot has no clean reversion path. Commit it to your repo and treat it as an immutable artifact for the duration of the operation.

Get started with the name.com API to grab your API key and pull your current DNS records in under five minutes.

Have you run a domain migration that went sideways despite a solid checklist? Drop your experience in the comments, including where the seam broke and what you did to recover.