Mateen Anjum

Posted on Feb 21

Drift Detection in Air-Gapped Workloads: What Nobody Tells You

#devops #kubernetes #terraform #security

TL;DR: Standard drift detection breaks in air-gapped environments because every major tool assumes cloud API access. The fix is decentralized reconciliation with local state management, not trying to force connected tools into disconnected networks.

The Assumption That Breaks Everything

Every popular drift detection tool makes the same assumption: your infrastructure can reach the internet.

terraform plan calls AWS APIs. Argo CD pulls from remote Git repos. Spacelift runs scans from a SaaS control plane. These tools work brilliantly in connected environments. The moment you drop them into an air-gapped network, they go silent.

I've spent the better part of a decade building infrastructure for organizations where connectivity isn't optional, it's forbidden. Government agencies, defense contractors, healthcare systems, financial trading floors. These environments are disconnected by design, not by accident. And drift detection in these networks is a fundamentally different problem than what most DevOps engineers encounter.

Why Air-Gapped Workloads Drift Differently

In a connected environment, drift happens and gets caught relatively fast. Someone clicks through the console, Terraform Cloud flags it on the next scan, you fix it. The feedback loop is tight.

In air-gapped environments, drift accumulates silently.

A sysadmin patches a node manually because the automated pipeline can't reach the package mirror. A developer tweaks a ConfigMap directly because the GitOps controller lost sync with the local Git server. An operator scales a deployment by hand during an incident and forgets to commit the change.

These changes compound. By the time anyone runs a manual audit, the gap between declared state and actual state can be enormous.

The core problem: connected drift detection is continuous and automated. Disconnected drift detection is episodic and manual. That gap is where compliance violations, security incidents, and late night pages live.

What Doesn't Work (And Why Teams Keep Trying)

Terraform Plan Over VPN

The most common first attempt: tunnel terraform plan through a VPN into the air-gapped network.

Problems:

Latency kills the feedback loop. Provider API calls that take milliseconds on the internet take seconds over a restricted VPN. A plan that runs in 30 seconds now takes 15 minutes.
Partial connectivity isn't air-gapped. If your "air-gapped" network has a VPN tunnel to SaaS tooling, your security team has questions. Valid ones.
State file synchronization becomes a bottleneck. Remote state backends (S3, Consul) need connectivity. Local state files create merge conflicts when multiple operators work simultaneously.

GitOps Controllers Pointed at External Repos

Flux CD and Argo CD are excellent GitOps tools. But pointing them at a GitHub repo from an air-gapped cluster means... you don't have an air-gapped cluster anymore.

Running a local Git server (Gitea, GitLab) inside the perimeter fixes the connectivity problem but creates a new one: keeping the local repo in sync with the source of truth requires a deliberate, auditable transfer process. USB drives, data diodes, or scheduled one-way syncs all introduce delay. That delay is where drift happens.

Periodic Manual Audits

The fallback everyone hates: someone SSHes in, runs a bunch of comparison scripts, and writes a report.

This catches drift after the fact. In regulated environments, "we check quarterly" doesn't satisfy auditors who want continuous compliance evidence. And manual audits miss things. Every time.

What Actually Works

After iterating through the failures above across multiple engagements, three patterns consistently work in production air-gapped environments.

Pattern 1: Decentralized Policy Agents

Instead of a central control plane that reaches into clusters, deploy autonomous policy agents inside each air-gapped cluster.

Each agent:

Stores the desired state locally (pulled in during the last approved sync window)
Runs a continuous reconciliation loop comparing desired vs. actual state
Logs every deviation to a local audit store
Remediates automatically when configured to do so, or raises alerts for manual review

This is the pattern that Spectro Cloud Palette uses, and it's the right mental model. The cluster enforces its own policy. It doesn't need to phone home.

# Example: OPA Gatekeeper constraint running locally
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-label
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels:
      - key: "team"
      - key: "cost-center"
      - key: "environment"

Gatekeeper runs entirely inside the cluster. No external connectivity needed. Violations are logged locally and can be exported during sync windows.

Pattern 2: Local State Snapshots with Diff-on-Sync

For Terraform managed infrastructure, maintain state snapshots inside the air-gapped environment.

The workflow:

Declare state in your IaC repo outside the air gap
Transfer the repo into the environment through your approved media (data diode, approved USB, one-way sync)
Run terraform plan inside the air gap against local provider endpoints
Snapshot the actual state after each apply
Diff the snapshot against the expected state on a cron schedule
Export the diff report during the next sync window

The key insight: the state file and the provider APIs both live inside the perimeter. terraform plan works fine when everything it needs to reach is local.

#!/bin/bash
# drift_check.sh - runs inside the air-gapped environment
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DRIFT_DIR="/var/log/drift-reports"

terraform plan -detailed-exitcode -out="${DRIFT_DIR}/plan_${TIMESTAMP}.tfplan" 2>&1 | \
  tee "${DRIFT_DIR}/drift_${TIMESTAMP}.log"

EXIT_CODE=${PIPESTATUS[0]}
if [ "$EXIT_CODE" -eq 2 ]; then
  echo "DRIFT_DETECTED" > "${DRIFT_DIR}/status_${TIMESTAMP}"
  # Alert local monitoring
  curl -s -X POST http://alertmanager.local:9093/api/v1/alerts \
    -d '[{"labels":{"alertname":"InfrastructureDrift","severity":"warning"}}]'
fi

Pattern 3: Immutable Baselines with Checksum Verification

For the most sensitive environments (defense, critical infrastructure), treat infrastructure state like a software artifact.

Build a golden baseline of every resource's expected configuration
Generate checksums (SHA-256) for each configuration artifact
Deploy a lightweight agent that periodically recalculates checksums on live resources
Any mismatch triggers an immediate alert

This is coarser than Terraform drift detection, but it works without any provider APIs. It's closer to file integrity monitoring (think AIDE or OSSEC) applied to infrastructure configuration.

# baseline_check.py - infrastructure checksum verification
import hashlib
import json
import subprocess

def get_resource_state(resource_type, resource_name):
    """Capture current state of a Kubernetes resource."""
    result = subprocess.run(
        ["kubectl", "get", resource_type, resource_name, "-o", "json"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        return None

    state = json.loads(result.stdout)
    # Strip volatile fields that change on every read
    for field in ["resourceVersion", "uid", "creationTimestamp",
                  "managedFields", "generation"]:
        state.get("metadata", {}).pop(field, None)
    return state

def checksum(state):
    """Generate deterministic checksum of resource state."""
    canonical = json.dumps(state, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

def verify_baseline(baseline_file):
    """Compare live state against stored baseline checksums."""
    with open(baseline_file) as f:
        baseline = json.load(f)

    drift_detected = []
    for resource in baseline["resources"]:
        current = get_resource_state(resource["type"], resource["name"])
        if current is None:
            drift_detected.append({
                "resource": f"{resource['type']}/{resource['name']}",
                "status": "MISSING"
            })
            continue

        current_hash = checksum(current)
        if current_hash != resource["expected_hash"]:
            drift_detected.append({
                "resource": f"{resource['type']}/{resource['name']}",
                "status": "MODIFIED",
                "expected": resource["expected_hash"][:12],
                "actual": current_hash[:12]
            })

    return drift_detected

Choosing the Right Pattern

Pattern	Best For	Latency	Complexity	Audit Trail
Decentralized Agents	Kubernetes clusters	Real-time	Medium	Strong
Local State Snapshots	Terraform/IaC resources	Minutes (cron)	Low	Medium
Checksum Baselines	High-security environments	Minutes (cron)	Low	Strong

In practice, most air-gapped environments use a combination. Gatekeeper handles Kubernetes policy enforcement in real time. Terraform drift checks run on a cron inside the perimeter. Checksum baselines provide an additional layer for the security team.

The Compliance Angle

Auditors care about three things:

Can you prove your infrastructure matches the declared state? Drift reports with timestamps answer this.
How quickly do you detect deviations? "Within minutes" beats "at the next quarterly audit."
What happens when drift is detected? Automated remediation or documented manual review processes.

Air-gapped environments often have stricter compliance requirements than connected ones. The irony is that their tooling for meeting those requirements is worse. Building local drift detection infrastructure closes that gap.

Lessons From the Field

1. Treat sync windows as deployment events. When new policy or desired state enters the air-gapped environment, that transfer should go through the same review process as a production deployment. Because it is one.

2. Log everything locally, export periodically. Build a local ELK or Loki stack inside the perimeter. Drift events, remediation actions, audit logs. Export summaries during sync windows for central visibility.

3. Test your drift detection in staging first. Introduce intentional drift in a staging cluster and verify your agents catch it. I've seen teams deploy Gatekeeper and assume it works, only to discover six months later that their constraints had a typo that prevented enforcement.

4. Don't fight the air gap. The biggest mistake is trying to poke holes in the network boundary to make connected tools work. Every hole is an attack surface. Build for disconnection. It's simpler in the long run.

5. Version your baselines. When the approved state changes (through a sync window), update the baseline checksums and keep the old ones. This gives you a historical record of what the environment should have looked like at any point in time.

DEV Community