TL;DR: Standard drift detection breaks in air-gapped environments because every major tool assumes cloud API access. The fix is decentralized reconciliation with local state management, not trying to force connected tools into disconnected networks.
The Assumption That Breaks Everything
Every popular drift detection tool makes the same assumption: your infrastructure can reach the internet.
terraform plan calls AWS APIs. Argo CD pulls from remote Git repos. Spacelift runs scans from a SaaS control plane. These tools work brilliantly in connected environments. The moment you drop them into an air-gapped network, they go silent.
I've spent the better part of a decade building infrastructure for organizations where connectivity isn't optional, it's forbidden. Government agencies, defense contractors, healthcare systems, financial trading floors. These environments are disconnected by design, not by accident. And drift detection in these networks is a fundamentally different problem than what most DevOps engineers encounter.
Why Air-Gapped Workloads Drift Differently
In a connected environment, drift happens and gets caught relatively fast. Someone clicks through the console, Terraform Cloud flags it on the next scan, you fix it. The feedback loop is tight.
In air-gapped environments, drift accumulates silently.
A sysadmin patches a node manually because the automated pipeline can't reach the package mirror. A developer tweaks a ConfigMap directly because the GitOps controller lost sync with the local Git server. An operator scales a deployment by hand during an incident and forgets to commit the change.
These changes compound. By the time anyone runs a manual audit, the gap between declared state and actual state can be enormous.
The core problem: connected drift detection is continuous and automated. Disconnected drift detection is episodic and manual. That gap is where compliance violations, security incidents, and late night pages live.
What Doesn't Work (And Why Teams Keep Trying)
Terraform Plan Over VPN
The most common first attempt: tunnel terraform plan through a VPN into the air-gapped network.
Problems:
- Latency kills the feedback loop. Provider API calls that take milliseconds on the internet take seconds over a restricted VPN. A plan that runs in 30 seconds now takes 15 minutes.
- Partial connectivity isn't air-gapped. If your "air-gapped" network has a VPN tunnel to SaaS tooling, your security team has questions. Valid ones.
- State file synchronization becomes a bottleneck. Remote state backends (S3, Consul) need connectivity. Local state files create merge conflicts when multiple operators work simultaneously.
GitOps Controllers Pointed at External Repos
Flux CD and Argo CD are excellent GitOps tools. But pointing them at a GitHub repo from an air-gapped cluster means... you don't have an air-gapped cluster anymore.
Running a local Git server (Gitea, GitLab) inside the perimeter fixes the connectivity problem but creates a new one: keeping the local repo in sync with the source of truth requires a deliberate, auditable transfer process. USB drives, data diodes, or scheduled one-way syncs all introduce delay. That delay is where drift happens.
Periodic Manual Audits
The fallback everyone hates: someone SSHes in, runs a bunch of comparison scripts, and writes a report.
This catches drift after the fact. In regulated environments, "we check quarterly" doesn't satisfy auditors who want continuous compliance evidence. And manual audits miss things. Every time.
What Actually Works
After iterating through the failures above across multiple engagements, three patterns consistently work in production air-gapped environments.
Pattern 1: Decentralized Policy Agents
Instead of a central control plane that reaches into clusters, deploy autonomous policy agents inside each air-gapped cluster.
Each agent:
- Stores the desired state locally (pulled in during the last approved sync window)
- Runs a continuous reconciliation loop comparing desired vs. actual state
- Logs every deviation to a local audit store
- Remediates automatically when configured to do so, or raises alerts for manual review
This is the pattern that Spectro Cloud Palette uses, and it's the right mental model. The cluster enforces its own policy. It doesn't need to phone home.
# Example: OPA Gatekeeper constraint running locally
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: require-team-label
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Namespace"]
parameters:
labels:
- key: "team"
- key: "cost-center"
- key: "environment"
Gatekeeper runs entirely inside the cluster. No external connectivity needed. Violations are logged locally and can be exported during sync windows.
Pattern 2: Local State Snapshots with Diff-on-Sync
For Terraform managed infrastructure, maintain state snapshots inside the air-gapped environment.
The workflow:
- Declare state in your IaC repo outside the air gap
- Transfer the repo into the environment through your approved media (data diode, approved USB, one-way sync)
-
Run
terraform planinside the air gap against local provider endpoints - Snapshot the actual state after each apply
- Diff the snapshot against the expected state on a cron schedule
- Export the diff report during the next sync window
The key insight: the state file and the provider APIs both live inside the perimeter. terraform plan works fine when everything it needs to reach is local.
#!/bin/bash
# drift_check.sh - runs inside the air-gapped environment
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DRIFT_DIR="/var/log/drift-reports"
terraform plan -detailed-exitcode -out="${DRIFT_DIR}/plan_${TIMESTAMP}.tfplan" 2>&1 | \
tee "${DRIFT_DIR}/drift_${TIMESTAMP}.log"
EXIT_CODE=${PIPESTATUS[0]}
if [ "$EXIT_CODE" -eq 2 ]; then
echo "DRIFT_DETECTED" > "${DRIFT_DIR}/status_${TIMESTAMP}"
# Alert local monitoring
curl -s -X POST http://alertmanager.local:9093/api/v1/alerts \
-d '[{"labels":{"alertname":"InfrastructureDrift","severity":"warning"}}]'
fi
Pattern 3: Immutable Baselines with Checksum Verification
For the most sensitive environments (defense, critical infrastructure), treat infrastructure state like a software artifact.
- Build a golden baseline of every resource's expected configuration
- Generate checksums (SHA-256) for each configuration artifact
- Deploy a lightweight agent that periodically recalculates checksums on live resources
- Any mismatch triggers an immediate alert
This is coarser than Terraform drift detection, but it works without any provider APIs. It's closer to file integrity monitoring (think AIDE or OSSEC) applied to infrastructure configuration.
# baseline_check.py - infrastructure checksum verification
import hashlib
import json
import subprocess
def get_resource_state(resource_type, resource_name):
"""Capture current state of a Kubernetes resource."""
result = subprocess.run(
["kubectl", "get", resource_type, resource_name, "-o", "json"],
capture_output=True, text=True
)
if result.returncode != 0:
return None
state = json.loads(result.stdout)
# Strip volatile fields that change on every read
for field in ["resourceVersion", "uid", "creationTimestamp",
"managedFields", "generation"]:
state.get("metadata", {}).pop(field, None)
return state
def checksum(state):
"""Generate deterministic checksum of resource state."""
canonical = json.dumps(state, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
def verify_baseline(baseline_file):
"""Compare live state against stored baseline checksums."""
with open(baseline_file) as f:
baseline = json.load(f)
drift_detected = []
for resource in baseline["resources"]:
current = get_resource_state(resource["type"], resource["name"])
if current is None:
drift_detected.append({
"resource": f"{resource['type']}/{resource['name']}",
"status": "MISSING"
})
continue
current_hash = checksum(current)
if current_hash != resource["expected_hash"]:
drift_detected.append({
"resource": f"{resource['type']}/{resource['name']}",
"status": "MODIFIED",
"expected": resource["expected_hash"][:12],
"actual": current_hash[:12]
})
return drift_detected
Choosing the Right Pattern
| Pattern | Best For | Latency | Complexity | Audit Trail |
|---|---|---|---|---|
| Decentralized Agents | Kubernetes clusters | Real-time | Medium | Strong |
| Local State Snapshots | Terraform/IaC resources | Minutes (cron) | Low | Medium |
| Checksum Baselines | High-security environments | Minutes (cron) | Low | Strong |
In practice, most air-gapped environments use a combination. Gatekeeper handles Kubernetes policy enforcement in real time. Terraform drift checks run on a cron inside the perimeter. Checksum baselines provide an additional layer for the security team.
The Compliance Angle
Auditors care about three things:
- Can you prove your infrastructure matches the declared state? Drift reports with timestamps answer this.
- How quickly do you detect deviations? "Within minutes" beats "at the next quarterly audit."
- What happens when drift is detected? Automated remediation or documented manual review processes.
Air-gapped environments often have stricter compliance requirements than connected ones. The irony is that their tooling for meeting those requirements is worse. Building local drift detection infrastructure closes that gap.
Lessons From the Field
1. Treat sync windows as deployment events. When new policy or desired state enters the air-gapped environment, that transfer should go through the same review process as a production deployment. Because it is one.
2. Log everything locally, export periodically. Build a local ELK or Loki stack inside the perimeter. Drift events, remediation actions, audit logs. Export summaries during sync windows for central visibility.
3. Test your drift detection in staging first. Introduce intentional drift in a staging cluster and verify your agents catch it. I've seen teams deploy Gatekeeper and assume it works, only to discover six months later that their constraints had a typo that prevented enforcement.
4. Don't fight the air gap. The biggest mistake is trying to poke holes in the network boundary to make connected tools work. Every hole is an attack surface. Build for disconnection. It's simpler in the long run.
5. Version your baselines. When the approved state changes (through a sync window), update the baseline checksums and keep the old ones. This gives you a historical record of what the environment should have looked like at any point in time.

Top comments (0)