My tool detects Terraform drift: it scans live AWS, diffs it against tfstate,
and lists every resource that no longer matches. For a long time I thought that
was the whole job.
Then I was staring at a real drift report — forty-odd changes — during a "wait,
why is that open?" moment, and I realized the report was answering the wrong
question. It told me what changed. It couldn't tell me the two things I
actually needed in that moment:
- Which of these forty is an emergency, and which is noise?
- Who changed it?
A flat list where a security group opened to the entire internet sits in the
same grey row as a renamed tag is a list that makes you do the triage. So I
fixed both. Here's how, and the bits that bit me.
Part 1 — Not all drift is equal
terraform plan is deliberately value-neutral: a diff is a diff. But to a human
on call, 0.0.0.0/0 appearing in an ingress rule is a heart-attack, and
Name: web → Name: web-1 is a shrug. The tool already had the field-level
diff — it just treated every field the same.
So I graded each change. No new AWS calls, no model — pure logic over the diff I
already compute:
def classify_change(asset_type, field, old, new):
f = (field or '').lower()
old_s = str(old or '').lower()
new_s = str(new or '').lower()
# opened to the world
if '0.0.0.0/0' in new_s and '0.0.0.0/0' not in old_s:
return CRITICAL, _('Opened to the entire internet (0.0.0.0/0)')
if 'public' in f and new_s in _TRUTHY and old_s not in _TRUTHY:
return CRITICAL, _('Resource was made publicly accessible')
# protections removed
if any(k in f for k in ('encrypt', 'kms', 'sse')) and old_s not in _FALSY and new_s in _FALSY:
return HIGH, _('Encryption was disabled')
if any(k in f for k in ('policy', 'iam', 'role', 'principal', 'acl')):
return HIGH, _('Access or permission configuration changed')
return LOW, _('Configuration value changed')
A resource's severity is just the worst of its field changes — open a port and
rename a tag, you're an incident, not a shrug:
worst = LOW
for c in changes:
sev, reason = classify_change(asset_type, c['field'], c['old'], c['new'])
if SEVERITY_ORDER[sev] > SEVERITY_ORDER[worst]:
worst = sev
Two things I'd flag if you copy this. First, grade on the transition, not the
value — 0.0.0.0/0 in new and not in old only fires when the change opened
it; a group that was always public doesn't scream every scan. Second, these are
heuristics, not a policy engine — string-matching field names will miss
things and occasionally over-flag. That's a deliberate trade: a fast, obvious
"this one first" beats a correct-but-unshipped OPA integration. I'd rather be
roughly right on every resource today.
Part 2 — "...but who did this?"
Severity tells you which drift to open first. It still doesn't tell you who to go
talk to. And terraform plan structurally cannot tell you — it compares two
files; it has no idea a human touched the console at 3pm.
But my tool isn't stateless. It already assumes a read-only role into each
account to scan it. Which means CloudTrail is right there, one API call away:
_READ_PREFIXES = ('Describe', 'List', 'Get', 'Lookup', 'BatchGet')
def lookup_actor(session, resource_id):
try:
client = session.client('cloudtrail')
resp = client.lookup_events(
LookupAttributes=[{'AttributeKey': 'ResourceName',
'AttributeValue': resource_id}],
MaxResults=10,
)
except Exception as exc: # AccessDenied, throttling, no trail...
logger.warning('CloudTrail lookup failed for %s: %s', resource_id, exc)
return None # attribution is a bonus, never load-bearing
for ev in resp.get('Events', []):
if ev['EventName'].startswith(_READ_PREFIXES):
continue # skip the Describe/List noise — we scan a lot
return _parse_event(ev) # who / when / source IP, from the event JSON
return None
Three things that bit me here:
-
Filter out read events. My first version proudly reported that the last
thing to touch the security group was... my own scanner, calling
DescribeSecurityGroups. The tool kept catching itself. Skipping theDescribe/List/Getprefixes gets you the actual mutating event. -
It must never break the page.
LookupEventscan be denied (missing permission), throttled, or simply find nothing. Every one of those returnsNoneand the row says "no record" — attribution failing can't take down the drift report. - Do it lazily. Calling CloudTrail for forty resources on page load is slow and rate-limit roulette. So it's a per-row "Who changed this?" button — CloudTrail is only hit for the one resource you actually care about.
I also added cloudtrail:LookupEvents to the bundled IAM policy, and I'm honest
in the UI about the limits: CloudTrail Lookup covers ~90 days of management
events, and it's regional. Sometimes the answer is "no record," and that's fine —
it's still more than terraform plan ever offered.
Where it lives: a detachable plugin
I didn't grow this inside the core app. It's its own optional Django app that
plugs in through one seam — a feature flag plus a sidebar entry — and the core
never imports it. Drop it from INSTALLED_APPS and the nav entry disappears and
the routes 404; nothing else notices. Keeping advanced features at arm's length
like this means the core stays a clean, boring ledger, and the interesting stuff
is opt-in. (That's a whole post of its own.)
Takeaways
- Drift detection is the easy 80%. The decision-useful part is how bad and who — and neither comes from diffing two files.
- Grade severity on the transition (old→new), not the current value, or your dashboard cries wolf on every scan.
- Heuristic severity that ships beats a perfect policy engine that doesn't. You can always tighten the rules later.
- If you already hold credentials into an account, CloudTrail attribution is almost free — just remember to filter out your own read calls, fail soft, and fetch it lazily.
This is the drift-risk view of a self-hosted tool that watches how your live AWS
drifts from Terraform — open source (MIT), one docker compose up:
syncvey.com. When drift shows up in your infra, what tells
you who did it today — CloudTrail by hand, a SIEM, or nobody and you just ask
around?
Top comments (0)