DEV Community

ひとし 田畑
ひとし 田畑

Posted on

My drift detector graded every change — and stayed blind to the secret that hadn't rotated in 200 days

My drift detector is built on one idea: take two snapshots of your cloud, diff
them, and grade what moved. A security group opened to 0.0.0.0/0? Critical. An
RDS instance that flipped to public? Critical. A tag someone fat-fingered? Low.
Every rule is a function of an old → new transition — a change happened, and
I score how bad the change is.

Then I went to add "your Secrets Manager secret hasn't rotated in too long" and
the whole model fell over. Not because it's hard to detect. Because the thing I
wanted to flag produces no diff.

The shape that breaks a diff

Here's a secret that hasn't rotated in 200 days. I scan it Monday. I scan it
Tuesday. I diff the two snapshots:

(no changes)
Enter fullscreen mode Exit fullscreen mode

Of course there are no changes — nothing rotated. That's the entire problem.
Scan it every minute for a week and every diff comes back empty. The dangerous
state of this secret is precisely the state in which nothing is happening to
it.

My whole engine was wired to answer "what changed between two points in time?"
This risk lives in the opposite question: "what is true about this thing right
now, regardless of whether it just changed?"
A password that rotated 200 days
ago and a password that rotated 201 days ago look identical to a diff — but one
of them crossed my 90-day line and the other didn't, and only absolute state
knows that.

I'd quietly assumed every risk was a change risk. It isn't. Some risks are
standing conditions: overdue rotation, public-by-default, encryption never
enabled. The absence of a change is the finding.

Two different questions, two different modules

So I stopped trying to force rotation through the diff path. The change-based
rules stayed exactly as they were — they need a prior snapshot and grade the
transition:

# rules.py — grades a field-level diff (old → new)
if asset.raw_data_prev:
    changes = _compute_raw_diff(asset.raw_data_prev, asset.raw_data)
    if changes:
        findings.extend(assess(asset.asset_type, changes)['findings'])
        has_change = True
Enter fullscreen mode Exit fullscreen mode

Rotation got its own module that grades current state, with no _prev in
sight:

# rotation.py — grades a standing condition (now, no diff)
if (asset.raw_data or {}).get('_resource_type') == 'aws_secretsmanager_secret':
    findings.extend(assess_rotation(asset.raw_data, now, max_age)['findings'])
Enter fullscreen mode Exit fullscreen mode

Both emit the same {'field', 'severity', 'reason'} shape, so the two streams
of findings merge into one row and sort into the same severity-ranked list. The
UI never knows one came from a diff and the other from a stopwatch.

What the grader actually looks at

Rotation posture isn't one boolean, it's a little ladder, and each rung is a
different severity:

def assess_rotation(raw_data, now, max_age_days):
    raw = raw_data or {}
    findings = []

    if not _truthy(raw.get('rotation_enabled')):
        findings.append({'field': 'rotation_enabled', 'severity': HIGH,
                         'reason': 'Automatic rotation is disabled'})
        return _wrap(findings)

    last = _parse(raw.get('last_rotated_date'))
    if last is None:
        findings.append({'field': 'last_rotated_date', 'severity': MEDIUM,
                         'reason': 'Rotation is enabled but the secret has never rotated'})
        return _wrap(findings)

    age_days = (now - last).days
    if age_days >= max_age_days * 2:
        findings.append({'severity': CRITICAL, ...})   # over 2× the limit
    elif age_days >= max_age_days:
        findings.append({'severity': HIGH, ...})        # past the limit
    return _wrap(findings)
Enter fullscreen mode Exit fullscreen mode
  • Rotation disabled → HIGH. It's not overdue, it's structurally never coming. Worse than "late."
  • Enabled but never rotated → MEDIUM. Someone flipped the switch and walked away; the Lambda may be misconfigured.
  • Overdue past the limit → HIGH.
  • Over twice the limit → CRITICAL. 90 days late is a mistake; 180 days late is a dead process nobody's watching.

It's a pure function over a dict — no AWS calls, no models, fully testable with a
frozen now. Same discipline as the change rules: keep the judgement pure, keep
the I/O outside.

The scan captures posture, never the secret

The one thing I was paranoid about: a tool that reads your secrets to check
your secrets
is a worse problem than the one it solves. It never touches a
value.

ListSecrets already returns the rotation metadata — RotationEnabled,
LastRotatedDate, NextRotationDate — so there's no GetSecretValue, not even
a DescribeSecret. One list call carries everything the grader needs:

sm.get_paginator('list_secrets')
# → RotationEnabled, LastRotatedDate, NextRotationDate, RotationRules ...
# never GetSecretValue. posture only, value never leaves AWS.
Enter fullscreen mode Exit fullscreen mode

I capture whether it rotates, when it last did, and how often it's supposed to —
and nothing that would be dangerous to store. There's a moto-backed test whose
entire job is to assert the scanned record never contains the secret string.

The detail I nearly got wrong: "Who changed this?"

Every risk row has a lazy "Who changed this?" button — click it and the tool
calls CloudTrail to name whoever made the change (I wrote about that
in Day 11).

For an overdue rotation, that button is a lie. There is no actor. Nobody did
anything — that's the whole finding. Asking CloudTrail "who caused this secret to
not rotate?" returns nothing, because non-events don't have culprits.

So the button is gated on the same has_change flag that the diff path sets:

findings.extend(assess(asset.asset_type, changes)['findings'])
has_change = True   # only change-based findings get an actor
Enter fullscreen mode Exit fullscreen mode

A row that exists only because of a standing condition renders without the
attribution button. The shape of the risk decides whether "who did it?" is even a
coherent question — and for the absence of an event, it isn't.

Honest limits

  • The severity ladder is a heuristic. A 90-day default and a 2× critical cliff are opinions, not policy — hence the SECRET_ROTATION_MAX_AGE_DAYS setting.
  • It grades what ListSecrets reports. A rotation Lambda that "succeeds" while silently rotating to the same value would still look healthy. Posture, not proof.
  • Secrets Manager only. Rotation-as-a-standing-condition generalizes (IAM access keys, TLS certs, KMS key age) but I've only wired the one so far.

Takeaways

  • Not every risk is a change risk. Diff-based detection is structurally blind to conditions that are dangerous precisely because nothing is changing — an overdue rotation produces no old→new transition to grade.
  • When a new risk doesn't fit your existing pipeline, that's a signal it's a different question, not a harder version of the same one. Give it its own path instead of bending the diff to fit.
  • Grade standing conditions on absolute current state (now vs last_rotated_date), not on a snapshot delta.
  • Check posture without reading the secret — ListSecrets carries the rotation metadata, so you never call GetSecretValue.
  • Attribution only makes sense for events. Gate "who did this?" on whether a discrete change actually happened; non-events have no culprit.

This ships in a self-hosted tool that scans your live AWS, grades what drifted
and what's standing overdue, and never stores a secret value — open source
(MIT), one docker compose up: syncvey.com. What's the
most dangerous thing in your account right now that would never show up in a
diff because it's been quietly not changing for months?

Top comments (0)