My infra tool knew what drifted, how bad, and who did it — then waited for someone to open a tab

#django #aws #devops #python

Over the last few weeks I taught my self-hosted drift detector some genuinely
useful tricks: it keeps a history so you can see the trend, it grades
each drift by security impact, and it can tell you who made the change via
CloudTrail. On paper, that's exactly what you want when infrastructure quietly
diverges from Terraform.

There was one problem, and it took me embarrassingly long to name it: all of
it was pull. You had to remember the tool existed, open it, and go look. On a
random Tuesday afternoon, nobody does. A signal nobody pulls is a signal that
doesn't exist.

So I made it push. A Monday-morning briefing that lands in Slack on its own.

What the briefing says

The whole point is synthesis — one message that answers what changed, how bad,
which way it's trending, and who to talk to:

Drift briefing — Production
3 drifted  ·  ▲ 2 since last week
🔴 1 critical  🟠 1 high  🟡 1 medium  ⚪ 0 low
─────────────
CRITICAL  web-sg (SG / prod)
Opened to the entire internet (0.0.0.0/0)
changed by tanaka via AuthorizeSecurityGroupIngress

None of that is new data. It's the drift history, the severity rules, and the
CloudTrail lookup I already had — assembled and delivered instead of sitting in
a tab. The building was easy. Realizing it needed to come to people was the
part I'd missed.

Rule one: shut up on a quiet week

The fastest way to get a notification muted is to send it when nothing happened.
So the briefing only fires if there's something to say:

digest = build_digest(system, days=days, attribute=True)
if not digest['has_data']:
    return False        # a clean week shouldn't ping anyone
return _post_to_slack(url, _format_slack(digest))

No drift, no message. The absence of a Monday briefing is the good news.

The interesting part: a plugin that schedules its own job

Here's the architecture constraint I'd set myself earlier in this project: the
risk/attribution/digest stuff lives in an optional, detachable app, and the
core is not allowed to import it. Remove the app from INSTALLED_APPS and the
core should neither know nor care.

That's easy for a web view (route it, guard it with a flag). It's harder for a
background job. The scheduler lives in the core. How does the core register
a cron job that belongs to a plugin it's forbidden to import?

The answer is the same trick Django itself uses for a lot of things: discovery,
not import. The core asks every installed app "do you have any jobs for me?"
without knowing which apps those are:

# core: collect job specs from any plugin that offers them
def plugin_scheduled_jobs():
    jobs = []
    for cfg in _plugin_app_configs():          # apps with syncvey_plugin = True
        getter = getattr(cfg, 'scheduled_jobs', None)
        if not callable(getter):
            continue
        try:
            jobs.extend(getter() or [])
        except Exception:                      # a bad plugin must not block startup
            continue
    return jobs

The core scheduler just iterates whatever comes back:

from .plugins import plugin_scheduled_jobs
for spec in plugin_scheduled_jobs():
    _scheduler.add_job(
        spec['func'], trigger=spec['trigger'], id=spec['id'],
        name=spec.get('name', spec['id']), jobstore='default',
        max_instances=1, coalesce=True, replace_existing=True,
    )

And the plugin advertises its job from its AppConfig — importing apscheduler
and its own code lazily, so the core never pulls any of it in:

class DriftRiskConfig(AppConfig):
    syncvey_plugin = True

    def scheduled_jobs(self):
        from django.conf import settings
        if not getattr(settings, 'DRIFT_DIGEST_ENABLED', False):
            return []                          # opt-in; silent by default
        from apscheduler.triggers.cron import CronTrigger
        from .digest import run_digest_job
        return [{
            'id': 'drift_digest_weekly',
            'name': 'Weekly drift briefing',
            'func': run_digest_job,
            'trigger': CronTrigger(day_of_week='mon', hour=9, minute=0),
        }]

Now the core scheduler runs a job it has never heard of, and deleting the plugin
deletes the job with it. No if plugin_installed: branches in the core, no
import, no coupling.

The gotcha that bit me: jobs are stored by reference

I use django-apscheduler, which persists jobs in the database so they survive
a restart. That persistence is the trap. apscheduler doesn't pickle your
function — it stores it by import path (syncvey_drift_risk.digest:run_digest_job)
and re-imports it when the job fires.

My first version made run_digest_job a closure inside scheduled_jobs(),
because it needed a bit of context. apscheduler couldn't serialize it: "cannot
be scheduled, it is not importable." The fix is boring but absolute — the job
has to be a module-level function anything can import by name. Any state it
needs, it looks up itself when it runs (it just iterates the systems that have a
Slack webhook). Once the job is a plain top-level function, the persistent store
is happy.

The opt-in flag matters for the same reason a clean week stays silent: a fresh
docker compose up shouldn't start firing outbound Slack messages. scheduled_jobs()
returns nothing until you set DRIFT_DIGEST_ENABLED=true and configure a webhook.

Takeaways

A monitoring feature you have to remember to open isn't done. Pull tells you nothing at 2pm on a Tuesday; push meets people where they already are.
Send nothing on a quiet week. The most trustworthy alert channel is the one that only speaks when it matters.
To let an optional plugin extend the core — even with a background job — use discovery (getattr(app_config, 'hook', None)), not import. The core asks; it never names the plugin.
If your scheduler persists jobs (django-apscheduler and friends), the job must be a top-level importable function — it's stored by path, not pickled. Closures and bound methods will fail at serialization time.

This is the weekly briefing from a self-hosted tool that tracks how your live AWS
drifts from Terraform — open source (MIT), one docker compose up:
syncvey.com. Where do your infra alerts actually land —
Slack, email, a dashboard you open on purpose, or a channel everyone muted months
ago?