Configuration drift is one of those problems that seems minor—until it isn’t.
A “temporary” security group rule stays open for weeks.
A manual change fixes a production incident but never makes it back to Terraform.
An EC2 instance gets a one-off flag “just for now” and quietly becomes the special case nobody wants to touch.
Over time, these tiny deviations compound into outages, security gaps, and a lot of “who changed what, when?” energy. This article walks through how I designed and built a lightweight Config Drift Detector for AWS that:
- Takes regular snapshots of your infrastructure.
- Compares them against a moving baseline.
- Surfaces drift events in a Next.js dashboard.
- Sends Slack alerts for high/critical changes.
High-level architecture
Here’s the architecture diagram used in this article:
At a glance:
- AWS Services (e.g., EC2, Security Groups) are sampled on a schedule.
- A Snapshot Lambda writes raw JSON snapshots to S3 and Supabase/PostgreSQL.
- A Detect Lambda compares the latest snapshot to the previous baseline to detect drift.
- An Alert Lambda writes drift events, updates baselines, and optionally sends Slack alerts.
- A Next.js dashboard polls a lightweight API backed by Supabase/PostgreSQL to show drifts and baselines.
The rest of the article breaks this down from the perspective of an SRE/DevOps engineer who wants fast feedback, clear audit trails, and a UI that doesn’t feel like a side project.
Design goals and constraints
When I scoped this project, I set a few explicit goals:
- Detect meaningful drift, not every single field that changes.
- Keep the architecture boring and observable: managed services over bespoke infra.
- Make the UI operator-friendly: think SRE console, not toy dashboard.
- Be small enough to build solo, but credible enough to show to senior engineers or hiring managers.
From there, the architecture fell naturally into four pieces:
- Snapshot pipeline.
- Drift detection engine.
- Alerting and audit trail.
- Web dashboard.
1. Snapshot pipeline
What gets snapshotted?
To start, I focused on a narrow but high-impact slice of AWS resources:
- EC2 instances: lifecycle, instance type, tags.
- Security groups: inbound/outbound rules and attached resources.
These are common sources of “quick fixes” and “just for debugging” changes that later turn into security and reliability problems.
How snapshots flow through the system
The snapshot pipeline revolves around a scheduled Lambda:
- Trigger: EventBridge rule runs every 30 minutes.
-
Snapshot Lambda:
- Calls AWS APIs to list EC2 instances and security groups.
- Normalizes the data into a stable JSON shape.
- Writes each snapshot to:
-
S3: raw, timestamped JSON (e.g.,
YYYY-MM-DD/HH-MM-SS.json). - Supabase/PostgreSQL: summarized snapshot metadata for faster queries later.
This gives you:
- A cheap, append-only log of the world as it looked at each point in time (S3).
- A queryable state for dashboards and drift detection (Postgres).
2. Drift detection engine
Baselines vs snapshots
The system uses a simple mental model:
- A snapshot is “what the world looks like now”.
- A baseline is “what we expect the world to look like”.
Every time a new snapshot arrives, the Detect Lambda compares it to the current baseline:
- For each resource (instance, security group, etc.):
- Map by a stable identifier (e.g., instance ID).
- Compare relevant fields that matter for reliability/security.
- Ignore noisy, fast-changing fields (e.g., some timestamps).
The output is a set of drift events:
-
ADDED: resource exists in snapshot but not in baseline. -
REMOVED: resource exists in baseline but not in snapshot. -
MODIFIED: resource exists in both, but relevant fields differ.
Each drift event carries:
- Resource metadata (ID, type, environment).
- Which fields changed (before vs after).
- A severity classification (more on that below).
Once detection is done, the baseline is updated forward so the system tracks drift incrementally rather than replaying from the beginning every time.
3. Alerting and severity
Not all drift is created equal. Changing a tag is not the same as opening SSH to the world.
To make alerts meaningful, drift events are classified by severity:
-
CRITICAL: Security group changes that materially expand exposure (e.g.,
0.0.0.0/0on sensitive ports). - HIGH: EC2 changes that alter lifecycle or network placement in risky ways.
- MEDIUM: Configuration changes that might affect behavior but aren’t obviously dangerous.
- LOW: Tag-only changes and other low-risk metadata updates.
The Alert Lambda is responsible for:
- Writing drift events into Supabase/PostgreSQL for later querying.
- Sending Slack notifications for HIGH and CRITICAL drifts:
- Channel: e.g.,
#infra-alerts. - Message includes: resource, environment, severity, and a short description.
- Channel: e.g.,
This keeps the Slack noise under control while still providing a tight feedback loop for changes that actually matter.
4. The Next.js dashboard
The dashboard is intentionally simple, but optimized for SRE/DevOps workflows rather than demos.
Key views
The app exposes three main pages:
-
Dashboard:
- High-level stats: number of active drifts, baselines, and monitored environments.
- Recent drifts, sorted by time and severity.
- Baseline overview (which environments are covered, which baselines are stale).
-
Drifts:
- Table of drift events with:
- Severity chips.
- Resource and environment.
- Type of drift (
ADDED,REMOVED,MODIFIED). - Detected time.
- Filters for severity, status, and environment.
-
Baselines:
- List of baselines with:
- Name, environment.
- Status (Active / Stale / Archived).
- Last updated time.
- Links into the Drifts view filtered by baseline.
Data flow
The dashboard queries Supabase/PostgreSQL via a light API layer:
- Fetch lists of drifts and baselines.
- Support simple aggregation for dashboard metrics (e.g., count of active drifts).
- Polls frequently enough to make the UI feel “live” without hammering the backend.
The focus is on operational clarity:
- It should be easy to answer:
- “What changed recently?”
- “Is this environment drifting more than others?”
- “Which baselines are out of date?”
Why this architecture?
This design deliberately avoids premature complexity:
- Serverless for cadence-based work: Lambdas plus an EventBridge scheduler are a natural fit for “run every N minutes and compare snapshots”.
-
S3 + Postgres gives both durability and queryability:
- S3 for raw history.
- Postgres for fast reads and simple aggregations.
-
Next.js dashboard:
- Easy to deploy.
- Easy to iterate on UX.
- Pairs well with Supabase as a backend.
At the same time, it leaves room to grow:
- Add more resource types beyond EC2 and security groups.
- Introduce per-environment baselines and multi-account support.
- Expand the dashboard with timelines, diff views, and richer filters.
Future improvements
There are several natural extensions to this architecture:
- Better diff views: show structured diffs (field-level before/after) in the UI, not just “modified”.
- Alert policies: configurable rules to decide which drifts should alert where (Slack, email, etc.).
- Multi-cloud support: abstract snapshot/detect logic to handle other providers.
- Drift remediation hooks: for certain classes of drift, trigger runbooks or automated remediation.
The current version focuses on the basics: detect, classify, alert, and visualize. That’s already enough to catch the most painful “someone changed prod” issues and to tell a coherent story in a portfolio or blog post.
Wrapping up
Config Drift Detector started as a way to make configuration changes more visible, but it also became a nice exercise in small, focused architecture:
- One clear data flow from AWS → snapshots → drift detection → alerts → dashboard.
- Minimal moving parts, each doing one job well.
- A UI that reflects how operators actually investigate and respond to drift.
If you’re interested in configuration management, SRE tooling, or just want a portfolio project that goes beyond CRUD, building something like this is a great way to explore the intersection of cloud architecture, observability, and developer experience.

Top comments (0)