How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy"

#cicd #monitoring #sre #devops

You're staring at a P95 latency spike.

The alert says: "Database pool exhausted. P95: 2847ms."You know what broke. You don't know why.
So you open your git log, check when the spike started, scroll through commits, and try to figure out what changed in the 10 minutes before everything went sideways.
That archaeology takes 20 minutes on a good day. At 2am it takes longer.

The Problem with Context-Free Alerts
Most incident alerts are great at telling you the “what”. None of them tell you the “when” in relation to your codebase.
The question every engineer asks during an incident isn't "what is the P95?" — they already know that. It's "Did we just deploy something?"

The Insight: Incidents Have a Deployment Shadow
The way I see it, the majority of production incidents fall into one of two categories:
• Infrastructure events — upstream dependency failure, Redis outage, traffic spike
• Deployment shadows — something changed in the last deploy that didn't show up in testing

For category 2, the fastest path to resolution is knowing exactly what changed and when — down to the commit level.
If your alert says:
Database pool exhausted (P95: 2847ms)
Recent deployments before incident:
3m ago — a1b2c3d: "Fix checkout query isolation level" (John, +12/-3)
1 recent commit touched database/query files
You've just saved 20 minutes of log archaeology.

How to Build It
The implementation is simpler than it sounds. Three components:
• A commit store — Redis sorted set, scored by timestamp
• A GitHub webhook — receives push events, stores commits
• An incident correlator — maps incident start time to nearby commits

The Commit Store
def store_commit(tenant_id, sha, message, author, timestamp, files_changed):
key = f"orchestrator:commits:{tenant_id}"
redis.zadd(key, {entry: timestamp})
redis.expire(key, 86400 * 7) # 7 day TTL
A Redis sorted set gives you O(log N) insertion and O(log N + K) range queries — perfect for "give me commits in the 10 minutes before this timestamp."

The GitHub Webhook
@app.post("/commits/webhook")
async def github_webhook(request: Request):
body = await request.json()
for commit in body.get("commits", []):
store_commit(...)

Injecting Context into AI Diagnosis
Without commit context, Claude sees raw metrics. With commit context, Claude sees the metrics AND what changed 3 minutes before the incident — shifting the diagnosis from "likely database connection issue" to "checkout query isolation level change likely caused connection pool exhaustion."
That's a different quality of diagnosis entirely.

What the WhatsApp Message Looks Like
⚠️ Action Recommended
Service: Payment API
Issue: Database pool exhausted — P95 2.8s
Likely cause: Checkout query isolation level change
(commit a1b2c3d, 3m ago)
Confidence: 87%
👉 Approve fix: [link]
Nothing will run without your approval.

Three Setup Options
• GitHub webhook (recommended) — POST /commits/webhook with header X-AlertEngine-Tenant-ID
• Manual push from CI — curl from your GitHub Actions workflow
• GitHub API polling — set GITHUB_TOKEN and GITHUB_REPO, AlertEngine fetches automatically

The Broader Pattern
This feature is an instance of a broader pattern: enrich your incident context with everything that changed recently, not just the metrics at the moment of failure.
Future extensions of the same idea:
• Feature flag changes in the 10 minutes before an incident
• Infrastructure changes (Terraform applies, Docker image updates)
• Database migration executions
• Config changes

The alert that says, "Here's what broke, here's what changed right before it broke, here's the fix"—that's the alert worth building for.

─────────────────────────────────────────
This is now live in FastAPI AlertEngine as commit_context.py.
GitHub: github.com/Tandem-Media/fastapi-alertengine
Docs: tandem-media.github.io/fastapi-alertengine/
pip install fastapi-alertengine

Top comments (2)

Harjot Singh • May 31

"Most alerts tell you the what, none tell you the when in relation to your codebase" nails the actual gap. The 2am archaeology you describe (open git log, find when the spike started, scroll commits) is the single most repeated motion in incident response, and it's pure manual correlation a machine should already have done. Stamping the alert with "this broke 3 minutes after deploy abc123" collapses the first and usually most valuable diagnostic step into the notification itself, because deploy-correlation is the highest-prior-probability cause by a mile. The thing I'd guard against as this gets smarter: temporal correlation is a strong hint, not proof, plenty of incidents fire minutes after an unrelated deploy and the real cause is a slow-burn leak or upstream change. So the win is "here's the deploy that most likely did it, with the diff attached," surfacing the suspect without foreclosing the investigation. That surface-the-evidence-don't-assert-the-verdict line is exactly how I think about agent-assisted ops in Moonshift. Are you correlating purely on deploy timing, or also weighting which service the deploy touched against where the alert fired?

Lenard Francis • May 31

"Surface the evidence; don't assert the verdict"—that's exactly the line we're trying to hold.

Right now the correlation is purely temporal — commits within 10 minutes before the incident start timestamp, with a flag if any touched database/query/migration files. It's a strong prior, not a conclusion.

The Claude prompt receives both the metrics and the commit context, and the confidence score reflects whether the evidence is actually coherent. A deploy that touched auth middleware showing up 3 minutes before a latency spike on the payment endpoint gets lower confidence than the same deploy touching the payment query directly.

Your question about service-level weighting is the right next step. We have the alert's origin (which endpoint is spiking) and the commit's changed files — mapping those two together to produce a "blast radius overlap" score is on the roadmap but not yet built.

The slow-burn leak case is the harder problem. Temporal correlation would
miss it entirely. That's where the baseline memory helps — if P95 has
been climbing for 3 hours rather than spiking in the last 5 minutes, the deploy correlation gets deprioritized in favor of "gradual resource exhaustion"
patterns.

What's the Moonshift approach to distinguishing deploy-correlated spikes from slow burns?