Incident Prevention: Using Code Intelligence for Proactive Reliability

#devtools #programming #architecture #ai

Your incident response process is polished. PagerDuty alerts, runbooks, blameless post-mortems, action items tracked in Jira. You're great at responding to incidents.

But you're still having them. Because incident response is reactive. Incident prevention is proactive — and most teams don't have the tools for it.

The Prevention Stack

Pre-Commit: Blast Radius Analysis

Before a developer commits, show them what their change affects. Not just the files in the PR — the transitive dependencies, downstream consumers, and cross-service impacts.

Pre-Merge: Historical Risk Assessment

Before merging, check: "Have similar changes in this area caused incidents before?" Git history knows. Surface past regressions, reverts, and hotfixes in the affected code paths.

Pre-Deploy: Change Coupling Validation

Before deploying, verify: "Are all coupled changes included?" If files A and B always change together (detected through change coupling analysis) but only A is in this deployment, flag the risk.

Post-Deploy: Knowledge-Aware Alerting

After deploying, route alerts to the right people. Not the on-call rotation — the engineer who actually understands the changed code. Knowledge-based alerting reduces mean-time-to-resolution by 40-60%.

From Reactive to Proactive

The shift: instead of asking "what happened?" after an incident, ask "what could happen?" before shipping.

Code intelligence makes this possible by turning codebase knowledge into risk scores that surface at each stage of the development pipeline.

Real-World Example: The Session TTL Incident

An engineer changed the session TTL from 30 minutes to 60 minutes. A routine config change. Code review approved it in minutes.

What nobody knew: the WebSocket service had a hardcoded 30-minute reconnection timer that assumed sessions would expire at 30 minutes. After the TTL change, WebSocket connections would reconnect, find a still-valid session, and create duplicate connections. Memory usage climbed until the service OOMed at 3 AM.

With pre-commit blast radius analysis, the dependency between session TTL config and the WebSocket timer would have been visible. The fix: a 5-minute code change to read the TTL from config instead of hardcoding it. Without the analysis: a 4-hour production incident, customer-facing downtime, and a blameless postmortem that recommended "better testing" (the actual fix was better dependency awareness).

Originally published on glue.tools. Glue is the pre-code intelligence platform — paste a ticket, get a battle plan.