Jay Saadana

for Steadwing

Posted on Apr 28

Your Engineers Aren't Slow. Your incident response is. Here's Where the First 20 Minutes Actually Go

#devops #sre #ai #productivity

Your last P0 incident probably took 50 minutes to resolve. The fix itself? Likely under 10 minutes. A config rollback. A connection pool bump. A single kubectl command.

So where did the other 40 minutes go?

Not to engineering. To coordination. Jumping between tools, paging the right people, checking what changed, and trying to piece together the context from six different dashboards before anyone even starts debugging.

The data backs this claim up. An incident.io analysis of real-world P0 incidents found a typical MTTR breakdown of 12 minutes assembling the team and gathering context, 20 minutes troubleshooting, 4 minutes on actual mitigation, and 12 minutes on cleanup, meaning coordination consumes roughly 70% of the total resolution window while the actual repair takes a fraction of it. Separately, the Catchpoint SRE Report 2025 found that and operational toil rose to 30% of engineering time, up from 25% the first increase in five years. Splunk's State of Observability 2025 reported that 73% of organisations experienced outages related to ignored or suppressed alerts because their teams were drowning in noise.

The pattern is consistent across the industry and matches what we've seen firsthand: roughly 70% of incident response time goes to coordination, not engineering. Whether it's a PagerDuty report showing customer-impacting incidents increased 43% year-over-year, or incident.io's data showing that team assembly and cleanup alone consume half the resolution window, the bottleneck isn't your engineers. It's everything they have to do before they can start fixing.

Key Takeaways

~70% of incident response time is coordination, not engineering. The fix is usually immediate. Getting to the solution takes 50 minutes.
The first 20 minutes are almost entirely logistics. Detection, assembly, and context gathering before a single engineer has looked at a log line with intent.
MTTR is a misleading metric. A 50 minute MTTR doesn't tell you if your team spent 40 minutes coordinating and 10 debugging, or the other way around. Same number, entirely different problems. Track where the time actually goes.
The highest-ROI improvements target coordination, not debugging. If 70% of your incident time is spent on people jumping between tools and paging each other, buying a faster APM will not help. Automate the assembly, pre-wire the context, and let your engineers go straight to the problem.
On-call burnout is a coordination problem. Your senior engineers aren't experiencing burnout due to the difficulty of the fixes. They're burning out because they're the only ones who can navigate across the tools effectively.

The Real Anatomy of a P0 Incident

So what does that 70% actually look like in practice? Here's the minute-by-minute breakdown of a typical P0 incident. The pattern was remarkably consistent across every team we spoke to.

Minutes 0–4: Detection. The alert fires. The on-call engineer acknowledges. If they're in a meeting or away from their desk, the delay alone eats four minutes.
Minutes 4–20: The Assembly Phase. This is where time goes away. The engineer opens Slack and posts in the incidents channel, but then they remember that they don't know who owns the checkout service. They have Datadog open in one tab and the deployment dashboard in another, and they're looking through GitHub commits to see if anyone pushed anything in the last hour. They haven't even started debugging yet with six tools open. They're just trying to figure out what's going on.
Minutes 20–34: Investigation. The actual diagnostic work begins, but it is hindered by coordination issues. Two engineers independently check if a recent config change caused the issue. One checks Elasticsearch logs, while the other checks Datadog logs, as they didn't coordinate. Meanwhile, Slack is buzzing with questions: "Is the issue related to the deploy we did at 2:30?" "Should we roll back?" "Do we need to update the status page?" A focused investigation of about five minutes reveals the actual engineering insight: "The connection pool size was reduced in the 2:30 config push." But that five minutes is buried inside fourteen minutes of tool-hopping and duplicated effort.
Minutes 34–40: The Fix. Almost always fast. Roll back the config. Bump the pool size. Push the change. Verify.
Minutes 40–50: Cleanup. Update the status page. Close PagerDuty. Post a summary. Create the post-mortem ticket. More coordination, zero engineering.

Here's what that looks like when you map every minute:

Why This Should Concern Engineering Leaders

The obvious cost is downtime. According to ITIC's 2024 Hourly Cost of Downtime survey, over 90% of mid-size and large enterprises report that a single hour of downtime costs more than $300,000, with 41% putting it between $1 million and $5 million. Gartner puts the average for Fortune 500 companies at $500,000 to $1 million per hour. But there's a quieter cost.

If your team handles 15 incidents per month with an average of 3 engineers per incident, and each one carries 39 minutes of coordination overhead, that's roughly 29 engineer-hours per month nearly four full engineering days spent not on diagnosis, not on the fix, but on opening Slack channels, paging people, and checking who deployed what.

And that calculation doesn't include context-switching costs. Each incident interruption costs 15–25 minutes to return to deep work afterward. The real productivity loss is multiples higher.

This cost falls disproportionately on your most experienced engineers the ones who know which signals matter, who own which service, and where to look first. When those engineers burn out and leave, they take that institutional knowledge with them. The next incident takes longer because the coordination phase expands.

MTTR hides all of these issues. A 50-minute MTTR doesn't tell you whether you spent 40 minutes on coordination and 10 on the fix or 10 on coordination and 40 on a genuinely challenging problem. These require entirely different solutions.

What You Can Do About It

The 70/30 split tells you exactly where to focus.

Pre-wire your incident response. Most coordination in the first 20 minutes comes from answering questions that should already have answers: Who owns this service? Who's on call? What changed recently? Where's the dashboard? A well-maintained service catalogue eliminates the "who do I page?" and "where do I look?" questions that consume the opening minutes.

Eliminate parallel tool-hopping. If your engineers are independently querying three different observability tools during an incident, you have a coordination problem. Assign roles explicitly: one person investigates logs, one checks deploys, one handles communication.

Automate the coordination layer. Creating channels, paging owners, and pulling context are almost entirely automated. Every minute your engineers spend on logistics during an active incident is a minute they're not diagnosing the problem.

Automate the investigation layer. This area is the frontier. The investigation phase remains time-consuming because it requires connecting the dots across tools mapping an error spike to a recent deploy, linking a latency increase to a config change, and grouping 30 cascading alerts into a single root cause. This kind of cross-tool correlation is exactly what AI is adept at.

At Steadwing, this type of cross-tool correlation is the problem we solve. When an alert fires, we pull context from your logs, metrics, traces, and recent code changes, connect the dots across your whole stack, and give you a full root cause analysis with automatable fixes on the code level, around deployment, and infra. The RCA investigation is over by the time the on-call person opens their laptop.

We handle the 70%, so your engineers can focus on the 30% that actually requires their expertise.

Frequently Asked Questions

Where does the 70% coordination figure come from?
We timed real incidents across multiple engineering teams and categorized every minute as either coordination (team assembly, tool-switching, communication, cleanup) or engineering (diagnosis, root cause identification, fix). The split consistently landed between 65–80% coordination. This aligns with publicly available incident data incident.io's MTTR breakdown shows coordination and investigation phases consume the majority of resolution time, while the Catchpoint SRE Report 2025 and Splunk State of Observability 2025 confirm that operational toil and alert noise continue to rise across the industry.

What's the business case for fixing this?
A mid-stage SaaS company handling 15 incidents per month with 3 engineers per incident and 39 minutes of coordination overhead per incident loses roughly 29 engineer-hours per month to non engineering work. At a fully loaded cost of $150/hour, that's about $52,000/year in direct labor before accounting for context-switching costs and the attrition risk of burned out on-call engineers.

How does Steadwing specifically address this?
When an alert fires, Steadwing takes info from your logs, metrics, traces, and codebase, connects the dots across your whole stack, and gives the on-call engineer a full root cause analysis with automatable fixes on code level, around deployment, and infra in under 5 minutes. The RCA investigation is over by the time the on-call person opens their laptop. Your engineers still make the decisions but they start with a diagnosis and solution, not a blank screen and six browser tabs.

Steadwing is an autonomous on-call engineer. It connects the dots across your stack and gives you a full RCA with fixes before your team starts the manual scramble. Start free →

Top comments (1)

arun rajkumar • Apr 30

The 70/30 split matches what we've seen but the ratio gets uglier the more regulated your domain is. We run payments — so half of "the fix" isn't the rollback, it's documenting it for the FCA, our acquiring banks, and our internal audit log within a fixed SLA. The coordination tax doesn't end at "incident resolved" — it extends into compliance reporting, which is invisible in MTTR but eats engineering time the same way. Two things that bought us back the most minutes: (1) a service-ownership map auto-generated from our Traefik config + repo CODEOWNERS, so "who do I page" is a one-click answer, not a Slack archaeology session, and (2) every alert ships with a pre-attached context bundle (last 3 deploys to the affected service, current config diff, recent error-rate trend). Cuts the assembly phase roughly in half. The investigation phase is where AI agents earn their keep — but only if they're allowed to read across tools, not just one.