LinkedIn Draft — Workflow (2026-04-07)
A hard-earned rule from incident retrospectives:
Incident RCA without a data-backed timeline is just a story you told yourself
Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.
Memory-based timeline: Data-backed timeline:
T+0 "Deploy happened" T+0:00 Deploy (Argo event)
T+? "Errors started" T+0:07 Error rate +0.3% (Prometheus)
T+? "Someone noticed" T+0:12 P95 latency 340ms→2.1s (trace)
T+? "We rolled back" T+0:19 Alert fired (PD)
T+0:31 Rollback complete (Argo)
Where it breaks:
▸ Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.
▸ Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.
▸ Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.
The rule I keep coming back to:
→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.
How I sanity-check it:
▸ OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.
▸ Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.
Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.
This is where most runbooks stop — what's your next step after this?
Top comments (0)