A hard-earned rule from incident retrospectives:

#terraform #sre #devops #kubernetes

LinkedIn Draft — Workflow (2026-04-07)

Incident RCA without a data-backed timeline is just a story you told yourself

Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.

Memory-based timeline:     Data-backed timeline:

T+0  "Deploy happened"     T+0:00  Deploy (Argo event)
T+?  "Errors started"      T+0:07  Error rate +0.3% (Prometheus)
T+?  "Someone noticed"     T+0:12  P95 latency 340ms→2.1s (trace)
T+?  "We rolled back"      T+0:19  Alert fired (PD)
                           T+0:31  Rollback complete (Argo)

Where it breaks:
▸ Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.
▸ Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.
▸ Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.

The rule I keep coming back to:
→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.

How I sanity-check it:
▸ OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.
▸ Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.

Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.

Deep dive: https://neeraja-portfolio-v1.vercel.app/workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself

This is where most runbooks stop — what's your next step after this?