<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Long Yi</title>
    <description>The latest articles on DEV Community by Long Yi (@long_yi_dfc45a0fea1bf55db).</description>
    <link>https://dev.to/long_yi_dfc45a0fea1bf55db</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3755345%2Fd9778a1f-0063-4530-b0aa-5b933e29ceba.png</url>
      <title>DEV Community: Long Yi</title>
      <link>https://dev.to/long_yi_dfc45a0fea1bf55db</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/long_yi_dfc45a0fea1bf55db"/>
    <language>en</language>
    <item>
      <title>What 3am On-Call Taught Me About Why Incident Tools Break Down</title>
      <dc:creator>Long Yi</dc:creator>
      <pubDate>Thu, 05 Feb 2026 18:42:05 +0000</pubDate>
      <link>https://dev.to/long_yi_dfc45a0fea1bf55db/what-3am-on-call-taught-me-about-why-incident-tools-break-down-2nd7</link>
      <guid>https://dev.to/long_yi_dfc45a0fea1bf55db/what-3am-on-call-taught-me-about-why-incident-tools-break-down-2nd7</guid>
      <description>&lt;p&gt;At 3am, during an incident, nobody is excited about tooling.&lt;/p&gt;

&lt;p&gt;You’re tired, Slack is exploding, alerts are firing, and everyone is asking the same question in different words:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“What changed?”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I learned this the hard way after spending a few years on call as an SRE.&lt;/p&gt;

&lt;p&gt;Most incidents I dealt with didn’t fail because we lacked alerts or dashboards. We had plenty. The failure usually happened &lt;em&gt;after&lt;/em&gt; the alert — during investigation — when context was missing and people started jumping between tools trying to reconstruct what the system even looked like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uncomfortable pattern I kept seeing
&lt;/h2&gt;

&lt;p&gt;Across different teams and stacks, the same things kept happening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs, metrics, traces, and deploy history all lived in different places
&lt;/li&gt;
&lt;li&gt;Context from past incidents lived in Slack threads or postmortems nobody read
&lt;/li&gt;
&lt;li&gt;New tools required weeks of setup before they were useful
&lt;/li&gt;
&lt;li&gt;During incidents, nobody wanted to open yet another UI
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What surprised me most was how much effort went into &lt;em&gt;wiring tools together&lt;/em&gt;, instead of helping people reason about failures.&lt;/p&gt;

&lt;p&gt;A lot of “AI for SRE” tools I tried assumed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clean integrations already existed
&lt;/li&gt;
&lt;li&gt;the system graph was known
&lt;/li&gt;
&lt;li&gt;teams had time to configure everything upfront
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s rarely true in real systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem wasn’t intelligence — it was context
&lt;/h2&gt;

&lt;p&gt;At some point it clicked for me:&lt;br&gt;&lt;br&gt;
the bottleneck wasn’t smarter analysis, it was missing context.&lt;/p&gt;

&lt;p&gt;If a tool doesn’t understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what services exist&lt;/li&gt;
&lt;li&gt;how they depend on each other&lt;/li&gt;
&lt;li&gt;what changed recently&lt;/li&gt;
&lt;li&gt;how incidents were handled before
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then adding AI on top just produces confident nonsense.&lt;/p&gt;

&lt;p&gt;So instead of asking &lt;em&gt;“how do we make the model smarter?”&lt;/em&gt;, I started asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How does the tool learn what the system actually looks like?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A different approach I wanted to try
&lt;/h2&gt;

&lt;p&gt;I started building &lt;strong&gt;IncidentFox&lt;/strong&gt; mostly to scratch my own itch.&lt;/p&gt;

&lt;p&gt;Two design decisions came directly from on-call pain:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Learn the system at setup, not weeks later
&lt;/h3&gt;

&lt;p&gt;Instead of asking users to manually wire everything, the tool analyzes your system during setup — codebase, observability signals, past incidents — and builds the initial understanding automatically.&lt;/p&gt;

&lt;p&gt;Not because setup is annoying (it is), but because incomplete integrations lead to wrong conclusions. Accuracy depends on context, and context shouldn’t take weeks to assemble.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Stay where incidents already happen
&lt;/h3&gt;

&lt;p&gt;Every incident I’ve been part of lived in Slack.&lt;br&gt;&lt;br&gt;
That’s where decisions happened, context was shared, and confusion spread.&lt;/p&gt;

&lt;p&gt;So the tool is Slack-first by design. Not as a notification surface, but as the actual place where investigation happens — pulling in logs, metrics, traces, and historical context directly into the thread.&lt;/p&gt;

&lt;p&gt;The goal wasn’t to replace existing tools, but to stop people from losing context by bouncing between them.&lt;/p&gt;

&lt;h2&gt;
  
  
  One more thing I didn’t expect
&lt;/h2&gt;

&lt;p&gt;As incidents happen, teams leave behind a lot of implicit knowledge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why a hypothesis was rejected
&lt;/li&gt;
&lt;li&gt;which signals mattered
&lt;/li&gt;
&lt;li&gt;what ended up being noise
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most tools throw that away.&lt;/p&gt;

&lt;p&gt;We decided to keep it. The system continuously learns from each incident, so future investigations start with more context instead of a blank slate.&lt;/p&gt;

&lt;p&gt;It’s not magic, and it definitely doesn’t “solve incidents automatically”. It just tries to remember what humans already learned under pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is today
&lt;/h2&gt;

&lt;p&gt;IncidentFox is open source (Apache 2.0), self-hostable, and very much still evolving. It won’t replace your monitoring stack, and it has blind spots I haven’t hit yet.&lt;/p&gt;

&lt;p&gt;I’m sharing it because I want feedback from people who’ve actually been on call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What helped you most during investigation?&lt;/li&gt;
&lt;li&gt;What tools disappointed you?&lt;/li&gt;
&lt;li&gt;What assumptions here sound naive?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo is here if you’re curious:&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/incidentfox/incidentfox" rel="noopener noreferrer"&gt;https://github.com/incidentfox/incidentfox&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m still learning — and incidents have a way of humbling everyone.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
