<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samson Tanimawo</title>
    <description>The latest articles on DEV Community by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://dev.to/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>DEV Community: Samson Tanimawo</title>
      <link>https://dev.to/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>Incident Severity Levels: SEV-1 to SEV-5 Calibration</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 03 May 2026 14:27:49 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/incident-severity-levels-sev-1-to-sev-5-calibration-52j1</link>
      <guid>https://dev.to/samson_tanimawo/incident-severity-levels-sev-1-to-sev-5-calibration-52j1</guid>
      <description>&lt;h2&gt;
  
  
  Why Severity Is Broken at Most Companies
&lt;/h2&gt;

&lt;p&gt;Everyone has severity levels. Almost nobody agrees on what they mean.&lt;/p&gt;

&lt;p&gt;Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under-paged incidents (people thought SEV-3 meant "no rush")&lt;/li&gt;
&lt;li&gt;Over-paged incidents (everything is SEV-1)&lt;/li&gt;
&lt;li&gt;Exhausted on-call (false alarms)&lt;/li&gt;
&lt;li&gt;Missed SLOs (incidents not escalated in time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calibration matters. Here's a definition that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Levels
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SEV-1: Critical&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary product is completely down for all users&lt;/li&gt;
&lt;li&gt;Active data loss&lt;/li&gt;
&lt;li&gt;Security breach in progress&lt;/li&gt;
&lt;li&gt;Core business stopped (can't process payments, can't log in)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 5 minutes&lt;br&gt;
Escalation: Immediate, all hands&lt;br&gt;
Post-mortem: Required, public within 5 days&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-2: High&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary product is degraded for most users&lt;/li&gt;
&lt;li&gt;Core feature unavailable for a subset&lt;/li&gt;
&lt;li&gt;Significant customer impact but workaround exists&lt;/li&gt;
&lt;li&gt;Performance significantly degraded (&amp;gt;5x normal latency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 15 minutes&lt;br&gt;
Escalation: Page primary on-call, notify secondary&lt;br&gt;
Post-mortem: Required, internal within 5 days&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-3: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-critical feature broken&lt;/li&gt;
&lt;li&gt;Affects a small percentage of users&lt;/li&gt;
&lt;li&gt;Degraded performance within tolerance&lt;/li&gt;
&lt;li&gt;Bug in new feature rollout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 1 hour&lt;br&gt;
Escalation: Page during business hours, ticket overnight&lt;br&gt;
Post-mortem: Recommended&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-4: Low&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minor bug with workaround&lt;/li&gt;
&lt;li&gt;Internal tooling broken&lt;/li&gt;
&lt;li&gt;Non-customer-facing issue&lt;/li&gt;
&lt;li&gt;Cosmetic problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 1 business day&lt;br&gt;
Escalation: Ticket only&lt;br&gt;
Post-mortem: Not required&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-5: Informational&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not actually broken&lt;/li&gt;
&lt;li&gt;Preemptive warning&lt;/li&gt;
&lt;li&gt;"This might become a problem"&lt;/li&gt;
&lt;li&gt;Observed anomaly without impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: Backlog&lt;br&gt;
Escalation: None&lt;br&gt;
Post-mortem: Not required&lt;/p&gt;
&lt;h2&gt;
  
  
  The Calibration Problem
&lt;/h2&gt;

&lt;p&gt;Levels written on paper are useless. What matters is &lt;strong&gt;consistent application&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Run this exercise: take your last 50 incidents. Ask three SRE leads to independently assign severity levels. Compare.&lt;/p&gt;

&lt;p&gt;If more than 20% disagree by at least one level, your definitions aren't calibrated. Run training.&lt;/p&gt;
&lt;h2&gt;
  
  
  The "When In Doubt" Rules
&lt;/h2&gt;

&lt;p&gt;When severity is ambiguous, default to &lt;strong&gt;higher severity&lt;/strong&gt; and downgrade if wrong.&lt;/p&gt;

&lt;p&gt;Better to over-escalate and apologize than under-escalate and miss a SEV-1 for 4 hours.&lt;/p&gt;

&lt;p&gt;Specific rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User data loss&lt;/strong&gt; → always SEV-1 or SEV-2, never lower&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security issue&lt;/strong&gt; → always SEV-1 or SEV-2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revenue impact&lt;/strong&gt; → SEV-2 minimum if measurable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uncertain scope&lt;/strong&gt; → start at higher severity, downgrade when scope is clear&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Customer Impact Matrix
&lt;/h2&gt;

&lt;p&gt;For fast calibration, use a matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| &amp;lt;1% users | 1-10% users | 10-50% | &amp;gt;50% users
Product Down | SEV-2 | SEV-1 | SEV-1 | SEV-1
Major Degraded | SEV-3 | SEV-2 | SEV-2 | SEV-1
Minor Degraded | SEV-4 | SEV-3 | SEV-2 | SEV-2
Workaround Exists| SEV-4 | SEV-4 | SEV-3 | SEV-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a fast severity assignment without relying on intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-Based Escalation
&lt;/h2&gt;

&lt;p&gt;Severity isn't fixed for the incident lifetime. It escalates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sev_2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;auto_escalate_to_sev_1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_not_resolved_in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60_minutes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_user_impact_grows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;above_10_percent&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_revenue_loss_exceeds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$10000/hour&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start at SEV-2, auto-escalate if things worsen. Don't let an incident linger at the same severity if the impact is growing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Downgrade Rule
&lt;/h2&gt;

&lt;p&gt;Downgrading is allowed &lt;strong&gt;but must be justified in writing&lt;/strong&gt; in the incident channel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Downgrading from SEV-1 to SEV-2 at 10:23. Initial reports of
total outage were incorrect. Real impact is ~5% of users in
us-west-2 only. Ticket: INC-1234"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents silent downgrades that understate severity for retro analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLO Integration
&lt;/h2&gt;

&lt;p&gt;Your SLOs and severity levels should align:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLO: 99.95% availability (21.6 min/month budget)

If this month's error budget burned:
&amp;lt;25% → normal operations
25-50% → no SEV-3 burn-down deploys
50-75% → SEV-2 threshold lowered
&amp;gt;75% → any degradation is SEV-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you're running low on error budget, everything gets more severe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Incident Categories
&lt;/h2&gt;

&lt;p&gt;Beyond numeric severity, label incidents by type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;INCIDENT_TYPES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;infrastructure (AWS, networking)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;application (code bug)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment (bad release)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;capacity (scaling failure)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;data (corruption, loss)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;security (breach, exposure)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;external (3rd-party dependency)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Severity tells you how urgent. Type tells you who to page.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Monthly Review
&lt;/h2&gt;

&lt;p&gt;Once a month, review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All SEV-1s and SEV-2s&lt;/li&gt;
&lt;li&gt;Any SEV-3 that should have been SEV-2&lt;/li&gt;
&lt;li&gt;Any SEV-2 that should have been SEV-3&lt;/li&gt;
&lt;li&gt;Average time from incident open to correct severity assignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adjust the definitions based on what you learn. Severity is a living standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pet severity&lt;/strong&gt; every team invents their own. Standardize company-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEV-0&lt;/strong&gt; don't add levels above SEV-1. Just use "SEV-1 all hands."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity inflation&lt;/strong&gt; if every incident is SEV-2, nobody takes SEV-2 seriously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity deflation&lt;/strong&gt; pressure to avoid post-mortems leads to fake SEV-4s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unchanging severity&lt;/strong&gt; escalation is a tool, use it&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;Severity should mean the same thing to every person in the org. Engineers, PMs, execs, customer support.&lt;/p&gt;

&lt;p&gt;When someone says "SEV-1," everyone should know what that means, how urgent it is, and what the response looks like.&lt;/p&gt;

&lt;p&gt;When you achieve that, incident response gets dramatically better.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>incidents</category>
      <category>sre</category>
      <category>oncall</category>
      <category>process</category>
    </item>
    <item>
      <title>Memory Leak Detection in Long-Running Services</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 02 May 2026 14:27:34 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/memory-leak-detection-in-long-running-services-l9f</link>
      <guid>https://dev.to/samson_tanimawo/memory-leak-detection-in-long-running-services-l9f</guid>
      <description>&lt;h2&gt;
  
  
  The Slowest Incident to Diagnose
&lt;/h2&gt;

&lt;p&gt;Memory leaks are sneaky. The service runs fine for hours. Then, slowly, it gets worse. Slower responses, more GC pauses, eventual OOM kills.&lt;/p&gt;

&lt;p&gt;And when you look at the first 30 minutes of metrics, everything looks normal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Flavors of Memory Growth
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. True leaks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Objects allocated but never freed&lt;/li&gt;
&lt;li&gt;Classic in C/C++, rare in Go/Java with GC&lt;/li&gt;
&lt;li&gt;Grows linearly forever until OOM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Unbounded caches&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache adds entries but never evicts&lt;/li&gt;
&lt;li&gt;Common in Node.js, Python, Go&lt;/li&gt;
&lt;li&gt;Grows until memory pressure triggers other issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Memory fragmentation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heap is large but not usable&lt;/li&gt;
&lt;li&gt;Happens in long-running Java, Go,.NET services&lt;/li&gt;
&lt;li&gt;Not really a "leak" but behaves like one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three cause the same symptom: memory grows over time. Treatment is different for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection Without Heap Dumps
&lt;/h2&gt;

&lt;p&gt;Before you reach for pprof or heap dumps, the fastest diagnosis is graph-watching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Is memory growing linearly over the last 24 hours?
deriv(container_memory_working_set_bytes{service="api"}[24h]) &amp;gt; 0

# Is GC pause time increasing?
rate(jvm_gc_pause_seconds_sum[1h]) &amp;gt; 0.05
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If memory is growing by ~500MB/day and GC pauses are increasing, you have a leak. Diagnosis complete.&lt;/p&gt;

&lt;p&gt;The question is &lt;strong&gt;where&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go Memory Profiling
&lt;/h2&gt;

&lt;p&gt;Go makes this relatively easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="s"&gt;"net/http/pprof"&lt;/span&gt;

&lt;span class="c"&gt;// In main():&lt;/span&gt;
&lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":6060"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get a heap profile&lt;/span&gt;
go tool pprof http://localhost:6060/debug/pprof/heap

&lt;span class="c"&gt;# In the pprof shell:&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; top
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; list suspiciousFunction
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; web &lt;span class="c"&gt;# generates a SVG callgraph&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Objects with high &lt;code&gt;inuse_space&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Objects with growing counts over time&lt;/li&gt;
&lt;li&gt;Unexpected large maps or slices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key trick&lt;/strong&gt;: take two heap profiles 1 hour apart and diff them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go tool pprof &lt;span class="nt"&gt;-base&lt;/span&gt; heap1.pprof heap2.pprof
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What shows up as "new" allocations in the diff is almost certainly your leak.&lt;/p&gt;

&lt;h2&gt;
  
  
  Java Memory Profiling
&lt;/h2&gt;

&lt;p&gt;Java is harder because the JVM adds layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dump the heap&lt;/span&gt;
jmap &lt;span class="nt"&gt;-dump&lt;/span&gt;:format&lt;span class="o"&gt;=&lt;/span&gt;b,file&lt;span class="o"&gt;=&lt;/span&gt;heap.hprof &amp;lt;pid&amp;gt;

&lt;span class="c"&gt;# Analyze with Eclipse MAT or JVisualVM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In MAT, look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leak Suspects report (automatic)&lt;/li&gt;
&lt;li&gt;Dominator tree (what's holding the most memory)&lt;/li&gt;
&lt;li&gt;GC roots path (what's preventing garbage collection)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common Java culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static collections (especially &lt;code&gt;static Map&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;ThreadLocal values without cleanup&lt;/li&gt;
&lt;li&gt;Listeners/callbacks registered but never unregistered&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;finalize()&lt;/code&gt; methods delaying collection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Node.js Memory Profiling
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Enable the inspector&lt;/span&gt;
&lt;span class="nx"&gt;node&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nx"&gt;inspect&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;

&lt;span class="c1"&gt;// Then in Chrome DevTools → Memory → Heap Snapshot&lt;/span&gt;
&lt;span class="c1"&gt;// Take 3 snapshots: baseline, after 10 min, after 20 min&lt;/span&gt;
&lt;span class="c1"&gt;// Compare to find retained objects&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common Node culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event emitter listeners that accumulate&lt;/li&gt;
&lt;li&gt;Closures holding references to large objects&lt;/li&gt;
&lt;li&gt;Unbounded caches (remember, Node has no built-in LRU)&lt;/li&gt;
&lt;li&gt;Stream buffers not being drained&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python Memory Profiling
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracemalloc&lt;/span&gt;
&lt;span class="n"&gt;tracemalloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;#... run the leaky operation...
&lt;/span&gt;
&lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracemalloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take_snapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;top_stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lineno&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;stat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_stats&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use &lt;code&gt;memory_profiler&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memory_profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;

&lt;span class="nd"&gt;@profile&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;suspect_function&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="c1"&gt;# code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common Python culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Global lists/dicts growing unbounded&lt;/li&gt;
&lt;li&gt;Reference cycles with &lt;code&gt;__del__&lt;/code&gt; methods&lt;/li&gt;
&lt;li&gt;C extensions leaking (hardest to find)&lt;/li&gt;
&lt;li&gt;Pandas DataFrames kept around too long&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Cache Leak Special Case
&lt;/h2&gt;

&lt;p&gt;The most common "leak" isn't a leak at all. It's a cache without eviction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD: unbounded
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# GOOD: bounded LRU
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lru_cache&lt;/span&gt;

&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always bound your caches. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fragmentation in Go
&lt;/h2&gt;

&lt;p&gt;Go's garbage collector can leave the heap fragmented. You see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime memory is high&lt;/li&gt;
&lt;li&gt;Heap profile shows low allocations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;runtime.GC()&lt;/code&gt; doesn't reduce usage much&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solution: tune &lt;code&gt;GOGC&lt;/code&gt; or force memory release:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"runtime/debug"&lt;/span&gt;
&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetGCPercent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// More aggressive GC&lt;/span&gt;
&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FreeOSMemory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// Return memory to OS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Long-Running Service Pattern
&lt;/h2&gt;

&lt;p&gt;Services that run for weeks without restart accumulate cruft. Even without leaks.&lt;/p&gt;

&lt;p&gt;We use this pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deployment_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;max_uptime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7d&lt;/span&gt;
&lt;span class="na"&gt;restart_schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolling&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restart&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;every&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every pod gets restarted weekly during a quiet window. Eliminates slow memory growth as a class of problem.&lt;/p&gt;

&lt;p&gt;This isn't defeat. It's acknowledging that long-running processes in any language eventually accumulate state you don't want.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic Checklist
&lt;/h2&gt;

&lt;p&gt;When a service is suspected of leaking:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is memory growing linearly or logarithmically? (linear = real leak)&lt;/li&gt;
&lt;li&gt;Is GC frequency/duration increasing? (yes = real pressure)&lt;/li&gt;
&lt;li&gt;Are request rates growing proportionally? (yes = normal growth, not leak)&lt;/li&gt;
&lt;li&gt;Take heap profile, save baseline&lt;/li&gt;
&lt;li&gt;Wait 1 hour, take second profile, diff&lt;/li&gt;
&lt;li&gt;Look for unexpected high-count objects&lt;/li&gt;
&lt;li&gt;Trace back to allocation site&lt;/li&gt;
&lt;li&gt;Fix the leak, deploy, watch metrics for 24h&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rinse and repeat. Memory leaks are annoying but systematically fixable.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>debugging</category>
      <category>memory</category>
      <category>sre</category>
      <category>performance</category>
    </item>
    <item>
      <title>CI/CD Reliability: When Your Deploy Pipeline is Your SPOF</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 01 May 2026 14:23:02 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/cicd-reliability-when-your-deploy-pipeline-is-your-spof-4g7i</link>
      <guid>https://dev.to/samson_tanimawo/cicd-reliability-when-your-deploy-pipeline-is-your-spof-4g7i</guid>
      <description>&lt;h2&gt;
  
  
  The Invisible SPOF
&lt;/h2&gt;

&lt;p&gt;Every engineering org has a single point of failure that nobody lists on their risk registry: &lt;strong&gt;the deploy pipeline itself&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When CI/CD breaks, you can't ship features. You can't deploy hotfixes. You can't roll back a broken release. Your production doesn't go down, but your ability to fix production does.&lt;/p&gt;

&lt;p&gt;We had a 4-hour outage last year caused by a GitHub Actions incident. Not a single server went down. We just couldn't deploy the fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Categorizing the Risk
&lt;/h2&gt;

&lt;p&gt;Your pipeline consists of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;source_control&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# GitHub, GitLab, Bitbucket&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PRs"&lt;/span&gt;

&lt;span class="na"&gt;ci_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# GitHub Actions, CircleCI, self-hosted&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;builds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run"&lt;/span&gt;

&lt;span class="na"&gt;artifact_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# ECR, Artifactory, S3&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;push"&lt;/span&gt;

&lt;span class="na"&gt;deployment_controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# ArgoCD, Flux, Spinnaker&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploys&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;happen"&lt;/span&gt;

&lt;span class="na"&gt;cluster_api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# k8s API, cloud provider API&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;change"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is a failure domain. A serious pipeline needs fallback plans for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manual Escape Hatch
&lt;/h2&gt;

&lt;p&gt;Rule #1: &lt;strong&gt;You must have a documented path to deploy manually.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not for daily use for emergencies. Every team should know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to build the image locally&lt;/li&gt;
&lt;li&gt;How to push to the registry&lt;/li&gt;
&lt;li&gt;How to update the cluster without the normal pipeline&lt;/li&gt;
&lt;li&gt;Who has permission to do this in production&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We test this quarterly. Every SRE must deploy one service manually, end-to-end, in under 10 minutes, without the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardening the Pipeline Itself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Pin your dependencies&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD&lt;/span&gt;
&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@main&lt;/span&gt;

&lt;span class="c1"&gt;# GOOD&lt;/span&gt;
&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4.1.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;actions/checkout@main&lt;/code&gt; breaks, your deploys break. Pin to versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cache everything locally&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/yourorg&lt;/span&gt;
&lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ecr.amazonaws.com/yourorg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the primary registry is down, you need a mirror. Every production image should exist in at least two registries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Monitor the pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You probably monitor your services. Do you monitor your CI?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;pipeline_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;build_success_rate (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;99%)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;deploy_duration_p99 (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;5 min)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;time_to_rollback_p99 (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;2 min)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;runner_queue_depth (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;5)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on these the same way you'd alert on a service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Test disaster modes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you ship if GitHub Actions is down?&lt;br&gt;
Can you ship if the main registry is unreachable?&lt;br&gt;
Can you ship if ArgoCD is down?&lt;/p&gt;

&lt;p&gt;If the answer is "no", you have undocumented SPOFs.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Rollback Rule
&lt;/h2&gt;

&lt;p&gt;Every deploy must be reversible in under 2 minutes. No exceptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;time_to_deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15 minutes&lt;/span&gt;
&lt;span class="na"&gt;time_to_rollback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;90 seconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your rollback takes longer than your deploy, your pipeline is backwards.&lt;/p&gt;

&lt;p&gt;How to achieve fast rollbacks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep the previous image running in parallel during deploys&lt;/li&gt;
&lt;li&gt;Use traffic-shifting deploys (ALB weights, Istio)&lt;/li&gt;
&lt;li&gt;Label every image with the git commit&lt;/li&gt;
&lt;li&gt;Never deploy untested rollback paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Deploy Freeze
&lt;/h2&gt;

&lt;p&gt;Some teams never deploy on Fridays. This is cargo culting.&lt;/p&gt;

&lt;p&gt;The real rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't deploy when the on-call person is asleep&lt;/li&gt;
&lt;li&gt;Don't deploy during peak traffic windows&lt;/li&gt;
&lt;li&gt;Don't deploy major changes during holidays&lt;/li&gt;
&lt;li&gt;DO deploy hotfixes anytime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Friday at 5pm is the only time you can deploy a fix, you deploy. The alternative is customers suffering all weekend.&lt;/p&gt;

&lt;p&gt;A reliable pipeline makes any-time deploys safe. Banning Friday deploys means your pipeline isn't reliable enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Provider Strategy
&lt;/h2&gt;

&lt;p&gt;Big-ticket item: run critical workloads on CI from a different vendor than your code host.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code: GitHub
CI: CircleCI (not GitHub Actions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When GitHub Actions is down (it happens twice a year), your builds still run. When CircleCI is down, you can fall back to GitHub Actions.&lt;/p&gt;

&lt;p&gt;This doubles your CI bill but removes a major SPOF.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Break Glass" Deploy
&lt;/h2&gt;

&lt;p&gt;Every pipeline should have an emergency bypass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal deploy (takes 15 minutes, runs all tests)&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Break-glass deploy (skips tests, full audit log, Slack alert)&lt;/span&gt;
./deploy.sh &lt;span class="nt"&gt;--break-glass&lt;/span&gt; &lt;span class="nt"&gt;--reason&lt;/span&gt; &lt;span class="s2"&gt;"Fixing P1 incident #1234"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The break-glass path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires written justification&lt;/li&gt;
&lt;li&gt;Skips long-running tests&lt;/li&gt;
&lt;li&gt;Notifies the whole team&lt;/li&gt;
&lt;li&gt;Writes to a permanent audit log&lt;/li&gt;
&lt;li&gt;Can only be used with incident in progress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Used maybe 3-5 times a year. Without it, your 2-hour deploy pipeline becomes a bottleneck when every minute matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metric That Matters Most
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mean Time to Deploy a Hotfix (MTTDHF)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From "we need to fix this" to "fix is in production" how long?&lt;/p&gt;

&lt;p&gt;Good: under 30 minutes&lt;br&gt;
Great: under 10 minutes&lt;br&gt;
Unicorn: under 5 minutes&lt;/p&gt;

&lt;p&gt;Track this. Optimize it. It's the most important reliability metric nobody talks about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Your pipeline is production infrastructure. Treat it with the same respect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor it&lt;/li&gt;
&lt;li&gt;Back it up&lt;/li&gt;
&lt;li&gt;Test failure modes&lt;/li&gt;
&lt;li&gt;Document manual paths&lt;/li&gt;
&lt;li&gt;Never let it become a SPOF&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it breaks during an incident, you'll be very glad you did.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>reliability</category>
      <category>devops</category>
      <category>deployments</category>
    </item>
    <item>
      <title>Multi-Region Failover: Lessons from Running It Hot</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:47:08 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-9c4</link>
      <guid>https://dev.to/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-9c4</guid>
      <description>&lt;h2&gt;
  
  
  Why "Hot" Matters
&lt;/h2&gt;

&lt;p&gt;Three multi-region strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold&lt;/strong&gt;: Backup region is off. You start it when primary fails. RTO: hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm&lt;/strong&gt;: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot&lt;/strong&gt;: Both regions serve live traffic simultaneously. RTO: seconds.&lt;/p&gt;

&lt;p&gt;If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Warm Failover
&lt;/h2&gt;

&lt;p&gt;Warm sounds great on paper. In practice, on the day you need it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The warm region has never seen real load&lt;/li&gt;
&lt;li&gt;DNS cache propagation takes 5-15 minutes&lt;/li&gt;
&lt;li&gt;Autoscaling lags because it's starting cold&lt;/li&gt;
&lt;li&gt;Your team has never run on the warm region&lt;/li&gt;
&lt;li&gt;Half your connection strings are hardcoded to the primary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Hot: The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both regions always serve traffic. Split is usually 50/50 but can shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Database replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where multi-region gets hard. Three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single writer, multi-region readers&lt;/strong&gt;: simplest, but writes pay cross-region latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-master&lt;/strong&gt;: complex, but truly hot requires conflict resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region-sharded&lt;/strong&gt;: users pinned to a region for writes, simplest if your data model allows it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use region-sharded for user-scoped data and single-writer for global config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Session stickiness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user's session is in Region A, and their next request goes to Region B, things break.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT tokens with no server state&lt;/li&gt;
&lt;li&gt;Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)&lt;/li&gt;
&lt;li&gt;Cookie routing that pins a user to a region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Cache coherence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Region A's cache doesn't know when Region B updates the database. Options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short TTLs (1-5 minutes) and accept the inconsistency&lt;/li&gt;
&lt;li&gt;Pub/sub cache invalidation across regions (complex)&lt;/li&gt;
&lt;li&gt;Read-through caches only, never write-through&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Failover Mechanics
&lt;/h2&gt;

&lt;p&gt;When Region A dies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Health checks detect failure&lt;/strong&gt; route53/ALB removes Region A from DNS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic shifts to Region B&lt;/strong&gt; already warm, already running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscaling kicks in&lt;/strong&gt; Region B doubles capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User sessions degrade gracefully&lt;/strong&gt; re-authentication, cache warmup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring reports the shift&lt;/strong&gt; team gets paged, not customers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It Monthly
&lt;/h2&gt;

&lt;p&gt;If you don't test failover monthly, you don't have failover. You have hope.&lt;/p&gt;

&lt;p&gt;We do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First Tuesday of every month, 10 AM&lt;/li&gt;
&lt;li&gt;Route100% of traffic to Region B for 30 minutes&lt;/li&gt;
&lt;li&gt;Watch dashboards, fix anything that degrades&lt;/li&gt;
&lt;li&gt;Route back to 50/50&lt;/li&gt;
&lt;li&gt;Document any issues, fix them&lt;/li&gt;
&lt;li&gt;Repeat next month with the other region&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Reality Check
&lt;/h2&gt;

&lt;p&gt;Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.&lt;/p&gt;

&lt;p&gt;The question is: what's your revenue per hour of downtime?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Operational Complexity Tax
&lt;/h2&gt;

&lt;p&gt;Running hot costs more than money. It costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More runbooks&lt;/strong&gt; (one per region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More monitoring&lt;/strong&gt; (cross-region latency, replication lag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harder debugging&lt;/strong&gt; ("which region was this request in?")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compliance surface&lt;/strong&gt; (data residency, each region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More deployment pipelines&lt;/strong&gt; (usually)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget 20% more engineering time for multi-region from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure in DNS config&lt;/strong&gt; your DNS provider becomes the new SPOF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing only with healthy traffic&lt;/strong&gt; test with 2x normal load during drills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting about databases&lt;/strong&gt; DB failover is the hardest part&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using regions as backup, not active&lt;/strong&gt; never tested until crisis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not planning for split-brain&lt;/strong&gt; what if both regions think they're primary?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Minimum Viable Hot Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Two regions, stateless app tier, 50/50 traffic&lt;/li&gt;
&lt;li&gt;Database: multi-AZ primary, cross-region async replica&lt;/li&gt;
&lt;li&gt;CDN/DNS: health-check-based routing&lt;/li&gt;
&lt;li&gt;Session: JWT-based (stateless)&lt;/li&gt;
&lt;li&gt;Monthly failover drills&lt;/li&gt;
&lt;li&gt;Runbooks tested in last 90 days&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start there. Layer in complexity as you need it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>multiregion</category>
      <category>failover</category>
      <category>sre</category>
      <category>aws</category>
    </item>
    <item>
      <title>Multi-Region Failover: Lessons from Running It Hot</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:08:47 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-1h8g</link>
      <guid>https://dev.to/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-1h8g</guid>
      <description>&lt;h2&gt;
  
  
  Why "Hot" Matters
&lt;/h2&gt;

&lt;p&gt;Three multi-region strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold&lt;/strong&gt;: Backup region is off. You start it when primary fails. RTO: hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm&lt;/strong&gt;: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot&lt;/strong&gt;: Both regions serve live traffic simultaneously. RTO: seconds.&lt;/p&gt;

&lt;p&gt;If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Warm Failover
&lt;/h2&gt;

&lt;p&gt;Warm sounds great on paper. In practice, on the day you need it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The warm region has never seen real load&lt;/li&gt;
&lt;li&gt;DNS cache propagation takes 5-15 minutes&lt;/li&gt;
&lt;li&gt;Autoscaling lags because it's starting cold&lt;/li&gt;
&lt;li&gt;Your team has never run on the warm region&lt;/li&gt;
&lt;li&gt;Half your connection strings are hardcoded to the primary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Hot: The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both regions always serve traffic. Split is usually 50/50 but can shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Database replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where multi-region gets hard. Three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single writer, multi-region readers&lt;/strong&gt;: simplest, but writes pay cross-region latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-master&lt;/strong&gt;: complex, but truly hot requires conflict resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region-sharded&lt;/strong&gt;: users pinned to a region for writes, simplest if your data model allows it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use region-sharded for user-scoped data and single-writer for global config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Session stickiness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user's session is in Region A, and their next request goes to Region B, things break.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT tokens with no server state&lt;/li&gt;
&lt;li&gt;Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)&lt;/li&gt;
&lt;li&gt;Cookie routing that pins a user to a region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Cache coherence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Region A's cache doesn't know when Region B updates the database. Options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short TTLs (1-5 minutes) and accept the inconsistency&lt;/li&gt;
&lt;li&gt;Pub/sub cache invalidation across regions (complex)&lt;/li&gt;
&lt;li&gt;Read-through caches only, never write-through&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Failover Mechanics
&lt;/h2&gt;

&lt;p&gt;When Region A dies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Health checks detect failure&lt;/strong&gt; route53/ALB removes Region A from DNS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic shifts to Region B&lt;/strong&gt; already warm, already running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscaling kicks in&lt;/strong&gt; Region B doubles capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User sessions degrade gracefully&lt;/strong&gt; re-authentication, cache warmup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring reports the shift&lt;/strong&gt; team gets paged, not customers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It Monthly
&lt;/h2&gt;

&lt;p&gt;If you don't test failover monthly, you don't have failover. You have hope.&lt;/p&gt;

&lt;p&gt;We do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First Tuesday of every month, 10 AM&lt;/li&gt;
&lt;li&gt;Route100% of traffic to Region B for 30 minutes&lt;/li&gt;
&lt;li&gt;Watch dashboards, fix anything that degrades&lt;/li&gt;
&lt;li&gt;Route back to 50/50&lt;/li&gt;
&lt;li&gt;Document any issues, fix them&lt;/li&gt;
&lt;li&gt;Repeat next month with the other region&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Reality Check
&lt;/h2&gt;

&lt;p&gt;Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.&lt;/p&gt;

&lt;p&gt;The question is: what's your revenue per hour of downtime?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Operational Complexity Tax
&lt;/h2&gt;

&lt;p&gt;Running hot costs more than money. It costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More runbooks&lt;/strong&gt; (one per region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More monitoring&lt;/strong&gt; (cross-region latency, replication lag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harder debugging&lt;/strong&gt; ("which region was this request in?")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compliance surface&lt;/strong&gt; (data residency, each region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More deployment pipelines&lt;/strong&gt; (usually)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget 20% more engineering time for multi-region from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure in DNS config&lt;/strong&gt; your DNS provider becomes the new SPOF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing only with healthy traffic&lt;/strong&gt; test with 2x normal load during drills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting about databases&lt;/strong&gt; DB failover is the hardest part&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using regions as backup, not active&lt;/strong&gt; never tested until crisis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not planning for split-brain&lt;/strong&gt; what if both regions think they're primary?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Minimum Viable Hot Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Two regions, stateless app tier, 50/50 traffic&lt;/li&gt;
&lt;li&gt;Database: multi-AZ primary, cross-region async replica&lt;/li&gt;
&lt;li&gt;CDN/DNS: health-check-based routing&lt;/li&gt;
&lt;li&gt;Session: JWT-based (stateless)&lt;/li&gt;
&lt;li&gt;Monthly failover drills&lt;/li&gt;
&lt;li&gt;Runbooks tested in last 90 days&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start there. Layer in complexity as you need it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>multiregion</category>
      <category>failover</category>
      <category>sre</category>
      <category>aws</category>
    </item>
    <item>
      <title>Disaster Recovery Drills That Actually Work</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 29 Apr 2026 15:45:56 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/disaster-recovery-drills-that-actually-work-2npp</link>
      <guid>https://dev.to/samson_tanimawo/disaster-recovery-drills-that-actually-work-2npp</guid>
      <description>&lt;h2&gt;
  
  
  Most DR Drills Are Theater
&lt;/h2&gt;

&lt;p&gt;Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.&lt;/p&gt;

&lt;p&gt;Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.&lt;/p&gt;

&lt;p&gt;Real DR drills test whether your team can actually recover, not whether they can talk about recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Levels of DR Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Tabletop&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through a scenario on paper&lt;/li&gt;
&lt;li&gt;Identify missing runbooks&lt;/li&gt;
&lt;li&gt;Find ownership gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: New team members, initial gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Doesn't prove anything actually works&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Partial Failure Test&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail one component in staging&lt;/li&gt;
&lt;li&gt;Watch recovery happen with real tools&lt;/li&gt;
&lt;li&gt;Time the full recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Validating specific runbooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Staging ≠ production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Full Production Drill&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail a real production component&lt;/li&gt;
&lt;li&gt;Customer-facing (announce a maintenance window if needed)&lt;/li&gt;
&lt;li&gt;Full team responds as if it's real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Proving you can recover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Scary, high-coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real DR Drill Scenario
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Primary database becomes unreachable&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup (48 hours before)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedule window with product team&lt;/li&gt;
&lt;li&gt;Pick a time when customer impact is minimal&lt;/li&gt;
&lt;li&gt;Brief the team: "Something will fail tomorrow, respond as normal"&lt;/li&gt;
&lt;li&gt;Pre-position the incident commander&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At T+0, block network access to primary database via iptables&lt;/li&gt;
&lt;li&gt;Start stopwatch&lt;/li&gt;
&lt;li&gt;Watch the team respond&lt;/li&gt;
&lt;li&gt;Do NOT intervene or give hints&lt;/li&gt;
&lt;li&gt;Document every action, every decision, every delay&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Metrics to capture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to detection (first alert fire)&lt;/li&gt;
&lt;li&gt;Time to engagement (first human acknowledges)&lt;/li&gt;
&lt;li&gt;Time to diagnosis ("we know what's wrong")&lt;/li&gt;
&lt;li&gt;Time to mitigation (customer impact stops)&lt;/li&gt;
&lt;li&gt;Time to recovery (fully restored)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring the Drill
&lt;/h2&gt;

&lt;p&gt;Good scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detection: under 2 minutes&lt;/li&gt;
&lt;li&gt;Engagement: under 5 minutes&lt;/li&gt;
&lt;li&gt;Diagnosis: under 15 minutes&lt;/li&gt;
&lt;li&gt;Mitigation: under 30 minutes (for DR scenarios)&lt;/li&gt;
&lt;li&gt;Recovery: depends on scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are 5x longer than target, you have a real problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Always Goes Wrong
&lt;/h2&gt;

&lt;p&gt;In every DR drill we run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Runbook is out of date.&lt;/strong&gt; The one that worked 6 months ago has wrong commands now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials don't work.&lt;/strong&gt; The service account was rotated, nobody updated the runbook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup is untested.&lt;/strong&gt; The restore fails because the backup is corrupted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation paths are stale.&lt;/strong&gt; The "DBA on-call" has left the company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependencies are missing.&lt;/strong&gt; The recovery playbook assumes Service X is up, but Service X depends on the failed component.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every drill uncovers 3-5 of these. Fix them, then drill again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chaos Engineering vs. DR Drills
&lt;/h2&gt;

&lt;p&gt;These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.&lt;/p&gt;

&lt;p&gt;Chaos engineering answers: "Can we survive small failures routinely?"&lt;/p&gt;

&lt;p&gt;DR drills answer: "Can we survive catastrophic failures at all?"&lt;/p&gt;

&lt;p&gt;You need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blame-Free Rule
&lt;/h2&gt;

&lt;p&gt;DR drills expose weaknesses. Those weaknesses are process problems, not people problems.&lt;/p&gt;

&lt;p&gt;The rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No firing based on drill performance&lt;/li&gt;
&lt;li&gt;No promotions based on "being the hero"&lt;/li&gt;
&lt;li&gt;Focus on process gaps, not individual failures&lt;/li&gt;
&lt;li&gt;The post-drill retrospective is 90% about fixing systems, 10% about training people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the team is afraid of the drill, you'll never learn anything real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequency That Actually Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also: after every major infrastructure change, drill the affected components.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Lesson
&lt;/h2&gt;

&lt;p&gt;The drill is the easy part. The hard part is &lt;strong&gt;making the fixes from the drill a priority&lt;/strong&gt; when there's feature pressure.&lt;/p&gt;

&lt;p&gt;We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you've never done a DR drill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one scenario (database failure, region outage, API gateway down)&lt;/li&gt;
&lt;li&gt;Schedule it for a quiet hour&lt;/li&gt;
&lt;li&gt;Run a tabletop first find the obvious gaps&lt;/li&gt;
&lt;li&gt;Fix those gaps&lt;/li&gt;
&lt;li&gt;Run a partial failure test in staging&lt;/li&gt;
&lt;li&gt;Measure everything&lt;/li&gt;
&lt;li&gt;Run a retro focused on process&lt;/li&gt;
&lt;li&gt;Schedule the next one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dr</category>
      <category>reliability</category>
      <category>sre</category>
      <category>chaosengineering</category>
    </item>
    <item>
      <title>Disaster Recovery Drills That Actually Work</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:41:48 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/disaster-recovery-drills-that-actually-work-1n83</link>
      <guid>https://dev.to/samson_tanimawo/disaster-recovery-drills-that-actually-work-1n83</guid>
      <description>&lt;h2&gt;
  
  
  Most DR Drills Are Theater
&lt;/h2&gt;

&lt;p&gt;Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.&lt;/p&gt;

&lt;p&gt;Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.&lt;/p&gt;

&lt;p&gt;Real DR drills test whether your team can actually recover, not whether they can talk about recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Levels of DR Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Tabletop&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through a scenario on paper&lt;/li&gt;
&lt;li&gt;Identify missing runbooks&lt;/li&gt;
&lt;li&gt;Find ownership gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: New team members, initial gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Doesn't prove anything actually works&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Partial Failure Test&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail one component in staging&lt;/li&gt;
&lt;li&gt;Watch recovery happen with real tools&lt;/li&gt;
&lt;li&gt;Time the full recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Validating specific runbooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Staging ≠ production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Full Production Drill&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail a real production component&lt;/li&gt;
&lt;li&gt;Customer-facing (announce a maintenance window if needed)&lt;/li&gt;
&lt;li&gt;Full team responds as if it's real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Proving you can recover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Scary, high-coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real DR Drill Scenario
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Primary database becomes unreachable&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup (48 hours before)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedule window with product team&lt;/li&gt;
&lt;li&gt;Pick a time when customer impact is minimal&lt;/li&gt;
&lt;li&gt;Brief the team: "Something will fail tomorrow, respond as normal"&lt;/li&gt;
&lt;li&gt;Pre-position the incident commander&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At T+0, block network access to primary database via iptables&lt;/li&gt;
&lt;li&gt;Start stopwatch&lt;/li&gt;
&lt;li&gt;Watch the team respond&lt;/li&gt;
&lt;li&gt;Do NOT intervene or give hints&lt;/li&gt;
&lt;li&gt;Document every action, every decision, every delay&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Metrics to capture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to detection (first alert fire)&lt;/li&gt;
&lt;li&gt;Time to engagement (first human acknowledges)&lt;/li&gt;
&lt;li&gt;Time to diagnosis ("we know what's wrong")&lt;/li&gt;
&lt;li&gt;Time to mitigation (customer impact stops)&lt;/li&gt;
&lt;li&gt;Time to recovery (fully restored)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring the Drill
&lt;/h2&gt;

&lt;p&gt;Good scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detection: under 2 minutes&lt;/li&gt;
&lt;li&gt;Engagement: under 5 minutes&lt;/li&gt;
&lt;li&gt;Diagnosis: under 15 minutes&lt;/li&gt;
&lt;li&gt;Mitigation: under 30 minutes (for DR scenarios)&lt;/li&gt;
&lt;li&gt;Recovery: depends on scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are 5x longer than target, you have a real problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Always Goes Wrong
&lt;/h2&gt;

&lt;p&gt;In every DR drill we run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Runbook is out of date.&lt;/strong&gt; The one that worked 6 months ago has wrong commands now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials don't work.&lt;/strong&gt; The service account was rotated, nobody updated the runbook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup is untested.&lt;/strong&gt; The restore fails because the backup is corrupted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation paths are stale.&lt;/strong&gt; The "DBA on-call" has left the company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependencies are missing.&lt;/strong&gt; The recovery playbook assumes Service X is up, but Service X depends on the failed component.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every drill uncovers 3-5 of these. Fix them, then drill again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chaos Engineering vs. DR Drills
&lt;/h2&gt;

&lt;p&gt;These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.&lt;/p&gt;

&lt;p&gt;Chaos engineering answers: "Can we survive small failures routinely?"&lt;/p&gt;

&lt;p&gt;DR drills answer: "Can we survive catastrophic failures at all?"&lt;/p&gt;

&lt;p&gt;You need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blame-Free Rule
&lt;/h2&gt;

&lt;p&gt;DR drills expose weaknesses. Those weaknesses are process problems, not people problems.&lt;/p&gt;

&lt;p&gt;The rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No firing based on drill performance&lt;/li&gt;
&lt;li&gt;No promotions based on "being the hero"&lt;/li&gt;
&lt;li&gt;Focus on process gaps, not individual failures&lt;/li&gt;
&lt;li&gt;The post-drill retrospective is 90% about fixing systems, 10% about training people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the team is afraid of the drill, you'll never learn anything real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequency That Actually Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also: after every major infrastructure change, drill the affected components.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Lesson
&lt;/h2&gt;

&lt;p&gt;The drill is the easy part. The hard part is &lt;strong&gt;making the fixes from the drill a priority&lt;/strong&gt; when there's feature pressure.&lt;/p&gt;

&lt;p&gt;We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you've never done a DR drill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one scenario (database failure, region outage, API gateway down)&lt;/li&gt;
&lt;li&gt;Schedule it for a quiet hour&lt;/li&gt;
&lt;li&gt;Run a tabletop first find the obvious gaps&lt;/li&gt;
&lt;li&gt;Fix those gaps&lt;/li&gt;
&lt;li&gt;Run a partial failure test in staging&lt;/li&gt;
&lt;li&gt;Measure everything&lt;/li&gt;
&lt;li&gt;Run a retro focused on process&lt;/li&gt;
&lt;li&gt;Schedule the next one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dr</category>
      <category>reliability</category>
      <category>sre</category>
      <category>chaosengineering</category>
    </item>
    <item>
      <title>Feature Flags as a Reliability Tool, Not Just an A/B Platform</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:11:47 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/feature-flags-as-a-reliability-tool-not-just-an-ab-platform-40e</link>
      <guid>https://dev.to/samson_tanimawo/feature-flags-as-a-reliability-tool-not-just-an-ab-platform-40e</guid>
      <description>&lt;h2&gt;
  
  
  Most Teams Use Feature Flags Wrong
&lt;/h2&gt;

&lt;p&gt;They wire up LaunchDarkly or Unleash, use it for two A/B tests, then forget about it.&lt;/p&gt;

&lt;p&gt;Meanwhile, their production is full of &lt;code&gt;if (isNewCheckoutEnabled)&lt;/code&gt; blocks that nobody remembers how to toggle.&lt;/p&gt;

&lt;p&gt;Feature flags are not primarily an experimentation tool. They're a &lt;strong&gt;reliability tool&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Value
&lt;/h2&gt;

&lt;p&gt;Feature flags let you &lt;strong&gt;separate deploy from release&lt;/strong&gt;. You ship code to production cold, then turn it on gradually for real users.&lt;/p&gt;

&lt;p&gt;When things break, you flip the switch back in 10 seconds. No rollback, no redeploy, no PR reverts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Reliability Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Kill Switches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every risky new feature ships behind a kill switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;featureFlags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEnabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;new_payment_flow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;newPaymentFlow&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;legacyPaymentFlow&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the new flow has a bug, you don't rollback. You flip the flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Gradual Rollouts&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;new_search_algorithm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;rollout_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;# Start at 1% of users&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.tier&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'internal'"&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="c1"&gt;# Internal users always see it&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy to 1%, watch metrics, go to 5%, watch, 25%, 50%, 100%. Takes 2-4 hours per rollout instead of a single risky deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Circuit Breakers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;external_recommendations_service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;automatic_disable_if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;error_rate_above&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5%&lt;/span&gt;
&lt;span class="na"&gt;for_minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a downstream service starts failing, the flag auto-disables that feature. Your product degrades gracefully instead of crashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Load Shedding&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;expensive_realtime_dashboard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;enabled_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;cpu_utilization_below&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;70%&lt;/span&gt;
&lt;span class="na"&gt;active_users_below&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under load, disable non-critical features to preserve the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern: Permanent Flags
&lt;/h2&gt;

&lt;p&gt;After a feature is 100% rolled out, the flag should be deleted within 2 weeks. Every flag left in the codebase is technical debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flag hygiene rules:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Every flag has an expiration date (90 days max)
- Every flag has an owner in CODEOWNERS
- CI fails if a flag is older than 180 days
- Monthly flag cleanup is part of standard operations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We track "flag count" as a reliability metric. If it grows unbounded, we're doing it wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;A solid feature flag system has three parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Definition store&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source of truth for all flags&lt;/li&gt;
&lt;li&gt;Versioned in Git or a managed service (LaunchDarkly, Unleash, GrowthBook)&lt;/li&gt;
&lt;li&gt;Audit log for every change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Client SDK&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In-app flag evaluation&lt;/li&gt;
&lt;li&gt;Falls back to defaults if the service is unreachable&lt;/li&gt;
&lt;li&gt;Caches decisions for 60 seconds&lt;/li&gt;
&lt;li&gt;Emits telemetry for flag usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Admin interface&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change flags without deploying code&lt;/li&gt;
&lt;li&gt;See current state across environments&lt;/li&gt;
&lt;li&gt;Role-based access (not everyone can flip prod flags)&lt;/li&gt;
&lt;li&gt;Approval workflow for high-risk flags&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evaluating at the Right Layer
&lt;/h2&gt;

&lt;p&gt;Flags can live at multiple layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CDN edge use for marketing experiments
Load balancer use for blue/green deploys
App server use for feature experiments
Database use for schema migrations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deeper the layer, the faster the rollout. CDN flags flip in seconds. Database flags take minutes to propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reliability Metric
&lt;/h2&gt;

&lt;p&gt;Track: &lt;strong&gt;mean time to mitigate (MTTM)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your team can mitigate an incident in under 30 seconds via a feature flag flip, that's a win. If you have to redeploy to mitigate, your reliability is bottlenecked by deploy time.&lt;/p&gt;

&lt;p&gt;Good teams: MTTM under 60 seconds&lt;br&gt;
Great teams: MTTM under 15 seconds&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Gotchas
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stale flags skew A/B results&lt;/strong&gt; clean them up after experiments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flags without defaults cause prod outages&lt;/strong&gt; every flag must have a safe fallback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flag flips mid-request cause weird bugs&lt;/strong&gt; evaluate at request start, cache for the request lifetime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested flags (flags inside flags) are impossible to reason about&lt;/strong&gt; avoid&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A Reliability-First Flag Strategy
&lt;/h2&gt;

&lt;p&gt;Start simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every new feature ships behind a kill switch&lt;/li&gt;
&lt;li&gt;Gradual rollouts for anything touching the critical path&lt;/li&gt;
&lt;li&gt;Circuit breakers for external dependencies&lt;/li&gt;
&lt;li&gt;Flag cleanup is a monthly ritual&lt;/li&gt;
&lt;li&gt;Track MTTM and optimize it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Feature flags are the most underrated reliability tool in modern engineering. Treat them that way.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>featureflags</category>
      <category>reliability</category>
      <category>devops</category>
      <category>deployments</category>
    </item>
    <item>
      <title>eBPF for SREs: Observability Without Agents</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:11:20 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/ebpf-for-sres-observability-without-agents-2ohk</link>
      <guid>https://dev.to/samson_tanimawo/ebpf-for-sres-observability-without-agents-2ohk</guid>
      <description>&lt;h2&gt;
  
  
  The Agent Problem
&lt;/h2&gt;

&lt;p&gt;Traditional monitoring means shipping an agent with every service. That agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds memory overhead&lt;/li&gt;
&lt;li&gt;Needs to be updated&lt;/li&gt;
&lt;li&gt;Gets out of date&lt;/li&gt;
&lt;li&gt;Breaks with kernel upgrades&lt;/li&gt;
&lt;li&gt;Needs instrumentation code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;eBPF says: &lt;strong&gt;what if the kernel itself could emit observability data?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What eBPF Actually Is
&lt;/h2&gt;

&lt;p&gt;eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without recompiling or loading modules. It was originally for packet filtering. Now it powers Cilium, Pixie, Falco, and dozens of other tools.&lt;/p&gt;

&lt;p&gt;From an SRE perspective: &lt;strong&gt;you get deep visibility into syscalls, network traffic, process behavior, and filesystem operations with zero code changes to your applications&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Observe
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;every TCP connection (src, dst, bytes, duration)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DNS queries and response times&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;TLS handshake failures&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HTTP request/response cycles&lt;/span&gt;

&lt;span class="na"&gt;application&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;function call latencies (uprobes)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory allocations&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;lock contention&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GC pauses&lt;/span&gt;

&lt;span class="na"&gt;security&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;syscall audit trails&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;privilege escalations&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;suspicious file access&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container escape attempts&lt;/span&gt;

&lt;span class="na"&gt;performance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CPU scheduling delays&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;I/O wait time per process&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;disk latency histograms&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;page fault patterns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of this &lt;strong&gt;without modifying your application code&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Example: Detecting Slow HTTP Requests
&lt;/h2&gt;

&lt;p&gt;Traditional approach: instrument your HTTP framework with OpenTelemetry, deploy a collector, ship traces.&lt;/p&gt;

&lt;p&gt;eBPF approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install bpftrace&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;bpftrace

&lt;span class="c"&gt;# Trace every HTTP response larger than 1MB&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bpftrace &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'
uprobe:/usr/lib/libssl.so:SSL_write {
@http_writes[pid] = count();
@http_bytes[comm] = sum(arg2);
}
'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No code changes. No restarts. Real-time visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools Worth Knowing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Pixie&lt;/strong&gt; (now part of New Relic)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-instruments every service in your K8s cluster&lt;/li&gt;
&lt;li&gt;No code changes, no sidecars&lt;/li&gt;
&lt;li&gt;Full HTTP, MySQL, Postgres, DNS tracing&lt;/li&gt;
&lt;li&gt;Open source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Cilium&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network observability + security policy enforcement&lt;/li&gt;
&lt;li&gt;Replaces kube-proxy&lt;/li&gt;
&lt;li&gt;Hubble UI for service-to-service traffic visualization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Falco&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime security detection&lt;/li&gt;
&lt;li&gt;"Alert if a process inside a container spawns a shell"&lt;/li&gt;
&lt;li&gt;Writes rules in YAML&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Parca&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuous profiling via eBPF&lt;/li&gt;
&lt;li&gt;See CPU flame graphs across your entire fleet&lt;/li&gt;
&lt;li&gt;Identify the most expensive code paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Tracee&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Security-focused eBPF tracing&lt;/li&gt;
&lt;li&gt;Detects privilege escalations, cryptojacking, suspicious syscalls&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Tradeoffs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero app code changes&lt;/li&gt;
&lt;li&gt;Near-zero overhead (kernel-level efficiency)&lt;/li&gt;
&lt;li&gt;Unified view across languages (Go, Python, Java, Rust, all seen the same way)&lt;/li&gt;
&lt;li&gt;No agent lifecycle to manage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires Linux 4.14+ (5.0+ preferred)&lt;/li&gt;
&lt;li&gt;Steep learning curve for custom probes&lt;/li&gt;
&lt;li&gt;Limited visibility into in-process logic (you see syscalls, not business logic)&lt;/li&gt;
&lt;li&gt;eBPF verifier rejects programs for subtle reasons&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When eBPF Shines
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network debugging&lt;/strong&gt;: "Why is service A slow to reach service B?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security auditing&lt;/strong&gt;: "What containers are making unexpected syscalls?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance profiling&lt;/strong&gt;: "Where is the cluster CPU time actually going?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident forensics&lt;/strong&gt;: "Reconstruct the syscall timeline during the outage"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When eBPF Is Wrong
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Business logic observability&lt;/strong&gt; you still need OpenTelemetry for spans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application errors&lt;/strong&gt; your logs and exception tracking still matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region correlation&lt;/strong&gt; eBPF is node-local&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use eBPF for infrastructure and network. Use OpenTelemetry for application logic. They complement each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Deploy Pixie in a dev cluster (1-line install)&lt;/li&gt;
&lt;li&gt;Open the UI, watch real-time HTTP traffic&lt;/li&gt;
&lt;li&gt;Try a bpftrace one-liner to trace a specific syscall&lt;/li&gt;
&lt;li&gt;Read the Cilium + Hubble docs&lt;/li&gt;
&lt;li&gt;Replace one agent-based tool with its eBPF equivalent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The future of observability is kernel-native. Agent-based tools will still exist, but the gap will keep shrinking.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ebpf</category>
      <category>observability</category>
      <category>linux</category>
      <category>kernel</category>
    </item>
    <item>
      <title>Observability as Code: Managing Dashboards and Alerts with Terraform</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 26 Apr 2026 14:11:06 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/observability-as-code-managing-dashboards-and-alerts-with-terraform-2hl4</link>
      <guid>https://dev.to/samson_tanimawo/observability-as-code-managing-dashboards-and-alerts-with-terraform-2hl4</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with Click-Ops Dashboards
&lt;/h2&gt;

&lt;p&gt;Your team has 200 dashboards. You don't know who owns them. Half are broken. The rest show yesterday's reality.&lt;/p&gt;

&lt;p&gt;This is click-ops debt, and it compounds faster than code debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability as Code
&lt;/h2&gt;

&lt;p&gt;Every dashboard, alert, and SLO definition should live in a Git repository alongside your service code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"datadog_dashboard"&lt;/span&gt; &lt;span class="s2"&gt;"api_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"API Gateway - Golden Signals"&lt;/span&gt;
&lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Owner: @platform-team"&lt;/span&gt;
&lt;span class="nx"&gt;layout_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ordered"&lt;/span&gt;

&lt;span class="nx"&gt;widget&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;timeseries_definition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Request Rate (per second)"&lt;/span&gt;
&lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sum:api.requests{service:gateway}.as_rate()"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;widget&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;timeseries_definition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"P99 Latency"&lt;/span&gt;
&lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"max:api.latency{service:gateway}.as_count()"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lives next to &lt;code&gt;main.tf&lt;/code&gt; for your service. When you deploy the service, you deploy the observability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits That Compound
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Ownership is clear.&lt;/strong&gt; The file has a CODEOWNERS entry. PRs require review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dashboards auto-update.&lt;/strong&gt; Renaming a service? Terraform refactor propagates to all dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Drift detection.&lt;/strong&gt; Someone clicked "save as" in the UI and now that dashboard is out of sync. &lt;code&gt;terraform plan&lt;/code&gt; catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Review before production.&lt;/strong&gt; Alert changes go through PR review. No more "who set this threshold?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Tooling by Platform
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DataDog/datadog&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;datadog_monitor, datadog_dashboard, datadog_slo&lt;/span&gt;

&lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana_dashboard, grafana_alert_rule&lt;/span&gt;

&lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;approach&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YAML files in Git, deployed by ArgoCD&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alert rules, recording rules&lt;/span&gt;

&lt;span class="na"&gt;new_relic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;newrelic/newrelic&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;newrelic_alert_policy, newrelic_dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pick one source of truth. Don't mix.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example
&lt;/h2&gt;

&lt;p&gt;We have a module that takes a service name and generates a complete observability stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"service_observability"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/observability"&lt;/span&gt;

&lt;span class="nx"&gt;service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payment-processor"&lt;/span&gt;
&lt;span class="nx"&gt;team_slack&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"#payments"&lt;/span&gt;
&lt;span class="nx"&gt;severity_map&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;error_rate_pct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;span class="nx"&gt;p99_latency_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="nx"&gt;saturation_pct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;slo_targets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;availability&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9995&lt;/span&gt;
&lt;span class="nx"&gt;latency_p99&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One module call creates: 3 dashboards, 8 alerts, 2 SLOs, a Slack channel binding, and a PagerDuty escalation policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Part
&lt;/h2&gt;

&lt;p&gt;The code is easy. The hard part is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Migrating existing click-ops dashboards&lt;/strong&gt; budget 2 weeks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Getting engineers to edit YAML/HCL instead of the UI&lt;/strong&gt; budget 3 months of reminders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking UI edits&lt;/strong&gt; some tools let you set dashboards to read-only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reviewing alert changes&lt;/strong&gt; PR reviewers need context&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern to Avoid
&lt;/h2&gt;

&lt;p&gt;Don't write Terraform for every custom chart an engineer wants. That leads to 500-line dashboard modules nobody understands.&lt;/p&gt;

&lt;p&gt;Instead, define &lt;strong&gt;standard dashboards&lt;/strong&gt; (golden signals, RED/USE, SLO burn rate) as modules. Let engineers add their own custom dashboards in the UI if they want, but mark them as "explore-only" (not alert-worthy).&lt;/p&gt;

&lt;p&gt;Core observability = code. Experimental exploration = UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Strategy
&lt;/h2&gt;

&lt;p&gt;Week 1: Pick 1 service, convert its dashboards to Terraform&lt;br&gt;
Week 2: Add alerts + SLOs to Terraform&lt;br&gt;
Week 3: Delete the UI versions&lt;br&gt;
Week 4: Create a module from the patterns&lt;br&gt;
Month 2: Roll out to 10 more services&lt;br&gt;
Month 3: Require all new services to use the module&lt;/p&gt;

&lt;p&gt;Six months in, your click-ops debt is gone and your observability is reproducible.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>observability</category>
      <category>devops</category>
      <category>iac</category>
    </item>
    <item>
      <title>Service Level Objectives for Complex Microservices</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 25 Apr 2026 14:11:24 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/service-level-objectives-for-complex-microservices-3j14</link>
      <guid>https://dev.to/samson_tanimawo/service-level-objectives-for-complex-microservices-3j14</guid>
      <description>&lt;h2&gt;
  
  
  Why SLOs Break in Microservices
&lt;/h2&gt;

&lt;p&gt;A SLO that works for a monolith often collapses when you distribute the same logic across 30 services. The math of availability is unforgiving.&lt;/p&gt;

&lt;p&gt;If your service depends on 5 others, each at 99.9%, your realistic ceiling is &lt;code&gt;0.999^5 = 99.5%&lt;/code&gt;. That 0.4% gap eats your entire error budget before your own code even runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Mistakes Teams Make
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Copying the same SLO to every service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 99.9% target on a payment service and a batch analytics service are not the same thing. One ruins revenue. One ruins dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Measuring uptime instead of user experience&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /health&lt;/code&gt; returning 200 is not a SLO. Users don't call &lt;code&gt;/health&lt;/code&gt;. They check out, log in, view pages. Measure those.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Ignoring fan-out&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user request fans out to 8 downstream calls, and one of them has a 99% SLO, your user-facing reliability is capped at 99% no matter how good your code is.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical SLO Framework for Microservices
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;user_journey_slos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;critical_path_checkout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;99.95%&lt;/span&gt;
&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;successful_checkouts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;total_checkouts"&lt;/span&gt;
&lt;span class="na"&gt;error_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;21.6 minutes / month&lt;/span&gt;

&lt;span class="na"&gt;user_login&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;99.9%&lt;/span&gt;
&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;
&lt;span class="na"&gt;error_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;43.2 minutes / month&lt;/span&gt;

&lt;span class="na"&gt;background_analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;99.0%&lt;/span&gt;
&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;
&lt;span class="na"&gt;error_budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7.2 hours / month&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice: we define SLOs on &lt;strong&gt;user journeys&lt;/strong&gt;, not services. This is the biggest mental shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Composition Rules
&lt;/h2&gt;

&lt;p&gt;When Service A depends on B and C:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A's SLO must account for B + C availability&lt;/li&gt;
&lt;li&gt;If B is 99.9% and C is 99.5%, A's realistic SLO is ~99.4%&lt;/li&gt;
&lt;li&gt;Build a dependency graph and calculate compound availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use a simple rule: &lt;strong&gt;no service promises a SLO tighter than the weakest service it depends on&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Implementation Pattern
&lt;/h2&gt;

&lt;p&gt;Every service exports three Prometheus metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;slo_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;journey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;slo_budget_remaining&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;journey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;slo_burn_rate&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;journey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From these three, you can compute:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current SLO compliance&lt;/li&gt;
&lt;li&gt;Budget remaining (in minutes)&lt;/li&gt;
&lt;li&gt;Burn rate (how fast you're consuming budget)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alert on burn rate, not on individual requests. A 2% error rate for 30 seconds is a blip. A sustained 2% error rate over 10 minutes is an incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Policies That Actually Stick
&lt;/h2&gt;

&lt;p&gt;The trick isn't defining SLOs. It's enforcing them. We use a 4-level policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;budget_exhausted: freeze non-critical deploys, notify product
budget_50_pct: feature freeze, focus on reliability
budget_25_pct: normal operations, monitor carefully
budget_healthy: ship new features, experiment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the budget is exhausted, &lt;strong&gt;product can't ship new features until reliability is restored&lt;/strong&gt;. This alignment between eng and product is what makes SLOs stick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Anti-Patterns
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SLOs nobody looks at&lt;/strong&gt; if you don't page on budget burn rate, they're dead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLOs that never fail&lt;/strong&gt; if you never breach budget, your targets are too loose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLOs that always fail&lt;/strong&gt; if you're always in the red, your targets are unrealistic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLOs without product buy-in&lt;/strong&gt; engineering-only SLOs get ignored during feature pressure&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;SLOs are a negotiation tool between engineering and product. Without them, every outage becomes a fight. With them, you have a shared contract about what "good enough" means.&lt;/p&gt;

&lt;p&gt;Start with one critical journey. Measure it for 30 days. Set a realistic SLO. Enforce budget policies. Then add more journeys.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>slo</category>
      <category>microservices</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Building a Culture of Reliability: Beyond the SRE Handbook</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 24 Apr 2026 22:08:36 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/building-a-culture-of-reliability-beyond-the-sre-handbook-jf</link>
      <guid>https://dev.to/samson_tanimawo/building-a-culture-of-reliability-beyond-the-sre-handbook-jf</guid>
      <description>&lt;h2&gt;
  
  
  You Can't Hire Your Way to Reliability
&lt;/h2&gt;

&lt;p&gt;I've seen companies hire 5 SREs and expect reliability to magically improve. It doesn't. Reliability is a cultural outcome, not a headcount metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reliability Maturity Model
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1: Reactive
"Things break, we fix them."
No SLOs, no error budgets, post-mortems are optional.

Level 2: Aware
"We know what's breaking and how often."
Basic SLOs defined, post-mortems happen, on-call exists.

Level 3: Proactive
"We prevent most issues before they happen."
Error budgets enforced, chaos engineering started,
automated remediation for common issues.

Level 4: Predictive
"We predict and prevent issues we haven't seen yet."
ML-driven anomaly detection, capacity planning,
reliability is a product feature.

Level 5: Systemic
"Reliability is embedded in everything we do."
Every engineer thinks about reliability, every design
doc includes failure modes, every feature has SLOs.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most companies are at Level 1-2. Getting to Level 3 is the biggest jump.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Pillars of Reliability Culture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pillar 1: Ownership
&lt;/h3&gt;

&lt;p&gt;Reliability is not the SRE team's job. It's everyone's job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ownership_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;development_teams&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Write SLOs for their services&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;On-call for their services (with SRE backup)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Fix production issues in their domain&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Include failure modes in design docs&lt;/span&gt;

&lt;span class="na"&gt;sre_team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Build reliability infrastructure (monitoring, alerting, CI/CD)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Consult on architecture for reliability&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Run chaos engineering program&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Manage cross-cutting reliability projects&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Train development teams on SRE practices&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pillar 2: Learning
&lt;/h3&gt;

&lt;p&gt;Every incident is a learning opportunity. But only if you structure it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;post_incident_learning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="c1"&gt;# 1. Blameless post-mortem (within 48 hours)
&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;write_postmortem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Share with entire engineering org
&lt;/span&gt;&lt;span class="nf"&gt;post_to_engineering_channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Add to searchable incident database
&lt;/span&gt;&lt;span class="n"&gt;incident_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Extract patterns
&lt;/span&gt;&lt;span class="n"&gt;similar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;incident_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similar&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="nf"&gt;create_reliability_project&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Systemic issue: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;similar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Update runbooks
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_knowledge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="nf"&gt;update_runbook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;postmortem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_knowledge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pillar 3: Investment
&lt;/h3&gt;

&lt;p&gt;Reliability work needs dedicated time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Engineering time allocation:
Feature development: 60%
Reliability work: 20%
Tech debt reduction: 10%
Learning/experimentation: 10%

The 20% reliability budget includes:
- Alert tuning and noise reduction
- Runbook automation
- Chaos experiments
- SLO reviews and adjustments
- On-call process improvements
- Monitoring and observability improvements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Protect this 20%. When leadership pressures to ship more features, show the correlation between reliability investment and incident reduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Culture
&lt;/h2&gt;

&lt;p&gt;You can't manage what you can't measure. Cultural metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;reliability_culture_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="c1"&gt;# Engineering engagement&lt;/span&gt;
&lt;span class="na"&gt;postmortem_attendance_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 80%&lt;/span&gt;
&lt;span class="na"&gt;action_item_completion_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 90%&lt;/span&gt;
&lt;span class="na"&gt;runbook_update_frequency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 2x/month per service&lt;/span&gt;

&lt;span class="c1"&gt;# Design quality&lt;/span&gt;
&lt;span class="na"&gt;design_docs_with_failure_modes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 95%&lt;/span&gt;
&lt;span class="na"&gt;new_services_with_slos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target 100%&lt;/span&gt;
&lt;span class="na"&gt;chaos_experiment_frequency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 1x/quarter per service&lt;/span&gt;

&lt;span class="c1"&gt;# Team health&lt;/span&gt;
&lt;span class="na"&gt;oncall_nps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;developer_survey_reliability_confidence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;gt; 4/5&lt;/span&gt;
&lt;span class="na"&gt;sre_team_attrition_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target &amp;lt; 10%/year&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Quick Wins
&lt;/h2&gt;

&lt;p&gt;If you're starting from Level 1, here are the highest-impact changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1&lt;/strong&gt;: Define SLOs for your top 3 services. Just availability + latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2&lt;/strong&gt;: Make post-mortems mandatory and blameless. Use a template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3&lt;/strong&gt;: Set up on-call rotation with clear escalation paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 4&lt;/strong&gt;: Create a reliability Slack channel. Share learnings daily.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 2&lt;/strong&gt;: Start tracking error budgets. Share with product managers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 3&lt;/strong&gt;: Run your first chaos experiment (kill a pod, see what happens).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Six weeks from chaos to competence. Not perfect, but dramatically better.&lt;/p&gt;

&lt;p&gt;If you want to accelerate your reliability culture with AI-powered tools that embed SRE best practices, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>culture</category>
      <category>reliability</category>
      <category>engineering</category>
    </item>
  </channel>
</rss>
