<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samson Tanimawo</title>
    <description>The latest articles on DEV Community by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://dev.to/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>DEV Community: Samson Tanimawo</title>
      <link>https://dev.to/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>Service Maps: The Architectural Clarity Your Team Is Missing</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 03 Jul 2026 00:30:12 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/service-maps-the-architectural-clarity-your-team-is-missing-4f70</link>
      <guid>https://dev.to/samson_tanimawo/service-maps-the-architectural-clarity-your-team-is-missing-4f70</guid>
      <description>&lt;h2&gt;
  
  
  "What Calls What?"
&lt;/h2&gt;

&lt;p&gt;The question that launches a thousand Slack threads. In every microservices architecture I've worked with, nobody has a complete picture of how services interact.&lt;/p&gt;

&lt;p&gt;Service maps fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Static Diagrams
&lt;/h2&gt;

&lt;p&gt;Every team has architecture diagrams. They're all wrong. They were accurate on the day they were created, which was 18 months ago.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reality of static diagrams:
  Created: January 2023 (accurate)
  Updated: March 2023 (still mostly accurate)
  Last touched: March 2023
  Current accuracy: ~40%
  New services since: 12
  Removed services: 3
  Changed dependencies: 27
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Static diagrams are documentation debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auto-Generated Service Maps
&lt;/h2&gt;

&lt;p&gt;The solution is maps generated from actual traffic. There are three data sources:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Network Traffic (Infrastructure Level)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Parse Kubernetes network policies + actual traffic
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_network_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k8s_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k8s_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_namespaced_service&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;connections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Get pods for this service
&lt;/span&gt;        &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k8s_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_namespaced_pod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;label_selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Check established connections
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pod_connections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# from /proc/net/tcp
&lt;/span&gt;            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;conns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;target_svc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;resolve_ip_to_service&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remote_ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_svc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;connections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;target_svc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remote_port&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;connections&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Distributed Traces (Application Level)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query trace data for service dependencies&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;parent_service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;child_service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_latency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ERROR'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;error_rate&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;traces&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24 hours'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;parent_service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;child_service&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. API Gateway Logs (Edge Level)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API Gateway → Which external services call which internal services
Load Balancer → Traffic distribution and routing rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What a Good Service Map Shows
&lt;/h2&gt;

&lt;p&gt;Beyond just "A calls B," a useful service map includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;service_map_edge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-api&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-service&lt;/span&gt;
  &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests_per_second&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;150&lt;/span&gt;
    &lt;span class="na"&gt;error_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.2%&lt;/span&gt;
    &lt;span class="na"&gt;p99_latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;120ms&lt;/span&gt;
    &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gRPC&lt;/span&gt;
    &lt;span class="na"&gt;circuit_breaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;enabled&lt;/span&gt;
    &lt;span class="na"&gt;retry_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3x with exponential backoff&lt;/span&gt;
    &lt;span class="na"&gt;owner_team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
    &lt;span class="na"&gt;last_deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-03-14T15:30:00Z&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Color-code by health:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Green: &amp;lt; 1% error rate, latency within SLO&lt;/li&gt;
&lt;li&gt;Yellow: Approaching SLO limits&lt;/li&gt;
&lt;li&gt;Red: SLO breached&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Incident Response Advantage
&lt;/h2&gt;

&lt;p&gt;During an incident, a live service map immediately answers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Impact scope&lt;/strong&gt;: What services depend on the broken one?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius&lt;/strong&gt;: How many users are affected?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause direction&lt;/strong&gt;: Is the problem upstream or downstream?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery order&lt;/strong&gt;: Which services need to come back first?
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before service maps (incident response):
  "Is anyone else affected?" → 15 minutes of Slack investigation

After service maps:
  *click on broken service* → instantly see all dependents
  Answer: 4 downstream services, 3 are degraded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Start Simple
&lt;/h2&gt;

&lt;p&gt;You don't need a fancy tool to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Export your K8s services and network policies&lt;/li&gt;
&lt;li&gt;Add trace-based dependencies&lt;/li&gt;
&lt;li&gt;Render with a simple graph library&lt;/li&gt;
&lt;li&gt;Update automatically via cron job&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A basic but accurate map beats a fancy but outdated one every time.&lt;/p&gt;

&lt;p&gt;If you want auto-generated, always-accurate service maps with AI-powered impact analysis, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>microservices</category>
      <category>observability</category>
      <category>sre</category>
    </item>
    <item>
      <title>AI in Incident Response: Hype vs. Reality in 2024</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 02 Jul 2026 13:21:33 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/ai-in-incident-response-hype-vs-reality-in-2024-2ibg</link>
      <guid>https://dev.to/samson_tanimawo/ai-in-incident-response-hype-vs-reality-in-2024-2ibg</guid>
      <description>&lt;h2&gt;
  
  
  Every Vendor Claims AI Magic
&lt;/h2&gt;

&lt;p&gt;Open any monitoring vendor's website and you'll see: "AI-powered incident detection!" "ML-driven root cause analysis!" "Intelligent alerting!"&lt;/p&gt;

&lt;p&gt;After evaluating a dozen AI ops tools and running three in production, here's what actually works and what's snake oil.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Works: Anomaly Detection
&lt;/h2&gt;

&lt;p&gt;ML-based anomaly detection genuinely helps with metrics that have predictable patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Good candidates for ML anomaly detection:
  - Request rate (daily/weekly seasonality)
  - CPU usage (follows traffic patterns)
  - Database connections (predictable daily cycles)
  - Error counts (should be near-zero baseline)

Bad candidates:
  - Deployment metrics (irregular by nature)
  - Batch job durations (vary by data volume)
  - Cache hit rates (depends on traffic mix)
  - Anything with frequent step changes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is training on enough data. You need at least 2-3 weeks of data for daily patterns, and 6-8 weeks for weekly seasonality.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Works: Alert Correlation
&lt;/h2&gt;

&lt;p&gt;This is where AI delivers real value. When 15 alerts fire simultaneously, AI can group them and identify the probable root cause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw alerts (what the human sees):
  01:15:03  CRITICAL  api-server p99 latency &amp;gt; 2s
  01:15:07  WARNING   postgres connection pool 90%
  01:15:12  CRITICAL  checkout error rate &amp;gt; 5%
  01:15:15  WARNING   redis response time &amp;gt; 100ms
  01:15:18  CRITICAL  payment-service timeout
  01:15:22  WARNING   cart-service p99 &amp;gt; 1s

AI-correlated (what the human should see):
  01:15:03  INCIDENT  Database connection pool exhaustion
    Impact: checkout, payment, cart services degraded
    Probable cause: postgres connection pool at 90%
    Related alerts: 6 (grouped)
    Suggested action: Check for connection leaks, consider pool size increase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the difference between a 45-minute investigation and a 5-minute fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work (Yet): Autonomous Remediation
&lt;/h2&gt;

&lt;p&gt;Vendors love to demo "AI automatically fixed the issue!" In reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-scaling works great (but that's not really AI)&lt;/li&gt;
&lt;li&gt;Auto-rollback works great (also not really AI)&lt;/li&gt;
&lt;li&gt;Actual autonomous root cause analysis and fix? Not reliable enough for production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tested three autonomous remediation products. Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Correct diagnosis: 72%
Correct remediation: 45%
Made things worse: 8%
Did nothing useful: 47%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 45% success rate isn't good enough for production systems. But it IS good enough for suggesting actions to a human.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI-Assisted Sweet Spot
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human only:         AI suggests + Human decides:    AI autonomous:
───────────         ────────────────────────────    ──────────────
Slow, error-prone   Fast, accurate                  Fast, risky
at 3am              at 3am                          at 3am

                    ← THIS IS WHERE WE SHOULD BE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The best approach today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AI detects the anomaly&lt;/li&gt;
&lt;li&gt;AI correlates related alerts&lt;/li&gt;
&lt;li&gt;AI suggests probable root cause&lt;/li&gt;
&lt;li&gt;AI recommends remediation steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human approves&lt;/strong&gt; the action&lt;/li&gt;
&lt;li&gt;Automation executes&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What to Evaluate
&lt;/h2&gt;

&lt;p&gt;When looking at AI ops tools, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What data do you need?&lt;/strong&gt; (If they need 6 months of data to start, that's a red flag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the false positive rate?&lt;/strong&gt; (Anything &amp;gt; 10% will be ignored by your team)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can I see the reasoning?&lt;/strong&gt; (Black-box AI is useless for incident response)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it integrate with my existing tools?&lt;/strong&gt; (If it requires rip-and-replace, walk away)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when the AI is wrong?&lt;/strong&gt; (Good tools show confidence scores)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  My Prediction
&lt;/h2&gt;

&lt;p&gt;In 2-3 years, AI will handle 80% of incidents autonomously. The remaining 20% — novel failures, complex cascading issues — will still need human judgment. But that's fine. Those are the interesting problems.&lt;/p&gt;

&lt;p&gt;If you want to see how AI-assisted incident response actually works in practice, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>incidents</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>Monitoring Costs Are Out of Control — Here's How to Fix It</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 02 Jul 2026 00:06:24 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/monitoring-costs-are-out-of-control-heres-how-to-fix-it-20j3</link>
      <guid>https://dev.to/samson_tanimawo/monitoring-costs-are-out-of-control-heres-how-to-fix-it-20j3</guid>
      <description>&lt;h2&gt;
  
  
  The $50K/Month Monitoring Bill
&lt;/h2&gt;

&lt;p&gt;I audited our monitoring stack last quarter. The total cost across all tools: $52,000/month. For a company with 200 engineers. That's $260 per engineer per month just to watch our systems.&lt;/p&gt;

&lt;p&gt;Something had to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Money Goes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Breakdown of $52K/month:
  Custom metrics ingestion:  $18,000 (35%)
  Log storage &amp;amp; search:      $14,000 (27%)
  APM/Tracing:               $9,000  (17%)
  Alerting platform:         $4,000  (8%)
  Synthetic monitoring:      $3,000  (6%)
  Dashboards &amp;amp; visualization:$2,000  (4%)
  Status page:               $2,000  (4%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The top two — metrics and logs — were 62% of the bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 1: Metrics Cardinality Audit
&lt;/h2&gt;

&lt;p&gt;High-cardinality metrics are the #1 cost driver. One bad label can 10x your bill:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# This is fine: ~100 time series
http_requests_total{method="GET", status="200", service="api"}

# This is expensive: ~1,000,000 time series
http_requests_total{method="GET", status="200", service="api", user_id="..."}
#                                                              ^^^^^^^^^
#                                                              100K unique values
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wrote a script to find high-cardinality offenders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_high_cardinality_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prometheus_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get all metric names
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prometheus_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/v1/label/__name__/values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;expensive&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Count time series per metric
&lt;/span&gt;        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prometheus_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/v1/series&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;match[]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expensive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;series_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expensive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;series_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Found: request_duration_bucket had 2.3M series due to URL path label
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We found three metrics responsible for 60% of our series count. Fixed them in a day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 2: Aggregation at Collection
&lt;/h2&gt;

&lt;p&gt;Instead of sending raw metrics and aggregating at query time, aggregate at collection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prometheus.yml - recording rules&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aggregations&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Pre-aggregate p99 latency per service (not per pod)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service:http_request_duration:p99&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (service, le))&lt;/span&gt;

      &lt;span class="c1"&gt;# Pre-aggregate error rate per service&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service:http_errors:rate5m&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pre-aggregated metrics are cheaper to store AND faster to query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 3: Right-Size Your Retention
&lt;/h2&gt;

&lt;p&gt;Do you really need 13 months of 15-second resolution metrics? Probably not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;retention_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;raw_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
    &lt;span class="na"&gt;retention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7 days&lt;/span&gt;

  &lt;span class="na"&gt;5_minute_rollups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;retention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30 days&lt;/span&gt;

  &lt;span class="na"&gt;1_hour_rollups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
    &lt;span class="na"&gt;retention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;13 months&lt;/span&gt;

  &lt;span class="na"&gt;1_day_rollups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1d&lt;/span&gt;
    &lt;span class="na"&gt;retention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5 years&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 4: Eliminate Unused Dashboards
&lt;/h2&gt;

&lt;p&gt;We had 340 dashboards. I checked access logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;340 total dashboards
 82 viewed in last 30 days (24%)
 31 viewed in last 90 days (9%)
227 never viewed in 6 months (67%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;67% of our dashboards were zombie dashboards. Nobody looked at them, but they drove metric queries.&lt;/p&gt;

&lt;p&gt;We archived everything not viewed in 90 days. Savings: $3,200/month from reduced query load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: $52,000/month
  Metrics cardinality fix:    -$11,000
  Log tier/sampling changes:  -$8,000
  Dashboard cleanup:          -$3,200
  Retention right-sizing:     -$5,800
  Duplicate tool consolidation:-$6,000
After: $18,000/month

Annual savings: $408,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And we actually improved our observability because the remaining data was higher quality.&lt;/p&gt;

&lt;p&gt;If you want monitoring that's cost-effective by design, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>devops</category>
      <category>costs</category>
      <category>observability</category>
    </item>
    <item>
      <title>Hiring SREs: What I Look For After Interviewing 100+ Candidates</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 01 Jul 2026 13:01:47 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/hiring-sres-what-i-look-for-after-interviewing-100-candidates-35ff</link>
      <guid>https://dev.to/samson_tanimawo/hiring-sres-what-i-look-for-after-interviewing-100-candidates-35ff</guid>
      <description>&lt;h2&gt;
  
  
  The SRE Hiring Problem
&lt;/h2&gt;

&lt;p&gt;SRE roles are notoriously hard to fill. The intersection of software engineering, systems administration, and operational wisdom is narrow. After interviewing over 100 candidates across three companies, here's what I've learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Care About
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Years of experience with a specific tool&lt;/li&gt;
&lt;li&gt;Certifications (CKA, AWS SA, etc.)&lt;/li&gt;
&lt;li&gt;Memorized answers about CAP theorem&lt;/li&gt;
&lt;li&gt;Whether they've used Terraform or Pulumi&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools change. Fundamentals don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Look For
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Systematic Debugging (The #1 Skill)
&lt;/h3&gt;

&lt;p&gt;I give every candidate the same scenario:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Users report the app is slow. What do you do?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Bad answer: "I'd check the database."&lt;/p&gt;

&lt;p&gt;Good answer: "First, I'd define 'slow' — which endpoints, which users, since when? Then I'd check the golden signals: is latency up? Error rate up? Traffic unusual? Saturation? I'd compare current metrics to baseline and narrow down from there."&lt;/p&gt;

&lt;p&gt;The best candidates think in systems, not components.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Comfort With Ambiguity
&lt;/h3&gt;

&lt;p&gt;SRE work is fundamentally ambiguous. The page fires, you don't know why, and people are waiting.&lt;/p&gt;

&lt;p&gt;I ask: "Tell me about a time you had to make a decision with incomplete information."&lt;/p&gt;

&lt;p&gt;I want to hear about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How they scoped the unknowns&lt;/li&gt;
&lt;li&gt;What heuristics they used&lt;/li&gt;
&lt;li&gt;How they communicated uncertainty&lt;/li&gt;
&lt;li&gt;Whether they revised their approach as they learned more&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Writing Ability
&lt;/h3&gt;

&lt;p&gt;SREs write post-mortems, runbooks, design docs, and incident updates. If you can't write clearly under pressure, you'll struggle.&lt;/p&gt;

&lt;p&gt;I include a writing exercise: "Write a 3-sentence status update for a customer-facing outage."&lt;/p&gt;

&lt;p&gt;Good example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We identified an issue affecting login for approximately 15% of users starting at 2:30 PM UTC. Our team has identified the root cause and deployed a fix. We're monitoring to confirm full resolution, which we expect within the next 10 minutes."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. Automation Mindset
&lt;/h3&gt;

&lt;p&gt;I ask: "What's the last thing you automated, and why?"&lt;/p&gt;

&lt;p&gt;The "why" matters more than the "what." I want to hear about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identifying repetitive toil&lt;/li&gt;
&lt;li&gt;Calculating the ROI of automation&lt;/li&gt;
&lt;li&gt;Choosing the right level of automation&lt;/li&gt;
&lt;li&gt;Knowing when NOT to automate&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Empathy
&lt;/h3&gt;

&lt;p&gt;Surprising for a technical role? SREs are the bridge between development, operations, and business. Empathy means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding why a developer shipped that sketchy PR (deadline pressure)&lt;/li&gt;
&lt;li&gt;Knowing that a product manager asking "when will it be fixed?" is scared, not annoying&lt;/li&gt;
&lt;li&gt;Recognizing that the junior engineer who caused the outage feels terrible already&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My Interview Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1 (45 min): Technical screen
  - Systems design: "Design monitoring for a checkout service"
  - Debugging scenario: "Walk me through investigating latency"

Round 2 (60 min): Incident simulation
  - Live scenario: simulated page with fake dashboards
  - Evaluate: systematic thinking, communication, tool usage

Round 3 (45 min): Culture and collaboration
  - Post-mortem discussion: review a real (anonymized) incident
  - Conflict resolution: "Dev team pushes back on your SLO proposal"

Round 4 (30 min): Writing exercise
  - Write a runbook for a given scenario
  - Write a status update for a given incident
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Red Flags
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Blames people instead of systems&lt;/li&gt;
&lt;li&gt;Can't explain things simply&lt;/li&gt;
&lt;li&gt;Never says "I don't know"&lt;/li&gt;
&lt;li&gt;Only talks about tools, never principles&lt;/li&gt;
&lt;li&gt;No interest in learning from incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building an SRE team and want AI to handle the routine work so your engineers focus on what matters, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>hiring</category>
      <category>career</category>
      <category>devops</category>
    </item>
    <item>
      <title>Log Management at Scale: How We Cut Costs 70% Without Losing Signal</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 01 Jul 2026 00:55:38 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/log-management-at-scale-how-we-cut-costs-70-without-losing-signal-1k1j</link>
      <guid>https://dev.to/samson_tanimawo/log-management-at-scale-how-we-cut-costs-70-without-losing-signal-1k1j</guid>
      <description>&lt;h2&gt;
  
  
  $12,000/Month for Logs Nobody Reads
&lt;/h2&gt;

&lt;p&gt;Our logging bill was $12,000/month. We were ingesting 2TB/day. When I asked the team what percentage of logs they actually looked at during incidents, the answer was embarrassing: about 5%.&lt;/p&gt;

&lt;p&gt;We were paying to store 95% noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Log Audit
&lt;/h2&gt;

&lt;p&gt;First, I categorized all log sources by value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;High value (always need during incidents):
  Application errors (stack traces)
  Authentication events
  Business transactions
  External API calls with responses
  Health check failures

Medium value (sometimes useful):
  Request/response logs (sampled)
  Performance metrics in logs
  Deployment events
  Configuration changes

Low value (almost never needed):
  Debug/trace level logs
  Health check successes
  Static asset requests
  Heartbeat messages
  Verbose framework logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 1: Log Levels as a Service
&lt;/h2&gt;

&lt;p&gt;We made log levels dynamic. In production, default is WARN. During incidents, flip to DEBUG for the affected service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="c1"&gt;# Log level from environment variable, changeable at runtime
&lt;/span&gt;&lt;span class="n"&gt;LOG_LEVEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WARNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Endpoint to change log level without restart
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/admin/log-level&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_log_level&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;setLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Kubernetes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal operation&lt;/span&gt;
kubectl &lt;span class="nb"&gt;set env &lt;/span&gt;deployment/api &lt;span class="nv"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;WARNING

&lt;span class="c"&gt;# During incident&lt;/span&gt;
kubectl &lt;span class="nb"&gt;set env &lt;/span&gt;deployment/api &lt;span class="nv"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DEBUG

&lt;span class="c"&gt;# After incident&lt;/span&gt;
kubectl &lt;span class="nb"&gt;set env &lt;/span&gt;deployment/api &lt;span class="nv"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;WARNING
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 2: Tiered Retention
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;retention_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hot_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Fast search, expensive&lt;/span&gt;
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7 days&lt;/span&gt;
    &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OR&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tag:business_event"&lt;/span&gt;

  &lt;span class="na"&gt;warm_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Slower search, cheaper&lt;/span&gt;
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30 days&lt;/span&gt;
    &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;INFO"&lt;/span&gt;

  &lt;span class="na"&gt;cold_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Archive only, cheapest&lt;/span&gt;
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;365 days&lt;/span&gt;
    &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag:audit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OR&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tag:compliance"&lt;/span&gt;

  &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Don't store at all&lt;/span&gt;
    &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;DEBUG&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OR&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;source:health_check"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 3: Structured Logging
&lt;/h2&gt;

&lt;p&gt;Unstructured logs are expensive to parse. Structured logs are cheap to query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Unstructured
&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; purchased &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; for $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Parsing this requires regex, which costs compute
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Structured
&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase_completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# Output: {"message": "purchase_completed", "user_id": "u123", ...}
# Queryable without parsing
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 4: Sample Verbose Logs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_log_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Always log errors
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="c1"&gt;# Always log slow requests
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="c1"&gt;# Sample 10% of successful requests
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before:
  Daily ingestion: 2 TB
  Monthly cost: $12,000
  Useful data: ~5%

After:
  Daily ingestion: 400 GB
  Monthly cost: $3,600
  Useful data: ~70%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We cut costs by 70% AND improved signal quality. Searches are faster because there's less noise. Incidents resolve quicker because relevant logs surface immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rule
&lt;/h2&gt;

&lt;p&gt;Before adding a log statement, ask: "Will someone look at this during an incident?" If the answer is no, it's DEBUG level at most.&lt;/p&gt;

&lt;p&gt;If you're spending too much on logs and want smarter log management, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>logging</category>
      <category>observability</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 30 Jun 2026 13:27:00 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/canary-deployments-the-pattern-that-cut-our-rollback-rate-by-80-38nc</link>
      <guid>https://dev.to/samson_tanimawo/canary-deployments-the-pattern-that-cut-our-rollback-rate-by-80-38nc</guid>
      <description>&lt;h2&gt;
  
  
  Deploy and Pray
&lt;/h2&gt;

&lt;p&gt;Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.&lt;/p&gt;

&lt;p&gt;After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Canary Deployments Actually Mean
&lt;/h2&gt;

&lt;p&gt;A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic flow:
  Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
                            └──→  5% → v1.2.4 (canary)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Canary Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/canary-deploy.yml&lt;/span&gt;
&lt;span class="na"&gt;canary_deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy canary (5%)&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;kubectl set image deployment/api-canary api=api:${{ github.sha }}&lt;/span&gt;
        &lt;span class="s"&gt;kubectl scale deployment/api-canary --replicas=1&lt;/span&gt;
        &lt;span class="s"&gt;# Configure traffic split&lt;/span&gt;
        &lt;span class="s"&gt;kubectl apply -f - &amp;lt;&amp;lt;EOF&lt;/span&gt;
        &lt;span class="s"&gt;apiVersion: split.smi-spec.io/v1alpha1&lt;/span&gt;
        &lt;span class="s"&gt;kind: TrafficSplit&lt;/span&gt;
        &lt;span class="s"&gt;metadata:&lt;/span&gt;
          &lt;span class="s"&gt;name: api-canary&lt;/span&gt;
        &lt;span class="s"&gt;spec:&lt;/span&gt;
          &lt;span class="s"&gt;service: api&lt;/span&gt;
          &lt;span class="s"&gt;backends:&lt;/span&gt;
          &lt;span class="s"&gt;- service: api-stable&lt;/span&gt;
            &lt;span class="s"&gt;weight: 95&lt;/span&gt;
          &lt;span class="s"&gt;- service: api-canary&lt;/span&gt;
            &lt;span class="s"&gt;weight: 5&lt;/span&gt;
        &lt;span class="s"&gt;EOF&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Wait and analyze (10 minutes)&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sleep 600&lt;/span&gt;
        &lt;span class="s"&gt;# Check canary health&lt;/span&gt;
        &lt;span class="s"&gt;ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')&lt;/span&gt;
        &lt;span class="s"&gt;LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')&lt;/span&gt;

        &lt;span class="s"&gt;echo "Canary error rate: $ERROR_RATE"&lt;/span&gt;
        &lt;span class="s"&gt;echo "Canary p99 latency: $LATENCY"&lt;/span&gt;

        &lt;span class="s"&gt;if (( $(echo "$ERROR_RATE &amp;gt; 0.01" | bc -l) )); then&lt;/span&gt;
          &lt;span class="s"&gt;echo "CANARY FAILED: Error rate too high"&lt;/span&gt;
          &lt;span class="s"&gt;exit 1&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Promote to 50%&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;kubectl apply -f traffic-split-50.yaml&lt;/span&gt;
        &lt;span class="s"&gt;sleep 600  # Wait another 10 min&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Full rollout&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;kubectl set image deployment/api-stable api=api:${{ github.sha }}&lt;/span&gt;
        &lt;span class="s"&gt;kubectl delete deployment api-canary&lt;/span&gt;
        &lt;span class="s"&gt;kubectl delete trafficsplit api-canary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Canary Checklist
&lt;/h2&gt;

&lt;p&gt;What we check during the canary window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CANARY_CHECKS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate(http_5xx_total{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}[5m]) / rate(http_requests_total{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}[5m])&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Max 1% errors
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;less_than&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latency_p99&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;histogram_quantile(0.99, rate(http_duration_bucket{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}[5m]))&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Max 500ms
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;less_than&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rate(http_2xx_total{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}[5m]) / rate(http_requests_total{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}[5m])&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Min 99% success
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;greater_than&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;memory_usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;container_memory_working_set_bytes{version=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Max 512MB
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;less_than&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results After 6 Months
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rollback rate&lt;/td&gt;
&lt;td&gt;15% of deploys&lt;/td&gt;
&lt;td&gt;3% of deploys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean time to detect bad deploy&lt;/td&gt;
&lt;td&gt;25 min&lt;/td&gt;
&lt;td&gt;8 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer-facing incidents from deploys&lt;/td&gt;
&lt;td&gt;4/month&lt;/td&gt;
&lt;td&gt;0.5/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy frequency&lt;/td&gt;
&lt;td&gt;1x/day (afraid)&lt;/td&gt;
&lt;td&gt;5x/day (confident)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Simple
&lt;/h2&gt;

&lt;p&gt;You don't need Istio or a service mesh for canary deploys. Start with:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Two deployment objects (stable + canary)&lt;/li&gt;
&lt;li&gt;A load balancer that supports weighted routing&lt;/li&gt;
&lt;li&gt;A script that checks error rates after deploy&lt;/li&gt;
&lt;li&gt;A human who decides whether to promote or rollback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Automate from there.&lt;/p&gt;

&lt;p&gt;If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>deployments</category>
      <category>devops</category>
      <category>sre</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Platform Engineering: Building an Internal Developer Platform That Teams Actually Use</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 30 Jun 2026 00:53:49 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/platform-engineering-building-an-internal-developer-platform-that-teams-actually-use-bgn</link>
      <guid>https://dev.to/samson_tanimawo/platform-engineering-building-an-internal-developer-platform-that-teams-actually-use-bgn</guid>
      <description>&lt;h2&gt;
  
  
  The "Build It and They Won't Come" Problem
&lt;/h2&gt;

&lt;p&gt;Our platform team spent 6 months building an internal developer platform. Beautiful service catalog, automated provisioning, self-service databases. Nobody used it.&lt;/p&gt;

&lt;p&gt;Here's what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Platforms Fail
&lt;/h2&gt;

&lt;p&gt;Most internal platforms fail for the same reason: they're built top-down instead of bottom-up.&lt;/p&gt;

&lt;p&gt;Top-down: "We decided every team should use this standardized deployment pipeline."&lt;br&gt;
Bottom-up: "We noticed 8 teams solving the same problem differently, so we built a shared solution."&lt;/p&gt;
&lt;h2&gt;
  
  
  The Paved Road Approach
&lt;/h2&gt;

&lt;p&gt;Instead of mandating tools, offer a paved road. Make the right thing the easy thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paved road (easy):        Off-road (hard but allowed):
─────────────────         ──────────────────────────
Standard CI/CD template   Custom pipeline
Managed Postgres          Self-managed DB
Shared observability       Own monitoring stack
Pre-configured K8s         Custom infrastructure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key: off-road is allowed but unsupported. You break it, you own it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Our IDP Looks Like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# service.yaml — the only file developers need to create&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-api&lt;/span&gt;
  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
  &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python&lt;/span&gt;
  &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fastapi&lt;/span&gt;

  &lt;span class="na"&gt;dependencies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres:14&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;redis:7&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;rabbitmq:3&lt;/span&gt;

  &lt;span class="na"&gt;scaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
    &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;

  &lt;span class="na"&gt;environments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;multi_az&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From this single file, the platform provisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git repo with CI/CD pipeline&lt;/li&gt;
&lt;li&gt;Kubernetes namespace and RBAC&lt;/li&gt;
&lt;li&gt;Database and connection secrets&lt;/li&gt;
&lt;li&gt;Monitoring dashboards (golden signals)&lt;/li&gt;
&lt;li&gt;Alerting rules&lt;/li&gt;
&lt;li&gt;Log aggregation&lt;/li&gt;
&lt;li&gt;Service mesh entry&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Developer Experience Metrics
&lt;/h2&gt;

&lt;p&gt;We track these to know if the platform is working:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Time from idea to production deploy:    Before: 2 weeks  After: 4 hours
Time to provision a new environment:    Before: 3 days   After: 12 minutes
Deploy frequency:                       Before: weekly    After: 5x/day
Change failure rate:                    Before: 18%       After: 4%
Developer satisfaction (quarterly NPS): Before: -10       After: +52
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Self-Service Portal
&lt;/h2&gt;

&lt;p&gt;Our portal has exactly four actions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create Service&lt;/strong&gt; — Generates everything from service.yaml&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;View My Services&lt;/strong&gt; — Dashboard of health, deploys, costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request Resource&lt;/strong&gt; — Database, queue, cache (auto-provisioned)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get Help&lt;/strong&gt; — Links to docs + Slack channel&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Four buttons. If you need more than four buttons, your platform is too complex.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adoption Strategy
&lt;/h2&gt;

&lt;p&gt;We didn't mandate adoption. We seduced teams into it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1-4&lt;/strong&gt;: Pilot with the friendliest team. Fix everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5-8&lt;/strong&gt;: Add two more teams. Fix more things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 9-12&lt;/strong&gt;: Success stories in engineering all-hands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 13+&lt;/strong&gt;: Other teams start asking to join.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By month 6, 80% of teams had migrated voluntarily. The remaining 20% had legitimate edge cases we accommodated.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Not to Build
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don't build a service mesh if you have &amp;lt; 20 services&lt;/li&gt;
&lt;li&gt;Don't build a custom scheduler if standard K8s works&lt;/li&gt;
&lt;li&gt;Don't build a custom secret manager — use Vault or cloud-native&lt;/li&gt;
&lt;li&gt;Don't build a custom CI system — use GitHub Actions/GitLab CI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build the glue, not the tools.&lt;/p&gt;

&lt;p&gt;If you want a platform that includes AI-powered operations from day one, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>devops</category>
      <category>sre</category>
      <category>devex</category>
    </item>
    <item>
      <title>Chaos Engineering for Teams That Aren't Netflix</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 29 Jun 2026 15:44:45 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/chaos-engineering-for-teams-that-arent-netflix-5cin</link>
      <guid>https://dev.to/samson_tanimawo/chaos-engineering-for-teams-that-arent-netflix-5cin</guid>
      <description>&lt;h2&gt;
  
  
  You Don't Need Chaos Monkey
&lt;/h2&gt;

&lt;p&gt;Every chaos engineering talk starts with Netflix and Chaos Monkey. Cool story. You're not Netflix. You probably have 5-50 services, not 500. You don't need a sophisticated chaos platform.&lt;/p&gt;

&lt;p&gt;You need a methodology.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start With Game Days
&lt;/h2&gt;

&lt;p&gt;Before injecting failures into production, run tabletop exercises:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Game Day: Database Failover&lt;/span&gt;
Date: 2024-03-15
Scope: Primary DB goes down
Participants: SRE team + Backend leads

&lt;span class="gu"&gt;### Scenario&lt;/span&gt;
At 10:00 AM, the primary database becomes unreachable.

&lt;span class="gu"&gt;### Questions to answer:&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; How does the application behave?
&lt;span class="p"&gt;2.&lt;/span&gt; Does the replica promote automatically?
&lt;span class="p"&gt;3.&lt;/span&gt; What's the expected failover time?
&lt;span class="p"&gt;4.&lt;/span&gt; What manual steps are needed?
&lt;span class="p"&gt;5.&lt;/span&gt; How do we verify data consistency after failover?

&lt;span class="gu"&gt;### Pre-requisites&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Backup verified within last 24 hours
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Runbook for DB failover reviewed
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Rollback plan documented
&lt;span class="p"&gt;-&lt;/span&gt; [ ] All participants in incident channel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Game days cost nothing and reveal 80% of the gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your First Real Chaos Experiment
&lt;/h2&gt;

&lt;p&gt;Start small. Really small.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Experiment 1: Kill a single pod&lt;/span&gt;
&lt;span class="c"&gt;# Hypothesis: Traffic shifts to remaining pods with zero errors&lt;/span&gt;

&lt;span class="c"&gt;# Before&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api-service
&lt;span class="c"&gt;# NAME                          READY   STATUS&lt;/span&gt;
&lt;span class="c"&gt;# api-service-7d8f9b6c4-abc12   1/1     Running&lt;/span&gt;
&lt;span class="c"&gt;# api-service-7d8f9b6c4-def34   1/1     Running&lt;/span&gt;
&lt;span class="c"&gt;# api-service-7d8f9b6c4-ghi56   1/1     Running&lt;/span&gt;

&lt;span class="c"&gt;# The experiment&lt;/span&gt;
kubectl delete pod api-service-7d8f9b6c4-abc12

&lt;span class="c"&gt;# Observe&lt;/span&gt;
&lt;span class="c"&gt;# - Did error rate spike?&lt;/span&gt;
&lt;span class="c"&gt;# - Did latency increase?&lt;/span&gt;
&lt;span class="c"&gt;# - Did K8s reschedule the pod?&lt;/span&gt;
&lt;span class="c"&gt;# - How long until back to 3 replicas?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If killing one pod causes errors, you have a serious problem that's better to find now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chaos Experiment Template
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;experiment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Network&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;service"&lt;/span&gt;
  &lt;span class="na"&gt;date&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-03-20"&lt;/span&gt;

  &lt;span class="na"&gt;hypothesis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;steady_state&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0.1%"&lt;/span&gt;
    &lt;span class="na"&gt;expected_behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="s"&gt;Circuit breaker activates within 5 seconds.&lt;/span&gt;
      &lt;span class="s"&gt;Checkout falls back to cached payment validation.&lt;/span&gt;
      &lt;span class="s"&gt;Users see a 'retry' message, not an error.&lt;/span&gt;

  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tc&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(traffic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;control)"&lt;/span&gt;
    &lt;span class="na"&gt;injection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200ms&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;443&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment-service"&lt;/span&gt;
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
    &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Single&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;availability&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;zone"&lt;/span&gt;

  &lt;span class="na"&gt;abort_conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exceeds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5%"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;drops&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;90%"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Any&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inconsistency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detected"&lt;/span&gt;

  &lt;span class="na"&gt;rollback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tc&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;qdisc&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;del&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;eth0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;root"&lt;/span&gt;
    &lt;span class="na"&gt;verification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;returns&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;baseline"&lt;/span&gt;

  &lt;span class="na"&gt;results&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;actual_behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[filled&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;after&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;experiment]"&lt;/span&gt;
    &lt;span class="na"&gt;hypothesis_confirmed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="s"&gt;/false&lt;/span&gt;
    &lt;span class="na"&gt;action_items&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Progressive Chaos Levels
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1: Kill a pod (Week 1)
Level 2: Kill all pods in one AZ (Week 4)
Level 3: Inject latency to a dependency (Week 8)
Level 4: Simulate full dependency outage (Week 12)
Level 5: Multi-failure scenario (Week 16+)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Don't skip levels. Each one builds confidence and reveals issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Found
&lt;/h2&gt;

&lt;p&gt;After 6 months of regular chaos experiments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12 missing circuit breakers discovered&lt;/li&gt;
&lt;li&gt;3 services with no health checks&lt;/li&gt;
&lt;li&gt;5 services with incorrect timeout configurations&lt;/li&gt;
&lt;li&gt;2 services with hard-coded dependency URLs (no DNS)&lt;/li&gt;
&lt;li&gt;1 service that crashed when its cache was unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these would have caused outages eventually. We found them on our terms, during business hours, with everyone ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Business Case
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prevented outages (estimated): 4 per quarter
Average outage cost: $15,000
Chaos engineering cost: ~20 hours/quarter of eng time

ROI: ($60,000 saved - $4,000 cost) = $56,000/quarter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to run chaos experiments with AI-guided blast radius analysis, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>chaosengineering</category>
      <category>sre</category>
      <category>testing</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Distributed Tracing: The Missing Piece of Your Observability Stack</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 29 Jun 2026 00:17:36 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/distributed-tracing-the-missing-piece-of-your-observability-stack-34b</link>
      <guid>https://dev.to/samson_tanimawo/distributed-tracing-the-missing-piece-of-your-observability-stack-34b</guid>
      <description>&lt;h2&gt;
  
  
  When Logs and Metrics Aren't Enough
&lt;/h2&gt;

&lt;p&gt;You have great dashboards. Your log aggregation is solid. But when a user reports "the checkout page is slow," you still spend 30 minutes jumping between services trying to find the bottleneck.&lt;/p&gt;

&lt;p&gt;That's the gap distributed tracing fills.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Tracing Actually Shows You
&lt;/h2&gt;

&lt;p&gt;A trace is a complete picture of a single request as it flows through your system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request → API Gateway → Auth Service → Product Service → DB → Cache → Response
                  5ms          12ms           45ms       120ms  3ms
                                                          ^
                                              This is your bottleneck
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without tracing, you'd see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway: latency looks fine&lt;/li&gt;
&lt;li&gt;Auth Service: latency looks fine&lt;/li&gt;
&lt;li&gt;Product Service: latency is HIGH but why?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With tracing, you see the exact DB query inside Product Service that's taking 120ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry is the standard. Here's a minimal setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python example with Flask
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FlaskInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RequestsInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.sqlalchemy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLAlchemyInstrumentor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;

&lt;span class="c1"&gt;# Setup
&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://otel-collector:4317&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Auto-instrument everything
&lt;/span&gt;&lt;span class="nc"&gt;FlaskInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;RequestsInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nc"&gt;SQLAlchemyInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Three auto-instrumentations cover 80% of what you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Spans for the Other 20%
&lt;/h2&gt;

&lt;p&gt;Auto-instrumentation gives you HTTP calls and DB queries. Add custom spans for business logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order.total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_inventory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;validate_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;charge_payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;charge_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;send_confirmation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Sampling Strategy
&lt;/h2&gt;

&lt;p&gt;You can't trace every request in production. Well, you can, but your bill will be astronomical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otel-collector-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;probabilistic_sampler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# Sample 10% of requests&lt;/span&gt;

  &lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Always keep errors&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;status_code&lt;/span&gt;
        &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]}&lt;/span&gt;
      &lt;span class="c1"&gt;# Always keep slow requests&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slow-requests&lt;/span&gt;  
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
        &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;threshold_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;1000&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="c1"&gt;# Sample 5% of everything else&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;probabilistic&lt;/span&gt;
        &lt;span class="na"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tail sampling is the key. It lets you keep 100% of interesting traces and only 5% of boring ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Queries That Matter
&lt;/h2&gt;

&lt;p&gt;Once you have tracing data, these three queries solve 90% of debugging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. "Show me the slowest traces in the last hour"
   → Finds performance regressions

2. "Show me traces with errors, grouped by service"
   → Finds which service is failing

3. "Show me traces for user X's request at time T"
   → Reproduces specific customer issues
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Not propagating trace context&lt;/strong&gt; — If service A calls service B but doesn't pass the trace ID, you get broken traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-sampling in production&lt;/strong&gt; — Start at 1-5%, increase as needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not adding business context&lt;/strong&gt; — Adding &lt;code&gt;user.id&lt;/code&gt;, &lt;code&gt;order.id&lt;/code&gt;, etc. to spans makes traces actually useful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring async operations&lt;/strong&gt; — Queues break trace propagation unless you explicitly pass context&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want AI-powered trace analysis that automatically finds bottlenecks, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>tracing</category>
      <category>microservices</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Golden Signals: A Practical Implementation Guide</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 28 Jun 2026 13:18:06 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-4n8m</link>
      <guid>https://dev.to/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-4n8m</guid>
      <description>&lt;h2&gt;
  
  
  Four Metrics to Rule Them All
&lt;/h2&gt;

&lt;p&gt;Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.&lt;/p&gt;

&lt;p&gt;Here's a practical guide from someone who's implemented them across 50+ services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 1: Latency
&lt;/h2&gt;

&lt;p&gt;Not all latency is equal. You need to track successful requests and error requests separately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Average latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_request_time&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;  &lt;span class="c1"&gt;# Useless
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Percentile latency, separated by status
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;

&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Request latency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="n"&gt;status_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;xx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_class&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on p99, not p50. Your happiest users don't need help.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighLatencyP99&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 2: Traffic
&lt;/h2&gt;

&lt;p&gt;Traffic tells you "is this normal?" It's the context for every other signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: &amp;gt;
    rate(http_requests_total[5m]) 
    &amp;lt; 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped &amp;gt;50% compared to 1 hour ago"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic drops are often more concerning than traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 3: Errors
&lt;/h2&gt;

&lt;p&gt;Track error rate as a percentage, not absolute count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But also track error types separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;error_categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;5xx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(our&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fault)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;4xx_excluding_404&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Client&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issue)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeouts"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;circuit_breaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dependency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 4: Saturation
&lt;/h2&gt;

&lt;p&gt;The most underrated signal. Saturation answers: "how close are we to full?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Every service gets a standard dashboard with four rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern
&lt;/h2&gt;

&lt;p&gt;Don't build a golden signals dashboard per service manually. Template it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dashboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Golden Signals: {{ service_name }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"templating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"query"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One template, 50 dashboards. Update once, apply everywhere.&lt;/p&gt;

&lt;p&gt;If you want golden signal monitoring that sets itself up automatically, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>monitoring</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes Observability: What to Monitor and Why</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 28 Jun 2026 01:17:57 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-32bb</link>
      <guid>https://dev.to/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-32bb</guid>
      <description>&lt;h2&gt;
  
  
  The Kubernetes Monitoring Maze
&lt;/h2&gt;

&lt;p&gt;Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.&lt;/p&gt;

&lt;p&gt;After running K8s in production for four years, here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Kubernetes observability has three distinct layers, and you need different strategies for each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 1: Cluster Health
&lt;/h2&gt;

&lt;p&gt;These are your "is the platform working?" metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_cluster_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_ready_status&lt;/span&gt;        &lt;span class="c1"&gt;# Are all nodes healthy?&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_cpu_utilization&lt;/span&gt;     &lt;span class="c1"&gt;# Alert at 85%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_memory_utilization&lt;/span&gt;  &lt;span class="c1"&gt;# Alert at 90%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_disk_pressure&lt;/span&gt;       &lt;span class="c1"&gt;# Boolean alert&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_pid_pressure&lt;/span&gt;        &lt;span class="c1"&gt;# Rarely fires, always critical&lt;/span&gt;

  &lt;span class="na"&gt;control_plane&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;apiserver_request_latency_p99&lt;/span&gt;  &lt;span class="c1"&gt;# Alert &amp;gt; 1s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;etcd_disk_wal_fsync_duration&lt;/span&gt;   &lt;span class="c1"&gt;# Alert &amp;gt; 100ms&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler_pending_pods&lt;/span&gt;         &lt;span class="c1"&gt;# Alert if &amp;gt; 0 for 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controller_manager_queue_depth&lt;/span&gt; &lt;span class="c1"&gt;# Alert if growing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't alert on individual node CPU. Alert on cluster-level capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) &amp;gt; 0.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2: Workload Health
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They monitor pods instead of workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_workload_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;available_replicas &amp;lt; desired_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# For &amp;gt; 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_generation != observed_generation&lt;/span&gt;  &lt;span class="c1"&gt;# Stuck rollout&lt;/span&gt;

  &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;restart_count increasing&lt;/span&gt;       &lt;span class="c1"&gt;# CrashLoopBackOff detection&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container_oom_killed&lt;/span&gt;            &lt;span class="c1"&gt;# Memory limits too low&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pod_pending_duration &amp;gt; 2min&lt;/span&gt;     &lt;span class="c1"&gt;# Scheduling issues&lt;/span&gt;

  &lt;span class="na"&gt;hpa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;current_replicas == max_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# Scale ceiling hit&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_utilization_vs_target&lt;/span&gt;         &lt;span class="c1"&gt;# Consistently above target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most valuable alert I ever wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect pods stuck in CrashLoopBackOff&lt;/span&gt;
&lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
&lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash-looping"&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.namespace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Application Performance
&lt;/h2&gt;

&lt;p&gt;This is what your users actually care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;application_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;red_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Rate, Errors, Duration&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_rate_per_second&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;error_rate_percentage&lt;/span&gt;        &lt;span class="c1"&gt;# Alert &amp;gt; 1%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_duration_p99&lt;/span&gt;         &lt;span class="c1"&gt;# Alert &amp;gt; 500ms&lt;/span&gt;

  &lt;span class="na"&gt;use_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Utilization, Saturation, Errors&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_receive_bytes_rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard That Saves Us
&lt;/h2&gt;

&lt;p&gt;We built a single "K8s Health" dashboard with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster capacity&lt;/strong&gt; — CPU/Memory/Disk utilization per node pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload status&lt;/strong&gt; — Table of all deployments with health status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates&lt;/strong&gt; — All services, sorted by error rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent events&lt;/strong&gt; — K8s events filtered to warnings and errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one dashboard answers 90% of "is something wrong?" questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring pods instead of services&lt;/strong&gt; — Pods are ephemeral, services are what matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not setting resource requests&lt;/strong&gt; — Without requests, your metrics are meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on resource usage instead of SLOs&lt;/strong&gt; — High CPU isn't a problem if latency is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the control plane&lt;/strong&gt; — An unhealthy API server affects everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want unified Kubernetes observability without the complexity, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>On-Call Wellness: Protecting Your Engineers from Burnout</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 27 Jun 2026 13:21:24 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-46d9</link>
      <guid>https://dev.to/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-46d9</guid>
      <description>&lt;h2&gt;
  
  
  The On-Call Burnout Epidemic
&lt;/h2&gt;

&lt;p&gt;I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.&lt;/p&gt;

&lt;p&gt;We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Warning Signs
&lt;/h2&gt;

&lt;p&gt;Before someone quits, they show these signals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cynicism in post-mortems&lt;/strong&gt; — "This will never get fixed"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert numbness&lt;/strong&gt; — Slow to respond, missed pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacation avoidance&lt;/strong&gt; — "I can't take time off, who would cover?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep rejection&lt;/strong&gt; — "That's not my problem"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meeting silence&lt;/strong&gt; — Previously engaged, now checked out&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you see three or more of these in someone on your team, they're already halfway out the door.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hard Cap on Pages
&lt;/h3&gt;

&lt;p&gt;We set a maximum of 2 pages per 8-hour on-call shift. If someone gets paged more than that, the secondary automatically takes over and the incident is escalated as a process failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on_call_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_pages_per_shift&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;shift_duration_hours&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;overflow_action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_to_secondary"&lt;/span&gt;
  &lt;span class="na"&gt;overflow_review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly_ops_review"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Follow-the-Sun Rotation
&lt;/h3&gt;

&lt;p&gt;We stopped asking people to be on-call at 3am. With team members across US timezones, we created overlapping business-hours shifts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shift A (Eastern):  6am - 2pm ET
Shift B (Central):  11am - 7pm CT  
Shift C (Pacific):  2pm - 10pm PT
Overnight:          Managed by alert automation + escalation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nobody gets paged between 10pm and 6am unless it's a true P1.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. On-Call Compensation
&lt;/h3&gt;

&lt;p&gt;We implemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$500 flat fee per on-call week&lt;/li&gt;
&lt;li&gt;$200 per off-hours page&lt;/li&gt;
&lt;li&gt;Comp day after any overnight incident &amp;gt; 30 minutes&lt;/li&gt;
&lt;li&gt;On-call swaps require zero management approval&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The "Toil Budget"
&lt;/h3&gt;

&lt;p&gt;Each engineer gets a toil budget: maximum 30% of their time on operational work. If toil exceeds 30%, they're pulled from on-call until the team automates the excess.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Weekly toil tracking:
  Alert response:     4 hours
  Manual deployments: 2 hours  
  Config updates:     1 hour
  Ad-hoc debugging:   3 hours
  ─────────────────────────
  Total:              10 hours (25% of 40hr week) ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Quarterly On-Call Reviews
&lt;/h3&gt;

&lt;p&gt;Every quarter, we review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages per person&lt;/li&gt;
&lt;li&gt;Off-hours disruptions&lt;/li&gt;
&lt;li&gt;Toil percentages&lt;/li&gt;
&lt;li&gt;Team sentiment survey&lt;/li&gt;
&lt;li&gt;Attrition risk signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After (6 months)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Attrition rate&lt;/td&gt;
&lt;td&gt;40%/year&lt;/td&gt;
&lt;td&gt;8%/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pages per shift&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours pages&lt;/td&gt;
&lt;td&gt;12/week&lt;/td&gt;
&lt;td&gt;2/week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team NPS&lt;/td&gt;
&lt;td&gt;-15&lt;/td&gt;
&lt;td&gt;+45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recruitment cost saved&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;~$400K/year&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;On-call wellness isn't a perk. It's a business decision. Replacing a senior SRE costs $150-200K in recruiting, onboarding, and lost productivity. Preventing burnout costs almost nothing.&lt;/p&gt;

&lt;p&gt;If you're looking to reduce on-call toil and protect your team from burnout, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>oncall</category>
      <category>burnout</category>
      <category>culture</category>
    </item>
  </channel>
</rss>
