<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samson Tanimawo</title>
    <description>The latest articles on DEV Community by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://dev.to/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>DEV Community: Samson Tanimawo</title>
      <link>https://dev.to/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>The Engineer Who Owns Nothing: A Cautionary Tale</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 12 Jun 2026 20:15:33 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-engineer-who-owns-nothing-a-cautionary-tale-5b5c</link>
      <guid>https://dev.to/samson_tanimawo/the-engineer-who-owns-nothing-a-cautionary-tale-5b5c</guid>
      <description>&lt;p&gt;I'm going to tell you about an engineer I worked with. Call him Mark. Mark was talented, well-liked, and utterly ineffective. Here's what I learned from watching him.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Mark did
&lt;/h2&gt;

&lt;p&gt;Mark's technical skills were real. He wrote good code. He gave thoughtful design review comments. He spoke well in meetings.&lt;/p&gt;

&lt;p&gt;Mark's problem: he didn't own anything.&lt;/p&gt;

&lt;p&gt;He worked on whatever was in front of him. He fixed bugs in code he didn't write. He helped other engineers with their services. He never said 'this is mine.' Everything was 'we should figure out who handles that.'&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this was a problem
&lt;/h2&gt;

&lt;p&gt;When something broke in production, Mark would help debug — for a few minutes. Then he'd disengage because it 'wasn't really his area.' Nobody ever held him accountable because the code wasn't explicitly assigned to him.&lt;/p&gt;

&lt;p&gt;When planning happened, Mark didn't propose projects. He waited for projects to be assigned to him. Assigned projects rarely came because leaders couldn't predict what he'd actually commit to.&lt;/p&gt;

&lt;p&gt;When reliability work needed doing — alerts to tune, dashboards to fix, runbooks to write — Mark agreed it was important and waited for someone else to do it.&lt;/p&gt;

&lt;p&gt;Mark got good performance reviews for the first two years. He was technically capable, pleasant, and unobjectionable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;At year three, the company hit hard times. Leadership started asking: 'what has this person done? what do they own?' Mark had no answer.&lt;/p&gt;

&lt;p&gt;He got laid off. It wasn't a performance issue in the normal sense — he hadn't done anything wrong. He just hadn't owned anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson for SREs
&lt;/h2&gt;

&lt;p&gt;SRE is especially vulnerable to this trap. Everything is shared infrastructure. It's tempting to be the person who 'helps out everywhere' without ever claiming a specific service.&lt;/p&gt;

&lt;p&gt;Don't. Claim something.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Own an SLO&lt;/li&gt;
&lt;li&gt;Own a runbook library&lt;/li&gt;
&lt;li&gt;Own the post-mortem process&lt;/li&gt;
&lt;li&gt;Own the alerting hygiene for a service&lt;/li&gt;
&lt;li&gt;Own the capacity planning model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't have to be big. But it has to be explicitly yours, with consequences if it fails and credit if it succeeds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The meta-lesson
&lt;/h2&gt;

&lt;p&gt;Ownership isn't the same as being busy. You can be in a million meetings and own nothing. You can write 10,000 lines of code and own nothing. Ownership means: when something in your area breaks, you are the first person called, and you are the person who gets credit when it works.&lt;/p&gt;

&lt;p&gt;Pick something this week. Ask your manager: 'can I explicitly own X?' Write it down. Defend it.&lt;/p&gt;

&lt;p&gt;Don't be Mark.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>culture</category>
      <category>ownership</category>
    </item>
    <item>
      <title>Error Budget Policies That Hold Leadership Accountable</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 11 Jun 2026 21:23:20 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/error-budget-policies-that-hold-leadership-accountable-18f4</link>
      <guid>https://dev.to/samson_tanimawo/error-budget-policies-that-hold-leadership-accountable-18f4</guid>
      <description>&lt;p&gt;Error budgets are useless without a policy. 'We're out of error budget' should trigger consequences. If it doesn't, you don't have an error budget — you have a vanity metric.&lt;/p&gt;

&lt;p&gt;Here's a policy that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four states
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Healthy (&amp;lt; 70% of budget used).&lt;/strong&gt; Business as usual. Feature development proceeds at full speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch (70-90% used).&lt;/strong&gt; Feature velocity continues but new risky changes require explicit sign-off from an SRE. No gate, just attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Constrained (90-100% used).&lt;/strong&gt; Feature freezes. Only reliability work and critical bug fixes until we're back below 90%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breached (&amp;gt; 100% used).&lt;/strong&gt; Incident-level response. Leadership informed. Post-mortem for why we blew through. Feature work stays frozen until we recover &lt;em&gt;and&lt;/em&gt; identify systemic causes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part most policies miss
&lt;/h2&gt;

&lt;p&gt;The feature freeze in 'constrained' state is the part that actually changes behavior. Everything else is documentation. Without consequences, teams ignore the budget.&lt;/p&gt;

&lt;p&gt;The freeze has to be &lt;em&gt;real&lt;/em&gt;. Leadership can't override it for a 'really important feature' — that's exactly the time the freeze matters. The only exception is a legitimate emergency fix, and those should be rare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Selling this to leadership
&lt;/h2&gt;

&lt;p&gt;Executives hate feature freezes. They see it as slowing the business. Counter-argument: feature freezes during budget exhaustion &lt;em&gt;protect&lt;/em&gt; the business. Shipping features onto broken infrastructure creates more breakage, which burns more budget, which is a doom loop.&lt;/p&gt;

&lt;p&gt;Frame it as: 'the feature freeze is a safety valve. When it triggers, it's because something's wrong and we need to fix it before making it worse.'&lt;/p&gt;

&lt;p&gt;Also: a good policy lets you spend the budget aggressively when you have it. Feature teams should be encouraged to experiment, deploy fast, and take risks when you're at 30% budget used. The freeze is only for when the safety margin is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The review cadence
&lt;/h2&gt;

&lt;p&gt;Weekly error budget review, 15 minutes max. Who attended: SRE lead, engineering manager, maybe a PM. Decisions: are we in healthy/watch/constrained? Any actions for the coming week?&lt;/p&gt;

&lt;p&gt;Monthly broader review with leadership. Trends over time. Investment decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The escalation
&lt;/h2&gt;

&lt;p&gt;If a team enters 'constrained' state three times in a quarter, that's a systemic issue. Escalate to engineering leadership with a proposal: either invest in reliability or accept a lower SLO formally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The endgame
&lt;/h2&gt;

&lt;p&gt;A mature organization uses error budget policy to balance feature velocity against reliability automatically. Nobody is negotiating individual decisions. The framework does the work.&lt;/p&gt;

&lt;p&gt;Getting there takes 6-12 months of discipline. The first few freezes will feel painful. After that, they become routine, and something surprising happens: you stop having them as often. The policy is working.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>slo</category>
      <category>leadership</category>
    </item>
    <item>
      <title>Dependency Injection for Observability</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 10 Jun 2026 20:47:02 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/dependency-injection-for-observability-3508</link>
      <guid>https://dev.to/samson_tanimawo/dependency-injection-for-observability-3508</guid>
      <description>&lt;p&gt;Want your code to be easy to observe? Use dependency injection for observability concerns. Sounds dry. Hear me out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Your code calls &lt;code&gt;log.info(...)&lt;/code&gt; directly. In tests, you can't verify what was logged. In prod, if you want to change the logger, you're grepping the codebase. If you want to add tracing, you're editing every call site.&lt;/p&gt;

&lt;p&gt;Same for metrics. Same for tracing. Same for error reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Pass the observer in. Functions take a &lt;code&gt;logger&lt;/code&gt;, &lt;code&gt;metrics&lt;/code&gt;, or &lt;code&gt;tracer&lt;/code&gt; as an argument (or constructor dependency). The function doesn't know what's behind it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;HandleOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deps&lt;/span&gt; &lt;span class="n"&gt;Deps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"order.received"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"processing order"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In tests, pass in a mock that captures all calls&lt;/li&gt;
&lt;li&gt;In prod, pass in the real logger&lt;/li&gt;
&lt;li&gt;Changing backends (Datadog → Prometheus) is one wire-up change&lt;/li&gt;
&lt;li&gt;Adding tracing is one new field in &lt;code&gt;Deps&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why most codebases don't do this
&lt;/h2&gt;

&lt;p&gt;It feels verbose. 'Why do I have to thread a logger through every function?' Engineers hate boilerplate.&lt;/p&gt;

&lt;p&gt;The alternative is a global singleton. Easy to use, impossible to test cleanly, nightmare to refactor.&lt;/p&gt;

&lt;p&gt;The boilerplate is worth it. Especially for observability, where you &lt;em&gt;will&lt;/em&gt; want to swap implementations later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The subtle win
&lt;/h2&gt;

&lt;p&gt;Dependency-injected observability forces you to think about &lt;em&gt;what&lt;/em&gt; you're observing. When you have to explicitly pass the logger, you notice that a function is calling it 8 times. Is that too much? Is the logging doing real work? Would one structured log at the end of the function be better?&lt;/p&gt;

&lt;p&gt;Functions with injected dependencies tend to have better observability because the developer had to look at it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;You don't need to refactor everything at once. Pick one critical path — the checkout flow, the auth path. Refactor just that path to use dependency injection for observability. See if tests and debugging get easier.&lt;/p&gt;

&lt;p&gt;If yes, expand. If no, you found something else is the bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger idea
&lt;/h2&gt;

&lt;p&gt;Observability is code. Treat it with the same architectural discipline you treat the rest of your codebase. Dependency injection is one tool. There are others (context objects, middleware, decorators). Pick one, apply it consistently, and your future-you will be able to observe your code in ways that are impossible with ad-hoc &lt;code&gt;log.info&lt;/code&gt; calls everywhere.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>observability</category>
      <category>patterns</category>
    </item>
    <item>
      <title>Load Balancer Tuning: Lessons from Production</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 09 Jun 2026 20:36:49 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/load-balancer-tuning-lessons-from-production-5cg3</link>
      <guid>https://dev.to/samson_tanimawo/load-balancer-tuning-lessons-from-production-5cg3</guid>
      <description>&lt;p&gt;Load balancers are the silent infrastructure. You don't think about them until they start dropping connections at 2 AM. Here are the settings that have bitten me in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection timeouts
&lt;/h2&gt;

&lt;p&gt;Default connection idle timeout on AWS ALB is 60 seconds. Your app might have requests that legitimately take 90 seconds. Result: the ALB drops the connection mid-request, your user sees a 502, and your logs show nothing because the app was still processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; set the LB idle timeout higher than your longest legitimate request. For most APIs, 120-300 seconds. Verify it matches your app's own timeout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Health check intervals
&lt;/h2&gt;

&lt;p&gt;Default health check interval is usually 30 seconds, with 2 failures before marking unhealthy. That's up to 60 seconds of traffic sent to a dying instance before it's removed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; 10-second intervals, 2 failures. 20 seconds to remove a bad instance is much better. Yes, slightly more load on your service. Worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unhealthy threshold
&lt;/h2&gt;

&lt;p&gt;Marking an instance unhealthy after 2 consecutive failures is usually right. Marking it healthy again after 2 consecutive successes is usually wrong — it lets half-broken instances come back prematurely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; require 3-5 consecutive successes to mark healthy. Slower recovery, more stable routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slow start / connection draining
&lt;/h2&gt;

&lt;p&gt;When a new instance comes online, it's often cold — empty caches, no warmed connections. If the LB sends it full traffic immediately, it performs badly and might get marked unhealthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; enable slow start (AWS calls it 'slow start mode' on ALB). Ramp traffic over 30-60 seconds. Worth it for any service that needs warming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sticky sessions
&lt;/h2&gt;

&lt;p&gt;Sticky sessions feel like a solution. They're usually a problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; avoid them. If you need them, your app has shared state that should be in a database or cache, not in-process. The one exception: WebSocket connections, where stickiness is unavoidable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-zone load balancing
&lt;/h2&gt;

&lt;p&gt;Disabled by default on some LBs. With cross-zone disabled, an instance in AZ-A only serves traffic from AZ-A. If AZ-A has fewer instances, that AZ's traffic gets uneven distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; enable it unless you have a specific reason not to. Costs a small amount of cross-AZ traffic but gives you even load distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The meta-lesson
&lt;/h2&gt;

&lt;p&gt;Load balancer defaults are designed for 'will work for everybody, badly.' Any given workload needs tuning.&lt;/p&gt;

&lt;p&gt;Spend one afternoon reviewing your LB configs. Ask: 'is this default right for my app?' At least half the defaults will be wrong. Tune them. Future you will thank you during the next incident.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>loadbalancer</category>
      <category>performance</category>
    </item>
    <item>
      <title>Capacity Planning for Startups</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 08 Jun 2026 20:14:20 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/capacity-planning-for-startups-4m90</link>
      <guid>https://dev.to/samson_tanimawo/capacity-planning-for-startups-4m90</guid>
      <description>&lt;p&gt;Capacity planning sounds like enterprise spreadsheet work. For a startup, it's 'don't get embarrassed when traffic spikes, don't go broke overprovisioning.' Here's the pragmatic version.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. What does normal look like right now?&lt;/strong&gt; Peak RPS, p99 latency, CPU/memory utilization at peak. If you don't know these, stop and measure. You cannot plan without baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What's the next expected spike?&lt;/strong&gt; A launch. A press mention. A marketing campaign. The Black Friday of your industry. Put these on a calendar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. How long does it take to add capacity?&lt;/strong&gt; Minutes (autoscaling)? Hours (VM provisioning)? Weeks (vendor contracts)?&lt;/p&gt;

&lt;p&gt;Your capacity plan is the gap between expected spike and response time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The startup hack: overprovision early
&lt;/h2&gt;

&lt;p&gt;At startup scale, overprovisioning is cheap. An extra $5k/month of slack is trivial compared to the embarrassment of 'we went down during the launch.'&lt;/p&gt;

&lt;p&gt;Run at 30-40% peak utilization. Yes, that's wasteful. It's also a 3x buffer for unexpected spikes. Worth it until you're big enough to care about the efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The autoscaling reality
&lt;/h2&gt;

&lt;p&gt;Autoscaling is great but has limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cold start times mean it can't handle traffic that doubles in 30 seconds&lt;/li&gt;
&lt;li&gt;Provisioning limits mean you can only add X instances per minute&lt;/li&gt;
&lt;li&gt;Downstream dependencies (databases, queues) usually don't autoscale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Test your autoscaling &lt;em&gt;before&lt;/em&gt; you need it. The first time I depended on autoscaling in production, it worked. The second time, it didn't, because our database connection pool was capped. That was the real bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 bottlenecks to check
&lt;/h2&gt;

&lt;p&gt;For every scaling test, verify:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stateless service capacity.&lt;/strong&gt; Usually easy — just add more instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database capacity.&lt;/strong&gt; Connection counts, query latency, replication lag. Usually the real bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third-party dependencies.&lt;/strong&gt; Rate limits on external APIs, email providers, payment processors. A sudden 10x spike usually hits someone's rate limit.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The launch checklist
&lt;/h2&gt;

&lt;p&gt;Before any planned spike:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Overprovision by 3x what you think you need&lt;/li&gt;
&lt;li&gt;Pre-warm caches and connection pools&lt;/li&gt;
&lt;li&gt;Confirm your paging rotation is ready&lt;/li&gt;
&lt;li&gt;Prepare a rollback plan for the feature being launched&lt;/li&gt;
&lt;li&gt;Schedule the launch during your team's awake hours, not off-hours&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The real lesson
&lt;/h2&gt;

&lt;p&gt;For startups, the goal of capacity planning is not efficiency. It's confidence. If you have to spend a little more to avoid panic during growth, spend it. Optimize for efficiency later, when you have a year of traffic history to work from.&lt;/p&gt;

&lt;p&gt;Right now, your job is to stay standing. Do that first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>capacity</category>
      <category>startup</category>
    </item>
    <item>
      <title>How We Handled Our First Major Outage (And Survived)</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 07 Jun 2026 21:13:56 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/how-we-handled-our-first-major-outage-and-survived-1idm</link>
      <guid>https://dev.to/samson_tanimawo/how-we-handled-our-first-major-outage-and-survived-1idm</guid>
      <description>&lt;p&gt;Three years ago we had our first real outage. Six hours of downtime. Thousands of angry users. Multiple executives on the call. Here's what we did right, what we did wrong, and what we'd do differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we did right
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Communicated immediately.&lt;/strong&gt; The moment we knew we had a problem, we updated the status page and emailed our biggest customers personally. Not when we had answers. When we had a question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Had a single incident commander.&lt;/strong&gt; One person making calls. Not a committee. When the CEO tried to direct technical work, the IC politely rerouted and told her where her help was actually needed (talking to customers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Took care of our people.&lt;/strong&gt; During hour 4, I ordered food. During hour 5, I forced the primary engineer off the call for 20 minutes to walk outside. Long incidents destroy people. You have to feed them and force them to rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Wrote it down as we went.&lt;/strong&gt; We had a shared doc with a live timeline. When the post-mortem came, we had every decision captured.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we did wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tried to fix the root cause during the incident.&lt;/strong&gt; For the first 2 hours, we were digging into &lt;em&gt;why&lt;/em&gt; the database was struggling. We should have been mitigating (rolling back) first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Let too many people 'help.'&lt;/strong&gt; By hour 3, we had 12 engineers in the call. Half of them were useless. The IC should have kicked people out sooner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Gave optimistic estimates.&lt;/strong&gt; 'We'll be back in 30 minutes.' We were not back in 30 minutes. That miscommunication was worse than saying 'unknown.'&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Didn't prepare the executive communication.&lt;/strong&gt; The CEO had to answer customer questions in real time with no script. We should have drafted talking points for her after hour 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we'd do differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Mitigate first, investigate second. Always.&lt;/li&gt;
&lt;li&gt;Cap the number of active engineers at 4 during an incident. Others go on standby.&lt;/li&gt;
&lt;li&gt;Default to 'unknown' for estimates. Only give a number when we're sure.&lt;/li&gt;
&lt;li&gt;Assign someone explicitly to 'executive liaison.' Their job is to keep the C-suite informed without interrupting the technical team.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The aftermath
&lt;/h2&gt;

&lt;p&gt;The post-mortem was brutal and cathartic. We identified 14 action items. We actually did 11 of them over the next quarter.&lt;/p&gt;

&lt;p&gt;The outage was the best thing that happened to our reliability culture. It turned reliability from 'a thing SRE owns' into 'a thing everyone takes seriously.' I wouldn't wish a 6-hour outage on anyone, but I also wouldn't trade the lessons.&lt;/p&gt;

&lt;h2&gt;
  
  
  The final lesson
&lt;/h2&gt;

&lt;p&gt;Your first major outage will happen. Prepare for it by running game days. The game days will feel silly until the real thing happens, at which point every muscle you trained will kick in.&lt;/p&gt;

&lt;p&gt;Incident response is a skill. Skills need practice. Practice now.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>incident</category>
      <category>culture</category>
    </item>
    <item>
      <title>The Economics of Reliability: When to Invest, When to Accept Risk</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 06 Jun 2026 20:16:40 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-economics-of-reliability-when-to-invest-when-to-accept-risk-1i22</link>
      <guid>https://dev.to/samson_tanimawo/the-economics-of-reliability-when-to-invest-when-to-accept-risk-1i22</guid>
      <description>&lt;p&gt;Reliability is not a virtue. It's an investment. Too little and you lose customers. Too much and you can't afford to ship. The question is: where's the right balance?&lt;/p&gt;

&lt;h2&gt;
  
  
  The error budget framing
&lt;/h2&gt;

&lt;p&gt;The SRE book covers this well. Pick an SLO (say 99.9% uptime). That's 43 minutes of budget per month. If you're at 99.95%, you have budget to spend on risky things. If you're at 99.85%, you need to stop shipping risk.&lt;/p&gt;

&lt;p&gt;This works. But it doesn't answer 'what SLO should I pick?' Let me give you a framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. What do users expect?&lt;/strong&gt; A consumer banking app needs 4 nines or more. A developer tool can get away with 3. A beta product can live with 99%. Ask users (or watch churn numbers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What does an outage cost?&lt;/strong&gt; Dollars of lost revenue + dollars of customer churn + hours of engineering time. For a checkout-heavy product, an hour of downtime might cost $500k. For a B2B internal tool, $5k.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. What does the next 9 cost?&lt;/strong&gt; Going from 99% to 99.9% might cost $50k of engineering work. Going from 99.9% to 99.99% often costs $500k or more. Each 9 is 10x harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math
&lt;/h2&gt;

&lt;p&gt;Invest in reliability up to the point where the next $1 invested saves less than $1 of outage cost over the amortization period.&lt;/p&gt;

&lt;p&gt;If moving from 99% to 99.9% costs $50k and would save $200k over a year in reduced outage damage, invest. Easy call.&lt;/p&gt;

&lt;p&gt;If moving from 99.9% to 99.99% costs $500k and saves $100k, don't invest. Accept the risk, and spend the engineering time on something with better ROI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden cost
&lt;/h2&gt;

&lt;p&gt;Over-investing in reliability has a hidden cost: team velocity. Teams that chase 99.999% uptime spend so much on tests, canaries, staging environments, and approval gates that they ship slowly. Competitors with 99.5% reliability but 5x your velocity will win the market.&lt;/p&gt;

&lt;p&gt;Reliability that kills velocity is bad reliability. Measure both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The political reality
&lt;/h2&gt;

&lt;p&gt;The hardest part isn't the math. It's defending 'we're not going to fix this' when a VP demands reliability improvements. You need explicit agreement, in writing, on the target SLOs — so 'we're not fixing this' is 'we agreed on 99.9% and we're at 99.92%, which is within budget.'&lt;/p&gt;

&lt;h2&gt;
  
  
  The pragmatic answer
&lt;/h2&gt;

&lt;p&gt;Most teams I've worked with should pick:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99.9% for production-critical services&lt;/li&gt;
&lt;li&gt;99% for internal tools&lt;/li&gt;
&lt;li&gt;Best-effort for dev environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then measure, spend the error budget, and stop arguing about it. The math is usually clearer than the politics.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>reliability</category>
      <category>strategy</category>
    </item>
    <item>
      <title>Why Your Status Page Should Be Boring</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 05 Jun 2026 20:41:09 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/why-your-status-page-should-be-boring-2ieb</link>
      <guid>https://dev.to/samson_tanimawo/why-your-status-page-should-be-boring-2ieb</guid>
      <description>&lt;p&gt;A good status page is boring. Calm design, minimal copy, clear current state. If your status page feels exciting, something is wrong.&lt;/p&gt;

&lt;p&gt;Here's what I've learned from running status pages for three different products.&lt;/p&gt;

&lt;h2&gt;
  
  
  What users actually want from a status page
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Is it me or is it you?&lt;/strong&gt; The #1 question. Answer it in the first 3 seconds of landing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. If it's you, what exactly is broken?&lt;/strong&gt; Not 'we're experiencing issues.' Specifically: 'API endpoint /v2/checkout is returning 500 errors for ~15% of requests.'&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. How long until it's fixed?&lt;/strong&gt; Even 'unknown' is better than no estimate. 'Investigating' with a last-updated timestamp beats silence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Should I keep retrying?&lt;/strong&gt; If you're broken and expect to stay broken for a while, tell users to back off. Your support queue will thank you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What users don't want
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Corporate-speak ('we're aware of a potential service degradation')&lt;/li&gt;
&lt;li&gt;Vague promises ('working to resolve as quickly as possible')&lt;/li&gt;
&lt;li&gt;Technical jargon they can't parse&lt;/li&gt;
&lt;li&gt;Delayed acknowledgments (updating the page 20 minutes after an outage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The update cadence rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Acknowledge within 5 minutes of detection&lt;/li&gt;
&lt;li&gt;Update every 15-30 minutes during active investigation&lt;/li&gt;
&lt;li&gt;Mark as monitoring as soon as mitigation is in place, even if cause is unknown&lt;/li&gt;
&lt;li&gt;Mark as resolved when you're confident it's fixed — not before&lt;/li&gt;
&lt;li&gt;Always do a final post-incident summary&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The hard part: being honest
&lt;/h2&gt;

&lt;p&gt;The temptation is to minimize language. 'A small number of users' when it's actually 20%. 'Minor issue' when it's a real outage.&lt;/p&gt;

&lt;p&gt;Don't. Users trust a status page that's honest with them. The first time you get caught minimizing, you've lost credibility that takes years to earn back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The automation trap
&lt;/h2&gt;

&lt;p&gt;Automated status pages that say 'all systems operational' while your product is clearly broken are worse than no page. Users lose trust in the entire signal.&lt;/p&gt;

&lt;p&gt;If you automate, automate the detection to trigger human review. Don't automate the reassurance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a boring status page looks like
&lt;/h2&gt;

&lt;p&gt;Uptime over 30 days. Current state of each service (green/yellow/red). A list of recent incidents with their post-mortems linked.&lt;/p&gt;

&lt;p&gt;That's it. No marketing copy. No animated elements. Boring.&lt;/p&gt;

&lt;p&gt;Boring is trustworthy. Make your status page boring.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>statuspage</category>
      <category>communication</category>
    </item>
    <item>
      <title>Building Trust with Product Teams as an SRE</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 04 Jun 2026 20:16:58 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/building-trust-with-product-teams-as-an-sre-34hj</link>
      <guid>https://dev.to/samson_tanimawo/building-trust-with-product-teams-as-an-sre-34hj</guid>
      <description>&lt;p&gt;SRE teams that fight with product teams don't get things done. SRE teams that get along with product teams get surprising amounts of reliability work done by product engineers themselves. Here's how to build that trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start by making them faster, not slower
&lt;/h2&gt;

&lt;p&gt;The common SRE pattern: introduce yourself by adding gates. 'You need to do X before you can deploy.' 'Your service needs Y before production.' Product teams immediately see you as friction.&lt;/p&gt;

&lt;p&gt;Better pattern: introduce yourself by removing friction. 'I noticed your CI takes 20 minutes. I can get it to 6. Interested?' Now you're useful. The gates come later, and you have trust to spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speak their language
&lt;/h2&gt;

&lt;p&gt;Product engineers care about: shipping, feature quality, user complaints. They don't care about: error budgets, SLIs, observability stacks.&lt;/p&gt;

&lt;p&gt;Translate. Instead of 'we're over our SLO,' say 'users are seeing errors on checkout — here's what's hitting them and what it's costing.' The facts are the same. The reception is totally different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Share ownership of incidents
&lt;/h2&gt;

&lt;p&gt;When a product team's code breaks prod, resist the urge to fix it yourself. Instead: be in the room, coach them through it, let them own the fix.&lt;/p&gt;

&lt;p&gt;Yes, it's slower. Yes, sometimes they'll ask awkward questions. That's exactly the point. They're learning. After 3 incidents, they'll be better engineers and more grateful to your team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Give credit publicly
&lt;/h2&gt;

&lt;p&gt;When a product team does reliability work well, say so. Publicly. In eng all-hands. On the CEO's Slack thread.&lt;/p&gt;

&lt;p&gt;This sounds performative. It's not. It's you saying 'reliability is valued here, and we notice when people invest in it.' Other teams see it and start wanting credit too. You're using recognition as a reliability multiplier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Absorb the hit sometimes
&lt;/h2&gt;

&lt;p&gt;Sometimes a product team will say 'we don't have time for the runbook right now, can you just do it?' Say yes. Write the runbook. Don't make a thing of it.&lt;/p&gt;

&lt;p&gt;Do this too often and you're a service team. Do this never and you're hostile. Do this sometimes, strategically, and you're building long-term trust. Read the room.&lt;/p&gt;

&lt;h2&gt;
  
  
  The end goal
&lt;/h2&gt;

&lt;p&gt;After a year of this, product teams start coming to you &lt;em&gt;before&lt;/em&gt; launches. 'We're about to ship X — any reliability concerns?' That's when you know you've won. They see you as a partner, not a checkpoint.&lt;/p&gt;

&lt;p&gt;SRE culture work is slow. It compounds. Invest in it from day one.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>culture</category>
      <category>collaboration</category>
    </item>
    <item>
      <title>Incident Command: The Skills They Don't Teach You</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 03 Jun 2026 20:24:48 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/incident-command-the-skills-they-dont-teach-you-g68</link>
      <guid>https://dev.to/samson_tanimawo/incident-command-the-skills-they-dont-teach-you-g68</guid>
      <description>&lt;p&gt;Running a production incident is a skill. Most of the skill isn't technical. Here's what nobody told me when I started running incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 1: Calling the cadence
&lt;/h2&gt;

&lt;p&gt;During an incident, time warps. Everyone is heads-down in logs. Nobody remembers when they last updated the status channel.&lt;/p&gt;

&lt;p&gt;The incident commander's job is to force a cadence: 'Update in 5 minutes. What do we know? What do we need?' Without this, the incident drags on because no one is aggregating context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 2: Saying 'I don't know, and here's what we're doing to find out'
&lt;/h2&gt;

&lt;p&gt;Stakeholders want certainty. You can't give it. The temptation is to guess.&lt;/p&gt;

&lt;p&gt;Don't guess. Say 'We don't know the cause yet. We're investigating X and Y. I'll update in 10 minutes.' Trust builds on honesty, not performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 3: Interrupting your engineers
&lt;/h2&gt;

&lt;p&gt;Your engineers are investigating. They don't want to stop and explain. But if you don't interrupt, you can't make decisions.&lt;/p&gt;

&lt;p&gt;Do it anyway. Say 'I know you're busy. 30 seconds — what have you learned?' Most engineers will appreciate the structure, even if they complain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 4: Knowing when to stop investigating and start mitigating
&lt;/h2&gt;

&lt;p&gt;The temptation in an incident is to find the root cause. The right action is usually to mitigate first and investigate second.&lt;/p&gt;

&lt;p&gt;'We don't know why, but rolling back stops it' is a win. Don't feel bad. The post-mortem can figure out why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 5: Managing morale
&lt;/h2&gt;

&lt;p&gt;Long incidents grind people down. Notice when your team is flagging. Bring in a relief shift. Say 'good job, let's take 10.' Acknowledge that it's hard.&lt;/p&gt;

&lt;p&gt;The worst incidents I've been in were the ones where the team ran out of emotional energy before the problem was fixed. That's on the commander.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill 6: Declaring the incident over
&lt;/h2&gt;

&lt;p&gt;Incidents often drag on past actual resolution because nobody wants to declare victory. Declare it. 'Issue resolved at 15:47. We'll keep monitoring for 30 minutes but the incident is closed.' People need permission to exhale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real job
&lt;/h2&gt;

&lt;p&gt;Incident command is emotional labor disguised as technical work. The best commanders I know are calm, honest, and generous with credit. The worst are the ones who try to be the smartest technical person in the room — that's not the job.&lt;/p&gt;

&lt;p&gt;You can't learn this from a book. You learn it by running incidents badly and asking for feedback afterward. The feedback is usually uncomfortable. It's also how you get good.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>incident</category>
      <category>leadership</category>
    </item>
    <item>
      <title>How AI Is Changing SRE Workflows (Without Replacing SREs)</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 02 Jun 2026 20:46:44 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/how-ai-is-changing-sre-workflows-without-replacing-sres-efg</link>
      <guid>https://dev.to/samson_tanimawo/how-ai-is-changing-sre-workflows-without-replacing-sres-efg</guid>
      <description>&lt;p&gt;I get asked this question a lot: 'Is AI going to replace SREs?' Short answer: no. Long answer: AI is changing what SREs spend their time on, and the SREs who adapt will have a huge edge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI is actually good at in SRE workflows
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. First-pass triage.&lt;/strong&gt; AI can look at 50 alerts and tell you the 5 most likely to be related to an ongoing incident. Beats manual correlation every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Log summarization.&lt;/strong&gt; 100,000 log lines into a 20-line summary highlighting anomalies. The summary isn't always right, but it's a starting point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Runbook generation.&lt;/strong&gt; Given an alert type and historical incidents, AI can draft a runbook. You edit it; you don't write from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Post-mortem first drafts.&lt;/strong&gt; Pull from chat logs, ticket history, monitoring data. Generate a structured timeline. Human polishes it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Routine query generation.&lt;/strong&gt; 'Show me the error rate for service X grouped by endpoint for the last 24 hours.' AI writes the query, you run it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI is bad at
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Judgment calls under pressure.&lt;/strong&gt; When multiple things could be wrong, a good SRE uses instincts built from years of experience. AI guesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Novel failures.&lt;/strong&gt; AI is pattern-matching on history. A truly new failure mode looks like noise to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Organizational politics.&lt;/strong&gt; 'Who do I wake up at 3 AM' is not a technical question. AI doesn't help here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Accountability.&lt;/strong&gt; When something breaks, someone needs to own the decisions that got made. AI can't own anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The new SRE workflow
&lt;/h2&gt;

&lt;p&gt;AI does the first 30%. Human does the crucial middle 40% (judgment, decision-making, stakeholder communication). AI does the last 30% (writing up, following up, documenting).&lt;/p&gt;

&lt;p&gt;This lets SREs handle 2-3x more incidents with the same quality. Not by working harder — by delegating the mechanical parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to learn
&lt;/h2&gt;

&lt;p&gt;If you're an SRE and you're not using AI tools in your workflow, you're leaving 30% of your productivity on the table. Not because AI is magic, but because the boring parts of your job don't need a human.&lt;/p&gt;

&lt;p&gt;Start small: feed an alert into ChatGPT or Claude and ask for possible causes. See what you get. Then add it to your actual on-call tooling.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Security Monitoring for SRE Teams</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 01 Jun 2026 20:54:02 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/security-monitoring-for-sre-teams-4j8h</link>
      <guid>https://dev.to/samson_tanimawo/security-monitoring-for-sre-teams-4j8h</guid>
      <description>&lt;p&gt;Security used to be a separate team. Increasingly, SRE teams are being asked to own the monitoring side of it. Here's a practical framework that doesn't turn you into a SOC analyst.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you already have
&lt;/h2&gt;

&lt;p&gt;Your existing observability stack is half of a security monitoring solution. You already collect logs, metrics, and traces. You already have alerts. The missing piece is usually &lt;em&gt;what&lt;/em&gt; to look at.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to add
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Authentication anomalies.&lt;/strong&gt; Alert on impossible logins (user from Tokyo, then from Paris 10 minutes later), brute force patterns, and unusual session durations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Privilege escalation patterns.&lt;/strong&gt; New admin role granted. Service account added to a sensitive group. Kubernetes role binding changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Unusual data access.&lt;/strong&gt; A user or service reading 10x the normal volume of records. Downloads from sensitive S3 buckets. Queries against PII tables by accounts that don't normally touch them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Outbound traffic anomalies.&lt;/strong&gt; A process that has never called an external IP suddenly connecting to one. Large egress volumes during off-hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Failed auth spikes.&lt;/strong&gt; Not just the login endpoint. Internal auth, API keys, mTLS — anywhere auth happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  What not to do
&lt;/h2&gt;

&lt;p&gt;Don't try to build a SIEM yourself. You'll burn out chasing alerts. Either use a managed SIEM or stick to a small set of high-signal alerts.&lt;/p&gt;

&lt;p&gt;Don't treat security alerts like reliability alerts. Security alerts often need investigation, not immediate fix. The triage workflow is different.&lt;/p&gt;

&lt;h2&gt;
  
  
  The handoff
&lt;/h2&gt;

&lt;p&gt;If you're on the SRE side owning security monitoring, you need a clear escalation path to someone who does security full-time. Your job is detection; theirs is response.&lt;/p&gt;

&lt;p&gt;The worst pattern: SRE team gets a suspicious alert, doesn't know what to do with it, and tables it. Weeks later, a real incident traces back to that alert. Define the handoff up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quick win
&lt;/h2&gt;

&lt;p&gt;If you do nothing else, implement #1 and #5. Failed auth spikes and login anomalies catch 70% of opportunistic attacks. The rest is shoring up for the 30%.&lt;/p&gt;

&lt;p&gt;Security monitoring is just reliability monitoring with a different threat model. Treat it the same way: start with high-signal basics, prune noise aggressively, and escalate when you're unsure.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>security</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
