<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samson Tanimawo</title>
    <description>The latest articles on DEV Community by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://dev.to/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>DEV Community: Samson Tanimawo</title>
      <link>https://dev.to/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>Scaling On-Call When You Only Have 5 Engineers</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 23 May 2026 17:18:22 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/scaling-on-call-when-you-only-have-5-engineers-51hc</link>
      <guid>https://dev.to/samson_tanimawo/scaling-on-call-when-you-only-have-5-engineers-51hc</guid>
      <description>&lt;p&gt;On-call is brutal at small scale. Every engineer takes 1 week in 5. You get woken up once a week. Burnout is weeks away.&lt;/p&gt;

&lt;p&gt;Here's what works at 5 engineers, from someone who's been there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accept the reality
&lt;/h2&gt;

&lt;p&gt;You cannot build a 'rested, follow-the-sun, healthy' on-call rotation with 5 people. Stop trying to mimic Google. Build for small-team reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 things that help
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Aggressively reduce alerts.&lt;/strong&gt; When you have 5 engineers, you cannot afford 50 alerts/day. Cut mercilessly. Target: 1-2 pages per week per on-call. Yes, you might miss things. You'll miss more by being exhausted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Kill pager fatigue with business hours routing.&lt;/strong&gt; Non-urgent alerts go to a ticket, not a page. Only 'user-facing impact right now' alerts wake someone up. Everything else waits for morning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Pay for on-call.&lt;/strong&gt; $500-$1000/week for primary. Yes, you can afford it. If you can't, your company is too small for 24/7 on-call just accept overnight delays.&lt;/p&gt;

&lt;h2&gt;
  
  
  What doesn't help
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;'Just be better at triage' (not a system fix)&lt;/li&gt;
&lt;li&gt;Bringing in contractors for on-call (they don't know your system)&lt;/li&gt;
&lt;li&gt;Unplanned time off after a rough week (too late, damage done)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The emotional side
&lt;/h2&gt;

&lt;p&gt;The hardest part of small-team on-call isn't the pages. It's the feeling that the company rests on you personally. Fight that narrative.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take real vacations. Block the week. No Slack.&lt;/li&gt;
&lt;li&gt;Rotate the 'primary' role explicitly so nobody becomes the default expert&lt;/li&gt;
&lt;li&gt;Document everything so anyone can handle anything&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The growth path
&lt;/h2&gt;

&lt;p&gt;As you hire, protect the on-call ratio. Don't add 3 engineers and immediately expand the services they're responsible for. Use growth to shrink individual load first. Then expand scope.&lt;/p&gt;

&lt;p&gt;5-engineer on-call is survivable. 7-engineer with the same scope is comfortable. Plan for the second, suffer the first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>oncall</category>
      <category>startup</category>
    </item>
    <item>
      <title>TLS Certificate Management Without Tears</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 22 May 2026 17:59:07 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/tls-certificate-management-without-tears-1i5j</link>
      <guid>https://dev.to/samson_tanimawo/tls-certificate-management-without-tears-1i5j</guid>
      <description>&lt;p&gt;Expired certificates cause more outages than they should. Every time, the post-mortem says 'we'll monitor expiry dates.' Every time, six months later, someone forgets.&lt;/p&gt;

&lt;p&gt;Here's how to actually solve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two rules
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rule 1: Don't manage certs manually.&lt;/strong&gt; If a human has to remember to renew, the system is broken. Use Let's Encrypt + cert-manager (or your cloud's equivalent) and let the machines handle it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2: Monitor expiry as an SLI.&lt;/strong&gt; 'Days until cert expires' is a metric. Alert at 14 days and at 7 days. Actually &lt;em&gt;page&lt;/em&gt; at 3 days.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotchas
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Certs you didn't know about.&lt;/strong&gt; Internal services with self-signed certs that someone deployed in 2019 and nobody has touched since. Scan your infrastructure. Inventory everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client certs.&lt;/strong&gt; mTLS clients can have expired certs too. These are harder to find because they're often distributed across devices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third-party APIs.&lt;/strong&gt; You don't manage their certs, but you break when they expire without notice. Monitor outbound connections with TLS validation turned on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The renewal that silently fails.&lt;/strong&gt; Automated renewal fails because of a config change. Nobody notices because nothing changed visibly until the old cert expires. Alert on renewal &lt;em&gt;failures&lt;/em&gt;, not just expiry dates.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quarterly audit
&lt;/h2&gt;

&lt;p&gt;Once a quarter:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List every domain/service that uses TLS&lt;/li&gt;
&lt;li&gt;Verify the renewal automation is working&lt;/li&gt;
&lt;li&gt;Check monitoring is actually firing (test alert on a staging cert)&lt;/li&gt;
&lt;li&gt;Delete certs that belong to services that no longer exist&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The emotional truth
&lt;/h2&gt;

&lt;p&gt;Nobody wants to work on cert management. That's why it breaks. Make it someone's explicit quarterly responsibility and reward them for boring success. You'll never have another cert outage.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>tls</category>
      <category>security</category>
    </item>
    <item>
      <title>DNS: The SRE's Most Underrated Skill</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 21 May 2026 17:57:49 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/dns-the-sres-most-underrated-skill-1anm</link>
      <guid>https://dev.to/samson_tanimawo/dns-the-sres-most-underrated-skill-1anm</guid>
      <description>&lt;p&gt;I've seen more outages caused by DNS than by code. And it's always the same story: the team shipped, something broke, and three hours into debugging someone said, 'wait, is it DNS?'&lt;/p&gt;

&lt;p&gt;It's always DNS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DNS bites SREs specifically
&lt;/h2&gt;

&lt;p&gt;DNS is invisible until it breaks. It caches at every layer (OS, resolver, app, CDN). TTLs are rarely what you expect. And it's usually owned by 'the networking team' who are actually just one guy who left the company in 2022.&lt;/p&gt;

&lt;h2&gt;
  
  
  The debugging mindset
&lt;/h2&gt;

&lt;p&gt;When something weird happens, especially 'works from my laptop, broken in prod,' check DNS before you check code.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dig +short&lt;/code&gt; the hostname from the affected host&lt;/li&gt;
&lt;li&gt;Check the TTL: &lt;code&gt;dig HOSTNAME&lt;/code&gt;. Short TTL (60s)? Probably fine. Long TTL (86400)? You have a problem during rollout.&lt;/li&gt;
&lt;li&gt;Is the resolver returning stale records? Try &lt;code&gt;dig @8.8.8.8 HOSTNAME&lt;/code&gt; to bypass local cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 3 DNS setups I've seen break
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Split-horizon DNS with cached results.&lt;/strong&gt; Internal resolver returns one IP, external returns another. Your service caches the wrong one. Mysterious connection failures ensue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Short TTL during migration, long TTL in resolver.&lt;/strong&gt; You set the TTL to 60s for a cutover. Your downstream service's resolver has its own cache that respects the record's &lt;em&gt;initial&lt;/em&gt; TTL, which was 86400. Your cutover doesn't propagate for a day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. DNS-based health checks with slow propagation.&lt;/strong&gt; You remove a bad host from DNS. Clients keep hitting it because of cache. Outage continues for the length of the TTL.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;Lower your TTL &lt;em&gt;before&lt;/em&gt; you need to. Not during the outage. A long TTL on production records is a loaded gun.&lt;/p&gt;

&lt;p&gt;DNS deserves respect. Learn it. Love it. Debug it first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>dns</category>
      <category>networking</category>
    </item>
    <item>
      <title>The Silent Outage: Monitoring What You Can't See</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 20 May 2026 17:02:15 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-silent-outage-monitoring-what-you-cant-see-fb7</link>
      <guid>https://dev.to/samson_tanimawo/the-silent-outage-monitoring-what-you-cant-see-fb7</guid>
      <description>&lt;p&gt;The worst kind of outage is one nobody notices. Your metrics are green. Your dashboards are fine. Your users are quietly getting a broken experience.&lt;/p&gt;

&lt;p&gt;I've been burned by three silent outages in my career. Here's how I catch them now.&lt;/p&gt;

&lt;h2&gt;
  
  
  How silent outages happen
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frontend caching the error.&lt;/strong&gt; Your API returned a 500. Your CDN cached it. Now all users get the cached error for 10 minutes, but your API health check passes because the CDN never re-asks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partial feature breakage.&lt;/strong&gt; Login works. Checkout works. The search bar silently returns empty results. Your dashboards don't track 'zero-result searches' so you don't see anything wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stale data pipelines.&lt;/strong&gt; The data pipeline stopped running 3 hours ago. Your dashboards are showing frozen numbers but the backend looks fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Synthetic user journeys from the outside.&lt;/strong&gt; A test user clicks login, search, checkout every 5 minutes. If any step fails, alert.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data freshness, not just data availability.&lt;/strong&gt; Alert on 'last data write &amp;gt; X minutes ago,' not just 'database is up.'&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Business metrics, not just tech metrics.&lt;/strong&gt; 'Checkouts per hour' as an alert. If it drops 50% unexpectedly, something is wrong even if all your infra is green.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error budget burn rate.&lt;/strong&gt; Sudden burn rate spike = something silent is happening even if individual alerts aren't firing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The harder problem
&lt;/h2&gt;

&lt;p&gt;The truly silent outages are the ones where your users go quiet because they've given up on you. No complaints, just churn. You only find out weeks later from a usage graph.&lt;/p&gt;

&lt;p&gt;Business metric monitoring is the only defense against this. Treat conversion rate, daily active users, and session length as SLIs.&lt;/p&gt;

&lt;p&gt;Your real job isn't to keep the servers up. It's to keep users succeeding. Monitor that.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Why Every SRE Should Learn a Little Rust</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 19 May 2026 17:01:58 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/why-every-sre-should-learn-a-little-rust-5669</link>
      <guid>https://dev.to/samson_tanimawo/why-every-sre-should-learn-a-little-rust-5669</guid>
      <description>&lt;p&gt;I'm not saying rewrite your stack in Rust. I'm saying: learn enough to read it.&lt;/p&gt;

&lt;p&gt;Here's why, from someone who dragged their feet for years and finally gave in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rust is showing up everywhere in infra
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Vector, OpenTelemetry collector components, Tempo, Loki's ingesters in some paths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxies:&lt;/strong&gt; Linkerd's data plane, parts of Istio's future, newer eBPF-based tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases:&lt;/strong&gt; TiKV, SurrealDB, and bits of Postgres extensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLIs:&lt;/strong&gt; Half the infra tools you install now are Rust binaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your stack has observability or networking components, you're going to be reading Rust code in an incident sooner or later. Better to be able to follow along.&lt;/p&gt;

&lt;h2&gt;
  
  
  You don't need to be fluent
&lt;/h2&gt;

&lt;p&gt;You need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read Rust well enough to follow a function call&lt;/li&gt;
&lt;li&gt;Understand what ownership and borrowing &lt;em&gt;mean&lt;/em&gt; (you won't debug them, but you need to read the code)&lt;/li&gt;
&lt;li&gt;Compile a small program&lt;/li&gt;
&lt;li&gt;Read a panic stack trace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's enough. That's maybe 2 weekends of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The side benefit
&lt;/h2&gt;

&lt;p&gt;Rust teaches you to think about state, concurrency, and error handling more carefully. Those skills show up in whatever language you actually write in. I write less broken Python and Go after learning Rust, even though I rarely write Rust.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;The Rust book (free online) + one small project (I did a log parser). Skip the async stuff for now. Come back to it when you need it.&lt;/p&gt;

&lt;p&gt;You're not becoming a Rust engineer. You're becoming an SRE who can read the tools your stack depends on. That's a real edge.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>rust</category>
      <category>programming</category>
    </item>
    <item>
      <title>How We Built Our Own Incident Management System</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 18 May 2026 17:09:08 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/how-we-built-our-own-incident-management-system-1i9d</link>
      <guid>https://dev.to/samson_tanimawo/how-we-built-our-own-incident-management-system-1i9d</guid>
      <description>&lt;p&gt;A couple of years ago we built our own incident management system instead of buying one. I'd do it again. Here's why, and the pieces that mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not buy?
&lt;/h2&gt;

&lt;p&gt;We looked at PagerDuty, Incident.io, FireHydrant, and a couple of others. Good tools. Each was $40-80/user/month. For 40 engineers, that's $20-40k/year.&lt;/p&gt;

&lt;p&gt;The real problem: none of them fit our workflow exactly. We'd pay $30k/year and still have to work around the tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;A small Slack-first tool. Total: ~3000 lines of Go. Took one engineer 3 weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/incident start [title]&lt;/code&gt; creates a channel, pings on-call, assigns a commander&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/incident update [message]&lt;/code&gt; appends to a timeline that gets used in the retro&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/incident severity [sev-1..sev-5]&lt;/code&gt; routes escalation based on severity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/incident close&lt;/code&gt; triggers post-mortem doc auto-generation from the timeline&lt;/li&gt;
&lt;li&gt;Integrations with our monitoring, Jira, and status page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. No 50-feature bloat.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we skipped
&lt;/h2&gt;

&lt;p&gt;Most of the fancy features in commercial tools go unused. We skipped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom roles and permissions&lt;/li&gt;
&lt;li&gt;Auto-generated stakeholder updates (we write them by hand better)&lt;/li&gt;
&lt;li&gt;Post-mortem templates beyond the one we chose&lt;/li&gt;
&lt;li&gt;Runbook hosting (we use our docs repo)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Would I buy instead today?
&lt;/h2&gt;

&lt;p&gt;If you're under 50 engineers, probably yes buy. Your engineering time is more valuable than the tool cost.&lt;/p&gt;

&lt;p&gt;If you're bigger and have specific workflow needs, build. A focused in-house tool beats a feature-bloated commercial one every time.&lt;/p&gt;

&lt;p&gt;The worst option is buying a tool and then fighting it. Pick the fit, not the feature list.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>incident</category>
      <category>tools</category>
    </item>
    <item>
      <title>The Role of Platform Engineering in a Startup</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 17 May 2026 17:08:42 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-role-of-platform-engineering-in-a-startup-1me0</link>
      <guid>https://dev.to/samson_tanimawo/the-role-of-platform-engineering-in-a-startup-1me0</guid>
      <description>&lt;p&gt;Platform engineering sounds like a big-company thing. But I think every startup past 20 engineers needs a small platform function. Here's why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem platform engineering solves
&lt;/h2&gt;

&lt;p&gt;At 5 engineers, everybody knows how to deploy. At 20, they don't. People start copy-pasting deployment configs, breaking things, and asking the same questions in Slack every day.&lt;/p&gt;

&lt;p&gt;You need &lt;em&gt;one&lt;/em&gt; person who owns 'how we ship software,' even part-time. That person is a platform engineer whether you call them that or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a startup platform function does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Owns the deployment path.&lt;/strong&gt; One golden path from git push to production. Documented, maintained, and defended from one-off exceptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Owns the dev environment.&lt;/strong&gt; Laptop setup, local testing, shared services. New hire productive in days, not weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Owns the shared services.&lt;/strong&gt; Auth, logging, tracing, secrets management. Used by everyone, owned by no one until you assign it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Owns the developer experience.&lt;/strong&gt; CI speed. Local/prod parity. Error messages. The stuff that's no one's job but costs everyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it doesn't do
&lt;/h2&gt;

&lt;p&gt;It doesn't build a K8s abstraction layer that rivals AWS. Startups can't afford that. Use off-the-shelf. Customize lightly.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to start
&lt;/h2&gt;

&lt;p&gt;At 15-20 engineers, if you're still asking 'why is my build failing' in Slack 3 times a week, you're ready.&lt;/p&gt;

&lt;p&gt;Before 15, just have engineers fix each other's stuff and take turns being the 'ops person.' It's not efficient, but it's cheaper than a full function.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hire
&lt;/h2&gt;

&lt;p&gt;Hire someone who loves developer experience. Not the best infrastructure engineer on the team the one who genuinely cares about making everyone else faster. That's a different skill set.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>platform</category>
      <category>startup</category>
    </item>
    <item>
      <title>Building Dashboards People Actually Use</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 16 May 2026 17:08:15 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/building-dashboards-people-actually-use-3fmm</link>
      <guid>https://dev.to/samson_tanimawo/building-dashboards-people-actually-use-3fmm</guid>
      <description>&lt;p&gt;I've built dozens of dashboards. Most have been ignored. A few have been used constantly. The difference isn't the graphs. It's the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-second test
&lt;/h2&gt;

&lt;p&gt;A useful dashboard answers 'is everything OK?' in 3 seconds. Not 'let me scroll through 40 graphs to find out.'&lt;/p&gt;

&lt;p&gt;Big colored header at the top: green = healthy, yellow = watching, red = broken. That's the 3-second answer. Everything else is drill-down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hierarchy rule
&lt;/h2&gt;

&lt;p&gt;Three layers, no more:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Overview&lt;/strong&gt; one line per service, status color, key SLI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service detail&lt;/strong&gt; one dashboard per service, 6-12 graphs max&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep dive&lt;/strong&gt; triggered from service detail, domain-specific&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Anything beyond 3 layers is 'please get lost in my dashboard tree.'&lt;/p&gt;

&lt;h2&gt;
  
  
  The on-call test
&lt;/h2&gt;

&lt;p&gt;Imagine you're on-call at 3 AM. You get paged for 'service X is slow.' Can you, in 30 seconds, use this dashboard to tell if the problem is the service itself, its database, its upstream dependency, or its downstream consumers?&lt;/p&gt;

&lt;p&gt;If yes, the dashboard works.&lt;br&gt;
If no, redesign.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to cut
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Graphs with no baseline (flat line or spiky forever how do you know if it's bad?)&lt;/li&gt;
&lt;li&gt;Metrics you've never used in an actual incident&lt;/li&gt;
&lt;li&gt;Vanity metrics (total requests ever)&lt;/li&gt;
&lt;li&gt;Graphs where the y-axis is in units nobody understands&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The hidden metric
&lt;/h2&gt;

&lt;p&gt;The real measure of a dashboard's value: does the on-call engineer open it before or after the paging tool?&lt;/p&gt;

&lt;p&gt;If they open it first it's their compass.&lt;br&gt;
If they open it only after being paged it's a reference, not a dashboard.&lt;/p&gt;

&lt;p&gt;Aim for the first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>dashboards</category>
      <category>observability</category>
    </item>
    <item>
      <title>SRE Maturity Models: Where Is Your Team?</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 15 May 2026 17:59:16 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/sre-maturity-models-where-is-your-team-1i27</link>
      <guid>https://dev.to/samson_tanimawo/sre-maturity-models-where-is-your-team-1i27</guid>
      <description>&lt;p&gt;Where is your SRE team on the maturity curve? I've worked with teams at every stage. Here's a rough map.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 0: Reactive
&lt;/h2&gt;

&lt;p&gt;The site goes down, someone scrambles to fix it, the cycle repeats. No on-call rotation. No dashboards. Alerts are emails nobody reads.&lt;/p&gt;

&lt;p&gt;Characteristic phrase: 'We'll look at that after launch.'&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Foundation
&lt;/h2&gt;

&lt;p&gt;On-call rotation exists. Alerts route to a paging tool. Basic dashboards for CPU, memory, error rate. Post-mortems happen sometimes.&lt;/p&gt;

&lt;p&gt;Characteristic phrase: 'Did anyone see that spike last night?'&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: Measured
&lt;/h2&gt;

&lt;p&gt;SLOs defined for critical services. Error budgets tracked. Alert volume is monitored and pruned. Post-mortems are written and reviewed.&lt;/p&gt;

&lt;p&gt;Characteristic phrase: 'We're at 80% of our error budget for the quarter.'&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Automated
&lt;/h2&gt;

&lt;p&gt;Runbooks exist for top alerts. Toil is measured and reduced. Deployment pipeline has automatic rollback. Chaos engineering is practiced.&lt;/p&gt;

&lt;p&gt;Characteristic phrase: 'The auto-rollback caught it.'&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Predictive
&lt;/h2&gt;

&lt;p&gt;Anomaly detection catches issues before alerts fire. Capacity planning is data-driven. New services have SLOs and dashboards at launch, not after. AI/ML assists incident response.&lt;/p&gt;

&lt;p&gt;Characteristic phrase: 'We caught that before customers noticed.'&lt;/p&gt;

&lt;h2&gt;
  
  
  Where most teams are
&lt;/h2&gt;

&lt;p&gt;Most teams I've worked with are at Stage 1 or Stage 2, trying to get to Stage 3. The jump from 2 to 3 is the hardest it requires sustained investment with no immediate crisis to justify it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap
&lt;/h2&gt;

&lt;p&gt;Don't try to skip stages. Teams that install ML anomaly detection at Stage 0 just have prettier chaos. Get the foundation right first. Then automate. Then predict.&lt;/p&gt;

&lt;p&gt;The highest maturity team I've seen was boring. Almost nothing broke. The engineers had time to work on interesting problems. That's the goal.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>maturity</category>
      <category>strategy</category>
    </item>
    <item>
      <title>The Art of Writing a Good Post-Mortem</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 14 May 2026 17:59:00 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/the-art-of-writing-a-good-post-mortem-hhi</link>
      <guid>https://dev.to/samson_tanimawo/the-art-of-writing-a-good-post-mortem-hhi</guid>
      <description>&lt;p&gt;A good post-mortem is a piece of technical writing. It should be readable by someone who wasn't there, convey the timeline clearly, and suggest concrete changes.&lt;/p&gt;

&lt;p&gt;Most are none of these things. They're a wall of Slack screenshots.&lt;/p&gt;

&lt;p&gt;Here's how to write one that people will read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structure
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Summary (3 sentences).&lt;/strong&gt; What happened. Impact. Duration. That's it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Timeline.&lt;/strong&gt; Precise times, in UTC. Not 'around 3pm.' 14:02 UTC. Every event gets one line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Impact.&lt;/strong&gt; Quantified. 'X% of checkout traffic failed between 14:02 and 14:17.' Not 'some users affected.'&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Root cause.&lt;/strong&gt; What broke and why. Not who. If the answer is 'human error,' keep going why did the system allow human error to reach production?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Action items.&lt;/strong&gt; Concrete. Owner. Due date. 'Add validation to config pipeline &lt;a class="mentioned-user" href="https://dev.to/alice"&gt;@alice&lt;/a&gt; 2 weeks.' Not 'be more careful.'&lt;/p&gt;

&lt;h2&gt;
  
  
  The tone
&lt;/h2&gt;

&lt;p&gt;Write it like you're explaining to a curious outsider. No inside jokes. No 'as you all know.' Future readers don't know.&lt;/p&gt;

&lt;p&gt;Keep it honest. If the cause was something embarrassing, write it down. Post-mortems that hide the ugly parts are worthless to the reader.&lt;/p&gt;

&lt;h2&gt;
  
  
  The distribution
&lt;/h2&gt;

&lt;p&gt;The best post-mortems get read by people outside the team. Share them. Put them in a searchable archive. Reference them in design reviews. ('We tried this in 2024, see post-mortem 42.')&lt;/p&gt;

&lt;p&gt;Institutional memory lives in good post-mortems. Bad ones evaporate the day they're written.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>postmortem</category>
      <category>writing</category>
    </item>
    <item>
      <title>Why We Stopped Using Log Aggregation for Everything</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 13 May 2026 17:58:38 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/why-we-stopped-using-log-aggregation-for-everything-249j</link>
      <guid>https://dev.to/samson_tanimawo/why-we-stopped-using-log-aggregation-for-everything-249j</guid>
      <description>&lt;p&gt;We used to push every log line to our centralized log system. It was a mess. Here's why we stopped and what we do now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Our log volume was growing 20% month-over-month. Most of it was debug-level stuff that nobody searched for. We were paying to store logs nobody read.&lt;/p&gt;

&lt;p&gt;Worse: when we actually needed to find something, the noise made it harder. You can't grep usefully through a billion lines that are mostly heartbeats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule we adopted
&lt;/h2&gt;

&lt;p&gt;'Logs are for events humans or systems will query. Metrics are for counts. Traces are for request flow.'&lt;/p&gt;

&lt;p&gt;Applying this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DEBUG logs: local only, never shipped&lt;/li&gt;
&lt;li&gt;INFO logs: shipped but aggressively sampled (1%)&lt;/li&gt;
&lt;li&gt;WARN logs: shipped in full&lt;/li&gt;
&lt;li&gt;ERROR logs: shipped in full, tagged with a request ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Counts and rates moved to metrics, not logs. Request flow moved to traces, not logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Log ingest cost down 70%&lt;/li&gt;
&lt;li&gt;Search queries 4x faster (less noise)&lt;/li&gt;
&lt;li&gt;We actually find things when we need to&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The traps
&lt;/h2&gt;

&lt;p&gt;People write INFO logs for debugging, then forget to remove them. A linter that flags high-volume log calls helped us catch this before it got to prod.&lt;/p&gt;

&lt;p&gt;Sampled logs can be confusing. 'Why did user X's request not show up?' Answer: it was sampled out. Make sure your sampling rules are transparent so engineers don't assume missing logs mean missing requests.&lt;/p&gt;

&lt;p&gt;Logs are one observability tool. Not the only one. Stop making them do everything.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>logging</category>
      <category>observability</category>
    </item>
    <item>
      <title>Running Postgres at Scale: Lessons Learned</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 13 May 2026 13:58:36 +0000</pubDate>
      <link>https://dev.to/samson_tanimawo/running-postgres-at-scale-lessons-learned-30eo</link>
      <guid>https://dev.to/samson_tanimawo/running-postgres-at-scale-lessons-learned-30eo</guid>
      <description>&lt;p&gt;We run Postgres for a product with millions of users. Along the way I've broken it in every possible way. Here are the lessons I wish I'd known on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Autovacuum is not optional
&lt;/h2&gt;

&lt;p&gt;You can ignore autovacuum for a while. You cannot ignore it forever. Dead tuples accumulate. Query plans go bad. Eventually a query that used to take 10ms takes 3 seconds and nobody knows why.&lt;/p&gt;

&lt;p&gt;Tune autovacuum earlier than you think. &lt;code&gt;autovacuum_vacuum_scale_factor = 0.05&lt;/code&gt; on big tables is a good default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection pooling is not optional
&lt;/h2&gt;

&lt;p&gt;Postgres connections are expensive. Every connection holds memory and a worker process. You will run out.&lt;/p&gt;

&lt;p&gt;Use PgBouncer or equivalent. Set pool size conservatively. Your app might want 500 connections; Postgres can happily handle 50 if you pool properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long-running transactions are silent killers
&lt;/h2&gt;

&lt;p&gt;A transaction that's been open for 2 hours prevents vacuum from cleaning tuples newer than its start time. Your table bloats. Your queries slow down. You blame the database.&lt;/p&gt;

&lt;p&gt;Alert on &lt;code&gt;pg_stat_activity.xact_start &amp;lt; now() - interval '10 minutes'&lt;/code&gt;. Hunt and kill long transactions before they bite you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The query planner is not magic
&lt;/h2&gt;

&lt;p&gt;It's a cost estimator. It can be wrong. When you see a query doing a sequential scan that should use an index, the planner chose sequential because its estimate said it was cheaper. Sometimes the estimate is wrong.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;ANALYZE&lt;/code&gt; regularly, increase &lt;code&gt;default_statistics_target&lt;/code&gt; for large tables, and don't be afraid to use &lt;code&gt;SET enable_seqscan = off&lt;/code&gt; as a debug tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backups you haven't restored are not backups
&lt;/h2&gt;

&lt;p&gt;Practice the restore. Monthly. On real data volume. The first time I tried to restore our 800GB production backup, it took 11 hours. That's a useful thing to know &lt;em&gt;before&lt;/em&gt; the outage.&lt;/p&gt;

&lt;p&gt;Postgres is incredibly forgiving. But only to people who respect it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>postgres</category>
      <category>database</category>
    </item>
  </channel>
</rss>
