DEV Community: Samson Tanimawo

Scaling On-Call When You Only Have 5 Engineers

Samson Tanimawo — Sat, 23 May 2026 17:18:22 +0000

On-call is brutal at small scale. Every engineer takes 1 week in 5. You get woken up once a week. Burnout is weeks away.

Here's what works at 5 engineers, from someone who's been there.

Accept the reality

You cannot build a 'rested, follow-the-sun, healthy' on-call rotation with 5 people. Stop trying to mimic Google. Build for small-team reality.

The 3 things that help

1. Aggressively reduce alerts. When you have 5 engineers, you cannot afford 50 alerts/day. Cut mercilessly. Target: 1-2 pages per week per on-call. Yes, you might miss things. You'll miss more by being exhausted.

2. Kill pager fatigue with business hours routing. Non-urgent alerts go to a ticket, not a page. Only 'user-facing impact right now' alerts wake someone up. Everything else waits for morning.

3. Pay for on-call. $500-$1000/week for primary. Yes, you can afford it. If you can't, your company is too small for 24/7 on-call just accept overnight delays.

What doesn't help

'Just be better at triage' (not a system fix)
Bringing in contractors for on-call (they don't know your system)
Unplanned time off after a rough week (too late, damage done)

The emotional side

The hardest part of small-team on-call isn't the pages. It's the feeling that the company rests on you personally. Fight that narrative.

Take real vacations. Block the week. No Slack.
Rotate the 'primary' role explicitly so nobody becomes the default expert
Document everything so anyone can handle anything

The growth path

As you hire, protect the on-call ratio. Don't add 3 engineers and immediately expand the services they're responsible for. Use growth to shrink individual load first. Then expand scope.

5-engineer on-call is survivable. 7-engineer with the same scope is comfortable. Plan for the second, suffer the first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

TLS Certificate Management Without Tears

Samson Tanimawo — Fri, 22 May 2026 17:59:07 +0000

Expired certificates cause more outages than they should. Every time, the post-mortem says 'we'll monitor expiry dates.' Every time, six months later, someone forgets.

Here's how to actually solve it.

The two rules

Rule 1: Don't manage certs manually. If a human has to remember to renew, the system is broken. Use Let's Encrypt + cert-manager (or your cloud's equivalent) and let the machines handle it.

Rule 2: Monitor expiry as an SLI. 'Days until cert expires' is a metric. Alert at 14 days and at 7 days. Actually page at 3 days.

The gotchas

Certs you didn't know about. Internal services with self-signed certs that someone deployed in 2019 and nobody has touched since. Scan your infrastructure. Inventory everything.

Client certs. mTLS clients can have expired certs too. These are harder to find because they're often distributed across devices.

Third-party APIs. You don't manage their certs, but you break when they expire without notice. Monitor outbound connections with TLS validation turned on.

The renewal that silently fails. Automated renewal fails because of a config change. Nobody notices because nothing changed visibly until the old cert expires. Alert on renewal failures, not just expiry dates.

The quarterly audit

Once a quarter:

List every domain/service that uses TLS
Verify the renewal automation is working
Check monitoring is actually firing (test alert on a staging cert)
Delete certs that belong to services that no longer exist

The emotional truth

Nobody wants to work on cert management. That's why it breaks. Make it someone's explicit quarterly responsibility and reward them for boring success. You'll never have another cert outage.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DNS: The SRE's Most Underrated Skill

Samson Tanimawo — Thu, 21 May 2026 17:57:49 +0000

I've seen more outages caused by DNS than by code. And it's always the same story: the team shipped, something broke, and three hours into debugging someone said, 'wait, is it DNS?'

It's always DNS.

Why DNS bites SREs specifically

DNS is invisible until it breaks. It caches at every layer (OS, resolver, app, CDN). TTLs are rarely what you expect. And it's usually owned by 'the networking team' who are actually just one guy who left the company in 2022.

The debugging mindset

When something weird happens, especially 'works from my laptop, broken in prod,' check DNS before you check code.

dig +short the hostname from the affected host
Check the TTL: dig HOSTNAME. Short TTL (60s)? Probably fine. Long TTL (86400)? You have a problem during rollout.
Is the resolver returning stale records? Try dig @8.8.8.8 HOSTNAME to bypass local cache.

The 3 DNS setups I've seen break

1. Split-horizon DNS with cached results. Internal resolver returns one IP, external returns another. Your service caches the wrong one. Mysterious connection failures ensue.

2. Short TTL during migration, long TTL in resolver. You set the TTL to 60s for a cutover. Your downstream service's resolver has its own cache that respects the record's initial TTL, which was 86400. Your cutover doesn't propagate for a day.

3. DNS-based health checks with slow propagation. You remove a bad host from DNS. Clients keep hitting it because of cache. Outage continues for the length of the TTL.

The rule

Lower your TTL before you need to. Not during the outage. A long TTL on production records is a loaded gun.

DNS deserves respect. Learn it. Love it. Debug it first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Silent Outage: Monitoring What You Can't See

Samson Tanimawo — Wed, 20 May 2026 17:02:15 +0000

The worst kind of outage is one nobody notices. Your metrics are green. Your dashboards are fine. Your users are quietly getting a broken experience.

I've been burned by three silent outages in my career. Here's how I catch them now.

How silent outages happen

Frontend caching the error. Your API returned a 500. Your CDN cached it. Now all users get the cached error for 10 minutes, but your API health check passes because the CDN never re-asks.

Partial feature breakage. Login works. Checkout works. The search bar silently returns empty results. Your dashboards don't track 'zero-result searches' so you don't see anything wrong.

Stale data pipelines. The data pipeline stopped running 3 hours ago. Your dashboards are showing frozen numbers but the backend looks fine.

What to monitor

Synthetic user journeys from the outside. A test user clicks login, search, checkout every 5 minutes. If any step fails, alert.
Data freshness, not just data availability. Alert on 'last data write > X minutes ago,' not just 'database is up.'
Business metrics, not just tech metrics. 'Checkouts per hour' as an alert. If it drops 50% unexpectedly, something is wrong even if all your infra is green.
Error budget burn rate. Sudden burn rate spike = something silent is happening even if individual alerts aren't firing.

The harder problem

The truly silent outages are the ones where your users go quiet because they've given up on you. No complaints, just churn. You only find out weeks later from a usage graph.

Business metric monitoring is the only defense against this. Treat conversion rate, daily active users, and session length as SLIs.

Your real job isn't to keep the servers up. It's to keep users succeeding. Monitor that.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Why Every SRE Should Learn a Little Rust

Samson Tanimawo — Tue, 19 May 2026 17:01:58 +0000

I'm not saying rewrite your stack in Rust. I'm saying: learn enough to read it.

Here's why, from someone who dragged their feet for years and finally gave in.

Rust is showing up everywhere in infra

Observability: Vector, OpenTelemetry collector components, Tempo, Loki's ingesters in some paths
Proxies: Linkerd's data plane, parts of Istio's future, newer eBPF-based tools
Databases: TiKV, SurrealDB, and bits of Postgres extensions
CLIs: Half the infra tools you install now are Rust binaries

If your stack has observability or networking components, you're going to be reading Rust code in an incident sooner or later. Better to be able to follow along.

You don't need to be fluent

You need to:

Read Rust well enough to follow a function call
Understand what ownership and borrowing mean (you won't debug them, but you need to read the code)
Compile a small program
Read a panic stack trace

That's enough. That's maybe 2 weekends of work.

The side benefit

Rust teaches you to think about state, concurrency, and error handling more carefully. Those skills show up in whatever language you actually write in. I write less broken Python and Go after learning Rust, even though I rarely write Rust.

The starting point

The Rust book (free online) + one small project (I did a log parser). Skip the async stuff for now. Come back to it when you need it.

You're not becoming a Rust engineer. You're becoming an SRE who can read the tools your stack depends on. That's a real edge.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

How We Built Our Own Incident Management System

Samson Tanimawo — Mon, 18 May 2026 17:09:08 +0000

A couple of years ago we built our own incident management system instead of buying one. I'd do it again. Here's why, and the pieces that mattered.

Why not buy?

We looked at PagerDuty, Incident.io, FireHydrant, and a couple of others. Good tools. Each was $40-80/user/month. For 40 engineers, that's $20-40k/year.

The real problem: none of them fit our workflow exactly. We'd pay $30k/year and still have to work around the tool.

What we built

A small Slack-first tool. Total: ~3000 lines of Go. Took one engineer 3 weeks.

Features:

/incident start [title] creates a channel, pings on-call, assigns a commander
/incident update [message] appends to a timeline that gets used in the retro
/incident severity [sev-1..sev-5] routes escalation based on severity
/incident close triggers post-mortem doc auto-generation from the timeline
Integrations with our monitoring, Jira, and status page

That's it. No 50-feature bloat.

What we skipped

Most of the fancy features in commercial tools go unused. We skipped:

Custom roles and permissions
Auto-generated stakeholder updates (we write them by hand better)
Post-mortem templates beyond the one we chose
Runbook hosting (we use our docs repo)

Would I buy instead today?

If you're under 50 engineers, probably yes buy. Your engineering time is more valuable than the tool cost.

If you're bigger and have specific workflow needs, build. A focused in-house tool beats a feature-bloated commercial one every time.

The worst option is buying a tool and then fighting it. Pick the fit, not the feature list.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Role of Platform Engineering in a Startup

Samson Tanimawo — Sun, 17 May 2026 17:08:42 +0000

Platform engineering sounds like a big-company thing. But I think every startup past 20 engineers needs a small platform function. Here's why.

The problem platform engineering solves

At 5 engineers, everybody knows how to deploy. At 20, they don't. People start copy-pasting deployment configs, breaking things, and asking the same questions in Slack every day.

You need one person who owns 'how we ship software,' even part-time. That person is a platform engineer whether you call them that or not.

What a startup platform function does

1. Owns the deployment path. One golden path from git push to production. Documented, maintained, and defended from one-off exceptions.

2. Owns the dev environment. Laptop setup, local testing, shared services. New hire productive in days, not weeks.

3. Owns the shared services. Auth, logging, tracing, secrets management. Used by everyone, owned by no one until you assign it.

4. Owns the developer experience. CI speed. Local/prod parity. Error messages. The stuff that's no one's job but costs everyone.

What it doesn't do

It doesn't build a K8s abstraction layer that rivals AWS. Startups can't afford that. Use off-the-shelf. Customize lightly.

When to start

At 15-20 engineers, if you're still asking 'why is my build failing' in Slack 3 times a week, you're ready.

Before 15, just have engineers fix each other's stuff and take turns being the 'ops person.' It's not efficient, but it's cheaper than a full function.

The hire

Hire someone who loves developer experience. Not the best infrastructure engineer on the team the one who genuinely cares about making everyone else faster. That's a different skill set.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Building Dashboards People Actually Use

Samson Tanimawo — Sat, 16 May 2026 17:08:15 +0000

I've built dozens of dashboards. Most have been ignored. A few have been used constantly. The difference isn't the graphs. It's the design.

The 3-second test

A useful dashboard answers 'is everything OK?' in 3 seconds. Not 'let me scroll through 40 graphs to find out.'

Big colored header at the top: green = healthy, yellow = watching, red = broken. That's the 3-second answer. Everything else is drill-down.

The hierarchy rule

Three layers, no more:

Overview one line per service, status color, key SLI
Service detail one dashboard per service, 6-12 graphs max
Deep dive triggered from service detail, domain-specific

Anything beyond 3 layers is 'please get lost in my dashboard tree.'

The on-call test

Imagine you're on-call at 3 AM. You get paged for 'service X is slow.' Can you, in 30 seconds, use this dashboard to tell if the problem is the service itself, its database, its upstream dependency, or its downstream consumers?

If yes, the dashboard works.
If no, redesign.

What to cut

Graphs with no baseline (flat line or spiky forever how do you know if it's bad?)
Metrics you've never used in an actual incident
Vanity metrics (total requests ever)
Graphs where the y-axis is in units nobody understands

The hidden metric

The real measure of a dashboard's value: does the on-call engineer open it before or after the paging tool?

If they open it first it's their compass.
If they open it only after being paged it's a reference, not a dashboard.

Aim for the first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

SRE Maturity Models: Where Is Your Team?

Samson Tanimawo — Fri, 15 May 2026 17:59:16 +0000

Where is your SRE team on the maturity curve? I've worked with teams at every stage. Here's a rough map.

Stage 0: Reactive

The site goes down, someone scrambles to fix it, the cycle repeats. No on-call rotation. No dashboards. Alerts are emails nobody reads.

Characteristic phrase: 'We'll look at that after launch.'

Stage 1: Foundation

On-call rotation exists. Alerts route to a paging tool. Basic dashboards for CPU, memory, error rate. Post-mortems happen sometimes.

Characteristic phrase: 'Did anyone see that spike last night?'

Stage 2: Measured

SLOs defined for critical services. Error budgets tracked. Alert volume is monitored and pruned. Post-mortems are written and reviewed.

Characteristic phrase: 'We're at 80% of our error budget for the quarter.'

Stage 3: Automated

Runbooks exist for top alerts. Toil is measured and reduced. Deployment pipeline has automatic rollback. Chaos engineering is practiced.

Characteristic phrase: 'The auto-rollback caught it.'

Stage 4: Predictive

Anomaly detection catches issues before alerts fire. Capacity planning is data-driven. New services have SLOs and dashboards at launch, not after. AI/ML assists incident response.

Characteristic phrase: 'We caught that before customers noticed.'

Where most teams are

Most teams I've worked with are at Stage 1 or Stage 2, trying to get to Stage 3. The jump from 2 to 3 is the hardest it requires sustained investment with no immediate crisis to justify it.

The trap

Don't try to skip stages. Teams that install ML anomaly detection at Stage 0 just have prettier chaos. Get the foundation right first. Then automate. Then predict.

The highest maturity team I've seen was boring. Almost nothing broke. The engineers had time to work on interesting problems. That's the goal.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

The Art of Writing a Good Post-Mortem

Samson Tanimawo — Thu, 14 May 2026 17:59:00 +0000

A good post-mortem is a piece of technical writing. It should be readable by someone who wasn't there, convey the timeline clearly, and suggest concrete changes.

Most are none of these things. They're a wall of Slack screenshots.

Here's how to write one that people will read.

The structure

1. Summary (3 sentences). What happened. Impact. Duration. That's it.

2. Timeline. Precise times, in UTC. Not 'around 3pm.' 14:02 UTC. Every event gets one line.

3. Impact. Quantified. 'X% of checkout traffic failed between 14:02 and 14:17.' Not 'some users affected.'

4. Root cause. What broke and why. Not who. If the answer is 'human error,' keep going why did the system allow human error to reach production?

5. Action items. Concrete. Owner. Due date. 'Add validation to config pipeline @alice 2 weeks.' Not 'be more careful.'

The tone

Write it like you're explaining to a curious outsider. No inside jokes. No 'as you all know.' Future readers don't know.

Keep it honest. If the cause was something embarrassing, write it down. Post-mortems that hide the ugly parts are worthless to the reader.

The distribution

The best post-mortems get read by people outside the team. Share them. Put them in a searchable archive. Reference them in design reviews. ('We tried this in 2024, see post-mortem 42.')

Institutional memory lives in good post-mortems. Bad ones evaporate the day they're written.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Why We Stopped Using Log Aggregation for Everything

Samson Tanimawo — Wed, 13 May 2026 17:58:38 +0000

We used to push every log line to our centralized log system. It was a mess. Here's why we stopped and what we do now.

The problem

Our log volume was growing 20% month-over-month. Most of it was debug-level stuff that nobody searched for. We were paying to store logs nobody read.

Worse: when we actually needed to find something, the noise made it harder. You can't grep usefully through a billion lines that are mostly heartbeats.

The rule we adopted

'Logs are for events humans or systems will query. Metrics are for counts. Traces are for request flow.'

Applying this:

DEBUG logs: local only, never shipped
INFO logs: shipped but aggressively sampled (1%)
WARN logs: shipped in full
ERROR logs: shipped in full, tagged with a request ID

Counts and rates moved to metrics, not logs. Request flow moved to traces, not logs.

The results

Log ingest cost down 70%
Search queries 4x faster (less noise)
We actually find things when we need to

The traps

People write INFO logs for debugging, then forget to remove them. A linter that flags high-volume log calls helped us catch this before it got to prod.

Sampled logs can be confusing. 'Why did user X's request not show up?' Answer: it was sampled out. Make sure your sampling rules are transparent so engineers don't assume missing logs mean missing requests.

Logs are one observability tool. Not the only one. Stop making them do everything.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Running Postgres at Scale: Lessons Learned

Samson Tanimawo — Wed, 13 May 2026 13:58:36 +0000

We run Postgres for a product with millions of users. Along the way I've broken it in every possible way. Here are the lessons I wish I'd known on day one.

Autovacuum is not optional

You can ignore autovacuum for a while. You cannot ignore it forever. Dead tuples accumulate. Query plans go bad. Eventually a query that used to take 10ms takes 3 seconds and nobody knows why.

Tune autovacuum earlier than you think. autovacuum_vacuum_scale_factor = 0.05 on big tables is a good default.

Connection pooling is not optional

Postgres connections are expensive. Every connection holds memory and a worker process. You will run out.

Use PgBouncer or equivalent. Set pool size conservatively. Your app might want 500 connections; Postgres can happily handle 50 if you pool properly.

Long-running transactions are silent killers

A transaction that's been open for 2 hours prevents vacuum from cleaning tuples newer than its start time. Your table bloats. Your queries slow down. You blame the database.

Alert on pg_stat_activity.xact_start < now() - interval '10 minutes'. Hunt and kill long transactions before they bite you.

The query planner is not magic

It's a cost estimator. It can be wrong. When you see a query doing a sequential scan that should use an index, the planner chose sequential because its estimate said it was cheaper. Sometimes the estimate is wrong.

Fix: ANALYZE regularly, increase default_statistics_target for large tables, and don't be afraid to use SET enable_seqscan = off as a debug tool.

Backups you haven't restored are not backups

Practice the restore. Monthly. On real data volume. The first time I tried to restore our 800GB production backup, it took 11 hours. That's a useful thing to know before the outage.

Postgres is incredibly forgiving. But only to people who respect it.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com