DEV Community

Samson Tanimawo profile picture

Samson Tanimawo

Building the first Agentic SRE Platform. 100 AI agents that detect, investigate, and resolve incidents autonomously.

Location Houston Joined Joined on  Personal website https://novaaiops.com

Pronouns

He/Him/His

Scaling On-Call When You Only Have 5 Engineers

Scaling On-Call When You Only Have 5 Engineers

Comments
2 min read
TLS Certificate Management Without Tears

TLS Certificate Management Without Tears

Comments
2 min read
DNS: The SRE's Most Underrated Skill

DNS: The SRE's Most Underrated Skill

Comments
2 min read
The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

Comments
2 min read
Why Every SRE Should Learn a Little Rust

Why Every SRE Should Learn a Little Rust

Comments
2 min read
How We Built Our Own Incident Management System

How We Built Our Own Incident Management System

Comments
2 min read
The Role of Platform Engineering in a Startup

The Role of Platform Engineering in a Startup

Comments
2 min read
Building Dashboards People Actually Use

Building Dashboards People Actually Use

Comments
2 min read
SRE Maturity Models: Where Is Your Team?

SRE Maturity Models: Where Is Your Team?

Comments
2 min read
The Art of Writing a Good Post-Mortem

The Art of Writing a Good Post-Mortem

Comments
1 min read
Why We Stopped Using Log Aggregation for Everything

Why We Stopped Using Log Aggregation for Everything

Comments
1 min read
Running Postgres at Scale: Lessons Learned

Running Postgres at Scale: Lessons Learned

Comments
2 min read
How We Reduced Our Deployment Failure Rate to Under 2%

How We Reduced Our Deployment Failure Rate to Under 2%

Comments
1 min read
The Hidden Cost of Flaky Tests

The Hidden Cost of Flaky Tests

Comments
1 min read
Observability for Serverless: What's Different

Observability for Serverless: What's Different

Comments
2 min read
From DevOps to SRE: Making the Transition

From DevOps to SRE: Making the Transition

Comments
2 min read
The SRE Interview: Questions I Actually Ask

The SRE Interview: Questions I Actually Ask

1
Comments
1 min read
Incident Retrospectives Without Blame

Incident Retrospectives Without Blame

Comments
1 min read
Alert Fatigue: The Silent Productivity Killer

Alert Fatigue: The Silent Productivity Killer

Comments
1 min read
Why SLIs Matter More Than SLOs

Why SLIs Matter More Than SLOs

Comments
1 min read
The PagerDuty Migration Playbook

The PagerDuty Migration Playbook

Comments
1 min read
How We Cut Datadog Bills by 60% Without Losing Observability

How We Cut Datadog Bills by 60% Without Losing Observability

Comments
1 min read
Building Your First Runbook: A Template That Actually Works

Building Your First Runbook: A Template That Actually Works

Comments
1 min read
AIOps vs Traditional Monitoring: What Actually Changed

AIOps vs Traditional Monitoring: What Actually Changed

Comments
1 min read
Eventual Consistency: Debugging the Hardest Class of Bugs

Eventual Consistency: Debugging the Hardest Class of Bugs

Comments
4 min read
The Economics of Self-Hosting vs. Managed Monitoring

The Economics of Self-Hosting vs. Managed Monitoring

Comments
4 min read
Building an Incident Response Playbook Library

Building an Incident Response Playbook Library

Comments
4 min read
Kubernetes Network Policies: Lessons from Production Incidents

Kubernetes Network Policies: Lessons from Production Incidents

Comments
4 min read
Reducing Toil: The Google SRE Book Applied to Startups

Reducing Toil: The Google SRE Book Applied to Startups

Comments
4 min read
Incident Severity Levels: SEV-1 to SEV-5 Calibration

Incident Severity Levels: SEV-1 to SEV-5 Calibration

Comments
4 min read
Memory Leak Detection in Long-Running Services

Memory Leak Detection in Long-Running Services

Comments
3 min read
CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Feature Flags as a Reliability Tool, Not Just an A/B Platform

Feature Flags as a Reliability Tool, Not Just an A/B Platform

Comments
3 min read
eBPF for SREs: Observability Without Agents

eBPF for SREs: Observability Without Agents

Comments
3 min read
Observability as Code: Managing Dashboards and Alerts with Terraform

Observability as Code: Managing Dashboards and Alerts with Terraform

Comments
2 min read
Service Level Objectives for Complex Microservices

Service Level Objectives for Complex Microservices

Comments
3 min read
Building a Culture of Reliability: Beyond the SRE Handbook

Building a Culture of Reliability: Beyond the SRE Handbook

Comments
3 min read
Debugging Kubernetes OOMKilled: A Step-by-Step Guide

Debugging Kubernetes OOMKilled: A Step-by-Step Guide

Comments
3 min read
Deployment Frequency: How We Went From Weekly to 20x/Day

Deployment Frequency: How We Went From Weekly to 20x/Day

1
Comments
3 min read
Cost-Effective Observability: The 80/20 Stack for Startups

Cost-Effective Observability: The 80/20 Stack for Startups

Comments
3 min read
Incident Communication: The Status Page That Builds Trust

Incident Communication: The Status Page That Builds Trust

Comments
3 min read
Load Testing in Production: How We Do It Safely

Load Testing in Production: How We Do It Safely

Comments
3 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
GitOps for Infrastructure: How We Deploy With Zero SSH

GitOps for Infrastructure: How We Deploy With Zero SSH

Comments
2 min read
Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
Database Reliability: The SRE Approach to Keeping Data Safe

Database Reliability: The SRE Approach to Keeping Data Safe

1
Comments
3 min read
Container Security for SREs: The Practical Checklist

Container Security for SREs: The Practical Checklist

Comments
3 min read
The Incident Commander Role: Running Incidents Without Chaos

The Incident Commander Role: Running Incidents Without Chaos

1
Comments
2 min read
Terraform at Scale: Lessons from Managing 500+ Resources

Terraform at Scale: Lessons from Managing 500+ Resources

Comments
2 min read
Why Your Microservices Need Circuit Breakers (And How to Add Them)

Why Your Microservices Need Circuit Breakers (And How to Add Them)

Comments
2 min read
The On-Call Handoff That Prevents Dropped Incidents

The On-Call Handoff That Prevents Dropped Incidents

Comments
2 min read
SLOs That Product Managers Actually Understand

SLOs That Product Managers Actually Understand

Comments
2 min read
MTTR Optimization: The 7 Levers That Actually Move the Needle

MTTR Optimization: The 7 Levers That Actually Move the Needle

Comments
3 min read
Service Maps: The Architectural Clarity Your Team Is Missing

Service Maps: The Architectural Clarity Your Team Is Missing

Comments
2 min read
AI in Incident Response: Hype vs. Reality in 2024

AI in Incident Response: Hype vs. Reality in 2024

Comments
3 min read
Monitoring Costs Are Out of Control — Here's How to Fix It

Monitoring Costs Are Out of Control — Here's How to Fix It

Comments
2 min read
loading...