My First dev.to Post — And a 1-Evening SRE System That Changed Our On-Call

Yonatan Sapoznik — Tue, 07 Apr 2026 09:37:48 +0000

Hey dev.to 👋

This is my first post here.

I wanted to share something I built at work — a system I created in a single evening, completely by myself.

I built it because I saw an opportunity — and knew this domain really matters in my company.

The Context

Most SRE improvements come with more tooling, more dashboards, and more complexity.

I went the opposite direction.

No new system. No big infra changes. Just a different way of working.

During incidents, we kept asking:

Where is this happening most?
Is it tenant-specific?
Is it region-related?
Is this new or recurring?

The data existed — but the process to get answers was slow and inconsistent.

What I Built

What started as a Markdown file turned into something much bigger:

An AI-powered SRE teammate.

A system that:

understands our architecture
queries logs and metrics in real time
searches past incidents and Runbooks
and investigates production issues end-to-end

Like a senior engineer who’s been here since day one — available 24/7.

At a Glance

~4 minutes to triage incidents
End-to-end investigations from a single input
Zero context switching between tools
Live correlation between code, logs, and metrics

👉 Full article here: I Cut MTTR to 4 Minutes — My “SRE” Is a 619-Line Markdown File

Why I’m Sharing This

This wasn’t meant to be a “big solution”.

It was just:

“Let’s make on-call a bit less painful”

But it ended up having a real impact.

So I figured it’s worth sharing — and also getting feedback.

What I’m Thinking About Next

I want to go deeper in the next posts.

A couple of directions I’m considering:

1. Designing Deterministic Skills & Agents

How I built skills and agents that behave predictably —

so you can test, extend, and evolve them without breaking things.

test at different layers
extend with confidence
avoid hidden regressions

2. New Ideas for Agents

Less about hype — more about:

practical use cases
where agents actually help
and some practical methods I’ve found effective

Would Love Your Input

If any of this sounds interesting — let me know 🙌

What would you want me to dive into next?
Have you tried something similar?
Do your on-call shifts feel harder than they should be?

Thanks for reading ✌️

DEV Community: Yonatan Sapoznik