DEV Community

Yonatan Sapoznik
Yonatan Sapoznik

Posted on

My First dev.to Post β€” And a 1-Evening SRE System That Changed Our On-Call

Hey dev.to πŸ‘‹

This is my first post here.

I wanted to share something I built at work β€” a system I created in a single evening, completely by myself.

I built it because I saw an opportunity β€” and knew this domain really matters in my company.


The Context

Most SRE improvements come with more tooling, more dashboards, and more complexity.

I went the opposite direction.

No new system. No big infra changes. Just a different way of working.

During incidents, we kept asking:

Where is this happening most?
Is it tenant-specific?
Is it region-related?
Is this new or recurring?

The data existed β€” but the process to get answers was slow and inconsistent.


What I Built

What started as a Markdown file turned into something much bigger:

An AI-powered SRE teammate.

A system that:

  • understands our architecture
  • queries logs and metrics in real time
  • searches past incidents and Runbooks
  • and investigates production issues end-to-end

Like a senior engineer who’s been here since day one β€” available 24/7.

At a Glance

  • ~4 minutes to triage incidents
  • End-to-end investigations from a single input
  • Zero context switching between tools
  • Live correlation between code, logs, and metrics

πŸ‘‰ Full article here: I Cut MTTR to 4 Minutes β€” My β€œSRE” Is a 619-Line Markdown File


Why I’m Sharing This

This wasn’t meant to be a β€œbig solution”.

It was just:

β€œLet’s make on-call a bit less painful”

But it ended up having a real impact.

So I figured it’s worth sharing β€” and also getting feedback.


What I’m Thinking About Next

I want to go deeper in the next posts.

A couple of directions I’m considering:

1. Designing Deterministic Skills & Agents

How I built skills and agents that behave predictably β€”

so you can test, extend, and evolve them without breaking things.

  • test at different layers
  • extend with confidence
  • avoid hidden regressions

2. New Ideas for Agents

Less about hype β€” more about:

  • practical use cases
  • where agents actually help
  • and some practical methods I’ve found effective

Would Love Your Input

If any of this sounds interesting β€” let me know πŸ™Œ

  • What would you want me to dive into next?
  • Have you tried something similar?
  • Do your on-call shifts feel harder than they should be?

Thanks for reading ✌️

Top comments (0)