DEV Community

Cover image for SRE is the BEST Thing Ever
Jairo Junior
Jairo Junior

Posted on

SRE is the BEST Thing Ever

If you don’t know what is SRE, don’t worry… I got you.

I'm Jairo Jr., Software Engineer at Mercado Livre, based in Brazil, and in the last months I’ve been studying SRE.

And bro… I’ll be honest:

SRE changed the way I see production.

Before SRE, production for me was like:
✅ “deploy is done”
✅ “feature is working”
✅ “let’s move to the next ticket”

Now, after getting deeper into SRE, I started to see production as:

“Ok… but what if this breaks at 3AM?”

Yeah. That’s the monster. 😅


So… what is SRE?

SRE means Site Reliability Engineering.

It started at Google around 2003, when they realized something very simple:

If your product grows…
your problems grow too.

And the real pain is not “having problems” (every system has).
The real pain is:

  • problems happening every week
  • breaking users experience
  • people getting stressed
  • on-call turning into a nightmare
  • and the company losing money while you are trying to debug logs like a detective 🕵️‍♂️

So SRE is basically an approach to make software:

✅ scalable
✅ reliable
✅ measurable
✅ and less “random”


SRE is not just DevOps with a fancy name

A lot of people think:

“Ok SRE is DevOps, right?”

Not exactly.

SRE is more like:

DevOps goals + Engineering mindset

Instead of solving problems manually forever, SRE asks:

“Can we automate this?”
“Can we predict this?”
“Can we detect it before the customer does?”
“Can we recover faster?”

SRE is the discipline that makes you stop being reactive and start being proactive.


The day-to-day example (the real one)

Let’s imagine this classic situation:

You deploy a new feature.
Everything looks ok.

But 10 minutes later:

  • latency goes up 📈
  • some requests start failing
  • and your metrics dashboard looks like a Christmas tree 🎄

And you start hearing that magic sentence:

“For me it’s working…”

But for the users:
❌ it’s not.

This is where SRE saves you.

Because SRE makes you build the system in a way that you can answer quickly:

  • What is broken?
  • When did it start?
  • Is it affecting everyone or just some customers?
  • Is the failure on my service or in a dependency?
  • What changed?
  • How fast can I rollback?

Without SRE culture, you usually discover things like this:

  • using Slack messages
  • customer complaint
  • or your manager asking “what is happening?” 😭

SRE teaches you to measure reliability

One thing I really like is that SRE doesn’t live only in theory.

SRE is about numbers.

So instead of saying:

❌ “Our service is very stable”

You say:

✅ “Our service is stable because our SLO is 99.9% and we are within the error budget.”

And this is super powerful, because now reliability becomes something you can talk about with:

  • engineers
  • product
  • managers
  • business people

Everyone understands it.


SLO and Error Budget (the part that hits different)

Ok now the fun part.

If your SLO is 99.9% availability per month, it means you can “afford” around:

43 minutes of downtime per month

This is your error budget.

And now it becomes like a rule:

✅ if you are inside the budget → you can deploy more
❌ if you already burned the budget → you stop pushing risky stuff and fix reliability

This is basically SRE saying:

“Move fast… but not stupid fast.”


SRE is the reason you stop being a hero

Without SRE, companies create a weird culture where:

  • something breaks
  • one person wakes up
  • fixes everything
  • and becomes “the hero”

Sounds cool… but that is a trap.

Because now the system depends on a human.

And humans:

  • get tired
  • make mistakes
  • get sick
  • take vacations
  • change jobs

SRE pushes you to create systems that don’t need heroes.

It’s not about “who can fix faster”.

It’s about:
✅ why this happened
✅ what we improve
✅ how we avoid again
✅ how we reduce impact next time


On-call is not the problem (bad on-call is)

SRE also made me understand something:

On-call is part of the game.

But bad on-call is what kills teams.

Bad on-call is when:

  • you get paged for useless alerts
  • you don’t have runbooks
  • no clear dashboards
  • no rollback plan
  • no ownership
  • and every incident feels like the first time

Good SRE makes on-call easier because it forces the team to build:

  • clear monitoring
  • meaningful alerts
  • fast recovery
  • and incident process

So instead of “panic mode”, the team enters “process mode”.


The real goal: protect your users

At the end of the day, this is what matters.

Your user doesn’t care about:

  • kubernetes
  • Kafka
  • retries
  • p95 latency
  • cache invalidation

The user cares about:

✅ the app works
✅ the payment goes through
✅ the screen loads fast
✅ the order is confirmed

SRE helps you deliver that every day, not just on your local machine.


Why I think SRE is the BEST thing ever

Because it changes your mindset.

SRE makes you stop thinking only about:

🧩 “How to build this feature”

And start thinking about:

🔥 “How to keep this feature alive in production for millions of users”

And bro… that’s a different level of engineering.

Top comments (0)