DCT Technology Pvt. Ltd.

Posted on Jul 17

Why System Design Should Consider On-Call Engineers Too

#systemdesign #devops #oncall #softwaredevelopment

You’ve probably seen beautifully crafted architecture diagrams — microservices neatly divided, auto-scaling groups, load balancers, and fancy queues.

But here’s a question nobody's asking:

Who’s the one waking up at 3 AM to fix it when it breaks?

The answer? Your on-call engineer — the unsung hero who's usually left out of the design discussions.

And that, right there, is where most system designs fall short.

😴 The Midnight Pager Test (And Why It Matters)

Every feature you ship will eventually break. That’s not pessimism — it’s reality.

But how gracefully your system fails, and how quickly it recovers, is often the difference between a five-minute outage and a PR nightmare.

Here’s why involving on-call engineers early is not just nice — it’s critical:

They’ve seen the weird edge cases your design doc doesn’t cover.
They know what logs are useless when things go down.
They understand how alerts feel when they trigger at 2 AM for the third time that week.

🧠 Real-World Wisdom > Hypothetical Planning

“Everything works in theory — until prod says otherwise.”

On-call engineers bring something you can’t Google or ChatGPT your way into: battle-tested experience.

Just listen to stories like this one:
👉 Postmortem of a Redis Outage at GitHub
You'll notice how tiny assumptions in design led to huge firefights in production.

✅ Design Decisions That On-Call Engineers Would Change

Timeouts and Retries It’s easy to say "just retry 3 times". But what happens when every service retries and your database dies from traffic?

   // Instead of naive retries, use exponential backoff
   async function retryWithBackoff(fn, retries = 5) {
     let delay = 100;
     for (let i = 0; i < retries; i++) {
       try {
         return await fn();
       } catch (e) {
         await new Promise(res => setTimeout(res, delay));
         delay *= 2;
       }
     }
     throw new Error("Max retries reached");
   }

Monitoring That Actually Helps Logs without context are like puzzles with missing pieces. Your alerts should be:

Actionable
Prioritized
Mapped to business impact

Great intro on setting up effective observability:
👉 Monitoring Distributed Systems by Cindy Sridharan

Failure Modes with Human Impact Don’t just plan for failover. Ask: What’s the experience for the person fixing it?

👥 Add Them to the Room, Not Just the Rotation

System design isn’t just about how it works when everything's okay. It’s about:

What happens when it fails?
Who gets the call?
Can they fix it fast?
Can they sleep peacefully after?

That’s why every major design review should have an on-call engineer present.

🛠️ How to Start Including On-Call Input in Design

📋 Add an “Operational Review” step in your architecture checklist.
🧑‍💻 Have an “On-Call Feedback Doc” linked to each system.
💬 Host a quarterly “On-Call Pain Points” meeting.
✅ Validate all new features against incident history.

🚨 Because "Works on My Machine" Isn't Enough

Production is where your system faces:

Real users
Real spikes
Real bugs

And it’s where on-call engineers carry the system on their shoulders.

So next time you're drawing that architecture diagram —
Leave room for the one person who knows what breaks at 3 AM.

If you’ve ever been on call — or know someone who has — share this with your team. It might just save someone a sleepless night. 🌙💡

💬 Have horror stories from being on-call during a bad system design? Drop them in the comments. Let's make sure no one else repeats the same mistake.

👉 Follow [DCT Technology]for more real-world insights, design lessons, and engineering best practices that actually work in production.

#systemdesign #devops #oncall #softwareengineering #webdevelopment #productivity #techlead #incidentmanagement #devcommunity #dcttechnology

DEV Community