DEV Community

DCT Technology Pvt. Ltd.
DCT Technology Pvt. Ltd.

Posted on

Why System Design Should Consider On-Call Engineers Too

You’ve probably seen beautifully crafted architecture diagrams — microservices neatly divided, auto-scaling groups, load balancers, and fancy queues.

But here’s a question nobody's asking:

Who’s the one waking up at 3 AM to fix it when it breaks?

The answer? Your on-call engineer — the unsung hero who's usually left out of the design discussions.

And that, right there, is where most system designs fall short.

😴 The Midnight Pager Test (And Why It Matters)

Every feature you ship will eventually break. That’s not pessimism — it’s reality.

But how gracefully your system fails, and how quickly it recovers, is often the difference between a five-minute outage and a PR nightmare.

Here’s why involving on-call engineers early is not just nice — it’s critical:

  • They’ve seen the weird edge cases your design doc doesn’t cover.
  • They know what logs are useless when things go down.
  • They understand how alerts feel when they trigger at 2 AM for the third time that week.

🧠 Real-World Wisdom > Hypothetical Planning

“Everything works in theory — until prod says otherwise.”

On-call engineers bring something you can’t Google or ChatGPT your way into: battle-tested experience.

Just listen to stories like this one:
👉 Postmortem of a Redis Outage at GitHub
You'll notice how tiny assumptions in design led to huge firefights in production.


✅ Design Decisions That On-Call Engineers Would Change

  1. Timeouts and Retries It’s easy to say "just retry 3 times". But what happens when every service retries and your database dies from traffic?
   // Instead of naive retries, use exponential backoff
   async function retryWithBackoff(fn, retries = 5) {
     let delay = 100;
     for (let i = 0; i < retries; i++) {
       try {
         return await fn();
       } catch (e) {
         await new Promise(res => setTimeout(res, delay));
         delay *= 2;
       }
     }
     throw new Error("Max retries reached");
   }
Enter fullscreen mode Exit fullscreen mode
  1. Monitoring That Actually Helps Logs without context are like puzzles with missing pieces. Your alerts should be:
  • Actionable
  • Prioritized
  • Mapped to business impact

Great intro on setting up effective observability:
👉 Monitoring Distributed Systems by Cindy Sridharan

  1. Failure Modes with Human Impact Don’t just plan for failover. Ask: What’s the experience for the person fixing it?

👥 Add Them to the Room, Not Just the Rotation

System design isn’t just about how it works when everything's okay. It’s about:

  • What happens when it fails?
  • Who gets the call?
  • Can they fix it fast?
  • Can they sleep peacefully after?

That’s why every major design review should have an on-call engineer present.


🛠️ How to Start Including On-Call Input in Design

  • 📋 Add an “Operational Review” step in your architecture checklist.
  • 🧑‍💻 Have an “On-Call Feedback Doc” linked to each system.
  • 💬 Host a quarterly “On-Call Pain Points” meeting.
  • ✅ Validate all new features against incident history.

🚨 Because "Works on My Machine" Isn't Enough

Production is where your system faces:

  • Real users
  • Real spikes
  • Real bugs

And it’s where on-call engineers carry the system on their shoulders.

So next time you're drawing that architecture diagram —
Leave room for the one person who knows what breaks at 3 AM.


If you’ve ever been on call — or know someone who has — share this with your team. It might just save someone a sleepless night. 🌙💡

💬 Have horror stories from being on-call during a bad system design? Drop them in the comments. Let's make sure no one else repeats the same mistake.


👉 Follow [DCT Technology]for more real-world insights, design lessons, and engineering best practices that actually work in production.


#systemdesign #devops #oncall #softwareengineering #webdevelopment #productivity #techlead #incidentmanagement #devcommunity #dcttechnology

Top comments (0)