You’ve probably seen beautifully crafted architecture diagrams — microservices neatly divided, auto-scaling groups, load balancers, and fancy queues.
But here’s a question nobody's asking:
Who’s the one waking up at 3 AM to fix it when it breaks?
The answer? Your on-call engineer — the unsung hero who's usually left out of the design discussions.
And that, right there, is where most system designs fall short.
😴 The Midnight Pager Test (And Why It Matters)
Every feature you ship will eventually break. That’s not pessimism — it’s reality.
But how gracefully your system fails, and how quickly it recovers, is often the difference between a five-minute outage and a PR nightmare.
Here’s why involving on-call engineers early is not just nice — it’s critical:
- They’ve seen the weird edge cases your design doc doesn’t cover.
- They know what logs are useless when things go down.
- They understand how alerts feel when they trigger at 2 AM for the third time that week.
🧠 Real-World Wisdom > Hypothetical Planning
“Everything works in theory — until prod says otherwise.”
On-call engineers bring something you can’t Google or ChatGPT your way into: battle-tested experience.
Just listen to stories like this one:
👉 Postmortem of a Redis Outage at GitHub
You'll notice how tiny assumptions in design led to huge firefights in production.
✅ Design Decisions That On-Call Engineers Would Change
- Timeouts and Retries It’s easy to say "just retry 3 times". But what happens when every service retries and your database dies from traffic?
// Instead of naive retries, use exponential backoff
async function retryWithBackoff(fn, retries = 5) {
let delay = 100;
for (let i = 0; i < retries; i++) {
try {
return await fn();
} catch (e) {
await new Promise(res => setTimeout(res, delay));
delay *= 2;
}
}
throw new Error("Max retries reached");
}
- Monitoring That Actually Helps Logs without context are like puzzles with missing pieces. Your alerts should be:
- Actionable
- Prioritized
- Mapped to business impact
Great intro on setting up effective observability:
👉 Monitoring Distributed Systems by Cindy Sridharan
- Failure Modes with Human Impact Don’t just plan for failover. Ask: What’s the experience for the person fixing it?
👥 Add Them to the Room, Not Just the Rotation
System design isn’t just about how it works when everything's okay. It’s about:
- What happens when it fails?
- Who gets the call?
- Can they fix it fast?
- Can they sleep peacefully after?
That’s why every major design review should have an on-call engineer present.
🛠️ How to Start Including On-Call Input in Design
- 📋 Add an “Operational Review” step in your architecture checklist.
- 🧑💻 Have an “On-Call Feedback Doc” linked to each system.
- 💬 Host a quarterly “On-Call Pain Points” meeting.
- ✅ Validate all new features against incident history.
🚨 Because "Works on My Machine" Isn't Enough
Production is where your system faces:
- Real users
- Real spikes
- Real bugs
And it’s where on-call engineers carry the system on their shoulders.
So next time you're drawing that architecture diagram —
Leave room for the one person who knows what breaks at 3 AM.
If you’ve ever been on call — or know someone who has — share this with your team. It might just save someone a sleepless night. 🌙💡
💬 Have horror stories from being on-call during a bad system design? Drop them in the comments. Let's make sure no one else repeats the same mistake.
👉 Follow [DCT Technology]for more real-world insights, design lessons, and engineering best practices that actually work in production.
#systemdesign #devops #oncall #softwareengineering #webdevelopment #productivity #techlead #incidentmanagement #devcommunity #dcttechnology
Top comments (0)