binadit

Posted on Apr 21 • Originally published at binadit.com

12 practices that make on-call sustainable for small teams

#oncall #reliability #teammanagement #incidentresponse

How small teams can run on-call without burning out (12 actionable practices)

Running reliable infrastructure with a small team? You're probably familiar with this nightmare: the same three engineers getting paged at 2 AM, spending hours on issues that could be automated, and slowly burning out from unsustainable on-call rotations.

I've seen teams of 5-15 engineers maintain 99.9% uptime without killing themselves. Here's how they do it.

The real problem with small team on-call

Unlike companies with dedicated SRE teams, small engineering teams wear multiple hats. Your backend developer is also your infrastructure engineer, database admin, and on-call responder. Traditional on-call practices designed for large teams don't work here.

12 practices that actually work for small teams

1. Set hard escalation rules

Junior engineers shouldn't debug production database issues at 3 AM. Define exactly when to escalate:

Customer-facing services down > 15 minutes
Any data corruption detected
Security incidents
After 30 minutes of unsuccessful troubleshooting

This protects both junior engineers from impossible situations and senior engineers from unnecessary wake-ups.

2. Write 3 AM proof runbooks

Your runbooks should work for a sleep-deprived engineer who didn't write them. Include exact commands, expected outputs, and clear escalation points.

# Database connection fix - Max time: 5 minutes
# If this doesn't work, escalate immediately

1. Check connection pool:
   docker exec app-container pg_pool_status

2. Expected output: "pool_size: 20, active: <20"

3. If pool exhausted, restart:
   docker restart app-container

4. Verify in 60 seconds:
   curl -f https://app.com/health

3. Kill alert fatigue with smart routing

Too many alerts train engineers to ignore their phones. Route alerts by severity:

Critical: Phone call + SMS
Warning: Slack ping
Info: Email only

One alert storm shouldn't destroy your team's trust in the monitoring system.

4. Group related alerts intelligently

When your database crashes, you don't need 30 alerts about every dependent service. Configure your monitoring to suppress downstream alerts when upstream services fail.

Most monitoring tools support this, they call it "alert dependencies" or "suppression rules."

5. Automate common fixes

If your team manually fixes the same issue twice per month, automate it. Common candidates:

#!/bin/bash
# Auto-cleanup script for disk space alerts
USAGE=$(df /var/log | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
    find /var/log -name "*.log" -mtime +7 -delete
    systemctl reload nginx
    echo "Cleaned logs, disk usage now: $(df -h /var/log)"
fi

6. Structure handoffs properly

Schedule handoffs at specific times, not "whenever." The outgoing person should brief their replacement on:

Current system health
Ongoing issues
Scheduled maintenance
Anything weird they noticed

7. Use dedicated incident channels

Create separate Slack channels for incidents. Keep urgent technical discussion away from general team chat. Include stakeholders like customer success when incidents affect users.

8. Monitor degradation, not just failures

Track early warning signals:

Response times increasing
Queue depths growing
Error rates climbing

This gives on-call engineers time to act before complete failure.

9. Time-box investigations

Set investigation limits before switching to restoration mode:

Performance issues: 30 minutes max
Service outages: 15 minutes max
Unknown errors: 45 minutes max

After the time limit, restore from backup, switch to standby, or escalate. Debug later.

10. Build redundant notification paths

Don't rely on just Slack. Use:

SMS for critical alerts
Phone calls for extended outages
Push notifications via PagerDuty/Opsgenie
Email as backup

Test these monthly.

11. Hold regular on-call retrospectives

After incidents or monthly, review what happened:

What tools would have helped?
Which runbooks need updates?
What monitoring gaps exist?

Focus on systemic improvements, not individual blame.

12. Respect boundaries and compensate fairly

Set clear expectations:

Acknowledge alerts within 15 minutes
Begin investigation within 30 minutes
Compensate with additional pay, time off, or flexibility

Rolling this out

Don't implement everything at once. Start with your biggest pain points:

Too many false alerts? Begin with alert routing and grouping
Chaotic incident response? Focus on communication and runbooks
Engineers burning out? Start with escalation boundaries and compensation

Implement 3-4 practices over 2-3 months. Measure the impact with metrics like mean time to resolution and engineer satisfaction.

The bottom line

Sustainable on-call practices aren't about eliminating incidents, they're about handling them efficiently without destroying your team.

Small teams can maintain reliable systems, but only with practices designed for their constraints. These approaches scale with your team and evolve as your systems grow more complex.

Originally published on binadit.com

DEV Community