DEV Community

Cover image for 12 practices that make on-call sustainable for small teams
binadit
binadit

Posted on • Originally published at binadit.com

12 practices that make on-call sustainable for small teams

How small teams can run on-call without burning out (12 actionable practices)

Running reliable infrastructure with a small team? You're probably familiar with this nightmare: the same three engineers getting paged at 2 AM, spending hours on issues that could be automated, and slowly burning out from unsustainable on-call rotations.

I've seen teams of 5-15 engineers maintain 99.9% uptime without killing themselves. Here's how they do it.

The real problem with small team on-call

Unlike companies with dedicated SRE teams, small engineering teams wear multiple hats. Your backend developer is also your infrastructure engineer, database admin, and on-call responder. Traditional on-call practices designed for large teams don't work here.

12 practices that actually work for small teams

1. Set hard escalation rules

Junior engineers shouldn't debug production database issues at 3 AM. Define exactly when to escalate:

  • Customer-facing services down > 15 minutes
  • Any data corruption detected
  • Security incidents
  • After 30 minutes of unsuccessful troubleshooting

This protects both junior engineers from impossible situations and senior engineers from unnecessary wake-ups.

2. Write 3 AM proof runbooks

Your runbooks should work for a sleep-deprived engineer who didn't write them. Include exact commands, expected outputs, and clear escalation points.

# Database connection fix - Max time: 5 minutes
# If this doesn't work, escalate immediately

1. Check connection pool:
   docker exec app-container pg_pool_status

2. Expected output: "pool_size: 20, active: <20"

3. If pool exhausted, restart:
   docker restart app-container

4. Verify in 60 seconds:
   curl -f https://app.com/health
Enter fullscreen mode Exit fullscreen mode

3. Kill alert fatigue with smart routing

Too many alerts train engineers to ignore their phones. Route alerts by severity:

  • Critical: Phone call + SMS
  • Warning: Slack ping
  • Info: Email only

One alert storm shouldn't destroy your team's trust in the monitoring system.

4. Group related alerts intelligently

When your database crashes, you don't need 30 alerts about every dependent service. Configure your monitoring to suppress downstream alerts when upstream services fail.

Most monitoring tools support this, they call it "alert dependencies" or "suppression rules."

5. Automate common fixes

If your team manually fixes the same issue twice per month, automate it. Common candidates:

#!/bin/bash
# Auto-cleanup script for disk space alerts
USAGE=$(df /var/log | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
    find /var/log -name "*.log" -mtime +7 -delete
    systemctl reload nginx
    echo "Cleaned logs, disk usage now: $(df -h /var/log)"
fi
Enter fullscreen mode Exit fullscreen mode

6. Structure handoffs properly

Schedule handoffs at specific times, not "whenever." The outgoing person should brief their replacement on:

  • Current system health
  • Ongoing issues
  • Scheduled maintenance
  • Anything weird they noticed

7. Use dedicated incident channels

Create separate Slack channels for incidents. Keep urgent technical discussion away from general team chat. Include stakeholders like customer success when incidents affect users.

8. Monitor degradation, not just failures

Track early warning signals:

  • Response times increasing
  • Queue depths growing
  • Error rates climbing

This gives on-call engineers time to act before complete failure.

9. Time-box investigations

Set investigation limits before switching to restoration mode:

  • Performance issues: 30 minutes max
  • Service outages: 15 minutes max
  • Unknown errors: 45 minutes max

After the time limit, restore from backup, switch to standby, or escalate. Debug later.

10. Build redundant notification paths

Don't rely on just Slack. Use:

  • SMS for critical alerts
  • Phone calls for extended outages
  • Push notifications via PagerDuty/Opsgenie
  • Email as backup

Test these monthly.

11. Hold regular on-call retrospectives

After incidents or monthly, review what happened:

  • What tools would have helped?
  • Which runbooks need updates?
  • What monitoring gaps exist?

Focus on systemic improvements, not individual blame.

12. Respect boundaries and compensate fairly

Set clear expectations:

  • Acknowledge alerts within 15 minutes
  • Begin investigation within 30 minutes
  • Compensate with additional pay, time off, or flexibility

Rolling this out

Don't implement everything at once. Start with your biggest pain points:

  • Too many false alerts? Begin with alert routing and grouping
  • Chaotic incident response? Focus on communication and runbooks
  • Engineers burning out? Start with escalation boundaries and compensation

Implement 3-4 practices over 2-3 months. Measure the impact with metrics like mean time to resolution and engineer satisfaction.

The bottom line

Sustainable on-call practices aren't about eliminating incidents, they're about handling them efficiently without destroying your team.

Small teams can maintain reliable systems, but only with practices designed for their constraints. These approaches scale with your team and evolve as your systems grow more complex.

Originally published on binadit.com

Top comments (0)