Surviving Oncall: Tales from the Incident Trenches
Ah, oncall. Those late night alerts that jolt you awake, summoning you to battle unexpected bugs and outages. Thrilling? Sometimes. Stressful? Often. Manageable? Absolutely - with the right preparation and mindset.
Let me share some hard-earned tips to help you survive and thrive during your oncall tours of duty. Think of me as your wise, veteran software knight passing down battle-tested strategies.
Before Your Watch
Set up call forwarding - Make sure your phone is set up to receive calls even if you're on Do Not Disturb. You don't want to miss that 3am alert!
Review past incidents - Read summaries from previous oncall shifts to be aware of any ongoing or recurring issues.
Limit non-urgent work - Try not to schedule tasks that require heavy focus like launches or complex debugging. Handle those outside oncall weeks when possible. Oncall is for fighting fires, not building new castles.
Prioritize tech debt - Use oncall weeks to pay down eng debt through tests, automation, and documentation. This makes future oncalls smoother.
During Incidents
Triage alerts - Check alert docs, failing timeseries, and charts to understand the scope and root cause. Swift action contains damage
Debug efficiently - Leverage tools group-bys, CodeHub to quickly identify issues.
Escalate appropriately - Involve relevant teams early if you suspect the problem is not isolated.
Communicate proactively - Keep stakeholders updated on SEV status and next steps through SEV tools and chats.
Support team members - Have empathy for those involved in an incident. Avoid finger pointing.
After Your Watch
Write summaries - Document oncall incidents, learnings, and follow ups while memory is fresh.
Attend debriefs - Reflect on opportunities to improve documentation, alerts, tests, and overall oncall process.
Handoff open issues - Ensure incoming engineer is aware of any ongoing investigations that need follow up.
Rest and recover. Heroes need downtime before the next battle
Following structured oncall best practices goes a long way in reducing stress, minimizing fire drills, and strengthening team resilience. With the right preparation, diligence during incidents, and thorough follow up, you can handle oncall with confidence. Now grab your laptop, we've got outages to fight!
Top comments (0)