These are the notes from Chapter 29: Dealing with Interrupts from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
SRE book notes: Reliable Product Launches at Scale
Hercules Lemke Merscher ・ Feb 16 ・ 2 min read
Any complex system is as imperfect as its creators. In managing the operational load created by these systems, remember that its creators are also imperfect machines.
Polarizing time means that when a person comes into work each day, they should know if they’re doing just project work or just interrupts. Polarizing their time in this way means they get to concentrate for longer periods of time on the task at hand. They don’t get stressed out because they’re being roped into tasks that drag them away from the work they’re supposed to be doing.
For any given class of interrupt, if the volume of interrupts is too high for one person, add another person.
A person should never be expected to be on-call and also make progress on projects (or anything else with a high context switching cost).
Sometimes when a person isn’t on interrupts, the team receives an interrupt that the person is uniquely qualified to handle. While ideally this scenario should never happen, it sometimes does. You should work to make such occurrences rare.
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Brett Jordan on Unsplash
Top comments (0)