Back in the golden days of 2016, O’Reilly published a collection of essays by Google’s Site Reliability Engineers explaining to the world what they do and how they do it. It covered a lot of excellent ground - version control, build and deployment automation, monitoring, engineering vs. toil… I could go on, but there’s a whole book for that.
Truth be told, much of the book was over my head. I’d like to say it’s just because I’d spent most of my career on the Microsoft side of things, but the more likely reality is that I wouldn’t be qualified regardless of the OS I managed. I talked with some of the folks from my enterprise’s Linux side, and they shared the opinion that the techniques were great but beyond what we were going to accomplish. In fact, that was true for enough people that O’Reilly published the followup Seeking SRE. It addressed how to do SRE when you’re not Google.
I even tried to implement some of the practices on my team, with a certain amount of success. I was on an application operations team in an enterprise. We were a group of developers responsible for tier 3 support, monitoring, deployment, and maintenance of the application. There were separate teams for infrastructure operations - we just handled the app. Seems to lend itself perfectly to SRE, right?
We adjusted our monitoring to better reflect the priorities from the book. We automated our formerly manual delivery process. We started proactively investigating the logs to identify performance issues before they caused noticeable customer impact. Things were actually getting better for a while, but I wouldn’t be writing this if that’s where the story ended.
We weren’t Google.
Maybe if I were a Google-caliber engineer, our team could have gotten further ahead. We could have automated more things and increased reliability for less effort. We could have capitalized on those gains and accelerated even faster, enabling the dev team to move quickly and confidently. But y’know what? I really doubt it.
Reliability fell. We started spending more and more time on toil. Support tickets, midnight escalations, and hotfixes were the name of the game. It didn’t matter how fast we could find and fix issues, because more just kept coming through the pipeline, and we couldn’t hand back the pager.
For all that people talk about lacking Google’s technical capabilities, that wasn’t the blocker. We lacked Google’s culture of accountability. It didn’t matter that the ops team got paged - we could get the service back up with minimal impact, and the dev team could continue adding new features the next day. Feature velocity was more important than any service level objective.
Through all of this, the App Dev team was doing their best. So was Product, so was Management. At every level there were smart, talented people with a drive to do the best they could by the customer. The culture just didn’t prioritize the accountability necessary for Google’s implementation of SRE.
That’s the problem with the SRE model when you’re not google. It’s got nothing to do with the tech - most people aren’t as good, but most people aren’t at a scale where they have to be. It’s got everything to do with the culture.
If you can’t hand back the pager, no technical capabilities will ever get you to SRE.
Top comments (0)