DEV Community

Cover image for SRE book notes: Reliable Product Launches at Scale
Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at bitmaybewise.substack.com

1

SRE book notes: Reliable Product Launches at Scale

These are the notes from Chapter 27: Reliable Product Launches at Scale from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:


Experience has demonstrated that engineers are likely to sidestep processes that they consider too burdensome or as adding insufficient value—especially when a team is already in crunch mode, and the launch process is seen as just another item blocking their launch. For this reason, LCE must optimize the launch experience continuously to strike the right balance between cost and benefit.

LCE = Launch Coordination Engineering


In a large organization, engineers may not be aware of available infrastructure for common tasks (such as rate limiting). Lacking proper guidance, they’re likely to re-implement existing solutions. Converging on a set of common infrastructure libraries avoids this scenario, and provides obvious benefits to the company: it cuts down on duplicate effort, makes knowledge more easily transferable between services, and results in a higher level of engineering and service quality due to the concentrated attention given to infrastructure.

Reimplementation of existing solutions is also common in companies where there is a heterogenous set of tools and programming languages. This is a problem that’s usually solved by sticking to a small subset of technologies.

Sticking to a unique programming language or tool can be an option, as it will lay a common ground for all the teams, but needs to be weighed well. As the company grows, different problems and solutions will need to comply with different requirements.

Pick a small subset of tools intentionally chosen, where its trade-offs are well considered. Hype is not a strategy.


When products are successful far beyond any early estimates, and their usage increases by more than two orders of magnitude, keeping pace with their load necessitates many design changes. Such scalability changes, combined with ongoing feature additions, often make the product more complex, fragile, and difficult to operate. At some point, the original product architecture becomes unmanageable and the product needs to be completely rearchitected. Rearchitecting the product and then migrating all users from the old to the new architecture requires a large investment of time and resources from developers and SREs alike, slowing down the rate of new feature development during that period.


A Collection of Best Practices for Production Services


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.


Photo by Andy Hermawan on Unsplash

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more