Kate Yanchenko

Posted on Jan 26, 2022

The art of making good decisions in development or SRE at Google

#devops #sre #monitoring #productivity

I took inspiration from the Google Site Reliability Engineer book and applied it to my own experience with Amazon and Wargaming.

If you want to work on high load scalable projects to ensure good availability, like 99.99 to 99.9999, you need to be able to ask the right questions. Here we will consider which ones, take a seat.

What availability and latency should your services have?
How to measure uptime? What alarms should be of concern?
When to add a feature and when to work on technical debts.
The art of moving towards a better job.

What availability should your services have

Availability

First thing first, do you really need a better availability?
Should you provide more than 99% availability if 80% of your customers use cell phones and their availability is close to 99%. In general, there is no point in providing more availability, since you will pay for it, and the end user will not even notice it.

100% is probably never a proper availability goal: not because it's not possible, but it's usually higher reliability than end users of the service want or notice.

Would it make sense if the efficiency of adding another 9 more exceeded the cost your business can extract, the commitment to better availability would be cost-problematic and complicate the system. It will be more difficult to maintain in a feature. Keeping this in mind could ease the pressure from product leaders as people tinker with numbers.

Many of the services you base your decision on are not 100% reliable, you need to take that into account likewise. For example, Google has noticed that their Chubby (Distributed lock manager) service is so reliable that companies have come to expect it to always work. They introduced a controlled downtime for this service as long as it didn't exceed the error limit

Latency

Usually you need to measure p90-p99.99 to follow what latency you provide for end user, it won't be fair if you will measure just average metric, as it won't tell you how in time the latency was changed.
It's important to distinguish latency of successful requests and of failed requests. A slow error response is even worse than a fast error! Try to think if it's possible to brush.

How and what to measure

The actual uptime is measured by third party: monitoring system. You can notice long-term trends: How big your database and how quickly is growing. Is better cache rate with an extra node?
Based on it you can write post-mortem and make a data-driven root causes.

White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, HTTP handler, JVM Profiler, etc..
You need to measure on every service metrics to have a quick understanding what's wrong. If you will measure calls on your service to database, and you will notice a sharp increase of latency, but there is not the same picture in database metrics, it would more likely mean you have problems with network.

Black-box monitoring
Testing externally visible behaviour as a user would see it.

Alert
You need to trigger alert just in case: something is broken, and you need to fix it right now!

Errors
The rate of errors that fail, either explicitly (HTTP 500), implicitly (f.e., an HTTP 200 response, but coupled with the wrong content).

Saturation
How "full" your service is. Resources differ from type of service you measure and logic it uses, it can be CPU, disk, memory, I/O constraints.

What metrics should trigger alarm?

You need to trigger alarm just in cases it:

detects undetected condition: it's urgent and actionable, and user-visible;
you will never ignore that alarm;
this indicates that user was affected;
can you take an action to this response? Every page that happens now distracts a human from improving the system tomorrow. Some good examples of over-alerting:

BigTable's tale of over alerting in Google:

Many years ago BigTable's SLO was driven by a "large" tail, because of problems with lower layers of storage stack. Email alerts were triggered as the SLO was exceeded.

To remedy the situation, the team used a 3-pronged approach: while making great efforts to improve the performance of BigTable, they dialed back SLO target , using p75 latency; disabled email alerts, as there were so many, that spending time diagnosing them was infeasible.

This strategy gave us enough breathing room to actually fix the longer-term problems in Bigtable and the lower layers of storage stack, rather than constantly fixing tactical problems.

GMAIL: predictable responses from Humans

In the very early days of Gmail, the service was built on on a distributed process management system called Workqueue. Workqueue was "adapted" to long-lived processes and applied to Gmail, but certain bugs in the scheduler proved hard to beat.

At that time, the Gmail monitoring was structured that alerts fired when individual tasks were "de-scheduled" by Workqueue. Gmail had many, many thousands of tasks, each task represented a fraction of percent of users. So such alerting system was unmaintainable, even the team was deeply cared about user-experience.

Gmail SRE team built a tool that helped "poke" the scheduler in just the right way to minimize impact to users. The team had a lot of discussions should we introduce now hack instead of long term solution.

This kind of tension is common within a lot of companies, and often reflects a mistrust of the team's self-discipline: while others want to implement a "hack" to allow time for a proper fix, others worry that a hack will be forgotten or taht proper fix will be deprioritized indefinitely.

Managers and technical leaders play a key role in implementing true, long term fixes by prioritazing potentially time-consuming long-term fixes even when the initial "pain" of paging subsides.

Your error budget

Is it a good time to add feature, or it's time to make a hard stop for features but a good for tech debts?
In Google there is an introduced system of error budget. Imagine, you develop a new service, leader of project defines SLO (Service Level Objectives). There is included availability and latency.

To make it's true in Google, team calculates if they have a budget on error, for release. If your availability is less or close to required you need to work on improvements, and release them in further quarter, otherwise you can release. A simple logic, right?

Art of moving toward the best job

It's a true art, you need to avoid a routine operational job, to dedicate it for automation. In google that job is names toil. Toil becomes toxic when the job is experienced in large quantities. If you're burdened with too much toil, you should be very concerned to complain loudly.

In common, it's career stagnation, you won't get any promotion based on manual job, that won't any provide positive impact. More at all it's dangerous to do it: it can come to burnout. People from other teams will expect your team to do that manual job, it creates confusion, slows progress, and increases motivation for your team to find a more rewarding job.

Besides black art, there is only automation and mechanization
(Federico Garcia Lorca 1898–1936, Spanish poet and playwright)

To sum up, follow long term solutions, increase productivity of the team.

DEV Community

The art of making good decisions in development or SRE at Google

What availability should your services have

Availability

Latency

How and what to measure

What metrics should trigger alarm?

Your error budget

Art of moving toward the best job

Top comments (0)

Read next

Is Building Your Own Portfolio a Waste of Time?

Perplexity Playground: A hidden gem

Deploy Node.js applications on a VPS using Coolify

Docker Log Observability: Analyzing Container Logs in HashiCorp Nomad with Vector, Loki, and Grafana