SLO in latency terms
A common SLA might be response times.
Lets say your SLA is that every request will be resolved within 300ms then you might set an SLO of 200ms.
Crucially notice that you will find out your breaking your SLO before breaking the customers trust and service level agreement.
Going back to the concept of reliability being considered a feature - if you have a clear SLO on response times and then you start to break it then its essentially an indicator that feature velocity should slow and reliability/performance investments should be prioritised. Yay!
Measuring Reliability
Now this part I love! The key notion around the element that you are going to measure is the Service Level Indicator (SLI)
So for example - measuring the response time would be done by checking the latency. So latency is your SLI.
SLI's tend to be expressed as the proportion of events that were good. EG. How many requests were within the 200ms mark vs how many requests in total.
Setting the SLO for the SLI
Wow even I'm now hating the acronyms but let us go on.
So we're measuring response times and we know how many requests were marked as good (200ms or less) from all our requests.
But what SLO target might we set?
They have a few key features.
- Generally percentage based value
- Utilise the SLI
- Timeframe (EG. Last 4 weeks)
For example:
99% of requests will fall within 200ms over the last 4 weeks.
Top comments (0)