DEV Community

Cover image for Using Percentile Tail Latency to Impress Your Elitist Backend Interviewer
Thoroughfare by the Brooks
Thoroughfare by the Brooks

Posted on

Using Percentile Tail Latency to Impress Your Elitist Backend Interviewer

When your backend job interviewer is grilling you in the hot seat with obnoxious questions about performance, it is worth realizing that a majority of decent backend work is optimization. Besides fancy feature implementations, they want to see how reliable those designs are and if they can perform under stress.
So, the next time when you find yourselves in a similar scenario, you should be well-armed to tackle such discussions.
And when you do get cracking upon the interview, you shall inevitably find yourself circle around the major question that the interviewer is expecting a very professional answer to: "What are the metrics of the system's performance with respect to the requests being evaluated?"

Three Words: Percentile. Tail. Latency

In simple terms, Percentile Tail Latency is a metric that we use to measure the latency that we have in our backend in a standard format. This basically means that we can use tail latencies to know how much time do requests take to generate and deliver responses (the metric is properly explained below).
Unlike other faulty mechanisms, this is a more robust way to deliver a report on the performance of your backend.
Before we dive into it, let us first take a look at the not-so-elegant ways to derive such metrics...

Simple & Useless Ways to Measure Backend Performance

Assume that the question you were asked by the interviewer is responded by these replies:

  1. "The maximum response time in my backend is 1 second.":

Interviewer: "Oh, that's a fancy number! The costliest response gets delivered in just 1 second? That is great! But I still don't know about the other requests - do they also take somewhere around 1 second to execute? If that is the case, the backend doesn't seem too good at its job...

  1. "The minimum response time in my backend is 1 milli-second.":

Interviewer: "Okay, but I fail to see what implication it has with the entire backend. The response might as well be a 200 on simple calculation, for all I know. That seems to have nothing to do with the performance..."

  1. "The average response time in my backend is 500 milli-seconds":

Interviewer: "That is quite a nice number! But what if your backend has most of the requests being served under 10 milli-seconds and gets a bulky request with a response time for a whole of 5 seconds. That one request messes up the math and pushes your average to seconds from milli-seconds. What then...?"

Can We Be More Specific?

Naturally, the interviewer is not pleased. The answers failed to account for the entire system at once. There was no standard reply that could take into consideration the entire architecture of the system.
Hence, instead of disappointing the interviewer even more, now we start off with the measure that solves the entire problem.

What is Tail Latency in Percentile?

When we say that "we have a 95 percentile tail latency of 1 second", we mean that 95% of the requests being made are all within 1 second.
Similarly, when we say that "we have a 99 percentile tail latency of 2 seconds", it means that 99% of the requests being made are all within 2 seconds.

Graph showing left-skewed plot of frequency of requests versus response time.

If we have a careful look at the graph, we would know that the right-skewed distribution means that most requests that we encounter are processed well within half the time needed to process the maximum requests. Hence, this shows that using maximum, minimum and average response time as metrics really undermines what our systems are actually capable of and most often leads to false assumptions.
The benefit of this metric is that it addresses the performance of the entire system and provides a flexible benchmark of the percentile value that we want to inform someone of.

How to Measure This?

Typically, to generate such a report, we would need to fire thousands of requests, distributed evenly among all the endpoints of our backend service to gauge how quick it gets processed. This can be done by using:

  • Cloud tools in most CSPs facilitate mock requests to the API.
  • Use a profiler for the respective backend tech stack and figure out the latencies. They even show more details and faults where we can improve our performance (database indices, timeouts, caches, et cetera).
  • Programmatically fire mock requests at the backend to gauge its response time.

Words of Advice...

  • To make such metrics work, we need to send thousands of requests to the API, not 10s or 50s. Otherwise, the whole purpose of the metric is defeated.
  • Really serious developers quote percentiles such as "99.99999" just to flaunt that their sample of requests made per instant is much higher, thereby giving a more accurate result. Just something to keep in mind...
  • There may be a case where, say, 95% of the requests get serviced within 1 second, which may be good. But sometimes, 99% of the requests may get serviced at a much higher time (say 5 seconds). In such a case, we may have the urge to just quote the 95 percentile tail latency, but that might be a double-edged sword. The interviewer may either be impressed or may ask for the 99 percentile value. Just optimize your systems in advance to be prepared for things like this.
  • Because of the previous case, it is always advised to clearly present 95 and 99 percentile tail latencies.
  • Sometimes, we may have a 95 percentile value of 1 second (which we assume to be good). If we have a 99 percentile value that is still closer to 1 second, something is fishy. This means that the smaller requests also take up around the same time as larger requests to get serviced, which is not a good thing. Rule of thumb: there must be a noticeable difference between the tail values of 95 and 99.
  • After quoting the numbers, be prepared to explain the latencies as to why the numbers are what they are and if there is something that can be done to improve them.

In conclusion, it is worth keeping in mind that the examples taken are all assumptions. One second might be a good response time for heavier and specialized cases but drastically bad when we are dealing with more generic and lightweight applications. It should match your use-case, not what others quote over the internet.

Hope this piece would help those in search for optimal performance gauging metrics for their backends.

Top comments (0)