Getting At The Good Stuff: How To Sample Traces in Honeycomb

#tracing #sampling #observability #devops

Sampling is a must for applications at scale; it’s a technique for reducing the burden on your infrastructure and telemetry systems by only keeping data on a statistical sample of requests rather than 100% of requests. Large systems may produce large volumes of similar requests which can be de-duplicated.

This article is for developers and operators who have added Honeycomb instrumentation into their applications and wish to learn more about sampling as a technique for controlling costs and resource utilization while maximizing the value of your telemetry data. In this article, you’ll learn about the various sampling techniques and decision strategies as well as their benefits, drawbacks, and use-cases. Finally, there are some tips on where to look in case you’re running into trouble.

Instrumentation isn't free

Instrumentation always has a cost. Sampling can reduce this cost but will not entirely eliminate it.

There are, of course, costs associated with sending and storing the additional data (in Honeycomb or any platform).
There’s also an infrastructure cost to capture instrumentation data, handle it, make decisions about what to do with it (like sampling) and transmit it somewhere.
- This cost is typically paid with an incremental amount of CPU usage, memory allocation and/or network bandwidth.
- If your application and infrastructure are not CPU-bound or Network IO-bound, the impact on request latency is likely negligible.

These costs vary by programming language. All instrumentation tooling vendors and projects work hard to minimize impact, typically by processing+sending data in different threads so as to not block applications while they serve requests. Latency may be more impacted in runtimes that were not designed for native parallelism (see: the GIL in Ruby and Python, NodeJS is essentially single threaded)

Although you can't eliminate the overhead, you can move the overhead around in (and out) of your infrastructure. For example, Honeycomb provides the samproxy (currently in alpha) which you can run in your infrastructure for the buffering and sending of traces or our new Refinery feature (currently in beta) for handling this on the collection side.

Sampling traces is trickier than sampling individual events. You want to avoid situations where traces aren’t completely collected (some spans thrown away) because this is frustrating for developers trying to debug their code. Not all event-sampling techniques will work since trace spans can be distributed among multiple threads, processes, or even servers and processed asynchronously. More on this below.

How sampling works

Honeycomb can decide which request traces or events to keep and which ones to drop in a number of different ways:

Random sampling evaluates each event (whether it is a trace span or not) for sampling on a simple probabilistic basis - this means that each event has an equal chance of being selected, there is no further selection criteria other than the probability (for instance, 1 in 10). This is not good for tracing because it doesn’t guarantee you’ll get all the spans of a trace!
Deterministic sampling is a probabilistic technique that works by taking a hash of the trace ID and converting the first 4 bytes of the hash to a numeric value, for easy and fast decision-making based on a target sample rate. Consistent sampling is ensured for all spans in the trace because the trace ID is propagated to all child spans and the sampling decision is made the same way.
Target rate sampling delivers a given rate of collected telemetry (e.g. 500 collected spans per minute), decreasing the sample probability if there’s an increase in traffic, and increasing the sample probability if traffic drops off.
Rule-based sampling is a variant of deterministic sampling, but you can make your sampling decision based on properties of the request. Think of this as the data contained in the HTTP header: endpoint (URL), user-agent, or any arbitrary header value that you known will exist. This is cool because it allows you to fine-tune sampling rates based on your needs. For instance, keeping all write-path data but sampling the higher traffic read-path data.
- Rule-based sampling is compatible with both traces and events, however in the case of trace data that isn’t known at the start of the request it needs to consider the entire request including status codes which means it is only compatible with tail-based sampling.
Dynamic sampling combines rule-based and target rate sampling. It delivers a given sample rate, weighting rare traffic and frequent traffic (for example by status code or endpoint) differently so as to end up with the correct average. Frequent traffic is sampled more heavily, while rarer events are kept or sampled at a lower rate. This is the strategy you want to use if you are concerned about keeping high-resolution data about unusual events while maintaining an a representative sample of your application’s behavior.

How sampling works for traces

Head-based sampling - a sampling decision is made when the root span is created - this is typically when the request is first seen by a tracing-aware service. At this point only metadata is known about the request (e.g. information in the HTTP header fields). The sampling decision is then propagated to child spans, typically with the addition of an HTTP header denoting the request is to be captured. This is most compatible with Deterministic and Rule-based sampling where the rules are the same on all services and the fields are known in advance.
Tail-based sampling is where the sampling decision is made upon the completion of the entire request and all of the spans have been collected. This is much more expensive than Head-based sampling because all trace spans must be collected, buffered and then processed for sampling decisions. This is best combined with Dynamic sampling because all of the information is available to identify the “interesting” traces by type (including status code) and volume.
- We advise against performing dynamic tail-based sampling in your app (in-process) because it’s impossible to buffer downstream trace spans, and the overhead may adversely affect performance.

Troubleshooting your sampling implementation

Seeing increased request latency?
If your existing metrics systems have picked up a significant increase in request latency, it’s important to quantify and isolate the sources of the latency. Is the effect on latency different when the system is under periods of high load? If so, this is most likely a sign of infrastructure contention. Increasing the sampling rate (sending fewer traces) may help in some cases, as well as using feature flags to introduce finer granularity and control over sample rates.

If the increase in request latency is visible even when the system is idle, this is very rare and might be an issue with the instrumentation itself or how it was integrated. Reach out to Honeycomb support and we can help.

Missing trace spans?
If you’ve enabled sampling and see missing trace spans, there are several potential avenues of investigation:

If you’ve customized your sampling logic, ensure you’re using a sampling decision that is compatible with your sampling technique. For example, don’t combine Head-based sampling with Random or Dynamic sampling techniques.
If trace spans are coming from different services, ensure that consistent sampling rules are being applied to all services and that there are no network issues preventing any services from sending events.
If you see the whole trace but it claims a missing root span, be sure you're not setting the Parent ID field on the root span - its absence is what indicates the root span. (Note: this only applies to users of the libhoney SDK, the Honeycomb Beelines take care of this for you automatically.)

Want more information?

For more information on sampling, check out the following resources:

Need this kind of flexibility for your instrumentation and logs? Get started with Honeycomb for free.

DEV Community