DEV Community: Varun

Sampling Strategies in Distributed Tracing - A Comprehensive Guide

Varun — Fri, 28 Jul 2023 17:07:21 +0000

If you are running a distributed system where each request talks to more than a couple of services, databases, and a queuing system, pinpointing the cause of an issue is not a trivial affair. The complexity increases as the number of services increases, as east-west traffic goes up, as teams get split up, and as data tends to eventual consistency. There are a plethora of tools aiming to solve this problem to various degrees. Perhaps the most critical tooling in this workflow is distributed tracing. We have even argued that a well-implemented distributed tracing solution might subsume logging into its fold.

Yet, in reality, tracing end-to-end request flows has been an afterthought in most companies, as postulated in this article. One of the biggest challenges plaguing widespread adoption is the sheer volume of trace data. Capturing, storing, indexing, and querying from this massive dataset will not only impact performance, and add significant noise, but also break the bank :)

The solution to this data volume problem is the tried and tested strategy of sampling. In this article, we will delve deeper into various sampling strategies, their advantages, and shortcomings, and see what an ideal sampling strategy for distributed tracing should look like.

An Introduction to Sampling

Sampling is selectively capturing a subset of traces for analysis, rather than capturing and storing every trace. In other words, capture only the ‘interesting’ traces. There may be many events in a distributed architecture that define ‘interesting’. For instance:

Latency events: Traces exceeding a certain latency threshold can be selectively sampled. By focusing on high latency traces, you can identify performance bottlenecks and areas for improvement.
Errors: Selectively sample traces that result in errors or exceptions. By capturing these error traces, you can investigate and address issues that negatively impact the system's reliability and stability.
Priority events: Assigning priorities to different types of requests or services can help determine which traces to sample. For example, you might assign higher priority to critical services or specific user interactions, ensuring that traces related to these high-priority components are captured.

Most companies try to capture these events more holistically that mostly boils down to one of these two strategies.

Head-based Sampling: Traces are sampled randomly, typically based on a predefined sampling rate or probability. For instance, a system may randomly sample 1% of all traces. The principle here is that in a large enough dataset, a 1% or an x% sample will in ‘high probability’ capture most traces of interest.
Tail-based sampling: Tail sampling operates under the principle that rare but impactful events occur sporadically. These tail events often indicate performance bottlenecks, service degradation, or other issues that need attention. Collecting data on every request or transaction, where success cases are the norm, is therefore impractical and not efficient.

Head-based sampling in Open Telemetry

Head-based sampling is where the sampling decision is made right at the start of the trace. In other words, irrespective of whether the trace is ‘interesting’ or not, the decision to drop or keep a span gets made through a simple algorithm that is based on the desired percentage of traces to sample.

The Open Telemetry project comes built-in with 4 samplers that can be set as a configuration.

AlwaysSample and NeverSample:

These samplers are self-explanatory. These samplers do not make any decisions - they either sample everything or nothing as the case may be.

TraceIDRatioBased:

The TraceIdRatioBased sampler makes decisions based on the ratio set and the trace id of each span. For instance, if the ratio is set to 0.1, the sampler will aim to sample approximately 10% of spans. One way to set this ratio is to look at the TraceID of the span. TraceIDs are randomly generated 128-bit numbers. The sampler treats the TraceID as a number between 0 and 1 by dividing it by the maximum possible 128-bit number. If this number is less than the ratio set when creating the sampler, the sampler samples the span. Otherwise, it doesn't.

The key when implementing this sampler is to have a deterministic hash of the TraceId when computing the sampling decision. This basically ensures that running the sampler on any child Span will always produce the same decision as the root span.

" width="560" height="642">

Parent Based Sampler:

This sampler makes decisions based on the sampling decision of the parent span. If a span has a parent span and the parent span is sampled, then the child span will also be sampled. This is useful for ensuring that entire traces are sampled (or not sampled) consistently.

The Open Telemetry default sampler is a composite sampler and is essentially ParentBased with root=AlwaysSample. This can be modified at the root to a TraceIdRatioBased sampler to sample only a ratio of the spans. The remaining child spans will be sampled or not based on the parent span’s sampling decision. Once that decision is made at the start of the creation of the first span, this gets propagated to all the subsequent child spans as the request flows through the system. This means that the entire trace is sampled as a whole with no span gaps in the middle.

Head sampling is easy to set up and maintain

The great thing about creating a Head sampling strategy is that it is:

Simple to execute at scale: As long as the sampler’s configurations are set properly during the instrumentation of the agent, the end-to-end spans of a trace will get sampled with no gaps in between.

Can be made efficient: In high-volume systems, a lower sampling ratio is most probably enough to capture both interesting and uninteresting traces. This is also efficient because the sampling decision is propagated to all child spans with a ParentBased sampler.

Unbiased: Sampling traces is purely random and does not look at any properties of the span or the trace to make a decision.

Performant: The decision to sample is made at the start with a quick algorithm instead of holding data in memory to make the decision later.

The intangible costs of this strategy outweigh its benefits

Most companies when implementing OpenTelemetry end up executing the Head sampling strategy with a TraceIdRatioBased sampler at the root. While there are clear benefits as outlined above, the intangible costs of this strategy outweigh the benefits, especially in large-scale, high-volume clusters.

1. This is a noisy strategy.

There is a probability attached to whether an interesting trace is captured or not. This means that when looking to debug a potential error, there is a likelihood that the trace might not even have been captured. On top of this, most platforms dump traces onto a dashboard, which implies that the developer has to search, query, and filter from the list of traces to identify a potential trace knowing that there is no guarantee that the trace is even captured.

2. Which leads to low developer trust.

This randomness leads to developers, more often than not, preferring to start their debugging journey from monitoring and logging platforms rather than from the tracing platform.

3. Which leads to low usage.

When this starts happening, the tracing platform becomes an optional tack-on to the observability pipeline and tooling. At the organization level, the usage ends up becoming sporadic and infrequent as a result.

4. Which eventually leads to poor RoI.

When the average developer who ships and owns a service is using a tracing tool maybe 2-3 times a year, the RoI case for distributed tracing gets quite fuzzy.

Clearly, a better way to beat this ‘probability’ problem and hence drive more RoI is to employ Tail Based Sampling.

Tail-based sampling in Open Telemetry

Tail sampling is where the decision to sample a trace takes place by considering all the spans within the trace. In other words, Tail Sampling gives the option to sample a trace based on specific criteria derived from different parts of a trace. The Tail Sampling processor in the Open Telemetry project is not part of the core OTEL collector contrib.

To implement Tail Sampling effectively, the tail sampling processor already comes with multiple policies. Some of the more commonly used policies are:

latency: Using this policy, the decision to sample is made based on the duration of the trace. The duration is calculated by comparing the earliest start time and the latest end time, without factoring in the events that occurred during the intervening period. If the duration exceeds the threshold set, then the trace gets sampled.

status_code: Using this policy, the sampling decision is made based on the HTTP status code of the response. If the HTTP status code is in the range of 100-399, the span's status is set to Ok. These codes represent successful or provisional responses. If the HTTP status code is in the range of 400-599, the span's status is set to Error. These codes represent client errors (400-499) or server errors (500-599). If there's no HTTP status code (for example, because the operation didn't involve HTTP), the span's status is left unset.

composite: Most companies use a composite policy that is a combination of 2 or more policies with certain percentages of spans per policy.

The following is a placeholder example of a composite tail-based sampling policy. There are no defaults set by Otel, so at least one policy has to be set up for tail sampling to run. For a tail-based policy to run effectively, we need to define three important configurations.

decision_wait: 10s: This sets the wait time to 10 seconds before deciding to sample, allowing traces to be completed. This will basically store the data in memory for a full 10 seconds.
num_traces: 100: This sets the maximum number of traces that can be stored in memory to 100.
expected_new_traces_per_sec: 10: This sets the expected number of new traces per second to 10.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      [
         {
            name: composite-policy-1,
            type: composite,
            composite:
              {
                max_total_spans_per_second: 1000,
                composite_sub_policy:
                  [
                    {
                      name: test-composite-policy-1,
                                type: latency,
                                latency: {threshold_ms: 5000}
                    },
                    {
                      name: test-composite-policy-2,
                                type: status_code,
                                status_code: {status_codes: [ERROR, UNSET]}
                    }
                  ]
              }
          }
        ]

Tail sampling makes collection intelligent

Tail sampling has some distinct advantages:

Informed Decisions: Tail sampling makes decisions at the end of a request, or even after the request has been completed. This allows it to make more informed decisions because it has access to more information.
Reduced Noise: By making more informed sampling decisions, tail sampling can reduce noise and make identifying and diagnosing issues easier.
Cost Efficiency: Tail sampling is also more cost-efficient. By storing and analyzing only the most relevant traces, the amount of storage and processing required can be reduced significantly.

Yet, in practice, Tail sampling scales poorly

In practice though, Tail based sampling scales rather poorly due to inherent design challenges.

Not a ‘set and forget’ design: Microservices are a complex beast that can change shape and form. Systems change, traffic patterns change and features get added. The rules that govern the sampling policies need to be constantly updated too.

Performance intensive and operationally hard design: The fundamental design of waiting for the trace to complete end to end means we have to hold these spans in memory for a set time in a collector. In addition to this, all the spans need to end up in the same collector for tail sampling to work effectively. This can be done either locally or in some central location. If held locally, this is going to eat into the application resources and will never scale for a real production system. When processed centrally in some other location, this means complex engineering to support scale. This might involve setting up a load balancer that also will have to be trace id aware to send all the spans of the same trace to the same collector. This will then help orchestrate the capturing of an ‘interesting’ trace from the flood of spans.

Defining ‘interesting’ traces involves significant collaboration: While errors might be fairly straightforward to capture, capturing other traces based on latency or the number of queries sent to a DB or a particular customer’s span involve significant collaboration overheads and team specificity. Remember, these may involve changes and additions in the future as systems involve.

Data costs: Lastly, there is inherent variability in costs since ‘interesting’ events may happen in spikes and may be largely unpredictable. This makes it hard to manage costs predictably.

An ideal model would incorporate the best elements from both the approaches

Given the pros and cons of both these sampling techniques, an ideal theoretical model would incorporate the best elements from each approach. The end goal would be to create a strategy that is simple to execute, efficient, and effective in a variety of scenarios. Here’s one such ideal theoretical construct:

1. No-ops, automated selection of interesting traces

Operational complexity is one of the biggest bottlenecks that need to be removed to democratize the use of tracing. An ideal solution would automate the instrumentation of the collector/processor/load balancing aspects of Tail Sampling. Once dropped in, the spans get collected, anomalies are identified, and the trace gets stitched and dumped into persistent storage. The rest get discarded.

2. Adaptive

Systems change continuously and the decisions ideally should not be based on fixed rules. For instance, the sampling rate can be dynamically altered based on error trends, traffic patterns, and other telemetry data which brings us to the next point.

3. Integrations with other tools

There is telemetry data emitted to multiple tools in the Observability infrastructure. An ideal platform should be able to read/listen to these signals to create dynamic policies. For instance, a seemingly simple policy such as ‘capture is fairly complicated to execute. To execute this, the tail processor must have access to the p95 latency values of the last day and the current p95 latency to make a decision. This has to be as close to real-time as possible. The way such traces get sampled today instead is by using decisions such as latency greater than 300ms, which is hard coded into the policy config.

4. Form-based and not code based

When the tail policy decisions need to be modified today, it involves a complex collaboration process between the developer and the platform/ops teams. This then translates to changes in the policy file. An ideal solution would have a form-based approach where the developer can dynamically adjust the sampling policy to capture traces that meet certain unique conditions without needing the involvement of the ops teams. Of course, this may dramatically alter the trace volumes and hence must come with proper feedback to the developer.

Conclusion

In summary, while tracing is a powerful tool for understanding and debugging complex distributed systems, crafting the right sampling techniques is critical to deriving strong utility. While both head-based and tail-based sampling offer unique advantages, they also come with their own set of challenges. The quest for an ideal solution continues. Platforms such as ZeroK.ai eliminate this problem by using AI to automatically identify traces of interest, and create a no-ops, adaptive, and operationally straightforward experience. There are bound to be newer developments in this space over the next few years.

Can Distributed Tracing replace Logging?

Varun — Wed, 12 Jul 2023 01:10:14 +0000

Monitoring, Logging, and Tracing are often highlighted as the three fundamental pillars of a contemporary Observability framework. Conventional wisdom suggests that all 3 pieces of technology are equally critical and have their own place in the Observability stack. As more of the world shifts to cloud, containerized, and distributed systems, will one of these pillars end up becoming more critical than the other?

We predict that this will most likely be the case. In this article, we compare the roles of two of these Observability pillars, Distributed Tracing vs Logging, and see which best suits the needs of an increasingly cloud-native world.

Let's delve further into the topic, beginning with Logging.

Logging is as old as programming itself

Logs have been used since time immemorial to understand how software works, track program flow, and to root cause issues if there are any. As software and infrastructure evolved, logs too have evolved to capture a variety of datasets.

Today, a log in a backend application can capture anything from request response data, error messages and codes for error reporting, stack traces, metrics such as response times, custom log messages, and all the way up to business-specific events such as user activities and transactions. Different log platforms may also generate their own specific format of logs.

And therefore, lies a massive problem in a cloud-native distributed world.

Much of logging today is completely unstructured

The fact that most logs are unstructured has both upsides and downsides. For one, it allows for flexibility and freedom to log any information that a developer feels is necessary. They are quick to implement (easy implementation code), efficient from a processing standpoint (processing or serialization is required to conform to a specific format), and generally easy to work with given so many drop-in tools and frameworks.

In this example, the log file has captured an error scenario:

[2023-05-20 14:30:27] ERROR - Exception occurred while processing request: java.lang.NullPointerException: null pointer in method doSomething() at com.example.MyService.processRequest(MyService.java:123)

Timestamp: [2023-05-20 14:30:27] indicates the date and time when the error occurred.
Log level: ERROR signifies that it's an error log entry.
Message: Exception occurred while processing request: briefly describes the error.
Exception details: java.lang.NullPointerException: null pointer in method doSomething() indicates the specific exception that was thrown, a NullPointerException occurred in the doSomething() method.
Stack trace: at com.example.MyService.processRequest(MyService.java:123) shows the location in the code where the exception occurred

However, there are many downsides to unstructured logging. They have limited readability at large volumes. Poor searchability makes troubleshooting and debugging microservices time-consuming.

These problems compound as applications become more complex.

Distributed architecture problems

Structured logs to the rescue

The logical mitigation to this is where a log adheres to a predefined format or schema aka structured data. This increases readability, searchability, and makes the log suitable for automated analysis.

For instance, companies tend to standardize and create JSON formats with clearly defined keys and values so you can extract and analyze relevant information more easily.

The resulting data looks something like this - the exact same parameters passed as before but in clearly defined keys and values.

{
    "timestamp": "2023-05-20T14:30:27Z",
    "level": "ERROR",
    "message": "Exception occurred while processing request",
    "exception": {
        "type": "java.lang.NullPointerException",
        "message": "null pointer in method doSomething()",
        "stackTrace": "at com.example.MyService.processRequest(MyService.java:123)"
    }
}

Structured log data greatly add to readability and searchability. When executed well, they bring down debug times massively. Companies like Elastic can also convert unstructured to structured logs during indexing and by using processes like Dynamic Mapping where the datatype of each field is dynamically determined based on the content. However, this is not always 100%, given the variety of data and customization possible. Problems begin to surface at multiple levels - at a personnel level, at an organizational level, and at an industry level.

At a personnel level, the seamless adoption of standardized formats across teams is extremely challenging. Different teams across the entire app stack may use their own logging practices and monitoring systems or would have implemented different logging libraries during the initial scale-up phase. Retrofitting new logging mechanisms into legacy applications requires significant effort. Add training, dev., and ops teams to this new paradigm, and the cost balloons.

At the org level, structured log files may require changes to the logging infrastructure. Traditional text-based log storage and analysis tools may not be optimized for structured log formats, requiring the adoption of new log management systems and/ or monitoring systems capable of efficiently handling structured log files. There's also this complexity of striking a balance between providing enough structure to make logs useful for analysis while allowing flexibility for different types of log entries.

At the industry level, logging has been one of the biggest standardization challenges. The diversity of datatypes, use cases, and vendors, makes this incredibly challenging. Yet, new projects are always coming up to promote consistent log formats and practices.

Structured Event triggered log

A project worth mentioning is the Logging for Cloud-Native Applications (LogCNCF) initiative. You can see the landscape of logging projects here. Projects like FluentD attempt to standardize data collection, unify logging with structured JSON and make the architecture pluggable with data sources and outputs. However, we believe that adoption will take time and will most definitely not cover older, legacy technologies.

The question then, is how is all this going to work itself out in this increasingly microservices architecture world?

Context Propagation - the holy grail

Modern applications talk to each other countless times. If a problem (logic bug or a performance issue) occurs in one service, it can have a cascading effect on other services. Request, response, version numbers, state, feature flag, and many such data points of connected services will be relevant while debugging issues.

However, logging today is inherently limited to gathering info only for that particular microservice that has been instrumented. Hence, understanding interdependencies and pinpointing the exact service or component causing the issue requires complex JOINS across multiple log files of multiple services.

Here’s the challenge - to execute seamless JOINS, you need to context propagate across all peer services. Sounds great in theory, but extremely difficult in practice. Here’s a great thread on Twitter talking about this exact same problem.

There is however a great solution for this clear and present challenge.

The most powerful JOIN statement in the distributed world - Distributed Tracing

A distributed trace effectively enables a JOIN operation across the complete distributed transaction.

Here's a real-world example of context propagation for logs with tracing enabled:

A view of how individual services in a checkout system behave

1. Starting point: An e-commerce user initiates a checkout process by adding items to the shopping cart.

2. Frontend service: The frontend service handles the user's request and generates a trace identifier to track the entire checkout process. The trace identifier is included in the log entries. Example log entry:

[2023-05-20 14:30:27|Trace ID: 1234567890] INFO - User initiated checkout process
  User ID: 9876

3. Cart service: The frontend service communicates with the cart service to retrieve the user's shopping cart items. The trace identifier is propagated to the cart service. Example log entry in the cart service:

[2023-05-20 14:30:30|Trace ID: 1234567890] DEBUG - Retrieved cart items for user
  User ID: 9876
  Cart Items: [Item 1, Item 2, Item 3]

4. Payment service: The frontend service communicates with the payment service to process the user's payment. The trace identifier is passed to the payment service. Example log entry in the payment service:

[2023-05-20 14:30:35|Trace ID: 1234567890] INFO - Processing payment for user
  User ID: 9876
  Payment Amount: $100.00

5. Order service: After the payment is successfully processed, the frontend service communicates with the order service to create an order. The trace identifier is propagated to the order service. Example log entry in the order service:

[2023-05-20 14:30:40|Trace ID: 1234567890] INFO - Created order for user
  User ID: 9876
  Order ID: 54321

Throughout the checkout process, the trace identifier is propagated across different services. This allows log entries to be correlated based on the trace identifier, enabling end-to-end tracing of a single user's journey across the entire architecture.

Now imagine doing this if tracing was not available. You’d need to do the following to get to some basic related data. TL;DR THIS WON’T SCALE

Define a unique identifier: The frontend service needs to generate a unique identifier, eg. checkout_id, for each checkout request and include it in the log entries. This approach requires manual handling of the header propagation in each service.
Context passing via message queues: If you're using message queues for communication between services, you've to include checkout_id as a property in the messages. Each service that consumes the message can extract the checkout_id from the message properties and include it in its log entries. This approach ensures the checkout_id travels with the message and is available to downstream services.
Service-level context storage: You can maintain a shared context storage system (e.g., a cache or database) accessible by all services involved in the checkout process. When the frontend service generates checkout_id, it can store it in this shared context storage system along with any other relevant information. Each service can then retrieve checkout_id from the shared storage system and include it in its log entries. This approach requires careful synchronization and management of the shared context storage system.
Custom correlation via parsing logs: Extract and parse the checkout_id from log entries in each service and use custom log analysis techniques to correlate across log files. This approach involves writing code or using log analysis tools to search for log entries with matching checkout_id values and analyzing them together.
Once any or all of this is executed, you can then perform a JOIN-like operation.

Tracing vs Logging: Distributed Tracing will most likely subsume logging in most organizations

Where data progression through logs relies heavily on manual instrumentation, tracing represents ease because it is completely automatic. As a result, standardization is a given. No wonder, one of the most widely adopted standardization projects over the last few years has been Open Telemetry. Suddenly, information of peer services that are super relevant to debugging any issue (performance issues, exceptions, any non-200 response) for a given service becomes available to the developer.

A log will still be the fundamental unit for debugging, after all they provide the final clue to the problem at hand. But they will be subsumed within a single tracing platform. Within this unified platform, users can visualize performance bottlenecks, request/response flows, and track system errors without the need to switch between multiple user interfaces.

Tracing vs logging

Unfortunately, tracing has not yet been widely adopted, and for good reasons

First up, multiple applications need to be instrumented together. This can be time-consuming, especially in large codebases or when working with multiple languages or frameworks. On top of this, tracing generates a significant amount of spans that need to be collected, stored, and analyzed effectively. Since most vendors charge on the spans ingested, this presents a real headache for developers when it comes to debugging - which of these traces should be dropped and which ones retained? As a result, various sampling techniques have come about that help with this very problem (This requires an article of its own). But this is all going to change.

The future of Observability is bright!

The field of Observability in general and tracing in particular is continuously evolving. Hypothetically if these challenges associated with tracing are removed, we believe the world of Observability will forever change. Tracing will subsume logging, visualization of incident propagation will become intuitive, and all related data for debugging will reside in a single platform. At ZeroK, we are inventing the data engineering needed to remove these challenges once and for all. The possibilities, as a result, will have no bounds!