DEV Community

Cover image for Global Sampling Strategy for Distributed Tracing
beefed.ai
beefed.ai

Posted on • Originally published at beefed.ai

Global Sampling Strategy for Distributed Tracing

The system-level symptoms are familiar: trace ingestion spikes that trigger throttling, backend query latencies rising under index pressure, dashboards that show stable metrics but miss the critical error traces that explain the outage, and divergent sampling behavior across teams because sampling lives in different places (SDKs, sidecars, collectors). Every one of those symptoms points to a lack of centralized sampling policy and observability over sampling decisions.

Contents

  • [Why Sampling Is Non-Negotiable for Production Tracing]
  • [Compare Sampling Strategies: Probabilistic, Rate-Limiting, and Tail-Based]
  • [How to Implement Sampling in the OpenTelemetry Collector (concrete configs)]
  • [How Adaptive Sampling and Dynamic Rules Keep Costs Predictable]
  • [Actionable Checklist: Implement a Global Adaptive Sampling Pipeline]

Why Sampling Is Non-Negotiable for Production Tracing

Sampling is not a cost-cutting nicety; it’s an architectural control. Traces impose three distinct costs: application-side overhead (CPU/memory and network), collector-side state and CPU to reassemble traces, and backend costs for ingest, indexing, and long-term retention. When you instrument broadly and run without a plan, you pay all three costs for most traffic that’s routine and uninteresting. OpenTelemetry SDKs provide deterministic head samplers such as TraceIdRatioBasedSampler to control generation at the source, and the collector provides processors to control ingest and retention across tiers.

Two operational truths steer good design:

  • Sampling at the source (head sampling) reduces application overhead and network volume, but it makes later, context-aware decisions impossible because child spans can be dropped at creation.
  • Collector-side sampling (tail sampling) can make richer decisions because it observes whole traces, but it requires stateful processors and memory sizing trade-offs.

When total trace traffic grows beyond a few hundred to a few thousand traces per second for a single cluster, you need a systematic sampling approach (many vendors recommend evaluating sampling when you exceed ~1,000 traces/sec).

Compare Sampling Strategies: Probabilistic, Rate-Limiting, and Tail-Based

Choosing the right sampler is about matching decision time to decision quality and cost.

Strategy Decision point Pros Cons Typical OpenTelemetry implementation
Probabilistic (head-based) At span creation or collector stateless hash Very low overhead, deterministic, easy to reason about May drop interesting traces; incomplete traces if front-end and back-end use different probabilities SDK TraceIdRatioBasedSampler or Collector probabilistic_sampler.
Rate‑limiting Head or remote control plane, token/leaky-bucket Guarantees steady ingest rate, protects backend budget Can bias results toward recent bursts; needs careful per-service tuning Jaeger remote/rate-limiting or collector tail_sampling rate-limiting policy.
Tail‑based After trace completes (collector) Keeps rare events (errors, slow traces); policy-rich (attributes, latency) Requires stateful collectors, memory sizing, decision latency Collector tail_sampling processor (policies: status_code, latency, probabilistic, rate_limiting, composite).

Key facts you must account for:

  • Head samplers like TraceIdRatioBasedSampler implement deterministic sampling via TraceID hashing so different hosts can make consistent decisions.
  • Collector probabilistic_sampler performs consistent hashing too and exposes hash_seed to coordinate sampling across collector tiers.
  • tail_sampling supports rich policy types (error, latency, string/numeric attributes, byte/span rate limits, composite allocation) and needs decision_wait and memory sizing. Policy and implementation details live in the collector contrib docs.

How to Implement Sampling in the OpenTelemetry Collector (concrete configs)

Practical pipeline patterns converge on two core ideas: generate metrics before sampling, and centralize complex decisions in a stateful pool of collectors. The following YAML is a compact, production-oriented example you can adapt.

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 1024
    spike_limit_mib: 256

  # Head-like collector probabilistic sampler (stateless, quick)
  probabilistic_sampler:
    sampling_percentage: 10.0
    hash_seed: 42

  # Tail sampler: decision_wait / num_traces sizing must match your workload
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 500
    policies:
      - name: retain-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sampling-fallback
        type: probabilistic
        probabilistic: { sampling_percentage: 1.0 }

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"

service:
  pipelines:
    traces/metrics:
      receivers: [otlp]
      processors: [memory_limiter]           # do not batch before tail sampling/groupbytrace
      exporters: [otlp/metrics-backend]
    traces/sampled:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, probabilistic_sampler, batch]
      exporters: [otlp/tempo]
Enter fullscreen mode Exit fullscreen mode

Implementation notes:

  • The tail_sampling processor’s decision_wait controls how long the collector waits for the rest of a trace before making a decision; a common default is 30s but values should match your system’s maximum trace duration and SLOs for trace availability.
  • Compute num_traces conservatively as expected_new_traces_per_sec * decision_wait * safety_factor so the collector can hold the working set of traces in memory; many distributions provide guidance and metrics to detect eviction.
  • Never put a batch processor upstream of components that need full trace context (for example groupbytrace, tail_sampling) because batching can split spans across pushes and break reassembly.

Small SDK example for head sampling (Node.js):

// Node.js example: sample ~1% at SDK
import { NodeSDK } from '@opentelemetry/sdk-node';
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBasedSampler(0.01)
});

await sdk.start();
Enter fullscreen mode Exit fullscreen mode

That head sampler reduces network and backend load but intentionally sacrifices the option to reconstitute traces later for tail decisions.

Important: Generate span-derived metrics (span metrics / exemplars) before applying tail-based sampling so metric aggregates remain accurate; sampling at the wrong place will skew latency and error-rate metrics.

How Adaptive Sampling and Dynamic Rules Keep Costs Predictable

Adaptive sampling is the control-plane pattern that converts throughput and value signals into sampling probabilities that meet a target budget. The pattern has three parts:

  1. Observability of incoming traffic (per-service, per-operation TPS, error rate, latency distribution).
  2. A controller or engine that computes per-key probabilities against a budget/target (for example, target_samples_per_second for each service).
  3. A distribution mechanism that pushes sampling probabilities to the decision point (SDK remote sampler, collector policies, or a dedicated sampler like Jaeger’s remote sampling engine).

Jaeger’s adaptive/remote sampling model recalculates per-service/per-operation probabilities so the collected trace volume matches target_samples_per_second; new services are sampled at an initial_sampling_probability until enough data exists to stabilize the estimate. That engine requires a sampling_store to hold observed traffic and computed probabilities.

Practical patterns you’ll use:

  • Keep an always-sample policy for critical flows (auth, billing) and for error traces (status_code == ERROR) via tail_sampling. This preserves fidelity for high-business-value areas.
  • Use a composite policy to allocate a fixed portion of the sampling budget to different classes (errors, slow paths, high-cardinality features) and let a probabilistic fallback fill remaining capacity. tail_sampling supports composite and rate_allocation.
  • Implement a feedback loop where backend ingestion metrics (sampled traces/s, dropped traces/s, tail-sampler evictions, collector memory pressure) feed the adaptive engine. Many distributions export collector self-metrics to help tune num_traces and observe when decisions are evicted.

Adaptive sampling examples in the wild include Jaeger’s remote/adaptive engine and Honeycomb’s Refinery (a trace-aware tail-sampling proxy). Those systems show the trade-offs between centralized control and the operational complexity of stateful components.

Actionable Checklist: Implement a Global Adaptive Sampling Pipeline

  1. Inventory and baseline.

    • Measure current trace TPS per service and 95th/99th trace duration for a 7–14 day window.
    • Record backend cost per million traces and current retention policy to set a budget.
  2. Decide sampling layers.

    • Use SDK head sampling (TraceIdRatioBasedSampler) for coarse volume control where application-side resource savings matter.
    • Use collector probabilistic sampling (probabilistic_sampler) as a stateless, consistent second tier for large but predictable traffic.
    • Use collector tail sampling for business-critical flows and to retain error/latency traces.
  3. Define initial policy bank (expressed as tail_sampling policies).

    • always_sample for critical services.
    • status_code policy to keep errors.
    • latency policy for slow requests above a threshold_ms.
    • probabilistic fallback for low-priority traffic.
    • Consider rate_limiting or bytes_limiting policies to cap steady-state budget.
  4. Size stateful components.

    • Set decision_wait to slightly above your max observed trace duration (e.g., max duration + 25% headroom).
    • Compute num_traces >= expected_new_traces_per_sec * decision_wait * 1.5. Monitor eviction metrics such as otelcol_processor_groupbytrace_traces_evicted and increase sizing if > 0.
  5. Instrument sampling telemetry (metrics and attributes).

    • Export and alert on:
      • Incoming traces/sec (ingest TPS)
      • Sampled traces/sec (per service)
      • Tail-sampler cached decisions hit/miss and eviction counters
      • Collector memory and CPU utilization
      • Backend ingest error/latency and cost metrics
    • Tag sampled spans with a sampler.* attribute showing the policy or SampleRate so the backend can compensate for weighting when calculating aggregates. Honeycomb-style SampleRate attributes allow correct aggregation of counts.
  6. Rollout and validate.

    • Roll sample-rate changes in a canary group (non-critical namespaces) and compare detection rates for known incidents.
    • Validate that SLO-related signals (error-rate spikes, p99 latency) are still detectable at the new sampling level.
    • Use periodic full-capture windows (for example, a 1–4 hour snapshot at 100% for critical services) to recalibrate baselines and verify adaptive-engine behavior.
  7. Automate policy delivery.

    • Choose a control plane: remote-sampling endpoints for SDKs, a policy datastore used by your collectors, or an adaptive engine (e.g., Jaeger remote sampling). Automate policy rollout and auditing.
  8. Keep cost and fidelity visible.

    • Maintain a dashboard that correlates sample rate, ingested spans, traced incidents resolved, and dollar cost. Treat that dashboard as the system’s SLA for observability spend.

Practical metric example: For a service generating ~500 traces/sec with 2s typical duration and a target backend of 50 sampled traces/sec, set decision_wait = 3s, compute num_traces >= 500 * 3 * 1.5 ≈ 2250, and set a probabilistic fallback that produces approximately the remaining budget after always_sample/status_code policies have their share. Monitor backend ingress and iterate.

Closing

A global sampling strategy is not a one-time config; it is an operational feedback loop that balances value (errors, high-cardinality flows, SLO-implicated traces) against cost (ingest, storage, query latency). Adopt layered sampling — conservative head-based controls, stateless collector-level probabilistic gates, and stateful tail-based policies for high-value retention — instrument the decision telemetry, and iterate on concrete budgets so the system keeps the traces that solve incidents while keeping the bill predictable.

Sources

Tail Sampling with OpenTelemetry: Why it’s useful, how to do it - OpenTelemetry blog post describing tail sampling concepts, decision_wait semantics, and a sample tail_sampling configuration.

Tracing SDK Sampling (OpenTelemetry Tracing SDK spec and language docs) - Specification and language-specific docs for head samplers such as TraceIdRatioBasedSampler.

Tail sampling processor (OpenTelemetry Collector Contrib) - Processor reference listing supported tail_sampling policy types (status_code, latency, probabilistic, rate_limiting, composite, etc.) and configuration fields.

Getting Started with Advanced Sampling (AWS Distro for OpenTelemetry) - Practical guidance on groupbytrace/tail_sampling pipeline patterns and sizing guidance (num_traces, decision_wait) plus monitoring recommendations.

Sampling (Jaeger documentation) - Explanation of remote sampling, adaptive sampling, and configuration patterns for per-service and per-operation policies.

Tail sampling (Grafana / Alloy documentation) - Best-practice: generate span-derived metrics before sampling to avoid metric skew; also shows pipeline patterns for metrics + sampling.

Sampled Data in Honeycomb - Explanation of SampleRate attributes and how backends can adjust aggregates to compensate for sampling.

Probabilistic sampler processor (Splunk / Collector distributions) - Practical probabilistic_sampler configuration options including sampling_percentage, hash_seed, and failure modes.

Top comments (0)