Mohammad Jawad (Kasir) Barati

Posted on Jun 27

OpenTelemetry Collector Configuration Explained

#otel #config #observability

I decided tp configure OTel in 9109679196/piper-tts-rest-api, but then I decided to share my findings about OpenTelemetry Collector configuration. And JFYI you can find the official collector configuration documentation here.

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 2048

  # Tail sampling — always keep error traces, probabilistically sample the rest.
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 50
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 100

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  debug:
    verbosity: basic

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]

Big picture

An OpenTelemetry Collector configuration is usually built from these main sections:

receivers: how telemetry gets into the Collector.
processors: how telemetry is modified, grouped, filtered, sampled, or enriched.
exporters: where telemetry is sent after processing.
extensions: extra Collector functionality that is not directly part of the telemetry pipeline.
service: where you enable extensions and wire receivers, processors, and exporters together into pipelines.

Official docs:

Collector configuration: https://opentelemetry.io/docs/collector/configuration/
Collector overview: https://opentelemetry.io/docs/collector/

`extensions`

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

What it does

extensions define extra behavior for the Collector itself. Extensions are not directly involved in receiving, processing, or exporting traces, metrics, or logs.

In this config, you enable the health_check extension.

`health_check`

health_check:
  endpoint: 0.0.0.0:13133

This exposes an HTTP health endpoint for the Collector.

In Docker Compose, another container can call:

http://otel-collector:13133/

This is useful for checking whether the Collector process is alive and ready enough to respond.

`endpoint: 0.0.0.0:13133`

This means the health check server listens on all network interfaces inside the container on port 13133.

If you used only localhost:13133, another container might not be able to reach it. For Docker Compose, 0.0.0.0:13133 is usually the practical choice.

Official docs:

Health check extension: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension

`receivers`

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

What it does

A receiver is how telemetry enters the Collector.

Your config enables the otlp receiver. OTLP means OpenTelemetry Protocol. It is the standard protocol used by OpenTelemetry SDKs and instrumentation libraries to send traces, metrics, and logs.

Official docs:

Receivers in Collector configuration: https://opentelemetry.io/docs/collector/configuration/#receivers
OTLP receiver: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver

`otlp`

otlp:

This configures the Collector to accept telemetry in OTLP format.

For your NestJS app, this is the receiver your app will send traces to.

`protocols`

protocols:
  http:
    endpoint: 0.0.0.0:4318
  grpc:
    endpoint: 0.0.0.0:4317

The OTLP receiver can listen using HTTP and/or gRPC.

You enabled both:

OTLP/HTTP on port 4318
OTLP/gRPC on port 4317

`http.endpoint: 0.0.0.0:4318`

This makes the Collector listen for OTLP over HTTP.

Your app would usually send HTTP OTLP traces to something like:

http://otel-collector:4318/v1/traces

`grpc.endpoint: 0.0.0.0:4317`

This makes the Collector listen for OTLP over gRPC.

Your app would usually send gRPC OTLP traces to:

otel-collector:4317

Use this if your SDK/exporter is configured for OTLP/gRPC.

`processors`

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 2048

  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 50
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 100

What processors do

Processors sit between receivers and exporters.

They can batch, filter, sample, enrich, transform, or otherwise modify telemetry before it is exported.

In your trace pipeline, processors are applied in this order:

processors: [tail_sampling, batch]

That means:

First, the Collector decides which traces to keep using tail_sampling.
Then, the kept traces are batched using batch before sending them to Jaeger.

This order makes sense. Sampling drops data first, and batching groups the remaining data before export.

Official docs:

Processors in Collector configuration: https://opentelemetry.io/docs/collector/configuration/#processors

`batch` processor

batch:
  timeout: 5s
  send_batch_size: 512
  send_batch_max_size: 2048

What it does

The batch processor groups spans together before exporting them.

This is useful because sending many small requests to your backend is inefficient. Batching can reduce network overhead and improve export performance.

Official docs:

Batch processor: https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor

`timeout: 5s`

The Collector sends a batch after 5s, even if the batch has not reached send_batch_size.

So this puts a maximum wait time on batching.

`send_batch_size: 512`

This is the batch size trigger.

When the Collector has collected around 512 spans, metric data points, or log records, it will send a batch.

In your trace pipeline, this applies to spans.

`send_batch_max_size: 2048`

This is the maximum batch size allowed.

If a batch becomes larger than 2048, the processor will split it into smaller batches.

A key detail: send_batch_max_size must be greater than or equal to send_batch_size.

Your config is valid because:

2048 >= 512

`tail_sampling` processor

tail_sampling:
  decision_wait: 10s
  num_traces: 50000
  expected_new_traces_per_sec: 50
  policies:
    - name: keep-errors
      type: status_code
      status_code:
        status_codes: [ERROR]
    - name: probabilistic-sample
      type: probabilistic
      probabilistic:
        sampling_percentage: 100

What it does

Tail sampling means the Collector waits until it has seen enough spans from a trace before deciding whether to keep or drop that trace.

This is different from head sampling, where the decision is made at the beginning of the trace.

Tail sampling is useful because you can make decisions based on the final trace result. For example:

keep traces with errors;
keep slow traces;
keep traces matching specific services or attributes;
sample only a percentage of normal successful traces.

Official docs:

Tail sampling processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
Tail sampling examples: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/testdata/tail_sampling_config.yaml

Important production note

Tail sampling needs all spans for the same trace to arrive at the same Collector instance. If you run multiple Collector replicas later, you need to be careful with load balancing so spans from the same trace are routed to the same Collector.

For local Docker Compose with one Collector instance, this is fine.

`decision_wait: 10s`

The Collector waits 10s from the first span of a trace before making a sampling decision.

This gives other spans in the same distributed trace time to arrive.

If your traces are long or your services are slow, you may need a higher value. If you want faster export, you may reduce it.

Trade-off:

Higher value: better chance of complete traces, but more memory usage and more delay.
Lower value: less memory and faster export, but higher chance of incomplete sampling decisions.

`num_traces: 50000`

This is the number of traces the processor keeps in memory while waiting to make sampling decisions.

Higher traffic usually needs a higher value.

If this is too low, the Collector may be forced to drop or evict traces before making the right decision.

`expected_new_traces_per_sec: 50`

This tells the processor roughly how many new traces per second to expect.

It helps the processor allocate internal data structures more efficiently.

For local development, 50 is usually fine unless you generate a lot of load.

`policies`

Policies define which traces should be sampled/kept.

You configured two policies:

policies:
  - name: keep-errors
    type: status_code
    status_code:
      status_codes: [ERROR]
  - name: probabilistic-sample
    type: probabilistic
    probabilistic:
      sampling_percentage: 100

Tail sampling policy: `keep-errors`

- name: keep-errors
  type: status_code
  status_code:
    status_codes: [ERROR]

What it does

This policy keeps traces where the span status code is ERROR.

This is useful because error traces are usually the traces you most want to inspect in Jaeger.

`name: keep-errors`

A human-readable name for this policy.

`type: status_code`

This means the policy makes its decision based on span status code.

`status_codes: [ERROR]`

This means traces with error spans should be sampled/kept.

Tail sampling policy: `probabilistic-sample`

- name: probabilistic-sample
  type: probabilistic
  probabilistic:
    sampling_percentage: 100

What it does

This policy samples a percentage of traces.

In your current config:

sampling_percentage: 100

That means keep 100% of traces.

So right now, your Collector keeps:

all error traces because of keep-errors;
all other traces because probabilistic-sample is 100%.

For local development, this is fine because you probably want to see everything.

For production or heavier load testing, you might reduce this, for example:

sampling_percentage: 10

That would keep all error traces, plus approximately 10% of other traces.

`exporters`

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  debug:
    verbosity: basic

What exporters do

Exporters send telemetry out of the Collector to another system.

In your config, there are two exporters defined:

otlp/jaeger
debug

But only otlp/jaeger is currently used in the trace pipeline.

Official docs:

Exporters in Collector configuration: https://opentelemetry.io/docs/collector/configuration/#exporters
OTLP exporter: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlpexporter
Debug exporter: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/debugexporter

`otlp/jaeger` exporter

otlp/jaeger:
  endpoint: jaeger:4317
  tls:
    insecure: true

What it does

This exporter sends telemetry from the Collector to Jaeger using OTLP/gRPC.

The name has two parts:

otlp/jaeger

otlp is the exporter type.
jaeger is a custom name you gave to this exporter instance.

This is useful when you want multiple exporters of the same type with different destinations.

`endpoint: jaeger:4317`

This tells the Collector to send data to the service named jaeger on port 4317.

In Docker Compose, jaeger is resolved by Docker's internal DNS to your Jaeger container.

Port 4317 is the usual OTLP/gRPC port.

`tls.insecure: true`

This disables TLS for this exporter connection.

For local Docker Compose, this is normal because the Collector and Jaeger communicate inside a local Docker network.

For production, you should think carefully before using insecure transport.

`debug` exporter

debug:
  verbosity: basic

What it does

The debug exporter prints telemetry to the Collector logs.

This is useful when troubleshooting because you can verify whether the Collector is receiving and processing telemetry.

However, your current pipeline does not use it:

exporters: [otlp/jaeger]

So defining this exporter does nothing unless you add it to a pipeline.

For example:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger, debug]

Then traces would be sent to Jaeger and also printed in the Collector logs.

`service`

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]

What it does

The service section is where you enable configured components.

Defining a receiver, processor, exporter, or extension is not always enough. You usually also need to reference it in the service section.

Official docs:

Service section and pipelines: https://opentelemetry.io/docs/collector/configuration/#service

`service.extensions`

extensions: [health_check]

This enables the health_check extension that you configured earlier.

Without this line, the health_check config would exist, but it would not actually be started.

`service.pipelines`

pipelines:
  traces:
    receivers: [otlp]
    processors: [tail_sampling, batch]
    exporters: [otlp/jaeger]

A pipeline defines the path telemetry follows through the Collector.

Your config defines only one pipeline: traces.

That means this Collector configuration is currently only handling traces. It is not handling metrics or logs yet.

Trace pipeline flow

Your trace data flows like this:

NestJS app
  -> OTLP receiver
  -> tail_sampling processor
  -> batch processor
  -> OTLP exporter
  -> Jaeger

Or in config terms:

otlp -> tail_sampling -> batch -> otlp/jaeger

`receivers: [otlp]`

The trace pipeline receives spans from the OTLP receiver.

`processors: [tail_sampling, batch]`

The trace pipeline first applies tail sampling, then batching.

`exporters: [otlp/jaeger]`

The trace pipeline sends the final spans to Jaeger.

Practical notes for your current setup

1. Your current sampling keeps everything

Because you configured:

sampling_percentage: 100

You are keeping all traces.

That is OK for local development.

If you want to keep all error traces but only some successful traces, use something like:

sampling_percentage: 10

2. `debug` exporter is configured but unused

You define:

debug:
  verbosity: basic

But your pipeline only uses:

exporters: [otlp/jaeger]

So the debug exporter is inactive.

Use this during troubleshooting:

exporters: [otlp/jaeger, debug]

3. You only configured traces

Your pipeline is:

pipelines:
  traces:

So this config handles traces only.

If later you want metrics or logs, you need separate pipelines, for example:

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

4. The health check port does not need to be published to your host

For Docker Compose internal checks, this is enough:

http://otel-collector:13133/

You only need this in ports:

- "13133:13133"

if you want to call the health endpoint from your host machine or browser.

Official documentation links

OpenTelemetry Collector overview: https://opentelemetry.io/docs/collector/
Collector configuration: https://opentelemetry.io/docs/collector/configuration/
Health check extension: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension
OTLP receiver: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver
Batch processor: https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor
Tail sampling processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
Tail sampling example config: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/testdata/tail_sampling_config.yaml
OTLP exporter: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlpexporter
Debug exporter: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/debugexporter

Big picture

extensions

What it does

health_check

endpoint: 0.0.0.0:13133

receivers

What it does

otlp

protocols

http.endpoint: 0.0.0.0:4318

grpc.endpoint: 0.0.0.0:4317

processors

What processors do

batch processor

What it does

timeout: 5s

send_batch_size: 512

send_batch_max_size: 2048

tail_sampling processor

What it does

Important production note

decision_wait: 10s

num_traces: 50000

expected_new_traces_per_sec: 50

policies

Tail sampling policy: keep-errors

What it does

name: keep-errors

type: status_code

status_codes: [ERROR]

Tail sampling policy: probabilistic-sample

What it does

exporters

What exporters do

otlp/jaeger exporter

What it does

endpoint: jaeger:4317

tls.insecure: true

debug exporter

What it does

service

What it does

service.extensions

service.pipelines

Trace pipeline flow

receivers: [otlp]

processors: [tail_sampling, batch]

exporters: [otlp/jaeger]

Practical notes for your current setup

1. Your current sampling keeps everything

2. debug exporter is configured but unused

3. You only configured traces

4. The health check port does not need to be published to your host

Official documentation links

`extensions`

`health_check`

`endpoint: 0.0.0.0:13133`

`receivers`

`otlp`

`protocols`

`http.endpoint: 0.0.0.0:4318`

`grpc.endpoint: 0.0.0.0:4317`

`processors`

`batch` processor

`timeout: 5s`

`send_batch_size: 512`

`send_batch_max_size: 2048`

`tail_sampling` processor

`decision_wait: 10s`

`num_traces: 50000`

`expected_new_traces_per_sec: 50`

`policies`

Tail sampling policy: `keep-errors`

`name: keep-errors`

`type: status_code`

`status_codes: [ERROR]`

Tail sampling policy: `probabilistic-sample`

`exporters`

`otlp/jaeger` exporter

`endpoint: jaeger:4317`

`tls.insecure: true`

`debug` exporter

`service`

`service.extensions`

`service.pipelines`

`receivers: [otlp]`

`processors: [tail_sampling, batch]`

`exporters: [otlp/jaeger]`

2. `debug` exporter is configured but unused