DEV Community

Cover image for OpenTelemetry Collector Configuration Explained

OpenTelemetry Collector Configuration Explained

I decided tp configure OTel in 9109679196/piper-tts-rest-api, but then I decided to share my findings about OpenTelemetry Collector configuration. And JFYI you can find the official collector configuration documentation here.

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 2048

  # Tail sampling β€” always keep error traces, probabilistically sample the rest.
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 50
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 100

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  debug:
    verbosity: basic

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]
Enter fullscreen mode Exit fullscreen mode

Big picture

An OpenTelemetry Collector configuration is usually built from these main sections:

  • receivers: how telemetry gets into the Collector.
  • processors: how telemetry is modified, grouped, filtered, sampled, or enriched.
  • exporters: where telemetry is sent after processing.
  • extensions: extra Collector functionality that is not directly part of the telemetry pipeline.
  • service: where you enable extensions and wire receivers, processors, and exporters together into pipelines.

Official docs:

OTel config


extensions

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
Enter fullscreen mode Exit fullscreen mode

What it does

extensions define extra behavior for the Collector itself. Extensions are not directly involved in receiving, processing, or exporting traces, metrics, or logs.

In this config, you enable the health_check extension.

health_check

health_check:
  endpoint: 0.0.0.0:13133
Enter fullscreen mode Exit fullscreen mode

This exposes an HTTP health endpoint for the Collector.

In Docker Compose, another container can call:

http://otel-collector:13133/
Enter fullscreen mode Exit fullscreen mode

This is useful for checking whether the Collector process is alive and ready enough to respond.

endpoint: 0.0.0.0:13133

This means the health check server listens on all network interfaces inside the container on port 13133.

If you used only localhost:13133, another container might not be able to reach it. For Docker Compose, 0.0.0.0:13133 is usually the practical choice.

Official docs:


receivers

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317
Enter fullscreen mode Exit fullscreen mode

What it does

A receiver is how telemetry enters the Collector.

Your config enables the otlp receiver. OTLP means OpenTelemetry Protocol. It is the standard protocol used by OpenTelemetry SDKs and instrumentation libraries to send traces, metrics, and logs.

Official docs:

otlp

otlp:
Enter fullscreen mode Exit fullscreen mode

This configures the Collector to accept telemetry in OTLP format.

For your NestJS app, this is the receiver your app will send traces to.

protocols

protocols:
  http:
    endpoint: 0.0.0.0:4318
  grpc:
    endpoint: 0.0.0.0:4317
Enter fullscreen mode Exit fullscreen mode

The OTLP receiver can listen using HTTP and/or gRPC.

You enabled both:

  • OTLP/HTTP on port 4318
  • OTLP/gRPC on port 4317

http.endpoint: 0.0.0.0:4318

This makes the Collector listen for OTLP over HTTP.

Your app would usually send HTTP OTLP traces to something like:

http://otel-collector:4318/v1/traces
Enter fullscreen mode Exit fullscreen mode

grpc.endpoint: 0.0.0.0:4317

This makes the Collector listen for OTLP over gRPC.

Your app would usually send gRPC OTLP traces to:

otel-collector:4317
Enter fullscreen mode Exit fullscreen mode

Use this if your SDK/exporter is configured for OTLP/gRPC.


processors

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 2048

  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 50
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

What processors do

Processors sit between receivers and exporters.

They can batch, filter, sample, enrich, transform, or otherwise modify telemetry before it is exported.

In your trace pipeline, processors are applied in this order:

processors: [tail_sampling, batch]
Enter fullscreen mode Exit fullscreen mode

That means:

  1. First, the Collector decides which traces to keep using tail_sampling.
  2. Then, the kept traces are batched using batch before sending them to Jaeger.

This order makes sense. Sampling drops data first, and batching groups the remaining data before export.

Official docs:


batch processor

batch:
  timeout: 5s
  send_batch_size: 512
  send_batch_max_size: 2048
Enter fullscreen mode Exit fullscreen mode

What it does

The batch processor groups spans together before exporting them.

This is useful because sending many small requests to your backend is inefficient. Batching can reduce network overhead and improve export performance.

Official docs:

timeout: 5s

The Collector sends a batch after 5s, even if the batch has not reached send_batch_size.

So this puts a maximum wait time on batching.

send_batch_size: 512

This is the batch size trigger.

When the Collector has collected around 512 spans, metric data points, or log records, it will send a batch.

In your trace pipeline, this applies to spans.

send_batch_max_size: 2048

This is the maximum batch size allowed.

If a batch becomes larger than 2048, the processor will split it into smaller batches.

A key detail: send_batch_max_size must be greater than or equal to send_batch_size.

Your config is valid because:

2048 >= 512
Enter fullscreen mode Exit fullscreen mode

tail_sampling processor

tail_sampling:
  decision_wait: 10s
  num_traces: 50000
  expected_new_traces_per_sec: 50
  policies:
    - name: keep-errors
      type: status_code
      status_code:
        status_codes: [ERROR]
    - name: probabilistic-sample
      type: probabilistic
      probabilistic:
        sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

What it does

Tail sampling means the Collector waits until it has seen enough spans from a trace before deciding whether to keep or drop that trace.

This is different from head sampling, where the decision is made at the beginning of the trace.

Tail sampling is useful because you can make decisions based on the final trace result. For example:

  • keep traces with errors;
  • keep slow traces;
  • keep traces matching specific services or attributes;
  • sample only a percentage of normal successful traces.

Official docs:

Important production note

Tail sampling needs all spans for the same trace to arrive at the same Collector instance. If you run multiple Collector replicas later, you need to be careful with load balancing so spans from the same trace are routed to the same Collector.

For local Docker Compose with one Collector instance, this is fine.

decision_wait: 10s

The Collector waits 10s from the first span of a trace before making a sampling decision.

This gives other spans in the same distributed trace time to arrive.

If your traces are long or your services are slow, you may need a higher value. If you want faster export, you may reduce it.

Trade-off:

  • Higher value: better chance of complete traces, but more memory usage and more delay.
  • Lower value: less memory and faster export, but higher chance of incomplete sampling decisions.

num_traces: 50000

This is the number of traces the processor keeps in memory while waiting to make sampling decisions.

Higher traffic usually needs a higher value.

If this is too low, the Collector may be forced to drop or evict traces before making the right decision.

expected_new_traces_per_sec: 50

This tells the processor roughly how many new traces per second to expect.

It helps the processor allocate internal data structures more efficiently.

For local development, 50 is usually fine unless you generate a lot of load.

policies

Policies define which traces should be sampled/kept.

You configured two policies:

policies:
  - name: keep-errors
    type: status_code
    status_code:
      status_codes: [ERROR]
  - name: probabilistic-sample
    type: probabilistic
    probabilistic:
      sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

Tail sampling policy: keep-errors

- name: keep-errors
  type: status_code
  status_code:
    status_codes: [ERROR]
Enter fullscreen mode Exit fullscreen mode

What it does

This policy keeps traces where the span status code is ERROR.

This is useful because error traces are usually the traces you most want to inspect in Jaeger.

name: keep-errors

A human-readable name for this policy.

type: status_code

This means the policy makes its decision based on span status code.

status_codes: [ERROR]

This means traces with error spans should be sampled/kept.


Tail sampling policy: probabilistic-sample

- name: probabilistic-sample
  type: probabilistic
  probabilistic:
    sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

What it does

This policy samples a percentage of traces.

In your current config:

sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

That means keep 100% of traces.

So right now, your Collector keeps:

  • all error traces because of keep-errors;
  • all other traces because probabilistic-sample is 100%.

For local development, this is fine because you probably want to see everything.

For production or heavier load testing, you might reduce this, for example:

sampling_percentage: 10
Enter fullscreen mode Exit fullscreen mode

That would keep all error traces, plus approximately 10% of other traces.


exporters

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  debug:
    verbosity: basic
Enter fullscreen mode Exit fullscreen mode

What exporters do

Exporters send telemetry out of the Collector to another system.

In your config, there are two exporters defined:

  • otlp/jaeger
  • debug

But only otlp/jaeger is currently used in the trace pipeline.

Official docs:


otlp/jaeger exporter

otlp/jaeger:
  endpoint: jaeger:4317
  tls:
    insecure: true
Enter fullscreen mode Exit fullscreen mode

What it does

This exporter sends telemetry from the Collector to Jaeger using OTLP/gRPC.

The name has two parts:

otlp/jaeger
Enter fullscreen mode Exit fullscreen mode
  • otlp is the exporter type.
  • jaeger is a custom name you gave to this exporter instance.

This is useful when you want multiple exporters of the same type with different destinations.

endpoint: jaeger:4317

This tells the Collector to send data to the service named jaeger on port 4317.

In Docker Compose, jaeger is resolved by Docker's internal DNS to your Jaeger container.

Port 4317 is the usual OTLP/gRPC port.

tls.insecure: true

This disables TLS for this exporter connection.

For local Docker Compose, this is normal because the Collector and Jaeger communicate inside a local Docker network.

For production, you should think carefully before using insecure transport.


debug exporter

debug:
  verbosity: basic
Enter fullscreen mode Exit fullscreen mode

What it does

The debug exporter prints telemetry to the Collector logs.

This is useful when troubleshooting because you can verify whether the Collector is receiving and processing telemetry.

However, your current pipeline does not use it:

exporters: [otlp/jaeger]
Enter fullscreen mode Exit fullscreen mode

So defining this exporter does nothing unless you add it to a pipeline.

For example:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger, debug]
Enter fullscreen mode Exit fullscreen mode

Then traces would be sent to Jaeger and also printed in the Collector logs.


service

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]
Enter fullscreen mode Exit fullscreen mode

What it does

The service section is where you enable configured components.

Defining a receiver, processor, exporter, or extension is not always enough. You usually also need to reference it in the service section.

Official docs:


service.extensions

extensions: [health_check]
Enter fullscreen mode Exit fullscreen mode

This enables the health_check extension that you configured earlier.

Without this line, the health_check config would exist, but it would not actually be started.


service.pipelines

pipelines:
  traces:
    receivers: [otlp]
    processors: [tail_sampling, batch]
    exporters: [otlp/jaeger]
Enter fullscreen mode Exit fullscreen mode

A pipeline defines the path telemetry follows through the Collector.

Your config defines only one pipeline: traces.

That means this Collector configuration is currently only handling traces. It is not handling metrics or logs yet.

Trace pipeline flow

Your trace data flows like this:

NestJS app
  -> OTLP receiver
  -> tail_sampling processor
  -> batch processor
  -> OTLP exporter
  -> Jaeger
Enter fullscreen mode Exit fullscreen mode

Or in config terms:

otlp -> tail_sampling -> batch -> otlp/jaeger
Enter fullscreen mode Exit fullscreen mode

receivers: [otlp]

The trace pipeline receives spans from the OTLP receiver.

processors: [tail_sampling, batch]

The trace pipeline first applies tail sampling, then batching.

exporters: [otlp/jaeger]

The trace pipeline sends the final spans to Jaeger.


Practical notes for your current setup

1. Your current sampling keeps everything

Because you configured:

sampling_percentage: 100
Enter fullscreen mode Exit fullscreen mode

You are keeping all traces.

That is OK for local development.

If you want to keep all error traces but only some successful traces, use something like:

sampling_percentage: 10
Enter fullscreen mode Exit fullscreen mode

2. debug exporter is configured but unused

You define:

debug:
  verbosity: basic
Enter fullscreen mode Exit fullscreen mode

But your pipeline only uses:

exporters: [otlp/jaeger]
Enter fullscreen mode Exit fullscreen mode

So the debug exporter is inactive.

Use this during troubleshooting:

exporters: [otlp/jaeger, debug]
Enter fullscreen mode Exit fullscreen mode

3. You only configured traces

Your pipeline is:

pipelines:
  traces:
Enter fullscreen mode Exit fullscreen mode

So this config handles traces only.

If later you want metrics or logs, you need separate pipelines, for example:

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]
Enter fullscreen mode Exit fullscreen mode

4. The health check port does not need to be published to your host

For Docker Compose internal checks, this is enough:

http://otel-collector:13133/
Enter fullscreen mode Exit fullscreen mode

You only need this in ports:

- "13133:13133"
Enter fullscreen mode Exit fullscreen mode

if you want to call the health endpoint from your host machine or browser.


Official documentation links



Enter fullscreen mode Exit fullscreen mode

Top comments (0)