OpenTelemetry Collector Explained: What It Does and When You Need One

#guides #infrastructure

The first question every team asks when adopting OpenTelemetry is: "Do I need a Collector, or can my SDK export directly to Jaeger / Prometheus / Datadog?" The answer determines whether you run an extra piece of infrastructure or skip it entirely. Most guides explain what the Collector is without answering that question first. This one starts there.

Do you need a Collector?

Three questions decide it:

Do you export to more than one backend? If your traces go to Jaeger and your metrics go to Prometheus and your logs go to Loki, the Collector fans out from a single OTLP stream. Without it, every service needs three exporters configured in its SDK — three sets of credentials, three retry policies, three failure modes.
Do you need to transform telemetry before it lands? Scrubbing PII from span attributes, sampling 10% of low-priority traces, enriching resource labels with Kubernetes metadata — these are processor-layer concerns. The SDK can do basic attribute manipulation, but batch logic, tail-based sampling, and cross-signal correlation live in the Collector.
Do you want to decouple services from backend changes? If you switch from Jaeger to Tempo next quarter, a Collector means you change one exporter config in one place. Without it, you redeploy every instrumented service.

If you answered "no" to all three — you export to one backend, you don't transform telemetry, and your backend choice is stable — skip the Collector. Configure the OTLP exporter in your SDK to point directly at the backend. You can always add a Collector later without changing your instrumentation code, because the SDK speaks OTLP either way.

What a Collector actually is

The OpenTelemetry Collector is a vendor-agnostic proxy that receives telemetry, processes it, and exports it. The pipeline model has three stages:

Stage	Role	Examples
Receivers	Ingest data from sources	OTLP, Prometheus scrape, Jaeger Thrift, Fluent Forward, host metrics
Processors	Transform, filter, enrich, sample	Batch, memory limiter, attributes, tail sampling, k8s attributes
Exporters	Send data to backends	OTLP, Prometheus remote write, Jaeger, Elasticsearch, Datadog, Loki

You wire them together in a YAML config. One Collector can run multiple pipelines — traces, metrics, and logs each get their own receiver-processor-exporter chain.

Agent vs gateway deployment

Two deployment models, not mutually exclusive:

Agent mode. Run one Collector instance per node (DaemonSet in Kubernetes, sidecar in ECS). Each instance receives spans from local services over localhost, batches them, and forwards to the backend. Low latency, no network hop for span delivery, but you run N instances for N nodes.

Gateway mode. Run a small cluster of Collector instances behind a load balancer. All services send OTLP to the gateway endpoint. Centralized processing, easier to manage sampling and routing rules, but introduces a network hop and a single point of failure (mitigate with horizontal scaling and health checks).

Most production setups use both: agents on every node handle local buffering and basic processing, then forward to a gateway that handles tail-based sampling and multi-backend fan-out.

A working configuration

Here's a Collector config for a common stack: receive OTLP from instrumented services, batch spans to reduce export overhead, and send to a Jaeger backend.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]

The memory_limiter processor is not optional in production. Without it, a burst of spans can OOM the Collector. The check_interval: 1s polls memory usage every second; when usage exceeds limit_mib, the Collector starts dropping data (better than crashing).

Run it with the contrib distribution, which includes all community-maintained receivers and exporters:

docker run -d --name otel-collector \
  -v ./otel-config.yaml:/etc/otelcol-contrib/config.yaml \
  -p 4317:4317 -p 4318:4318 \
  otel/opentelemetry-collector-contrib:latest

Common pitfalls

Memory bloat from unbounded batching. The default batch processor has no memory limit. In a spike, it buffers everything in memory until the export succeeds or the process dies. Always pair batch with memory_limiter and set spike_limit_mib to at most 25% of your container's memory limit.

Exporter back-pressure cascading upstream. If Jaeger is slow to ingest, the Collector's export queue fills up, the batch processor blocks, and incoming OTLP requests start timing out in your application SDKs. Set sending_queue.queue_size on the exporter and accept that data will be dropped under sustained back-pressure — dropping spans is better than adding latency to your application's hot path.

Running the core distribution instead of contrib. The core Collector (otel/opentelemetry-collector) ships with a minimal set of components. Most real deployments need contrib receivers (Prometheus, Fluent Forward) or exporters (Datadog, Elasticsearch). Use otel/opentelemetry-collector-contrib unless you're building a custom distribution.

Skipping resource detection. Without the resourcedetection processor (or k8sattributes in Kubernetes), your spans lack metadata like service.namespace, k8s.pod.name, and cloud.region. Debugging a trace without knowing which pod produced it is painful.

Collector vs SDK-direct export

	SDK-direct	Via Collector
Latency	One fewer network hop	Adds ~1-5ms (localhost agent) or ~5-20ms (gateway)
Reliability	SDK retries on failure; spans may be lost if the app crashes	Collector buffers and retries independently of the app
Flexibility	One exporter per backend, configured per service	Fan-out, sampling, enrichment in one place
Operational cost	Zero extra infra	DaemonSet or gateway to run and monitor

For a single-service, single-backend setup (one API exporting traces to Jaeger), SDK-direct is simpler and has fewer moving parts. For anything beyond that — multiple services, multiple backends, or any processing requirement — the Collector is worth the operational cost.

How monitoring complements the Collector pipeline

The Collector is infrastructure, and infrastructure fails. An OOM, a misconfigured exporter, or a network partition between the Collector and your backend means spans are silently dropped. You won't notice until someone asks "why are there no traces for the last 2 hours?"

External uptime monitoring closes that gap. A monitor that checks your Collector's health endpoint (http://collector:13133/) every 30 seconds catches Collector failures before the gap in your trace data becomes an incident. If your MTTR for "Collector is down" is 2 hours because nobody noticed, a 30-second check interval cuts that to minutes.

Set up a monitor at app.devhelm.io for every piece of your observability stack — the Collector, Jaeger, Prometheus, Grafana. The irony of observability infrastructure is that it's the last thing teams monitor, and the first thing that causes blind spots when it fails.

Originally published on DevHelm.