OpenTelemetry Collector Contrib: 10 Features That Will Transform Your Observability Pipeline
The OpenTelemetry Collector Contrib project continues to evolve at a rapid pace, and the latest release is packed with features that address real-world observability challenges. Whether you're running workloads on GCP, managing Kubernetes clusters, or trying to tame your log volumes, this release has something for you.
Let's dive into the 10 most impactful features and see how they can improve your observability stack.
1. Export Traces to Google Cloud Storage
What's New: You can now export traces directly to Google Cloud Storage (GCS).
This is huge for teams that need long-term trace retention without the cost of keeping everything in a real-time trace backend. Think of it as a "cold storage" tier for your traces.
Why This Matters
Traditional trace backends like Jaeger or Tempo are optimized for real-time querying, but storing months of trace data gets expensive fast. With GCS export, you can:
- Archive traces for compliance and auditing
- Reduce costs by moving older traces to cheaper storage
- Build custom analytics pipelines on historical trace data
Example Configuration
exporters:
googlecloudstorage:
bucket: "my-traces-bucket"
prefix: "traces/"
compression: gzip
service:
pipelines:
traces:
exporters: [googlecloudstorage]
Best Practice
Use the GCS exporter alongside your primary trace backend. Send real-time traces to Jaeger/Tempo for immediate debugging, and batch export to GCS for long-term retention.
2. Limit Maximum Trace Size in Tail Sampling
What's New: The tail sampling processor now supports maximum_trace_size_bytes to limit the memory footprint of individual traces. Traces exceeding this byte limit are immediately dropped—no sampling decision is made for them.
The Problem This Solves
Tail sampling holds traces in memory while waiting for all spans to arrive before making a sampling decision. This is powerful, but it creates a vulnerability: occasionally, a single trace can grow to an enormous size (think: a batch job creating thousands of spans), causing spiky memory consumption that can crash your collector.
The memory limiter processor doesn't fully solve this because it applies backpressure while traces are waiting for decisions, which can degrade sampling accuracy and overall throughput.
How It Works
processors:
tail_sampling:
decision_wait: 10s
maximum_trace_size_bytes: 5242880 # 5 MB per trace
policies:
- name: error-policy
type: status_code
status_code: {status_codes: [ERROR]}
When a trace's in-memory size exceeds maximum_trace_size_bytes, it's immediately dropped without waiting for the decision_wait period. No sampling decision is made—the trace is simply discarded to protect collector stability.
Real-World Scenario
Consider a data processing pipeline that creates a span for each record processed. A batch of 100,000 records generates 100,000 spans in a single trace. Each span might be 500 bytes, resulting in a 50MB trace sitting in memory. Without limits, a few concurrent batches could exhaust your collector's memory.
With maximum_trace_size_bytes: 5242880 (5MB), oversized traces are dropped early, protecting your collector while still sampling normal-sized traces correctly.
Best Practice
Use this alongside the memory limiter processor for defense in depth:
-
maximum_trace_size_bytesprotects against individual large traces - Memory limiter protects against overall memory pressure from many traces
3. Linux Hugepages Memory Monitoring
What's New: Monitor hugepages usage on Linux hosts via the new system.memory.linux.hugepages metrics.
What Are Hugepages?
Standard memory pages on Linux are 4KB. Hugepages are larger (typically 2MB or 1GB), and they're critical for high-performance applications like databases, in-memory caches, and VMs.
Why Monitor Them?
If your application expects hugepages but they're exhausted, performance tanks. Now you can track:
-
system.memory.linux.hugepages.usage- Currently used hugepages -
system.memory.linux.hugepages.free- Available hugepages -
system.memory.linux.hugepages.total- Total configured hugepages
Example Alert
# Prometheus alert for low hugepages
- alert: HugepagesExhausted
expr: system_memory_linux_hugepages_free < 10
for: 5m
annotations:
summary: "Hugepages nearly exhausted on {{ $labels.host }}"
Who Needs This?
- Teams running Redis, PostgreSQL, or MongoDB in production
- Anyone using DPDK for high-performance networking
- VM hosts using KVM with hugepages backing
4. Exclude Namespaces from Kubernetes Watching
What's New: The k8sobjects receiver now supports excluding specific Kubernetes namespaces from being watched.
The Problem
The k8sobjects receiver watches Kubernetes objects (events, pods, deployments, etc.) and converts them to logs. In large clusters, watching all namespaces generates massive amounts of data. You often want to exclude system namespaces or namespaces managed by other tools.
Configuration
receivers:
k8sobjects:
objects:
- name: events
mode: watch
namespaces: [] # Empty means all namespaces
exclude_namespaces:
- kube-system
- kube-public
- kube-node-lease
- name: pods
mode: pull
interval: 30s
exclude_namespaces:
- kube-system
Use Cases
-
Reduce noise: Exclude
kube-systemevents that flood your logs - Compliance: Only watch specific namespaces for audit purposes
- Multi-tenancy: Different collectors for different namespace groups
- Cost control: Reduce log volume by excluding high-churn namespaces
Best Practice
Exclude namespaces that:
- Generate high volumes of Kubernetes events you don't need
- Are managed by separate observability pipelines
- Contain system components (kube-system, monitoring infrastructure)
5. Suppress Repeated Permission Denied Errors
What's New: The filelog receiver now logs only one permission denied error per file per process run, with an informational message when the file becomes readable again.
Before This Change
ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
# Repeated every second, forever
After This Change
ERROR file /var/log/secure: permission denied
# ... silence ...
INFO file /var/log/secure: now readable, resuming collection
Why This Matters
Log spam from permission errors:
- Fills up your log storage
- Makes it harder to find real issues
- Can trigger false alerts on error counts
This small change significantly improves operational hygiene.
6. Trace Flags Policy for Sampling
What's New: A new trace_flags policy for the tail sampling processor lets you make sampling decisions based on trace flags.
What Are Trace Flags?
Trace flags are part of the W3C Trace Context standard. The most common flag is the "sampled" bit, which indicates whether the trace was marked for sampling upstream.
Use Case: Honor Upstream Sampling Decisions
processors:
tail_sampling:
policies:
- name: honor-upstream-sampling
type: trace_flags
trace_flags:
sampled: true # Keep traces marked as sampled
Use Case: Force Sample Unsampled Traces with Errors
processors:
tail_sampling:
policies:
- name: sample-errors-even-if-unsampled
type: and
and:
- name: not-sampled
type: trace_flags
trace_flags:
sampled: false
- name: has-error
type: status_code
status_code: {status_codes: [ERROR]}
7. GCP FaaS Attribute Migration (faas.id → faas.instance)
What's New: The processor.resourcedetection.removeGCPFaaSID feature gate is now stable and always enabled. The faas.id attribute is replaced by faas.instance.
What Changed?
| Before | After |
|---|---|
faas.id: "abc123" |
faas.instance: "abc123" |
Why This Matters
This aligns with the OpenTelemetry semantic conventions. The faas.instance attribute better represents "the execution environment instance" rather than just an ID.
Migration Steps
- Update any dashboards or alerts that filter on
faas.id - Search your codebase for references to
faas.id - Update to use
faas.instanceinstead
8. Improved Workflow Job Trace Structure
What's New: Step spans are now siblings of the queue/job span (under the job span) instead of children of the queue/job span.
Before (Nested Structure)
Job Span
└── Queue/Job Span
├── Step 1 Span
├── Step 2 Span
└── Step 3 Span
After (Sibling Structure)
Job Span
├── Queue/Job Span
├── Step 1 Span
├── Step 2 Span
└── Step 3 Span
Why This Is Better
- Clearer visualization in trace UIs
- Steps are directly associated with the job, not buried under queue processing
- Easier to calculate total step duration vs. queue wait time
9. Prometheus Receiver: Extra Scrape Metrics Ignored by Default
What's New: The report_extra_scrape_metrics configuration option is now ignored by default (feature gate promoted to beta).
What This Means
Previously, the Prometheus receiver could report additional metrics about the scrape process itself. These extra metrics are now disabled by default to reduce metric cardinality.
If You Need Them
You can re-enable them by setting the feature gate:
otelcol --feature-gates=-receiver.prometheusreceiver.RemoveReportExtraScrapeMetricsConfig
Best Practice
Only enable extra scrape metrics if you're actively debugging Prometheus scrape issues. For most production deployments, the default (disabled) is correct.
10. Removable Prometheus Service Discoveries via Build Tags
What's New: Prometheus service discoveries can now be excluded at build time using Go build tags.
Why This Matters
The OpenTelemetry Collector binary can get large when it includes all Prometheus service discovery mechanisms (Kubernetes, Consul, EC2, Azure, etc.). If you only use Kubernetes SD, you're shipping unnecessary code.
Building a Lighter Collector
# Include only Kubernetes service discovery
go build -tags "promsd_kubernetes" ./cmd/otelcol-contrib
# Exclude all service discoveries except static
go build -tags "promsd_none" ./cmd/otelcol-contrib
Benefits
- Smaller binary size
- Reduced attack surface
- Faster startup times
Wrapping Up
This release of OpenTelemetry Collector Contrib demonstrates the project's commitment to solving real-world observability challenges. From cost-effective trace archival with GCS export to protecting your collectors with trace size limits, these features address pain points that teams face daily.
Key Takeaways
- Use GCS export for cost-effective long-term trace retention
- Set trace size limits to protect against memory exhaustion
- Monitor hugepages if you run high-performance workloads
- Exclude namespaces to reduce k8s processor load
- Migrate from faas.id to faas.instance for GCP workloads
Next Steps
- Review the full changelog for additional changes
- Test these features in your staging environment
- Join the CNCF Slack #otel-collector channel for community support
Stay updated on the latest cloud-native releases by following Relnx. Never miss a feature release again.
Top comments (0)