DEV Community

Cover image for OpenTelemetry Collector Contrib V0.145.0: 10 Features That Will Transform Your Observability Pipeline
Ahmed Zidan for AWS Community Builders

Posted on • Originally published at relnx.io

OpenTelemetry Collector Contrib V0.145.0: 10 Features That Will Transform Your Observability Pipeline

OpenTelemetry Collector Contrib: 10 Features That Will Transform Your Observability Pipeline

The OpenTelemetry Collector Contrib project continues to evolve at a rapid pace, and the latest release is packed with features that address real-world observability challenges. Whether you're running workloads on GCP, managing Kubernetes clusters, or trying to tame your log volumes, this release has something for you.

Let's dive into the 10 most impactful features and see how they can improve your observability stack.


1. Export Traces to Google Cloud Storage

What's New: You can now export traces directly to Google Cloud Storage (GCS).

This is huge for teams that need long-term trace retention without the cost of keeping everything in a real-time trace backend. Think of it as a "cold storage" tier for your traces.

Why This Matters

Traditional trace backends like Jaeger or Tempo are optimized for real-time querying, but storing months of trace data gets expensive fast. With GCS export, you can:

  • Archive traces for compliance and auditing
  • Reduce costs by moving older traces to cheaper storage
  • Build custom analytics pipelines on historical trace data

Example Configuration

exporters:
  googlecloudstorage:
    bucket: "my-traces-bucket"
    prefix: "traces/"
    compression: gzip

service:
  pipelines:
    traces:
      exporters: [googlecloudstorage]
Enter fullscreen mode Exit fullscreen mode

Best Practice

Use the GCS exporter alongside your primary trace backend. Send real-time traces to Jaeger/Tempo for immediate debugging, and batch export to GCS for long-term retention.


2. Limit Maximum Trace Size in Tail Sampling

What's New: The tail sampling processor now supports maximum_trace_size_bytes to limit the memory footprint of individual traces. Traces exceeding this byte limit are immediately dropped—no sampling decision is made for them.

The Problem This Solves

Tail sampling holds traces in memory while waiting for all spans to arrive before making a sampling decision. This is powerful, but it creates a vulnerability: occasionally, a single trace can grow to an enormous size (think: a batch job creating thousands of spans), causing spiky memory consumption that can crash your collector.

The memory limiter processor doesn't fully solve this because it applies backpressure while traces are waiting for decisions, which can degrade sampling accuracy and overall throughput.

How It Works

processors:
  tail_sampling:
    decision_wait: 10s
    maximum_trace_size_bytes: 5242880  # 5 MB per trace
    policies:
      - name: error-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
Enter fullscreen mode Exit fullscreen mode

When a trace's in-memory size exceeds maximum_trace_size_bytes, it's immediately dropped without waiting for the decision_wait period. No sampling decision is made—the trace is simply discarded to protect collector stability.

Real-World Scenario

Consider a data processing pipeline that creates a span for each record processed. A batch of 100,000 records generates 100,000 spans in a single trace. Each span might be 500 bytes, resulting in a 50MB trace sitting in memory. Without limits, a few concurrent batches could exhaust your collector's memory.

With maximum_trace_size_bytes: 5242880 (5MB), oversized traces are dropped early, protecting your collector while still sampling normal-sized traces correctly.

Best Practice

Use this alongside the memory limiter processor for defense in depth:

  • maximum_trace_size_bytes protects against individual large traces
  • Memory limiter protects against overall memory pressure from many traces

3. Linux Hugepages Memory Monitoring

What's New: Monitor hugepages usage on Linux hosts via the new system.memory.linux.hugepages metrics.

What Are Hugepages?

Standard memory pages on Linux are 4KB. Hugepages are larger (typically 2MB or 1GB), and they're critical for high-performance applications like databases, in-memory caches, and VMs.

Why Monitor Them?

If your application expects hugepages but they're exhausted, performance tanks. Now you can track:

  • system.memory.linux.hugepages.usage - Currently used hugepages
  • system.memory.linux.hugepages.free - Available hugepages
  • system.memory.linux.hugepages.total - Total configured hugepages

Example Alert

# Prometheus alert for low hugepages
- alert: HugepagesExhausted
  expr: system_memory_linux_hugepages_free < 10
  for: 5m
  annotations:
    summary: "Hugepages nearly exhausted on {{ $labels.host }}"
Enter fullscreen mode Exit fullscreen mode

Who Needs This?

  • Teams running Redis, PostgreSQL, or MongoDB in production
  • Anyone using DPDK for high-performance networking
  • VM hosts using KVM with hugepages backing

4. Exclude Namespaces from Kubernetes Watching

What's New: The k8sobjects receiver now supports excluding specific Kubernetes namespaces from being watched.

The Problem

The k8sobjects receiver watches Kubernetes objects (events, pods, deployments, etc.) and converts them to logs. In large clusters, watching all namespaces generates massive amounts of data. You often want to exclude system namespaces or namespaces managed by other tools.

Configuration

receivers:
  k8sobjects:
    objects:
      - name: events
        mode: watch
        namespaces: []  # Empty means all namespaces
        exclude_namespaces:
          - kube-system
          - kube-public
          - kube-node-lease
      - name: pods
        mode: pull
        interval: 30s
        exclude_namespaces:
          - kube-system
Enter fullscreen mode Exit fullscreen mode

Use Cases

  • Reduce noise: Exclude kube-system events that flood your logs
  • Compliance: Only watch specific namespaces for audit purposes
  • Multi-tenancy: Different collectors for different namespace groups
  • Cost control: Reduce log volume by excluding high-churn namespaces

Best Practice

Exclude namespaces that:

  • Generate high volumes of Kubernetes events you don't need
  • Are managed by separate observability pipelines
  • Contain system components (kube-system, monitoring infrastructure)

5. Suppress Repeated Permission Denied Errors

What's New: The filelog receiver now logs only one permission denied error per file per process run, with an informational message when the file becomes readable again.

Before This Change

ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
# Repeated every second, forever
Enter fullscreen mode Exit fullscreen mode

After This Change

ERROR file /var/log/secure: permission denied
# ... silence ...
INFO file /var/log/secure: now readable, resuming collection
Enter fullscreen mode Exit fullscreen mode

Why This Matters

Log spam from permission errors:

  • Fills up your log storage
  • Makes it harder to find real issues
  • Can trigger false alerts on error counts

This small change significantly improves operational hygiene.


6. Trace Flags Policy for Sampling

What's New: A new trace_flags policy for the tail sampling processor lets you make sampling decisions based on trace flags.

What Are Trace Flags?

Trace flags are part of the W3C Trace Context standard. The most common flag is the "sampled" bit, which indicates whether the trace was marked for sampling upstream.

Use Case: Honor Upstream Sampling Decisions

processors:
  tail_sampling:
    policies:
      - name: honor-upstream-sampling
        type: trace_flags
        trace_flags:
          sampled: true  # Keep traces marked as sampled
Enter fullscreen mode Exit fullscreen mode

Use Case: Force Sample Unsampled Traces with Errors

processors:
  tail_sampling:
    policies:
      - name: sample-errors-even-if-unsampled
        type: and
        and:
          - name: not-sampled
            type: trace_flags
            trace_flags:
              sampled: false
          - name: has-error
            type: status_code
            status_code: {status_codes: [ERROR]}
Enter fullscreen mode Exit fullscreen mode

7. GCP FaaS Attribute Migration (faas.id → faas.instance)

What's New: The processor.resourcedetection.removeGCPFaaSID feature gate is now stable and always enabled. The faas.id attribute is replaced by faas.instance.

What Changed?

Before After
faas.id: "abc123" faas.instance: "abc123"

Why This Matters

This aligns with the OpenTelemetry semantic conventions. The faas.instance attribute better represents "the execution environment instance" rather than just an ID.

Migration Steps

  1. Update any dashboards or alerts that filter on faas.id
  2. Search your codebase for references to faas.id
  3. Update to use faas.instance instead

8. Improved Workflow Job Trace Structure

What's New: Step spans are now siblings of the queue/job span (under the job span) instead of children of the queue/job span.

Before (Nested Structure)

Job Span
└── Queue/Job Span
    ├── Step 1 Span
    ├── Step 2 Span
    └── Step 3 Span
Enter fullscreen mode Exit fullscreen mode

After (Sibling Structure)

Job Span
├── Queue/Job Span
├── Step 1 Span
├── Step 2 Span
└── Step 3 Span
Enter fullscreen mode Exit fullscreen mode

Why This Is Better

  • Clearer visualization in trace UIs
  • Steps are directly associated with the job, not buried under queue processing
  • Easier to calculate total step duration vs. queue wait time

9. Prometheus Receiver: Extra Scrape Metrics Ignored by Default

What's New: The report_extra_scrape_metrics configuration option is now ignored by default (feature gate promoted to beta).

What This Means

Previously, the Prometheus receiver could report additional metrics about the scrape process itself. These extra metrics are now disabled by default to reduce metric cardinality.

If You Need Them

You can re-enable them by setting the feature gate:

otelcol --feature-gates=-receiver.prometheusreceiver.RemoveReportExtraScrapeMetricsConfig
Enter fullscreen mode Exit fullscreen mode

Best Practice

Only enable extra scrape metrics if you're actively debugging Prometheus scrape issues. For most production deployments, the default (disabled) is correct.


10. Removable Prometheus Service Discoveries via Build Tags

What's New: Prometheus service discoveries can now be excluded at build time using Go build tags.

Why This Matters

The OpenTelemetry Collector binary can get large when it includes all Prometheus service discovery mechanisms (Kubernetes, Consul, EC2, Azure, etc.). If you only use Kubernetes SD, you're shipping unnecessary code.

Building a Lighter Collector

# Include only Kubernetes service discovery
go build -tags "promsd_kubernetes" ./cmd/otelcol-contrib

# Exclude all service discoveries except static
go build -tags "promsd_none" ./cmd/otelcol-contrib
Enter fullscreen mode Exit fullscreen mode

Benefits

  • Smaller binary size
  • Reduced attack surface
  • Faster startup times

Wrapping Up

This release of OpenTelemetry Collector Contrib demonstrates the project's commitment to solving real-world observability challenges. From cost-effective trace archival with GCS export to protecting your collectors with trace size limits, these features address pain points that teams face daily.

Key Takeaways

  1. Use GCS export for cost-effective long-term trace retention
  2. Set trace size limits to protect against memory exhaustion
  3. Monitor hugepages if you run high-performance workloads
  4. Exclude namespaces to reduce k8s processor load
  5. Migrate from faas.id to faas.instance for GCP workloads

Next Steps


Stay updated on the latest cloud-native releases by following Relnx. Never miss a feature release again.

Top comments (0)