DEV Community

Cover image for How to Build Resilient Telemetry Pipelines with the OpenTelemetry Collector: High Availability and Gateway Architecture
Adnan Rahić
Adnan Rahić

Posted on • Edited on • Originally published at bindplane.com

How to Build Resilient Telemetry Pipelines with the OpenTelemetry Collector: High Availability and Gateway Architecture

Do you remember when humans used to write step-by-step tutorials?

Let’s bring that back. Today you’ll learn how to configure high availability for the OpenTelemetry Collector so you don’t lose telemetry during node failures, rolling upgrades, or traffic spikes. The guide covers both Docker and Kubernetes samples with hands-on demos of configs.

But first, let’s lay some groundwork.

How to define High Availability (HA) with the OpenTelemetry Collector?

You want to ensure telemetry collection and processing works even if individual Collector instances fail. It’s outlined in three main points:

  • Avoid data loss when exporting to a dead observability backend.

  • Ensure telemetry continuity during rolling updates or infrastructure failures.

  • Enable horizontal scalability for load-balancing traces, logs, and metrics.

To enable high availability it’s recommended that you use the Agent-Gateway deployment pattern. This means:

  • Agent Collectors run on every host, container, or node.

  • Gateway Collectors are centralized, scalable back-end services receiving telemetry from Agent Collectors.

  • Each layer can be scaled independently and horizontally.

Please note, an Agent Collector and Gateway Collector is essentially the same binary. They’re completely identical. The ONLY difference is WHERE it is running. Think of it this way.

  • An Agent Collector runs close to the workload–in the context of Kubernetes it could be a sidecar, or a deployment for every namespace–or for Docker, a service alongside your app in the docker-compose.yaml. This would tend to mean the dev team will own this instance of the Collector.

  • A Gateway Collector is a central (standalone) operation of the collector–think a standalone Collector in a specific namespace or even a dedicated Kubernetes cluster–typically owned by the platform team. This is the final step of the telemetry pipeline letting the platform team enforce policies like filtering logs, sampling traces, dropping metrics, before sending it to an observability backend.

Here’s an awesome explanation on StackOverflow. Yes, it’s still a thing. No, not everything is explained by AI. 😂

To satisfy all high availability I’ll walk you through how to configure:

  • Multiple Collector Instances. Each instance is capable of handling the full workload with redundant storage for temporary data buffering.

  • A Load Balancer. It’ll distribute incoming telemetry data and maintain consistent routing. Load balancers also support automatic failover if a collector becomes unavailable.

  • Shared Storage. Persistent storage for collector state and configuration management.

Now it’s time to get our hands dirty with some hands-on coding.

Configure Agent-Gateway High Availability (HA) with the OpenTelemetry Collector

Let me first explain this concept by using Docker and visualize it with Bindplane. This architecture is transferable and usable for any type of Linux or Windows VM setup as well. More about Kubernetes further below.

There are three options you can use. Either using a load balancer like Nginx or Traefik. Or, using the loadbalancing exporter that’s available in the Collector. Finally, if you’re fully committed to a containerized environment, use native load balancing in Kubernetes.

Nginx Load Balancer

The Nginx option is the simpler, out-of-the-box solution.

I’ll set up the architecture with:

  • Three Gateway Collectors in parallel

  • One Nginx load balancer

  • One Agent Collector configured to generate telemetry (app simulation)

This structure is the bare-bones minimum you’ll end up using. Note that you'll end up using three separate services for the gateway collectors. The reason behind this is that each collector needs to have its own separate file_storage path to store data in the persistent queue. In Docker, this means you need to make sure each container gets a unique volume. Let me explain how that works. 

Copy the content below into a docker-compose.yaml.

version: '3.8'

volumes:
  gw1-storage:      # persistent queue for gateway-1
  gw2-storage:      # persistent queue for gateway-2
  gw3-storage:      # persistent queue for gateway-3
  telgen-storage:   # persistent queue for telemetry generator
  external-gw-storage: # persistent queue for external gateway

services:
  # ────────────── GATEWAYS (3×) ──────────────
  gw1:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: gw1
    hostname: gw1
    command: ["--config=/etc/otel/config/config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - gw1-storage:/etc/otel/storage  # 60 GiB+ queue
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
      OPAMP_SECRET_KEY: "<secret>"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/gw1-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml

  gw2:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: gw2
    hostname: gw2
    command: ["--config=/etc/otel/config/config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - gw2-storage:/etc/otel/storage  # 60 GiB+ queue
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
      OPAMP_SECRET_KEY: "<secret>"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/gw2-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
    
  gw3:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: gw3
    hostname: gw3
    command: ["--config=/etc/otel/config/config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - gw3-storage:/etc/otel/storage  # 60 GiB+ queue
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
      OPAMP_SECRET_KEY: "<secret>"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/gw3-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml

  # ────────────── OTLP LOAD-BALANCER ──────────────
  otlp-lb:
    image: nginx:1.25-alpine
    volumes:
      - ./nginx-otlp.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP/JSON
    depends_on: [gw1, gw2, gw3]

  # ────────────── TELEMETRY GENERATOR ──────────────
  telgen:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: telgen
    hostname: telgen
    command: ["--config=/etc/otel/config/config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - telgen-storage:/etc/otel/storage  # 60 GiB+ queue
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
      OPAMP_SECRET_KEY: "<secret>"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/telgen-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml

  # ────────────── EXTERNAL GATEWAY ──────────────
  external-gw:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: external-gw
    hostname: external-gw
    command: ["--config=/etc/otel/config/external-gw-config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - external-gw-storage:/etc/otel/storage  # 60 GiB+ queue
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
      OPAMP_SECRET_KEY: "<secret>"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/external-gw-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/external-gw-config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
Enter fullscreen mode Exit fullscreen mode

Open your Bindplane instance and click the Install Agent button.

Set the platform to Linux, since I’m demoing this with Docker, and hit next.

This screen now shows the environment variables you'll need to replace in the docker-compose.yaml

Go ahead and replace the OPAMP_SECRET_KEYwith your own secret key from Bindplane. If you’re using a self-hosted instance of Bindplane, replace your OPAMP_ENDPOINTas well. Use the values after -e and -s which represent the endpoint and secret.

Create a nginx-otlp.conf file for the load balancer.

worker_processes auto;
events { worker_connections 1024; }

stream {
  upstream otlp_grpc {
    server gw1:4317 max_fails=3 fail_timeout=15s;
    server gw2:4317 max_fails=3 fail_timeout=15s;
    server gw3:4317 max_fails=3 fail_timeout=15s;
  }
  server {
    listen 4317;            # gRPC
    proxy_pass otlp_grpc;
    proxy_connect_timeout 1s;
    proxy_timeout 30s;
  }
}

http {
  upstream otlp_http {
    server gw1:4318 max_fails=3 fail_timeout=15s;
    server gw2:4318 max_fails=3 fail_timeout=15s;
    server gw3:4318 max_fails=3 fail_timeout=15s;
  }
  server {
    listen 4318;            # HTTP/JSON
    location / {
      proxy_pass http://otlp_http;
      proxy_next_upstream error timeout http_502 http_503 http_504;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Create a ./config directory in the same root directory as your docker-compose.yaml, and create 3 files.

> config/
    config.yaml
    telgen-config.yaml
    logging.yaml
Enter fullscreen mode Exit fullscreen mode

Paste this basic config into the config.yaml and telgen-config.yaml for the BDOT Collector to have a base config to start. I’ll then configure it with Bindplane.

receivers:
  nop:
processors:
  batch:
exporters:
  nop:
service:
  pipelines:
    metrics:
      receivers: [nop]
      processors: [batch]
      exporters: [nop]
  telemetry:
    metrics:
      level: none
Enter fullscreen mode Exit fullscreen mode

And, a base setup for the logging.yaml.

output: stdout
level: info
Enter fullscreen mode Exit fullscreen mode

Start the Docker Compose services.

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Jump into Bindplane and create three configurations for:

  • telgen

  • otlp-lb-gw

  • external-gw

The telgen configuration has a Telemetry Generator source.

And, an OTLP destination.

The OTLP destination is configured to send telemetry to the otlp-lb hostname, which is the hostname for the Nginx load balancer I’m running in Docker Compose.

Next, the otlp-lb-gw configuration has an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.

The destination is also OTLP, but instead sending to the external-gw hostname.

Finally, the external-gw configuration is again using an identical OTLP source.

And, a Dev Null destination.

This setup enables you to drop in whatever destination you want in the list of destinations for the external-gw configuration. Go wild! 😂

If you open the processor node for the Dev Null destination, you’ll see logs flowing through the load balancer. 

While in the otlp-lb-gw configuration, if you open a processor node, you’ll see evenly distributed load across all three collectors.

That’s how you load balance telemetry across multiple collectors with Nginx.

If you would rather apply these configs via the Bindplane CLI, get the files on GitHub, here.

Load Balancing Exporter

The second option is to use the dedicated loadbalancing exporter in the collector. With this exporter you can specify multiple downstream collectors that will receive the telemetry traffic equally.

One quick note before about the load balancing exporter. You don’t always need it. Its main job is to make sure spans from the same trace stick together and get routed to the same backend collector. That’s super useful for distributed tracing with sampling. But if you’re just shipping logs and metrics, or even traces without fancy sampling rules, you can probably skip it and stick with Nginx.

I’ll set up the architecture just as I did above but with yet another collector instead the Nginx load balancer:

  • Three Gateway Collectors in parallel

  • One Gateway Collector using the loadbalancing exporter

  • One Agent Collector configured to generate telemetry (app simulation)

This behaves identical to an Nginx load balancer. However this requires one less step and less configuration overhead. No need to configure and run Nginx, manage specific Nginx files, instead run one more instance of the collector and use a trusty collector config.yaml that you’re already familiar with.

The drop in replacement for the use case above is as follows. In the docker-compose.yaml replace the otlp-lb Nginx service with another collector named lb.

services:


# ...

  lb:
    image: ghcr.io/observiq/bindplane-agent:1.79.2
    container_name: lb
    hostname: lb
    command: ["--config=/etc/otel/config/lb-config.yaml"]
    volumes:
      - ./config:/etc/otel/config
      - lb-storage:/etc/otel/storage
    ports:
      - "4317:4317"   # OTLP gRPC - external endpoint
      - "4318:4318"   # OTLP HTTP/JSON - external endpoint
    environment:
      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"
      OPAMP_SECRET_KEY: "01JFJGVKWHQ1SPQVDGZEHVA995"
      OPAMP_LABELS: ephemeral=true
      MANAGER_YAML_PATH: /etc/otel/config/lb-manager.yaml
      CONFIG_YAML_PATH: /etc/otel/config/lb-config.yaml
      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
    depends_on: [gw1, gw2, gw3]

# ...
Enter fullscreen mode Exit fullscreen mode

Create a base lb-config.yaml for this collector instance in the ./config directory. Bindplane will update this remotely once you add a destination for the loadbalancing exporter.

receivers:
  nop:
processors:
  batch:
exporters:
  nop:
service:
  pipelines:
    metrics:
      receivers: [nop]
      processors: [batch]
      exporters: [nop]
  telemetry:
    metrics:
      level: none
Enter fullscreen mode Exit fullscreen mode

Go ahead and restart Docker Compose. 

docker compose down
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

This will start the new lb collector. In Bindplane, go ahead and create a new configuration called lb and add an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.

Now, create a custom destination and paste the loadbalancing exporter configuration in the input field.

loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        timeout: 30s
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_elapsed_time: 300s
          max_interval: 30s
        sending_queue:
          enabled: true
          num_consumers: 10
          queue_size: 5000
    resolver:
      static:
        hostnames:
          - gw1:4317
          - gw2:4317
          - gw3:4317
Enter fullscreen mode Exit fullscreen mode

Note that the hostnames correlate to the hostnames of the gateway collectors configured in Docker Compose. Save this configuration and roll it out to the new lb collector. Opening the gw configuration in Bindplane and selecting a processor node, you’ll see the telemetry flowing through all 3 gateway collector instances.

You’ll see an even nicer split by seeing the telemetry throughput across all collectors in the Agents view.

The lb and external-gw are reporting the same throughput with the three gateway collectors load balancing traffic equally.

The loadbalancing exporter is behaving like a drop-in replacement for Nginx. I would call that a win. Less configuration overhead, fewer moving parts, and no need to learn specific Nginx configs. Instead, focus only on the collector.

To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.

Since you now have a good understanding of how to configure OpenTelemetry Collector  infrastructure for high availability, let's move into details about resilience specifically. 

Building Resilience into Your Collector

When it comes to resilience, features like retry logic, persistent queues, and batching should be handled in the Agent Collectors. These are the instances sitting closest to your workloads; they’re most at risk of losing data if something goes wrong. The Agent’s job is to collect, buffer, and forward telemetry reliably, even when the backend is flaky or slow.

How you configure the OpenTelemetry collector for resilience to avoid losing telemetry during network issues or telemetry backend outages:

  • Batching groups signals before export, improving efficiency.

  • Retry ensures failed exports are re-attempted. For critical workloads, increase max_elapsed_time to tolerate longer outages—but be aware this will increase the buffer size on disk.

  • Persistent Queue stores retries on disk, protecting against data loss if the Collector crashes. You can configure:

    • Number of consumers – how many parallel retry workers run
    • Queue size – how many batches are stored
    • Persistence – enables disk buffering for reliability

Retry & Persistent Queue

Luckily enough for you, Bindplane handles both retries and the persistent queue out of the box for OTLP exporters.

Take a look at the telgen configuration. This is the collector we’re running in agent-mode simulating a bunch of telemetry traffic.

In the telgen-config.yaml, you'll see OTLP exporter is configured with both the persistent queue and retries.

exporters:
    otlp/lb:
        compression: gzip
        endpoint: gw:4317
        retry_on_failure:
            enabled: true
            initial_interval: 5s
            max_elapsed_time: 300s
            max_interval: 30s
        sending_queue:
            enabled: true
            num_consumers: 10
            queue_size: 5000
            storage: file_storage/lb
        timeout: 30s
        tls:
            insecure: true
Enter fullscreen mode Exit fullscreen mode

This is because the advanced settings for every OTLP exporter in Bindplane have this default configuration enabled.

The persistent queue directory here is the storage directory that we configured by creating a volume in Docker. 

# docker-compose.yaml 

...

volumes:
  gw1-storage:      # persistent queue for gateway-1
  gw2-storage:      # persistent queue for gateway-2
  gw3-storage:      # persistent queue for gateway-3
  telgen-storage:   # persistent queue for telemetry generator
  lb-storage:    # persistent queue for load-balancing gateway
  external-gw-storage: # persistent queue for external gateway

...
Enter fullscreen mode Exit fullscreen mode

Bindplane then automatically configures a storage extension in the config and enables it like this:

# telgen-config.yaml 

...
extensions:
    file_storage/lb:
        compaction:
            directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
            on_rebound: true
        directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
service:
    extensions:
        - file_storage/lb
...
Enter fullscreen mode Exit fullscreen mode

Note that the OIQ_OTEL_COLLECTOR_HOME environment variable actually is mapped to the /etc/otel directory.

Now your telemetry pipeline becomes resilient and HA-ready with data persistence to survive restarts, persistent queue buffering to handle temporary outages, and failover recovery to prevent data loss.

Batching

Batching is a whole other story, because you need to add a processor on the processor node for it to be enabled before connecting it to the destination. 

Agent-mode collectors should batch telemetry before sending it to the gateway collector. The OTLP receiver on the gateway side will receive batches and forward them to your telemetry backend of choice.

In the telgen configuration, click a processor node and add a batch processor.

This config will send a batch of telemetry signals every 200ms regardless of the size. Or, it will send a batch of the size 8192 regardless of the timeout. Applying this processor in Bindplane will generate a config like this:

# telgen-config.yaml 

...

processors:
    batch/lb: null
    batch/lb-0__processor0:
        send_batch_max_size: 0
        send_batch_size: 8192
        timeout: 200ms

...
Enter fullscreen mode Exit fullscreen mode

Kubernetes-native load balancing with HorizontalPodAutoscaler

Finally, after all the breakdowns, explanations, and diagrams, it’s time to show you what it would look like in the wild with a simple Kubernetes sample.

Using Kubernetes is the preferred architecture suggested by the Bindplane team and the OpenTelemetry community. K8s will maximize the benefits you get with Bindplane as well.

I’ll set up the architecture with:

  • One Agent-mode Collector running per node on the K8s cluster configured to generate telemetry (app simulation)

  • A Gateway Collector Deployment

    • Using a HorizontalPodAutoscaler scaling from 2 to 10 pods
    • And a ClusterIP service
    • Configured with persistent storage, sending queue, and retry
  • An external Gateway Collector running on another cluster acting as a mock telemetry backend

Luckily enough getting all the K8s YAML manifests for the collectors is all point-and-click from the Bindplane UI. However, you need to build the configurations first, before applying the collectors to your K8s cluster.

For the sake of simplicity l’ll show how to spin up two K8s clusters with kind, and use them in this demo.

kind create cluster --name kind-2
kind create cluster --name kind-1
# make sure you set the context to the kind-1 cluster first
kubectl config use-context kind-kind-1
Enter fullscreen mode Exit fullscreen mode

Next, jump into Bindplane and create three configurations for:

  • telgen-kind-1

  • gw-kind-1

  • external-gw-kind-2

The telgen-kind-1 configuration has a Custom source with a telemetrygeneratorreceiver.

telemetrygeneratorreceiver:
        generators:
            - additional_config:
                body: 127.0.0.1 - - [30/Jun/2025:12:00:00 +0000] \"GET /index.html HTTP/1.1\" 200 512
                severity: 9
              type: logs
        payloads_per_second: 1
Enter fullscreen mode Exit fullscreen mode

And, a Bindplane Gateway destination.

Note: This is identical to any OTLP destination.

The Bindplane Gateway destination is configured to send telemetry to the bindplane-gateway-agent.bindplane-agent.svc.cluster.local

hostname, which is the hostname for the Bindplane Gateway Collector service in Kubernetes that you’ll start in a second.

The final step for this configuration is to click a processor node and add a batch processor.

Next, the gw-kind-1 configuration has a Bindplane Gateway source that listens on 0.0.0.0 and ports 4317 and 4318.

The destination is OTLP, and sending telemetry to the IP address (172.18.0.2) of the external gateway running on the second K8s cluster.

Note: This might differ for your clusters. If you are using kind, like I am in this demo, the IP will be 172.18.0.2.

Finally, the external-gw-kind-2 configuration is again using an OTLP source.

And, a Dev Null destination.

Feel free to use the Bindplane CLI and these resources to apply all the configurations in one go without having to do it manually in the UI.

With the configurations created, you can install collectors easily by getting manifest files from your Bindplane account. Navigate to the install agents UI in Bindplane and select a Kubernetes environment. Use the Node platform and telgen-kind-1 configuration.

Clicking next will show a manifest file for you to apply in the cluster.

Save this file as node-agent-kind-1.yaml. Check out below what a sample of it looks like. Or, see the file in GitHub, here.

---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
  name: bindplane-agent
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
  name: bindplane-agent
  namespace: bindplane-agent
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: bindplane-agent
  labels:
    app.kubernetes.io/name: bindplane-agent
rules:
- apiGroups:
  - ""
  resources:
  - events
  - namespaces
  - namespaces/status
  - nodes
  - nodes/spec
  - nodes/stats
  - nodes/proxy
  - pods
  - pods/status
  - replicationcontrollers
  - replicationcontrollers/status
  - resourcequotas
  - services
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  - cronjobs
  verbs:
  - get
  - list
  - watch
- apiGroups:
    - autoscaling
  resources:
    - horizontalpodautoscalers
  verbs:
    - get
    - list
    - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: bindplane-agent
  labels:
    app.kubernetes.io/name: bindplane-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: bindplane-agent
subjects:
- kind: ServiceAccount
  name: bindplane-agent
  namespace: bindplane-agent
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
  name: bindplane-node-agent
  namespace: bindplane-agent
spec:
  ports:
  - appProtocol: grpc
    name: otlp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
  - appProtocol: http
    name: otlp-http
    port: 4318
    protocol: TCP
    targetPort: 4318
  selector:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: node
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: node
  name: bindplane-node-agent-headless
  namespace: bindplane-agent
spec:
  clusterIP: None
  ports:
  - appProtocol: grpc
    name: otlp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
  - appProtocol: http
    name: otlp-http
    port: 4318
    protocol: TCP
    targetPort: 4318
  selector:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: node
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: bindplane-node-agent-setup
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: node
  namespace: bindplane-agent
data:
  # This script assumes it is running in /etc/otel.
  setup.sh: |
    # Configure storage/ emptyDir volume permissions so the
    # manager configuration can ge written to it.
    chown 10005:10005 storage/

    # Copy config and logging configuration files to storage/
    # hostPath volume if they do not already exist.
    if [ ! -f storage/config.yaml ]; then
      echo '
      receivers:
        nop:
      processors:
        batch:
      exporters:
        nop:
      service:
        pipelines:
          metrics:
            receivers: [nop]
            processors: [batch]
            exporters: [nop]
        telemetry:
          metrics:
            level: none
      ' > storage/config.yaml
    fi
    if [ ! -f storage/logging.yaml ]; then
      echo '
      output: stdout
      level: info
      ' > storage/logging.yaml
    fi
    chown 10005:10005 storage/config.yaml storage/logging.yaml
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: bindplane-node-agent
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: node
  namespace: bindplane-agent
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: bindplane-agent
      app.kubernetes.io/component: node
  template:
    metadata:
      labels:
        app.kubernetes.io/name: bindplane-agent
        app.kubernetes.io/component: node
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "8888"
        prometheus.io/scheme: http
        prometheus.io/job-name: bindplane-node-agent
    spec:
      serviceAccount: bindplane-agent
      initContainers:
        - name: setup
          image: busybox:latest
          securityContext:
            # Required for changing permissions from
            # root to otel user in emptyDir volume.
            runAsUser: 0
          command: ["sh", "/setup/setup.sh"]
          volumeMounts:
            - mountPath: /etc/otel/config
              name: config
            - mountPath: /storage
              name: storage
            - mountPath: "/setup"
              name: setup
      containers:
        - name: opentelemetry-collector
          image: ghcr.io/observiq/bindplane-agent:1.80.1
          imagePullPolicy: IfNotPresent
          securityContext:
            readOnlyRootFilesystem: true
            # Required for reading container logs hostPath.
            runAsUser: 0
          ports:
            - containerPort: 8888
              name: prometheus
          resources:
            requests:
              memory: 200Mi
              cpu: 100m
            limits:
              memory: 200Mi
          env:
            - name: OPAMP_ENDPOINT
              value: wss://app.bindplane.com/v1/opamp
            - name: OPAMP_SECRET_KEY
              value: <secret>
            - name: OPAMP_AGENT_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: OPAMP_LABELS
              value: configuration=telgen-kind-1,container-platform=kubernetes-daemonset,install_id=0979c5c2-bd7a-41c1-89b8-2c16441886ab
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # The collector process updates config.yaml
            # and manager.yaml when receiving changes
            # from the OpAMP server.
            #
            # The config.yaml is persisted by saving it to the
            # hostPath volume, allowing the agent to continue
            # running after restart during an OpAMP server outage.
            #
            # The manager configuration must be re-generated on
            # every startup due to how the bindplane-agent handles
            # manager configuration. It prefers a manager config file
            # over environment variables, meaning it cannot be
            # updated using environment variables, if it is persisted).
            - name: CONFIG_YAML_PATH
              value: /etc/otel/storage/config.yaml
            - name: MANAGER_YAML_PATH
              value: /etc/otel/config/manager.yaml
            - name: LOGGING_YAML_PATH
              value: /etc/otel/storage/logging.yaml
          volumeMounts:
            - mountPath: /etc/otel/config
              name: config
            - mountPath: /run/log/journal
              name: runlog
              readOnly: true
            - mountPath: /var/log
              name: varlog
              readOnly: true
            - mountPath: /var/lib/docker/containers
              name: dockerlogs
              readOnly: true
            - mountPath: /etc/otel/storage
              name: storage
      volumes:
        - name: config
          emptyDir: {}
        - name: runlog
          hostPath:
            path: /run/log/journal
        - name: varlog
          hostPath:
            path: /var/log
        - name: dockerlogs
          hostPath:
            path: /var/lib/docker/containers
        - name: storage
          hostPath:
            path: /var/lib/observiq/otelcol/container
        - name: setup
          configMap:
            name: bindplane-node-agent-setup
Enter fullscreen mode Exit fullscreen mode

In short, this manifest deploys the BDOT Collector as a DaemonSet on every node, using OpAMP to receive config from Bindplane. It includes:

  • RBAC to read Kubernetes objects (pods, nodes, deployments, etc.)

  • Services to expose OTLP ports (4317 gRPC, 4318 HTTP)

  • An init container to bootstrap a config to start the collector which will be replaced by the telgen-kind-1 configuration once started

  • Persistent hostPath storage for retries and disk buffering

  • Prometheus annotations for metrics scraping

Your file will include the correct OPAMP_ENDPOINT, OPAMP_SECRET_KEY, and OPAMP_LABELS.

Go ahead and apply this manifest to the first k8s cluster.

kubectl config use-context kind-kind-1
kubectl apply -f node-agent-kind-1.yaml
Enter fullscreen mode Exit fullscreen mode

Now, install another collector in the K8s cluster, but now choose a Gateway and the gw-kind-1 configuration.

You’ll get a manifest file to apply again, but this time a deployment. Save it as gateway-collector-kind-1.yaml. Here’s what it looks like in GitHub.

Here’s the full manifest as a deployment with a horizontal pod autoscaler.

---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
  name: bindplane-agent
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
  name: bindplane-agent
  namespace: bindplane-agent
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
  name: bindplane-gateway-agent
  namespace: bindplane-agent
spec:
  ports:
  - appProtocol: grpc
    name: otlp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
  - appProtocol: http
    name: otlp-http
    port: 4318
    protocol: TCP
    targetPort: 4318
  - appProtocol: tcp
    name: splunk-tcp
    port: 9997
    protocol: TCP
    targetPort: 9997
  - appProtocol: tcp
    name: splunk-hec
    port: 8088
    protocol: TCP
    targetPort: 8088
  selector:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
  name: bindplane-gateway-agent-headless
  namespace: bindplane-agent
spec:
  clusterIP: None
  ports:
  - appProtocol: grpc
    name: otlp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
  - appProtocol: http
    name: otlp-http
    port: 4318
    protocol: TCP
    targetPort: 4318
  selector:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bindplane-gateway-agent
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
  namespace: bindplane-agent
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: bindplane-agent
      app.kubernetes.io/component: gateway
  template:
    metadata:
      labels:
        app.kubernetes.io/name: bindplane-agent
        app.kubernetes.io/component: gateway
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "8888"
        prometheus.io/scheme: http
        prometheus.io/job-name: bindplane-gateway-agent
    spec:
      serviceAccount: bindplane-agent
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: app.kubernetes.io/name
                      operator: In
                      values:  [bindplane-agent]
                    - key: app.kubernetes.io/component
                      operator: In
                      values: [gateway]
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000000000
        runAsGroup: 1000000000
        fsGroup: 1000000000
        seccompProfile:
          type: RuntimeDefault
      initContainers:
        - name: setup-volumes
          image: ghcr.io/observiq/bindplane-agent:1.80.1
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000000000
            runAsGroup: 1000000000
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            seccompProfile:
              type: RuntimeDefault
            capabilities:
              drop:
                - ALL
          command:
            - 'sh'
            - '-c'
            - |
              echo '
              receivers:
                nop:
              processors:
                batch:
              exporters:
                nop:
              service:
                pipelines:
                  metrics:
                    receivers: [nop]
                    processors: [batch]
                    exporters: [nop]
                telemetry:
                  metrics:
                    level: none
              ' > /etc/otel/storage/config.yaml
              echo '
              output: stdout
              level: info
              ' > /etc/otel/storage/logging.yaml
          resources:
            requests:
              memory: 200Mi
              cpu: 100m
            limits:
              memory: 200Mi
          volumeMounts:
            - mountPath: /etc/otel/storage
              name: bindplane-gateway-agent-storage
      containers:
        - name: opentelemetry-container
          image: ghcr.io/observiq/bindplane-agent:1.80.1
          imagePullPolicy: IfNotPresent
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000000000
            runAsGroup: 1000000000
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            seccompProfile:
              type: RuntimeDefault
            capabilities:
              drop:
                - ALL
          resources:
            requests:
              memory: 500Mi
              cpu: 250m
            limits:
              memory: 500Mi
          ports:
            - containerPort: 8888
              name: prometheus
          env:
            - name: OPAMP_ENDPOINT
              value: wss://app.bindplane.com/v1/opamp
            - name: OPAMP_SECRET_KEY
              value: <secret>
            - name: OPAMP_AGENT_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPAMP_LABELS
              value: configuration=gw-kind-1,container-platform=kubernetes-gateway,install_id=51dbe4d2-83d2-45c0-ab4a-e0c127a59649
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # The collector process updates config.yaml
            # and manager.yaml when receiving changes
            # from the OpAMP server.
            - name: CONFIG_YAML_PATH
              value: /etc/otel/storage/config.yaml
            - name: MANAGER_YAML_PATH
              value: /etc/otel/config/manager.yaml
            - name: LOGGING_YAML_PATH
              value: /etc/otel/storage/logging.yaml
          volumeMounts:
          - mountPath: /etc/otel/storage
            name: bindplane-gateway-agent-storage
          - mountPath: /etc/otel/config
            name: config
      volumes:
        - name: config
          emptyDir: {}
        - name: bindplane-gateway-agent-storage
          emptyDir: {}
      # Allow exporters to drain their queue for up to
      # five minutes.
      terminationGracePeriodSeconds: 500
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: bindplane-gateway-agent
  namespace: bindplane-agent
spec:
  maxReplicas: 10
  minReplicas: 2
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bindplane-gateway-agent
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
Enter fullscreen mode Exit fullscreen mode

Here’s a breakdown of what this manifest does:

  • Creates a dedicated namespace and service account for the Bindplane Gateway Collector (bindplane-agent).

  • Defines two Kubernetes services:

    • A standard ClusterIP service for OTLP (gRPC/HTTP) and Splunk (TCP/HEC) traffic.
    • A headless service for direct pod discovery, useful in peer-to-peer setups.
  • Deploys the Bindplane Agent as a scalable Deployment:

    • Runs the OpenTelemetry Collector image.
    • Bootstraps basic config via an initContainer.
    • Secure runtime with strict securityContext settings.
    • Prometheus annotations enable metrics scraping.
  • Auto-scales the collector horizontally using an HPA:

    • Scales between 2 and 10 replicas based on CPU utilization.
  • Uses OpAMP to receive remote config and updates from Bindplane.

  • Mounts ephemeral storage for config and persistent queue support using emptyDir.

Your file will include the correct OPAMP_ENDPOINT, OPAMP_SECRET_KEY, and OPAMP_LABELS.

Apply it in the first k8s cluster.

kubectl apply -f gateway-collector-kind-1.yaml
Enter fullscreen mode Exit fullscreen mode

Now, create an identical Gateway Collector as above but use the external-gw-kind-2 configuration.

You’ll get a manifest file to apply again, but this time apply it in your second cluster. Save it as gateway-collector-kind-2.yaml. Here’s what it looks like in GitHub. I won’t bother showing you the manifest YAML since it will be identical to the one above.

kubectl config use-context kind-kind-2
kubectl apply -f gateway-collector-kind-2.yaml
Enter fullscreen mode Exit fullscreen mode

Finally, to expose this Gateway Collector’s service and enable OTLP traffic from cluster 1 to cluster 2 I’ll use this NodePort service called gateway-nodeport-service.yaml.

apiVersion: v1
kind: Service
metadata:
  name: bindplane-gateway-agent-nodeport
  namespace: bindplane-agent
  labels:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
spec:
  type: NodePort
  ports:
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
    nodePort: 30317
    protocol: TCP
  - name: otlp-http
    port: 4318
    targetPort: 4318
    nodePort: 30318
    protocol: TCP
  selector:
    app.kubernetes.io/name: bindplane-agent
    app.kubernetes.io/component: gateway
Enter fullscreen mode Exit fullscreen mode

And, apply it with:

kubectl apply -f gateway-nodeport-service.yaml
Enter fullscreen mode Exit fullscreen mode

Your final setup will look like this.

One Agent-mode collector sending telemetry traffic via a horizontally scaled Gateway-mode collector to an external Gateway running in a separate cluster. This can be any other telemetry backend of your choice.

You’ll have 5 collectors running in total.

And, 3 configurations, where two of them will be scaled between 2 and 10 collector pods.

To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.

Final thoughts

At the end of the day, high availability for the OpenTelemetry Collector means one thing: don’t lose telemetry when stuff breaks.

You want things to keep working when a telemetry backend goes down, a node restarts, or you’re pushing out updates. That’s why the Agent-Gateway pattern exists. That’s why we scale horizontally. That’s why we use batching, retries, and persistent queues.

Set it up once, and sleep better knowing your pipeline won’t fall over at the first hiccup. Keep signals flowing. No drops. No drama.

Want to give Bindplane a try? Spin up a free instance of Bindplane Cloud and hit the ground running right away.

Top comments (0)