Do you remember when humans used to write step-by-step tutorials?
Let’s bring that back. Today you’ll learn how to configure high availability for the OpenTelemetry Collector so you don’t lose telemetry during node failures, rolling upgrades, or traffic spikes. The guide covers both Docker and Kubernetes samples with hands-on demos of configs.
But first, let’s lay some groundwork.
How to define High Availability (HA) with the OpenTelemetry Collector?
You want to ensure telemetry collection and processing works even if individual Collector instances fail. It’s outlined in three main points:
Avoid data loss when exporting to a dead observability backend.
Ensure telemetry continuity during rolling updates or infrastructure failures.
Enable horizontal scalability for load-balancing traces, logs, and metrics.
To enable high availability it’s recommended that you use the Agent-Gateway deployment pattern. This means:
Agent Collectors run on every host, container, or node.
Gateway Collectors are centralized, scalable back-end services receiving telemetry from Agent Collectors.
Each layer can be scaled independently and horizontally.
Please note, an Agent Collector and Gateway Collector is essentially the same binary. They’re completely identical. The ONLY difference is WHERE it is running. Think of it this way.
An Agent Collector runs close to the workload–in the context of Kubernetes it could be a sidecar, or a deployment for every namespace–or for Docker, a service alongside your app in the docker-compose.yaml. This would tend to mean the dev team will own this instance of the Collector.
A Gateway Collector is a central (standalone) operation of the collector–think a standalone Collector in a specific namespace or even a dedicated Kubernetes cluster–typically owned by the platform team. This is the final step of the telemetry pipeline letting the platform team enforce policies like filtering logs, sampling traces, dropping metrics, before sending it to an observability backend.
Here’s an awesome explanation on StackOverflow. Yes, it’s still a thing. No, not everything is explained by AI. 😂
To satisfy all high availability I’ll walk you through how to configure:
Multiple Collector Instances. Each instance is capable of handling the full workload with redundant storage for temporary data buffering.
A Load Balancer. It’ll distribute incoming telemetry data and maintain consistent routing. Load balancers also support automatic failover if a collector becomes unavailable.
Shared Storage. Persistent storage for collector state and configuration management.
Now it’s time to get our hands dirty with some hands-on coding.
Configure Agent-Gateway High Availability (HA) with the OpenTelemetry Collector
Let me first explain this concept by using Docker and visualize it with Bindplane. This architecture is transferable and usable for any type of Linux or Windows VM setup as well. More about Kubernetes further below.
There are three options you can use. Either using a load balancer like Nginx or Traefik. Or, using the loadbalancing exporter that’s available in the Collector. Finally, if you’re fully committed to a containerized environment, use native load balancing in Kubernetes.
Nginx Load Balancer
The Nginx option is the simpler, out-of-the-box solution.
I’ll set up the architecture with:
Three Gateway Collectors in parallel
One Nginx load balancer
One Agent Collector configured to generate telemetry (app simulation)
This structure is the bare-bones minimum you’ll end up using. Note that you'll end up using three separate services for the gateway collectors. The reason behind this is that each collector needs to have its own separate file_storage
path to store data in the persistent queue. In Docker, this means you need to make sure each container gets a unique volume. Let me explain how that works.
Copy the content below into a docker-compose.yaml
.
version: '3.8'
volumes:
gw1-storage: # persistent queue for gateway-1
gw2-storage: # persistent queue for gateway-2
gw3-storage: # persistent queue for gateway-3
telgen-storage: # persistent queue for telemetry generator
external-gw-storage: # persistent queue for external gateway
services:
# ────────────── GATEWAYS (3×) ──────────────
gw1:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: gw1
hostname: gw1
command: ["--config=/etc/otel/config/config.yaml"]
volumes:
- ./config:/etc/otel/config
- gw1-storage:/etc/otel/storage # 60 GiB+ queue
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp" # point to your Bindplane server
OPAMP_SECRET_KEY: "<secret>"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/gw1-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
gw2:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: gw2
hostname: gw2
command: ["--config=/etc/otel/config/config.yaml"]
volumes:
- ./config:/etc/otel/config
- gw2-storage:/etc/otel/storage # 60 GiB+ queue
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp" # point to your Bindplane server
OPAMP_SECRET_KEY: "<secret>"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/gw2-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
gw3:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: gw3
hostname: gw3
command: ["--config=/etc/otel/config/config.yaml"]
volumes:
- ./config:/etc/otel/config
- gw3-storage:/etc/otel/storage # 60 GiB+ queue
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp" # point to your Bindplane server
OPAMP_SECRET_KEY: "<secret>"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/gw3-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
# ────────────── OTLP LOAD-BALANCER ──────────────
otlp-lb:
image: nginx:1.25-alpine
volumes:
- ./nginx-otlp.conf:/etc/nginx/nginx.conf:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP/JSON
depends_on: [gw1, gw2, gw3]
# ────────────── TELEMETRY GENERATOR ──────────────
telgen:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: telgen
hostname: telgen
command: ["--config=/etc/otel/config/config.yaml"]
volumes:
- ./config:/etc/otel/config
- telgen-storage:/etc/otel/storage # 60 GiB+ queue
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp" # point to your Bindplane server
OPAMP_SECRET_KEY: "<secret>"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/telgen-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
# ────────────── EXTERNAL GATEWAY ──────────────
external-gw:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: external-gw
hostname: external-gw
command: ["--config=/etc/otel/config/external-gw-config.yaml"]
volumes:
- ./config:/etc/otel/config
- external-gw-storage:/etc/otel/storage # 60 GiB+ queue
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp" # point to your Bindplane server
OPAMP_SECRET_KEY: "<secret>"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/external-gw-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/external-gw-config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
Open your Bindplane instance and click the Install Agent button.
Set the platform to Linux, since I’m demoing this with Docker, and hit next.
This screen now shows the environment variables you'll need to replace in the docker-compose.yaml
.
Go ahead and replace the OPAMP_SECRET_KEY
with your own secret key from Bindplane. If you’re using a self-hosted instance of Bindplane, replace your OPAMP_ENDPOINT
as well. Use the values after -e
and -s
which represent the endpoint and secret.
Create a nginx-otlp.conf
file for the load balancer.
worker_processes auto;
events { worker_connections 1024; }
stream {
upstream otlp_grpc {
server gw1:4317 max_fails=3 fail_timeout=15s;
server gw2:4317 max_fails=3 fail_timeout=15s;
server gw3:4317 max_fails=3 fail_timeout=15s;
}
server {
listen 4317; # gRPC
proxy_pass otlp_grpc;
proxy_connect_timeout 1s;
proxy_timeout 30s;
}
}
http {
upstream otlp_http {
server gw1:4318 max_fails=3 fail_timeout=15s;
server gw2:4318 max_fails=3 fail_timeout=15s;
server gw3:4318 max_fails=3 fail_timeout=15s;
}
server {
listen 4318; # HTTP/JSON
location / {
proxy_pass http://otlp_http;
proxy_next_upstream error timeout http_502 http_503 http_504;
}
}
}
Create a ./config
directory in the same root directory as your docker-compose.yaml, and create 3 files.
> config/
config.yaml
telgen-config.yaml
logging.yaml
Paste this basic config into the config.yaml
and telgen-config.yaml
for the BDOT Collector to have a base config to start. I’ll then configure it with Bindplane.
receivers:
nop:
processors:
batch:
exporters:
nop:
service:
pipelines:
metrics:
receivers: [nop]
processors: [batch]
exporters: [nop]
telemetry:
metrics:
level: none
And, a base setup for the logging.yaml.
output: stdout
level: info
Start the Docker Compose services.
docker compose up -d
Jump into Bindplane and create three configurations for:
telgen
otlp-lb-gw
external-gw
The telgen configuration has a Telemetry Generator source.
And, an OTLP destination.
The OTLP destination is configured to send telemetry to the otlp-lb hostname, which is the hostname for the Nginx load balancer I’m running in Docker Compose.
Next, the otlp-lb-gw configuration has an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.
The destination is also OTLP, but instead sending to the external-gw hostname.
Finally, the external-gw configuration is again using an identical OTLP source.
And, a Dev Null destination.
This setup enables you to drop in whatever destination you want in the list of destinations for the external-gw configuration. Go wild! 😂
If you open the processor node for the Dev Null destination, you’ll see logs flowing through the load balancer.
While in the otlp-lb-gw configuration, if you open a processor node, you’ll see evenly distributed load across all three collectors.
That’s how you load balance telemetry across multiple collectors with Nginx.
If you would rather apply these configs via the Bindplane CLI, get the files on GitHub, here.
Load Balancing Exporter
The second option is to use the dedicated loadbalancing
exporter in the collector. With this exporter you can specify multiple downstream collectors that will receive the telemetry traffic equally.
One quick note before about the load balancing exporter. You don’t always need it. Its main job is to make sure spans from the same trace stick together and get routed to the same backend collector. That’s super useful for distributed tracing with sampling. But if you’re just shipping logs and metrics, or even traces without fancy sampling rules, you can probably skip it and stick with Nginx.
I’ll set up the architecture just as I did above but with yet another collector instead the Nginx load balancer:
Three Gateway Collectors in parallel
One Gateway Collector using the loadbalancing exporter
One Agent Collector configured to generate telemetry (app simulation)
This behaves identical to an Nginx load balancer. However this requires one less step and less configuration overhead. No need to configure and run Nginx, manage specific Nginx files, instead run one more instance of the collector and use a trusty collector config.yaml that you’re already familiar with.
The drop in replacement for the use case above is as follows. In the docker-compose.yaml
replace the otlp-lb
Nginx service with another collector named lb
.
services:
# ...
lb:
image: ghcr.io/observiq/bindplane-agent:1.79.2
container_name: lb
hostname: lb
command: ["--config=/etc/otel/config/lb-config.yaml"]
volumes:
- ./config:/etc/otel/config
- lb-storage:/etc/otel/storage
ports:
- "4317:4317" # OTLP gRPC - external endpoint
- "4318:4318" # OTLP HTTP/JSON - external endpoint
environment:
OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"
OPAMP_SECRET_KEY: "01JFJGVKWHQ1SPQVDGZEHVA995"
OPAMP_LABELS: ephemeral=true
MANAGER_YAML_PATH: /etc/otel/config/lb-manager.yaml
CONFIG_YAML_PATH: /etc/otel/config/lb-config.yaml
LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
depends_on: [gw1, gw2, gw3]
# ...
Create a base lb-config.yaml
for this collector instance in the ./config
directory. Bindplane will update this remotely once you add a destination for the loadbalancing exporter.
receivers:
nop:
processors:
batch:
exporters:
nop:
service:
pipelines:
metrics:
receivers: [nop]
processors: [batch]
exporters: [nop]
telemetry:
metrics:
level: none
Go ahead and restart Docker Compose.
docker compose down
docker compose up -d
This will start the new lb
collector. In Bindplane, go ahead and create a new configuration called lb
and add an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.
Now, create a custom destination and paste the loadbalancing
exporter configuration in the input field.
loadbalancing:
protocol:
otlp:
tls:
insecure: true
timeout: 30s
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 300s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
resolver:
static:
hostnames:
- gw1:4317
- gw2:4317
- gw3:4317
Note that the hostnames correlate to the hostnames of the gateway collectors configured in Docker Compose. Save this configuration and roll it out to the new lb
collector. Opening the gw
configuration in Bindplane and selecting a processor node, you’ll see the telemetry flowing through all 3 gateway collector instances.
You’ll see an even nicer split by seeing the telemetry throughput across all collectors in the Agents view.
The lb and external-gw are reporting the same throughput with the three gateway collectors load balancing traffic equally.
The loadbalancing exporter is behaving like a drop-in replacement for Nginx. I would call that a win. Less configuration overhead, fewer moving parts, and no need to learn specific Nginx configs. Instead, focus only on the collector.
To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.
Since you now have a good understanding of how to configure OpenTelemetry Collector infrastructure for high availability, let's move into details about resilience specifically.
Building Resilience into Your Collector
When it comes to resilience, features like retry logic, persistent queues, and batching should be handled in the Agent Collectors. These are the instances sitting closest to your workloads; they’re most at risk of losing data if something goes wrong. The Agent’s job is to collect, buffer, and forward telemetry reliably, even when the backend is flaky or slow.
How you configure the OpenTelemetry collector for resilience to avoid losing telemetry during network issues or telemetry backend outages:
Batching groups signals before export, improving efficiency.
Retry ensures failed exports are re-attempted. For critical workloads, increase
max_elapsed_time
to tolerate longer outages—but be aware this will increase the buffer size on disk.-
Persistent Queue stores retries on disk, protecting against data loss if the Collector crashes. You can configure:
- Number of consumers – how many parallel retry workers run
- Queue size – how many batches are stored
- Persistence – enables disk buffering for reliability
Retry & Persistent Queue
Luckily enough for you, Bindplane handles both retries and the persistent queue out of the box for OTLP exporters.
Take a look at the telgen configuration. This is the collector we’re running in agent-mode simulating a bunch of telemetry traffic.
In the telgen-config.yaml, you'll see OTLP exporter is configured with both the persistent queue and retries.
exporters:
otlp/lb:
compression: gzip
endpoint: gw:4317
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 300s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage/lb
timeout: 30s
tls:
insecure: true
This is because the advanced settings for every OTLP exporter in Bindplane have this default configuration enabled.
The persistent queue directory here is the storage directory that we configured by creating a volume in Docker.
# docker-compose.yaml
...
volumes:
gw1-storage: # persistent queue for gateway-1
gw2-storage: # persistent queue for gateway-2
gw3-storage: # persistent queue for gateway-3
telgen-storage: # persistent queue for telemetry generator
lb-storage: # persistent queue for load-balancing gateway
external-gw-storage: # persistent queue for external gateway
...
Bindplane then automatically configures a storage extension in the config and enables it like this:
# telgen-config.yaml
...
extensions:
file_storage/lb:
compaction:
directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
on_rebound: true
directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
service:
extensions:
- file_storage/lb
...
Note that the OIQ_OTEL_COLLECTOR_HOME
environment variable actually is mapped to the /etc/otel
directory.
Now your telemetry pipeline becomes resilient and HA-ready with data persistence to survive restarts, persistent queue buffering to handle temporary outages, and failover recovery to prevent data loss.
Batching
Batching is a whole other story, because you need to add a processor on the processor node for it to be enabled before connecting it to the destination.
Agent-mode collectors should batch telemetry before sending it to the gateway collector. The OTLP receiver on the gateway side will receive batches and forward them to your telemetry backend of choice.
In the telgen configuration, click a processor node and add a batch processor.
This config will send a batch of telemetry signals every 200ms regardless of the size. Or, it will send a batch of the size 8192 regardless of the timeout. Applying this processor in Bindplane will generate a config like this:
# telgen-config.yaml
...
processors:
batch/lb: null
batch/lb-0__processor0:
send_batch_max_size: 0
send_batch_size: 8192
timeout: 200ms
...
Kubernetes-native load balancing with HorizontalPodAutoscaler
Finally, after all the breakdowns, explanations, and diagrams, it’s time to show you what it would look like in the wild with a simple Kubernetes sample.
Using Kubernetes is the preferred architecture suggested by the Bindplane team and the OpenTelemetry community. K8s will maximize the benefits you get with Bindplane as well.
I’ll set up the architecture with:
One Agent-mode Collector running per node on the K8s cluster configured to generate telemetry (app simulation)
-
A Gateway Collector Deployment
- Using a HorizontalPodAutoscaler scaling from 2 to 10 pods
- And a ClusterIP service
- Configured with persistent storage, sending queue, and retry
An external Gateway Collector running on another cluster acting as a mock telemetry backend
Luckily enough getting all the K8s YAML manifests for the collectors is all point-and-click from the Bindplane UI. However, you need to build the configurations first, before applying the collectors to your K8s cluster.
For the sake of simplicity l’ll show how to spin up two K8s clusters with kind, and use them in this demo.
kind create cluster --name kind-2
kind create cluster --name kind-1
# make sure you set the context to the kind-1 cluster first
kubectl config use-context kind-kind-1
Next, jump into Bindplane and create three configurations for:
telgen-kind-1
gw-kind-1
external-gw-kind-2
The telgen-kind-1 configuration has a Custom source with a telemetrygeneratorreceiver
.
telemetrygeneratorreceiver:
generators:
- additional_config:
body: 127.0.0.1 - - [30/Jun/2025:12:00:00 +0000] \"GET /index.html HTTP/1.1\" 200 512
severity: 9
type: logs
payloads_per_second: 1
And, a Bindplane Gateway destination.
Note: This is identical to any OTLP destination.
The Bindplane Gateway destination is configured to send telemetry to the bindplane-gateway-agent.bindplane-agent.svc.cluster.local
hostname, which is the hostname for the Bindplane Gateway Collector service in Kubernetes that you’ll start in a second.
The final step for this configuration is to click a processor node and add a batch processor.
Next, the gw-kind-1 configuration has a Bindplane Gateway source that listens on 0.0.0.0 and ports 4317 and 4318.
The destination is OTLP, and sending telemetry to the IP address (172.18.0.2) of the external gateway running on the second K8s cluster.
Note: This might differ for your clusters. If you are using kind, like I am in this demo, the IP will be 172.18.0.2.
Finally, the external-gw-kind-2 configuration is again using an OTLP source.
And, a Dev Null destination.
Feel free to use the Bindplane CLI and these resources to apply all the configurations in one go without having to do it manually in the UI.
With the configurations created, you can install collectors easily by getting manifest files from your Bindplane account. Navigate to the install agents UI in Bindplane and select a Kubernetes environment. Use the Node platform and telgen-kind-1 configuration.
Clicking next will show a manifest file for you to apply in the cluster.
Save this file as node-agent-kind-1.yaml
. Check out below what a sample of it looks like. Or, see the file in GitHub, here.
---
apiVersion: v1
kind: Namespace
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
name: bindplane-agent
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
name: bindplane-agent
namespace: bindplane-agent
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: bindplane-agent
labels:
app.kubernetes.io/name: bindplane-agent
rules:
- apiGroups:
- ""
resources:
- events
- namespaces
- namespaces/status
- nodes
- nodes/spec
- nodes/stats
- nodes/proxy
- pods
- pods/status
- replicationcontrollers
- replicationcontrollers/status
- resourcequotas
- services
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- batch
resources:
- jobs
- cronjobs
verbs:
- get
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: bindplane-agent
labels:
app.kubernetes.io/name: bindplane-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: bindplane-agent
subjects:
- kind: ServiceAccount
name: bindplane-agent
namespace: bindplane-agent
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
name: bindplane-node-agent
namespace: bindplane-agent
spec:
ports:
- appProtocol: grpc
name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- appProtocol: http
name: otlp-http
port: 4318
protocol: TCP
targetPort: 4318
selector:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
sessionAffinity: None
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
name: bindplane-node-agent-headless
namespace: bindplane-agent
spec:
clusterIP: None
ports:
- appProtocol: grpc
name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- appProtocol: http
name: otlp-http
port: 4318
protocol: TCP
targetPort: 4318
selector:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
sessionAffinity: None
type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
name: bindplane-node-agent-setup
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
namespace: bindplane-agent
data:
# This script assumes it is running in /etc/otel.
setup.sh: |
# Configure storage/ emptyDir volume permissions so the
# manager configuration can ge written to it.
chown 10005:10005 storage/
# Copy config and logging configuration files to storage/
# hostPath volume if they do not already exist.
if [ ! -f storage/config.yaml ]; then
echo '
receivers:
nop:
processors:
batch:
exporters:
nop:
service:
pipelines:
metrics:
receivers: [nop]
processors: [batch]
exporters: [nop]
telemetry:
metrics:
level: none
' > storage/config.yaml
fi
if [ ! -f storage/logging.yaml ]; then
echo '
output: stdout
level: info
' > storage/logging.yaml
fi
chown 10005:10005 storage/config.yaml storage/logging.yaml
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: bindplane-node-agent
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
namespace: bindplane-agent
spec:
selector:
matchLabels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
template:
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: node
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8888"
prometheus.io/scheme: http
prometheus.io/job-name: bindplane-node-agent
spec:
serviceAccount: bindplane-agent
initContainers:
- name: setup
image: busybox:latest
securityContext:
# Required for changing permissions from
# root to otel user in emptyDir volume.
runAsUser: 0
command: ["sh", "/setup/setup.sh"]
volumeMounts:
- mountPath: /etc/otel/config
name: config
- mountPath: /storage
name: storage
- mountPath: "/setup"
name: setup
containers:
- name: opentelemetry-collector
image: ghcr.io/observiq/bindplane-agent:1.80.1
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
# Required for reading container logs hostPath.
runAsUser: 0
ports:
- containerPort: 8888
name: prometheus
resources:
requests:
memory: 200Mi
cpu: 100m
limits:
memory: 200Mi
env:
- name: OPAMP_ENDPOINT
value: wss://app.bindplane.com/v1/opamp
- name: OPAMP_SECRET_KEY
value: <secret>
- name: OPAMP_AGENT_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: OPAMP_LABELS
value: configuration=telgen-kind-1,container-platform=kubernetes-daemonset,install_id=0979c5c2-bd7a-41c1-89b8-2c16441886ab
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# The collector process updates config.yaml
# and manager.yaml when receiving changes
# from the OpAMP server.
#
# The config.yaml is persisted by saving it to the
# hostPath volume, allowing the agent to continue
# running after restart during an OpAMP server outage.
#
# The manager configuration must be re-generated on
# every startup due to how the bindplane-agent handles
# manager configuration. It prefers a manager config file
# over environment variables, meaning it cannot be
# updated using environment variables, if it is persisted).
- name: CONFIG_YAML_PATH
value: /etc/otel/storage/config.yaml
- name: MANAGER_YAML_PATH
value: /etc/otel/config/manager.yaml
- name: LOGGING_YAML_PATH
value: /etc/otel/storage/logging.yaml
volumeMounts:
- mountPath: /etc/otel/config
name: config
- mountPath: /run/log/journal
name: runlog
readOnly: true
- mountPath: /var/log
name: varlog
readOnly: true
- mountPath: /var/lib/docker/containers
name: dockerlogs
readOnly: true
- mountPath: /etc/otel/storage
name: storage
volumes:
- name: config
emptyDir: {}
- name: runlog
hostPath:
path: /run/log/journal
- name: varlog
hostPath:
path: /var/log
- name: dockerlogs
hostPath:
path: /var/lib/docker/containers
- name: storage
hostPath:
path: /var/lib/observiq/otelcol/container
- name: setup
configMap:
name: bindplane-node-agent-setup
In short, this manifest deploys the BDOT Collector as a DaemonSet on every node, using OpAMP to receive config from Bindplane. It includes:
RBAC to read Kubernetes objects (pods, nodes, deployments, etc.)
Services to expose OTLP ports (4317 gRPC, 4318 HTTP)
An init container to bootstrap a config to start the collector which will be replaced by the telgen-kind-1 configuration once started
Persistent hostPath storage for retries and disk buffering
Prometheus annotations for metrics scraping
Your file will include the correct OPAMP_ENDPOINT
, OPAMP_SECRET_KEY
, and OPAMP_LABELS.
Go ahead and apply this manifest to the first k8s cluster.
kubectl config use-context kind-kind-1
kubectl apply -f node-agent-kind-1.yaml
Now, install another collector in the K8s cluster, but now choose a Gateway and the gw-kind-1 configuration.
You’ll get a manifest file to apply again, but this time a deployment. Save it as gateway-collector-kind-1.yaml
. Here’s what it looks like in GitHub.
Here’s the full manifest as a deployment with a horizontal pod autoscaler.
---
apiVersion: v1
kind: Namespace
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
name: bindplane-agent
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
name: bindplane-agent
namespace: bindplane-agent
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
name: bindplane-gateway-agent
namespace: bindplane-agent
spec:
ports:
- appProtocol: grpc
name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- appProtocol: http
name: otlp-http
port: 4318
protocol: TCP
targetPort: 4318
- appProtocol: tcp
name: splunk-tcp
port: 9997
protocol: TCP
targetPort: 9997
- appProtocol: tcp
name: splunk-hec
port: 8088
protocol: TCP
targetPort: 8088
selector:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
sessionAffinity: None
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
name: bindplane-gateway-agent-headless
namespace: bindplane-agent
spec:
clusterIP: None
ports:
- appProtocol: grpc
name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- appProtocol: http
name: otlp-http
port: 4318
protocol: TCP
targetPort: 4318
selector:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
sessionAffinity: None
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: bindplane-gateway-agent
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
namespace: bindplane-agent
spec:
selector:
matchLabels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
template:
metadata:
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8888"
prometheus.io/scheme: http
prometheus.io/job-name: bindplane-gateway-agent
spec:
serviceAccount: bindplane-agent
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values: [bindplane-agent]
- key: app.kubernetes.io/component
operator: In
values: [gateway]
securityContext:
runAsNonRoot: true
runAsUser: 1000000000
runAsGroup: 1000000000
fsGroup: 1000000000
seccompProfile:
type: RuntimeDefault
initContainers:
- name: setup-volumes
image: ghcr.io/observiq/bindplane-agent:1.80.1
securityContext:
runAsNonRoot: true
runAsUser: 1000000000
runAsGroup: 1000000000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
command:
- 'sh'
- '-c'
- |
echo '
receivers:
nop:
processors:
batch:
exporters:
nop:
service:
pipelines:
metrics:
receivers: [nop]
processors: [batch]
exporters: [nop]
telemetry:
metrics:
level: none
' > /etc/otel/storage/config.yaml
echo '
output: stdout
level: info
' > /etc/otel/storage/logging.yaml
resources:
requests:
memory: 200Mi
cpu: 100m
limits:
memory: 200Mi
volumeMounts:
- mountPath: /etc/otel/storage
name: bindplane-gateway-agent-storage
containers:
- name: opentelemetry-container
image: ghcr.io/observiq/bindplane-agent:1.80.1
imagePullPolicy: IfNotPresent
securityContext:
runAsNonRoot: true
runAsUser: 1000000000
runAsGroup: 1000000000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
resources:
requests:
memory: 500Mi
cpu: 250m
limits:
memory: 500Mi
ports:
- containerPort: 8888
name: prometheus
env:
- name: OPAMP_ENDPOINT
value: wss://app.bindplane.com/v1/opamp
- name: OPAMP_SECRET_KEY
value: <secret>
- name: OPAMP_AGENT_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: OPAMP_LABELS
value: configuration=gw-kind-1,container-platform=kubernetes-gateway,install_id=51dbe4d2-83d2-45c0-ab4a-e0c127a59649
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# The collector process updates config.yaml
# and manager.yaml when receiving changes
# from the OpAMP server.
- name: CONFIG_YAML_PATH
value: /etc/otel/storage/config.yaml
- name: MANAGER_YAML_PATH
value: /etc/otel/config/manager.yaml
- name: LOGGING_YAML_PATH
value: /etc/otel/storage/logging.yaml
volumeMounts:
- mountPath: /etc/otel/storage
name: bindplane-gateway-agent-storage
- mountPath: /etc/otel/config
name: config
volumes:
- name: config
emptyDir: {}
- name: bindplane-gateway-agent-storage
emptyDir: {}
# Allow exporters to drain their queue for up to
# five minutes.
terminationGracePeriodSeconds: 500
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: bindplane-gateway-agent
namespace: bindplane-agent
spec:
maxReplicas: 10
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bindplane-gateway-agent
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Here’s a breakdown of what this manifest does:
Creates a dedicated namespace and service account for the Bindplane Gateway Collector (bindplane-agent).
-
Defines two Kubernetes services:
- A standard ClusterIP service for OTLP (gRPC/HTTP) and Splunk (TCP/HEC) traffic.
- A headless service for direct pod discovery, useful in peer-to-peer setups.
-
Deploys the Bindplane Agent as a scalable Deployment:
- Runs the OpenTelemetry Collector image.
- Bootstraps basic config via an initContainer.
- Secure runtime with strict securityContext settings.
- Prometheus annotations enable metrics scraping.
-
Auto-scales the collector horizontally using an HPA:
- Scales between 2 and 10 replicas based on CPU utilization.
Uses OpAMP to receive remote config and updates from Bindplane.
Mounts ephemeral storage for config and persistent queue support using emptyDir.
Your file will include the correct OPAMP_ENDPOINT
, OPAMP_SECRET_KEY
, and OPAMP_LABELS.
Apply it in the first k8s cluster.
kubectl apply -f gateway-collector-kind-1.yaml
Now, create an identical Gateway Collector as above but use the external-gw-kind-2 configuration.
You’ll get a manifest file to apply again, but this time apply it in your second cluster. Save it as gateway-collector-kind-2.yaml
. Here’s what it looks like in GitHub. I won’t bother showing you the manifest YAML since it will be identical to the one above.
kubectl config use-context kind-kind-2
kubectl apply -f gateway-collector-kind-2.yaml
Finally, to expose this Gateway Collector’s service and enable OTLP traffic from cluster 1 to cluster 2 I’ll use this NodePort service called gateway-nodeport-service.yaml
.
apiVersion: v1
kind: Service
metadata:
name: bindplane-gateway-agent-nodeport
namespace: bindplane-agent
labels:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
spec:
type: NodePort
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
nodePort: 30317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
nodePort: 30318
protocol: TCP
selector:
app.kubernetes.io/name: bindplane-agent
app.kubernetes.io/component: gateway
And, apply it with:
kubectl apply -f gateway-nodeport-service.yaml
Your final setup will look like this.
One Agent-mode collector sending telemetry traffic via a horizontally scaled Gateway-mode collector to an external Gateway running in a separate cluster. This can be any other telemetry backend of your choice.
You’ll have 5 collectors running in total.
And, 3 configurations, where two of them will be scaled between 2 and 10 collector pods.
To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.
Final thoughts
At the end of the day, high availability for the OpenTelemetry Collector means one thing: don’t lose telemetry when stuff breaks.
You want things to keep working when a telemetry backend goes down, a node restarts, or you’re pushing out updates. That’s why the Agent-Gateway pattern exists. That’s why we scale horizontally. That’s why we use batching, retries, and persistent queues.
Set it up once, and sleep better knowing your pipeline won’t fall over at the first hiccup. Keep signals flowing. No drops. No drama.
Want to give Bindplane a try? Spin up a free instance of Bindplane Cloud and hit the ground running right away.
Top comments (0)