DEV Community

Cover image for Why Istio's Metrics Merging Breaks in Multi-Container Pods (And How to Fix It)
Kaio Cunha
Kaio Cunha

Posted on

Why Istio's Metrics Merging Breaks in Multi-Container Pods (And How to Fix It)

If you run multi-container pods under Istio with STRICT mTLS, you're probably missing metrics

And you might not know it. The containers are healthy. The scrape job shows no errors. But half your metrics are just... absent from Prometheus. No alert, no obvious explanation.

I spent a while debugging this before I understood what was going on, so here's the full picture.


The problem

Istio has a built-in metrics-merging feature that lets Prometheus scrape a pod through the Istio proxy without reaching each container directly. It's useful. But it has a hard limitation that the docs mention only in passing:

Istio's metrics-merge only supports one port per pod.

The Superorbital team wrote the definitive explanation of why this is the case. The short version: Istio's proxy forwards the scrape to a single application port. If you have three containers each exposing /metrics on different ports, Istio picks one and ignores the rest.

Someone opened a feature request for multi-port support back in 2022. It was labeled lifecycle/stale and auto-closed. There are several other issues from people hitting variations of this same problem. None of them were resolved.

Here's what it looks like in practice:

# Pod with api container (:8080) and worker container (:9100)

up{pod="my-app-abc123", container="api"}    = 1    scraped through Istio proxy

# worker metrics? absent. no error, just gone.
Enter fullscreen mode Exit fullscreen mode

The worker container is perfectly healthy. Its metrics just never reach Prometheus. No scrape failure gets recorded because Prometheus never even tries. It only knows about the one port Istio advertises.


The workarounds you'll try (and why they don't work)

"Just scrape each container port directly." Works if mTLS is in permissive mode. In STRICT mode, every connection must go through the Istio proxy, which only forwards to one port. Direct port scraping gets rejected at the mTLS layer.

"Use multiple PodMonitor entries pointing at different ports." Same problem. The proxy is the bottleneck, not the scrape configuration.

"Push metrics to a Pushgateway." Technically works, but now you've broken the pull model everything else in your stack depends on, added a component that becomes a single point of failure, and introduced staleness semantics that are genuinely confusing to debug.


What about ambient mode?

Before I get to my solution, I should be upfront: if you're running Istio in ambient mode (GA since Istio 1.24), this problem doesn't apply to you. Ambient replaces the per-pod sidecar with a per-node L4 proxy (ztunnel), so there's no sidecar sitting inside your pod intercepting scrapes. Prometheus can reach your container ports directly, and mTLS is handled transparently at the node level. Howard John from the Istio team wrote about this — the TL;DR is "it just works."

But most production Istio deployments are still running sidecar mode. Migrating to ambient is a significant undertaking, and the Istio project itself says they expect many users to stay on sidecars for years. If that's you, keep reading.


What actually works in sidecar mode: one sidecar, one port

The idea is simple. Add a small sidecar container that scrapes all your other containers over localhost (where mTLS doesn't apply, because it's all inside the same pod) and exposes the merged result on a single port. Istio sees one port, Prometheus scrapes one port, and you get everything.

┌──────────────────────────────────────────────────────┐
│  Pod                                                 │
│                                                      │
│  ┌────────┐  localhost:8080/metrics                  │
│  │  api   ├──────────────────┐                       │
│  └────────┘                  │                       │
│                         ┌────▼──────────┐            │
│  ┌────────┐             │  aggregator   │            │
│  │ worker ├────────────►│  :9090/metrics│◄── Prometheus
│  └────────┘             └───────────────┘            │
│             localhost:9100/metrics                   │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This is what metrics-aggregator does. I built it because I kept hitting this problem and none of the existing tools solved it cleanly.


Configuration

Add it as a sidecar to any pod:

containers:
  - name: metrics-aggregator
    image: ghcr.io/kaiohenricunha/metrics-aggregator:latest
    ports:
      - containerPort: 9090
    env:
      - name: METRICS_ENDPOINTS
        # JSON map (recommended), or comma-separated URLs
        value: '{"api":"http://localhost:8080/metrics","worker":"http://localhost:9100/metrics"}'
Enter fullscreen mode Exit fullscreen mode

Point Prometheus at port 9090:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9090"
Enter fullscreen mode Exit fullscreen mode

That's it. No extra service, no push gateway, no changes to your app containers.

Here's what Prometheus sees after:

# Same pod, same containers, all metrics present now

http_requests_total{method="GET", status="200", origin_container="api"}    1027
http_requests_total{method="GET", status="200", origin_container="worker"}  843

go_goroutines{origin_container="api"}    42
go_goroutines{origin_container="worker"} 17
Enter fullscreen mode Exit fullscreen mode

Every metric line gets an origin_container label injected automatically so you can tell which container produced it. # TYPE and # HELP lines are deduplicated so the output is valid Prometheus exposition format.


How it works under the hood

Endpoints are scraped concurrently with best-effort semantics. If one container is down, the others still report. The request only fails if every source fails.

The repo has the full details: self-instrumentation metrics, optional OpenTelemetry tracing, alerting rules, and a Grafana dashboard. I won't rehash all of that here.


Does it actually work under STRICT mTLS?

Yes. The CI suite deploys a 4-container pod (three app containers plus istio-proxy) under PeerAuthentication mode STRICT and asserts that Prometheus sustains up == 1 over 60 seconds. The scrape goes through the proxy; the internal localhost scrapes bypass it entirely.

I wanted this to be tested in CI, not just "it works on my cluster."


Supply chain security

The image is signed with Cosign, scanned with Trivy on every release, and ships with SBOM and SLSA provenance. Releases use semantic versioning via Conventional Commits. This is infrastructure tooling that goes into your production pods, so I wanted to get this part right.


Getting started

Full manifests (plain Deployment, PodMonitor, Helm, Kustomize) are in the examples/ directory.

Quickest path:

kubectl apply -f https://raw.githubusercontent.com/kaiohenricunha/metrics-aggregator/main/examples/deployment.yaml
Enter fullscreen mode Exit fullscreen mode

The repo is here: kaiohenricunha/metrics-aggregator

If you're on sidecar mode with STRICT mTLS and wondering why half your metrics are missing, give it a try. And if you're planning a migration to ambient mode down the road but need something that works today, this bridges the gap. Open an issue if something doesn't work or if you have a use case I haven't thought of.

Top comments (0)