<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaio Cunha</title>
    <description>The latest articles on DEV Community by Kaio Cunha (@kaiohenricunha).</description>
    <link>https://dev.to/kaiohenricunha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828257%2F134a4877-ba9a-4bbc-bf40-35b7ede7f498.jpeg</url>
      <title>DEV Community: Kaio Cunha</title>
      <link>https://dev.to/kaiohenricunha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kaiohenricunha"/>
    <language>en</language>
    <item>
      <title>Why Istio's Metrics Merging Breaks in Multi-Container Pods (And How to Fix It)</title>
      <dc:creator>Kaio Cunha</dc:creator>
      <pubDate>Tue, 17 Mar 2026 00:07:31 +0000</pubDate>
      <link>https://dev.to/kaiohenricunha/why-istios-metrics-merging-breaks-in-multi-container-pods-and-how-to-fix-it-3l6f</link>
      <guid>https://dev.to/kaiohenricunha/why-istios-metrics-merging-breaks-in-multi-container-pods-and-how-to-fix-it-3l6f</guid>
      <description>&lt;h2&gt;
  
  
  If you run multi-container pods under Istio with STRICT mTLS, you're probably missing metrics
&lt;/h2&gt;

&lt;p&gt;And you might not know it. The containers are healthy. The scrape job shows no errors. But half your metrics are just... absent from Prometheus. No alert, no obvious explanation.&lt;/p&gt;

&lt;p&gt;I spent a while debugging this before I understood what was going on, so here's the full picture.&lt;/p&gt;




&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Istio has a built-in metrics-merging feature that lets Prometheus scrape a pod through the Istio proxy without reaching each container directly. It's useful. But it has a hard limitation that the docs mention only in passing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Istio's metrics-merge only supports one port per pod.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Superorbital team wrote &lt;a href="https://superorbital.io/blog/istio-metrics-merging/" rel="noopener noreferrer"&gt;the definitive explanation&lt;/a&gt; of why this is the case. The short version: Istio's proxy forwards the scrape to a single application port. If you have three containers each exposing &lt;code&gt;/metrics&lt;/code&gt; on different ports, Istio picks one and ignores the rest.&lt;/p&gt;

&lt;p&gt;Someone &lt;a href="https://github.com/istio/istio/issues/41276" rel="noopener noreferrer"&gt;opened a feature request&lt;/a&gt; for multi-port support back in 2022. It was labeled &lt;code&gt;lifecycle/stale&lt;/code&gt; and auto-closed. There are &lt;a href="https://github.com/istio/istio/issues/27328" rel="noopener noreferrer"&gt;several&lt;/a&gt; &lt;a href="https://github.com/istio/istio/issues/38348" rel="noopener noreferrer"&gt;other&lt;/a&gt; &lt;a href="https://github.com/istio/istio/issues/53753" rel="noopener noreferrer"&gt;issues&lt;/a&gt; from people hitting variations of this same problem. None of them were resolved.&lt;/p&gt;

&lt;p&gt;Here's what it looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pod with api container (:8080) and worker container (:9100)&lt;/span&gt;

&lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"my-app-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="err"&gt;✓&lt;/span&gt; &lt;span class="n"&gt;scraped&lt;/span&gt; &lt;span class="n"&gt;through&lt;/span&gt; &lt;span class="n"&gt;Istio&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;

&lt;span class="c"&gt;# worker metrics? absent. no error, just gone.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker container is perfectly healthy. Its metrics just never reach Prometheus. No scrape failure gets recorded because Prometheus never even tries. It only knows about the one port Istio advertises.&lt;/p&gt;




&lt;h3&gt;
  
  
  The workarounds you'll try (and why they don't work)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;"Just scrape each container port directly."&lt;/strong&gt; Works if mTLS is in permissive mode. In &lt;code&gt;STRICT&lt;/code&gt; mode, every connection must go through the Istio proxy, which only forwards to one port. Direct port scraping gets rejected at the mTLS layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use multiple &lt;code&gt;PodMonitor&lt;/code&gt; entries pointing at different ports."&lt;/strong&gt; Same problem. The proxy is the bottleneck, not the scrape configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Push metrics to a Pushgateway."&lt;/strong&gt; Technically works, but now you've broken the pull model everything else in your stack depends on, added a component that becomes a single point of failure, and introduced staleness semantics that are genuinely confusing to debug.&lt;/p&gt;




&lt;h3&gt;
  
  
  What about ambient mode?
&lt;/h3&gt;

&lt;p&gt;Before I get to my solution, I should be upfront: if you're running Istio in &lt;strong&gt;ambient mode&lt;/strong&gt; (GA since Istio 1.24), this problem doesn't apply to you. Ambient replaces the per-pod sidecar with a per-node L4 proxy (ztunnel), so there's no sidecar sitting inside your pod intercepting scrapes. Prometheus can reach your container ports directly, and mTLS is handled transparently at the node level. Howard John from the Istio team &lt;a href="https://blog.howardjohn.info/posts/securing-prometheus/" rel="noopener noreferrer"&gt;wrote about this&lt;/a&gt; — the TL;DR is "it just works."&lt;/p&gt;

&lt;p&gt;But most production Istio deployments are still running sidecar mode. Migrating to ambient is a significant undertaking, and the Istio project itself says they expect many users to stay on sidecars for years. If that's you, keep reading.&lt;/p&gt;




&lt;h3&gt;
  
  
  What actually works in sidecar mode: one sidecar, one port
&lt;/h3&gt;

&lt;p&gt;The idea is simple. Add a small sidecar container that scrapes all your other containers over &lt;code&gt;localhost&lt;/code&gt; (where mTLS doesn't apply, because it's all inside the same pod) and exposes the merged result on a single port. Istio sees one port, Prometheus scrapes one port, and you get everything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────┐
│  Pod                                                 │
│                                                      │
│  ┌────────┐  localhost:8080/metrics                  │
│  │  api   ├──────────────────┐                       │
│  └────────┘                  │                       │
│                         ┌────▼──────────┐            │
│  ┌────────┐             │  aggregator   │            │
│  │ worker ├────────────►│  :9090/metrics│◄── Prometheus
│  └────────┘             └───────────────┘            │
│             localhost:9100/metrics                   │
└──────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator" rel="noopener noreferrer"&gt;metrics-aggregator&lt;/a&gt; does. I built it because I kept hitting this problem and none of the existing tools solved it cleanly.&lt;/p&gt;




&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;

&lt;p&gt;Add it as a sidecar to any pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics-aggregator&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/kaiohenricunha/metrics-aggregator:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9090&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;METRICS_ENDPOINTS&lt;/span&gt;
        &lt;span class="c1"&gt;# JSON map (recommended), or comma-separated URLs&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"api":"http://localhost:8080/metrics","worker":"http://localhost:9100/metrics"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point Prometheus at port 9090:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus.io/scrape&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="na"&gt;prometheus.io/port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9090"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No extra service, no push gateway, no changes to your app containers.&lt;/p&gt;

&lt;p&gt;Here's what Prometheus sees after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# Same pod, same containers, all metrics present now&lt;/span&gt;

&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="mi"&gt;1027&lt;/span&gt;
&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"worker"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="mi"&gt;843&lt;/span&gt;

&lt;span class="n"&gt;go_goroutines&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="n"&gt;go_goroutines&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"worker"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every metric line gets an &lt;code&gt;origin_container&lt;/code&gt; label injected automatically so you can tell which container produced it. &lt;code&gt;# TYPE&lt;/code&gt; and &lt;code&gt;# HELP&lt;/code&gt; lines are deduplicated so the output is valid Prometheus exposition format.&lt;/p&gt;




&lt;h3&gt;
  
  
  How it works under the hood
&lt;/h3&gt;

&lt;p&gt;Endpoints are scraped concurrently with best-effort semantics. If one container is down, the others still report. The request only fails if every source fails.&lt;/p&gt;

&lt;p&gt;The repo has the full details: self-instrumentation metrics, optional OpenTelemetry tracing, alerting rules, and a Grafana dashboard. I won't rehash all of that here.&lt;/p&gt;




&lt;h3&gt;
  
  
  Does it actually work under STRICT mTLS?
&lt;/h3&gt;

&lt;p&gt;Yes. The CI suite deploys a 4-container pod (three app containers plus &lt;code&gt;istio-proxy&lt;/code&gt;) under &lt;code&gt;PeerAuthentication&lt;/code&gt; mode &lt;code&gt;STRICT&lt;/code&gt; and asserts that Prometheus sustains &lt;code&gt;up == 1&lt;/code&gt; over 60 seconds. The scrape goes through the proxy; the internal localhost scrapes bypass it entirely.&lt;/p&gt;

&lt;p&gt;I wanted this to be tested in CI, not just "it works on my cluster."&lt;/p&gt;




&lt;h3&gt;
  
  
  Supply chain security
&lt;/h3&gt;

&lt;p&gt;The image is signed with Cosign, scanned with Trivy on every release, and ships with SBOM and SLSA provenance. Releases use semantic versioning via Conventional Commits. This is infrastructure tooling that goes into your production pods, so I wanted to get this part right.&lt;/p&gt;




&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;

&lt;p&gt;Full manifests (plain Deployment, PodMonitor, Helm, Kustomize) are in the &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator/tree/main/examples" rel="noopener noreferrer"&gt;&lt;code&gt;examples/&lt;/code&gt;&lt;/a&gt; directory.&lt;/p&gt;

&lt;p&gt;Quickest path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/kaiohenricunha/metrics-aggregator/main/examples/deployment.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repo is here: &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator" rel="noopener noreferrer"&gt;kaiohenricunha/metrics-aggregator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're on sidecar mode with STRICT mTLS and wondering why half your metrics are missing, give it a try. And if you're planning a migration to ambient mode down the road but need something that works today, this bridges the gap. Open an issue if something doesn't work or if you have a use case I haven't thought of.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update: I wrote a follow-up post exploring the broader question of whether Istio should extend metrics merging or sunset it entirely: &lt;a href="https://medium.com/@kaiohsdc/istios-metrics-merging-was-built-for-a-simpler-world-what-should-replace-it-585b285fbc32" rel="noopener noreferrer"&gt;Istio's metrics merging was built for a simpler world. What should replace it?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>prometheus</category>
      <category>kubernetes</category>
      <category>istio</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
