<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaio Cunha</title>
    <description>The latest articles on DEV Community by Kaio Cunha (@kaiohenricunha).</description>
    <link>https://dev.to/kaiohenricunha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828257%2F134a4877-ba9a-4bbc-bf40-35b7ede7f498.jpeg</url>
      <title>DEV Community: Kaio Cunha</title>
      <link>https://dev.to/kaiohenricunha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kaiohenricunha"/>
    <language>en</language>
    <item>
      <title>dotclaude: The Open-Source Governance Layer for AI-Assisted Development</title>
      <dc:creator>Kaio Cunha</dc:creator>
      <pubDate>Sat, 18 Apr 2026 22:58:04 +0000</pubDate>
      <link>https://dev.to/kaiohenricunha/dotclaude-the-open-source-governance-layer-for-ai-assisted-development-3177</link>
      <guid>https://dev.to/kaiohenricunha/dotclaude-the-open-source-governance-layer-for-ai-assisted-development-3177</guid>
      <description>&lt;p&gt;You finish a great Claude Code session. A solid PR-review workflow. A debugging loop that actually finds root causes. A deploy checklist you trust. You close the terminal.&lt;/p&gt;

&lt;p&gt;Next week, starting fresh, you've lost all of it. The assistant has no memory of how &lt;em&gt;you&lt;/em&gt; like to work. You re-explain the worktree convention. You re-explain the test-plan format. You re-explain why &lt;code&gt;--force-push&lt;/code&gt; on &lt;code&gt;main&lt;/code&gt; is never OK.&lt;/p&gt;

&lt;p&gt;Now scale that problem to a team. Five engineers using Claude Code, each with their own tricks, no shared floor of discipline. PRs land with different review depths. Audits have no structure. Some sessions produce hallucinated "fixes" that never touched the real code path. Specs drift from implementation and nobody notices until something breaks in prod.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kaiohenricunha/dotclaude" rel="noopener noreferrer"&gt;dotclaude&lt;/a&gt; is an MIT-licensed project that solves both problems from the same codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two problems, one repo
&lt;/h2&gt;

&lt;p&gt;The project has a &lt;strong&gt;dual-persona monorepo&lt;/strong&gt; layout (&lt;a href="https://github.com/kaiohenricunha/dotclaude/blob/main/docs/adr/0001-monorepo-dual-persona-layout.md" rel="noopener noreferrer"&gt;ADR-0001&lt;/a&gt;). That sounds architectural, but it maps to two very different users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The individual developer&lt;/strong&gt; who wants a portable skills library wired into every Claude Code session on their laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The engineering team&lt;/strong&gt; that wants a governance CLI enforcing spec-backed PRs, skill-manifest integrity, and drift detection in CI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both paths are backed by the same skills, the same slash commands, the same &lt;code&gt;CLAUDE.md&lt;/code&gt; rules. Neither path requires the other. You can use one, both, or swap from one to the other as your needs change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 1: skills &amp;amp; commands in every session
&lt;/h2&gt;

&lt;p&gt;For the individual path, the install is three lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kaiohenricunha/dotclaude.git ~/projects/dotclaude
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/projects/dotclaude
./bootstrap.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bootstrap.sh&lt;/code&gt; symlinks &lt;code&gt;commands/&lt;/code&gt;, &lt;code&gt;skills/&lt;/code&gt;, and &lt;code&gt;CLAUDE.md&lt;/code&gt; into &lt;code&gt;~/.claude/&lt;/code&gt;. From that point, every Claude Code session in every repo has access to the full library. The highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud &amp;amp; IaC specialists&lt;/strong&gt; — the &lt;code&gt;aws-specialist&lt;/code&gt;, &lt;code&gt;gcp-specialist&lt;/code&gt;, &lt;code&gt;azure-specialist&lt;/code&gt;, &lt;code&gt;kubernetes-specialist&lt;/code&gt;, &lt;code&gt;terraform-specialist&lt;/code&gt;, &lt;code&gt;terragrunt-specialist&lt;/code&gt;, &lt;code&gt;pulumi-specialist&lt;/code&gt;, and &lt;code&gt;crossplane-specialist&lt;/code&gt; skills auto-trigger when you mention the relevant technology. Saying "review the IAM trust policy on the prod account" is enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slash commands for real PR work&lt;/strong&gt; — &lt;code&gt;/pre-pr&lt;/code&gt; runs a simplify + security-review + full-test-suite gate before you open the PR. &lt;code&gt;/review-pr &amp;lt;N&amp;gt;&lt;/code&gt; walks 14 steps: fetch comments, validate each one, apply fixes in an isolated worktree, run the test plan, resolve threads. &lt;code&gt;/review-prs &amp;lt;N1&amp;gt; &amp;lt;N2&amp;gt; ...&lt;/code&gt; dispatches one sub-agent per PR in parallel, up to six concurrent, and aggregates results into a table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging disciplines&lt;/strong&gt; — &lt;code&gt;/ground-first &amp;lt;subject&amp;gt;&lt;/code&gt; forces a read-before-edit pass with &lt;code&gt;file:line&lt;/code&gt; citations before any change is proposed. &lt;code&gt;/fix-with-evidence &amp;lt;issue&amp;gt;&lt;/code&gt; enforces a Reproduce → Fix → Verify → PR loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysis docs&lt;/strong&gt; — &lt;code&gt;/create-audit&lt;/code&gt;, &lt;code&gt;/create-inspection&lt;/code&gt;, and &lt;code&gt;/create-assessment&lt;/code&gt; produce evidence-backed markdown documents in &lt;code&gt;docs/audits/&lt;/code&gt;, &lt;code&gt;docs/inspections/&lt;/code&gt;, and &lt;code&gt;docs/assessments/&lt;/code&gt; respectively. Every claim cites a file, a line, or command output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-machine handoff&lt;/strong&gt; — &lt;code&gt;/handoff push claude latest&lt;/code&gt; scrubs secrets and uploads a digest to a private GitHub gist. On another machine: &lt;code&gt;/handoff pull latest&lt;/code&gt;. Your Windows/WSL session continues on Linux without re-explaining context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; file installs a global rule floor alongside the skills: no pushing to &lt;code&gt;main&lt;/code&gt; without explicit instruction, no force-pushing another session's branch, no &lt;code&gt;--no-verify&lt;/code&gt; or &lt;code&gt;--no-gpg-sign&lt;/code&gt;, full test suite before merges that touch protected paths, and a spec-coverage contract enforced at PR time.&lt;/p&gt;

&lt;p&gt;To stay current: &lt;code&gt;./sync.sh pull&lt;/code&gt; (bootstrap path) or &lt;code&gt;dotclaude sync pull&lt;/code&gt; (npm path) re-bootstraps from the latest &lt;code&gt;main&lt;/code&gt;. No npm required for the bootstrap path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 2: the governance CLI
&lt;/h2&gt;

&lt;p&gt;For the team path, there's a zero-runtime-dependency npm package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @dotclaude/dotclaude
dotclaude bootstrap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That installs the same skills library but also gives you a set of validators designed for CI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dotclaude-validate-specs&lt;/code&gt; — audits spec contracts, catches dependency cycles.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dotclaude-check-spec-coverage&lt;/code&gt; — the PR-time gate. Any PR that touches a protected path (defined in &lt;code&gt;docs/repo-facts.json&lt;/code&gt;) must carry a &lt;code&gt;Spec ID:&lt;/code&gt; header or a &lt;code&gt;## No-spec rationale&lt;/code&gt; section. No loophole.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dotclaude-check-instruction-drift&lt;/code&gt; — detects stale &lt;code&gt;CLAUDE.md&lt;/code&gt; and README entries.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dotclaude-detect-drift&lt;/code&gt; — flags commands that have diverged from &lt;code&gt;origin/main&lt;/code&gt; for 14+ days.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dotclaude-doctor&lt;/code&gt; — self-diagnostic across env, facts, manifest, specs, drift, hooks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every bin honors &lt;code&gt;--help&lt;/code&gt;, &lt;code&gt;--version&lt;/code&gt;, &lt;code&gt;--json&lt;/code&gt;, &lt;code&gt;--verbose&lt;/code&gt;, &lt;code&gt;--no-color&lt;/code&gt;. Exit codes follow the &lt;code&gt;{0, 1, 2, 64}&lt;/code&gt; convention (&lt;a href="https://github.com/kaiohenricunha/dotclaude/blob/main/docs/adr/0013-exit-code-convention.md" rel="noopener noreferrer"&gt;ADR-0013&lt;/a&gt;), with 64 matching BSD &lt;code&gt;EX_USAGE&lt;/code&gt;. Every failure surfaces as a structured &lt;code&gt;ValidationError&lt;/code&gt; with a stable &lt;code&gt;.code&lt;/code&gt; (&lt;a href="https://github.com/kaiohenricunha/dotclaude/blob/main/docs/adr/0012-structured-error-contract.md" rel="noopener noreferrer"&gt;ADR-0012&lt;/a&gt;), so your CI scripts branch on classes of failure instead of grepping strings.&lt;/p&gt;

&lt;p&gt;There's also a Node API for teams that want to build their own gates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;createHarnessContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;validateSpecs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;ERROR_CODES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;EXIT_CODES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@dotclaude/dotclaude&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createHarnessContext&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;errors&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validateSpecs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;EXIT_CODES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;VALIDATION&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Need a scaffold for a fresh repo?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx dotclaude-init &lt;span class="nt"&gt;--project-name&lt;/span&gt; my-project &lt;span class="nt"&gt;--project-type&lt;/span&gt; node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That writes &lt;code&gt;.claude/settings.json&lt;/code&gt;, the skills manifest, a destructive-git guard hook, three GitHub Actions workflows (&lt;code&gt;validate-skills&lt;/code&gt;, &lt;code&gt;detect-drift&lt;/code&gt;, &lt;code&gt;ai-review&lt;/code&gt;), and a spec stub. A green &lt;code&gt;dotclaude-doctor&lt;/code&gt; from a cold start.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick taste
&lt;/h2&gt;

&lt;p&gt;After bootstrap, pick a real repo and try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Read before you touch anything.&lt;/span&gt;
/ground-first auth token refresh race condition
&lt;span class="gh"&gt;# → grounded analysis with file:line citations, no edits proposed&lt;/span&gt;

&lt;span class="gh"&gt;# Fix a reported bug with a full evidence loop.&lt;/span&gt;
/fix-with-evidence 140
&lt;span class="gh"&gt;# → reproduces, fixes, verifies, opens a PR — all with proof&lt;/span&gt;

&lt;span class="gh"&gt;# Deep AWS IAM review.&lt;/span&gt;
/aws-specialist review IAM policies in the production account
&lt;span class="gh"&gt;# → structured report: least-privilege gaps, trust-policy findings, remediations&lt;/span&gt;

&lt;span class="gh"&gt;# Batch-triage every open Dependabot PR.&lt;/span&gt;
/dependabot-sweep
&lt;span class="gh"&gt;# → parallel sub-agents annotate risk; safe bumps merged automatically&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every command is context-aware. It reads your repo's files, git history, CI state, and PR body. It cites evidence. It never pushes without permission.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bother with governance at all
&lt;/h2&gt;

&lt;p&gt;The case for spec-driven development gets stronger the more AI you put into the loop. An assistant that writes code fast enough to outrun human review is a liability unless the &lt;em&gt;rules of the game&lt;/em&gt; are encoded somewhere machine-readable. &lt;code&gt;docs/specs/&lt;/code&gt; becomes the contract. Protected paths become the enforcement surface. A PR gate that says "touched this path → show me the Spec ID" turns AI speed into a feature instead of a foot-gun.&lt;/p&gt;

&lt;p&gt;dotclaude isn't opinionated about &lt;em&gt;which&lt;/em&gt; workflow you adopt. It's opinionated that &lt;em&gt;some&lt;/em&gt; workflow must exist — and that the same tools should serve both the person writing the code and the team shipping it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/kaiohenricunha/dotclaude" rel="noopener noreferrer"&gt;README&lt;/a&gt; — both install paths, full skills catalog.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kaiohenricunha/dotclaude/blob/main/docs/quickstart.md" rel="noopener noreferrer"&gt;docs/quickstart.md&lt;/a&gt; — install to first green validator in under 10 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kaiohenricunha/dotclaude/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;docs/architecture.md&lt;/a&gt; — layer diagram and PR-time sequence.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kaiohenricunha/dotclaude/tree/main/docs/adr" rel="noopener noreferrer"&gt;docs/adr/&lt;/a&gt; — every hardening decision, with rationale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MIT licensed. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>opensource</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why Istio's Metrics Merging Breaks in Multi-Container Pods (And How to Fix It)</title>
      <dc:creator>Kaio Cunha</dc:creator>
      <pubDate>Tue, 17 Mar 2026 00:07:31 +0000</pubDate>
      <link>https://dev.to/kaiohenricunha/why-istios-metrics-merging-breaks-in-multi-container-pods-and-how-to-fix-it-3l6f</link>
      <guid>https://dev.to/kaiohenricunha/why-istios-metrics-merging-breaks-in-multi-container-pods-and-how-to-fix-it-3l6f</guid>
      <description>&lt;h2&gt;
  
  
  If you run multi-container pods under Istio with STRICT mTLS, you're probably missing metrics
&lt;/h2&gt;

&lt;p&gt;And you might not know it. The containers are healthy. The scrape job shows no errors. But half your metrics are just... absent from Prometheus. No alert, no obvious explanation.&lt;/p&gt;

&lt;p&gt;I spent a while debugging this before I understood what was going on, so here's the full picture.&lt;/p&gt;




&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Istio has a built-in metrics-merging feature that lets Prometheus scrape a pod through the Istio proxy without reaching each container directly. It's useful. But it has a hard limitation that the docs mention only in passing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Istio's metrics-merge only supports one port per pod.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Superorbital team wrote &lt;a href="https://superorbital.io/blog/istio-metrics-merging/" rel="noopener noreferrer"&gt;the definitive explanation&lt;/a&gt; of why this is the case. The short version: Istio's proxy forwards the scrape to a single application port. If you have three containers each exposing &lt;code&gt;/metrics&lt;/code&gt; on different ports, Istio picks one and ignores the rest.&lt;/p&gt;

&lt;p&gt;Someone &lt;a href="https://github.com/istio/istio/issues/41276" rel="noopener noreferrer"&gt;opened a feature request&lt;/a&gt; for multi-port support back in 2022. It was labeled &lt;code&gt;lifecycle/stale&lt;/code&gt; and auto-closed. There are &lt;a href="https://github.com/istio/istio/issues/27328" rel="noopener noreferrer"&gt;several&lt;/a&gt; &lt;a href="https://github.com/istio/istio/issues/38348" rel="noopener noreferrer"&gt;other&lt;/a&gt; &lt;a href="https://github.com/istio/istio/issues/53753" rel="noopener noreferrer"&gt;issues&lt;/a&gt; from people hitting variations of this same problem. None of them were resolved.&lt;/p&gt;

&lt;p&gt;Here's what it looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pod with api container (:8080) and worker container (:9100)&lt;/span&gt;

&lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"my-app-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="err"&gt;✓&lt;/span&gt; &lt;span class="n"&gt;scraped&lt;/span&gt; &lt;span class="n"&gt;through&lt;/span&gt; &lt;span class="n"&gt;Istio&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;

&lt;span class="c"&gt;# worker metrics? absent. no error, just gone.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker container is perfectly healthy. Its metrics just never reach Prometheus. No scrape failure gets recorded because Prometheus never even tries. It only knows about the one port Istio advertises.&lt;/p&gt;




&lt;h3&gt;
  
  
  The workarounds you'll try (and why they don't work)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;"Just scrape each container port directly."&lt;/strong&gt; Works if mTLS is in permissive mode. In &lt;code&gt;STRICT&lt;/code&gt; mode, every connection must go through the Istio proxy, which only forwards to one port. Direct port scraping gets rejected at the mTLS layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use multiple &lt;code&gt;PodMonitor&lt;/code&gt; entries pointing at different ports."&lt;/strong&gt; Same problem. The proxy is the bottleneck, not the scrape configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Push metrics to a Pushgateway."&lt;/strong&gt; Technically works, but now you've broken the pull model everything else in your stack depends on, added a component that becomes a single point of failure, and introduced staleness semantics that are genuinely confusing to debug.&lt;/p&gt;




&lt;h3&gt;
  
  
  What about ambient mode?
&lt;/h3&gt;

&lt;p&gt;Before I get to my solution, I should be upfront: if you're running Istio in &lt;strong&gt;ambient mode&lt;/strong&gt; (GA since Istio 1.24), this problem doesn't apply to you. Ambient replaces the per-pod sidecar with a per-node L4 proxy (ztunnel), so there's no sidecar sitting inside your pod intercepting scrapes. Prometheus can reach your container ports directly, and mTLS is handled transparently at the node level. Howard John from the Istio team &lt;a href="https://blog.howardjohn.info/posts/securing-prometheus/" rel="noopener noreferrer"&gt;wrote about this&lt;/a&gt; — the TL;DR is "it just works."&lt;/p&gt;

&lt;p&gt;But most production Istio deployments are still running sidecar mode. Migrating to ambient is a significant undertaking, and the Istio project itself says they expect many users to stay on sidecars for years. If that's you, keep reading.&lt;/p&gt;




&lt;h3&gt;
  
  
  What actually works in sidecar mode: one sidecar, one port
&lt;/h3&gt;

&lt;p&gt;The idea is simple. Add a small sidecar container that scrapes all your other containers over &lt;code&gt;localhost&lt;/code&gt; (where mTLS doesn't apply, because it's all inside the same pod) and exposes the merged result on a single port. Istio sees one port, Prometheus scrapes one port, and you get everything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────┐
│  Pod                                                 │
│                                                      │
│  ┌────────┐  localhost:8080/metrics                  │
│  │  api   ├──────────────────┐                       │
│  └────────┘                  │                       │
│                         ┌────▼──────────┐            │
│  ┌────────┐             │  aggregator   │            │
│  │ worker ├────────────►│  :9090/metrics│◄── Prometheus
│  └────────┘             └───────────────┘            │
│             localhost:9100/metrics                   │
└──────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator" rel="noopener noreferrer"&gt;metrics-aggregator&lt;/a&gt; does. I built it because I kept hitting this problem and none of the existing tools solved it cleanly.&lt;/p&gt;




&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;

&lt;p&gt;Add it as a sidecar to any pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics-aggregator&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/kaiohenricunha/metrics-aggregator:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9090&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;METRICS_ENDPOINTS&lt;/span&gt;
        &lt;span class="c1"&gt;# JSON map (recommended), or comma-separated URLs&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"api":"http://localhost:8080/metrics","worker":"http://localhost:9100/metrics"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point Prometheus at port 9090:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus.io/scrape&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="na"&gt;prometheus.io/port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9090"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No extra service, no push gateway, no changes to your app containers.&lt;/p&gt;

&lt;p&gt;Here's what Prometheus sees after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# Same pod, same containers, all metrics present now&lt;/span&gt;

&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="mi"&gt;1027&lt;/span&gt;
&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"worker"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="mi"&gt;843&lt;/span&gt;

&lt;span class="n"&gt;go_goroutines&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="n"&gt;go_goroutines&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;origin_container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"worker"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every metric line gets an &lt;code&gt;origin_container&lt;/code&gt; label injected automatically so you can tell which container produced it. &lt;code&gt;# TYPE&lt;/code&gt; and &lt;code&gt;# HELP&lt;/code&gt; lines are deduplicated so the output is valid Prometheus exposition format.&lt;/p&gt;




&lt;h3&gt;
  
  
  How it works under the hood
&lt;/h3&gt;

&lt;p&gt;Endpoints are scraped concurrently with best-effort semantics. If one container is down, the others still report. The request only fails if every source fails.&lt;/p&gt;

&lt;p&gt;The repo has the full details: self-instrumentation metrics, optional OpenTelemetry tracing, alerting rules, and a Grafana dashboard. I won't rehash all of that here.&lt;/p&gt;




&lt;h3&gt;
  
  
  Does it actually work under STRICT mTLS?
&lt;/h3&gt;

&lt;p&gt;Yes. The CI suite deploys a 4-container pod (three app containers plus &lt;code&gt;istio-proxy&lt;/code&gt;) under &lt;code&gt;PeerAuthentication&lt;/code&gt; mode &lt;code&gt;STRICT&lt;/code&gt; and asserts that Prometheus sustains &lt;code&gt;up == 1&lt;/code&gt; over 60 seconds. The scrape goes through the proxy; the internal localhost scrapes bypass it entirely.&lt;/p&gt;

&lt;p&gt;I wanted this to be tested in CI, not just "it works on my cluster."&lt;/p&gt;




&lt;h3&gt;
  
  
  Supply chain security
&lt;/h3&gt;

&lt;p&gt;The image is signed with Cosign, scanned with Trivy on every release, and ships with SBOM and SLSA provenance. Releases use semantic versioning via Conventional Commits. This is infrastructure tooling that goes into your production pods, so I wanted to get this part right.&lt;/p&gt;




&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;

&lt;p&gt;Full manifests (plain Deployment, PodMonitor, Helm, Kustomize) are in the &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator/tree/main/examples" rel="noopener noreferrer"&gt;&lt;code&gt;examples/&lt;/code&gt;&lt;/a&gt; directory.&lt;/p&gt;

&lt;p&gt;Quickest path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/kaiohenricunha/metrics-aggregator/main/examples/deployment.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repo is here: &lt;a href="https://github.com/kaiohenricunha/metrics-aggregator" rel="noopener noreferrer"&gt;kaiohenricunha/metrics-aggregator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're on sidecar mode with STRICT mTLS and wondering why half your metrics are missing, give it a try. And if you're planning a migration to ambient mode down the road but need something that works today, this bridges the gap. Open an issue if something doesn't work or if you have a use case I haven't thought of.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update: I wrote a follow-up post exploring the broader question of whether Istio should extend metrics merging or sunset it entirely: &lt;a href="https://medium.com/@kaiohsdc/istios-metrics-merging-was-built-for-a-simpler-world-what-should-replace-it-585b285fbc32" rel="noopener noreferrer"&gt;Istio's metrics merging was built for a simpler world. What should replace it?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>prometheus</category>
      <category>kubernetes</category>
      <category>istio</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
