<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrei Taranik</title>
    <description>The latest articles on DEV Community by Andrei Taranik (@cicdteam).</description>
    <link>https://dev.to/cicdteam</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945719%2F4f079259-bcd8-485e-88e1-206e35eb6db1.png</url>
      <title>DEV Community: Andrei Taranik</title>
      <link>https://dev.to/cicdteam</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cicdteam"/>
    <language>en</language>
    <item>
      <title>Remetric: find waste in self-hosted Prometheus, Grafana, and Loki</title>
      <dc:creator>Andrei Taranik</dc:creator>
      <pubDate>Wed, 27 May 2026 17:00:00 +0000</pubDate>
      <link>https://dev.to/cicdteam/remetric-find-waste-in-self-hosted-prometheus-grafana-and-loki-31dn</link>
      <guid>https://dev.to/cicdteam/remetric-find-waste-in-self-hosted-prometheus-grafana-and-loki-31dn</guid>
      <description>&lt;p&gt;Self-hosted Prometheus stacks degrade in predictable ways: a label&lt;br&gt;
explosion that quietly doubles TSDB head size, a metric scraped by&lt;br&gt;
every node and queried by none, an alert rule that has not fired in&lt;br&gt;
nine months, a dashboard panel pointing at a metric that was renamed&lt;br&gt;
last quarter. The signals are all there in the existing APIs, but&lt;br&gt;
writing the queries, running them on a schedule, and turning the&lt;br&gt;
answers into actionable fixes is enough friction that nobody does it&lt;br&gt;
until something breaks.&lt;/p&gt;

&lt;p&gt;Remetric is a single static binary that does this. v0.1 has already shipped.&lt;/p&gt;
&lt;h2&gt;
  
  
  The four waste patterns
&lt;/h2&gt;

&lt;p&gt;Remetric ships five analyzers covering four categories of waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cardinality explosion.&lt;/strong&gt; A label with high uniqueness produces one&lt;br&gt;
series per value. A &lt;code&gt;trace_id&lt;/code&gt; label on a request counter generates a&lt;br&gt;
new series for every request and never reuses any of them. Within&lt;br&gt;
hours a single metric can carry hundreds of thousands of dead series,&lt;br&gt;
consuming TSDB head memory and slowing every query that touches it.&lt;br&gt;
Remetric ranks labels by uniqueness ratio and series-count&lt;br&gt;
contribution, flags hot labels, and produces a fix snippet that drops&lt;br&gt;
the offending label via &lt;code&gt;metric_relabel_configs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unused metrics.&lt;/strong&gt; Exporters scrape thousands of metric names.&lt;br&gt;
Dashboards, alerting rules, and recording rules reference a subset.&lt;br&gt;
The leftover is dead weight in head series. Remetric walks every&lt;br&gt;
Grafana dashboard and every Prometheus rule expression, collects the&lt;br&gt;
set of referenced metric names, diffs against the ingested set, and&lt;br&gt;
emits one finding per unused metric. The fix is a&lt;br&gt;
&lt;code&gt;metric_relabel_configs&lt;/code&gt; &lt;code&gt;drop&lt;/code&gt; rule per metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert hygiene.&lt;/strong&gt; Two failure modes: alerting rules that fire&lt;br&gt;
constantly (noise that everyone scrolls past) and rules that have&lt;br&gt;
never fired in the available retention window (broken? threshold&lt;br&gt;
wrong? failure mode no longer present?). Both need a human decision,&lt;br&gt;
but neither announces itself. Remetric queries rule history and&lt;br&gt;
surfaces rules in each state with lookback window and observed&lt;br&gt;
fire-count as evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Broken panels.&lt;/strong&gt; Panel queries that reference metrics no longer in&lt;br&gt;
head series or in recording-rule outputs. The panel renders empty;&lt;br&gt;
nobody notices, because empty looks the same as fine. Remetric parses&lt;br&gt;
every PromQL target across every dashboard, diffs against the&lt;br&gt;
existence set (head series union recording-rule outputs), and emits&lt;br&gt;
one finding per &lt;code&gt;(dashboard, missing-metric)&lt;/code&gt; pair listing the&lt;br&gt;
affected panels.&lt;/p&gt;

&lt;p&gt;None of these are exotic. All are detectable from Prometheus and&lt;br&gt;
Grafana HTTP APIs. The PromQL queries for each have been folklore for&lt;br&gt;
years; the value is in running them on a schedule and producing&lt;br&gt;
actionable findings instead of more PromQL to maintain.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why a separate tool
&lt;/h2&gt;

&lt;p&gt;The detection logic exists scattered across blog posts, gists, and&lt;br&gt;
pinned Slack messages. Per-backend tools (cortex-tool, vmctl,&lt;br&gt;
mimirtool) each cover a slice, usually for the backend their vendor&lt;br&gt;
sells. None of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cross over to Grafana to ask "does anything still use this metric?"&lt;/li&gt;
&lt;li&gt;check whether alert rules ever fire&lt;/li&gt;
&lt;li&gt;detect broken panels (which requires walking dashboards and
querying head series in one pass)&lt;/li&gt;
&lt;li&gt;ship as a single static binary that runs in CI without a runtime
install&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remetric covers all four patterns plus the Grafana side, with a&lt;br&gt;
read-only contract: it never writes to the target Prometheus or&lt;br&gt;
Grafana, and bounded concurrency (5 in-flight requests by default,&lt;br&gt;
configurable) keeps it from overwhelming the target during a scan.&lt;/p&gt;
&lt;h2&gt;
  
  
  Does this work for Grafana Cloud?
&lt;/h2&gt;

&lt;p&gt;Yes. Grafana Cloud's Prometheus (hosted Mimir) and Grafana itself&lt;br&gt;
speak the same HTTP APIs as the self-hosted versions, so remetric&lt;br&gt;
runs against them with bearer-token auth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;remetric scan &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prometheus&lt;/span&gt;    https://prometheus-prod-XX.grafana.net/api/prom &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prom-token&lt;/span&gt;    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GRAFANA_CLOUD_PROM_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--grafana&lt;/span&gt;       https://YOUR-ORG.grafana.net &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--grafana-token&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GRAFANA_CLOUD_GRAFANA_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prom-max-in-flight&lt;/span&gt; 2 &lt;span class="nt"&gt;--grafana-max-in-flight&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lowered concurrency keeps remetric under Grafana Cloud's&lt;br&gt;
per-tenant rate limits during a full scan.&lt;/p&gt;

&lt;p&gt;Two of remetric's analyzers overlap with built-in Grafana Cloud&lt;br&gt;
features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cardinality Management&lt;/strong&gt; (UI, available on all tiers) shows top
metrics and top labels by series count. Same data remetric's
cardinality analyzer surfaces, but bound to the UI: no CI
integration, no paste-ready fix snippets, no programmatic
consumption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Metrics&lt;/strong&gt; (paid tiers) automatically aggregates unused
dimensions to reduce cardinality. Conceptually overlaps with
remetric's unused-metrics analyzer, but operates as opaque
automation: you don't see which labels were dropped or what
dashboards rely on them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What Grafana Cloud's built-ins don't cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert hygiene (never-firing or always-firing rules).&lt;/li&gt;
&lt;li&gt;Broken panels (queries pointing at metrics that no longer exist).&lt;/li&gt;
&lt;li&gt;CI integration via &lt;code&gt;--fail-on=critical&lt;/code&gt; and &lt;code&gt;exit 3&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Auditable, human-reviewed fixes you can land in a PR rather than
delegate to an automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Grafana Cloud users, remetric is most useful at those gaps:&lt;br&gt;
alert hygiene, broken panels, and getting label-by-label&lt;br&gt;
explanations of what's driving the bill, instead of just "Adaptive&lt;br&gt;
Metrics handled it."&lt;/p&gt;
&lt;h2&gt;
  
  
  What you get per finding
&lt;/h2&gt;

&lt;p&gt;Each finding carries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A class slug: &lt;code&gt;hot-label&lt;/code&gt;, &lt;code&gt;unused-metric&lt;/code&gt;, &lt;code&gt;never-firing-alert&lt;/code&gt;,
&lt;code&gt;always-firing-alert&lt;/code&gt;, &lt;code&gt;label-pattern-overly-granular&lt;/code&gt;,
&lt;code&gt;broken-panel&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A severity (&lt;code&gt;critical&lt;/code&gt; / &lt;code&gt;high&lt;/code&gt; / &lt;code&gt;medium&lt;/code&gt; / &lt;code&gt;low&lt;/code&gt;) derived from
observed series counts, uniqueness ratios, and lookback windows.&lt;/li&gt;
&lt;li&gt;Evidence: sample values, series counts, affected panel titles.&lt;/li&gt;
&lt;li&gt;A paste-ready fix snippet: YAML for &lt;code&gt;prometheus.yml&lt;/code&gt; when the fix
is a scrape-config change, an instruction block when the fix is
editing a Grafana dashboard.&lt;/li&gt;
&lt;li&gt;A documentation URL pointing at &lt;code&gt;remetric.dev/findings/&amp;lt;class&amp;gt;&lt;/code&gt;
with the canonical write-up: what the pattern is, why it matters,
how remetric detects it, known false positives, how to fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix snippet is the load-bearing part. Every finding answers "now&lt;br&gt;
what?" with copy-pasteable text, not "consider reducing cardinality"&lt;br&gt;
advice.&lt;/p&gt;
&lt;h2&gt;
  
  
  Running a scan
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fremetric-dev%2Fremetric%2Fmain%2Fdemo%2Fremetric.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fremetric-dev%2Fremetric%2Fmain%2Fdemo%2Fremetric.gif" alt="remetric scan demo" width="760" height="451"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;remetric scan &lt;span class="nt"&gt;--prometheus&lt;/span&gt; http://localhost:9090 &lt;span class="nt"&gt;--grafana&lt;/span&gt; http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five analyzers run in sequence (each logs when it starts and how&lt;br&gt;
long it took, so a hung target or a slow analyzer is visible). The&lt;br&gt;
severity table at the top gives at-a-glance ranking; per-finding&lt;br&gt;
detail blocks below carry evidence and the fix.&lt;/p&gt;

&lt;p&gt;For CI integration, swap terminal output for JSON and add a fail-on&lt;br&gt;
threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;remetric scan &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prometheus&lt;/span&gt; http://prom.internal:9090 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--grafana&lt;/span&gt;    http://grafana.internal:3000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt;     json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--fail-on&lt;/span&gt;    critical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 3 on any finding at or above the threshold; default&lt;br&gt;
behavior is exit 0 regardless of findings, so the tool wires into&lt;br&gt;
pipelines without surprise failures.&lt;/p&gt;

&lt;p&gt;Known-noise patterns suppress with anchored regex flags:&lt;br&gt;
&lt;code&gt;--ignore-metric "node_.*"&lt;/code&gt;, &lt;code&gt;--ignore-label "container_label_.*"&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;--ignore-alert "TestAlert.*"&lt;/code&gt;, &lt;code&gt;--ignore-dashboard "Legacy .*"&lt;/code&gt;. The&lt;br&gt;
flags are repeatable; the patterns wrap as &lt;code&gt;^(&amp;lt;pattern&amp;gt;)$&lt;/code&gt; so&lt;br&gt;
substring matches don't accidentally suppress unrelated findings.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;report&lt;/code&gt; subcommand produces the same findings as &lt;code&gt;scan&lt;/code&gt; but in&lt;br&gt;
self-contained HTML (&lt;code&gt;--format html&lt;/code&gt;) or Markdown (&lt;code&gt;--format markdown&lt;/code&gt;)&lt;br&gt;
for PR comments and review distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;v0.1 ships five analyzers. The post-v0.1 roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loki support.&lt;/strong&gt; Logs cardinality is its own waste category; the
API surface is parallel enough that existing client patterns
transfer cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recording-rule suggestions.&lt;/strong&gt; If three dashboards each compute
the same expensive aggregation on every refresh, that aggregation
is a recording rule waiting to be promoted. The analyzer would
surface the candidates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot diff.&lt;/strong&gt; &lt;code&gt;remetric scan --baseline=last-week.json&lt;/code&gt; to
surface regressions over time, not just absolute state.
Cardinality drift is the obvious target.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VictoriaMetrics, Mimir, Thanos extended support.&lt;/strong&gt; Basic VM
support already works (the Prometheus API path is shared); deeper
integration would unlock backend-specific signals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Further out: a plugin system for custom analyzers (so teams can&lt;br&gt;
codify their own anti-patterns) and a parallel continuous-monitoring&lt;br&gt;
SaaS layer with alerts on cardinality spikes, multi-cluster views,&lt;br&gt;
and historical trends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One-liner install (drops to $HOME/.local/bin)&lt;/span&gt;
curl &lt;span class="nt"&gt;-sSL&lt;/span&gt; https://remetric.dev/install.sh | sh

&lt;span class="c"&gt;# Homebrew (macOS or Linux)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;remetric-dev/tap/remetric

&lt;span class="c"&gt;# Docker (linux/amd64, linux/arm64)&lt;/span&gt;
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; ghcr.io/remetric-dev/remetric:latest &lt;span class="se"&gt;\&lt;/span&gt;
  scan &lt;span class="nt"&gt;--prometheus&lt;/span&gt; http://host.docker.internal:9090
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Documentation at &lt;a href="https://remetric.dev" rel="noopener noreferrer"&gt;remetric.dev&lt;/a&gt;. Source and&lt;br&gt;
issues at &lt;a href="https://github.com/remetric-dev/remetric" rel="noopener noreferrer"&gt;github.com/remetric-dev/remetric&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feedback on production scans is the most useful input for the v0.2&lt;br&gt;
roadmap: what the tool caught, what it missed, what it&lt;br&gt;
false-positived. Open a GitHub issue with a redacted output snippet&lt;br&gt;
and the rough shape of the stack.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>prometheus</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
