Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

#sre #devops #prometheus #observability

Self-hosted Prometheus stacks degrade in predictable ways: a label
explosion that quietly doubles TSDB head size, a metric scraped by
every node and queried by none, an alert rule that has not fired in
nine months, a dashboard panel pointing at a metric that was renamed
last quarter. The signals are all there in the existing APIs, but
writing the queries, running them on a schedule, and turning the
answers into actionable fixes is enough friction that nobody does it
until something breaks.

Remetric is a single static binary that does this. v0.1 has already shipped.

The four waste patterns

Remetric ships five analyzers covering four categories of waste.

Cardinality explosion. A label with high uniqueness produces one
series per value. A trace_id label on a request counter generates a
new series for every request and never reuses any of them. Within
hours a single metric can carry hundreds of thousands of dead series,
consuming TSDB head memory and slowing every query that touches it.
Remetric ranks labels by uniqueness ratio and series-count
contribution, flags hot labels, and produces a fix snippet that drops
the offending label via metric_relabel_configs.

Unused metrics. Exporters scrape thousands of metric names.
Dashboards, alerting rules, and recording rules reference a subset.
The leftover is dead weight in head series. Remetric walks every
Grafana dashboard and every Prometheus rule expression, collects the
set of referenced metric names, diffs against the ingested set, and
emits one finding per unused metric. The fix is a
metric_relabel_configs drop rule per metric.

Alert hygiene. Two failure modes: alerting rules that fire
constantly (noise that everyone scrolls past) and rules that have
never fired in the available retention window (broken? threshold
wrong? failure mode no longer present?). Both need a human decision,
but neither announces itself. Remetric queries rule history and
surfaces rules in each state with lookback window and observed
fire-count as evidence.

Broken panels. Panel queries that reference metrics no longer in
head series or in recording-rule outputs. The panel renders empty;
nobody notices, because empty looks the same as fine. Remetric parses
every PromQL target across every dashboard, diffs against the
existence set (head series union recording-rule outputs), and emits
one finding per (dashboard, missing-metric) pair listing the
affected panels.

None of these are exotic. All are detectable from Prometheus and
Grafana HTTP APIs. The PromQL queries for each have been folklore for
years; the value is in running them on a schedule and producing
actionable findings instead of more PromQL to maintain.

Why a separate tool

The detection logic exists scattered across blog posts, gists, and
pinned Slack messages. Per-backend tools (cortex-tool, vmctl,
mimirtool) each cover a slice, usually for the backend their vendor
sells. None of them:

cross over to Grafana to ask "does anything still use this metric?"
check whether alert rules ever fire
detect broken panels (which requires walking dashboards and querying head series in one pass)
ship as a single static binary that runs in CI without a runtime install

Remetric covers all four patterns plus the Grafana side, with a
read-only contract: it never writes to the target Prometheus or
Grafana, and bounded concurrency (5 in-flight requests by default,
configurable) keeps it from overwhelming the target during a scan.

Does this work for Grafana Cloud?

Yes. Grafana Cloud's Prometheus (hosted Mimir) and Grafana itself
speak the same HTTP APIs as the self-hosted versions, so remetric
runs against them with bearer-token auth:

remetric scan \
  --prometheus    https://prometheus-prod-XX.grafana.net/api/prom \
  --prom-token    "$GRAFANA_CLOUD_PROM_TOKEN" \
  --grafana       https://YOUR-ORG.grafana.net \
  --grafana-token "$GRAFANA_CLOUD_GRAFANA_TOKEN" \
  --prom-max-in-flight 2 --grafana-max-in-flight 2

The lowered concurrency keeps remetric under Grafana Cloud's
per-tenant rate limits during a full scan.

Two of remetric's analyzers overlap with built-in Grafana Cloud
features:

Cardinality Management (UI, available on all tiers) shows top metrics and top labels by series count. Same data remetric's cardinality analyzer surfaces, but bound to the UI: no CI integration, no paste-ready fix snippets, no programmatic consumption.
Adaptive Metrics (paid tiers) automatically aggregates unused dimensions to reduce cardinality. Conceptually overlaps with remetric's unused-metrics analyzer, but operates as opaque automation: you don't see which labels were dropped or what dashboards rely on them.

What Grafana Cloud's built-ins don't cover:

Alert hygiene (never-firing or always-firing rules).
Broken panels (queries pointing at metrics that no longer exist).
CI integration via --fail-on=critical and exit 3.
Auditable, human-reviewed fixes you can land in a PR rather than delegate to an automation.

For Grafana Cloud users, remetric is most useful at those gaps:
alert hygiene, broken panels, and getting label-by-label
explanations of what's driving the bill, instead of just "Adaptive
Metrics handled it."

What you get per finding

Each finding carries:

A class slug: hot-label, unused-metric, never-firing-alert, always-firing-alert, label-pattern-overly-granular, broken-panel.
A severity (critical / high / medium / low) derived from observed series counts, uniqueness ratios, and lookback windows.
Evidence: sample values, series counts, affected panel titles.
A paste-ready fix snippet: YAML for prometheus.yml when the fix is a scrape-config change, an instruction block when the fix is editing a Grafana dashboard.
A documentation URL pointing at remetric.dev/findings/<class> with the canonical write-up: what the pattern is, why it matters, how remetric detects it, known false positives, how to fix.

The fix snippet is the load-bearing part. Every finding answers "now
what?" with copy-pasteable text, not "consider reducing cardinality"
advice.

Running a scan

remetric scan --prometheus http://localhost:9090 --grafana http://localhost:3000

Five analyzers run in sequence (each logs when it starts and how
long it took, so a hung target or a slow analyzer is visible). The
severity table at the top gives at-a-glance ranking; per-finding
detail blocks below carry evidence and the fix.

For CI integration, swap terminal output for JSON and add a fail-on
threshold:

remetric scan \
  --prometheus http://prom.internal:9090 \
  --grafana    http://grafana.internal:3000 \
  --output     json \
  --fail-on    critical

Exit code 3 on any finding at or above the threshold; default
behavior is exit 0 regardless of findings, so the tool wires into
pipelines without surprise failures.

Known-noise patterns suppress with anchored regex flags:
--ignore-metric "node_.*", --ignore-label "container_label_.*",
--ignore-alert "TestAlert.*", --ignore-dashboard "Legacy .*". The
flags are repeatable; the patterns wrap as ^(<pattern>)$ so
substring matches don't accidentally suppress unrelated findings.

The report subcommand produces the same findings as scan but in
self-contained HTML (--format html) or Markdown (--format markdown)
for PR comments and review distribution.

What's next

v0.1 ships five analyzers. The post-v0.1 roadmap:

Loki support. Logs cardinality is its own waste category; the API surface is parallel enough that existing client patterns transfer cleanly.
Recording-rule suggestions. If three dashboards each compute the same expensive aggregation on every refresh, that aggregation is a recording rule waiting to be promoted. The analyzer would surface the candidates.
Snapshot diff. remetric scan --baseline=last-week.json to surface regressions over time, not just absolute state. Cardinality drift is the obvious target.
VictoriaMetrics, Mimir, Thanos extended support. Basic VM support already works (the Prometheus API path is shared); deeper integration would unlock backend-specific signals.

Further out: a plugin system for custom analyzers (so teams can
codify their own anti-patterns) and a parallel continuous-monitoring
SaaS layer with alerts on cardinality spikes, multi-cluster views,
and historical trends.

Get started

# One-liner install (drops to $HOME/.local/bin)
curl -sSL https://remetric.dev/install.sh | sh

# Homebrew (macOS or Linux)
brew install remetric-dev/tap/remetric

# Docker (linux/amd64, linux/arm64)
docker run --rm ghcr.io/remetric-dev/remetric:latest \
  scan --prometheus http://host.docker.internal:9090

Documentation at remetric.dev. Source and
issues at github.com/remetric-dev/remetric.

Feedback on production scans is the most useful input for the v0.2
roadmap: what the tool caught, what it missed, what it
false-positived. Open a GitHub issue with a redacted output snippet
and the rough shape of the stack.