astronaut

Posted on Jun 11 • Edited on Jul 8

🧑‍🚀 Claude Code Skills Catalog: Observability, Stale Detection, and OpenTelemetry in Practice

#ai #agents #aiops #claude

You spent an evening writing a custom skill, shipped it to the team — and went blind. Does it fire at all? Is anyone using it? How many tokens does it burn, and is it worth the cost? Multiply that by the whole team and you get a catalog that nobody actually knows anything about. Here is how to make it observable using Claude Code's native telemetry and OpenTelemetry — without patching a single line of source code.

The Problem: A Catalog Nobody Watches

When a team adopts Claude Code seriously, skills start accumulating on their own. Someone adds a code-reviewer, someone else pulls in a db-migration-helper from a neighboring repo, another person installs a plugin with a dozen skills "just in case." The problem is not the quantity. The problem is that for every one of them you cannot answer basic questions:

Has anyone called this skill this month? Or is it dead weight in the catalog, and every request pays context tokens for it anyway?
Which skill burns the most tokens — and is it worth it? Expensive and popular: fine. Expensive and nearly unused: that is money burning that nobody notices.
The custom skill I wrote last week — does it actually fire? Or is the model silently ignoring it while I sit here convinced that "everything works"?
If a skill breaks, will I know? Or will it quietly fail every other run until someone shows up to complain?

Each issue is tolerable in isolation. But skills accumulate faster than understanding of who uses them and why — and at some point the catalog turns into a black box. This is skill sprawl: the same disease as server sprawl or tool sprawl, familiar to anyone who has maintained a catalog of microservices, libraries, or feature flags: artifacts multiply faster than insight into who is touching them.

This is not a hypothetical. There is a real feature request #35319 in the Claude Code tracker where a team describes growth from 67 to 183 skills in a month with zero usage visibility — and asks for some kind of analytics. And mature observability consoles (Datadog for Claude Code, for example) currently stop at user / model / repo / cost breakdowns — no skill-level analytics. That gap matters once the catalog becomes shared infrastructure.

The right question to ask is not "is the team using Claude Code" (billing answers that), but "is each skill we created alive, and does it earn its place in the context?"

That last part is not a metaphor. Here is how it works under the hood: at session startup, Claude Code scans all available skills and inserts each one's name and description into the system prompt — the model needs to know what it can call. This list goes into every API request. More skills means a longer system prompt, means more expensive every token for the team. A legacy-formatter that nobody has called in six months still pays input tokens on every request — just by existing in the catalog. Claude Code even has dedicated settings for managing this cost: maxSkillDescriptionChars caps the per-skill description length (default: 1536 characters), skillListingBudgetFraction limits the total fraction of the context window allocated to the listing (default: 1%). When the listing overflows, descriptions for the least-used skills are collapsed to bare names. Run /doctor to see whether truncation is happening in your session. The very existence of those settings confirms this is a real line item, not abstract "clutter."

A skill goes through the same lifecycle as any service: written, shipped, it either sticks or quietly dies. But a service has a dashboard, an owner, and alerts. A skill has nothing: shipped and blind. Skills are a team's golden paths — tested routes to common tasks. So the catalog deserves to be treated like a service catalog: with a roster, owners, usage metrics, and an honest decommission process.

What Claude Code Gives You Out of the Box

Good news: you don't need to patch anything to get started. Claude Code has native OpenTelemetry support and emits enough signal to manage the catalog.

Signal	Type	What it carries	Where we route it
`claude_code.skill_activated`	event (log)	`skill.name`, `invocation_trigger`, `skill.source`	Loki
`claude_code.cost.usage`	metric	`skill.name`, `model`, USD	Prometheus
`claude_code.token.usage`	metric	`skill.name`, `type` (input/output/cache)	Prometheus
`claude_code.tool_result`	event	`tool_name`, `success`, `duration_ms`	Loki

The key point most people miss: skill activations are events (logs), not metrics. One Prometheus instance is not enough. Metrics will tell you "how many tokens did code-reviewer consume", but not "who called it, when, and from what trigger". For that you need a log pipeline and a log store — in our case, Loki.

Three Gotchas Worth Knowing Upfront

Any telemetry write-up is easy to frame as "flip the flag and it all works." In practice there are three things I hit, and they are worth naming directly.

1. OTEL_LOG_TOOL_DETAILS=1 is mandatory. Without this flag, your custom skill names collapse into a featureless placeholder custom_skill in every event. Telemetry flows, the dashboard renders, but instead of code-reviewer and pr-describer you see seven rows of custom_skill. You typically discover this after collecting data.

2. Cost attribution is honest only for "first-party" skills. In the cost.usage metric, skill names are propagated as-is only for built-in, user-defined, and official marketplace skills. Names of third-party plugins are replaced with "third-party". This is why the demo uses project-level skills (.claude/skills/, source user-defined) — real names are visible in both events and cost metrics. If you distribute skills to your team through a third-party marketplace, keep this in mind: in the cost breakdown they will merge together.

3. Slash-command invocations and programmatic Skill tool calls are two different paths. When a user types /skill-name in the CLI, the skill content is expanded client-side and injected as a user message — this path may emit different (or no) skill_activated events depending on your Claude Code version. When Claude calls the same skill programmatically via the Skill tool, the tool_result event is emitted normally. Validate which invocation paths your team actually uses before treating this as a complete usage accounting system. The demo in this article uses the programmatic path.

Architecture: Why This Stack

The stack is built on the official Anthropic guide — claude-code-monitoring-guide: OTel Collector + Prometheus + Grafana. But the official guide has a metrics-only pipeline, and its dashboard panels cover cost / token / users / LOC — no skill panels. We extend it with two things:

Log pipeline + Loki — to capture skill_activated events. The official guide does not touch these because they are logs, not metrics.
Our own "Skill Catalog Management" dashboard — that is our contribution.

Why OpenTelemetry rather than a proprietary agent? Because OTLP is an open standard (graduated in CNCF), and the same telemetry stream, unchanged, goes to whatever you already have running: Grafana Cloud, Datadog, Honeycomb. Only the endpoint changes (OTEL_EXPORTER_OTLP_ENDPOINT) — skills and environment variables stay the same. No new vendor, no vendor lock-in.

The local docker-compose in this article is a showcase and sandbox: a way to reproduce everything from scratch in a couple of minutes and touch it with your own hands.

The Core Idea: Stale Detection via Catalog Join

This is where it gets interesting — and non-obvious.

Telemetry shows only what fired. To find skills that nobody ever called — candidates for deletion — telemetry alone is not enough. You need to join activity against the full catalog of all skills.

Think about it for a second. If a skill has never been called, there is not a single event for it in Loki. It simply does not exist in the data. No query against telemetry will return "these skills are silent" — because silence is not logged.

The solution is a classic outer join: take the list of all skills (the source of truth) and attach an activation count from Loki. Rows where the count is empty → that is the dead weight.

Our source of truth is skills-catalog.json, generated by scanning .claude/skills/*/SKILL.md:

./scripts/build-catalog.sh
# Wrote skills-catalog.json and grafana/catalog.csv:
# skill_name
# changelog-updater
# code-reviewer
# db-migration-helper
# ...

The script produces two forms: JSON for humans and programs, and CSV — embedded directly into the Grafana dashboard (via a TestData datasource) and outer-joined with activations from Loki. This is the technically honest answer to "what are we not using."

Demo Catalog: 7 Skills with Personality

These skills are fictional. They were written specifically for this observability demo and are not production-quality tools. Their purpose is to generate realistic telemetry patterns — not to be actually useful. Replace them with your team's real skills to instrument a live catalog.

To make the dashboard show something meaningful, you need a realistic mini-catalog. Seven skills, and each one makes real tool calls (git, Read, Glob, Bash) when invoked — generating real telemetry, not mocks.

Skill	Profile	Tools	Planned invocations
`code-reviewer`	medium cost, reliable	Bash(git) + Read	frequent (≈14)
`dep-auditor`	fast, unstable	Bash (≈50% exit 1)	frequent (≈13) — tests observability edge cases
`test-scaffolder`	slow, reliable	Glob + Read×N	notable (≈13)
`pr-describer`	fast, reliable	Bash(git)	notable (≈10)
`changelog-updater`	medium, reliable	Bash(git) + Read	moderate (≈7)
`legacy-formatter`	—	Glob	0 — demonstrates stale
`db-migration-helper`	—	Glob	0 — demonstrates stale

Two skills — legacy-formatter and db-migration-helper — are intentionally never called. These are our "dead" candidates that should surface in red.

dep-auditor deserves a separate note. It is deliberately unstable — the command inside alternates between success and failure:

COUNT=$(cat /tmp/dep_auditor_count 2>/dev/null || echo 0); COUNT=$((COUNT+1))
echo $COUNT > /tmp/dep_auditor_count
if [ $((COUNT % 2)) -eq 1 ]; then
  echo "audit backend unreachable (attempt #$COUNT)" >&2 && exit 1
else
  echo "0 vulnerabilities found (attempt #$COUNT)"
fi

Why? To check whether native telemetry sees a "flapping" skill — and if not, why. Spoiler: it doesn't. The answer is in the section on the third honest gotcha below.

The "Skill Catalog Management" Dashboard

Now for the visual part. Stack is running, telemetry collected — let's see what we got.

At the top, four stat panels give an instant health snapshot of the catalog:

Catalog Size: 7 — how many skills are in the catalog (from the source of truth).
Active Skills: 6 — how many unique skills fired at least once in the period.
Total Invocations: 66 — total activations in the period.
Auditor Error Rate — a panel for skill error signal. In our demo it shows "No data" — and that is an honest, instructive result, explained below.

Already you can see a discrepancy: the catalog has 7, but active is 6. One skill is silent. (In the demo, actually two of our seven are silent, and the sixth active one is superpowers:executing-plans — which I used to run the data collection plan itself. A nice illustration: monitoring caught a skill I wasn't even planning to show. The catalog lives its own life — which is exactly why you need to watch it.)

Hero: Leaderboard + Stale Skills

These are the two main panels, and they are most useful side by side.

Skill Usage Leaderboard (left) — ranking by activation count. Shows the team's golden paths: code-reviewer (14) leads, followed by dep-auditor and test-scaffolder (13 each). This is what the team actually bets on.

🔴 Stale Skills (right) — the catalog outer join with activity. Every skill from the catalog is joined to an activation count. And here are the red rows:

db-migration-helper → 0 — STALE
legacy-formatter → 0 — STALE

These two exist in the catalog but nobody has ever called them. Without the join against the catalog you would simply never see them — they are not in the telemetry. This panel answers the core catalog question: which skills are candidates for decommissioning?

Adoption and Cost

Adoption Over Time — activations by skill over time (5-minute buckets, stacked). On this curve you can see how a new skill gets adopted — or doesn't. You shipped a skill on Tuesday, and by Friday the curve for it is still flat? Adoption didn't happen, and that is a reason to talk to the team rather than silently keep the skill in the catalog.

Cost & Tokens per Skill — cost and token breakdown by skill, from the claude_code.cost.usage / token.usage metrics. One important implementation detail: tokens are measured in tens of thousands, costs in cents. These are two fundamentally different scales, and trying to plot them on the same linear axis is meaningless — the cheaper metric just hugs zero. So the two signals are separated into distinct panels (or table rows), each with its own scale. A small but telling thing: a dashboard is not "dump all metrics on one canvas," it is fitting the representation to the nature of the data.

Invocation Trigger (pie) answers the question of who is actually calling the skill: a human via /slash, Claude proactively, or a nested call from another skill. A useful breakdown — it distinguishes "skill that people consciously invoke" from "skill that fires in the background."

The Third Honest Gotcha: Native Telemetry Does Not Know Exit Codes

The "Auditor Error Rate" panel shows "No data" — and we deliberately did not hide that.

dep-auditor is designed to fail every other run: the bash command inside exits with exit 1 on odd runs. One would expect success=false to show up in claude_code.tool_result — but it doesn't. Checking real data in Loki: 19 out of 19 Bash results show success=true.

Why? The official Claude Code documentation cleanly separates two levels:

What happened	`success`	`error_type`	Example
Bash didn't launch at all	`false`	`Error:ENOENT`	binary not found
Shell crashed abnormally	`false`	`ShellError`	OOM, kill signal
Command ran and exited with `exit 1`	`true`	(none)	`bash -c "exit 1"`
Command ran and exited with `exit 0`	`true`	(none)	`bash -c "exit 0"`

In other words, success reflects "the tool harness executed the command and got a result" — not "the command did what was intended." This is a design decision: Claude Code deliberately does not interpret the semantics of what it ran. For the platform, exit 1 is a valid program response, not an error.

Practical implication: native telemetry answers "did the skill run?" — not "did the skill work correctly?" These are two different questions, and the second one requires the skill itself to report its result. Either through a custom OTLP write (the skill sends an event with result=success/fail directly to the collector — OTEL_* variables are intentionally not inherited by child processes, so the endpoint must be set explicitly), or through a PostToolUse hook that checks the command output.

This is the exact same logic by which you add a health check to a service: the infrastructure knows it is "running," but only the service itself knows it is "working correctly."

How to Reproduce

The entire stack runs locally in a couple of minutes. Here is the path from zero to a live dashboard.

1. Start the stack:

docker-compose up -d
sleep 12
docker-compose ps                          # 4 services Up
curl -s http://localhost:3001/api/health   # Grafana ok

2. Enable OTLP and launch Claude Code from the same shell (variables must reach the claude process, so source comes first):

source .env.example
claude

3. Use skills — call them as you would in real work. Leave two untouched (for the stale demonstration).

4. Open the dashboard: http://localhost:3001 (admin / admin) → Skill Catalog Management. Panels come alive in ~10–20 seconds.

UIs at hand: Grafana :3001 · Prometheus :9090 · Loki :3100.

Pitfalls (So You Don't Have to Step in Them)

Everything you might trip over, in one place:

Symptom	Cause	Fix
Skill names show as `custom_skill`	`OTEL_LOG_TOOL_DETAILS=1` is not set	Close session → `source .env.example` → `claude`
Third-party plugin costs merged into `"third-party"`	Cost attribution only works for first-party skills	Use project-level / user-defined skills
Error rate panel shows "No data" despite failed commands	`success` in `tool_result` reflects harness failure, not command exit code — `bash -c "exit 1"` returns `success=true`	Add a PostToolUse hook or custom OTLP instrumentation inside the skill to report semantic result

Scaling Beyond Local

The local stack is a showcase and sandbox. What changes when you bring this to the team:

The dashboard is already portable. It lives as JSON in Grafana provisioning — commit it to your platform team's repository and it deploys into your corporate Grafana as-is.
Endpoint instead of localhost. OTEL_EXPORTER_OTLP_ENDPOINT switches to your corporate collector. Everything else stays untouched — that is the point of vendor-neutral OTLP.
Distributing skills via marketplace. When you package skills into a plugin and distribute them through a marketplace — remember the cost attribution gotcha: third-party plugins merge into "third-party". If per-skill cost visibility matters, keep them as first-party / user-defined.
Catalog as CI artifact. build-catalog.sh can run in the pipeline and publish the catalog as an artifact — then the source of truth is always fresh, and the dashboard always joins against the current list.

Debrief

Claude Code skills sprawl exactly the way any unmonitored catalog sprawls — microservices, libraries, feature flags. The cure is also familiar: treat the catalog like a service. A roster with owners, usage metrics, an adoption curve, and an honest decommission process.

The good news is that Claude Code hands you everything you need for this out of the box — through an open standard, without patches and without proprietary agents. Two flags, a log pipeline for events, and one non-obvious technique: joining activity against the catalog, so you can see not just what is alive, but what is ready for honest decommissioning.

The engineering task is not to guess what the team uses, but to instrument the catalog thoroughly enough that its behavior becomes visible. Then decisions are made from data, not from intuition.

Stack, skills, configs, and dashboard — all in the repository. Starts with a single docker-compose up -d.

Next: "Debugging your Claude Code skills: what native telemetry won't tell you and how to close those gaps." Catalog management answers "what lives in the team." Skill debugging answers "does it work the way it was designed to" — and that is a separate story with different tooling.

DEV Community