DEV Community: astronaut

Your Claude Code skill is running on empty — and you don't know it

astronaut — Thu, 09 Jul 2026 06:59:00 +0000

You ship a custom skill to your team. A week later a colleague says: "this skill doesn't work — it always returns the same thing." You open the dashboard: 47 activations, success=true on every one. All green. But the skill is broken.

Part 1 was about the catalog: what lives in your team's skill set, what's stale, what it costs. This article is about testing an individual skill on real runs. Same repository, same stack, zero patches.

Why you can't test a skill like code

With regular code the path is clear: write a function, cover it with tests, run them in CI. A skill breaks that on all three counts:

The "code" is a prompt. SKILL.md is an instruction to the model, not a program. There's no function to call with a fixture.
The behavior is nondeterministic. The same skill on the same input can take different paths: today the model reads three files, tomorrow one.
The result is an artifact, not a value. Often it's prose: an updated document, a PR description, a code review. There's no assertEquals for prose.

The only honest approach: test in production — observe real runs and check what actually happened.

In part 1, dep-auditor exited with exit 1 on every other run — yet telemetry showed success=true on all 19 results. Documented behavior: success means the harness got a response, not that the skill succeeded. That example was easy to dismiss — who audits dependencies with a skill when npm audit exists? So let's use a skill where the model is genuinely required.

The test subject: `doc-updater`

doc-updater keeps the architecture document in sync with the stack config: read what changed, understand the implications, rewrite the prose in docs/ARCHITECTURE.md. That's exactly the work a deterministic utility can't do.

I invoked it 3 times across 2 sessions. The output is free-form text — success=true tells you nothing about whether it reflects the actual code. A shell command at least has an exit code (which, as we've seen, gets lost anyway). Prose doesn't even have that.

I opened the dashboard from part 1:

doc-updater is at the top: 12 events, green bar all the way. By this view the skill looks healthy. Now the real picture:

Out of 3 invocations, one updated the document (updated), two returned empty. empty is not an error — the code hadn't changed since the last run, so the skill correctly did nothing. The problem: by native signals those two runs looked identical to updated. And critically: if the skill had failed mid-way (say, a Confluence publish dropped over the network), it would also return empty — indistinguishable from a legitimate "nothing to do." To tell these apart, you need a signal from inside the skill.

Three states of a finished skill run

State	Meaning	How to check
1. Ran	Invoked, tool calls completed, `success=true`	Native telemetry, out of the box
2. Did meaningful work	Actually did the job — didn't process empty input	Nothing out of the box. But the skill knows
3. Did it correctly	The output is semantically correct	Not even the skill knows — needs eval or review

All native signals answer state 1. The goal of in-production testing is to reliably check state 2. State 3 is a different discipline entirely — more on that at the end.

Check 1. Native telemetry: free signals and their limits

Two flags from part 1 (CLAUDE_CODE_ENABLE_TELEMETRY=1, OTEL_LOG_TOOL_DETAILS=1) plus OTel → Loki / Prometheus give you:

Question	Signal	Storage
Is the skill being invoked?	`claude_code.skill_activated`	Loki
How often?	count by `skill_name`	Loki
How fast?	`duration_ms` on `tool_result`	Loki
Tokens / cost?	`claude_code.token.usage` / `cost.usage`	Prometheus
What triggered it?	`invocation_trigger`	Loki

Two blind spots on real data:

skill_activated undercounts invocations. I invoked doc-updater 3 times. Loki shows 2 events. Why: skill_activated fires once per session — on the first invocation of a skill within a given claude process. My third invocation was the second call in one session: no event. For catalog inventory that's fine; for counting actual runs — it's not.

success=true says nothing about the work. All tool results: success=true, including both empty runs.

What happened	`success`	Example
Bash never launched	`false`	binary not found (`ENOENT`)
Command exited with `exit 1`	`true`	`bash -c "exit 1"`
Command exited with `exit 0`	`true`	`bash -c "exit 0"`

success is "the harness got a result," not "the skill did its job." For prose skills there's no shell failure at all — an empty run looks perfect.

Check 2. Hooks: intercepting from outside

A PostToolUse hook fires after every tool and receives on stdin a JSON payload of what the tool returned (tool_response).

Before writing any hook logic, I captured a real payload with a one-liner:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Bash|Edit|Write",
      "hooks": [{ "type": "command", "command": "cat >> /tmp/hook-capture.jsonl || true" }]
    }]
  }
}

The result:

Bash  tool_response:  interrupted, isImage, noOutputExpected, stderr, stdout
Edit  tool_response:  filePath, newString, oldString, originalFile, structuredPatch, userModified

Bash has no exit code field. The hook sees stdout, stderr, interrupted — not a numeric return code. Whether the command "failed" must be inferred from \stdout/stderrcontent by hand.

For Edit the payload is richer: structuredPatch and userModified let a hook detect whether a file actually changed — a real state 2 signal for doc-updater.

The production hook that pushes an event to the collector (full payload in the repo):


{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "jq -r '.tool_name' 2>/dev/null | grep -q . && curl -s -X POST http://localhost:4318/v1/logs -H 'Content-Type: application/json' -d '{...hook_tool_observed...}' >/dev/null; true"
      }]
    }]
  }
}

Two gotchas:

OTEL_* are not inherited by subprocesses. Claude Code intentionally doesn't forward OTLP variables to hooks or bash. The endpoint (http://localhost:4318) must be hardcoded in the command — $OTEL_EXPORTER_OTLP_ENDPOINT won't be there.
Exit code 2 is blocking. For PostToolUse, code 2 returns stderr to Claude as an error. Any other non-zero is a non-blocking warning. End with ; true so telemetry never interrupts the skill.

On a real run this hook produced 3 hook_tool_observed events — one per Bash call, independent of skill_activated.

Hooks are strictly richer than native telemetry: they see output, duration, and for Edit calls, the actual diff. But the semantic verdict "did the skill do its job" still has to come from one place: the skill itself.

Check 3. Self-report: the skill tells you what it did

doc-updater knows whether it updated the doc or returned \empty`. One step added to the end ofSKILL.md`:

`shell
RESULT="updated" # or "empty" / "error"
FILES_TOUCHED=1
curl -s -X POST http://localhost:4318/v1/logs \
-H "Content-Type: application/json" \
-d "{\"resourceLogs\":[{\"resource\":{\"attributes\":[{\"key\":\"service.name\",\"value\":{\"stringValue\":\"claude-code\"}}]},\"scopeLogs\":[{\"logRecords\":[{\"body\":{\"stringValue\":\"skill_result\"},\"attributes\":[{\"key\":\"skill_name\",\"value\":{\"stringValue\":\"doc-updater\"}},{\"key\":\"result\",\"value\":{\"stringValue\":\"$RESULT\"}},{\"key\":\"files_touched\",\"value\":{\"stringValue\":\"$FILES_TOUCHED\"}}]}]}]}]}"

/dev/null || true
`

Three deliberate decisions:

|| true — if the collector is down, the skill still completes. Observability must not break functionality.
Endpoint hardcoded — OTEL_* aren't inherited (see above). Not a duplication; it's the only path.
result and files_touched are semantic attributes — the skill's own verdict on its task, not the harness's.

In Loki these become labels — queryable and usable in dashboard panels. The "Skill Health" panel above is one LogQL query:

`prometheus

sum by (result) (count_over_time({service_name="claude-code"} |= skill_result | result =~ .+ [$__range]))

The numbers: native skill_activated → 2 (one per session). skill_result → 3, each with a verdict: updated=1, empty=2. Self-report counts invocations more accurately and distinguishes meaningful work from idle runs.

Where this honestly ends

result=updated means the skill edited a file. It does not mean the documentation is correct.

The skill might have updated the wrong section, invented a component that doesn't exist, or misrepresented a service relationship. By every signal — success=true, result=updated, files_touched=1 — that's a perfect run. But the prose is lying.

That's state 3: "did it correctly." It requires a different class of tools: eval, LLM-as-judge, human review. Checks 2 and 3 reliably get you to state 2. Beyond that is evaluation territory. Anyone claiming telemetry covers correctness is selling peace of mind, not observability.

Acceptance checklist

Before declaring a skill ready and handing it to the team:

#	Question	Signal	Note
1	Being invoked?	`skill_activated`	Counts per session, not per invocation
2	No hangs?	`duration_ms` normal
3	Context not bloated?	`token.usage` not spiked	Prompt inflation, injection
4	Logic reaches the end?	`skill_result` arriving	Final step actually executes
5	Doing work or idling?	breakdown by `result`	The main test: state 2
6	Output correct?	—	Out of scope — eval/review

Items 1–5: two flags, a log pipeline, one curl at the end of the skill. For items 4–5 to work, the skill needs a report step:

Run normal skill steps.
Decide the semantic result (updated / empty / skipped).
Send via curl to the hardcoded endpoint.
|| true — degrade gracefully.

Debrief

success=true is about the harness, not the job. For a prose-generating skill, an empty run is indistinguishable from a useful one — until the skill says otherwise. All three of our runs looked green; only skill_result showed that two produced nothing.

In-production testing doesn't make a skill correct. It makes its behavior visible.

Stack, doc-updater, hook, dashboard, and template — in the repository. One docker-compose up -d, reproduces everything one-to-one. Part 1: "Claude Code Skills Catalog: Observability, Stale Detection and OpenTelemetry in Practice".

🧑‍🚀 Claude Code Skills Catalog: Observability, Stale Detection, and OpenTelemetry in Practice

astronaut — Thu, 11 Jun 2026 06:51:45 +0000

You spent an evening writing a custom skill, shipped it to the team — and went blind. Does it fire at all? Is anyone using it? How many tokens does it burn, and is it worth the cost? Multiply that by the whole team and you get a catalog that nobody actually knows anything about. Here is how to make it observable using Claude Code's native telemetry and OpenTelemetry — without patching a single line of source code.

The Problem: A Catalog Nobody Watches

When a team adopts Claude Code seriously, skills start accumulating on their own. Someone adds a code-reviewer, someone else pulls in a db-migration-helper from a neighboring repo, another person installs a plugin with a dozen skills "just in case." The problem is not the quantity. The problem is that for every one of them you cannot answer basic questions:

Has anyone called this skill this month? Or is it dead weight in the catalog, and every request pays context tokens for it anyway?
Which skill burns the most tokens — and is it worth it? Expensive and popular: fine. Expensive and nearly unused: that is money burning that nobody notices.
The custom skill I wrote last week — does it actually fire? Or is the model silently ignoring it while I sit here convinced that "everything works"?
If a skill breaks, will I know? Or will it quietly fail every other run until someone shows up to complain?

Each issue is tolerable in isolation. But skills accumulate faster than understanding of who uses them and why — and at some point the catalog turns into a black box. This is skill sprawl: the same disease as server sprawl or tool sprawl, familiar to anyone who has maintained a catalog of microservices, libraries, or feature flags: artifacts multiply faster than insight into who is touching them.

This is not a hypothetical. There is a real feature request #35319 in the Claude Code tracker where a team describes growth from 67 to 183 skills in a month with zero usage visibility — and asks for some kind of analytics. And mature observability consoles (Datadog for Claude Code, for example) currently stop at user / model / repo / cost breakdowns — no skill-level analytics. That gap matters once the catalog becomes shared infrastructure.

The right question to ask is not "is the team using Claude Code" (billing answers that), but "is each skill we created alive, and does it earn its place in the context?"

That last part is not a metaphor. Here is how it works under the hood: at session startup, Claude Code scans all available skills and inserts each one's name and description into the system prompt — the model needs to know what it can call. This list goes into every API request. More skills means a longer system prompt, means more expensive every token for the team. A legacy-formatter that nobody has called in six months still pays input tokens on every request — just by existing in the catalog. Claude Code even has dedicated settings for managing this cost: maxSkillDescriptionChars caps the per-skill description length (default: 1536 characters), skillListingBudgetFraction limits the total fraction of the context window allocated to the listing (default: 1%). When the listing overflows, descriptions for the least-used skills are collapsed to bare names. Run /doctor to see whether truncation is happening in your session. The very existence of those settings confirms this is a real line item, not abstract "clutter."

A skill goes through the same lifecycle as any service: written, shipped, it either sticks or quietly dies. But a service has a dashboard, an owner, and alerts. A skill has nothing: shipped and blind. Skills are a team's golden paths — tested routes to common tasks. So the catalog deserves to be treated like a service catalog: with a roster, owners, usage metrics, and an honest decommission process.

What Claude Code Gives You Out of the Box

Good news: you don't need to patch anything to get started. Claude Code has native OpenTelemetry support and emits enough signal to manage the catalog.

Signal	Type	What it carries	Where we route it
`claude_code.skill_activated`	event (log)	`skill.name`, `invocation_trigger`, `skill.source`	Loki
`claude_code.cost.usage`	metric	`skill.name`, `model`, USD	Prometheus
`claude_code.token.usage`	metric	`skill.name`, `type` (input/output/cache)	Prometheus
`claude_code.tool_result`	event	`tool_name`, `success`, `duration_ms`	Loki

The key point most people miss: skill activations are events (logs), not metrics. One Prometheus instance is not enough. Metrics will tell you "how many tokens did code-reviewer consume", but not "who called it, when, and from what trigger". For that you need a log pipeline and a log store — in our case, Loki.

Three Gotchas Worth Knowing Upfront

Any telemetry write-up is easy to frame as "flip the flag and it all works." In practice there are three things I hit, and they are worth naming directly.

1. OTEL_LOG_TOOL_DETAILS=1 is mandatory. Without this flag, your custom skill names collapse into a featureless placeholder custom_skill in every event. Telemetry flows, the dashboard renders, but instead of code-reviewer and pr-describer you see seven rows of custom_skill. You typically discover this after collecting data.

2. Cost attribution is honest only for "first-party" skills. In the cost.usage metric, skill names are propagated as-is only for built-in, user-defined, and official marketplace skills. Names of third-party plugins are replaced with "third-party". This is why the demo uses project-level skills (.claude/skills/, source user-defined) — real names are visible in both events and cost metrics. If you distribute skills to your team through a third-party marketplace, keep this in mind: in the cost breakdown they will merge together.

3. Slash-command invocations and programmatic Skill tool calls are two different paths. When a user types /skill-name in the CLI, the skill content is expanded client-side and injected as a user message — this path may emit different (or no) skill_activated events depending on your Claude Code version. When Claude calls the same skill programmatically via the Skill tool, the tool_result event is emitted normally. Validate which invocation paths your team actually uses before treating this as a complete usage accounting system. The demo in this article uses the programmatic path.

Architecture: Why This Stack

The stack is built on the official Anthropic guide — claude-code-monitoring-guide: OTel Collector + Prometheus + Grafana. But the official guide has a metrics-only pipeline, and its dashboard panels cover cost / token / users / LOC — no skill panels. We extend it with two things:

Log pipeline + Loki — to capture skill_activated events. The official guide does not touch these because they are logs, not metrics.
Our own "Skill Catalog Management" dashboard — that is our contribution.

Why OpenTelemetry rather than a proprietary agent? Because OTLP is an open standard (graduated in CNCF), and the same telemetry stream, unchanged, goes to whatever you already have running: Grafana Cloud, Datadog, Honeycomb. Only the endpoint changes (OTEL_EXPORTER_OTLP_ENDPOINT) — skills and environment variables stay the same. No new vendor, no vendor lock-in.

The local docker-compose in this article is a showcase and sandbox: a way to reproduce everything from scratch in a couple of minutes and touch it with your own hands.

The Core Idea: Stale Detection via Catalog Join

This is where it gets interesting — and non-obvious.

Telemetry shows only what fired. To find skills that nobody ever called — candidates for deletion — telemetry alone is not enough. You need to join activity against the full catalog of all skills.

Think about it for a second. If a skill has never been called, there is not a single event for it in Loki. It simply does not exist in the data. No query against telemetry will return "these skills are silent" — because silence is not logged.

The solution is a classic outer join: take the list of all skills (the source of truth) and attach an activation count from Loki. Rows where the count is empty → that is the dead weight.

Our source of truth is skills-catalog.json, generated by scanning .claude/skills/*/SKILL.md:

./scripts/build-catalog.sh
# Wrote skills-catalog.json and grafana/catalog.csv:
# skill_name
# changelog-updater
# code-reviewer
# db-migration-helper
# ...

The script produces two forms: JSON for humans and programs, and CSV — embedded directly into the Grafana dashboard (via a TestData datasource) and outer-joined with activations from Loki. This is the technically honest answer to "what are we not using."

Demo Catalog: 7 Skills with Personality

These skills are fictional. They were written specifically for this observability demo and are not production-quality tools. Their purpose is to generate realistic telemetry patterns — not to be actually useful. Replace them with your team's real skills to instrument a live catalog.

To make the dashboard show something meaningful, you need a realistic mini-catalog. Seven skills, and each one makes real tool calls (git, Read, Glob, Bash) when invoked — generating real telemetry, not mocks.

Skill	Profile	Tools	Planned invocations
`code-reviewer`	medium cost, reliable	Bash(git) + Read	frequent (≈14)
`dep-auditor`	fast, unstable	Bash (≈50% exit 1)	frequent (≈13) — tests observability edge cases
`test-scaffolder`	slow, reliable	Glob + Read×N	notable (≈13)
`pr-describer`	fast, reliable	Bash(git)	notable (≈10)
`changelog-updater`	medium, reliable	Bash(git) + Read	moderate (≈7)
`legacy-formatter`	—	Glob	0 — demonstrates stale
`db-migration-helper`	—	Glob	0 — demonstrates stale

Two skills — legacy-formatter and db-migration-helper — are intentionally never called. These are our "dead" candidates that should surface in red.

dep-auditor deserves a separate note. It is deliberately unstable — the command inside alternates between success and failure:

COUNT=$(cat /tmp/dep_auditor_count 2>/dev/null || echo 0); COUNT=$((COUNT+1))
echo $COUNT > /tmp/dep_auditor_count
if [ $((COUNT % 2)) -eq 1 ]; then
  echo "audit backend unreachable (attempt #$COUNT)" >&2 && exit 1
else
  echo "0 vulnerabilities found (attempt #$COUNT)"
fi

Why? To check whether native telemetry sees a "flapping" skill — and if not, why. Spoiler: it doesn't. The answer is in the section on the third honest gotcha below.

The "Skill Catalog Management" Dashboard

Now for the visual part. Stack is running, telemetry collected — let's see what we got.

At the top, four stat panels give an instant health snapshot of the catalog:

Catalog Size: 7 — how many skills are in the catalog (from the source of truth).
Active Skills: 6 — how many unique skills fired at least once in the period.
Total Invocations: 66 — total activations in the period.
Auditor Error Rate — a panel for skill error signal. In our demo it shows "No data" — and that is an honest, instructive result, explained below.

Already you can see a discrepancy: the catalog has 7, but active is 6. One skill is silent. (In the demo, actually two of our seven are silent, and the sixth active one is superpowers:executing-plans — which I used to run the data collection plan itself. A nice illustration: monitoring caught a skill I wasn't even planning to show. The catalog lives its own life — which is exactly why you need to watch it.)

Hero: Leaderboard + Stale Skills

These are the two main panels, and they are most useful side by side.

Skill Usage Leaderboard (left) — ranking by activation count. Shows the team's golden paths: code-reviewer (14) leads, followed by dep-auditor and test-scaffolder (13 each). This is what the team actually bets on.

🔴 Stale Skills (right) — the catalog outer join with activity. Every skill from the catalog is joined to an activation count. And here are the red rows:

db-migration-helper → 0 — STALE
legacy-formatter → 0 — STALE

These two exist in the catalog but nobody has ever called them. Without the join against the catalog you would simply never see them — they are not in the telemetry. This panel answers the core catalog question: which skills are candidates for decommissioning?

Adoption and Cost

Adoption Over Time — activations by skill over time (5-minute buckets, stacked). On this curve you can see how a new skill gets adopted — or doesn't. You shipped a skill on Tuesday, and by Friday the curve for it is still flat? Adoption didn't happen, and that is a reason to talk to the team rather than silently keep the skill in the catalog.

Cost & Tokens per Skill — cost and token breakdown by skill, from the claude_code.cost.usage / token.usage metrics. One important implementation detail: tokens are measured in tens of thousands, costs in cents. These are two fundamentally different scales, and trying to plot them on the same linear axis is meaningless — the cheaper metric just hugs zero. So the two signals are separated into distinct panels (or table rows), each with its own scale. A small but telling thing: a dashboard is not "dump all metrics on one canvas," it is fitting the representation to the nature of the data.

Invocation Trigger (pie) answers the question of who is actually calling the skill: a human via /slash, Claude proactively, or a nested call from another skill. A useful breakdown — it distinguishes "skill that people consciously invoke" from "skill that fires in the background."

The Third Honest Gotcha: Native Telemetry Does Not Know Exit Codes

The "Auditor Error Rate" panel shows "No data" — and we deliberately did not hide that.

dep-auditor is designed to fail every other run: the bash command inside exits with exit 1 on odd runs. One would expect success=false to show up in claude_code.tool_result — but it doesn't. Checking real data in Loki: 19 out of 19 Bash results show success=true.

Why? The official Claude Code documentation cleanly separates two levels:

What happened	`success`	`error_type`	Example
Bash didn't launch at all	`false`	`Error:ENOENT`	binary not found
Shell crashed abnormally	`false`	`ShellError`	OOM, kill signal
Command ran and exited with `exit 1`	`true`	(none)	`bash -c "exit 1"`
Command ran and exited with `exit 0`	`true`	(none)	`bash -c "exit 0"`

In other words, success reflects "the tool harness executed the command and got a result" — not "the command did what was intended." This is a design decision: Claude Code deliberately does not interpret the semantics of what it ran. For the platform, exit 1 is a valid program response, not an error.

Practical implication: native telemetry answers "did the skill run?" — not "did the skill work correctly?" These are two different questions, and the second one requires the skill itself to report its result. Either through a custom OTLP write (the skill sends an event with result=success/fail directly to the collector — OTEL_* variables are intentionally not inherited by child processes, so the endpoint must be set explicitly), or through a PostToolUse hook that checks the command output.

This is the exact same logic by which you add a health check to a service: the infrastructure knows it is "running," but only the service itself knows it is "working correctly."

How to Reproduce

The entire stack runs locally in a couple of minutes. Here is the path from zero to a live dashboard.

1. Start the stack:

docker-compose up -d
sleep 12
docker-compose ps                          # 4 services Up
curl -s http://localhost:3001/api/health   # Grafana ok

2. Enable OTLP and launch Claude Code from the same shell (variables must reach the claude process, so source comes first):

source .env.example
claude

3. Use skills — call them as you would in real work. Leave two untouched (for the stale demonstration).

4. Open the dashboard: http://localhost:3001 (admin / admin) → Skill Catalog Management. Panels come alive in ~10–20 seconds.

UIs at hand: Grafana :3001 · Prometheus :9090 · Loki :3100.

Pitfalls (So You Don't Have to Step in Them)

Everything you might trip over, in one place:

Symptom	Cause	Fix
Skill names show as `custom_skill`	`OTEL_LOG_TOOL_DETAILS=1` is not set	Close session → `source .env.example` → `claude`
Third-party plugin costs merged into `"third-party"`	Cost attribution only works for first-party skills	Use project-level / user-defined skills
Error rate panel shows "No data" despite failed commands	`success` in `tool_result` reflects harness failure, not command exit code — `bash -c "exit 1"` returns `success=true`	Add a PostToolUse hook or custom OTLP instrumentation inside the skill to report semantic result

Scaling Beyond Local

The local stack is a showcase and sandbox. What changes when you bring this to the team:

The dashboard is already portable. It lives as JSON in Grafana provisioning — commit it to your platform team's repository and it deploys into your corporate Grafana as-is.
Endpoint instead of localhost. OTEL_EXPORTER_OTLP_ENDPOINT switches to your corporate collector. Everything else stays untouched — that is the point of vendor-neutral OTLP.
Distributing skills via marketplace. When you package skills into a plugin and distribute them through a marketplace — remember the cost attribution gotcha: third-party plugins merge into "third-party". If per-skill cost visibility matters, keep them as first-party / user-defined.
Catalog as CI artifact. build-catalog.sh can run in the pipeline and publish the catalog as an artifact — then the source of truth is always fresh, and the dashboard always joins against the current list.

Debrief

Claude Code skills sprawl exactly the way any unmonitored catalog sprawls — microservices, libraries, feature flags. The cure is also familiar: treat the catalog like a service. A roster with owners, usage metrics, an adoption curve, and an honest decommission process.

The good news is that Claude Code hands you everything you need for this out of the box — through an open standard, without patches and without proprietary agents. Two flags, a log pipeline for events, and one non-obvious technique: joining activity against the catalog, so you can see not just what is alive, but what is ready for honest decommissioning.

The engineering task is not to guess what the team uses, but to instrument the catalog thoroughly enough that its behavior becomes visible. Then decisions are made from data, not from intuition.

Stack, skills, configs, and dashboard — all in the repository. Starts with a single docker-compose up -d.

Next: "Debugging your Claude Code skills: what native telemetry won't tell you and how to close those gaps." Catalog management answers "what lives in the team." Skill debugging answers "does it work the way it was designed to" — and that is a separate story with different tooling.

Prompt Management Is Infrastructure: Requirements, Tools, and Patterns

astronaut — Tue, 17 Mar 2026 17:00:56 +0000

Mission Log #6 — Prompt control center: from strings in code to a production-grade system.

If your LLM service keeps prompts in code or in a UI without strict version control, you're accumulating technical debt. Not the usual kind. This debt doesn't show up as stack traces. It shows up as silent quality drift: SLAs green, logs clean, and users increasingly getting irrelevant answers.

In production, a prompt is the behavioral contract of your service. It directly affects tool-calling accuracy, RAG faithfulness, latency distribution, inference cost, and downstream behavior.

This article is not about prompt engineering (how to write a good prompt). It's about prompt management — how to manage prompts as an engineer: version, deploy, roll back, observe, and avoid silent regressions.

You'll find:

What prompt management is and how it differs from prompt engineering.
What production demands from prompt management (and what breaks when you ignore it).
A maturity model: where your team is and what the next step is.
Tools that address these requirements and how they map.
Architectural patterns for embedding prompt management into your system.

What Is Prompt Management (and What Are We Versioning?)

Prompt management is the set of practices and tools for the full lifecycle of prompts: creation, versioning, testing, deployment, monitoring, and rollback.

In production, a "prompt" is not a single text string. It's a composite artifact of several components, each of which affects service behavior:

Component	Example	Why we version it
System prompt	"You are a support agent..."	Defines model behavior
Few-shot examples	3 input→output pairs	Affect format and quality of responses
Tool schemas	OpenAPI specs for function calling	Define which tools the model can call
Output schema	JSON Schema for structured output	Breaks downstream parsers when changed
Inference params	model, temperature, max_tokens, top_p	Affect latency, cost, response style
Prompt template	Template with variables (`{{user_name}}`, `{{context}}`)	Logic for assembling the final prompt
Routing logic	Which prompt for which tenant/use case	Determines who sees which version

Engineers often version only the system prompt text. But if someone changes a tool schema or bumps temperature from 0.3 to 0.9, system behavior changes just as much. In mature production systems, teams version the entire artifact, not just the text.

9 Requirements for Production-Grade Prompt Management

These requirements come from working with production LLM systems. Each is described with a concrete failure mode — what actually breaks when the requirement isn't met.

It helps to split them into three planes:

Versioning: version identity, diff, change history, reproducibility.
Delivery/Rollout: labels, canary, version distribution, rollback.
Control/Governance: eval gating, audit trail, trace linkage.

1. Immutable versions

Every prompt version is immutable. A unique prompt_version_id (content hash or incremental id).

Without it: you can't tell which exact prompt version was live during an incident. "Someone changed the prompt last week, I think" is guesswork, not debugging.

2. Labels / Aliases

Named labels for routing prompt versions at runtime. Examples:

By environment: production, canary, staging.
By model: gpt-4o, claude-sonnet, llama-3-70b — different prompts tuned for different LLMs.
By tenant/use case: tenant_acme, support_flow, sales_agent.
By experiment: experiment_v3_concise, baseline.

The app requests a prompt by label, not by concrete version. That lets you change the version without changing code.

Without it: changing a prompt version means a full service deploy. Every text change goes through the full CI/CD pipeline.

3. Evaluation gating

A new prompt version goes through controlled validation before promotion:

domain-specific golden dataset,
automated regression tests,
offline comparison to baseline,
(optional) LLM-based scoring.

Promotion is a deliberate decision, not a blind merge.

Without it: every prompt change is a lottery. You can go a month without noticing that answer quality dropped 15%.

4. Low-latency fetch

Predictable time to fetch the prompt at runtime. In-memory cache on the hot path. The goal is to avoid putting a slow, uncached config dependency on the critical request path.

Without it: prompt management becomes a single point of failure. If the config service responds in 500ms instead of 5ms, your TTFT is already broken.

5. Audit trail

Who changed what, when, and why. Commit message + metadata.

Without it: after an incident you run a detective investigation instead of root-cause analysis. "Who changed the support prompt?" shouldn't take more than 10 seconds to answer.

6. Trace linkage

prompt_version_id attached to every trace/span. Correlation with metrics: latency, tool-call success rate, semantic failures.

Without it: you see quality degrade but can't tie it to a specific prompt version. Observability without trace linkage is dashboards for the sake of dashboards.

7. Rollback without downtime

Reassign a label → fast rollback without redeploy or service restart (within your propagation window).

Without it: recovery time after a bad prompt equals full deploy time (minutes or hours instead of seconds). In agent systems with dozens of prompts, that's critical.

8. Structured schema support

Version not only text but tool schemas, output constraints, and templating.

Without it: you track prompt text, someone quietly changes the output schema, and the downstream parser breaks. Half the artifact is out of control.

9. GitOps-friendly or API-driven workflow

Infra and product teams work in parallel without overwriting each other. Prompts are managed via Git (PR, review) or via API (SDK, UI).

Without it: two people edit the same prompt in the UI → last save wins, wiping the first person's changes. Familiar Google Docs pain, but with production impact.

Maturity Model: Where Are You Now?

Not every system needs Level 4. The point is to know your current level and choose the next step.

Level 0 — Strings in code

Prompts live as literals in code or hardcoded in the UI.

# Typical Level 0
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that..."},
        {"role": "user", "content": user_input}
    ]
)

No explicit versions (only git blame if you're lucky).
Rollback = git revert + full deploy.
Debug: "check the code for what's there" — but production may be running a different build.

Covers: minimal code-level audit trail and version history in Git; almost none of the runtime requirements.

Level 1 — Git-based prompts

Prompts live in separate files (YAML, JSON, Markdown) and are versioned in Git.

# prompts/support_agent/v2.yaml
id: support_agent
version: v2
model: gpt-4o
temperature: 0.3
system_prompt: |
  You are a support agent for {{product_name}}.
  Always check the knowledge base before answering.
  If unsure, escalate to a human.
tools:
  - search_kb
  - create_ticket

Change history and PR review.
Audit trail via git log.
Rollback still via deploy (git revert → CI → deploy).
No runtime labels/aliases.

Covers: immutable history in Git, audit trail (git log), GitOps workflow, structured schema (if the file holds all components). Immutable runtime artifacts only appear when you explicitly build and publish versioned artifacts.

Level 2 — Config store + labels

Prompts live in a key-value store (Redis, Postgres, DynamoDB, internal config service) with label support.

GET /v1/prompts/support_agent?label=production
→ { version_id: "v2-abc123", system_prompt: "...", tools: [...] }

GET /v1/prompts/support_agent?label=canary
→ { version_id: "v3-def456", system_prompt: "...", tools: [...] }

Runtime routing by alias.
Changing the production version without deploy (reassign label).
In-memory cache on the client + background refresh.
No built-in eval gating.

Covers: immutability, labels, low-latency fetch, rollback, audit trail (if you keep it), GitOps/API.

Level 3 — Dedicated prompt management platform

A dedicated platform: UI for version management, diffs between versions, built-in tracing, and observability integrations.

Examples: Langfuse, Braintrust, MLflow Prompt Registry, PromptLayer, LangSmith.

UI for comparing versions, promoting, rolling back.
Observability integration (trace linkage).
A/B testing and canary rollouts.
Non-engineers (product, domain experts) can edit prompts.

Covers: all 9 requirements to varying degrees (platform-dependent).

Level 4 — Full prompt ops

Single pipeline: create → eval → offline comparison → canary rollout → monitoring → auto-rollback.

Prompt management is part of CI/CD and the eval pipeline.
Evaluation gating built into the promotion process.
Automatic alerts when metrics degrade for a given prompt_version.
A prompt doesn't reach production until it passes the golden set and regression tests.

Covers: all 9 requirements plus automated eval.

Tool Overview

Not a feature list — a mapping onto the 9 requirements. The focus is on infrastructure needs, not marketing features.

Langfuse

What it is: LLM observability + prompt management platform, open-source / open-core. After the ClickHouse merger, the project kept an open core.

Strengths:

Versioning with labels (production, staging, custom).
Client-side cache — prompt is fetched once, then served from memory. No extra latency on requests.
Trace linkage: prompt_version_id attached to every trace.
Self-hosted option (Docker) — important for compliance and data-sensitive systems.
Open-source/open-core: most core features are open; some capabilities depend on the commercial plan.

Weaknesses:

UI for non-engineers is less polished than more product-centric platforms.
Eval gating has to be built separately (via integration with eval frameworks).

Requirements: immutability ✓, labels ✓, eval gating ~, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ~, GitOps/API ✓.

MLflow Prompt Registry

What it is: Part of the MLflow GenAI ecosystem. Git-inspired versioning for prompts.

Strengths:

Immutable versions + aliasing (Git-inspired).
Lineage tracking — link prompts to model runs and eval results.
Natural fit for teams already on MLflow/Databricks.
Template support with variables ({{variable}}), conversion to LangChain/LlamaIndex formats.

Weaknesses:

Tightly coupled to the MLflow ecosystem. If you're not on Databricks/MLflow, integration overhead.
Not a standalone observability platform.

Requirements: immutability ✓, labels ✓ (aliases), eval gating ✓ (via MLflow evaluate), low-latency ~, audit trail ✓, trace linkage ~ (via MLflow tracking), rollback ✓, schema ✓, GitOps ~ (custom scripts).

Braintrust

What it is: AI observability platform with prompt management, eval, and production monitoring.

Strengths:

Environments: development → staging → production with quality gates.
Bidirectional sync between code (SDK) and UI (playground) — engineers and product work in parallel.
GitHub Actions integration: eval in CI, blocking deployments, PR comments.
Prompt playground for testing on real data.

Weaknesses:

SaaS-first: deployment and data-plane options depend on enterprise setup and contracts.
Platform lock-in and migration cost if you switch vendors.

Requirements: immutability ✓, labels ✓ (environments), eval gating ✓, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ✓, GitOps ✓.

PromptLayer

What it is: Lightweight tool for logging and versioning LLM calls.

Strengths:

Easiest integration (< 30 minutes, a few lines of code).
Prompt registry: prompts stored outside code, deployed via API.
Release labels and dynamic labels for runtime routing.
Basic eval and version comparison.
Low barrier to entry; good for getting started.

Weaknesses:

Less depth on observability and governance than full-stack LLMOps platforms.
Teams with growing complexity will outgrow it quickly.

Requirements: immutability ✓, labels ✓, eval gating ~, low-latency ~, audit trail ✓, trace linkage ~, rollback ✓, schema ~, GitOps ~.

LangSmith

What it is: LangChain platform for tracing, eval, and prompt management.

Strengths:

Deep integration with LangChain/LangGraph.
Hub for sharing and versioning prompts.
Evaluation + dataset management.

Weaknesses:

Tied to the LangChain ecosystem (though there are SDK and API).
Commercial product: deployment modes and enterprise features depend on plan and contract.

Requirements: immutability ✓, labels ~, eval gating ✓, low-latency ~, audit trail ✓, trace linkage ✓, rollback ~, schema ✓, GitOps ~.

Summary table

Requirement	Langfuse	MLflow	Braintrust	PromptLayer	LangSmith
Immutability	✓	✓	✓	✓	✓
Labels/Aliases	✓	✓	✓	✓	~
Eval Gating	~	✓	✓	~	✓
Low-latency Fetch	✓	~	✓	~	~
Audit Trail	✓	✓	✓	✓	✓
Trace Linkage	✓	~	✓	~	✓
Rollback	✓	✓	✓	✓	~
Structured Schema	~	✓	✓	~	✓
GitOps/API	✓	~	✓	~	~
Open Source	~	✓	✗	✗	✗
Self-hosted	✓	✓	~	✗	✗

✓ = full support, ~ = partial or needs extra setup, ✗ = no.

Table reflects public docs and typical production scenarios at the time of writing. For a real choice, always check current limits for plans, licensing, and deployment mode.

Architectural Patterns

Pattern 1: Git-native

prompts/
  support_agent/
    v1.yaml
    v2.yaml
  code_review/
    v1.yaml
  registry.yaml     ← index: which label points to which version

CI builds prompts into an artifact (JSON bundle, SQLite, Redis snapshot). The service loads the artifact at startup.

Pros	Cons
Familiar workflow (PR, review, CI)	Rollback = new deploy
Full audit trail	Non-engineers can't edit
No runtime dependencies	No runtime labels
No extra cost	Eval gating built from scratch

Best for: teams of 1–5 engineers, early stage, few prompts.

Pattern 2: Config service (internal)

Your own service with REST/gRPC API:

GET /v1/prompts/{name}?label=production
POST /v1/prompts/{name}/versions   ← create version
PUT /v1/prompts/{name}/labels      ← reassign label

Storage: Postgres / DynamoDB. Clients: SDK with in-memory cache + background polling (TTL 30–60 sec).

Pros	Cons
Full control	Build and maintain it yourself
Runtime labels + rollback	Another service in the stack
Low-latency (your cache)	You build the UI
No vendor lock-in	Eval gating is a separate concern

Consistency note: with background polling and TTL 30–60 sec, after reassigning a label different instances can run on different prompt versions for up to a minute. For most LLM use cases eventual consistency is fine. For safety-critical systems you need a push mechanism (webhook/event) or a shorter TTL.

Best for: mid-size and larger teams that care about control and have capacity for infra.

Pattern 3: Managed platform (SaaS)

Langfuse Cloud / Braintrust / LangSmith — prompts managed via the platform's UI and SDK.

Pros	Cons
Fast to start	Runtime dependency on SaaS
UI for non-engineers	Vendor lock-in (as with Humanloop, which was discontinued)
Eval, tracing, A/B out of the box	Cost at scale
No infra to build	Data residency constraints

Critical question: what happens when the SaaS is down? The client SDK must have a fallback (last known good version from cache). Without it, SaaS downtime = your service downtime.

Best for: teams that need a quick start and non-engineer access, and accept the risk.

Pattern 4: Hybrid (Git + platform)

Git is source of truth. CI syncs prompts into the platform (Langfuse, Braintrust). The platform handles runtime delivery and observability.

Developer → Git PR → Review → Merge → CI syncs to Platform → Runtime fetch via SDK

Pros	Cons
Code review + runtime flexibility	Sync complexity
Audit trail in Git	Drift between Git and platform possible
Non-engineers see result in UI	Two sources of truth when things go wrong
Runtime labels + rollback	Extra CI plumbing

Failure modes to plan for:

Drift: CI sync fails, Git moves ahead, platform serves an old version. Engineer thinks the prompt is updated — service is still on the previous one. Mitigation: check prompt_hash on the platform side + alert on mismatch.
Ownership: if non-engineers can edit prompts directly in the platform UI, bypassing Git, Git is no longer the single source of truth. Either block direct edits in the UI or implement reverse sync (platform → Git), which is much more complex.

Best for: teams that want Git review plus runtime flexibility. Most mature pattern, and the hardest to operate.

Pattern 5: Feature flags

Prompt versions are managed as feature flags in your existing system.

Pros	Cons
Granular rollout (5% → 50% → 100%)	Flag systems aren't built for long text
Instant rollback (toggle off)	With dozens of prompts, flag sprawl
A/B testing out of the box	No diffs between prompt versions
Familiar if you already use it	Prompts still need to live somewhere

Best for: teams that already have feature-flag infra and need granular rollout. Works well as a complement to other patterns (e.g. Git-native + flags for rollout), not as the only mechanism.

Runtime delivery: 3 questions for any pattern

Whatever pattern you pick, answer these before production:

How does the prompt reach runtime? Polling with TTL, push via webhook/event, or baked in at deploy? This determines how fast changes propagate.
What happens if the prompt source is unavailable? Fallback from local cache (stale-while-revalidate) or hard failure? Without fallback you add a single point of failure on the hot path.
How quickly do all instances see the new version? Eventual consistency (seconds–minutes) or strong? For most LLM use cases eventual is enough, but you must know your consistency window.

Each of these is a separate engineering concern from distributed config propagation. A deeper treatment — caching patterns, failure modes, examples — is a separate post.

How to Choose: Decision Framework

Don't choose by feature list. Choose by four questions:

1. Who edits prompts?

Only engineers → Git-native or config service.
Product/domain experts too → Platform or hybrid.

2. How fast must rollback be?

Seconds → you need runtime labels (Level 2+).
Minutes via CI is acceptable → Git-native is enough.

3. How many prompts and how often do they change?

5 prompts, change once a month → Git-native.
50+ prompts, change weekly → Platform or hybrid.

4. Data residency and compliance?

Data must stay in region / on-premise → self-hosted (Langfuse, MLflow) or your own config service.
No constraints → SaaS is fine.

For enterprise teams, (4) is often the first filter and rules out half the options immediately.

Insight

Prompt management is a new infrastructure layer. It's closest to config management and feature flags, but with a twist: prompt semantics are opaque and the impact of changes is probabilistic.

You don't need to build Level 4 right away. See where you are and pick one next step:

At Level 0? → Move prompts to files and introduce prompt_version_id.
At Level 1? → Add runtime labels and rollback without deploy.
At Level 2? → Add eval gating and trace linkage.
At Level 3? → Automate the promotion pipeline.

If you already run prompt management in production — what approach did you choose and what pitfalls did you hit?

Design Recipe: Observability Pyramid for LLM Infrastructure

astronaut — Thu, 05 Feb 2026 08:44:43 +0000

In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.

As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the Observability Pyramid, where each layer protects the next.

1. System Layer: Telemetry and SRE Basics

Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.

Key Metrics:

TTFT (Time to First Token): the main metric for UX
TPOT (Time Per Output Token): generation stability
Tokens/Sec & Input/Output Ratio: critical for capacity planning and understanding KV-cache load

Engineering Approach: Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).

For details on profiling the engine and finding bottlenecks — see my article:

LLM Engine Telemetry: How to profile models

2. Retrieval Layer: Data Hygiene (RAG Triad)

Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:

A. Context Precision

How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.

Tools: RAGAS, DeepEval.

B. Context Recall

Does the retrieved set contain the factual answer?

Practice: You need a "golden standard" — a labeled dataset. I use Meta CRAG because it simulates real-world chaos and dynamically changing data.

See my guide on local CRAG evaluation here.

C. Faithfulness

Is the answer derived from the context or hallucinated?

A judge model checks every claim in the response against the provided source.

3. Semantic Layer: LLM-as-a-Judge at Scale

This level checks logic. The main challenge is balancing evaluation quality with cost/speed.

Engineering Best Practices:

CI/CD Gating: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).
Production Sampling: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%). Additionally: implement judge caching (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
Specialized Judges: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.
Out-of-band Eval: In production, evaluation always runs asynchronously to avoid increasing main request latency.

Diagnostic Map: What to Fix?

Metric	If Dropped, Problem In:	Action Plan
Context Recall	Embeddings / Indexing	Switch embedding model, implement Hybrid Search (Vector + Keyword)
Context Precision	Chunking / Noise	Add Reranker (Cross-Encoder), revise Chunking Strategy
Faithfulness	Temperature / Context	Lower Temperature, strengthen system prompt, check chunk integrity
TTFT (Latency)	Hardware / Load	Check Cache Hit Rate, enable quantization or PagedAttention

Implementation Plan (Checklist)

Instrument (Day 0): Set up export of metrics and traces (vLLM + OpenTelemetry).
Golden Set: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG).
Automate: Integrate DeepEval/RAGAS into GitHub Actions.
Sampling & Feedback: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.

Conclusion

For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.

Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG

astronaut — Tue, 30 Dec 2025 15:31:54 +0000

Want to skip the theory and launch a local RAG benchmark in Docker right now? Check out the repo

1. Introduction: Breaking the Infrastructure Barrier

In my previous article, we prepped our "shuttle" for launch by containerizing the Meta CRAG infrastructure. It gave us a standardized environment, but we were still tethered to one expensive "ground control" dependency.

The original benchmark baselines are resource-hungry. They expect:

Paid OpenAI API for final judging.
GPU(CUDA) clusters to run inference via vLLM.

Developing a RAG system under these constraints feels like ordering expensive parts by mail when you already have the tools in your garage. You spend your budget on "shipping" (API tokens) and wait for external servers to reply, even though you have plenty of local horsepower sitting idle.

What if you could launch the rocket from your own spaceport? Right on your laptop, with zero cost per request and total autonomy. We’re swapping external APIs for local inference using Ollama and Ray.

2. Architecture: The OpenAI-Compatible Interface

The biggest headache with academic benchmarks is their rigid stack. Meta CRAG expects either vLLM or OpenAI by default. Rewriting the core evaluation logic is a recipe for bugs and broken metrics.

Instead, we’ll take the engineering shortcut:

We implemented a RAGOpenAICompatibleModel class. It uses the standard openai library but "hijacks" the data flow via the base_url variable. This lets us point the benchmark at a local Ollama instance without changing of the core logic.

Why this matters: This gives us hot-swappable brains. Want to test Llama 3? Just change the key. Want to compare it against Qwen or Gemma? A quick export in your terminal is all it takes and a few lines in the configuration file.

3. Tuning the "Onboard Systems": Ray and HTML Cleanup

In the cloud, you pay for convenience—you can feed raw HTML to LLM and hope it figures it out. In a local spaceport, resources are finite. Every extra token is dead weight (ballast).

🛠 Parallelism via Ray

Processing hundreds of HTML pages for every question is heavy. We use Ray to distribute the load: while the GPU is busy generating an answer, the idle CPU cores are "scrubbing" data for the next batch in the background.

🧹 The "Space Junk" Filter

Using BeautifulSoup to strip tags is a survival requirement. Local models with 8k context windows quickly "suffocate" under endless <div> and <script> tags.

We clean the HTML.
Split text into sentences.
Cap snippets at 1000 characters.

Result: We fit significantly more useful info into the context, boosting accuracy without needing massive model weights.

4. Field Testing: Real Metrics

We picked three popular models to see how they handle a "combat" RAG scenario.

Model	Accuracy (Correct)	Hallucination	Missing (I don't know)	Final Score
Gemma-2-9B	25%	20%	55%	0.05
Llama-3-8B	15%	30%	55%	-0.15
Qwen-2.5-7B	0%	100%	0%	-1.00

Post-Mortem: Why did Qwen crash? 💥

Qwen’s results look catastrophic, but this is a huge engineering lesson. It didn't fail because it was "stupid"—it failed because it violated the protocol.

Typical Qwen output:

" <think> Okay, let's see. The user is asking about the producers... I need to check the references..."

The model started "thinking out loud" via the tag, ignoring the instruction to answer succinctly. In CRAG, any text that isn't the direct answer is flagged as a Hallucination.

The Takeaway: Models with forced Chain-of-Thought (CoT) need heavy post-processing (stripping tags) or ювелирный (precise) prompting to keep them from turning a short answer into a philosophical essay.

5. Try it Yourself: Code on GitHub

Stop reading and start launching. I’ve prepped a repository with:

Docker configs for easy deployment.
Ollama adapters for local inference.
Ray scripts for high-speed HTML cleaning.

🚀 Project Repo: astronaut27/CRAG_with_Docker

6. Conclusion: Autonomy Achieved

We’ve proven that you don’t need a corporate budget to do serious RAG engineering.

Our Results:

Reproducibility: Run the benchmark with a single command.
Cost: Exactly $0 per iteration.
Security: Your data never leaves your "space station."

Local evaluation is about building an honest development process where every change is backed by numbers, not just gut feeling.

7. Next Mission: RAGas vs. CRAG

Our spaceport is fully operational. But how does our local ground truth compare to popular metrics like RAGas? In the next post, we’ll pit "RAGas" against the hard facts of CRAG.

See you in orbit! 👨‍🚀✨

🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost

astronaut — Thu, 27 Nov 2025 14:32:20 +0000

“Any LLM is an engine. It can be massive or compact, but if you don't look at the telemetry, you'll never understand where you're burning energy inefficiently.”
— Astronaut Engineer, Logbook #4

🌌 Introduction: Why LLMs Need Profiling

When engineers discuss LLM performance, three key phases are most often mentioned:

Tokenization latency
TTFT (Time To First Token)
tokens/sec

But it's easier to think of it this way:

An LLM is an engine, and the profiler is its dashboard. The rest is visible through the readings—and we're about to break them down.

Just like in real machinery:

Startup is always more expensive than the cruising phase,
Different engine components consume energy differently,
The true picture is only visible through telemetry.

👨‍🚀 Caption: "Before launch—rely only on the instruments"

🚀 Mission Plan

We are launching the GPT-2 model in three scenarios:

Short prompt
Medium prompt
Long prompt

Each test goes through three key phases:

Tokenization — Preparing the input.
Prefill - The initial prompt processing that establishes TTFT.
Decode / Steady-State — The cruising phase of generation.

We measure:

Tokenization time,
TTFT,
Generation speed (ms/token, tokens/sec),
Memory usage (peakRSS),
The most expensive low-level operations.

All data is collected via torch.profiler and displayed in TensorBoard.

🧩 LLM Operation Phases: What Happens Under the Hood

1. Tokenization - Input Preparation

The text is converted into tokens using the chosen tokenizer. On short texts, measuring this phase can be highly susceptible to system noise (jitter), which is why tokenization is almost always measured separately.

2. Prefill - Prompt Processing and Model State Establishment

In this phase, the model:

Runs the entire prompt through all layers once,
Computes attention for the entire input sequence,
Populates the KV-Cache for subsequent generation,
Allocates temporary tensors and buffers.

Formally:

TTFT = Prefill time + first Decode step time

TTFT is the time required to complete the prompt processing and generate the first token. On a per-token basis, prefill is by far the most expensive phase, since the entire prompt is processed in one go.

3. Decode — Generating New Tokens

After prefill, the model transitions to sequential generation. Each new token requires:

1 forward pass → 1 token

Decode characteristics:

Operations are repeated with the same structure,
The KV-Cache prevents re-computing attention for the entire prompt,
Metrics become stable: ms/token, tokens/sec.

📡 Experimental Setup

The mission_profiler.py script:

Performs three launches (short / medium / long prompt)
Executes two generations for each:

Prefill → TTFT and full generation → Steady

Saves traces to TensorBoard,
Outputs a summary metrics table.

⚠️ We do not perform any warmup, so the first run (short_prompt) may be slower.

🛠️ Launch Telemetry Yourself!

You can replicate this "flight" and study the profiler logs on your own machine.

All the code, settings, and launch instructions are available in the mission repository: GitHub: LLM Profiler Mission - Engine Telemetry

📈 Mission Results

================= MISSION SUMMARY =================
tag            prompt_len   tokenize   TTFT(ms)   steady(ms)   actual_tok   ms/token   tok/s   peakRSS(MB)
--------------------------------------------------------------------------------------------------------
short_prompt           19        6.6      920.9        823.5         32      25.73     38.9       2541.2
medium_prompt          56        1.4       43.2       1047.4         32      32.73     30.6       2866.3
long_prompt           116        1.7       32.5        894.0         32      27.94     35.8       2886.8

How to Read These Numbers:

🔹 Tokenize We can't reliably compare tokenizers using this data—a dedicated, large-scale benchmark is needed. Short strings are heavily affected by system noise, so tokenization performance is evaluated separately.

🔹 TTFT (Time-To-First-Token) The most interesting observation:

Short prompt → 921 ms
Medium → 43 ms
Long → 32 ms

Why the difference?

The first run (short_prompt) bore the full impact of the cold start: it includes CUDA/MPS warmup, allocations, and JIT compilation of kernels.
TTFT is sensitive to the very first execution.
In subsequent runs (medium, long prompt), after warmup, TTFT stabilizes, and the difference between the medium and long prompt becomes minimal.

Conclusion:

TTFT should be measured either after a dedicated warmup or averaged over several runs.

🔹 Steady-State (ms/token) The cost per token remains relatively stable:

~26–33 ms/token
~30–39 tok/s

This is the engine's speed in cruising mode. As expected, the per-token latency shows almost no dependence on prompt length.

🔹 Peak RSS 2541 → 2866 → 2886 MB. Memory usage jumps noticeably when going from the short to the medium prompt (due to the growth of the KV-Cache and general allocations), but further lengthening shows minimal increase. This confirms that the primary VRAM/RAM allocation is for the model itself, while the KV-cache consumes only a small fraction. Its size does, however, grow linearly with input length.

📊 Who is Really Consuming Resources:

In the profiler, all operations fall into two camps:

🔥 1. Main Thrust (Useful Work)

On the GPU, these are:

addmm
mm / matmul
scaled_dot_product_attention

They consume the majority of the CUDA time. These are the matrix computation kernels—the operations that truly propel the LLM engine forward.

⚙️ 2. Control Expenses (Overhead)

Utility operations:

_local_scalar_dense
item
cat
copy_
to
Mask checks (eq, all)

These are responsible for:

Data movement,
Synchronization between the CPU/host and the GPU/device,
Scalar extraction,
Utility logic for generation.

This is not the engine's thrust, but the cost of flight control.

🧭 The Big Picture

Core computational kernels (attention, matmuls, addmm)—these determine whether the model is fast or slow.
Overhead operations —these are non-productive costs that can be reduced through optimizations: minimizing synchronizations, using use_cache=True, and reducing the number of small tensor operations.
On CUDA, matrix kernels dominate (as they should), but on MPS, utility operations often dominate.

The real profile of LLM performance is hidden in the balance between these two groups.

⚙️ Why Profiling LLMs is Essential

The profiler turns:

❌ "The model is running slow" into ✔ "Here is the specific operation that's consuming energy."

It helps reveal:

Where the bottleneck is located,
The cost of prefill,
The cost of each token,
How memory behaves,
The overhead created by HuggingFace generate().

🏁 Conclusion

The LLM is an engine. Sometimes powerful, sometimes compact, but always complex and sensitive to overloads. And until you open the profiler:

You won't see the expensive matrix operations,
You won't see the synchronization overhead,
You won't know the cost of prefill,
You won't see the growth of the KV-Cache.

The profiler is our flight recorder. It shows:

Where the engine is pulling,
Where it's stalling,
And where the energy is going.

No one launches a rocket without a flight recorder.

🧑‍🚀 Choosing the Right Engine to Launch Your LLM (LM Studio, Ollama, and vLLM)

astronaut — Thu, 06 Nov 2025 17:00:00 +0000

A Practical Field Guide for Engineers: LM Studio, Ollama, and vLLM

“When you’re building your first LLM ship, the hardest part isn’t takeoff — it’s choosing the right engine.”
— Engineer-Astronaut, Mission Log №3

In the LLM universe, everything moves at lightspeed.
Sooner or later, every engineer faces the same question:

how do you run a local model — fast, stable, and reliably?

LM Studio — a local capsule with a friendly interface.
Ollama — a maneuverable shuttle for edge missions.
vLLM — an industrial reactor for API workloads and GPU clusters.

But which one is right for your mission?
This article isn’t just another benchmark — it’s a navigation map, built by an engineer who has wrestled with GPU crashes, dependency hell, and Dockerization pains.

🪐 Personal Log.

“When I first tried LM Studio on my laptop, it was beautiful —
until I needed to automate the launch.
The GUI couldn’t be containerized, and the headless mode required extra tinkering.
Then I switched to Ollama, and only with vLLM did I finally understand what a real production-grade workload feels like.”

⚙️ 1. LM Studio — A Piloted Capsule for Local Missions

What it is:

LM Studio is a desktop application with a local OpenAI-compatible API.
It lets you work offline and run models directly on your laptop.

📚 Documentation: lmstudio
💻 Platforms: macOS, Windows, Linux (AppImage).

How to launch:

Download and install from lmstudio.ai.

Caveats:

GUI-only app — limited containerization;
Experimental headless API;
May overload CPU/GPU during long sessions.

“LM Studio is a flight simulator — perfect for training,
but it won’t take you into orbit.”

🚀 2. Ollama — A Maneuverable Shuttle for Edge Missions

What it is:

An open-source CLI/desktop runtime for models like Mistral, Gemma, Phi-3, and Llama-3.
It runs as a REST API and integrates easily into Docker.

📚 Documentation: ollama.ai
💻 Platforms: macOS, Linux, Windows.

How to launch:

brew install ollama
ollama run llama3

Or via Docker:

docker run -d -p 11434:11434 ollama/ollama

When to use:

Local REST APIs and edge inference;
CI/CD and microservices;
Quick launches without complex dependencies.

“Ollama is a light shuttle —
it can launch from any planet, but it won’t carry heavy cargo.”

☀️ 3. vLLM — A Reactor for Production-Grade Flights

What it is:

vLLM is a high-performance runtime for LLM inference,
optimized for GPUs, fully OpenAI-API compatible, and designed for scaling.

📚 Documentation: vllm

💻 Platforms: Linux and major cloud providers (AWS, GCP, Azure).

How to launch:

docker run -d \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai \
  --model meta-llama/Llama-3-8b-instruct \
  --gpu-memory-utilization 0.9

When to use:

Product APIs and AI platforms;
Multi-user environments;
High-speed, CUDA-optimized inference.

Caveats:

Requires NVIDIA GPU (CUDA ≥ 12.x);
Not compatible with macOS (no GPU backend);
Needs DevOps experience — monitoring, logging, version sync.

“vLLM is a deep-space reactor — built for interstellar journeys.
But if you try to fire it up in your garage, it simply won’t ignite.”

🪐 The Mission Map — Which Engine to Choose

⚠️ Common pitfalls:

LM Studio → limited containerization;
Ollama → not all models available out of the box, though you can import from Hugging Face;
vLLM → CUDA version mismatch causes kernel errors.

🧩 Mission Debrief

Every engine is built for its own orbit.

LM Studio — for solo flights and quick system checks.
Ollama — for agile edge missions.
vLLM — for long-range, interstellar operations.

“Sometimes an engineer’s mission isn’t to build a new engine —
but to understand which existing one fits the current flight plan.”

🛰️ Previous Missions

🚀 Prepared Meta’s CRAG Benchmark for Launch in Docker

🧑‍🚀 Mission Accomplished: How an Engineer-Astronaut Prepared Meta’s CRAG Benchmark for Launch in Docker

astronaut — Thu, 06 Nov 2025 11:04:46 +0000

Every ML system is like a spacecraft — powerful, intricate, and temperamental.
But without telemetry, you have no idea where it’s headed.

🌌 Introduction

The CRAG (Comprehensive RAG Benchmark) from Meta AI is the control panel for Retrieval-Augmented Generation systems.
It measures how well model responses stay grounded in facts, remain robust under noise, and maintain contextual relevance.

As is often the case with research projects, CRAG required engineering adaptation to operate reliably in a modern environment:
incompatible library versions, dependency conflicts, unclear paths, and manual launch steps.

🧰 I wanted to bring CRAG to a state where it could be launched with a single command — no dependency chaos, no manual fixes.
The result is a fully reproducible Dockerized environment, available here:

👉 github.com/astronaut27/CRAG_with_Docker

🚀 What I Improved

In the original build, several issues made CRAG difficult to run:

🔧 Conflicting library versions;
⚙️ No unified, reproducible start-up workflow.

Now, everything comes to life with a single command:

docker-compose up --build

After building, two containers start automatically:

🛰️ mock-api — an emulator for web search and Knowledge Graph APIs;
🚀 crag-app — the main container with the benchmark and built-in baseline models.

🧱 Pre-Launch Preparation: Handling the Mission Artifacts

Before firing up the Docker build, make sure all mission artifacts — the large data and model files — are present locally.

Because CRAG includes files over 100 MB, it uses Git Large File Storage (LFS). Without them, your container won’t initialize.

So the first command in your console is essentially fueling the ship with data:

git lfs pull

🧩 How It Works

📡 ⚙️ CRAG in Autonomous Mode

mock-API — simulates external data sources (Web Search, KG API) used by the RAG system.
crag-app — the main container running the benchmark and the model used for response generation (a dummy model at this stage).
local_evaluation.py — coordinates the pipeline, calls the mock API, and handles metric evaluation.
ChatGPT — serves as an LLM-assisted judge that evaluates generated responses by CRAG’s metrics.

🧠 What CRAG Measures: The Telemetry Dashboard

CRAG reports quantitative indicators — a flight log of your system after a test mission:

total: Total number of evaluated examples.
n_correct: Count of responses that are fully supported by retrieved context.
n_hallucination: Number of responses containing unsupported or invented facts.
n_miss: Responses missing key information or empty answers.
accuracy/ score: Overall precision (ratio of correct responses).
hallucination: Ratio = n_hallucination / total.
missing: Ratio = n_miss / total.

💡 These metrics are the sensors on your RAG ship’s dashboard.
If any of them start flashing red — it’s time to check the model’s engine.

🧱 Docker Architecture

version: '3.8'

services:
  # Mock API service for RAG data
  mock-api:
    build:
      context: ../mock_api
      dockerfile: ../deployments/Dockerfile.mock-api
    container_name: crag-mock-api
    ports:
      - "8000:8000"
    volumes:
      - ../mock_api/cragkg:/app/cragkg
    environment:
      - PYTHONPATH=/app
    networks:
      - crag-network
    restart: unless-stopped

  # CRAG application container
  crag-app:
    build:
      context: ..
      dockerfile: deployments/Dockerfile.crag-app
    container_name: crag-app
    depends_on:
      - mock-api
    environment:
      # OpenAI for evaluation (optional)
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      # Mock API connection (Docker service)
      - CRAG_MOCK_API_URL=http://mock-api:8000
      # Evaluation model
      - EVALUATION_MODEL_NAME=${EVALUATION_MODEL_NAME:-gpt-4-0125-preview}
    volumes:
      # Mount large data directories (read-only)
      - ../data:/app/data:ro
      - ../results:/app/results
      - ../example_data:/app/example_data:ro
      # Tokenizer (if needed)
      - ../tokenizer:/app/tokenizer:ro
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - crag-network
    stdin_open: true
    tty: true
    command: ["python", "local_evaluation.py"]

networks:
  crag-network:
    driver: bridge

🪐 Why This Matters

RAG systems are quickly becoming the core engines of modern LLM-based products.
CRAG allows engineers to evaluate their reliability and factual grounding before shipping to production.

This Docker build transforms Meta AI’s research benchmark into a practical engineering environment:

📦 fully isolated and reproducible;
🧠 runnable locally or in CI pipelines;
🚀 easily extendable with your own models (for example, via LM Studio — coming in the next mission).

🔭 The Next Mission

Right now, CRAG runs on its built-in baselines — a test flight before mounting the real engine.
The next step is integrating the LM Studio API and evaluating a live LLM within the same container setup.
That will be Mission II 🚀

🧭 Mission Summary

“Sometimes engineering magic isn’t about building a brand-new ship,
but about preparing an existing one for its next flight.”

CRAG now launches reliably, telemetry is stable, and the mission is a success.

Next up: integrating LM Studio and real models.
For now, the ship holds a steady course. 🪐

🔗 Mission Repository

📦 github.com/astronaut27/CRAG_with_Docker

📜 License
CRAG is distributed under the MIT License, developed by Meta AI / Facebook Research.
All modifications in CRAG_with_Docker preserve the original copyright notices.

DEV Community: astronaut

Your Claude Code skill is running on empty — and you don't know it

Why you can't test a skill like code

The test subject: doc-updater

Three states of a finished skill run

Check 1. Native telemetry: free signals and their limits

Check 2. Hooks: intercepting from outside

Check 3. Self-report: the skill tells you what it did

Where this honestly ends

Acceptance checklist

Debrief

🧑‍🚀 Claude Code Skills Catalog: Observability, Stale Detection, and OpenTelemetry in Practice

The Problem: A Catalog Nobody Watches

What Claude Code Gives You Out of the Box

Three Gotchas Worth Knowing Upfront

Architecture: Why This Stack

The Core Idea: Stale Detection via Catalog Join

Demo Catalog: 7 Skills with Personality

The "Skill Catalog Management" Dashboard

Hero: Leaderboard + Stale Skills

Adoption and Cost

The Third Honest Gotcha: Native Telemetry Does Not Know Exit Codes

How to Reproduce

Pitfalls (So You Don't Have to Step in Them)

Scaling Beyond Local

Debrief

Prompt Management Is Infrastructure: Requirements, Tools, and Patterns

What Is Prompt Management (and What Are We Versioning?)

9 Requirements for Production-Grade Prompt Management

1. Immutable versions

2. Labels / Aliases

3. Evaluation gating

4. Low-latency fetch

5. Audit trail

6. Trace linkage

7. Rollback without downtime

8. Structured schema support

9. GitOps-friendly or API-driven workflow

Maturity Model: Where Are You Now?

Level 0 — Strings in code

Level 1 — Git-based prompts

Level 2 — Config store + labels

Level 3 — Dedicated prompt management platform

Level 4 — Full prompt ops

Tool Overview

Langfuse

MLflow Prompt Registry

Braintrust

PromptLayer

LangSmith

Summary table

Architectural Patterns

Pattern 1: Git-native

Pattern 2: Config service (internal)

Pattern 3: Managed platform (SaaS)

Pattern 4: Hybrid (Git + platform)

Pattern 5: Feature flags

Runtime delivery: 3 questions for any pattern

How to Choose: Decision Framework

Insight

Design Recipe: Observability Pyramid for LLM Infrastructure

1. System Layer: Telemetry and SRE Basics

2. Retrieval Layer: Data Hygiene (RAG Triad)

3. Semantic Layer: LLM-as-a-Judge at Scale

Diagnostic Map: What to Fix?

Implementation Plan (Checklist)

Conclusion

Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG

1. Introduction: Breaking the Infrastructure Barrier

2. Architecture: The OpenAI-Compatible Interface

3. Tuning the "Onboard Systems": Ray and HTML Cleanup

🛠 Parallelism via Ray

🧹 The "Space Junk" Filter

4. Field Testing: Real Metrics

Post-Mortem: Why did Qwen crash? 💥

5. Try it Yourself: Code on GitHub

6. Conclusion: Autonomy Achieved

7. Next Mission: RAGas vs. CRAG

🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost

🌌 Introduction: Why LLMs Need Profiling

The test subject: `doc-updater`