<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: astronaut</title>
    <description>The latest articles on DEV Community by astronaut (@astronaut27).</description>
    <link>https://dev.to/astronaut27</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3594224%2F48ba6b28-a495-4630-8112-62f28ff8b5dc.png</url>
      <title>DEV Community: astronaut</title>
      <link>https://dev.to/astronaut27</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/astronaut27"/>
    <language>en</language>
    <item>
      <title>🧑‍🚀 Claude Code Skills Catalog: Observability, Stale Detection, and OpenTelemetry in Practice</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 11 Jun 2026 06:51:45 +0000</pubDate>
      <link>https://dev.to/astronaut27/claude-code-skills-catalog-observability-stale-detection-and-opentelemetry-in-practice-3b9i</link>
      <guid>https://dev.to/astronaut27/claude-code-skills-catalog-observability-stale-detection-and-opentelemetry-in-practice-3b9i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;You spent an evening writing a custom skill, shipped it to the team — and went blind. Does it fire at all? Is anyone using it? How many tokens does it burn, and is it worth the cost? Multiply that by the whole team and you get a catalog that nobody actually knows anything about. Here is how to make it observable using Claude Code's native telemetry and OpenTelemetry — without patching a single line of source code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem: A Catalog Nobody Watches
&lt;/h2&gt;

&lt;p&gt;When a team adopts Claude Code seriously, skills start accumulating on their own. Someone adds a &lt;code&gt;code-reviewer&lt;/code&gt;, someone else pulls in a &lt;code&gt;db-migration-helper&lt;/code&gt; from a neighboring repo, another person installs a plugin with a dozen skills "just in case." The problem is not the quantity. The problem is that for every one of them you cannot answer basic questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Has anyone called this skill this month?&lt;/strong&gt; Or is it dead weight in the catalog, and every request pays context tokens for it anyway?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which skill burns the most tokens — and is it worth it?&lt;/strong&gt; Expensive and popular: fine. Expensive and nearly unused: that is money burning that nobody notices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The custom skill I wrote last week — does it actually fire?&lt;/strong&gt; Or is the model silently ignoring it while I sit here convinced that "everything works"?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If a skill breaks, will I know?&lt;/strong&gt; Or will it quietly fail every other run until someone shows up to complain?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each issue is tolerable in isolation. But skills accumulate faster than understanding of who uses them and why — and at some point the catalog turns into a black box. This is &lt;strong&gt;skill sprawl&lt;/strong&gt;: the same disease as &lt;em&gt;server sprawl&lt;/em&gt; or &lt;em&gt;tool sprawl&lt;/em&gt;, familiar to anyone who has maintained a catalog of microservices, libraries, or feature flags: artifacts multiply faster than insight into who is touching them.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. There is a real &lt;a href="https://github.com/anthropics/claude-code/issues/35319" rel="noopener noreferrer"&gt;feature request #35319&lt;/a&gt; in the Claude Code tracker where a team describes growth &lt;strong&gt;from 67 to 183 skills in a month with zero usage visibility&lt;/strong&gt; — and asks for some kind of analytics. And mature observability consoles (Datadog for Claude Code, for example) currently stop at user / model / repo / cost breakdowns — &lt;strong&gt;no skill-level analytics&lt;/strong&gt;. That gap matters once the catalog becomes shared infrastructure.&lt;/p&gt;

&lt;p&gt;The right question to ask is not "is the team using Claude Code" (billing answers that), but &lt;strong&gt;"is each skill we created alive, and does it earn its place in the context?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That last part is not a metaphor. Here is how it works under the hood: at session startup, Claude Code scans all available skills and inserts each one's name and description into the system prompt — the model needs to know what it can call. This list goes into &lt;strong&gt;every&lt;/strong&gt; API request. More skills means a longer system prompt, means more expensive every token for the team. A &lt;code&gt;legacy-formatter&lt;/code&gt; that nobody has called in six months &lt;strong&gt;still pays&lt;/strong&gt; input tokens on every request — just by existing in the catalog. Claude Code even has dedicated settings for managing this cost: &lt;code&gt;maxSkillDescriptionChars&lt;/code&gt; caps the per-skill description length (default: 1536 characters), &lt;code&gt;skillListingBudgetFraction&lt;/code&gt; limits the total fraction of the context window allocated to the listing (default: 1%). When the listing overflows, descriptions for the least-used skills are collapsed to bare names. Run &lt;code&gt;/doctor&lt;/code&gt; to see whether truncation is happening in your session. The very existence of those settings confirms this is a real line item, not abstract "clutter."&lt;/p&gt;

&lt;p&gt;A skill goes through the same lifecycle as any service: written, shipped, it either sticks or quietly dies. But a service has a dashboard, an owner, and alerts. A skill has nothing: shipped and blind. Skills are a team's &lt;em&gt;golden paths&lt;/em&gt; — tested routes to common tasks. So the catalog deserves to be treated like a &lt;strong&gt;service catalog&lt;/strong&gt;: with a roster, owners, usage metrics, and an honest decommission process.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Claude Code Gives You Out of the Box
&lt;/h2&gt;

&lt;p&gt;Good news: you don't need to patch anything to get started. Claude Code has native OpenTelemetry support and emits enough signal to manage the catalog.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What it carries&lt;/th&gt;
&lt;th&gt;Where we route it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude_code.skill_activated&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;event (log)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;skill.name&lt;/code&gt;, &lt;code&gt;invocation_trigger&lt;/code&gt;, &lt;code&gt;skill.source&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Loki&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude_code.cost.usage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;metric&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;skill.name&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, USD&lt;/td&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude_code.token.usage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;metric&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;skill.name&lt;/code&gt;, &lt;code&gt;type&lt;/code&gt; (input/output/cache)&lt;/td&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude_code.tool_result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;event&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tool_name&lt;/code&gt;, &lt;code&gt;success&lt;/code&gt;, &lt;code&gt;duration_ms&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Loki&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key point most people miss: &lt;strong&gt;skill activations are events (logs), not metrics.&lt;/strong&gt; One Prometheus instance is not enough. Metrics will tell you "how many tokens did &lt;code&gt;code-reviewer&lt;/code&gt; consume", but not "who called it, when, and from what trigger". For that you need a log pipeline and a log store — in our case, Loki.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Gotchas Worth Knowing Upfront
&lt;/h3&gt;

&lt;p&gt;Any telemetry write-up is easy to frame as "flip the flag and it all works." In practice there are three things I hit, and they are worth naming directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;OTEL_LOG_TOOL_DETAILS=1&lt;/code&gt; is mandatory.&lt;/strong&gt; Without this flag, your custom skill names collapse into a featureless placeholder &lt;code&gt;custom_skill&lt;/code&gt; in every event. Telemetry flows, the dashboard renders, but instead of &lt;code&gt;code-reviewer&lt;/code&gt; and &lt;code&gt;pr-describer&lt;/code&gt; you see seven rows of &lt;code&gt;custom_skill&lt;/code&gt;. You typically discover this after collecting data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cost attribution is honest only for "first-party" skills.&lt;/strong&gt; In the &lt;code&gt;cost.usage&lt;/code&gt; metric, skill names are propagated as-is only for built-in, &lt;strong&gt;user-defined&lt;/strong&gt;, and official marketplace skills. Names of &lt;strong&gt;third-party plugins&lt;/strong&gt; are replaced with &lt;code&gt;"third-party"&lt;/code&gt;. This is why the demo uses &lt;strong&gt;project-level skills&lt;/strong&gt; (&lt;code&gt;.claude/skills/&lt;/code&gt;, source &lt;code&gt;user-defined&lt;/code&gt;) — real names are visible in both events and cost metrics. If you distribute skills to your team through a third-party marketplace, keep this in mind: in the cost breakdown they will merge together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Slash-command invocations and programmatic Skill tool calls are two different paths.&lt;/strong&gt; When a user types &lt;code&gt;/skill-name&lt;/code&gt; in the CLI, the skill content is expanded client-side and injected as a user message — this path may emit different (or no) &lt;code&gt;skill_activated&lt;/code&gt; events depending on your Claude Code version. When Claude calls the same skill programmatically via the &lt;code&gt;Skill&lt;/code&gt; tool, the &lt;code&gt;tool_result&lt;/code&gt; event is emitted normally. Validate which invocation paths your team actually uses before treating this as a complete usage accounting system. The demo in this article uses the programmatic path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: Why This Stack
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs61l2k4lgrqw00jc1447.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs61l2k4lgrqw00jc1447.png" alt=" " width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The stack is built on the &lt;strong&gt;official Anthropic guide&lt;/strong&gt; — &lt;a href="https://github.com/anthropics/claude-code-monitoring-guide" rel="noopener noreferrer"&gt;claude-code-monitoring-guide&lt;/a&gt;: OTel Collector + Prometheus + Grafana. But the official guide has a &lt;strong&gt;metrics-only pipeline&lt;/strong&gt;, and its dashboard panels cover cost / token / users / LOC — &lt;strong&gt;no skill panels&lt;/strong&gt;. We extend it with two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log pipeline + Loki&lt;/strong&gt; — to capture &lt;code&gt;skill_activated&lt;/code&gt; events. The official guide does not touch these because they are logs, not metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Our own "Skill Catalog Management" dashboard&lt;/strong&gt; — that is our contribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why OpenTelemetry rather than a proprietary agent? Because OTLP is an open standard (graduated in CNCF), and the same telemetry stream, unchanged, goes to whatever you &lt;strong&gt;already have running&lt;/strong&gt;: Grafana Cloud, Datadog, Honeycomb. Only the endpoint changes (&lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt;) — skills and environment variables stay the same. No new vendor, no vendor lock-in.&lt;/p&gt;

&lt;p&gt;The local &lt;code&gt;docker-compose&lt;/code&gt; in this article is a &lt;strong&gt;showcase and sandbox&lt;/strong&gt;: a way to reproduce everything from scratch in a couple of minutes and touch it with your own hands.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: Stale Detection via Catalog Join
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting — and non-obvious.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Telemetry shows only what &lt;em&gt;fired&lt;/em&gt;.&lt;/strong&gt; To find skills that &lt;em&gt;nobody ever called&lt;/em&gt; — candidates for deletion — telemetry alone is not enough. You need to join activity against the &lt;strong&gt;full catalog&lt;/strong&gt; of all skills.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Think about it for a second. If a skill has never been called, there is &lt;strong&gt;not a single event&lt;/strong&gt; for it in Loki. It simply does not exist in the data. No query against telemetry will return "these skills are silent" — because silence is not logged.&lt;/p&gt;

&lt;p&gt;The solution is a classic &lt;strong&gt;outer join&lt;/strong&gt;: take the list of all skills (the source of truth) and attach an activation count from Loki. Rows where the count is empty → that is the dead weight.&lt;/p&gt;

&lt;p&gt;Our source of truth is &lt;code&gt;skills-catalog.json&lt;/code&gt;, generated by scanning &lt;code&gt;.claude/skills/*/SKILL.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/build-catalog.sh
&lt;span class="c"&gt;# Wrote skills-catalog.json and grafana/catalog.csv:&lt;/span&gt;
&lt;span class="c"&gt;# skill_name&lt;/span&gt;
&lt;span class="c"&gt;# changelog-updater&lt;/span&gt;
&lt;span class="c"&gt;# code-reviewer&lt;/span&gt;
&lt;span class="c"&gt;# db-migration-helper&lt;/span&gt;
&lt;span class="c"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script produces two forms: JSON for humans and programs, and CSV — embedded directly into the Grafana dashboard (via a TestData datasource) and &lt;strong&gt;outer-joined&lt;/strong&gt; with activations from Loki. This is the technically honest answer to "what are we not using."&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo Catalog: 7 Skills with Personality
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;These skills are fictional.&lt;/strong&gt; They were written specifically for this observability demo and are not production-quality tools. Their purpose is to generate realistic telemetry patterns — not to be actually useful. Replace them with your team's real skills to instrument a live catalog.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To make the dashboard show something meaningful, you need a realistic mini-catalog. Seven skills, and each one makes &lt;strong&gt;real&lt;/strong&gt; tool calls (git, Read, Glob, Bash) when invoked — generating real telemetry, not mocks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Profile&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Planned invocations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-reviewer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;medium cost, reliable&lt;/td&gt;
&lt;td&gt;Bash(git) + Read&lt;/td&gt;
&lt;td&gt;frequent (≈14)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dep-auditor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fast, &lt;strong&gt;unstable&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Bash (≈50% exit 1)&lt;/td&gt;
&lt;td&gt;frequent (≈13) — tests observability edge cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test-scaffolder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;slow, reliable&lt;/td&gt;
&lt;td&gt;Glob + Read×N&lt;/td&gt;
&lt;td&gt;notable (≈13)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pr-describer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fast, reliable&lt;/td&gt;
&lt;td&gt;Bash(git)&lt;/td&gt;
&lt;td&gt;notable (≈10)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;changelog-updater&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;medium, reliable&lt;/td&gt;
&lt;td&gt;Bash(git) + Read&lt;/td&gt;
&lt;td&gt;moderate (≈7)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;legacy-formatter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Glob&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 — demonstrates stale&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;db-migration-helper&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Glob&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 — demonstrates stale&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two skills — &lt;code&gt;legacy-formatter&lt;/code&gt; and &lt;code&gt;db-migration-helper&lt;/code&gt; — are &lt;strong&gt;intentionally never called&lt;/strong&gt;. These are our "dead" candidates that should surface in red.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dep-auditor&lt;/code&gt; deserves a separate note. It is deliberately unstable — the command inside alternates between success and failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/dep_auditor_count 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;0&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;COUNT+1&lt;span class="k"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$COUNT&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/dep_auditor_count
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="k"&gt;$((&lt;/span&gt;COUNT &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 1 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"audit backend unreachable (attempt #&lt;/span&gt;&lt;span class="nv"&gt;$COUNT&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;else
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"0 vulnerabilities found (attempt #&lt;/span&gt;&lt;span class="nv"&gt;$COUNT&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? To check whether native telemetry sees a "flapping" skill — and if not, why. Spoiler: it doesn't. The answer is in the section on the third honest gotcha below.&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Skill Catalog Management" Dashboard
&lt;/h2&gt;

&lt;p&gt;Now for the visual part. Stack is running, telemetry collected — let's see what we got.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2saxljdga57qj3s1f4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2saxljdga57qj3s1f4q.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the top, four stat panels give an instant health snapshot of the catalog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Size: 7&lt;/strong&gt; — how many skills are in the catalog (from the source of truth).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active Skills: 6&lt;/strong&gt; — how many unique skills fired at least once in the period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Invocations: 66&lt;/strong&gt; — total activations in the period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditor Error Rate&lt;/strong&gt; — a panel for skill error signal. In our demo it shows "No data" — and that is an honest, instructive result, explained below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Already you can see a discrepancy: the catalog has &lt;strong&gt;7&lt;/strong&gt;, but active is &lt;strong&gt;6&lt;/strong&gt;. One skill is silent. (In the demo, actually two of our seven are silent, and the sixth active one is &lt;code&gt;superpowers:executing-plans&lt;/code&gt; — which I used &lt;em&gt;to run the data collection plan itself&lt;/em&gt;. A nice illustration: monitoring caught a skill I wasn't even planning to show. The catalog lives its own life — which is exactly why you need to watch it.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Hero: Leaderboard + Stale Skills
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmetzyeornc9y4c1bk2w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmetzyeornc9y4c1bk2w.png" alt=" " width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are the two main panels, and they are most useful side by side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill Usage Leaderboard&lt;/strong&gt; (left) — ranking by activation count. Shows the team's golden paths: &lt;code&gt;code-reviewer&lt;/code&gt; (14) leads, followed by &lt;code&gt;dep-auditor&lt;/code&gt; and &lt;code&gt;test-scaffolder&lt;/code&gt; (13 each). This is what the team actually bets on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔴 Stale Skills&lt;/strong&gt; (right) — the catalog outer join with activity. Every skill from the catalog is joined to an activation count. And here are the red rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;db-migration-helper&lt;/code&gt; → &lt;strong&gt;0 — STALE&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;legacy-formatter&lt;/code&gt; → &lt;strong&gt;0 — STALE&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two &lt;strong&gt;exist in the catalog but nobody has ever called them&lt;/strong&gt;. Without the join against the catalog you would simply never see them — they are not in the telemetry. This panel answers the core catalog question: which skills are candidates for decommissioning?&lt;/p&gt;

&lt;h3&gt;
  
  
  Adoption and Cost
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flq66kgki0ol4o9bmomxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flq66kgki0ol4o9bmomxt.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Adoption Over Time&lt;/strong&gt; — activations by skill over time (5-minute buckets, stacked). On this curve you can see how a new skill gets adopted — or doesn't. You shipped a skill on Tuesday, and by Friday the curve for it is still flat? Adoption didn't happen, and that is a reason to talk to the team rather than silently keep the skill in the catalog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost &amp;amp; Tokens per Skill&lt;/strong&gt; — cost and token breakdown by skill, from the &lt;code&gt;claude_code.cost.usage&lt;/code&gt; / &lt;code&gt;token.usage&lt;/code&gt; metrics. One important implementation detail: &lt;strong&gt;tokens are measured in tens of thousands, costs in cents.&lt;/strong&gt; These are two fundamentally different scales, and trying to plot them on the same linear axis is meaningless — the cheaper metric just hugs zero. So the two signals are separated into distinct panels (or table rows), each with its own scale. A small but telling thing: a dashboard is not "dump all metrics on one canvas," it is fitting the representation to the nature of the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invocation Trigger&lt;/strong&gt; (pie) answers the question of who is actually calling the skill: a human via &lt;code&gt;/slash&lt;/code&gt;, Claude proactively, or a nested call from another skill. A useful breakdown — it distinguishes "skill that people consciously invoke" from "skill that fires in the background."&lt;/p&gt;
&lt;h3&gt;
  
  
  The Third Honest Gotcha: Native Telemetry Does Not Know Exit Codes
&lt;/h3&gt;

&lt;p&gt;The "Auditor Error Rate" panel shows "No data" — and we deliberately did not hide that.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dep-auditor&lt;/code&gt; is designed to fail every other run: the bash command inside exits with &lt;code&gt;exit 1&lt;/code&gt; on odd runs. One would expect &lt;code&gt;success=false&lt;/code&gt; to show up in &lt;code&gt;claude_code.tool_result&lt;/code&gt; — but it doesn't. Checking real data in Loki: 19 out of 19 Bash results show &lt;code&gt;success=true&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Why? The official Claude Code documentation cleanly separates two levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;th&gt;&lt;code&gt;success&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;error_type&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bash didn't launch at all&lt;/td&gt;
&lt;td&gt;&lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Error:ENOENT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;binary not found&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shell crashed abnormally&lt;/td&gt;
&lt;td&gt;&lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ShellError&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OOM, kill signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Command ran and exited with &lt;code&gt;exit 1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;true&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;(none)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bash -c "exit 1"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Command ran and exited with &lt;code&gt;exit 0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;(none)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bash -c "exit 0"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In other words, &lt;code&gt;success&lt;/code&gt; reflects "the tool harness executed the command and got a result" — not "the command did what was intended." This is a &lt;strong&gt;design decision&lt;/strong&gt;: Claude Code deliberately does not interpret the semantics of what it ran. For the platform, &lt;code&gt;exit 1&lt;/code&gt; is a valid program response, not an error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical implication:&lt;/strong&gt; native telemetry answers "did the skill run?" — not "did the skill work correctly?" These are two different questions, and the second one requires the &lt;strong&gt;skill itself to report its result&lt;/strong&gt;. Either through a custom OTLP write (the skill sends an event with &lt;code&gt;result=success/fail&lt;/code&gt; directly to the collector — &lt;code&gt;OTEL_*&lt;/code&gt; variables are intentionally not inherited by child processes, so the endpoint must be set explicitly), or through a PostToolUse hook that checks the command output.&lt;/p&gt;

&lt;p&gt;This is the exact same logic by which you add a health check to a service: the infrastructure knows it is "running," but only the service itself knows it is "working correctly."&lt;/p&gt;


&lt;h2&gt;
  
  
  How to Reproduce
&lt;/h2&gt;

&lt;p&gt;The entire stack runs locally in a couple of minutes. Here is the path from zero to a live dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Start the stack:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;12
docker-compose ps                          &lt;span class="c"&gt;# 4 services Up&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:3001/api/health   &lt;span class="c"&gt;# Grafana ok&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Enable OTLP and launch Claude Code from the same shell&lt;/strong&gt; (variables must reach the &lt;code&gt;claude&lt;/code&gt; process, so &lt;code&gt;source&lt;/code&gt; comes first):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .env.example
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Use skills&lt;/strong&gt; — call them as you would in real work. Leave two untouched (for the stale demonstration).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Open the dashboard:&lt;/strong&gt; &lt;a href="http://localhost:3001" rel="noopener noreferrer"&gt;http://localhost:3001&lt;/a&gt; (admin / admin) → &lt;strong&gt;Skill Catalog Management&lt;/strong&gt;. Panels come alive in ~10–20 seconds.&lt;/p&gt;

&lt;p&gt;UIs at hand: Grafana :3001 · Prometheus :9090 · Loki :3100.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pitfalls (So You Don't Have to Step in Them)
&lt;/h2&gt;

&lt;p&gt;Everything you might trip over, in one place:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Skill names show as &lt;code&gt;custom_skill&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OTEL_LOG_TOOL_DETAILS=1&lt;/code&gt; is not set&lt;/td&gt;
&lt;td&gt;Close session → &lt;code&gt;source .env.example&lt;/code&gt; → &lt;code&gt;claude&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party plugin costs merged into &lt;code&gt;"third-party"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cost attribution only works for first-party skills&lt;/td&gt;
&lt;td&gt;Use project-level / user-defined skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate panel shows "No data" despite failed commands&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;success&lt;/code&gt; in &lt;code&gt;tool_result&lt;/code&gt; reflects harness failure, not command exit code — &lt;code&gt;bash -c "exit 1"&lt;/code&gt; returns &lt;code&gt;success=true&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Add a PostToolUse hook or custom OTLP instrumentation inside the skill to report semantic result&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Scaling Beyond Local
&lt;/h2&gt;

&lt;p&gt;The local stack is a showcase and sandbox. What changes when you bring this to the team:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The dashboard is already portable.&lt;/strong&gt; It lives as JSON in Grafana provisioning — commit it to your platform team's repository and it deploys into your corporate Grafana as-is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Endpoint instead of localhost.&lt;/strong&gt; &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; switches to your corporate collector. Everything else stays untouched — that is the point of vendor-neutral OTLP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributing skills via marketplace.&lt;/strong&gt; When you package skills into a plugin and distribute them through a marketplace — remember the cost attribution gotcha: third-party plugins merge into &lt;code&gt;"third-party"&lt;/code&gt;. If per-skill cost visibility matters, keep them as first-party / user-defined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog as CI artifact.&lt;/strong&gt; &lt;code&gt;build-catalog.sh&lt;/code&gt; can run in the pipeline and publish the catalog as an artifact — then the source of truth is always fresh, and the dashboard always joins against the current list.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Debrief
&lt;/h2&gt;

&lt;p&gt;Claude Code skills sprawl exactly the way any unmonitored catalog sprawls — microservices, libraries, feature flags. The cure is also familiar: &lt;strong&gt;treat the catalog like a service.&lt;/strong&gt; A roster with owners, usage metrics, an adoption curve, and an honest decommission process.&lt;/p&gt;

&lt;p&gt;The good news is that Claude Code hands you everything you need for this out of the box — through an open standard, without patches and without proprietary agents. Two flags, a log pipeline for events, and one non-obvious technique: &lt;strong&gt;joining activity against the catalog&lt;/strong&gt;, so you can see not just what is alive, but what is ready for honest decommissioning.&lt;/p&gt;

&lt;p&gt;The engineering task is not to guess what the team uses, but to instrument the catalog thoroughly enough that its behavior becomes visible. Then decisions are made from data, not from intuition.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stack, skills, configs, and dashboard — all in the &lt;a href="https://github.com/astronaut27/claude-code-skills-observability" rel="noopener noreferrer"&gt;repository&lt;/a&gt;. Starts with a single &lt;code&gt;docker-compose up -d&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next:&lt;/strong&gt; &lt;em&gt;"Debugging your Claude Code skills: what native telemetry won't tell you and how to close those gaps."&lt;/em&gt; Catalog management answers "what lives in the team." Skill debugging answers "does it work the way it was designed to" — and that is a separate story with different tooling.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>aiops</category>
      <category>claude</category>
    </item>
    <item>
      <title>Prompt Management Is Infrastructure: Requirements, Tools, and Patterns</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Tue, 17 Mar 2026 17:00:56 +0000</pubDate>
      <link>https://dev.to/astronaut27/prompt-management-is-infrastructure-requirements-tools-and-patterns-32nn</link>
      <guid>https://dev.to/astronaut27/prompt-management-is-infrastructure-requirements-tools-and-patterns-32nn</guid>
      <description>&lt;p&gt;&lt;strong&gt;Mission Log #6 — Prompt control center: from strings in code to a production-grade system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your LLM service keeps prompts in code or in a UI without strict version control, you're accumulating technical debt. Not the usual kind. This debt doesn't show up as stack traces. It shows up as silent quality drift: SLAs green, logs clean, and users increasingly getting irrelevant answers.&lt;/p&gt;

&lt;p&gt;In production, a prompt is the &lt;strong&gt;behavioral contract of your service&lt;/strong&gt;. It directly affects tool-calling accuracy, RAG faithfulness, latency distribution, inference cost, and downstream behavior.&lt;/p&gt;

&lt;p&gt;This article is not about prompt engineering (how to write a good prompt). It's about &lt;strong&gt;prompt management&lt;/strong&gt; — how to manage prompts as an engineer: version, deploy, roll back, observe, and avoid silent regressions.&lt;/p&gt;

&lt;p&gt;You'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What prompt management is and how it differs from prompt engineering.&lt;/li&gt;
&lt;li&gt;What production demands from prompt management (and what breaks when you ignore it).&lt;/li&gt;
&lt;li&gt;A maturity model: where your team is and what the next step is.&lt;/li&gt;
&lt;li&gt;Tools that address these requirements and how they map.&lt;/li&gt;
&lt;li&gt;Architectural patterns for embedding prompt management into your system.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Is Prompt Management (and What Are We Versioning?)
&lt;/h2&gt;

&lt;p&gt;Prompt management is the set of practices and tools for the full lifecycle of prompts: creation, versioning, testing, deployment, monitoring, and rollback.&lt;/p&gt;

&lt;p&gt;In production, a "prompt" is not a single text string. It's a &lt;strong&gt;composite artifact&lt;/strong&gt; of several components, each of which affects service behavior:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Why we version it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;"You are a support agent..."&lt;/td&gt;
&lt;td&gt;Defines model behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;3 input→output pairs&lt;/td&gt;
&lt;td&gt;Affect format and quality of responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool schemas&lt;/td&gt;
&lt;td&gt;OpenAPI specs for function calling&lt;/td&gt;
&lt;td&gt;Define which tools the model can call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output schema&lt;/td&gt;
&lt;td&gt;JSON Schema for structured output&lt;/td&gt;
&lt;td&gt;Breaks downstream parsers when changed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference params&lt;/td&gt;
&lt;td&gt;model, temperature, max_tokens, top_p&lt;/td&gt;
&lt;td&gt;Affect latency, cost, response style&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt template&lt;/td&gt;
&lt;td&gt;Template with variables (&lt;code&gt;{{user_name}}&lt;/code&gt;, &lt;code&gt;{{context}}&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Logic for assembling the final prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing logic&lt;/td&gt;
&lt;td&gt;Which prompt for which tenant/use case&lt;/td&gt;
&lt;td&gt;Determines who sees which version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Engineers often version only the system prompt text. But if someone changes a tool schema or bumps temperature from 0.3 to 0.9, system behavior changes just as much. In mature production systems, teams version the &lt;strong&gt;entire artifact&lt;/strong&gt;, not just the text.&lt;/p&gt;




&lt;h2&gt;
  
  
  9 Requirements for Production-Grade Prompt Management
&lt;/h2&gt;

&lt;p&gt;These requirements come from working with production LLM systems. Each is described with a concrete failure mode — what actually breaks when the requirement isn't met.&lt;/p&gt;

&lt;p&gt;It helps to split them into three planes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Versioning&lt;/strong&gt;: version identity, diff, change history, reproducibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivery/Rollout&lt;/strong&gt;: labels, canary, version distribution, rollback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control/Governance&lt;/strong&gt;: eval gating, audit trail, trace linkage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Immutable versions
&lt;/h3&gt;

&lt;p&gt;Every prompt version is immutable. A unique &lt;code&gt;prompt_version_id&lt;/code&gt; (content hash or incremental id).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you can't tell which exact prompt version was live during an incident. "Someone changed the prompt last week, I think" is guesswork, not debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Labels / Aliases
&lt;/h3&gt;

&lt;p&gt;Named labels for routing prompt versions at runtime. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;By environment&lt;/strong&gt;: &lt;code&gt;production&lt;/code&gt;, &lt;code&gt;canary&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By model&lt;/strong&gt;: &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;claude-sonnet&lt;/code&gt;, &lt;code&gt;llama-3-70b&lt;/code&gt; — different prompts tuned for different LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By tenant/use case&lt;/strong&gt;: &lt;code&gt;tenant_acme&lt;/code&gt;, &lt;code&gt;support_flow&lt;/code&gt;, &lt;code&gt;sales_agent&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By experiment&lt;/strong&gt;: &lt;code&gt;experiment_v3_concise&lt;/code&gt;, &lt;code&gt;baseline&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The app requests a prompt by label, not by concrete version. That lets you change the version without changing code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: changing a prompt version means a full service deploy. Every text change goes through the full CI/CD pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Evaluation gating
&lt;/h3&gt;

&lt;p&gt;A new prompt version goes through controlled validation before promotion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;domain-specific golden dataset,&lt;/li&gt;
&lt;li&gt;automated regression tests,&lt;/li&gt;
&lt;li&gt;offline comparison to baseline,&lt;/li&gt;
&lt;li&gt;(optional) LLM-based scoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Promotion is a deliberate decision, not a blind merge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: every prompt change is a lottery. You can go a month without noticing that answer quality dropped 15%.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Low-latency fetch
&lt;/h3&gt;

&lt;p&gt;Predictable time to fetch the prompt at runtime. In-memory cache on the hot path. The goal is to avoid putting a slow, uncached config dependency on the critical request path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: prompt management becomes a single point of failure. If the config service responds in 500ms instead of 5ms, your TTFT is already broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit trail
&lt;/h3&gt;

&lt;p&gt;Who changed what, when, and why. Commit message + metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: after an incident you run a detective investigation instead of root-cause analysis. "Who changed the support prompt?" shouldn't take more than 10 seconds to answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Trace linkage
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;prompt_version_id&lt;/code&gt; attached to every trace/span. Correlation with metrics: latency, tool-call success rate, semantic failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you see quality degrade but can't tie it to a specific prompt version. Observability without trace linkage is dashboards for the sake of dashboards.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Rollback without downtime
&lt;/h3&gt;

&lt;p&gt;Reassign a label → fast rollback without redeploy or service restart (within your propagation window).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: recovery time after a bad prompt equals full deploy time (minutes or hours instead of seconds). In agent systems with dozens of prompts, that's critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Structured schema support
&lt;/h3&gt;

&lt;p&gt;Version not only text but tool schemas, output constraints, and templating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you track prompt text, someone quietly changes the output schema, and the downstream parser breaks. Half the artifact is out of control.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. GitOps-friendly or API-driven workflow
&lt;/h3&gt;

&lt;p&gt;Infra and product teams work in parallel without overwriting each other. Prompts are managed via Git (PR, review) or via API (SDK, UI).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: two people edit the same prompt in the UI → last save wins, wiping the first person's changes. Familiar Google Docs pain, but with production impact.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00htb8giltooe9jwb827.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00htb8giltooe9jwb827.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Maturity Model: Where Are You Now?
&lt;/h2&gt;

&lt;p&gt;Not every system needs Level 4. The point is to know your current level and choose the next step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 0 — Strings in code
&lt;/h3&gt;

&lt;p&gt;Prompts live as literals in code or hardcoded in the UI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Typical Level 0
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;No explicit versions (only git blame if you're lucky).&lt;/li&gt;
&lt;li&gt;Rollback = git revert + full deploy.&lt;/li&gt;
&lt;li&gt;Debug: "check the code for what's there" — but production may be running a different build.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: minimal code-level audit trail and version history in Git; almost none of the runtime requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1 — Git-based prompts
&lt;/h3&gt;

&lt;p&gt;Prompts live in separate files (YAML, JSON, Markdown) and are versioned in Git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompts/support_agent/v2.yaml&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;support_agent&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
&lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
&lt;span class="na"&gt;system_prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;You are a support agent for {{product_name}}.&lt;/span&gt;
  &lt;span class="s"&gt;Always check the knowledge base before answering.&lt;/span&gt;
  &lt;span class="s"&gt;If unsure, escalate to a human.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;search_kb&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;create_ticket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Change history and PR review.&lt;/li&gt;
&lt;li&gt;Audit trail via git log.&lt;/li&gt;
&lt;li&gt;Rollback still via deploy (git revert → CI → deploy).&lt;/li&gt;
&lt;li&gt;No runtime labels/aliases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: immutable history in Git, audit trail (git log), GitOps workflow, structured schema (if the file holds all components). Immutable runtime artifacts only appear when you explicitly build and publish versioned artifacts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2 — Config store + labels
&lt;/h3&gt;

&lt;p&gt;Prompts live in a key-value store (Redis, Postgres, DynamoDB, internal config service) with label support.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/prompts/support_agent?label=production
→ { version_id: "v2-abc123", system_prompt: "...", tools: [...] }

GET /v1/prompts/support_agent?label=canary
→ { version_id: "v3-def456", system_prompt: "...", tools: [...] }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Runtime routing by alias.&lt;/li&gt;
&lt;li&gt;Changing the production version without deploy (reassign label).&lt;/li&gt;
&lt;li&gt;In-memory cache on the client + background refresh.&lt;/li&gt;
&lt;li&gt;No built-in eval gating.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: immutability, labels, low-latency fetch, rollback, audit trail (if you keep it), GitOps/API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3 — Dedicated prompt management platform
&lt;/h3&gt;

&lt;p&gt;A dedicated platform: UI for version management, diffs between versions, built-in tracing, and observability integrations.&lt;/p&gt;

&lt;p&gt;Examples: Langfuse, Braintrust, MLflow Prompt Registry, PromptLayer, LangSmith.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI for comparing versions, promoting, rolling back.&lt;/li&gt;
&lt;li&gt;Observability integration (trace linkage).&lt;/li&gt;
&lt;li&gt;A/B testing and canary rollouts.&lt;/li&gt;
&lt;li&gt;Non-engineers (product, domain experts) can edit prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: all 9 requirements to varying degrees (platform-dependent).&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 4 — Full prompt ops
&lt;/h3&gt;

&lt;p&gt;Single pipeline: create → eval → offline comparison → canary rollout → monitoring → auto-rollback.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt management is part of CI/CD and the eval pipeline.&lt;/li&gt;
&lt;li&gt;Evaluation gating built into the promotion process.&lt;/li&gt;
&lt;li&gt;Automatic alerts when metrics degrade for a given prompt_version.&lt;/li&gt;
&lt;li&gt;A prompt doesn't reach production until it passes the golden set and regression tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: all 9 requirements plus automated eval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tool Overview
&lt;/h2&gt;

&lt;p&gt;Not a feature list — a mapping onto the 9 requirements. The focus is on infrastructure needs, not marketing features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Langfuse
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: LLM observability + prompt management platform, open-source / open-core. After the ClickHouse merger, the project kept an open core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Versioning with labels (&lt;code&gt;production&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, custom).&lt;/li&gt;
&lt;li&gt;Client-side cache — prompt is fetched once, then served from memory. No extra latency on requests.&lt;/li&gt;
&lt;li&gt;Trace linkage: &lt;code&gt;prompt_version_id&lt;/code&gt; attached to every trace.&lt;/li&gt;
&lt;li&gt;Self-hosted option (Docker) — important for compliance and data-sensitive systems.&lt;/li&gt;
&lt;li&gt;Open-source/open-core: most core features are open; some capabilities depend on the commercial plan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI for non-engineers is less polished than more product-centric platforms.&lt;/li&gt;
&lt;li&gt;Eval gating has to be built separately (via integration with eval frameworks).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓, eval gating ~, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ~, GitOps/API ✓.&lt;/p&gt;

&lt;h3&gt;
  
  
  MLflow Prompt Registry
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Part of the MLflow GenAI ecosystem. Git-inspired versioning for prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immutable versions + aliasing (Git-inspired).&lt;/li&gt;
&lt;li&gt;Lineage tracking — link prompts to model runs and eval results.&lt;/li&gt;
&lt;li&gt;Natural fit for teams already on MLflow/Databricks.&lt;/li&gt;
&lt;li&gt;Template support with variables (&lt;code&gt;{{variable}}&lt;/code&gt;), conversion to LangChain/LlamaIndex formats.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tightly coupled to the MLflow ecosystem. If you're not on Databricks/MLflow, integration overhead.&lt;/li&gt;
&lt;li&gt;Not a standalone observability platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓ (aliases), eval gating ✓ (via MLflow evaluate), low-latency ~, audit trail ✓, trace linkage ~ (via MLflow tracking), rollback ✓, schema ✓, GitOps ~ (custom scripts).&lt;/p&gt;

&lt;h3&gt;
  
  
  Braintrust
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: AI observability platform with prompt management, eval, and production monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environments: development → staging → production with quality gates.&lt;/li&gt;
&lt;li&gt;Bidirectional sync between code (SDK) and UI (playground) — engineers and product work in parallel.&lt;/li&gt;
&lt;li&gt;GitHub Actions integration: eval in CI, blocking deployments, PR comments.&lt;/li&gt;
&lt;li&gt;Prompt playground for testing on real data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SaaS-first: deployment and data-plane options depend on enterprise setup and contracts.&lt;/li&gt;
&lt;li&gt;Platform lock-in and migration cost if you switch vendors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓ (environments), eval gating ✓, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ✓, GitOps ✓.&lt;/p&gt;

&lt;h3&gt;
  
  
  PromptLayer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Lightweight tool for logging and versioning LLM calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easiest integration (&amp;lt; 30 minutes, a few lines of code).&lt;/li&gt;
&lt;li&gt;Prompt registry: prompts stored outside code, deployed via API.&lt;/li&gt;
&lt;li&gt;Release labels and dynamic labels for runtime routing.&lt;/li&gt;
&lt;li&gt;Basic eval and version comparison.&lt;/li&gt;
&lt;li&gt;Low barrier to entry; good for getting started.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less depth on observability and governance than full-stack LLMOps platforms.&lt;/li&gt;
&lt;li&gt;Teams with growing complexity will outgrow it quickly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓, eval gating ~, low-latency ~, audit trail ✓, trace linkage ~, rollback ✓, schema ~, GitOps ~.&lt;/p&gt;

&lt;h3&gt;
  
  
  LangSmith
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: LangChain platform for tracing, eval, and prompt management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deep integration with LangChain/LangGraph.&lt;/li&gt;
&lt;li&gt;Hub for sharing and versioning prompts.&lt;/li&gt;
&lt;li&gt;Evaluation + dataset management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tied to the LangChain ecosystem (though there are SDK and API).&lt;/li&gt;
&lt;li&gt;Commercial product: deployment modes and enterprise features depend on plan and contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ~, eval gating ✓, low-latency ~, audit trail ✓, trace linkage ✓, rollback ~, schema ✓, GitOps ~.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Langfuse&lt;/th&gt;
&lt;th&gt;MLflow&lt;/th&gt;
&lt;th&gt;Braintrust&lt;/th&gt;
&lt;th&gt;PromptLayer&lt;/th&gt;
&lt;th&gt;LangSmith&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immutability&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Labels/Aliases&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval Gating&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency Fetch&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit Trail&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trace Linkage&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured Schema&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps/API&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;✓ = full support, ~ = partial or needs extra setup, ✗ = no.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Table reflects public docs and typical production scenarios at the time of writing. For a real choice, always check current limits for plans, licensing, and deployment mode.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architectural Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Git-native
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;prompts/&lt;/span&gt;
  &lt;span class="s"&gt;support_agent/&lt;/span&gt;
    &lt;span class="s"&gt;v1.yaml&lt;/span&gt;
    &lt;span class="s"&gt;v2.yaml&lt;/span&gt;
  &lt;span class="s"&gt;code_review/&lt;/span&gt;
    &lt;span class="s"&gt;v1.yaml&lt;/span&gt;
  &lt;span class="s"&gt;registry.yaml     ← index&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;which label points to which version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI builds prompts into an artifact (JSON bundle, SQLite, Redis snapshot). The service loads the artifact at startup.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Familiar workflow (PR, review, CI)&lt;/td&gt;
&lt;td&gt;Rollback = new deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full audit trail&lt;/td&gt;
&lt;td&gt;Non-engineers can't edit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No runtime dependencies&lt;/td&gt;
&lt;td&gt;No runtime labels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No extra cost&lt;/td&gt;
&lt;td&gt;Eval gating built from scratch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams of 1–5 engineers, early stage, few prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Config service (internal)
&lt;/h3&gt;

&lt;p&gt;Your own service with REST/gRPC API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/prompts/{name}?label=production
POST /v1/prompts/{name}/versions   ← create version
PUT /v1/prompts/{name}/labels      ← reassign label
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Storage: Postgres / DynamoDB. Clients: SDK with in-memory cache + background polling (TTL 30–60 sec).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;td&gt;Build and maintain it yourself&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime labels + rollback&lt;/td&gt;
&lt;td&gt;Another service in the stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency (your cache)&lt;/td&gt;
&lt;td&gt;You build the UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No vendor lock-in&lt;/td&gt;
&lt;td&gt;Eval gating is a separate concern&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Consistency note&lt;/strong&gt;: with background polling and TTL 30–60 sec, after reassigning a label different instances can run on &lt;strong&gt;different prompt versions&lt;/strong&gt; for up to a minute. For most LLM use cases eventual consistency is fine. For safety-critical systems you need a push mechanism (webhook/event) or a shorter TTL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: mid-size and larger teams that care about control and have capacity for infra.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Managed platform (SaaS)
&lt;/h3&gt;

&lt;p&gt;Langfuse Cloud / Braintrust / LangSmith — prompts managed via the platform's UI and SDK.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast to start&lt;/td&gt;
&lt;td&gt;Runtime dependency on SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI for non-engineers&lt;/td&gt;
&lt;td&gt;Vendor lock-in (as with Humanloop, which was discontinued)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval, tracing, A/B out of the box&lt;/td&gt;
&lt;td&gt;Cost at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No infra to build&lt;/td&gt;
&lt;td&gt;Data residency constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Critical question&lt;/strong&gt;: what happens when the SaaS is down? The client SDK must have a fallback (last known good version from cache). Without it, SaaS downtime = your service downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that need a quick start and non-engineer access, and accept the risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4: Hybrid (Git + platform)
&lt;/h3&gt;

&lt;p&gt;Git is source of truth. CI syncs prompts into the platform (Langfuse, Braintrust). The platform handles runtime delivery and observability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer → Git PR → Review → Merge → CI syncs to Platform → Runtime fetch via SDK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code review + runtime flexibility&lt;/td&gt;
&lt;td&gt;Sync complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail in Git&lt;/td&gt;
&lt;td&gt;Drift between Git and platform possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-engineers see result in UI&lt;/td&gt;
&lt;td&gt;Two sources of truth when things go wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime labels + rollback&lt;/td&gt;
&lt;td&gt;Extra CI plumbing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Failure modes to plan for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drift&lt;/strong&gt;: CI sync fails, Git moves ahead, platform serves an old version. Engineer thinks the prompt is updated — service is still on the previous one. Mitigation: check &lt;code&gt;prompt_hash&lt;/code&gt; on the platform side + alert on mismatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ownership&lt;/strong&gt;: if non-engineers can edit prompts directly in the platform UI, bypassing Git, Git is no longer the single source of truth. Either block direct edits in the UI or implement reverse sync (platform → Git), which is much more complex.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that want Git review plus runtime flexibility. Most mature pattern, and the hardest to operate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 5: Feature flags
&lt;/h3&gt;

&lt;p&gt;Prompt versions are managed as feature flags in your existing system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Granular rollout (5% → 50% → 100%)&lt;/td&gt;
&lt;td&gt;Flag systems aren't built for long text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instant rollback (toggle off)&lt;/td&gt;
&lt;td&gt;With dozens of prompts, flag sprawl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A/B testing out of the box&lt;/td&gt;
&lt;td&gt;No diffs between prompt versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Familiar if you already use it&lt;/td&gt;
&lt;td&gt;Prompts still need to live somewhere&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that already have feature-flag infra and need granular rollout. Works well as a &lt;strong&gt;complement&lt;/strong&gt; to other patterns (e.g. Git-native + flags for rollout), not as the only mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runtime delivery: 3 questions for any pattern
&lt;/h3&gt;

&lt;p&gt;Whatever pattern you pick, answer these before production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How does the prompt reach runtime?&lt;/strong&gt; Polling with TTL, push via webhook/event, or baked in at deploy? This determines how fast changes propagate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens if the prompt source is unavailable?&lt;/strong&gt; Fallback from local cache (stale-while-revalidate) or hard failure? Without fallback you add a single point of failure on the hot path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How quickly do all instances see the new version?&lt;/strong&gt; Eventual consistency (seconds–minutes) or strong? For most LLM use cases eventual is enough, but you must know your consistency window.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these is a separate engineering concern from distributed config propagation. A deeper treatment — caching patterns, failure modes, examples — is a separate post.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose: Decision Framework
&lt;/h2&gt;

&lt;p&gt;Don't choose by feature list. Choose by four questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Who edits prompts?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only engineers → Git-native or config service.&lt;/li&gt;
&lt;li&gt;Product/domain experts too → Platform or hybrid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. How fast must rollback be?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Seconds → you need runtime labels (Level 2+).&lt;/li&gt;
&lt;li&gt;Minutes via CI is acceptable → Git-native is enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. How many prompts and how often do they change?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 prompts, change once a month → Git-native.&lt;/li&gt;
&lt;li&gt;50+ prompts, change weekly → Platform or hybrid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Data residency and compliance?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data must stay in region / on-premise → self-hosted (Langfuse, MLflow) or your own config service.&lt;/li&gt;
&lt;li&gt;No constraints → SaaS is fine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For enterprise teams, (4) is often the &lt;strong&gt;first filter&lt;/strong&gt; and rules out half the options immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Insight
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Prompt management is a new infrastructure layer. It's closest to config management and feature flags, but with a twist: prompt semantics are opaque and the impact of changes is probabilistic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don't need to build Level 4 right away. See where you are and pick &lt;strong&gt;one&lt;/strong&gt; next step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At Level 0? → Move prompts to files and introduce &lt;code&gt;prompt_version_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;At Level 1? → Add runtime labels and rollback without deploy.&lt;/li&gt;
&lt;li&gt;At Level 2? → Add eval gating and trace linkage.&lt;/li&gt;
&lt;li&gt;At Level 3? → Automate the promotion pipeline.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you already run prompt management in production — what approach did you choose and what pitfalls did you hit?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Design Recipe: Observability Pyramid for LLM Infrastructure</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 05 Feb 2026 08:44:43 +0000</pubDate>
      <link>https://dev.to/astronaut27/design-recipe-observability-pyramid-for-llm-infrastructure-3b5l</link>
      <guid>https://dev.to/astronaut27/design-recipe-observability-pyramid-for-llm-infrastructure-3b5l</guid>
      <description>&lt;p&gt;In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.&lt;/p&gt;

&lt;p&gt;As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the &lt;strong&gt;Observability Pyramid&lt;/strong&gt;, where each layer protects the next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff75eneriwv4ubbmlx40v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff75eneriwv4ubbmlx40v.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. System Layer: Telemetry and SRE Basics
&lt;/h2&gt;

&lt;p&gt;Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time to First Token): the main metric for UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPOT&lt;/strong&gt; (Time Per Output Token): generation stability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens/Sec &amp;amp; Input/Output Ratio&lt;/strong&gt;: critical for capacity planning and understanding KV-cache load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Engineering Approach:&lt;/strong&gt; Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).&lt;/p&gt;

&lt;p&gt;For details on profiling the engine and finding bottlenecks — see my article:&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;LLM Engine Telemetry: How to profile models&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Retrieval Layer: Data Hygiene (RAG Triad)
&lt;/h2&gt;

&lt;p&gt;Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A. Context Precision&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.&lt;br&gt;&lt;br&gt;
Tools: RAGAS, DeepEval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B. Context Recall&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Does the retrieved set contain the factual answer?&lt;br&gt;&lt;br&gt;
Practice: You need a "golden standard" — a labeled dataset. I use &lt;a href="https://github.com/facebookresearch/CRAG" rel="noopener noreferrer"&gt;Meta CRAG&lt;/a&gt; because it simulates real-world chaos and dynamically changing data.&lt;br&gt;&lt;br&gt;
See my guide on local CRAG evaluation &lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C. Faithfulness&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Is the answer derived from the context or hallucinated? &lt;/p&gt;

&lt;p&gt;A judge model checks every claim in the response against the provided source.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Semantic Layer: LLM-as-a-Judge at Scale
&lt;/h2&gt;

&lt;p&gt;This level checks logic. The main challenge is balancing evaluation quality with cost/speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineering Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Gating&lt;/strong&gt;: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production Sampling&lt;/strong&gt;: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%).
Additionally: implement &lt;strong&gt;judge caching&lt;/strong&gt; (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
&lt;a href="https://github.com/zilliztech/GPTCache" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialized Judges&lt;/strong&gt;: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-band Eval&lt;/strong&gt;: In production, evaluation always runs asynchronously to avoid increasing main request latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Diagnostic Map: What to Fix?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;If Dropped, Problem In:&lt;/th&gt;
&lt;th&gt;Action Plan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context Recall&lt;/td&gt;
&lt;td&gt;Embeddings / Indexing&lt;/td&gt;
&lt;td&gt;Switch embedding model, implement Hybrid Search (Vector + Keyword)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;td&gt;Chunking / Noise&lt;/td&gt;
&lt;td&gt;Add Reranker (Cross-Encoder), revise Chunking Strategy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;Temperature / Context&lt;/td&gt;
&lt;td&gt;Lower Temperature, strengthen system prompt, check chunk integrity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT (Latency)&lt;/td&gt;
&lt;td&gt;Hardware / Load&lt;/td&gt;
&lt;td&gt;Check Cache Hit Rate, enable quantization or PagedAttention&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Implementation Plan (Checklist)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instrument (Day 0)&lt;/strong&gt;: Set up export of metrics and traces (vLLM + OpenTelemetry).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golden Set&lt;/strong&gt;: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article &lt;a href="https://dev.to/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k"&gt;Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate&lt;/strong&gt;: Integrate DeepEval/RAGAS into GitHub Actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling &amp;amp; Feedback&lt;/strong&gt;: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fak7booeooz6xzliaod2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fak7booeooz6xzliaod2g.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>aiops</category>
      <category>rag</category>
    </item>
    <item>
      <title>Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Tue, 30 Dec 2025 15:31:54 +0000</pubDate>
      <link>https://dev.to/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k</link>
      <guid>https://dev.to/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Want to skip the theory and launch a local RAG benchmark in Docker right now? Check out the &lt;a href="https://github.com/astronaut27/CRAG_with_Docker" rel="noopener noreferrer"&gt;repo&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Introduction: Breaking the Infrastructure Barrier&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://medium.com/@astronaut27/how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-docker-8ea8435f9fa2" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, we prepped our "shuttle" for launch by containerizing the Meta CRAG infrastructure. It gave us a standardized environment, but we were still tethered to one expensive "ground control" dependency.&lt;/p&gt;

&lt;p&gt;The original benchmark baselines are &lt;strong&gt;resource-hungry&lt;/strong&gt;. They expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paid OpenAI API for final judging.&lt;/li&gt;
&lt;li&gt;GPU(CUDA) clusters to run inference via vLLM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Developing a RAG system under these constraints feels like ordering expensive parts by mail when you already have the tools in your garage.&lt;/em&gt; You spend your budget on "shipping" (API tokens) and wait for external servers to reply, even though you have plenty of local horsepower sitting idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if you could launch the rocket from your own spaceport?&lt;/strong&gt; Right on your laptop, with &lt;strong&gt;zero cost per request&lt;/strong&gt; and total autonomy. We’re swapping external APIs for local inference using Ollama and Ray.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. Architecture: The OpenAI-Compatible Interface&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9afsublbt9qjh1moeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9afsublbt9qjh1moeg.png" alt=" " width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The biggest headache with academic benchmarks is their &lt;strong&gt;rigid stack.&lt;/strong&gt; Meta CRAG expects either vLLM or OpenAI by default. Rewriting the core evaluation logic is a recipe for bugs and broken metrics.&lt;/p&gt;

&lt;p&gt;Instead, we’ll take the engineering shortcut:&lt;/p&gt;

&lt;p&gt;We implemented a RAGOpenAICompatibleModel class. It uses the standard openai library but "hijacks" the data flow via the base_url variable. This lets us point the benchmark at a local Ollama instance without changing of the core logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; This gives us hot-swappable brains. Want to test Llama 3? Just change the key. Want to compare it against Qwen or Gemma? A quick export in your terminal is all it takes and a few lines in the configuration file.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Tuning the "Onboard Systems": Ray and HTML Cleanup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the cloud, you pay for convenience—you can feed raw HTML to LLM and hope it figures it out. In a local spaceport, &lt;strong&gt;resources are finite.&lt;/strong&gt; Every extra token is &lt;em&gt;dead weight&lt;/em&gt; (ballast).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0nirj6ok5e98lyi3npm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0nirj6ok5e98lyi3npm.png" alt=" " width="800" height="129"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🛠 Parallelism via Ray
&lt;/h3&gt;

&lt;p&gt;Processing hundreds of HTML pages for every question is heavy. We use &lt;strong&gt;Ray&lt;/strong&gt; to distribute the load: while the GPU is busy generating an answer, the idle CPU cores are &lt;strong&gt;"scrubbing" data&lt;/strong&gt; for the next batch in the background.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧹 The "Space Junk" Filter
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;BeautifulSoup&lt;/code&gt; to strip tags is a &lt;strong&gt;survival requirement.&lt;/strong&gt; Local models with 8k context windows quickly "suffocate" under endless &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tags.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We clean the HTML.&lt;/li&gt;
&lt;li&gt;Split text into sentences.&lt;/li&gt;
&lt;li&gt;Cap snippets at 1000 characters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Result&lt;/em&gt;: We fit significantly more useful info into the context, boosting accuracy without needing massive model weights.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Field Testing: Real Metrics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwjpl3xvyraxwbhsbgaq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwjpl3xvyraxwbhsbgaq.png" alt=" " width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We picked three popular models to see how they handle a "combat" RAG scenario.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy (Correct)&lt;/th&gt;
&lt;th&gt;Hallucination&lt;/th&gt;
&lt;th&gt;Missing (I don't know)&lt;/th&gt;
&lt;th&gt;Final Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma-2-9B&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3-8B&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;td&gt;-0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen-2.5-7B&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;-1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Post-Mortem: Why did Qwen crash? 💥
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmakpmgg3v4g3aaf5lgh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmakpmgg3v4g3aaf5lgh.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qwen’s results look catastrophic, but this is a &lt;strong&gt;huge engineering lesson&lt;/strong&gt;. It didn't fail because it was "stupid"—it failed because it violated the protocol.&lt;/p&gt;

&lt;p&gt;Typical Qwen output:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;" &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; Okay, let's see. The user is asking about the producers... I need to check the references..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model started "thinking out loud" via the  tag, ignoring the instruction to &lt;em&gt;answer succinctly&lt;/em&gt;. In CRAG, any text that isn't the direct answer is flagged as a &lt;strong&gt;Hallucination&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Takeaway:&lt;/strong&gt; Models with forced &lt;em&gt;Chain-of-Thought (CoT)&lt;/em&gt; need heavy post-processing (stripping tags) or ювелирный (precise) prompting to keep them from turning a short answer into a philosophical essay.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl6rhnhtp6shszsguqf3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl6rhnhtp6shszsguqf3.png" alt=" " width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Try it Yourself: Code on GitHub&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Stop reading and start launching. I’ve prepped a repository with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker configs for easy deployment.&lt;/li&gt;
&lt;li&gt;Ollama adapters for local inference.&lt;/li&gt;
&lt;li&gt;Ray scripts for high-speed HTML cleaning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 Project Repo: &lt;a href="https://github.com/astronaut27/CRAG_with_Docker" rel="noopener noreferrer"&gt;astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Conclusion: Autonomy Achieved&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We’ve proven that you don’t need a corporate budget to do serious RAG engineering.&lt;/p&gt;

&lt;p&gt;Our Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility&lt;/strong&gt;: Run the benchmark with a single command.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: Exactly $0 per iteration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Your data never leaves your "space station."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Local evaluation is about building an honest development process where every change is backed by numbers, not just gut feeling.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Next Mission: RAGas vs. CRAG&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Our spaceport is fully operational. But how does our local ground truth compare to popular metrics like &lt;strong&gt;RAGas&lt;/strong&gt;? In the next post, we’ll pit "RAGas" against the hard facts of CRAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See you in orbit! 👨‍🚀✨&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 27 Nov 2025 14:32:20 +0000</pubDate>
      <link>https://dev.to/astronaut27/llm-engine-telemetry-how-to-profile-models-and-see-where-performance-is-lost-169b</link>
      <guid>https://dev.to/astronaut27/llm-engine-telemetry-how-to-profile-models-and-see-where-performance-is-lost-169b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;“Any LLM is an engine. It can be massive or compact, but if you don't look at the telemetry, you'll never understand where you're burning energy inefficiently.”&lt;br&gt;
— Astronaut Engineer, Logbook #4&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🌌 Introduction: Why LLMs Need Profiling
&lt;/h2&gt;

&lt;p&gt;When engineers discuss LLM performance, three key phases are most often mentioned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tokenization latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time To First Token)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;tokens/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;But it's easier to think of it this way:&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An LLM is an engine, and the profiler is its dashboard. The rest is visible through the readings—and we're about to break them down.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Just like in real machinery:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Startup is always more expensive than the cruising phase,&lt;/li&gt;
&lt;li&gt;Different engine components consume energy differently,&lt;/li&gt;
&lt;li&gt;The true picture is only visible through telemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;👨‍🚀 Caption: "Before launch—rely only on the instruments"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Mission Plan
&lt;/h2&gt;

&lt;p&gt;We are launching the &lt;strong&gt;GPT-2&lt;/strong&gt; model in three scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short prompt&lt;/li&gt;
&lt;li&gt;Medium prompt&lt;/li&gt;
&lt;li&gt;Long prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each test goes through three key phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization&lt;/strong&gt; — Preparing the input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefill&lt;/strong&gt; - The initial prompt processing that establishes TTFT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode / Steady-State&lt;/strong&gt; — The cruising phase of generation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokenization time,&lt;/li&gt;
&lt;li&gt;TTFT,&lt;/li&gt;
&lt;li&gt;Generation speed (ms/token, tokens/sec),&lt;/li&gt;
&lt;li&gt;Memory usage (peakRSS),&lt;/li&gt;
&lt;li&gt;The most expensive low-level operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All data is collected via &lt;em&gt;torch.profiler&lt;/em&gt; and displayed in TensorBoard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhyry70g5m1ntxb992o4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhyry70g5m1ntxb992o4.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 LLM Operation Phases: What Happens Under the Hood
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tokenization - Input Preparation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The text is converted into tokens using the chosen tokenizer. On short texts, measuring this phase can be highly susceptible to system noise (jitter), which is why tokenization is almost always measured separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Prefill - Prompt Processing and Model State Establishment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this phase, the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the entire prompt through all layers once,&lt;/li&gt;
&lt;li&gt;Computes attention for the entire input sequence,&lt;/li&gt;
&lt;li&gt;Populates the KV-Cache for subsequent generation,&lt;/li&gt;
&lt;li&gt;Allocates temporary tensors and buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Formally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TTFT = Prefill time + first Decode step time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TTFT is the time required to complete the prompt processing and generate the first token. On a per-token basis, prefill is by far the most expensive phase, since the entire prompt is processed in one go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Decode — Generating New Tokens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After prefill, the model transitions to sequential generation. Each new token requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1 forward pass → 1 token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Decode characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operations are repeated with the same structure,&lt;/li&gt;
&lt;li&gt;The KV-Cache prevents re-computing attention for the entire prompt,&lt;/li&gt;
&lt;li&gt;Metrics become stable: &lt;code&gt;ms/token&lt;/code&gt;, &lt;code&gt;tokens/sec&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;📡 Experimental Setup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;mission_profiler.py&lt;/code&gt; script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performs three launches (short / medium / long prompt)&lt;/li&gt;
&lt;li&gt;Executes two generations for each:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Prefill → TTFT and full generation → Steady&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Saves traces to TensorBoard,&lt;/li&gt;
&lt;li&gt;Outputs a summary metrics table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⚠️ We do not perform any warmup, so the first run (short_prompt) may be slower.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🛠️ Launch Telemetry Yourself!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can replicate this "flight" and study the profiler logs on your own machine. &lt;/p&gt;

&lt;p&gt;All the code, settings, and launch instructions are available in the mission repository: &lt;a href="https://github.com/astronaut27/llm-profiler-mission" rel="noopener noreferrer"&gt;GitHub: LLM Profiler Mission - Engine Telemetry&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  📈 Mission Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;================= MISSION SUMMARY =================
tag&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; prompt_len&amp;nbsp; &amp;nbsp;tokenize&amp;nbsp; &amp;nbsp;TTFT(ms)&amp;nbsp; &amp;nbsp;steady(ms)&amp;nbsp; &amp;nbsp;actual_tok&amp;nbsp; &amp;nbsp;ms/token&amp;nbsp; &amp;nbsp;tok/s&amp;nbsp; &amp;nbsp;peakRSS(MB)
--------------------------------------------------------------------------------------------------------
short_prompt&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;19&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 6.6&amp;nbsp; &amp;nbsp; &amp;nbsp; 920.9&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 823.5&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;32&amp;nbsp; &amp;nbsp; &amp;nbsp; 25.73&amp;nbsp; &amp;nbsp; &amp;nbsp;38.9&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2541.2
medium_prompt&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 56&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 1.4&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;43.2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1047.4&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;32&amp;nbsp; &amp;nbsp; &amp;nbsp; 32.73&amp;nbsp; &amp;nbsp; &amp;nbsp;30.6&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2866.3
long_prompt&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;116&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 1.7&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;32.5&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 894.0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;32&amp;nbsp; &amp;nbsp; &amp;nbsp; 27.94&amp;nbsp; &amp;nbsp; &amp;nbsp;35.8&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2886.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to Read These Numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Tokenize&lt;/strong&gt; We can't reliably compare tokenizers using this data—a dedicated, large-scale benchmark is needed. Short strings are heavily affected by system noise, so tokenization performance is evaluated separately.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;TTFT (Time-To-First-Token)&lt;/strong&gt; The most interesting observation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short prompt → 921 ms&lt;/li&gt;
&lt;li&gt;Medium → 43 ms&lt;/li&gt;
&lt;li&gt;Long → 32 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Why the difference?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first run (short_prompt) bore the full impact of the &lt;strong&gt;cold start&lt;/strong&gt;: it includes CUDA/MPS warmup, allocations, and JIT compilation of kernels.&lt;/li&gt;
&lt;li&gt;TTFT is sensitive to the very first execution.&lt;/li&gt;
&lt;li&gt;In subsequent runs (medium, long prompt), after warmup, TTFT stabilizes, and the difference between the medium and long prompt becomes minimal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TTFT should be measured either after a dedicated &lt;strong&gt;warmup&lt;/strong&gt; or averaged over several runs.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Steady-State (ms/token)&lt;/strong&gt; The cost per token remains relatively stable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~26–33 ms/token&lt;/li&gt;
&lt;li&gt;~30–39 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the engine's speed in &lt;strong&gt;cruising mode&lt;/strong&gt;. As expected, the per-token latency shows almost no dependence on prompt length.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Peak RSS 2541 → 2866 → 2886 MB.&lt;/strong&gt; Memory usage jumps noticeably when going from the short to the medium prompt &lt;em&gt;(due to the growth of the KV-Cache and general allocations)&lt;/em&gt;, but further lengthening shows minimal increase. This confirms that the primary VRAM/RAM allocation is for the model itself, while the KV-cache consumes only a small fraction. Its size does, however, grow linearly with input length.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📊 Who is Really Consuming Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the profiler, all operations fall into two camps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔥 1. Main Thrust (Useful Work)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the GPU, these are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;addmm&lt;/li&gt;
&lt;li&gt;mm / matmul&lt;/li&gt;
&lt;li&gt;scaled_dot_product_attention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They consume the majority of the CUDA time. These are the matrix computation kernels—the operations that truly &lt;strong&gt;propel&lt;/strong&gt; the LLM engine forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ 2. Control Expenses (Overhead)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Utility operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;_local_scalar_dense&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;item&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cat&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;copy_&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;to&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Mask checks (eq, all)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data movement,&lt;/li&gt;
&lt;li&gt;Synchronization between the CPU/host and the GPU/device,&lt;/li&gt;
&lt;li&gt;Scalar extraction,&lt;/li&gt;
&lt;li&gt;Utility logic for generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not the engine's thrust, but the cost of &lt;strong&gt;flight control.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🧭 The Big Picture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core computational kernels (attention, matmuls, addmm)&lt;/strong&gt;—these determine whether the model is fast or slow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overhead operations&lt;/strong&gt; —these are non-productive costs that can be reduced through optimizations: minimizing synchronizations, using use_cache=True, and reducing the number of small tensor operations.&lt;/li&gt;
&lt;li&gt;On &lt;strong&gt;CUDA&lt;/strong&gt;, matrix kernels dominate (as they should), but on MPS, utility operations often dominate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real profile of LLM performance is hidden in the balance between these two groups.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvveiwef3stj8drsd2xza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvveiwef3stj8drsd2xza.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ Why Profiling LLMs is Essential&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The profiler turns: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;❌ "The model is running slow" into ✔ "Here is the specific operation that's consuming energy."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It helps reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the &lt;strong&gt;bottleneck&lt;/strong&gt; is located,&lt;/li&gt;
&lt;li&gt;The cost of prefill,&lt;/li&gt;
&lt;li&gt;The cost of each token,&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;memory&lt;/strong&gt; behaves,&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;overhead&lt;/strong&gt; created by HuggingFace generate().&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;🏁 Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM is an engine. Sometimes powerful, sometimes compact, but always complex and sensitive to overloads. And until you open the profiler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You won't see the expensive matrix operations,&lt;/li&gt;
&lt;li&gt;You won't see the &lt;strong&gt;synchronization overhead&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;You won't know the cost of prefill,&lt;/li&gt;
&lt;li&gt;You won't see the growth of the KV-Cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The profiler is our &lt;em&gt;flight recorder&lt;/em&gt;. It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the engine is pulling,&lt;/li&gt;
&lt;li&gt;Where it's stalling,&lt;/li&gt;
&lt;li&gt;And where the energy is going.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;No one launches a rocket without a flight recorder.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>python</category>
    </item>
    <item>
      <title>🧑‍🚀 Choosing the Right Engine to Launch Your LLM (LM Studio, Ollama, and vLLM)</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 06 Nov 2025 17:00:00 +0000</pubDate>
      <link>https://dev.to/astronaut27/choosing-the-right-engine-to-launch-your-llm-lm-studio-ollama-and-vllm-195o</link>
      <guid>https://dev.to/astronaut27/choosing-the-right-engine-to-launch-your-llm-lm-studio-ollama-and-vllm-195o</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;A Practical Field Guide for Engineers: LM Studio, Ollama, and vLLM&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When you’re building your first LLM ship, the hardest part isn’t takeoff — it’s choosing the right engine.”&lt;br&gt;
— Engineer-Astronaut, Mission Log №3&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In the LLM universe, everything moves at lightspeed.&lt;br&gt;
Sooner or later, every engineer faces the same question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how do you run a local model — fast, stable, and reliably?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;LM Studio — a local capsule with a friendly interface.&lt;/li&gt;
&lt;li&gt;Ollama — a maneuverable shuttle for edge missions.&lt;/li&gt;
&lt;li&gt;vLLM — an industrial reactor for API workloads and GPU clusters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But which one is right for &lt;em&gt;your&lt;/em&gt; mission?&lt;br&gt;
This article isn’t just another benchmark — it’s a &lt;strong&gt;navigation map&lt;/strong&gt;, built by an engineer who has wrestled with GPU crashes, dependency hell, and Dockerization pains.&lt;/p&gt;


&lt;h2&gt;
  
  
  🪐 Personal Log.
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“When I first tried LM Studio on my laptop, it was beautiful —&lt;br&gt;
until I needed to automate the launch.&lt;br&gt;
The GUI couldn’t be containerized, and the headless mode required extra tinkering.&lt;br&gt;
Then I switched to Ollama, and only with vLLM did I finally understand what a real production-grade workload feels like.”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  ⚙️ 1. LM Studio — A Piloted Capsule for Local Missions
&lt;/h2&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;LM Studio is a desktop application with a local OpenAI-compatible API.&lt;br&gt;
It lets you work offline and run models directly on your laptop.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//lmstudio.ai/docs"&gt;lmstudio&lt;/a&gt;&lt;br&gt;
💻 Platforms: macOS, Windows, Linux (AppImage).&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Download and install from &lt;a href="//lmstudio.ai/download"&gt;lmstudio.ai&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;Caveats:&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;GUI-only app — limited containerization;&lt;/li&gt;
&lt;li&gt;Experimental headless API;&lt;/li&gt;
&lt;li&gt;May overload CPU/GPU during long sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“LM Studio is a flight simulator — perfect for training,&lt;br&gt;
but it won’t take you into orbit.”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🚀 2. Ollama — A Maneuverable Shuttle for Edge Missions
&lt;/h2&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;An open-source CLI/desktop runtime for models like Mistral, Gemma, Phi-3, and Llama-3.&lt;br&gt;
It runs as a REST API and integrates easily into Docker.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//ollama.ai"&gt;ollama.ai&lt;/a&gt;&lt;br&gt;
💻 Platforms: macOS, Linux, Windows.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
ollama run llama3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or via Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 11434:11434 ollama/ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  When to use:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Local REST APIs and edge inference;&lt;/li&gt;
&lt;li&gt;CI/CD and microservices;&lt;/li&gt;
&lt;li&gt;Quick launches without complex dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Ollama is a light shuttle —&lt;br&gt;
it can launch from any planet, but it won’t carry heavy cargo.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  ☀️ 3. vLLM — A Reactor for Production-Grade Flights
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;vLLM is a high-performance runtime for LLM inference,&lt;br&gt;
optimized for GPUs, fully OpenAI-API compatible, and designed for scaling.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//github.com/vllm-project/vllm"&gt;vllm&lt;/a&gt;&lt;br&gt;&lt;br&gt;
💻 Platforms: Linux and major cloud providers (AWS, GCP, Azure).&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; meta-llama/Llama-3-8b-instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  When to use:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Product APIs and AI platforms;&lt;/li&gt;
&lt;li&gt;Multi-user environments;&lt;/li&gt;
&lt;li&gt;High-speed, CUDA-optimized inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Caveats:&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Requires NVIDIA GPU (CUDA ≥ 12.x);&lt;/li&gt;
&lt;li&gt;Not compatible with macOS (no GPU backend);&lt;/li&gt;
&lt;li&gt;Needs DevOps experience — monitoring, logging, version sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“vLLM is a deep-space reactor — built for interstellar journeys.&lt;br&gt;
But if you try to fire it up in your garage, it simply won’t ignite.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  🪐 The Mission Map — Which Engine to Choose
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmrz1ujv1oplh0yzayxh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmrz1ujv1oplh0yzayxh.png" alt=" " width="665" height="1460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚠️ Common pitfalls:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LM Studio → limited containerization;&lt;/li&gt;
&lt;li&gt;Ollama → not all models available out of the box, though you can import from Hugging Face;&lt;/li&gt;
&lt;li&gt;vLLM → CUDA version mismatch causes kernel errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nlx7xmmqyt22mcf52y0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nlx7xmmqyt22mcf52y0.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🧩 Mission Debrief
&lt;/h3&gt;

&lt;p&gt;Every engine is built for its own orbit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LM Studio&lt;/strong&gt; — for solo flights and quick system checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — for agile edge missions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — for long-range, interstellar operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Sometimes an engineer’s mission isn’t to build a new engine —&lt;br&gt;
but to understand which existing one fits the current flight plan.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  🛰️ Previous Missions
&lt;/h3&gt;

&lt;p&gt;🚀 &lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;Prepared Meta’s CRAG Benchmark for Launch in Docker&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>🧑‍🚀 Mission Accomplished: How an Engineer-Astronaut Prepared Meta’s CRAG Benchmark for Launch in Docker</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 06 Nov 2025 11:04:46 +0000</pubDate>
      <link>https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6</link>
      <guid>https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Every ML system is like a spacecraft — powerful, intricate, and temperamental.&lt;br&gt;
But without telemetry, you have no idea where it’s headed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  🌌 Introduction
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;CRAG (Comprehensive RAG Benchmark)&lt;/strong&gt; from Meta AI is the control panel for Retrieval-Augmented Generation systems.&lt;br&gt;
It measures how well model responses stay grounded in facts, remain robust under noise, and maintain contextual relevance.&lt;/p&gt;

&lt;p&gt;As is often the case with research projects, CRAG required &lt;strong&gt;engineering adaptation&lt;/strong&gt; to operate reliably in a modern environment:&lt;br&gt;
incompatible library versions, dependency conflicts, unclear paths, and manual launch steps.&lt;/p&gt;

&lt;p&gt;🧰 I wanted to bring CRAG to a state where it could be launched with a single command — no dependency chaos, no manual fixes.&lt;br&gt;
The result is a fully reproducible Dockerized environment, available here:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="//github.com/astronaut27/CRAG_with_Docker"&gt;github.com/astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  🚀 What I Improved
&lt;/h2&gt;

&lt;p&gt;In the original build, several issues made CRAG difficult to run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔧 Conflicting library versions;&lt;/li&gt;
&lt;li&gt;⚙️ No unified, reproducible start-up workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, everything comes to life with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After building, two containers start automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🛰️ mock-api — an emulator for web search and Knowledge Graph APIs;&lt;/li&gt;
&lt;li&gt;🚀 crag-app — the main container with the benchmark and built-in baseline models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🧱 Pre-Launch Preparation: Handling the Mission Artifacts
&lt;/h2&gt;

&lt;p&gt;Before firing up the Docker build, make sure all mission artifacts — the large data and model files — are present locally.&lt;/p&gt;

&lt;p&gt;Because CRAG includes files over 100 MB, it uses &lt;strong&gt;Git Large File Storage (LFS)&lt;/strong&gt;. Without them, your container won’t initialize.&lt;/p&gt;

&lt;p&gt;So the first command in your console is essentially &lt;strong&gt;fueling the ship&lt;/strong&gt; with data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git lfs pull
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧩 How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foigv0nsubou4mzm3i4uc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foigv0nsubou4mzm3i4uc.png" alt=" " width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  📡 ⚙️ CRAG in Autonomous Mode
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;mock-API&lt;/em&gt;&lt;/strong&gt; — simulates external data sources (Web Search, KG API) used by the RAG system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;crag-app&lt;/em&gt;&lt;/strong&gt; — the main container running the benchmark and the model used for response generation (a dummy model at this stage).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;local_evaluation.py&lt;/em&gt;&lt;/strong&gt; — coordinates the pipeline, calls the mock API, and handles metric evaluation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;ChatGPT&lt;/em&gt;&lt;/strong&gt; — serves as an LLM-assisted judge that evaluates generated responses by CRAG’s metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🧠 What CRAG Measures: The Telemetry Dashboard
&lt;/h2&gt;

&lt;p&gt;CRAG reports quantitative indicators — a flight log of your system after a test mission:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;total&lt;/em&gt;&lt;/strong&gt;: Total number of evaluated examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_correct&lt;/em&gt;&lt;/strong&gt;: Count of responses that are fully supported by retrieved context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_hallucination&lt;/em&gt;&lt;/strong&gt;: Number of responses containing unsupported or invented facts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_miss&lt;/em&gt;&lt;/strong&gt;: Responses missing key information or empty answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;accuracy/ score&lt;/em&gt;&lt;/strong&gt;: Overall precision (ratio of correct responses).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;hallucination&lt;/em&gt;&lt;/strong&gt;: Ratio = n_hallucination / total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;missing&lt;/em&gt;&lt;/strong&gt;: Ratio = n_miss / total.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn5zvjv46heqbitc43ev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn5zvjv46heqbitc43ev.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡 These metrics are &lt;strong&gt;the sensors on your RAG ship’s dashboard&lt;/strong&gt;.&lt;br&gt;
If any of them start flashing red — it’s time to check the model’s engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧱 Docker Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Mock API service for RAG data&lt;/span&gt;
  &lt;span class="na"&gt;mock-api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../mock_api&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../deployments/Dockerfile.mock-api&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crag-mock-api&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../mock_api/cragkg:/app/cragkg&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PYTHONPATH=/app&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;crag-network&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;

  &lt;span class="c1"&gt;# CRAG application container&lt;/span&gt;
  &lt;span class="na"&gt;crag-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;..&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployments/Dockerfile.crag-app&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crag-app&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mock-api&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# OpenAI for evaluation (optional)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY=${OPENAI_API_KEY}&lt;/span&gt;
      &lt;span class="c1"&gt;# Mock API connection (Docker service)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CRAG_MOCK_API_URL=http://mock-api:8000&lt;/span&gt;
      &lt;span class="c1"&gt;# Evaluation model&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;EVALUATION_MODEL_NAME=${EVALUATION_MODEL_NAME:-gpt-4-0125-preview}&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Mount large data directories (read-only)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../data:/app/data:ro&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../results:/app/results&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../example_data:/app/example_data:ro&lt;/span&gt;
      &lt;span class="c1"&gt;# Tokenizer (if needed)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../tokenizer:/app/tokenizer:ro&lt;/span&gt;
    &lt;span class="na"&gt;extra_hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host.docker.internal:host-gateway"&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;crag-network&lt;/span&gt;
    &lt;span class="na"&gt;stdin_open&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local_evaluation.py"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;crag-network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🪐 Why This Matters
&lt;/h2&gt;

&lt;p&gt;RAG systems are quickly becoming the &lt;strong&gt;core engines of modern LLM-based products&lt;/strong&gt;.&lt;br&gt;
CRAG allows engineers to evaluate their reliability and factual grounding before shipping to production.&lt;/p&gt;

&lt;p&gt;This Docker build transforms Meta AI’s research benchmark into a &lt;strong&gt;practical engineering environment&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 fully isolated and reproducible;&lt;/li&gt;
&lt;li&gt;🧠 runnable locally or in CI pipelines;&lt;/li&gt;
&lt;li&gt;🚀 easily extendable with your own models (for example, via LM Studio — coming in the next mission).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔭 The Next Mission
&lt;/h3&gt;

&lt;p&gt;Right now, CRAG runs on its built-in baselines — a test flight before mounting the real engine.&lt;br&gt;
The next step is integrating the &lt;strong&gt;LM Studio API&lt;/strong&gt; and evaluating a live LLM within the same container setup.&lt;br&gt;
That will be &lt;strong&gt;Mission II&lt;/strong&gt; 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14hvf10ag8ty80z3alh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14hvf10ag8ty80z3alh0.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🧭 Mission Summary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“Sometimes engineering magic isn’t about building a brand-new ship,&lt;br&gt;
but about preparing an existing one for its next flight.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;CRAG now launches reliably, telemetry is stable, and the mission is a success.&lt;/p&gt;

&lt;p&gt;Next up: integrating LM Studio and real models.&lt;br&gt;
For now, the ship holds a steady course. 🪐&lt;/p&gt;

&lt;h4&gt;
  
  
  🔗 Mission Repository
&lt;/h4&gt;

&lt;p&gt;📦 &lt;a href="//%F0%9F%93%A6%20github.com/astronaut27/CRAG_with_Docker"&gt;github.com/astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📜 License&lt;br&gt;
CRAG is distributed under the MIT License, developed by Meta AI / Facebook Research.&lt;br&gt;
All modifications in &lt;a href="//github.com/astronaut27/CRAG_with_Docker"&gt;CRAG_with_Docker&lt;/a&gt; preserve the original copyright notices.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
