Digital Commander Shepard

Posted on Mar 10 • Originally published at digitalshepard.ai

I Ran Claude Code for 5 Hours and It Burned 26M Tokens. Here's How I Debugged It

#docker #ai #claudecode #observability

The $47 Problem

I ran Claude Code for five hours on a large refactor. Opus model, lots of file reading, long context. When it finished, the CLI showed exactly one number: $47. No token breakdown, no trace of what the agent actually did, no explanation why the cost was that high.

So I pulled the session logs and started digging.

The first surprise: 26,364,893 cache read tokens. Not a typo. That was 99.9% of all tokens in the session. On long Claude Opus sessions, every turn re-reads the accumulated context from cache. Cache reads are cheaper ($1.50/M vs $15/M for fresh input), but tens of millions of tokens still add up. The second surprise: three context compactions I didn't know about, each one silently resetting the cache warmup cycle.

None of that is visible from the CLI. This post walks through the observability stack I built to fix that - for Claude Code, Codex CLI, and Gemini CLI, running locally in six Docker containers.

What You Get

Eight Grafana dashboards covering cost breakdown, token distribution, tool call analysis, per-provider deep dives, and a session timeline waterfall. Sixteen alert rules for sensitive file access, high token burn, session cost thresholds, and infrastructure health. Everything self-hosted, seven-day retention, no external services.

The stack ended up looking like a standard observability pipeline - just pointed at CLI hooks instead of application code.

Step 1 — Start the Stack

git clone https://github.com/shepard-system/shepard-obs-stack.git
cd shepard-obs-stack
docker compose up -d
./scripts/init.sh

docker compose up -d launches six containers: OTel Collector, Prometheus, Loki, Tempo, Alertmanager, and Grafana. init.sh provisions dashboards and alert rules. Grafana is available at localhost:3000 - credentials in .env.example.

Step 2 — Wire the CLIs

Each CLI ships its own hook mechanism. The stack provides shell hooks for all three:

./hooks/install.sh

The installer auto-detects which CLIs are present, backs up existing configs, and injects hooks via non-destructive jq merge. Each hook fires on its event, reads JSON from stdin, builds an OTLP payload, and POSTs it to the collector:

# hooks/lib/metrics.sh (core pattern)
emit_counter() {
  local name="$1" value="$2" labels_json="$3"

  local attrs
  attrs=$(jq -c '[to_entries[] | {key: .key, value: {stringValue: .value}}]' <<< "$labels_json")

  local payload
  payload=$(jq -n -c \
    --arg name "$name" --argjson value "$value" \
    --arg ts "$(date +%s)000000000" --argjson attrs "$attrs" \
    '{resourceMetrics: [{
      resource: {attributes: [{key: "service.name", value: {stringValue: "shepherd-hooks"}}]},
      scopeMetrics: [{metrics: [{
        name: $name,
        sum: {dataPoints: [{asDouble: $value, timeUnixNano: $ts, attributes: $attrs}],
              aggregationTemporality: 1, isMonotonic: true}
      }]}]
    }]}')

  curl -s -o /dev/null -XPOST "http://localhost:4318/v1/metrics" \
    -H "Content-Type: application/json" -d "$payload" & disown
}

bash, curl, jq - tools present on every Unix machine since before most AI startups were incorporated. The & disown is the entire performance strategy: the hook fires and exits, the agent never waits for telemetry.

If you want to eliminate the bash/jq/curl chain entirely, there's an optional Rust accelerator - shepard-hooks-rs
That replaces all 9 hook scripts with a single compiled binary. The hooks auto-detect it via a three-line resolver: binary present → binary runs, binary absent → bash fallback. Zero configuration, zero breakage.

./scripts/install-accelerator.sh   # downloads to hooks/bin/, no sudo required

Verify the full pipeline after wiring:

./scripts/test-signal.sh

Eleven checks. Pass or fail with a clear message identifying which component broke.

Step 3 — The Dashboards

Cost - token breakdown, not just a total

The top row is six stat panels: total cost, total tokens, session count, input tokens, output tokens, and cache efficiency as a gauge. The gauge is what matters most for understanding long sessions. It's the ratio of cache reads to total tokens, and it tells you whether the agent is reusing context intelligently or rebuilding it from scratch on every turn.

Below that, Cache Economics is a stacked bar chart of cache reads vs. cache writes over time. Green reads, red writes. In a healthy long session you want a lot of green building up and staying there. If you see heavy red on every turn, the context isn't warming up the way you think and that's the panel that would have explained the $47 session immediately.

For Max subscription users: cost figures reflect API-equivalent rates, not your actual $20/$100/$200 month billing. The dollar figure was the least useful metric in the entire session like the cache reads and compaction count were the explanation.

If you're running a mix of Max subscription and AWS Bedrock (which is how I use it) the Cost dashboard breaks down usage by model in real time. us.anthropic.* on Bedrock and claude-opus-4-6 on Max show up as separate series in the Cost Over Time chart. You see exactly which model is burning at what rate, per session, as it's happening. The subscription obscures nothing; the API charges are visible immediately.

Session Timeline — the waterfall that doesn't exist anywhere else

When a session ends, the stop hook parses the raw JSONL session log and converts it into synthetic OpenTelemetry traces, then ships them to Tempo. The Claude parser is 265 lines of jq. It extracts tool calls, MCP operations, sub-agent invocations, thinking blocks, interruptions, and context compaction events - each becomes a span with a start time, duration, and status code.

Click a Trace ID in the Session Traces table and you get a full waterfall of the session: every Bash call, every Read, every Write, in order, with latency, with error status.

The compaction spans are the part that changes how you read session limits. When Claude Code compacts the context (collapses the conversation history to free up space) that compaction becomes a visible span with a timestamp and duration. Three compactions in a five-hour session means three points where the cache warmup cycle restarted. If your session hits a limit after two hours and you can't understand why, the timeline shows you the mechanism: compaction at 01:12, cache reads drop to zero, cost per turn increases as the context rebuilds. That's the explanation the CLI never gives you.

Below the waterfall, Top Tools by Invocation Count ranks every tool by frequency, and Tool Duration Distribution shows p95 latency by tool over time. In my sessions: Bash runs ~47 times, Read ~12 times. Bash p95 is around 8 seconds, Read p95 is 200ms. The agent wasn't stuck in a loop - it was waiting on shell execution. Different cause, different fix.

Every other LLM observability tool like LangSmith, Langfuse, Helicone - assumes you're instrumenting your own code that calls the model. Claude Code is the code. The only instrumentation entry point is hooks and JSONL.

Tools — "What is the agent actually doing?"

Tool call frequency, error rate per tool, calls per minute, and a Failing Tools table sorted by error count. In my stack Bash leads on errors - not because Bash is broken, but because Bash is where agents try things that don't work. An agent with zero errors is either perfect or not reporting failures honestly.

Operations and Quality

Operations shows events per minute, source breakdown by CLI, and a live Loki log stream at the bottom. When you're debugging a session in real time, you want a scrolling terminal, not another chart.

Quality tracks cache hit rate as a gauge and session counts over time. In my stack right now: 92.6% cache efficiency across 8 sessions. This is the dashboard nobody checks daily and everyone should.

Provider Deep Dives and Benchmarking

Claude Code gets 27 panels, Codex gets 26 (powered by Loki recording rules since Codex doesn't emit native metrics - the stack bridges that gap by extracting structured fields from logs and remote-writing to Prometheus), Gemini gets 31 including an API latency heatmap.

If you're evaluating which CLI fits your workflow, the stack also gives you a controlled measurement environment. What I mean:

Run the same task on all three (same prompt, same codebase, same time window) then compare token distribution, reasoning token ratio (Gemini and Codex expose reasoning tokens separately; Claude folds them into output), cache efficiency, and tool call patterns.
The Session Timeline shows not just what each agent did, but in what sequence and at what latency. One agent called Bash 40 times; another read more files up front and called Bash 8 times. Same result, different strategy, measurable difference.
That's what "which model is better for my workflow" looks like when it's data instead of vibes.

Step 4 — Alerts

Dashboards are for the hours when you're watching. Alerts cover the rest.

Tier	Rules	What it catches
Infrastructure	6	`OTelCollectorDown`, export failures, memory limits
Pipeline	5	`LokiDown`, `TempoDown`, `PrometheusTargetDown`, recording rule failures, `NoTelemetryReceived`
Business logic	5	`HighSessionCost` (>$10/hr), `HighTokenBurn` (>50k tok/min), `HighToolErrorRate`, `SensitiveFileAccess`, `NoTelemetryReceived`

SensitiveFileAccess fires when an agent reads .env, credentials, or private keys — caught at PreToolUse, before the read completes. You'll know before the commit does. Inhibit rules suppress noise automatically: when the collector is down, business-logic alerts go quiet.

Alert routing lives in configs/alertmanager/alertmanager.yaml. Wire it to Slack, PagerDuty, or email - whatever you already use.

What It Doesn't Do

New CLIs need hook scripts written manually — there's no auto-discovery. Log and trace retention is seven days by default - retention_period: 168h in configs/loki/loki-config.yaml and block_retention: 168h in configs/tempo/tempo-config.yaml. Both are single-line changes if you want more. Nothing in the stack enforces the limit; it's just a sensible default for local storage. Claude Code doesn't export native traces, so session reconstruction relies entirely on log parsing; if Anthropic ships trace export, the stack will use it.

What's Next

The stack tells you what agents are doing. It doesn't control what they're allowed to do.

You see Bash called forty-seven times. You see three compactions in one session. You see the token burn climbing. The data doesn't intervene - it just watches. The next piece covers the governance layer: one MCP hub, three CLIs, a unified tool registry with RBAC and a full audit trail. Every tool call routed through one gate that can say no when the rules say no.

AI coding agents are already operational systems. But most tooling still treats them like chat interfaces with a receipt attached. That gap is where observability belongs.

Next: The Gate — one MCP hub, three CLIs, unified tooling.

GitHub

If this is useful, a star on GitHub helps other engineers find it.
Comments open if something's broken or missing.

DEV Community