Every time you press Enter in Claude Code, something interesting happens behind the scenes. Your full conversation — system prompt, message history, tool definitions, everything — gets packaged into an API call and sent to Anthropic's servers.
But you never get to see those calls. Claude Code logs a JSONL transcript of what it did (tool calls, responses, thinking blocks), but not the raw API traffic that made it happen. The system prompt, HTTP headers, request parameters, latency per call, and one entirely hidden API call — all invisible.
So we built a way to see everything.
The Trick: One Environment Variable
Claude Code officially supports ANTHROPIC_BASE_URL — an environment variable that redirects API traffic to a custom endpoint. It's meant for enterprise proxies, but it works perfectly for local interception:
Claude Code ──plain HTTP──▶ Sniffer (localhost:7735) ──HTTPS──▶ api.anthropic.com
│
▼
~/.claude/api-sniffer/*.jsonl
Start the sniffer in one terminal, launch Claude Code in another:
# Terminal 1
claudetui sniffer
# Terminal 2
claudetui sniff
claudetui sniff auto-detects the sniffer port and launches Claude Code through the proxy. If the sniffer isn't running, it falls back to launching Claude Code directly — so you never get stuck with a ConnectionRefused retry loop.
Every API call now flows through the proxy and gets logged. Claude Code works identically — it doesn't know (or care) that traffic is being captured.
No TLS interception. No certificates. No patching binaries. Just a localhost HTTP server that forwards to the real API over HTTPS.
What You See
The sniffer prints one line per API call as it happens:
ClaudeTUI API Sniffer — listening on http://127.0.0.1:7735
Use: ANTHROPIC_BASE_URL=http://localhost:7735 claude
Log: ~/.claude/api-sniffer/sniffer-20260314-103000.jsonl
#1 POST /v1/messages opus-4-6 45.2k->1.5k $0.120 2312ms 740KB/4.2KB 98%c [Tt]
#2 POST /v1/messages opus-4-6 48.1k->0.8k $0.094 1134ms 741KB/2.1KB 99%c [TU] Edit
#3 POST /v1/messages opus-4-6 50.3k->52 $0.081 1823ms 742KB/0.3KB 100%c [U] Glob,Grep
#4 POST /v1/messages opus-4-6 12.3k->2.1k $0.041 3412ms 42KB/6.8KB 95%c [Tt] compaction
#5 POST /v1/messages sonnet-4-6 14.3k->2.1k $0.008 2341ms 42KB/6.8KB [Tt] +agent.1
Summary: 5 requests | $0.344 | 170k in | 5.6k out | 2.3MB sent | 18KB recv | 1 sub-agent
Each line shows the model, input->output tokens, estimated cost, latency, traffic size, cache hit ratio, content block types, tool names, and sub-agent tracking. Compaction events get flagged automatically.
The content blocks tell you what Claude is doing: T = thinking, t = text, U = tool use, S = server-side tool (like WebSearch). Cache ratio (98%c) shows how much you're saving — a 0%c (shown in red) means a cache miss, 12.5x more expensive.
Meanwhile, every request and response is logged as structured JSONL for later analysis.
What Transcripts Don't Tell You
Claude Code's JSONL transcripts are useful, but they omit a lot. Here's what the sniffer captures that transcripts don't:
| Data | In Transcript? | In Sniffer? |
|---|---|---|
| Token usage (input/output/cache) | Yes | Yes |
| Raw system prompt | No | Yes |
| Full conversation history per request | No | Yes |
| Request parameters (max_tokens, temperature) | No | Yes |
| HTTP headers (anthropic-beta, version) | No | Yes |
| Request/response latency | No | Yes |
| Hidden compaction API call | No | Yes |
| Error response bodies | Partial | Yes |
| Streaming SSE events | No | Yes |
| Tool definitions (full JSON schema) | No | Yes |
The most interesting items on this list are the system prompt and the hidden compaction call.
The System Prompt
Claude Code's system prompt is sent on every single API call. It contains:
- Claude Code's internal instructions and behavioral guidelines
- Tool definitions (Read, Write, Edit, Bash, Glob, Grep, etc.) with full JSON schemas
- Your CLAUDE.md project instructions
- Memory files, hooks output, and other injected context
- Safety and permission guidelines
With --full mode, the sniffer captures the complete system prompt text. In our sessions, it consistently measures ~14k tokens — a fixed tax on every API call.
This is useful for understanding exactly what Claude Code "knows" about your project. Your CLAUDE.md, your hooks output, your memory files — it's all there in the system prompt, and now you can read it.
The Hidden Compaction Call
This is the one we were most curious about.
When Claude Code's context window fills up (~167k of 200k tokens), it triggers compaction. The entire conversation gets compressed into a summary, and the next turn starts fresh with just the system prompt + summary.
But here's the thing: the API call that generates the compaction summary doesn't appear in the transcript. Claude Code makes it, receives the summary, and continues — but the JSONL transcript shows nothing. You see a compact_boundary marker, but not the actual summarization call.
The sniffer catches it because it's just another POST /v1/messages:
#12 POST /v1/messages opus-4-6 12.3k->2.1k $0.041 3412ms compaction
The sniffer detects compaction by comparing consecutive requests. When the message count drops by more than 50% or the total content size drops by more than 70% compared to the previous request, it flags it as post-compaction. The dramatic shrinkage — from 167k tokens of conversation down to a ~15k summary — is unmistakable.
This is the only way to observe the compaction call's actual cost, latency, and output tokens. In our sessions, compaction summary generation takes 2-4 seconds and produces 11-19k tokens of compressed context.
The Tool Use Loop
When you see a line like this in the sniffer:
#17 POST /v1/messages opus-4-6 114.3k->531 $0.217 16047ms [U] tool
That's 114k tokens in but only 531 out. Why so few output tokens? Because Claude isn't writing prose — it's calling a tool. The response is just a small JSON block:
{"type": "tool_use", "name": "Read", "input": {"file_path": "/src/app.py"}}
Here's the full cycle for a single tool call:
- Claude Code sends the full conversation to the API (114.3k input tokens — system prompt, message history, tool definitions, everything)
-
API responds with a
tool_useblock — just the tool name and parameters (531 output tokens) - Claude Code executes the tool locally — reads the file, runs the command, whatever the tool does
-
Claude Code sends another request with the tool result appended as a
tool_resultmessage — now input tokens are higher because the file contents (or command output) are part of the conversation
That's why you see rapid back-to-back requests in the sniffer. A single "read this file and edit it" from the user might generate 5+ API calls:
#17 POST /v1/messages opus-4-6 114.3k->531 [U] tool ← decide to read file
#18 POST /v1/messages opus-4-6 116.1k->204 [U] tool ← decide to edit file
#19 POST /v1/messages opus-4-6 117.8k->1.2k [Tt] ← respond to user
Each round-trip adds the tool result to the conversation, growing the input tokens. This is why context fills up faster than you'd expect — tool results (file contents, command output, search results) are often much larger than the tool call itself.
SSE Streaming Under the Hood
Claude Code uses Server-Sent Events (SSE) for streaming responses. The API returns text/event-stream and sends data in chunks as the model generates tokens.
The sniffer handles this transparently — it forwards each chunk to Claude Code as it arrives (so you don't notice any delay), while capturing the entire stream for logging.
After streaming completes, it reassembles the SSE events to extract structured data: model, usage, stop reason, and content block types (text, thinking, tool_use). This is what makes the one-line terminal output possible — you get clean 45.2k->1.5k $0.120 2312ms instead of raw SSE data.
The key technical detail: we use response.read1(8192) instead of response.read(8192). The read1() method reads whatever data is currently available without waiting for the full buffer to fill — critical for streaming, where you need to forward partial data immediately.
Sub-Agent Tracking
When Claude Code spawns a sub-agent (via the Agent tool), the sub-agent makes its own API calls — often using a different model. The sniffer tracks these by session ID:
#8 POST /v1/messages opus-4-6 89.1k->3.2k $0.182 8234ms 99%c [TU] Agent
#9 POST /v1/messages sonnet-4-6 14.3k->2.1k $0.008 2341ms [Tt] +agent.1
#10 POST /v1/messages sonnet-4-6 16.5k->1.2k $0.006 1823ms [TU] Read agent.1
#11 POST /v1/messages sonnet-4-6 22.8k->0.5k $0.009 1243ms [t] agent.1
#12 POST /v1/messages opus-4-6 92.3k->1.5k $0.152 4312ms 99%c [Tt]
+agent.1 marks the first request from a new sub-agent. Subsequent requests from the same agent show agent.1. The main session has no label.
This reveals things you can't see from the transcript: sub-agents often use Sonnet (cheaper, faster) for research tasks while the main session runs on Opus. You can see exactly how many API calls each sub-agent makes, their cost, and how they overlap with the main session.
Cache Misses — The Silent Cost Spike
The cache ratio (98%c, 100%c) shows what percentage of input tokens were cache reads. Most of the time it's near 100% — great, you're paying the cheap rate.
But leave your session idle for ~5 minutes and watch what happens:
#6 POST /v1/messages opus-4-6 129.4k->15 $0.199 3336ms 100%c [t]
#7 POST /v1/messages opus-4-6 129.5k->428 $2.460 16108ms 0%c [Tt]
#8 POST /v1/messages opus-4-6 130.0k->600 $0.248 18310ms 100%c [Tt]
Request #7 cost $2.46 — 12.5x more than usual — because the cache expired. All 129.5k tokens went through cache_creation at $18.75/M instead of cache_read at $1.50/M. Same data, same tokens, wildly different price.
The sniffer shows 0%c in red to make these cache misses impossible to miss.
Per-Request Cost Tracking
The sniffer calculates cost per API call using the token breakdown from the response:
{
"usage": {
"input_tokens": 3,
"cache_read_input_tokens": 45000,
"cache_creation_input_tokens": 800,
"output_tokens": 1500
}
}
With model-specific pricing (Opus: $15/$1.50/$18.75/$75 per 1M tokens for input/cache-read/cache-write/output), each line shows the exact cost of that call. No estimation, no averaging — real cost per request.
This revealed something we didn't expect: the variance between calls is huge. A simple response might cost $0.03, while a long code generation can cost $0.50+ — in the same session, same model.
What We Learned
After running the sniffer on real sessions, a few things stood out:
1. The system prompt is remarkably stable. It barely changes between calls within a session. The ~14k tokens are almost entirely cached after the first call, making them cheap ($1.50/M vs $15/M for Opus). But they still consume context window space.
2. Compaction is expensive in latency, not just tokens. The summary generation call takes 2-4 seconds — during which Claude Code is unresponsive. On a long session with 3 compactions, that's 6-12 seconds of dead time.
3. Cache hit rates are extraordinary. In typical sessions, 95-98% of input tokens are cache reads. The stateless-API design sounds expensive, but caching makes it practical.
4. Error responses are more informative than you'd think. When Claude Code encounters a 429 (rate limit) or 529 (overloaded), the response body often includes retry-after headers and detailed error messages. These are swallowed by Claude Code's retry logic and never shown to you.
5. Beta headers reveal feature flags. The anthropic-beta header shows which experimental features are active. Watching this change across Claude Code versions is interesting.
Security Notes
The sniffer is designed to be safe by default:
-
Localhost only — binds to
127.0.0.1, never0.0.0.0 -
API keys redacted —
x-api-keyandauthorizationheaders stripped from logs by default (use--no-redactto override) -
Restricted permissions — log files created with
0o600(owner read/write only) - Local plaintext — the API key transits in plain text only over the loopback interface, which is standard for local proxy patterns
Try It
The sniffer is part of ClaudeTUI:
# Install
brew tap slima4/claude-tui && brew install claude-tui && claudetui setup
# Or
curl -sSL https://raw.githubusercontent.com/slima4/claude-tui/main/install.sh | bash
# Run
claudetui sniffer # Terminal 1: start proxy
claudetui sniff # Terminal 2: launch claude through proxy
claudetui sniff --resume abc # or resume a session through proxy
Options:
--port PORT Custom port (default: 7735)
--full Log complete request/response bodies (warning: large files)
--no-redact Include API keys in logs (use with caution)
--quiet Suppress terminal output, log only
Python 3.8+, stdlib only — no external dependencies.
What's Next
The sniffer captures data that was previously invisible. Combined with ClaudeTUI's existing context efficiency analysis, this gives a complete picture of what Claude Code is doing under the hood — from high-level token waste tracking down to raw HTTP traffic.
Some things we're exploring:
- Replaying captured sessions for cost modeling ("what would this session cost on Sonnet vs Opus?")
- Diffing system prompts across Claude Code versions to track changes
- Correlating latency with context size — does response time scale linearly with input tokens?
- Analyzing compaction summaries — what gets preserved and what gets lost?
If you're curious about what your Claude Code sessions actually look like at the API level, point the sniffer at a session and watch the data flow.
ClaudeTUI is open source and MIT licensed. Stdlib-only Python, zero external dependencies.
GitHub: github.com/slima4/claude-tui
Top comments (0)