Slim

Posted on Mar 16

Sniffing Claude Code's API Calls: What Your IDE Is Really Sending

#claude #ai #sniffer #monitoring

Every time you press Enter in Claude Code, something interesting happens behind the scenes. Your full conversation — system prompt, message history, tool definitions, everything — gets packaged into an API call and sent to Anthropic's servers.

But you never get to see those calls. Claude Code logs a JSONL transcript of what it did (tool calls, responses, thinking blocks), but not the raw API traffic that made it happen. The system prompt, HTTP headers, request parameters, latency per call, and one entirely hidden API call — all invisible.

So we built a way to see everything.

The Trick: One Environment Variable

Claude Code officially supports ANTHROPIC_BASE_URL — an environment variable that redirects API traffic to a custom endpoint. It's meant for enterprise proxies, but it works perfectly for local interception:

Claude Code  ──plain HTTP──▶  Sniffer (localhost:7735)  ──HTTPS──▶  api.anthropic.com
                                    │
                                    ▼
                          ~/.claude/api-sniffer/*.jsonl

Start the sniffer in one terminal, launch Claude Code in another:

# Terminal 1
claudetui sniffer

# Terminal 2
claudetui sniff

claudetui sniff auto-detects the sniffer port and launches Claude Code through the proxy. If the sniffer isn't running, it falls back to launching Claude Code directly — so you never get stuck with a ConnectionRefused retry loop.

Every API call now flows through the proxy and gets logged. Claude Code works identically — it doesn't know (or care) that traffic is being captured.

No TLS interception. No certificates. No patching binaries. Just a localhost HTTP server that forwards to the real API over HTTPS.

What You See

The sniffer prints one line per API call as it happens:

  ClaudeTUI API Sniffer — listening on http://127.0.0.1:7735

  Use:  ANTHROPIC_BASE_URL=http://localhost:7735 claude
  Log:  ~/.claude/api-sniffer/sniffer-20260314-103000.jsonl

  #1   POST /v1/messages  opus-4-6  45.2k->1.5k  $0.120  2312ms  740KB/4.2KB  98%c  [Tt]
  #2   POST /v1/messages  opus-4-6  48.1k->0.8k  $0.094  1134ms  741KB/2.1KB  99%c  [TU]  Edit
  #3   POST /v1/messages  opus-4-6  50.3k->52    $0.081  1823ms  742KB/0.3KB  100%c  [U]  Glob,Grep
  #4   POST /v1/messages  opus-4-6  12.3k->2.1k  $0.041  3412ms  42KB/6.8KB   95%c  [Tt]  compaction
  #5   POST /v1/messages  sonnet-4-6  14.3k->2.1k  $0.008  2341ms  42KB/6.8KB  [Tt]  +agent.1

  Summary: 5 requests | $0.344 | 170k in | 5.6k out | 2.3MB sent | 18KB recv | 1 sub-agent

Each line shows the model, input->output tokens, estimated cost, latency, traffic size, cache hit ratio, content block types, tool names, and sub-agent tracking. Compaction events get flagged automatically.

The content blocks tell you what Claude is doing: T = thinking, t = text, U = tool use, S = server-side tool (like WebSearch). Cache ratio (98%c) shows how much you're saving — a 0%c (shown in red) means a cache miss, 12.5x more expensive.

Meanwhile, every request and response is logged as structured JSONL for later analysis.

What Transcripts Don't Tell You

Claude Code's JSONL transcripts are useful, but they omit a lot. Here's what the sniffer captures that transcripts don't:

Data	In Transcript?	In Sniffer?
Token usage (input/output/cache)	Yes	Yes
Raw system prompt	No	Yes
Full conversation history per request	No	Yes
Request parameters (max_tokens, temperature)	No	Yes
HTTP headers (anthropic-beta, version)	No	Yes
Request/response latency	No	Yes
Hidden compaction API call	No	Yes
Error response bodies	Partial	Yes
Streaming SSE events	No	Yes
Tool definitions (full JSON schema)	No	Yes

The most interesting items on this list are the system prompt and the hidden compaction call.

The System Prompt

Claude Code's system prompt is sent on every single API call. It contains:

Claude Code's internal instructions and behavioral guidelines
Tool definitions (Read, Write, Edit, Bash, Glob, Grep, etc.) with full JSON schemas
Your CLAUDE.md project instructions
Memory files, hooks output, and other injected context
Safety and permission guidelines

With --full mode, the sniffer captures the complete system prompt text. In our sessions, it consistently measures ~14k tokens — a fixed tax on every API call.

This is useful for understanding exactly what Claude Code "knows" about your project. Your CLAUDE.md, your hooks output, your memory files — it's all there in the system prompt, and now you can read it.

The Hidden Compaction Call

This is the one we were most curious about.

When Claude Code's context window fills up (~167k of 200k tokens), it triggers compaction. The entire conversation gets compressed into a summary, and the next turn starts fresh with just the system prompt + summary.

But here's the thing: the API call that generates the compaction summary doesn't appear in the transcript. Claude Code makes it, receives the summary, and continues — but the JSONL transcript shows nothing. You see a compact_boundary marker, but not the actual summarization call.

The sniffer catches it because it's just another POST /v1/messages:

  #12  POST /v1/messages  opus-4-6  12.3k->2.1k  $0.041  3412ms  compaction

The sniffer detects compaction by comparing consecutive requests. When the message count drops by more than 50% or the total content size drops by more than 70% compared to the previous request, it flags it as post-compaction. The dramatic shrinkage — from 167k tokens of conversation down to a ~15k summary — is unmistakable.

This is the only way to observe the compaction call's actual cost, latency, and output tokens. In our sessions, compaction summary generation takes 2-4 seconds and produces 11-19k tokens of compressed context.

The Tool Use Loop

When you see a line like this in the sniffer:

  #17  POST /v1/messages  opus-4-6  114.3k->531  $0.217  16047ms  [U]  tool

That's 114k tokens in but only 531 out. Why so few output tokens? Because Claude isn't writing prose — it's calling a tool. The response is just a small JSON block:

{"type": "tool_use", "name": "Read", "input": {"file_path": "/src/app.py"}}

Here's the full cycle for a single tool call:

Claude Code sends the full conversation to the API (114.3k input tokens — system prompt, message history, tool definitions, everything)
API responds with a tool_use block — just the tool name and parameters (531 output tokens)
Claude Code executes the tool locally — reads the file, runs the command, whatever the tool does
Claude Code sends another request with the tool result appended as a tool_result message — now input tokens are higher because the file contents (or command output) are part of the conversation

That's why you see rapid back-to-back requests in the sniffer. A single "read this file and edit it" from the user might generate 5+ API calls:

  #17  POST /v1/messages  opus-4-6  114.3k->531   [U]  tool     ← decide to read file
  #18  POST /v1/messages  opus-4-6  116.1k->204   [U]  tool     ← decide to edit file
  #19  POST /v1/messages  opus-4-6  117.8k->1.2k  [Tt]          ← respond to user

Each round-trip adds the tool result to the conversation, growing the input tokens. This is why context fills up faster than you'd expect — tool results (file contents, command output, search results) are often much larger than the tool call itself.

SSE Streaming Under the Hood

Claude Code uses Server-Sent Events (SSE) for streaming responses. The API returns text/event-stream and sends data in chunks as the model generates tokens.

The sniffer handles this transparently — it forwards each chunk to Claude Code as it arrives (so you don't notice any delay), while capturing the entire stream for logging.

After streaming completes, it reassembles the SSE events to extract structured data: model, usage, stop reason, and content block types (text, thinking, tool_use). This is what makes the one-line terminal output possible — you get clean 45.2k->1.5k $0.120 2312ms instead of raw SSE data.

The key technical detail: we use response.read1(8192) instead of response.read(8192). The read1() method reads whatever data is currently available without waiting for the full buffer to fill — critical for streaming, where you need to forward partial data immediately.

Sub-Agent Tracking

When Claude Code spawns a sub-agent (via the Agent tool), the sub-agent makes its own API calls — often using a different model. The sniffer tracks these by session ID:

  #8   POST /v1/messages  opus-4-6    89.1k->3.2k  $0.182  8234ms  99%c  [TU]  Agent
  #9   POST /v1/messages  sonnet-4-6  14.3k->2.1k  $0.008  2341ms         [Tt]  +agent.1
  #10  POST /v1/messages  sonnet-4-6  16.5k->1.2k  $0.006  1823ms         [TU]  Read  agent.1
  #11  POST /v1/messages  sonnet-4-6  22.8k->0.5k  $0.009  1243ms         [t]   agent.1
  #12  POST /v1/messages  opus-4-6    92.3k->1.5k  $0.152  4312ms  99%c  [Tt]

+agent.1 marks the first request from a new sub-agent. Subsequent requests from the same agent show agent.1. The main session has no label.

This reveals things you can't see from the transcript: sub-agents often use Sonnet (cheaper, faster) for research tasks while the main session runs on Opus. You can see exactly how many API calls each sub-agent makes, their cost, and how they overlap with the main session.

Cache Misses — The Silent Cost Spike

The cache ratio (98%c, 100%c) shows what percentage of input tokens were cache reads. Most of the time it's near 100% — great, you're paying the cheap rate.

But leave your session idle for ~5 minutes and watch what happens:

  #6   POST /v1/messages  opus-4-6  129.4k->15   $0.199   3336ms  100%c  [t]
  #7   POST /v1/messages  opus-4-6  129.5k->428  $2.460  16108ms  0%c    [Tt]
  #8   POST /v1/messages  opus-4-6  130.0k->600  $0.248  18310ms  100%c  [Tt]

Request #7 cost $2.46 — 12.5x more than usual — because the cache expired. All 129.5k tokens went through cache_creation at $18.75/M instead of cache_read at $1.50/M. Same data, same tokens, wildly different price.

The sniffer shows 0%c in red to make these cache misses impossible to miss.

Per-Request Cost Tracking

The sniffer calculates cost per API call using the token breakdown from the response:

{
  "usage": {
    "input_tokens": 3,
    "cache_read_input_tokens": 45000,
    "cache_creation_input_tokens": 800,
    "output_tokens": 1500
  }
}

With model-specific pricing (Opus: $15/$1.50/$18.75/$75 per 1M tokens for input/cache-read/cache-write/output), each line shows the exact cost of that call. No estimation, no averaging — real cost per request.

This revealed something we didn't expect: the variance between calls is huge. A simple response might cost $0.03, while a long code generation can cost $0.50+ — in the same session, same model.

What We Learned

After running the sniffer on real sessions, a few things stood out:

1. The system prompt is remarkably stable. It barely changes between calls within a session. The ~14k tokens are almost entirely cached after the first call, making them cheap ($1.50/M vs $15/M for Opus). But they still consume context window space.

2. Compaction is expensive in latency, not just tokens. The summary generation call takes 2-4 seconds — during which Claude Code is unresponsive. On a long session with 3 compactions, that's 6-12 seconds of dead time.

3. Cache hit rates are extraordinary. In typical sessions, 95-98% of input tokens are cache reads. The stateless-API design sounds expensive, but caching makes it practical.

4. Error responses are more informative than you'd think. When Claude Code encounters a 429 (rate limit) or 529 (overloaded), the response body often includes retry-after headers and detailed error messages. These are swallowed by Claude Code's retry logic and never shown to you.

5. Beta headers reveal feature flags. The anthropic-beta header shows which experimental features are active. Watching this change across Claude Code versions is interesting.

Security Notes

The sniffer is designed to be safe by default:

Localhost only — binds to 127.0.0.1, never 0.0.0.0
API keys redacted — x-api-key and authorization headers stripped from logs by default (use --no-redact to override)
Restricted permissions — log files created with 0o600 (owner read/write only)
Local plaintext — the API key transits in plain text only over the loopback interface, which is standard for local proxy patterns

Try It

The sniffer is part of ClaudeTUI:

# Install
brew tap slima4/claude-tui && brew install claude-tui && claudetui setup

# Or
curl -sSL https://raw.githubusercontent.com/slima4/claude-tui/main/install.sh | bash

# Run
claudetui sniffer              # Terminal 1: start proxy
claudetui sniff                # Terminal 2: launch claude through proxy
claudetui sniff --resume abc   # or resume a session through proxy

Options:

--port PORT     Custom port (default: 7735)
--full          Log complete request/response bodies (warning: large files)
--no-redact     Include API keys in logs (use with caution)
--quiet         Suppress terminal output, log only

Python 3.8+, stdlib only — no external dependencies.

What's Next

The sniffer captures data that was previously invisible. Combined with ClaudeTUI's existing context efficiency analysis, this gives a complete picture of what Claude Code is doing under the hood — from high-level token waste tracking down to raw HTTP traffic.

Some things we're exploring:

Replaying captured sessions for cost modeling ("what would this session cost on Sonnet vs Opus?")
Diffing system prompts across Claude Code versions to track changes
Correlating latency with context size — does response time scale linearly with input tokens?
Analyzing compaction summaries — what gets preserved and what gets lost?

If you're curious about what your Claude Code sessions actually look like at the API level, point the sniffer at a session and watch the data flow.

ClaudeTUI is open source and MIT licensed. Stdlib-only Python, zero external dependencies.

GitHub: github.com/slima4/claude-tui

DEV Community