DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

agenttap: see exactly what your LLM SDK sent to the wire, with API keys scrubbed

I lost an entire afternoon last month to a bug that had a one-word explanation: the SDK swapped system for system_prompt between two minor versions, and my retry path was building the message dict the old way.

The agent looked fine. The traces looked fine. The error message from Anthropic was a generic 400. I added log lines. I added more log lines. I printed the message dict. The dict looked right. The dict was right. The SDK was reshaping it on the way out.

The only thing that would have caught this in five minutes is what the provider actually received on the wire. That is what agenttap does.

The problem

Five years into the SDK era, "what was actually sent to the model?" is still a hard question.

SDK debug logging is verbose, leaks API keys into your terminal scrollback, and reformats payloads in transit. Callbacks scatter across vendor-specific abstractions and never agree on what a "request" is. The two-line solution everyone reaches for, httpx.Client(event_hooks=...), gives you a Request object but does not redact your Authorization header.

You end up reading the SDK source to figure out where the request body gets serialized. Then you patch a method. Then you forget you patched the method. Then a new SDK version changes the method name and your debug glue silently breaks.

The shape of the fix

agenttap installs as an httpx transport. You hand it to the client. Every call goes through it. Credentials get scrubbed on the way in. The exact request body sticks around in memory so you can inspect it after the call.

import httpx
import anthropic
from agenttap import Tap

tap = Tap()
client = anthropic.Anthropic(http_client=httpx.Client(transport=tap.transport()))

client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    messages=[{"role": "user", "content": "Hello"}],
)

print(tap.last.url)                  # https://api.anthropic.com/v1/messages
print(tap.last.pretty_request())     # exact JSON body sent
print(tap.last.response_status)      # 200
Enter fullscreen mode Exit fullscreen mode

OpenAI works the same way:

import httpx
import openai
from agenttap import Tap

tap = Tap()
client = openai.OpenAI(http_client=httpx.Client(transport=tap.transport()))

client.chat.completions.create(
    model="gpt-4o", messages=[{"role": "user", "content": "Hi"}]
)

print(tap.last.request_body)
Enter fullscreen mode Exit fullscreen mode

When you want to know why two calls produced different results, diff them:

from agenttap import diff

print(diff(tap.all[0], tap.all[1]))
# - "system": "v1: be helpful"
# + "system": "v2: be concise"
Enter fullscreen mode Exit fullscreen mode

That diff call would have saved me my lost afternoon.

What it does NOT do

  • It is not a proxy. It is not a server. There is nothing to deploy.
  • It does not normalize across providers. The whole point is to show what each provider actually received.
  • It does not persist. tap.all lives in memory. Write it to JSON yourself if you want a record.
  • It is not full observability. For traces and dashboards, ship the recorded calls into Phoenix, Langfuse, or OpenTelemetry.

Inside the lib (one design choice worth showing)

agenttap redacts in two places, and both matter.

Headers get scrubbed by name. The list is small and explicit: Authorization, x-api-key, api-key, cookie, anthropic-api-key, openai-organization, x-amz-security-token, x-google-api-key. This is the boring layer. Most credential leaks happen here, and a fixed list closes the door cleanly.

Body string values get scrubbed by regex against known credential shapes: OpenAI and Anthropic sk-…, AWS AKIA…, Google AIza…, Slack xox[baprs]-…. This is the layer that catches the credentials that should not be in your request body but somehow are, because some helper put them there.

from agenttap import Tap, Redactor

# Default: scrub headers + known credential patterns in body
tap = Tap()

# Opt out for local testing
tap = Tap(redactor=Redactor.none())

# Custom placeholder
tap = Tap(redactor=Redactor(placeholder="<hidden>"))
Enter fullscreen mode Exit fullscreen mode

The reason this design matters: the first time you want to copy a tapped request into a Slack thread to ask a teammate for help, you do not have to do mental math about whether your key is in the screenshot. The default already removed it.

When this is useful

  • You are debugging a "looks right, fails with 400" error from an LLM provider and need to see the exact bytes that hit the wire.
  • You are migrating between SDK versions and want to confirm the request shape did not silently change.
  • You are writing a Slack post or bug report and want to paste a real request body without leaking your key.
  • You are comparing two prompt variants and need a diff at the wire level, not at the application level.
  • You are reviewing a teammate's PR that touches request building and want a quick "what does this actually send?" sanity check.

When this is NOT what you want

  • You need long-term observability with dashboards. Use Phoenix, Langfuse, Helicone, or an OTel collector. agenttap is for the debug loop, not the production dashboard.
  • You need a proxy that fronts every call from every process. agenttap is in-process per client.
  • You are on an SDK that does not use httpx under the hood. The transport hook does not apply.

Install

pip install agenttap
Enter fullscreen mode Exit fullscreen mode

Repo: https://github.com/MukundaKatta/agenttap

Sibling libraries

Library Role
agentsnap Snapshot tests for agent runs
cachebench Per-call prompt-cache hit ratio + cost saved
llmfleet Pool requests from many coroutines into provider Batch APIs
agentguard Egress allowlist for tool calls
agenttrace Cost + latency per run

agenttap is the debug-loop tool. The others are production tools. They compose: tap a call in development, snap it in CI, trace it in production.

What's next

I want to add a small TapServer adapter that exports captured calls as OpenTelemetry spans, so the same calls you tap in development can flow into the same dashboard you use in production. I also want a "compare against a saved JSON" assertion so you can pin the wire shape in a test.

If you have ever lost an afternoon to a phantom SDK change, you already know why this exists.

Top comments (0)