Mukunda Rao Katta

Posted on May 25

I shipped 50 agent infrastructure libraries. Here is what I learned.

#hermeschallenge #ai #python #agents

I am not going to open with the number. The number is not the interesting part.

The interesting part is the pattern that made each library fast to ship.

The pattern

Every library in this sprint started the same way. I picked one failure mode in an agent loop. One specific way things go wrong. Then I wrote the smallest fix that addressed only that failure mode. Then I wrote tests until I was confident the fix held. Then I stopped.

That constraint, one failure, one fix, forced the scope to stay small. It meant I could start a new library, ship it to GitHub, write the tests, and close it out in a few hours. It also meant each library stayed easy to reason about. No "this also handles X" sprawl.

The pattern is not complicated. The hard part is resisting the urge to generalize.

What the boring problems actually are

If you build agents long enough, you run into a small set of recurring problems. They are not glamorous. They do not show up in demos. But they are the reason production agent systems fall over.

I grouped them into five buckets.

Safety and control

Agents do unexpected things when you are not watching. They call the same tool forty times. They spend more tokens than your budget allows. They make outbound HTTP calls to domains you did not sanction. They accept tool arguments with the wrong types and silently produce bad results.

These are not model problems. They are infrastructure problems. The model is doing exactly what you told it to do. You just did not install the guardrails.

Observability

Most agent frameworks give you logging. Logging is not observability. Observability means you can answer: why did the agent make this decision, where did this output come from, what did the exact JSON payload look like at the wire level, and how much did this run actually cost.

None of those answers come for free.

Reliability

Agents fail. Providers rate-limit. Context windows fill up. Long-running jobs crash halfway through. If your agent cannot recover from a failure without starting over, you are going to waste a lot of tokens and a lot of patience.

Reliability is boring infrastructure. Retry logic, idempotency keys, checkpointing, circuit breakers. The same stuff distributed systems people solved fifteen years ago, except now the thing failing is an LLM call instead of a database write.

Context and prompt management

The context window is the agent's working memory. Managing it badly causes hallucinations, dropped context, and cache misses. Managing it well is not magic. It is careful, mechanical work. Trim the right messages. Keep tool calls paired with their results. Warm the cache before you need it. Pin your prompt versions so you know what changed when behavior drifts.

Tool infrastructure

Tools are where most agent bugs live. The model calls a tool with a string where you expected an integer. It passes "true" instead of True. The tool returns five thousand words of raw HTML and you stuff it directly into the next message. The tool has side effects and you retry it blindly.

Tool infrastructure is the plumbing. It is the part that keeps the rest of the system honest.

Why zero dependencies matters

Every library in this sprint ships with zero production dependencies by default. That is not a performance requirement. It is a composition requirement.

If library A pulls in package X at version 2 and library B pulls in package X at version 3, you cannot use A and B together without resolving that conflict. Multiply that by fifty libraries and you have a dependency graph that nobody wants to debug.

Zero deps means each library is a leaf node. You can drop any subset of them into a project and nothing fights anything else. That is the composability that makes a collection of small libraries more useful than a monolith.

A few libraries have optional extras where the value is obvious. prompt-token-counter lets you bring your own tokenizer. tool-result-cache has an optional Redis backend. But the core path works without installing anything extra.

Why the test count is the signal

Every post in this series lists the test count. I did not put that number there to impress anybody. I put it there because it is the clearest signal that the library actually does what it says.

A library with twenty tests has a narrow scope. You can read the test file and understand the contract in five minutes. A library with two hundred tests probably has scope creep. You can use the test count as a rough proxy for whether the original constraint held.

The target for each library was twenty to fifty tests. Enough to cover the normal paths, the edge cases, and the error conditions. Not so many that the test suite is harder to read than the implementation.

The full library inventory

Category	Library	What it fixes
Safety and control	prompt-shield	Injection detection, 5 composable rules
	agentguard	Egress firewall, domain allowlist for agent tools
	agentvet	Tool arg validation with LLM-friendly retry hints
	tool-call-budgets	Per-tool call caps, stops runaway loops
	llm-stop-conditions	Composable stop conditions (MaxIters, MaxUsd, MaxTokens, MaxSeconds, NoProgress)
	tool-loop-guard	Sliding-window repeated-call detector
	agent-deadline	Cooperative per-task time cap
	llm-circuit-breaker-py	Closed/Open/HalfOpen circuit breaker, sync and async
Observability	agentsnap	Snapshot tests for tool-call traces
	agenttrace	Cost and latency tracking per run
	agent-decision-log	WHY layer, captures options, chosen, rationale, outcome
	agent-citation	WHERE layer, structured citations for agent outputs
	agent-replay-trace	Load and step through JSONL agent traces
	agenttap	Wire-level HTTP capture, redacts credentials
	prompt-replay	Record prompts, replay across providers, diff outputs
Reliability	agent-resume	Checkpoint and resume long-running jobs
	llm-retry-py	Full-jitter backoff, Anthropic/OpenAI/Bedrock/Gemini presets
	llm-fallback-chain	Ordered provider failover, sync and async
	agentidemp-py	Idempotency keys, sha256-hex and UUIDv5 variants
	agent-shadow-mode	Record-not-execute wrapper for staging
Context and prompt	agentfit	Token-aware message truncation, multiple strategies
	agent-message-window	Sliding window that keeps tool calls paired with results
	prompt-token-counter	Approximate token counts, BYO tokenizer
	llm-content-blocks	Anthropic content-block builder, no SDK dependency
	prompt-cache-warmer	Pre-warm Anthropic prompt cache, optional verify step
	prompt-template-version	Semver-pinned prompts, content hash per version
	prompt-eval-rubric	0.0-1.0 scoring rubrics, weighted aggregation
Tool infrastructure	tool-schema-from-fn	Function signature to Anthropic/OpenAI tool schema
	tool-arg-defaults	Fill missing kwargs, caller overrides win
	tool-arg-coerce-py	JSON Schema-driven type coercion, records every conversion
	tool-arg-rename	snake_case, camelCase, PascalCase, kebab-case conversion
	tool-arg-fuzzy	Fuzzy enum match, conservative on ambiguity
	tool-result-cache	LRU plus TTL memoization for tool calls
	tool-call-cache	SHA-256-keyed memoization for LLM calls
	tool-side-effects-tag	READ/WRITE/IDEMPOTENT/DESTRUCTIVE tags
	tool-output-format	Render tool output as LLM-friendly markdown
	tool-output-truncate-py	Four truncation strategies, UTF-8 safe, zero deps
	tool-secret-scrubber	Strip credentials from tool logs
	tool-error-classify	Closed ErrorKind enum, Retry-After parsed
	agent-fn-registry	Registry of fn plus schema plus side-effect tags plus defaults
	agent-event-bus	Sync and async pub/sub for agent events
Data and cost	bedrock-kit	AWS Bedrock wrapper, throttle plus cache-aware cost
	claude-cost	Cache-aware cost calculator for Anthropic API
	bedrock-cost	Cross-vendor Bedrock pricing, inference-profile aware
	token-budget-py	Thread-safe shared token/USD cap
	llm-budget-window	Multi-window sliding budget (per minute, hour, day)
	anthropic-batch-kit	Submit/poll/retrieve with 50% batch-discount cost tally
	llmfleet	Concurrent Anthropic messages.create dispatcher
	llm-batch-coalesce	Single-flight for LLM calls, many callers, one request
	llm-pii-redact	Regex PII redaction, reversible placeholders
	conversation-codec	JSONL conversation persistence with optional encryption
	llm-output-validator	Rule-based validator for LLM strings
	llm-json-repair	Three-pass local repair for malformed LLM JSON
	llm-fallback-router	Ordered provider failover, AllProvidersFailedError
	llm-message-hash-py	Canonical JSON hash with per-provider noise-field drops

Lessons

The scope constraint is the whole game. Every time I let a library get fuzzy about its purpose, the implementation got harder and the tests got longer. The libraries that shipped cleanest were the ones where I could write the problem statement in one sentence before writing any code.

Test count is an input, not an output. I set a target of twenty to fifty tests before I started each library. That target forced me to think about the contract before I thought about the implementation. What are the cases I need to cover? That question is often a better design tool than asking what the code should do.

Composability requires restraint. I had to actively resist adding convenience methods that combined two libraries into one. That kind of merging might be useful in a specific application, but it destroys composability at the infrastructure level. If you want agent-resume to call llm-retry internally, do that in your application. Do not do it in the library.

Zero deps is a forcing function. Deciding upfront that a library ships with zero production dependencies forced me to write the thing myself instead of wrapping an existing package. That sounds inefficient. In practice, a SHA-256 hash function, an LRU eviction policy, and a simple retry loop are all small enough to implement correctly in an afternoon. And now I own the behavior.

The boring problems are load-bearing. Nobody demos their circuit breaker. Nobody shows off their idempotency key generator. But those pieces are what let you run an agent in production without babysitting it. The interesting model behavior sits on top of this boring infrastructure. If the infrastructure is missing or unreliable, the model behavior does not matter.

What comes next

Most of these libraries are Python. Several have Rust siblings that I shipped alongside. The next step is getting the PyPI publishes unblocked (rate limits during the sprint slowed down the last twenty or so), then picking a few of the higher-traffic ones for better documentation.

The pattern that made this sprint work is also the pattern I will keep using. One failure mode, one fix, one library. There are still plenty of failure modes in agent loops that do not have a clean, small fix available yet. Those are interesting problems. The ones in this post were not interesting. They were just work. And doing the work makes the interesting problems easier to find.

All libraries are under MukundaKatta on GitHub.

DEV Community