DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

I shipped 50 agent infrastructure libraries. Here is what I learned.

I am not going to open with the number. The number is not the interesting part.

The interesting part is the pattern that made each library fast to ship.

The pattern

Every library in this sprint started the same way. I picked one failure mode in an agent loop. One specific way things go wrong. Then I wrote the smallest fix that addressed only that failure mode. Then I wrote tests until I was confident the fix held. Then I stopped.

That constraint, one failure, one fix, forced the scope to stay small. It meant I could start a new library, ship it to GitHub, write the tests, and close it out in a few hours. It also meant each library stayed easy to reason about. No "this also handles X" sprawl.

The pattern is not complicated. The hard part is resisting the urge to generalize.

What the boring problems actually are

If you build agents long enough, you run into a small set of recurring problems. They are not glamorous. They do not show up in demos. But they are the reason production agent systems fall over.

I grouped them into five buckets.

Safety and control

Agents do unexpected things when you are not watching. They call the same tool forty times. They spend more tokens than your budget allows. They make outbound HTTP calls to domains you did not sanction. They accept tool arguments with the wrong types and silently produce bad results.

These are not model problems. They are infrastructure problems. The model is doing exactly what you told it to do. You just did not install the guardrails.

Observability

Most agent frameworks give you logging. Logging is not observability. Observability means you can answer: why did the agent make this decision, where did this output come from, what did the exact JSON payload look like at the wire level, and how much did this run actually cost.

None of those answers come for free.

Reliability

Agents fail. Providers rate-limit. Context windows fill up. Long-running jobs crash halfway through. If your agent cannot recover from a failure without starting over, you are going to waste a lot of tokens and a lot of patience.

Reliability is boring infrastructure. Retry logic, idempotency keys, checkpointing, circuit breakers. The same stuff distributed systems people solved fifteen years ago, except now the thing failing is an LLM call instead of a database write.

Context and prompt management

The context window is the agent's working memory. Managing it badly causes hallucinations, dropped context, and cache misses. Managing it well is not magic. It is careful, mechanical work. Trim the right messages. Keep tool calls paired with their results. Warm the cache before you need it. Pin your prompt versions so you know what changed when behavior drifts.

Tool infrastructure

Tools are where most agent bugs live. The model calls a tool with a string where you expected an integer. It passes "true" instead of True. The tool returns five thousand words of raw HTML and you stuff it directly into the next message. The tool has side effects and you retry it blindly.

Tool infrastructure is the plumbing. It is the part that keeps the rest of the system honest.

Why zero dependencies matters

Every library in this sprint ships with zero production dependencies by default. That is not a performance requirement. It is a composition requirement.

If library A pulls in package X at version 2 and library B pulls in package X at version 3, you cannot use A and B together without resolving that conflict. Multiply that by fifty libraries and you have a dependency graph that nobody wants to debug.

Zero deps means each library is a leaf node. You can drop any subset of them into a project and nothing fights anything else. That is the composability that makes a collection of small libraries more useful than a monolith.

A few libraries have optional extras where the value is obvious. prompt-token-counter lets you bring your own tokenizer. tool-result-cache has an optional Redis backend. But the core path works without installing anything extra.

Why the test count is the signal

Every post in this series lists the test count. I did not put that number there to impress anybody. I put it there because it is the clearest signal that the library actually does what it says.

A library with twenty tests has a narrow scope. You can read the test file and understand the contract in five minutes. A library with two hundred tests probably has scope creep. You can use the test count as a rough proxy for whether the original constraint held.

The target for each library was twenty to fifty tests. Enough to cover the normal paths, the edge cases, and the error conditions. Not so many that the test suite is harder to read than the implementation.

The full library inventory

Category Library What it fixes
Safety and control prompt-shield Injection detection, 5 composable rules
agentguard Egress firewall, domain allowlist for agent tools
agentvet Tool arg validation with LLM-friendly retry hints
tool-call-budgets Per-tool call caps, stops runaway loops
llm-stop-conditions Composable stop conditions (MaxIters, MaxUsd, MaxTokens, MaxSeconds, NoProgress)
tool-loop-guard Sliding-window repeated-call detector
agent-deadline Cooperative per-task time cap
llm-circuit-breaker-py Closed/Open/HalfOpen circuit breaker, sync and async
Observability agentsnap Snapshot tests for tool-call traces
agenttrace Cost and latency tracking per run
agent-decision-log WHY layer, captures options, chosen, rationale, outcome
agent-citation WHERE layer, structured citations for agent outputs
agent-replay-trace Load and step through JSONL agent traces
agenttap Wire-level HTTP capture, redacts credentials
prompt-replay Record prompts, replay across providers, diff outputs
Reliability agent-resume Checkpoint and resume long-running jobs
llm-retry-py Full-jitter backoff, Anthropic/OpenAI/Bedrock/Gemini presets
llm-fallback-chain Ordered provider failover, sync and async
agentidemp-py Idempotency keys, sha256-hex and UUIDv5 variants
agent-shadow-mode Record-not-execute wrapper for staging
Context and prompt agentfit Token-aware message truncation, multiple strategies
agent-message-window Sliding window that keeps tool calls paired with results
prompt-token-counter Approximate token counts, BYO tokenizer
llm-content-blocks Anthropic content-block builder, no SDK dependency
prompt-cache-warmer Pre-warm Anthropic prompt cache, optional verify step
prompt-template-version Semver-pinned prompts, content hash per version
prompt-eval-rubric 0.0-1.0 scoring rubrics, weighted aggregation
Tool infrastructure tool-schema-from-fn Function signature to Anthropic/OpenAI tool schema
tool-arg-defaults Fill missing kwargs, caller overrides win
tool-arg-coerce-py JSON Schema-driven type coercion, records every conversion
tool-arg-rename snake_case, camelCase, PascalCase, kebab-case conversion
tool-arg-fuzzy Fuzzy enum match, conservative on ambiguity
tool-result-cache LRU plus TTL memoization for tool calls
tool-call-cache SHA-256-keyed memoization for LLM calls
tool-side-effects-tag READ/WRITE/IDEMPOTENT/DESTRUCTIVE tags
tool-output-format Render tool output as LLM-friendly markdown
tool-output-truncate-py Four truncation strategies, UTF-8 safe, zero deps
tool-secret-scrubber Strip credentials from tool logs
tool-error-classify Closed ErrorKind enum, Retry-After parsed
agent-fn-registry Registry of fn plus schema plus side-effect tags plus defaults
agent-event-bus Sync and async pub/sub for agent events
Data and cost bedrock-kit AWS Bedrock wrapper, throttle plus cache-aware cost
claude-cost Cache-aware cost calculator for Anthropic API
bedrock-cost Cross-vendor Bedrock pricing, inference-profile aware
token-budget-py Thread-safe shared token/USD cap
llm-budget-window Multi-window sliding budget (per minute, hour, day)
anthropic-batch-kit Submit/poll/retrieve with 50% batch-discount cost tally
llmfleet Concurrent Anthropic messages.create dispatcher
llm-batch-coalesce Single-flight for LLM calls, many callers, one request
llm-pii-redact Regex PII redaction, reversible placeholders
conversation-codec JSONL conversation persistence with optional encryption
llm-output-validator Rule-based validator for LLM strings
llm-json-repair Three-pass local repair for malformed LLM JSON
llm-fallback-router Ordered provider failover, AllProvidersFailedError
llm-message-hash-py Canonical JSON hash with per-provider noise-field drops

Lessons

The scope constraint is the whole game. Every time I let a library get fuzzy about its purpose, the implementation got harder and the tests got longer. The libraries that shipped cleanest were the ones where I could write the problem statement in one sentence before writing any code.

Test count is an input, not an output. I set a target of twenty to fifty tests before I started each library. That target forced me to think about the contract before I thought about the implementation. What are the cases I need to cover? That question is often a better design tool than asking what the code should do.

Composability requires restraint. I had to actively resist adding convenience methods that combined two libraries into one. That kind of merging might be useful in a specific application, but it destroys composability at the infrastructure level. If you want agent-resume to call llm-retry internally, do that in your application. Do not do it in the library.

Zero deps is a forcing function. Deciding upfront that a library ships with zero production dependencies forced me to write the thing myself instead of wrapping an existing package. That sounds inefficient. In practice, a SHA-256 hash function, an LRU eviction policy, and a simple retry loop are all small enough to implement correctly in an afternoon. And now I own the behavior.

The boring problems are load-bearing. Nobody demos their circuit breaker. Nobody shows off their idempotency key generator. But those pieces are what let you run an agent in production without babysitting it. The interesting model behavior sits on top of this boring infrastructure. If the infrastructure is missing or unreliable, the model behavior does not matter.

What comes next

Most of these libraries are Python. Several have Rust siblings that I shipped alongside. The next step is getting the PyPI publishes unblocked (rate limits during the sprint slowed down the last twenty or so), then picking a few of the higher-traffic ones for better documentation.

The pattern that made this sprint work is also the pattern I will keep using. One failure mode, one fix, one library. There are still plenty of failure modes in agent loops that do not have a clean, small fix available yet. Those are interesting problems. The ones in this post were not interesting. They were just work. And doing the work makes the interesting problems easier to find.


All libraries are under MukundaKatta on GitHub.

Top comments (0)