I am not going to open with the number. The number is not the interesting part.
The interesting part is the pattern that made each library fast to ship.
The pattern
Every library in this sprint started the same way. I picked one failure mode in an agent loop. One specific way things go wrong. Then I wrote the smallest fix that addressed only that failure mode. Then I wrote tests until I was confident the fix held. Then I stopped.
That constraint, one failure, one fix, forced the scope to stay small. It meant I could start a new library, ship it to GitHub, write the tests, and close it out in a few hours. It also meant each library stayed easy to reason about. No "this also handles X" sprawl.
The pattern is not complicated. The hard part is resisting the urge to generalize.
What the boring problems actually are
If you build agents long enough, you run into a small set of recurring problems. They are not glamorous. They do not show up in demos. But they are the reason production agent systems fall over.
I grouped them into five buckets.
Safety and control
Agents do unexpected things when you are not watching. They call the same tool forty times. They spend more tokens than your budget allows. They make outbound HTTP calls to domains you did not sanction. They accept tool arguments with the wrong types and silently produce bad results.
These are not model problems. They are infrastructure problems. The model is doing exactly what you told it to do. You just did not install the guardrails.
Observability
Most agent frameworks give you logging. Logging is not observability. Observability means you can answer: why did the agent make this decision, where did this output come from, what did the exact JSON payload look like at the wire level, and how much did this run actually cost.
None of those answers come for free.
Reliability
Agents fail. Providers rate-limit. Context windows fill up. Long-running jobs crash halfway through. If your agent cannot recover from a failure without starting over, you are going to waste a lot of tokens and a lot of patience.
Reliability is boring infrastructure. Retry logic, idempotency keys, checkpointing, circuit breakers. The same stuff distributed systems people solved fifteen years ago, except now the thing failing is an LLM call instead of a database write.
Context and prompt management
The context window is the agent's working memory. Managing it badly causes hallucinations, dropped context, and cache misses. Managing it well is not magic. It is careful, mechanical work. Trim the right messages. Keep tool calls paired with their results. Warm the cache before you need it. Pin your prompt versions so you know what changed when behavior drifts.
Tool infrastructure
Tools are where most agent bugs live. The model calls a tool with a string where you expected an integer. It passes "true" instead of True. The tool returns five thousand words of raw HTML and you stuff it directly into the next message. The tool has side effects and you retry it blindly.
Tool infrastructure is the plumbing. It is the part that keeps the rest of the system honest.
Why zero dependencies matters
Every library in this sprint ships with zero production dependencies by default. That is not a performance requirement. It is a composition requirement.
If library A pulls in package X at version 2 and library B pulls in package X at version 3, you cannot use A and B together without resolving that conflict. Multiply that by fifty libraries and you have a dependency graph that nobody wants to debug.
Zero deps means each library is a leaf node. You can drop any subset of them into a project and nothing fights anything else. That is the composability that makes a collection of small libraries more useful than a monolith.
A few libraries have optional extras where the value is obvious. prompt-token-counter lets you bring your own tokenizer. tool-result-cache has an optional Redis backend. But the core path works without installing anything extra.
Why the test count is the signal
Every post in this series lists the test count. I did not put that number there to impress anybody. I put it there because it is the clearest signal that the library actually does what it says.
A library with twenty tests has a narrow scope. You can read the test file and understand the contract in five minutes. A library with two hundred tests probably has scope creep. You can use the test count as a rough proxy for whether the original constraint held.
The target for each library was twenty to fifty tests. Enough to cover the normal paths, the edge cases, and the error conditions. Not so many that the test suite is harder to read than the implementation.
The full library inventory
| Category | Library | What it fixes |
|---|---|---|
| Safety and control | prompt-shield | Injection detection, 5 composable rules |
| agentguard | Egress firewall, domain allowlist for agent tools | |
| agentvet | Tool arg validation with LLM-friendly retry hints | |
| tool-call-budgets | Per-tool call caps, stops runaway loops | |
| llm-stop-conditions | Composable stop conditions (MaxIters, MaxUsd, MaxTokens, MaxSeconds, NoProgress) | |
| tool-loop-guard | Sliding-window repeated-call detector | |
| agent-deadline | Cooperative per-task time cap | |
| llm-circuit-breaker-py | Closed/Open/HalfOpen circuit breaker, sync and async | |
| Observability | agentsnap | Snapshot tests for tool-call traces |
| agenttrace | Cost and latency tracking per run | |
| agent-decision-log | WHY layer, captures options, chosen, rationale, outcome | |
| agent-citation | WHERE layer, structured citations for agent outputs | |
| agent-replay-trace | Load and step through JSONL agent traces | |
| agenttap | Wire-level HTTP capture, redacts credentials | |
| prompt-replay | Record prompts, replay across providers, diff outputs | |
| Reliability | agent-resume | Checkpoint and resume long-running jobs |
| llm-retry-py | Full-jitter backoff, Anthropic/OpenAI/Bedrock/Gemini presets | |
| llm-fallback-chain | Ordered provider failover, sync and async | |
| agentidemp-py | Idempotency keys, sha256-hex and UUIDv5 variants | |
| agent-shadow-mode | Record-not-execute wrapper for staging | |
| Context and prompt | agentfit | Token-aware message truncation, multiple strategies |
| agent-message-window | Sliding window that keeps tool calls paired with results | |
| prompt-token-counter | Approximate token counts, BYO tokenizer | |
| llm-content-blocks | Anthropic content-block builder, no SDK dependency | |
| prompt-cache-warmer | Pre-warm Anthropic prompt cache, optional verify step | |
| prompt-template-version | Semver-pinned prompts, content hash per version | |
| prompt-eval-rubric | 0.0-1.0 scoring rubrics, weighted aggregation | |
| Tool infrastructure | tool-schema-from-fn | Function signature to Anthropic/OpenAI tool schema |
| tool-arg-defaults | Fill missing kwargs, caller overrides win | |
| tool-arg-coerce-py | JSON Schema-driven type coercion, records every conversion | |
| tool-arg-rename | snake_case, camelCase, PascalCase, kebab-case conversion | |
| tool-arg-fuzzy | Fuzzy enum match, conservative on ambiguity | |
| tool-result-cache | LRU plus TTL memoization for tool calls | |
| tool-call-cache | SHA-256-keyed memoization for LLM calls | |
| tool-side-effects-tag | READ/WRITE/IDEMPOTENT/DESTRUCTIVE tags | |
| tool-output-format | Render tool output as LLM-friendly markdown | |
| tool-output-truncate-py | Four truncation strategies, UTF-8 safe, zero deps | |
| tool-secret-scrubber | Strip credentials from tool logs | |
| tool-error-classify | Closed ErrorKind enum, Retry-After parsed | |
| agent-fn-registry | Registry of fn plus schema plus side-effect tags plus defaults | |
| agent-event-bus | Sync and async pub/sub for agent events | |
| Data and cost | bedrock-kit | AWS Bedrock wrapper, throttle plus cache-aware cost |
| claude-cost | Cache-aware cost calculator for Anthropic API | |
| bedrock-cost | Cross-vendor Bedrock pricing, inference-profile aware | |
| token-budget-py | Thread-safe shared token/USD cap | |
| llm-budget-window | Multi-window sliding budget (per minute, hour, day) | |
| anthropic-batch-kit | Submit/poll/retrieve with 50% batch-discount cost tally | |
| llmfleet | Concurrent Anthropic messages.create dispatcher | |
| llm-batch-coalesce | Single-flight for LLM calls, many callers, one request | |
| llm-pii-redact | Regex PII redaction, reversible placeholders | |
| conversation-codec | JSONL conversation persistence with optional encryption | |
| llm-output-validator | Rule-based validator for LLM strings | |
| llm-json-repair | Three-pass local repair for malformed LLM JSON | |
| llm-fallback-router | Ordered provider failover, AllProvidersFailedError | |
| llm-message-hash-py | Canonical JSON hash with per-provider noise-field drops |
Lessons
The scope constraint is the whole game. Every time I let a library get fuzzy about its purpose, the implementation got harder and the tests got longer. The libraries that shipped cleanest were the ones where I could write the problem statement in one sentence before writing any code.
Test count is an input, not an output. I set a target of twenty to fifty tests before I started each library. That target forced me to think about the contract before I thought about the implementation. What are the cases I need to cover? That question is often a better design tool than asking what the code should do.
Composability requires restraint. I had to actively resist adding convenience methods that combined two libraries into one. That kind of merging might be useful in a specific application, but it destroys composability at the infrastructure level. If you want agent-resume to call llm-retry internally, do that in your application. Do not do it in the library.
Zero deps is a forcing function. Deciding upfront that a library ships with zero production dependencies forced me to write the thing myself instead of wrapping an existing package. That sounds inefficient. In practice, a SHA-256 hash function, an LRU eviction policy, and a simple retry loop are all small enough to implement correctly in an afternoon. And now I own the behavior.
The boring problems are load-bearing. Nobody demos their circuit breaker. Nobody shows off their idempotency key generator. But those pieces are what let you run an agent in production without babysitting it. The interesting model behavior sits on top of this boring infrastructure. If the infrastructure is missing or unreliable, the model behavior does not matter.
What comes next
Most of these libraries are Python. Several have Rust siblings that I shipped alongside. The next step is getting the PyPI publishes unblocked (rate limits during the sprint slowed down the last twenty or so), then picking a few of the higher-traffic ones for better documentation.
The pattern that made this sprint work is also the pattern I will keep using. One failure mode, one fix, one library. There are still plenty of failure modes in agent loops that do not have a clean, small fix available yet. Those are interesting problems. The ones in this post were not interesting. They were just work. And doing the work makes the interesting problems easier to find.
All libraries are under MukundaKatta on GitHub.
Top comments (0)