What I Learned Shipping 50 Agent-Stack Libraries in One Sprint

#hermeschallenge #ai #python #agents

About three weeks ago I looked at my notes from the last six agent projects and noticed the same list of problems in every one. Infinite loops. Secret keys in logs. No cost tracking. Provider outages with no fallback. Context window overflows at the worst possible moment. PII leaking to third-party APIs.

Every team had solved some of these. None had solved all of them. And every solution was bespoke: a custom retry decorator written in 2023 that nobody fully understood anymore, a log scrubber script that only ran in one environment, a context trimmer that broke tool_use/tool_result pairing in ways that caused subtle model errors.

I wanted to know: could a set of small, zero-dependency, single-responsibility libraries solve this better than custom code or a monolithic framework? And could I build them fast enough to test the theory during a sprint?

The answer to the first question is yes. The answer to the second is yes, with a lot of friction I did not expect.

The Original Problem

The failure modes I kept seeing fell into five buckets. Safety (the agent did something it should not have done). Observability (we had no idea what happened or what it cost). Reliability (the agent failed when a provider had a bad hour). Context management (the context window overflowed and the request errored). Tool infrastructure (arguments were wrong, calls were duplicated, results were not cached).

Every project eventually addresses all five. But they address them reactively, one incident at a time, with uncoordinated patches that do not compose. The patches accumulate until the codebase is hard to reason about.

The bet was that small libraries addressing specific problems would compose better than patches because each library has an explicit contract, an explicit test suite, and can be swapped out independently.

What Worked

Zero-dependency discipline. This was the single most important decision. Every Python library ships with zero runtime dependencies beyond the standard library. Every Rust crate ships with zero external crates beyond what the algorithm strictly requires.

The reason this matters is practical, not philosophical. Agent projects already have complex dependency graphs. LangChain, LiteLLM, various provider SDKs, async frameworks. Every dependency you add to a utility library risks version conflicts. Zero deps means pip install tool-loop-guard works in any environment, every time, with no conflicts. It also makes the libraries auditable. You can read all the code.

Single responsibility. Each library does one thing and exposes a small surface area. The public API for most of these libraries fits in five to ten lines. That means the cost of learning the library is low. You read the README, look at one example, and you understand the whole thing. You do not need to understand the whole ecosystem to use one piece.

Test-first. Writing tests before the implementation forced each library to have a clean external interface. If it was hard to test, it meant the interface was wrong. Several libs went through two or three interface revisions before the tests were easy to write. The discipline made the final interfaces better.

Keeping test counts visible in the README. This sounds like vanity but it was useful. The badge showing "14 tests passing" made it obvious that the library was not just a stub. It also created a personal commitment: publishing a library with one test would feel embarrassing. The social pressure was useful.

Rust ports for algorithmic libs. The sliding window for loop detection, the LRU cache for tool result memoization, the circuit breaker state machine, the token bucket rate limiter. These are well-understood algorithms that are a better fit for Rust than Python. Writing the Python version first helped clarify the interface. Writing the Rust version second was faster than I expected because the algorithm was already clear.

What Did Not Work

Bulk PyPI publish. On the final day of the sprint I queued twenty-plus PyPI publishes in sequence. The first five went through. Then the 429s started. PyPI rate limits new package creation and does not document the exact threshold. Each retry after a 429 extends the cooldown. By end of day I had a queue of pending packages and had to wait overnight.

The lesson: publish in batches of five, spaced at least an hour apart. The rate limit is specifically on new package creation, not on new versions of existing packages.

crates.io publish queue. Similar problem with Rust crates, though the limits are less severe. The publish queue backed up and some crates took 30-60 minutes to become available after publishing. Not a blocker, just slower than expected.

GitHub Push Protection on realistic test fixtures. I had test fixtures that used realistic-looking API key patterns, with actual prefix formats that providers use. GitHub Push Protection flagged these and blocked the push. The fix is obvious in hindsight: use PLACEHOLDER_ prefixes and clearly synthetic shapes from the start. But reworking test fixtures late in the sprint when you are trying to push is expensive and stressful.

The rule I adopted after this: any test fixture that looks like it could be a real credential uses a PLACEHOLDER_ prefix. No real prefix formats, no real-length random strings that match known patterns.

PyPI "API token not found" after rate limit delay. After waiting out a rate limit period, some publish attempts failed with authentication errors even though the token was correct. The root cause was stale token state in the publish tooling. The fix was to re-authenticate and retry, but diagnosing it cost an hour.

The Surprising Thing

The Rust ports were faster to write than the Python originals for the algorithmic libs. This surprised me.

The Python versions came first and were the hardest to write because they required figuring out the right interface. What is the error type? How is state passed? What is the minimal API that covers the use cases?

By the time I was writing the Rust version, all those decisions were made. I just had to express a proven algorithm in Rust. The type system caught interface mistakes at compile time instead of test time. For the sliding window, LRU cache, and state machine libs, the Rust port took roughly 60% of the time the Python original took.

The non-algorithmic libs, the ones that are mostly string processing or schema manipulation, were faster in Python. No surprise there.

What I Would Do Differently

Publish in batches of five with hour-long gaps. I said this above but it is the most actionable lesson. Do not queue twenty publishes and walk away. Stagger them.

Write the fixture rule in the first commit. Put PLACEHOLDER_ shapes in the first test file for the first lib and make it the template for everything after. Do not decide to be careful about this after you have already written twenty test suites.

Keep a scratch doc of pending decisions. During the sprint I kept finding edge cases that needed a decision: what error type to raise, whether to support async in addition to sync, what the default window size should be. I was making these decisions fast and not always consistently. A scratch doc to park them and revisit would have produced more consistent APIs across the fifty libs.

Test the publish flow on one lib before committing to the full batch. The rate limit discovery would have been much less painful if I had done a trial run of five or six libs before queuing the full batch.

The Stack as It Stands

Fifty-plus libraries are now in the MukundaKatta GitHub org. About half are published on PyPI, about a third on crates.io, and the rest are pending due to rate limit cooldowns.

Layer	Published	Pending
Safety	5	3
Observability	4	2
Reliability	6	4
Context	3	1
Tool Infra	8	5

The published ones are installable and usable today. The pending ones have source on GitHub and can be installed with pip install git+https://github.com/MukundaKatta/<repo>.

What Is Next

Three things.

First, finishing the pending publishes. Rate limit cooldowns are temporary. Most pending packages will be on PyPI within the next week.

Second, a lightweight run-scoped context object that lets multiple libs coordinate without importing each other. Right now llm-cost-cap, agent-deadline, and tool-loop-guard are each independent. A shared context would let them share a budget, a deadline, and a call history.

Third, picking the three or four libs with the most real-world traction and going deeper on those: better docs, more examples, API refinement based on feedback.

If you are building agents and hitting specific production failures, check the MukundaKatta org on GitHub. There is likely already a library for it.