I Shipped 50 Agent-Stack Libraries. Here Is How They Fit Together.

#hermeschallenge #ai #python #agents

Why 50 Small Libraries Instead of One Big Framework

When I started building AI agents for real workloads, I kept running into the same set of failures. Agents would exceed token budgets mid-run. Tool calls would fire on blocked endpoints. The LLM would return malformed JSON and the caller would crash. Duplicate tool calls would pile up. There was no record of why the agent made a particular decision.

I could have wrapped all of this in a framework. One big import, one big config file. But I did not want that. I wanted to be able to grab exactly what I needed for a given project and not carry the rest. So I built 50 small libraries instead. Each one solves one specific failure mode.

This post is about how they fit together. Not all 50 at once. Five layers. A clear composition pattern. And a "smallest useful stack" section for when you do not need all of it.

The Five Layers

The 50 libraries fall into five layers. Each layer has a distinct job.

Safety stops things from going wrong before they happen. agentguard enforces an egress allowlist so the agent can only call approved endpoints. agentvet validates tool call arguments before execution. prompt-shield detects prompt injection attempts in user input. tool-secret-scrubber strips API keys and tokens from tool outputs before they are logged. tool-side-effects-tag marks every tool call as READ, WRITE, IDEMPOTENT, or DESTRUCTIVE so the runtime knows what it is doing before it does it.

Observability tells you what happened and why. agentsnap captures the full call trace so you can replay any run. agenttrace aggregates cost and latency across runs. agenttap intercepts raw prompts and responses at the wire level. agent-decision-log records the WHY behind each agent decision in a structured log. agent-citation records WHERE each claim came from, linking back to the source document or API response. agent-event-bus routes events between components. agent-replay-trace lets you step through a recorded JSONL trace interactively.

Reliability handles the things that go wrong at runtime. llm-retry adds exponential backoff to LLM calls. llm-circuit-breaker stops hammering a provider that is returning errors. llm-fallback-router fails over to a backup provider when the primary is down. agent-deadline enforces a cooperative per-task time limit. llm-stop-conditions gives you composable rules for when the agent loop should exit. token-budget-pool tracks token and USD spend across concurrent calls and stops when the budget is hit.

Context Management keeps the prompt from growing out of control. agentfit checks if a message list fits within a model's context window. agent-message-window maintains a rolling history that respects tool_use / tool_result pairing rules. prompt-token-counter estimates token counts without an API call. prompt-cache-warmer pre-warms Anthropic's prompt cache for long system prompts so you are not paying full price on every turn.

Tool Infrastructure handles everything around tool calls. agentcast enforces structured output when the LLM returns tool arguments. tool-arg-coerce coerces tool arguments to the expected types when the LLM returns a string where you wanted an int. tool-arg-defaults fills in missing tool arguments from schema defaults. tool-arg-fuzzy fuzzy-matches LLM-provided enum arguments to the nearest valid value. tool-schema-from-fn generates tool schemas from Python function signatures. tool-output-truncate trims tool output that is too long before it goes back into the context. tool-output-format renders tool output as LLM-friendly markdown. tool-result-cache caches tool results so identical calls do not fire twice. tool-loop-guard detects when the same tool is being called repeatedly with the same arguments and breaks the loop.

Composition Around a Single LLM Call

Here is what six of these libraries look like composed around a single agent loop:

from agentguard import EgressGuard
from agentvet import ArgValidator
from llm_retry import with_retry
from token_budget_pool import BudgetPool
from agentsnap import Tracer
from tool_side_effects_tag import SideEffectsEnforcer

# Set up the stack once
guard = EgressGuard(allowed_hosts=["api.openai.com", "api.anthropic.com"])
validator = ArgValidator.from_tool_schemas(tools)
budget = BudgetPool(max_usd=0.50, max_tokens=50_000)
tracer = Tracer(session_id="run-001")
enforcer = SideEffectsEnforcer(allow_write=False)

def run_agent_turn(messages, tools):
    with tracer.span("llm_call"):
        with guard:
            with budget.check():
                # Retry the LLM call on transient errors
                response = with_retry(
                    lambda: client.chat.completions.create(
                        model="gpt-5.4",
                        messages=messages,
                        tools=tools
                    ),
                    max_attempts=3
                )

        # Validate tool call arguments before executing
        for tool_call in response.tool_calls or []:
            enforcer.check(tool_call)         # Blocks WRITE calls in read-only mode
            validator.validate(tool_call)      # Checks args against schema
            result = execute_tool(tool_call)
            tracer.record_tool_call(tool_call, result)

    return response

Each library does one thing. EgressGuard blocks unexpected outbound calls. ArgValidator stops the LLM from passing a string where the schema requires an int. BudgetPool stops the loop if spend exceeds the limit. Tracer records everything. SideEffectsEnforcer ensures a read-only agent does not accidentally fire a write tool call. with_retry handles the transient LLM errors.

You do not have to use all of them. Pick the ones relevant to your failure mode.

The Smallest Useful Stack

If you are just starting out, reach for four libraries. These four cover the most common failure modes without adding much complexity.

llm-retry first. Transient LLM errors are universal. Every provider has them. Without retry, your agent loop will fail on perfectly normal runs. This one has no downside.

token-budget-pool second. Without a budget cap, a runaway agent loop will spend real money. Set a low USD cap while you are developing. Raise it as you get confidence.

agentcast third. Structured output failures are the most common cause of crashes in tool-heavy agents. The LLM will return JSON that does not match your schema. agentcast catches that and retries with the error injected as context.

agentsnap fourth. You will need to debug your agent at some point. Without a trace, you are flying blind. agentsnap captures the full call trace with no configuration required.

Those four handle retry, budget, output format, and observability. Everything else in the stack is for specific problems you will discover as you scale.

What NOT to Use on Day One

Some of these libraries are only useful at scale or in specific situations.

prompt-cache-warmer is only worth it if you have a long system prompt that is reused across many turns. For a short system prompt or single-turn usage, it adds latency for no gain.

llm-circuit-breaker is for production systems with sustained load. If you are running an agent that fires one or two calls at a time, you do not need a circuit breaker. It exists for systems where a provider outage could cause a cascade.

tool-arg-fuzzy is for cases where the LLM is matching against a closed enum and frequently gets close but not exact. If your enums are large and the LLM is consistent, you do not need it.

agent-replay-trace is a debugging tool, not a production tool. Build and test with it. Remove it from the hot path in production.

The goal is not to use all 50 libraries. The goal is to use the ones that solve a real problem you are facing.

When This Stack Beats a Framework

Frameworks give you an opinionated loop and a set of built-in behaviors. That is useful when you are moving fast and want something that works out of the box. It is less useful when the built-in behaviors conflict with your requirements, or when you need to audit exactly what the agent is doing.

This stack is composable because the libraries have no shared state. You can use llm-retry without agentsnap. You can use token-budget-pool without agentguard. Each library exposes a narrow interface and does not depend on the others. You add them incrementally as you discover the failure mode they address.

The audit trail is also different. With agent-decision-log and agent-citation in the stack, every decision has a WHY and every claim has a WHERE. In a framework, those are often opaque. Here they are structured objects you can inspect, store, and query.

What Is Next

The stack is not finished. Three areas are actively being developed.

The feedback layer is the biggest gap. Right now, the libraries record what happened. They do not learn from it. A feedback capture step that can route corrections back to improve classification prompts over time would close the loop between observability and improvement.

Cross-library coordination is second. Right now, each library is independent. A thin coordinator that can propagate a cancellation signal, a budget exhaustion event, or a circuit break across the entire stack in one call would simplify the composition code significantly.

Rust ports of the core reliability libraries are in progress. llm-retry, llm-circuit-breaker, token-budget-pool, and agentguard all have Rust counterparts that are either published or close to it. The composition pattern is the same; the runtime cost is lower.

The pattern stays the same as the stack grows. Small libraries, one problem each, composable without shared state. That is the only thing that has to stay constant.

Source: All libraries at github.com/MukundaKatta