Your agent works in local testing. It handles the happy path, returns clean results, and costs a few cents per run. Then you deploy it. Here is what breaks over the first seven days, and what fixes each one.
This is not a tutorial on building an agent from scratch. It is a failure playbook. Each section is one day, one failure mode, and one targeted fix.
Hook
The pattern is always the same. You ship the agent. Users start using it. On day one, everything looks fine. By day seven, you have had five separate fires, fixed three of them badly, and still have no visibility into what is actually happening.
The libraries in this post are not a framework. They are individual fixes for individual failure modes. You do not need all of them on day one. You pick the one that matches the problem you are currently facing.
Day 1: A Tool Crashes on Bad Args
Your agent calls a search tool. The model passes {"query": ["deploy", "errors"]}, a list instead of a string. Your tool expects a string. It throws an uncaught TypeError. The agent crashes without giving the user any feedback.
Fix: agentvet
from agentvet import Vet, ArgSpec, ArgError
class SearchArgs(ArgSpec):
query: str # strict type enforcement
vet = Vet()
@vet.guard(schema=SearchArgs)
def search_logs(query: str) -> dict:
return {"results": [f"logs matching '{query}'"]}
# In your tool dispatcher
try:
result = search_logs(**tool_input)
except ArgError as e:
# Return a structured error the LLM can reason about
return {
"error": True,
"message": f"Bad tool arguments: {e}",
"hint": "query must be a string, not a list",
}
agentvet validates incoming args against the declared spec before your function runs. Bad args return a structured error the LLM can read and self-correct from, instead of a stack trace that kills the run.
Day 2: Provider Rate Limits During Peak Hours
You have 20 concurrent users at 6pm. The provider returns 429s. Your agent crashes for all 20 users simultaneously because there is no retry logic.
Fix: llm-retry-py + llm-budget-window
from llm_retry import RetryConfig, with_retry
from llm_budget_window import BudgetWindow
# Per-minute token budget to stay under provider limits
window = BudgetWindow(tokens_per_minute=40_000, usd_per_minute=2.00)
retry_cfg = RetryConfig(
max_attempts=4,
base_delay=1.0,
max_delay=30.0,
jitter=True,
retryable_status_codes={429, 500, 502, 503},
)
import anthropic
client = anthropic.Anthropic()
def call_llm(messages: list, max_tokens: int = 512) -> anthropic.types.Message:
# Check budget before calling
window.reserve(estimated_tokens=max_tokens + 200)
@with_retry(retry_cfg)
def _call():
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=max_tokens,
messages=messages,
)
response = _call()
window.record(actual_tokens=response.usage.input_tokens + response.usage.output_tokens)
return response
The budget window throttles requests before they hit the provider. The retry wrapper handles the 429s that slip through. Jitter spreads out the retry wave so all 20 users do not hit the API again at the same second.
Day 3: Context Window Exceeded Mid-Run
The agent is on turn 35 of a long research task. The LLM call fails with a context-length error. The conversation history grew past the limit and nobody noticed.
Fix: agent-message-window + agentfit
from agent_message_window import MessageWindow
from agentfit import FitReport
from prompt_token_counter import count_tokens
window = MessageWindow(
max_tokens=75_000, # well under the model's hard limit
token_counter=count_tokens,
drop_strategy="oldest", # drop oldest when over budget
)
def run_turn(user_text: str, history: list) -> tuple[str, list]:
window.clear()
for turn in history:
window.push(turn["role"], turn["content"])
window.push("user", user_text)
messages = window.as_messages()
fit = FitReport(messages=messages, model="claude-sonnet-4-6")
if fit.tokens > 90_000:
# Hard block before the API call even happens
raise RuntimeError(f"Context too large: {fit.tokens} tokens")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
)
reply = response.content[0].text
window.push("assistant", reply)
return reply, window.as_turns()
The window trims the oldest messages when the budget is exceeded. agentfit gives you a token count before the call so you can catch edge cases where the window still exceeds the model's hard limit.
Day 4: Agent Loops on "Search Returns Nothing"
Your agent is tasked with finding recent news about a specific company. The search tool returns empty results. The agent rephrases the query. Still empty. Rephrases again. Still empty. After 15 iterations you notice the cost is climbing. The agent never gave up.
Fix: tool-loop-guard + NoProgress from llm-stop-conditions
from tool_loop_guard import LoopGuard, LoopDetectedError
from llm_stop_conditions import StopEvaluator, MaxIters, MaxUsd, NoProgress
guard = LoopGuard(window_size=8, max_repeats=3)
evaluator = StopEvaluator(conditions=[
MaxIters(20),
MaxUsd(1.50),
NoProgress(window=5, similarity=0.80, max_stale=3),
])
ctx = {"iters": 0, "cost_usd": 0.0, "last_tool_results": []}
def run_tool(name: str, args: dict, result: str) -> str:
guard.record(name, args) # raises LoopDetectedError after 3 identical calls
ctx["last_tool_results"].append(result)
return result
# In the main loop:
stop = evaluator.check(ctx)
if stop.should_stop:
return f"Could not complete: {stop.reason}. Please try a more specific query."
The loop guard fires when the same tool and args repeat more than 3 times in 8 turns. NoProgress fires when recent tool results are all similar. Either one tells the agent to change strategy or give up cleanly.
Day 5: A Secret Ends Up in Tool Logs
You add logging to your tool dispatcher. A week later you notice the logs contain API keys that were passed as tool arguments. Some tools accept credentials. You logged the raw args dict without scrubbing.
Fix: tool-secret-scrubber
from tool_secret_scrubber import Scrubber
scrubber = Scrubber(
patterns=["api_key", "token", "secret", "password", "bearer"],
redact_with="[REDACTED]",
)
import logging
log = logging.getLogger("tools")
def dispatch_tool(name: str, args: dict) -> dict:
safe_args = scrubber.scrub_dict(args)
log.info("tool_call", extra={"tool": name, "args": safe_args})
result = execute_tool(name, args) # pass real args to the tool
safe_result = scrubber.scrub_dict(result)
log.info("tool_result", extra={"tool": name, "result": safe_result})
return result
The scrubber walks the args dict, finds keys that match known secret patterns, and replaces the values with [REDACTED] before logging. The real args still go to the tool. Only the log output is safe.
Day 6: Cost Spike from Cold Prompt Cache
A new deployment clears the model's prompt cache. Your agent uses a large system prompt shared across all users. Every call on the first hour after deployment is a cache miss. Your bill for that hour is 10x higher than normal.
Fix: prompt-cache-warmer + cachebench
from prompt_cache_warmer import CacheWarmer
from cachebench import CacheBenchmark
SYSTEM_PROMPT = open("system_prompt.txt").read() # 8,000 tokens of shared context
warmer = CacheWarmer(model="claude-sonnet-4-6")
bench = CacheBenchmark()
async def warm_cache_on_startup():
"""Call this once at deploy time, before serving user traffic."""
await warmer.warm(
system=SYSTEM_PROMPT,
seed_messages=[
{"role": "user", "content": "Ready."},
],
)
print("Cache warmed.")
# In each real request
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=SYSTEM_PROMPT,
messages=user_messages,
)
bench.record(
cache_read_tokens=response.usage.cache_read_input_tokens or 0,
input_tokens=response.usage.input_tokens,
)
print(f"Cache hit ratio: {bench.hit_ratio():.1%}")
Warm the cache at startup before real traffic hits. cachebench tells you the hit ratio after each call so you can verify the warming worked and detect if cache is being invalidated unexpectedly.
Day 7: Provider Outage Takes Down All Users
The LLM provider has a 45-minute outage. All user requests fail instantly. No fallback. No queuing. You spend 45 minutes watching error reports come in.
Fix: llm-circuit-breaker-py + llm-fallback-router
from llm_circuit_breaker import CircuitBreaker, CircuitOpenError
from llm_fallback_router import FallbackRouter, ProviderConfig
# Primary: Claude Sonnet. Fallback: OpenAI GPT.
router = FallbackRouter(
providers=[
ProviderConfig(name="anthropic", model="claude-sonnet-4-6", priority=1),
ProviderConfig(name="openai", model="gpt-5.4", priority=2),
]
)
circuit = CircuitBreaker(
failure_threshold=5, # open after 5 consecutive failures
recovery_timeout=60, # try again after 60 seconds
)
def call_with_fallback(messages: list) -> str:
try:
with circuit:
return router.call(
messages=messages,
max_tokens=512,
)
except CircuitOpenError:
# Circuit is open, go straight to fallback
return router.call(
messages=messages,
max_tokens=512,
skip_providers=["anthropic"],
)
The circuit breaker counts consecutive failures. After the threshold, it opens and routes directly to the fallback without waiting for primary timeouts. During the 45-minute outage, users get slower OpenAI responses instead of errors. When the primary recovers, the circuit closes and primary calls resume.
Siblings Table
| Library | Day it fixes | GitHub |
|---|---|---|
| agentvet | Day 1: bad args | MukundaKatta/agentvet |
| llm-retry-py | Day 2: rate limits | MukundaKatta/llm-retry-py |
| llm-budget-window | Day 2: provider throttling | MukundaKatta/llm-budget-window |
| agent-message-window | Day 3: context overflow | MukundaKatta/agent-message-window |
| agentfit | Day 3: token tracking | MukundaKatta/agentfit |
| tool-loop-guard | Day 4: repeated calls | MukundaKatta/tool-loop-guard |
| llm-stop-conditions | Day 4: stuck agent | MukundaKatta/llm-stop-conditions |
| tool-secret-scrubber | Day 5: secrets in logs | MukundaKatta/tool-secret-scrubber |
| prompt-cache-warmer | Day 6: cache misses | MukundaKatta/prompt-cache-warmer |
| cachebench | Day 6: cache monitoring | MukundaKatta/cachebench |
| llm-circuit-breaker-py | Day 7: provider outage | MukundaKatta/llm-circuit-breaker-py |
| llm-fallback-router | Day 7: provider fallback | MukundaKatta/llm-fallback-router |
What's Next
Day 8 and beyond, the failures get more subtle. Session isolation breaks. Output formatting fails downstream parsers. Tool call latency grows because three sequential calls could have run in parallel.
These are architectural problems. The libraries in this post buy you time. They prevent the obvious fires. Once those are out, you have headspace to fix the underlying design.
All repos are at MukundaKatta on GitHub. Each one is a standalone package with no required dependencies on the others.
Top comments (0)