DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

A Week in Production: What Your Agent Will Actually Break On

Your agent works in local testing. It handles the happy path, returns clean results, and costs a few cents per run. Then you deploy it. Here is what breaks over the first seven days, and what fixes each one.

This is not a tutorial on building an agent from scratch. It is a failure playbook. Each section is one day, one failure mode, and one targeted fix.


Hook

The pattern is always the same. You ship the agent. Users start using it. On day one, everything looks fine. By day seven, you have had five separate fires, fixed three of them badly, and still have no visibility into what is actually happening.

The libraries in this post are not a framework. They are individual fixes for individual failure modes. You do not need all of them on day one. You pick the one that matches the problem you are currently facing.


Day 1: A Tool Crashes on Bad Args

Your agent calls a search tool. The model passes {"query": ["deploy", "errors"]}, a list instead of a string. Your tool expects a string. It throws an uncaught TypeError. The agent crashes without giving the user any feedback.

Fix: agentvet

from agentvet import Vet, ArgSpec, ArgError

class SearchArgs(ArgSpec):
    query: str  # strict type enforcement

vet = Vet()

@vet.guard(schema=SearchArgs)
def search_logs(query: str) -> dict:
    return {"results": [f"logs matching '{query}'"]}

# In your tool dispatcher
try:
    result = search_logs(**tool_input)
except ArgError as e:
    # Return a structured error the LLM can reason about
    return {
        "error": True,
        "message": f"Bad tool arguments: {e}",
        "hint": "query must be a string, not a list",
    }
Enter fullscreen mode Exit fullscreen mode

agentvet validates incoming args against the declared spec before your function runs. Bad args return a structured error the LLM can read and self-correct from, instead of a stack trace that kills the run.


Day 2: Provider Rate Limits During Peak Hours

You have 20 concurrent users at 6pm. The provider returns 429s. Your agent crashes for all 20 users simultaneously because there is no retry logic.

Fix: llm-retry-py + llm-budget-window

from llm_retry import RetryConfig, with_retry
from llm_budget_window import BudgetWindow

# Per-minute token budget to stay under provider limits
window = BudgetWindow(tokens_per_minute=40_000, usd_per_minute=2.00)

retry_cfg = RetryConfig(
    max_attempts=4,
    base_delay=1.0,
    max_delay=30.0,
    jitter=True,
    retryable_status_codes={429, 500, 502, 503},
)

import anthropic
client = anthropic.Anthropic()

def call_llm(messages: list, max_tokens: int = 512) -> anthropic.types.Message:
    # Check budget before calling
    window.reserve(estimated_tokens=max_tokens + 200)

    @with_retry(retry_cfg)
    def _call():
        return client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=max_tokens,
            messages=messages,
        )

    response = _call()
    window.record(actual_tokens=response.usage.input_tokens + response.usage.output_tokens)
    return response
Enter fullscreen mode Exit fullscreen mode

The budget window throttles requests before they hit the provider. The retry wrapper handles the 429s that slip through. Jitter spreads out the retry wave so all 20 users do not hit the API again at the same second.


Day 3: Context Window Exceeded Mid-Run

The agent is on turn 35 of a long research task. The LLM call fails with a context-length error. The conversation history grew past the limit and nobody noticed.

Fix: agent-message-window + agentfit

from agent_message_window import MessageWindow
from agentfit import FitReport

from prompt_token_counter import count_tokens

window = MessageWindow(
    max_tokens=75_000,      # well under the model's hard limit
    token_counter=count_tokens,
    drop_strategy="oldest", # drop oldest when over budget
)

def run_turn(user_text: str, history: list) -> tuple[str, list]:
    window.clear()
    for turn in history:
        window.push(turn["role"], turn["content"])
    window.push("user", user_text)

    messages = window.as_messages()

    fit = FitReport(messages=messages, model="claude-sonnet-4-6")
    if fit.tokens > 90_000:
        # Hard block before the API call even happens
        raise RuntimeError(f"Context too large: {fit.tokens} tokens")

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages,
    )

    reply = response.content[0].text
    window.push("assistant", reply)
    return reply, window.as_turns()
Enter fullscreen mode Exit fullscreen mode

The window trims the oldest messages when the budget is exceeded. agentfit gives you a token count before the call so you can catch edge cases where the window still exceeds the model's hard limit.


Day 4: Agent Loops on "Search Returns Nothing"

Your agent is tasked with finding recent news about a specific company. The search tool returns empty results. The agent rephrases the query. Still empty. Rephrases again. Still empty. After 15 iterations you notice the cost is climbing. The agent never gave up.

Fix: tool-loop-guard + NoProgress from llm-stop-conditions

from tool_loop_guard import LoopGuard, LoopDetectedError
from llm_stop_conditions import StopEvaluator, MaxIters, MaxUsd, NoProgress

guard = LoopGuard(window_size=8, max_repeats=3)
evaluator = StopEvaluator(conditions=[
    MaxIters(20),
    MaxUsd(1.50),
    NoProgress(window=5, similarity=0.80, max_stale=3),
])

ctx = {"iters": 0, "cost_usd": 0.0, "last_tool_results": []}

def run_tool(name: str, args: dict, result: str) -> str:
    guard.record(name, args)   # raises LoopDetectedError after 3 identical calls
    ctx["last_tool_results"].append(result)
    return result

# In the main loop:
stop = evaluator.check(ctx)
if stop.should_stop:
    return f"Could not complete: {stop.reason}. Please try a more specific query."
Enter fullscreen mode Exit fullscreen mode

The loop guard fires when the same tool and args repeat more than 3 times in 8 turns. NoProgress fires when recent tool results are all similar. Either one tells the agent to change strategy or give up cleanly.


Day 5: A Secret Ends Up in Tool Logs

You add logging to your tool dispatcher. A week later you notice the logs contain API keys that were passed as tool arguments. Some tools accept credentials. You logged the raw args dict without scrubbing.

Fix: tool-secret-scrubber

from tool_secret_scrubber import Scrubber

scrubber = Scrubber(
    patterns=["api_key", "token", "secret", "password", "bearer"],
    redact_with="[REDACTED]",
)

import logging
log = logging.getLogger("tools")

def dispatch_tool(name: str, args: dict) -> dict:
    safe_args = scrubber.scrub_dict(args)
    log.info("tool_call", extra={"tool": name, "args": safe_args})

    result = execute_tool(name, args)  # pass real args to the tool

    safe_result = scrubber.scrub_dict(result)
    log.info("tool_result", extra={"tool": name, "result": safe_result})

    return result
Enter fullscreen mode Exit fullscreen mode

The scrubber walks the args dict, finds keys that match known secret patterns, and replaces the values with [REDACTED] before logging. The real args still go to the tool. Only the log output is safe.


Day 6: Cost Spike from Cold Prompt Cache

A new deployment clears the model's prompt cache. Your agent uses a large system prompt shared across all users. Every call on the first hour after deployment is a cache miss. Your bill for that hour is 10x higher than normal.

Fix: prompt-cache-warmer + cachebench

from prompt_cache_warmer import CacheWarmer
from cachebench import CacheBenchmark

SYSTEM_PROMPT = open("system_prompt.txt").read()  # 8,000 tokens of shared context

warmer = CacheWarmer(model="claude-sonnet-4-6")
bench = CacheBenchmark()

async def warm_cache_on_startup():
    """Call this once at deploy time, before serving user traffic."""
    await warmer.warm(
        system=SYSTEM_PROMPT,
        seed_messages=[
            {"role": "user", "content": "Ready."},
        ],
    )
    print("Cache warmed.")

# In each real request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=SYSTEM_PROMPT,
    messages=user_messages,
)

bench.record(
    cache_read_tokens=response.usage.cache_read_input_tokens or 0,
    input_tokens=response.usage.input_tokens,
)
print(f"Cache hit ratio: {bench.hit_ratio():.1%}")
Enter fullscreen mode Exit fullscreen mode

Warm the cache at startup before real traffic hits. cachebench tells you the hit ratio after each call so you can verify the warming worked and detect if cache is being invalidated unexpectedly.


Day 7: Provider Outage Takes Down All Users

The LLM provider has a 45-minute outage. All user requests fail instantly. No fallback. No queuing. You spend 45 minutes watching error reports come in.

Fix: llm-circuit-breaker-py + llm-fallback-router

from llm_circuit_breaker import CircuitBreaker, CircuitOpenError
from llm_fallback_router import FallbackRouter, ProviderConfig

# Primary: Claude Sonnet. Fallback: OpenAI GPT.
router = FallbackRouter(
    providers=[
        ProviderConfig(name="anthropic", model="claude-sonnet-4-6", priority=1),
        ProviderConfig(name="openai", model="gpt-5.4", priority=2),
    ]
)

circuit = CircuitBreaker(
    failure_threshold=5,     # open after 5 consecutive failures
    recovery_timeout=60,     # try again after 60 seconds
)

def call_with_fallback(messages: list) -> str:
    try:
        with circuit:
            return router.call(
                messages=messages,
                max_tokens=512,
            )
    except CircuitOpenError:
        # Circuit is open, go straight to fallback
        return router.call(
            messages=messages,
            max_tokens=512,
            skip_providers=["anthropic"],
        )
Enter fullscreen mode Exit fullscreen mode

The circuit breaker counts consecutive failures. After the threshold, it opens and routes directly to the fallback without waiting for primary timeouts. During the 45-minute outage, users get slower OpenAI responses instead of errors. When the primary recovers, the circuit closes and primary calls resume.


Siblings Table

Library Day it fixes GitHub
agentvet Day 1: bad args MukundaKatta/agentvet
llm-retry-py Day 2: rate limits MukundaKatta/llm-retry-py
llm-budget-window Day 2: provider throttling MukundaKatta/llm-budget-window
agent-message-window Day 3: context overflow MukundaKatta/agent-message-window
agentfit Day 3: token tracking MukundaKatta/agentfit
tool-loop-guard Day 4: repeated calls MukundaKatta/tool-loop-guard
llm-stop-conditions Day 4: stuck agent MukundaKatta/llm-stop-conditions
tool-secret-scrubber Day 5: secrets in logs MukundaKatta/tool-secret-scrubber
prompt-cache-warmer Day 6: cache misses MukundaKatta/prompt-cache-warmer
cachebench Day 6: cache monitoring MukundaKatta/cachebench
llm-circuit-breaker-py Day 7: provider outage MukundaKatta/llm-circuit-breaker-py
llm-fallback-router Day 7: provider fallback MukundaKatta/llm-fallback-router

What's Next

Day 8 and beyond, the failures get more subtle. Session isolation breaks. Output formatting fails downstream parsers. Tool call latency grows because three sequential calls could have run in parallel.

These are architectural problems. The libraries in this post buy you time. They prevent the obvious fires. Once those are out, you have headspace to fix the underlying design.

All repos are at MukundaKatta on GitHub. Each one is a standalone package with no required dependencies on the others.

Top comments (0)