Most agents ship without any of this. Then the first production incident adds two or three items to the list. By year two, the codebase has a patchwork of half-solutions: a global try/except here, a hardcoded timeout there, a bash script that scrubs the logs before they hit the SIEM. The fixes are real but they are uncoordinated, untested, and invisible to new team members.
The checklist below covers fifteen things that should be in place before a Python LLM agent handles real users or real money. Each item describes the risk in one sentence and the solution in one sentence. Where there is a library, I have linked it. Where there is not, a pattern is enough.
None of these are hard to add. The hard part is knowing all fifteen exist before you need them.
The Checklist
1. Arg validation before tool execution
Risk: The model sends malformed or out-of-range arguments to a tool. The tool crashes, produces garbage output, or, in the worst case, executes a destructive action with unexpected parameters.
Solution: Validate every argument against a schema before the tool function runs. agentvet can generate a schema from a function signature and validate against it in one call.
from agentvet import ArgValidator, schema_from_fn
validator = ArgValidator(schema_from_fn(delete_record))
validator.validate({"record_id": tool_args["record_id"]})
# raises ValidationError before delete_record is called
2. Egress allowlist
Risk: A prompt injection attack or a model hallucination causes the agent to make HTTP requests to unintended hosts, exfiltrating data or calling attacker-controlled infrastructure.
Solution: Maintain an explicit allowlist of hosts the agent is permitted to contact. agentguard checks every outbound URL before the request is made and raises an exception for anything not on the list.
3. Prompt injection detection
Risk: User-provided content contains instructions that override the system prompt and redirect the agent toward unintended actions.
Solution: Run user input through a pattern-based detector before it enters the prompt. prompt-shield catches common injection patterns without needing a model call.
4. Secret scrubbing from logs
Risk: API keys, tokens, or credentials appear in tool arguments or model responses and end up written to logs, where they are visible to anyone with log access.
Solution: Pass all log output through a scrubber that replaces known secret patterns with redacted placeholders. tool-secret-scrubber uses regex patterns that match common API key shapes without needing to know specific values.
5. PII redaction from LLM inputs
Risk: User messages contain email addresses, phone numbers, social security numbers, or credit card numbers that are sent to a third-party model provider, violating your data handling commitments.
Solution: Redact PII from user input before it leaves your infrastructure. llm-pii-redact handles the common patterns (email, phone, SSN, credit card with Luhn check) and replaces them with typed placeholders.
6. Budget cap per run
Risk: A single agent run makes far more model calls than expected (due to a bug, a loop, or unexpected task complexity) and costs orders of magnitude more than the expected per-run cost.
Solution: Set a USD budget at the start of each run and raise an exception when the budget is exhausted. llm-cost-cap does the pre-flight calculation using token estimates and current pricing.
from llm_cost_cap import CostCap
cap = CostCap(max_usd=0.50, model="claude-sonnet-4-6")
cap.check(estimated_input_tokens=2000, estimated_output_tokens=500)
# raises CostLimitExceeded if projected cost exceeds $0.50
7. Time limit per run
Risk: An agent run hangs indefinitely waiting on a slow tool response, a model that is taking too long, or an infinite reasoning loop. Resources are tied up and the user sees no response.
Solution: Set a wall-clock deadline at the start of each run and check it at each iteration. agent-deadline provides a cooperative deadline object that raises DeadlineExceeded when the limit is hit.
8. Loop detection
Risk: The model enters a pattern where it calls the same tool with the same arguments repeatedly, making no progress and burning tokens until the budget is exhausted or the run is manually killed.
Solution: Track tool calls in a sliding window and raise an exception when the same call appears more than N times. tool-loop-guard handles this with a configurable max-calls threshold and window size.
9. Provider failover
Risk: Your primary model provider returns a 500 error or a rate limit response. The agent fails completely even though alternative providers could serve the request.
Solution: Wrap model calls with an ordered failover chain that tries the next provider on failure. llm-fallback-router takes an ordered list of provider functions and calls them in sequence until one succeeds.
from llm_fallback_router import FallbackRouter
router = FallbackRouter([
anthropic_client.call,
openai_client.call,
local_ollama_client.call,
])
response = router.call(prompt)
10. Retry with jitter
Risk: A transient error (rate limit, timeout, network blip) causes the agent to fail immediately instead of waiting and retrying.
Solution: Wrap model calls with exponential backoff and jitter so retries do not pile onto an already-stressed provider. llm-retry-py provides a decorator and a context manager for this pattern.
11. Circuit breaker
Risk: A provider is degraded. Every call fails after a timeout. The agent keeps trying, burning time and money, instead of failing fast and routing to a fallback.
Solution: Use a circuit breaker that tracks the failure rate and opens the circuit after a threshold, failing fast until a cooldown period passes. llm-circuit-breaker-py implements the standard half-open/open/closed state machine.
12. Context window management
Risk: A long conversation or a large tool output causes the total token count to exceed the model's context window. The request fails with a context length error, often after the agent has already done significant work.
Solution: Track token counts and trim the message history before each model call to stay within the context limit. agentfit checks the window and prunes older messages while keeping tool_use/tool_result pairs intact. agent-message-window provides the paired-message sliding window primitive directly.
13. Tool call recording for debugging
Risk: An agent produces the wrong answer and you cannot figure out which tool calls it made, what arguments it passed, or what it received back. Debugging becomes guesswork.
Solution: Record every tool call and response as a structured trace. agentsnap writes a JSONL trace for each run that you can replay or inspect after the fact.
14. Cost tracking per run
Risk: You have no visibility into what individual agent runs cost, making it impossible to spot expensive outliers, model pricing regressions, or prompt changes that doubled your token usage.
Solution: Wrap each run with a tracer that accumulates token counts and converts them to USD. agenttrace produces a per-run cost summary that you can log, alert on, or store.
from agenttrace import RunTracer
with RunTracer(model="claude-sonnet-4-6") as tracer:
result = run_agent(prompt)
print(f"Run cost: ${tracer.total_usd:.4f}")
print(f"Input tokens: {tracer.input_tokens}")
print(f"Output tokens: {tracer.output_tokens}")
15. Stop conditions
Risk: You have a budget cap and a loop detector but they are separate and uncoordinated. Each has its own exception type. Your agent loop catches the wrong exception and keeps running.
Solution: Define explicit, composable stop conditions that the agent loop checks at each iteration. llm-stop-conditions provides MaxIterations, MaxUsd, and MaxTokens conditions that can be combined and checked in one call.
from llm_stop_conditions import StopConditions, MaxIterations, MaxUsd
stop = StopConditions([
MaxIterations(max_iters=20),
MaxUsd(max_usd=1.00, model="claude-sonnet-4-6"),
])
while not stop.should_stop(iteration=i, tokens_used=total_tokens):
response = model.call(messages)
# process response
i += 1
How to Use This Checklist
Not every item applies to every agent. A batch-processing agent with no user input does not need prompt injection detection. An agent that only calls read-only APIs does not need an egress allowlist with the same urgency as one that can write to external services.
Go through the fifteen items and mark each as: already handled, not applicable, or needs work. For every "needs work" item there is a library above. Most take under twenty lines of integration code.
The goal is to know which six or eight apply to your agent and have them in place before the first production incident forces the issue.
Installing the Stack
# Safety layer
pip install agentvet agentguard prompt-shield tool-secret-scrubber llm-pii-redact
# Reliability layer
pip install llm-cost-cap agent-deadline tool-loop-guard llm-fallback-router llm-retry-py llm-circuit-breaker-py
# Context layer
pip install agentfit agent-message-window
# Observability layer
pip install agentsnap agenttrace
# Stop conditions
pip install llm-stop-conditions
All fifteen items above are covered by these packages. Each is independent. You can install only what you need.
What Is Next
These fifteen items cover the most common production failure modes. There are more specialized concerns beyond them: distributed budget tracking, replay-based debugging, and eval integration. Check the GitHub org at MukundaKatta if you are hitting something not covered here.
Top comments (0)