Your test suite is green. Every unit test passes. Integration tests pass. The agent generates correct SQL, summarizes documents accurately, routes requests to the right model.
You deploy to production.
Tuesday morning, the agent decides a table needs a new column. It writes the migration. Executes it. The column already existed under a different name. Data corruption. Three hours of downtime.
Every test passed because tests verify what agents do. Nobody verified what agents are allowed to do.
The gap between testing and safety
Traditional software testing works because traditional software is deterministic. Given input X, function F always produces output Y. If it doesn't, there's a bug.
AI agents are stochastic. Given the same input, the same agent might take different paths, use different tools, produce different outputs. Testing captures a sample of behaviors. It cannot capture all possible behaviors.
This is the fundamental gap: tests are necessary but insufficient for agent safety.
What's missing is the contract layer — the set of behavioral boundaries that constrain what an agent can do regardless of what it decides to do.
Runtime contracts for AI agents
AgentAssert introduces 12 contract types across 6 reliability pillars:
1. Behavioral contracts — Define what actions are permitted.
- Tool access contracts: which tools can this agent call?
- Output format contracts: what shape must the response take?
2. Drift contracts — Detect when behavior changes.
- Semantic drift: is the agent's output meaning shifting over time?
- Distribution drift: are tool call patterns changing?
3. Resource contracts — Prevent runaway costs.
- Token budget: maximum tokens per execution
- Time budget: maximum wall-clock time
- Cost ceiling: maximum dollar spend per invocation
4. Safety contracts — Hard boundaries.
- No data exfiltration: output cannot contain PII from input
- No unauthorized writes: agent cannot modify specified resources
- Scope limitation: agent stays within its designated domain
5. Quality contracts — Minimum output standards.
- Accuracy floor: output must meet minimum quality threshold
- Consistency requirement: output must be reproducible within bounds
6. Coordination contracts — Multi-agent boundaries.
- Handoff contracts: what context must be preserved between agents
- Conflict resolution: how to handle contradictory agent outputs
How it works
from agentassert import Contract, enforce
@enforce(
tool_access=["read_db", "write_log"], # no write_db
token_budget=4000,
cost_ceiling=0.05,
no_pii_leak=True
)
def my_agent(query: str) -> str:
# Agent logic here
...
The contract wraps the agent. If the agent tries to call write_db, the contract intercepts it before execution. If the agent exceeds its token budget, execution halts gracefully. If PII appears in the output, it's redacted before returning.
Contracts are enforced at runtime — not at test time. The agent cannot violate them regardless of what the LLM decides to do.
Results
AgentAssert has been cited 3 times in 6 weeks. It's the only framework that covers all 6 reliability pillars in a single library.
Zero competitors implement the full contract stack. LangSmith monitors. Guardrails validates output format. Neither enforces behavioral boundaries at runtime.
The difference between "it works" and "it works safely" is a contract.
pip install agentassert
Paper: arXiv:2602.22302
AgentAssert is part of the Qualixar AI Reliability Engineering platform — 7 open-source tools, 7 peer-reviewed papers, zero cloud dependency.
Follow the build: @varunPbhardwaj
Top comments (0)