The 5 Things I Check Before Marking Agent Code Verified

#ai #agents #verification #testing

The 5 Things I Check Before Marking Agent Code Verified

I have reviewed 47 agent codebases in the past three months. Three of them passed on the first pass. The rest went back for revision. Here is what separates the verified from the rejected.

1. The Autonomy Level Match

Every agent operates at one of five autonomy levels: Operator, Collaborator, Consultant, Approver, or Observer. Most failures happen when code assumes one level but deployment assumes another.

I check whether the agent requests permission at the right thresholds. An Operator-class agent that auto-approves irreversible actions is a disaster. An Observer-class agent that asks for permission on every read operation is unusable.

Match the code to the level. State it explicitly in the deployment docs.

2. Logic Errors That Compound

Agents do not just make mistakes. They make mistakes that cascade. A rounding error in transaction logic becomes a balance discrepancy. An off-by-one in pagination becomes data loss.

I trace the error paths. I look for assumptions that hold in testing but fail at scale. I check whether the agent has circuit breakers when confidence drops.

One codebase I reviewed had perfect unit tests. It also had a retry loop with no maximum attempt limit. In production, that would have hammered the API until credentials were revoked.

3. Privilege Escalation Patterns

Agents have access. The question is whether they can expand it without detection.

I audit every permission request. I check for credential storage in memory versus environment. I look for injection points where user input could rewrite system prompts.

Anthropic data shows 80% of AI actions have safeguards built in. I verify the other 20% are intentional and monitored.

4. Reasoning Transparency

If an agent makes a decision, I need to see why. Not a summary. The chain.

Black-box approvals are unacceptable for irreversible actions. I check whether the agent logs its reasoning at decision points. Whether a human can reconstruct the logic if something goes wrong.

Only 0.8% of AI actions are irreversible. Those are the ones that need the full chain documented.

5. Post-Deployment Monitoring

Pre-deployment testing is necessary. It is also insufficient.

I verify that the agent has runtime telemetry. Error rates. Decision confidence scores. Human intervention triggers.

The best agents I have reviewed shift from approve-everything mode to monitor-and-intervene mode. Anthropic data shows experienced users auto-approve 40% of actions while maintaining a 9% intervention rate. That balance is the goal.