The demo is always convincing. You ask the agent to "find the overdue invoice for Acme and send a reminder," it reasons through the steps, calls a couple of tools, and reports success. Everyone nods. Then you put it in front of real traffic against real systems and it creates duplicate invoices on a retry, emails the wrong contact, or cheerfully reports success on an action that silently failed.
The reasoning was never the hard part. The hard part is the last mile: the layer where an agent stops talking and starts acting on systems of record like your CRM, your ticketing platform, or your ERP. That layer is ordinary, unglamorous distributed-systems engineering, and almost none of it is AI-specific. Here are the patterns that matter most.
Every Tool Is a Contract, Not a Suggestion
The single biggest source of agents "going rogue" is loose tool definitions. If a tool accepts free-form input and trusts the model to behave, the model eventually won't. Validate at the boundary, and put hard limits in code where the model can't talk its way past them.
from pydantic import BaseModel, Field
class SendReminder(BaseModel):
invoice_id: str = Field(pattern=r"^INV-\d{8}$")
channel: str = Field(json_schema_extra={"enum": ["email", "sms"]})
# The model cannot send to an arbitrary address; it picks an
# on-file contact by role, and code resolves the actual destination.
recipient_role: str = Field(json_schema_extra={"enum": ["billing", "ap_clerk"]})
def send_reminder(req: SendReminder) -> dict:
invoice = load_invoice(req.invoice_id) # 404s are real, handle them
if invoice.status == "paid":
return {"status": "skipped", "reason": "already_paid"}
contact = resolve_contact(invoice.account_id, req.recipient_role)
...
Notice what the contract removes: the model never supplies a raw email address, never picks an invoice that doesn't match the ID format, and never overrides the "already paid" check. The agent proposes; deterministic code disposes.
Idempotency, Because Agents Retry
Agents retry. Networks fail mid-call. A user double-clicks. If the same logical action can execute twice and produce two effects, you have an incident waiting to happen, and "send payment" or "create ticket" are exactly the actions where a double-execution hurts.
Make state-changing actions idempotent with a key derived from the intent, not from a random ID generated per attempt:
def create_ticket(account_id: str, summary: str, body: str) -> dict:
# Same logical request => same key => at most one ticket.
idem_key = sha256(f"{account_id}:{summary}:{body}".encode()).hexdigest()
existing = tickets.find_by_idempotency_key(idem_key)
if existing:
return {"status": "exists", "ticket_id": existing.id}
return tickets.create(account_id, summary, body, idempotency_key=idem_key)
If your downstream system supports idempotency keys natively (many payment and ticketing APIs do), pass them through. If it doesn't, enforce it in your own layer before the call.
Permissions Belong to the User, Not the Agent
A subtle and dangerous mistake: running every agent action with the agent's own service-account privileges. Now any user who can chat with the agent can implicitly do anything the agent can do, including reading records they should never see.
The agent should act on behalf of the requesting user, carrying that user's authorization to every tool call. Retrieval is the easy place to get this wrong: filtering results after the fact is fragile, so scope the query itself so unauthorized records are never candidates.
def search_accounts(query: str, *, acting_user: User) -> list[Account]:
# The user's scope is part of the query, not a post-filter.
return crm.search(query, visibility=acting_user.account_scope)
Plan for Partial Failure and Honest Reporting
A multi-step action will sometimes get halfway and fail. The worst outcome is an agent that reports "Done!" when step three threw an exception. Two rules:
- Never let the model narrate success it didn't verify. Tool results, not the model's optimism, determine what the agent tells the user. If a call failed, the failure propagates.
- Decide your transaction story up front. Either make the sequence atomic where the systems allow it, or design compensating actions (if you created the order but the payment failed, you cancel the order). Silent half-completed workflows are how data integrity quietly erodes.
def fulfill(order):
created = create_order(order) # idempotent
try:
charge_payment(order) # may fail
except PaymentError:
cancel_order(created.id) # compensate, then surface the failure
raise
return created
High-Stakes Actions Get a Human Checkpoint
Full autonomy is rarely the right design for actions that move money, delete data, or contact customers. The more reliable pattern is a confident draft plus a human approval step. This is frequently the difference between a system the business will actually authorize and one stuck in pilot forever, and it costs you very little: the agent does all the work, a human just clicks approve on the irreversible part.
Make the threshold explicit and enforce it in code:
def execute(action):
if action.risk == "irreversible" or action.amount_cents > AUTO_LIMIT:
return queue_for_human_approval(action)
return run(action)
You Cannot Debug What You Did Not Trace
When a user says "the agent messed up my account," you need to replay exactly what happened, not reconstruct it from optimism and partial logs. Capture the full chain for every action: the user input, the model's tool selection and arguments, the raw tool results, and the final response. This is the same trace you'll use to build evaluations, so design it once and use it for both.
trace_id: 8f2c...
user: "send the overdue reminder for Acme"
acting_user: u_4471 (scope: account:acme)
tool_call: send_reminder(invoice_id=INV-00038122, channel=email, recipient_role=billing)
resolved_recipient: billing@acme.example # resolved by code, not the model
tool_result: {status: sent, message_id: m_99...}
agent_reply: "Reminder sent to Acme's billing contact."
With this, "the agent messed up" becomes a five-minute investigation instead of a guessing game.
The Takeaway
An AI agent is only as good as the boundary between its reasoning and your systems of record. The intelligence gets the headlines, but reliability lives in the boring layer: typed tool contracts, idempotency, per-user authorization, partial-failure handling, human checkpoints for irreversible actions, and end-to-end tracing. Get that layer right and the agent becomes something the business can actually trust with real work. Skip it, and you have a very impressive demo.
I work on AI engineering at Wizr AI, where custom AI application development services are the day job. More on us as a generative AI software development company if you're curious. Happy to compare integration war stories in the comments.
Top comments (0)