<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marketing wizr</title>
    <description>The latest articles on DEV Community by Marketing wizr (@marketing_wizr_f14586ace9).</description>
    <link>https://dev.to/marketing_wizr_f14586ace9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4009213%2Fb73b0f67-4481-4518-881e-d182120832cd.png</url>
      <title>DEV Community: Marketing wizr</title>
      <link>https://dev.to/marketing_wizr_f14586ace9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marketing_wizr_f14586ace9"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Works in the Demo and Breaks in Production. The Problem Is the Last Mile.</title>
      <dc:creator>Marketing wizr</dc:creator>
      <pubDate>Tue, 30 Jun 2026 08:45:10 +0000</pubDate>
      <link>https://dev.to/marketing_wizr_f14586ace9/your-ai-agent-works-in-the-demo-and-breaks-in-production-the-problem-is-the-last-mile-1gnc</link>
      <guid>https://dev.to/marketing_wizr_f14586ace9/your-ai-agent-works-in-the-demo-and-breaks-in-production-the-problem-is-the-last-mile-1gnc</guid>
      <description>&lt;p&gt;The demo is always convincing. You ask the agent to "find the overdue invoice for Acme and send a reminder," it reasons through the steps, calls a couple of tools, and reports success. Everyone nods. Then you put it in front of real traffic against real systems and it creates duplicate invoices on a retry, emails the wrong contact, or cheerfully reports success on an action that silently failed.&lt;/p&gt;

&lt;p&gt;The reasoning was never the hard part. The hard part is the &lt;strong&gt;last mile&lt;/strong&gt;: the layer where an agent stops talking and starts acting on systems of record like your CRM, your ticketing platform, or your ERP. That layer is ordinary, unglamorous distributed-systems engineering, and almost none of it is AI-specific. Here are the patterns that matter most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every Tool Is a Contract, Not a Suggestion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The single biggest source of agents "going rogue" is loose tool definitions. If a tool accepts free-form input and trusts the model to behave, the model eventually won't. Validate at the boundary, and put hard limits in code where the model can't talk its way past them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SendReminder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^INV-\d{8}$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_schema_extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="c1"&gt;# The model cannot send to an arbitrary address; it picks an
&lt;/span&gt;    &lt;span class="c1"&gt;# on-file contact by role, and code resolves the actual destination.
&lt;/span&gt;    &lt;span class="n"&gt;recipient_role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_schema_extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap_clerk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_reminder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SendReminder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;invoice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# 404s are real, handle them
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;already_paid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;contact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;resolve_contact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recipient_role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what the contract removes: the model never supplies a raw email address, never picks an invoice that doesn't match the ID format, and never overrides the "already paid" check. The agent proposes; deterministic code disposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency, Because Agents Retry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents retry. Networks fail mid-call. A user double-clicks. If the same logical action can execute twice and produce two effects, you have an incident waiting to happen, and "send payment" or "create ticket" are exactly the actions where a double-execution hurts.&lt;/p&gt;

&lt;p&gt;Make state-changing actions idempotent with a key derived from the intent, not from a random ID generated per attempt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Same logical request =&amp;gt; same key =&amp;gt; at most one ticket.
&lt;/span&gt;    &lt;span class="n"&gt;idem_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_by_idempotency_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exists&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your downstream system supports idempotency keys natively (many payment and ticketing APIs do), pass them through. If it doesn't, enforce it in your own layer before the call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permissions Belong to the User, Not the Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A subtle and dangerous mistake: running every agent action with the agent's own service-account privileges. Now any user who can chat with the agent can implicitly do anything the agent can do, including reading records they should never see.&lt;/p&gt;

&lt;p&gt;The agent should act &lt;strong&gt;on behalf of&lt;/strong&gt; the requesting user, carrying that user's authorization to every tool call. Retrieval is the easy place to get this wrong: filtering results after the fact is fragile, so scope the query itself so unauthorized records are never candidates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_accounts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acting_user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Account&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# The user's scope is part of the query, not a post-filter.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;crm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;visibility&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;acting_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Plan for Partial Failure and Honest Reporting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A multi-step action will sometimes get halfway and fail. The worst outcome is an agent that reports "Done!" when step three threw an exception. Two rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never let the model narrate success it didn't verify.&lt;/strong&gt; Tool results, not the model's optimism, determine what the agent tells the user. If a call failed, the failure propagates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide your transaction story up front.&lt;/strong&gt; Either make the sequence atomic where the systems allow it, or design compensating actions (if you created the order but the payment failed, you cancel the order). Silent half-completed workflows are how data integrity quietly erodes.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fulfill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# idempotent
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;charge_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# may fail
&lt;/span&gt;    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;PaymentError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;cancel_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# compensate, then surface the failure
&lt;/span&gt;        &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;High-Stakes Actions Get a Human Checkpoint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Full autonomy is rarely the right design for actions that move money, delete data, or contact customers. The more reliable pattern is a confident draft plus a human approval step. This is frequently the difference between a system the business will actually authorize and one stuck in pilot forever, and it costs you very little: the agent does all the work, a human just clicks approve on the irreversible part.&lt;/p&gt;

&lt;p&gt;Make the threshold explicit and enforce it in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;irreversible&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount_cents&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AUTO_LIMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;queue_for_human_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;You Cannot Debug What You Did Not Trace&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a user says "the agent messed up my account," you need to replay exactly what happened, not reconstruct it from optimism and partial logs. Capture the full chain for every action: the user input, the model's tool selection and arguments, the raw tool results, and the final response. This is the same trace you'll use to build evaluations, so design it once and use it for both.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trace_id: 8f2c...
  user: "send the overdue reminder for Acme"
  acting_user: u_4471 (scope: account:acme)
  tool_call: send_reminder(invoice_id=INV-00038122, channel=email, recipient_role=billing)
  resolved_recipient: billing@acme.example   # resolved by code, not the model
  tool_result: {status: sent, message_id: m_99...}
  agent_reply: "Reminder sent to Acme's billing contact."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this, "the agent messed up" becomes a five-minute investigation instead of a guessing game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Takeaway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An AI agent is only as good as the boundary between its reasoning and your systems of record. The intelligence gets the headlines, but reliability lives in the boring layer: typed tool contracts, idempotency, per-user authorization, partial-failure handling, human checkpoints for irreversible actions, and end-to-end tracing. Get that layer right and the agent becomes something the business can actually trust with real work. Skip it, and you have a very impressive demo.&lt;/p&gt;

&lt;p&gt;I work on AI engineering at Wizr AI, where &lt;a href="https://wizr.ai/custom-ai-application-development-services/" rel="noopener noreferrer"&gt;custom AI application development services&lt;/a&gt; are the day job. More on us as a &lt;a href="https://wizr.ai/generative-ai-software-development-company/" rel="noopener noreferrer"&gt;generative AI software development company&lt;/a&gt; if you're curious. Happy to compare integration war stories in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>backend</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Two Places Generative AI Shows Up When You Ship a Custom AI Application</title>
      <dc:creator>Marketing wizr</dc:creator>
      <pubDate>Tue, 30 Jun 2026 08:20:52 +0000</pubDate>
      <link>https://dev.to/marketing_wizr_f14586ace9/the-two-places-generative-ai-shows-up-when-you-ship-a-custom-ai-application-1l5c</link>
      <guid>https://dev.to/marketing_wizr_f14586ace9/the-two-places-generative-ai-shows-up-when-you-ship-a-custom-ai-application-1l5c</guid>
      <description>&lt;p&gt;Most write-ups about "AI development" quietly conflate two very different activities. One is &lt;strong&gt;building software that uses generative AI&lt;/strong&gt; as a core capability: copilots, retrieval systems, autonomous agents. The other is &lt;strong&gt;using generative AI to build software&lt;/strong&gt;: code generation, test synthesis, legacy modernization. They share a buzzword and almost nothing else. The skills, the risks, and the discipline required are different, and teams that treat them as one thing tend to get burned on both.&lt;/p&gt;

&lt;p&gt;If you're shipping a custom AI application, you will run into both at once. This post is a practical map of where each shows up, what tends to break, and how to keep the speed without inheriting the fragility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track 1: Generative AI as the Product&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the AI is the feature, the engineering challenge is not "call a model." It's everything wrapped around the call that decides whether the thing is correct, safe, and maintainable.&lt;/p&gt;

&lt;p&gt;A few realities that separate a working demo from a deployable application:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval quality is the silent killer.&lt;/strong&gt; Most custom AI apps lean on Retrieval-Augmented Generation (RAG). The naive version (embed documents, do a similarity search, stuff results into the prompt) hides every decision that actually matters. Fixed-size chunking severs context. Pure vector search whiffs on exact identifiers like error codes or SKUs, where a hybrid of dense vectors plus keyword search does far better. And if the model can answer without citing which retrieved chunk supports the claim, you have no mechanism to detect hallucination in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the smallest control structure that works.&lt;/strong&gt; "Autonomous multi-agent system" is rarely the right starting point. Reliability drops and debugging cost climbs with every layer of autonomy you add. A ticket-classification task needs one well-prompted call with a typed output, not three agents deliberating. Reserve orchestration for problems that genuinely require planning and tool use you can't predetermine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails live in code, not prompts.&lt;/strong&gt; Anything that must always be true (a spending cap, a permission check, a rate limit) belongs in deterministic code that runs no matter what the model decided. A prompt is a request, not an invariant.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The model can *propose* a refund. Code decides whether it's allowed.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;issue_refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_cents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;amount_cents&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_AUTO_REFUND_CENTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_to_human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# invariant enforced here
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;user_owns_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;PermissionError&lt;/span&gt;                     &lt;span class="c1"&gt;# never trust the prompt
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process_refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Evaluation is non-negotiable.&lt;/strong&gt; The defining question is "did that change make the system better or worse?" Without a versioned evaluation set you run on every prompt tweak and model swap, every change is a guess. It doesn't need to be huge; a few dozen well-chosen cases catch a surprising number of regressions. Because outputs are non-deterministic, your metrics should be thresholds over a sample, not pass/fail on one run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track 2: Generative AI as the Way You Build&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The second track is generative AI accelerating the development lifecycle itself, and it follows one governing principle: &lt;strong&gt;generation is the cheap part; ownership is the expensive part.&lt;/strong&gt; A model can produce fifty plausible lines in seconds. Reviewing, testing, securing, and maintaining those lines for years costs exactly as much as if a human wrote them.&lt;/p&gt;

&lt;p&gt;What changes behavior is treating AI output as a draft and the reviewer as fully accountable for it. "The model wrote it" is not a defense in a postmortem. Watch for the failure modes assistants over-produce: subtly wrong edge cases (empty collections, timezones, integer truncation), hallucinated or outdated APIs, and security anti-patterns like string-concatenated SQL that they reproduce from training data.&lt;/p&gt;

&lt;p&gt;The practical safeguard is a deterministic gate before human review. AI raises the &lt;em&gt;volume&lt;/em&gt; of code flowing into review, so the automated floor under that review has to be solid enough to absorb the increase without humans rubber-stamping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type-checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="nf"&gt;run_type_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;linter clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="nf"&gt;run_linter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no vuln patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="nf"&gt;run_sast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no secrets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="nf"&gt;scan_for_secrets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests non-trivial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="nf"&gt;tests_meaningful&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage not reduced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;coverage_delta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ReviewBlocked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI change failed gates: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready_for_human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test generation is the highest-leverage use, with one caveat about the direction of trust. Generating tests for &lt;em&gt;existing, human-written&lt;/em&gt; code is safe and valuable, because the code is the trusted artifact. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected" behavior. Keep a human-authored specification of intended behavior as the anchor.&lt;/p&gt;

&lt;p&gt;Legacy modernization is where this track is most seductive and most dangerous. A model will translate an old module into idiomatic modern code while silently dropping a side effect some downstream system depends on. The discipline that works: modernize in small increments, and use characterization tests (tests that capture the legacy code's &lt;em&gt;existing&lt;/em&gt; behavior, quirks included) as the contract the new code must satisfy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the Two Tracks Meet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shipping a custom AI application means running both tracks at the same time, and the same engineering values turn out to govern each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Determinism around non-determinism.&lt;/strong&gt; Whether it's a model deciding to issue a refund or a model writing the refund code, the safety net is deterministic checks that don't depend on the model behaving.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation over vibes.&lt;/strong&gt; Track 1 needs faithfulness evals; Track 2 needs change-failure rate and defect-escape rate. Both replace "it worked when I tried it" with a measurement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human accountability at the boundary.&lt;/strong&gt; A high-stakes agent action gets a human approval checkpoint; a high-stakes code change gets a human reviewer who owns it. Same pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A rough readiness check before you call a custom AI application "done":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ ] Retrieval returns cited, verifiable sources
[ ] Hard business rules enforced in code, not prompts
[ ] Versioned eval set runs on every model/prompt change
[ ] Full request tracing captured (input → retrieval → prompt → output)
[ ] Idempotency on every state-changing action
[ ] Graceful degradation when the model is unavailable
[ ] PII handling and access control on retrieval
[ ] AI-assisted code passed the same gates as human code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If most of those boxes are empty, you have a prototype, not a product, no matter how good the demo looked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Closing Thought&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The hype frames generative AI as a single revolution. In practice it's two distinct disciplines that happen to share a name, and a custom AI application sits at their intersection. The teams that win aren't the ones generating the most code or wiring up the most agents. They're the ones whose evaluation, guardrails, and review gates are strong enough that more AI (in the product &lt;em&gt;and&lt;/em&gt; in the process) makes them faster without making them fragile.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I work on AI engineering at Wizr AI, where we build &lt;a href="https://wizr.ai/custom-ai-application-development-services/" rel="noopener noreferrer"&gt;custom AI applications&lt;/a&gt; and use &lt;a href="https://wizr.ai/generative-ai-software-development-company/" rel="noopener noreferrer"&gt;generative AI across the software development lifecycle&lt;/a&gt;. These tradeoffs are part of the daily job. Happy to compare notes in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
