<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kim-Like</title>
    <description>The latest articles on DEV Community by Kim-Like (@kimlike).</description>
    <link>https://dev.to/kimlike</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3964587%2Fbfe3513b-9f71-40e6-8e38-48de414a6356.jpg</url>
      <title>DEV Community: Kim-Like</title>
      <link>https://dev.to/kimlike</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kimlike"/>
    <language>en</language>
    <item>
      <title>The null input that broke my production agent and what fixed it</title>
      <dc:creator>Kim-Like</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:22:00 +0000</pubDate>
      <link>https://dev.to/kimlike/the-null-input-that-broke-my-production-agent-and-what-fixed-it-1e77</link>
      <guid>https://dev.to/kimlike/the-null-input-that-broke-my-production-agent-and-what-fixed-it-1e77</guid>
      <description>&lt;p&gt;The demo ran flawlessly for three weeks. Every test input parsed clean, every output routed correctly, and I thought we had a reliable system.&lt;/p&gt;

&lt;p&gt;Then a supplier sent a confirmation email with an empty subject line.&lt;/p&gt;

&lt;p&gt;The agent, which was supposed to extract order references and route them into a queue, got a null where it expected a string. It didn't crash. That would have been better. Instead it generated a plausible-looking order reference, routed it, and the downstream system processed it like it was real. Nobody caught it for four hours.&lt;/p&gt;

&lt;p&gt;That is the demo problem: demos use inputs that look like what you expect. Production does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the demo hid
&lt;/h2&gt;

&lt;p&gt;I built and run the agent operation at aienterprise.dk, so I control every layer of the stack. When this broke, I could see the full trace. The agent's system prompt said "extract the order reference from the subject line." Sensible instruction. Works every time the subject line exists.&lt;/p&gt;

&lt;p&gt;When it doesn't exist, a well-prompted LLM doesn't say "I cannot find an order reference." It fills the gap. It invents something that looks right. The hallucination isn't random noise. It's plausible, structured noise. That's what makes it dangerous. A random failure is easy to catch. A confident, well-formatted wrong answer is not.&lt;/p&gt;

&lt;p&gt;In the demo, I never sent an email with a null subject. I never thought to. The input felt so basic I didn't consider it an edge case. It isn't an edge case in production. It's Tuesday.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unglamorous fix
&lt;/h2&gt;

&lt;p&gt;I didn't retrain anything. I didn't adjust the prompt. I added a guard before the model call.&lt;/p&gt;

&lt;p&gt;Before the agent touches input now, a deterministic check runs: is the subject field present and non-empty? If not, the message routes to a hold queue with a flag. A human reviews it. The agent never sees the malformed input.&lt;/p&gt;

&lt;p&gt;That guard is twelve lines of code. It's the least interesting thing I built all year. It's also what makes the agent reliable.&lt;/p&gt;

&lt;p&gt;The pattern generalizes. Every place an agent assumes structure in its input is a place production will eventually send you unstructured data. The fix isn't a smarter model. The fix is a boundary: a check that runs before the model and routes bad input to a human instead of letting the model guess.&lt;/p&gt;

&lt;p&gt;This is what I mean when I say reliability is the only feature. A demo proves an agent can do the task. Production proves it does the task, again, on the bad input, at 3am, when no one is watching. Those are different claims. Only the second one matters to anyone paying for it.&lt;/p&gt;

&lt;p&gt;The agent now processes roughly 200 routing operations per day without incident. The hold queue gets used about twice a week. When it does, a human looks at whatever weird thing arrived, handles it, and I learn something new about what production actually looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note for 2027
&lt;/h2&gt;

&lt;p&gt;If you're building agents for clients in high-risk categories under the EU AI Act, the compliance deadline is December 2, 2027. That covers employment decisions, biometrics, border control, education systems. Not far off.&lt;/p&gt;

&lt;p&gt;A system that routes confidently on bad inputs and produces plausible wrong answers won't survive an audit. The guard I described isn't just good engineering. For systems in scope, it's a compliance minimum. The European Commission published draft Article 6 classification guidelines this month. If you haven't checked whether your system is in scope, now is the time.&lt;/p&gt;

&lt;p&gt;Reliability isn't a feature you add later. The hold queue proves that. The hallucinated order reference proves it too, in the more expensive way.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why my AI agents can write code but can't ship it</title>
      <dc:creator>Kim-Like</dc:creator>
      <pubDate>Wed, 03 Jun 2026 07:42:00 +0000</pubDate>
      <link>https://dev.to/kimlike/why-my-ai-agents-can-write-code-but-cant-ship-it-598c</link>
      <guid>https://dev.to/kimlike/why-my-ai-agents-can-write-code-but-cant-ship-it-598c</guid>
      <description>&lt;p&gt;Last month an agent finished a content update at 2am, wrote the diff, ran the pre-deploy checks, and then stopped. It filed a request and went idle. The deploy didn't happen until morning, when the Librarian process ran its scheduled verification and shipped it.&lt;/p&gt;

&lt;p&gt;That pause was not a bug. I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The capability I withheld
&lt;/h2&gt;

&lt;p&gt;Every agent in my operation at aienterprise.dk has file write access to its workspace. It can read the database, call external APIs, generate and modify content. What it cannot do is push to production. The pm2 reload command, the deploy script, the snapshot promoter, none of them are in scope for any agent except the one process I have designated as deploy authority.&lt;/p&gt;

&lt;p&gt;This is not about distrust. The agents' code is usually fine. The issue is risk asymmetry.&lt;/p&gt;

&lt;p&gt;A wrong file write gets caught in the next review cycle. A wrong production deploy is live the moment it runs. Those are not the same failure mode and they should not have the same access model.&lt;/p&gt;

&lt;p&gt;I closed that gap after an agent shipped a schema migration to the wrong site instance because it matched on name prefix instead of full identifier. Nobody was harmed, rollback took four minutes, but the path was clearly wrong. An agent that builds a thing should not also be the one that decides when the thing ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the gate actually looks like
&lt;/h2&gt;

&lt;p&gt;The mechanism is simple on purpose. When an agent finishes deployable work, it calls a single script: request-deploy.mjs. The script takes a surface, an intent string, and the artifact ID. That's it. The agent's job is done.&lt;/p&gt;

&lt;p&gt;A separate process, the Librarian, holds the actual deploy token. It runs on a 15-minute heartbeat plus an autorun trigger when a request lands. It checks whether the new snapshot conflicts with anything else in flight, runs pre-deploy verification, ships all six sites in lockstep, bumps the version with the intent string as the changelog bullet, and records the deploy to the runtime log.&lt;/p&gt;

&lt;p&gt;The agent never interacts with the Librarian directly. The separation is not there to create friction. It's there to ensure the thing that builds is never the thing that ships, with no exceptions baked in at the agent layer.&lt;/p&gt;

&lt;p&gt;If a deploy is genuinely blocked, the Librarian escalates. I get an alert. I resolve it. But the system does not let an agent work around the gate by claiming urgency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is worth thinking about now
&lt;/h2&gt;

&lt;p&gt;The EU AI Act's December 2027 deadline is fixed. Danish operators running agentic systems in employment, critical infrastructure, or migration contexts have a planning horizon they can work backward from. The draft Article 6 guidelines define what high-risk means in practice, and the answer is broader than most builders expect.&lt;/p&gt;

&lt;p&gt;But the reason to build approval gates isn't the regulation. It's that production systems break in ways demos don't. An agent that can both write and ship is a system where the blast radius of a wrong decision is unbounded on one axis. That is the axis you want to control before something goes wrong, not after.&lt;/p&gt;

&lt;p&gt;I withheld deploy capability from my agents because the boundary between built and shipped is where accountability lives. If an agent builds and ships and something breaks, the audit trail is harder to read. If an agent builds and a separate process ships after verification, every step is logged and attributable.&lt;/p&gt;

&lt;p&gt;That's the governance mechanism. It's not a policy document. It's an architecture decision that makes the policy enforceable.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
    <item>
      <title>How a Scanned PDF Broke My Invoice Agent in Production</title>
      <dc:creator>Kim-Like</dc:creator>
      <pubDate>Tue, 02 Jun 2026 12:56:07 +0000</pubDate>
      <link>https://dev.to/kimlike/how-a-scanned-pdf-broke-my-invoice-agent-in-production-4kni</link>
      <guid>https://dev.to/kimlike/how-a-scanned-pdf-broke-my-invoice-agent-in-production-4kni</guid>
      <description>&lt;p&gt;Four days into a new supplier's first batch, my invoice extraction agent had filed 31 documents with amounts shifted by a decimal. Nothing raised an error. The downstream system accepted every record. The agent returned a 200 each time.&lt;/p&gt;

&lt;p&gt;The demo had run on five clean PDFs. Clear fonts, properly formatted dates, consistent layout. The extraction agent pulled vendor name, amount, due date, line items. Every field populated, every output valid. I ran it for the stakeholder meeting and it looked exactly like something you would ship.&lt;/p&gt;

&lt;p&gt;Three months in, the agent had processed around 800 invoices without complaint. Then a new supplier switched to scanned documents. Slightly rotated, thin fonts, OCR doing what it could on degraded source material. The model found text that resembled amounts and dates, and returned confident structured output. 1,247.50 read as 12,475.0. A due date resolved to a valid date three years in the future. The confidence was the problem. The model had no mechanism to say it was uncertain. It just answered.&lt;/p&gt;

&lt;p&gt;Nobody caught it for four days.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built after
&lt;/h2&gt;

&lt;p&gt;The problem was not the model. The model did what it was designed to do. Find structure in text and return it. The straight pipeline from input to output had no gate in it.&lt;/p&gt;

&lt;p&gt;The fix was not more prompting or a better model. I added a validation layer between the agent output and the downstream system. It runs synchronously, takes about 80ms, and checks four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every required field is non-null.&lt;/li&gt;
&lt;li&gt;Amounts parse as positive numbers within a configured range for that supplier type.&lt;/li&gt;
&lt;li&gt;Dates fall within a 90-day future window.&lt;/li&gt;
&lt;li&gt;Extracted totals are consistent with line item sums, within a small tolerance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Anything failing a check routes to a review inbox instead of the queue. A human looks at it, corrects it if needed, marks it resolved. The system logs which check triggered and what the input looked like.&lt;/p&gt;

&lt;p&gt;In the first week after deployment, the layer caught 23 documents out of about 1,400. Eleven were bad scans. Seven were valid invoices in a format the model had not seen before. Five were duplicates that had slipped through upstream. All 23 would have gone through clean before the layer existed.&lt;/p&gt;

&lt;p&gt;The review inbox is not impressive. It is an HTML table and a textarea. It took three hours to build. It has caught every significant extraction failure since I shipped it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliability is the only feature
&lt;/h2&gt;

&lt;p&gt;I run the agent operation at Agent Enterprise (aienterprise.dk) and this pattern shows up in every domain we deploy into. The model capability is mostly not the question. What does not improve automatically is the boundary between what the agent produces and what the downstream system trusts.&lt;/p&gt;

&lt;p&gt;Every deployment has its own version of this guard. For a scheduling agent it is a check that the proposed slot is actually open. For a classification agent it is a threshold below which the label goes to review rather than being applied automatically. The pattern is constant. The agent produces something, and before that something becomes a fact in your system, something deterministic verifies it is plausible.&lt;/p&gt;

&lt;p&gt;The demo proves the agent can. Production proves it does, correctly, on the bad input, on the rotated scan, at 3am when no one is watching. That second proof is the one your users care about. It is also the one that does not come from the model.&lt;/p&gt;

&lt;p&gt;The validation layer is not exciting to ship. It is the right call every time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
