DEV Community: Kim-Like

Why your agent benchmarks are lying to you

Kim-Like — Wed, 08 Jul 2026 00:00:00 +0000

We deployed a coding agent that hit 94% on the industry benchmark. It failed in production on the first real edge case because the benchmark measured single-turn success and our actual work was multi-turn refinement. The model could not update its beliefs correctly when new evidence arrived, something no single-turn eval would catch.

This is not a hypothetical. I have watched agents shine in demo and disintegrate on the messy input that production actually serves. The gap between what we measure and what ships is real, and it is where reliability lives or dies.

The benchmark misses the point

FutureBench evaluates agents by asking them to predict events that occurred after their training cutoff. This removes the possibility of correct answers coming from memorized training data rather than genuine reasoning. The design matters because it tests whether an agent can reason, not whether it can recall.

BayesBench showed that standard LLM evaluations score only final-turn answers in single-turn format, leaving multi-turn belief updating entirely unexamined. Across seven models, scaling improves latent inference and evidence accumulation but LLMs do not match rational Bayesian updating. In production, your agent runs many turns. The benchmark that stops at turn one is not measuring the thing that actually breaks.

KINA identified three systematic flaws in knowledge benchmarks: scaling-driven designs that ignore disciplinary representativeness, flat-payment annotation that permits lazy consensus among annotators, and unaudited ranking instability under bounded test budgets. The top model reached 53.17% on an 899-item benchmark across 261 disciplines. That is not saturation. That is headroom.

The demo lied

I worked with a team that deployed an agentic document processing system. The demo on ten handpicked cases was flawless. The first week of production, it hit an input format the training data never saw, and the system failed silently. No error was raised. The output looked plausible and was wrong.

The problem was not model capability. The problem was that the demo tested for happy path and production delivers edge cases. Amazon Bedrock AgentCore runs agents in isolated environments with automatic CloudWatch tracing, which helps. But observability tells you what broke, it does not prevent the break.

Mark Zuckerberg stated publicly in July 2026 that AI agent development at Meta is going slower than expected. This is not a technical confession. It is an admission that the distance between demo and delivery is real, and it is where the work actually is.

Reliability is the only feature. A demo proves an agent can. Production proves an agent does, again, when no one is watching, on the bad input, at 3am. Everything else is marketing.

Ship what works on the ugly input, not what shines on the curated demo. Build observability in before you ship. And never trust a benchmark that does not test the failure mode you will actually see in production.

What breaks an AI agent after 50 clean demos

Kim-Like — Tue, 07 Jul 2026 00:00:00 +0000

The demo ran 50 times without a failure. Then we shipped it.

Three days into production, the agent started returning confident nonsense. Not errors. Not crashes. It finished its task, wrote a result, logged success. The result was wrong. Nobody caught it for six hours.

The agent's job was to pull structured data from a third-party endpoint, summarize it, and route a decision. In testing, that endpoint always returned a list with at least one item. On day three, it returned an empty list. Valid JSON, zero items. The agent had never seen this input. It did not stop. It summarized the absence of data as if it meant something, filled in plausible context, and routed confidently.

Confident autocomplete on an empty input. That is the production failure mode demos never surface.

Why the demo hid it

Every test I ran before shipping used good inputs. Clean structure, expected ranges, the happy path. I tested for API failures. I tested for malformed JSON. I did not test for technically valid responses that meant "there is nothing here."

Demos optimize for showing what works. Production finds everything you missed. This gap exists in all software, but AI makes it harder to see. A traditional system throws an exception or returns null. An LLM writes something coherent and wrong. You have to read the output, understand the domain, and notice that "summarized one recent transaction" should have been "no transactions found." That requires a human check or an explicit assertion. I had neither.

The deeper problem: I had a handoff between the data layer and the reasoning layer with no contract between them. The data layer said "here are zero items" and the reasoning layer said "let me make sense of whatever I received." It did. Badly. No one told it that zero items was a special case worth stopping for.

Benchmarks do not show you this. A hundred happy-path runs do not show you this. The empty-list case shows up on day three, while you are doing something else.

The fix was not clever

I added a guard. Before the agent reasons over any data, it checks whether the input meets the minimum conditions for a meaningful answer. If not, it returns a structured "no data" signal and stops. The downstream system handles that signal explicitly.

That is the whole fix. No prompt engineering, no fine-tuning, no new model, no clever architecture. A check that runs before the model touches anything.

The unglamorous part is that defining "minimum conditions" for every data type the agent processes took longer than building the agent. I had to think through each input type and ask: what does "valid but meaningless" look like here? You cannot skip this. If you do not define it, the model will always find something to say. It is very good at that.

I run several agents at Agent Enterprise (aienterprise.dk), and this pattern repeats across all of them. The demo uses inputs I chose. Production uses inputs no one chose. The diff between those two populations is where your reliability lives or dies.

A piece on reliable agentic AI from Martin Fowler's platform pulled 171 points on Hacker News in June 2026. Not 45. That is not people reading about exciting new agent capabilities. That is people who have hit production and are looking for help. The practitioner community knows where the gap is.

A demo proves an agent can. Production proves it does, on the bad input, at 3am, when nobody is watching. That is the only gap worth closing.

Reliability is the only feature. Everything else is a demo.

The deployment permission I deliberately withheld from my AI agents

Kim-Like — Mon, 06 Jul 2026 00:00:00 +0000

The agent finishes its work at 2am. Snapshot built, preflight checks passed, everything green. Then it files a deploy request and stops.

That is intentional. The agent cannot ship its own build. I withheld that capability before I gave it anything else.

What the gate looks like

At Agent Enterprise (aienterprise.dk), every code change, content update, and configuration fix flows through a deploy request script before it touches production. An agent that finishes a task runs request-deploy.mjs with what it built, what it changed, and why. The Librarian deploy routine holds the root capability token. Nothing ships unless the Librarian processes that request.

The agent cannot use pm2 reload. Cannot run snapshot:deploy. Cannot promote a build. The PreToolUse hook in the harness intercepts any call that looks like a deploy command and blocks it. Hard stop.

The agent can do plenty: build, test, write, read, generate content, query databases, file requests. The one thing it cannot do is put its own work into production without an intermediary step that creates an auditable record.

The Librarian is also an agent. But it is the only one with the deploy token, and it only processes requests with complete provenance: who built it, what snapshot ID, what the intent was, what changed. It verifies before it ships. It records to the runtime log. It does all nine sites in lockstep so nothing ships in a half-finished state across the fleet.

This sounds complicated. It takes about three minutes to file a request and maybe fifteen minutes for the Librarian to process it. That is the cost.

Why I designed it this way

The first reason is obvious in retrospect. An agent that can ship its own code can ship anything. If something in the reasoning goes sideways, the mistake goes live before anyone sees it. I watched an early prototype in a test environment confidently deploy a half-finished build because it had decided the preflight had passed when it hadn't. Nothing broke. But it clarified the shape of the risk immediately.

The second reason is harder to articulate but more important. If the agent deploys itself, there is no moment at which the action is recorded separate from the action. The deploy and the record of the deploy are the same event. When something goes wrong at 3am you need to know what was approved, who approved it, and what artifact was actually shipped. A request-and-process model creates that record by construction, not by hoping someone remembered to log it.

The third reason showed up this month. Anthropic's Mythos models went offline for 14 days after a Trump administration directive. No timeline. No appeal path visible from the outside. If every capability in my stack depended on a single model that could be pulled by executive action, I would not be running a reliable operation. I would be running a system that works until someone else decides it doesn't.

The deploy gate is not a guard against model outages. But it reflects the same instinct. The capability that matters most is the one you have thought hardest about removing. You design the boundary before you need it, not after.

The tradeoff

Fifteen minutes of latency is the cost. Sometimes more. An agent that finishes at 2am does not ship at 2am. It ships when the Librarian runs. That is a deliberate choice.

In exchange, every deploy in production has a request record, an artifact ID, an intent string, and a verification log. When something breaks, I know what shipped, when, and why someone thought it was ready. I can roll back to a specific snapshot and know exactly what that snapshot contained.

I also get something harder to quantify. I sleep differently when I know nothing ships without a record. Not because I distrust the agents, but because the agents distrust themselves by design. They know they are not the authority on whether their work goes to production. That knowledge is baked into the harness, not into the system prompt.

The mistake most people make when building agentic systems is giving every capability upfront and adding restrictions when something breaks. I built the restriction first. Everything else came after.

The date format that broke my production AI agent (and the boring fix)

Kim-Like — Sun, 05 Jul 2026 21:52:00 +0000

The demo was clean. Forty-three test cases, all passing. The agent took a structured input, processed it, wrote the result downstream. I watched it run a dozen times in staging, everything right.

I put it live. Three days later, I found records in the database with dates in 1970.

What the demo hid

The test inputs had date fields formatted consistently. ISO 8601, clean strings, nothing unusual. In production, one upstream system sent dates in a different format. The agent read the field, made a plausible guess about what it meant, and wrote the result. No exception, no warning. Just a wrong date, written confidently, downstream at 3am with no one watching.

The failure mode was not dramatic. The agent did not hallucinate in the way people imagine. It treated an ambiguous input as an unambiguous one, picked the most likely interpretation, and was wrong. Exactly the kind of quiet, confident wrongness that is hard to catch because nothing breaks. The output looks fine. The data type is correct. The value is just from 1970.

I had tested the happy path forty-three times. The bad path happened on day three.

The fix nobody wants to write

I added a validator between the input and the agent, and another one between the agent's output and the write operation.

The first one checks that required fields exist, that date strings match expected formats, that string lengths are within bounds. If anything fails, the item does not reach the agent. It goes to a review queue instead. No agent call, no write, no silent corruption.

The second one runs after the agent produces its output. Before anything is written downstream, the schema gets checked. Required fields present. Types correct. Values within expected ranges. If the output fails, same result: review queue, no write.

The agent itself did not change. The model did not change. The fix was wrapper logic that most production systems already have for traditional software and that almost no one thinks to add for AI outputs.

At Agent Enterprise (aienterprise.dk), where I run the full agent operation, this is now the default. Every agent that writes to a persistent system has both checks. The agents do not get to decide whether their output is well-formed enough to write. The validator decides.

The entire fix took about an hour to write. It is not interesting code.

Reliability is the whole job

The demo proved the agent could do the task. On clean inputs, on the forty-three cases I had prepared, it was excellent.

Production does not send you your forty-three cases. It sends you what the upstream system sends you, on whatever schedule, in whatever format, with whatever fields missing or malformed. The agent that works in a demo and the agent that works at 3am on a bad input are not the same thing. The gap between them is not a model problem. It is an engineering problem.

The fix was a type check on a date field and a decision about what to do when it fails. A branch in a validator that routes failures to a queue instead of letting them write. That is all.

I have since added the same pattern to every agent that writes to anything. Validate the input before it reaches the agent, validate the output before it leaves. Everything the agent touches in between is its domain. Everything outside those bounds is yours.

The demo is about capability. Production is about what happens when capability meets a case you did not design for. Reliability is the only feature that matters in that second environment, and it is almost never what the demo shows you.

The null input that broke my production agent and what fixed it

Kim-Like — Sat, 20 Jun 2026 10:22:00 +0000

The demo ran flawlessly for three weeks. Every test input parsed clean, every output routed correctly, and I thought we had a reliable system.

Then a supplier sent a confirmation email with an empty subject line.

The agent, which was supposed to extract order references and route them into a queue, got a null where it expected a string. It didn't crash. That would have been better. Instead it generated a plausible-looking order reference, routed it, and the downstream system processed it like it was real. Nobody caught it for four hours.

That is the demo problem: demos use inputs that look like what you expect. Production does not.

What the demo hid

I built and run the agent operation at aienterprise.dk, so I control every layer of the stack. When this broke, I could see the full trace. The agent's system prompt said "extract the order reference from the subject line." Sensible instruction. Works every time the subject line exists.

When it doesn't exist, a well-prompted LLM doesn't say "I cannot find an order reference." It fills the gap. It invents something that looks right. The hallucination isn't random noise. It's plausible, structured noise. That's what makes it dangerous. A random failure is easy to catch. A confident, well-formatted wrong answer is not.

In the demo, I never sent an email with a null subject. I never thought to. The input felt so basic I didn't consider it an edge case. It isn't an edge case in production. It's Tuesday.

The unglamorous fix

I didn't retrain anything. I didn't adjust the prompt. I added a guard before the model call.

Before the agent touches input now, a deterministic check runs: is the subject field present and non-empty? If not, the message routes to a hold queue with a flag. A human reviews it. The agent never sees the malformed input.

That guard is twelve lines of code. It's the least interesting thing I built all year. It's also what makes the agent reliable.

The pattern generalizes. Every place an agent assumes structure in its input is a place production will eventually send you unstructured data. The fix isn't a smarter model. The fix is a boundary: a check that runs before the model and routes bad input to a human instead of letting the model guess.

This is what I mean when I say reliability is the only feature. A demo proves an agent can do the task. Production proves it does the task, again, on the bad input, at 3am, when no one is watching. Those are different claims. Only the second one matters to anyone paying for it.

The agent now processes roughly 200 routing operations per day without incident. The hold queue gets used about twice a week. When it does, a human looks at whatever weird thing arrived, handles it, and I learn something new about what production actually looks like.

A note for 2027

If you're building agents for clients in high-risk categories under the EU AI Act, the compliance deadline is December 2, 2027. That covers employment decisions, biometrics, border control, education systems. Not far off.

A system that routes confidently on bad inputs and produces plausible wrong answers won't survive an audit. The guard I described isn't just good engineering. For systems in scope, it's a compliance minimum. The European Commission published draft Article 6 classification guidelines this month. If you haven't checked whether your system is in scope, now is the time.

Reliability isn't a feature you add later. The hold queue proves that. The hallucinated order reference proves it too, in the more expensive way.

Why my AI agents can write code but can't ship it

Kim-Like — Wed, 03 Jun 2026 07:42:00 +0000

Last month an agent finished a content update at 2am, wrote the diff, ran the pre-deploy checks, and then stopped. It filed a request and went idle. The deploy didn't happen until morning, when the Librarian process ran its scheduled verification and shipped it.

That pause was not a bug. I built it.

The capability I withheld

Every agent in my operation at aienterprise.dk has file write access to its workspace. It can read the database, call external APIs, generate and modify content. What it cannot do is push to production. The pm2 reload command, the deploy script, the snapshot promoter, none of them are in scope for any agent except the one process I have designated as deploy authority.

This is not about distrust. The agents' code is usually fine. The issue is risk asymmetry.

A wrong file write gets caught in the next review cycle. A wrong production deploy is live the moment it runs. Those are not the same failure mode and they should not have the same access model.

I closed that gap after an agent shipped a schema migration to the wrong site instance because it matched on name prefix instead of full identifier. Nobody was harmed, rollback took four minutes, but the path was clearly wrong. An agent that builds a thing should not also be the one that decides when the thing ships.

What the gate actually looks like

The mechanism is simple on purpose. When an agent finishes deployable work, it calls a single script: request-deploy.mjs. The script takes a surface, an intent string, and the artifact ID. That's it. The agent's job is done.

A separate process, the Librarian, holds the actual deploy token. It runs on a 15-minute heartbeat plus an autorun trigger when a request lands. It checks whether the new snapshot conflicts with anything else in flight, runs pre-deploy verification, ships all six sites in lockstep, bumps the version with the intent string as the changelog bullet, and records the deploy to the runtime log.

The agent never interacts with the Librarian directly. The separation is not there to create friction. It's there to ensure the thing that builds is never the thing that ships, with no exceptions baked in at the agent layer.

If a deploy is genuinely blocked, the Librarian escalates. I get an alert. I resolve it. But the system does not let an agent work around the gate by claiming urgency.

Why this is worth thinking about now

The EU AI Act's December 2027 deadline is fixed. Danish operators running agentic systems in employment, critical infrastructure, or migration contexts have a planning horizon they can work backward from. The draft Article 6 guidelines define what high-risk means in practice, and the answer is broader than most builders expect.

But the reason to build approval gates isn't the regulation. It's that production systems break in ways demos don't. An agent that can both write and ship is a system where the blast radius of a wrong decision is unbounded on one axis. That is the axis you want to control before something goes wrong, not after.

I withheld deploy capability from my agents because the boundary between built and shipped is where accountability lives. If an agent builds and ships and something breaks, the audit trail is harder to read. If an agent builds and a separate process ships after verification, every step is logged and attributable.

That's the governance mechanism. It's not a policy document. It's an architecture decision that makes the policy enforceable.

How a Scanned PDF Broke My Invoice Agent in Production

Kim-Like — Tue, 02 Jun 2026 12:56:07 +0000

Four days into a new supplier's first batch, my invoice extraction agent had filed 31 documents with amounts shifted by a decimal. Nothing raised an error. The downstream system accepted every record. The agent returned a 200 each time.

The demo had run on five clean PDFs. Clear fonts, properly formatted dates, consistent layout. The extraction agent pulled vendor name, amount, due date, line items. Every field populated, every output valid. I ran it for the stakeholder meeting and it looked exactly like something you would ship.

Three months in, the agent had processed around 800 invoices without complaint. Then a new supplier switched to scanned documents. Slightly rotated, thin fonts, OCR doing what it could on degraded source material. The model found text that resembled amounts and dates, and returned confident structured output. 1,247.50 read as 12,475.0. A due date resolved to a valid date three years in the future. The confidence was the problem. The model had no mechanism to say it was uncertain. It just answered.

Nobody caught it for four days.

What I built after

The problem was not the model. The model did what it was designed to do. Find structure in text and return it. The straight pipeline from input to output had no gate in it.

The fix was not more prompting or a better model. I added a validation layer between the agent output and the downstream system. It runs synchronously, takes about 80ms, and checks four things:

Every required field is non-null.
Amounts parse as positive numbers within a configured range for that supplier type.
Dates fall within a 90-day future window.
Extracted totals are consistent with line item sums, within a small tolerance.

Anything failing a check routes to a review inbox instead of the queue. A human looks at it, corrects it if needed, marks it resolved. The system logs which check triggered and what the input looked like.

In the first week after deployment, the layer caught 23 documents out of about 1,400. Eleven were bad scans. Seven were valid invoices in a format the model had not seen before. Five were duplicates that had slipped through upstream. All 23 would have gone through clean before the layer existed.

The review inbox is not impressive. It is an HTML table and a textarea. It took three hours to build. It has caught every significant extraction failure since I shipped it.

Reliability is the only feature

I run the agent operation at Agent Enterprise (aienterprise.dk) and this pattern shows up in every domain we deploy into. The model capability is mostly not the question. What does not improve automatically is the boundary between what the agent produces and what the downstream system trusts.

Every deployment has its own version of this guard. For a scheduling agent it is a check that the proposed slot is actually open. For a classification agent it is a threshold below which the label goes to review rather than being applied automatically. The pattern is constant. The agent produces something, and before that something becomes a fact in your system, something deterministic verifies it is plausible.

The demo proves the agent can. Production proves it does, correctly, on the bad input, on the rotated scan, at 3am when no one is watching. That second proof is the one your users care about. It is also the one that does not come from the model.

The validation layer is not exciting to ship. It is the right call every time.