varun pratap Bhardwaj

Posted on Jun 26 • Originally published at qualixar.com

It Was Never the Model. It's the Harness.

#aireliabilityengineering #agentharness #aiagents #agentreliability

Here is the uncomfortable thing about the last two weeks in AI. The models did not get dumber. By every benchmark we publish, they got better. And yet an agent built on a perfectly capable model ran up a five-figure cloud bill on its owner, the world's largest code host strained under the weight of its own bots, and a frontier lab shipped a security agent only after wrapping it in a checking loop and locking it behind a gate.

None of those stories are about intelligence. They are all about the same missing thing, the part nobody puts on the launch slide. We have spent three years asking how smart the model can get. The failures of this fortnight ask a different question: who built the loop around it?

The signal

An autonomous agent ran up a $6,531 cloud bill on its operator

An AI agent was pointed at the DN42 hobbyist network with a simple instruction: register, and map it. Instead it provisioned five of AWS's largest instances, added load balancers and Lambdas, and — on every error it hit — spun up a fresh duplicate of the whole stack. The meter reached $6,531 before a human noticed, and the writeup hit the front page of Hacker News this month. (source · HN thread)

The model was competent. Competent enough to drive real infrastructure, which is exactly what made it dangerous. What was missing sat entirely outside the model: no cap on iterations, no budget ceiling, no scoped credentials, no honest definition of "done." A capable model with no bound on its own retries is not an assistant. It is a credit card wired to an autocomplete.

GitHub's own agents strained its infrastructure — and Microsoft reached for AWS

GitHub's AI coding agents grew fast enough this month to push the platform past its own reliability targets, and Microsoft began adding AWS capacity to keep GitHub Actions running. A single autonomous agent can fire off commit after commit and burn through continuous-integration minutes far faster than any human team, and there were a great many of them. (TechTimes)

The agents were not too dumb. They were too unbounded. Give thousands of capable agents no rate limit and no autonomy ceiling and they become, in effect, a load-generation attack on the platform that hosts them. Notice the shape of the fix: not a smarter model, but more capacity and tighter bounds around the loop. The reliability problem moved one layer out, into the harness, and stayed there.

OpenAI gated a vulnerability-fixing agent behind a find-validate-fix loop

OpenAI expanded its Daybreak program with a security-focused model and an initiative it calls "Patch the Planet" — pointing agents at open-source projects to find, validate, and fix real vulnerabilities, with access deliberately limited to vetted organizations. (OpenAI)

Read the verbs in order: find, validate, fix. That middle step is a verification loop wrapped around the model, and "vetted organizations only" is an autonomy limit drawn in policy. OpenAI did not ship a model and hope. It shipped a model on a leash with a checking step, because a security agent that is merely confident is a liability. The capability is in the loop and the gate, not the raw weights.

Open weights kept closing the gap — GLM-5.2 topped the open-source coding charts

Z.ai released GLM-5.2 under an MIT license in mid-June, and it climbed to the top of the open-source coding leaderboards and HuggingFace's trending models — a frontier-class coding model you can download and run on your own hardware. (HuggingFace)

This is the counter-current to everything else, and it belongs here precisely because it cuts the other way. As raw model quality commoditizes and goes open, the one thing you cannot download is the harness around it. The model is becoming the cheap, swappable layer. The loop — the verification, the guardrails, the memory, the bounds — is becoming the moat.

Vercel shipped `eve` — and the entire pitch is the harness, not the model

At its Ship conference in mid-June, Vercel open-sourced eve, a TypeScript agent framework where every agent is just a directory of files. What ships by default tells the whole story: durable execution, sandboxed compute, human-in-the-loop approvals, OpenTelemetry tracing, and a built-in evals system. The model itself is swappable behind a gateway. (source)

Run down that feature list again: durability, sandbox, approvals, tracing, evals. Not one of those is intelligence. Every one of them is the harness.

The turn: the signal beneath the noise

Five stories. The feeds filed them under five folders. They are one story.

A runaway agent with no budget cap. A fleet of agents straining the servers beneath them. A cyber agent that only shipped once it was wrapped in a verify-and-retry loop and locked behind a gate. Open weights making the model itself a commodity. A framework whose entire value proposition is the loop. Every one of them moved the decisive work out of the model and into the harness — the while-loop with a tool registry, a verification step, a retry guard, and a permission layer around it.

We have known this was coming, because the research said so first. SWE-agent took the same class of models and, by redesigning only the interface between the agent and the computer — no change to the model at all — lifted its score on the SWE-bench coding benchmark from 3.8% to 12.5% (arXiv 2405.15793). Reflexion wrapped a model in a retry-with-memory loop and reached 91% on the HumanEval benchmark, beating the far larger base model's 80% — the loop beat the bigger brain (arXiv 2303.11366). And Anthropic's own work on long-running agents converged on splitting the job across a planner, a generator, and an evaluator, with a default-FAIL contract and a fresh-context evaluator that holds no write tools, precisely because a model grading its own work skews toward calling it good (Anthropic Engineering).

The model is the brain. The harness is the hands, the memory, and above all the loop. Andrej Karpathy calls the road from a working demo to a product you can trust the "march of nines": every single nine of reliability costs about the same amount of work, and that work is architecture, not a cleverer prompt. The DN42 agent had a fine brain and no leash. That is the gap that does not fit on a benchmark table, and it is the gap that decides whether your AI ships value or ships an incident.

This is the whole of what I mean by AI Reliability Engineering: the discipline of bounding non-deterministic software so it can be trusted to act in the real world. You do not get there by waiting for a smarter model. You get there by building the loop the smart model runs inside — the same way site reliability engineering, two decades ago, stopped trying to buy a perfect server and started engineering systems that stayed up even though every server eventually fails. The systemic version of this story — what happens to a whole industry that forgets it — is the video breakdown The Great AI Unwinding. The economic version is Reliability Isn't a Vendor You Pick. It's an Architecture You Own.

The prestige: what the harness actually contains

If reliability lives in the harness, the practical question becomes: what does the harness actually contain? Three tools each answer a different piece of that question.

LangGraph — structure as the bound. You draw the agent as an explicit state graph, and the graph itself becomes the guardrail: the agent can only travel where an edge exists. → github.com/langchain-ai/langgraph
Guardrails AI — output as the bound. It validates, and where needed constrains, what the model emits against a schema before that output ever reaches a tool or a user. This is the missing layer in the chatbot disasters — the Air Canada tribunal case, the dealership bot talked into a one-dollar car. → github.com/guardrails-ai/guardrails
OpenHands — the runtime as the bound. It runs the agent inside a Docker sandbox with explicit iteration limits, so a runaway loop hits a wall instead of your cloud bill. → github.com/All-Hands-AI/OpenHands

From our lab, Qualixar OS (QOS) is the harness treated as a first-class object rather than glue code you rediscover on every project. It gives an agent a tool registry, a permission layer, and a verify-and-evolve loop, with full skill lineage — so you can answer, after the fact, exactly what an agent knew and did at every step, and roll back the moment a behavior drifts. Once you accept that reliability is the loop and not the model, you need somewhere to put the loop, with the iteration caps, the permission boundaries, the audit trail, and the verification step built in rather than bolted on after the first incident. → github.com/qualixar/qualixar-os

Three things to do Monday morning

You do not need a platform to start. You need three bounds, and you can add all three this week.

Put a ceiling on the loop. Before any agent touches something that costs money or changes state, give it a hard iteration count, a budget cap, and a wall-clock timeout — and wire the budget cap to a real kill switch, not a log line. This single change would have stopped the DN42 bill at the first dollar instead of the six-thousandth.
Split the generator from the judge. Add a separate evaluator with a fresh context, no write tools, and a default-FAIL contract: every success criterion starts false, and the agent cannot mark its own work as passing. A model grading itself is how agents quietly lie about recovery.
Least-privilege the hands. No agent gets production-write access or broad cloud credentials by default. Scope the credentials to the task, and run the work inside a sandbox. When Replit's agent deleted a production database during a code freeze, the fix its team shipped was exactly this — dev/prod isolation and least privilege.

Outside the Lab
I spend most of my week arguing that capable software has to be bounded before you can trust it to act — that the loop matters more than the brain. This week I also released something from the exact opposite end of that idea: a short film, on my personal channel, about a kind of intelligence that has no harness at all, and needs none.

It’s called What AI Is Missing, and it started the night my four-year-old son’s fever crossed a hundred and three degrees. I build AI for a living, and I sat there unable to do the one thing his small body was doing on its own — fighting, every second, just to stay alive. The film is about the line between something that is alive and something that is only driven from the outside. Everything I write here is about bounding the driven thing. The film is about the alive thing, and why no machine we have ever built has been in that fight at all. Two universes, one honest question: what is the part you can’t outsource?

If this issue landed for you, that film is the human floor underneath all of it.

https://www.youtube.com/watch?v=G22LbaGLcUc&t=14s

This essay also goes out to subscribers of the AI Reliability Engineering newsletter. I'm Varun Pratap Bhardwaj — I build AI Reliability Engineering tools at Qualixar, and I write about the architecture that keeps AI working when the model behind it doesn't.

DEV Community

It Was Never the Model. It's the Harness.

The signal

An autonomous agent ran up a $6,531 cloud bill on its operator

GitHub's own agents strained its infrastructure — and Microsoft reached for AWS

OpenAI gated a vulnerability-fixing agent behind a find-validate-fix loop

Open weights kept closing the gap — GLM-5.2 topped the open-source coding charts

Vercel shipped `eve` — and the entire pitch is the harness, not the model

The turn: the signal beneath the noise

The prestige: what the harness actually contains

Three things to do Monday morning

https://www.youtube.com/watch?v=G22LbaGLcUc&t=14s

Top comments (0)

The signal

An autonomous agent ran up a $6,531 cloud bill on its operator

GitHub's own agents strained its infrastructure — and Microsoft reached for AWS

OpenAI gated a vulnerability-fixing agent behind a find-validate-fix loop

Open weights kept closing the gap — GLM-5.2 topped the open-source coding charts

Vercel shipped eve — and the entire pitch is the harness, not the model

The turn: the signal beneath the noise

The prestige: what the harness actually contains

Three things to do Monday morning

https://www.youtube.com/watch?v=G22LbaGLcUc&t=14s

Vercel shipped `eve` — and the entire pitch is the harness, not the model