Muggle AI

Posted on Jun 5

I Spent $200 in Two Hours Watching a Coding Agent Guess

#agents #ai #testing #vibecoding

I spent $200 in two hours and the bug was still there at the end of it.

One bug. Two hours. Two hundred dollars.

One bug. Not a feature, not a refactor, one bug. There’s a report page in my product with three columns. Table of contents on the left, the actual content in the middle, a little quick-nav thing on the right. You click an entry in the table of contents and the middle column is supposed to scroll to that section. It didn’t. That’s it. That’s the whole bug. If you’ve ever fought with sticky headers and nested scroll containers and the exact math scrollIntoView does versus what you wanted it to do, you know it’s fiddly, but it’s not some exotic thing.

I watched Opus 4.6 propose a fix (it was before 4.7, which was worse). It didn’t look at anything first, it just decided what was probably wrong and wrote the change. The change ran, which cost actual money, and it didn’t work. So it went “okay” and tried the next idea. Wrote it, ran it, didn’t work. Next one.

It never stopped to actually figure out the cause. It just kept producing. And the meter kept going the whole time. I eventually killed the session and started clean, but only after my partner called and yelled at me. The restart has its own special feeling too, paying two hundred bucks to end up exactly where you started.

What gets me is that the boring approach would have worked. Reproduce it, look at what’s actually happening on the page, drop a log in, narrow it down before you touch a line. That’s the whole job. The agent skipped straight to writing confident code and charged me for every wrong guess on the way.

Anyway, the bug isn’t really the point. It’s just where I finally saw the thing clearly.

Pretty is solved. But not ready to sell.

The easy stuff is basically a solved feeling at this point. You want a slick landing page, some generated images, a thirty-second video, a script that pulls the day’s news, the agent will do all of it and it’ll keep getting cleaner as the models improve. No argument from me there. That part is good and I use it.

But “make me a nice page” and “make me something I can actually sell” are not the same job and it isn’t close. The second you’re building for a real person who’s going to pay you, or a business that’s going to lean on it, the list explodes. Accounts and logins. A real database. The thing not falling over when traffic shows up. Payments, which you cannot get wrong even once.

The way I think about it: a page with every animation you could want is still the easy bracket. A popup that pulls real data from the backend, saves it, and shows you the right state when you come back to it later, that’s a different animal. The pretty part is done. The part where it has to be true, under load, with real users and real money moving, that’s the part the agents are still shaky on. And yeah, that’s exactly where my $200 went.

Everything is a black box

Here’s what you’re actually signing up for once you get serious with these things. It’s all a black box.

You don’t know what it’ll cost, and that’s not me being dramatic, it’s measurable. There’s research on this (trust me, I let AI bulletproof my point) where people ran the same coding task over and over on the frontier models and the token cost for the identical task swung by more than 10x between runs, and the agents couldn’t predict their own spend going in. So picture it spending a hundred bucks building something and another eighty cleaning up a bug it introduced itself, and you never even find out that’s what happened. You also don’t know how long it’ll take. And the worst one, the one that actually keeps me up: after the money’s gone and the time’s gone, you still don’t know if the thing works. The cost is bad, the clock is bad, but not knowing whether you got anything for either of them is the real problem.

My $200 was all three of those at once. Didn’t know the cost, didn’t know the time, and at the end I had nothing.

It’s not more tests (I know how that sounds)

So what’s the actual missing piece.

It is not more tests. I want to be careful saying that because it really sounds like the answer should be more tests. The agent will write your code, then write tests for that code, run them, and show you a wall of green. And you, the person who actually has to use the thing, are sitting there looking at red. This isn’t just my bad luck, it’s a known failure. When the same agent writes the code and the tests in one sitting, the tests just end up being a mirror of what the agent thought it was building, not what you actually asked for. There’s research on AI-generated test suites hitting full coverage while catching almost no real bugs. It aced its own exam. The exam was wrong. And nobody’s got the energy to read every line and find the spot where the green is lying to them, definitely not at the end of a long day.

And before someone says it, no, it’s not agent harnessing, and it’s not TDD or spec-driven development either. A harness is a workflow. It keeps the agent on the rails, it does not make the work correct. People skip past this part: the agent is the one doing the work. If it doesn’t notice it got something wrong, or it’s just wrong with full confidence, then everything after that is suspect. The code, the fix, even the spec or the plan it wrote for itself. Nobody’s there to catch it but you. And the one day you’re tired and you wave it through, congratulations, your user is now the one who finds it.

Okay, so what about a really tight set of rules, a project config that spells out exactly how the agent should behave. Still no. On anything big the context balloons and the agent quietly stops following your rules. There’s a pile of research now on models degrading and dropping instructions as context gets longer, before the window’s even full. This is not some rare edge case. For me it’s just normal.

The most effective thing I’ve personally found is splitting the work across different models. One plans, one executes, one audits. I’ll be honest, that setup works better than anything else I’ve run. (Funny enough it’s roughly what the big labs land on when they actually care about being right: keep the thing that builds separate from the thing that judges.) But “works” and “efficient” are not the same word. I spent forever slicing the work into pieces and a real amount of money keeping a few subscriptions alive at once just to hold the whole contraption together.

Either the agent grades its own homework, or you do

And here’s what every one of those approaches has in common, which is the thing I can’t get past. Either the agent is grading its own homework, or you are. The first one we’ve established is broken. The second one works fine right up until the moment you’re busy or lazy or just a normal tired human, and then it doesn’t.

What’s actually missing is a check that doesn’t depend on either of those. Something outside the whole loop that can say “does this do the thing it was supposed to,” instead of “did the code the agent wrote agree with the tests the agent also wrote.”

What I actually want to exist

I’ll be straight about what’s out there, because this is the spot where it’d be really easy to oversell. A few tools have started checking the running product the way a person would, instead of taking the agent’s tests at face value, and that’s the right direction. But none of it is yet a real acceptance layer. By which I mean something independent enough, and trustworthy enough, that it can sit between “the agent says it’s done” and you, and actually answer the question. That’s the thing worth building and it’s what we’re going after, because honestly everything else hangs off it.

Now, the part I keep turning over in my head, which is what a real, honest “this works” signal would let you do. Once you can actually measure done, somebody could sell a guarantee on top of it. Imagine putting $30 on a task and the completion is just guaranteed. Costs less, the other side pockets the difference. Costs more, they eat it. You’d simply know it gets done for thirty bucks. The only reason nothing like that exists today is that “done” in software has always been mush. Nobody can agree on when a thing is actually finished, so nobody can stand behind it. Crack the measuring and the rest opens up. The measuring is the whole game.

Which is why we started the company. Not because we’ve cracked it, to be clear, we haven’t. What Muggle Test does right now is narrow and concrete: it runs behavioral checks against your running product, the live thing, the way a user would actually hit it, independent of whatever the agent’s own tests are claiming. So when the agent says green and the product is red, something from outside the agent’s loop is the one to tell you. That’s what exists today. The bigger goal is to grow that into an acceptance layer you can actually trust, the kind of “done” signal everything else, guarantees included, could eventually be built on. We’re early.

If you build with these agents you already know the feeling. The one where you’re just watching the number go up. Tell me where it’s gotten you. We’re doing this in the open and it’s early, so if something’s broken or we’re missing something obvious, that’s exactly what I want to hear.

Top comments (1)

Adam Lewis • Jun 6

You ruled out spec-driven, but I think the version you're describing as missing is close to it. You're right that a spec the agent writes is no safer than its code, since it came from the same place. But a check you write yourself, before it starts, and don't let it touch, is outside the loop in the way you want.

It doesn't have to be a full suite. For that scroll bug, one test that asserts the clicked section ends up at the top of the container is something the agent can't make pass by agreeing with its own tests. And the reproduce-first habit you describe at the start is really the same thing - nothing forced it to prove the cause before writing a fix, so it guessed. I'd still want the independent product-level checker you're building, I just wouldn't wait for it before writing down the few behaviours that say a task is done.