DEV Community: Hidai Bar-Mor

Your AI Agent Did Not Crash. It Just Started Making Things Up.

Hidai Bar-Mor — Tue, 10 Mar 2026 15:44:47 +0000

I think the most dangerous agent bugs are the ones that look completely normal.

No error. No crash. No red screen. No stack trace.

The agent replies. The format looks right. The answer sounds confident. Everyone moves on.

Meanwhile it quietly stopped using its tools three days ago and has been hallucinating ever since.

That is the bug.

I have seen this again and again while building agents.

A model update changes behavior behind the API.

A framework update messes with tool calling.

A checkpoint resumes with bad state.

A subagent silently stops running.

Everything still looks fine from the outside. That is what makes it nasty.

The response is clean. The tone is smooth. The answer is plausible.

It is also wrong.

And the worst part is your users usually cannot tell. Honestly, sometimes you cannot tell either until something blows up later.

Most agent testing misses this completely.

If your test only checks the final answer, it can pass.

If your eval asks an LLM judge whether the response looks good, it can pass.

Because the problem is often not the final answer.

The problem is the path.

The tool calls.

The order.

The arguments.

The missing lookup step that used to happen every time and now just does not.

That is where the regression starts.

The output can still look good long after the behavior is already broken.

This is why agent regressions feel so slippery. A normal app breaks loudly. An agent breaks politely.

It smiles. It nods. It lies.

What has worked much better for me is simple.

Do not only test the answer. Snapshot the behavior.

Run the agent when it is working.

Record which tools it called, in what order, and with what inputs.

Save that as the baseline.

Then after every prompt change, model change, framework update, or tool refactor, run the same scenario again and compare the trajectory.

✓ login_flow         PASSED
⚠ refund_request     TOOLS_CHANGED
    before: lookup_order → check_policy → process_refund
    now:    lookup_order → process_refund

✗ billing_dispute    REGRESSION   score 85 → 55

Now the bug is obvious.

The tool disappeared.

The sequence changed.

The quality dropped.

You catch it in review instead of learning about it from an angry user.

That is the part I wish more people talked about.

A lot of agent eval discussion is still obsessed with final outputs. Was the answer good. Did the judge like it. Did the score go up.

That matters.

But if you are shipping agents, behavior drift matters just as much.

Sometimes more.

Because once an agent stops taking the right path, it can still sound smart for a surprisingly long time.

That is where false confidence comes from.

And the nice part is you do not need to spend a fortune to catch this stuff.

Tool call diffing is deterministic.

You do not need an LLM judge every time.

You can reserve model based scoring for the cases where output quality actually needs judgment and keep structural regression checks running all the time.

That is the workflow I wanted, so I built EvalView around it.

Snapshot behavior.

Compare runs.

Catch regressions before they hit production.

But even if you never use EvalView, I think this habit is worth adopting right now.

Start recording tool calls.

Start diffing trajectories.

Start treating agent behavior like something you can baseline, compare, and protect.

Because your AI agent usually will not crash when it breaks.

It will just get smoother at being wrong.

If you have seen this happen in production, I would genuinely love to hear your story.

If this article helped and you want to follow the project, here’s the repo — stars and feedback are always appreciated.

My AI agent cost me $400 overnight, so I built pytest for agents and open-sourced it

Hidai Bar-Mor — Mon, 08 Dec 2025 09:52:39 +0000

So there I was at 2am staring at my OpenAI dashboard wondering how the hell my bill went from $80 to $400 in a single day.
The answer? One of my agents decided to call the same tool 47 times in a loop. In production. While real users were waiting.

The Problem Nobody Talks About

I've been running custom AI agents in production for about six months now. Here's what I learned the hard way: agents that work perfectly on your local machine will absolutely betray you in production.
Sometimes they hallucinate tools that don't exist. Sometimes they answer questions without calling any tools at all, just making stuff up with complete confidence. Sometimes they get stuck in loops burning through tokens like there's no tomorrow.
The worst part? You don't find out until a user complains. Or until you check your billing dashboard and feel your stomach drop.
I tried writing unit tests but how do you even test something that's nondeterministic by design? Mock the LLM? Cool, now you're testing your mocks, not your agent.

What I Actually Wanted

I wanted something dead simple. Write down what the agent is supposed to do. Run it. Fail the build if it does something stupid.
That's it. No PhD required.
So I built it.

Meet EvalView

The idea is embarrassingly simple. You write a YAML file describing what should happen:

name: order lookup
input: 
  query: "What's the status of order 12345?"
expected:
  tools:
    - get_order_status
thresholds:
  max_cost: 0.10

That's a real test. If the agent answers without calling get_order_status, the test fails. If it suddenly costs more than 10 cents, the test fails. Red error, CI breaks, deploy blocked.
The tool call check alone catches probably 90% of the dumb stuff. Agent confidently answered a question about an order without actually looking up the order? Caught. Agent called some random tool instead of the right one? Caught. Agent decided to call the same tool fifteen times? You get the idea.

Running It

pip install evalview
evalview quickstart

The quickstart spins up a tiny demo agent and runs some tests against it so you can see how it works. Takes maybe fifteen seconds.
For your own agent you just point it at your test files:

evalview run

Throws it in CI and now you have actual guardrails.

What Changed For Me

Before EvalView I was averaging maybe two or three angry user reports per deploy. Something would break in some weird edge case and I'd spend my evening debugging production.
After adding these tests? Ten deploys in a row with zero incidents. I actually deploy on Fridays now. I know, I know, but I do.
The $400 surprise bills stopped too. Turns out catching infinite loops before production is good for your wallet.

The Boring Technical Stuff

It works with LangGraph, CrewAI, OpenAI, Anthropic, basically anything you can hit with an HTTP request. There's also an LLM as judge feature for checking output quality since exact string matching is useless for AI responses.

What I'm Working On Next

Also thinking about adding test generation from production logs so you can turn real failures into regression tests automatically. And maybe a comparison mode to test different agent versions or configurations side by side and see which one performs better.
If you've got ideas or want to contribute I'm very open to that. The codebase is not that big and there's plenty of low hanging fruit.

Go Look At It

Here's the repo: https://github.com/hidai25/eval-view
If you've ever had an agent embarrass you in production or if you've ever opened a cloud bill and felt physical pain, maybe give it a shot. And if it saves you even one late night debugging session, throw it a star.
I'm genuinely curious what other people are doing for this stuff. Do you have some elaborate eval setup? Let me know in the comments because I'm still figuring this out as I go.