DEV Community

Louis Dupont
Louis Dupont

Posted on

5 4 2 3 4

How to move beyond Vibe Checking

When developing AI, Vibe Checking is a must. Until a certain point.

When you start an AI project, everything feels like progress. You tweak a prompt, add context, examples, or even plug in a retrieval system. It looks better. So you keep going.

But eventually, it stops being clear what “better” even means.

You didn't break anything. But you're not moving forward either. The outputs are different, but not obviously more useful. You tweak again. And again. Some changes help. Some don't. Some feel promising until a week later when a user hits a strange edge case you thought was gone.

At some point, you start to wonder:

Are we still improving this? Or are we just getting used to how it behaves?

That's vibe-checking.

You're not iterating. You're running changes through your gut and hoping they stick.

And that's not a critique. That's the way to get started.

But it's a phase you're supposed to grow out of.

Why Vibe-Checking Stops Working

In early prototypes, vibes are enough. You're testing if the core idea even makes sense. You're looking for signal, not stability. So you move fast. You don't overthink. You don't measure. Good.

But once you've seen the potential (once you're no longer validating the idea, but trying to improve it) you need more than just a feeling.

The problem isn't that you trust your gut.

It's that your gut doesn't scale.

You change a prompt. The answers look cleaner. Then someone else on your team flags a regression you didn't notice. The improvements were real, but only on the five examples you had in mind.

The Shift You Actually Need

This is where most teams start thinking about metrics.

But good metrics don't come out of nowhere.

They come from understanding what matters.

That's the real shift: moving from vibes to clarity.

Not by jumping into evals, but by seriously observing what's going wrong.

That means sitting with the outputs. Looking at dozens of real examples. Tagging what failed and why. Not just “bad answer.” Not just “hallucination.” Specific, meaningful categories: wrong reference pulled, misunderstood intent, incomplete summary, broken format.

You don't need 50 dashboards. You don't even need to automate anything.

You need to name the failures you're seeing again and again.

Clarity is a Practice

Here's the trick no one tells you:

You can't scale what you haven't named.

Vibes are raw data. Clarity is the result of processing them.

If you do it right, i.e. if you go through 50, 100, 200 real examples and tag the failure modes, you'll start to see a pattern. Some failures happen more than you thought. Some are rare, but critical. Some only show up on specific query types.

Suddenly, your fixes aren't abstract anymore. They're targeted. And you can evaluate their impact by measure the failure modes frequency.

You're not guessing anymore. You're engineering.

Stop Tweaking. Start Observing.

You don't need to jump into full-on evals. Not yet.

But you do need to stop assuming that "looks better" is the same as “is better.”

If you want to improve your system systematically, it starts by asking one simple question:

“What's actually going wrong? And how often?”

Until you have that answer, everything else is just educated guessing.

📌 Want to go deeper?
👉 I'm sharing my insights from building AI for years.

Top comments (0)