DEV Community

wentao long
wentao long

Posted on

Vibe coding is a black box. I got tired of guessing and started measuring.

Computer science is a science. How do you do science without observation?

That's the question that started this.

I'd been using AI to build features for months. It worked — sometimes brilliantly, sometimes frustratingly. But I had no idea
what was actually happening. Was AI getting better at my project over time, or was I just getting better at prompting around
its blind spots? Were the rules I added actually helping, or did I just get lucky that session?

I couldn't answer any of these questions. Not because the answers didn't exist — but because I had no data.


The real problem with vibe coding

It's not that AI makes mistakes. It's that you can't see the pattern.

You fix something. Next session, same mistake. You fix it again. At some point you wonder: is this the third time I've
corrected this, or the seventh? Is this a one-off or a systemic gap in my project rules? Without data, you're just guessing.

I started logging this stuff manually in my existing project — not as a tool, just as structured notes. Task started,
deviation recorded, rule added. After a few weeks I had something interesting: a record of exactly where AI kept going wrong,
what I did about it, and whether it helped.

That's when I realized this should be automatic.

Deviation root cause breakdown — hallucination vs rule-missing vs context-gap


What I actually built

AIDA is an MCP server that silently collects structured data as your AI works. One line to set up:

  { "mcpServers": { "aida": { "command": "npx", "args": ["-y", "ai-dev-analytics", "mcp"] } } }
Enter fullscreen mode Exit fullscreen mode

Every task, deviation, bug, self-review, and file change gets recorded to a local JSON file. No cloud, no telemetry, 100% on
your machine.

Then aida dashboard renders it:

  • Where is AI deviating? Which categories — layout, components, API patterns?
  • Why is it deviating — hallucination, missing rules, or context gap?
  • After you add a rule, does that category of deviation actually go down?
  • What's the bug rate this sprint vs last sprint?
  • Which files keep getting touched? Where are the real pain points?

These aren't vanity metrics. They're the feedback loop that tells you whether what you're doing is working.

Deviation category distribution and deviation+rule trend over time


The part that matters: rules with evidence

Anyone using AI long enough has a collection of "rules" for their project — things you've told it, conventions you've
documented, patterns you've reinforced. But do they work? Which ones are actually changing AI behavior, and which ones are
just words in a file?

With observation data, you can answer that. Add a rule, watch the deviation rate in that category over the next few runs.
That's not "I think it's better" — that's a data-supported conclusion.

The rules system in AIDA reflects this. Rules are sedimented from observed deviations, stored in rules.json, and
auto-compiled into .md files AI reads every session. When a rule stops being relevant, you deprecate it. The data tells you
when.

  aida rules build    # compile rules.json → .md views AI reads
  aida rules dedupe   # find overlapping rules (>40% keyword similarity)
  aida rules merge    # resolve branch conflicts by fingerprint union
Enter fullscreen mode Exit fullscreen mode


What I learned after running this on a real project

The deviation categories that kept showing up for me: component usage, layout conventions, API patterns. Not hallucination —
mostly rule-missing. AI wasn't confused about how to code; it just didn't know my project's specific conventions.

Once I had that data, I knew exactly where to focus. Rules sedimented from observed deviations. After that, those specific
categories dropped significantly. Not because I believed harder in the rules — because the next run's data showed fewer
deviations in those areas.

That's the loop: observe → identify → add rule → measure → repeat.

No talent required. Just iteration.


Try it

  npx ai-dev-analytics dashboard
Enter fullscreen mode Exit fullscreen mode

Opens a local dashboard with anonymized demo data so you can see what it looks like before connecting your own project.

GitHub: https://github.com/LWTlong/ai-dev-analytics

Curious whether others have been tracking this kind of thing manually — and what patterns you've found in where AI actually
fails on your projects.

Top comments (0)