Lakshmi Sravya Vedantham

Posted on Feb 24

I Built a Vibe-Check Tool — Then Ran It on an AI-Built Codebase and It Scored 0/100

#python #ai #devtools #programming

The Setup

A few weeks ago I built vibe-check — a CLI that scores how much of your codebase was written by AI, file by file, from 0 (human) to 100 (vibe-coded).

It works by detecting patterns that AI models reliably leave behind: over-commenting, generic naming conventions, hallucinated imports, repetitive structure, placeholder code.

I was happy with it. Ran it on a few projects, got plausible scores. Published a blog post. Got some stars on GitHub.

Then I ran it on a large full-stack web application — React frontend, Express middleware, FastAPI ML backend, multiple machine-learning models for data classification and real-time analytics.

It returned: 0/100. MOSTLY HUMAN.

The Numbers

Here's what vibe-check found across the full repo:

Scan path: /path/to/project
Files analyzed: 244   Skipped: 11   Errors: 0

╭───────────────────────────── Repository Summary ─────────────────────────────╮
│   Repo Vibe Score              0/100 — MOSTLY HUMAN                          │
│   Average Score                2                                             │
│   Highest Score                13                                            │
│   Lowest Score                 0                                             │
│   High Risk Files (>=60)       0                                             │
│   Medium Risk Files (40-59)    0                                             │
╰──────────────────────────────────────────────────────────────────────────────╯

For contrast, here's commit-prophet — a CLI tool built entirely by an AI agent in a single session, zero human-authored lines:

Files analyzed: 9

╭───────────────────────────── Repository Summary ─────────────────────────────╮
│   Repo Vibe Score              2/100 — MOSTLY HUMAN                          │
│   Average Score                6.4                                           │
│   Highest Score                14                                            │
│   High Risk Files (>=60)       0                                             │
╰──────────────────────────────────────────────────────────────────────────────╯

A tool I know was 100% AI-generated scored 2/100.

Something is wrong. Let me break down exactly why.

Blind Spot #1: Polyglot Codebases

vibe-check analyzes Python files. That's it.

The project has:

~10,000 lines of Python (ML backend)
~7,000 lines of TypeScript
~14,000 lines of JavaScript

That means roughly 70% of the codebase is completely invisible to vibe-check. The React frontend — 17 pages, 49 UI components, custom hooks, TanStack Query integration — scored 0 on every single file. Not because it's human-written. Because vibe-check never looked at it.

Here's what a typical frontend file looks like — App.tsx, the routing layer:

import { Switch, Route } from "wouter";
import { QueryClientProvider } from "@tanstack/react-query";
import Dashboard from "@/pages/dashboard";
import Analytics from "@/pages/analytics";
import Settings from "@/pages/settings";
// ... 20 more page imports

That import block — 20+ named page components, perfectly organized, consistent naming convention — is a textbook AI signature. A human would have added them one by one over weeks, with inconsistent capitalization and a few typos along the way. The tool gave it 0 because TypeScript isn't Python.

The fix: vibe-check needs language-agnostic detectors — or at minimum, separate analyzers for JS/TS, Go, Rust.

Blind Spot #2: Domain-Specific Vocabulary Defeats Generic Detectors

vibe-check's naming detector looks for "generic AI naming" — words like helper, manager, handler, process_data, get_result. The assumption is that AI tends to reach for bland, placeholder-sounding names when it doesn't have strong domain context.

But look at actual variable names in the ML backend:

confidence_weighted_score = weighted_avg(model_outputs, confidence_weights)
normalized_feature_vector = standardize(raw_features, per_channel=True)
inter_class_variance = between_class / within_class
calibrated_threshold = baseline_mean + (2.5 * baseline_std)
rolling_accuracy = ema(correct_predictions, window=50)

These are highly specific. calibrated_threshold, inter_class_variance, confidence_weighted_score — none of these match any "generic" pattern. They look exactly like code a domain expert would write.

Except they weren't. They emerged from an AI that had deeply internalized the technical literature — the right terminology, the right mathematical relationships, the right variable names for the problem space. The vocabulary is domain-correct, so the detector reads it as human.

The insight: AI doesn't always produce generic code. When given sufficient domain context — research papers, technical terminology, specific dataset names — it produces vocabulary that is more specific and precise than what a junior developer would write, not less.

Blind Spot #3: The Patterns That Actually Reveal AI Are Subtle

vibe-check looks for obvious surface signals. But the real AI fingerprints in a mature codebase are structural, not lexical:

Perfect docstring coverage. Every function in the ML backend has a docstring. Every single one. With Args, Returns, and Raises sections. Human engineers docstring the complex stuff and skip the obvious stuff. Uniform docstring coverage is a strong signal — but vibe-check doesn't check coverage uniformity, just whether comments exist.

Suspiciously comprehensive error handling. Every API endpoint handles the exact error cases you'd list in a prompt: ValueError, HTTPException with correct status codes, proper JSON error bodies. Real code usually has at least one except Exception as e: pass somewhere. Zero of them here.

Exact adherence to one style guide. No camelCase leaked into Python. No snake_case leaked into TypeScript. No inconsistency anywhere across 30,000 lines. Human codebases always have some inconsistency — a variable named userID next to user_name, an import sorted slightly wrong. Absolute consistency at scale is itself a signal.

The problem: these patterns require understanding the distribution of human inconsistency, not just checking for known-bad patterns. That's a much harder problem.

What a Better Detector Would Look Like

1. Cross-language support. TypeScript and JavaScript are now first-class targets. AST-level analysis, not regex. Parse the import graph, check for the "20 imports added in one commit" pattern.

2. Consistency scoring. Measure variance in style, naming, comment density across files. High consistency = higher vibe score. Calculate the coefficient of variation for docstring length, comment ratio, function naming patterns.

3. Vocabulary specificity index. Build a corpus of domain-specific terminology by crawling GitHub repos in specific domains (finance, genomics, logistics). A file with 80% domain-specific terms and 0% generic terms is more suspicious, not less — because humans mix domain and generic terms. Experts still write tmp, data, result.

4. Commit-level analysis. The biggest tell isn't in the code — it's in the git history. Commits that touch 50 files at once, add 2,000 lines of tested, formatted, documented code in a single push, with zero fix-up commits afterward. Human development has a different tempo: small commits, then fix commits, then refactor commits.

5. Test-to-source ratio uniformity. AI-generated code tends to have tests that perfectly mirror the source structure, with identical coverage across all modules. Human code has coverage deserts — the boring utility functions have 95% coverage, the complex business logic has 40%.

The Meta-Irony

I built vibe-check using Claude to help write the detectors and tests.

I then ran vibe-check on a project built with Claude and got 0/100.

The tool is measuring the wrong thing. It's measuring style markers of careless AI usage — the AI that doesn't know the domain, doesn't have context, produces generic boilerplate. It gives a clean bill of health to AI that does understand the domain and writes code that looks expert.

Which is, of course, the harder problem to solve. And the more important one.

What This Means for AI Detection Tools Generally

The broader lesson: AI detection works on distribution shift, and the distribution is shifting.

Early AI-generated code (GPT-3 era) had obvious tells: hallucinated imports, identical function names, # Step 1, # Step 2 comments, TODO: implement this. Those are easy to detect because they're anomalous relative to human code.

Modern AI-generated code, given good context, looks like the best code in the codebase. Perfect naming, comprehensive tests, thorough error handling. The distribution shift is now in the positive direction — the AI code looks better than average human code, not worse.

Detectors trained on "AI code looks sloppy" will miss all of it.

The Honest Score

If I had to estimate the real vibe percentage for this project:

React TypeScript frontend: ~60-70% (scaffold, routing, component structure mostly AI-generated; business logic more human)
FastAPI ML backend: ~40-50% (API endpoints and boilerplate AI-generated; core algorithms more human-guided)
Data processing pipeline: ~30% (core logic is human-researched, AI-implemented)
Model training scripts: ~20% (training design is human, AI transcribed)

Overall: probably 45-55% AI-assisted. vibe-check said 0.

That's a 45-55 point miss.

Where to Go From Here

I'm planning to rebuild vibe-check's detector suite around these findings:

Language-agnostic AST analysis (Python + TypeScript + JavaScript)
Style consistency measurement (variance, not just presence)
Git history analysis (commit size, tempo, fix-up rate)
Test coverage distribution analysis

If this is interesting to you, follow along: github.com/LakshmiSravyaVedantham/vibe-check

And if you've built an AI detection tool that actually handles these cases — I'd genuinely love to know how.

The uncomfortable truth: the better AI gets at writing domain-specific code, the more it looks like expertise rather than generation. Detection tools need to evolve faster than the models they're detecting.

DEV Community