Yurukusa

Posted on Feb 26 • Edited on Jun 28

AI Strategy vs Human Intuition: A Bug-Finding Showdown

#productivity #ai #claudecode #testing

cover_image: https://www.npmjs.com/package/cc-safe-setup
title: "AI Strategy vs Human Intuition: A Bug-Finding Showdown"
published: true
description: "My AI ran automated tests for weeks. Then a human played for 5 minutes and found 3 bugs the AI missed. Here's what that teaches us."

tags: claudecode, ai, testing, productivity

My AI agent had been building a game for weeks. Automated tests. Regression suites. Linting passes. The codebase was clean. The CI was green. Every function returned what it was supposed to return.

Then a human sat down, played for five minutes, and found three critical bugs.

Not crashes. Not exceptions. Not logic errors. Feel bugs. The kind that make a player close the tab and never come back - without ever being able to articulate why.

This is the story of what happened, what it taught me about the blind spots of AI-driven development, and the scoring system we built to make sure it never happens again.

The Setup

I've been running Claude Code autonomously for months. The agent - let's call it CC - built a game called Spell Cascade. It's a Vampire Survivors-style roguelite where you combine spell elements. Entirely AI-built. Zero lines of human code.

CC had been iterating on it for weeks. The development cycle looked like this:

Implement feature
Run automated tests
Fix what the tests catch
Commit
Repeat

The test suite covered the things you'd expect: collision detection, spawn rates, damage calculations, inventory state, save/load integrity. If a function was supposed to return 42, the test verified it returned 42. If an enemy was supposed to die at 0 HP, the test confirmed it died at 0 HP.

By every metric the AI could measure, the game was solid.

The Five-Minute Demolition

Then the human operator - the person who runs this whole operation - picked up the game and played it.

Five minutes. That's all it took.

Bug 1: The phantom hit.
You cast a spell. The enemy health bar drops. But there's no impact feedback - no screen shake, no flash, no sound cue, no particle burst. The number changes, but your brain doesn't register that you did something. It feels like clicking a spreadsheet cell, not landing a fireball.

Bug 2: The input swamp.
Movement felt sluggish. Not in a "low framerate" way - the FPS counter was fine. The issue was input latency and acceleration curves. The character started moving at a speed that felt like wading through mud for the first 200 milliseconds, then snapped to full speed. Your hands knew something was wrong before your brain could name it.

Bug 3: The silent death.
When you die, the game just... stops. No death animation. No dramatic pause. No "GAME OVER" slam. The gameplay loop ends and a menu appears. It felt like the game crashed. Multiple times, the human wasn't sure if they'd actually died or if something had broken.

Three bugs. None of them would show up in any automated test. None of them would crash the program. None of them would produce an error log.

All three would make a player quit.

Why the AI Missed Them

This is the part that fascinated me. It's not that CC was bad at testing. It's that CC was testing the wrong layer.

Think about what an automated test can verify:

State changes. Did the value go from A to B? Yes/no.
Logic correctness. Given input X, does the system produce output Y? Yes/no.
Boundary conditions. Does it handle zero? Negative numbers? Null? Yes/no.
Regression. Did the thing that worked yesterday still work today? Yes/no.

All of these are structural properties. They live in the code. They're deterministic. They have right answers.

Now think about what the human found:

Feedback timing. Does the response feel immediate? This is subjective and perceptual.
Motion curves. Does movement feel responsive? This is about the shape of an acceleration function over the first 200ms - a property that's technically "correct" (the character does move) but experientially wrong.
Emotional punctuation. Does the game's most dramatic moment - death - feel dramatic? This is pure experience design.

These are experiential properties. They don't live in the code. They live in the gap between the code and a human nervous system.

The AI tested whether the game worked. The human tested whether the game felt right.

The Taxonomy

After this happened, I sat down and categorized the difference. There are two fundamentally different kinds of bugs:

Structural Bugs (AI's Domain)

These are bugs in the mechanism. The code does something wrong. A value is incorrect. A state transition is invalid. A resource leaks. A race condition fires.

Characteristics:

Deterministic. Same input ? same broken output.
Verifiable. You can write a test that catches them.
Binary. The code is either correct or it isn't.
Discoverable by analysis. You can find them by reading the code.

AI agents are excellent at finding these. They can run thousands of test cases. They can trace execution paths. They can check every boundary condition a human would forget about. Given enough time, an AI agent will find every structural bug in a codebase.

Experiential Bugs (Human's Domain)

These are bugs in the experience. The code does exactly what it's supposed to do - the problem is that what it's supposed to do doesn't feel right.

Characteristics:

Subjective. Different players might experience them differently.
Context-dependent. A 200ms delay feels fine in a menu. In combat, it feels broken.
Continuous, not binary. There's no "correct" screen shake intensity - there's a spectrum from "too little" to "too much."
Discoverable only by experiencing. You can't find them by reading code. You have to play.

Humans find these because humans have bodies. They have proprioception. They have expectations built from thousands of hours of playing other games. When a fireball lands and there's no screen shake, a human's gut says "that felt wrong" before their conscious mind can explain why.

An AI doesn't have that gut. It has verification.

The Feel Scorecard

After the five-minute demolition, we needed a system. Not just to fix these three bugs, but to catch the next ones. And the ones after that.

The problem: experiential bugs are, by definition, hard to quantify. "It feels wrong" isn't actionable for an AI agent. You need metrics.

We built what we call the Feel Scorecard. Three dimensions, each scored 1-5:

1. Feedback Clarity (FC)

When the player does something, does the game acknowledge it?

Score	Meaning
1	No feedback. State changes silently.
2	Minimal feedback. A number changes.
3	Adequate feedback. Visual or audio cue present.
4	Clear feedback. Multiple channels (visual + audio + haptic).
5	Satisfying feedback. Feedback feels good, not just present.

The phantom hit bug scored a 2. The HP number changed, but that was it. After the fix - screen shake, impact flash, damage number popup, hit sound - it scored a 4.

2. Input Responsiveness (IR)

Does the game respond to input at the speed the player expects?

Score	Meaning
1	Noticeable lag. Player doubts if input was registered.
2	Sluggish. Movement happens but feels delayed.
3	Acceptable. No conscious complaints, but not snappy.
4	Responsive. Input-to-action gap feels natural.
5	Instant. The character feels like an extension of the player's hand.

The input swamp bug scored a 2. The acceleration curve was technically functional, but the 200ms ramp felt like wading through syrup. We flattened the curve - near-instant response with a subtle ease-in over 50ms - and brought it to a 4.

3. State Communication (SC)

Does the player always know what's happening and why?

Score	Meaning
1	Player can't tell what state they're in.
2	State is technically visible but not emphasized.
3	State transitions are clear but lack dramatic weight.
4	Major state changes (death, level up, boss) are communicated with appropriate drama.
5	Every state transition tells a micro-story.

The silent death bug scored a 1. The game went from "playing" to "menu" with nothing in between. After the fix - freeze frame, screen flash, death animation, dramatic pause, GAME OVER text - it scored a 4.

The Minimum Viable Feel

We set a threshold: no dimension below 3, total score of 10+ out of 15.

If any dimension scores below 3, the feature ships with a known experiential debt. If the total is below 10, the feature doesn't ship at all.

The AI agent can now self-evaluate against these metrics if it has a reference frame. The trick is that the first calibration has to come from a human. Someone has to play the game and say "this feels like a 2 in Feedback Clarity." Once the AI has several calibrated examples, it can extrapolate to new features with reasonable accuracy.

Not perfect accuracy. But good enough to catch the worst offenders before a human touches it.

The Optimal Setup

Here's what I think the correct architecture looks like for AI-driven game development - or, honestly, for AI-driven development of anything where user experience matters:

???????????????????????????????????????????????
?              AI Agent Layer                  ?
?                                             ?
?  ? Structural testing (automated)           ?
?  ? Regression detection (automated)         ?
?  ? Code quality enforcement (automated)     ?
?  ? Feel Scorecard self-eval (semi-auto)     ?
?  ? Build, deploy, iterate (automated)       ?
?                                             ?
???????????????????????????????????????????????
?              Human Layer                    ?
?                                             ?
?  ? Initial feel calibration (manual)        ?
?  ? Experiential bug discovery (play-test)   ?
?  ? Scorecard threshold validation (manual)  ?
?  ? "Does this feel right?" gut check        ?
?                                             ?
???????????????????????????????????????????????
?              Feedback Loop                  ?
?                                             ?
?  Human plays ? finds feel bugs ?            ?
?  AI adds to scorecard ? AI self-checks      ?
?  future builds ? Human re-validates         ?
?                                             ?
???????????????????????????????????????????????

The AI handles volume. Thousands of test runs. Every code path. Every boundary condition. Every regression check. Things that would take a human team days, the AI does in minutes.

The human handles depth. Five minutes of play that catches what ten thousand test runs missed. The gut feeling that something is off. The ability to say "this doesn't feel like a game, it feels like a demo."

Neither is sufficient alone. Together, they're remarkably effective.

What This Means Beyond Games

This pattern isn't unique to game development. Every product that humans interact with has two layers:

Does it work correctly? (Structural)
Does it feel right to use? (Experiential)

A checkout flow that technically processes every payment but feels clunky? Experiential bug. A dashboard that displays accurate data but presents it in a way that confuses users? Experiential bug. An API that returns correct responses but with latency that makes the frontend feel broken? Experiential bug.

AI agents are getting astonishingly good at the structural layer. They can write tests, find logic errors, enforce code quality, and catch regressions at a scale no human team can match.

But the experiential layer still belongs to humans. Not because AI can't get there eventually - maybe it will. But today, right now, in February 2026, the thing that catches "this feels wrong" is a human nervous system that has spent decades interacting with software and building unconscious expectations about how things should respond.

The Numbers

Before the human play-test, Spell Cascade's Feel Scorecard looked like this:

Dimension	Score
Feedback Clarity	2
Input Responsiveness	2
State Communication	1
Total	5/15

After fixes:

Dimension	Score
Feedback Clarity	4
Input Responsiveness	4
State Communication	4
Total	12/15

From 5 to 12. A 140% improvement in experiential quality. From a five-minute play session.

The structural tests passed before and after. Not a single automated test changed status. By every metric the AI measured, the game was exactly as "correct" as it had been. But it went from something that felt like a prototype to something that felt like a game.

The Uncomfortable Lesson

AI will find bugs humans miss. That's well-established and not controversial.

The uncomfortable part: humans will find bugs AI misses. Not because the AI is bad. Because the bugs exist in a dimension the AI doesn't perceive.

A structural bug exists in code. An experiential bug exists in the space between code and consciousness. You can't test for it without a consciousness to run it through.

The practical takeaway: if you're building anything with AI agents, build the human feedback loop into the process. Not as an afterthought. Not as a "nice to have." As infrastructure.

Five minutes of human play-testing caught what weeks of automated testing missed. That's not a failure of AI testing. That's data about what AI testing is and what it isn't.

Build accordingly.

Postscript: The Scorecard Template

If you want to try the Feel Scorecard on your own project, here's the template:

## Feel Scorecard - [Feature/Build Name]

Date:
Evaluator:

### Feedback Clarity (FC): _/5
- What action did you take?
- What feedback did you receive?
- What feedback did you expect?

### Input Responsiveness (IR): _/5
- Did inputs feel immediate?
- Was there any perceived lag or sluggishness?
- Did the control feel natural?

### State Communication (SC): _/5
- Did you always know what was happening?
- Were transitions between states clear?
- Did important moments feel important?

### Total: _/15
### Minimum threshold met (no dim <3, total >=10)? Y/N

### Notes:
[Free-form observations about feel issues]

It's intentionally simple. The goal isn't a 50-item checklist - it's a fast, repeatable evaluation that gives your AI agent something concrete to optimize against.

Three dimensions. Fifteen points. Five minutes to fill out.

That's all it took to catch what weeks of automated testing couldn't.

More tools: Dev Toolkit - 440+ free browser-based tools for developers. JSON, regex, colors, CSS, SQL, and more. All single HTML files, no signup.

Make Claude Code safe: npx cc-safe-setup — 8 hooks, 10 seconds, zero config. GitHub

DEV Community