ihsan_kutluk

Posted on Jun 22

AI Writes the Code. But Who Checks It?

#ai #llm #softwaredevelopment #testing

Hunting down three critical bugs in a real optimization project showed us the biggest blind spot in modern software development — and pointed us toward a fix.

June 2026 · stochastic-VRP-decision-focused-learning · github/spec-kit

Picture this: you hand a complex optimization problem to an AI. Within hours, 17 files are pushed. Tables, charts, numbers — everything looks polished and complete.

But what if three critical bugs are hiding inside? And none of them were caught by code review, a linter, or any architectural rule?

This is exactly that story. More importantly: these bugs revealed a problem modern software development still hasn't solved. And we're proposing something to fix it.

What the project actually does

We're solving a vehicle routing problem for a Cash-in-Transit company. Twenty ATMs, a handful of vehicles, each ATM with different cash demand on different days. The classical approach uses averages — and it fails badly. 42% of ATMs run out of cash, costing serious money.

The project tackles this in two layers: a deterministic baseline (CVRP), then a decision-focused machine learning model (SPO+) that learns to account for demand uncertainty. The result: the stockout rate drops from 42% to 25% — nearly matching a theoretical oracle with perfect information.


📉 Stockout rate (classical)	42%
📉 Stockout rate (SPO+)	25%
💰 Cost reduction	51%

Strong numbers. But the real story starts here.

The quantum experiment: good idea, misleading results

Vehicle routing belongs to the class of combinatorial optimization problems that quantum computing theoretically targets. So we asked a natural question: "What happens if we use Q# and QAOA here?"

We fed the question to an AI. Hours later, a complete-looking implementation arrived:

# QAOA Comparison Table — First Version

Method          <H> Energy   Constraints
PuLP/MILP       -2.5599      ✅ Satisfied
QAOA p=1        -2.5599      ⚠️  Violation
QAOA p=3        -2.5599      ⚠️  Violation

Wait. QAOA is violating constraints — but returning the exact same energy as MILP? And why do p=1 and p=3 (different circuit depths) agree to four decimal places?

This is mathematically inconsistent. A solution that violates constraints should score worse — that's the whole point of penalty terms. Something was wrong. We dug in.

Three bugs, three different faces

Bug 1 — The penalty weight was invisibly small

# phase2c_qaoa_simulator.py, line 56

LAMBDA_C = 0.5  # ← Way too small

# Load: 330k TL, capacity limit: 250k TL
# Violation penalty: 0.5 × (1.32 - 1.0)² = 0.051 units
# Route cost range: 1.2 – 3.5 units
# Penalty is ~1.5% of cost — QAOA literally cannot see it

LAMBDA_C = 40.0  # Fix: penalty now exceeds max route cost (3.5)

QAOA was violating the capacity constraint, but couldn't "feel" it — because the penalty for violating it was nearly the same as not violating it. The penalty term had dissolved into the cost function.

Bug 2 — MILP contradicted itself but silently called it "Optimal"

The MILP model had two constraints: "visit all ATMs" and "don't exceed vehicle capacity." But total demand (330k TL) already exceeded the capacity limit (250k TL) — meaning no feasible solution could satisfy both simultaneously.

PuLP/CBC returned "Optimal." Silently. No warning. The numbers appeared in the table. Nobody questioned them.

Bug 3 — The wrong metric was being measured

p=1 and p=3 showed identical results because the comparison code measured the argmax-bit solution instead of QAOA's expectation energy. For every value of p, argmax selected [1,1,1,1,1] — yielding the same number every time. The circuits were actually running differently. The measurement just couldn't tell.

Corrected results

Method	`<H>` Energy	Bit Energy	Gap	Constraints
Brute Force	—	−34.856	0%	✅
MILP (fixed)	—	−34.856	0%	✅
QAOA p=1	−32.686	−34.047	2.3%	✅
QAOA p=2	−31.689	−33.622	3.5%	✅
QAOA p=3	−32.815	−34.047	2.3%	✅

The corrected picture is actually more convincing: QAOA works, circuit depth genuinely matters, but even at this small scale it trails optimal by 2–4%. That's consistent with the literature and physically expected behavior.

Why didn't anyone catch this?

Look at what all three bugs have in common: the code was structurally correct. No syntax errors, sensible variable names, functions calling functions, charts rendering. No linter would flag any of this. No architecture rule would fire. In code review it would read as "looks good."

The problem was behavioral: penalty weight too small, constraints contradicting each other, wrong metric chosen. You can only catch these by running the code and comparing its actual behavior against what was intended.

Code looking correct isn't enough. It has to be compared against expected behavior.

And right now, there's no automated tool that does this in a spec-driven workflow.

The Golden Demo idea

The Spec-Driven Development world already has strong tools: Architecture Guard enforces architectural rules, DocGuard audits documentation quality, Security Review scans for vulnerabilities. All valuable. But all static — they read code, they don't run it.

Golden Demo does something different:

During planning — from the acceptance criteria and examples written in the spec, generate a small, runnable reference implementation — the golden example. A deterministic, executable representation of intended behavior.

After implementation — run both the golden example and the real code against the same test vectors. Compare outputs. If there's a gap, generate a drift report.

At merge time — you know two things: does the code run? And does it do what was intended? These are the two questions most easily skipped right now.

How would Golden Demo have handled the three bugs in this project?

Bug 1: If the spec had said "penalty term must exceed maximum route cost," that becomes a test vector. With lambda=0.5, it fails immediately at merge.
Bug 2: "A solution returned as Optimal must satisfy all constraints" would automatically verify the solver's output — no silent infeasibility.
Bug 3: "Expectation energy must improve as circuit depth increases" would halt the test the moment both p values returned the same number.

None of these require a human to think of checking them on the day. They're encoded once, in the spec, and verified automatically every time.

One more thing: the quantum question

A fair question worth addressing directly: "If a quantum computer tries all solutions at once, why can't it just win?"

The answer lives in the difference between superposition and entanglement.

Superposition means a qubit carries both 0 and 1 simultaneously before measurement — theoretically enabling parallel search across the entire solution space. Entanglement is different: it correlates qubits so that measuring one instantly tells you something about others. That's not variability; it's a strong dependency. In optimization, the real magic is superposition. Entanglement supports it.

The problem is measurement: superposition collapses to a single classical result. QAOA's job is to amplify the probability of the right answer through interference — the way waves reinforce or cancel each other. On today's NISQ hardware, that interference control is too noisy to work reliably.

Our 20-ATM problem needs roughly 1.2 million physical qubits. The best current hardware sits somewhere between 1,000 and 10,000. No practical advantage exists at this scale until fault-tolerant hardware matures — likely 10 to 15 years out.

That doesn't mean "quantum is useless here." The resource estimation analysis in this project shows that as problem size grows (20 → 50 → 100 → 200 ATMs), a threshold emerges where classical MILP struggles and quantum could theoretically contribute. Knowing that the threshold can't be reached today isn't a reason to stop — it's a reason to prepare for the right moment.

What we're doing now

This project became a live case study for why the Golden Demo + Behavioral Drift extension we're proposing for the GitHub Spec Kit ecosystem needs to exist. Not an abstract idea — a demonstrated need, with three real bugs as evidence.

We've posted an RFC in the spec-kit Discussions. The v1 scope is deliberately narrow: pure functions only, explicit input/output relationships, no side effects. The real risk with any new validation tool is generating noisy false positives and having everyone disable it on day two.

If this problem feels familiar — approving AI-generated code because it looks right, then discovering a behavioral bug weeks later — we'd love your input on the RFC.

Links