DEV Community: ihsan_kutluk

AI Writes the Code. But Who Checks It?

ihsan_kutluk — Mon, 22 Jun 2026 11:46:47 +0000

Hunting down three critical bugs in a real optimization project showed us the biggest blind spot in modern software development — and pointed us toward a fix.

June 2026 · stochastic-VRP-decision-focused-learning · github/spec-kit

Picture this: you hand a complex optimization problem to an AI. Within hours, 17 files are pushed. Tables, charts, numbers — everything looks polished and complete.

But what if three critical bugs are hiding inside? And none of them were caught by code review, a linter, or any architectural rule?

This is exactly that story. More importantly: these bugs revealed a problem modern software development still hasn't solved. And we're proposing something to fix it.

What the project actually does

We're solving a vehicle routing problem for a Cash-in-Transit company. Twenty ATMs, a handful of vehicles, each ATM with different cash demand on different days. The classical approach uses averages — and it fails badly. 42% of ATMs run out of cash, costing serious money.

The project tackles this in two layers: a deterministic baseline (CVRP), then a decision-focused machine learning model (SPO+) that learns to account for demand uncertainty. The result: the stockout rate drops from 42% to 25% — nearly matching a theoretical oracle with perfect information.


📉 Stockout rate (classical)	42%
📉 Stockout rate (SPO+)	25%
💰 Cost reduction	51%

Strong numbers. But the real story starts here.

The quantum experiment: good idea, misleading results

Vehicle routing belongs to the class of combinatorial optimization problems that quantum computing theoretically targets. So we asked a natural question: "What happens if we use Q# and QAOA here?"

We fed the question to an AI. Hours later, a complete-looking implementation arrived:

# QAOA Comparison Table — First Version

Method          <H> Energy   Constraints
PuLP/MILP       -2.5599      ✅ Satisfied
QAOA p=1        -2.5599      ⚠️  Violation
QAOA p=3        -2.5599      ⚠️  Violation

Wait. QAOA is violating constraints — but returning the exact same energy as MILP? And why do p=1 and p=3 (different circuit depths) agree to four decimal places?

This is mathematically inconsistent. A solution that violates constraints should score worse — that's the whole point of penalty terms. Something was wrong. We dug in.

Three bugs, three different faces

Bug 1 — The penalty weight was invisibly small

# phase2c_qaoa_simulator.py, line 56

LAMBDA_C = 0.5  # ← Way too small

# Load: 330k TL, capacity limit: 250k TL
# Violation penalty: 0.5 × (1.32 - 1.0)² = 0.051 units
# Route cost range: 1.2 – 3.5 units
# Penalty is ~1.5% of cost — QAOA literally cannot see it

LAMBDA_C = 40.0  # Fix: penalty now exceeds max route cost (3.5)

QAOA was violating the capacity constraint, but couldn't "feel" it — because the penalty for violating it was nearly the same as not violating it. The penalty term had dissolved into the cost function.

Bug 2 — MILP contradicted itself but silently called it "Optimal"

The MILP model had two constraints: "visit all ATMs" and "don't exceed vehicle capacity." But total demand (330k TL) already exceeded the capacity limit (250k TL) — meaning no feasible solution could satisfy both simultaneously.

PuLP/CBC returned "Optimal." Silently. No warning. The numbers appeared in the table. Nobody questioned them.

Bug 3 — The wrong metric was being measured

p=1 and p=3 showed identical results because the comparison code measured the argmax-bit solution instead of QAOA's expectation energy. For every value of p, argmax selected [1,1,1,1,1] — yielding the same number every time. The circuits were actually running differently. The measurement just couldn't tell.

Corrected results

Method	`<H>` Energy	Bit Energy	Gap	Constraints
Brute Force	—	−34.856	0%	✅
MILP (fixed)	—	−34.856	0%	✅
QAOA p=1	−32.686	−34.047	2.3%	✅
QAOA p=2	−31.689	−33.622	3.5%	✅
QAOA p=3	−32.815	−34.047	2.3%	✅

The corrected picture is actually more convincing: QAOA works, circuit depth genuinely matters, but even at this small scale it trails optimal by 2–4%. That's consistent with the literature and physically expected behavior.

Why didn't anyone catch this?

Look at what all three bugs have in common: the code was structurally correct. No syntax errors, sensible variable names, functions calling functions, charts rendering. No linter would flag any of this. No architecture rule would fire. In code review it would read as "looks good."

The problem was behavioral: penalty weight too small, constraints contradicting each other, wrong metric chosen. You can only catch these by running the code and comparing its actual behavior against what was intended.

Code looking correct isn't enough. It has to be compared against expected behavior.

And right now, there's no automated tool that does this in a spec-driven workflow.

The Golden Demo idea

The Spec-Driven Development world already has strong tools: Architecture Guard enforces architectural rules, DocGuard audits documentation quality, Security Review scans for vulnerabilities. All valuable. But all static — they read code, they don't run it.

Golden Demo does something different:

During planning — from the acceptance criteria and examples written in the spec, generate a small, runnable reference implementation — the golden example. A deterministic, executable representation of intended behavior.

After implementation — run both the golden example and the real code against the same test vectors. Compare outputs. If there's a gap, generate a drift report.

At merge time — you know two things: does the code run? And does it do what was intended? These are the two questions most easily skipped right now.

How would Golden Demo have handled the three bugs in this project?

Bug 1: If the spec had said "penalty term must exceed maximum route cost," that becomes a test vector. With lambda=0.5, it fails immediately at merge.
Bug 2: "A solution returned as Optimal must satisfy all constraints" would automatically verify the solver's output — no silent infeasibility.
Bug 3: "Expectation energy must improve as circuit depth increases" would halt the test the moment both p values returned the same number.

None of these require a human to think of checking them on the day. They're encoded once, in the spec, and verified automatically every time.

One more thing: the quantum question

A fair question worth addressing directly: "If a quantum computer tries all solutions at once, why can't it just win?"

The answer lives in the difference between superposition and entanglement.

Superposition means a qubit carries both 0 and 1 simultaneously before measurement — theoretically enabling parallel search across the entire solution space. Entanglement is different: it correlates qubits so that measuring one instantly tells you something about others. That's not variability; it's a strong dependency. In optimization, the real magic is superposition. Entanglement supports it.

The problem is measurement: superposition collapses to a single classical result. QAOA's job is to amplify the probability of the right answer through interference — the way waves reinforce or cancel each other. On today's NISQ hardware, that interference control is too noisy to work reliably.

Our 20-ATM problem needs roughly 1.2 million physical qubits. The best current hardware sits somewhere between 1,000 and 10,000. No practical advantage exists at this scale until fault-tolerant hardware matures — likely 10 to 15 years out.

That doesn't mean "quantum is useless here." The resource estimation analysis in this project shows that as problem size grows (20 → 50 → 100 → 200 ATMs), a threshold emerges where classical MILP struggles and quantum could theoretically contribute. Knowing that the threshold can't be reached today isn't a reason to stop — it's a reason to prepare for the right moment.

What we're doing now

This project became a live case study for why the Golden Demo + Behavioral Drift extension we're proposing for the GitHub Spec Kit ecosystem needs to exist. Not an abstract idea — a demonstrated need, with three real bugs as evidence.

We've posted an RFC in the spec-kit Discussions. The v1 scope is deliberately narrow: pure functions only, explicit input/output relationships, no side effects. The real risk with any new validation tool is generating noisy false positives and having everyone disable it on day two.

If this problem feels familiar — approving AI-generated code because it looks right, then discovering a behavioral bug weeks later — we'd love your input on the RFC.

Links

github.com/jasstt/stochastic-VRP-decision-focused-learning
Spec Kit Discussions · https://github.com/github/spec-kit/discussions

If I made a mistake, please mention it below:)

Stack: Python · Q# · PuLP · Microsoft QDK · Azure Quantum Resource Estimator

Why Dense Search Fails in Production RAG — And How Hybrid Search Fixes It

ihsan_kutluk — Sun, 07 Jun 2026 21:10:38 +0000

I built a RAG system following the standard tutorial approach — embed, store, retrieve by cosine similarity. It worked fine until I asked it a technical question and got back two completely unrelated chunks about feature engineering. That's when I started digging.

This article explains exactly why this happens — and how hybrid search with Reciprocal Rank Fusion (RRF) and an LLM reranker solves the problem. All results come from a real pipeline I built and tested.

The Problem — Dense Search Fails on Exact Keywords

Here's a concrete example. I asked my RAG system:

"What are the advantages of the Transformer architecture over traditional RNNs?"

With dense-only search (ChromaDB + all-MiniLM-L6-v2), the top 3 retrieved chunks were:

Rank	Chunk ID	Source	Relevant?
1	`chunk_4`	nlp_temelleri.txt	✅ Yes — Transformer & self-attention
2	`chunk_11`	veri_bilimi.txt	❌ No — MSE, MAE error metrics
3	`chunk_8`	veri_bilimi.txt	❌ No — Feature engineering

The model saw "model evaluation" and "Transformer model performance" as semantically close — because they are, in embedding space. But they're not what I was asking about. Dense search had no way to know that.

What is Hybrid Search?

Hybrid search combines two fundamentally different retrieval strategies:

Dense Retrieval (Semantic Search)

Uses neural embeddings (e.g., all-MiniLM-L6-v2)
Captures semantic meaning: "automobile" matches "car"
Great for paraphrase-style queries
Weak at: exact technical terms, proper nouns, version numbers

Sparse Retrieval (BM25)

A classic probabilistic keyword matching algorithm
Scores documents based on term frequency and inverse document frequency (TF-IDF family)
Great at: exact keyword matching ("Transformer", "RNN", "CUDA")
Weak at: synonyms and semantic variations

Neither is perfect alone. Together, they cover each other's blind spots. A query like "Transformer architecture vs RNN" benefits from BM25 catching the exact term "Transformer" while dense search handles the conceptual framing.

Reciprocal Rank Fusion (RRF)

Once you have two ranked lists — one from dense, one from BM25 — you need to merge them intelligently. A naive approach (averaging scores) fails because the score scales are completely different: ChromaDB returns cosine distances while BM25 returns TF-IDF-based scores.

RRF solves this with a rank-based formula:

RRF_score(doc) = Σ  1 / (k + rank_i(doc))

Where k is a constant (typically 60) and rank_i(doc) is the document's position in the i-th ranked list.

The beauty of RRF is that it only cares about rank position, not raw score magnitudes. A document that ranks #1 in dense and #3 in BM25 will score much higher than one that ranks #20 in both — regardless of the underlying score scales. This makes it robust across completely different retrieval systems.

The Reranker

After RRF produces a merged list of ~20 candidates, sending all of them to the LLM for generation would be noisy and expensive. The reranker cuts this down to the top 5 that actually matter.

Rather than another embedding model, I send all 20 candidates to Gemini in a single prompt:

Given this question: [query]
Rank the following 20 passages by relevance.
Return only: {"ranking": [idx1, idx2, idx3, idx4, idx5]}

This is effectively a cross-encoder pattern: the LLM reads the query and all passages together, allowing it to consider interaction effects between the query and each passage — something bi-encoder embedding models cannot do. The trade-off is cost and latency, but since we're calling it once per query (not once per document), it's manageable.

The reranker also includes a retry + fallback mechanism: if the API returns a 503 UNAVAILABLE, it waits 5 seconds and retries up to 3 times. On total failure, it falls back to the top 5 from RRF directly — so the pipeline never crashes.

Real Results

Here's what happened when I ran the same query with both approaches:

Query: "What are the advantages of the Transformer architecture over traditional RNNs?"

Rank	Dense Only	Hybrid (Dense + BM25 + RRF)
1	`chunk_4` ✅ nlp_temelleri.txt	`chunk_4` ✅ nlp_temelleri.txt
2	`chunk_11` ❌ veri_bilimi.txt	`chunk_3` ✅ nlp_temelleri.txt
3	`chunk_8` ❌ veri_bilimi.txt	`chunk_11` ❌ veri_bilimi.txt

BM25 caught "Transformer" and "RNN" as exact keywords and boosted chunk_3 — a passage about word embeddings and NLP context — from outside the top 3 into rank #2. The two irrelevant data science chunks dropped out.

Evaluation across 5 questions:

Metric	Score
Overall Accuracy	80% (4/5)
Citation Coverage	14/14 successful citations
Hybrid vs Dense	BM25 removed 2 irrelevant chunks
Resilience	503 errors handled via retry + fallback

Every answer cites its source inline (e.g., [1], [2]) with the actual filename, so users can verify the origin of each claim.

The Stack

Component	Library
Embeddings	`sentence-transformers` (`all-MiniLM-L6-v2`)
Vector DB	`chromadb`
Sparse retrieval	`rank_bm25`
Fusion	Custom RRF implementation
Reranker + Generator	Google Gemini API (`google-genai`)
Environment	`python-dotenv`

Try It Yourself

🔗 github.com/jasstt/rag_project

git clone https://github.com/jasstt/rag_project.git
cd rag_project
pip install -r requirements.txt
# Add your Gemini API key to .env
python src/ingest.py
python main.py
python src/eval.py

I'm not saying dense search is bad. For most casual queries it works fine. But the moment your users start asking technical questions — exact model names, function signatures, version numbers — BM25 starts pulling its weight. Adding it took maybe 20 minutes. Two irrelevant chunks disappeared from the results without touching anything else in the pipeline.

v1.1 Update: Community Feedback in Action

Shortly after publishing the initial version of this pipeline, I received some incredible feedback from the engineering community. I've integrated three major improvements directly into the codebase:

**1. Sentence-Aware Chunking
Instead of blindly cutting text at 500 characters, src/ingest.py now uses NLTK/regex to detect sentence boundaries. It never splits a sentence in half, and it specifically preserves table-like structures (e.g., lists with pipes or colons) by keeping those rows together. This drastically improves the semantic quality of the chunks.

*2. Skip-Rerank Optimization
LLM rerankers introduce latency. To fix this, I added a confidence check in src/rerank.py. If the top 1 result from RRF has a score significantly higher than the top 2 result (configured via SKIP_RERANK_THRESHOLD), the pipeline assumes high confidence and *skips the LLM reranker entirely, dropping latency to near-zero for easy questions.

**3. Local Cross-Encoder Reranker
To remove the hard dependency on Gemini for reranking, I integrated cross-encoder/ms-marco-MiniLM-L-6-v2. You can now switch RERANK_MODE = "local" in the config to run a fully offline, local cross-encoder that evaluates interactions between the query and the retrieved chunks without hitting any external APIs.

Building in public is a cheat code. A huge thanks to the community for the suggestions.