DEV Community

Cover image for How I caught a silent NaN bug in production RAG, by asking the system to debug itself
Nick Bokuchava
Nick Bokuchava

Posted on

How I caught a silent NaN bug in production RAG, by asking the system to debug itself

Last week I built a personal knowledge brain. This week I loaded it with five ML textbooks and asked it to debug itself.

Here is what happened, and why it matters if you run RAG on Postgres.

The setup

I run a Supabase + pgvector RAG called GBrain. Hybrid search: vector + tsvector fused with reciprocal rank fusion, then re-scored with cosine. 2,872 chunks, OpenAI text-embedding-3-small, around 5 cents to ingest. Two AI systems share the same brain over MCP. Claude Code helps me write code interactively, OpenClaw runs background automation on Gemini Flash via Vertex AI.

The whole thing cost me about $0.10/M tokens to run and roughly an afternoon to wire up. It was working. That was the problem.

The first sign something was off

I dropped in five textbooks (Murphy's Probabilistic Perspective, Bishop's PRML, Chip Huyen's Designing ML Systems, Géron's Hands-On ML, Murphy's Advanced Topics) and started querying. Two things looked wrong.

Every query was returning NaN in the relevance score. Not always a hard failure, just a quiet NaN floating in the rank metadata. The retrieved chunks still came back, ordering still mostly looked sensible, so I almost ignored it.

Then I asked the system "explain the EM algorithm for Gaussian mixtures" and it missed Murphy chapter 11.4.2. The chapter that is literally about EM for Gaussian mixtures. Top hit was something about variational inference instead.

Classic RAG failure mode. Wrong chunk wins on a query that should be a layup.

Asking the system to audit itself

Before opening the source code I tried something different. I asked Gemini Flash, reading GBrain through MCP, to use the five textbooks to audit its own retrieval quality.

It came back with surprisingly sharp output. Murphy §9.7.4 quoted verbatim on MRR/NDCG/MAP. Huyen chapter 8 on monitoring and SLO design. And one honest admission: "cross-encoder reranking is not in the corpus." Which is true, because Huyen's book is from 2022 and cross-encoders went mainstream in 2023.

But the audit also confirmed action item #1: fix the NaN. It guessed an RRF division bug.

It guessed wrong. I went to look.

The actual bug

In the search module:

result.set(row.id, row.embedding as Float32Array);
Enter fullscreen mode Exit fullscreen mode

Classic TypeScript trap. as Float32Array is compile-time only. At runtime, in my setup (Supabase JS client, default config without a custom type parser registered for pgvector), the Postgres client returns the pgvector column as a string, formatted like "[0.1, 0.2, 0.3, ...]". Whether you hit this depends on your driver and config — raw pg, Drizzle with pgvector typing, or a custom type parser will all behave differently. But "returns as string" is the default for a lot of common Supabase setups.

So cosine similarity was running over what TypeScript believed was a Float32Array but was actually a string. JavaScript happily multiplies string characters by numbers in some paths and silently produces NaN in others. The result moved through the pipeline, got blended with the BM25 score, and poisoned the final ranking, but never hard-crashed anywhere.

This is the kind of bug that compiles, passes existing tests on happy-path inputs, and just quietly degrades retrieval quality forever. You only catch it when you have ground truth (the textbooks), a clear expected hit (Murphy §11.4.2), and you actually go look.

The patch and the second-order problem

The cast bug itself was fixable in one line at the data-ingest boundary, parsing the string back into a real Float32Array. That fix landed upstream as gbrain #196.

But it left a question. cosineSimilarity is still a public export. Future embedding models at different dimensions, direct callers from user code, test fixtures, none of them go through the parse boundary. Same NaN-shaped failure could come back from a different direction.

So I wrote a separate, narrow defensive hardening of cosineSimilarity itself. Five lines added, two changed, no API change, no behavior change on valid finite dim-matched vectors. Same scores as before for inputs that were already correct.

export function cosineSimilarity(a: Float32Array, b: Float32Array): number {
  let dot = 0, magA = 0, magB = 0;
  const len = Math.min(a.length, b.length); // dim-mismatch safe
  for (let i = 0; i < len; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  const denom = Math.sqrt(magA) * Math.sqrt(magB);
  if (!Number.isFinite(denom) || denom === 0) return 0; // Infinity/NaN safe
  const result = dot / denom;
  return Number.isFinite(result) ? result : 0; // belt and suspenders
}
Enter fullscreen mode Exit fullscreen mode

Three failure modes this prevents.

Dimension mismatch. Caller passes a 768-dim vector to a brain storing 1536-dim, the old loop ran past b's end, multiplying undefined * undefined = NaN, which poisoned magB and the return value. Now the loop runs over the common prefix and returns a finite similarity over the shared dimensions. This is a pragmatic defensive choice, not a semantically exact cosine over mismatched-dimension vectors. The "correct" answer for that case is arguably to throw. The goal here is to not poison every downstream score because one caller passed a wrong-dimension input.

Non-finite denominator. If either vector has values large enough that squared sums overflow to Infinity, then sqrt(Infinity) * sqrt(Infinity) = Infinity. The old denom === 0 guard misses that, and dot / Infinity silently returns 0 or NaN depending on dot. The explicit Number.isFinite(denom) check is clear and fast.

Non-finite final result. Belt-and-suspenders check on dot / denom. Since cosineSimilarity's output feeds directly into the blended score 0.7 * rrf + 0.3 * cosine, a single NaN propagates through every downstream result the same way the original cast bug did. Better to catch it here too.

That landed as PR #295 to garrytan/gbrain, currently open with three unit tests covering each guard.

What I actually learned

The bug was boring. The interesting part was the path to finding it.

Self-improving AI agents need three things. Most setups give them two. They give it knowledge (what it knows) and tools (how it acts). They lock down the third one, which is introspection rights, permission to read its own source code. Teams lock it down because it feels scary. But without it, an agent can point at its own bug and still not fix it. You watch it confabulate around the symptom.

The other thing: a NaN in a score column is one of those bugs where every single layer of the system looks fine in isolation. TypeScript compiles. Tests pass. The query returns. The UI renders. The only signal is that retrieval quality is worse than it should be, and "worse than it should be" is invisible without ground truth. Production RAG without an evaluation corpus is a pipeline you cannot debug.

One thing this patch does not do: it does not log when any of these guards trigger. That is fine for stability — you do not want a NaN propagating into a blended retrieval score in production. But for evaluation, silently mapping NaN → 0 can hide real bugs from your metrics. If you adopt this pattern in your own code, add a counter for each guard branch so you can see when the defensive code is actually firing.

If you run hybrid search on pgvector, two specific things to check today:

  1. Pull a row directly from your DB client and console.log(typeof row.embedding). If it's "string" and your code casts it to Float32Array, you have this bug.
  2. Run a query whose correct top hit you know by name. If the right chunk does not come back top-3, treat it as a real signal, not a tuning question.

Repo: github.com/garrytan/gbrain PR with the hardening + tests: PR #295

Top comments (0)