Atharva Panegai

Posted on Jun 20

Can AI actually find the root cause of a bug, or does it just sound confident?

#ai #automation #programming #productivity

We built Pinaka to write root cause analyses automatically — read a Jira ticket, search the codebase, trace the failure, post the RCA as a comment. Before putting that in front of real teams, we needed an honest answer to one question: does it actually find the right bug, or does it just sound right?

"Sound right" is the trap with every AI tool that explains why something broke. LLMs are extremely good at producing a plausible-sounding root cause. Plausible and correct are not the same thing, and the gap between them is exactly where a team loses trust in a tool — usually after the first time it sends someone chasing the wrong file for an hour.

So we picked a real bug, in a real production codebase, with a real human-written root cause already on record — and tested two versions of Pinaka against it.

The test setup

We used BullMQ (github.com/taskforcesh/bullmq), a Redis-based job queue library for Node.js with thousands of GitHub stars and heavy production use. It's also part of our own stack, which meant we could judge the RCA quality ourselves, not just trust the AI's confidence score.

The bug: issue #2487 — when a job has stackTraceLimit set and fails multiple times on retry, the stack trace shown never updates. It always shows the first failure, even on the third or fourth retry. A real BullMQ maintainer confirmed it as a genuine bug, and it was fixed in v5.4.6.

We indexed the BullMQ repo into Pinaka, then ran two separate RCA passes:

Run A — code only. A realistically vague Jira ticket, the way an engineer would actually file it — no file names, no hypotheses, just "the stack trace doesn't update after retries." Pinaka had to find the bug using semantic search over the codebase alone.

Run B — code plus runtime context. Same indexed repo, but this time Pinaka also received the kind of execution-time data our SDK is built to capture automatically — real option values, job state, and the actual sequence of events as they occur in a live Redis instance.

We graded both outputs against the real fix the BullMQ team shipped.

What code-only RCA found

Pinaka correctly identified the exact file (job.ts) and exact method (moveToFailed()), and gave a coherent step-by-step trace of how the bug manifested across retries. It even ran through three plausible alternative explanations and ruled each one out with specific evidence from the test suite — which is the kind of reasoning you'd want from a senior engineer doing this by hand.

What it didn't get: the precise line-level cause. It described the bug correctly in behavior but proposed a fix that was functionally valid but more complicated than what BullMQ's maintainers actually shipped. It also had no way of seeing that a closely related, deeper bug existed one layer down — inside the Redis Lua script BullMQ uses for atomic job updates, not in the TypeScript code at all.

Against the human ground truth, we'd score this run roughly 6.5 out of 10 — right neighborhood, not quite the exact fix.

What runtime context found

The second run, with real execution-time data included, didn't just refine the same answer — it surfaced a different, deeper bug entirely: what happens when stackTraceLimit is explicitly set to 0. This turned out to be a real follow-up issue BullMQ's own maintainers filed after the first fix shipped.

The root cause lived in the Lua script's trim condition, which only fired when the limit was greater than zero — so a limit of exactly zero silently skipped the trim step and let stale stack traces leak into Redis. Pinaka named the exact condition (ARGV[4] > 0) and proposed a fix that matched the structure of what the maintainers actually merged.

Against the same ground truth, this run scored closer to 9.2 out of 10.

Why this matters more than the score

The interesting part isn't really "9.2 beats 6.5." It's what kind of mistake runtime context fixed. Static code reading can tell you what the code says it should do. It cannot tell you what's actually sitting in Redis at the moment of failure, or that a boolean condition silently short-circuits when a value is zero, or that the real bug spans two different layers of the system — application code and the Lua script running atomically inside the database. That's not a knowledge gap an LLM can read its way out of. It's a visibility gap, and the only fix for a visibility gap is actually capturing what happened at runtime.

That's the whole bet behind shipping @getpinaka/sdk instead of building a tool that only reads your GitHub repo: a lot of real production bugs don't live in the code you can see. They live in the gap between what the code says and what actually happened.

Where we'd push back on our own result

A few things we don't want to bury in fine print:

Run A's input was more informative than real production logs usually are. We wrote the simulated logs to explicitly state facts an engineer would have to dig for in practice. If anything, this means the 6.5 score is generous to the code-only condition — the real-world gap between code-only and SDK-instrumented RCA is probably larger than what we measured here.

Run A and Run B technically targeted two related but different bugs, not a strict apples-to-apples test on one bug with and without context. They're both part of the same GitHub issue thread, and the second was filed as a direct follow-up to the first — but a tighter version of this test would hold the bug constant and vary only the context.

This is one bug, on one repo, run once per condition. It's a real qualitative case study, not a statistically powered benchmark. We're running the same methodology against issues in TypeORM and Prisma next, and we'll publish what we find — including if it doesn't look as clean as this one.

Run B's runtime context was constructed, not captured live by the SDK. The next version of this test uses a real instrumented service and a real crash, end to end.

We'd rather tell you where this benchmark is thin than have you find it yourself in the GitHub issue thread — which, if you're the kind of engineer who reads root cause analyses for a living, you probably will anyway.

Try it

If you want to see what Pinaka does on your own codebase join the waitlist at getpinaka.com We're a small team building this in public, and benchmarks like this one are how we're deciding what to build next, not just what to put in a launch post.

Top comments (1)

Atharva Panegai • Jun 20

Hi, I'm the person who built this (and Pinaka). Wanted to share this
benchmark because I think a lot of AI RCA claims right now are
unverifiable hype, and I'd rather be transparent about where this one
is thin than have someone find the gaps in the comments. Happy to
answer questions about the methodology or the SDK.