lakshmipathi

Posted on Mar 25

Why AI tools guess from CI logs (and how to give them real data instead)

#ai #ci #github #gitlab

Your CI failed. The AI read the log. It guessed. Here's why that's not good enough.

The log says "Segmentation fault." Now what?

You push code. CI runs. It fails. You get a log:

Round  7/12: num =  58  *** BUG DETECTED (crash next round) ***
Round  8/12: num =  50  *** CRASHING: previous round 7 had bad value 58 ***
Segmentation fault (core dumped)
Exit code: 139

You paste this into your AI coding tool. It reads the log, sees "Segmentation fault" and "NULL pointer," and suggests a fix.

But did it find the root cause? Or did it just patch the symptom?

We tested this. Two AIs, same crash, very different fixes.

We took a real CI failure — a C program that crashes with SIGSEGV at round 8 — and gave it to two AI models.

AI #1 (from logs only): Read the CI output. Saw the NULL pointer dereference at line 34. Generated this fix:

-        int *p = NULL;
-        *p = g_prev_num;
+        return -1;

Removes the crash. CI passes. Job done?

No. The program still silently corrupts data. The real bug is five lines above — the code forces num = 58 at round 7, which triggers the crash path. Removing the crash mechanism just hides the problem.

AI #2 (from runtime data): Saw the exact variable values at every step:

Round 7: getrandom() returned 91
         → code overwrote num from 91 to 58
         → is_even=1, in_range=1
         → g_prev_was_bad set to 1

Round 8: entered crash path because g_prev_was_bad=1
         → p = NULL
         → *p = g_prev_num (58) → SIGSEGV

It identified lines 38-42 as the root cause and generated:

-    if (round >= 7 && !g_prev_was_bad) {
-        num = 58;
-        is_even = 1;
-        in_range = 1;
-    }

The actual bug removed. Not the symptom — the cause.

The difference? Variable values.

AI #1 saw: "line 34 crashes with NULL pointer."
AI #2 saw: "line 39 overwrites num from 91 to 58, which triggers the crash two lines later."

The second AI didn't guess. It traced the variable evolution step by step, saw where the value changed, and identified the injection point. That's what real runtime data gives you that a log never will.

Logs show symptoms. Runtime data shows causes.

Here's what a typical CI log contains:

The test name
stdout/stderr output
An exit code
Maybe a stack trace

Here's what deep runtime tracking captures:

Every variable's value at the crash point
The call stack with arguments
Thread state and interleaving order
The exact line where a value changed unexpectedly

For simple bugs, logs are enough. For the bugs that actually waste your time — race conditions, intermittent crashes, "works on my machine" failures — you need the runtime data.

What this looks like in practice

We built a tool that re-runs your failed CI test with deep runtime tracking. Here's what the output looks like — a replay showing the exact variable evolution leading to the crash:

The yellow arrow steps through the code. Variables update in real-time on the left. When num suddenly changes from 91 to 58 — that's the bug, highlighted in orange. When p = NULL and the program writes to it — red flash, SIGSEGV.

This isn't a simulation. These are actual values captured from the running process.

How it works

Your CI fails on your runner (GitHub Actions, GitLab CI — doesn't matter)
We re-run only the failed test with deep runtime tracking
We capture exact variable values, thread state, and stack trace at the crash point
AI analyzes the captured data — not logs — and opens a fix PR

Zero overhead on your passing builds. We only run on failures.

The hard bugs aren't the ones with good error messages

The bugs that waste engineering hours are:

Thread races — "passes 9 out of 10 runs"
Timing-dependent crashes — works locally, fails in CI
Intermittent NULL derefs — depends on random values or scheduling

These bugs don't leave useful logs. They leave "Segmentation fault" and a prayer. The only way to diagnose them reliably is to see the actual state at the moment of failure.

That's what deep runtime tracking does. That's what we capture. That's the proof.

Try it

We're looking for beta testers — especially teams with flaky tests they can't figure out.

neverbreak.ai — CI broke. AI fixed. With proof.

Currently supports C, C++, Go, Python, Node.js, and Java. Linux CI only.

DEV Community