wintrover

Posted on Apr 24 • Originally published at wintrover.github.io

"42% Silence": What It Means to Control Failure in AI Code Verification

#axiom #ai #code #verification

We live in an era where AI writes code.

The problem is no longer productivity.

It's trust.

Code gets generated fast.
Tests pass.
But

there's no guarantee that this code is actually correct.

1) What We're Building: Axiom

Axiom is not a mere "automatic proof engine."

It's a system that makes AI-generated code trustworthy.

To achieve this, we combine:

SMT Solver (Z3)
Abstraction (CEGAR)
Summary-based Decomposition (Summary)
Formal Verification (Lean 4)

into a single orchestration.

2) The Incident: "42% Silence"

When running ax prove, the progress would stall precisely at 42%.

It looked like a simple UI glitch.

But three possibilities existed simultaneously:

Actual deadlock
Solver endlessly computing
Some tasks finished but UI not showing them

In other words,

The system was in a state where you couldn't tell if it was dead or thinking.

3) Kimi's Diagnosis: "It's Actually Stuck"

Investigation revealed an actual bug.

Cause 1 — Sequential Result Waiting

let result = ^task  # blocks in spawn order

→ If one hangs, everything stops

Cause 2 — Blocking I/O

readLine()  # can wait forever on Z3 pipe

→ Deadlock possible regardless of solver state

4) First Fix: Make the System Non-Blocking

We fixed two things.

✔ Out-of-order processing

if isReady(task):
  collect(task)

→ Immediately process completed tasks

✔ Non-blocking I/O

readData()  # chunk-based processing

→ Detect solver termination/abnormal state immediately

This removed the "true freeze."

5) But a Bigger Problem Remained

After fixing the bug, the same question lingered.

"What if the solver never finds an answer?"

6) The Technical Debate: "Bypass or Siege?"

Two perspectives collided here.

🏛️ Gemini — Engineering Siege Strategy

"We're not finding answers, we're collecting evidence"

Simplify problems with CEGAR
Break them down with Summary
Finalize with Lean 4

🏛️ Kimi — Limit-Awareness Design

"All these tools fail"

SMT is undecidable
CEGAR can also stall in loops
Lean automation is ultimately a search problem

Conclusion:

There is no perfect solution.

7) Our Conclusion

The key insight from this debate:

❌ "A system that always proves" is impossible
✅ "A system that controls failure" is possible

8) Axiom's Design Principles

Axiom is not a "proof engine" but:

"A system that structures failure"

🏛️ L1/L2 — Stop the failure

timeout
abstraction (CEGAR)

→ Block infinite computation

🏛️ L3 — Break down the failure

decomposition
summary

→ Whole failure → Partial success

🏛️ L4 — Escalate the failure

Lean 4
Human invariant

→ Automation → Collaboration

9) Another Problem Surfaced

The bug was gone, but a new problem appeared.

It still "looked stuck."

This isn't just a UX problem.

Core Issue

In a verification system, "invisible progress" equals trust collapse.

10) Solution: Heartbeat Patch

We changed our approach.

Instead of "exact progress percentage,"
show "evidence of being alive"

Before

[████████░░] 42%

After

[████████░░] 42%

[VERIFIED] expr.nim::add
[VERIFIED] results.nim::isErr
[TIMEOUT] main.nim::getStr

The meaning of this change:

"It's not stuck — it's solving a hard problem."

11) Additional Improvement: Evidence-Based Hints

Z3 says nothing when it times out.

So we worked around it:

Structural complexity analysis (Topology)
Lean unsolved goal extraction

Result:

We can't directly say "why it failed,"
but we can show "where the problem is."

12) Lessons Learned

The incident clarified two things.

✔ Nim threadpool limitations

FlowVar cancellation impossible
Polling-based design required

✔ Z3's black-box nature

timeout = unknown
Root cause tracking impossible

→ Contextual diagnosis required

13) Conclusion

The "42% incident" wasn't just a simple bug.

It was the surfacing of structural limitations that every AI code verification system must face.

We made one choice here.

Abandon perfect proof
Choose controllable failure

One-sentence definition:

In a verification engine, what matters isn't the correct answer,
but how failure manifests and is controlled.

Closing

Axiom no longer freezes at 42%.

And more importantly:

Even when it stops,
it tells you why it stopped and what to do next.

One-line summary:

Axiom isn't an engine for proving code,
it's a system for generating trust.

DEV Community