"Prove your AI-written code — or get the exact input that breaks it"

IshvaTheGuru — Wed, 24 Jun 2026 04:54:39 +0000

tags: python, opensource, ai, devtools

cover_image: https://raw.githubusercontent.com/ishvaproducts-png/ishvacerto/main/assets/social-preview.png

AI coding assistants are fast, and they ship confident bugs. The output looks right, the explanation sounds right, and the failing case turns up in production. The missing piece isn't a smarter generator — it's something that can check the generated code and refuse to bluff when it can't.

ishvacerto is that gate. Give it a function and a way to check it — its own doctests, your tests, or a reference implementation — and it returns exactly one of three answers:

✅ VERIFIED — it passed the captured spec on every input the gate could exercise.
❌ REFUTED — it fails, and here is the exact failing input. Not "looks suspicious." For example: REFUTED [doctest] fn=square counterexample: square(3) (got 6, expected 9).
🤷 ABSTAIN — no checkable spec could be captured, so it says so instead of rubber-stamping.

The whole promise lives in that third answer. Never wrong, sometimes silent. It verifies what it can check and abstains on the rest — which is exactly why it never false-alarms on correct code.

Try it

pip install ishvacerto

from ishvacerto import verify, verify_against_reference

verify(open("f.py").read())                    # uses the code's own doctests
verify(code, tests=[("f(3)", "9")])            # against your tests
verify_against_reference(ai_code, ref, "f")    # where does it diverge from a reference?

From the command line (exits 1 on REFUTED, so it gates CI directly):

ishvacerto my_function.py
ishvacerto --ref reference.py --entry my_func ai_generated.py   # differential
ishvacerto --json my_function.py                                # machine-readable

Measured, not asserted

You can reproduce the headline numbers yourself — there's a script in the repo:

python benchmarks/humaneval_gate.py

On the real HumanEval benchmark (164 problems), the gate produces 0 false alarms on the canonical correct solutions, captures a checkable doctest spec on 76/164 (~46%) of problems, and abstains on the rest. It even flags HumanEval's own wrong doctest (problem 47) as a spec/code conflict rather than a false alarm — it caught a benchmark bug instead of blaming the code.

Coverage grows with the spec or reference you give it. The roadmap is a reference proposer that retrieves a same-task verified reference for code that ships with no tests, widening reach while keeping false alarms at zero.

How it decides

VERIFIED only if the captured spec passed on every input it could exercise.
REFUTED only on a clean mismatch — and it tells you the input.
ABSTAIN if it couldn't capture a usable spec. That discipline is what keeps it from false-alarming on good code.

The differential mode is the fun part: it generates inputs, runs the candidate and the reference, and shows the first input where they disagree. Input generation is signature-agnostic — it produces generic argument tuples, lets the reference filter the valid ones, and abstains if it can't exercise at least one.

Engineering

Pure Python standard library, zero dependencies, 13/13 tests, CI green on Python 3.9 / 3.11 / 3.12, MIT. It runs entirely on your machine — no account, no cloud, no telemetry, your code never leaves the box. There's also a VS Code extension that shows the counterexample inline.

Honest scope

It verifies what it can check and abstains on the rest — coverage is a function of the spec or reference you give it, never a guess. And the subprocess timeout guards against hangs; it is not a security sandbox, so verify code whose source you trust (your own assistant's output) or run it in a container.

It doesn't compete with your AI coder — it makes its output safe to ship.

⭐ MIT, free, and the measurements are reproducible: https://github.com/ishvaproducts-png/ishvacerto

pip install ishvacerto