DEV Community

Ajay Singh
Ajay Singh

Posted on

Why I am building a tool to auto-execute LLM code suggestions.

The Problem
I've wasted too many hours this week on the "AI Debug Loop":
Ask ChatGPT for a fix.
Paste code. It crashes.
Ask Claude for a fix.
Paste code. It crashes differently.
Ask Gemini. It invents a library that doesn't exist.
We treat LLMs like Oracles, but for code, they are often just confident liars.
The Idea: A "Truth Engine" for Code
I got tired of being the manual tester for these models. So, I’m working on a script that automates the verification process.
Instead of asking one model, the tool:
Queries the Council: Sends your bug/prompt to GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro simultaneously.
The Sandbox: It spins up isolated Docker containers for each solution.
The Execution: It actually runs the code to check for runtime errors.
The Verdict: It discards the hallucinations and gives you the code that compiled.
What the output looks like
I'm building the CLI right now, but here is the concept:
Plaintext

Analyzing bug in 'auth_controller.py'...

┌─────────────┬─────────────┬─────────────┐
│ Model │ Status │ Result │
├─────────────┼─────────────┼─────────────┤
│ GPT-4o │ ✅ PASSED │ Runtime OK │
│ Claude 3.5 │ ✅ PASSED │ Runtime OK │
│ Gemini 1.5 │ ❌ FAILED │ Syntax Err │
└─────────────┴─────────────┴─────────────┘

[Recommended Fix]: Claude 3.5 (Fastest execution time)
Why do this?
Because I'd rather wait 30 seconds for a verified answer than spend 10 minutes debugging a hallucination.
Want to test it?
I’m currently running this workflow manually to benchmark how often the models disagree.
If you have a bug or a snippet that AI keeps messing up:
Drop it in the comments (or DM me).
I’ll run it through the "Council" and reply with the comparison results.
I’m trying to figure out if this is worth building into a full CLI tool or SaaS. Let me know what you think!

Top comments (1)

Collapse
 
rokoss21 profile image
rokoss21

What you’re describing isn’t really about “better prompts” or “better models” — it’s about moving the source of truth from model confidence to execution and verification.

Once you treat LLMs as proposal generators and let runtime behavior be the arbiter, a lot of the AI-debug loop pain simply disappears. Execution becomes the contract.

I’ve been thinking in a similar direction while working on deterministic contract layers (schemas, validation, canonical outputs) for LLM systems — your approach complements that nicely by enforcing truth at runtime instead of trusting the model upfront.

“Models propose. Execution decides.” feels like the right mental model going forward.