DEV Community

Arun Raghunath
Arun Raghunath

Posted on

We built test mode. Then discovered it was broken.

Part of building jhansi.io in public.

Test mode sounded simple. Upload code, pass a command, jhansi runs it + your test suite. Done.

Except it wasn't done. First run: empty output. No errors. Just silence.

Here's what broke — and how it changed how we think about AI-generated code.


The original idea

AI writes code. Scripts, APIs, full backends. But code without proof is liability.

Test mode is the proof. You upload a project to a jhansi sandbox, pass the command that starts your app, and jhansi:

  1. Runs the command
  2. Waits for the server to come up
  3. Executes your test suite against it
  4. Returns results
  5. Kills everything

All inside an isolated container. Nothing escapes. Nothing persists.

This is the verification layer missing from Cursor, Claude Code, Windsurf. They generate. We verify.


The problem we didn't anticipate

v0.4 of test mode accepted a filename.

Upload app.py, call exec with filename: "app.py", jhansi figures out how to run it.

The problem: real projects aren't single files.

A Flask app is app.py + tests/ + requirements.txt. When we uploaded them separately, they landed flat in the workspace. pytest couldn't find tests/. The installer couldn't find requirements.txt.

We built test mode for the toy world. But AI doesn't generate toys. It generates projects.

AI agents don't write hello_world.py. They write repos.


The fix: projects are zips, not files

Obvious once you see it. Upload the whole project as a zip.

# From inside your project
cd my_project && zip -r ../my_project.zip .

# Upload to sandbox
curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/upload \
  -F "file=@my_project.zip"
Enter fullscreen mode Exit fullscreen mode

jhansi extracts it preserving structure. tests/ lands where pytest expects it. requirements.txt lands where the installer looks.

This also killed the filename param. You now pass the actual command:

curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/exec \
  -H "Content-Type: application/json" \
  -d '{"command": "python app.py", "test": true}'
Enter fullscreen mode Exit fullscreen mode

Language-agnostic. Python, Node, Go, Java. Same API. jhansi handles the runtime.


What test mode actually does

When test: true:

  1. Install deps — blocking. Wait for pip install to finish. This was bug #2.
  2. Start your app — detached, in the background
  3. Wait 2s for the server to bind to port
  4. Run tests — pytest, jest, go test, mvn test. Auto-detected.
  5. Return output — stdout, stderr, test summary
  6. Kill container — no state leaks Test runner needs zero config. If pytest finds it locally, we find it in the sandbox.

The dependency race condition

v1 ran install + app start in one Docker command.

Container starts → pip install begins → python app.py tries to start → pytest fires 2s later.

But pip install flask was still downloading. Server wasn't up. Tests hit ConnectionRefused.

The fix: serialize it.

  1. Install deps. Block until done.
  2. Start app. Detach.
  3. Sleep 2s.
  4. Test. Obvious in hindsight. You only learn this by shipping and watching it fail.

The honest bit

We shipped test mode in v0.4. It works. All four languages tested end-to-end.

But it took discovering that AI generates projects, not scripts, to get there.

The first design was for the demo. The second design is for the world AI actually creates.

This is why building in public matters. Not to announce features. To document how the problem reveals itself when you touch it.


What's next

v0.5 is serve mode — start a server, get a temporary preview URL, share it with your team, kill it when you're done.

The last verification step before you deploy anywhere real. No more "works on my machine" from an LLM.

Code is open source at github.com/jhansi-io/petri. Apache 2.0. Self-host today.

Building AI tooling at a bank or fintech and this sounds familiar? I want to hear from you.


jhansi.io — the missing runtime layer for AI-generated code.

Top comments (1)

Collapse
 
mateo_ruiz_6992b1fce47843 profile image
Mateo Ruiz

This is a great example of a pattern I'm seeing across AI tooling: the first version is built around how humans think about the problem, and the second version is built around how AI actually behaves.

"Upload a file and run it" sounds reasonable until you realize AI rarely generates single files anymore. It generates projects with dependencies, tests, configs, and assumptions about directory structure. At that point, preserving the repo becomes more important than executing the code.

The dependency race condition is another good lesson. Verification isn't just running tests it's recreating the environment correctly. A green test suite only means something if installation, startup, and execution happen in the right order. This is also why runtime validation is becoming such an important layer in AI-assisted development. At IT Path Solutions, we've found that generating code is usually the easy part. The harder challenge is creating reliable environments where generated code can be tested, validated, and observed before it reaches production. The shift from "AI writes code" to "AI-generated code needs a verification pipeline" feels like one of the biggest lessons the industry is learning right now.