Arun Raghunath

Posted on Jun 8

We built test mode. Then discovered it was broken.

#buildinpublic #ai #devops #opensource

Part of building jhansi.io in public.

Test mode sounded simple. Upload code, pass a command, jhansi runs it + your test suite. Done.

Except it wasn't done. First run: empty output. No errors. Just silence.

Here's what broke — and how it changed how we think about AI-generated code.

The original idea

AI writes code. Scripts, APIs, full backends. But code without proof is liability.

Test mode is the proof. You upload a project to a jhansi sandbox, pass the command that starts your app, and jhansi:

Runs the command
Waits for the server to come up
Executes your test suite against it
Returns results
Kills everything

All inside an isolated container. Nothing escapes. Nothing persists.

This is the verification layer missing from Cursor, Claude Code, Windsurf. They generate. We verify.

The problem we didn't anticipate

v0.4 of test mode accepted a filename.

Upload app.py, call exec with filename: "app.py", jhansi figures out how to run it.

The problem: real projects aren't single files.

A Flask app is app.py + tests/ + requirements.txt. When we uploaded them separately, they landed flat in the workspace. pytest couldn't find tests/. The installer couldn't find requirements.txt.

We built test mode for the toy world. But AI doesn't generate toys. It generates projects.

AI agents don't write hello_world.py. They write repos.

The fix: projects are zips, not files

Obvious once you see it. Upload the whole project as a zip.

# From inside your project
cd my_project && zip -r ../my_project.zip .

# Upload to sandbox
curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/upload \
  -F "file=@my_project.zip"

jhansi extracts it preserving structure. tests/ lands where pytest expects it. requirements.txt lands where the installer looks.

This also killed the filename param. You now pass the actual command:

curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/exec \
  -H "Content-Type: application/json" \
  -d '{"command": "python app.py", "test": true}'

Language-agnostic. Python, Node, Go, Java. Same API. jhansi handles the runtime.

What test mode actually does

When test: true:

Install deps — blocking. Wait for pip install to finish. This was bug #2.
Start your app — detached, in the background
Wait 2s for the server to bind to port
Run tests — pytest, jest, go test, mvn test. Auto-detected.
Return output — stdout, stderr, test summary
Kill container — no state leaks Test runner needs zero config. If pytest finds it locally, we find it in the sandbox.

The dependency race condition

v1 ran install + app start in one Docker command.

Container starts → pip install begins → python app.py tries to start → pytest fires 2s later.

But pip install flask was still downloading. Server wasn't up. Tests hit ConnectionRefused.

The fix: serialize it.

Install deps. Block until done.
Start app. Detach.
Sleep 2s.
Test. Obvious in hindsight. You only learn this by shipping and watching it fail.

The honest bit

We shipped test mode in v0.4. It works. All four languages tested end-to-end.

But it took discovering that AI generates projects, not scripts, to get there.

The first design was for the demo. The second design is for the world AI actually creates.

This is why building in public matters. Not to announce features. To document how the problem reveals itself when you touch it.

What's next

v0.5 is serve mode — start a server, get a temporary preview URL, share it with your team, kill it when you're done.

The last verification step before you deploy anywhere real. No more "works on my machine" from an LLM.

Code is open source at github.com/jhansi-io/petri. Apache 2.0. Self-host today.

Building AI tooling at a bank or fintech and this sounds familiar? I want to hear from you.

jhansi.io — the missing runtime layer for AI-generated code.

Top comments (4)

Mateo Ruiz • Jun 8

This is a great example of a pattern I'm seeing across AI tooling: the first version is built around how humans think about the problem, and the second version is built around how AI actually behaves.

"Upload a file and run it" sounds reasonable until you realize AI rarely generates single files anymore. It generates projects with dependencies, tests, configs, and assumptions about directory structure. At that point, preserving the repo becomes more important than executing the code.

The dependency race condition is another good lesson. Verification isn't just running tests it's recreating the environment correctly. A green test suite only means something if installation, startup, and execution happen in the right order. This is also why runtime validation is becoming such an important layer in AI-assisted development. At IT Path Solutions, we've found that generating code is usually the easy part. The harder challenge is creating reliable environments where generated code can be tested, validated, and observed before it reaches production. The shift from "AI writes code" to "AI-generated code needs a verification pipeline" feels like one of the biggest lessons the industry is learning right now.

Arun Raghunath • Jun 10

Exactly this. "AI writes code" was always the easier half of the problem. The harder half is what you've described — reproducible environments, ordered execution, observable outcomes. That's what jhansi.io is being built for. Would love to hear more about what you're doing at IT Path Solutions around this — sounds like you're hitting the same walls.

Mateo Ruiz • Jun 10

Appreciate that. We have been seeing a similar pattern with AI-generated MVPs and internal tooling projects. The code generation itself is rarely the blocker anymore. The recurring issues tend to be around reproducibility, dependency management, environment drift, test reliability, and understanding whether generated changes are actually safe to deploy.

One thing we've learned is that validation needs to happen at multiple layers not just tests, but environment setup, observability, rollback paths, and production behavior under real usage. That's where many AI-generated projects start diverging from demo-ready to production-ready.

Curious to see how you're approaching observability and debugging inside the jhansi runtime as more agent-generated projects start running through it.

Arun Raghunath • Jun 11

Thanks for your thoughtful response - The layers you're describing — environment setup, rollback paths, production behaviour — are exactly the pre-deployment pipeline. That's a hard problem worth solving.

Jhansi sits one layer earlier: not 'is this safe to deploy' but 'can I run this at all without it becoming a security incident.' The assumption is the code is already untrusted — AI-generated, unreviewed, potentially exfiltrating secrets or making arbitrary network calls.

On observability inside the runtime — that's v0.9. Execution logs, structured output, error capture per run. The goal is: every execution leaves a trace you can reason about, not just a pass/fail. Would be curious what signals matter most in your workflow — whether it's the environment state before execution or the behaviour during.