DEV Community: Arun Raghunath

LLMs Solved Language. That Was the Easy Part.

Arun Raghunath — Fri, 10 Jul 2026 08:57:19 +0000

A few years ago, if you wanted to build an intelligent chatbot, most of your effort went into getting the computer to understand people. Intent classification, entity extraction, stemming, confidence scores, dialogue trees, fallback logic. You annotated thousands of examples, hand-crafted regular expressions, and tuned thresholds until the system stopped misreading "cancel my order" as a request to place one.

The hardest problem was not deciding what the system should do. It was figuring out what the user actually meant.

Today, that problem is largely solved.

Large Language Models have made natural language one of the easiest parts of the stack. We rarely train intent classifiers anymore. We do not spend weeks annotating data or crafting brittle pattern matches. We hand a request to an LLM and, more often than not, it understands.

So the bottleneck moved. The question stopped being:

How do I make the computer understand English?

It became:

Now that it understands, what should it do — and how do I make it do that reliably?

That second question is where most of the difficulty now lives.

The engineering shifted up the stack

The challenges are no longer primarily about language processing. They are about systems engineering:

agent orchestration
tool calling
memory and state
permission models
sandboxed execution
observability
evaluation
reliability
cost control
governance
human approval

Ironically, software engineering has become more important, not less.

Calling an LLM API is easy. Building a system that can reliably interact with dozens of services, recover from failures, execute code safely, respect permissions, remain observable in production and scale to thousands of users is anything but. Most of that difficulty is invisible in a demo. A demo needs one happy path to work once. A product needs every path to fail safely, ten thousand times a day.

It is worth being concrete about why these problems are difficult, because "orchestration is hard" is the kind of statement everyone agrees with and no one learns from. Consider four of them.

Sandboxed execution

The moment an agent can run code — even if it is only doing arithmetic, transforming data or generating a report — you have given an untrusted author the ability to execute arbitrary logic inside your infrastructure. That author is the model, influenced by whoever or whatever supplies its context.

The naive implementation is to run generated code inside the application process, or inside a long-lived container reused across requests. Both approaches are quietly dangerous. Code written by the model may be able to read environment variables, access credentials, make outbound network calls, consume unbounded CPU or memory, inspect files it should not see, or leave state behind that leaks into the next execution.

Doing this properly means treating every execution as hostile. You need real isolation — separate namespaces, a stronger sandbox, a microVM, or technologies such as gVisor or Firecracker rather than a shared runtime. You need no ambient credentials inside the execution environment, controlled network egress so generated code cannot simply phone home, hard limits on CPU, memory, time and filesystem usage, and an ephemeral environment that is destroyed after each run.

While building Jhansi, this was the shift that became impossible to ignore. What initially looked like "let the model execute some Python" quickly became a runtime problem. The interesting questions were no longer about prompting. They were about process isolation, sandbox lifecycle, resource limits, auditability, and what an execution environment should be allowed to see.

The difficult part was never running the code. It was running code you did not write, cannot fully predict and should not trust — safely, on every request.

Permission models and tool calling

When you give an agent tools, the model decides which tool to call and what arguments to supply. That decision is non-deterministic, and — more importantly — it can be influenced by input you do not control.

The naive implementation is to register every tool with application-level credentials and allow the model to choose freely. That creates a classic confused-deputy problem: a prompt injection hidden inside a document, email or web page may persuade the agent to call a destructive tool on the attacker's behalf, using the application's authority.

The model is not malicious. It is obedient. And obedience is the vulnerability.

Getting this right pushes you towards least privilege at the action level rather than the application level. The agent should not hold long-lived credentials. High-blast-radius actions should require explicit authorisation, policy checks, or a human in the loop. Every tool call should be attributable, reviewable and revocable — the system should be able to say who authorised an action, whose permissions were used, what data was exposed, what the model was attempting to do, whether it could have been prevented, and whether it can be reversed.

At that point, you are no longer writing prompt logic. You are designing an authorisation system that happens to have a language model sitting in the middle of it.

Observability

With deterministic software, when something fails you inspect the logs or read the stack trace. With an agent, "why did it do that?" may span a non-deterministic model decision, retrieved context, accumulated memory, multiple tool calls, external API responses, retries, intermediate outputs and a final action. A stack trace captures almost none of that.

To debug an agent in production, you need traces that record the complete decision-and-tool-call chain: token and cost accounting for each step, plus the prompt, retrieved context, tool arguments, tool outputs and policy decisions that shaped the run. Ideally you also need the ability to replay the workflow against the same context. Without that, you are staring at an outcome with no reliable way to reconstruct the sequence of information that produced it. The context may already be gone, the model may produce a different answer if you run it again, and the downstream service may have changed.

Debugging shifts from read the error and find the faulty line to reconstruct a decision the system made from a chain of state, context and external events. That is a different engineering discipline, and many teams only discover they need it after the first serious production incident.

Evaluation

Traditional software has unit tests. Agentic systems need something broader, because a workflow that succeeds today may fail tomorrow even when your own source code has not changed. The model may have been updated, a prompt may have drifted, a retrieval source may now contain different information, a downstream tool may return a new response shape, or the model may simply choose a different sequence of actions for the same request.

A conventional test suite can verify that the code runs. It cannot always verify that the system still makes good decisions.

Teams therefore need evaluation suites built around representative tasks and expected outcomes — measuring not just task success, but factual accuracy, policy compliance, tool-selection quality, latency, token usage, monetary cost, consistency, safety and escalation behaviour. You also need regression datasets built from real failures. Every production incident should become an evaluation case. Every unexpected tool call should become a scenario. Every prompt injection that almost worked should become a permanent test.

The question changes from does the code compile? to does the system still behave well? That sounds like a small distinction. It is not. It changes what testing means.

We have become obsessed with prompts

Prompt engineering has its place. A carefully designed prompt can improve reasoning, constrain output and reduce ambiguity. But prompts are increasingly becoming an implementation detail. Models continue to improve, reasoning becomes stronger, context windows grow, and tool-use capabilities become more standardised. A prompt advantage is also easy to copy.

The competitive advantage is unlikely to come from discovering a clever phrase that every competitor can reproduce tomorrow. It will come from designing better systems around the intelligence.

The companies that win will not necessarily have the smartest model. They will build the best products around it — products that are reliable, secure, observable, affordable, governable, scalable and useful. The model is increasingly a commodity input. What you build around it is not.

We have been here before

In the early days of the web, one of the hard problems was making a page render correctly. Frameworks made that easier. Then the challenge moved to server-side applications, then to APIs, then to distributed systems, then to cloud infrastructure, deployment automation, observability and platform engineering. Each new abstraction removed one category of work and exposed another. The work did not disappear. It moved.

AI feels like the next version of that transition. LLMs removed much of the effort required to interpret natural language, but that did not eliminate engineering — it made a new class of products possible, and those products still have to be secured, operated, tested and scaled. Every generation of tooling creates the impression that engineering is about to become unnecessary. In practice, better abstractions usually raise the level at which engineering happens.

We no longer spend as much time asking whether a machine can understand a sentence. Now we have to decide what authority it should have, how it should act, what happens when it is wrong, and how the rest of the system should contain the consequences. Those are not smaller problems. They are bigger ones.

If anything, LLMs increased the amount of systems work available, because they made previously impossible products feasible — and everything that becomes possible eventually has to be made reliable. The engineers who thrive over the next decade will not simply know how to use AI. They will know how to architect systems around it, integrate it with real platforms, constrain its authority, evaluate its behaviour, operate it safely, recover when it fails, and turn intelligence into something customers can trust.

The question is no longer:

Can AI understand the user?

It is:

Can we build systems that make that understanding useful — reliably, safely and at scale?

Language is no longer the bottleneck. Systems are.

We fixed output corruption. Then built persistence. Then TTL. All in v0.6

Arun Raghunath — Thu, 11 Jun 2026 17:25:58 +0000

Running untrusted AI-generated code safely is the obvious hard problem.

But sometimes the problems that break an agent workflow look like boring infrastructure work.

v0.6 began as plumbing:

Persistent sandbox registry
Automatic cleanup with TTL

Necessary, but not particularly glamorous.

Then the tests started failing.

The output corruption problem

Every execution returned something like this:

WARNING: Running pip as the 'root' user can result in broken permissions...
[notice] A new release of pip is available: 25.0.1 -> 26.1.2
hello world

The actual program output was buried under dependency-installation noise.

For a human reading a terminal, that is annoying.

For an AI agent parsing execution output, it is broken.

The cause was straightforward: dependency installation and code execution were chained into a single Docker call, with stderr redirected into stdout.

Everything ended up in the same stream.

The fix: two Docker calls, not one

We separated the operations.

Call 1: Install dependencies silently.

subprocess.run(
    [...dependency_install_command],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

Call 2: Execute the user command and capture its output.

result = subprocess.run(
    [...execution_command],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
)

return result.stdout

It is a small change, but the principle matters:

When infrastructure is built for AI agents, clean output is part of the API contract.

Agents parse what you return. Installation logs, warnings and runtime output cannot be treated as one undifferentiated stream.

Persistence: SQLite instead of an in-memory dictionary

The original sandbox registry was a Python dictionary.

Restart the service, and every sandbox record disappeared.

The containers might still exist, but Jhansi no longer knew about them. Any agent workflow expecting to reconnect after a service restart would fail.

We considered:

JSON: simple, but vulnerable to partial writes and corruption during crashes
Redis: native TTL and a good operational model, but another service for self-hosters to run
SQLite: durable, transactional and already included with Python

We chose SQLite.

The schema is intentionally small:

CREATE TABLE IF NOT EXISTS sandboxes (
    id TEXT PRIMARY KEY,
    language TEXT NOT NULL,
    container_id TEXT,
    workspace_path TEXT,
    status TEXT NOT NULL,
    created_at TEXT NOT NULL,
    expires_at TEXT NOT NULL
);

No ORM.

No migration framework.

Just SQLite doing what SQLite is good at.

TTL: last active, not creation time

Each sandbox receives an expires_at value, initially one hour after creation.

The important decision is that every execution resets the clock:

new_expires = (
    datetime.now(timezone.utc)
    + timedelta(seconds=TTL_SECONDS)
)

registry.update_expires_at(
    sandbox_id,
    new_expires,
)

A background task runs every 60 seconds and removes expired sandboxes.

This makes the TTL activity-based rather than age-based.

An agent may perform dozens of small executions during a 20-minute analysis. A creation-time TTL can terminate the sandbox in the middle of an active workflow.

A last-active TTL does not.

Active sandboxes remain available. Only idle ones are cleaned up.

What this unlocks

With persistence and activity-based TTL, Jhansi sandboxes are becoming reliable execution primitives:

Create a sandbox once.

Use it repeatedly.

Survive service restarts.

Trust that active work will not disappear underneath the agent.

That is the foundation longer-running agent workflows need.

Next in v0.7: streaming execution through Server-Sent Events.

No more waiting for the entire command to finish before seeing its output.

Jhansi is an open-source cloud sandbox for running AI-generated code safely.

Self-host it with:

docker compose up

AI agents need execution, not credentials.

Star it if this problem resonates: https://github.com/jhansi-io/petri

pip install jhansi — the SDK is live

Arun Raghunath — Mon, 08 Jun 2026 19:49:30 +0000

Six weeks ago, running code on jhansi.io meant curl + sandbox IDs + manual cleanup.

Today it looks like this:

from jhansi import Sandbox

with Sandbox(language="python") as sb:
    sb.upload_file("main.py")
    result = sb.exec("python main.py")
    print(result["output"])

That's the milestone. The SDK is live.

Why this matters

The API was always there. Petri — the execution engine underneath — has been running code in isolated Docker containers since v0.1. But you had to understand HTTP, manage container lifecycle, and remember to delete sandboxes or you'd leak resources.

The SDK removes all of that. You write Python. jhansi.io handles the rest.

The context manager was non-negotiable

If you create a sandbox and forget to delete it, you leak containers and workspace storage. That's not acceptable — especially when AI agents are creating sandboxes programmatically.

The context manager makes cleanup automatic:

with Sandbox(language="python") as sb:
    # sandbox created here
    sb.upload_file("main.py")
    result = sb.exec("python main.py")
# sandbox deleted here — even if exec raised an exception

No leaked containers. No cleanup code. No surprises.

The Docker-in-Docker problem

Self-hosting Petri via docker compose up uncovered something we hadn't anticipated.

Petri runs inside a Docker container. But Petri's job is to spin up Docker containers to run your code. So Petri needs access to Docker — from inside Docker.

Fix one: mount the Docker socket.

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

Fix two: shared workspace path. Petri creates workspace folders inside its container. When it mounts those into sandbox containers, Docker looks for the path on the host — not inside Petri. The path doesn't exist.

volumes:
  - /var/run/docker.sock:/var/run/docker.sock
  - /tmp/petri-workspaces:/tmp/petri-workspaces
environment:
  - PETRI_WORKSPACE_ROOT=/tmp/petri-workspaces

Same path both sides. Docker finds it. Problem solved.

Getting started

# Start Petri
git clone https://github.com/jhansi-io/petri.git
cd petri
docker compose up

# Install the SDK
pip install jhansi

Full docs at docs.jhansi.io.

What's next

v0.6 — persistent registry so sandboxes survive Petri restarts
v0.7 — streaming exec, real-time output as your code runs
MCP server — Cursor and Claude Code use Petri directly instead of their own cloud.

The MCP server is the one I'm most excited about. More on that soon.

Star the repo if you're following the build. ⭐
github.com/jhansi-io/jhansi

We built test mode. Then discovered it was broken.

Arun Raghunath — Mon, 08 Jun 2026 09:00:11 +0000

Part of building jhansi.io in public.

Test mode sounded simple. Upload code, pass a command, jhansi runs it + your test suite. Done.

Except it wasn't done. First run: empty output. No errors. Just silence.

Here's what broke — and how it changed how we think about AI-generated code.

The original idea

AI writes code. Scripts, APIs, full backends. But code without proof is liability.

Test mode is the proof. You upload a project to a jhansi sandbox, pass the command that starts your app, and jhansi:

Runs the command
Waits for the server to come up
Executes your test suite against it
Returns results
Kills everything

All inside an isolated container. Nothing escapes. Nothing persists.

This is the verification layer missing from Cursor, Claude Code, Windsurf. They generate. We verify.

The problem we didn't anticipate

v0.4 of test mode accepted a filename.

Upload app.py, call exec with filename: "app.py", jhansi figures out how to run it.

The problem: real projects aren't single files.

A Flask app is app.py + tests/ + requirements.txt. When we uploaded them separately, they landed flat in the workspace. pytest couldn't find tests/. The installer couldn't find requirements.txt.

We built test mode for the toy world. But AI doesn't generate toys. It generates projects.

AI agents don't write hello_world.py. They write repos.

The fix: projects are zips, not files

Obvious once you see it. Upload the whole project as a zip.

# From inside your project
cd my_project && zip -r ../my_project.zip .

# Upload to sandbox
curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/upload \
  -F "file=@my_project.zip"

jhansi extracts it preserving structure. tests/ lands where pytest expects it. requirements.txt lands where the installer looks.

This also killed the filename param. You now pass the actual command:

curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/exec \
  -H "Content-Type: application/json" \
  -d '{"command": "python app.py", "test": true}'

Language-agnostic. Python, Node, Go, Java. Same API. jhansi handles the runtime.

What test mode actually does

When test: true:

Install deps — blocking. Wait for pip install to finish. This was bug #2.
Start your app — detached, in the background
Wait 2s for the server to bind to port
Run tests — pytest, jest, go test, mvn test. Auto-detected.
Return output — stdout, stderr, test summary
Kill container — no state leaks Test runner needs zero config. If pytest finds it locally, we find it in the sandbox.

The dependency race condition

v1 ran install + app start in one Docker command.

Container starts → pip install begins → python app.py tries to start → pytest fires 2s later.

But pip install flask was still downloading. Server wasn't up. Tests hit ConnectionRefused.

The fix: serialize it.

Install deps. Block until done.
Start app. Detach.
Sleep 2s.
Test. Obvious in hindsight. You only learn this by shipping and watching it fail.

The honest bit

We shipped test mode in v0.4. It works. All four languages tested end-to-end.

But it took discovering that AI generates projects, not scripts, to get there.

The first design was for the demo. The second design is for the world AI actually creates.

This is why building in public matters. Not to announce features. To document how the problem reveals itself when you touch it.

What's next

v0.5 is serve mode — start a server, get a temporary preview URL, share it with your team, kill it when you're done.

The last verification step before you deploy anywhere real. No more "works on my machine" from an LLM.

Code is open source at github.com/jhansi-io/petri. Apache 2.0. Self-host today.

Building AI tooling at a bank or fintech and this sounds familiar? I want to hear from you.

jhansi.io — the missing runtime layer for AI-generated code.

Closing the execution gap, Part 2: Dependency management

Arun Raghunath — Sat, 06 Jun 2026 20:29:27 +0000

This is Part 2 of Closing the execution gap — a series on building jhansi.io, a cloud sandbox for AI-generated code.

The first question I got after shipping persistent sandboxes was predictable:

"Great — but do I still have to pip install everything myself?"

Yes. You did. That was embarrassing.

If the pitch is "run AI-generated code with zero friction," making users manage deps manually is a contradiction. For regulated teams it's worse: every new package is a supply-chain review. Friction kills adoption.

So v0.3 fixes it.

The problem with dependencies in sandboxes

Every sandbox starts as a clean container. Upload main.py, hit run, and you get:

ModuleNotFoundError: No module named 'requests'

The naive fix is to install at exec time. But downloading from PyPI on every run is slow, expensive, and brittle. pandas + numpy is a 40s cold start. Run that 100 times and your AI agent burns budget before it does anything useful.

The right fix: install once, persist forever.

Install once, persist forever

jhansi.io gives every sandbox a persistent workspace — a folder that survives across runs. In v0.3, dependencies live there too.

First exec: we detect deps, install to /sandbox/deps, run your code.
Second exec: deps are already there. Cold start drops dramatically. This matters for AI agents. Humans tolerate a 30s install. Agents that try 5 approaches to solve a task can't. Workspace-scoped cache means failed attempt #1 pays the install tax. Attempts #2–5 run instantly.

First exec — install + run

$ curl -X POST .../sandboxes/sb_abc123/exec -d '{"filename": "main.py"}'
{
"output": "Installing requests==2.31.0...\n200\n"
}

Second exec — just run

$ curl -X POST .../sandboxes/sb_abc123/exec -d '{"filename": "main.py"}'
{
"output": "200\n"
}

That's the difference between "AI is too slow" and "AI is faster than a junior dev."

Manifest first, auto-detect as fallback

How do we know what to install? Both approaches, in the right order.

If you provide a manifest, we trust you. You know your deps better than any static analyser. If you don't, we fall back to auto-detection.

Priority for Python:

pyproject.toml → pip install
requirements.txt → pip install -r
Neither → pipreqs scan pipreqs isn't just import requests → requests. It knows import cv2 means opencv-python, import sklearn means scikit-learn. You don't have to remember.

Using a manifest isn't just faster — it's auditable. Auditors can diff your pinned deps between runs. Auto-detect is for prototyping. More on auditability in Part 5.

Four languages, four strategies

AI doesn't just write Python. jhansi.io handles the four languages LLMs generate most:

Language	Manifest detected	Install command	No manifest fallback
Python	`pyproject.toml`, `requirements.txt`	`pip install --target /sandbox/deps`	`pipreqs` auto-detect
Node	`package.json`	`npm install`	Run as-is
Go	`go.mod`	`go mod download`	`go mod init` + `go mod tidy`
Java	`pom.xml`, `build.gradle`	Maven or Gradle	Direct `javac` compile

Each language keeps its own idioms. We don't impose a universal abstraction. Workspace-scoping means one sandbox's torch==2.1.0 can't poison another's torch==1.13. No dependency hell across AI runs.

The trust boundary

One decision worth documenting: we don't vet what gets installed.

Egress is restricted to official registries — PyPI, npm, Maven Central, proxy.golang.org — and nothing else. No arbitrary domains. What you install from those registries is your responsibility.

The contract is simple:

jhansi.io guarantees isolation. You own your code.

This is the same model as AWS Lambda or Cloud Run. We contain the blast radius. We don't audit your imports.

SBOM per exec — a full list of every package installed, with versions and licenses — is on the roadmap. Today we contain. Tomorrow we curate.

What's next

Two things didn't make v0.3:

Streaming output — dep installs can take 30s. Right now you wait. Soon you'll see output live and know exactly why torch is taking forever.
Missing import detection — if your manifest forgets a package, you get an ImportError today. We should surface the unlisted import in the response. Coming soon. Next in the series: Isolation — you can pip install safely now. But can you stop that package from exfiltrating your AWS credentials? What "hard-sandboxed" actually means, why Docker isn't enough, and the attacks most sandboxes miss.

jhansi.io is open source (Apache 2.0) at github.com/jhansi-io. Follow the series on Dev.to, LinkedIn, and X.

Closing the execution gap: a series

Arun Raghunath — Sat, 06 Jun 2026 18:51:07 +0000

Every AI coding tool can write Python — Cursor, Claude Code, Windsurf. None of them can run it safely in production.

That gap between "AI wrote the code" and "the code ran safely" is exactly what I'm building jhansi.io to close.

This series documents the journey. One layer of the problem at a time.

The execution gap

When AI generates code, four things still stand between you and prod:

Dependencies — Install the right packages, with versions and licenses you trust
Isolation — Run it hard-sandboxed. No host access, no outbound network, no surprises
Secrets — Let AI use your API keys without ever letting it see or leak them
Audit — Log every execution. Prompt, code, result, timestamp. Compliance-grade. Most teams stop at step 1. Banks and fintechs can't. FCA, SOC2, and the EU AI Act require audit trails for AI actions. You can't eval() your way through an audit.

jhansi.io is the missing run() for AI-generated code. Open core, cloud sandbox, built to close each part of the gap — layer by layer.

The series

Part 1 — Persistent sandboxes
Why "ephemeral" breaks debugging, state, and compliance. The case for giving every AI a home directory.
→ Read Part 1

Part 2 — Dependency management (coming soon)
Detecting, installing, and locking deps across Python, Node, Go, and Java. With SBOMs and policy built in.
→ Read Part 2

Part 3 — Isolation (coming soon)
What "hard isolation" actually means. Containers, Firecracker, zero trust networking, and the metadata service attacks you haven't thought of yet.

Part 4 — Secrets (coming soon)
Kernel-level proxies. AI can call Stripe without the key ever entering the sandbox.

Part 5 — Audit (coming soon)
Who ran what, when, with which prompt. Hash-chained logs that satisfy auditors, not just engineers.

Building this in public. Follow the series on Dev.to, Linkedin, and X.

Code is Apache 2.0 at github.com/jhansi-io.

The case for persistent sandboxes in AI code execution

Arun Raghunath — Fri, 05 Jun 2026 09:56:53 +0000

Every AI coding tool generates code. None of them solve what happens next.

Cursor writes your Python. Claude Code refactors your script. Windsurf
ships your feature. But running that code safely, in isolation, with
audit trails, without exposing your secrets, is still an unsolved problem.

That's what Jhansi.io is built for.

The mistake we made in v0.1

Our first execution model was simple. Send code as a string, run it in
an isolated container, return the output.

POST /v1/sandboxes/{id}/exec
body: { "code": "print('hello world')" }

It worked. But it had three fundamental problems.

Problem	Why it breaks
Single file only	No multi-file projects, no shared modules, no dependencies. Not how production code works.
Full payload on every run	Even if nothing changed, you resent everything. Wasted bandwidth, added latency.
No foundation for delta sync	If you're sending everything every time, there's nothing to diff against.

The insight

A sandbox should be a workspace, not a disposable container.

Give every sandbox a dedicated folder on disk. Files live there between
runs. Execution just says "run this file" — no payload, no resend, no waste.

This is the architecture shift in Jhansi.io v0.2.

What changed

Workspace per sandbox. Every sandbox gets a dedicated folder on disk
at creation time. Zero config locally, overridable in production via
PETRI_WORKSPACE_ROOT.

File upload API. Upload once. Upload only when something changes.

POST /v1/sandboxes/{id}/files

Files land in the sandbox workspace and persist between runs.

Exec by filename. No code in the request body. Just a filename.

POST /v1/sandboxes/{id}/exec
body: { "filename": "main.py" }

Jhansi.io mounts the workspace into a fresh isolated container and runs
the named file. The container dies. The workspace survives.

The flow

# Create once
curl -X POST /v1/sandboxes -d '{"language": "python"}'

# Upload when files change
curl -X POST /v1/sandboxes/{id}/files -F "file=@main.py"

# Exec as many times as you need
curl -X POST /v1/sandboxes/{id}/exec -d '{"filename": "main.py"}'

What this unlocks

The persistent workspace is the foundation for everything coming next:

Delta sync — detect file changes, upload only diffs
Auto dependency detection — parse imports, install packages invisibly
Multi-file projects — real codebases, not toy scripts

If you're building AI agents that generate and run code, we want you in
our design partner program. Early access at jhansiio.featurebase.app

Jhansi.io — Build it. Run it. Ship it.

I got tired of running Docker manually. So I built a sandbox for AI-generated code.

Arun Raghunath — Thu, 04 Jun 2026 09:18:31 +0000

I've been on sabbatical for a few months. Writing code. Building projects.

And running Docker manually. Again. And again.

docker run. Check what's up. docker stop. Forget one. Find it next week eating RAM. Repeat.

At some point I asked: why is this still manual? Why can't containers just spin up, run, and die when they're done?

Then I threw AI into the mix.

Now I'm not just running my code. I'm running code a model wrote. Code I haven't audited line by line. Code that might have os.system(f'rm -rf {user_input}') because the model had a bad day.

That's a different problem.

The question nobody wants to answer

Cursor, Claude Code, Windsurf, Copilot. They all generate Python, Node, Go.

None of them answer: where does that code actually run?

Best case: you paste it into your terminal and hope.

Worst case: you're piping untrusted eval() with access to your .env file, your AWS creds, and your customer database.

In a startup that's risky.

In fintech that's an FCA fine and a conversation with Legal you do not want to have.

I spent 18 years in banking. I watched teams ban AI coding tools outright because nobody could answer: "Where does the generated code run, and what can it touch?"

So I decided to build the answer.

What ships today: Petri

jhansi.io starts with Petri, the execution engine. It's live right now.

What it does:

Spins up an isolated Docker container per request
Runs Python, Node, or Go code
Returns stdout/stderr
Tears down the container. Zero state left behind.

The API:

POST /v1/sandboxes → Create sandbox, get sb_
POST /v1/sandboxes/{id}/exec → Run code, get output
DELETE /v1/sandboxes/{id} → Destroy it. Gone.

No Docker CLI. No Compose files. No "wait, is sad_fermat still running from Tuesday?"

Petri answers "where does code run". That's it. It does not touch secrets. It does not produce compliance audit logs.

Why existing tools don't cut it

E2B, Modal, Daytona are great tools. I use them. But they're SaaS only.

	E2B / Modal / Daytona	jhansi.io with Petri
Hosting	Public cloud only	Self-hosted or managed SaaS
Data residency	Your code runs on their infra	Runs in your VPC
Execution model	Stateful VMs in many cases	Ephemeral container per run
Who can use it	Startups	Startups, banks, anyone with a regulator

If you're a bank, you cannot send customer PII to a third party to execute. You need to self-host. You need control.

Petri gives you that. But execution is only 30% of the problem.

The roadmap: What I'm building next

Petri solves "where does it run". It doesn't solve "what can it touch" or "prove it to compliance".

That's why I'm building TenantVault and the Audit Layer.

TenantVault: Secrets injection where your AI agent can use a database password to run a query, but it can't read the password, print it, or exfiltrate it.

Audit Layer: Full execution traces. What ran, what files it touched, what network calls it made. Stream it to your SIEM.

I'm building those because 18 years in banking taught me you can't deploy AI codegen without them. "It ran in Docker" isn't enough when the FCA asks questions.

Full roadmap with ETAs: jhansi.io/roadmap

No vaporware. If it's not on the roadmap with a target date, we're not building it yet.

Where things stand

Petri is running. Python, Node, Go support. REST API. Sub-second cold starts.

Next up is the SDK so you can do this:


python
from jhansi import Sandbox

with Sandbox(language="python") as sb:
    result = sb.exec("print('hello from isolation')")
    print(result.output)

No SDK to share yet. I'm building in public because I want feedback before I lock the API. Especially from teams in fintech, healthtech, or anywhere "oops, it leaked" isn't an option.

## Follow along

I'll post technical deep-dives here and on GitHub as I ship:

1. Python + TypeScript SDKs
2. Self-hosted Docker Compose setup
3. TenantVault and audit streaming

**Jhansi.io — Build it. Run it. Ship it.**  
Because "where does this code run?" shouldn't be a rhetorical question anymore.

---

*Building in public. Star the repo on [GitHub](https://github.com/jhansi-io/jhansi or check the roadmap at [jhansi.io/roadmap](https://jhansi.io/roadmap). Questions? Drop them below.*