DEV Community: Dipankar Sarkar

Running five coding agents in parallel is easy. Not losing their state when one crashes is the hard part

Dipankar Sarkar — Tue, 21 Jul 2026 15:26:13 +0000

Point one AI coding agent at your repo and it mostly works. Point five at it and the
problem stops being the agents and starts being the coordination. Who is working on
what. What happens when agent three's session dies mid-task. Where the state lives so
that a crash does not lose the plan. How the finished work merges back without
stepping on itself.

Most setups answer those questions with a JSON file in the working tree and a hope.
The file gets half-written when a process is killed. The working tree gets dirty and
pollutes code review. Two agents edit the same state and one wins silently.

brat is a Rust multi-agent harness built to answer those questions properly. The
core promise: even if agents crash, your coordination state is always recoverable and
auditable.

The core idea: coordination is an append-only event log

Brat does not keep a mutable state file. It is built on grite, an append-only event
log stored in git refs (refs/grite/wal). Every change to coordination state is an
event appended to that log. State is not stored, it is derived by replaying events.

That one decision is where the crash-safety comes from. An append-only log has no
half-written mutable record to corrupt. If a process dies mid-operation, you rebuild
deterministically from the last known-good point. And because the log lives in git
refs, not tracked files, your working tree stays clean. No coordination JSON showing
up in git status, no metadata noise in code review.

The problems it explicitly targets, from its own list:

Problem	How Brat fixes it
Dirty working trees	Metadata lives in `refs/grite/*`, never in tracked files
Silent failures	All state changes recorded as events, fully observable
Crash recovery	Append-only log enables deterministic rebuild from any point
Merge chaos	The Refinery manages a queue with configurable policy

How it works: a small cast of roles

Brat models the work as a set of roles, and once you see them the mental model clicks:

Mayor is the AI orchestrator. It analyzes the codebase, breaks down the work, and creates convoys and tasks.
Convoy is a group of related tasks. Think sprint, epic, or feature branch.
Task is a single work item assigned to one coding agent.
Witness spawns and monitors the agent sessions.
Refinery manages the merge queue, runs CI checks, and handles integration.
Deacon is the background janitor: it cleans locks, syncs state, and detects orphaned sessions.

The flow reads top to bottom. The Mayor creates a Convoy that contains Tasks. The
Witness spawns agents against queued tasks. The Refinery merges finished work. The
Deacon keeps the whole thing from leaking locks and orphans.

Underneath, the substrate does the unglamorous correctness work. Events are immutable.
Each agent (actor) gets its own isolated data directory, so one agent's writes do not
clobber another's. Engine operations have bounded, configurable timeouts, so a hung
agent does not hang the harness. Resource coordination uses TTL-based lease locks, so
a crashed agent's lock eventually expires instead of deadlocking everyone.

And it is engine-agnostic. Brat drives your preferred coding tool through an adapter:

Engine	Command
Claude Code	`claude`
Aider	`aider`
OpenCode	`opencode`
Codex	`codex`
Continue	`cn`
Gemini	`gemini`
GitHub Copilot	`gh copilot`

You pick the engine in .brat/config.toml and Brat handles the orchestration around
it.

A concrete run

The loop is deliberately small. Initialize the substrate and the harness, start the
Mayor, ask it to analyze code, then let the Witness run the tasks it created.

cd your-project
grite init     # initialize the grite substrate
brat init      # initialize the Brat harness

# start the AI orchestrator and give it a job
brat mayor start
brat mayor ask "Analyze src/ and create tasks for any bugs you find"

# see what it created
brat status

# spawn agents for the queued tasks
brat witness run --once
brat status --watch

Because coordination is events in git, brat status is not reading a fragile local
file, it is querying derived state that can be rebuilt from the log at any time.

You can also declare reusable workflows. A parallel convoy is just a set of legs that
fan out and a synthesis step that pulls the results together:

name: code-review
type: convoy
legs:
  - id: correctness
    title: "Review correctness"
  - id: security
    title: "Review security"
  - id: performance
    title: "Review performance"
synthesis:
  title: "Synthesize review findings"

Three agents review three different concerns in parallel, and a fourth synthesizes.
That is the shape multi-agent work actually wants, and it maps cleanly onto convoys
and tasks.

Where it does not fit

Brat ships its own honest "what it does not solve" list, which is the main reason I
trust it. Repeating it, because the trade-offs are the point.

It does not fix engine reliability. API rate limits, auth failures, and vendor
outages are outside Brat's control. If Claude or GPT is down, Brat cannot conjure a
response. It can recover the coordination state around the failure, not the failure
itself.
It does not resolve real merge conflicts. The Refinery manages the merge queue
and policy. Genuine code conflicts still need human judgment. Brat orders and gates
the merges, it does not understand your code well enough to reconcile two
contradictory diffs.
It does not write your prompts. Brat orchestrates agents. Prompt quality is
still your job. Point it at a vague task and you get vague work, coordinated
cleanly.
It does not replace your CI. Brat integrates with your existing CI, it does not
become it.
It is early and Rust-native. You need the Rust toolchain to build from source,
and you install grite first as a prerequisite. This is infrastructure for people
who want to run several agents seriously, not a one-click consumer app.

The pattern across that list: Brat is honest about being a coordination substrate, not
a magic wand. It makes the state crash-safe and auditable. It does not make the agents
good.

Takeaways

The hard part of multi-agent coding is not parallelism, it is state. An append-only event log in git refs makes that state crash-recoverable and keeps your working tree clean.
Roles (Mayor, Convoy, Task, Witness, Refinery, Deacon) give you a mental model that matches how the work actually decomposes.
Actor isolation, bounded timeouts, and TTL lease locks are the boring correctness details that decide whether a harness survives contact with a crash.
Believe a project more when it publishes what it does not solve. Brat does.

Code, the role docs, and the demo are here:
https://github.com/neul-labs/brat

If you are already running multiple coding agents, I want to know how you handle a
mid-task crash today. Kick the tyres, issues welcome.

AI agents that browse the web need a fleet of isolated browsers, here is a brokerless scheduler for it

Dipankar Sarkar — Sun, 19 Jul 2026 15:26:18 +0000

Give an AI agent one browser and it is easy. Give a hundred agents their own browsers and you have an infrastructure problem.

Browser automation at scale has an awkward shape. Each browser wants real resources, a real filesystem, and real isolation, because you are running untrusted pages, scraping targets that fight back, or agent sessions you do not want sharing cookies. So you cannot just spin up a thousand threads. You end up managing containers, then managing the machines the containers run on, and then you are writing a scheduler.

Machineuse is that scheduler, built for exactly this workload. It is a Python platform that creates, schedules, and manages isolated browser instances across multiple worker nodes, with load balancing and a snapshot-based dormancy trick to reclaim resources from idle instances. This post covers how it places work, how dormancy works, and where the design constrains you.

The project tags itself for chromium, browser-automation, and mcp, so the agent-browser fleet is squarely the workload it has in mind. One thing to be precise about, though: machineuse schedules and isolates the containers. What runs inside, the browser and your automation or agent code, is yours. It is fleet infrastructure, not an agent framework. That separation is the point. You bring the agent, it brings the machines.

The core idea

The unit is an isolated browser instance running inside a systemd-nspawn container with dedicated resources. The platform's job is to decide which node runs each instance and to keep the fleet healthy.

Two design decisions define it.

First, the messaging is pure NNG, with no external broker. There is no Redis or RabbitMQ in the middle. Nodes talk to a control plane over NNG sockets directly. That is one fewer stateful system to run and one fewer thing to fall over.

Second, placement is intelligent rather than round-robin. The scheduler places instances based on node capabilities and current load, so a heavier node does not get handed work it cannot serve well.

How the architecture fits together

There are two roles. A control plane coordinates, and worker nodes run containers. You start a control plane bound to an NNG address, then point workers at it.

# Start the control plane
python -m machineuse.nodes.control_plane --bind tcp://*:5000

# Start worker nodes on different machines
python -m machineuse.nodes.agent worker-1 tcp://control-plane:5000
python -m machineuse.nodes.agent worker-2 tcp://control-plane:5000

Storage is split by role, which is a sensible choice. Each node uses SQLite locally. The control plane can use PostgreSQL for shared metadata. DuckDB is used for analytics on the metrics side. So the local hot path stays embedded and simple, while shared state and analytics get the heavier stores only where they are actually needed.

Reliability comes from auto-healing: the platform detects failures and migrates instances off a bad node. Real-time metrics feed time-series analytics so you can see utilization across the cluster rather than guessing.

How snapshot dormancy works

This is the feature worth the price of admission. A browser instance you are not actively using still holds memory and CPU. Multiply that across a fleet and idle instances become the dominant cost.

Machineuse handles this with snapshot dormancy: it pauses an instance and takes a filesystem snapshot, freeing the live resources, then revives it from the snapshot when you need it again. So an instance you might need later does not have to sit resident. It goes dormant, gives back its resources, and comes back when called.

From the CLI that is two commands.

# Put an instance to sleep, reclaiming its resources
machineuse-cli dormant <id>

# Bring it back from its snapshot
machineuse-cli revive <id>

The rest of the CLI is what you expect for lifecycle and fleet management.

# Instance lifecycle
machineuse-cli create --image ubuntu:22.04
machineuse-cli list --node worker-1
machineuse-cli delete <id>

# Cluster view
machineuse-cli nodes list
machineuse-cli cluster status
machineuse-cli metrics --node worker-1

Using it from code

There is a Python client that talks to the cluster and a REST API on port 8000. The library path is the one most automation would use, since it hands back the scheduling result directly.

from machineuse.client import ClusterManager

# Connect to the distributed cluster
client = ClusterManager("tcp://control-plane:5000")

# Create an instance and let the scheduler place it
instance = client.create_instance(
    image="ubuntu:22.04",
    config={"memory_gb": 4, "cpu_cores": 2},
)
print(f"Instance {instance.id} scheduled on {instance.node_id}")

# Check fleet-wide utilization
status = client.get_cluster_status()
print(f"Cluster utilization: {status['utilization']}%")

The REST surface mirrors this. A POST /v2/instances with an image and config creates an instance, GET /v2/instances lists them, and GET /health is the readiness check. Deployment is Docker Compose, with a single-node docker-compose.yml and a docker-compose.distributed.yml for control plane plus workers plus PostgreSQL.

Where it does not fit

systemd-nspawn means Linux, systemd, and root. The README is explicit: you need an Ubuntu or Debian system with systemd-nspawn support, Python 3.11+, and root or sudo access for container management. This is not going to run on macOS, and it is not going to run rootless. If your automation infrastructure is not systemd Linux you host yourself, this is the wrong tool.

There is a per-node ceiling. MACHINEUSE_MAX_INSTANCES defaults to 50 containers per node. That is a sensible default, and it also tells you the scaling model: you scale out by adding nodes, not by cramming a box. Plan capacity around instances-per-node times node-count, and remember each browser is a real resource consumer.

Dormancy is a trade, not free capacity. Snapshotting to disk and reviving takes time and disk space. It is a win when instances are idle long enough to justify the pause and restore cost. For instances you cycle rapidly, the snapshot overhead may cost more than it saves. It shines for a large pool of intermittently-used sessions, not for high-churn throwaway instances.

Brokerless NNG puts coordination on you. No external broker is a real operational win, but the control plane is then the coordination point you must keep available. It uses PostgreSQL for shared metadata, so your durability and HA story for the fleet is your durability and HA story for that database and that control-plane process.

Takeaways

Machineuse is a focused answer to one real problem: running many isolated browsers across many machines without hand-rolling the scheduler and without standing up a message broker.

The idea worth stealing is snapshot dormancy. Treating an idle heavy instance as something you pause to a filesystem snapshot and revive on demand is a clean way to decouple "instances that exist" from "resources currently spent." It turns idle capacity from a cost into a snapshot on disk.

Repo: https://github.com/dotcommoners/machineuse

If you run browser automation at any real scale, stand up the single-node Compose stack, create a handful of instances, and measure your own dormant-to-revive time. Real revive latency on your storage is the number that decides whether dormancy pays off, and it is exactly the kind of issue worth filing.

A key-value store where the query language is Lua, and you can build RAG inside it

Dipankar Sarkar — Fri, 17 Jul 2026 15:27:11 +0000

Most embedded databases give you two verbs and a shrug. Put a value. Get a value.
Anything cleverer than that, filtering, transforming, combining reads, you do in
your application language, which means pulling data across the boundary, working on
it, and pushing it back.

That round-trip is fine until the logic gets interesting. Increment a counter
atomically. Read a document, embed it, store the vector. Do a similarity search and
feed the top hit into an LLM. Every one of those becomes several client calls with
your app code as the glue.

Liath takes a different bet. It is an embedded key-value store where the query
language is Lua, a real programming language, running next to the data. You send
logic to the data instead of dragging the data to your logic.

The core idea

The pitch is three lines to a working database:

from liath import EmbeddedLiath

db = EmbeddedLiath(data_dir="./data")

db.put("user:1", '{"name": "Alice", "role": "admin"}')
print(db.get("user:1"))

db.close()

That is the boring half. The interesting half is that you can hand Liath a Lua
script and it runs against the store in one call, with the db object and a set of
plugins available inside the script:

db.execute_lua('''
    -- Store data
    db:put("counter", "0")

    -- Read and modify
    local count = tonumber(db:get("counter"))
    db:put("counter", tostring(count + 1))

    return db:get("counter")
''')

The read, the modify, and the write happen server-side in one execution instead of
three Python calls. Lua is a good choice for this. It is tiny, it embeds cleanly,
and it is the same language people already accept inside Redis and Nginx for exactly
this reason.

How it works

Liath is a Python package with a pluggable storage backend and a plugin system that
extends the Lua runtime.

Storage is not fixed. You get LevelDB for development and RocksDB for production, and
a storage_type of auto, rocksdb, or leveldb picks the engine. For a
production deployment you point it explicitly:

db = EmbeddedLiath(
    data_dir="/var/lib/liath",
    storage_type="rocksdb"
)

The capabilities that make Liath more than a KV wrapper come from plugins, exposed
inside Lua under the plugins table. Some are always present, others you install
as extras:

Plugin	Function	Install
`db`	Core CRUD	Included
`file`	File read/write	Included
`cache`	Query result caching	Included
`backup`	Backup/restore	Included
`monitor`	System monitoring	Included
`embed`	Text/image embeddings	`liath[embed]`
`vdb`	Vector similarity search	`liath[vdb]`
`llm`	LLM completions/chat	`liath[llm]`

So pip install liath[embed,vdb,llm] turns a key-value store into something that
can embed text, index vectors, and call a model, all reachable from a Lua script.

Namespaces give you multi-tenant isolation without running separate databases. You
create a namespace, switch to it, and keys written there do not collide with another
namespace:

db.create_namespace("production")
db.set_namespace("production")
db.put("config", '{"debug": false}')

db.set_namespace("development")
db.put("config", '{"debug": true}')  # Separate from production

You can also write your own plugin in Python by subclassing PluginBase and
exposing Lua-callable functions, then load it with a plugins_dir. That is the
escape hatch when the built-ins are not enough.

RAG without leaving the database

The example that shows why co-locating logic and data matters is retrieval-augmented
generation. In most stacks that is an orchestration script gluing an embedding
service, a vector database, and an LLM API. In Liath the README does it in one Lua
script per phase.

Indexing creates a vector index, stores a document, embeds it, and adds the vector:

db.execute_lua('''
    local json = require("cjson")

    plugins.vdb.vdb_create_index("docs", 384)

    local text = "Liath is a programmable database with Lua queries."
    db:put("doc:1", text)

    local emb = json.decode(plugins.embed.embed(text)).embedding
    plugins.vdb.vdb_add("docs", "doc:1", emb)

    return "Indexed"
''')

Querying embeds the question, searches for neighbours, pulls the matching document,
builds a prompt, and calls the model, all in one server-side script:

answer = db.execute_lua('''
    local json = require("cjson")
    local query = "What is Liath?"

    local q_emb = json.decode(plugins.embed.embed(query)).embedding
    local results = json.decode(plugins.vdb.vdb_search("docs", q_emb, 3))

    local context = db:get("doc:" .. results.results[1].id)

    local prompt = "Context: " .. context .. "\\n\\nQuestion: " .. query
    return plugins.llm.llm_complete(prompt)
''')

The embedding backend is FastEmbed, the vector search is USearch, and the LLM plugin
targets OpenAI or Llama. The point is not that any one of those is novel. It is that
the retrieve, augment, generate loop runs where the data already sits.

Where it does not fit

Now the honest part.

Liath is single-node and embedded. There is a liath-server with an HTTP API and a
liath-cli, but the architecture is an embedded store, not a distributed cluster.
If you need horizontal scale, replication, or failover, this is the wrong tool.

Running application logic inside the database is a trade you should make with your
eyes open. It is the same tension people have argued about with Redis Lua and
Postgres stored procedures for years. The logic is fast because it is next to the
data, but it also lives in a place that is harder to test, version, and debug than
your normal application code. A Lua string in a Python file is not a first-class
citizen of your test suite unless you make it one.

The AI plugins are powerful and they are also new surface area. Embeddings, vector
search, and LLM calls each pull in their own dependency and their own failure modes.
An LLM call from inside a Lua script still has network latency, rate limits, and
cost. Co-location removes round-trips between your services. It does not remove the
model API on the far side.

And Lua is a real language, which means Lua is real rope. Expressive server-side
scripting is exactly the kind of power that turns into an unmaintainable pile if the
team does not treat those scripts with the same discipline as the rest of the code.

Takeaways

Sending logic to the data instead of data to the logic is a genuinely different shape for an embedded store, and Lua is a proven choice for it.
The plugin model is the real story. Optional embed, vdb, and llm extras let the same KV store host a full RAG loop.
Pluggable RocksDB or LevelDB backends and namespace isolation make it more production-shaped than a toy, within single-node limits.
Treat in-database Lua like stored procedures: powerful, fast, and easy to abuse. Test them, version them, or regret them.

Install it, the code, and the plugin reference are here:
https://github.com/incredlabs/liath

If you are running a RAG pipeline as three separate services today, I would like to
know whether collapsing it into one embedded store is a relief or a footgun for your
workload. Try pip install liath[embed,vdb,llm], run the RAG example, and tell me
where it breaks.

A package.lock for the prompts hiding in your codebase

Dipankar Sarkar — Wed, 15 Jul 2026 15:24:31 +0000

Prompts are dependencies. We just refuse to treat them like it. A production LLM
app has prompt strings scattered across a dozen files, concatenated inline, and
edited by whoever touched that feature last. Nobody knows which prompts exist,
which version is live, or whether the one in summarize.py matches the one the
eval ran against. We version our libraries down to the patch. Our prompts get
a content = "Summarize: " + text and a shrug.

blogus is a package.lock for AI prompts. It extracts the prompts already in
your code, versions them like dependencies, locks them with content hashes, and
syncs changes back to your source files. The tagline is exactly that:
"package.lock for AI prompts."

The core idea: discover, do not migrate

Most prompt-management tools ask you to move your prompts into their system
first. That migration is the reason they never get adopted. blogus inverts it.
It scans your existing code and finds the LLM calls where they already live, so
adoption is a scan, not a rewrite.

$ blogus scan

Found 3 LLM API calls:
  src/chat.py:15         OpenAI      unversioned
  src/summarize.py:42    OpenAI      unversioned
  lib/translate.js:28    Anthropic   unversioned

That output is the whole pitch in one screen. It found calls across Python and
JavaScript, identified the provider, and flagged every one as unversioned. You
did not move anything. You just found out what you have.

The workflow

blogus runs as a five-step loop that maps cleanly onto how you already think
about dependencies: extract, version, lock, sync, verify.

Versioning turns a discovered prompt into a .prompt file with front matter,
so the prompt becomes a real artifact with a name, a model config, and typed
variables:

# prompts/summarize.prompt
---
name: summarize
model:
  id: gpt-4o
  temperature: 0.3
variables:
  - name: text
    required: true
---
Summarize in 2-3 sentences: {{text}}

Locking generates prompts.lock, which is the mechanism that makes this a
dependency system rather than a folder of text files:

# prompts.lock
prompts:
  summarize:
    hash: sha256:a1b2c3d4...
    commit: 4903f76

Each prompt gets a sha256 content hash and the commit it was locked at. Now
"which version of the summarize prompt is in production" has an answer you can
diff, not a guess.

How the sync works

The step that makes this more than a linter is fix. It rewrites your source
code to load from the versioned prompt instead of carrying an inline string:

# Before
content = "Summarize: " + text

# After
# @blogus:summarize sha256:a1b2c3d4
content = load_prompt("summarize", text=text)

The inline string becomes a load_prompt call annotated with the prompt name
and its hash. That comment is the link between code and lock file. Once it is
there, blogus verify can run in CI and fail the build if the code has drifted
from the locked prompt:

$ blogus verify || exit 1

This is the whole value. The hash in the code comment must match the hash in the
lock file. If someone edits the inline usage or the .prompt file changes
without a re-lock, verify catches it before merge. It is the same contract as a
lockfile check on your dependencies, applied to the prompts that actually drive
your model behavior.

The CLI has the surface you would expect around that loop: scan, init,
prompts to list, exec <name> to run a prompt with variables, analyze to
evaluate effectiveness, test to generate test cases, lock, verify,
check to find unversioned prompts, and fix. There is also a web UI
(uvx --with blogus[web] blogus-web on port 8000) and an interactive TUI demo.
Install is uv add blogus, or uvx blogus scan to try it with zero install.

Where it does not fit

A few honest limits.

Start with what it can even see. The scanner detects OpenAI SDK calls
(openai.chat.completions.create) and Anthropic SDK calls
(anthropic.messages.create), in Python and JavaScript/TypeScript files. That
is the detection surface. If your calls go through LiteLLM, Bedrock, a provider
router, or your own thin wrapper, or if your service is Go or Rust, scan
finds nothing and the whole pitch evaporates. Check that your stack is in that
window before you judge the idea by an empty scan.

The analyze command does LLM-powered effectiveness scoring and the test
command generates test cases. This is not a hidden detail, it is in the
signature: analyze takes a --judge-model. A model is grading your prompt.
LLM-as-judge is noisy, so treat those scores as a signal to investigate, not a
gate to block on. The trustworthy part of blogus is the deterministic core:
scan, lock, verify. The hash contract is exact. The quality scoring is fuzzy by
nature, and you should hold the two to different standards.

The fix step edits your source files. Rewriting inline strings into
load_prompt calls is a code transformation, and any codemod can get an edge
case wrong. There is a blogus fix --dry-run that previews the rewrite, and
you should use it before the real thing. Run it in a clean working tree, review
the diff, and keep it behind a PR. This is not a tool to run with uncommitted
changes.

It also introduces a load_prompt runtime dependency into your app. Your
prompts now live in .prompt files that have to ship and load at runtime. That
is a fair trade for versioning, but it is a trade. If your prompts are trivial
and never change, the lockfile ceremony is overhead you do not need. The value
scales with how many prompts you have and how often they drift.

And it manages the prompt text and model config. It does not manage what the
model actually returns. A locked, verified prompt can still produce a worse
answer after a model provider updates their model under the same name. blogus
pins your side of the contract, not the provider's.

Takeaways

Prompts are dependencies, and the reason they rot is that nothing treats them like one. A content hash plus a lockfile fixes the "which version is live" question.
Discover-in-place instead of migrate is why this can actually get adopted. The first command is a scan, not a rewrite.
The verify step in CI is the payoff. Code drift from the locked prompt fails the build, same as a dependency lock check.
Trust the deterministic core (scan, lock, verify) more than the LLM-powered analyze and test features. Different reliability, different standards.

If your codebase has more prompts than anyone can list from memory, blogus scan is a five-second way to find out how many. Code and CLI reference are
here: https://github.com/Skelf-Research/blogus

I would genuinely like to know the highest prompt count anyone's scan turns up.
Kick the tyres, issues welcome.

I threw 750 autonomous LLM exploit attempts at a $10k sandbox bounty. Zero escapes.

Dipankar Sarkar — Mon, 13 Jul 2026 15:24:30 +0000

Pydantic put up a $10,000 bounty called Hack Monty: escape the sandbox of their Monty runtime. That is a clean, adversarial, unforgiving target. Either you get a sandbox escape or you do not. No partial credit, no hand-waving.

I did not try to hack it by hand. I built an autonomous LLM loop to hack it, and then I rebuilt that loop four times as I learned what did and did not work. This is the honest write-up of what an agent swarm actually found, which is not a $10k escape, and why the result is still worth publishing.

The core idea

An autonomous hacking loop is a simple pattern with hard details. An LLM proposes an exploit, a harness runs it against the target, an evaluator scores the result, and that score feeds back to steer the next attempt. Repeat for hundreds of iterations.

The engine here started on the autoresearch pattern and evolved through four versions: from a single loop, to XBOW-style parallel swarms, and finally to a tokenworm plus MCP architecture. Each version was a bet about what the bottleneck was. Early on it was idea diversity, so I went parallel. Later it was tool discipline, so I moved to MCP with a tightly-scoped boundary.

How V4 works

The final architecture is three layers with a hard boundary in the middle.

At the top, tokenworm is the LLM agent harness. It runs a native Ollama provider, four role-focused SKILL.md files, and its own sandbox (bwrap plus network). It speaks to the outside world only through an MCP client over stdio.

tokenworm (LLM agent harness)
  Provider: ollama_native
  Skills: 4 role-focused SKILL.md
  Sandbox: bwrap + network
        │  MCP Client (stdio)
        ▼
hackmonty_mcp_server.py
  17 boundary tools (run, evaluate, bandit, state)
        │
        ▼
hackmonty.com   Ollama Cloud   GitHub Issues   Filesystem

The middle layer, hackmonty_mcp_server.py, is the interesting design decision. It exposes exactly 17 tools and nothing else. The agent cannot touch the target, run code, score an attempt, or read state except through those 17 tools. That boundary is the whole safety and control model. It is what lets an autonomous loop run 500 iterations without the harness itself becoming the vulnerability.

Two more pieces make the loop learn rather than flail. A UCB1 bandit (bandit.py) selects which attack strategy to try next, treating exploit families as arms and their scores as rewards, so the loop spends its budget on what is paying off. And an evaluator (evaluate.py) scores each attempt 0 to 5 and enriches the context, so the next proposal is grounded in what the last one actually did.

Running it is a shell script or the Python SDK:

# Production (tokenworm Zig binary), needs ollama serve running
export USER_SECRET=your-passphrase
./run.sh            # 500 iterations
./run.sh 20         # 20 iterations (test)
./run.sh -i         # interactive REPL

# Python SDK showcase (InlineSkill, MCP, native Ollama Cloud)
uv run python run.py 20

There is also an SSE remote mode for the MCP server: uv run python -B hackmonty_mcp_server.py --sse --port 8765.

The results, honestly

Here is where I refuse to inflate anything, because the whole point of an adversarial bounty is that the target does not lie and neither should I.

750+ exploit attempts across the four orchestrator versions.
0 sandbox escapes found.
1 latent unsafe bug found: a heap_read_boxed provenance mismatch.
43 unsafe blocks audited, 9 GHSA advisories reviewed, 6 CPython divergences documented.

So the headline number is zero. The loop did not win the bounty. It threw 750 attempts at a hardened sandbox and the sandbox held.

But zero escapes is itself a result. A 750-attempt autonomous red-team that finds no escape is real evidence about the target's robustness, and along the way it surfaced a genuine latent unsafe bug (the provenance mismatch went into the bounty submission) plus a paper-level, 12-section security assessment. The full write-up is in the repo's REPORT.md and SUBMISSION.md.

Where this approach does NOT fit

The honest limits, and they are the actual lesson.

Autonomous loops are good at breadth, weak at depth. 750 attempts sounds like a lot until you realize a real sandbox escape is often one deep, precise chain of insights, not a wide search. The loop is excellent at exploring the surface and terrible at the kind of sustained, single-thread reasoning a human exploit dev brings to one hard bug. Zero escapes across four architectures is partly a statement about the target and partly a statement about the method.

Evaluation is the ceiling. The loop is only as smart as evaluate.py. A 0 to 5 score plus a bandit steers toward whatever the evaluator rewards, which means a blind spot in scoring is a blind spot in the whole search. Building the evaluator is harder than building the agent, and it is where most of the real engineering went.

It is bespoke to this target. The 17 boundary tools, the skills, the client all encode Hack Monty. This is a case study in building an autonomous red-team, not a turnkey scanner you point at your own code tomorrow. Reuse the pattern, not the config.

Cost and noise are real. 750 LLM-driven attempts is a lot of tokens for one latent bug and a report. Whether that trade is worth it depends entirely on the value of the target.

Takeaways

The loop pattern is simple: propose, run, score, feed back. The hard parts are the evaluator and the tool boundary.
A single MCP server exposing exactly 17 tools is the control model that makes 500-iteration autonomy safe to run.
A UCB1 bandit plus a 0 to 5 evaluator is what turns a flailing loop into one that concentrates its budget.
Result: 0 escapes in 750+ attempts, but 1 latent unsafe bug, a bounty submission, and a full assessment. Zero is still data.
Autonomous agents are breadth-first. Deep single-bug exploitation is still where humans win.

Repo: https://github.com/dipankar/hackmonty

Honest ask: if you build autonomous agent loops, read the evaluator and the bandit selection and tell me where you would have steered the search differently. The escape count is zero, so the interesting critique is about the method, not the score. Issues and discussion welcome.

737x faster LangGraph checkpoints, and the case where Rust lost

Dipankar Sarkar — Sat, 11 Jul 2026 15:24:04 +0000

Run a LangGraph agent long enough and the model call stops being your bottleneck.
The plumbing takes over. Every step, the graph serializes its state to a
checkpoint so you can resume, replay, or recover. LangGraph does that with
Python's deepcopy. For a small dict that is fine. For a 250KB agent state with
nested messages, tool outputs, and accumulated context, deepcopy is brutally
slow, and you pay it on every single step of a long run.

So I built fast-langgraph: a set of Rust accelerators for the hot paths in
LangGraph, packaged as drop-in components that keep full API compatibility.

Lead with the numbers, including the ones that hurt

Here is what the Rust paths actually buy you, measured against the Python
equivalents:

Operation	Speedup	Where
Complex checkpoint (250KB)	737x faster than deepcopy	Large agent state
Complex checkpoint (35KB)	178x faster	Medium state
Sustained state updates	13-46x	Long-running graphs, many steps
LLM response caching	10x at 90% hit rate	Repeated prompts, RAG
End-to-end graph execution	2-3x	Production workloads with checkpointing

And the automatic mode, the one that needs zero code changes, lands around
2.8x for a typical invocation.

Now the honest part. These are not "Rust is faster at everything" numbers. The
checkpoint speedup scales with state size. It is a serialization story. For a
small, flat dict, Python's built-in dict is implemented in C and already fast.
Rust does not win there, and the README says so plainly. The 737x is a large
complex-state number, not a headline you get on a toy graph.

The core idea: reimplement the critical paths, keep the API

LangGraph is good. I did not want to fork it or replace it. I wanted to swap out
the three operations that dominate a real workload:

Checkpoint serialization. deepcopy on complex nested state is the single biggest cost in a long run. Rust does a structured serialize instead.
State management at scale. High-frequency updates accumulate overhead. A Rust merge path handles append-heavy state.
Repeated LLM calls. Identical prompts waste real API money. A cache in front of the model kills the redundant calls.

Everything else stays Python. The Rust code is behind a compatible surface.

Two ways to turn it on

The easiest is automatic acceleration. One flag, or one function call, and your
existing graph runs faster with no code changes:

# At the top of your application
import fast_langgraph
fast_langgraph.shim.patch_langgraph()

# Rest of your code unchanged - runs 2-3x faster
from langgraph.graph import StateGraph
# ...

Or set FAST_LANGGRAPH_AUTO_PATCH=1 in the environment, which is the path I
prefer in production because it needs zero code diff.

For the biggest wins you reach for the components directly. The checkpointer is a
drop-in replacement:

from fast_langgraph import RustSQLiteCheckpointer

# 5-6x faster than the default checkpointer
checkpointer = RustSQLiteCheckpointer("state.db")
graph = graph.compile(checkpointer=checkpointer)

The LLM cache is a decorator, and it reports its own hit rate so you can prove the
win instead of guessing:

from fast_langgraph import cached

@cached(max_size=1000)
def call_llm(prompt):
    return llm.invoke(prompt)

# First call: hits the API (~500ms)
response = call_llm("What is LangGraph?")

# Second identical call: returns from cache (~0.01ms)
response = call_llm("What is LangGraph?")

print(call_llm.cache_stats())
# {'hits': 1, 'misses': 1, 'size': 1}

That cache is where the 10x number comes from, and only at a high hit rate. If
your prompts are all unique, the cache does nothing. It is a RAG-and-retry
optimization, not magic.

How it works under the hood

The design constraint that mattered most: nobody rewrites their agent to try an
accelerator. So the whole thing is built as compatible shims over Rust
implementations.

The shim patches LangGraph's hot functions at import time. apply_writes, the
channel batch update, gets a Rust version. The executor caching path reuses a
ThreadPoolExecutor across invocations instead of rebuilding it. Neither of those
is a huge single win (1.2x and 2.3x respectively), but they compose into that
~2.8x automatic number because they sit on the path every invocation walks.

The manual components are where the large numbers live. RustSQLiteCheckpointer
replaces the serialize-and-persist path. langgraph_state_update does the state
merge in Rust with explicit append keys:

from fast_langgraph import langgraph_state_update

new_state = langgraph_state_update(
    current_state,
    {"messages": [new_message]},
    append_keys=["messages"]
)

That merge path is the 13-46x sustained-update number for long graphs that append
to messages hundreds of times.

There is also a profiler, so you can find your own bottleneck before you port
anything:

from fast_langgraph.profiler import GraphProfiler

profiler = GraphProfiler()
with profiler.profile_run():
    result = graph.invoke(input_data)
profiler.print_report()

Profile first. The wins are in specific hot paths, not "the code."

Where this does NOT fit

I would rather tell you where it loses than have you find out in production.

Small, simple state. If your agent state is a flat dict of a few keys, Python's C-implemented dict already wins. The Rust checkpoint path pays off when state is large and nested, and the speedup scales with size. On a toy graph you will see close to nothing.
All-unique prompts. The @cached decorator is a 10x win at a 90% hit rate. If every prompt is different, the cache is dead weight. It is for repeated prompts and RAG, not open-ended chat with high entropy inputs.
You are not checkpointing. Half the value is faster persistence. If your graph runs once and exits without saving state, you skip the operation that benefits most, and you are left with the smaller 2-3x end-to-end range.
You need a specific LangGraph internal that is not patched. The shim covers the hot paths, not the entire surface. Automatic mode is conservative for a reason, and it falls back to Python where it is not confident.

If your agents are short and cheap, this is not for you. If you run long,
stateful, checkpoint-heavy graphs in production, the serialization path is where
your time actually goes.

Takeaways

The bottleneck in a long agent run is usually the plumbing, not the model. deepcopy on big state is a real tax you pay every step.
Big speedups here are a data-and-serialization story, not a language story. 737x is a 250KB-state number and it scales down with your state.
Measure the small cases. Rust loses to C-backed Python dicts on flat state, and pretending otherwise gets you caught.
Make it a drop-in with automatic fallback, or it never gets adopted. One import, or one env var, is the whole onboarding.

Works with any LangGraph version, Python 3.9+. Code, the full benchmark
breakdown, and the architecture notes are here:
https://github.com/neul-labs/fast-langgraph

If you run stateful LangGraph agents at volume, I would love to know how big your
checkpoint state actually gets. Run the profiler, tell me your hot path. Kick the
tyres, issues welcome.

Why I built a C-like language for agent-written code

Dipankar Sarkar — Thu, 09 Jul 2026 15:24:38 +0000

Most of the code my tools write now is written by an agent and read by me.

That single shift breaks a lot of assumptions baked into the languages we use. C
trusts the author completely. It has no idea whether a function does I/O, whether
an integer just wrapped, or whether a build script is about to run curl | sh on
your laptop. When a human wrote every line, that trust was earned slowly. When an
LLM emits 200 lines in two seconds, the trust model is wrong.

So I built fastC. It is a small systems language that compiles to readable C11,
and it is designed around one question: what does a language look like when you
assume the author is an agent and the reviewer is a tired human?

The problem with C for generated code

C's failure modes are exactly the ones an agent hits.

Silent integer overflow is the classic one. Ask an open-weight model to "sum an
array" and it will happily hand you code that wraps at scale and returns a wrong
answer with no warning. I measured this. On a "sum 1..100000" task with GLM,
three trials each:

Go silently wrapped 3/3.
Rust wrapped 2/3.
fastC and Zig either refused to compile or computed correctly. Neither shipped a silently-wrong binary.

The other failure mode is ambient authority. In C, any function can open a socket
or read a file. There is no signal in the type of a function that says "this one
touches the network." When you are reviewing agent output, you want that signal
badly. You want to look at a signature and know the blast radius.

And then there is the build step. build.rs, build.zig, cgo all run
arbitrary code at build time. That is a supply-chain hole that an agent pulling
a dependency will walk straight into. fastC ships a side-by-side demo where
cargo build executes a malicious build.rs, and fastc.toml rejects the same
shape at parse time because it structurally cannot execute anything at build time.

The core idea: capabilities as typed arguments

The center of fastC is that I/O capabilities are typed function arguments, not
ambient powers.

If a function wants to reach the network, it must declare a CapNetConnect
parameter. A function that does not take that parameter structurally cannot open
a socket. The type system makes it impossible, not discouraged. The same holds
for reading files with CapFsRead and spawning processes with CapProcSpawn.
The capability set is finite and named.

The capability tokens can only be minted in main, through caps::init().
Library code never fabricates them, and the compiler has a fabrication check that
enforces exactly that. So when you read a fastC signature, the capability
arguments are an honest, compiler-checked manifest of what the function can do.

That changes code review. A pure transform has no cap arguments, and you can
trust that claim without reading the body. That is the whole wedge for reviewing
code you did not write.

Hello world, and how it differs from C

Here is the smallest program.

fn main() -> i32 {
    return 0;
}

Compile it and run it through a normal C compiler:

echo 'fn main() -> i32 { return 0; }' > hello.fc
fastc compile hello.fc -o hello.c
cc hello.c -o hello && ./hello && echo OK

fastC emits C11. That is deliberate. It means any C cross-compiler in the world
can target a fastC binary. The stripped hello binary is 53 KB, in the same class
as C and Zig, not the 342 KB of a Rust hello or the 2.4 MB of a Go one. A fastC
binary that does real work fits inside a container cold-start budget and inside
embedded flash. That is a real constraint for a lot of the systems I care about.

The differences from C that matter day to day:

Memory safety without a garbage collector.
Capability-typed I/O, as above.
Mandatory pre- and postconditions on public APIs, @requires and @ensures, discharged through a three-tier pipeline: a syntactic checker that is always on, an opt-in Z3 tier for linear-integer tautologies, and a runtime fallback. Proven obligations cost zero at runtime.
Vendored, content-hashed dependencies with no central registry and no executable build steps.

Contracts, concretely

Contracts are not a bolt-on. They are compile-time obligations on public APIs.

fastc compile --prove program.fc

That runs every @requires and @ensures clause through the three tiers. The
syntactic discharger constant-folds and catches tautological comparisons with no
Z3 dependency at all. The Z3 tier handles linear-integer cases with a 500 ms
per-obligation budget. Anything neither tier can prove falls back to a runtime
fc_trap. The build emits a discharge.json per-obligation report with
proven/runtime/unknown counts and the tier that handled each one. If Z3 is not on
your PATH, the obligation degrades to the runtime tier with a structured reason.
The build does not break.

The point is that "prove what you can, trap the rest, never silently drop a
check" is the right default when you cannot assume the author reasoned carefully.

The honest numbers

fastC compile time is around 30 to 40 percent faster than Rust to a release
binary. Runtime matches C on floating-point-heavy work. It is about 26 percent
slower on recursive integer work like fib(40) because of overflow-check cost.
That is the trade. You pay a little on integer-heavy recursion to never ship a
silent wrap.

On agent first-compile success for a sum_array task across four open-weight
models, fastC scored 12/12, matching or beating C, Rust, Zig, and Go. But there
is a sharp caveat in that number, and it leads into the trade-offs.

Where fastC does not fit

This is the part that matters most, so I will be blunt.

The agent success number is only 12/12 when the cheatsheet shipped with the
prompt is faithful. An earlier run against an inaccurate cheatsheet scored 0/9.
The language is new, so models do not know it from training. You have to feed
them an accurate worked example and a "common mistakes" guide. Without that
scaffolding, an agent flounders, because there is no Stack Overflow corpus for
fastC yet. That is a real adoption cost.

fastC also loses on ecosystem. Rust has 150K crates. fastC has an eleven-package
core set in preview. If your project needs a mature library for something
specific, you will be writing FFI bindings or doing without.

The contract and capability model is friction. If your team ships internal
scripts fast and does not review agent output carefully, all of this ceremony is
overhead you will resent. fastC pays off precisely when the review burden is real
and the blast radius matters.

And on raw single-threaded integer throughput, plain C at -O2 is still ahead. If
that 26 percent is your hot path and you trust your authors, use C.

The honest framing, including which rows fastC loses on, is in the repo's
MANIFESTO.md. I would rather you read that than my pitch.

Takeaways

Assume the author is an agent, and the language design changes. You want I/O in the type signature and you want overflow to be loud.
Compiling to readable C11 buys you portability and small binaries for free.
"Prove what you can, trap the rest" beats both "assume it is fine" and "block the build."
New languages have no training corpus. That is a first-class adoption problem, not a footnote.

fastC is v1.0 feature-complete with 337+ passing tests. It is MIT licensed.

Repo: https://github.com/fastc-lang/fastc

The honest ask: clone it, run the supply-chain demo and the cross-lang
benchmarks, and tell me where the capability model gets in your way. The failure
reports are more useful to me than the stars. Issues welcome.

Learning how LLMs actually work by building 18 of them in Zig

Dipankar Sarkar — Tue, 07 Jul 2026 10:27:17 +0000

Most people who ship features on top of LLMs have never seen the inside of one. I
was in that group. I could call an API, tune a prompt, quantize a GGUF file, and
still not tell you what RMSNorm does or why RoPE exists. The model was a black box
with a temperature knob.

The tutorials did not help. They either hand you a 40-line PyTorch script that hides
everything behind nn.Linear, or they drop you into a production C++ engine where
the math is buried under kernel dispatch and memory pools. Neither one lets you see
the whole path from a tensor to a token.

So I built zigllm: an educational implementation of transformer architectures in
Zig, from raw tensor operations all the way up to text generation. It implements 18
model families (LLaMA, Mistral, GPT-2, Falcon, Mamba, BERT and more) and ships 285+
tests that double as executable documentation. Every component is written to teach
why it works, not just to run fast.

The core idea: six layers, each built on the last

The whole project is organized as a stack. You start at the bottom and work up, and
each layer only depends on the ones below it:

 6. Inference         Text generation, sampling, KV caching, streaming
 5. Models            LLaMA, GPT-2, Mistral, Falcon, GGUF loading, tokenization
 4. Transformers      Multi-head attention, feed-forward networks, full blocks
 3. Neural Primitives Activations (SwiGLU, GELU), normalization (RMSNorm), RoPE
 2. Linear Algebra    SIMD matrix ops, K-quantization, IQ-quantization (18+ formats)
 1. Foundation        Tensors, memory management, memory mapping

That ordering is the pedagogy. You cannot understand attention until you understand
matrix multiplication. You cannot understand a full LLaMA block until you understand
attention plus normalization plus RoPE. By the time you reach layer 6 and generate
your first token, nothing above you is magic, because you built every layer under it.

The 18 architectures on top of that stack cover roughly 80% of real-world LLM usage.
The point is not to run all of them in production. In practice LLaMA is the one path
wired end to end for real inference; the other families are there to read and
compare, which is exactly the point. You get to see that LLaMA, Mistral, Falcon and
GPT-2 are mostly the same machine with different knobs, and that Mamba and BERT are
genuinely different animals.

Why Zig, and how the engineering holds up

Zig is an odd choice until you sit with it. It gives you manual memory control,
comptime generics, and first-class SIMD, all with no runtime and no garbage
collector. That combination is exactly what inference wants. You get deterministic
memory (which matters a lot when you are learning, because you can see every
allocation) and you get SIMD intrinsics without dropping into inline assembly.

The optimizations in the repo are the real ones you find in production engines,
just written to be readable:

KV caching, so you do not recompute attention over the whole sequence for every new token.
SIMD acceleration (AVX, AVX2, NEON auto-detection), for a 3-5x speedup on matrix ops.
18+ quantization formats, up to 95% memory reduction.
Memory-mapped model loading, so a large GGUF file does not have to be read into RAM up front.

Each of those is a lesson. KV caching teaches you why generation is quadratic
without it. Quantization teaches you the trade between memory and precision in a way
no blog post ever made click for me.

What using it looks like

Getting started is one clone and one command, because the tests are the tutorial:

git clone https://github.com/cognisoc/zigllm.git
cd zigllm
zig build test                    # All 285+ tests
zig build test-foundation         # Foundation layer only
zig build test-linear-algebra     # Linear algebra layer only

Running a single layer's tests is how you study that layer. Each test demonstrates a
concept and validates the math, so a failing assertion is a lesson about what the
math actually requires.

The examples are the guided tour, and the inference layer's own docs
(docs/06-inference/) walk the per-token cost down as each optimization comes on:

Without optimizations:  ~200ms/token   (naive)
With KV caching:         ~50ms/token    (4x)
With all optimizations:  ~5ms/token     (40x total)

That is the whole value of building it yourself. The number is not a claim in a
README, it is something you can reproduce and then explain: KV caching alone is the
4x, and the rest comes from SIMD and quantization stacked on top.

The same docs are honest about memory, and this is worth internalizing. For a
7B-class model in Q4 the parameters are ~3.5GB, but the KV cache for a 2k context
adds ~1.5GB per sequence, so the real floor is ~8GB, not 3.5GB. The parameter count
is the number people quote; the working set is the number that OOMs your box. Seeing
both fall out of code you wrote is the point.

There are more examples in examples/: simple_demo.zig for the overview,
gguf_demo.zig for loading a real pre-trained model, and
model_architectures_demo.zig for comparing all 18 families side by side.

Where this does NOT fit

This is the part that matters, so I will be blunt.

zigllm is not a production inference engine, and you should not deploy it as one.
Its own parity analysis rates it around 40% of llama.cpp on production concerns:
quantization coverage is a fraction of what llama.cpp offers, hardware acceleration
is CPU-focused, and broad multi-model serving is not the goal. If you need to serve
a 70B model to real users, use llama.cpp or vLLM. That is what they are for.

It is also not the fastest thing you can write in Zig. The README is explicit that
educational clarity takes priority over micro-optimization. When there was a choice
between a readable loop and a clever one, the readable one won. That is correct for
a teaching project and wrong for a serving one.

And if you do not want to learn the internals, this buys you nothing over just
running Ollama. There is no shortcut here. The value is entirely in reading the code
and running the tests. If you skip that, you skip the whole thing.

Takeaways

The fastest way to stop treating LLMs as magic is to build one from tensors up, in a language that does not hide memory from you.
A progressive layer stack (foundation, linear algebra, primitives, transformers, models, inference) is a better teaching order than any single top-down script.
Tests as executable documentation is underrated. 285+ tests that each prove one concept beat any amount of prose.
Zig turns out to be a genuinely good fit for this: comptime, SIMD, and manual memory, no runtime in the way.

The code, the docs for all six layers, and every example are here:
https://github.com/cognisoc/zigllm

If you have wanted to actually understand attention and KV caching instead of just
using them, clone it and run zig build test-transformers. If a layer's explanation
is unclear, that is a bug in the teaching, and I would genuinely like the issue.
Kick the tyres, PRs that improve educational clarity are the most welcome kind.

Sandboxing untrusted agent code with gVisor costs ~200ms per cold start. Blocking syscalls instead of emulating them costs ~8ms

Dipankar Sarkar — Sun, 05 Jul 2026 16:24:18 +0000

You are running code you did not write. It might be an AI agent executing an
LLM's output, a CI job running npm install across dependencies nobody audited,
or a plugin that insists it needs shell access. A normal container does almost
nothing for you here. It is namespaces and cgroups, and the full kernel attack
surface is still sitting right there. Every runc escape CVE is the reminder.

The usual heavy answer is gVisor. It puts a userspace kernel in front of the
workload and emulates syscalls. It works. It also costs you 5-250x syscall
overhead, roughly 200ms cold starts, and about 50MB per container. For a
high-throughput API or a serverless function, that overhead is the whole budget.

zviz takes a different route. It is an OCI-compatible container runtime
written in Zig that uses selective denial instead of emulation. Most syscalls
reach the host kernel at native speed. A small set of dangerous ones are blocked
before any kernel code runs. One is argument-filtered inline. No userspace
kernel, no emulation, no daemon.

The core idea: allow, deny, broker

gVisor's model is that every syscall goes through its Sentry process, which
emulates it. zviz's model is a filter that makes one of three decisions per
syscall:

gVisor:  App -> Sentry (emulates ~300 syscalls) -> Host kernel (~70 syscalls)
zviz:    App -> BPF filter -> ALLOW (native speed) / DENY (EPERM) / BROKER (mediated)

Allowed syscalls hit the kernel directly, so they run at native speed. Dangerous
ones are denied immediately with EPERM, so exploit code fails on the spot rather
than being safely emulated. A tiny set gets routed to a userspace broker for
inspection. The socket syscall is filtered on its arguments inline.

The philosophical difference matters for compatibility. gVisor emulates a
dangerous syscall safely. zviz refuses it. Both isolate, but the failure modes
are opposite: emulation stays compatible, denial stays strict and fast.

How it works

The enforcement is five layers, applied in a specific order, and the order is
the interesting part:

Layer	Mechanism	Purpose
1	Namespaces (user, pid, mount, ipc, uts)	Resource isolation
2	Capabilities (all 41 dropped)	Privilege elimination
3	Landlock LSM	Filesystem access control
4	Seccomp-BPF (124 instructions)	Syscall filtering
5	cgroups v2	Resource limits

Capabilities drop before seccomp loads, and Landlock applies before seccomp,
so the security setup syscalls do not get caught by the very filter they are
installing. Get that ordering wrong and the runtime blocks itself while arming.
The default profile drops all 41 Linux capabilities, applies a Landlock ruleset,
mounts /proc, /sys, and /dev privately, and runs the workload as PID 1 of a
fresh user, PID, mount, IPC, and UTS namespace.

The whole seccomp policy is 124 BPF instructions. That is small enough to audit
by hand, which is a real security property. You can read the exact filter that
stands between untrusted code and your kernel.

Running it

You build with Zig 0.15.0+ on Linux 5.13+ (5.13 is where Landlock landed), then
run any OCI bundle. The README's example extracts a Redis image and runs it:

git clone https://github.com/Skelf-Research/zviz.git
cd zviz && zig build -Doptimize=ReleaseSafe

# Build an OCI bundle from any Docker image
mkdir -p ~/zviz-bundle/rootfs
docker create --name extract redis:alpine
docker export extract | tar -C ~/zviz-bundle/rootfs -xf -
docker rm extract

# Run it, verbose logs every blocked syscall
./zig-out/bin/zviz run my-container ~/zviz-bundle --verbose

The --verbose flag logs every blocked syscall, which is exactly what you need
when an agent workload hits a restriction you did not expect. There are built-in
profiles for the common cases: ci-runner (the balanced default), web-server
(network allowed), batch-job (no network, 8G memory), hostile-tenant
(maximum restrictions), and development (allows ptrace, explicitly not for
production).

zviz auto-mounts the pseudo-filesystems so you do not have to declare them:
/proc as procfs with nosuid, nodev, noexec, /sys as read-only sysfs, and
/dev as a private tmpfs with the standard device nodes bind-mounted in.

The numbers

The README reports these, tested against real escape techniques:

Metric	zviz	gVisor
Escape tests blocked	19/19 (100%)	11/19 (58%)*
Cold start	~8ms	~200ms
Memory per container	~2MB	~50MB
Policy compatibility	98.2%	baseline

Syscall latency is where selective denial pays off. clock_gettime is 20ns on
zviz versus 4,982ns on gVisor, a 249x gap, because zviz lets it hit the kernel
directly while gVisor routes it through Sentry. read is 20.7x faster, getpid
4.1x. The asterisk on the escape numbers is important and the README is honest
about it: gVisor "allows" some syscalls like ptrace and mount but emulates them
safely, which is a different philosophy with an equivalent security outcome for
those operations, not a straight loss.

Where it does not fit

The README has a whole "when to use gVisor instead" section, which is the right
instinct.

If your workload needs ptrace, zviz blocks it. strace, debuggers, and anything
that traces another process will not run. gVisor emulates it safely, so for
debugging-heavy workloads gVisor wins. If you need mount or unshare for
Docker-in-Docker, or you run Bazel or Nix builds that create their own internal
namespaces, zviz denies the syscalls those need. Nested containerization is a
gVisor job.

The 1.8% policy gap is a deliberate choice: zviz defaults network egress to
deny, gVisor allows it. That is stricter, but it means a workload that expects
outbound network fails closed until you pick a profile that opens it. On Ubuntu
24.04+ there is an extra setup step, because the kernel's
apparmor_restrict_unprivileged_userns sysctl blocks the bind mount
pivot_root needs. Without installing the provided AppArmor profile, zviz falls
back to chdir-only filesystem isolation, which is weaker. And it is Linux-only,
kernel 5.13+, cgroups v2 required. There is no macOS or Windows story.

The honest framing from the README: if you need nested containers or process
tracing, use gVisor. Otherwise zviz is faster and stricter.

Takeaways

Selective denial beats emulation on speed because allowed syscalls hit the kernel at native speed. That is the 249x clock_gettime gap.
Denial also fails safer for exploits. Malicious code gets EPERM immediately rather than a safely-emulated success.
Layer ordering is load-bearing. Capabilities and Landlock go before seccomp so the runtime does not block its own setup.
The cost is compatibility. No ptrace, no nested containers, no Bazel/Nix internal sandboxing. That is the trade for ~8ms cold starts and a 124- instruction filter you can actually read.

If you run untrusted or agent-generated code and your workloads do not need
ptrace or nested containers, zviz is worth a benchmark against your current
sandbox. Code, the threat model, and the comparison docs are here:
https://github.com/Skelf-Research/zviz

I am curious which blocked syscall trips up the first real agent workload you
throw at it. Run it with --verbose and open an issue with the trace.

We let an AI agent hit a database 1034 times. Text-to-SQL ran 23 unsafe ops. The policy layer ran zero

Dipankar Sarkar — Thu, 02 Jul 2026 16:23:54 +0000

The moment you give an AI agent database access, you inherit every question a junior
engineer with production credentials raises, except the agent does not get tired and
does not ask permission.

What if it reads a column full of PII. What if it writes a query that scans the whole
table and bills you for it. What if a user tricks it, through prompt injection, into
deleting rows. How do you prove, after the fact, what it actually did.

The common answer is text-to-SQL: let the model write SQL, run it. That is exactly
the wrong layer to enforce safety, because by the time you have a SQL string you have
already lost. We benchmarked it against the Spider dataset, 1034 natural language
queries. Text-to-SQL executed 23 unsafe operations and left no audit trail. SQL
injection was possible.

ormai is a different bet. Enforce at the ORM layer, not the prompt layer. On the
same 1034 queries it executed 0 unsafe operations, SQL injection was not possible,
and every call was logged.

The core idea: agents get typed tools, not a SQL prompt

OrmAI wraps your existing ORM models in a policy-enforced runtime. The agent never
sees or writes raw SQL. Instead it gets a small set of typed tools:

Read-safe: db.query, db.get, db.aggregate, db.describe_schema
Write-safe: db.create, db.update, db.delete, db.bulk_update, each gated by policy

Every request the agent makes is compiled into a parameterized ORM query. The
database never receives an agent-authored SQL string. That single architectural choice
is what closes the SQL injection hole: there is no string to inject into.

How it works: a runtime between the agent and the ORM

OrmAI sits as a layer between the agent and your ORM. A request flows through three
enforcement stages before it ever reaches the database:

Your Agent
   | calls a typed tool
OrmAI Runtime
   [ Policy Enforcer ] [ Audit Logger ] [ Tenant Scope Filter ]
   | parameterized queries only
Your ORM (SQLAlchemy / Prisma / Drizzle / ...)

The policy enforcer decides what is allowed: which models, which fields, which
operations, and how much. Field-level policies hide passwords and mask emails
automatically. Query budgets cap rows, include depth, and statement timeouts so a
runaway query cannot run away. Writes are off unless you explicitly enable them, and
you can require a reason or human approval for sensitive models.

The tenant scope filter auto-injects a tenant predicate into every query.
.tenantScope('tenant_id') means a request in tenant A's context physically cannot
return tenant B's rows, whether or not the agent remembered to filter. Isolation is
built in, not bolted on and not left to the prompt.

The audit logger records every call with the principal, tenant, trace ID, input,
and output. When someone asks "what did the agent do", you have the answer.

You describe all of this with a PolicyBuilder, which reads like the security review
you would want to write anyway:

from ormai.utils import PolicyBuilder, DEFAULT_PROD

policy = (
    PolicyBuilder(DEFAULT_PROD)
    .register_models([Customer, Order])
    .deny_fields("*password*", "*secret*", "*token*")
    .mask_fields(["email", "phone"])
    .tenant_scope("tenant_id")
    .enable_writes(models=["Order"], require_reason=True)
    .build()
)

Presets ship for the common postures: DEFAULT_DEV is permissive, DEFAULT_INTERNAL
is moderate, DEFAULT_PROD is strict. You start from one and tighten.

A concrete wiring

The Python path is deliberately short. Point it at your existing SQLAlchemy engine and
session, hand it a policy, and you get a toolset the agent can use:

from ormai.quickstart import mount_sqlalchemy
from ormai.utils import DEFAULT_DEV

# your existing SQLAlchemy models + session
toolset = mount_sqlalchemy(
    engine=engine,
    session_factory=Session,
    policy=DEFAULT_DEV,
)

# done. the agent now has db.query, db.get, db.aggregate, db.describe_schema

It is not Python-only. There is a TypeScript/Node.js implementation too, with the same
policy model expressed through the same builder pattern, and it plugs into Vercel AI
SDK, LangChain.js, and MCP among others. On the ORM side the coverage is wide: on
Python it supports SQLAlchemy, SQLModel, Django ORM, Tortoise, and Peewee; on
TypeScript it supports Prisma, Drizzle, and TypeORM. Schema introspection is automatic,
so it discovers your models, fields, relations, and primary keys rather than making
you redeclare them.

Where it does not fit

The mandatory honesty section.

It constrains what your agent can do, on purpose. OrmAI is about safe, typed,
policy-bounded access. If your use case genuinely needs arbitrary analytical SQL,
window functions, recursive CTEs, hand-tuned joins, the typed tool surface will feel
like a cage. That is the trade you are buying: expressiveness for safety.
The policy is only as good as you write it. OrmAI enforces your rules. It does
not guess which fields are sensitive. Ship a policy that forgets to deny a secret
column and the runtime will faithfully expose it. The presets help, but the review
is still yours.
It is another layer in the request path. Requests compile through the runtime
before hitting the ORM. For most agent workloads that overhead is irrelevant next to
the model call, but if you are chasing raw database latency on a hot path, measure
it.
It assumes you use one of the supported ORMs. The value comes from wrapping an
existing ORM. If your data access is hand-rolled SQL or an unsupported ORM, there is
nothing for OrmAI to wrap yet.
It reduces risk, it does not eliminate it. Zero unsafe ops on the Spider
benchmark is a strong signal, not a proof for every schema and every prompt. Treat
it as defense in depth, keep your database-level permissions tight anyway.

Takeaways

Safety belongs at the ORM layer, not the prompt layer. Once you have a SQL string from a model, you have already lost the injection fight.
Typed tools plus parameterized queries is what took SQL injection off the table in the 1034-query benchmark, not a cleverer prompt.
Field masking, tenant scoping, and query budgets are the three controls that turn "the agent can read the database" into "the agent can read exactly what policy allows".
Audit everything. The question "what did the agent do" only has a good answer if you logged the principal, tenant, and trace on every call.

Code, the Spider benchmark demo, and the policy docs are here:
https://github.com/neul-labs/ormai

If you have wired an agent to a production database, I want to know how you are
scoping tenants today. Kick the tyres, issues welcome.

I put a Rust layer under LiteLLM. Here is where it actually helped (and where it did not)

Dipankar Sarkar — Wed, 01 Jul 2026 22:42:18 +0000

LiteLLM is the glue a lot of us reach for when an app has to talk to more than one
model provider. One interface, dozens of backends. It is great. But once you run it
under real load, the hot path stops being the model call and starts being the
plumbing around it: connection pooling, rate limiting, token counting on big
inputs. That plumbing is pure Python, and under concurrency it shows.

So I built fast-litellm: a drop-in Rust acceleration layer that swaps the hot
paths out for PyO3 extensions and falls back to Python everywhere else. This is the
honest write-up, including the cases where Rust lost.

Lead with the numbers, even the bad ones

These compare production-grade Python (thread-safe implementations) against the
Rust versions:

Component	Result
Connection pool	3.2x faster (lock-free DashMap)
Rate limiting	1.6x faster (atomic ops)
Large-text token counting	1.5 to 1.7x faster
High-cardinality rate limits (1000+ keys)	42x less memory
Small-text token counting	0.5x, Python wins (FFI overhead dominates)
Routing with complex Python objects	0.4x, Python wins

That last block is the important part. Crossing the Python to Rust boundary is not
free. For a 12-token chat message, the FFI overhead is bigger than the work you
saved, so Rust loses. Anyone who tells you their native extension is faster at
everything is not measuring the small cases.

Where it wins, it wins because of data structures, not because "Rust is fast." The
connection pool uses a lock-free DashMap so concurrent workers stop serializing
on a global lock. The high-cardinality rate limiter holds 1000+ unique keys in a
fraction of the Python footprint. 42x less memory is a memory-layout story, not a
language story.

The architecture: accelerate the hot path, never break the app

The design constraint I cared about most: nobody rewrites their app to try this,
and nobody ships a native extension that can take prod down. So the layer has two
halves.

LiteLLM (Python)
  └─ fast_litellm (Python integration layer)
       ├─ monkeypatches the hot paths
       ├─ feature flags + gradual rollout
       ├─ performance monitoring
       └─ automatic fallback to Python
  └─ Rust components (PyO3)
       ├─ connection_pool
       ├─ rate_limiter
       ├─ tokens
       └─ core (routing)

The Python side does the patching, watches the metrics, and owns the safety net.
The Rust side does the work. If an accelerated component throws or looks wrong, the
integration layer falls back to the original Python implementation instead of
propagating the failure. That single decision is what makes a native accelerator
safe to actually turn on in production.

Drop-in, or it does not get used

import fast_litellm  # accelerates LiteLLM automatically
import litellm

response = litellm.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}],
)

One import before litellm. It patches the hot paths on load, and every
accelerated component has that automatic fallback. Installation is the boring part,
which is the point:

uv add fast-litellm   # or: pip install fast-litellm

Because the risky path is turning native code on in prod, rollout is gated. Feature
flags let you send a percentage of traffic through the Rust path first, watch the
monitoring, and widen it only when the numbers hold. If you run the LiteLLM proxy
under gunicorn, a two-line wrapper with --preload applies the acceleration before
the workers fork:

# app.py
import fast_litellm  # apply before litellm loads
from litellm.proxy.proxy_server import app

Where it does not fit

Be honest about who should not bother with this.

If your traffic is short prompts and simple routing, the FFI overhead can make you slightly slower, not faster. The table above is not marketing, it is a warning label. Measure your own payload sizes first.
If you are not concurrency-bound, the connection-pool win shrinks. The 3.2x comes from removing lock contention. No contention, no prize.
It is another native dependency in your build. For a single-process, low-QPS script, the operational cost is probably not worth the millisecond.

The sweet spot is the opposite of all that: many workers, many unique rate-limit
keys, long inputs, connection-pool pressure. That is where the data-structure wins
compound.

What I would take from this

Profile before you port. The win was in three specific hot paths, not "the code."
Measure the small inputs too. FFI overhead is real and it will embarrass you.
Make it a drop-in or it dies on the vine. Zero-config plus automatic fallback is what turns "interesting Rust project" into something a team will actually run.

Code, the full benchmark breakdown, and the PyO3 architecture are here:
https://github.com/neul-labs/fast-litellm

If you run LiteLLM at any real volume, I would love to know which path is your
bottleneck, and whether the small-input penalty bites you. Kick the tyres, issues
welcome.