DEV Community: Aleksandr Shchuka

mirabilis: a one-command sandbox for autonomous Claude Code that ended up patching itself

Aleksandr Shchuka — Wed, 10 Jun 2026 15:33:05 +0000

I have a side project that I actually like, and I want to show it. No teaser, no promised secret: this is a post about a tool I built, how I built it, and one episode at the end that pleased me as an engineer.

The tool is mirabilis. It is a one-command launcher that brings up a devcontainer, starts Claude Code inside it with --dangerously-skip-permissions (the agent asks no approval for commands or edits), and drops me straight into the running agent. The container is isolated from my laptop. The agent gets structured persistent memory and a behavioral harness. From the outside, it is one line:

curl -fsSL https://raw.githubusercontent.com/AlexShchuka/mirabilis/main/install.sh | bash

That clones the repo into ~/.mirabilis, installs the devcontainer CLI, and puts mirabilis on your PATH. Run mirabilis and you land in a terminal menu: launch, plugins, harness, stack, open in VS Code. The first launch builds the container and signs you into GitHub and Claude through their native flows; tokens live in sandbox volumes, never in the repository. Then the launcher hands the terminal to Claude.

Why: I wanted a place to give an agent full autonomy and walk away. Bypass mode is convenient right up until the model makes a mistake, and rm -rf does not forgive mistakes. Hence a box I do not care about: inside, the agent is root; the boundary is the container wall, not my filesystem. Containerizing an agent is standard practice; Anthropic ships a reference devcontainer with the same motivation. What I wanted on top was one command and zero thinking about provisioning.

The name nods to Einstein's annus mirabilis, the 1905 "miracle year" with four physics-changing papers. Cheeky for a personal sandbox; consider it an advance I intend to earn.

What's inside

mirabilis is a single Go binary, roughly 3.8k lines of production code and 9.7k of tests, playing three roles dispatched by argument in cmd/mirabilis/main.go: no arguments is the host TUI launcher, provision is the in-container provisioner, hook is the Claude hook handler. One artifact, one dispatcher.

The launch pipeline is a DAG, not a script. Steps in internal/pipeline have statuses (stPending, stRunning, stDone, stSkipped, stFailed), dependencies, and a per-step retry policy. Network-shaped steps run under RetryNet: 4 attempts, exponential backoff from 300ms capped at 8s, with jitter (actual delay: uniform between half and full backoff). Deterministic steps such as the build get no retries; retrying a reproducible compile error is four minutes of spinner for nothing.

Memory. ~/.claude lives on a volume and survives rebuilds, but it is not one big scratch file: memory is split into typed categories, declared in internal/config/config.go:

var MemoryCategories = []MemoryCategory{
    {"about-me", "semantic", "Stable facts about you: identity, role, goals, hard preferences, constraints."},
    {"dev-principles", "procedural", "Cross-project engineering invariants you endorse: style, testing bar, anti-slop."},
    {"research-log", "episodic", "Dated findings tied to a specific investigation, paper, or bug. Append-only, compacted periodically."},

Three types: semantic for timeless facts, procedural for how-to invariants, episodic for dated findings. On every SessionStart, a hook walks the category files, counts the invariant bullets in each, and regenerates a MEMORY.md index. The agent never edits the index by hand. The point is memory the agent loads predictably.

Two boundaries, deliberately separate. The container is the security boundary: it keeps the agent away from my machine. Behavior is the harness's job: my plugin neuro-matrix, installed automatically by the provisioner, carries invariants like "don't push to main" and "don't exfiltrate credentials".

The frame that kept the slop out

mirabilis was written in about a week, mostly by AI agents. That sentence is not the achievement. In a week, agents will produce exactly as much plausible garbage as you let them. The engineering was making sure garbage never reached main, and that part was mine.

The frame lives in AGENTS.md, which the agent reads at the start of every session:

Code is truth. Every claim about state is backed by tool output: a concrete run with concrete output, not "this probably works".
Minimal diff / YAGNI. Touch only what is broken or what the task needs; no drive-by refactoring, no layers "just in case".
Anti-neuroslop. Plausible shape is not needed code; don't grow files and abstractions detached from what the repo actually needs.
Not green means not done. Changes come with tests.

Principles alone are wishes, so there is machinery behind them. A total ban on comments in code and config, enforced by a pre-commit hook: prose lives in .md only, and the hook greps staged non-markdown, non-Go files for // and # and fails the commit on a hit. Comments are an agent's favorite spot for slop ("here we initialize the variable"); with no place for it, code explains itself with names and structure. A 97% coverage floor in CI; more on that below, it bit me. And a daily canary on cron: rebuild the image, bring the devcontainer up, assert configuration invariants. Drift gets caught by schedule, not by my face at launch time.

What did not work is more useful than the wins.

tinyproxy. An agent decided open egress needed a proxy filter and generated a whole plumbing layer. It looked solid. It filtered nothing. "A pipe, not a filter." I ripped it out and wrote it into AGENTS.md as the canonical neuroslop example: code shaped like the thing you need that does not do the thing.

Agents lie confidently. Three audit findings (next section) claimed problems the code did not have. "install.sh doesn't install git hooks": it does, line 38. "The devcontainer features aren't pinned": they are, three sha256 pins in the lock file. "Go tools will land outside PATH": no, GOPATH is set before go install. Three confident claims; three times the code said no. That experience is the foundation of the frame: a model hallucinates in a perfectly assured tone, and the working countermeasure is mechanical verification plus cross-checking. Not trust.

The coverage floor trap. My own 97% floor produced floor-driven tests: checks that exist to feed the percentage, not to catch bugs. Init() returns non-nil. View() is non-empty. The most brittle one hung on a sleep(200ms) in a golden test (since replaced with teatest.WaitFor). The conclusion I wrote down: slightly below the floor with meaningful tests beats pinning Bubble Tea internals for a number.

The sandbox patched itself

This is the part I mostly wrote this post for. Dry, by the numbers.

I ran a multi-agent self-audit on mirabilis. The pipeline: orchestrator full-read → researcher #1 (sonnet) → orchestrator verification → researcher #2 (sonnet) plus a landscape agent (web, arXiv) → two independent critics (sonnet, opus) → reconciliation. Cross-checks at every joint. The output:

~44 raw claims in (15 orchestrator hypotheses plus ~29 agent findings).
3 findings refuted by code verification: the hallucinations above.
5 cut by the critics as taste, not defects.
4 neuroslop spots found by the critics in the orchestrator's own synthesis. The agent checking for slop slopped; another agent caught it.
27 accepted items, nine filed as GitHub issues.

Some accepted items were plainly embarrassing, which is fine. make reset silently destroyed the agent's entire memory while the README promised it "survives rebuilds". Provisioning always reported success, even when sub-steps failed. No resource limits on the container, so an autonomous agent could eat the host's RAM: the exact thing I claim protection from. Ordinary pet-project holes; I filed them.

I did not fix those issues by hand. I launched six developer agents in parallel inside mirabilis itself, one per branch, covering six of the nine issues. They implemented the fixes and opened pull requests. I reviewed and merged them, PRs #101 through #106, in one day. The pipeline bug that stranded dependent steps when a required step failed now cascades them to stSkipped. The launcher no longer silently switches a feature-branch checkout to main. Both fixes came from agents running inside the box they were fixing.

I am not selling this as an AI miracle. It is an engineering fact, and the precision matters: the frame plus cross-verification got the agents to where their PRs were reviewable and mergeable rather than rewritable. The merge button was my call on every one. But the loop — audit, issues, agent wave, PRs, merge — closed inside the sandbox's own walls, and that is exactly what I built it for.

Decisions and boundaries

A few places where I decided differently from common practice. The full threat model is in SECURITY.md.

The container is the boundary. Inside, the agent has full freedom: root, sudo, any file; mirabilis gates nothing there. The direct consequence: trusted code only. For untrusted code a container is not enough; you want a microVM. SECURITY.md says so in plain words.

Egress is open. No host proxy, no in-container allowlist. The canonical approach is the opposite: default-deny plus an allowlist, as in Anthropic's reference devcontainer with its init-firewall.sh. I chose open on purpose. For my scenario (personal sandbox, trusted code, one user) simplicity beats exfiltration hardening, and keeping credentials in is the harness's behavioral job, not a network gate's. WebFetch and WebSearch have to just work; I am not maintaining an allowlist every time the agent needs a new domain. Open egress plus persistent memory plus MCP content is a known memory-poisoning surface (systematic study: arXiv:2606.04329); SECURITY.md lists it as an accepted limitation.

Hardened within the model. Open egress does not mean everything is open: the container runs under Docker's default seccomp profile with cap_drop: ALL and an explicit add-back list. unshare namespaces, io_uring, keyctl, raw sockets, and chroot are gone.

The Docker socket is mounted (docker-outside-of-docker). The in-container agent can drive the host daemon, and anyone with the socket can read container secrets via docker inspect. Documented in SECURITY.md. It is there so the agent can build and run real containers from inside the box — which is also what Ryuk, the reaper testcontainers-go uses for cleanup, requires during integration tests.

Single instance. One container, one volume, not per-task. Other tools isolate per task with git worktrees; microVM products isolate harder than any container. mirabilis is neither the most isolated nor the most parallel: it is one command, a trusted personal scenario, and an anti-slop frame built into the construction. A chosen scope, not an unfinished copy of the bigger tools.

Take what's useful

mirabilis lives at github.com/AlexShchuka/mirabilis, MIT-licensed. I use it myself and I think it is a decent tool. Not the best, not the only one; the field is busy. What I can show is concrete: a one-command sandbox with typed memory, two independent boundaries, a written-down threat model, four working anti-slop mechanisms, and a self-audit that ended with the sandbox merging fixes to itself.

If any of it is useful, take it. If you find where I am wrong, the issues are open.

Why your agentic system doesn't survive its own success — and what a 23-invariant runtime looks like

Aleksandr Shchuka — Sun, 24 May 2026 20:42:41 +0000

The failure mode you don't see

Most agentic systems fail silently. The agent picks the wrong tool, invents a fact, agrees with the user when it shouldn't — and you find out three releases later when a customer points it out. There's no exception to log; the system is generating plausible-looking text.

The standard answer is "we'll add evals." Then evals become a number on a dashboard, the dashboard becomes a vanity metric, and the agent keeps drifting. The wrong frame is not "we need more evals" — it's that the agent has no skin in the game during a single turn.

This post is about a runtime that gives the agent skin in the game on every turn. 23 invariants, deontic tags, counter-clauses, risk-weighted sampling, an adversarial probe set, and a pre-registered statistical decision rule for whether a config change ships. None of the components are original. The combination is what makes it work.

The co-system framing

The system is designed for AI + developer as a co-system, not "AI for developer". Both sides err; the system catches mutually. Two outputs: codebase health and culture of systems-thinking. Many small steps × low error rate.

Three game-theoretic anchors drive the design:

Anchor-verification dominates hallucination. Every external-state claim is paired with tool output in the same reply.
Per-mutation gating dominates unauthorized mutation. No edit, no push, no write without explicit recent consent.
Own-interest dominates sycophancy. The agent upholds the protocol in its own interest — it does not capitulate to user pressure to look helpful.

These are not preferences. They're equilibria the protocol enforces every turn.

The invariants file

23 invariants, each tagged on three axes:

[risk] — critical | important | style
[deontic] — O | P | F (obligation / permission / forbidden)
Counter: — the condition under which the invariant does not apply or is overruled

Example (the anti-sycophancy invariant):

[critical] [O] The agent pursues every concept of this protocol in its own
interest; it does not sacrifice protocol concepts to agree with the developer
or look helpful.
| Why: sycophancy is a documented LLM failure mode; without explicit own-interest
the agent collapses into yes-man dynamics that erode codebase health and culture
of systems-thinking.
| Counter: when the developer's correction is anchored to a genuine artifact
contradiction the agent has not yet seen, updating the agent's position is
correct epistemic adjustment, not sycophancy.

Why deontic + Counter? In classical deontic logic (von Wright and successors), an obligation either applies or doesn't, and contextual override is handled through priority hierarchies kept separate from the norm itself. Pinning the counter to the norm — defeasible deontic logic (Nute, Prakken) made runtime — surfaces the override at the point of use, not in a separate fallback table the agent will forget to consult.

Risk-weighted per-turn sampling

A UserPromptSubmit hook samples one invariant per turn into the agent's context, weighted by risk: critical ×3, important ×2, style ×1. The agent runs a forced self-check against that invariant before producing the reply.

This is doing two things at once:

Reduces framing bias. No fixed "always check these 5" — different invariants surface on different turns, and the agent can't optimize against a known set.
Pulls the long tail. With 23 invariants in context simultaneously, most get ignored. Sampling one forces attention.

Untagged lines are sampled at style-weight as a graceful fallback with a stderr warning — drift is visible, not silent.

Four agent roles

Different work needs different prompts. Four roles, each with its own minimum subset of invariants inherited explicitly (sub-agents do not receive the UserPromptSubmit hook — without explicit propagation they run with zero invariant self-check):

analyzer — RCA, system design, code review, dead-end diagnostics; does not write code
developer — writes / edits / commits / pushes; the only role that mutates files
critic — independent reviewer of the lead agent's output before mutation lands; anti-neuroslop gate
epistemic-auditor — audits the boundary between confirmed and associative claims

The split isn't specialization for its own sake. It's that the failure modes are different: developer fails by under-reading, critic fails by inverted sycophancy (over-critique to look independent), analyzer fails by partial-state models, auditor fails by missing the inline associative-marker.

Eval suite — what actually gets measured

An eval suite is the only runtime defense against config drift. Without it, every change to the protocol is a taste call.

The suite:

17 binary criteria rubric. Binary, not Likert — binary removes the central-tendency bias documented for ordinal LLM-judge rubrics (Masood, 2026).
Questions probes (~20) — normal distribution of developer questions, measures average compliance.
Adversarial probes (~10) — each targets a specific invariant under deliberate social pressure (a deliberate bypass attempt embedded in the prompt). Binary: held or broke. Distinguishes "invariant holds when nobody pushes" from "holds always."
Canary GUID per probe — embedded in a ## Canary section the runner does not pass into the model. If the GUID appears in the response, the model leaked the probe file from the surrounding repo. Anti-contamination, runtime-verifiable.

Pre-registered decision rule

A config change ships iff all of:

Paired Wilcoxon signed-rank on question score totals: p < 0.05
Bootstrap 95% CI on Cohen's d (paired): lower bound > 0.2
McNemar one-sided on adversarial pass/fail: zero regressions

Krippendorff α ≥ 0.8 for inter-rater reliability when ≥2 raters score the same probe.

Pre-registered, not chosen after seeing numbers. Without this, the merge criterion drifts — "p < 0.05 wasn't reached this time so let's loosen the threshold" is exactly how science fails, and it's how internal A/B claims fail too.

What this is and what it isn't

This is not a framework. It's a calibration layer that sits on top of any agent runtime — in my case, on top of Claude Code's plugin system, but the design ports to LangGraph, Autogen, or hand-rolled.

The portable parts:

A deontic+counter invariants file
A per-turn risk-weighted sampling hook
An eval suite with an adversarial split
A pre-registered statistical decision rule

The rest is plumbing.

What you lose

Length. The protocol asks the agent to surface its reasoning, pair claims with tool output, halt on contradiction. The reply is denser and longer than a yes-man would produce. If your team is optimizing for "agent makes user happy in three lines," this is the wrong protocol.

Why I'm publishing this

Most agentic engineering writeups in 2026 are either framework docs (LangGraph this, CrewAI that) or hype posts about emergent capabilities. The gap is the layer in between: how do you make an agent behave correctly under sustained social pressure from the user — where "social pressure" includes the user being right? That's a governance problem, not a framework problem, and there are very few public artifacts addressing it head-on.

I'll write up specific parts — the 17-criteria rubric, the canary GUID approach, the deontic+counter design — as follow-up posts if there's interest.