I gave an AI agent a shell. Then I gave myself a browser tab to watch it.

#rust #opensource #devtools #ai

A few months ago I started letting a coding agent run shell commands for me. Not suggest them, run them. It felt like magic for about a week.

Then one afternoon it tried to fix a failing build. The build printed maybe four thousand lines of webpack noise. The agent read all of it, lost the thread completely, decided the real problem was the test directory, and deleted it. I found out an hour later, because between "sure, I'll fix that" and the actual damage, I had no idea what was happening. It was a black box with my filesystem inside it.

I went looking for a safer way to do this and didn't really find one I liked, so I ended up building it. Here are the four things that afternoon taught me, and what I do now instead.

1. You need to see what it is doing, while it does it

The core problem isn't that agents make mistakes. It's that you find out after. A normal subprocess call gives you nothing to watch. The agent runs ten commands and you see a summary it wrote about itself, which is exactly the thing you can't trust when something went wrong.

So I built a window. Set one environment variable and the server starts a read-only web viewer on localhost and hands you a URL. You open it in a normal browser tab and watch the actual commands and their actual output stream in as they run, colored: cyan for the command, green for a clean exit, red for a failure or a blocked command. Not the agent's description of what it did. The ground truth.

The live transcript: the actual commands and output as they run, not the agent's summary of itself.

It groups sessions by where they ran (local, ssh, docker), keeps a history of past sessions you can scroll back through, and has a search box so you can find the one command you care about in a long run. The day I added the live view I caught it about to git reset --hard over uncommitted work. I'd never have known otherwise.

Search jumps you to the command you care about in a long run; "next err" hops straight to failures.

2. It will drown its own context in command output

That four-thousand-line build log didn't just confuse me, it confused the agent. Every line went into its context window, crowding out the actual task. Agents get dumber the more junk you feed them, and a raw shell feeds them everything.

The fix is boring and it works: shape the output before it reaches the model. Keep the last 200 lines of a noisy build. Grep a huge log for the errors with a couple lines of context. Cap the total characters. The agent sees the signal, not the firehose, and its context stays clean for the thing you actually asked.

3. A shell that forgets where it is, is useless

This one is mundane but it bit me constantly. The agent runs cd packages/api, then runs npm test, and the test runs in the repo root because the second call was a brand new shell. Every command starting from home is not a shell, it's a series of unrelated strangers.

State has to persist. cd, environment variables, the lot, across calls, the way a real terminal works. Once sessions were actually stateful the whole thing stopped feeling like fighting the tool.

4. Sooner or later it does something you want to undo

You can be careful and it will still happen. The agent rewrites a config, runs a migration, deletes a file it shouldn't have. On a local repo you have git. On a remote box over SSH, mid-task, you often don't.

So before any changing command, snapshot the workspace. If it goes wrong, restore. It only undoes files, not side effects (a dropped database stays dropped), but for "it mangled my source tree" it's the difference between a shrug and an evening.

The boring parts that turned out to matter most

A few things I didn't expect to care about and now wouldn't go without:

Secrets get redacted before output is stored or shown. My API keys were ending up in plaintext transcripts and I hadn't noticed.
stdout and stderr stay separate, with a real exit code and duration, as structured data. The agent stops guessing whether something failed.
SSH host keys are verified by default, so a swapped key gets rejected instead of silently trusted.
The viewer is read-only and local by construction: it binds loopback only, every request needs a URL token, and it can read the audit stream but never touch a session, a command, or your files.
The honest part: the command allow/deny fence is advisory. The real safety boundary is running the agent as a least-privilege user in a container or a scoped SSH account. A string filter is a speed bump, not a jail, and pretending otherwise is how people get burned.

The advisory fence blocked rm -rf / here, but its real value is that the attempt is visible and logged. Export or screenshot any session straight from the menu.

Where it ended up

I packaged all of this as execkit: a small Rust library, plus an MCP server so any agent that speaks MCP (Claude Code, Cursor, Gemini CLI, others) can use it. It's Apache-2.0.

pip install execkit-mcp                 # or: cargo install execkit-mcp
claude mcp add execkit -- execkit-mcp   # wire it into Claude Code

Then to watch it work, set EXECKIT_MCP_WATCH_WEB=1 in the server's environment and open the URL it prints. (Want it to pop the browser for you? Also set EXECKIT_MCP_WATCH_OPEN=1.) Or point the viewer at any existing audit log: execkit-mcp watch --serve --open path/to/audit.

I built it because I wanted to stop flying blind. If you're letting an agent touch a real shell, I'd at least want you to be able to watch it.
Repo and docs: https://blinkingbit-oss.github.io/execkit/

If you try it, I'd genuinely like to hear what breaks.