J. R. Swab

Posted on Mar 18 • Edited on Mar 19

How to Stop Babysitting Your AI Agents

#llmagents #cli #devtools #go

Every time I need an LLM to do something, the ritual is the same. Open a chat window. Type a prompt. Read the response. Decide if it's good enough. Repeat tomorrow. That's not automation that's a new job I didn't apply for.

The frustrating part isn't the AI. The frustrating part is that I'm the scheduler, the context manager, and the output parser all at once. I'm writing the same prompt variations over and over because nothing persists. I'm watching a spinner because there's no way to fire and forget. The tool is supposed to be doing the work.

So I built a 12MB binary to fix it.

Unix already solved this

You don't open a chat window to run grep. You pipe input in, get output out, and chain it with something else. Small tools, one job each, composable by design.

The Unix philosophy isn't clever it's right.. and it's been right for fifty years.

AI agents should work the same way. One job. Clean input/output. Plugs into your existing workflows. The problem is that most AI tooling goes the opposite direction. Massive context windows, general-purpose sessions, chat interfaces bolted onto automation primitives. The chat interface made sense when we were exploring what LLMs could do.

It's time to stop treating every task like an open-ended conversation.

This is where Axe comes in. Axe is a CLI that runs single-purpose LLM agents defined in plain text config files. You define an agent, give it a job, and run it from wherever you'd run any other command.

What it actually looks like

Here's a PR reviewer that runs before every commit via a git hook:

git diff --cached | axe run pr-reviewer

Stdin is always accepted. Output goes to stdout. That's it. No setup wizard, no subscription tier, no "connect your workspace." The git hook already exists. Axe just sits in the pipe.

Or say you want nightly log analysis. Drop this in a cron job:

cat /var/log/app/error.log | axe run log-analyzer

Debug info goes to stderr so it doesn't pollute your pipeline. Clean findings go to stdout. Wire it into whatever monitoring you already have like email, Slack, or a file.

Axe doesn't care.

The part I find most useful is chaining agents. A parent agent can delegate to sub-agents for focused subtasks. Each sub-agent runs with its own isolated context window and returns only the result. The previous agent never sees the sub-agent's internal reasoning, intermediate files, working memory, or even its output.

Under the hood (briefly)

Agent config is TOML:

name = "pr-reviewer"
description = "Reviews git diffs for issues"
model = "anthropic/claude-sonnet-4-20250514"

[params]
temperature = 0.3

[memory]
enabled = true
last_n = 10

The agent's instructions live in a SKILL.md file next to the config. Plain markdown. Human-readable, version-controllable, greppable. No database, no embeddings, no proprietary format. If you want to know what an agent does, you open the file. If you want to change what it does, you edit the file. That's the whole interface.

The --dry-run flag shows the full resolved context, system prompt, skill file contents, any piped stdin, model parameters, without calling the LLM. Useful for debugging. Also useful if you want to estimate token cost before committing to a run. I use it more than I expected to.

Agents can also remember across runs. Memory is plain markdown: each run appends a timestamped entry to a file sitting next to the config. No database, no schema migration. If something looks wrong, you open the file and edit it.

What it's not

Not a framework. Not a platform. Not a chat window. Not a SaaS dashboard with an "Agents" tab and a usage graph you'll check twice before forgetting it exists.

Axe aims for minimal dependencies. It's a single binary, no daemon running in the background, no runtime to install. It just runs wherever you drop it. Licensed as Apache 2.0 and free forever, pull requests welcome.

Axe is the executor, not the scheduler. Use cron, git hooks, entr, fswatch, whatever you already have. Axe doesn't want to own your workflow. It wants to run one agent, cleanly, and get out of the way.

If you want agents to feel like infrastructure instead of products you have to babysit, that's what this is for.

The repo is at github.com/jrswab/axe.

Go try it. Build something weird with it like a commit message writer, a nightly changelog summarizer, whatever you keep doing manually.

If it saves you from being the scheduler, the context manager, and the output parser all at once. That's the whole point.

Top comments (26)

Mykola Kondratiuk • Mar 25

the "new job I didn't apply for" framing is exactly right. I run 10+ agents daily and the biggest shift was accepting that your job becomes context architecture rather than task execution. the Unix pipe analogy is sharp - agents that behave like well-behaved CLI tools are way easier to orchestrate than chat-based ones. fire-and-forget only works when you trust the output contract

Ceyhun Aksan • Mar 19

This is a solid and inspiring solution @jrswab. I've been solving a similar problem from the other side: instead of a standalone runner, I use Claude Code's hook system (PreCompact, PostToolUse, SessionStart) to embed automation directly into the coding session. Auto-formatting on edit, WIP state persistence before context compaction, checkpoint restore on new sessions.

Your approach is tool-agnostic and composable across any workflow. However, from my perspective, hooks go deeper but stay locked to one ecosystem. Curious if you've considered bridging the two, e.g. triggering Axe agents from within editor hooks?

J. R. Swab • Mar 19

I'm actually working on a setup now that will have OpenClaw trigger a specific Axe agent when at a specific step of a skill.

As for editor hooks I have not tried that yet. If you do let me know how it goes and please add a GitHub issue if it does not work.

Ceyhun Aksan • Mar 19

I went ahead and tried it. Set up Axe via Docker, created a code-reviewer agent, and piped git diff from a Claude Code PostToolUse hook. It picked up SQL injection and resource leaks on a test diff, then ran it against a real project diff and caught 4 legitimate issues.

For now I'm keeping it as an on-demand script rather than running on every edit (Docker + LLM call per edit adds up). But the stdin/stdout composability made it trivial to integrate. Nice work 🫡

J. R. Swab • Mar 19

Love to hear it! Thank you for trying it out.

Ceyhun Aksan • Mar 19

Looking forward to seeing how OpenClaw + Axe evolves. I'm actually writing a post on my hook-based workflow (auto-format, WIP persistence, checkpoint restore) and will reference Axe as the standalone counterpart. Will share when it's out and link this post.
Best 🖖

J. R. Swab • Mar 19

Thank you! If you have any questions you can DM me on X or post a comment on this article.

Cyber Safety Zone • Mar 19

Really solid perspective here! The shift from babysitting AI agents to designing them with clear goals, feedback loops, and proper context is a game‑changer. Treating agents like autonomous collaborators instead of glorified scripts unlocks far better results and scalability. Great practical breakdown!7

J. R. Swab • Mar 19

Thanks for reading. I built axe to give each agent one job, clean I/O, and no open-ended sessions. Much like UNIX utilities apart from collaborators. Though I'm sure you can use axe in many more ways than I do.

Victor Okefie • Mar 19

The line that lands: "The chat interface made sense when we were exploring what LLMs could do." That's the pivot. Exploration is one mode. Execution is another. You built for execution, no babysitting, no context switching, just input in, output out. That's not an agent. That's a tool. And tools don't need to be managed.

René Zander • Mar 24

I run a similar setup in production - Claude agents on systemd timers handling daily briefings, follow-ups, and vector index syncing. The key insight I keep coming back to is structured output boundaries: give the agent a strict JSON contract for what it can touch, and let it be fully autonomous within that scope. Guardrails beat babysitting every time.

Joske Vermeulen • Mar 20

The babysitting problem is real. I've been testing MiMo-V2-Pro this week (Xiaomi's new agent model) and the biggest difference between models isn't raw quality, it's how many steps they can chain before losing the plot. Some models need a check-in every 3 steps, others can run 10+ autonomously.

J. R. Swab • Mar 21

That's a great way to frame it! "Steps before losing the plot" is the metric that actually matters in production. Raw benchmark scores don't tell us much if the model falls apart at step 4 of a 10-step task.

I haven't tried MiMo-V2-Pro yet but I'm curious how it handles ambiguous tool outputs mid-chain. That's usually where I see models start to hallucinate their way through rather than stopping to ask. Worth testing that specifically if you haven't.

Thanks for checking out the project!

Agntable • Mar 25

This is brilliant—finally an LLM workflow that treats agents like Unix tools: composable, focused, and zero babysitting required.

Jon Gottfried • Mar 18

This is a super cool project! Have you done any direct comparisons to how this works vs more built-in subagents in the model provider coding agent CLIs?

J. R. Swab • Mar 19 • Edited

Thanks for reading! Nothing that I can show metrics on at the moment but I do find piping information between agents with just the information they need to have a higher quality output. Than one big agentic framework like OpenClaw.

Harjot Singh • Jun 1

i totally get the frustration of managing multiple tasks with AI tools. it can feel like you're doing all the heavy lifting instead of the tech. at moonshift, we aim to streamline your dev process by letting you deploy a full next.js + postgres + auth app in about 7 minutes, and you keep the code on your github. if you're interested, i can set you up with a free build to try it out.

Adarsh Kant • Mar 22

This resonates deeply. The Unix philosophy for AI agents is exactly right — small tools, one job each, composable by design.

We took the same approach building AnveVoice (anvevoice.app). It's a voice AI that takes real DOM actions on websites — clicking buttons, filling forms, navigating pages. Not a general-purpose chatbot trying to do everything.

The architecture: 46 MCP tools exposed via JSON-RPC 2.0. Each tool has a single, well-defined scope. The voice agent orchestrates them based on user intent, but each tool is independently testable and replaceable.

Key learnings that match your article:

Scoped capabilities > god mode: Each MCP tool can only do one thing. The agent can't accidentally delete your database because there's no tool for that.
Clean I/O matters: Voice input → intent parsing → tool selection → DOM action → voice confirmation. Each step is observable.
Sub-700ms latency: Only achievable because tools are lightweight and focused, not loading massive contexts.

MIT-0 licensed, 50+ languages. The composable approach made it possible to support WCAG 2.1 AA accessibility without rebuilding the core.

View full discussion (26 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.