DEV Community

Cover image for How to Stop Babysitting Your AI Agents

How to Stop Babysitting Your AI Agents

J. R. Swab on March 18, 2026

Every time I need an LLM to do something, the ritual is the same. Open a chat window. Type a prompt. Read the response. Decide if it's good enough....
Collapse
 
itskondrat profile image
Mykola Kondratiuk

the "new job I didn't apply for" framing is exactly right. I run 10+ agents daily and the biggest shift was accepting that your job becomes context architecture rather than task execution. the Unix pipe analogy is sharp - agents that behave like well-behaved CLI tools are way easier to orchestrate than chat-based ones. fire-and-forget only works when you trust the output contract

Collapse
 
ceaksan profile image
Ceyhun Aksan

This is a solid and inspiring solution @jrswab. I've been solving a similar problem from the other side: instead of a standalone runner, I use Claude Code's hook system (PreCompact, PostToolUse, SessionStart) to embed automation directly into the coding session. Auto-formatting on edit, WIP state persistence before context compaction, checkpoint restore on new sessions.

Your approach is tool-agnostic and composable across any workflow. However, from my perspective, hooks go deeper but stay locked to one ecosystem. Curious if you've considered bridging the two, e.g. triggering Axe agents from within editor hooks?

Collapse
 
jrswab profile image
J. R. Swab

I'm actually working on a setup now that will have OpenClaw trigger a specific Axe agent when at a specific step of a skill.

As for editor hooks I have not tried that yet. If you do let me know how it goes and please add a GitHub issue if it does not work.

Collapse
 
ceaksan profile image
Ceyhun Aksan

I went ahead and tried it. Set up Axe via Docker, created a code-reviewer agent, and piped git diff from a Claude Code PostToolUse hook. It picked up SQL injection and resource leaks on a test diff, then ran it against a real project diff and caught 4 legitimate issues.

For now I'm keeping it as an on-demand script rather than running on every edit (Docker + LLM call per edit adds up). But the stdin/stdout composability made it trivial to integrate. Nice work 🫡

Thread Thread
 
jrswab profile image
J. R. Swab

Love to hear it! Thank you for trying it out.

Thread Thread
 
ceaksan profile image
Ceyhun Aksan

Looking forward to seeing how OpenClaw + Axe evolves. I'm actually writing a post on my hook-based workflow (auto-format, WIP persistence, checkpoint restore) and will reference Axe as the standalone counterpart. Will share when it's out and link this post.
Best 🖖

Thread Thread
 
jrswab profile image
J. R. Swab

Thank you! If you have any questions you can DM me on X or post a comment on this article.

Collapse
 
cyber8080 profile image
Cyber Safety Zone

Really solid perspective here! The shift from babysitting AI agents to designing them with clear goals, feedback loops, and proper context is a game‑changer. Treating agents like autonomous collaborators instead of glorified scripts unlocks far better results and scalability. Great practical breakdown!7

Collapse
 
jrswab profile image
J. R. Swab

Thanks for reading. I built axe to give each agent one job, clean I/O, and no open-ended sessions. Much like UNIX utilities apart from collaborators. Though I'm sure you can use axe in many more ways than I do.

Collapse
 
eaglelucid profile image
Victor Okefie

The line that lands: "The chat interface made sense when we were exploring what LLMs could do." That's the pivot. Exploration is one mode. Execution is another. You built for execution, no babysitting, no context switching, just input in, output out. That's not an agent. That's a tool. And tools don't need to be managed.

Collapse
 
reneza profile image
Rene Zander

I run a similar setup in production - Claude agents on systemd timers handling daily briefings, follow-ups, and vector index syncing. The key insight I keep coming back to is structured output boundaries: give the agent a strict JSON contract for what it can touch, and let it be fully autonomous within that scope. Guardrails beat babysitting every time.

Collapse
 
ai_made_tools profile image
Joske Vermeulen

The babysitting problem is real. I've been testing MiMo-V2-Pro this week (Xiaomi's new agent model) and the biggest difference between models isn't raw quality, it's how many steps they can chain before losing the plot. Some models need a check-in every 3 steps, others can run 10+ autonomously.

Collapse
 
jrswab profile image
J. R. Swab

That's a great way to frame it! "Steps before losing the plot" is the metric that actually matters in production. Raw benchmark scores don't tell us much if the model falls apart at step 4 of a 10-step task.

I haven't tried MiMo-V2-Pro yet but I'm curious how it handles ambiguous tool outputs mid-chain. That's usually where I see models start to hallucinate their way through rather than stopping to ask. Worth testing that specifically if you haven't.

Thanks for checking out the project!

Collapse
 
farrukh_tariq_b2d419a76cf profile image
Farrukh Tariq

This is brilliant—finally an LLM workflow that treats agents like Unix tools: composable, focused, and zero babysitting required.

Collapse
 
jonmarkgo profile image
Jon Gottfried

This is a super cool project! Have you done any direct comparisons to how this works vs more built-in subagents in the model provider coding agent CLIs?

Collapse
 
jrswab profile image
J. R. Swab • Edited

Thanks for reading! Nothing that I can show metrics on at the moment but I do find piping information between agents with just the information they need to have a higher quality output. Than one big agentic framework like OpenClaw.

Collapse
 
adarsh_kant_ebb2fde1d0c6b profile image
Adarsh Kant

This resonates deeply. The Unix philosophy for AI agents is exactly right — small tools, one job each, composable by design.

We took the same approach building AnveVoice (anvevoice.app). It's a voice AI that takes real DOM actions on websites — clicking buttons, filling forms, navigating pages. Not a general-purpose chatbot trying to do everything.

The architecture: 46 MCP tools exposed via JSON-RPC 2.0. Each tool has a single, well-defined scope. The voice agent orchestrates them based on user intent, but each tool is independently testable and replaceable.

Key learnings that match your article:

  • Scoped capabilities > god mode: Each MCP tool can only do one thing. The agent can't accidentally delete your database because there's no tool for that.
  • Clean I/O matters: Voice input → intent parsing → tool selection → DOM action → voice confirmation. Each step is observable.
  • Sub-700ms latency: Only achievable because tools are lightweight and focused, not loading massive contexts.

MIT-0 licensed, 50+ languages. The composable approach made it possible to support WCAG 2.1 AA accessibility without rebuilding the core.

Collapse
 
adarsh_kant_ebb2fde1d0c6b profile image
Adarsh Kant

Love the Unix philosophy framing — "one job, composable by design" is exactly right. We took a similar approach with AnveVoice (anvevoice.app): instead of building a monolithic chat interface, we built focused voice AI agents that each do one thing well — take real DOM actions on websites. Click buttons, fill forms, navigate pages, all through voice commands with sub-700ms latency.

The key insight from your post that resonates: agents should feel like infrastructure, not products to babysit. Our embed is literally one script tag. No dashboard to check, no conversations to monitor. The agent handles the interaction, reports results via JSON-RPC 2.0, done.

The TOML config approach is elegant. We use a similar declarative model for defining agent behaviors — keeps the complexity in the config, not the runtime.

Collapse
 
ahmad_shokry profile image
Ahmad Shokry

the repo link is not working

Collapse
 
jrswab profile image
J. R. Swab

Haha woops! GitHub.com/jrswab/axe

Collapse
 
ahmad_shokry profile image
Ahmad Shokry

Thanx

Collapse
 
apex_stack profile image
Apex Stack

This resonates hard. I run about 10 scheduled AI agents on cron-style timers for a programmatic SEO site — things like daily GSC audits, content quality checks, news feed refreshes, and even community engagement tasks. The biggest lesson I've learned is exactly what you describe: each agent needs ONE job with clean I/O boundaries.

The memory-as-markdown approach is smart. I do something similar where each agent appends to a shared activity log that downstream agents can read. The composability is key — my weekly review agent reads outputs from 6 other agents without knowing anything about their internals.

Curious about error handling in chained agents. When a sub-agent fails mid-pipeline, does Axe propagate the error cleanly to the parent, or do you need to handle that in the SKILL.md logic?

Collapse
 
javz profile image
Julien Avezou

Nice project! You touch a real pain point, which is the increased load that working with AI setups is creating.

Collapse
 
jrswab profile image
J. R. Swab

I totally agree! What I love most about axe is that I can reuse an agent in more than one pipeline.