DEV Community

Cover image for I Run 10 AI Agents in Production. They're All Bash Scripts.
Rene Zander
Rene Zander

Posted on

I Run 10 AI Agents in Production. They're All Bash Scripts.

A week ago I wrote about shipping AI agents the right way. That piece was about the harness: quality gates, token economics, multi-model verification. The stuff that separates demos from production.

A lot of people resonated with it. But I left out the part that actually eats most of my time: keeping the boring stuff running.

So let me walk you through what production AI agents actually look like when the conference talk is over.

The stack

I run about ten agents in production. They handle things like daily briefings, team follow-ups, task classification, weekly status reports, and job screening. None of them are impressive. All of them are useful.

The architecture for every single one is the same:

Trigger > Fetch data > Scoped prompt > Write output > Notify

A scheduled job fires. The agent pulls data from a task system, a calendar, an inbox, whatever it needs. A scoped prompt with a token budget processes that data. The result gets written back or sent as a notification.

That's it. No framework. No orchestration layer. No agent-to-agent communication protocol. Each agent is a standalone script with a timeout and a budget cap.

The ones that run daily cost about $1 per run. The weekly planner, which needs more context, costs about $5. The job screener uses a smaller model and comes in at $0.25. Monthly total across everything: somewhere around $80.

State is the whole game

The interesting part of running agents isn't the LLM call. It's everything around it.

One of my agents tracks which emails it already processed. It maintains a list of the last 500 message IDs. If that list gets corrupted, every email gets reprocessed and I'm flooded with duplicate notifications. If it gets wiped, same thing.

Another agent maintains a vector index. Every night, a sync job pulls all tasks, checks which ones changed since the last run, re-embeds only those, and updates the index. Smart sync means it skips tasks where only metadata changed. Without that optimization, the job would take 30 minutes instead of 30 seconds.

A third agent prepends follow-up messages to task notes, separated by date markers. It never deletes old entries. The history is the context for the next run. Mess with the format and the agent loses its memory.

None of this is AI work. It's state management. The LLM call is the easy part.

The maintenance nobody warns you about

This is what I wish someone had told me before I started: the data your agents pull from and write to needs constant housekeeping.

Task descriptions get stale. Context windows bloat because old entries pile up. Duplicate state creeps in when two agents touch the same data. Embeddings drift as your task descriptions evolve.

I spend more time maintaining the data around my agents than I spend on the agents themselves. Trimming old entries, cleaning up state files, making sure the vector index stays in sync with the source of truth.

It's the same kind of work you do with any data pipeline. But nobody frames it that way because "AI agent" sounds more exciting than "scheduled ETL job with an LLM in the middle."

Token economics shape everything

Your architecture follows from your token budget, not the other way around.

I learned this the hard way. My first agent design was a single mega-prompt that tried to do everything: read tasks, check calendar, scan inbox, generate a plan, write follow-ups. It worked. It also burned through tokens like they were free and hallucinated more than I was comfortable with.

Now every agent has a single job. Small context, scoped prompt, hard token limit. The job screener doesn't need to know about my calendar. The weekly planner doesn't need to see my inbox. Isolation isn't just cheaper. It's more reliable.

The cost breakdown matters because it changes your design decisions:

  • A $0.25 agent can run four times a day. A $5 agent runs once a week.
  • A fast, cheap model handles screening. A slower, expensive model handles planning.
  • If an agent costs more than $2 per run, I look at whether the prompt can be tighter.

Framework vs. no framework

I tried frameworks. They lasted about three months.

The problem wasn't capability. The frameworks could do everything I needed. The problem was that every failure mode became a framework debugging session instead of a straightforward script fix.

A shell script fails: I read the log, find the error, fix the line. Done.

A framework agent fails: I dig through abstraction layers, figure out which middleware swallowed the error, check whether the state manager persisted something weird, and then fix the line.

Same fix at the end. Ten times the debugging surface.

My current stack is scheduled jobs, shell scripts calling an LLM API, and log files. It has fewer moving parts than most people's "hello world" agent demos. And it's been running stable for months.

What I'd build differently

If I started over:

State management first. Before writing a single prompt, I'd design how every agent tracks what it already did, what changed since last run, and where it writes output. This is the foundation everything else sits on.

One agent, one job. No multi-purpose agents. The overhead of running five cheap agents is lower than debugging one expensive agent that does five things.

Scheduled over event-driven. For anything that happens on a predictable cadence, a timer is simpler and more reliable than a webhook chain. Event-driven has its place, but most of my agents don't need it.

Budget caps from day one. Every agent gets a hard spending limit. If it hits the cap, it fails loudly instead of running up a bill.


I've been building and running these systems for a while now. If you're working through similar problems, or thinking about moving agents from demo to production, I'm always up for a conversation: cal.eu/reneza

Top comments (0)