Tom Tokita

Posted on Apr 28 • Edited on May 13

Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.

#ai #devops #machinelearning #programming

You've seen the demos. An AI agent opens a browser. Navigates a website. Fills out forms. Makes decisions. Ships code. All by itself.

Looks like magic. Then you deploy it. It runs 24/7. Nobody's watching. The invoice arrives.

The Demo Is Not the Product

I build agent systems. I'm not anti-agent. I'm anti-fantasy.

The fully autonomous pitch sounds like: "Just let the AI handle it. It'll figure it out." In a demo with curated inputs? Sure. In production where data is messy and one wrong decision costs real money? Different story entirely.

What Autonomous Agents Actually Cost

API Burn

Autonomous agents reason through loops. Every iteration burns tokens. When an agent gets stuck, and they do, it's paying to argue with itself.

Scenario	Cost
Agent completes task cleanly	$0.15–$0.80
Reasoning loop (5–10 iterations)	$2–$8
Logic trap (nobody notices)	$50+ before cutoff
24/7 monitoring agent	$300–$800/month

A single runaway agent can consume your monthly budget in hours. Not hypothetical, it happens.

The Amazon Kiro Incident

In 2026, Amazon's Kiro AI agent autonomously deleted and recreated an AWS production environment. 13-hour outage. The root cause wasn't a bad model, it was no permission boundaries, no peer review, no destructive-action blocklist.

The agent did exactly what it was designed to do. Nobody designed the guardrails.

Drift: The Silent Killer

Kyndryl's 2026 research nails it: agents that work correctly on day 1 gradually shift behavior as they hit edge cases.

A fintech company deployed an agent to manage infrastructure costs. It learned traffic patterns, autonomously scaled down a database cluster one weekend. That weekend was month-end processing. Production down for 11 hours.

A customer service agent learned that issuing refunds correlated with positive reviews. Started granting refunds more freely. Not because anyone told it to, because it observed the pattern and optimized for it.

Drift is invisible until something breaks.

Maintenance Reality

Gartner estimates maintenance eats 20–50% of operational budgets for autonomous systems:

Model drift correction
Data pipeline upkeep
Security monitoring
"Why did the agent do that?" investigations

That's not in the pitch deck.

The "Set It and Forget It" Fantasy

The selling point is that autonomous agents free up human time. The reality:

You traded a human doing a task for a human watching an AI do a task, plus the API bill.

Fully autonomous agents need more monitoring than manual processes, not less. When a human makes a mistake, they usually catch it. When an agent makes a mistake, it makes it confidently, repeatedly, and at scale.

The Alternative: Autonomy with a Leash

I run agent systems in production. They work. Here's why, they're supervised, scheduled, and tiered.

Supervised

AI does the work, human reviews before it ships. For high-stakes actions, deployments, client comms, financial ops, there's always a checkpoint. Not slower. Safer. The review loop catches drift before production.

Scheduled

Agents run on defined schedules with defined scopes. Not 24/7 open-ended autonomy.

You control when they run, what they touch, and how much they spend. A scheduled agent running 3x/day costs a fraction of an always-on agent. And it's predictable.

Tiered

Not every task needs the same oversight:

Blast Radius	Examples	Autonomy Level
Low	Formatting, data entry, reports	Full auto, let it run
Medium	Content creation, analysis	AI executes, human spot-checks
High	Deployments, client comms	AI prepares, human approves
Critical	Production changes, security	Human executes, AI assists

The tier is based on blast radius, not convenience. "What's the worst that happens if this gets it wrong?" determines the oversight level.

The Cost Comparison

	Fully Autonomous	Supervised + Scheduled
API cost	Unpredictable, 24/7 burn	Predictable, runs on schedule
Drift risk	High, no review loop	Low, caught at checkpoints
Failure cost	Catastrophic (see: Kiro)	Contained, blast radius limited
Maintenance	20–50% of budget	Fraction, simpler, fewer surprises
Demo quality	Incredible	Boring

The boring option wins. Every time.

Three Questions Before You Deploy

1. What's the blast radius? If this agent gets it wrong, what breaks? A formatting error or a production database?

2. What's the budget cap? Hard limit on API spend per agent, per run. A logic loop should hit a ceiling, not your credit card.

3. Where's the human checkpoint? For actions above your risk threshold, the agent prepares, a human approves. That's not a bottleneck. That's insurance.

The Market Will Correct

The "fully autonomous" pitch will fade. Not because the tech isn't impressive, it is. But production costs are undeniable, and enterprises don't tolerate 13-hour outages from unsupervised AI.

What survives:

Agent systems with defined scopes
Human checkpoints for high-risk actions
Captured learnings so agents don't repeat mistakes
Cost controls that prevent runaway spend

Building from the Philippines, cost efficiency isn't optional, it's survival. That constraint forced us to design agent systems that are lean, supervised, and sustainable. Sometimes the best innovations come from not being able to afford the wasteful approach.

I'm Tom Tokita. I run Aether Global Technology out of Manila. We build AI operations and Salesforce systems for companies that need things to work, not just demo well. Building agents for production? Let's talk.

Top comments (1)

Alex @ Vibe Agent Making • May 1

This is the most honest post I've seen on production agents. "You traded a human doing a task for a human watching an AI do a task — plus the API bill" — that's the reality for most deployments right now.

We've been running an 8-agent fleet in production for 10+ weeks. Not a demo. Real published output — 80+ blog posts, deployed APIs, working infrastructure. Fleet cost: $200/month.

Your "runaway agent" concern: we built a watchdog that monitors every agent, SLA breach handlers that catch stalled tasks, and an adversarial review system where one agent audits the others. The human reviews exceptions, not every output.

But here's the part we think is missing from the industry conversation: nobody can prove their agents work in production. You can claim it. You can show dashboards. But can you cryptographically prove what your agents did, when they did it, and that the record wasn't edited after the fact?

We built that. Every agent action → tamper-evident hash chain → independent timestamp anchoring (Bitcoin + RFC 3161 TSA). 7,800+ entries. Anyone can verify independently — paste a proof file into ChatGPT or Claude and it confirms the hashes.

Your "demo vs production" framing is exactly right. We think the answer is receipts, not promises.

Try it: vibeagentmaking.com → "See It Work." 60 seconds, no account needed.