Tim Zinin

Posted on Mar 12

Technical Tutorial

#ai #python #devops #architecture

Designing Production AI Agents: 5 Lessons from Running 6 in the Wild

I've been running 6 AI agents in production for 3 months. They publish content, analyze data, respond to messages, generate reports, and monitor infrastructure. Here are the hard lessons I learned.

Lesson 1: Agents Need Guardrails, Not Freedom

The biggest mistake is giving agents too much autonomy too early. My content publisher agent once published the same post 47 times because a retry loop had no circuit breaker.

The fix: Every agent action goes through a validation layer. The "gatekeeper" pattern:

def execute_with_guardrails(agent, action, params):
    # 1. Validate inputs
    if not validate_params(params):
        return {"ok": False, "error": "invalid params"}

    # 2. Check rate limits
    if agent.rate_limiter.exceeded():
        return {"ok": False, "error": "rate limited"}

    # 3. Execute
    result = agent.execute(action, params)

    # 4. Validate output
    if not validate_result(result):
        rollback(action, params)
        return {"ok": False, "error": "output validation failed"}

    return result

Lesson 2: State Management Is Everything

Agents without persistent state are useless. They repeat work, lose context, and make inconsistent decisions.

My agents use a simple JSON state file per agent:

{
  "published": {"post_id": {"platform": "threads", "published_at": "..."}},
  "last_run": "2026-03-08T12:00:00",
  "errors": [],
  "metrics": {"total_published": 192, "total_failed": 12}
}

State files are small, human-readable, and trivially debuggable. No database needed for most agent workloads.

Lesson 3: Cron > Event-Driven (For Most Cases)

I started with an event-driven architecture using message queues. It was overengineered.

Now everything runs on cron:

*/30 * * * * /opt/auto-publisher/run.sh
0 6 * * * /opt/analytics/collect.sh
0 * * * * /opt/monitor/check.sh

Benefits: dead simple, no message broker to maintain, easy to debug (just check the logs), trivial to pause (comment out the line).

The only case for event-driven: when latency matters (chat responses, webhook processing).

Lesson 4: Log Everything, Trust Nothing

Every agent writes structured logs. Every external API call is logged with request, response, and duration. This has saved me dozens of debugging hours.

log.info("Publishing %s to %s", post_id, platform)
# ... publish ...
log.info("SUCCESS: %s -> %s (post_id: %s, duration: %.1fs)", 
         post_id, platform, result["post_id"], duration)

When something breaks at 3 AM, logs are all you have. Make them count.

Lesson 5: Graceful Degradation Over Perfect Execution

My analytics agent uses Groq's free tier for AI analysis. When the API is down (happens weekly), it falls back to rule-based analysis. When that fails, it returns raw data with no analysis.

try:
    analysis = groq_analyze(data)
except GroqAPIError:
    try:
        analysis = rule_based_analyze(data)
    except Exception:
        analysis = {"raw_data": data, "note": "Analysis unavailable"}

Users get something rather than nothing. The agent keeps running rather than crashing.

Bonus: The Stack

Language: Python 3.10 (no frameworks)
Scheduling: System cron
State: JSON files
Logging: Python logging to files
Hosting: Single VPS, $15/month
LLM: MiniMax M2.5 + Groq Llama 3.3 (both free tier)

Total monthly cost: $15 for hosting. Everything else is free.

The best AI agent architecture is the simplest one that works reliably.

Building production AI systems? See more at sborka.work

DEV Community