Hugo

Posted on Mar 8 • Originally published at o137.ai

Deploy AI Agents in Production The Practical 2026 Guide

#docker #ai #devops #machinelearning

_Originally published at o137.ai
_

The demo was impressive. Production is another story.

What enterprise reports really say — and what it means in practice.

Based on: LangChain State of Agents 2026, Cleanlab Enterprise Report, UC Berkeley MAP, McKinsey State of AI, Docker official documentation

The demo/production gap is real — and massive

In 2024-2025, AI agent demos proliferated. An agent that answers in natural language, uses tools, chains actions across multiple steps — on stage or in a notebook, it impresses.

In production, it's different. Not slightly different. Fundamentally different.

Key finding — Cleanlab / MIT 2025

Of 1,837 companies surveyed on their AI agent deployment, only 95 actually had an agent in production with real user interactions. And among those 95, the majority remained in an early maturity phase.

Source: AI Agents in Production 2025, Cleanlab (based on MIT State of AI in Business 2025 data)

It's not a model problem. LLMs work. The problem is everything around them: infrastructure, evaluation, governance, team trust.

"Most so-called AI agents can't reliably do what they claim."

— Curtis Northcutt, CEO of Cleanlab

What "production" really requires

The original article listed correct requirements but without quantified context. Here's what the data shows:

57% of surveyed companies have agents in production (LangChain, 1,300+ respondents, 2025)
32% cite quality as the main barrier to production
89% of production teams have implemented some form of observability
68% of agents run fewer than 10 steps before human intervention (Berkeley MAP)

Sources: LangChain State of Agent Engineering (Dec. 2025, n=1,340); UC Berkeley Measuring Agents in Production (n=300+)

Volume and latency. An application with 10,000 requests/day does not have the same constraints as a 10-request prototype. Latency has become the second most cited challenge (20% of teams), especially for multi-step agents where each LLM call adds up. Practical recommendations: aim under 500ms for a conversational agent, under 2 seconds for complex analytics.

Reliability, not uptime. Traditional uptime (99.9%) is not the right metric for an AI agent. An agent can be "available" but produce wrong answers, hallucinate, call the wrong tool, or get stuck in an infinite loop. These silent failures are more dangerous than a crash, because they trigger no alert.

Legal traceability and audit. In regulated sectors, 42% of companies plan to add supervision features (approvals, review controls) — versus only 16% in unregulated sectors. Without auditability of every decision, a production deployment exposes the company to regulatory risk.

Human escalation. Berkeley measured that 92.5% of production agents send their output to humans rather than to other systems. That's not a design flaw — it's a deliberate strategy to maintain reliability.

From localhost to production: the technical path

This is where most guides stop being useful. "Deploy to the cloud" is not a step. Here's the concrete path.

Why localhost ≠ production

When your agent works on your machine, you're typically running:

API keys hardcoded in a .env file or directly in the code
A single Python process with no restart on crash
No logging, no monitoring, no concurrency handling
Dependencies tied to your local Python version

None of that survives production. Here's how to bridge the gap systematically.

Step 1 — Containerize with Docker

Docker is the standard because it solves the "works on my machine" problem definitively. Your agent runs in an isolated container with its own dependencies, Python version, and environment — identical across dev, staging, and prod.

Dockerfile (Python agent, FastAPI)

# --- Build stage ---
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# --- Runtime stage ---
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY . .

# Never hardcode secrets here
ENV PYTHONUNBUFFERED=1

# Health check: verify the agent AND its dependencies are up
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key points:

Multi-stage build keeps the final image small (no build tools in production)
HEALTHCHECK verifies the container is actually functional, not just running
No secrets in the Dockerfile — ever

Step 2 — Manage secrets properly

The most common mistake: API keys in code or committed .env files.

Local development — .env file (never committed to git):

# .env — add this to .gitignore
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://user:password@localhost:5432/agent_db
REDIS_URL=redis://localhost:6379/0

In docker-compose.yml (local and staging):

services:
  agent:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env
    depends_on:
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: agent_db
      POSTGRES_USER: agent_user
      POSTGRES_PASSWORD: agent_password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent_user"]
      interval: 10s
      timeout: 3s
      retries: 3

In production — use your cloud provider's secret manager, never plain env vars:

AWS → Secrets Manager
GCP → Secret Manager
Kubernetes → kubectl create secret + mount as env vars

Step 3 — Add a staging environment

Never deploy directly from localhost to production. The staging environment catches environment-specific bugs (different OS, different network, different secret values) before they hit users.

localhost (dev)
     ↓
  docker-compose up  →  everything runs locally, identical to prod
     ↓
  staging (cloud)    →  same Docker image, real secrets, limited traffic
     ↓
  production         →  same image promoted from staging, full traffic

The key principle: the same Docker image travels through all three environments. You're not rebuilding for prod — you're promoting a tested image.

Step 4 — Choose your production infrastructure

Three main options depending on your scale and team:

Option	Best for	Scaling	Complexity
Google Cloud Run / AWS Lambda	Stateless agents, variable traffic	Automatic (serverless)	Low
AWS ECS / Azure Container Apps	Teams without Kubernetes expertise	Manual or auto	Medium
Kubernetes (EKS, GKE, AKS)	Large scale, multi-agent systems	Full control	High

Practical recommendation: Start with Cloud Run or ECS. Kubernetes is justified only when you have multiple agent types, high traffic, and a dedicated DevOps function.

For Cloud Run (simplest path from Docker to production):

# Build and push your image
docker build -t gcr.io/your-project/your-agent:v1.0.0 .
docker push gcr.io/your-project/your-agent:v1.0.0

# Deploy
gcloud run deploy your-agent \
  --image gcr.io/your-project/your-agent:v1.0.0 \
  --platform managed \
  --region europe-west1 \
  --memory 2Gi \
  --timeout 60s \
  --set-secrets OPENAI_API_KEY=openai-key:latest

Note the --memory 2Gi minimum — LLM applications need at least 1-2GB RAM. And --timeout 60s accounts for multi-step reasoning chains.

Step 5 — Handle concurrency with a queue

At low traffic (< 100 requests/day), a single process is fine. At scale, you need to separate request intake from execution.

Incoming requests → Redis queue → Worker 1
                                → Worker 2
                                → Worker 3

This prevents a slow agent run (10+ LLM calls) from blocking all other requests. Queue depth (jobs waiting) and worker utilization (CPU/memory per worker) become your main scaling signals — add workers when the queue grows faster than it drains.

The real problems in production

Hallucinations and output quality

Hallucinations don't work like classic software bugs. An agent doesn't "crash" when it hallucinates — it answers confidently while inventing information. In a multi-step workflow, an early hallucination can contaminate all following steps.

Beware of misleading metrics. An 85% accuracy rate at launch may seem solid. If it drops to 72% three months later, that's a signal of drift (model drift) or data misalignment — not normal fluctuation.

Measuring hallucinations in production today relies mainly on the "LLM-as-judge" approach: one model evaluates another model's outputs on consistency, factuality, and grounding in sources. It's imperfect but operational at scale.

Drift and stack instability

The AI stack moves fast — too fast to be stable. In the regulated sector, 70% of teams rebuild their agent stack every three months or faster. Each rebuild loses behavioral continuity. What you validated in January may no longer be valid in April if you changed model, framework version, or data pipeline.

Integration with existing systems

Salesforce acknowledged that its Einstein Copilot encountered difficulties in pilot because it could not reliably navigate between customer data silos and existing CRM workflows. This case isn't isolated — it's the norm. McKinsey notes that organizations reporting significant ROI from AI projects are twice as likely to have reconfigured their workflows end-to-end before deploying the agent.

Observability: the non-negotiable foundation

89% of teams with agents in production have implemented some form of observability. Among those planning investments in the year, improving observability is the number one priority (62% of prod teams).

What to trace

An AI agent is not a classic web service. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Standard monitoring tools (uptime, API latency) don't measure what matters.

Full traces — every reasoning step, every tool call, every intermediate decision, with inputs/outputs
Quality metrics — relevance, factuality, instruction compliance, consistency over time
Cost per request — the top 5% most expensive requests often consume 50% of tokens
Latency by percentile — p50, p95, p99 (not just average: slow requests are the ones that generate complaints)
Drift detection — compare performance across prompt versions, models, or time windows

Market tools (2025-2026)

Langfuse (open-source, self-hosted): full traces with replay, prompt versioning, evaluations. De facto standard for teams that want full control of their data.
Arize Phoenix: unified observability for traditional ML + LLM, "council of judges" approach for evaluation.
LangSmith (LangChain): native integration for LangChain/LangGraph projects, execution chain visualization.
Datadog LLM Observability: for teams already on Datadog — integrates AI monitoring into the existing observability stack.

If you're looking for a platform that integrates observability, human supervision, and agent control natively — without stitching together five tools — that's exactly what we built at Origin 137.

Architecture: the real choices

Containerization and orchestration

Docker + Kubernetes is the de facto standard for production deployments. Docker ensures reproducibility. Kubernetes handles scaling, load balancing, and automatic recovery on failure. For execution mode: if your agents must handle traffic spikes, queue mode (Redis + workers) separates scheduling from execution.

RAG vs fine-tuning

Most production teams use off-the-shelf models without fine-tuning, with manually tuned prompts. Fine-tuning complexity is only justified for very specific use cases. RAG (Retrieval-Augmented Generation) remains the preferred solution to ground responses in verifiable sources and reduce hallucinations.

Multi-agent or single agent?

The move toward distributed multi-agent systems is real in large enterprises. But beware: each additional agent multiplies communication paths, conflict scenarios, and coordination requirements. Berkeley teams observe that 68% of production agents stop in fewer than 10 steps before human intervention — a sign that complexity remains deliberately limited.

Common pitfall: Agents can end up in infinite loops — retrying failed operations indefinitely, or continuing to process already completed tasks. Defining explicit termination conditions is not optional.

Human supervision: not a stopgap

In the vast majority of production cases, agents pass their results to humans rather than to other systems. That's not lack of trust in the technology — it's deliberate architecture.

Forrester states it clearly in its 2025 Model Overview Report: AI agents fail in unexpected and costly ways, with failure modes that don't resemble classic software bugs. They emerge from ambiguity, poor coordination, and unpredictable systemic dynamics.

Human supervision isn't a temporary limitation until models improve. It's an architectural component that enables responsible deployment today while maintaining auditability and legal accountability.

The KPIs that actually matter

Uptime (99.9% vs 95%) is a relevant KPI for infrastructure, not for evaluating an AI agent. The metrics that matter in production:

Task completion rate — does the agent actually accomplish the requested task?
Hallucination rate — measured continuously via automated evaluations on real traffic samples
p95 and p99 latency — the slowest users define perceived experience
Human escalation rate — too low can mean false confidence; too high indicates a quality problem
Cost per successful request — not total cost, but cost relative to actually useful outputs
Quality drift over time — weekly or monthly comparison of evaluation scores

What it means in practice

If you're starting an AI agent project in 2026, the data suggests this sequence:

Define what "reliable" means for your specific use case — not in general. What error rate is acceptable? What latency? When must a human be in the loop?
Containerize from day one. A proper Dockerfile + docker-compose.yml from the start eliminates an entire class of "works on my machine" problems before they happen.
Put observability in before launch. Not after. Langfuse or Arize Phoenix open-source are enough to start. Without full traces, you can't debug, improve, or justify the agent's decisions.
Use a staging environment. The same Docker image travels from localhost → staging → production. Never rebuild for prod.
Reconfigure workflows before plugging in the agent. McKinsey data is clear: organizations that re-design their processes upfront are twice as likely to achieve significant ROI.
Stay simple until complexity is justified. A 5-step agent with well-designed human supervision is more reliable — and more useful — than a 20-step autonomous agent that produces silent errors.
Plan for stack instability. If 70% of teams in regulated sectors rebuild their stack every three months, that's the norm. Architect with swappable modules. Don't marry one framework.

Main sources

LangChain, State of Agent Engineering, Dec. 2025 (n=1,340 professionals)
Cleanlab, AI Agents in Production 2025 (MIT State of AI in Business 2025, n=1,837)
UC Berkeley, Measuring Agents in Production, Melissa Pan et al. (n=300+ teams)
McKinsey, State of AI 2025
Forrester, 2025 AI Model Overview Report
Docker, Agentic AI Applications — official documentation, 2025-2026
Docker, Build AI Agents with Docker Compose, Nov. 2025
MachineLearningMastery, Deploying AI Agents to Production: Architecture, Infrastructure, and Implementation Roadmap, Mar. 2026
n8n Blog, 15 best practices for deploying AI agents in production, Jan. 2026
FreeCodeCamp, How to Build and Deploy a Multi-Agent AI System with Python and Docker, Feb. 2026

This article synthesizes public data available in March 2026. Figures may evolve rapidly in this space.

Not sure where to start with your own agent?

We offer a free 20-minute workshop to help you define your first agentic use case — what to automate, how to scope it, and what production readiness actually looks like for your context.

Top comments (3)

Armorer Labs • May 12

Good practical checklist. The point about “available but wrong” being different from uptime is the one I wish more deployment guides called out.

One operational thing I would add beside health checks is an agent-specific readiness check. Not just “HTTP server responds,” but:

model/provider reachable
required tools configured
tool permissions/policies loaded
memory/vector store reachable if used
one tiny safe test run succeeds

For agents, a container can be healthy while the actual workflow is dead because one tool credential, MCP server, or prompt/config dependency is missing. That gap tends to show up only after the first real user request unless you test it deliberately.