_Originally published at o137.ai
_
The demo was impressive. Production is another story.
What enterprise reports really say — and what it means in practice.
Based on: LangChain State of Agents 2026, Cleanlab Enterprise Report, UC Berkeley MAP, McKinsey State of AI, Docker official documentation
The demo/production gap is real — and massive
In 2024-2025, AI agent demos proliferated. An agent that answers in natural language, uses tools, chains actions across multiple steps — on stage or in a notebook, it impresses.
In production, it's different. Not slightly different. Fundamentally different.
Key finding — Cleanlab / MIT 2025
Of 1,837 companies surveyed on their AI agent deployment, only 95 actually had an agent in production with real user interactions. And among those 95, the majority remained in an early maturity phase.
Source: AI Agents in Production 2025, Cleanlab (based on MIT State of AI in Business 2025 data)
It's not a model problem. LLMs work. The problem is everything around them: infrastructure, evaluation, governance, team trust.
"Most so-called AI agents can't reliably do what they claim."
— Curtis Northcutt, CEO of Cleanlab
What "production" really requires
The original article listed correct requirements but without quantified context. Here's what the data shows:
- 57% of surveyed companies have agents in production (LangChain, 1,300+ respondents, 2025)
- 32% cite quality as the main barrier to production
- 89% of production teams have implemented some form of observability
- 68% of agents run fewer than 10 steps before human intervention (Berkeley MAP)
Sources: LangChain State of Agent Engineering (Dec. 2025, n=1,340); UC Berkeley Measuring Agents in Production (n=300+)
Volume and latency. An application with 10,000 requests/day does not have the same constraints as a 10-request prototype. Latency has become the second most cited challenge (20% of teams), especially for multi-step agents where each LLM call adds up. Practical recommendations: aim under 500ms for a conversational agent, under 2 seconds for complex analytics.
Reliability, not uptime. Traditional uptime (99.9%) is not the right metric for an AI agent. An agent can be "available" but produce wrong answers, hallucinate, call the wrong tool, or get stuck in an infinite loop. These silent failures are more dangerous than a crash, because they trigger no alert.
Legal traceability and audit. In regulated sectors, 42% of companies plan to add supervision features (approvals, review controls) — versus only 16% in unregulated sectors. Without auditability of every decision, a production deployment exposes the company to regulatory risk.
Human escalation. Berkeley measured that 92.5% of production agents send their output to humans rather than to other systems. That's not a design flaw — it's a deliberate strategy to maintain reliability.
From localhost to production: the technical path
This is where most guides stop being useful. "Deploy to the cloud" is not a step. Here's the concrete path.
Why localhost ≠ production
When your agent works on your machine, you're typically running:
- API keys hardcoded in a
.envfile or directly in the code - A single Python process with no restart on crash
- No logging, no monitoring, no concurrency handling
- Dependencies tied to your local Python version
None of that survives production. Here's how to bridge the gap systematically.
Step 1 — Containerize with Docker
Docker is the standard because it solves the "works on my machine" problem definitively. Your agent runs in an isolated container with its own dependencies, Python version, and environment — identical across dev, staging, and prod.
Dockerfile (Python agent, FastAPI)
# --- Build stage ---
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# --- Runtime stage ---
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY . .
# Never hardcode secrets here
ENV PYTHONUNBUFFERED=1
# Health check: verify the agent AND its dependencies are up
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')"
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Key points:
- Multi-stage build keeps the final image small (no build tools in production)
- HEALTHCHECK verifies the container is actually functional, not just running
- No secrets in the Dockerfile — ever
Step 2 — Manage secrets properly
The most common mistake: API keys in code or committed .env files.
Local development — .env file (never committed to git):
# .env — add this to .gitignore
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://user:password@localhost:5432/agent_db
REDIS_URL=redis://localhost:6379/0
In docker-compose.yml (local and staging):
services:
agent:
build: .
ports:
- "8000:8000"
env_file:
- .env
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: agent_db
POSTGRES_USER: agent_user
POSTGRES_PASSWORD: agent_password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U agent_user"]
interval: 10s
timeout: 3s
retries: 3
In production — use your cloud provider's secret manager, never plain env vars:
- AWS → Secrets Manager
- GCP → Secret Manager
- Kubernetes →
kubectl create secret+ mount as env vars
Step 3 — Add a staging environment
Never deploy directly from localhost to production. The staging environment catches environment-specific bugs (different OS, different network, different secret values) before they hit users.
localhost (dev)
↓
docker-compose up → everything runs locally, identical to prod
↓
staging (cloud) → same Docker image, real secrets, limited traffic
↓
production → same image promoted from staging, full traffic
The key principle: the same Docker image travels through all three environments. You're not rebuilding for prod — you're promoting a tested image.
Step 4 — Choose your production infrastructure
Three main options depending on your scale and team:
| Option | Best for | Scaling | Complexity |
|---|---|---|---|
| Google Cloud Run / AWS Lambda | Stateless agents, variable traffic | Automatic (serverless) | Low |
| AWS ECS / Azure Container Apps | Teams without Kubernetes expertise | Manual or auto | Medium |
| Kubernetes (EKS, GKE, AKS) | Large scale, multi-agent systems | Full control | High |
Practical recommendation: Start with Cloud Run or ECS. Kubernetes is justified only when you have multiple agent types, high traffic, and a dedicated DevOps function.
For Cloud Run (simplest path from Docker to production):
# Build and push your image
docker build -t gcr.io/your-project/your-agent:v1.0.0 .
docker push gcr.io/your-project/your-agent:v1.0.0
# Deploy
gcloud run deploy your-agent \
--image gcr.io/your-project/your-agent:v1.0.0 \
--platform managed \
--region europe-west1 \
--memory 2Gi \
--timeout 60s \
--set-secrets OPENAI_API_KEY=openai-key:latest
Note the --memory 2Gi minimum — LLM applications need at least 1-2GB RAM. And --timeout 60s accounts for multi-step reasoning chains.
Step 5 — Handle concurrency with a queue
At low traffic (< 100 requests/day), a single process is fine. At scale, you need to separate request intake from execution.
Incoming requests → Redis queue → Worker 1
→ Worker 2
→ Worker 3
This prevents a slow agent run (10+ LLM calls) from blocking all other requests. Queue depth (jobs waiting) and worker utilization (CPU/memory per worker) become your main scaling signals — add workers when the queue grows faster than it drains.
The real problems in production
Hallucinations and output quality
Hallucinations don't work like classic software bugs. An agent doesn't "crash" when it hallucinates — it answers confidently while inventing information. In a multi-step workflow, an early hallucination can contaminate all following steps.
Beware of misleading metrics. An 85% accuracy rate at launch may seem solid. If it drops to 72% three months later, that's a signal of drift (model drift) or data misalignment — not normal fluctuation.
Measuring hallucinations in production today relies mainly on the "LLM-as-judge" approach: one model evaluates another model's outputs on consistency, factuality, and grounding in sources. It's imperfect but operational at scale.
Drift and stack instability
The AI stack moves fast — too fast to be stable. In the regulated sector, 70% of teams rebuild their agent stack every three months or faster. Each rebuild loses behavioral continuity. What you validated in January may no longer be valid in April if you changed model, framework version, or data pipeline.
Integration with existing systems
Salesforce acknowledged that its Einstein Copilot encountered difficulties in pilot because it could not reliably navigate between customer data silos and existing CRM workflows. This case isn't isolated — it's the norm. McKinsey notes that organizations reporting significant ROI from AI projects are twice as likely to have reconfigured their workflows end-to-end before deploying the agent.
Observability: the non-negotiable foundation
89% of teams with agents in production have implemented some form of observability. Among those planning investments in the year, improving observability is the number one priority (62% of prod teams).
What to trace
An AI agent is not a classic web service. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Standard monitoring tools (uptime, API latency) don't measure what matters.
- Full traces — every reasoning step, every tool call, every intermediate decision, with inputs/outputs
- Quality metrics — relevance, factuality, instruction compliance, consistency over time
- Cost per request — the top 5% most expensive requests often consume 50% of tokens
- Latency by percentile — p50, p95, p99 (not just average: slow requests are the ones that generate complaints)
- Drift detection — compare performance across prompt versions, models, or time windows
Market tools (2025-2026)
- Langfuse (open-source, self-hosted): full traces with replay, prompt versioning, evaluations. De facto standard for teams that want full control of their data.
- Arize Phoenix: unified observability for traditional ML + LLM, "council of judges" approach for evaluation.
- LangSmith (LangChain): native integration for LangChain/LangGraph projects, execution chain visualization.
- Datadog LLM Observability: for teams already on Datadog — integrates AI monitoring into the existing observability stack.
If you're looking for a platform that integrates observability, human supervision, and agent control natively — without stitching together five tools — that's exactly what we built at Origin 137.
Architecture: the real choices
Containerization and orchestration
Docker + Kubernetes is the de facto standard for production deployments. Docker ensures reproducibility. Kubernetes handles scaling, load balancing, and automatic recovery on failure. For execution mode: if your agents must handle traffic spikes, queue mode (Redis + workers) separates scheduling from execution.
RAG vs fine-tuning
Most production teams use off-the-shelf models without fine-tuning, with manually tuned prompts. Fine-tuning complexity is only justified for very specific use cases. RAG (Retrieval-Augmented Generation) remains the preferred solution to ground responses in verifiable sources and reduce hallucinations.
Multi-agent or single agent?
The move toward distributed multi-agent systems is real in large enterprises. But beware: each additional agent multiplies communication paths, conflict scenarios, and coordination requirements. Berkeley teams observe that 68% of production agents stop in fewer than 10 steps before human intervention — a sign that complexity remains deliberately limited.
Common pitfall: Agents can end up in infinite loops — retrying failed operations indefinitely, or continuing to process already completed tasks. Defining explicit termination conditions is not optional.
Human supervision: not a stopgap
In the vast majority of production cases, agents pass their results to humans rather than to other systems. That's not lack of trust in the technology — it's deliberate architecture.
Forrester states it clearly in its 2025 Model Overview Report: AI agents fail in unexpected and costly ways, with failure modes that don't resemble classic software bugs. They emerge from ambiguity, poor coordination, and unpredictable systemic dynamics.
Human supervision isn't a temporary limitation until models improve. It's an architectural component that enables responsible deployment today while maintaining auditability and legal accountability.
The KPIs that actually matter
Uptime (99.9% vs 95%) is a relevant KPI for infrastructure, not for evaluating an AI agent. The metrics that matter in production:
- Task completion rate — does the agent actually accomplish the requested task?
- Hallucination rate — measured continuously via automated evaluations on real traffic samples
- p95 and p99 latency — the slowest users define perceived experience
- Human escalation rate — too low can mean false confidence; too high indicates a quality problem
- Cost per successful request — not total cost, but cost relative to actually useful outputs
- Quality drift over time — weekly or monthly comparison of evaluation scores
What it means in practice
If you're starting an AI agent project in 2026, the data suggests this sequence:
- Define what "reliable" means for your specific use case — not in general. What error rate is acceptable? What latency? When must a human be in the loop?
- Containerize from day one. A proper Dockerfile + docker-compose.yml from the start eliminates an entire class of "works on my machine" problems before they happen.
- Put observability in before launch. Not after. Langfuse or Arize Phoenix open-source are enough to start. Without full traces, you can't debug, improve, or justify the agent's decisions.
- Use a staging environment. The same Docker image travels from localhost → staging → production. Never rebuild for prod.
- Reconfigure workflows before plugging in the agent. McKinsey data is clear: organizations that re-design their processes upfront are twice as likely to achieve significant ROI.
- Stay simple until complexity is justified. A 5-step agent with well-designed human supervision is more reliable — and more useful — than a 20-step autonomous agent that produces silent errors.
- Plan for stack instability. If 70% of teams in regulated sectors rebuild their stack every three months, that's the norm. Architect with swappable modules. Don't marry one framework.
Main sources
- LangChain, State of Agent Engineering, Dec. 2025 (n=1,340 professionals)
- Cleanlab, AI Agents in Production 2025 (MIT State of AI in Business 2025, n=1,837)
- UC Berkeley, Measuring Agents in Production, Melissa Pan et al. (n=300+ teams)
- McKinsey, State of AI 2025
- Forrester, 2025 AI Model Overview Report
- Docker, Agentic AI Applications — official documentation, 2025-2026
- Docker, Build AI Agents with Docker Compose, Nov. 2025
- MachineLearningMastery, Deploying AI Agents to Production: Architecture, Infrastructure, and Implementation Roadmap, Mar. 2026
- n8n Blog, 15 best practices for deploying AI agents in production, Jan. 2026
- FreeCodeCamp, How to Build and Deploy a Multi-Agent AI System with Python and Docker, Feb. 2026
This article synthesizes public data available in March 2026. Figures may evolve rapidly in this space.
Not sure where to start with your own agent?
We offer a free 20-minute workshop to help you define your first agentic use case — what to automate, how to scope it, and what production readiness actually looks like for your context.
Top comments (0)