- Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
- Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The agent works on your laptop. It passes evals. Your manager asks when it ships and you say Monday, because the modeling is done. Then you try to put it behind a load balancer and it falls apart, because you deployed it like a web service.
An agent is not a web service. A web service answers in milliseconds and forgets. An agent thinks for minutes, burns tokens across two or three providers, streams partial output to a browser, and sometimes decides to call delete_invoice on the eighth turn. Every deployment decision you make flows from one question: what does this thing do to your infrastructure while it is running?
Here is how to package it, where to hold state, and how to scale a workload whose bottleneck is a model call you do not control.
The shape is decided by the longest step
The single rule that saves you the most pain: an agent's deployment shape is decided by its longest step, not its average step.
A support chatbot answers in two seconds. A code-review agent thinks for six minutes. A research agent runs for forty. You cannot put all three behind the same HTTP endpoint and expect any of them to survive. Pick the pattern that matches the longest step, then cap the rest with timeouts.
- Under 30s → stateless HTTP endpoint (Cloud Run, Fly.io).
- 30s to 5m with a user watching → streaming over WebSocket or SSE.
- 5m to an hour, async → queue plus worker (Temporal, Inngest, or Redis).
- Longer than an hour → still queue plus worker, whether you like it or not.
Do not hold an HTTP request open for forty minutes. Something you did not know existed will kill it at the worst moment: a proxy, a CDN, a load-balancer idle timeout.
Package it: pin everything, drop root
The base image is the same across every pattern. Pin your Python, pin your SDKs, run as a non-root user, install nothing you do not need.
# Dockerfile
FROM python:3.13-slim-bookworm AS builder
ENV PIP_NO_CACHE_DIR=1
WORKDIR /build
COPY requirements.txt .
RUN pip wheel --wheel-dir /wheels -r requirements.txt
FROM python:3.13-slim-bookworm AS runtime
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
RUN groupadd -r agent && useradd -r -g agent agent
WORKDIR /app
COPY --from=builder /wheels /wheels
COPY requirements.txt .
RUN pip install --no-index \
--find-links=/wheels -r requirements.txt
COPY app/ ./app/
USER agent
EXPOSE 8000
CMD ["uvicorn", "app.main:app", \
"--host", "0.0.0.0", "--port", "8000"]
Two things earn their keep. The multi-stage build compiles wheels in stage one and copies only the runtime into stage two, so no build toolchain ships to production. And slim-bookworm is 130 MB against 1.1 GB for the default image. As a rough estimate, that smaller pull shaves seconds off cold pod start when you scale up under load.
Never bake an API key into the image. The runtime injects it. On Kubernetes that is a mounted Secret; on GCP it is Secret Manager with Workload Identity; on Fly it is fly secrets set. The image has no credentials, the agent reads them at boot, and it never logs them.
Pin exact versions in requirements.txt. The SDK field names drift, and a build that worked in February will break in July if the tags float:
anthropic==0.94.1
langgraph==1.1.6
litellm==1.75.1
fastapi==0.118.0
uvicorn[standard]==0.34.0
redis==5.2.1
tenacity==9.1.2
Stateless first: hold state in Redis, not the process
Reach for stateless unless you have a concrete reason not to. Each request starts a fresh agent. Conversation history, tool results, the scratchpad — all of it lives in Redis or Postgres, keyed by a session ID the client sends in. The process holds nothing between requests, so any pod can serve any request and a rolling deploy loses no memory.
# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from anthropic import AsyncAnthropic
import redis.asyncio as redis
import json
app = FastAPI()
r = redis.from_url("redis://cache:6379/0")
client = AsyncAnthropic()
class Req(BaseModel):
session_id: str
message: str
@app.post("/chat")
async def chat(req: Req):
key = f"hist:{req.session_id}"
raw = await r.get(key)
history = json.loads(raw) if raw else []
history.append(
{"role": "user", "content": req.message}
)
resp = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=history,
)
reply = resp.content[0].text
history.append(
{"role": "assistant", "content": reply}
)
await r.set(key, json.dumps(history), ex=3600)
return {"reply": reply}
The state is the session, not the server. That is the whole trick. When you need a durable, long-running agent instead (one that must survive a worker restart mid-run), you move to a queue and a workflow engine, and the state lives in the workflow store rather than in Redis. Do not fake durability by keeping it in process memory.
Health checks: cheap liveness, expensive readiness
Kubernetes probes are where agent deployments quietly break, because the defaults assume a fast service.
Split the two probes. /healthz is cheap: did the process start, is the event loop alive. /readyz is expensive: can the agent actually reach its provider. A pod that boots with a bad API key should never take traffic.
# app/probes.py
from fastapi import APIRouter, Response
from anthropic import AsyncAnthropic
router = APIRouter()
client = AsyncAnthropic()
@router.get("/healthz")
async def healthz():
return {"status": "ok"}
@router.get("/readyz")
async def readyz():
try:
await client.models.list()
return {"status": "ready"}
except Exception:
return Response(status_code=503)
Liveness should almost never kill the pod. A legitimate long agent turn can make the event loop look stuck, and the standard advice, kill it after three failures, will murder a run that was working fine. Point liveness at /healthz with a generous failure threshold, and set terminationGracePeriodSeconds larger than your worst-case turn so a rolling deploy lets in-flight runs finish instead of severing them.
Scaling: the bottleneck is a lock you do not own
Here is what makes agents different from a normal backend. Adding workers does not add throughput past a point, because every provider caps you on requests per minute and tokens per minute, per key. Scale your pool from 10 to 100 and those limits do not move. You will just generate more 429s.
So the first thing to wire is not autoscaling. It is a semaphore per (provider, model) pair, sized to your real budget, plus retry with backoff and jitter for the noise that slips through.
import asyncio
from tenacity import (
retry, stop_after_attempt,
wait_exponential_jitter,
)
# One gate per model, sized to your RPM budget.
gate = asyncio.Semaphore(64)
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=30),
)
async def call_model(client, **kw):
async with gate:
return await client.messages.create(**kw)
The workload blocks on I/O. The model call is a network wait, so concurrency is nearly free at the runtime level and expensive at the provider level. That inversion is the whole scaling story. Your CPU sits idle while a hundred coroutines wait on Claude. So do not autoscale on CPU alone; it will read near zero while you are completely saturated. Scale on in-flight request count, and cap in-flight runs per pod (four is a sane start for a 1-CPU pod with 30-second turns) so a single box does not fan out a thousand concurrent calls and eat your whole rate limit.
For a TypeScript agent the same gate is a small counting semaphore around the call:
// gate.ts — cap concurrent model calls
let active = 0;
const queue: Array<() => void> = [];
const LIMIT = 64;
export async function withGate<T>(
fn: () => Promise<T>,
): Promise<T> {
if (active >= LIMIT) {
await new Promise<void>((r) => queue.push(r));
}
active++;
try {
return await fn();
} finally {
active--;
queue.shift()?.();
}
}
When the pool saturates, the queue grows. That is your signal. Shed load at the ingress (return 429 or park the work on a durable queue) rather than letting the agent thrash against the provider. A queue that holds work is also what carries you through a provider outage: the run waits for the model to come back instead of failing.
Route through a gateway, not straight at the provider
Point your agent at one gateway that owns fallback, not at a provider SDK directly. When Claude Sonnet rate-limits, the gateway retries on the next model in the chain and your agent code never sees it. LiteLLM Proxy is the self-hosted default; OpenRouter and Portkey are the managed options.
The catch: a fallback chain you have never exercised is a fallback chain you do not have. The failover target needs the same rate-limit headroom as your primary, and a different model means a different tokenizer and tool-call schema. Test the chain against a fixed eval set, reported as its own score, and drill it — force ten percent of traffic through the fallback once a month. If it cannot carry ten percent on a quiet Tuesday, it will not carry a hundred percent at 3 AM.
The deployment is the easy part
Five patterns, a Dockerfile, a semaphore, and a fallback chain get you something that ships on Monday and does not fall over on Tuesday. The container runs, the gateway routes around outages, the queue holds work when nothing else can. That part is mechanical.
The hard part is what happens at 3 AM when the agent starts returning confident answers to questions nobody asked and burning tokens on a loop that never terminates — which is why deployment is only worth doing on top of tracing and evals. Agents in Production walks the five patterns, the scaling limits, and the degradation ladder end to end; Observability for LLM Applications, its companion in The AI Engineer's Library, is the tracing, evals, and cost-accounting layer that tells you the loop went wrong before your users do.

Top comments (0)