DEV Community

Cover image for Deploying Autonomous AI Agents in Production: A 2-Week Playbook
Hemang Joshi
Hemang Joshi

Posted on • Originally published at linkedin.com

Deploying Autonomous AI Agents in Production: A 2-Week Playbook

#ai

Most teams quote 3-6 months to deploy production AI agents. We've done it in 2 weeks. Repeatedly. Not demos. Not hackathon toys. Agents that handle real tickets, call real tools, pass real evals, and don't set the observability dashboard on fire at 2 AM.

The difference isn't some secret framework. It's ruthless scoping, opinionated tooling choices, and a hardening phase that most teams skip because they're still arguing about whether to use LangGraph or AutoGen on day 40.

Here is the exact playbook we run for enterprise AI teams. Fourteen days, start to production. Steal it.


Days 1-2: Scoping and Success Metrics

If you skip this step, the rest of the playbook collapses. Ninety percent of "failed" agent projects I've inherited failed here, not in the code.

The first question is not "which framework" or "which model." It is: what does a working agent actually output, measured how, at what cost, at what latency?

On day one we lock down four numbers:

  1. Task success rate target. For a tier-1 support triage agent, we aim for 85% deflection on classified intents with <2% escalation-to-human-too-late. For a sales research agent, 90% correct CRM field population on a 200-row golden set.
  2. P95 latency budget. Streaming-first agents usually land at 4-8s time-to-first-token and 20-40s time-to-final. Non-streaming batch agents get 60-120s. Anything over that and users will abandon or your queue depth explodes.
  3. Unit economics ceiling. Cost per successful task, not cost per token. A $0.40/task agent that replaces a $12 human action is a business. A $0.08/task agent that hallucinates 15% of the time is a liability.
  4. Blast radius. What's the worst thing this agent can do to a production system if it misfires? If the answer is "wire money" or "drop a database," we architect a human-in-the-loop gate from hour one, not hour 200.

We run a one-hour scoping workshop with the product owner and two senior engineers from the customer side. The output is a one-page spec with those four numbers, three to five representative tasks, and a list of tools the agent must call. That's it. If the customer can't agree on the four numbers, we don't start coding.


Days 3-5: Framework and Tool Selection

By now half the reader is shouting "just tell me which framework." Fine.

CrewAI wins when the workflow is naturally role-based and relatively linear: a researcher agent, a writer agent, an editor agent, pass the baton. It's fast to stand up, the abstractions match how non-engineers think about work, and it plays well with both OpenAI and Anthropic tool-calling. Weakness: anything with complex conditional branching or long-running state becomes awkward.

LangGraph wins the moment the agent needs real state, cycles, interrupts, or human-in-the-loop checkpoints. It is our default for anything touching financial workflows, medical triage, or multi-step enterprise processes with approval gates. The graph model maps cleanly to real business processes. The learning curve is real, though; budget a day for your team to internalize the state schema discipline.

AutoGen wins for research-flavored problems where multiple agents genuinely debate, critique, and iterate. It is overkill for anything deterministic, and we usually don't ship it to production untouched. We port the design to LangGraph once the conversation topology stabilizes.

Raw tool-calling loops with no framework win more often than anyone admits. If your agent is a single role with 4-8 tools and a 20-step max horizon, a 200-line custom loop with structured outputs will beat any framework on latency, debuggability, and on-call sanity.

Tool definitions matter more than framework choice. We standardize on JSON Schema with strict mode enabled (OpenAI) or XML-tagged tool-use blocks (Anthropic). Every tool gets a deterministic name, a one-sentence description optimized for the LLM, typed parameters with enums wherever possible, and a structured error return so the agent can self-correct.

More on our approach: hjlabs.in/AIML/services/agentic-ai/


Days 6-8: Build the PoC Agent

Three days. Not three weeks. If the PoC takes longer, the scope is wrong.

The system prompt is written last, not first. We start with tool definitions and a golden task, let the model try to solve it with zero instructions, and watch where it fails. The system prompt patches those specific failures. Short, behavioral, present tense, under 800 tokens. Long system prompts are a smell.

Memory: working memory is the conversation plus a scratchpad. Episodic memory goes in Postgres with a simple schema. Semantic memory, if needed at all, goes in the RAG layer.

Error handling: every tool call wraps in a retry with exponential backoff on transient errors, a structured error return on permanent ones, and a hard circuit breaker on the total step budget. An agent that has taken 30 tool calls to solve a task is almost certainly in a loop. Kill it, log the transcript, fall back to human handoff.

We enable prompt caching from day one. On Anthropic, cache the system prompt and tool definitions; on OpenAI, rely on automatic prefix caching and order your prompt carefully to maximize hits. 40-70% cost savings from two lines of code.


Days 9-10: RAG Knowledge Base

Most agents that "hallucinate" are actually starving for context. A good RAG layer fixes 80% of perceived agent quality issues.

Our defaults:

  • Chunking. Semantic chunking on paragraph and section boundaries, 400-800 tokens per chunk with 15% overlap. Tables and code get their own chunk type.
  • Embedding model. text-embedding-3-large for English-dominant corpora, voyage-3 for technical documentation, cohere-embed-multilingual-v3 for multilingual. Benchmark on the customer's own content.
  • Vector store. Qdrant for self-hosted production (binary quantization: 32x memory compression, <1% recall loss). Weaviate for hybrid search. Pinecone for zero-ops.
  • Reranker. Non-negotiable. Cohere Rerank 3 or fine-tuned BGE reranker on top-50 candidates, returning top-5 to the LLM. 15-25 points of answer quality for 80ms of latency.
  • Query rewriting. A small model rewrites the user query into 2-3 retrieval queries before search.

Deeper writeup: hjlabs.in/AIML/services/rag-systems/


Days 11-12: Evaluation Harness

No evals, no production. Full stop.

The golden dataset is built by hand, by a domain expert, in a spreadsheet. 80-150 tasks, each with the input, the expected tool-call trajectory, and the expected final output shape.

Three layers of evals:

  1. Deterministic checks. Did the agent call the required tools? Did the output parse as valid JSON? Did it stay under the step budget?
  2. LLM-as-judge. A stronger model grades task success against a versioned rubric. Pairwise comparison beats absolute scoring for subjective tasks.
  3. Human spot checks. 10-20 tasks per release, reviewed by the domain expert.

The eval harness runs in CI on every prompt or tool change. A regression in task success rate blocks the merge.

Langfuse or Arize as the trace and eval backend. Langfuse is faster to self-host; Arize has the edge on enterprise features.


Days 13-14: Production Hardening

The last two days are where most teams run out of budget and ship a prototype to prod. Do not be most teams.

  • Observability: structured trace per agent run with inputs, every tool call, every LLM call with token counts, and the final output.
  • Rate limiting: per-user, per-tenant, per-tool. A runaway agent that hammers an expensive tool 200 times is a six-figure incident.
  • Fallbacks: tiered model strategy. Primary model with streaming, secondary model on timeout, static fallback on total failure.
  • Cost monitoring: dashboards per agent per customer, alert thresholds at 1.5x and 3x daily budget.
  • Guardrails: input and output. Input guardrails block prompt injection. Output guardrails validate structure, redact PII, check policy violations.

The Real Lesson

Two weeks is not a magic number. It is the result of refusing to let any single phase expand beyond its budget. Scoping eats forever if you let it. Framework debates eat forever if you let them. RAG tuning eats forever if you let it. The playbook works because each phase has a hard stop and a crisp deliverable.

Teams that struggle are almost never struggling with the model. They are struggling with scope discipline, eval discipline, and the unsexy production hardening work that doesn't make for good demos.

Want this playbook executed for your team by engineers who have shipped it to enterprise production more than a dozen times? Book a 30-min scoping call: cal.com/hemangjoshi37a

Top comments (0)