Every AI tutorial shows you a chatbot that answers questions. That's not an agent. An agent decides what to do, takes action, observes the result, and adapts. In production, it does all of that reliably, with audit trails, error recovery, and human oversight.
LangGraph — the graph-based orchestration layer from LangChain — has quietly become the framework of choice for teams shipping real agents. Uber routes support workflows through it. LinkedIn uses it for internal knowledge agents. Klarna runs customer-facing agents on it at scale.
This article is the guide I wish I had when I moved from prototype to production. We'll build a Research Assistant agent end-to-end, covering every pattern that matters when uptime counts.
When to Use Agents (and When Not To)
Before writing a single line of agent code, ask yourself: does this task require dynamic decision-making?
Use agents when:
- The number of steps is unknown at design time
- The task requires selecting from multiple tools based on context
- Intermediate results change the execution path
- You need autonomous error recovery
Don't use agents when:
- A fixed pipeline (prompt → LLM → output) solves the problem
- You can enumerate all paths in advance (use a simple chain)
- Latency budget is under 2 seconds (agents loop; loops are slow)
- The cost of a wrong autonomous action is high and you can't add human checkpoints
Agents add complexity. A well-designed chain with structured outputs will outperform a poorly-designed agent every time. Start with the simplest approach that works, then graduate to agents when you hit the wall.
LangGraph Core Concepts
LangGraph models agent logic as a directed graph where:
- State is a typed dictionary that flows through the graph
- Nodes are functions that read and write state
- Edges connect nodes (static or conditional)
- Conditional edges inspect state and route to different nodes
Here's the minimal mental model:
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from operator import add
class AgentState(TypedDict):
messages: Annotated[list, add] # append-only message list
step_count: int
def process(state: AgentState) -> dict:
return {"messages": ["processed"], "step_count": state["step_count"] + 1}
def should_continue(state: AgentState) -> str:
return "end" if state["step_count"] >= 3 else "process"
graph = StateGraph(AgentState)
graph.add_node("process", process)
graph.add_conditional_edges(START, should_continue, {"process": "process", "end": END})
graph.add_conditional_edges("process", should_continue, {"process": "process", "end": END})
app = graph.compile()
result = app.invoke({"messages": [], "step_count": 0})
The Annotated[list, add] is critical — it tells LangGraph to merge list returns instead of overwriting. Without it, each node would clobber the previous messages.
Building the Research Assistant
Let's build something real: an agent that takes a research question, searches the web, reads and summarizes relevant pages, and produces a structured report. This is the kind of agent companies actually deploy.
Step 1: Define the State
from typing import TypedDict, Annotated, Literal
from operator import add
from pydantic import BaseModel
class Source(BaseModel):
url: str
title: str
summary: str
relevance_score: float
class ResearchState(TypedDict):
question: str
search_queries: list[str]
sources: Annotated[list[Source], add]
draft_report: str
critique: str
final_report: str
iteration: int
status: str
I'm using Pydantic models for Source — this gives you validation and serialization for free, which matters when you're persisting state to a database.
Step 2: Define the Nodes
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_community.tools.tavily_search import TavilySearchResults
llm = ChatOpenAI(model="gpt-4o", temperature=0)
search_tool = TavilySearchResults(max_results=5)
async def generate_queries(state: ResearchState) -> dict:
"""Turn the research question into targeted search queries."""
response = await llm.ainvoke([
SystemMessage(content="Generate 3 specific search queries to research this topic. Return only the queries, one per line."),
HumanMessage(content=state["question"])
])
queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()]
return {"search_queries": queries, "status": "searching"}
async def search_web(state: ResearchState) -> dict:
"""Execute searches and collect sources."""
all_sources = []
for query in state["search_queries"]:
results = await search_tool.ainvoke({"query": query})
for r in results:
source = Source(
url=r["url"],
title=r.get("title", ""),
summary=r["content"][:500],
relevance_score=0.0 # scored in next step
)
all_sources.append(source)
return {"sources": all_sources, "status": "analyzing"}
async def write_report(state: ResearchState) -> dict:
"""Synthesize sources into a structured report."""
source_text = "\n\n".join(
f"[{s.title}]({s.url})\n{s.summary}" for s in state["sources"]
)
response = await llm.ainvoke([
SystemMessage(content="""Write a detailed research report based on these sources.
Structure: Executive Summary, Key Findings (numbered), Analysis, Conclusion.
Cite sources inline as [1], [2], etc."""),
HumanMessage(content=f"Question: {state['question']}\n\nSources:\n{source_text}")
])
return {"draft_report": response.content, "status": "reviewing"}
async def critique_report(state: ResearchState) -> dict:
"""Self-critique the draft for gaps and improvements."""
response = await llm.ainvoke([
SystemMessage(content="""Review this research report critically. Identify:
1. Factual gaps or unsupported claims
2. Missing perspectives
3. Areas needing more depth
Be specific and actionable. If the report is solid, say "APPROVED"."""),
HumanMessage(content=state["draft_report"])
])
return {
"critique": response.content,
"iteration": state["iteration"] + 1,
"status": "critiqued"
}
async def revise_report(state: ResearchState) -> dict:
"""Revise the report based on critique."""
response = await llm.ainvoke([
SystemMessage(content="Revise this report to address the critique. Maintain the same structure."),
HumanMessage(content=f"Report:\n{state['draft_report']}\n\nCritique:\n{state['critique']}")
])
return {"draft_report": response.content, "status": "revised"}
async def finalize(state: ResearchState) -> dict:
return {"final_report": state["draft_report"], "status": "complete"}
Step 3: Wire the Graph
from langgraph.graph import StateGraph, START, END
def route_after_critique(state: ResearchState) -> Literal["revise", "finalize"]:
if "APPROVED" in state["critique"] or state["iteration"] >= 3:
return "finalize"
return "revise"
builder = StateGraph(ResearchState)
# Add nodes
builder.add_node("generate_queries", generate_queries)
builder.add_node("search_web", search_web)
builder.add_node("write_report", write_report)
builder.add_node("critique_report", critique_report)
builder.add_node("revise_report", revise_report)
builder.add_node("finalize", finalize)
# Add edges
builder.add_edge(START, "generate_queries")
builder.add_edge("generate_queries", "search_web")
builder.add_edge("search_web", "write_report")
builder.add_edge("write_report", "critique_report")
builder.add_conditional_edges("critique_report", route_after_critique)
builder.add_edge("revise_report", "critique_report") # loop back
builder.add_edge("finalize", END)
research_agent = builder.compile()
Run it:
result = await research_agent.ainvoke({
"question": "What are the most effective strategies for reducing LLM hallucinations in production systems?",
"search_queries": [],
"sources": [],
"draft_report": "",
"critique": "",
"final_report": "",
"iteration": 0,
"status": "starting"
})
print(result["final_report"])
State Management and Persistence
In production, agents crash. Servers restart. Users close browsers. You need checkpointing.
LangGraph has built-in support for persisting state at every step via checkpointers:
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
DB_URI = "postgresql://user:pass@localhost:5432/agents"
async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer:
await checkpointer.setup() # creates tables on first run
research_agent = builder.compile(checkpointer=checkpointer)
# Every invocation now saves state after each node
config = {"configurable": {"thread_id": "research-001"}}
result = await research_agent.ainvoke(initial_state, config)
If the process dies mid-execution, restart with the same thread_id and it picks up exactly where it left off:
# Resume from last checkpoint
result = await research_agent.ainvoke(None, config)
Production tip: Use thread_id as your correlation ID across logging, tracing, and customer support. When a user reports a problem, you can replay the exact state transitions.
For high-throughput systems, the Postgres checkpointer supports connection pooling. For simpler setups, SqliteSaver works fine. For serverless, use the MemorySaver during development but always switch to a durable store before deploying.
Human-in-the-Loop Patterns
Fully autonomous agents are a liability in production. The most reliable pattern is human-on-the-loop: the agent runs autonomously but pauses at critical decision points.
LangGraph supports this natively with interrupt:
from langgraph.types import interrupt, Command
async def write_report(state: ResearchState) -> dict:
# ... generate draft ...
# Pause and wait for human approval
approval = interrupt({
"question": "Review this draft report. Reply 'approved' or provide feedback.",
"draft": draft_content
})
if approval.lower() != "approved":
# Human provided feedback — use it as critique
return {"draft_report": draft_content, "critique": approval, "status": "human_feedback"}
return {"draft_report": draft_content, "status": "approved"}
On the calling side, you handle the interrupt:
config = {"configurable": {"thread_id": "research-001"}}
# First invocation runs until interrupt
result = await research_agent.ainvoke(initial_state, config)
# Agent is now paused. Show draft to user via your UI.
# When user responds:
result = await research_agent.ainvoke(
Command(resume="approved"), # or resume="Add more detail about X"
config
)
This pattern maps cleanly to web UIs (show a review screen), Slack bots (send a message and wait for reply), or email workflows.
Advanced pattern — tiered autonomy:
def route_by_confidence(state: ResearchState) -> str:
confidence = state.get("confidence_score", 0)
if confidence > 0.9:
return "auto_approve" # agent proceeds
elif confidence > 0.7:
return "notify_human" # agent proceeds but flags for review
else:
return "require_approval" # agent pauses
This lets low-risk actions flow through while escalating uncertain ones — the sweet spot for production throughput.
Tool Calling Best Practices
Tools are how agents interact with the real world. Get this wrong and you get agents that burn API credits, leak data, or take destructive actions.
Structured tool definitions
from langchain_core.tools import tool
from pydantic import Field
@tool
def search_knowledge_base(
query: str = Field(description="Natural language search query"),
filters: dict | None = Field(default=None, description="Optional metadata filters: {department: str, date_range: str}"),
max_results: int = Field(default=10, ge=1, le=50, description="Number of results to return")
) -> list[dict]:
"""Search the internal knowledge base for documents matching the query.
Use this for company-specific information. For general web information, use web_search instead."""
# implementation
...
Key practices:
Rich descriptions matter more than you think. The LLM reads the docstring and field descriptions to decide when and how to call the tool. Vague descriptions lead to wrong tool selection.
Constrain inputs. Use
ge,le, enums, and Pydantic validators. An agent that can passmax_results=10000will eventually do it.Separate read and write tools. Never have a single
database_toolthat can both query and delete. Give the agentdb_queryanddb_deleteseparately, and only binddb_deletewhen you've added human approval.Tool result formatting. Return structured data, not free text. The LLM processes structured results more reliably:
@tool
def get_order_status(order_id: str) -> dict:
"""Look up the status of a customer order."""
order = db.get_order(order_id)
return {
"order_id": order.id,
"status": order.status,
"items_count": len(order.items),
"estimated_delivery": order.eta.isoformat(),
"action_available": ["cancel"] if order.status == "processing" else []
}
- Bind tools selectively per node. Not every node needs every tool:
research_llm = llm.bind_tools([search_tool, scrape_tool])
writing_llm = llm.bind_tools([]) # no tools during writing
Error Handling and Retry Strategies
Production agents face three categories of failures:
1. Transient failures (API timeouts, rate limits)
Use LangGraph's built-in retry policy:
from langgraph.pregel import RetryPolicy
builder.add_node(
"search_web",
search_web,
retry=RetryPolicy(
max_attempts=3,
initial_interval=1.0, # seconds
backoff_factor=2.0,
retry_on=(TimeoutError, RateLimitError)
)
)
2. LLM failures (malformed output, hallucinated tool calls)
Wrap tool execution with validation:
async def safe_tool_executor(state: AgentState) -> dict:
last_message = state["messages"][-1]
for tool_call in last_message.tool_calls:
try:
# Validate tool exists
tool = tool_map.get(tool_call["name"])
if not tool:
return {"messages": [ToolMessage(
content=f"Tool '{tool_call['name']}' does not exist. Available: {list(tool_map.keys())}",
tool_call_id=tool_call["id"]
)]}
# Execute with timeout
result = await asyncio.wait_for(
tool.ainvoke(tool_call["args"]),
timeout=30.0
)
return {"messages": [ToolMessage(content=str(result), tool_call_id=tool_call["id"])]}
except ValidationError as e:
return {"messages": [ToolMessage(
content=f"Invalid arguments: {e}. Please fix and retry.",
tool_call_id=tool_call["id"]
)]}
The agent sees the error message and self-corrects on the next iteration. This works surprisingly well — LLMs are good at fixing their own mistakes when given clear error messages.
3. Logical failures (infinite loops, stuck states)
Guard against these at the graph level:
def route_after_critique(state: ResearchState) -> str:
# Hard cap on iterations
if state["iteration"] >= 3:
return "finalize"
# Detect stuck state: same critique twice
if state.get("prev_critique") == state["critique"]:
return "finalize"
return "revise"
Also set a global timeout on the entire graph execution:
result = await asyncio.wait_for(
research_agent.ainvoke(initial_state, config),
timeout=300.0 # 5 minute hard limit
)
Observability with LangSmith
You cannot operate what you cannot see. LangSmith is the observability layer for LangGraph — think Datadog for agent workflows.
Setup is two environment variables:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=lsv2_...
Every node execution, tool call, LLM invocation, and state transition is now traced automatically. No code changes required.
What to monitor in production:
# Custom metadata for filtering traces
config = {
"configurable": {"thread_id": "research-001"},
"metadata": {
"user_id": "u_12345",
"environment": "production",
"agent_version": "2.1.0"
},
"tags": ["research", "priority-high"]
}
Key metrics to track:
- Tokens per task: Set budgets. A research agent shouldn't exceed 50k tokens per run. Alert if it does.
- Iterations per completion: If your average is climbing, your prompts or critique logic are degrading.
- Tool call success rate: Below 95%? Your tool descriptions need work.
- Time to completion: Set SLOs. p50 under 30s, p99 under 120s.
- Human intervention rate: Track how often agents escalate. Trending up = model or prompt regression. Trending down = your agent is learning (or your thresholds are too loose).
LangSmith also supports evaluation datasets — curated input/output pairs that you run nightly to catch regressions:
from langsmith import Client
client = Client()
# Create a dataset of expected research outputs
dataset = client.create_dataset("research-agent-evals")
client.create_example(
inputs={"question": "What is retrieval augmented generation?"},
outputs={"expected_sections": ["Executive Summary", "Key Findings"]},
dataset_id=dataset.id
)
LangGraph vs. CrewAI vs. AutoGen
The framework landscape has matured significantly. Here's when to use what:
| Aspect | LangGraph | CrewAI | AutoGen |
|---|---|---|---|
| Architecture | Graph-based, explicit control flow | Role-based multi-agent | Conversation-based multi-agent |
| Best for | Complex workflows, production systems | Team simulation, parallel task delegation | Research, multi-agent debate |
| State management | Built-in, typed, persistent | Limited, via shared memory | Conversation history |
| Human-in-the-loop | First-class (interrupt) |
Basic approval flows | Chat-based intervention |
| Observability | LangSmith native | Basic logging | AutoGen Studio |
| Learning curve | Moderate (graph concepts) | Low (intuitive role metaphor) | Low-moderate |
| Production readiness | High | Medium | Medium |
Choose LangGraph when:
- You need fine-grained control over execution flow
- Persistence and checkpointing are requirements
- You're building a single agent with complex routing
- You need production-grade observability
Choose CrewAI when:
- Your problem naturally decomposes into roles (researcher, writer, reviewer)
- You want rapid prototyping of multi-agent systems
- Team-based delegation is the core pattern
Choose AutoGen when:
- You're building conversational multi-agent systems
- Agents need to debate or negotiate
- Research and experimentation are the primary goals
Hybrid approach (what I recommend): Use LangGraph as the orchestration layer and implement individual "agents" within it as specialized nodes. You get the reliability of graph-based control flow with the flexibility to swap implementations.
Production Deployment Tips
1. Use LangGraph Platform for managed deployment
// langgraph.json
{
"graphs": {
"research_agent": "./agent.py:research_agent"
},
"dependencies": ["langchain-openai", "tavily-python"],
"env": ".env"
}
langgraph dev # local development server with hot reload
langgraph build # Docker image for deployment
langgraph deploy # deploy to LangGraph Cloud
The platform gives you a REST API, WebSocket streaming, cron triggers, and a built-in task queue — eliminating significant infrastructure work.
2. Streaming for UX
Never make users stare at a spinner. Stream intermediate state:
async for event in research_agent.astream_events(initial_state, config, version="v2"):
if event["event"] == "on_chat_model_stream":
# Token-level streaming for the writing step
print(event["data"]["chunk"].content, end="", flush=True)
elif event["event"] == "on_chain_end":
# Node completion events
node_name = event.get("name", "")
print(f"\n[Completed: {node_name}]")
3. Rate limiting and cost controls
import tiktoken
class TokenBudget:
def __init__(self, max_tokens: int = 50_000):
self.max_tokens = max_tokens
self.used = 0
self.encoder = tiktoken.encoding_for_model("gpt-4o")
def check(self, text: str) -> bool:
tokens = len(self.encoder.encode(text))
self.used += tokens
if self.used > self.max_tokens:
raise TokenBudgetExceeded(f"Used {self.used}/{self.max_tokens} tokens")
return True
Wire this into your LLM callbacks. When an agent hits its budget, force it to the finalize step with whatever it has.
4. Version your prompts
Never hardcode prompts in your node functions. Use a prompt registry:
from langsmith import Client
client = Client()
# Pull versioned prompts from LangSmith Hub
system_prompt = client.pull_prompt("research-agent/critique:v3")
This lets you A/B test prompts, roll back bad deployments, and track which prompt version produced which outputs.
5. Graceful degradation
Build fallback paths into your graph:
def route_search_results(state: ResearchState) -> str:
if not state["sources"]:
return "fallback_generate" # LLM generates from knowledge
if len(state["sources"]) < 3:
return "search_again" # try different queries
return "write_report" # proceed normally
An agent that returns a partial result is infinitely more useful than one that throws a 500.
Wrapping Up
The gap between an agent demo and a production agent is the same gap between a script and a service — error handling, observability, persistence, and operational controls.
LangGraph gives you the primitives to bridge that gap: typed state, persistent checkpoints, conditional routing, human-in-the-loop interrupts, and native observability. It's opinionated enough to prevent common mistakes but flexible enough to model real workflows.
Start with the simplest graph that solves your problem. Add checkpointing on day one — you'll thank yourself the first time a process crashes mid-run. Add human approval gates before any destructive action. Monitor token usage religiously. And version everything: prompts, tools, graph topology.
The agents that succeed in production aren't the cleverest ones — they're the most predictable ones.
If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.
You Might Also Like
- BullMQ Job Queues in Node.js: Background Processing Done Right (2026 Guide)
- Building Your First MCP Server in TypeScript: Connect AI Agents to Anything
- Building AI-Ready Backends: Streaming, Tool Use, and LLM Integration Patterns (2026)
Follow me for more production-ready backend content!
If this helped you, buy me a coffee on Ko-fi!
Top comments (0)