Young Gao

Posted on Mar 21

Building Production AI Agents with LangGraph: Beyond the Toy Examples (2026)

#python #ai #llm #langgraph

Every AI tutorial shows you a chatbot that answers questions. That's not an agent. An agent decides what to do, takes action, observes the result, and adapts. In production, it does all of that reliably, with audit trails, error recovery, and human oversight.

LangGraph — the graph-based orchestration layer from LangChain — has quietly become the framework of choice for teams shipping real agents. Uber routes support workflows through it. LinkedIn uses it for internal knowledge agents. Klarna runs customer-facing agents on it at scale.

This article is the guide I wish I had when I moved from prototype to production. We'll build a Research Assistant agent end-to-end, covering every pattern that matters when uptime counts.

When to Use Agents (and When Not To)

Before writing a single line of agent code, ask yourself: does this task require dynamic decision-making?

Use agents when:

The number of steps is unknown at design time
The task requires selecting from multiple tools based on context
Intermediate results change the execution path
You need autonomous error recovery

Don't use agents when:

A fixed pipeline (prompt → LLM → output) solves the problem
You can enumerate all paths in advance (use a simple chain)
Latency budget is under 2 seconds (agents loop; loops are slow)
The cost of a wrong autonomous action is high and you can't add human checkpoints

Agents add complexity. A well-designed chain with structured outputs will outperform a poorly-designed agent every time. Start with the simplest approach that works, then graduate to agents when you hit the wall.

LangGraph Core Concepts

LangGraph models agent logic as a directed graph where:

State is a typed dictionary that flows through the graph
Nodes are functions that read and write state
Edges connect nodes (static or conditional)
Conditional edges inspect state and route to different nodes

Here's the minimal mental model:

from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from operator import add

class AgentState(TypedDict):
    messages: Annotated[list, add]  # append-only message list
    step_count: int

def process(state: AgentState) -> dict:
    return {"messages": ["processed"], "step_count": state["step_count"] + 1}

def should_continue(state: AgentState) -> str:
    return "end" if state["step_count"] >= 3 else "process"

graph = StateGraph(AgentState)
graph.add_node("process", process)
graph.add_conditional_edges(START, should_continue, {"process": "process", "end": END})
graph.add_conditional_edges("process", should_continue, {"process": "process", "end": END})

app = graph.compile()
result = app.invoke({"messages": [], "step_count": 0})

The Annotated[list, add] is critical — it tells LangGraph to merge list returns instead of overwriting. Without it, each node would clobber the previous messages.

Building the Research Assistant

Let's build something real: an agent that takes a research question, searches the web, reads and summarizes relevant pages, and produces a structured report. This is the kind of agent companies actually deploy.

Step 1: Define the State

from typing import TypedDict, Annotated, Literal
from operator import add
from pydantic import BaseModel

class Source(BaseModel):
    url: str
    title: str
    summary: str
    relevance_score: float

class ResearchState(TypedDict):
    question: str
    search_queries: list[str]
    sources: Annotated[list[Source], add]
    draft_report: str
    critique: str
    final_report: str
    iteration: int
    status: str

I'm using Pydantic models for Source — this gives you validation and serialization for free, which matters when you're persisting state to a database.

Step 2: Define the Nodes

from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_community.tools.tavily_search import TavilySearchResults

llm = ChatOpenAI(model="gpt-4o", temperature=0)
search_tool = TavilySearchResults(max_results=5)

async def generate_queries(state: ResearchState) -> dict:
    """Turn the research question into targeted search queries."""
    response = await llm.ainvoke([
        SystemMessage(content="Generate 3 specific search queries to research this topic. Return only the queries, one per line."),
        HumanMessage(content=state["question"])
    ])
    queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()]
    return {"search_queries": queries, "status": "searching"}

async def search_web(state: ResearchState) -> dict:
    """Execute searches and collect sources."""
    all_sources = []
    for query in state["search_queries"]:
        results = await search_tool.ainvoke({"query": query})
        for r in results:
            source = Source(
                url=r["url"],
                title=r.get("title", ""),
                summary=r["content"][:500],
                relevance_score=0.0  # scored in next step
            )
            all_sources.append(source)
    return {"sources": all_sources, "status": "analyzing"}

async def write_report(state: ResearchState) -> dict:
    """Synthesize sources into a structured report."""
    source_text = "\n\n".join(
        f"[{s.title}]({s.url})\n{s.summary}" for s in state["sources"]
    )
    response = await llm.ainvoke([
        SystemMessage(content="""Write a detailed research report based on these sources.
Structure: Executive Summary, Key Findings (numbered), Analysis, Conclusion.
Cite sources inline as [1], [2], etc."""),
        HumanMessage(content=f"Question: {state['question']}\n\nSources:\n{source_text}")
    ])
    return {"draft_report": response.content, "status": "reviewing"}

async def critique_report(state: ResearchState) -> dict:
    """Self-critique the draft for gaps and improvements."""
    response = await llm.ainvoke([
        SystemMessage(content="""Review this research report critically. Identify:
1. Factual gaps or unsupported claims
2. Missing perspectives
3. Areas needing more depth
Be specific and actionable. If the report is solid, say "APPROVED"."""),
        HumanMessage(content=state["draft_report"])
    ])
    return {
        "critique": response.content,
        "iteration": state["iteration"] + 1,
        "status": "critiqued"
    }

async def revise_report(state: ResearchState) -> dict:
    """Revise the report based on critique."""
    response = await llm.ainvoke([
        SystemMessage(content="Revise this report to address the critique. Maintain the same structure."),
        HumanMessage(content=f"Report:\n{state['draft_report']}\n\nCritique:\n{state['critique']}")
    ])
    return {"draft_report": response.content, "status": "revised"}

async def finalize(state: ResearchState) -> dict:
    return {"final_report": state["draft_report"], "status": "complete"}

Step 3: Wire the Graph

from langgraph.graph import StateGraph, START, END

def route_after_critique(state: ResearchState) -> Literal["revise", "finalize"]:
    if "APPROVED" in state["critique"] or state["iteration"] >= 3:
        return "finalize"
    return "revise"

builder = StateGraph(ResearchState)

# Add nodes
builder.add_node("generate_queries", generate_queries)
builder.add_node("search_web", search_web)
builder.add_node("write_report", write_report)
builder.add_node("critique_report", critique_report)
builder.add_node("revise_report", revise_report)
builder.add_node("finalize", finalize)

# Add edges
builder.add_edge(START, "generate_queries")
builder.add_edge("generate_queries", "search_web")
builder.add_edge("search_web", "write_report")
builder.add_edge("write_report", "critique_report")
builder.add_conditional_edges("critique_report", route_after_critique)
builder.add_edge("revise_report", "critique_report")  # loop back
builder.add_edge("finalize", END)

research_agent = builder.compile()

Run it:

result = await research_agent.ainvoke({
    "question": "What are the most effective strategies for reducing LLM hallucinations in production systems?",
    "search_queries": [],
    "sources": [],
    "draft_report": "",
    "critique": "",
    "final_report": "",
    "iteration": 0,
    "status": "starting"
})
print(result["final_report"])

State Management and Persistence

In production, agents crash. Servers restart. Users close browsers. You need checkpointing.

LangGraph has built-in support for persisting state at every step via checkpointers:

from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

DB_URI = "postgresql://user:pass@localhost:5432/agents"

async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer:
    await checkpointer.setup()  # creates tables on first run

    research_agent = builder.compile(checkpointer=checkpointer)

    # Every invocation now saves state after each node
    config = {"configurable": {"thread_id": "research-001"}}
    result = await research_agent.ainvoke(initial_state, config)

If the process dies mid-execution, restart with the same thread_id and it picks up exactly where it left off:

# Resume from last checkpoint
result = await research_agent.ainvoke(None, config)

Production tip: Use thread_id as your correlation ID across logging, tracing, and customer support. When a user reports a problem, you can replay the exact state transitions.

For high-throughput systems, the Postgres checkpointer supports connection pooling. For simpler setups, SqliteSaver works fine. For serverless, use the MemorySaver during development but always switch to a durable store before deploying.

Human-in-the-Loop Patterns

Fully autonomous agents are a liability in production. The most reliable pattern is human-on-the-loop: the agent runs autonomously but pauses at critical decision points.

LangGraph supports this natively with interrupt:

from langgraph.types import interrupt, Command

async def write_report(state: ResearchState) -> dict:
    # ... generate draft ...

    # Pause and wait for human approval
    approval = interrupt({
        "question": "Review this draft report. Reply 'approved' or provide feedback.",
        "draft": draft_content
    })

    if approval.lower() != "approved":
        # Human provided feedback — use it as critique
        return {"draft_report": draft_content, "critique": approval, "status": "human_feedback"}

    return {"draft_report": draft_content, "status": "approved"}

On the calling side, you handle the interrupt:

config = {"configurable": {"thread_id": "research-001"}}

# First invocation runs until interrupt
result = await research_agent.ainvoke(initial_state, config)

# Agent is now paused. Show draft to user via your UI.
# When user responds:
result = await research_agent.ainvoke(
    Command(resume="approved"),  # or resume="Add more detail about X"
    config
)

This pattern maps cleanly to web UIs (show a review screen), Slack bots (send a message and wait for reply), or email workflows.

Advanced pattern — tiered autonomy:

def route_by_confidence(state: ResearchState) -> str:
    confidence = state.get("confidence_score", 0)
    if confidence > 0.9:
        return "auto_approve"     # agent proceeds
    elif confidence > 0.7:
        return "notify_human"     # agent proceeds but flags for review
    else:
        return "require_approval" # agent pauses

This lets low-risk actions flow through while escalating uncertain ones — the sweet spot for production throughput.

Tool Calling Best Practices

Tools are how agents interact with the real world. Get this wrong and you get agents that burn API credits, leak data, or take destructive actions.

Structured tool definitions

from langchain_core.tools import tool
from pydantic import Field

@tool
def search_knowledge_base(
    query: str = Field(description="Natural language search query"),
    filters: dict | None = Field(default=None, description="Optional metadata filters: {department: str, date_range: str}"),
    max_results: int = Field(default=10, ge=1, le=50, description="Number of results to return")
) -> list[dict]:
    """Search the internal knowledge base for documents matching the query.
    Use this for company-specific information. For general web information, use web_search instead."""
    # implementation
    ...

Key practices:

Rich descriptions matter more than you think. The LLM reads the docstring and field descriptions to decide when and how to call the tool. Vague descriptions lead to wrong tool selection.
Constrain inputs. Use ge, le, enums, and Pydantic validators. An agent that can pass max_results=10000 will eventually do it.
Separate read and write tools. Never have a single database_tool that can both query and delete. Give the agent db_query and db_delete separately, and only bind db_delete when you've added human approval.
Tool result formatting. Return structured data, not free text. The LLM processes structured results more reliably:

@tool
def get_order_status(order_id: str) -> dict:
    """Look up the status of a customer order."""
    order = db.get_order(order_id)
    return {
        "order_id": order.id,
        "status": order.status,
        "items_count": len(order.items),
        "estimated_delivery": order.eta.isoformat(),
        "action_available": ["cancel"] if order.status == "processing" else []
    }

Bind tools selectively per node. Not every node needs every tool:

research_llm = llm.bind_tools([search_tool, scrape_tool])
writing_llm = llm.bind_tools([])  # no tools during writing

Error Handling and Retry Strategies

Production agents face three categories of failures:

1. Transient failures (API timeouts, rate limits)

Use LangGraph's built-in retry policy:

from langgraph.pregel import RetryPolicy

builder.add_node(
    "search_web",
    search_web,
    retry=RetryPolicy(
        max_attempts=3,
        initial_interval=1.0,  # seconds
        backoff_factor=2.0,
        retry_on=(TimeoutError, RateLimitError)
    )
)

2. LLM failures (malformed output, hallucinated tool calls)

Wrap tool execution with validation:

async def safe_tool_executor(state: AgentState) -> dict:
    last_message = state["messages"][-1]

    for tool_call in last_message.tool_calls:
        try:
            # Validate tool exists
            tool = tool_map.get(tool_call["name"])
            if not tool:
                return {"messages": [ToolMessage(
                    content=f"Tool '{tool_call['name']}' does not exist. Available: {list(tool_map.keys())}",
                    tool_call_id=tool_call["id"]
                )]}

            # Execute with timeout
            result = await asyncio.wait_for(
                tool.ainvoke(tool_call["args"]),
                timeout=30.0
            )
            return {"messages": [ToolMessage(content=str(result), tool_call_id=tool_call["id"])]}

        except ValidationError as e:
            return {"messages": [ToolMessage(
                content=f"Invalid arguments: {e}. Please fix and retry.",
                tool_call_id=tool_call["id"]
            )]}

The agent sees the error message and self-corrects on the next iteration. This works surprisingly well — LLMs are good at fixing their own mistakes when given clear error messages.

3. Logical failures (infinite loops, stuck states)

Guard against these at the graph level:

def route_after_critique(state: ResearchState) -> str:
    # Hard cap on iterations
    if state["iteration"] >= 3:
        return "finalize"

    # Detect stuck state: same critique twice
    if state.get("prev_critique") == state["critique"]:
        return "finalize"

    return "revise"

Also set a global timeout on the entire graph execution:

result = await asyncio.wait_for(
    research_agent.ainvoke(initial_state, config),
    timeout=300.0  # 5 minute hard limit
)

Observability with LangSmith

You cannot operate what you cannot see. LangSmith is the observability layer for LangGraph — think Datadog for agent workflows.

Setup is two environment variables:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=lsv2_...

Every node execution, tool call, LLM invocation, and state transition is now traced automatically. No code changes required.

What to monitor in production:

# Custom metadata for filtering traces
config = {
    "configurable": {"thread_id": "research-001"},
    "metadata": {
        "user_id": "u_12345",
        "environment": "production",
        "agent_version": "2.1.0"
    },
    "tags": ["research", "priority-high"]
}

Key metrics to track:

Tokens per task: Set budgets. A research agent shouldn't exceed 50k tokens per run. Alert if it does.
Iterations per completion: If your average is climbing, your prompts or critique logic are degrading.
Tool call success rate: Below 95%? Your tool descriptions need work.
Time to completion: Set SLOs. p50 under 30s, p99 under 120s.
Human intervention rate: Track how often agents escalate. Trending up = model or prompt regression. Trending down = your agent is learning (or your thresholds are too loose).

LangSmith also supports evaluation datasets — curated input/output pairs that you run nightly to catch regressions:

from langsmith import Client

client = Client()

# Create a dataset of expected research outputs
dataset = client.create_dataset("research-agent-evals")
client.create_example(
    inputs={"question": "What is retrieval augmented generation?"},
    outputs={"expected_sections": ["Executive Summary", "Key Findings"]},
    dataset_id=dataset.id
)

LangGraph vs. CrewAI vs. AutoGen

The framework landscape has matured significantly. Here's when to use what:

Aspect	LangGraph	CrewAI	AutoGen
Architecture	Graph-based, explicit control flow	Role-based multi-agent	Conversation-based multi-agent
Best for	Complex workflows, production systems	Team simulation, parallel task delegation	Research, multi-agent debate
State management	Built-in, typed, persistent	Limited, via shared memory	Conversation history
Human-in-the-loop	First-class (`interrupt`)	Basic approval flows	Chat-based intervention
Observability	LangSmith native	Basic logging	AutoGen Studio
Learning curve	Moderate (graph concepts)	Low (intuitive role metaphor)	Low-moderate
Production readiness	High	Medium	Medium

Choose LangGraph when:

You need fine-grained control over execution flow
Persistence and checkpointing are requirements
You're building a single agent with complex routing
You need production-grade observability

Choose CrewAI when:

Your problem naturally decomposes into roles (researcher, writer, reviewer)
You want rapid prototyping of multi-agent systems
Team-based delegation is the core pattern

Choose AutoGen when:

You're building conversational multi-agent systems
Agents need to debate or negotiate
Research and experimentation are the primary goals

Hybrid approach (what I recommend): Use LangGraph as the orchestration layer and implement individual "agents" within it as specialized nodes. You get the reliability of graph-based control flow with the flexibility to swap implementations.

Production Deployment Tips

1. Use LangGraph Platform for managed deployment

// langgraph.json
{
    "graphs": {
        "research_agent": "./agent.py:research_agent"
    },
    "dependencies": ["langchain-openai", "tavily-python"],
    "env": ".env"
}

langgraph dev     # local development server with hot reload
langgraph build   # Docker image for deployment
langgraph deploy  # deploy to LangGraph Cloud

The platform gives you a REST API, WebSocket streaming, cron triggers, and a built-in task queue — eliminating significant infrastructure work.

2. Streaming for UX

Never make users stare at a spinner. Stream intermediate state:

async for event in research_agent.astream_events(initial_state, config, version="v2"):
    if event["event"] == "on_chat_model_stream":
        # Token-level streaming for the writing step
        print(event["data"]["chunk"].content, end="", flush=True)
    elif event["event"] == "on_chain_end":
        # Node completion events
        node_name = event.get("name", "")
        print(f"\n[Completed: {node_name}]")

3. Rate limiting and cost controls

import tiktoken

class TokenBudget:
    def __init__(self, max_tokens: int = 50_000):
        self.max_tokens = max_tokens
        self.used = 0
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def check(self, text: str) -> bool:
        tokens = len(self.encoder.encode(text))
        self.used += tokens
        if self.used > self.max_tokens:
            raise TokenBudgetExceeded(f"Used {self.used}/{self.max_tokens} tokens")
        return True

Wire this into your LLM callbacks. When an agent hits its budget, force it to the finalize step with whatever it has.

4. Version your prompts

Never hardcode prompts in your node functions. Use a prompt registry:

from langsmith import Client

client = Client()

# Pull versioned prompts from LangSmith Hub
system_prompt = client.pull_prompt("research-agent/critique:v3")

This lets you A/B test prompts, roll back bad deployments, and track which prompt version produced which outputs.

5. Graceful degradation

Build fallback paths into your graph:

def route_search_results(state: ResearchState) -> str:
    if not state["sources"]:
        return "fallback_generate"  # LLM generates from knowledge
    if len(state["sources"]) < 3:
        return "search_again"       # try different queries
    return "write_report"           # proceed normally

An agent that returns a partial result is infinitely more useful than one that throws a 500.

Wrapping Up

The gap between an agent demo and a production agent is the same gap between a script and a service — error handling, observability, persistence, and operational controls.

LangGraph gives you the primitives to bridge that gap: typed state, persistent checkpoints, conditional routing, human-in-the-loop interrupts, and native observability. It's opinionated enough to prevent common mistakes but flexible enough to model real workflows.

Start with the simplest graph that solves your problem. Add checkpointing on day one — you'll thank yourself the first time a process crashes mid-run. Add human approval gates before any destructive action. Monitor token usage religiously. And version everything: prompts, tools, graph topology.

The agents that succeed in production aren't the cleverest ones — they're the most predictable ones.

If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.

Follow me for more production-ready backend content!

If this helped you, buy me a coffee on Ko-fi!

Top comments (2)

Raju Dandigam • May 14

The emphasis on explicit state management is important because many agent failures come from implicit or hidden orchestration state. LangGraph’s graph-based approach becomes much easier to reason about once systems grow beyond a single workflow. I also think production systems benefit heavily from visual execution traces layered on top of the graph runtime. That combination of orchestration + inspectability is where agent tooling starts becoming operationally mature.

klement Gunndu • Mar 21

The latency budget point is key — adding a state checkpoint between tool calls costs ~200ms but saves full replays on failure, which is worth it every time in production.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community