Most LangGraph agent examples stop at "hello world." A basic planner, an executor that calls a search tool, and a printed final result. That is fine for a demo, but the moment you try to ship an autonomous agent in production, you immediately run into the same four problems:
- How do I prevent an agent from silently hallucinating missing details between steps?
- How do I build a self-correcting loop that catches bad tool outputs before they poison the final answer?
- How do I optimize costs so I'm not using expensive reasoning models for simple bookkeeping?
- How do I safely integrate this orchestration layer with enterprise data and capabilities?
I spent time solving all of these in a real production environment—a regulated life sciences platform automating scientific workflows over tens of millions of research records—and packaged the result as a template anyone can fork: langgraph-plan-execute-validate.
This post explains the decisions behind each piece.
What is PEV and Why Does it Need a Production Template?
The standard LangGraph Plan-and-Execute pattern has two nodes: Plan and Execute. The gap between that two-node quickstart and a deployable, reliable service is significant.
In production, execution quality is not binary. An agent can technically complete a step while producing output that is incomplete, hallucinated, or missing a critical detail. Without a quality gate, those failures propagate silently to the next step.
You need:
- A structured Validator to score outputs
- A deterministic Router to handle retries and replanning
- An Audit Trail to debug what actually happened
- A cost-effective multi-model strategy
The template gives you all of this wired together and working out of the box.
| Feature | Standard plan-execute | This template |
|---|---|---|
| Planning node | ✓ | ✓ |
| Execution node with tool calls | ✓ | ✓ |
| Validation + confidence score | ✗ | ✓ 0.0 – 1.0 |
| Per-step retry with feedback injection | ✗ | ✓ configurable |
| Automatic replanning on exhausted retries | ✗ | ✓ with failure context |
| Multi-model cost optimisation | ✗ | ✓ haiku/sonnet split |
| Full audit trail (every attempt) | ✗ | ✓ operator.add accumulator |
| Structured outputs (no string parsing) | ✗ | ✓ Pydantic models |
The Architecture
The graph adds two nodes to the standard pattern: a Validator that scores every step output, and a Router (pure Python, no LLM) that decides what happens next.
START
│
planner ◄──────────────────────────────────────── (replan)
│
executor ◄─────────────────────────────── (retry)
│
validator
│
router ─── score ≥ threshold, more steps ──► executor (next step)
─── score ≥ threshold, last step ──► END (complete)
─── score < threshold, retry left ──► executor (retry)
─── score < threshold, retry gone ──► planner (replan)
─── all limits exhausted ──► END (failed)
System Overview
State Machine
Problem 1: Silent Failures in 2-Node Workflows
The naive approach is passing the output of the Executor directly to the next step. If Step 1 returns "I couldn't find the data," the Executor often just moves to Step 2 anyway, hallucinating the missing context.
The template solves this by introducing a third node: a structured Validator that scores every step output (0.0–1.0) against the original intent.
cfg = PEVConfig(
# Quality gate
pass_threshold = 0.80, # score >= this -> step passes
# Loop guards
max_retries = 2, # retries per step before escalating to replan
max_replans = 1, # full replanning cycles before marking failed
)
If the score is below the pass_threshold, the system doesn't move forward—it recovers. The validator produces a structured output with both a numeric score and a one-sentence explanation, which gets injected into the next retry's prompt.
Problem 2: Automated Retries and Replanning
When an execution fails validation, throwing an exception or failing the entire run is brittle. The LLM needs a chance to correct itself.
The template uses a pure Python Router node that implements deterministic recovery logic. When a step scores below the threshold, the Router injects the Validator's feedback directly into the next prompt and triggers a retry:
# The Router decides how to recover — no LLM involved
if score >= cfg.pass_threshold:
next_idx = idx + 1
if next_idx >= len(plan):
return {"status": "complete", "_next": "complete"}
return {
"current_step_idx": next_idx,
"retry_count": 0,
"status": "executing",
"_next": "execute",
}
if retry_count < cfg.max_retries:
return {
"retry_count": retry_count + 1,
"status": "executing",
"_next": "retry",
}
if replan_count < cfg.max_replans:
return {"status": "planning", "_next": "replan"}
# All recovery options exhausted
return {
"status": "failed",
"error": f"Step '{plan[idx]}' failed after {retry_count} retries and {replan_count} replans.",
"_next": "failed",
}
Design note: LangGraph conditional edges are read-only—they can't update state. The Router is a real node so it can both update
current_step_idx/retry_countAND write the routing decision tostate["_next"]. A single_dispatchconditional edge then reads that field to pick the next node.
If max_retries are exhausted, the Router escalates to the Planner, regenerating the entire remaining plan with the failure context. Nothing moves forward until it passes the quality gate.
Request Lifecycle
Problem 3: Multi-Model Cost Optimization
Routing every trivial check and structured output through your most capable (and expensive) model destroys the ROI of autonomous agents.
The template uses a 3-model routing strategy initialized via the configuration:
cfg = PEVConfig(
# Cheap model for structured JSON planning
planner_model = "claude-haiku-4-5-20251001",
# Capable model for complex tool calls + reasoning
executor_model = "claude-sonnet-4-6",
# Cheap model for scoring + generating one sentence of feedback
validator_model = "claude-haiku-4-5-20251001",
)
The planner and validator only produce structured JSON—a cheaper model handles this perfectly. The executor is where reasoning and tool use happen, so we invest in the capable model there.
Cheap (claude-haiku ~$0.25 / 1M tokens)
├── Planner — JSON output only
└── Validator — score + one sentence
Capable (claude-sonnet ~$3 / 1M tokens)
└── Executor — tool calls + reasoning
This design cuts per-run costs by ~60–70% while maintaining high final quality. A typical 3-step task costs ~$0.01 vs ~$0.027 if you route everything through sonnet.
Problem 4: MCP Integration for Enterprise Capabilities
Reliability doesn't stop at orchestration; it extends to your capabilities. Standard tool calling often means hardcoding database credentials or API integrations directly into the agent code.
The template is designed to pair with the FastMCP Production Template. By separating the Orchestrator (PEV) from the Tools (MCP), you can expose enterprise databases and internal APIs securely via a standalone FastMCP server, and pull them into the PEV loop with just a few lines:
from langchain_mcp_adapters.tools import load_mcp_tools
from pev import create_pev_graph, PEVConfig
# Connect to an MCP server (e.g., built with fastmcp-production-template)
mcp_tools = load_mcp_tools("uv", ["run", "mcp_server.py"])
# Orchestrate with PEV's quality gates
graph = create_pev_graph(PEVConfig(tools=mcp_tools, pass_threshold=0.85))
This is the Full-Stack AI architecture: PEV acts as the Brain (reasoning, planning, validation) and MCP acts as the Hands (standardized, secure access to enterprise data). You can update tools and data integrations without redeploying your core agent logic.
The Audit Trail: Observability at the Step Level
When something goes wrong with an AI-driven workflow, you need to know exactly where the agent struggled and how many attempts it took.
The template preserves every attempt in step_results via operator.add. Nothing is ever overwritten:
result["step_results"]
# [
# StepResult(step="Search for X", score=0.55, attempts=1, feedback="Missing Y — result was too generic"),
# StepResult(step="Search for X", score=0.88, attempts=2, feedback="Good. All required details present."),
# StepResult(step="Summarise X", score=0.92, attempts=1, feedback="Complete and well-structured."),
# ]
This is the operational signal that matters. It proves the system is catching failures and self-correcting before returning data to the user. You can see exactly where the agent struggled, what feedback it received, and how many attempts each step took.
The state is defined with Annotated[list[StepResult], operator.add] so LangGraph handles the accumulation automatically:
class PEVState(TypedDict):
# operator.add means append-only — nothing is ever overwritten
step_results: Annotated[list[StepResult], operator.add]
...
Getting Started
git clone https://github.com/ManjunathGovindaraju/langgraph-plan-execute-validate.git
cd langgraph-plan-execute-validate
uv sync
cp .env.example .env # add your ANTHROPIC_API_KEY
Five lines to run your first agent:
from pev import create_pev_graph, initial_state, PEVConfig
graph = create_pev_graph(PEVConfig(pass_threshold=0.85))
result = graph.invoke(initial_state("Research the top 3 vector databases"))
print(result["status"]) # "complete"
print(result["step_results"]) # scored audit trail for every step
Run the included examples:
python examples/research_agent.py # web search + validate
python examples/code_review_agent.py # strict threshold, shows retry flow
python examples/data_analysis_agent.py # no tools, LLM reasoning only
The template also includes a benchmark_reliability.py script that uses a "Flaky Tool" to prove the Validator catches generic results 100% of the time, forcing a retry that retrieves high-fidelity data:
make example-benchmark
Full Configuration Reference
from pev import PEVConfig
cfg = PEVConfig(
# Model routing
planner_model = "claude-haiku-4-5-20251001",
executor_model = "claude-sonnet-4-6",
validator_model = "claude-haiku-4-5-20251001",
# Quality gate
pass_threshold = 0.80,
# Loop guards
max_retries = 2,
max_replans = 1,
# Tools (planner and validator never see these)
tools = [TavilySearchResults(max_results=3)],
)
| Parameter | Default | Description |
|---|---|---|
planner_model |
claude-haiku-4-5-20251001 |
Structured JSON output only |
executor_model |
claude-sonnet-4-6 |
Tool calls + reasoning |
validator_model |
claude-haiku-4-5-20251001 |
Scoring only |
pass_threshold |
0.80 |
Minimum score [0.0–1.0] for a step to pass |
max_retries |
2 |
Retries per step before triggering replan |
max_replans |
1 |
Full replanning cycles before marking failed |
tools |
[] |
LangChain tools available to the executor |
Project Structure
langgraph-plan-execute-validate/
├── src/pev/
│ ├── __init__.py # Public API: create_pev_graph, initial_state, PEVConfig
│ ├── graph.py # StateGraph wiring, router node, _dispatch edge
│ ├── state.py # PEVState TypedDict, StepResult, Status
│ ├── config.py # PEVConfig dataclass with validation
│ ├── prompts.py # All prompt templates (one place, easy to tune)
│ └── nodes/
│ ├── planner.py # Structured output, replan-aware
│ ├── executor.py # Tool-call loop, feedback injection on retry
│ └── validator.py # Confidence scoring, audit trail append
├── examples/
│ ├── research_agent.py
│ ├── code_review_agent.py
│ ├── data_analysis_agent.py
│ └── mcp_agent.py
├── tests/
│ ├── test_planner.py
│ ├── test_executor.py
│ ├── test_validator.py
│ ├── test_retry_replan.py # Router decision tree — 12 routing scenarios
│ └── test_graph.py
└── docs/
└── architecture.md # Full architecture with Mermaid diagrams
Testing
The test suite is designed to run without API calls in CI:
# Unit tests — no API calls, runs in ~5 seconds
uv run pytest tests/ -m "not slow" -v
# Integration tests — requires ANTHROPIC_API_KEY
uv run pytest tests/ -m slow -v
The router decision tree is the most critical test target—test_retry_replan.py covers all 12 routing branches:
| Test file | What it covers |
|---|---|
test_planner.py |
First-plan vs replan, state resets, step injection |
test_executor.py |
Context injection, retry feedback, tool-call loop |
test_validator.py |
Score/feedback writing, score clamping, audit trail |
test_retry_replan.py |
Every router branch — 12 routing scenarios |
test_graph.py |
Config validation, graph compilation, initial_state
|
What This Is Not
This template is opinionated about the things that are always true in production agent workflows: quality gates, retry determinism, audit trails, and cost optimization. It does not make choices about your domain logic or the specific tools you expose.
Fork it, define your domain-specific tools (or load them via MCP), tune the validator prompts in prompts.py for your use case, and you have a production-grade orchestration engine without building the reliability layer from scratch.
github.com/ManjunathGovindaraju/langgraph-plan-execute-validate
If you found this useful, the companion post on the MCP side of this architecture is here: Building a Production-Ready MCP Server: Async PostgreSQL, OpenTelemetry, and Kubernetes in One Template



Top comments (0)