Deep Agents: Building Long-Running Autonomous Agents with LangChain's New Framework
The era of single-turn, reactive agents is ending. As engineering teams push toward workflows that span hours instead of seconds—internal coding pipelines, multi-stage research synthesis, iterative design systems—the limitations of flat tool-calling architectures become painfully obvious. You cannot orchestrate a code review that spawns test generation, runs CI, waits for human approval, and then deploys without a fundamentally different approach to agent architecture. LangChain's Deep Agents framework, announced alongside Deep Agents Deploy in March 2026, represents the clearest attempt yet to provide production-grade abstractions for this class of problem.
This isn't just another wrapper. Deep Agents introduces a layered architecture where planning loops, persistent memory, and sub-agent delegation are first-class concerns—not afterthoughts bolted onto a ReAct loop. The framework sits deliberately above LangGraph (which gives you fine-grained state machine control) and the standard LangChain abstractions (which optimize for quick iteration). If you're building agents that need to survive restarts, coordinate specialized sub-agents, and maintain coherent long-term memory across sessions, this is the stack to understand.
The use cases driving adoption are telling: Open SWE for internal coding agents that autonomously fix bugs across repositories, GTM workflow automation that coordinates across CRM systems and communication channels, and design iteration systems like the Moda case study where agents loop through revision cycles with human feedback gates. In this deep-dive, we'll dissect the architecture, walk through the key APIs, examine deployment considerations, and build a working research-then-draft agent from scratch.
The Deep Agents Architecture: Planning, Memory, and Sub-Agents
The Deep Agents framework rests on three architectural pillars that distinguish it from simpler agent implementations: planning loops that decompose goals into task graphs, persistent memory that survives across sessions and crashes, and sub-agent delegation that enables specialized capabilities to run in parallel.
Planning Mechanism
Traditional agents operate on flat tool-calling sequences—the model decides the next action based on the current state, executes it, and repeats. This works for simple tasks but falls apart when you need hierarchical goal decomposition. Deep Agents introduces a planning layer that generates task graphs before execution begins. The planning model analyzes the high-level objective, identifies dependencies between subtasks, and produces a directed acyclic graph (DAG) of work items. Each node in this graph can represent a direct tool call, a sub-agent invocation, or a human approval checkpoint.
This approach parallels findings from the W&D paper on parallel tool calling, which demonstrated that research agents achieve significant speedups when tool calls without mutual dependencies execute concurrently. Deep Agents extends this to sub-agent coordination: a parent agent can spawn multiple child agents in a single planning step, wait for their results in parallel, and then synthesize findings before proceeding.
Memory Subsystem
The memory architecture draws heavily from recent research on temporal reasoning in conversational AI. The APEX-MEM framework introduced property graphs with timestamp annotations for resolving when facts were learned and whether they remain valid. Deep Agents implements a similar approach: events are stored in append-only logs with temporal metadata, and the memory retrieval system can answer queries like "what did we learn about this repository before the last deployment?" rather than just "what do we know about this repository?"
The MongoDB integration announced in the LangChain + MongoDB partnership provides durable checkpointing out of the box. Every state transition, planning decision, and sub-agent result gets persisted to MongoDB Atlas, enabling crash recovery and long-running workflows that span days.
Sub-Agent Delegation
Sub-agents aren't just function calls with extra steps—they're fully autonomous agents with their own planning loops, memory contexts, and tool access. The parent agent maintains a registry of available sub-agents with capability descriptions, and the planning model can decide to delegate tasks to the most appropriate specialist. This mirrors the "harness" concept from LangChain's architecture documentation: the separation between orchestration logic (what needs to happen and when) and compute execution (the actual work). The parent agent is the orchestrator; sub-agents are the compute layer.
Hands-On: Code Walkthrough
Let's build a research-then-draft agent that demonstrates the core Deep Agents patterns. This agent accepts a topic, spawns specialized sub-agents for search and summarization, uses long-term memory to avoid redundant work, and produces a structured Markdown report.
"""
research_agent.py
A Deep Agent that coordinates search and summarization sub-agents
to produce research reports with persistent memory.
Requires:
- langchain-core >= 0.3.42
- langchain-deepagents >= 0.1.8
- langchain-mongodb >= 0.2.1
- langchain-community >= 0.3.20 (for TavilySearchResults)
"""
from langchain_deepagents import DeepAgent, SubAgent, PlanningConfig
from langchain_mongodb import MongoDBCheckpointer, MongoDBMemory
from langchain_community.tools import TavilySearchResults
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from typing import List, Dict, Any
import os
# Initialize the planning model - Claude 3.5 Sonnet works well for planning tasks
planning_model = ChatAnthropic(
model="claude-sonnet-4-20250514",
max_tokens=4096,
temperature=0.1 # Low temperature for more deterministic planning
)
# Configure MongoDB checkpointer for crash recovery
# This persists every state transition to Atlas
checkpointer = MongoDBCheckpointer(
connection_string=os.environ["MONGODB_URI"],
database_name="deep_agents",
collection_name="research_agent_checkpoints",
# Enable TTL for automatic cleanup of old checkpoints
ttl_seconds=86400 * 30 # 30 days retention
)
# Long-term memory backend with temporal resolution
# Stores facts with timestamps for "when did we learn this?" queries
memory_backend = MongoDBMemory(
connection_string=os.environ["MONGODB_URI"],
database_name="deep_agents",
collection_name="research_memory",
# Enable vector search for semantic retrieval
vector_index_name="memory_vector_index",
embedding_dimensions=1536
)
# Define the search sub-agent's tool
search_tool = TavilySearchResults(
max_results=5,
search_depth="advanced",
include_raw_content=True
)
@tool
def extract_key_claims(content: str) -> List[Dict[str, str]]:
"""Extract key claims and their supporting evidence from content."""
# In production, this would call a specialized extraction model
# Here we're showing the tool interface pattern
return [{"claim": content[:200], "evidence": content[200:400]}]
# Define the search sub-agent
search_subagent = SubAgent(
name="search_agent",
description="Searches the web for current information on a topic. "
"Use this for gathering primary sources and recent developments.",
tools=[search_tool],
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
# Sub-agents can have their own planning constraints
max_iterations=10,
# Memory scope: this sub-agent shares memory with parent
memory_scope="shared"
)
# Define the summarization sub-agent
summarization_subagent = SubAgent(
name="summarization_agent",
description="Condenses and synthesizes information from multiple sources. "
"Use this after gathering raw sources to produce coherent summaries.",
tools=[extract_key_claims],
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
max_iterations=5,
memory_scope="shared"
)
# Configure planning behavior
planning_config = PlanningConfig(
# Maximum depth of planning recursion
max_planning_depth=3,
# Timeout for individual sub-agent invocations (seconds)
sub_agent_timeout=300,
# Enable parallel sub-agent execution when dependencies allow
parallel_execution=True,
# Replan if a sub-agent fails rather than failing the whole workflow
replan_on_failure=True
)
# Initialize the Deep Agent
research_agent = DeepAgent(
name="research_report_agent",
description="Produces comprehensive research reports by coordinating "
"search and summarization specialists.",
planning_model=planning_model,
memory_backend=memory_backend,
checkpointer=checkpointer,
sub_agents=[search_subagent, summarization_subagent],
planning_config=planning_config,
# Path to custom instructions file
instructions_path="./AGENTS.md"
)
# Example invocation
async def generate_report(topic: str, session_id: str) -> str:
"""
Generate a research report on the given topic.
Uses session_id for memory continuity across invocations.
"""
result = await research_agent.ainvoke(
{
"objective": f"Research the topic '{topic}' and produce a "
f"structured Markdown report with citations.",
"output_format": "markdown",
"max_sources": 10
},
config={
"session_id": session_id,
# Enable LangSmith tracing for observability
"callbacks": [], # LangSmith callbacks auto-inject when configured
# Memory retrieval configuration
"memory_config": {
"retrieve_before_planning": True,
"max_memory_items": 50,
# Skip sources we've already processed in this session
"dedupe_by": "url"
}
}
)
return result.output
# Run the agent
if __name__ == "__main__":
import asyncio
report = asyncio.run(generate_report(
topic="Recent advances in parallel tool calling for AI agents",
session_id="research-session-001"
))
print(report)
The AGENTS.md file referenced above provides custom instructions that guide agent behavior:
# Research Report Agent Instructions
## Planning Guidelines
- Always check memory for previously researched sources before searching
- Spawn search_agent first to gather sources, then summarization_agent to synthesize
- If fewer than 3 high-quality sources found, replan with broader search terms
## Output Requirements
- Include inline citations linking to source URLs
- Structure reports with: Executive Summary, Key Findings, Detailed Analysis, Sources
- Flag any conflicting information between sources
## Safety Constraints
- Do not include speculation presented as fact
- Mark any information older than 6 months as potentially outdated
To view execution traces and debug planning decisions, access the LangSmith dashboard where each planning step, sub-agent invocation, and memory operation appears as a nested span with latency and token cost attribution.
Deployment Patterns: Deep Agents Deploy and Sandbox Execution
Running long-running agents in production requires infrastructure that most teams don't want to build themselves. Deep Agents Deploy, announced as "an open alternative to Claude Managed Agents", provides a hosted runtime specifically designed for agents that execute over minutes or hours rather than seconds.
Infrastructure Abstraction
Deep Agents Deploy handles the unglamorous but critical concerns: process isolation, crash recovery, timeout management, and resource allocation. When your agent spawns ten sub-agents in parallel for a research task, the runtime distributes these across worker pools and manages backpressure. When a network partition causes a checkpoint failure, the runtime automatically retries from the last durable state.
This separation mirrors the architecture pattern OpenAI introduced in their Agents SDK evolution: the harness (orchestration logic) runs on managed infrastructure while compute (the actual LLM calls and tool executions) can be distributed across different environments. Deep Agents Deploy implements this pattern with first-class support for LangGraph-based sub-agents.
Hardware Acceleration
The NVIDIA enterprise integration enables hardware acceleration for compute-intensive sub-agents. When a Deep Agent spawns multiple sub-agents that each need to process large documents or run inference on specialized models, the langchain-nvidia package routes these to NIM microservices running on GPU infrastructure. This becomes significant when your planning DAG has wide parallelism—ten sub-agents each running a 70B parameter model benefit substantially from NIM's optimized batch inference.
Durability and Cost
MongoDB-backed checkpoints provide durability guarantees that in-memory state cannot. If your agent is three hours into a complex workflow and the host process dies, resumption picks up from the last checkpoint rather than starting over. The MongoDB partnership specifically targets this use case: vector search for memory retrieval and document checkpointing unified in a single database you're likely already running.
Cost attribution in long-running agents requires tracking spend across sub-agent invocations. LangSmith Fleet provides identity-based cost tracking where each Deep Agent instance accumulates costs from all its sub-agent invocations, enabling accurate per-workflow billing and optimization.
Security Considerations
For agents that execute code—and most production agents eventually do—sandboxing is non-negotiable. LangSmith Sandboxes provide isolated execution environments for code-generation sub-agents, preventing arbitrary code execution from compromising your infrastructure. For runtime policy enforcement on tool calls, the Agent Governance Toolkit integrates with Deep Agents to enforce constraints like "this agent cannot call external APIs after 6 PM" or "this agent cannot modify files outside the /workspace directory."
Evaluation and Observability for Long-Running Agents
Traditional LLM evaluation—comparing model outputs to gold-standard responses—breaks down for autonomous agents. When an agent takes 47 steps to complete a task, the final output might be correct even if step 23 was wildly inefficient. Conversely, an incorrect final output might result from a single bad decision at step 12 that cascaded through the rest of the workflow.
The Observability Gap
The State of Agent Engineering survey found that 89% of production teams use observability for their agents, but evaluation remains the biggest blocker to deployment confidence. Teams can see what their agents are doing but struggle to systematically assess whether the agents are doing it well.
LangSmith addresses this with trace-based evaluation: each planning step, sub-agent result, and memory operation gets captured as a span that can be individually scored. You define evaluators that examine not just the final output but the quality of intermediate decisions:
from langsmith.evaluation import RunEvaluator
class PlanningQualityEvaluator(RunEvaluator):
"""Evaluates whether the agent's planning was efficient."""
def evaluate_run(self, run, example=None):
# Extract planning steps from the trace
planning_spans = [s for s in run.child_runs
if s.name.startswith("planning_")]
# Score based on planning efficiency metrics
replan_count = sum(1 for s in planning_spans
if "replan" in s.name)
# Penalize excessive replanning
score = max(0, 1.0 - (replan_count * 0.2))
return {"key": "planning_efficiency", "score": score}
Skills as Evaluated Capabilities
Deep Agents introduces "Skills"—reusable, versioned capabilities that agents can attach. A "web_research" skill encapsulates the ability to search, filter, and synthesize web content. Each skill has associated evaluators that run during CI to ensure capability regressions don't ship to production. When you upgrade from web_research@1.2 to web_research@1.3, the skill's evaluation suite must pass before deployment.
Harness Optimization
The hill-climbing approach to harness configuration uses evaluation feedback loops to optimize agent behavior. You define success metrics (task completion rate, average step count, cost per task), run the agent against a benchmark suite, and systematically adjust harness parameters—max_planning_depth, sub_agent_timeout, memory retrieval limits—to improve metrics. This transforms agent tuning from intuition-driven to data-driven.
What This Means for Your Stack
When to choose Deep Agents over LangGraph: Deep Agents adds value when your workflow requires persistent memory across sessions, autonomous goal decomposition, or sub-agent coordination. If you're building a customer support bot that handles single queries, LangGraph's lower-level abstractions are simpler and faster to iterate on. If you're building an internal coding agent that maintains context across PRs, learns from past reviews, and coordinates linter, test-runner, and documentation sub-agents, Deep Agents provides the right level of abstraction.
Memory backend selection: The MongoDB Checkpointer makes sense for teams already running Atlas—you get vector search for memory retrieval and checkpointing in a single managed database. For teams with existing PostgreSQL infrastructure, PostgresSaver provides equivalent durability with pgvector for semantic retrieval. The temporal reasoning features from the APEX-MEM research are currently better supported in the MongoDB backend.
Migration path: Existing LangGraph agents can be wrapped as sub-agents within a Deep Agent orchestrator. This enables incremental migration: start by wrapping your most complex workflow as a Deep Agent while keeping simpler agents on LangGraph. The shared memory scope ensures the orchestrator and sub-agents maintain consistent context.
Cost and latency tradeoffs: Long-running agents accumulate token costs across planning iterations and sub-agent invocations. Set max_planning_depth limits based on your cost tolerance. Monitor speculative_waste_ratio in LangSmith—this metric shows how often the planning model generates plans that get abandoned due to execution failures. A high ratio indicates either overly ambitious planning or unreliable tools.
Team readiness: Deep Agents requires observability maturity. If your team isn't already using LangSmith tracing for existing LLM workflows, start there before deploying autonomous agents. The State of Agent Engineering findings make this clear: teams without observability infrastructure struggle to debug and optimize agent behavior.
Security posture: Enable Sandboxes for any code-execution sub-agents—this is not optional for production deployments. Integrate with the Agent Governance Toolkit for runtime policy enforcement. Define explicit constraints on what tools each sub-agent can call and under what conditions.
What to Build This Week
Build a code review coordination agent that demonstrates the full Deep Agents architecture:
- Parent Agent: Accepts a PR URL, plans the review workflow, coordinates sub-agents, maintains memory of past reviews on this repository
- Static Analysis Sub-Agent: Runs linters and type checkers, reports findings
- Security Review Sub-Agent: Scans for common vulnerabilities, checks dependency updates
- Test Coverage Sub-Agent: Identifies untested code paths, suggests test cases
- Documentation Sub-Agent: Checks for missing docstrings, outdated README sections
Configure MongoDB checkpointing so the review survives interruptions. Use the AGENTS.md file to encode your team's code review standards. Set up LangSmith tracing and build a custom evaluator that scores review thoroughness against a benchmark of manually-reviewed PRs.
The goal isn't a production-ready tool—it's hands-on experience with planning DAGs, sub-agent coordination, persistent memory, and trace-based evaluation. These are the primitives that every non-trivial agent will require as we move from demos to deployment.
Sources
- March 2026: LangChain Newsletter
- State of Agent Engineering - LangChain
- LangChain Announces Enterprise Agentic AI Platform Built with NVIDIA
- Announcing the LangChain + MongoDB Partnership: The AI Agent Stack That Runs On The Database You Already Trust
- The next evolution of the Agents SDK - OpenAI
- W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents
- APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
- Introducing the Agent Governance Toolkit - Microsoft Open Source
This is part of the **Agentic Engineering Weekly* series — a deep-dive every Monday into the frameworks,
patterns, and techniques shaping the next generation of AI systems.*
Follow the Agentic Engineering Weekly series on Dev.to to catch every edition.
Building something agentic? Drop a comment — I'd love to feature reader projects.
Top comments (0)