James Lee

Posted on Mar 22 • Edited on Jun 14

From Single-Agent to Multi-Agent: Designing and Deploying an Enterprise-Grade LLM Application with LangGraph

#multiagent #langgraph #llm #ai

1. Introduction: Four Core Pain Points of Single-Agent Architecture in Enterprise LLM Applications

In enterprise LLM applications, user requests are often
complex and multi-dimensional. A typical example from
our e-commerce reference implementation:

"Check the shipping status of Order #123, look up the after-sales warranty policy for this product, and update my delivery address."

This single message contains three independent intents, requires two different data sources, and demands coordinated execution. Single-agent architecture exposes four unavoidable pain points in scenarios like this:

No complex task decomposition: A single agent cannot break down composite requests into executable subtasks — it either handles only one intent or produces a confused, incomplete response;
Poor tool call robustness: When an external tool fails (Neo4j timeout, GraphRAG service unavailable), a single agent falls into an infinite retry loop with no circuit-breaking mechanism, blocking the entire service;
Fragmented multi-source retrieval: Structured order data (Neo4j) and unstructured product documentation (GraphRAG) require completely different retrieval strategies — a single agent cannot coordinate both within a single response;
No end-to-end governance: Without a unified safety control node, there is no way to implement circuit breaking, content compliance checks, or permission management — failing to meet enterprise-grade compliance requirements.

This article builds on the technical foundations from the first three parts (MinerU multimodal parsing, Neo4j knowledge graph, GraphRAG service wrapping) to present a complete walkthrough of building an enterprise-grade multi-agent system with LangGraph — solving all four pain points through a layered decoupled architecture, precise intent routing, and end-to-end safety governance.

2. Full-Stack System Architecture

The system adopts a three-tier macro architecture with six decoupled sub-layers, fully isolating the underlying infrastructure from the upper-layer business application.

┌─────────────────────────────────────────────────────────┐
│              LLM Application Architecture Layer          │
│                                                         │
│  Application:  User Service │ Session Service │ KB Service │
│                                                         │
│  Function:  Multi-Agent │ Safety Guardrails │ Hybrid KB Retrieval │
│             Offline/Online Index Build │ Text2Cypher Debug │
├─────────────────────────────────────────────────────────┤
│              LLM Technical Architecture Layer            │
│                                                         │
│  Core:      Agent │ RAG │ Workflow                      │
│  Framework: LangChain / LangGraph / Microsoft GraphRAG  │
│  Interface: Vue / FastAPI / SSE / Open API              │
├─────────────────────────────────────────────────────────┤
│              LLM Platform Architecture Layer             │
│                                                         │
│  Model:  DeepSeek Online │ vLLM Private Deployment      │
│  Data:   MySQL │ Redis │ Neo4J │ LanceDB │ Local Disk   │
│  Infra:  Cloud Server │ GPU Server │ Docker Platform    │
└─────────────────────────────────────────────────────────┘

2.1 LLM Application Architecture Layer

The application layer faces users and the frontend directly, comprising three core modules:

User Service: Login, registration, identity verification, and permission management;
Session Service: Conversation lifecycle management, context storage, and session state synchronization;
Knowledge Base Service: Upload, parsing, and index management for product manuals and after-sales policies, integrated with the MinerU multimodal parsing capability from Part 2.

The function layer is the core business capability carrier of the multi-agent system:

Multi-Agent Architecture: End-to-end coordination covering intent routing, task decomposition, tool execution, and result aggregation;
Safety Guardrails: Circuit breaking, timeout control, content compliance checks, and request rate limiting;
Hybrid Knowledge Base Retrieval: Unified query entry point integrating Neo4j structured retrieval and GraphRAG unstructured retrieval;
Offline/Online Index Construction: Supports batch offline full indexing and real-time incremental updates for streaming data;
Text2Cypher Debugging: Natural language to Cypher generation, syntax validation, and logic correction.

2.2 LLM Technical Architecture Layer

This layer provides standardized technical capabilities to the upper business layers:

Core capability layer: Three capability units — Agent scheduling, RAG retrieval augmentation, and Workflow orchestration;
Framework layer: LangChain/LangGraph for multi-agent workflow orchestration; Microsoft GraphRAG for unstructured knowledge base retrieval;
Interface layer: Vue frontend, FastAPI backend, SSE streaming responses, and Open API standardized integration.

2.3 LLM Platform Architecture Layer

This layer provides compute, storage, and model capabilities:

Model layer: Dual-model strategy — DeepSeek online model for general conversation and intent recognition; vLLM private deployment for sensitive business data processing;
Data layer: Hybrid storage — MySQL for structured business data, Redis for session state caching, Neo4J for the business knowledge graph, LanceDB for vector data;
Infrastructure layer: Cloud server + GPU server compute foundation with Docker-based containerized deployment.

3. Multi-Agent Workflow: End-to-End Design

Based on LangGraph's StateGraph, the multi-agent collaboration process is abstracted into an observable, governable, and traceable state machine.

                        ┌─────────┐
                        │  Start  │
                        └────┬────┘
                             │
                ┌────────────▼────────────┐
                │  analyze_and_route_query │
                └────────────┬────────────┘
                             │
                ┌────────────▼────────────┐
                │       route_query        │
                └──┬──────┬───────┬───┬───┘
                   │      │       │   │
             General  Clarify  Query  Image
                   │      │       │   │
                   │      │    ┌──▼──┐│
                   │      │    │Planner│
                   │      │    └──┬──┘│
                   │      │  ┌───┼───┐│
                   │      │  │   │   ││
                   │      │ Tool1 Tool2 Tool3
                   │      │  │   │   ││
                   │      │  └───┼───┘│
                   │      │      │    │
                   └──────┴──────▼────┘
                                 │
                        ┌────────▼────────┐
                        │     Summary     │
                        └────────┬────────┘
                                 │
                        ┌────────▼────────┐
                        │  Final Answer   │
                        └────────┬────────┘
                                 │
                             ┌───▼───┐
                             │  End  │
                             └───────┘

3.1 Entry Node: analyze_and_route_query

The sole entry point for all user requests. Core responsibilities: receive user input, inject context, and trigger intent classification.

Design decision: Analysis and routing are merged into a single node rather than split into two. The reason is that intent analysis depends on the result of context injection — merging eliminates one state read/write cycle and reduces latency.

3.2 Core Decision Node: route_query

This is the "brain" of the entire workflow. It uses an LLM to perform precise intent classification and routes user requests to one of four processing branches.

The core challenge in classification design is defining clear boundaries between categories to prevent classification drift in ambiguous scenarios. Our approach: define classification boundaries using positive/negative sample contrast. After multiple iterations, classification accuracy improved from 78% in the initial version to 94%.

3.3 Four Branch Processing Logic

Branch 1: General Q&A

No external tools required. A response is generated directly via Prompt + LLM.

Use cases: Small talk, greetings, simple rule-based Q&A.

Branch 2: Clarification Required

Core design: Before prompting the user to provide more information, a business relevance check is performed first.

Relevance check passes → Generate a guided response prompting the user to supply the required parameters;
Relevance check fails → Return a fallback response directing the user to an alternative resolution path (e.g., human escalation or self-service portal).

Design decision: The relevance check is anchored to the Neo4j Schema definition and business scope description — not left to free-form LLM judgment. This binds the check result to explicit business boundaries and prevents the LLM from over-generalizing.

Dynamic Schema-Driven Guardrails

The relevance check in this branch is not a static rule — it is driven by a live Neo4j schema pulled at runtime. On every request, the system calls retrieve_and_parse_schema_from_graph_for_prompts() to fetch the current graph structure and injects it directly into the guardrail prompt:

graph_context = retrieve_and_parse_schema_from_graph_for_prompts(neo4j_graph)
full_system_prompt = GUARDRAILS_SYSTEM_PROMPT.format(schema=graph_context)
guardrails_chain = full_system_prompt | model.with_structured_output(AdditionalGuardrailsOutput)

This means the business scope boundary is always in sync with the actual knowledge graph — when new product categories or entity types are added to Neo4j, the guardrail automatically expands its coverage without any manual rule updates. This is the key difference from hardcoded keyword filtering.

Branch 3: Image Q&A

A multimodal LLM parses the image content, extracts key information, and generates the corresponding response.

Use cases: Users uploading screenshots of products, orders, or shipping information to ask questions.

Branch 4: Query Q&A (Core Branch)

This is the system's core processing branch, integrating all technical outputs from the first three parts. It consists of three sub-steps:

Step 1: Planner — Task Decomposition

Decomposes the user's complex query into multiple subtasks that can be executed in parallel or in sequence, specifying the goal, required tool, and execution order for each subtask.

Design decision: The Planner's output format is strictly defined as structured JSON. Core field design:

Field	Description
`task_id`	Unique subtask identifier for ordered result aggregation
`task_type`	Subtask type identifier for routing to the corresponding tool
`tool`	The tool type required for the subtask
`dependencies`	Dependency relationships controlling parallel/sequential execution order

Enforcing structured output ensures that the downstream tool selection node can parse results unambiguously, eliminating the uncertainty introduced by natural language descriptions.

Step 2: Tool Selection and Execution

Based on subtask type, requests are automatically routed to one of three tools:

Tool 1: GraphRAG Query

Use cases: Unstructured data queries (product specifications, after-sales policies, product manuals);
Integrates the GraphRAG RESTful API wrapped in Part 3, supporting Local / Global / Drift / Basic retrieval modes;
Tool selection logic: Retrieval mode is automatically selected based on query scope and depth — Local Search for precise local queries, Global Search for broad conceptual queries.

Tool 2: Generate Cypher

Use cases: Custom queries on structured business data (order status, shipping information, delivery address);
Integrates the Neo4j knowledge graph from Part 2, converting natural language to Cypher via a "Schema injection → LLM generation → syntax validation → execution" pipeline;
Key design: Generated Cypher is mandatorily validated for syntax and logic. On failure, it is sent back to the LLM for regeneration, with a maximum of 2 retries before falling back to a predefined result.

Tool 3: Predefined Cypher

Use cases: High-frequency, fixed structured queries (list all orders, check product inventory);
Matches the user query against predefined requirement descriptions by similarity, then directly fills in parameters and executes — no dynamic LLM generation required;
Design value: Covers approximately 80% of high-frequency query scenarios, pushing accuracy for this segment to near 100% while significantly reducing latency and token consumption.

Parallel Execution via LangGraph Send API

The concurrent dispatch of subtasks is implemented using LangGraph's Send API, which enables dynamic fan-out from the Planner node:

main_graph_builder.add_conditional_edges(
    "planner",
    map_reduce_planner_to_tool_selection,  # fan-out via Send API
    ["tool_selection"],
)

When the Planner decomposes a complex query into N subtasks, map_reduce_planner_to_tool_selection uses Send to spawn N independent tool_selection nodes that execute concurrently — each carrying its own subtask state. Results are collected and fan-in at the summarize node. This is not thread-based parallelism; it is LangGraph's native graph-level concurrency, meaning each branch has full state isolation and failure in one branch does not affect others.

Step 3: Safety Governance

Safety guardrails are active throughout the entire tool execution lifecycle:

Pre-execution: Validate parameter legality, user permissions, and call frequency;
During execution: Timeout control (configurable threshold per tool call) and circuit breaking (configurable maximum tool calls per conversation turn);
Post-execution: Validate relevance and compliance of returned results; filter sensitive information.

3.4 Result Aggregation: Summary Node

Collects execution results from all branches and subtasks, performs semantic-level fusion, resolves information conflicts, and organizes the output into logically coherent content that conforms to the target domain's communication standards.

Design decision: Sequential tasks are merged in dependency order; parallel tasks are merged by business logic category. The two merge strategies are handled separately to prevent result ordering issues.

4. Production-Grade Core Capabilities

4.1 LangGraph-Based State Persistence and Session Management

Using LangGraph's native Checkpointer mechanism, we implement full-lifecycle session state persistence:

Checkpointer: Uses RedisSaver as the backend; after each node completes, the State snapshot is automatically saved to Redis;
Hot/cold storage separation: Active session state is stored in Redis (hot data); upon session end, data is automatically synced to MySQL (cold data);
Seamless session recovery: When a user resumes an interrupted conversation, the state snapshot is loaded directly from the Checkpointer, restoring execution to the interrupted node;
Long-conversation memory compression: When a conversation exceeds 10 turns, the LLM is automatically invoked to summarize and compress the conversation history, reducing token consumption while preserving core semantics.

from langgraph.checkpoint.redis import RedisSaver

# Initialize Redis Checkpointer
checkpointer = RedisSaver.from_conn_string("redis://localhost:6379")

# Inject Checkpointer when compiling the workflow
app = workflow.compile(checkpointer=checkpointer)

# Carry thread_id on each call for session isolation
config = {"configurable": {"thread_id": session_id}}
result = await app.ainvoke(state, config=config)

4.2 Hybrid Knowledge Base Collaborative Retrieval

This is the system's core competitive moat, fully integrating the technical outputs of Parts 2 and 3:

Automatic routing: The Planner automatically routes to Neo4j structured retrieval or GraphRAG unstructured retrieval based on subtask type; complex tasks invoke both pipelines in parallel;
Result fusion: The Summary module performs semantic-level fusion of results from both pipelines, resolving information conflicts;
Fallback isolation: The two retrieval pipelines are fully isolated — a failure in one does not affect the other;
Index synchronization: When structured business data is updated, the GraphRAG incremental index update API is automatically triggered to ensure data consistency.

4.3 End-to-End Observability

Designed for enterprise production operations requirements:

Distributed tracing: Full-pipeline instrumentation based on OpenTelemetry, enabling end-to-end latency and status tracking from intent routing to final output;
Core metrics monitoring: Intent classification accuracy, Agent execution success rate, tool call latency/failure rate, average response latency;
Anomaly alerting: Automated alerts for scenarios such as execution failure rate exceeding threshold or response latency breaching SLA.

5. Production Pitfalls and Solutions

5.1 Agent Tool Call Infinite Loop

Symptom: When a tool call returns an unexpected result, the Agent repeatedly retries the same tool, entering an infinite loop and blocking the entire service for a single user request.

Root cause: Single-agent architecture has no global call counter — each retry is an independent decision, and the Agent has no awareness of how many retries have already occurred.

Solution:

# Maintain a global tool call counter in State
class AgentState(TypedDict):
    messages: list
    tool_call_count: int      # Global call counter
    max_tool_calls: int       # Configurable threshold based on your SLA

# Add circuit breaker check before the tool execution node
def check_circuit_breaker(state: AgentState):
    if state["tool_call_count"] >= state["max_tool_calls"]:
        return "fallback"     # Route to fallback node
    return "execute_tool"     # Proceed with normal execution

# Increment counter after each tool call
def execute_tool(state: AgentState):
    result = call_tool(state)
    return {
        **state,
        "tool_call_count": state["tool_call_count"] + 1,
        "tool_result": result
    }

By maintaining a global call counter in State and combining it with LangGraph's conditional routing, the infinite loop problem is resolved at the framework level.

5.2 Low Text2Cypher Generation Accuracy

Symptom: Dynamically generated Cypher statements contain syntax errors or logical deviations, causing Neo4j queries to fail or return incorrect results.

Root cause: The LLM has an imprecise understanding of Neo4j's property graph model and tends to hallucinate non-existent node types or relationship types.

Solution:

async def generate_and_validate_cypher(
    query: str,
    schema: dict,
    max_retries: int = 3
) -> str:
    for attempt in range(max_retries):
        # Inject full Schema to anchor the business model
        cypher = await llm.generate_cypher(query, schema)

        # Syntax validation
        if not validate_cypher_syntax(cypher):
            continue

        # Logic validation: check node/relationship types exist in Schema
        if not validate_against_schema(cypher, schema):
            continue

        return cypher

    # Exceeded retry threshold — fall back to Predefined Cypher matching
    return await match_predefined_cypher(query)

Additionally, Cypher templates are predefined for the
80% of high-frequency query scenarios in the reference
implementation, pushing accuracy for this segment to
near 100%.

5.3 Disordered Result Merging in Parallel Multi-Agent Tasks

Symptom: Results returned by multiple tools executing in parallel have inconsistent formats, preventing the Summary module from effectively integrating them and causing logical incoherence in the final response.

Solution: Define a unified tool output schema. All tool return results are required to conform to the same structure:

Field	Description
`task_id`	Corresponds to the subtask ID generated by the Planner
`task_type`	Tool type identifier
`status`	Execution status (success / failure / fallback)
`result_data`	Actual result data
`error_msg`	Error information on failure
`latency_ms`	Execution latency for performance monitoring

The Summary node performs ordered aggregation based on task_id: sequential tasks are merged in dependency order; parallel tasks are merged by business logic category.

6. Production Results

The following data is based on a manually annotated test
set of 100 real complex queries from our e-commerce
reference implementation (annotated by 3 domain experts; inter-annotator agreement Cohen's Kappa = 0.87) and validated through 1,000-round concurrent load testing:

Metric	Single-Agent	Multi-Agent	Improvement
Complex query resolution rate	70%	92%	↑ 22 pp
Average conversation turns	8	4.5	↓ 43.75%
Tool call failure rate	15%	4%	↓ 73.3%
Session recovery success rate	60%	96%	↑ 36 pp
Average response latency	3.5s	1.1s	↓ 68.6%

Core business impact:

Human escalation rate reduced by 42% in the reference implementation, demonstrating significant operational cost reduction potential;
User satisfaction score improved to 4.8 / 5;
System availability reached 99.9%, meeting 24/7 enterprise-grade SLA requirements.

7. Deployment Boundaries and Series Continuity

7.1 Deployment Boundaries

This multi-agent architecture is validated against
complex task handling in e-commerce scenarios, but
the LangGraph workflow design, state persistence
patterns, and safety governance mechanisms are
directly transferable. Healthcare and finance
deployments will need to adjust intent classification
boundaries, tool permission scopes, and compliance
policies to fit their own regulatory requirements —
the core architecture remains unchanged.

7.2 Series Continuity

GitHub repository: llm-customer-service, (Tag: v1.0.0-multi-agent)
Backward reference: Builds on Part 3 GraphRAG Service Wrapping, addressing the four core pain points of single-agent architecture.
Next up: Part 5 will focus on the production-grade LLM application safety guardrail system, covering Prompt injection defense, privilege escalation interception, hallucination validation, and more. Stay tuned.
Series finale: Part 8 will provide a complete retrospective of all architecture decisions, engineering pitfalls, and quantifiable outcomes from MVP to production-grade system, forming a full end-to-end engineering practice record.

DEV Community