James Lee

Posted on Mar 23 • Edited on Jun 14

Hybrid Knowledge Retrieval: Combining Neo4j Graph Queries, GraphRAG and Vector Search for Enterprise LLM Applications

#neo4j #graphrag #rag #llm

1. Introduction: The Blind Spots of Single-Retrieval Approaches and Defining Full-Stack Capability Closure

This is Part 6 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Application. In the first five parts, we completed the MVP architecture, multimodal data pipeline, GraphRAG service wrapping, multi-agent workflow design, and end-to-end safety guardrail system. This article completes the final piece of the system's core capability puzzle — a hybrid knowledge retrieval system — achieving full-stack capability closure from user input to compliant output.

We define production-grade capability closure as: any
legitimate user query can be fully processed within
the system through an automated pipeline of "intent recognition → task decomposition → precise retrieval → safety validation → result output" — with no manual intervention, no cross-system handoffs, while meeting production-grade requirements for compliance, stability, and low latency.

Enterprise LLM application queries are never "single-type." Some users ask "Where is the shipping for Order #123?" (structured query), others ask "How do I connect this smart bulb to WiFi?" (unstructured knowledge query), and others ask "What is the after-sales policy for the product in Order #123?" (complex hybrid query). Relying on a single retrieval approach creates obvious capability blind spots that make true closure impossible:

Retrieval Approach	Strengths	Core Limitations
Neo4j Text2Cypher	Precise structured data queries (orders/inventory/customers), fast response, high accuracy	Requires strict permission control, vulnerable to injection attacks, cannot cover unstructured knowledge queries
GraphRAG	Knowledge graph multi-hop reasoning, cross-chapter semantic queries in long documents (e.g., "full product line after-sales policy")	Low efficiency for pure FAQ and short-text fuzzy matching; heavily dependent on graph construction quality
Vector Search	Fuzzy semantic matching, unstructured short-text/FAQ queries (e.g., "What is the return process?")	Cannot handle structured relational queries, no multi-hop reasoning support, long-document context easily lost

No single retrieval approach can satisfy all scenarios — it either fails on structured queries, struggles with unstructured knowledge, or introduces security risks. We must therefore build a hybrid knowledge base system that coordinates Neo4j structured queries + GraphRAG knowledge graph retrieval + vector semantic search, letting each retrieval capability do what it does best, achieving a "1+1+1>3" effect and ultimately delivering full-stack capability closure.

2. Hybrid Retrieval Architecture: Coordinating Three Retrieval Capabilities with End-to-End Governance

The core of our hybrid knowledge base is a full pipeline of "task decomposition → intelligent routing → parallel retrieval → safety validation → result fusion", letting each retrieval approach handle its specialty while providing a unified invocation interface to the upper-layer Agent system, with safety guardrails embedded throughout to ensure production-grade stability and compliance.

2.1 Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│                     User Input + User Identity Info                  │
└───────────────────────────────────┬──────────────────────────────────┘
                                    │
┌───────────────────────────────────▼──────────────────────────────────┐
│           Planner (Complex Query Decomposition + Intent Recognition) │
│  Example: "What is the after-sales policy for the product in         │
│            Order #123?"                                              │
│  → Decomposed into: ["Query product info for Order #123",           │
│                       "Query after-sales policy for that product"]  │
└───────────────────────────────────┬──────────────────────────────────┘
                                    │ Subtask list
┌───────────────────────────────────▼──────────────────────────────────┐
│         Tool Selector (Intelligent Routing + Pre-Safety Validation)  │
│   Subtask 1 → Text2Cypher (structured order query)                  │
│   Subtask 2 → GraphRAG (unstructured after-sales policy query)      │
└──────────────────┬────────────────────────────────┬─────────────────┘
                   │                                │
┌──────────────────▼──────────┐   ┌─────────────────▼───────────────────┐
│  Text2Cypher                │   │  GraphRAG                           │
│  Orders / Inventory /       │   │  Knowledge graph multi-hop          │
│  Structured data queries    │   │  reasoning / long-doc unstructured  │
└──────────────────┬──────────┘   └─────────────────┬───────────────────┘
                   └────────────────────────────────┘
                                    │ Retrieval results from all paths
┌───────────────────────────────────▼──────────────────────────────────┐
│       Result Fusion → Factual Consistency Check → Final Answer       │
└──────────────────────────────────────────────────────────────────────┘

**Diagram note: User input is first decomposed into independent subtasks by the Planner, then routed to the corresponding retrieval tool by the Tool Selector — Text2Cypher for structured data queries, GraphRAG for unstructured knowledge queries. Safety validation is embedded throughout. Results are fused to generate a compliant final answer, achieving full-stack capability closure. Vector Search serves as a Tier 1 degradation fallback when GraphRAG is unavailable; see Section 2.3.1 for the full degradation strategy.

2.1.1 Complex Query Decomposition (Planner)

A task decomposition prompt framework customized for the target domain (e-commerce in our reference implementation) breaks multi-intent, mixed-type queries into independent, dependency-free subtasks, eliminating the blind spots that arise when a single retrieval approach cannot cover all cases:

PLANNER_SYSTEM_PROMPT = """
You are the task planning component of an enterprise 
LLM application system.
Your responsibility is to analyze user queries and decompose them into independent,
executable subtasks.

Core rules:
1. Simple single-intent queries do not need decomposition — return the original query directly.
2. Multi-intent mixed queries MUST be decomposed into independent subtasks with no
   dependencies or overlaps between them.
3. Key information such as user identity, order numbers, and product names MUST be
   preserved in each subtask.
4. Return ONLY the subtask list. Do NOT output any other content.
"""

Example: "What beverages does Northwind Trading carry, and what are their after-sales policies?" is decomposed into ["What beverage products does Northwind Trading carry?", "What are the after-sales policies for Northwind Trading's beverage products?"], routed separately to structured query and unstructured retrieval.

2.1.2 Intelligent Routing Rules (Tool Selector)

We define clear, domain-specific routing logic for each subtask, combining business priority to precisely assign tools while completing pre-flight safety validation:

TOOL_SELECTION_SYSTEM_PROMPT = """
You are the tool selection component of an enterprise 
LLM application system.
Your responsibility is to select the most appropriate retrieval tool for each subtask.

Tool selection priority and rules:
1. [HIGHEST PRIORITY] Structured data queries (orders, products, inventory, customers,
   logistics, pricing, suppliers, etc.):
   - High-frequency fixed scenarios: use predefined_cypher (pre-built Cypher templates)
   - Complex dynamic queries: use cypher_query (dynamically generated Cypher)

2. Unstructured long-document / cross-chapter knowledge queries (after-sales policies,
   warranty terms, product manuals, troubleshooting guides, etc.):
   Use microsoft_graphrag_query (GraphRAG knowledge graph retrieval)

3. Short-text FAQ / fuzzy semantic matching / similar question lookup:
   Use microsoft_graphrag_query (GraphRAG knowledge graph retrieval)

Output ONLY the tool name. Do NOT output any other content:
predefined_cypher / cypher_query / microsoft_graphrag_query

"""

Subtask Type	Routed Tool	Example Scenarios
Structured data queries (products/orders/customers/inventory)	`predefined_cypher` / `cypher_query`	"Check shipping for Order #123" / "How much inventory is left for this product?"
Unstructured long-doc knowledge queries (after-sales/manuals/troubleshooting)	`microsoft_graphrag_query`	"What is the return policy?" / "How do I connect the smart bulb to WiFi?"
Short-text FAQ / fuzzy semantic matching	`microsoft_graphrag_query`	"Is there anything like a '7-day no-questions-asked' return policy?"

2.2 Production-Grade Implementation and Security Governance for All Three Retrieval Capabilities

2.2.1 Text2Cypher Structured Queries: Security as the Top Priority

Structured queries directly touch core enterprise business data — security design is the foundational prerequisite for production deployment. We implement three layers of compliance and security, fully inheriting the safety guardrail system from Part 5:

Strong identity binding: All queries must carry the current logged-in user's user_id; only that user's own orders and personal information may be queried. Cross-user order lookups are blocked at the syntax level;
Predefined templates first: 80% of high-frequency queries are encapsulated as predefined_cypher templates — no dynamic Cypher generation required, just parameter substitution and execution, eliminating injection risk at the root;
Triple validation for dynamic generation: For the minority of complex dynamic queries, a three-stage validation pipeline is enforced — syntax validation → operation permission check → input sanitization. Only MATCH/RETURN read operations are permitted; all write operations are blocked; sensitive characters are filtered to prevent injection.

2.2.2 GraphRAG Unstructured Knowledge Retrieval: End-to-End Data Consistency

Building on the GraphRAG service capabilities from Parts 2 and 3, this layer handles long-document and cross-chapter unstructured knowledge queries, with an added incremental index synchronization mechanism:

When unstructured data such as product manuals or after-sales policies is updated, an incremental index build is automatically triggered — no full rebuild required;
Indexes for different data sources are isolated by directory, preventing interference and ensuring consistency and stability during data updates.

2.2.3 Vector Search: Supplementary Fallback for Short-Text FAQ Scenarios

As the supplementary fallback capability of the hybrid retrieval system, vector search is optimized for high-frequency short-text FAQ scenarios.

Indexing Pipeline:

Document Input (txt / pdf / docx)
    ↓ rag_service.py — file parsing and chunking
    ↓ embedding_service.py
    SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', dim=384)
    → FAISS IndexFlatL2
    → Persisted to disk (.bin index + .json metadata)

The multilingual model was explicitly chosen to support Chinese semantic retrieval — not a generic English-only model. FAISS IndexFlatL2 provides exact nearest-neighbor search suitable for the FAQ-scale dataset (Top 200 entries).

Retrieval and Output Pipeline:

# rag_chat_service.py — dual-stream output design
# Stream 1: yield retrieval sources first (frontend renders "X references found")
yield json.dumps(retrieval_results, ensure_ascii=False) + "\n\n"

# Stream 2: stream LLM-generated answer token by token
async for chunk in response:
    yield f"data: {content}\n\n"

The dual-stream output pattern allows the frontend to display reference sources immediately while the LLM answer is still being generated — a standard production-grade RAG interaction model that avoids the "blank wait" anti-pattern.

Integration with Hybrid Retrieval:

Retrieval results are deduplicated via reduce_docs before being passed to the LLM context, preventing redundant information from inflating token consumption;
When GraphRAG service times out or is unavailable, vector search automatically serves as the Tier 1 degradation fallback for unstructured knowledge queries (see Section 2.3.1).

2.3 Production-Grade Core Capabilities

2.3.1 Hybrid Retrieval Fallback and Degradation Strategy

To ensure 24/7 availability in production, we designed a three-tier degradation strategy so that any single retrieval path failure does not affect the overall service:

Tier 1 degradation (single tool failure): When one retrieval tool times out or becomes unavailable, traffic is automatically rerouted to a backup tool. Example: GraphRAG service timeout → unstructured queries automatically fall back to vector search;
Tier 2 degradation (multiple tool failures): When multiple retrieval paths fail, the system automatically switches to a predefined FAQ fallback library to maintain basic consultation capability;
Tier 3 degradation (full pipeline failure): When core services are unavailable, a standardized fallback response is returned immediately, directing users to contact a human agent — preventing service collapse.

All degradation events are recorded in the audit log and trigger alerting notifications, enabling operations teams to quickly locate and resolve issues.

2.3.2 Multi-Source Index Synchronization

Data consistency is the foundational prerequisite of the hybrid retrieval system. We designed two synchronization mechanisms to ensure real-time consistency across structured and unstructured data:

Structured data sync: Business data such as orders, products, and inventory is monitored via Binlog. Data changes are automatically synced to the Neo4j graph database with latency < 1s, ensuring query results are fully consistent with the business system;
Unstructured data sync: Documents such as product manuals and after-sales policies are managed by version number. When a document is added or modified, a MinerU parsing → incremental index build pipeline is automatically triggered. The live index is hot-swapped upon completion with zero service interruption.

2.3.3 Result Fusion Prompt Design and Engineering Logic

The quality of multi-path retrieval result fusion directly determines the quality of the final answer. Our core design principle is "aggregate by business logic category + factual consistency validation + target domain communication standards"**. Core prompt framework:

RESULT_FUSION_SYSTEM_PROMPT = """
You are the result fusion component of an enterprise 
LLM application system.
Your responsibility is to integrate results returned by multiple retrieval tools into a
single, logically coherent response that conforms to the target domain's communication standards.

Core rules:
1. Generate answers STRICTLY based on retrieval results. Do NOT fabricate any information
   not present in the retrieved content.
2. Present structured query results first, then unstructured query results, in clear
   logical order.
3. If different retrieval results contain conflicting information, the structured business
   database result takes precedence.
4. Language must be friendly, concise, and conform to 
the target domain's communication standards.
5. Do NOT expose any information about retrieval tools or technical implementation details.
"""

At the engineering level, a secondary factual consistency check is also applied: the generated final answer is re-matched against the original retrieval results to ensure zero hallucinations and zero false commitments, fully inheriting the hallucination validation capability from Part 5.

2.3.4 Unified Interface: Enabling "Transparent Invocation" for Upper-Layer Agents

We provide the upper-layer multi-agent system with a unified knowledge base interface that abstracts away the underlying retrieval methods, security validations, and degradation strategies, significantly reducing system complexity:

def query_hybrid_kb(user_query: str, user_id: str) -> dict:
    """Unified hybrid knowledge base query interface

    Args:
        user_query: Raw user query text
        user_id: Current logged-in user ID, used for permission validation

    Returns:
        Fused final answer and retrieval provenance information
    """
    # 1. Complex query decomposition
    sub_tasks = planner_decompose(user_query)
    task_results = []

    # 2. Subtask routing and parallel retrieval
    for task in sub_tasks:
        # Tool selection with pre-flight safety validation
        selected_tool = tool_selector(task)
        # Execute retrieval with built-in timeout control and fallback logic
        result = execute_tool_with_fallback(selected_tool, task, user_id)
        task_results.append({"task": task, "tool": selected_tool, "result": result})

    # 3. Result fusion and factual validation
    final_answer = fuse_and_validate_results(task_results)

    # 4. End-to-end audit log
    record_audit_log(user_id, user_query, task_results)

    return {
        "answer": final_answer,
        "source": task_results
    }

Core value: Upper-layer Agents only need to call this single interface — no need to know which retrieval method was used or what security validations were applied. True capability encapsulation and reuse.

3. End-to-End Validation: Hybrid Knowledge Base vs. Single-Retrieval Approaches

We sampled 1,000 user queries from our e-commerce
reference implementation logs (400 structured, 400
unstructured, 200 complex hybrid), manually annotated
by 3 domain experts using the standard "semantically consistent with the reference answer, no false information, conforms to business rules." Core metrics comparison across four approaches:

Metric	Text2Cypher Only	GraphRAG Only	Vector Search Only	Hybrid KB (This Article)
Answer accuracy	85%	78%	82%	94%
Full-scenario coverage	55%	65%	60%	98%
Avg. response time (s)	0.8	1.3	1.1	1.2 (see note)
Security violation rate	2%	0%	0%	0%
Complex hybrid query resolution	30%	65%	40%	92%

Note on response time: The hybrid knowledge base's 1.2s average is 0.1s slower than vector search alone — a deliberate trade-off to achieve 98% scenario coverage and 94% accuracy. It comfortably meets the < 2s real-time response SLA in our reference implementation.

Key Conclusions

Accuracy and coverage leap: The hybrid knowledge base covers 98% of query scenarios in our reference implementation with an overall answer accuracy of 94%. Complex hybrid query resolution jumped from a maximum of 65% (GraphRAG only) to 92%, fundamentally solving the core pain points of "can't answer structured questions" and "weak reasoning on unstructured knowledge";
Fully controlled performance: Average response time of 1.2s comfortably meets the < 2s real-time response SLA in our reference implementation;
Security and compliance baseline: Through end-to-end permission validation + predefined templates + injection protection, the security violation rate dropped to 0, fully satisfying enterprise-grade data security and compliance requirements.

4. Differentiation Analysis: Our Production-Grade Advantages

Compared to general-purpose open-source RAG solutions such as OpenAI RAG and LlamaIndex, our hybrid knowledge base offers three core advantages in enterprise LLM deployment scenarios:

Dimension	General Open-Source RAG	This Hybrid KB Solution
Security design	Basic permission control only; business-layer adaptation required; no industry-specific security templates	End-to-end permission validation + injection protection + fallback; enterprise compliance out of the box for the reference domain
Complex query handling	Single-intent queries only; no native complex task decomposition	Planner + Tool Selector deeply customized; native support for multi-intent decomposition and parallel retrieval
Full-stack closure	Retrieval module only; Agent integration, security system, and business system connections must be built separately	Complete production-grade closure from data pipeline → GraphRAG service → multi-agent → safety guardrails → hybrid KB, seamlessly integrated with business systems
Scenario fit	General-purpose; no industry customization	Deeply adapted to the reference domain (e-commerce); 80% of high-frequency business queries pre-templated; directly extensible to other domains

Core value: Our solution is not a "toy-grade retrieval module stack" — it is a complete enterprise-grade solution directly deployable to production, genuinely solving the core requirements of "deployable, secure, and full-scenario coverage."

5. Production Outcomes and Extensibility Roadmap

5.1 Production Deployment Results

After full-stack integration, our reference implementation v1.0 achieved:

Full-scenario coverage: From structured data queries to unstructured knowledge consultation, 98% of user questions in our reference implementation are answered automatically;
High stability: Supports 1,000 QPS concurrent load, 24/7 stable operation, zero downtime or data leakage incidents;
Low human intervention: Human agent escalation rate reduced from 40% to 10%, significantly lowering enterprise operational costs;
Compliance met: Satisfies major data protection regulations (e.g., GDPR, PIPL); zero sensitive information leakage incidents.

5.2 Future Extensibility Directions

Multimodal retrieval expansion: Add image/video retrieval capabilities to support scenarios such as "send a photo to diagnose a fault" or "scan a barcode to look up a product," further lowering user interaction friction;
Self-optimizing intelligent routing: Introduce reinforcement learning to let the system automatically learn "which query type fits which retrieval method" based on user feedback and business outcomes, continuously improving routing accuracy;
Streaming response optimization: Integrate LLM streaming output with KV Cache optimization to compress user-perceived time-to-first-token (TTFT) from 1.2s to under 500ms, further improving conversational experience;
A/B testing framework: Establish an A/B testing mechanism for different retrieval strategies and fusion prompts, using real business data to drive continuous iterative optimization of the hybrid knowledge base.

6. Deployment Boundaries and Series Continuity

6.1 Deployment Boundaries

This hybrid knowledge base system is validated against
an e-commerce reference implementation, but the
three-retrieval coordination architecture (Text2Cypher + GraphRAG + Vector Search), the fallback degradation
strategy, and the unified interface design are directly
transferable. Healthcare and finance deployments will
need to adjust permission control rules, data
synchronization mechanisms, and domain-specific
routing logic to align with their respective compliance
requirements — the core architecture remains unchanged. Full production deployment requires customized interface integration and data adaptation with the target business system.

6.2 Series Continuity

GitHub repository: llm-customer-service, (Tag: v1.2.0-hybrid-retrieval)
Backward reference: Builds on all five preceding parts — MVP architecture, data pipeline, GraphRAG service wrapping, multi-agent architecture, and safety guardrail system — completing the system's core capability closure.
Next up: Part 7 will focus on production-grade optimization, providing a complete breakdown of LLM inference cost and performance control strategies, upgrading the system from "functional" to "efficient and cost-effective." Stay tuned.
Series finale: Part 8 will provide a complete retrospective of all architecture decisions, engineering pitfalls, and quantifiable outcomes from MVP to production-grade system, forming a full end-to-end engineering practice record.

DEV Community