1. Introduction: The Hybrid Data Challenge in Intelligent Customer Service
In enterprise-level intelligent customer service scenarios, the system must simultaneously handle two categories of core data: structured data (e.g., e-commerce orders, customer profiles, product inventory stored in relational databases) and unstructured data (e.g., PDF product manuals, service agreements, and after-sales guides). Traditional RAG solutions are typically designed for plain text only, and face three critical limitations when dealing with hybrid data:
- Difficulty integrating structured data: Order and customer data lives in relational databases. Traditional vector retrieval cannot efficiently leverage entity relationships, resulting in very low accuracy for complex queries such as "Find the logistics information for Customer A's Order B."
- Difficulty parsing unstructured data: PDF documents contain multimodal content — text, tables, images, and formulas. Conventional parsing tools (e.g., PyMuPDF) frequently lose table structure and image context, causing semantic fragmentation that severely degrades downstream retrieval quality.
- Difficulty coordinating hybrid retrieval: The retrieval logic for structured and unstructured data is completely siloed, with no unified query entry point. Customer service agents must switch between multiple systems, reducing efficiency and increasing error rates.
This is Article 2 of the From 0 to 1 in 8 Weeks series. It directly addresses the core bottleneck exposed by the MVP version — insufficient support for multi-source data and long documents — by fully implementing a hybrid knowledge base data pipeline, completing the core iteration from v0.1 MVP to v0.5 Knowledge Graph Edition.
The central goal of this article is to build a production-grade hybrid knowledge base data pipeline: using Neo4j to store structured knowledge graphs, MinerU + LitServe + GraphRAG to process unstructured multimodal data, and ultimately achieving unified retrieval and coordination across both data types.
2. Technology Selection and Overall Architecture
2.1 Core Technology Stack
Each technology in the following stack was selected after hands-on validation in production-grade scenarios:
-
Neo4j Graph Database
- Strengths: Natively suited for storing and querying relational data; node-edge structures intuitively express entity relationships (e.g., "Customer → placed → Order", "Product → belongs to → Category").
- Fit: Cypher query language supports complex path queries and community detection, perfectly matching structured data retrieval needs in customer service.
- Scalability: Supports distributed deployment to handle large-scale knowledge graph storage and query pressure.
-
MinerU + LitServe Multimodal PDF Parsing Service
- Strengths: MinerU is an open-source tool that supports high-accuracy parsing of text, tables, images, and formulas, outputting structured Markdown and metadata files. Wrapped via LitServe as a RESTful API, it enables multi-GPU parallel parsing to solve the engineering challenge of slow PDF processing.
- Fit: Optimized for table recognition and image context extraction in e-commerce customer service scenarios; well-suited for parsing product manual PDFs.
- Engineering capability: Supports async task scheduling and multi-instance load balancing, meeting high-availability requirements in production.
-
Microsoft GraphRAG
- Strengths: Combines knowledge graphs with semantic indexing to achieve deep semantic understanding at the entity-relationship-community level, resolving the semantic loss problem of traditional vector retrieval in long-document and cross-chapter association scenarios.
- Scalability: Supports custom chunking strategies and entity extraction rules for customer service-specific optimization.
- Production capability: Provides index construction, incremental updates, and dual-mode retrieval (Local/Global Search) to meet enterprise-grade service requirements.
The engineering implementation and optimization of GraphRAG's four retrieval modes will be covered in detail in Article 3 of this series: GraphRAG Service Wrapping.
2.2 Overall Architecture Design
The hybrid data pipeline follows a layered decoupling + service-oriented encapsulation design philosophy:
- Structured data pipeline: Raw CSV → Data cleaning → Neo4j knowledge graph construction → Cypher query service
- Unstructured data pipeline: Raw PDF → MinerU + LitServe multimodal parsing → Data cleaning & semantic enrichment → GraphRAG entity/relationship extraction → Index construction → Semantic retrieval service
- Upper integration layer: An Agent-based routing layer automatically selects Neo4j structured retrieval or GraphRAG unstructured retrieval based on query type, returning a unified result.
This architecture cleanly separates the three core stages — data processing, index construction, and retrieval service — ensuring module independence while enabling coordinated use of hybrid data.
3. Data Processing Pipeline: From CSV and PDF to a Hybrid Knowledge Base
3.1 Structured Data Processing: Neo4j Knowledge Graph Construction
3.1.1 Knowledge Graph Modeling
For the e-commerce customer service scenario, we defined the following core node and edge types to cover high-frequency query needs:
Core node types:
-
Product: Product info (ID, name, price, category, etc.) -
Category: Product category (ID, name, parent category) -
Supplier: Supplier (ID, name, contact info) -
Customer: Customer (ID, name, phone number) -
Order: Order (ID, timestamp, amount, status) -
Shipper: Logistics company (ID, name, contact info)
Core edge types:
-
BELONGS_TO: Product → Category -
SUPPLIED_BY: Product → Supplier -
PLACED_BY: Order → Customer -
CONTAINS: Order → Product -
SHIPPED_VIA: Order → Shipper
This model closely mirrors e-commerce business logic and efficiently supports complex relational queries such as "Retrieve all orders for Customer A" or "Find the supplier for Product X."
3.1.2 Data Import and Engineering Implementation
We implemented automated CSV-to-Neo4j import via Python scripts. The core workflow:
-
Data cleaning: Read
*_nodes.csvand*_edges.csv, remove null values and malformed records, and normalize field types. -
Batch import: Use the
neo4jPython driver withUNWINDsyntax for bulk writes, avoiding the performance bottleneck of row-by-row insertion. -
Index creation: Create unique constraint indexes on core node IDs (e.g.,
Product.id,Customer.id) to improve query performance. -
Data reset: Execute
MATCH (n) DETACH DELETE nbefore import to ensure data consistency. In production, versioned incremental imports are strongly recommended over full truncation to prevent data loss.
3.1.3 Quantified Results
- Final knowledge graph: 13,204 nodes and 28,762 edges
- Import throughput: 100,000 records per batch in under 5 minutes, with 100% success rate
- Query performance: Simple queries (e.g., "Retrieve all orders for Customer ID=123") respond in under 100ms; complex path queries respond in under 500ms — fully meeting real-time customer service requirements
3.2 Unstructured Data Processing: MinerU + LitServe + GraphRAG Multimodal Pipeline
Figure: Full unstructured data processing pipeline — MinerU + LitServe handles PDF multimodal parsing, the Dynamic-Aware Chunking algorithm performs structure-aware segmentation, and the output feeds directly into GraphRAG index construction.
This project builds a complete production-grade pipeline for PDF multimodal data:
Raw PDF → MinerU + LitServe multimodal parsing → Data post-processing → Dynamic-Aware Chunking → GraphRAG graph data generation
Three core capabilities are implemented:
- MinerU + LitServe service-oriented parsing: Converts PDFs into structured Markdown and metadata files.
- Dynamic-Aware Chunking algorithm: Structure-aware segmentation by heading hierarchy + sliding window, preserving contextual integrity.
- Multimodal semantic enrichment: Enriches chunk semantics by incorporating table and image metadata.
The complete pipeline consists of 7 core steps, covering the full chain from raw files to graph data:
- Load raw PDF data: Load product manuals, service agreements, and other PDF documents from local storage or a file server.
-
Send to MinerU service for parsing: Via the LitServe-wrapped RESTful API, send the PDF to the MinerU service. Multi-GPU parallel parsing efficiently extracts text, tables, images, and formulas, outputting structured Markdown and companion metadata files (
.md,model.json,content_list.json, etc.). - Transfer parsed data to GraphRAG: Return the structured files produced by MinerU to the GraphRAG project directory.
-
Load parsed files: Within the GraphRAG project, load MinerU's output —
.mdtext files,model.jsonlayout metadata, andcontent_list.jsoncontent lists — to extract plain text, table structures, image context, and page metadata. - Data post-processing and semantic enrichment: Clean and enrich the parsed content — remove redundant headers/footers, generate business summaries for tables, generate visual descriptions for images, and inject metadata as Markdown comments to enrich semantic information.
- Dynamic-Aware Chunking: Apply the customized chunking strategy — split by heading hierarchy (H1/H2) to preserve chapter context; apply sliding window segmentation for long text blocks (with overlap); keep table and image blocks intact as standalone chunks without splitting.
- Generate GraphRAG graph data: Integrate extracted entities, relationships, and community detection results to produce the graph data required for GraphRAG index construction, preparing for downstream semantic retrieval.
3.2.1 PDF Multimodal Parsing: MinerU + LitServe Service Encapsulation
We wrapped MinerU as a standalone RESTful API service to solve the engineering challenges of PDF parsing at scale:
-
Service encapsulation: Used LitServe to wrap MinerU's parsing capability as a
/parseendpoint, supporting PDF file upload and async parsing. -
Multi-GPU parallelism: Configured LitServe's
devicesparameter to enable multi-GPU parallel parsing, reducing average per-page parsing time from 3s to 0.8s. -
Result export: Added a
/download_output_filesendpoint for one-click download of all parsed output files (Markdown,model.json,content_list.json, images, etc.) for downstream processing. - High-availability scaling: Multi-instance deployment with load balancing via LitServe further increases parsing throughput for large-scale PDF workloads.
Core output files from MinerU:
-
.md: Structured Markdown text containing text, tables, and image links -
model.json: Document layout, element types, and positional metadata -
content_list.json: Type, content, and page number for all content blocks -
_middle.json: Document hierarchy and fine-grained content units
3.2.2 Data Cleaning and Semantic Enrichment
Raw MinerU output contains format redundancy and semantic fragmentation. We applied customer service-specific cleaning and enrichment:
-
Table enrichment: Extract table elements from
model.json, call an LLM to generate a business summary, and inject it into the corresponding Markdown position as metadata. -
Image enrichment: Extract image elements from
content_list.json, call a vision model to generate image descriptions, and supplement contextual semantic information. - Text cleaning: Remove redundant headers/footers, blank lines, and garbled characters; fix line-break issues to ensure text coherence.
Example of enriched Markdown output:
type: table
description: "The table below compares quarterly product sales for 2023. Q4 achieved the highest revenue at 1.5M CNY."
source_page: 5
parent_headings: ["## 2023 Sales Performance"]
| Quarter | Revenue (10K CNY) |
|---------|-------------------|
| Q1 | 120 |
| Q2 | 135 |
| Q3 | 142 |
| Q4 | 150 |
3.2.3 GraphRAG Entity and Relationship Extraction
We customized GraphRAG's extraction rules for the customer service domain:
- Priority entity extraction: Product names, order numbers, after-sales policies, and other high-frequency customer service query entities.
- Relationship extraction optimization: Focus on extracting business-relevant relationships such as "Product → belongs to → Category" and "Policy → applies to → Product."
- Community detection: Leverage GraphRAG's community detection to cluster tightly related entities and relationships into communities, enabling semantic association during retrieval.
3.2.4 GraphRAG Index Construction: Dynamic-Aware Chunking Algorithm
To prevent traditional fixed-size chunking from breaking contextual continuity, we implemented a Dynamic-Aware Chunking strategy:
-
Heading-based segmentation: Split the document by H1 headings (
#) into top-level blocks, then by H2 headings (##) into sub-blocks, keeping all content under each heading in the same chunk. - Dynamic size adjustment: Apply sliding window segmentation to text blocks (window size: 512 tokens, 10% overlap); keep table and image blocks as standalone chunks without splitting.
- Semantic association preservation: Link tables and images to their surrounding context via metadata, ensuring no semantic information is lost after chunking.
4. Integration and Retrieval: From Hybrid Data to a Unified Knowledge Base
4.1 Hybrid Data Integration Strategy
We use an upper-layer Agent router to unify retrieval across structured and unstructured data:
- Structured data retrieval: Text2Cypher converts natural language queries into Cypher statements for direct Neo4j queries — suited for structured queries like "Check the status of Customer A's order."
- Unstructured data retrieval: GraphRAG's Global Search interface queries the semantic index of unstructured data — suited for queries like "What is the after-sales policy for Product X?"
- Agent routing strategy: Automatically selects the retrieval method based on query keywords (e.g., "order", "customer" → structured; "manual", "policy" → unstructured). Complex queries invoke both retrieval paths and merge the results.
The engineering implementation of Text2Cypher and the hybrid retrieval routing strategy will be covered in detail in Article 6 of this series: End-to-End Closure: Hybrid Knowledge Base and Capability Integration.
4.2 Retrieval Flow Example
User query: "What is the logistics status of my Order #123? What are the after-sales policies for Product A?"
- The Agent identifies the query contains both a structured component (order logistics) and an unstructured component (after-sales policy).
-
Structured part: Converts to Cypher —
MATCH (o:Order {id: "123"})-[:SHIPPED_VIA]->(s:Shipper) RETURN s.name, s.contact— and queries Neo4j. - Unstructured part: Calls GraphRAG Global Search to retrieve content related to "Product A after-sales policy."
- Merges both results and returns a unified response.
5. Key Pitfalls and Optimizations
5.1 Neo4j: Pitfalls and Optimizations
-
Issue 1: Inconsistent data import formats
- Symptom: Mismatched field types in CSV files caused import failures.
- Solution: Normalize field types during the data cleaning stage; add type validation logic before import.
-
Issue 2: Poor query performance at scale
- Symptom: Complex path queries exceeded 2s response time.
-
Solution: Create indexes on frequently queried node properties; optimize Cypher statements; use
PROFILEto analyze query execution plans.
5.2 MinerU + LitServe: Pitfalls and Optimizations
-
Issue 1: Table structure loss during parsing
- Symptom: Complex tables were parsed with scrambled structure.
-
Solution: Switch to MinerU's officially supported
StructTable-InternVL2-1Btable parsing model for improved recognition accuracy.
-
Issue 2: Slow parsing speed
- Symptom: Single-GPU parsing of a 100-page PDF took over 5 minutes.
- Solution: Enable multi-GPU parallel parsing via LitServe; optimize model loading strategy; combine with multi-instance load balancing to increase throughput.
5.3 GraphRAG: Pitfalls and Optimizations
-
Issue 1: Chunking breaks contextual continuity
- Symptom: Fixed-size chunking split cross-chapter related content into separate chunks.
- Solution: Implement Dynamic-Aware Chunking to preserve contextual integrity by heading hierarchy.
-
Issue 2: Table/image semantic information lost after chunking
- Symptom: After chunking, tables and images retained only links with no contextual description.
- Solution: Inject metadata descriptions for tables and images during the enrichment stage, embedded directly in the Markdown at the corresponding position.
6. Quantified Results
All metrics below were validated on 100 e-commerce product manual PDFs, 100 annotated real-world customer service query cases, and a dual RTX 4090 GPU test environment.
| Metric | Result |
|---|---|
| Neo4j total nodes | 13,204 |
| Neo4j total edges | 28,762 |
| Structured query accuracy | 98% |
| Table parsing accuracy | 95% |
| Avg. per-page PDF parsing time | 0.78s (multi-GPU parallel) |
| Entity extraction accuracy | 93% |
| Unstructured retrieval accuracy | 89% |
| Hybrid retrieval avg. response time | 1.2s |
All metrics are evaluated on an internal annotated test set of 100 e-commerce product manuals and 100 customer service QA pairs. Results may vary across domains and document types.
7. Summary
This article presented the complete construction process of a production-grade hybrid knowledge base data pipeline:
- Neo4j enables structured knowledge graph storage and efficient retrieval, solving complex relational query challenges in intelligent customer service scenarios.
- MinerU + LitServe + GraphRAG delivers parsing, semantic enrichment, and index construction for unstructured multimodal data, resolving semantic loss in long documents and multimodal content.
- Agent-based routing unifies retrieval across both data types, providing a stable and efficient knowledge base foundation for the upper-layer customer service system.
This pipeline has been validated in a production e-commerce customer service project, significantly improving query accuracy and efficiency while maintaining strong extensibility and maintainability that meets enterprise production environment requirements.
The next article in this series will focus on production-grade service encapsulation for GraphRAG indexes, covering API design, query optimization, and high-availability guarantees. Stay tuned.
GitHub repository: Link TBD
Series context: This is Article 2 of From 0 to 1 in 8 Weeks: Full-Stack Engineering Practice of a Production-Grade LLM Intelligent Customer Service System, building directly on the MVP architecture overview in Article 1. Subsequent articles will continue iterating on production-grade service encapsulation, Agent architecture, and more.

Top comments (0)