James Lee

Posted on Mar 22 • Edited on Jun 14

Production-Grade GraphRAG Data Pipeline: End-to-End Construction from PDF Parsing to Knowledge Graph

#rag #graphrag #pdf #llm

1. Introduction: The Hybrid Data Challenge in Enterprise LLM Applications

In enterprise-grade LLM applications, the system must simultaneously handle two core data types: structured data (e.g., orders, records, contracts stored in relational databases) and unstructured data (e.g., PDF manuals, policy documents, compliance agreements). This challenge is universal across domains — e-commerce customer service is used here as the reference implementation. Traditional RAG solutions are typically limited
to plain text, and when faced with hybrid data, they suffer
from three critical limitations:

Difficulty integrating structured data: Order and customer data stored in relational databases cannot be efficiently leveraged by vector retrieval, which fails to capture entity relationships — leading to poor accuracy on complex queries such as "retrieve the shipping information for Customer A's Order B";
Difficulty parsing unstructured data: PDF documents contain multimodal content including text, tables, images, and formulas. Traditional parsing tools (e.g., PyMuPDF) frequently lose table structure and image context, causing semantic fragmentation that severely degrades downstream retrieval quality;
Difficulty coordinating hybrid retrieval: The retrieval logic for structured and unstructured data is completely siloed, with no unified query entry point — forcing agents to switch between multiple systems, reducing efficiency and increasing error rates.

This is Part 2 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Application System. It addresses the core bottleneck exposed in the MVP — insufficient support for multi-source data and long documents — by delivering a complete hybrid knowledge base data pipeline, representing the key iteration from v0.1 MVP to v0.5 Knowledge Graph.

The core objective of this project is to build a production-grade hybrid knowledge base data pipeline: using Neo4j to store structured knowledge graphs, and MinerU + LitServe + GraphRAG to process unstructured multimodal data — ultimately enabling unified retrieval and coordination across both data types, and fundamentally resolving the hybrid data processing challenges in any knowledge-intensive LLM applications.

2. Technology Selection and Overall Architecture

2.1 Core Technology Stack

The following technology stack was selected to address the core requirements of hybrid data processing. Each choice has been validated against production-grade scenarios:

Neo4j Graph Database:
- Strengths: Natively suited for storing and querying relational data; node-edge structures intuitively represent entity relationships (e.g., "Customer → places → Order", "Product → belongs to → Category");
- Fit: Cypher query language supports complex path queries and community detection, perfectly matching structured data retrieval needs in any entity-relationship-heavy LLM application;
- Scalability: Supports distributed deployment to handle large-scale knowledge graph storage and query pressure.
MinerU + LitServe Multimodal PDF Parsing Service:
- Strengths: MinerU is an open-source project supporting high-accuracy parsing of text, tables, images, and formulas, outputting structured Markdown and metadata files; wrapped via LitServe as a RESTful API, it enables multi-GPU parallel parsing to address the engineering challenge of slow PDF processing;
- Fit: Optimized for table recognition and image context extraction in document-heavy scenarios — well-suited for parsing product manuals, compliance documents, medical records, and legal agreements;
- Engineering capability: Supports async task scheduling and multi-instance load balancing, meeting high-availability requirements in production environments.
Microsoft GraphRAG:
- Strengths: Combines knowledge graphs with semantic indexing to achieve deep semantic understanding at the entity-relation-community level, resolving semantic loss in traditional vector retrieval for long documents and cross-chapter associations;
- Scalability: Supports custom chunking strategies and entity extraction rules, enabling domain-specific optimization for any knowledge-intensive scenario;
- Production-grade capability: Provides index construction, incremental updates, and dual-mode retrieval (Local / Global Search), meeting enterprise-level high-availability requirements.

The engineering implementation and optimization of GraphRAG's four retrieval modes will be covered in detail in Part 3 of this series: GraphRAG Service Wrapping.

2.2 Overall Architecture Design

The hybrid data pipeline follows a layered decoupling, service-oriented encapsulation design philosophy. The complete flow is as follows:

Structured data pipeline: Raw CSV data → Data cleaning → Neo4j knowledge graph construction → Cypher query service;
Unstructured data pipeline: Raw PDF documents → MinerU + LitServe multimodal parsing → Data cleaning and semantic enrichment → GraphRAG entity/relation extraction → Index construction → Semantic retrieval service;
Upper integration layer: Agent-based hybrid retrieval routing automatically selects Neo4j structured retrieval or GraphRAG unstructured retrieval based on query type, returning unified results.

┌──────────────────────────────────────────────────────────────┐
│                     Hybrid Knowledge Base Pipeline           │
│                                                              │
│  ┌─────────────┐                    ┌──────────────────────┐ │
│  │ Structured  │                    │  Unstructured Data   │ │
│  │    Data     │                    │  ( PDF Documents )   │ │
│  │  ( CSV )    │                    └──────────┬───────────┘ │
│  └──────┬──────┘                               │             │
│         │                                      ▼             │
│         ▼                         ┌────────────────────────┐ │
│  ┌─────────────┐                  │  MinerU + LitServe     │ │
│  │  Data Clean │                  │  Multimodal Parsing    │ │
│  └──────┬──────┘                  └──────────┬─────────────┘ │
│         │                                    │               │
│         ▼                                    ▼               │
│  ┌─────────────┐                  ┌────────────────────────┐ │
│  │    Neo4j    │                  │   GraphRAG Pipeline    │ │
│  │  Knowledge  │                  │  Chunk → Extract →     │ │
│  │   Graph     │                  │  Index → Search        │ │
│  └──────┬──────┘                  └──────────┬─────────────┘ │
│         │                                    │               │
│         └──────────────┬─────────────────────┘               │
│                        ▼                                     │
│              ┌─────────────────┐                             │
│              │  Agent Router   │                             │
│              │ Structured Query│                             │
│              │   or Semantic   │                             │
│              │     Search      │                             │
│              └─────────────────┘                             │
└──────────────────────────────────────────────────────────────┘

The overall architecture clearly separates three core stages — data processing, index construction, and retrieval service — ensuring module independence while enabling coordinated use of hybrid data, providing a stable knowledge base foundation for the upper-layer LLM application system.

3. Data Processing Pipeline: From CSV and PDF to a Hybrid Knowledge Base

3.1 Structured Data Processing: Neo4j Knowledge Graph Construction

3.1.1 Knowledge Graph Modeling

For the e-commerce customer service scenario, the following node and edge types are defined as illustrative examples — real-world implementations should be redesigned based on your own entity taxonomy and do not represent a production schema:

Core node types:
- Product: Product information (illustrative business fields);
- Category: Product category;
- Supplier: Supplier;
- Customer: Customer;
- Order: Order;
- Shipper: Logistics provider.
Core edge types:
- BELONGS_TO: Product → Category;
- SUPPLIED_BY: Product → Supplier;
- PLACED_BY: Order → Customer;
- CONTAINS: Order → Product;
- SHIPPED_VIA: Order → Shipper.

This model aligns with e-commerce customer service business logic and efficiently supports complex association queries such as "retrieve all orders for Customer A" and "retrieve supplier information for Product X".

3.1.2 Data Import and Engineering Implementation

CSV data is imported into Neo4j via Python scripts. The core workflow is as follows:

Data cleaning: Read *_nodes.csv and *_edges.csv files; remove null values and malformed data; normalize field types;
Batch import: Use the neo4j Python driver with UNWIND syntax for batch writes, avoiding the performance bottleneck of single-record insertion;
Index creation: Create unique constraint indexes on core node IDs (e.g., Product.id, Customer.id) to improve query efficiency;
Data reset: Execute MATCH (n) DETACH DELETE n before import to clear stale data and ensure consistency. In production, versioned data imports are recommended over full truncation to avoid data loss risk.

3.1.3 Quantitative Results

Knowledge graph constructed: 13,204 nodes and 28,762 edges;
Import efficiency: A single batch of 100,000 records completes in under 5 minutes with 100% success rate;
Retrieval performance: Simple queries (e.g., "retrieve all orders for Customer ID=123") respond in under 100ms; complex path queries respond in under 500ms — fully meeting real-time requirements for production LLM application scenarios.

3.2 Unstructured Data Processing: MinerU + LitServe + GraphRAG Multimodal Pipeline

The complete data flow is illustrated below:

┌─────────────────────────────────────────────────────────────────┐
│                         PDF Data Input                          │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      MinerU Parse Service                       │
│                    [ Deployed via LitServe ]                    │
│                                                                 │
│   ┌─────────────────┐   ┌──────────────┐   ┌───────────────┐   │
│   │  Text Content   │   │    Tables    │   │    Images     │   │
│   │   ( .md file )  │   │ ( .json file)│   │ ( .json file )│   │
│   └─────────────────┘   └──────────────┘   └───────────────┘   │
└──────────────────────────────┬──────────────────────────────────┘
                               │  Structured Output
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                       GraphRAG Pipeline                         │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Step 1 · Data Preprocessing                            │   │
│   │  Merge text / table / image into unified structure      │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                 │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Step 2 · Dynamic Chunking                              │   │
│   │  Heading-aware splitting · table/image kept intact      │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                 │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Step 3 · Knowledge Graph Generation                    │   │
│   │  Entity extraction · Relation mapping · Graph storage   │   │
│   └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

This project builds a complete production-grade pipeline for PDF multimodal data — from raw PDF ingestion to multimodal parsing, semantic enrichment, and index construction — with three core capabilities:

MinerU + LitServe service-oriented parsing: Converts PDFs into structured Markdown and metadata files;
Structure-aware chunking strategy: Adaptively adjusts chunk boundaries based on semantic boundaries in the text, preserving contextual integrity;
Multimodal semantic enrichment: Leverages table and image metadata to enrich chunk semantics.

3.2.1 PDF Multimodal Parsing: MinerU + LitServe Service Wrapping

MinerU is wrapped as a standalone RESTful API service to address the engineering challenges of PDF parsing at scale:

Service wrapping: The LitServe framework encapsulates MinerU's parsing capability as a /parse endpoint, supporting PDF file upload and async parsing;
Multi-GPU parallelism: LitServe's devices configuration enables multi-GPU parallel parsing, significantly reducing per-page parsing time to under 1s on average;
Result export: A /download_output_files endpoint is added for one-click download of all parsed output files, facilitating downstream processing;
High-availability scaling: Multi-instance deployment with load balancing via LitServe further improves throughput for large-scale PDF processing.

3.2.2 Data Cleaning and Semantic Enrichment

Raw output from MinerU contains format redundancy and semantic fragmentation. We apply domain-specific cleaning and enrichment for the target domain:

Table enrichment: Table elements are extracted from parsed metadata; an LLM generates a business summary, which is inserted back into the corresponding position in the Markdown as metadata;
Image enrichment: Image elements are extracted from parsed metadata; a vision model generates image descriptions to supplement contextual semantics;
Text cleaning: Redundant headers/footers, blank lines, and garbled characters are removed; line-break issues are corrected to ensure text coherence.

3.2.3 GraphRAG Entity and Relation Extraction

GraphRAG extraction rules are customized for the
target domain. In this reference implementation:

Priority entity extraction: Focus on high-frequency domain entities such as product names, order numbers, and after-sales policies;
Relation extraction optimization: Prioritize business-relevant relations such as "Product → belongs to → Category" and "Policy → applies to → Product";
Community detection: GraphRAG's community detection groups tightly related entities and relations into communities, enabling semantic association during retrieval.

3.2.4 GraphRAG Index Construction: Structure-Aware Chunking Strategy

To prevent traditional fixed-size chunking from breaking contextual continuity, a structure-aware chunking strategy is implemented:

Core approach: Chunk boundaries are adaptively determined based on semantic boundaries in the text (e.g., table headings, paragraph logical breakpoints) rather than fixed lengths, resolving fragmentation issues in mixed text-image layouts;
Validation: Compared to fixed-window chunking, retrieval accuracy improves by 12% in table-heavy and mixed text-image scenarios.

4. Integration and Retrieval: From Hybrid Data to a Unified Knowledge Base

4.1 Hybrid Data Integration Approach

Unified retrieval across structured and unstructured data is achieved via an upper-layer Agent router:

Structured data retrieval: Text2Cypher converts natural language queries into Cypher statements for direct Neo4j knowledge graph queries — suited for structured queries such as "check the order status for Customer A";
Unstructured data retrieval: GraphRAG's Global Search interface retrieves the semantic index of unstructured data — suited for queries such as "retrieve the after-sales policy for Product X";
Agent routing strategy: Keywords in the user query determine the retrieval path (e.g., "order", "customer" → structured; "manual", "policy" → unstructured). Complex queries invoke both retrieval paths and merge the results.

The engineering implementation of Text2Cypher and the hybrid retrieval routing strategy will be covered in detail in Part 6 of this series: End-to-End Wrap-Up: Hybrid Knowledge Base and Capability Closure.

4.2 Retrieval Flow Example

User query: "What is the shipping status of my Order #123? What are the after-sales policies for Product A?"

The Agent identifies that the query contains both a structured component (order shipping) and an unstructured component (after-sales policy);
Structured part: Converted to a Cypher query and executed against Neo4j:

   // Illustrative pseudocode
   MATCH (order)-[:SHIPPED_VIA]->(shipper)
   WHERE order.id = <order_id>
   RETURN shipper.name, shipper.contact

Unstructured part: GraphRAG Global Search is invoked to retrieve content related to "Product A after-sales policy";
Results from both retrieval paths are merged and returned as a unified response.

5. Key Pitfalls and Optimizations

5.1 Neo4j: Pitfalls and Optimizations

Issue 1: Inconsistent data import formats
- Symptom: Non-uniform field types in CSV files caused import failures;
- Solution: Normalize field types during the data cleaning stage and add type validation logic.
Issue 2: Poor query performance at scale
- Symptom: Complex path queries exceeded 2s response time;
- Solution: Create indexes on frequently queried node properties; optimize Cypher statements; use PROFILE to analyze query plans.

5.2 MinerU + LitServe: Pitfalls and Optimizations

Issue 1: Loss of table structure during parsing
- Symptom: Complex tables were parsed with corrupted structure;
- Solution: Use MinerU's officially supported table-specialized parsing model to improve table recognition accuracy.
Issue 2: Slow parsing speed
- Symptom: Single-GPU parsing of a 100-page PDF took over 5 minutes;
- Solution: Enable multi-GPU parallel parsing via LitServe; optimize model loading strategy; combine with multi-instance load balancing to improve throughput.

5.3 GraphRAG: Pitfalls and Optimizations

Issue 1: Chunking breaks contextual continuity
- Symptom: Traditional fixed-size chunking split cross-chapter associated content;
- Solution: Apply structure-aware chunking strategy, preserving contextual integrity by respecting heading hierarchy.
Issue 2: Loss of table/image semantic information
- Symptom: After chunking, tables and images retained only links with no contextual description;
- Solution: Add metadata descriptions for tables and images during the semantic enrichment stage and insert them at the corresponding positions in the Markdown.

6. Quantitative Results

All metrics are validated on 100 e-commerce product manual PDFs, 100 annotated customer service query test cases, and a dual RTX 4090 GPU test environment:

Metric	Result
Neo4j total nodes	13,204
Neo4j total edges	28,762
Structured query accuracy	98%
Table parsing accuracy	95%
Average per-page PDF parsing time	< 1s (multi-GPU parallel)
Entity extraction accuracy	93%
Unstructured retrieval accuracy	89%
Hybrid retrieval average response time	1.2s

All metrics are evaluated on an internal annotated test set of 100 e-commerce product manuals and 100 customer service QA pairs. Results may vary across domains and document types.

7. Deployment Boundaries and Series Continuity

7.1 Deployment Boundaries

This hybrid knowledge base data pipeline is validated against e-commerce customer service scenarios, but the core patterns are directly transferable. Healthcare deployments will need domain-specific entity extraction rules (e.g., medical terminology, diagnosis codes); finance and legal deployments will need stricter data segmentation and compliance policies — the pipeline architecture itself remains unchanged. Production-grade iteration should further incorporate Text2Cypher and hybrid retrieval routing strategies.

7.2 Series Continuity

GitHub repository: llm-customer-service, (Tag: v0.5.0-graphrag-data-pipeline)
Backward reference: Builds on Part 1 Full MVP Architecture Breakdown, addressing the core bottleneck of insufficient multi-source data and long document support.
Next up: Part 3 will focus on production-grade service wrapping for GraphRAG indexes, covering API design, retrieval mode decision-making across four modes, and high-availability guarantees. Stay tuned.
Series finale: Part 8 will provide a complete retrospective of all architecture decisions, engineering pitfalls, and quantifiable outcomes from MVP to production-grade system, forming a full end-to-end engineering practice record.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.