DEV Community

James Lee
James Lee

Posted on • Edited on

Production-Grade GraphRAG Data Pipeline: End-to-End Construction from PDF Parsing to Knowledge Graph

1. Introduction: The Hybrid Data Challenge in Intelligent Customer Service

In enterprise-grade intelligent customer service scenarios, the system must simultaneously handle two core data types: structured data (e.g., e-commerce orders, customer profiles, product inventory stored in relational databases) and unstructured data (e.g., PDF product manuals, service agreements, and after-sales guides). Traditional RAG solutions are typically limited to plain text, and when faced with hybrid data, they suffer from three critical limitations:

  1. Difficulty integrating structured data: Order and customer data stored in relational databases cannot be efficiently leveraged by vector retrieval, which fails to capture entity relationships — leading to poor accuracy on complex queries such as "retrieve the shipping information for Customer A's Order B";
  2. Difficulty parsing unstructured data: PDF documents contain multimodal content including text, tables, images, and formulas. Traditional parsing tools (e.g., PyMuPDF) frequently lose table structure and image context, causing semantic fragmentation that severely degrades downstream retrieval quality;
  3. Difficulty coordinating hybrid retrieval: The retrieval logic for structured and unstructured data is completely siloed, with no unified query entry point — forcing agents to switch between multiple systems, reducing efficiency and increasing error rates.

This is Part 2 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Customer Service System. It addresses the core bottleneck exposed in the MVP — insufficient support for multi-source data and long documents — by delivering a complete hybrid knowledge base data pipeline, representing the key iteration from v0.1 MVP to v0.5 Knowledge Graph.

The core objective of this project is to build a production-grade hybrid knowledge base data pipeline: using Neo4j to store structured knowledge graphs, and MinerU + LitServe + GraphRAG to process unstructured multimodal data — ultimately enabling unified retrieval and coordination across both data types, and fundamentally resolving the hybrid data processing challenges in intelligent customer service scenarios.


2. Technology Selection and Overall Architecture

2.1 Core Technology Stack

The following technology stack was selected to address the core requirements of hybrid data processing. Each choice has been validated against production-grade scenarios:

  • Neo4j Graph Database:

    • Strengths: Natively suited for storing and querying relational data; node-edge structures intuitively represent entity relationships (e.g., "Customer → places → Order", "Product → belongs to → Category");
    • Fit: Cypher query language supports complex path queries and community detection, perfectly matching structured data retrieval needs in customer service scenarios;
    • Scalability: Supports distributed deployment to handle large-scale knowledge graph storage and query pressure.
  • MinerU + LitServe Multimodal PDF Parsing Service:

    • Strengths: MinerU is an open-source project supporting high-accuracy parsing of text, tables, images, and formulas, outputting structured Markdown and metadata files; wrapped via LitServe as a RESTful API, it enables multi-GPU parallel parsing to address the engineering challenge of slow PDF processing;
    • Fit: Optimized for table recognition and image context extraction in e-commerce customer service scenarios, well-suited for parsing product manual PDFs;
    • Engineering capability: Supports async task scheduling and multi-instance load balancing, meeting high-availability requirements in production environments.
  • Microsoft GraphRAG:

    • Strengths: Combines knowledge graphs with semantic indexing to achieve deep semantic understanding at the entity-relation-community level, resolving semantic loss in traditional vector retrieval for long documents and cross-chapter associations;
    • Scalability: Supports custom chunking strategies and entity extraction rules, enabling domain-specific optimization for customer service scenarios;
    • Production-grade capability: Provides index construction, incremental updates, and dual-mode retrieval (Local / Global Search), meeting enterprise-level high-availability requirements.

The engineering implementation and optimization of GraphRAG's four retrieval modes will be covered in detail in Part 3 of this series: GraphRAG Service Wrapping.

2.2 Overall Architecture Design

The hybrid data pipeline follows a layered decoupling, service-oriented encapsulation design philosophy. The complete flow is as follows:

  1. Structured data pipeline: Raw CSV data → Data cleaning → Neo4j knowledge graph construction → Cypher query service;
  2. Unstructured data pipeline: Raw PDF documents → MinerU + LitServe multimodal parsing → Data cleaning and semantic enrichment → GraphRAG entity/relation extraction → Index construction → Semantic retrieval service;
  3. Upper integration layer: Agent-based hybrid retrieval routing automatically selects Neo4j structured retrieval or GraphRAG unstructured retrieval based on query type, returning unified results.
┌──────────────────────────────────────────────────────────────┐
│                     Hybrid Knowledge Base Pipeline           │
│                                                              │
│  ┌─────────────┐                    ┌──────────────────────┐ │
│  │ Structured  │                    │  Unstructured Data   │ │
│  │    Data     │                    │  ( PDF Documents )   │ │
│  │  ( CSV )    │                    └──────────┬───────────┘ │
│  └──────┬──────┘                               │             │
│         │                                      ▼             │
│         ▼                         ┌────────────────────────┐ │
│  ┌─────────────┐                  │  MinerU + LitServe     │ │
│  │  Data Clean │                  │  Multimodal Parsing    │ │
│  └──────┬──────┘                  └──────────┬─────────────┘ │
│         │                                    │               │
│         ▼                                    ▼               │
│  ┌─────────────┐                  ┌────────────────────────┐ │
│  │    Neo4j    │                  │   GraphRAG Pipeline    │ │
│  │  Knowledge  │                  │  Chunk → Extract →     │ │
│  │   Graph     │                  │  Index → Search        │ │
│  └──────┬──────┘                  └──────────┬─────────────┘ │
│         │                                    │               │
│         └──────────────┬─────────────────────┘               │
│                        ▼                                     │
│              ┌─────────────────┐                             │
│              │  Agent Router   │                             │
│              │ Structured Query│                             │
│              │   or Semantic   │                             │
│              │     Search      │                             │
│              └─────────────────┘                             │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The overall architecture clearly separates three core stages — data processing, index construction, and retrieval service — ensuring module independence while enabling coordinated use of hybrid data, providing a stable knowledge base foundation for the upper-layer intelligent customer service system.


3. Data Processing Pipeline: From CSV and PDF to a Hybrid Knowledge Base

3.1 Structured Data Processing: Neo4j Knowledge Graph Construction

3.1.1 Knowledge Graph Modeling

For the e-commerce customer service scenario, the following node and edge types are defined as illustrative examples — real-world implementations should be redesigned based on your own entity taxonomy and do not represent a production schema:

  • Core node types:

    • Product: Product information (illustrative business fields);
    • Category: Product category;
    • Supplier: Supplier;
    • Customer: Customer;
    • Order: Order;
    • Shipper: Logistics provider.
  • Core edge types:

    • BELONGS_TO: Product → Category;
    • SUPPLIED_BY: Product → Supplier;
    • PLACED_BY: Order → Customer;
    • CONTAINS: Order → Product;
    • SHIPPED_VIA: Order → Shipper.

This model aligns with e-commerce customer service business logic and efficiently supports complex association queries such as "retrieve all orders for Customer A" and "retrieve supplier information for Product X".

3.1.2 Data Import and Engineering Implementation

CSV data is imported into Neo4j via Python scripts. The core workflow is as follows:

  1. Data cleaning: Read *_nodes.csv and *_edges.csv files; remove null values and malformed data; normalize field types;
  2. Batch import: Use the neo4j Python driver with UNWIND syntax for batch writes, avoiding the performance bottleneck of single-record insertion;
  3. Index creation: Create unique constraint indexes on core node IDs (e.g., Product.id, Customer.id) to improve query efficiency;
  4. Data reset: Execute MATCH (n) DETACH DELETE n before import to clear stale data and ensure consistency. In production, versioned data imports are recommended over full truncation to avoid data loss risk.

3.1.3 Quantitative Results

  • Knowledge graph constructed: 13,204 nodes and 28,762 edges;
  • Import efficiency: A single batch of 100,000 records completes in under 5 minutes with 100% success rate;
  • Retrieval performance: Simple queries (e.g., "retrieve all orders for Customer ID=123") respond in under 100ms; complex path queries respond in under 500ms — fully meeting real-time requirements for customer service scenarios.

3.2 Unstructured Data Processing: MinerU + LitServe + GraphRAG Multimodal Pipeline

The complete data flow is illustrated below:

┌─────────────────────────────────────────────────────────────────┐
│                         PDF Data Input                          │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      MinerU Parse Service                       │
│                    [ Deployed via LitServe ]                    │
│                                                                 │
│   ┌─────────────────┐   ┌──────────────┐   ┌───────────────┐   │
│   │  Text Content   │   │    Tables    │   │    Images     │   │
│   │   ( .md file )  │   │ ( .json file)│   │ ( .json file )│   │
│   └─────────────────┘   └──────────────┘   └───────────────┘   │
└──────────────────────────────┬──────────────────────────────────┘
                               │  Structured Output
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                       GraphRAG Pipeline                         │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Step 1 · Data Preprocessing                            │   │
│   │  Merge text / table / image into unified structure      │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                 │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Step 2 · Dynamic Chunking                              │   │
│   │  Heading-aware splitting · table/image kept intact      │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                 │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Step 3 · Knowledge Graph Generation                    │   │
│   │  Entity extraction · Relation mapping · Graph storage   │   │
│   └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This project builds a complete production-grade pipeline for PDF multimodal data — from raw PDF ingestion to multimodal parsing, semantic enrichment, and index construction — with three core capabilities:

  1. MinerU + LitServe service-oriented parsing: Converts PDFs into structured Markdown and metadata files;
  2. Structure-aware chunking strategy: Adaptively adjusts chunk boundaries based on semantic boundaries in the text, preserving contextual integrity;
  3. Multimodal semantic enrichment: Leverages table and image metadata to enrich chunk semantics.

3.2.1 PDF Multimodal Parsing: MinerU + LitServe Service Wrapping

MinerU is wrapped as a standalone RESTful API service to address the engineering challenges of PDF parsing at scale:

  1. Service wrapping: The LitServe framework encapsulates MinerU's parsing capability as a /parse endpoint, supporting PDF file upload and async parsing;
  2. Multi-GPU parallelism: LitServe's devices configuration enables multi-GPU parallel parsing, significantly reducing per-page parsing time to under 1s on average;
  3. Result export: A /download_output_files endpoint is added for one-click download of all parsed output files, facilitating downstream processing;
  4. High-availability scaling: Multi-instance deployment with load balancing via LitServe further improves throughput for large-scale PDF processing.

3.2.2 Data Cleaning and Semantic Enrichment

Raw output from MinerU contains format redundancy and semantic fragmentation. We apply domain-specific cleaning and enrichment for the customer service scenario:

  • Table enrichment: Table elements are extracted from parsed metadata; an LLM generates a business summary, which is inserted back into the corresponding position in the Markdown as metadata;
  • Image enrichment: Image elements are extracted from parsed metadata; a vision model generates image descriptions to supplement contextual semantics;
  • Text cleaning: Redundant headers/footers, blank lines, and garbled characters are removed; line-break issues are corrected to ensure text coherence.

3.2.3 GraphRAG Entity and Relation Extraction

GraphRAG extraction rules are customized for the customer service scenario:

  • Priority entity extraction: Focus on high-frequency customer service entities such as product names, order numbers, and after-sales policies;
  • Relation extraction optimization: Prioritize business-relevant relations such as "Product → belongs to → Category" and "Policy → applies to → Product";
  • Community detection: GraphRAG's community detection groups tightly related entities and relations into communities, enabling semantic association during retrieval.

3.2.4 GraphRAG Index Construction: Structure-Aware Chunking Strategy

To prevent traditional fixed-size chunking from breaking contextual continuity, a structure-aware chunking strategy is implemented:

  • Core approach: Chunk boundaries are adaptively determined based on semantic boundaries in the text (e.g., table headings, paragraph logical breakpoints) rather than fixed lengths, resolving fragmentation issues in mixed text-image layouts;
  • Validation: Compared to fixed-window chunking, retrieval accuracy improves by 12% in table-heavy and mixed text-image scenarios.

4. Integration and Retrieval: From Hybrid Data to a Unified Knowledge Base

4.1 Hybrid Data Integration Approach

Unified retrieval across structured and unstructured data is achieved via an upper-layer Agent router:

  • Structured data retrieval: Text2Cypher converts natural language queries into Cypher statements for direct Neo4j knowledge graph queries — suited for structured queries such as "check the order status for Customer A";
  • Unstructured data retrieval: GraphRAG's Global Search interface retrieves the semantic index of unstructured data — suited for queries such as "retrieve the after-sales policy for Product X";
  • Agent routing strategy: Keywords in the user query determine the retrieval path (e.g., "order", "customer" → structured; "manual", "policy" → unstructured). Complex queries invoke both retrieval paths and merge the results.

The engineering implementation of Text2Cypher and the hybrid retrieval routing strategy will be covered in detail in Part 6 of this series: End-to-End Wrap-Up: Hybrid Knowledge Base and Capability Closure.

4.2 Retrieval Flow Example

User query: "What is the shipping status of my Order #123? What are the after-sales policies for Product A?"

  1. The Agent identifies that the query contains both a structured component (order shipping) and an unstructured component (after-sales policy);
  2. Structured part: Converted to a Cypher query and executed against Neo4j:
   // Illustrative pseudocode
   MATCH (order)-[:SHIPPED_VIA]->(shipper)
   WHERE order.id = <order_id>
   RETURN shipper.name, shipper.contact
Enter fullscreen mode Exit fullscreen mode
  1. Unstructured part: GraphRAG Global Search is invoked to retrieve content related to "Product A after-sales policy";
  2. Results from both retrieval paths are merged and returned as a unified response.

5. Key Pitfalls and Optimizations

5.1 Neo4j: Pitfalls and Optimizations

  • Issue 1: Inconsistent data import formats

    • Symptom: Non-uniform field types in CSV files caused import failures;
    • Solution: Normalize field types during the data cleaning stage and add type validation logic.
  • Issue 2: Poor query performance at scale

    • Symptom: Complex path queries exceeded 2s response time;
    • Solution: Create indexes on frequently queried node properties; optimize Cypher statements; use PROFILE to analyze query plans.

5.2 MinerU + LitServe: Pitfalls and Optimizations

  • Issue 1: Loss of table structure during parsing

    • Symptom: Complex tables were parsed with corrupted structure;
    • Solution: Use MinerU's officially supported table-specialized parsing model to improve table recognition accuracy.
  • Issue 2: Slow parsing speed

    • Symptom: Single-GPU parsing of a 100-page PDF took over 5 minutes;
    • Solution: Enable multi-GPU parallel parsing via LitServe; optimize model loading strategy; combine with multi-instance load balancing to improve throughput.

5.3 GraphRAG: Pitfalls and Optimizations

  • Issue 1: Chunking breaks contextual continuity

    • Symptom: Traditional fixed-size chunking split cross-chapter associated content;
    • Solution: Apply structure-aware chunking strategy, preserving contextual integrity by respecting heading hierarchy.
  • Issue 2: Loss of table/image semantic information

    • Symptom: After chunking, tables and images retained only links with no contextual description;
    • Solution: Add metadata descriptions for tables and images during the semantic enrichment stage and insert them at the corresponding positions in the Markdown.

6. Quantitative Results

All metrics are validated on 100 e-commerce product manual PDFs, 100 annotated customer service query test cases, and a dual RTX 4090 GPU test environment:

Metric Result
Neo4j total nodes 13,204
Neo4j total edges 28,762
Structured query accuracy 98%
Table parsing accuracy 95%
Average per-page PDF parsing time < 1s (multi-GPU parallel)
Entity extraction accuracy 93%
Unstructured retrieval accuracy 89%
Hybrid retrieval average response time 1.2s

All metrics are evaluated on an internal annotated test set of 100 e-commerce product manuals and 100 customer service QA pairs. Results may vary across domains and document types.


7. Deployment Boundaries and Series Continuity

7.1 Deployment Boundaries

This hybrid knowledge base data pipeline is optimized for e-commerce knowledge graph Q&A scenarios. Domains such as healthcare and finance will require adjustments to entity extraction rules and security policies. Production-grade iteration should further incorporate Text2Cypher and hybrid retrieval routing strategies.

7.2 Series Continuity

  • GitHub repository: llm-customer-service, (Tag: v0.5.0-graphrag-data-pipeline)
  • Backward reference: Builds on Part 1 Full MVP Architecture Breakdown, addressing the core bottleneck of insufficient multi-source data and long document support.
  • Next up: Part 3 will focus on production-grade service wrapping for GraphRAG indexes, covering API design, retrieval mode decision-making across four modes, and high-availability guarantees. Stay tuned.
  • Series finale: Part 8 will provide a complete retrospective of all architecture decisions, engineering pitfalls, and quantifiable outcomes from MVP to production-grade system, forming a full end-to-end engineering practice record.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.