DEV Community

Aarav Rana
Aarav Rana

Posted on

GraphRAG Pipeline: Airbyte Ingestion to Neo4j Knowledge Graph

Introduction

I’ve spent the last year building RAG systems, and I’ve hit a hard ceiling.

Vector databases are fantastic at finding similar text. If you ask, "How do I reset my password?", a vector search finds the password reset doc immediately.

But they are terrible at relationships.

If I ask my internal dev bot: "Who is currently working on the ticket related to the billing API latency?", a vector DB fails. It finds documents with "billing" and "latency," but it has no concept of "Assigned To" or "Created By." It sees words; it doesn't see the graph.

To fix this, I stopped trying to force everything into vectors and moved to GraphRAG.

This is the architecture I built to solve it: Airbyte handles the messy API extraction, Neo4j stores the topology, and Gemini 3.0 translates my English questions into Cypher queries.

View the Full Source Code on GitHub


The Architecture

We aren't just moving tables; we are building a semantic network.

  1. Ingestion (Airbyte): It pulls raw data (Issues, Pull Requests, Users) from GitHub/Jira/Salesforce instead of writing custom scrapers.

  2. Storage (Neo4j): We need a native graph database to traverse deep relationships (e.g., User -> Team -> Ticket -> PR).

  3. Reasoning (Gemini 3.0 Pro):  We need an LLM that is strictly logical to write accurate database queries. The 3.0 Pro model is significantly better at SQL/Cypher generation than previous iterations.


Why I ditched Vector-Only RAG

In my previous iteration of this project, I used a standard vector database. It worked fine for questions like "What is an error?" but failed miserably at "Who is responsible for the error?".

I found that Vector DBs are great at Similarity but terrible at Topology. If the relationship between "User" and "Issue" isn't explicitly written in text (e.g., "John created issue #4"), the vector search misses it. Neo4j was the only way to make those invisible links visible to the LLM.


Prerequisites

  • Docker & Docker Compose installed.

  • Airbyte (Open Source version running locally).

  • Neo4j Database (Neo4j Aura Free Tier or local Docker).

  • Python 3.10+.


Step 1: Local Infrastructure

I run Neo4j in Docker to keep my host machine clean. We need the apoc plugins enabled to handle some of the JSON parsing later.

docker-compose.yml

version: '3.8'
services:
  neo4j:
    image: neo4j:5.15.0
    container_name: graphrag-neo4j
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      NEO4J_AUTH: neo4j/password
      NEO4J_PLUGINS: '["apoc"]'
Enter fullscreen mode Exit fullscreen mode

Start the instance:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Step 2: The "Airbyte" Data Dump

When Airbyte syncs data into a database, it doesn't magically create a graph. It dumps "Raw Tables." For example, a GitHub Issue becomes a node labeled _AirbyteRawGitHubIssues containing a giant JSON blob.

To make this guide reproducible, I wrote a script that mimics this exact "Raw Destination" format.

src/seed_airbyte_data.py

tx.run("""
    CREATE (n:_AirbyteRawGitHubIssues {
        _airbyte_data: $data,
        _airbyte_ab_id: randomUUID()
    })
""", data=json.dumps(issue))
Enter fullscreen mode Exit fullscreen mode

Run the seeder:

python src/seed_airbyte_data.py
Enter fullscreen mode Exit fullscreen mode

Now we have the data, but it's messy. It's just disconnected JSON files. This is where normal RAG faces limits.


Step 3: Transforming JSON to Graph

This is the most critical part of the pipeline. We need to take those JSON files and "explode" them into entities.

We need to tell the database: "Take the user.login field from the Issue JSON, find the User node with that name, and draw a :CREATED line between them."

src/build_graph.py

def transform_graph(tx):
    tx.run("""
        MATCH (r:_AirbyteRawGitHubUsers)
        WITH r, apoc.convert.fromJsonMap(r._airbyte_data) AS data
        MERGE (u:User {login: data.login})
    """)

    tx.run("""
        MATCH (r:_AirbyteRawGitHubIssues)
        WITH r, apoc.convert.fromJsonMap(r._airbyte_data) AS data
        MERGE (i:Issue {id: data.id})

        WITH i, data
        MATCH (u:User {login: data.user.login})

        MERGE (u)-[:CREATED]->(i)
    """)
Enter fullscreen mode Exit fullscreen mode

Run the build script:

python src/build_graph.py
Enter fullscreen mode Exit fullscreen mode

If you check the Neo4j Browser now, you won't see isolated records. You'll see a web.


Step 4: The Agent (Gemini 3.0)

Standard RAG searches for keywords. GraphRAG searches for paths.

To make this work, we use LangChain to pipe our schema into Gemini 3.0 Pro. We have to be very strict with the prompt. LLMs hallucinate column names (guessing username instead of login).

src/query_graph.py

llm = ChatGoogleGenerativeAI(model="gemini-3.0-pro", temperature=0)

CYPHER_GENERATION_TEMPLATE = """
Schema:
{schema}

CRITICAL RULES:
1. For User nodes, ALWAYS use the property 'login'.
2. The relationship is [:CREATED], do NOT guess [:AUTHORED].
3. Use 'CONTAINS' for partial string matching.

The question is: {question}
"""

chain = GraphCypherQAChain.from_llm(
    llm, 
    graph=graph, 
    cypher_prompt=CYPHER_PROMPT,
    verbose=True
)
Enter fullscreen mode Exit fullscreen mode

Step 5: Result

Let's try a query that requires hopping from a User to an Issue based on a partial title match.

Query: "Who created the issue about 'Memory Leak'?"

Generated Cypher:

MATCH (u:User)-[:CREATED]->(i:Issue)
WHERE toLower(i.title) CONTAINS toLower('Memory Leak')
RETURN u.login
Enter fullscreen mode Exit fullscreen mode

Result:

💡 AI: ai_wizard created the issue.
Enter fullscreen mode Exit fullscreen mode

Result demo


Final Thoughts

Building this pipeline made me realize that data topology is just as important as data content.

By using Airbyte for the plumbing and Neo4j for the structure, we can build AI agents that actually understand context, not just keywords. And by leveraging Gemini 3.0 Pro, we get high-accuracy query generation without the massive cost overhead of older models.

View the Full Source Code on GitHub

Top comments (0)