Elizabeth Fuentes L for AWS

Posted on Feb 27 • Edited on Mar 18 • Originally published at builder.aws.com

RAG vs GraphRAG: When Agents Hallucinate Answers

#tutorial #python #ai #agents

Traditional RAG makes AI agents hallucinate statistics and aggregations. This demo builds a travel booking agent with Strands Agents and compares RAG (FAISS) vs GraphRAG (Neo4j) to measure which approach reduces hallucinations when answering queries about 300 hotel FAQ documents

When AI Agents Don't Just Answer Wrong—They Act Wrong

In the previous blog post, we explored at a high level why AI agents hallucinate and introduced 4 essential techniques to stop them: Graph RAG, semantic tool selection, neurosymbolic guardrails, and multi agents validation. Now we're going to dive deeper into each one. This is Part 1: we'll build a travel booking agent, load 300 hotel FAQ documents, and measure exactly where traditional RAG breaks down and how GraphRAG with Neo4j eliminates those failures.

When AI Agents Don't Just Answer Wrong They Act Wrong

AI agents differ from chatbots. A chatbot giving incorrect information is annoying. An agent hallucinating during execution is catastrophic—it might fabricate API parameters, invent success confirmations after failures, or execute actions based on false beliefs.

Recent research (MetaRAG, 2025) proves you cannot eliminate hallucinations—they're inherent to how LLMs work. The focus shifted to detecting, containing, and mitigating them in production.

This Series: 4 Production Techniques

Part 1 (This Post): GraphRAG - Relationship-aware knowledge graphs preventing hallucinations in aggregations and precise queries

Part 2: Reduce Agent Errors and Token Costs with Semantic Tool Selection - Vector-based tool filtering for accurate tool selection

Part 3: AI Agent Guardrails: Rules That LLMs Cannot Bypass - Symbolic reasoning for verifiable decisions

Bonus Part 3.2: Runtime Guardrails for AI Agents — Steer, Don't Block Open-source runtime controls that guide agents to self-correct violations instead of failing the workflow.

Part 4: How to Stop AI Agents from Hallucinating Silently with Multi-Agent Validation - Agent teams detecting hallucinations before damage

Code uses Strands Agents.

Go to github repositorie: sample-why-agents-fail

git clone https://github.com/aws-samples/sample-why-agents-fail

Part 1: When RAG Makes Agents Hallucinate

Traditional RAG retrieves similar documents using vector search. This works for semantic questions but fails when agents need precise information. Research (RAG-KG-IL, 2025) identifies three types of hallucinations this causes:

Fabricated statistics — LLM generates plausible-sounding numbers from text chunks instead of computing them. The paper shows RAG-only systems produce 49 hallucinated statements vs 35 with knowledge graph integration — a 73% reduction compared to standalone LLMs.
Incomplete retrieval — Vector search returns top-k similar documents, missing relevant data scattered across hundreds of documents. The paper found RAG-only missed information in nearly every question (54 instances), while KG-integrated systems had near-zero incompleteness.
Out-of-domain fabrication — When no relevant data exists, RAG still returns similar-looking results and the LLM fabricates an answer. MetaRAG (2025) confirms this is inherent to how retrieval works: similarity search always returns something, even when nothing is relevant.

The Demo: Two Agents, Same Data, Different Approaches

The demo uses two separate agents querying the same 300 hotel FAQ documents:

Agent 1: Traditional RAG Agent

Uses FAISS vector similarity search as Strands Agents custom tool. Given a query, it finds the 3 most similar documents and lets the LLM summarize them.

from strands import Agent, tool

@tool
def search_faqs(query: str) -> str:
    """Search hotel FAQs using vector similarity (Traditional RAG)."""
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding.astype('float32'), 3)
    results = []
    for idx in indices[0]:
        doc = documents[idx]
        results.append(f"[{doc['filename']}]\n{doc['text'][:500]}...")
    return "\n\n".join(results)

rag_agent = Agent(
    tools=[search_faqs],
    system_prompt="You are a travel agent. Use vector search to find relevant FAQ information.",
    model=OpenAIModel(model_id="gpt-4o-mini")
)

Limitation: The agent only sees k documents at a time (for this example 3). It cannot aggregate, count, or traverse relationships across the full dataset.

Note on embeddings: This demo uses SentenceTransformers (all-MiniLM-L6-v2) for vector embeddings — it runs locally, requires no API keys, and costs nothing. You can swap it for any embedding model: Amazon Nova Embeddings, OpenAI text-embedding-3-small, Cohere Embed, etc.

Agent 2: Graph-RAG Agent

Uses a Neo4j knowledge graph built automatically with neo4j-graphrag (neo4j-graphrag-python). The LLM writes Cypher queries to get precise answers.

Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/01-faq-graphrag-demo

from strands import Agent, tool

@tool
def query_knowledge_graph(cypher_query: str) -> str:
    """Execute a Cypher query against the hotel knowledge graph.

    Node labels: Hotel, Room, Amenity, Policy, Service
    Hotel properties: name, address, guestRating, totalRooms, email, phone
    Room properties: name (e.g. "Standard Room"), price, maxOccupancy
    Amenity properties: name (e.g. "Outdoor Swimming Pool", "WiFi")
    Policy properties: name (e.g. "Check-in Policy"), details

    Relationships:
    - (Hotel)-[:HAS_ROOM]->(Room)
    - (Hotel)-[:OFFERS_AMENITY]->(Amenity)
    - (Hotel)-[:HAS_POLICY]->(Policy)
    - (Hotel)-[:PROVIDES_SERVICE]->(Service)

    Location is in Hotel.address property (e.g. "789 Corniche el-Nil, Cairo 11519").
    To find hotels by location, use: WHERE h.address CONTAINS 'Cairo'
    """
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    with driver.session() as session:
        result = session.run(cypher_query)
        records = list(result)
        if not records:
            return "No results found."
        output = f"Found {len(records)} results:\n"
        for record in records[:15]:
            output += f"  {dict(record.items())}\n"
        return output

graph_agent = Agent(
    tools=[query_knowledge_graph],
    system_prompt="You are a travel agent. Use the knowledge base to answer questions accurately. You can run multiple queries to explore the data.",
    model=OpenAIModel(model_id="gpt-4o-mini")
)

Key difference: The agent writes Cypher queries that execute native AVG(), COUNT(), and relationship traversals directly in the database.

How Text2Cypher Works

The GraphRAG agent doesn't have hardcoded queries. Instead, it uses the Text2Cypher pattern — the LLM translates natural language into Cypher based on the graph schema described in the tool's docstring:

User asks: "How many hotels have a swimming pool?"
LLM reads the tool description containing the schema (node labels, properties, relationships)
LLM generates: MATCH (h:Hotel)-[:OFFERS_AMENITY]->(a:Amenity) WHERE a.name CONTAINS 'Pool' RETURN COUNT(DISTINCT h)
Tool executes the query against Neo4j and returns the result

The schema in the docstring is what grounds the LLM — without it, the LLM would guess node names and relationships. With it, the LLM generates valid Cypher that matches the actual graph structure.

How the Knowledge Graph is Built

The graph is built automatically using neo4j-graphrag — no hardcoded schema. Research on automated knowledge graph construction (RAKG, 2025) shows LLMs can extract entities and relationships from unstructured text:

Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/01-faq-graphrag-demo

from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver,
    embedder=embedder,
    from_pdf=False,
    perform_entity_resolution=True,
)

# LLM discovers entities and relationships from each document
await kg_builder.run_async(text=document_text)

The LLM reads each FAQ and automatically discovers entity types (Hotel, Room, Amenity, Policy) and relationships (HAS_ROOM, OFFERS_AMENITY, HAS_POLICY). No manual schema definition needed — if you add new documents with new entity types (Restaurant, Airport, etc.), the LLM discovers them automatically.

Results: 4 Tests Validating the Research

Both agents answer the same questions. We compare their responses against what the research papers predict:

Run: travel_agent_demo.py

Test 1: Aggregation — "What is the average guest rating across all hotels in Paris?"

Research (RAG-KG-IL, 2025) predicts RAG cannot compute aggregations from text chunks.

RAG: Manually calculates from 2 docs it found → 4.7 (correct but only because it found 2)
GraphRAG: Native AVG(h.guestRating) in Cypher → 4.7 ✅ Database-level computation

Test 2: Precise Counting — "How many hotels have a swimming pool as an amenity?"

Research (MetaRAG, 2025) shows RAG retrieves top-k documents, making counting impossible.

RAG: ❌ "I don't have the data needed to answer" — cannot count across 300 docs (only sees 3)
Graph-RAG: ✅ "133 hotels" — exact count with Cypher COUNT()

Test 3: Multi hop Reasoning — "What are the room types and prices for the highest rated hotel?"

Research (RAG-KG-IL, 2025) shows RAG falls short when tasks require deeper inference across interconnected data.

RAG: ⚠️ Found one hotel but "does not include room types" — cannot traverse relationships
GraphRAG: Traversed Hotel → Room nodes via Cypher, found top-rated hotels and their room data

Test 4: Out of domain — "Tell me about hotels in Antarctica"

Research (MetaRAG, 2025) proves RAG hallucinates when data doesn't exist because vector search always returns similar results.

RAG: ❌ HALLUCINATED — fabricated "Research Stations", "Expedition Cruises", "Specialized Lodges" that DO NOT exist in the data
Graph-RAG: ✅ "No hotels listed in Antarctica" — honest, does not fabricate

RAG always returns something. GraphRAG returns empty results when data doesn't exist.

When to Use Graph-RAG

Use Graph-RAG:

Precise queries (numerical filtering, exact matches)
Aggregations (counts, averages, sums)
Relationships (multi-hop traversal)
Structured data (clear schemas)
Verifiable results

Use RAG:

Semantic search (similar concepts)
Unstructured text (documents, articles)
Fuzzy matching (approximate results)
Simple retrieval

Try It Yourself

git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/01-faq-graphrag-demo
uv venv && uv pip install -r requirements.txt

# Build FAISS vector index
uv run load_vector_data.py

# Build Neo4j knowledge graph (requires Neo4j with APOC plugin)
uv run build_graph.py

# Run comparison
uv run travel_agent_demo.py

What's Next

GraphRAG prevents hallucinations in agent responses. But agents still hallucinate during tool selection choosing the wrong tool.

Part 2: Semantic Tool Selection shows how vector based tool filtering reduces tool selection errors when agents have dozens of similar tools.

Key Takeaways

RAG makes agents hallucinate statistics: LLMs estimate instead of calculating
GraphRAG provides precision: Native aggregations, exact filtering
Explicit failure prevents hallucinations: Empty results vs similar matches
Automatic graph construction: neo4j-graphrag discovers entities without hardcoded schemas
Two-agent comparison: Same data, different tools — measurable difference
Strands Agents makes this simple: A @tool decorator and a few lines of configuration is all it takes — define a tool, wire it to an agent, and you have a working RAG or GraphRAG system. Swapping between approaches means swapping one tool, not rewriting your agent.