Traditional RAG makes AI agents hallucinate statistics and aggregations. This demo builds a travel booking agent with Strands Agents and compares RAG (FAISS) vs GraphRAG (Neo4j) to measure which approach reduces hallucinations when answering queries about 300 hotel FAQ documents
When AI Agents Don't Just Answer Wrong—They Act Wrong
In the previous blog post, we explored at a high level why AI agents hallucinate and introduced 4 essential techniques to stop them: Graph RAG, semantic tool selection, neurosymbolic guardrails, and multi agents validation. Now we're going to dive deeper into each one. This is Part 1: we'll build a travel booking agent, load 300 hotel FAQ documents, and measure exactly where traditional RAG breaks down and how GraphRAG with Neo4j eliminates those failures.
When AI Agents Don't Just Answer Wrong They Act Wrong
AI agents differ from chatbots. A chatbot giving incorrect information is annoying. An agent hallucinating during execution is catastrophic—it might fabricate API parameters, invent success confirmations after failures, or execute actions based on false beliefs.
Recent research (MetaRAG, 2025) proves you cannot eliminate hallucinations—they're inherent to how LLMs work. The focus shifted to detecting, containing, and mitigating them in production.
This Series: 4 Production Techniques
Part 1 (This Post): GraphRAG - Relationship-aware knowledge graphs preventing hallucinations in aggregations and precise queries
Part 2: Semantic Tool Selection - Vector-based tool filtering for accurate tool selection
Part 3: Neurosymbolic Guardrails - Symbolic reasoning for verifiable decisions
Part 4: Multi-Agent Validation - Agent teams detecting hallucinations before damage
Code uses Strands Agents.
Go to github repositorie: sample-why-agents-fail
git clone https://github.com/aws-samples/sample-why-agents-fail
Part 1: When RAG Makes Agents Hallucinate
Traditional RAG retrieves similar documents using vector search. This works for semantic questions but fails when agents need precise information. Research (RAG-KG-IL, 2025) identifies three types of hallucinations this causes:
Fabricated statistics — LLM generates plausible-sounding numbers from text chunks instead of computing them. The paper shows RAG-only systems produce 49 hallucinated statements vs 35 with knowledge graph integration — a 73% reduction compared to standalone LLMs.
Incomplete retrieval — Vector search returns top-k similar documents, missing relevant data scattered across hundreds of documents. The paper found RAG-only missed information in nearly every question (54 instances), while KG-integrated systems had near-zero incompleteness.
Out-of-domain fabrication — When no relevant data exists, RAG still returns similar-looking results and the LLM fabricates an answer. MetaRAG (2025) confirms this is inherent to how retrieval works: similarity search always returns something, even when nothing is relevant.
The Demo: Two Agents, Same Data, Different Approaches
The demo uses two separate agents querying the same 300 hotel FAQ documents:
Agent 1: Traditional RAG Agent
Uses FAISS vector similarity search as Strands Agents custom tool. Given a query, it finds the 3 most similar documents and lets the LLM summarize them.
from strands import Agent, tool
@tool
def search_faqs(query: str) -> str:
"""Search hotel FAQs using vector similarity (Traditional RAG)."""
query_embedding = model.encode([query])
distances, indices = index.search(query_embedding.astype('float32'), 3)
results = []
for idx in indices[0]:
doc = documents[idx]
results.append(f"[{doc['filename']}]\n{doc['text'][:500]}...")
return "\n\n".join(results)
rag_agent = Agent(
tools=[search_faqs],
system_prompt="You are a travel agent. Use vector search to find relevant FAQ information.",
model=OpenAIModel(model_id="gpt-4o-mini")
)
Limitation: The agent only sees k documents at a time (for this example 3). It cannot aggregate, count, or traverse relationships across the full dataset.
Note on embeddings: This demo uses SentenceTransformers (
all-MiniLM-L6-v2) for vector embeddings — it runs locally, requires no API keys, and costs nothing. You can swap it for any embedding model: Amazon Nova Embeddings, OpenAItext-embedding-3-small, Cohere Embed, etc.
Agent 2: Graph-RAG Agent
Uses a Neo4j knowledge graph built automatically with neo4j-graphrag (neo4j-graphrag-python). The LLM writes Cypher queries to get precise answers.
Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/01-faq-graphrag-demo
from strands import Agent, tool
@tool
def query_knowledge_graph(cypher_query: str) -> str:
"""Execute a Cypher query against the hotel knowledge graph.
Node labels: Hotel, Room, Amenity, Policy, Service
Hotel properties: name, address, guestRating, totalRooms, email, phone
Room properties: name (e.g. "Standard Room"), price, maxOccupancy
Amenity properties: name (e.g. "Outdoor Swimming Pool", "WiFi")
Policy properties: name (e.g. "Check-in Policy"), details
Relationships:
- (Hotel)-[:HAS_ROOM]->(Room)
- (Hotel)-[:OFFERS_AMENITY]->(Amenity)
- (Hotel)-[:HAS_POLICY]->(Policy)
- (Hotel)-[:PROVIDES_SERVICE]->(Service)
Location is in Hotel.address property (e.g. "789 Corniche el-Nil, Cairo 11519").
To find hotels by location, use: WHERE h.address CONTAINS 'Cairo'
"""
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
with driver.session() as session:
result = session.run(cypher_query)
records = list(result)
if not records:
return "No results found."
output = f"Found {len(records)} results:\n"
for record in records[:15]:
output += f" {dict(record.items())}\n"
return output
graph_agent = Agent(
tools=[query_knowledge_graph],
system_prompt="You are a travel agent. Use the knowledge base to answer questions accurately. You can run multiple queries to explore the data.",
model=OpenAIModel(model_id="gpt-4o-mini")
)
Key difference: The agent writes Cypher queries that execute native AVG(), COUNT(), and relationship traversals directly in the database.
How Text2Cypher Works
The GraphRAG agent doesn't have hardcoded queries. Instead, it uses the Text2Cypher pattern — the LLM translates natural language into Cypher based on the graph schema described in the tool's docstring:
- User asks: "How many hotels have a swimming pool?"
- LLM reads the tool description containing the schema (node labels, properties, relationships)
- LLM generates:
MATCH (h:Hotel)-[:OFFERS_AMENITY]->(a:Amenity) WHERE a.name CONTAINS 'Pool' RETURN COUNT(DISTINCT h) - Tool executes the query against Neo4j and returns the result
The schema in the docstring is what grounds the LLM — without it, the LLM would guess node names and relationships. With it, the LLM generates valid Cypher that matches the actual graph structure.
How the Knowledge Graph is Built
The graph is built automatically using neo4j-graphrag — no hardcoded schema. Research on automated knowledge graph construction (RAKG, 2025) shows LLMs can extract entities and relationships from unstructured text:
Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/01-faq-graphrag-demo
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
kg_builder = SimpleKGPipeline(
llm=llm,
driver=neo4j_driver,
embedder=embedder,
from_pdf=False,
perform_entity_resolution=True,
)
# LLM discovers entities and relationships from each document
await kg_builder.run_async(text=document_text)
The LLM reads each FAQ and automatically discovers entity types (Hotel, Room, Amenity, Policy) and relationships (HAS_ROOM, OFFERS_AMENITY, HAS_POLICY). No manual schema definition needed — if you add new documents with new entity types (Restaurant, Airport, etc.), the LLM discovers them automatically.
Results: 4 Tests Validating the Research
Both agents answer the same questions. We compare their responses against what the research papers predict:
Run: travel_agent_demo.py
Test 1: Aggregation — "What is the average guest rating across all hotels in Paris?"
Research (RAG-KG-IL, 2025) predicts RAG cannot compute aggregations from text chunks.
- RAG: Manually calculates from 2 docs it found → 4.7 (correct but only because it found 2)
-
GraphRAG: Native
AVG(h.guestRating)in Cypher → 4.7 ✅ Database-level computation
Test 2: Precise Counting — "How many hotels have a swimming pool as an amenity?"
Research (MetaRAG, 2025) shows RAG retrieves top-k documents, making counting impossible.
- RAG: ❌ "I don't have the data needed to answer" — cannot count across 300 docs (only sees 3)
-
Graph-RAG: ✅ "133 hotels" — exact count with Cypher
COUNT()
Test 3: Multi hop Reasoning — "What are the room types and prices for the highest rated hotel?"
Research (RAG-KG-IL, 2025) shows RAG falls short when tasks require deeper inference across interconnected data.
- RAG: ⚠️ Found one hotel but "does not include room types" — cannot traverse relationships
- GraphRAG: Traversed Hotel → Room nodes via Cypher, found top-rated hotels and their room data
Test 4: Out of domain — "Tell me about hotels in Antarctica"
Research (MetaRAG, 2025) proves RAG hallucinates when data doesn't exist because vector search always returns similar results.
- RAG: ❌ HALLUCINATED — fabricated "Research Stations", "Expedition Cruises", "Specialized Lodges" that DO NOT exist in the data
- Graph-RAG: ✅ "No hotels listed in Antarctica" — honest, does not fabricate
RAG always returns something. GraphRAG returns empty results when data doesn't exist.
When to Use Graph-RAG
Use Graph-RAG:
- Precise queries (numerical filtering, exact matches)
- Aggregations (counts, averages, sums)
- Relationships (multi-hop traversal)
- Structured data (clear schemas)
- Verifiable results
Use RAG:
- Semantic search (similar concepts)
- Unstructured text (documents, articles)
- Fuzzy matching (approximate results)
- Simple retrieval
Try It Yourself
git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/01-faq-graphrag-demo
uv venv && uv pip install -r requirements.txt
# Build FAISS vector index
uv run load_vector_data.py
# Build Neo4j knowledge graph (requires Neo4j with APOC plugin)
uv run build_graph.py
# Run comparison
uv run travel_agent_demo.py
What's Next
GraphRAG prevents hallucinations in agent responses. But agents still hallucinate during tool selection choosing the wrong tool.
Part 2: Semantic Tool Selection shows how vector based tool filtering reduces tool selection errors when agents have dozens of similar tools.
Key Takeaways
- RAG makes agents hallucinate statistics: LLMs estimate instead of calculating
- GraphRAG provides precision: Native aggregations, exact filtering
- Explicit failure prevents hallucinations: Empty results vs similar matches
-
Automatic graph construction:
neo4j-graphragdiscovers entities without hardcoded schemas - Two-agent comparison: Same data, different tools — measurable difference
-
Strands Agents makes this simple: A
@tooldecorator and a few lines of configuration is all it takes — define a tool, wire it to an agent, and you have a working RAG or GraphRAG system. Swapping between approaches means swapping one tool, not rewriting your agent.
Gracias!
🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr






Top comments (0)