4 techniques to stop AI agent hallucinations: Graph-RAG for precise data retrieval, semantic tool selection for accurate tool choice, neurosymbolic guardrails for rule enforcement, and multi-agent validation for error detection.
AI agents can hallucinate when executing tasks—fabricating statistics, choosing wrong tools, ignoring business rules, and claiming success when operations fail. This guide demonstrates 4 research-backed techniques to stop these hallucinations: Graph-RAG for precise data retrieval, semantic tool selection for accurate tool choice, neurosymbolic guardrails for rule enforcement, and multi-agent validation for error detection.
What you'll learn:
- How Graph-RAG prevents statistical hallucinations with structured data
- Why semantic tool selection improves accuracy and reduces token costs
- How neurosymbolic guardrails with Strands Agents block invalid operations that prompt engineering can't stop
- How multi-agent validation catches hallucinations before they reach users
Code examples: All techniques include working demos with Strands Agents framework.
git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations
Each demo includes:
- Jupyter notebooks with step-by-step explanations
- Python scripts for quick testing
- Ground truth verification against real data
- Performance metrics and comparisons
Why AI Agent Hallucinations Matter
AI agents differ from chatbots in a critical way. When a chatbot gives incorrect information, it's annoying. When an agent hallucinates during execution, it's catastrophic—fabricating API parameters, inventing success confirmations after failures, or executing actions based on false beliefs.
Recent research (MetaRAG, 2025) proves you cannot eliminate hallucinations—they're inherent to how LLMs work. The solution is detecting, containing, and mitigating them before they cause damage.
This guide explores 4 techniques tested on a travel booking agent with:
- 300 hotel FAQ documents for Graph-RAG testing
- 31 similar tools for semantic selection testing
- Complex business rules for neurosymbolic testing
- Multi-step validation flows for multi-agent testing
Prerequisites
Required:
- Python 3.9+
- LLM access: Amazon Bedrock, OpenAI, Anthropic, or Ollama (required for running agents)
-
AWS credentials configured if using Bedrock (
aws configure) - Basic understanding of AI agents and tool calling
Key libraries: Strands Agents, Neo4j, FAISS, SentenceTransformers
Note on embeddings: Demos use SentenceTransformers (
all-MiniLM-L6-v2) for vector embeddings — it runs locally with no API costs, making it fast and free to experiment. You can swap it for any embedding provider: Amazon Titan Embeddings, OpenAI, Cohere, etc.
Model configuration: See Strands Model Providers for setup instructions.
Technique 1: Graph-RAG for Precise Data Retrieval
Traditional RAG hallucinates statistics because it retrieves text chunks instead of executing precise calculations
When agents use traditional RAG for data-driven tasks, they face a fundamental limitation: vector search retrieves text, not structured data. Research (RAG-KG-IL, 2025) identifies three types of hallucinations this causes:
- Fabricated statistics — LLM generates plausible-sounding numbers from text chunks instead of computing them ("average rating is approximately 8.7" when no calculation occurred)
- Incomplete retrieval — Vector search returns top-k similar documents, missing relevant data scattered across hundreds of documents
- Out-of-domain fabrication — When no relevant data exists, RAG still returns similar-looking results and the LLM fabricates an answer instead of admitting ignorance
Research (MetaRAG, 2025) confirms this is inherent to how LLMs process unstructured data: they retrieve similar documents, then guess aggregations instead of executing calculations.
The demo compares two different agents querying the same 300 hotel FAQ documents:
- RAG Agent: Uses FAISS vector search → finds top 3 similar docs → LLM summarizes
- Graph-RAG Agent: Uses Neo4j knowledge graph → LLM writes Cypher queries → precise results
The knowledge graph is built automatically using neo4j-graphrag (RAKG, 2025) — the LLM discovers entities (Hotel, Room, Amenity, Policy) and relationships from unstructured text, no hardcoded schema:
from strands import Agent, tool
# RAG Agent — vector similarity search
@tool
def search_faqs(query: str) -> str:
"""Search hotel FAQs using vector similarity."""
query_embedding = embed_model.encode([query])
distances, indices = index.search(query_embedding.astype('float32'), 3)
return "\n".join([documents[idx]['text'][:500] for idx in indices[0]])
# Graph-RAG Agent — Cypher queries on knowledge graph
@tool
def query_knowledge_graph(cypher_query: str) -> str:
"""Execute a Cypher query against the hotel knowledge graph.
Node labels: Hotel, Room, Amenity, Policy, Service
Relationships: (Hotel)-[:HAS_ROOM]->(Room), (Hotel)-[:OFFERS_AMENITY]->(Amenity)
"""
with driver.session() as session:
result = session.run(cypher_query)
records = list(result)
if not records:
return "No results found."
return f"Found {len(records)} results:\n" + "\n".join(str(dict(r.items())) for r in records[:15])
rag_agent = Agent(tools=[search_faqs], model=model)
graph_agent = Agent(tools=[query_knowledge_graph], model=model)
Key insight: Graph-RAG reduces hallucinations because knowledge graphs provide structured, verifiable data — aggregations are computed by the database, relationships are explicit, and missing data returns empty results instead of fabricated answers. The LLM translates natural language into Cypher queries using the Text2Cypher pattern, grounded by the graph schema described in the tool's docstring.
Results
git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/01-faq-graphrag-demo
Demo: Graph-RAG vs Traditional RAG Demo — Jupyter notebook comparing hallucination rates on 300 hotel FAQs with Neo4j knowledge graph.
| Paper Finding | Demo Result | Status |
|---|---|---|
| RAG cannot aggregate (RAG-KG-IL) | RAG failed to count swimming pools across 300 docs | ✅ Validated |
| Graph-RAG computes natively (RAKG) | Cypher returned exact count: 133 hotels with pool | ✅ Validated |
| RAG hallucinates out-of-domain (MetaRAG) | RAG fabricated Antarctica accommodation info | ✅ Validated |
| Graph-RAG fails honestly | "No hotels in Antarctica" — no fabrication | ✅ Validated |
| Query Type | RAG | Graph-RAG |
|---|---|---|
| Aggregation: "Average rating in Paris?" | ⚠️ Calculates from 2 docs only | ✅ Native AVG() across all |
| Counting: "Hotels with swimming pool?" | ❌ "I don't have the data" | ✅ Precise: 133
|
| Multi-hop: "Room types for best hotel?" | ❌ Cannot traverse | ✅ Hotel → Room traversal |
| Out-of-domain: "Hotels in Antarctica" | ❌ Fabricates answers | ✅ Honest: "No hotels" |
Technique 2: Semantic Tool Selection for Accurate Tool Choice
Research (Internal Representations, 2025) identifies a critical agent failure mode: tool-calling hallucinations increase with tool count. When agents have many similar tools, they exhibit:
- Function selection errors - Calling non-existent tools
- Function appropriateness errors - Choosing semantically wrong tools
- Parameter errors - Malformed or invalid arguments
- Completeness errors - Missing required parameters
- Tool bypass behavior - Generating outputs instead of calling tools
The dual problem:
- ❌ Hallucination risk: More tools = more inappropriate selections
- ❌ Token waste: Sending all tool descriptions on every call (e.g., 31 tools = ~4,500 tokens per query)
The root cause: Agents see all tool descriptions in the prompt, creating choice overload that leads to both accuracy degradation AND cost explosion.
The Solution: Semantic Tool Filtering
git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/02-semantic-tools-demo
Filter tools before the agent sees them using vector similarity. The technique compares the user query against tool descriptions using embeddings, then passes only the most relevant tools to the agent:
This demo uses FAISS with SentenceTransformers as a lightweight, local implementation — but the technique works with any vector store and embedding provider (OpenSearch, Pinecone, Amazon Titan Embeddings, etc.):
from sentence_transformers import SentenceTransformer
import faiss
# Build index once
model = SentenceTransformer('all-MiniLM-L6-v2')
tool_embeddings = model.encode([tool.description for tool in ALL_TOOLS])
index = faiss.IndexFlatL2(384)
index.add(tool_embeddings)
# Filter per query
query_embedding = model.encode([query])
distances, indices = index.search(query_embedding, k=3)
relevant_tools = [ALL_TOOLS[i] for i in indices[0]]
Dynamic tool swapping with Strands Agents
Strands provides dynamic tool swapping that preserves conversation memory:
from strands import Agent
# Create agent once
agent = Agent(tools=[...], model=model)
# Swap tools per query without losing conversation history
for query in conversation:
relevant_tools = search_tools(query, top_k=3)
agent.tool_registry.registry.clear()
for tool in relevant_tools:
agent.tool_registry.register_tool(tool)
response = agent(query) # agent.messages preserved
Key advantage: Strands Agents maintains memory while tools change dynamically.
Results
Testing on 29 travel queries with ground truth:
Semantic tool selection reduces errors and token costs significantly
Demo: Semantic Tool Selection Demo - Includes token comparison script showing 89% token reduction with FAISS filtering.
Technique 3: Neurosymbolic Guardrails for AI Agents
Symbolic rules block invalid operations that prompt engineering cannot prevent
The Problem: Agents Cannot Be Constrained by Prompts Alone
Research (ATA: Autonomous Trustworthy Agents, 2024) shows that agents hallucinate when business rules are expressed only in natural language prompts:
-
Parameter errors: Agent calls
book_hotel(guests=15)despite "Maximum 10 guests" in docstring - Completeness errors: Agent executes bookings without required payment verification
- Tool bypass behavior: Agent confirms success without calling validation tools
Why prompt engineering fails: Prompts are suggestions, not constraints. Agents can ignore docstring instructions because they're processed as text, not executable rules. The agent sees "Maximum 10 guests" as context, not a hard boundary.
The Solution: Strands Agents Hooks for Neurosymbolic Validation
Strands Agents makes neurosymbolic guardrails effortless with hooks—a composable system that intercepts tool calls before execution to enforce symbolic rules.
Key insight: Use Strands hooks to validate tool calls before execution. The LLM cannot bypass rules enforced at the framework level.
from strands import Agent, tool
from strands.hooks import HookProvider, HookRegistry, BeforeToolCallEvent
# Define symbolic rules
BOOKING_RULES = [
Rule(
name="max_guests",
condition=lambda ctx: ctx.get("guests", 1) <= 10,
message="Maximum 10 guests per booking"
),
]
# Create validation hook
class NeurosymbolicHook(HookProvider):
def register_hooks(self, registry: HookRegistry) -> None:
registry.add_callback(BeforeToolCallEvent, self.validate)
def validate(self, event: BeforeToolCallEvent) -> None:
ctx = {"guests": event.tool_use["input"].get("guests", 1)}
passed, violations = validate(BOOKING_RULES, ctx)
if not passed:
event.cancel_tool = f"BLOCKED: {', '.join(violations)}"
# Clean tool (no validation logic)
@tool
def book_hotel(hotel: str, guests: int = 1) -> str:
"""Book a hotel room."""
return f"SUCCESS: Booked {hotel} for {guests} guests"
# Attach hook to agent
hook = NeurosymbolicHook()
agent = Agent(tools=[book_hotel], hooks=[hook])
Why Strands Hooks Excel Here
Strands hooks intercept tool calls before execution at the framework level:
Simple API: Just implement HookProvider and register callbacks
Centralized validation: One hook validates all tools
Clean tools: No validation logic mixed with business logic
Type-safe: Strongly-typed event objects
LLM cannot bypass: Rules enforced before tool execution
# Test
query = "Book hotel for 15 guests"
result = agent(query) # ✅ Hook blocks before tool executes
Key advantage: Rules are enforced at the framework level, not the prompt level. The LLM receives cancellation messages it cannot override.
Results
| Scenario | Prompt Engineering | Neurosymbolic with Hooks |
|---|---|---|
| Invalid Parameters | ❌ Accepts | ✅ Blocks |
| Missing Prerequisites | ⚠️ Sometimes catches | ✅ Always blocks |
| Rule Bypass | ❌ Possible | ✅ Impossible |
Demo: Neurosymbolic AI Agent Demo - Shows how Strands hooks enforce symbolic rules that LLMs cannot bypass.
Technique 4: Multi-Agent Validation for Error Detection
The Problem: Single Agents Cannot Self-Validate
Research (Markov Chain Multi-Agent Debate, 2024) shows that single agents hallucinate without detection mechanisms:
- Claim success when operations failed - No validation layer catches execution errors
- Use wrong tools for requests - No cross-check verifies tool appropriateness
- Fabricate responses - No second opinion challenges generated content
- Provide inaccurate statistics - No verification against ground truth
The core problem: Single agents operate in isolation. When they hallucinate, there's no mechanism to detect the error before it reaches users.
Multi-agent validation catches hallucinations that single agents miss
The Solution: Multi-Agent Debate for Cross-Validation
Multiple specialized agents validate each other through structured debate:
Key insight: Agents with different roles (Trust, Skeptic, Leader) debate claims until consensus. Research shows this reduces hallucination compared to single-agent approaches.
Why Strands Agents Excels Here
Strands provides Swarm for autonomous agent handoffs with shared context:
from strands import Agent
from strands.multiagent import Swarm
# Define specialized agents
executor = Agent(
name="executor",
tools=ALL_TOOLS,
system_prompt="Execute requests, then hand off to validator"
)
validator = Agent(
name="validator",
system_prompt="Check for hallucinations. Say VALID or HALLUCINATION"
)
critic = Agent(
name="critic",
system_prompt="Final review. Say APPROVED or REJECTED"
)
# Create swarm - agents hand off autonomously
swarm = Swarm(
[executor, validator, critic],
entry_point=executor,
max_handoffs=5
)
result = swarm("Book grand_hotel for John")
Key advantage: Agents decide when to hand off control. Shared context means validator sees executor's actions. Critic sees both. Cross-validation happens automatically.
Results
| Approach | Hallucination Detection | Accuracy | Latency |
|---|---|---|---|
| Single Agent | ❌ None | ⚠️ Fabricates | ✅ Fast |
| Multi-Agent | ✅ Detects errors | ✅ Validates | ⚠️ Slower |
Example: When executor tries to book the_ritz_paris (doesn't exist), validator detects the invalid hotel and critic returns Status.FAILED instead of hallucinating an alternative.
Demo: Multi-Agent Validation Demo - Shows Executor → Validator → Critic pattern detecting invalid hotel bookings.
Combining Techniques for Production
These techniques stack:
- Graph-RAG ensures data accuracy
- Semantic tool selection reduces tool errors and token costs
- Neurosymbolic rules enforce business constraints
- Multi-agent validation catches remaining hallucinations
Why Strands Agents Python SDK
- Dynamic tool management: Swap tools without losing conversation state
- Native multi-agent: Swarm handles handoffs and shared context
- Tool-level validation: Symbolic rules execute before LLM sees results
- Model flexibility: Works with Bedrock, OpenAI, Anthropic, Ollama
- Production-ready: Built for AWS deployment (Amazon Bedrock Agentcore, AWS Lambda, Amazon ECS, Amazon EC2)
Key Takeaways
- Hallucinations are inevitable - Focus on detection and mitigation, not elimination
- Graph-RAG for precision - Use when you need exact calculations or relationships
- Semantic filtering for scale - Essential when you have 10+ similar tools
- Symbolic rules for compliance - Prompt engineering cannot enforce business rules
- Multi-agent for validation - Cross-validation catches errors single agents miss
- Strands Agents for production - Dynamic tools, native multi-agent, and AWS-ready
References
- MetaRAG: Metamorphic Testing for Hallucination Detection
- Internal Representations as Indicators of Hallucinations in Agent Tool Selection
- Teaming LLMs to Detect and Mitigate Hallucinations
- RAG-KG-IL: Multi-Agent Hybrid Framework
- Strands Agents Documentation
Conclusion
In this post, I showed you how to stop AI agent hallucinations using 4 research-backed techniques. Graph-RAG eliminates statistical hallucinations by using structured data instead of text retrieval. Semantic tool selection reduces errors by up to 86.4% and cuts token costs by 89% through vector-based filtering. Neurosymbolic guardrails with Strands hooks enforce business rules at the framework level that LLMs cannot bypass. Multi-agent validation catches hallucinations through cross-validation before they reach users.
These techniques aren't theoretical—each demo includes working code you can run today. Start with Graph-RAG if you have structured data, add semantic tool selection when you have 10+ similar tools, implement neurosymbolic hooks for business constraints, and use multi-agent validation for critical operations.
What's next? Clone the repository and run the demos. Each folder includes Jupyter notebooks with step-by-step explanations and Python scripts for quick testing. The demos use Strands Agents with Amazon Bedrock, but you can swap in OpenAI, Anthropic, or Ollama—see the model providers documentation for configuration.
Want to dive deeper? Check out the research papers: MetaRAG for hallucination detection, Internal Representations for tool selection, Teaming LLMs for multi-agent validation, and RAG-KG-IL for hybrid frameworks. You can also learn more about Strands Agents for production deployments and Amazon Bedrock for LLM access.
Have you implemented any of these techniques in your agents? What challenges did you face? Share your experience in the comments below.
Gracias!
🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr








Top comments (4)
The guardrails approach is solid, but I think there's a step before guardrails that most people skip: making the architecture simpler so the AI needs less context to do its job.
I've been working on a codebase where AI kept hallucinating function signatures from other modules. The fix wasn't better prompting or more guardrails — it was making module boundaries explicit enough that the AI could work on one file without needing to understand the whole project.
Guardrails catch mistakes after they happen. Simpler architecture prevents whole categories of mistakes from happening in the first place. Both matter, but I think we underinvest in the structural side.
Excelente publicación, pregunta: se puede hacer semantic tool search usando algún servicio manejado de AWS?
Si con amazon bedrock agentcore, aca la informacion: docs.aws.amazon.com/bedrock-agentc...
😱😱 so cool !! I need to this