DEV Community

Cover image for Stop AI Agent Hallucinations: 4 Essential Techniques
Elizabeth Fuentes L for AWS

Posted on • Originally published at builder.aws.com

Stop AI Agent Hallucinations: 4 Essential Techniques

4 techniques to stop AI agent hallucinations: Graph-RAG for precise data retrieval, semantic tool selection for accurate tool choice, neurosymbolic guardrails for rule enforcement, and multi-agent validation for error detection.

AI agents can hallucinate when executing tasks—fabricating statistics, choosing wrong tools, ignoring business rules, and claiming success when operations fail. This guide demonstrates 4 research-backed techniques to stop these hallucinations: Graph-RAG for precise data retrieval, semantic tool selection for accurate tool choice, neurosymbolic guardrails for rule enforcement, and multi-agent validation for error detection.

What you'll learn:

  • How Graph-RAG prevents statistical hallucinations with structured data
  • Why semantic tool selection improves accuracy and reduces token costs
  • How neurosymbolic guardrails with Strands Agents block invalid operations that prompt engineering can't stop
  • How multi-agent validation catches hallucinations before they reach users

Code examples: All techniques include working demos with Strands Agents framework.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations
Enter fullscreen mode Exit fullscreen mode

Each demo includes:

  • Jupyter notebooks with step-by-step explanations
  • Python scripts for quick testing
  • Ground truth verification against real data
  • Performance metrics and comparisons

Why AI Agent Hallucinations Matter

AI agents differ from chatbots in a critical way. When a chatbot gives incorrect information, it's annoying. When an agent hallucinates during execution, it's catastrophic—fabricating API parameters, inventing success confirmations after failures, or executing actions based on false beliefs.

Recent research (MetaRAG, 2025) proves you cannot eliminate hallucinations—they're inherent to how LLMs work. The solution is detecting, containing, and mitigating them before they cause damage.

This guide explores 4 techniques tested on a travel booking agent with:

  • 300 hotel FAQ documents for Graph-RAG testing
  • 31 similar tools for semantic selection testing
  • Complex business rules for neurosymbolic testing
  • Multi-step validation flows for multi-agent testing

Prerequisites

Required:

  • Python 3.9+
  • LLM access: Amazon Bedrock, OpenAI, Anthropic, or Ollama (required for running agents)
  • AWS credentials configured if using Bedrock (aws configure)
  • Basic understanding of AI agents and tool calling

Key libraries: Strands Agents, Neo4j, FAISS, SentenceTransformers

Note on embeddings: Demos use SentenceTransformers (all-MiniLM-L6-v2) for vector embeddings — it runs locally with no API costs, making it fast and free to experiment. You can swap it for any embedding provider: Amazon Titan Embeddings, OpenAI, Cohere, etc.

Model configuration: See Strands Model Providers for setup instructions.


Technique 1: Graph-RAG for Precise Data Retrieval

RAG Hallucination Problem

Traditional RAG hallucinates statistics because it retrieves text chunks instead of executing precise calculations

When agents use traditional RAG for data-driven tasks, they face a fundamental limitation: vector search retrieves text, not structured data. Research (RAG-KG-IL, 2025) identifies three types of hallucinations this causes:

  1. Fabricated statistics — LLM generates plausible-sounding numbers from text chunks instead of computing them ("average rating is approximately 8.7" when no calculation occurred)
  2. Incomplete retrieval — Vector search returns top-k similar documents, missing relevant data scattered across hundreds of documents
  3. Out-of-domain fabrication — When no relevant data exists, RAG still returns similar-looking results and the LLM fabricates an answer instead of admitting ignorance

Research (MetaRAG, 2025) confirms this is inherent to how LLMs process unstructured data: they retrieve similar documents, then guess aggregations instead of executing calculations.

The demo compares two different agents querying the same 300 hotel FAQ documents:

  • RAG Agent: Uses FAISS vector search → finds top 3 similar docs → LLM summarizes
  • Graph-RAG Agent: Uses Neo4j knowledge graph → LLM writes Cypher queries → precise results

The knowledge graph is built automatically using neo4j-graphrag (RAKG, 2025) — the LLM discovers entities (Hotel, Room, Amenity, Policy) and relationships from unstructured text, no hardcoded schema:

from strands import Agent, tool

# RAG Agent — vector similarity search
@tool
def search_faqs(query: str) -> str:
    """Search hotel FAQs using vector similarity."""
    query_embedding = embed_model.encode([query])
    distances, indices = index.search(query_embedding.astype('float32'), 3)
    return "\n".join([documents[idx]['text'][:500] for idx in indices[0]])

# Graph-RAG Agent — Cypher queries on knowledge graph
@tool
def query_knowledge_graph(cypher_query: str) -> str:
    """Execute a Cypher query against the hotel knowledge graph.
    Node labels: Hotel, Room, Amenity, Policy, Service
    Relationships: (Hotel)-[:HAS_ROOM]->(Room), (Hotel)-[:OFFERS_AMENITY]->(Amenity)
    """
    with driver.session() as session:
        result = session.run(cypher_query)
        records = list(result)
        if not records:
            return "No results found."
        return f"Found {len(records)} results:\n" + "\n".join(str(dict(r.items())) for r in records[:15])

rag_agent = Agent(tools=[search_faqs], model=model)
graph_agent = Agent(tools=[query_knowledge_graph], model=model)
Enter fullscreen mode Exit fullscreen mode

Key insight: Graph-RAG reduces hallucinations because knowledge graphs provide structured, verifiable data — aggregations are computed by the database, relationships are explicit, and missing data returns empty results instead of fabricated answers. The LLM translates natural language into Cypher queries using the Text2Cypher pattern, grounded by the graph schema described in the tool's docstring.

Results

git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/01-faq-graphrag-demo 
Enter fullscreen mode Exit fullscreen mode

Demo: Graph-RAG vs Traditional RAG Demo — Jupyter notebook comparing hallucination rates on 300 hotel FAQs with Neo4j knowledge graph.

Paper Finding Demo Result Status
RAG cannot aggregate (RAG-KG-IL) RAG failed to count swimming pools across 300 docs ✅ Validated
Graph-RAG computes natively (RAKG) Cypher returned exact count: 133 hotels with pool ✅ Validated
RAG hallucinates out-of-domain (MetaRAG) RAG fabricated Antarctica accommodation info ✅ Validated
Graph-RAG fails honestly "No hotels in Antarctica" — no fabrication ✅ Validated

Query Type RAG Graph-RAG
Aggregation: "Average rating in Paris?" ⚠️ Calculates from 2 docs only ✅ Native AVG() across all
Counting: "Hotels with swimming pool?" ❌ "I don't have the data" ✅ Precise: 133
Multi-hop: "Room types for best hotel?" ❌ Cannot traverse ✅ Hotel → Room traversal
Out-of-domain: "Hotels in Antarctica" ❌ Fabricates answers ✅ Honest: "No hotels"

Technique 2: Semantic Tool Selection for Accurate Tool Choice

Diagram showing 31 tools filtered to 3 relevant tools with performance metrics

Research (Internal Representations, 2025) identifies a critical agent failure mode: tool-calling hallucinations increase with tool count. When agents have many similar tools, they exhibit:

  1. Function selection errors - Calling non-existent tools
  2. Function appropriateness errors - Choosing semantically wrong tools
  3. Parameter errors - Malformed or invalid arguments
  4. Completeness errors - Missing required parameters
  5. Tool bypass behavior - Generating outputs instead of calling tools

The dual problem:

  • Hallucination risk: More tools = more inappropriate selections
  • Token waste: Sending all tool descriptions on every call (e.g., 31 tools = ~4,500 tokens per query)

The root cause: Agents see all tool descriptions in the prompt, creating choice overload that leads to both accuracy degradation AND cost explosion.

The Solution: Semantic Tool Filtering

git clone https://github.com/aws-samples/sample-why-agents-fail
cd stop-ai-agent-hallucinations/02-semantic-tools-demo 
Enter fullscreen mode Exit fullscreen mode

Filter tools before the agent sees them using vector similarity. The technique compares the user query against tool descriptions using embeddings, then passes only the most relevant tools to the agent:

semantic tool selection

This demo uses FAISS with SentenceTransformers as a lightweight, local implementation — but the technique works with any vector store and embedding provider (OpenSearch, Pinecone, Amazon Titan Embeddings, etc.):

from sentence_transformers import SentenceTransformer
import faiss

# Build index once
model = SentenceTransformer('all-MiniLM-L6-v2')
tool_embeddings = model.encode([tool.description for tool in ALL_TOOLS])
index = faiss.IndexFlatL2(384)
index.add(tool_embeddings)

# Filter per query
query_embedding = model.encode([query])
distances, indices = index.search(query_embedding, k=3)
relevant_tools = [ALL_TOOLS[i] for i in indices[0]]
Enter fullscreen mode Exit fullscreen mode

Dynamic tool swapping with Strands Agents

Strands provides dynamic tool swapping that preserves conversation memory:

from strands import Agent

# Create agent once
agent = Agent(tools=[...], model=model)

# Swap tools per query without losing conversation history
for query in conversation:
    relevant_tools = search_tools(query, top_k=3)
    agent.tool_registry.registry.clear()
    for tool in relevant_tools:
        agent.tool_registry.register_tool(tool)

    response = agent(query)  # agent.messages preserved
Enter fullscreen mode Exit fullscreen mode

Key advantage: Strands Agents maintains memory while tools change dynamically.

Results

Testing on 29 travel queries with ground truth:

Semantic Tool Selection Results

Semantic tool selection reduces errors and token costs significantly

Demo: Semantic Tool Selection Demo - Includes token comparison script showing 89% token reduction with FAISS filtering.

Reduce AI Agents Costs and Mistakes with Semantic tool Selection


Technique 3: Neurosymbolic Guardrails for AI Agents

Neurosymbolic Blocking Results

Symbolic rules block invalid operations that prompt engineering cannot prevent

The Problem: Agents Cannot Be Constrained by Prompts Alone

Research (ATA: Autonomous Trustworthy Agents, 2024) shows that agents hallucinate when business rules are expressed only in natural language prompts:

  • Parameter errors: Agent calls book_hotel(guests=15) despite "Maximum 10 guests" in docstring
  • Completeness errors: Agent executes bookings without required payment verification
  • Tool bypass behavior: Agent confirms success without calling validation tools

Why prompt engineering fails: Prompts are suggestions, not constraints. Agents can ignore docstring instructions because they're processed as text, not executable rules. The agent sees "Maximum 10 guests" as context, not a hard boundary.

The Solution: Strands Agents Hooks for Neurosymbolic Validation

Strands Agents makes neurosymbolic guardrails effortless with hooks—a composable system that intercepts tool calls before execution to enforce symbolic rules.

Key insight: Use Strands hooks to validate tool calls before execution. The LLM cannot bypass rules enforced at the framework level.

from strands import Agent, tool
from strands.hooks import HookProvider, HookRegistry, BeforeToolCallEvent

# Define symbolic rules
BOOKING_RULES = [
    Rule(
        name="max_guests",
        condition=lambda ctx: ctx.get("guests", 1) <= 10,
        message="Maximum 10 guests per booking"
    ),
]

# Create validation hook
class NeurosymbolicHook(HookProvider):
    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.validate)

    def validate(self, event: BeforeToolCallEvent) -> None:
        ctx = {"guests": event.tool_use["input"].get("guests", 1)}
        passed, violations = validate(BOOKING_RULES, ctx)

        if not passed:
            event.cancel_tool = f"BLOCKED: {', '.join(violations)}"

# Clean tool (no validation logic)
@tool
def book_hotel(hotel: str, guests: int = 1) -> str:
    """Book a hotel room."""
    return f"SUCCESS: Booked {hotel} for {guests} guests"

# Attach hook to agent
hook = NeurosymbolicHook()
agent = Agent(tools=[book_hotel], hooks=[hook])
Enter fullscreen mode Exit fullscreen mode

Why Strands Hooks Excel Here

Strands hooks intercept tool calls before execution at the framework level:

Simple API: Just implement HookProvider and register callbacks

Centralized validation: One hook validates all tools

Clean tools: No validation logic mixed with business logic

Type-safe: Strongly-typed event objects

LLM cannot bypass: Rules enforced before tool execution

# Test
query = "Book hotel for 15 guests"
result = agent(query)  # ✅ Hook blocks before tool executes
Enter fullscreen mode Exit fullscreen mode

Key advantage: Rules are enforced at the framework level, not the prompt level. The LLM receives cancellation messages it cannot override.

Results

Scenario Prompt Engineering Neurosymbolic with Hooks
Invalid Parameters ❌ Accepts ✅ Blocks
Missing Prerequisites ⚠️ Sometimes catches ✅ Always blocks
Rule Bypass ❌ Possible ✅ Impossible

Demo: Neurosymbolic AI Agent Demo - Shows how Strands hooks enforce symbolic rules that LLMs cannot bypass.


Technique 4: Multi-Agent Validation for Error Detection

The Problem: Single Agents Cannot Self-Validate

Research (Markov Chain Multi-Agent Debate, 2024) shows that single agents hallucinate without detection mechanisms:

  • Claim success when operations failed - No validation layer catches execution errors
  • Use wrong tools for requests - No cross-check verifies tool appropriateness
  • Fabricate responses - No second opinion challenges generated content
  • Provide inaccurate statistics - No verification against ground truth

The core problem: Single agents operate in isolation. When they hallucinate, there's no mechanism to detect the error before it reaches users.

Multi-agent validation catches hallucinations that single agents miss

The Solution: Multi-Agent Debate for Cross-Validation

Multiple specialized agents validate each other through structured debate:

Single vs Multi-Agent Accuracy

Key insight: Agents with different roles (Trust, Skeptic, Leader) debate claims until consensus. Research shows this reduces hallucination compared to single-agent approaches.

Why Strands Agents Excels Here

Strands provides Swarm for autonomous agent handoffs with shared context:

from strands import Agent
from strands.multiagent import Swarm

# Define specialized agents
executor = Agent(
    name="executor",
    tools=ALL_TOOLS,
    system_prompt="Execute requests, then hand off to validator"
)

validator = Agent(
    name="validator",
    system_prompt="Check for hallucinations. Say VALID or HALLUCINATION"
)

critic = Agent(
    name="critic",
    system_prompt="Final review. Say APPROVED or REJECTED"
)

# Create swarm - agents hand off autonomously
swarm = Swarm(
    [executor, validator, critic],
    entry_point=executor,
    max_handoffs=5
)

result = swarm("Book grand_hotel for John")
Enter fullscreen mode Exit fullscreen mode

Key advantage: Agents decide when to hand off control. Shared context means validator sees executor's actions. Critic sees both. Cross-validation happens automatically.

Results

Approach Hallucination Detection Accuracy Latency
Single Agent ❌ None ⚠️ Fabricates ✅ Fast
Multi-Agent ✅ Detects errors ✅ Validates ⚠️ Slower

Example: When executor tries to book the_ritz_paris (doesn't exist), validator detects the invalid hotel and critic returns Status.FAILED instead of hallucinating an alternative.

Demo: Multi-Agent Validation Demo - Shows Executor → Validator → Critic pattern detecting invalid hotel bookings.


Combining Techniques for Production

These techniques stack:

  1. Graph-RAG ensures data accuracy
  2. Semantic tool selection reduces tool errors and token costs
  3. Neurosymbolic rules enforce business constraints
  4. Multi-agent validation catches remaining hallucinations

Why Strands Agents Python SDK

  1. Dynamic tool management: Swap tools without losing conversation state
  2. Native multi-agent: Swarm handles handoffs and shared context
  3. Tool-level validation: Symbolic rules execute before LLM sees results
  4. Model flexibility: Works with Bedrock, OpenAI, Anthropic, Ollama
  5. Production-ready: Built for AWS deployment (Amazon Bedrock Agentcore, AWS Lambda, Amazon ECS, Amazon EC2)

Key Takeaways

  1. Hallucinations are inevitable - Focus on detection and mitigation, not elimination
  2. Graph-RAG for precision - Use when you need exact calculations or relationships
  3. Semantic filtering for scale - Essential when you have 10+ similar tools
  4. Symbolic rules for compliance - Prompt engineering cannot enforce business rules
  5. Multi-agent for validation - Cross-validation catches errors single agents miss
  6. Strands Agents for production - Dynamic tools, native multi-agent, and AWS-ready

References


Conclusion

In this post, I showed you how to stop AI agent hallucinations using 4 research-backed techniques. Graph-RAG eliminates statistical hallucinations by using structured data instead of text retrieval. Semantic tool selection reduces errors by up to 86.4% and cuts token costs by 89% through vector-based filtering. Neurosymbolic guardrails with Strands hooks enforce business rules at the framework level that LLMs cannot bypass. Multi-agent validation catches hallucinations through cross-validation before they reach users.

These techniques aren't theoretical—each demo includes working code you can run today. Start with Graph-RAG if you have structured data, add semantic tool selection when you have 10+ similar tools, implement neurosymbolic hooks for business constraints, and use multi-agent validation for critical operations.

What's next? Clone the repository and run the demos. Each folder includes Jupyter notebooks with step-by-step explanations and Python scripts for quick testing. The demos use Strands Agents with Amazon Bedrock, but you can swap in OpenAI, Anthropic, or Ollama—see the model providers documentation for configuration.

Want to dive deeper? Check out the research papers: MetaRAG for hallucination detection, Internal Representations for tool selection, Teaming LLMs for multi-agent validation, and RAG-KG-IL for hybrid frameworks. You can also learn more about Strands Agents for production deployments and Amazon Bedrock for LLM access.

Have you implemented any of these techniques in your agents? What challenges did you face? Share your experience in the comments below.


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

Top comments (4)

Collapse
 
matthewhou profile image
Matthew Hou

The guardrails approach is solid, but I think there's a step before guardrails that most people skip: making the architecture simpler so the AI needs less context to do its job.

I've been working on a codebase where AI kept hallucinating function signatures from other modules. The fix wasn't better prompting or more guardrails — it was making module boundaries explicit enough that the AI could work on one file without needing to understand the whole project.

Guardrails catch mistakes after they happen. Simpler architecture prevents whole categories of mistakes from happening in the first place. Both matter, but I think we underinvest in the structural side.

Collapse
 
ensamblador profile image
ensamblador

Excelente publicación, pregunta: se puede hacer semantic tool search usando algún servicio manejado de AWS?

Collapse
 
elizabethfuentes12 profile image
Elizabeth Fuentes L AWS

Si con amazon bedrock agentcore, aca la informacion: docs.aws.amazon.com/bedrock-agentc...

Collapse
 
camila_hinojosa_anez profile image
Camila Hinojosa Anez

😱😱 so cool !! I need to this