When AI agents have many similar tools, they often select the wrong one and consume excessive tokens by processing all tool descriptions. Semantic tool selection filters tools before the agent processes them, which improves accuracy and reduce token costs. We'll build a travel agent with Strands Agents and use FAISS to filter 29 tools down to the top 3 most relevant, comparing filtered vs unfiltered results.
In Part 1, Graph-RAG prevented agents from hallucinating statistics. But agents still hallucinate during tool selection choosing the wrong tool even with correct data.
This Series: 4 Production Techniques
Part 1 : GraphRAG - Relationship-aware knowledge graphs preventing hallucinations in aggregations and precise queries
Part 2 (This Post): Semantic Tool Selection - Vector-based tool filtering for accurate tool selection
Part 3: Neurosymbolic Guardrails - Symbolic reasoning for verifiable decisions
Part 4: Multi-Agent Validation - Agent teams detecting hallucinations before damage
Code uses Strands Agents.
Go to github repositorie: sample-why-agents-fail
git clone https://github.com/aws-samples/sample-why-agents-fail
The Dual Problem: Errors + Token Waste
Imagine your travel agent has 29 tools: search_hotels, search_flights, search_hotel_reviews, check_hotel_availability, get_hotel_details, book_hotel... and 25 more similar ones.
Query: "How much does Hotel Marriott cost?"
The agent picks get_hotel_details() instead of get_hotel_pricing(). Wrong tool, wrong answer. And you just burned 4,500 tokens sending all 29 tool descriptions.
This happens because:
- Similar tool names cause confusion
- Generic tools (
search,check,get_info) are overused - More tools = more hallucinations
Research (Internal Representations, 2025) shows tool selection hallucinations increase with tool count. Production systems report 89% token reduction with semantic tool selection (rconnect.tech, 2025).
Prerequisites
Setup:
uv venv && uv pip install -r requirements.txt
Key libraries: Strands Agents (AI agent framework), FAISS (vector search), SentenceTransformers (embeddings)
Model options: This demo uses OpenAI GPT-4o-mini by default. You can change to any provider that Strands supports — see Strands Model Providers for configuration.
The Solution: Semantic Tool Selection with FAISS
Research from the Internal Representations study (2025)) identifies five critical agent failure modes as tools scale:
- Function selection errors - Calling non-existent tools
- Parameter errors - Malformed or invalid arguments
- Completeness errors - Missing required parameters
- Tool bypass behavior - Generating outputs instead of calling tools
When agents have 10+ similar tools, choice overload causes both accuracy degradation AND cost explosion.
Each LLM call sends all 29 tool descriptions. In a 50-step workflow, this results in 29 tools × 50 calls, which creates significant token waste and slows response times due to increased context processing.
Research (Internal Representations, 2025) shows tool selection hallucinations increase with tool count. Production systems report 89% token reduction with semantic tool selection (rconnect.tech, 2025).
The Demo: Two Agents, Same Data, Different Approaches
Test this code token_efficiency_analysis.ipynb
To compare the two agents, I created 13 travel queries with verified ground truth answers:
Test 1: Traditional Approach (All 29 Tools)
# 02-semantic-tools-demo/token_efficiency_analysis.ipynb - 29 travel
from strands import Agent
from strands.models.openai import OpenAIModel
from enhanced_tools import ALL_TOOLS
for query, expected in TESTS:
agent = Agent(tools=ALL_TOOLS, system_prompt=PROMPT, model=MODEL)
tools, tokens = run_and_capture_with_tokens(agent, query)
Results:
Test 2: Semantic Approach (Top-3 Filtered Tools)
Agent receives only the 3 most relevant tools per query.
First, I build my semantic index with FAISS using registry.py helper librery.
# registry.py
import faiss
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim embeddings
def build_index(tools):
"""Build FAISS index from tool docstrings"""
# Concatenate name + docstring for each tool
texts = [f"{t.__name__}: {t.__doc__}" for t in tools]
embeddings = model.encode(texts)
# Create L2 distance index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings.astype('float32'))
return index
def search_tools(query: str, top_k: int = 3):
"""Find most relevant tools using FAISS"""
emb = model.encode([query])
_, indices = index.search(emb.astype('float32'), top_k)
return [tools[i] for i in indices[0]]
The all-MiniLM-L6-v2 model is lightweight (22M parameters, 384 dimensions), optimized for semantic similarity, and performs efficiently on CPUs. This makes it ideal for matching short queries to tool descriptions.
from enhanced_tools import ALL_TOOLS
from registry import build_index, search_tools
build_index(ALL_TOOLS)
print("✅ FAISS index built")
Vector search filters tools before the agent sees them.
Each time I invoke the 'agent', the query is used to perform a semantic search in the tools index with 'search_tools', and the response is used as the 'selected' tools and the agent is initialized with those tools.
# 02-semantic-tools-demo/token_efficiency_analysis.ipynb - 29 travel
from strands import Agent
from strands.models.openai import OpenAIModel
for query, expected in TESTS:
selected = search_tools(query, top_k=3)
selected_names = [t.__name__ for t in selected]
agent = Agent(tools=selected, system_prompt=PROMPT, model=MODEL)
tools, tokens = run_and_capture_with_tokens(agent, query)
Results:
Although the accuracy did not improve much, we can see that there is a large reduction in tokens, from Avg tokens/query from 1557 to 275.
Key insight: The agent never processes the other 28 tools. These tools aren't removed from the system; they simply never enter the agent's context window.
Learn more about Strands Agents tool handling: Tool Documentation
Test 3: Semantic + Memory (Single Agent)
The basic approach creates a new agent per query, losing conversation history. For production multi-turn conversations. Same agent across all queries, tools swapped dynamically. Strands Agents provides native tool swapping:
1. Dynamic Tool Swapping: swap_tools
# Add/remove tools at runtime without recreating the agent
def swap_tools(agent, new_tools: List[Callable]):
"""Swap tools in a live agent without losing conversation memory.
Clears the agent's tool_registry and re-registers only the given tools.
Since get_all_tools_config() is called each event loop cycle, the agent
will see the new tools on the next call.
"""
reg = agent.tool_registry
reg.registry.clear()
reg.dynamic_tools.clear()
for t in new_tools:
reg.register_tool(t)
2. Conversation Memory Preservation
# Swap tools between queries while keeping conversation history
swap_tools(agent, new_tools) # agent.messages preserved
3. Runtime Tool Discovery
# Agent picks up tool changes automatically at each event loop
# No manual refresh needed - just modify tool_registry
Strands Agents maintains memory while tools change.
# 02-semantic-tools-demo/token_efficiency_analysis.ipynb
from strands import Agent
from strands.models.openai import OpenAIModel
from enhanced_tools import ALL_TOOLS
from registry import build_index, search_tools, swap_tools
initial_tools = search_tools(TESTS[0][0], top_k=3)
memory_agent = Agent(tools=initial_tools, system_prompt=PROMPT, model=MODEL)
mem_results = []
mem_correct = 0
mem_total_tokens = 0
for query, expected in TESTS:
selected = search_tools(query, top_k=3)
selected_names = [t.__name__ for t in selected]
swap_tools(memory_agent, selected)
tools, tokens = run_and_capture_with_tokens(memory_agent, query)
Learn more: Strands Tool Registry
Results:
The trade-off: The Semantic + Memory approach uses more tokens than the Semantic-only approach because conversation context accumulates. However, it still reduces token consumption compared to the traditional method while maintaining full conversation histor
Why this works: Strands calls tool_registry.get_all_tools_config() at each event loop cycle, automatically picking up runtime changes. No agent recreation needed—tools change, memory stays. Learn more: Strands Agent Architecture
Key advantages:
-
Zero conversation loss:
agent.messagespreserved across tool swaps - No agent recreation: Same agent instance handles all queries
- Runtime flexibility: Add/remove tools between any two queries
- Production-ready: Handles long conversations with dynamic tool needs
Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/02-semantic-tools-demo
Demo Improvement + Research Context
This demo achieved perfect accuracy on 13 queries, but this is a controlled test with 29 tools and clear distinctions. Research (Internal Representations, 2025) shows that in production systems with hundreds of tools and ambiguous queries, semantic tool selection achieves up to 86.4% accuracy in detecting and preventing tool selection hallucinations.
The technique still significantly reduces errors compared to traditional approaches (which can drop below 50% with 100+ tools), but it's not a silver bullet. Complex domains with overlapping tool semantics remain challenging.
What's Next
Semantic tool selection reduces tool selection hallucinations and token costs. However, agents can still hallucinate operation success by confirming bookings without processing payments or by ignoring business rules.
Part 3: Neurosymbolic Integration shows how symbolic rules enforce constraints at execution time.
You can have this built-in and ready to integrate with your agents in production using Amazon Bedrock Agentcore Gateway.
Key Takeaways
- Dual problem: Tool selection errors + token waste
- Significant error reduction: Fewer tools means fewer wrong choices
- 89% token savings: 30 → 3 tools per call
- $60K/month savings at production scale
- Simple implementation: ~20 lines with FAISS
-
Strands Agents makes this simple: The
@tooldecorator and dynamic tool loading let you build semantic filtering in a few lines — swap tools at runtime without changing agent logic
References
- Internal Representations as Indicators of Hallucinations
- Solving Context Window Overflow - 7x token reduction
- Semantic Tool Selection in Practice - 89% reduction
- Strands Agents Meta-Tooling
Gracias!
🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr








Top comments (1)
excellent work. Thank you!!