Elizabeth Fuentes L for AWS

Posted on Mar 4 • Edited on Apr 15 • Originally published at builder.aws.com

Reduce Agent Errors and Token Costs with Semantic Tool Selection

#ai #agents #tutorial #python

When AI agents have many similar tools, they often select the wrong one and consume excessive tokens by processing all tool descriptions. Semantic tool selection filters tools before the agent processes them, which improves accuracy and reduce token costs. We'll build a travel agent with Strands Agents and use FAISS to filter 29 tools down to the top 3 most relevant, comparing filtered vs unfiltered results.

In Part 1, Graph-RAG prevented agents from hallucinating statistics. But agents still hallucinate during tool selection choosing the wrong tool even with correct data.

This Series: 4 Production Techniques

Part 1: RAG vs GraphRAG: When Agents Hallucinate Answers - Relationship-aware knowledge graphs preventing hallucinations in aggregations and precise queries

Part 2 (This Post): Semantic Tool Selection - Vector-based tool filtering for accurate tool selection

Part 3: AI Agent Guardrails: Rules That LLMs Cannot Bypass - Symbolic reasoning for verifiable decisions

Bonus Part 3.2: Runtime Guardrails for AI Agents — Steer, Don't Block Open-source runtime controls that guide agents to self-correct violations instead of failing the workflow.

Part 4: How to Stop AI Agents from Hallucinating Silently with Multi-Agent Validation - Agent teams detecting hallucinations before damage

Code uses Strands Agents.

Go to github repositorie: sample-why-agents-fail

git clone https://github.com/aws-samples/sample-why-agents-fail

The Dual Problem: Errors + Token Waste

Imagine your travel agent has 29 tools: search_hotels, search_flights, search_hotel_reviews, check_hotel_availability, get_hotel_details, book_hotel... and 25 more similar ones.

Query: "How much does Hotel Marriott cost?"

The agent picks get_hotel_details() instead of get_hotel_pricing(). Wrong tool, wrong answer. And you just burned 4,500 tokens sending all 29 tool descriptions.

This happens because:

Similar tool names cause confusion
Generic tools (search, check, get_info) are overused
More tools = more hallucinations

Research (Internal Representations, 2025) shows tool selection hallucinations increase with tool count. Production systems report 89% token reduction with semantic tool selection (rconnect.tech, 2025).

Prerequisites

Setup:

uv venv && uv pip install -r requirements.txt

Key libraries: Strands Agents (AI agent framework), FAISS (vector search), SentenceTransformers (embeddings)

Model options: This demo uses OpenAI GPT-4o-mini by default. You can change to any provider that Strands supports — see Strands Model Providers for configuration.

The Solution: Semantic Tool Selection with FAISS

Research from the Internal Representations study (2025)) identifies five critical agent failure modes as tools scale:

Function selection errors - Calling non-existent tools
Parameter errors - Malformed or invalid arguments
Completeness errors - Missing required parameters
Tool bypass behavior - Generating outputs instead of calling tools

When agents have 10+ similar tools, choice overload causes both accuracy degradation AND cost explosion.

Each LLM call sends all 29 tool descriptions. In a 50-step workflow, this results in 29 tools × 50 calls, which creates significant token waste and slows response times due to increased context processing.

The Demo: Two Agents, Same Data, Different Approaches

Test this code token_efficiency_analysis.ipynb

To compare the two agents, I created 13 travel queries with verified ground truth answers:

Test 1: Traditional Approach (All 29 Tools)

# 02-semantic-tools-demo/token_efficiency_analysis.ipynb - 29 travel
from strands import Agent
from strands.models.openai import OpenAIModel
from enhanced_tools import ALL_TOOLS

for query, expected in TESTS:
    agent = Agent(tools=ALL_TOOLS, system_prompt=PROMPT, model=MODEL)
    tools, tokens = run_and_capture_with_tokens(agent, query)

Results:

Test 2: Semantic Approach (Top-3 Filtered Tools)

Agent receives only the 3 most relevant tools per query.

First, I build my semantic index with FAISS using registry.py helper librery.

# registry.py
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim embeddings

def build_index(tools):
    """Build FAISS index from tool docstrings"""
    # Concatenate name + docstring for each tool
    texts = [f"{t.__name__}: {t.__doc__}" for t in tools]
    embeddings = model.encode(texts)

    # Create L2 distance index
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings.astype('float32'))
    return index

def search_tools(query: str, top_k: int = 3):
    """Find most relevant tools using FAISS"""
    emb = model.encode([query])
    _, indices = index.search(emb.astype('float32'), top_k)
    return [tools[i] for i in indices[0]]

The all-MiniLM-L6-v2 model is lightweight (22M parameters, 384 dimensions), optimized for semantic similarity, and performs efficiently on CPUs. This makes it ideal for matching short queries to tool descriptions.


from enhanced_tools import ALL_TOOLS
from registry import build_index, search_tools

build_index(ALL_TOOLS)
print("✅ FAISS index built")

Vector search filters tools before the agent sees them.

Each time I invoke the 'agent', the query is used to perform a semantic search in the tools index with 'search_tools', and the response is used as the 'selected' tools and the agent is initialized with those tools.

# 02-semantic-tools-demo/token_efficiency_analysis.ipynb - 29 travel
from strands import Agent
from strands.models.openai import OpenAIModel

for query, expected in TESTS:
    selected = search_tools(query, top_k=3)
    selected_names = [t.__name__ for t in selected]

    agent = Agent(tools=selected, system_prompt=PROMPT, model=MODEL)
    tools, tokens = run_and_capture_with_tokens(agent, query)

Results:

Although the accuracy did not improve much, we can see that there is a large reduction in tokens, from Avg tokens/query from 1557 to 275.

Key insight: The agent never processes the other 28 tools. These tools aren't removed from the system; they simply never enter the agent's context window.

Learn more about Strands Agents tool handling: Tool Documentation

Test 3: Semantic + Memory (Single Agent)

The basic approach creates a new agent per query, losing conversation history. For production multi-turn conversations. Same agent across all queries, tools swapped dynamically. Strands Agents provides native tool swapping:

1. Dynamic Tool Swapping: swap_tools

# Add/remove tools at runtime without recreating the agent
def swap_tools(agent, new_tools: List[Callable]):
    """Swap tools in a live agent without losing conversation memory.

    Clears the agent's tool_registry and re-registers only the given tools.
    Since get_all_tools_config() is called each event loop cycle, the agent
    will see the new tools on the next call.
    """
    reg = agent.tool_registry
    reg.registry.clear()
    reg.dynamic_tools.clear()
    for t in new_tools:
        reg.register_tool(t)

2. Conversation Memory Preservation

# Swap tools between queries while keeping conversation history
swap_tools(agent, new_tools)  # agent.messages preserved

3. Runtime Tool Discovery

# Agent picks up tool changes automatically at each event loop
# No manual refresh needed - just modify tool_registry

Strands Agents maintains memory while tools change.

# 02-semantic-tools-demo/token_efficiency_analysis.ipynb 
from strands import Agent
from strands.models.openai import OpenAIModel
from enhanced_tools import ALL_TOOLS
from registry import build_index, search_tools, swap_tools

initial_tools = search_tools(TESTS[0][0], top_k=3)
memory_agent = Agent(tools=initial_tools, system_prompt=PROMPT, model=MODEL)

mem_results = []
mem_correct = 0
mem_total_tokens = 0

for query, expected in TESTS:
    selected = search_tools(query, top_k=3)
    selected_names = [t.__name__ for t in selected]
    swap_tools(memory_agent, selected)

    tools, tokens = run_and_capture_with_tokens(memory_agent, query)

Learn more: Strands Tool Registry

Results:

The trade-off: The Semantic + Memory approach uses more tokens than the Semantic-only approach because conversation context accumulates. However, it still reduces token consumption compared to the traditional method while maintaining full conversation histor

Why this works: Strands calls tool_registry.get_all_tools_config() at each event loop cycle, automatically picking up runtime changes. No agent recreation needed—tools change, memory stays. Learn more: Strands Agent Architecture

Key advantages:

Zero conversation loss: agent.messages preserved across tool swaps
No agent recreation: Same agent instance handles all queries
Runtime flexibility: Add/remove tools between any two queries
Production-ready: Handles long conversations with dynamic tool needs

Go to github repositorie: sample-why-agents-fail/stop-ai-agent-hallucinations/02-semantic-tools-demo

Demo Improvement + Research Context

This demo achieved perfect accuracy on 13 queries, but this is a controlled test with 29 tools and clear distinctions. Research (Internal Representations, 2025) shows that in production systems with hundreds of tools and ambiguous queries, semantic tool selection achieves up to 86.4% accuracy in detecting and preventing tool selection hallucinations.

The technique still significantly reduces errors compared to traditional approaches (which can drop below 50% with 100+ tools), but it's not a silver bullet. Complex domains with overlapping tool semantics remain challenging.

What's Next

Semantic tool selection reduces tool selection hallucinations and token costs. However, agents can still hallucinate operation success by confirming bookings without processing payments or by ignoring business rules.

Part 3: Neurosymbolic Integration shows how symbolic rules enforce constraints at execution time.

You can have this built-in and ready to integrate with your agents in production using Amazon Bedrock Agentcore Gateway.

Key Takeaways

Dual problem: Tool selection errors + token waste
Significant error reduction: Fewer tools means fewer wrong choices
89% token savings: 30 → 3 tools per call
$60K/month savings at production scale
Simple implementation: ~20 lines with FAISS
Strands Agents makes this simple: The @tool decorator and dynamic tool loading let you build semantic filtering in a few lines — swap tools at runtime without changing agent logic