Daathwi Naagh

Posted on Apr 10

100s of Tools in Your Agent — Here's How to Actually Pick the Right One

#ai #agentaichallenge #llm #mcp

The Problem Nobody Talks About

You've built an agent. You've wired up 100+ tools. You feel good about it.

Then it starts hallucinating. Picking the wrong tool. Collapsing entire workflows over a single misclassified query.

The failure isn't the LLM. It's the architecture.

My previous post covered a genuine use case I found for Gemma4 — and this is exactly where it fits in.

The Naive Approach (and Why It Fails)

Load all tools into LLM context and let it decide.

Sounds simple. It is. And it breaks at scale.

Hallucinations increase with context length
Bloated context = slower, more expensive calls
LLM gets confused choosing between 50+ tool descriptions

Result: Slow. Unreliable. Expensive agents.

The "Slightly Smarter" Approach (Still Broken)

RAG over tool descriptions. Seems reasonable:

User query → embedding → top 5 matches → LLM picks

Sounds clean. But embeddings can't distinguish intent.

Similar words ≠ same meaning.

Example:

User says: "I need iPhone"
Tool: check_product_catalog

Embedding search may:

Miss the tool completely
Retrieve irrelevant tools
Break the entire downstream workflow

Wrong tool gets selected. Entire workflow collapses.

The problem isn't the embedding model. It's that you're asking a single layer to carry too much responsibility.

What Actually Worked: A Layered Filtering Stack

The approach that works in production is layered filtering — not pure semantic search, not raw LLM reasoning. Both together, in the right order.

Here's the stack I use:

Layer	Tool	Role
Intent Classification	`gemma4:e4b` via Ollama (9.6 GB, local)	Fast, private intent routing
Semantic Search	`nomic-embed-text` via Ollama (274 MB)	Embedding over filtered subset

The 5-Step Architecture

Step 1 — Classify Intent First

A lightweight LLM maps the query to a high-level category before any search happens.

This eliminates entire irrelevant domains upfront. If the user is asking about orders, you never even look at inventory tools.

"I need iPhone" → category: product_discovery

Step 2 — Hard Filter by Metadata

Deterministic rules. Not embeddings.

Only tools matching the classified intent category are eligible. The search space collapses from 100+ tools to maybe 10–15.

category: product_discovery → eligible tools: [check_catalog, search_products, get_inventory, ...]

Step 3 — Semantic Search Within the Clean Subset

Now RAG works — because it's running over a small, relevant set, not noisy hundreds.

nomic-embed-text finds the closest semantic matches within your filtered pool. False positives drop dramatically.

Step 4 — Score and Rank

Confidence scoring on the top candidates. Auditable. Explainable.

No black box decisions. You can log exactly why a tool was selected — which matters when things break in production.

Step 5 — LLM Final Pick

Send the top candidates + the original user query to the LLM.

Now it's choosing between 3–5 relevant tools, not 100+. The context is clean. The decision is accurate.

[check_product_catalog, search_products, get_item_details] + "I need iPhone"
→ LLM picks: check_product_catalog

Why This Works

Each layer does one job. No single layer carries too much responsibility.

Intent classifier  → reduces domain space
Metadata filter    → reduces tool count (deterministic, fast)
Semantic search    → finds closest match in clean subset
Scoring            → adds confidence + auditability
LLM               → makes final call with minimal context

That's why it's reliable. That's why it's fast.

The Part No One Talks About

Even the best architecture fails if your tool descriptions are written like API docs.

# Bad — written for engineers
def check_product_catalog(sku: str, region: str) -> dict:
    """Queries product catalog by SKU with region filtering."""

# Good — written for users
def check_product_catalog(...):
    """
    Use this when someone wants to find a product, check if something 
    is available, look up an item by name or model, or browse what's 
    in stock. Works for queries like 'do you have iPhone 15?' or 
    'show me your laptop options'.
    """

Write descriptions in the language your users actually speak.

Think about how someone types a message to an agent — not how an engineer names a function.

The system learns from the language you give it.

Results

Metric	Value
End-to-end latency (all 4 steps)	< 2 seconds
Tool selection accuracy	Significantly higher than pure RAG
Model infra	Fully local, private
Bottleneck	1000+ concurrent users (next problem to solve)

The Real Lesson

We don't need smarter models.

We don't need infinite context windows.

We need better system design and the discipline to think like a user.

A layered architecture with good tool descriptions, running on lightweight local models, will outperform a bloated LLM context every time.

That's the real work.

Built this in production. Happy to discuss the concurrent scaling problem — that's a different beast entirely.

Found this useful? Follow for more on agent architecture and production ML systems.

Top comments (2)

Rhumb • Apr 11

Good breakdown.

The extra filter I think agent systems need is that semantic fit and authority fit should not be decided in the same step.

A stack can route to the semantically right tool and still make the wrong operational choice if the candidate set mixes:

read-only helpers
reversible write tools
high-side-effect business actions

Once those sit in one pool, the model is not just selecting functionality. It is implicitly selecting blast radius.

So the stronger stack is usually:

classify workflow intent
hard-filter by trust or authority class
semantic-rank inside that bounded set
let the model choose among a few tools with similar side-effect profile

That keeps the final LLM choice about workflow fit, not about whether the agent accidentally jumped from “look this up” to “change a system of record.”

Daathwi Naagh • May 7

That’s a really strong point.

I focused mostly on retrieval accuracy, but you’re right that semantic fit and operational authority are separate problems. Selecting a tool is also selecting a blast radius.

Adding an authority/risk filter before semantic ranking makes the architecture much safer and more production-ready.