DEV Community

Daathwi Naagh
Daathwi Naagh

Posted on

100s of Tools in Your Agent — Here's How to Actually Pick the Right One

The Problem Nobody Talks About

You've built an agent. You've wired up 100+ tools. You feel good about it.

Then it starts hallucinating. Picking the wrong tool. Collapsing entire workflows over a single misclassified query.

The failure isn't the LLM. It's the architecture.

My previous post covered a genuine use case I found for Gemma4 — and this is exactly where it fits in.


The Naive Approach (and Why It Fails)

Load all tools into LLM context and let it decide.

Sounds simple. It is. And it breaks at scale.

  • Hallucinations increase with context length
  • Bloated context = slower, more expensive calls
  • LLM gets confused choosing between 50+ tool descriptions

Result: Slow. Unreliable. Expensive agents.


The "Slightly Smarter" Approach (Still Broken)

RAG over tool descriptions. Seems reasonable:

User query → embedding → top 5 matches → LLM picks
Enter fullscreen mode Exit fullscreen mode

Sounds clean. But embeddings can't distinguish intent.

Similar words ≠ same meaning.

Example:

User says: "I need iPhone"
Tool: check_product_catalog

Embedding search may:

  • Miss the tool completely
  • Retrieve irrelevant tools
  • Break the entire downstream workflow

Wrong tool gets selected. Entire workflow collapses.

The problem isn't the embedding model. It's that you're asking a single layer to carry too much responsibility.


What Actually Worked: A Layered Filtering Stack

The approach that works in production is layered filtering — not pure semantic search, not raw LLM reasoning. Both together, in the right order.

Here's the stack I use:

Layer Tool Role
Intent Classification gemma4:e4b via Ollama (9.6 GB, local) Fast, private intent routing
Semantic Search nomic-embed-text via Ollama (274 MB) Embedding over filtered subset

The 5-Step Architecture

Step 1 — Classify Intent First

A lightweight LLM maps the query to a high-level category before any search happens.

This eliminates entire irrelevant domains upfront. If the user is asking about orders, you never even look at inventory tools.

"I need iPhone"  category: product_discovery
Enter fullscreen mode Exit fullscreen mode

Step 2 — Hard Filter by Metadata

Deterministic rules. Not embeddings.

Only tools matching the classified intent category are eligible. The search space collapses from 100+ tools to maybe 10–15.

category: product_discovery  eligible tools: [check_catalog, search_products, get_inventory, ...]
Enter fullscreen mode Exit fullscreen mode

Step 3 — Semantic Search Within the Clean Subset

Now RAG works — because it's running over a small, relevant set, not noisy hundreds.

nomic-embed-text finds the closest semantic matches within your filtered pool. False positives drop dramatically.

Step 4 — Score and Rank

Confidence scoring on the top candidates. Auditable. Explainable.

No black box decisions. You can log exactly why a tool was selected — which matters when things break in production.

Step 5 — LLM Final Pick

Send the top candidates + the original user query to the LLM.

Now it's choosing between 3–5 relevant tools, not 100+. The context is clean. The decision is accurate.

[check_product_catalog, search_products, get_item_details] + "I need iPhone"
 LLM picks: check_product_catalog
Enter fullscreen mode Exit fullscreen mode

Why This Works

Each layer does one job. No single layer carries too much responsibility.

Intent classifier  → reduces domain space
Metadata filter    → reduces tool count (deterministic, fast)
Semantic search    → finds closest match in clean subset
Scoring            → adds confidence + auditability
LLM               → makes final call with minimal context
Enter fullscreen mode Exit fullscreen mode

That's why it's reliable. That's why it's fast.


The Part No One Talks About

Even the best architecture fails if your tool descriptions are written like API docs.

# Bad — written for engineers
def check_product_catalog(sku: str, region: str) -> dict:
    """Queries product catalog by SKU with region filtering."""
Enter fullscreen mode Exit fullscreen mode
# Good — written for users
def check_product_catalog(...):
    """
    Use this when someone wants to find a product, check if something 
    is available, look up an item by name or model, or browse what's 
    in stock. Works for queries like 'do you have iPhone 15?' or 
    'show me your laptop options'.
    """
Enter fullscreen mode Exit fullscreen mode

Write descriptions in the language your users actually speak.

Think about how someone types a message to an agent — not how an engineer names a function.

The system learns from the language you give it.


Results

Metric Value
End-to-end latency (all 4 steps) < 2 seconds
Tool selection accuracy Significantly higher than pure RAG
Model infra Fully local, private
Bottleneck 1000+ concurrent users (next problem to solve)

The Real Lesson

We don't need smarter models.

We don't need infinite context windows.

We need better system design and the discipline to think like a user.

A layered architecture with good tool descriptions, running on lightweight local models, will outperform a bloated LLM context every time.

That's the real work.


Built this in production. Happy to discuss the concurrent scaling problem — that's a different beast entirely.

Found this useful? Follow for more on agent architecture and production ML systems.

Top comments (0)