The Problem Nobody Talks About
You've built an agent. You've wired up 100+ tools. You feel good about it.
Then it starts hallucinating. Picking the wrong tool. Collapsing entire workflows over a single misclassified query.
The failure isn't the LLM. It's the architecture.
My previous post covered a genuine use case I found for Gemma4 — and this is exactly where it fits in.
The Naive Approach (and Why It Fails)
Load all tools into LLM context and let it decide.
Sounds simple. It is. And it breaks at scale.
- Hallucinations increase with context length
- Bloated context = slower, more expensive calls
- LLM gets confused choosing between 50+ tool descriptions
Result: Slow. Unreliable. Expensive agents.
The "Slightly Smarter" Approach (Still Broken)
RAG over tool descriptions. Seems reasonable:
User query → embedding → top 5 matches → LLM picks
Sounds clean. But embeddings can't distinguish intent.
Similar words ≠ same meaning.
Example:
User says: "I need iPhone"
Tool:check_product_catalog
Embedding search may:
- Miss the tool completely
- Retrieve irrelevant tools
- Break the entire downstream workflow
Wrong tool gets selected. Entire workflow collapses.
The problem isn't the embedding model. It's that you're asking a single layer to carry too much responsibility.
What Actually Worked: A Layered Filtering Stack
The approach that works in production is layered filtering — not pure semantic search, not raw LLM reasoning. Both together, in the right order.
Here's the stack I use:
| Layer | Tool | Role |
|---|---|---|
| Intent Classification |
gemma4:e4b via Ollama (9.6 GB, local) |
Fast, private intent routing |
| Semantic Search |
nomic-embed-text via Ollama (274 MB) |
Embedding over filtered subset |
The 5-Step Architecture
Step 1 — Classify Intent First
A lightweight LLM maps the query to a high-level category before any search happens.
This eliminates entire irrelevant domains upfront. If the user is asking about orders, you never even look at inventory tools.
"I need iPhone" → category: product_discovery
Step 2 — Hard Filter by Metadata
Deterministic rules. Not embeddings.
Only tools matching the classified intent category are eligible. The search space collapses from 100+ tools to maybe 10–15.
category: product_discovery → eligible tools: [check_catalog, search_products, get_inventory, ...]
Step 3 — Semantic Search Within the Clean Subset
Now RAG works — because it's running over a small, relevant set, not noisy hundreds.
nomic-embed-text finds the closest semantic matches within your filtered pool. False positives drop dramatically.
Step 4 — Score and Rank
Confidence scoring on the top candidates. Auditable. Explainable.
No black box decisions. You can log exactly why a tool was selected — which matters when things break in production.
Step 5 — LLM Final Pick
Send the top candidates + the original user query to the LLM.
Now it's choosing between 3–5 relevant tools, not 100+. The context is clean. The decision is accurate.
[check_product_catalog, search_products, get_item_details] + "I need iPhone"
→ LLM picks: check_product_catalog
Why This Works
Each layer does one job. No single layer carries too much responsibility.
Intent classifier → reduces domain space
Metadata filter → reduces tool count (deterministic, fast)
Semantic search → finds closest match in clean subset
Scoring → adds confidence + auditability
LLM → makes final call with minimal context
That's why it's reliable. That's why it's fast.
The Part No One Talks About
Even the best architecture fails if your tool descriptions are written like API docs.
# Bad — written for engineers
def check_product_catalog(sku: str, region: str) -> dict:
"""Queries product catalog by SKU with region filtering."""
# Good — written for users
def check_product_catalog(...):
"""
Use this when someone wants to find a product, check if something
is available, look up an item by name or model, or browse what's
in stock. Works for queries like 'do you have iPhone 15?' or
'show me your laptop options'.
"""
Write descriptions in the language your users actually speak.
Think about how someone types a message to an agent — not how an engineer names a function.
The system learns from the language you give it.
Results
| Metric | Value |
|---|---|
| End-to-end latency (all 4 steps) | < 2 seconds |
| Tool selection accuracy | Significantly higher than pure RAG |
| Model infra | Fully local, private |
| Bottleneck | 1000+ concurrent users (next problem to solve) |
The Real Lesson
We don't need smarter models.
We don't need infinite context windows.
We need better system design and the discipline to think like a user.
A layered architecture with good tool descriptions, running on lightweight local models, will outperform a bloated LLM context every time.
That's the real work.
Built this in production. Happy to discuss the concurrent scaling problem — that's a different beast entirely.
Found this useful? Follow for more on agent architecture and production ML systems.
Top comments (0)