LLM agents break when you give them too many tools. I hit this wall with 248 Kubernetes API endpoints — the model's accuracy dropped to 12%. Vector search didn't fix it. Graph-based retrieval did.
Here's the problem, why vector search fails, and how I solved it with graph-tool-call — an open-source, zero-dependency Python library for tool retrieval.
The problem: context overflow kills accuracy
I was building an LLM agent (qwen3:4b) for a Kubernetes cluster. 248 API endpoints, all exposed as tools. Threw them all into the context and asked the model to "scale my deployment."
Accuracy? 12%. The model choked on 8,192 tokens of tool definitions.
This isn't a model problem — it's a retrieval problem. The LLM needs a smaller, relevant subset of tools. But how do you pick the right ones?
Why vector search isn't enough
Natural first instinct: embed all tool descriptions, find the closest matches via cosine similarity. Simple.
Except... when a user says "cancel my order and get a refund," vector search returns cancelOrder. But the actual workflow is:
listOrders → getOrder → cancelOrder → processRefund
Vector search finds one tool. You need the chain. Real API workflows involve sequencing, prerequisites, and complementary operations that flat similarity search completely misses.
The solution: graph-based tool retrieval
I built graph-tool-call — it models tool relationships as a directed graph. Tools have edges like PRECEDES, REQUIRES, COMPLEMENTARY. When you search, it doesn't just find one match — it traverses the graph and returns the whole workflow.
The retrieval fuses four signals via weighted Reciprocal Rank Fusion (wRRF):
| Signal | What it does |
|---|---|
| BM25 | Keyword matching against tool names & descriptions |
| Graph traversal | Expands results along PRECEDES/REQUIRES/COMPLEMENTARY edges |
| Embedding | Semantic similarity (optional — Ollama, OpenAI, vLLM, etc.) |
| MCP annotations | Prioritizes read-only vs destructive tools based on query intent |
Benchmark results
Same 248 K8s tools, same model (qwen3:4b, 4-bit quantized):
| Setup | Accuracy | Tokens | Token reduction |
|---|---|---|---|
| All 248 tools (baseline) | 12% | 8,192 | — |
| graph-tool-call (top-5) | 82% | 1,699 | 79% |
| + embedding + ontology | 82% | 1,924 | 76% |
On smaller APIs (19–50 tools), baseline accuracy is already high — but graph-tool-call still cuts tokens by 64–91%.
Here's what it looks like in action — token savings, e-commerce workflow search, and GitHub API search:
Zero dependencies
The core runs on Python stdlib only. No numpy, no torch, no heavy ML frameworks. Install only what you need:
pip install graph-tool-call # core — zero deps
pip install graph-tool-call[embedding] # + semantic search
pip install graph-tool-call[mcp] # + MCP server mode
pip install graph-tool-call[all] # everything
Try it in 30 seconds
uvx graph-tool-call search "user authentication" \
--source https://petstore.swagger.io/v2/swagger.json
As an MCP server
Drop this in your .mcp.json and any MCP client (Claude Code, Cursor, Windsurf) gets smart tool search:
{
"mcpServers": {
"tool-search": {
"command": "uvx",
"args": ["graph-tool-call[mcp]", "serve",
"--source", "https://api.example.com/openapi.json"]
}
}
}
Python API
from graph_tool_call import ToolGraph
tg = ToolGraph.from_url(
"https://petstore3.swagger.io/api/v3/openapi.json",
cache="petstore.json",
)
# Retrieve only relevant tools
tools = tg.retrieve("cancel my order", top_k=5)
for t in tools:
print(f"{t.name}: {t.description}")
MCP Proxy: 172 tools → 3 meta-tools
Running multiple MCP servers? Their tool definitions pile up in every LLM turn. MCP Proxy bundles them behind a single server:
-
172 tools across servers → 3 meta-tools (
search_tools,get_tool_schema,call_backend_tool) - After search, matched tools are dynamically injected for 1-hop direct calling
- Saves ~1,200 tokens per turn
claude mcp add tool-proxy -- \
uvx "graph-tool-call[mcp]" proxy --config ~/backends.json
What makes this different
| Vector-only | graph-tool-call | |
|---|---|---|
| Dependencies | Embedding model required | Zero (stdlib only) |
| Tool source | Manual registration | Auto-ingest from OpenAPI / MCP / Python |
| Search | Flat similarity | BM25 + graph + embedding + annotations |
| Workflows | Single tool matches | Multi-step chain retrieval |
| History | None | Demotes used tools, boosts next-step |
| LLM dependency | Required | Optional (better with, works without) |
Get started
GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pip install graph-tool-call
Docs: Architecture · Benchmarks
If you're dealing with large tool sets in production, I'd love to hear what threshold you hit before retrieval became necessary. Drop a comment or open an issue — contributions welcome 🙌

Top comments (0)