DEV Community

Cover image for I gave an LLM 248 tools and accuracy dropped to 12%. Here's what fixed it.
Son Seong Jun
Son Seong Jun

Posted on • Originally published at github.com

I gave an LLM 248 tools and accuracy dropped to 12%. Here's what fixed it.

LLM agents break when you give them too many tools. I hit this wall with 248 Kubernetes API endpoints — the model's accuracy dropped to 12%. Vector search didn't fix it. Graph-based retrieval did.

Here's the problem, why vector search fails, and how I solved it with graph-tool-call — an open-source, zero-dependency Python library for tool retrieval.


The problem: context overflow kills accuracy

I was building an LLM agent (qwen3:4b) for a Kubernetes cluster. 248 API endpoints, all exposed as tools. Threw them all into the context and asked the model to "scale my deployment."

Accuracy? 12%. The model choked on 8,192 tokens of tool definitions.

This isn't a model problem — it's a retrieval problem. The LLM needs a smaller, relevant subset of tools. But how do you pick the right ones?

Why vector search isn't enough

Natural first instinct: embed all tool descriptions, find the closest matches via cosine similarity. Simple.

Except... when a user says "cancel my order and get a refund," vector search returns cancelOrder. But the actual workflow is:

listOrders → getOrder → cancelOrder → processRefund
Enter fullscreen mode Exit fullscreen mode

Vector search finds one tool. You need the chain. Real API workflows involve sequencing, prerequisites, and complementary operations that flat similarity search completely misses.

The solution: graph-based tool retrieval

I built graph-tool-call — it models tool relationships as a directed graph. Tools have edges like PRECEDES, REQUIRES, COMPLEMENTARY. When you search, it doesn't just find one match — it traverses the graph and returns the whole workflow.

The retrieval fuses four signals via weighted Reciprocal Rank Fusion (wRRF):

Signal What it does
BM25 Keyword matching against tool names & descriptions
Graph traversal Expands results along PRECEDES/REQUIRES/COMPLEMENTARY edges
Embedding Semantic similarity (optional — Ollama, OpenAI, vLLM, etc.)
MCP annotations Prioritizes read-only vs destructive tools based on query intent

Benchmark results

Same 248 K8s tools, same model (qwen3:4b, 4-bit quantized):

Setup Accuracy Tokens Token reduction
All 248 tools (baseline) 12% 8,192
graph-tool-call (top-5) 82% 1,699 79%
+ embedding + ontology 82% 1,924 76%

On smaller APIs (19–50 tools), baseline accuracy is already high — but graph-tool-call still cuts tokens by 64–91%.

Here's what it looks like in action — token savings, e-commerce workflow search, and GitHub API search:

graph-tool-call demo

Zero dependencies

The core runs on Python stdlib only. No numpy, no torch, no heavy ML frameworks. Install only what you need:

pip install graph-tool-call                # core — zero deps
pip install graph-tool-call[embedding]     # + semantic search
pip install graph-tool-call[mcp]           # + MCP server mode
pip install graph-tool-call[all]           # everything
Enter fullscreen mode Exit fullscreen mode

Try it in 30 seconds

uvx graph-tool-call search "user authentication" \
  --source https://petstore.swagger.io/v2/swagger.json
Enter fullscreen mode Exit fullscreen mode

As an MCP server

Drop this in your .mcp.json and any MCP client (Claude Code, Cursor, Windsurf) gets smart tool search:

{
  "mcpServers": {
    "tool-search": {
      "command": "uvx",
      "args": ["graph-tool-call[mcp]", "serve",
               "--source", "https://api.example.com/openapi.json"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Python API

from graph_tool_call import ToolGraph

tg = ToolGraph.from_url(
    "https://petstore3.swagger.io/api/v3/openapi.json",
    cache="petstore.json",
)

# Retrieve only relevant tools
tools = tg.retrieve("cancel my order", top_k=5)
for t in tools:
    print(f"{t.name}: {t.description}")
Enter fullscreen mode Exit fullscreen mode

MCP Proxy: 172 tools → 3 meta-tools

Running multiple MCP servers? Their tool definitions pile up in every LLM turn. MCP Proxy bundles them behind a single server:

  • 172 tools across servers → 3 meta-tools (search_tools, get_tool_schema, call_backend_tool)
  • After search, matched tools are dynamically injected for 1-hop direct calling
  • Saves ~1,200 tokens per turn
claude mcp add tool-proxy -- \
  uvx "graph-tool-call[mcp]" proxy --config ~/backends.json
Enter fullscreen mode Exit fullscreen mode

What makes this different

Vector-only graph-tool-call
Dependencies Embedding model required Zero (stdlib only)
Tool source Manual registration Auto-ingest from OpenAPI / MCP / Python
Search Flat similarity BM25 + graph + embedding + annotations
Workflows Single tool matches Multi-step chain retrieval
History None Demotes used tools, boosts next-step
LLM dependency Required Optional (better with, works without)

Get started

GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pip install graph-tool-call
Docs: Architecture · Benchmarks

If you're dealing with large tool sets in production, I'd love to hear what threshold you hit before retrieval became necessary. Drop a comment or open an issue — contributions welcome 🙌

Top comments (0)