DEV Community: Son Seong Jun

I Built a Graph-Based Tool Search Engine for LLM Agents — Here's What I Learned After 1068 Tools

Son Seong Jun — Sun, 22 Mar 2026 08:02:38 +0000

LLM agents need tools. But when you have 248 Kubernetes API endpoints or 1068 GitHub API operations, you can't stuff them all into the context window. The standard fix is vector search — embed tool descriptions, find the closest match. It works for finding one tool. But real tasks aren't one tool.

I built graph-tool-call, a Python library that models tool relationships as a graph and retrieves execution chains, not just individual matches. After reaching v0.15, I ran a fair competitive benchmark against 6 retrieval strategies across 1068 API endpoints. The results were humbling — and led to a complete architecture rethink.

This post covers what I found, what I broke, and what I built differently.

Why Vector Search Isn't Enough

Consider this user request:

"Cancel my order and process a refund"

Vector search finds cancelOrder — the closest semantic match. But the actual workflow is:

listOrders → getOrder → cancelOrder → requestRefund

You need getOrder first because cancelOrder requires an order_id. You need requestRefund after because that's the business process. Vector search returns one tool; you need a chain of four.

This isn't a retrieval quality problem. It's a structural knowledge problem. No amount of embedding improvement will teach a vector database that getOrder must precede cancelOrder.

The Competitive Benchmark: 6 Strategies, 9 Datasets

I wanted to know: does graph-tool-call actually beat vector search? Not on my own cherry-picked examples, but on a fair, reproducible benchmark.

First, I analyzed bigtool (LangGraph's tool retrieval library). Its core is surprisingly simple — it calls store.search() on LangGraph's vector store. bigtool isn't a retrieval algorithm; it's a wrapper around cosine similarity search.

So I set up a fair comparison:

Strategy	What it does
Vector Only	Cosine similarity with qwen3-embedding (≈ what bigtool does)
BM25 Only	Keyword matching (TF-IDF style)
Graph Only	Graph traversal from category nodes
BM25 + Graph	graph-tool-call default (no embedding)
Vector + BM25	Hybrid without graph
Full Pipeline	BM25 + Graph + Embedding + Annotation

I ran all 6 across 9 datasets ranging from 19 tools (Petstore) to 1068 tools (GitHub full API), using the same queries and the same evaluation metrics.

Results

Strategy	Recall@5	MRR	Miss%	Latency
Vector Only	96.8%	0.897	2.1%	176ms
BM25 Only	91.6%	0.819	7.5%	1.5ms
BM25 + Graph	91.6%	0.819	7.5%	14ms
Vector + BM25	96.8%	0.897	2.1%	171ms
Full Pipeline	96.8%	0.897	2.1%	172ms

The uncomfortable truth: embedding dominates. Vector Only already hits 96.8% Recall. Adding Graph, BM25, or annotations on top of it provides zero additional improvement.

Without embedding, BM25 alone achieves 91.6% — decent, but clearly below vector search. And here's the worst part: BM25 + Graph performed the same as BM25 alone. The graph wasn't helping at all.

Three Bugs That Made Graph Harmful

During the benchmark, I discovered that on some datasets, BM25 + Graph actually scored worse than BM25 alone. The graph was actively degrading results. I spent days debugging this and found three root causes.

Bug 1: `set_weights()` Was Silently Ignored

The retrieval engine has adaptive weight selection based on corpus size. When I called set_weights(keyword=1.0, graph=0.0) to test BM25-only, the adaptive function overwrote my settings:

def _get_adaptive_weights(self):
    # This always ran, ignoring any manual set_weights() call
    n = len(self._tools)
    if n <= 30:
        return (0.55, 0.30, 0.0, 0.15)  # hardcoded!
    elif n <= 100:
        return (0.50, 0.30, 0.0, 0.20)  # hardcoded!

Every benchmark strategy produced identical results because the weights never changed. I ran the benchmark 5 times before catching this.

The fix was adding a _weights_manual flag:

def set_weights(self, *, keyword=None, graph=None, ...):
    # ... set values ...
    self._weights_manual = True  # disable adaptive

def _get_adaptive_weights(self):
    if self._weights_manual:
        return (self._keyword_weight, self._graph_weight, ...)
    # ... adaptive logic ...

Bug 2: Graph Just Echoed BM25

The graph channel worked like this:

Query → BM25 finds top 10 → Graph expands their neighbors → Done

Since graph expansion started from BM25 results, it could only find tools that were already near what BM25 found. It provided zero independent signal — just amplified BM25's noise.

I needed the graph to find things BM25 couldn't. That meant starting from a different place entirely — more on this below.

Bug 3: Annotations Overwhelmed Precision at Scale

With 248+ Kubernetes tools, many share the same HTTP method. When the intent classifier detected "create" in a query, annotation scoring boosted every POST endpoint. The correct tool (createCoreV1NamespacedService) got pushed below a wrong one (createCoreV1Namespace) because both are POST requests.

This is a precision vs recall tradeoff. Annotations help recall (finding more candidates), but at scale they destroy precision (ranking the right one first).

The Architecture Fix: Graph as Candidate Injection

The root problem was putting Graph into the wRRF (weighted Reciprocal Rank Fusion) scoring as a 4th channel alongside BM25, Embedding, and Annotation. Since Graph's standalone accuracy was only 69.8% (vs BM25's 91.6%), it dragged down the fused score.

I completely removed Graph from wRRF. Instead, Graph now acts as an independent candidate injection channel:

Before (broken):
  BM25 + Graph + Embedding + Annotation → wRRF fusion → results
                  ↑ noise injected here

After (fixed):
  BM25 + Embedding + Annotation → wRRF fusion → primary results
                                                      ↓
  Graph (independent) → inject candidates BM25 missed → final results

The critical rule: Graph candidates are always scored below the lowest BM25 result. Graph can only add tools that BM25 missed, never displace a BM25 result. This guarantees BM25 + Graph ≥ BM25 alone.

def _inject_graph_candidates(self, final_scores, graph_scores, ...):
    # Only tools NOT already in primary results
    new_candidates = {
        name: score for name, score in graph_scores.items()
        if name not in final_scores
    }
    # Score below lowest primary result
    min_primary = min(final_scores.values())
    injection_base = min_primary * 0.8

    for name, g_score in ranked[:max_inject]:
        final_scores[name] = injection_base * norm_score

Making Graph Search Generic

The original graph search had 49 GitHub-specific aliases hardcoded:

# Old code — only works with GitHub API
_RESOURCE_ALIASES = {
    "pull request": "pulls",
    "pr": "pulls",
    "issue": "issues",
    "runner": "actions",
    # ... 45 more
}

I replaced this with dynamic reverse-indexing. When you ingest any OpenAPI spec, tool names and descriptions are automatically tokenized and mapped to their category nodes:

# New code — works with any API
for neighbor in graph.get_neighbors(category_node):
    # "requestRefund" → tokens: ["request", "refund"]
    # → "refund" maps to "orders" category
    name_parts = re.sub(r"([a-z])([A-Z])", r"\1 \2", neighbor)
    for token in name_tokens:
        index[stem(token)] = category_node

    # Also index description keywords
    for token in description_tokens:
        index[stem(token)] = category_node

Now "refund" → orders, "checkout" → cart, "stargazer" → activity — all automatically from any OpenAPI spec.

1068 Tool Stress Test

I fetched the entire GitHub REST API OpenAPI spec — 1068 endpoints. This is where things get interesting.

Strategy	Recall@5	MRR	Miss%
Vector Only	88.0%	0.761	12.0%
BM25 + Graph	78.0%	0.643	22.0%
Full Pipeline	88.0%	0.761	12.0%

At this scale, everything degrades. The 22% miss rate for BM25 is high. I analyzed every miss case:

Query	Expected	Why it missed
"Close an existing issue"	issues/update	"close" ≠ "update"
"Add org member"	orgs/set-membership-for-user	Completely different naming
"Register self-hosted runner"	actions/create-registration-token	Indirect mapping
"Trigger workflow dispatch"	actions/create-workflow-dispatch-event	"trigger" ≠ "create"

The pattern: every miss is a semantic gap where the query uses different words than the tool name. "Close" means "update status to closed". "Register" means "create registration token". No keyword matching can bridge these gaps — this is fundamentally what embeddings solve.

This was a humbling realization. I stopped trying to make BM25+Graph beat vector search on ranking accuracy.

Where Graph Actually Wins: Workflow Chains

If Graph can't beat embeddings at ranking individual tools, what can it do that embeddings can't?

The answer: return execution chains, not individual tools.

plan = tg.plan_workflow("process a refund")
for step in plan.steps:
    print(f"{step.order}. {step.tool.name} — {step.reason}")

Output:

1. listOrders — prerequisite for requestRefund
2. requestRefund — primary action

Without this, an LLM agent receiving just requestRefund would:

Call requestRefund → error: "order_id required"
Figure out it needs getOrder → call it
Call requestRefund again → success

That's 3 LLM round trips. With the workflow chain, it's 1 round trip. Each round trip is an LLM API call — this directly cuts costs.

The workflow planner uses the graph's REQUIRES and PRECEDES edges:

Find primary tool via resource-first search (query keywords → category → tools)
Expand prerequisites — follow REQUIRES edges backward, but only same-category GET/LIST methods
Topological sort — order by dependency

The "same-category GET/LIST only" filter was critical. Without it, a single query would pull in 12+ unrelated prerequisites through loose cross-resource REQUIRES edges. With it, chains stay focused: 2-4 steps.

Visual Workflow Editor

Auto-generated chains aren't 100% accurate. "Close an issue" can't automatically map to updateIssue because of the semantic gap. So instead of chasing 100% automation, I built a visual editor.

plan.open_editor(tools=tg.tools) opens a browser-based drag-and-drop editor:

Drag to reorder steps
Click tools in the sidebar to add steps
X to remove steps
Export JSON to save the workflow

It's a single HTML file, zero dependencies — consistent with graph-tool-call's philosophy.

For code-first users:

plan = tg.plan_workflow("close an issue")
plan.reorder(["getIssue", "updateIssue"])
plan.set_param_mapping(
    "updateIssue", "issue_id", "getIssue.response.id"
)
plan.save("close_issue.json")

Remote Deployment: SSE Transport

The MCP server previously only supported stdio — local process communication. This meant every developer had to install and run graph-tool-call locally.

I added SSE (Server-Sent Events) and Streamable-HTTP transport:

# Deploy once on a server
graph-tool-call serve \
  --source https://api.example.com/openapi.json \
  --transport sse \
  --host 0.0.0.0 --port 8000

Team members just add a URL to their MCP client config:

{
  "mcpServers": {
    "tool-search": {
      "url": "http://tool-search.internal:8000/sse"
    }
  }
}

stdio is 1:1 local. SSE is 1:N network. One server, entire team.

The proxy mode also supports SSE, so you can aggregate multiple MCP servers behind a single remote endpoint:

graph-tool-call proxy \
  --config backends.json \
  --transport sse --port 8000

What I Learned

1. Don't compete with embeddings on ranking

If you have a good embedding model, vector search will beat keyword + graph at ranking individual tools. Period. The benchmark proved this conclusively.

2. Graph's value is structural, not semantic

Embeddings find semantically similar tools. Graphs encode relationships — what must come before what, what requires what. These are different kinds of knowledge. Don't try to use one where the other is needed.

3. Fair benchmarks are humbling

My original story was "Graph beats baseline by 70%." After running a fair 6-strategy comparison, the real story is "Graph ties BM25 at ranking, but uniquely provides workflow chains." Less dramatic. More honest. More useful.

4. 100% automation < good defaults + easy editing

I spent days trying to make the workflow planner automatically find updateIssue for "close an issue." Then I spent an afternoon building a visual editor. The editor was more valuable than any accuracy improvement.

5. Zero-dep is a feature

The entire core library has zero Python dependencies. BM25, graph traversal, wRRF fusion — all in stdlib Python. Add [embedding] for semantic search, [mcp] for MCP server mode. Users install only what they need.

Numbers

Metric	Value
Supported tool scale	1068 (tested)
Recall@5 (no embedding)	91.6%
Recall@5 (with embedding)	96.8%
Latency (no embedding)	1.5ms
Token reduction	64–91%
Dependencies (core)	0
Test coverage	494 tests

Try It

pip install graph-tool-call

from graph_tool_call import ToolGraph

tg = ToolGraph.from_url(
    "https://petstore3.swagger.io/api/v3/openapi.json",
    cache="petstore.json",
)

# Search
tools = tg.retrieve("place an order", top_k=5)
for t in tools:
    print(f"{t.name}: {t.description}")

# Workflow chain
plan = tg.plan_workflow("buy a pet and place an order")
print(plan)  # WorkflowPlan([addPet → placeOrder])

# Visual editor
plan.open_editor(tools=tg.tools)

Or try it without installing — the interactive playground runs in your browser with demo data.

GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pypi.org/project/graph-tool-call

Feedback, issues, and stars are welcome.

I gave an LLM 248 tools and accuracy dropped to 12%. Here's what fixed it.

Son Seong Jun — Sun, 15 Mar 2026 09:45:25 +0000

LLM agents break when you give them too many tools. I hit this wall with 248 Kubernetes API endpoints — the model's accuracy dropped to 12%. Vector search didn't fix it. Graph-based retrieval did.

Here's the problem, why vector search fails, and how I solved it with graph-tool-call — an open-source, zero-dependency Python library for tool retrieval.

The problem: context overflow kills accuracy

I was building an LLM agent (qwen3:4b) for a Kubernetes cluster. 248 API endpoints, all exposed as tools. Threw them all into the context and asked the model to "scale my deployment."

Accuracy? 12%. The model choked on 8,192 tokens of tool definitions.

This isn't a model problem — it's a retrieval problem. The LLM needs a smaller, relevant subset of tools. But how do you pick the right ones?

Why vector search isn't enough

Natural first instinct: embed all tool descriptions, find the closest matches via cosine similarity. Simple.

Except... when a user says "cancel my order and get a refund," vector search returns cancelOrder. But the actual workflow is:

listOrders → getOrder → cancelOrder → processRefund

Vector search finds one tool. You need the chain. Real API workflows involve sequencing, prerequisites, and complementary operations that flat similarity search completely misses.

The solution: graph-based tool retrieval

I built graph-tool-call — it models tool relationships as a directed graph. Tools have edges like PRECEDES, REQUIRES, COMPLEMENTARY. When you search, it doesn't just find one match — it traverses the graph and returns the whole workflow.

The retrieval fuses four signals via weighted Reciprocal Rank Fusion (wRRF):

Signal	What it does
BM25	Keyword matching against tool names & descriptions
Graph traversal	Expands results along PRECEDES/REQUIRES/COMPLEMENTARY edges
Embedding	Semantic similarity (optional — Ollama, OpenAI, vLLM, etc.)
MCP annotations	Prioritizes read-only vs destructive tools based on query intent

Benchmark results

Same 248 K8s tools, same model (qwen3:4b, 4-bit quantized):

Setup	Accuracy	Tokens	Token reduction
All 248 tools (baseline)	12%	8,192	—
graph-tool-call (top-5)	82%	1,699	79%
+ embedding + ontology	82%	1,924	76%

On smaller APIs (19–50 tools), baseline accuracy is already high — but graph-tool-call still cuts tokens by 64–91%.

Here's what it looks like in action — token savings, e-commerce workflow search, and GitHub API search:

Zero dependencies

The core runs on Python stdlib only. No numpy, no torch, no heavy ML frameworks. Install only what you need:

pip install graph-tool-call                # core — zero deps
pip install graph-tool-call[embedding]     # + semantic search
pip install graph-tool-call[mcp]           # + MCP server mode
pip install graph-tool-call[all]           # everything

Try it in 30 seconds

uvx graph-tool-call search "user authentication" \
  --source https://petstore.swagger.io/v2/swagger.json

As an MCP server

Drop this in your .mcp.json and any MCP client (Claude Code, Cursor, Windsurf) gets smart tool search:

{
  "mcpServers": {
    "tool-search": {
      "command": "uvx",
      "args": ["graph-tool-call[mcp]", "serve",
               "--source", "https://api.example.com/openapi.json"]
    }
  }
}

Python API

from graph_tool_call import ToolGraph

tg = ToolGraph.from_url(
    "https://petstore3.swagger.io/api/v3/openapi.json",
    cache="petstore.json",
)

# Retrieve only relevant tools
tools = tg.retrieve("cancel my order", top_k=5)
for t in tools:
    print(f"{t.name}: {t.description}")

MCP Proxy: 172 tools → 3 meta-tools

Running multiple MCP servers? Their tool definitions pile up in every LLM turn. MCP Proxy bundles them behind a single server:

172 tools across servers → 3 meta-tools (search_tools, get_tool_schema, call_backend_tool)
After search, matched tools are dynamically injected for 1-hop direct calling
Saves ~1,200 tokens per turn

claude mcp add tool-proxy -- \
  uvx "graph-tool-call[mcp]" proxy --config ~/backends.json

What makes this different

	Vector-only	graph-tool-call
Dependencies	Embedding model required	Zero (stdlib only)
Tool source	Manual registration	Auto-ingest from OpenAPI / MCP / Python
Search	Flat similarity	BM25 + graph + embedding + annotations
Workflows	Single tool matches	Multi-step chain retrieval
History	None	Demotes used tools, boosts next-step
LLM dependency	Required	Optional (better with, works without)

Get started

GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pip install graph-tool-call
Docs: Architecture · Benchmarks

If you're dealing with large tool sets in production, I'd love to hear what threshold you hit before retrieval became necessary. Drop a comment or open an issue — contributions welcome 🙌