LLM agents need tools. But when you have 248 Kubernetes API endpoints or 1068 GitHub API operations, you can't stuff them all into the context window. The standard fix is vector search — embed tool descriptions, find the closest match. It works for finding one tool. But real tasks aren't one tool.
I built graph-tool-call, a Python library that models tool relationships as a graph and retrieves execution chains, not just individual matches. After reaching v0.15, I ran a fair competitive benchmark against 6 retrieval strategies across 1068 API endpoints. The results were humbling — and led to a complete architecture rethink.
This post covers what I found, what I broke, and what I built differently.
Why Vector Search Isn't Enough
Consider this user request:
"Cancel my order and process a refund"
Vector search finds cancelOrder — the closest semantic match. But the actual workflow is:
listOrders → getOrder → cancelOrder → requestRefund
You need getOrder first because cancelOrder requires an order_id. You need requestRefund after because that's the business process. Vector search returns one tool; you need a chain of four.
This isn't a retrieval quality problem. It's a structural knowledge problem. No amount of embedding improvement will teach a vector database that getOrder must precede cancelOrder.
The Competitive Benchmark: 6 Strategies, 9 Datasets
I wanted to know: does graph-tool-call actually beat vector search? Not on my own cherry-picked examples, but on a fair, reproducible benchmark.
First, I analyzed bigtool (LangGraph's tool retrieval library). Its core is surprisingly simple — it calls store.search() on LangGraph's vector store. bigtool isn't a retrieval algorithm; it's a wrapper around cosine similarity search.
So I set up a fair comparison:
| Strategy | What it does |
|---|---|
| Vector Only | Cosine similarity with qwen3-embedding (≈ what bigtool does) |
| BM25 Only | Keyword matching (TF-IDF style) |
| Graph Only | Graph traversal from category nodes |
| BM25 + Graph | graph-tool-call default (no embedding) |
| Vector + BM25 | Hybrid without graph |
| Full Pipeline | BM25 + Graph + Embedding + Annotation |
I ran all 6 across 9 datasets ranging from 19 tools (Petstore) to 1068 tools (GitHub full API), using the same queries and the same evaluation metrics.
Results
| Strategy | Recall@5 | MRR | Miss% | Latency |
|---|---|---|---|---|
| Vector Only | 96.8% | 0.897 | 2.1% | 176ms |
| BM25 Only | 91.6% | 0.819 | 7.5% | 1.5ms |
| BM25 + Graph | 91.6% | 0.819 | 7.5% | 14ms |
| Vector + BM25 | 96.8% | 0.897 | 2.1% | 171ms |
| Full Pipeline | 96.8% | 0.897 | 2.1% | 172ms |
The uncomfortable truth: embedding dominates. Vector Only already hits 96.8% Recall. Adding Graph, BM25, or annotations on top of it provides zero additional improvement.
Without embedding, BM25 alone achieves 91.6% — decent, but clearly below vector search. And here's the worst part: BM25 + Graph performed the same as BM25 alone. The graph wasn't helping at all.
Three Bugs That Made Graph Harmful
During the benchmark, I discovered that on some datasets, BM25 + Graph actually scored worse than BM25 alone. The graph was actively degrading results. I spent days debugging this and found three root causes.
Bug 1: set_weights() Was Silently Ignored
The retrieval engine has adaptive weight selection based on corpus size. When I called set_weights(keyword=1.0, graph=0.0) to test BM25-only, the adaptive function overwrote my settings:
def _get_adaptive_weights(self):
# This always ran, ignoring any manual set_weights() call
n = len(self._tools)
if n <= 30:
return (0.55, 0.30, 0.0, 0.15) # hardcoded!
elif n <= 100:
return (0.50, 0.30, 0.0, 0.20) # hardcoded!
Every benchmark strategy produced identical results because the weights never changed. I ran the benchmark 5 times before catching this.
The fix was adding a _weights_manual flag:
def set_weights(self, *, keyword=None, graph=None, ...):
# ... set values ...
self._weights_manual = True # disable adaptive
def _get_adaptive_weights(self):
if self._weights_manual:
return (self._keyword_weight, self._graph_weight, ...)
# ... adaptive logic ...
Bug 2: Graph Just Echoed BM25
The graph channel worked like this:
Query → BM25 finds top 10 → Graph expands their neighbors → Done
Since graph expansion started from BM25 results, it could only find tools that were already near what BM25 found. It provided zero independent signal — just amplified BM25's noise.
I needed the graph to find things BM25 couldn't. That meant starting from a different place entirely — more on this below.
Bug 3: Annotations Overwhelmed Precision at Scale
With 248+ Kubernetes tools, many share the same HTTP method. When the intent classifier detected "create" in a query, annotation scoring boosted every POST endpoint. The correct tool (createCoreV1NamespacedService) got pushed below a wrong one (createCoreV1Namespace) because both are POST requests.
This is a precision vs recall tradeoff. Annotations help recall (finding more candidates), but at scale they destroy precision (ranking the right one first).
The Architecture Fix: Graph as Candidate Injection
The root problem was putting Graph into the wRRF (weighted Reciprocal Rank Fusion) scoring as a 4th channel alongside BM25, Embedding, and Annotation. Since Graph's standalone accuracy was only 69.8% (vs BM25's 91.6%), it dragged down the fused score.
I completely removed Graph from wRRF. Instead, Graph now acts as an independent candidate injection channel:
Before (broken):
BM25 + Graph + Embedding + Annotation → wRRF fusion → results
↑ noise injected here
After (fixed):
BM25 + Embedding + Annotation → wRRF fusion → primary results
↓
Graph (independent) → inject candidates BM25 missed → final results
The critical rule: Graph candidates are always scored below the lowest BM25 result. Graph can only add tools that BM25 missed, never displace a BM25 result. This guarantees BM25 + Graph ≥ BM25 alone.
def _inject_graph_candidates(self, final_scores, graph_scores, ...):
# Only tools NOT already in primary results
new_candidates = {
name: score for name, score in graph_scores.items()
if name not in final_scores
}
# Score below lowest primary result
min_primary = min(final_scores.values())
injection_base = min_primary * 0.8
for name, g_score in ranked[:max_inject]:
final_scores[name] = injection_base * norm_score
Making Graph Search Generic
The original graph search had 49 GitHub-specific aliases hardcoded:
# Old code — only works with GitHub API
_RESOURCE_ALIASES = {
"pull request": "pulls",
"pr": "pulls",
"issue": "issues",
"runner": "actions",
# ... 45 more
}
I replaced this with dynamic reverse-indexing. When you ingest any OpenAPI spec, tool names and descriptions are automatically tokenized and mapped to their category nodes:
# New code — works with any API
for neighbor in graph.get_neighbors(category_node):
# "requestRefund" → tokens: ["request", "refund"]
# → "refund" maps to "orders" category
name_parts = re.sub(r"([a-z])([A-Z])", r"\1 \2", neighbor)
for token in name_tokens:
index[stem(token)] = category_node
# Also index description keywords
for token in description_tokens:
index[stem(token)] = category_node
Now "refund" → orders, "checkout" → cart, "stargazer" → activity — all automatically from any OpenAPI spec.
1068 Tool Stress Test
I fetched the entire GitHub REST API OpenAPI spec — 1068 endpoints. This is where things get interesting.
| Strategy | Recall@5 | MRR | Miss% |
|---|---|---|---|
| Vector Only | 88.0% | 0.761 | 12.0% |
| BM25 + Graph | 78.0% | 0.643 | 22.0% |
| Full Pipeline | 88.0% | 0.761 | 12.0% |
At this scale, everything degrades. The 22% miss rate for BM25 is high. I analyzed every miss case:
| Query | Expected | Why it missed |
|---|---|---|
| "Close an existing issue" | issues/update | "close" ≠ "update" |
| "Add org member" | orgs/set-membership-for-user | Completely different naming |
| "Register self-hosted runner" | actions/create-registration-token | Indirect mapping |
| "Trigger workflow dispatch" | actions/create-workflow-dispatch-event | "trigger" ≠ "create" |
The pattern: every miss is a semantic gap where the query uses different words than the tool name. "Close" means "update status to closed". "Register" means "create registration token". No keyword matching can bridge these gaps — this is fundamentally what embeddings solve.
This was a humbling realization. I stopped trying to make BM25+Graph beat vector search on ranking accuracy.
Where Graph Actually Wins: Workflow Chains
If Graph can't beat embeddings at ranking individual tools, what can it do that embeddings can't?
The answer: return execution chains, not individual tools.
plan = tg.plan_workflow("process a refund")
for step in plan.steps:
print(f"{step.order}. {step.tool.name} — {step.reason}")
Output:
1. listOrders — prerequisite for requestRefund
2. requestRefund — primary action
Without this, an LLM agent receiving just requestRefund would:
- Call
requestRefund→ error: "order_id required" - Figure out it needs
getOrder→ call it - Call
requestRefundagain → success
That's 3 LLM round trips. With the workflow chain, it's 1 round trip. Each round trip is an LLM API call — this directly cuts costs.
The workflow planner uses the graph's REQUIRES and PRECEDES edges:
- Find primary tool via resource-first search (query keywords → category → tools)
- Expand prerequisites — follow REQUIRES edges backward, but only same-category GET/LIST methods
- Topological sort — order by dependency
The "same-category GET/LIST only" filter was critical. Without it, a single query would pull in 12+ unrelated prerequisites through loose cross-resource REQUIRES edges. With it, chains stay focused: 2-4 steps.
Visual Workflow Editor
Auto-generated chains aren't 100% accurate. "Close an issue" can't automatically map to updateIssue because of the semantic gap. So instead of chasing 100% automation, I built a visual editor.
plan.open_editor(tools=tg.tools) opens a browser-based drag-and-drop editor:
- Drag to reorder steps
- Click tools in the sidebar to add steps
- X to remove steps
- Export JSON to save the workflow
It's a single HTML file, zero dependencies — consistent with graph-tool-call's philosophy.
For code-first users:
plan = tg.plan_workflow("close an issue")
plan.reorder(["getIssue", "updateIssue"])
plan.set_param_mapping(
"updateIssue", "issue_id", "getIssue.response.id"
)
plan.save("close_issue.json")
Remote Deployment: SSE Transport
The MCP server previously only supported stdio — local process communication. This meant every developer had to install and run graph-tool-call locally.
I added SSE (Server-Sent Events) and Streamable-HTTP transport:
# Deploy once on a server
graph-tool-call serve \
--source https://api.example.com/openapi.json \
--transport sse \
--host 0.0.0.0 --port 8000
Team members just add a URL to their MCP client config:
{
"mcpServers": {
"tool-search": {
"url": "http://tool-search.internal:8000/sse"
}
}
}
stdio is 1:1 local. SSE is 1:N network. One server, entire team.
The proxy mode also supports SSE, so you can aggregate multiple MCP servers behind a single remote endpoint:
graph-tool-call proxy \
--config backends.json \
--transport sse --port 8000
What I Learned
1. Don't compete with embeddings on ranking
If you have a good embedding model, vector search will beat keyword + graph at ranking individual tools. Period. The benchmark proved this conclusively.
2. Graph's value is structural, not semantic
Embeddings find semantically similar tools. Graphs encode relationships — what must come before what, what requires what. These are different kinds of knowledge. Don't try to use one where the other is needed.
3. Fair benchmarks are humbling
My original story was "Graph beats baseline by 70%." After running a fair 6-strategy comparison, the real story is "Graph ties BM25 at ranking, but uniquely provides workflow chains." Less dramatic. More honest. More useful.
4. 100% automation < good defaults + easy editing
I spent days trying to make the workflow planner automatically find updateIssue for "close an issue." Then I spent an afternoon building a visual editor. The editor was more valuable than any accuracy improvement.
5. Zero-dep is a feature
The entire core library has zero Python dependencies. BM25, graph traversal, wRRF fusion — all in stdlib Python. Add [embedding] for semantic search, [mcp] for MCP server mode. Users install only what they need.
Numbers
| Metric | Value |
|---|---|
| Supported tool scale | 1068 (tested) |
| Recall@5 (no embedding) | 91.6% |
| Recall@5 (with embedding) | 96.8% |
| Latency (no embedding) | 1.5ms |
| Token reduction | 64–91% |
| Dependencies (core) | 0 |
| Test coverage | 494 tests |
Try It
pip install graph-tool-call
from graph_tool_call import ToolGraph
tg = ToolGraph.from_url(
"https://petstore3.swagger.io/api/v3/openapi.json",
cache="petstore.json",
)
# Search
tools = tg.retrieve("place an order", top_k=5)
for t in tools:
print(f"{t.name}: {t.description}")
# Workflow chain
plan = tg.plan_workflow("buy a pet and place an order")
print(plan) # WorkflowPlan([addPet → placeOrder])
# Visual editor
plan.open_editor(tools=tg.tools)
Or try it without installing — the interactive playground runs in your browser with demo data.
GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pypi.org/project/graph-tool-call
Feedback, issues, and stars are welcome.
Top comments (0)