DEV Community

Son Seong Jun
Son Seong Jun

Posted on • Originally published at infoedu.co.kr

I Built a Graph-Based Tool Search Engine for LLM Agents — Here's What I Learned After 1068 Tools

LLM agents need tools. But when you have 248 Kubernetes API endpoints or 1068 GitHub API operations, you can't stuff them all into the context window. The standard fix is vector search — embed tool descriptions, find the closest match. It works for finding one tool. But real tasks aren't one tool.

I built graph-tool-call, a Python library that models tool relationships as a graph and retrieves execution chains, not just individual matches. After reaching v0.15, I ran a fair competitive benchmark against 6 retrieval strategies across 1068 API endpoints. The results were humbling — and led to a complete architecture rethink.

This post covers what I found, what I broke, and what I built differently.


Why Vector Search Isn't Enough

Consider this user request:

"Cancel my order and process a refund"

Vector search finds cancelOrder — the closest semantic match. But the actual workflow is:

listOrders → getOrder → cancelOrder → requestRefund
Enter fullscreen mode Exit fullscreen mode

You need getOrder first because cancelOrder requires an order_id. You need requestRefund after because that's the business process. Vector search returns one tool; you need a chain of four.

This isn't a retrieval quality problem. It's a structural knowledge problem. No amount of embedding improvement will teach a vector database that getOrder must precede cancelOrder.


The Competitive Benchmark: 6 Strategies, 9 Datasets

I wanted to know: does graph-tool-call actually beat vector search? Not on my own cherry-picked examples, but on a fair, reproducible benchmark.

First, I analyzed bigtool (LangGraph's tool retrieval library). Its core is surprisingly simple — it calls store.search() on LangGraph's vector store. bigtool isn't a retrieval algorithm; it's a wrapper around cosine similarity search.

So I set up a fair comparison:

Strategy What it does
Vector Only Cosine similarity with qwen3-embedding (≈ what bigtool does)
BM25 Only Keyword matching (TF-IDF style)
Graph Only Graph traversal from category nodes
BM25 + Graph graph-tool-call default (no embedding)
Vector + BM25 Hybrid without graph
Full Pipeline BM25 + Graph + Embedding + Annotation

I ran all 6 across 9 datasets ranging from 19 tools (Petstore) to 1068 tools (GitHub full API), using the same queries and the same evaluation metrics.

Results

Strategy Recall@5 MRR Miss% Latency
Vector Only 96.8% 0.897 2.1% 176ms
BM25 Only 91.6% 0.819 7.5% 1.5ms
BM25 + Graph 91.6% 0.819 7.5% 14ms
Vector + BM25 96.8% 0.897 2.1% 171ms
Full Pipeline 96.8% 0.897 2.1% 172ms

The uncomfortable truth: embedding dominates. Vector Only already hits 96.8% Recall. Adding Graph, BM25, or annotations on top of it provides zero additional improvement.

Without embedding, BM25 alone achieves 91.6% — decent, but clearly below vector search. And here's the worst part: BM25 + Graph performed the same as BM25 alone. The graph wasn't helping at all.


Three Bugs That Made Graph Harmful

During the benchmark, I discovered that on some datasets, BM25 + Graph actually scored worse than BM25 alone. The graph was actively degrading results. I spent days debugging this and found three root causes.

Bug 1: set_weights() Was Silently Ignored

The retrieval engine has adaptive weight selection based on corpus size. When I called set_weights(keyword=1.0, graph=0.0) to test BM25-only, the adaptive function overwrote my settings:

def _get_adaptive_weights(self):
    # This always ran, ignoring any manual set_weights() call
    n = len(self._tools)
    if n <= 30:
        return (0.55, 0.30, 0.0, 0.15)  # hardcoded!
    elif n <= 100:
        return (0.50, 0.30, 0.0, 0.20)  # hardcoded!
Enter fullscreen mode Exit fullscreen mode

Every benchmark strategy produced identical results because the weights never changed. I ran the benchmark 5 times before catching this.

The fix was adding a _weights_manual flag:

def set_weights(self, *, keyword=None, graph=None, ...):
    # ... set values ...
    self._weights_manual = True  # disable adaptive

def _get_adaptive_weights(self):
    if self._weights_manual:
        return (self._keyword_weight, self._graph_weight, ...)
    # ... adaptive logic ...
Enter fullscreen mode Exit fullscreen mode

Bug 2: Graph Just Echoed BM25

The graph channel worked like this:

Query → BM25 finds top 10 → Graph expands their neighbors → Done
Enter fullscreen mode Exit fullscreen mode

Since graph expansion started from BM25 results, it could only find tools that were already near what BM25 found. It provided zero independent signal — just amplified BM25's noise.

I needed the graph to find things BM25 couldn't. That meant starting from a different place entirely — more on this below.

Bug 3: Annotations Overwhelmed Precision at Scale

With 248+ Kubernetes tools, many share the same HTTP method. When the intent classifier detected "create" in a query, annotation scoring boosted every POST endpoint. The correct tool (createCoreV1NamespacedService) got pushed below a wrong one (createCoreV1Namespace) because both are POST requests.

This is a precision vs recall tradeoff. Annotations help recall (finding more candidates), but at scale they destroy precision (ranking the right one first).


The Architecture Fix: Graph as Candidate Injection

The root problem was putting Graph into the wRRF (weighted Reciprocal Rank Fusion) scoring as a 4th channel alongside BM25, Embedding, and Annotation. Since Graph's standalone accuracy was only 69.8% (vs BM25's 91.6%), it dragged down the fused score.

I completely removed Graph from wRRF. Instead, Graph now acts as an independent candidate injection channel:

Before (broken):
  BM25 + Graph + Embedding + Annotation → wRRF fusion → results
                  ↑ noise injected here

After (fixed):
  BM25 + Embedding + Annotation → wRRF fusion → primary results
                                                      ↓
  Graph (independent) → inject candidates BM25 missed → final results
Enter fullscreen mode Exit fullscreen mode

The critical rule: Graph candidates are always scored below the lowest BM25 result. Graph can only add tools that BM25 missed, never displace a BM25 result. This guarantees BM25 + Graph ≥ BM25 alone.

def _inject_graph_candidates(self, final_scores, graph_scores, ...):
    # Only tools NOT already in primary results
    new_candidates = {
        name: score for name, score in graph_scores.items()
        if name not in final_scores
    }
    # Score below lowest primary result
    min_primary = min(final_scores.values())
    injection_base = min_primary * 0.8

    for name, g_score in ranked[:max_inject]:
        final_scores[name] = injection_base * norm_score
Enter fullscreen mode Exit fullscreen mode

Making Graph Search Generic

The original graph search had 49 GitHub-specific aliases hardcoded:

# Old code — only works with GitHub API
_RESOURCE_ALIASES = {
    "pull request": "pulls",
    "pr": "pulls",
    "issue": "issues",
    "runner": "actions",
    # ... 45 more
}
Enter fullscreen mode Exit fullscreen mode

I replaced this with dynamic reverse-indexing. When you ingest any OpenAPI spec, tool names and descriptions are automatically tokenized and mapped to their category nodes:

# New code — works with any API
for neighbor in graph.get_neighbors(category_node):
    # "requestRefund" → tokens: ["request", "refund"]
    # → "refund" maps to "orders" category
    name_parts = re.sub(r"([a-z])([A-Z])", r"\1 \2", neighbor)
    for token in name_tokens:
        index[stem(token)] = category_node

    # Also index description keywords
    for token in description_tokens:
        index[stem(token)] = category_node
Enter fullscreen mode Exit fullscreen mode

Now "refund" → orders, "checkout" → cart, "stargazer" → activity — all automatically from any OpenAPI spec.


1068 Tool Stress Test

I fetched the entire GitHub REST API OpenAPI spec — 1068 endpoints. This is where things get interesting.

Strategy Recall@5 MRR Miss%
Vector Only 88.0% 0.761 12.0%
BM25 + Graph 78.0% 0.643 22.0%
Full Pipeline 88.0% 0.761 12.0%

At this scale, everything degrades. The 22% miss rate for BM25 is high. I analyzed every miss case:

Query Expected Why it missed
"Close an existing issue" issues/update "close" ≠ "update"
"Add org member" orgs/set-membership-for-user Completely different naming
"Register self-hosted runner" actions/create-registration-token Indirect mapping
"Trigger workflow dispatch" actions/create-workflow-dispatch-event "trigger" ≠ "create"

The pattern: every miss is a semantic gap where the query uses different words than the tool name. "Close" means "update status to closed". "Register" means "create registration token". No keyword matching can bridge these gaps — this is fundamentally what embeddings solve.

This was a humbling realization. I stopped trying to make BM25+Graph beat vector search on ranking accuracy.


Where Graph Actually Wins: Workflow Chains

If Graph can't beat embeddings at ranking individual tools, what can it do that embeddings can't?

The answer: return execution chains, not individual tools.

plan = tg.plan_workflow("process a refund")
for step in plan.steps:
    print(f"{step.order}. {step.tool.name}{step.reason}")
Enter fullscreen mode Exit fullscreen mode

Output:

1. listOrders — prerequisite for requestRefund
2. requestRefund — primary action
Enter fullscreen mode Exit fullscreen mode

Without this, an LLM agent receiving just requestRefund would:

  1. Call requestRefund → error: "order_id required"
  2. Figure out it needs getOrder → call it
  3. Call requestRefund again → success

That's 3 LLM round trips. With the workflow chain, it's 1 round trip. Each round trip is an LLM API call — this directly cuts costs.

The workflow planner uses the graph's REQUIRES and PRECEDES edges:

  1. Find primary tool via resource-first search (query keywords → category → tools)
  2. Expand prerequisites — follow REQUIRES edges backward, but only same-category GET/LIST methods
  3. Topological sort — order by dependency

The "same-category GET/LIST only" filter was critical. Without it, a single query would pull in 12+ unrelated prerequisites through loose cross-resource REQUIRES edges. With it, chains stay focused: 2-4 steps.

Visual Workflow Editor

Auto-generated chains aren't 100% accurate. "Close an issue" can't automatically map to updateIssue because of the semantic gap. So instead of chasing 100% automation, I built a visual editor.

plan.open_editor(tools=tg.tools) opens a browser-based drag-and-drop editor:

  • Drag to reorder steps
  • Click tools in the sidebar to add steps
  • X to remove steps
  • Export JSON to save the workflow

It's a single HTML file, zero dependencies — consistent with graph-tool-call's philosophy.

For code-first users:

plan = tg.plan_workflow("close an issue")
plan.reorder(["getIssue", "updateIssue"])
plan.set_param_mapping(
    "updateIssue", "issue_id", "getIssue.response.id"
)
plan.save("close_issue.json")
Enter fullscreen mode Exit fullscreen mode

Remote Deployment: SSE Transport

The MCP server previously only supported stdio — local process communication. This meant every developer had to install and run graph-tool-call locally.

I added SSE (Server-Sent Events) and Streamable-HTTP transport:

# Deploy once on a server
graph-tool-call serve \
  --source https://api.example.com/openapi.json \
  --transport sse \
  --host 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Team members just add a URL to their MCP client config:

{
  "mcpServers": {
    "tool-search": {
      "url": "http://tool-search.internal:8000/sse"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

stdio is 1:1 local. SSE is 1:N network. One server, entire team.

The proxy mode also supports SSE, so you can aggregate multiple MCP servers behind a single remote endpoint:

graph-tool-call proxy \
  --config backends.json \
  --transport sse --port 8000
Enter fullscreen mode Exit fullscreen mode

What I Learned

1. Don't compete with embeddings on ranking

If you have a good embedding model, vector search will beat keyword + graph at ranking individual tools. Period. The benchmark proved this conclusively.

2. Graph's value is structural, not semantic

Embeddings find semantically similar tools. Graphs encode relationships — what must come before what, what requires what. These are different kinds of knowledge. Don't try to use one where the other is needed.

3. Fair benchmarks are humbling

My original story was "Graph beats baseline by 70%." After running a fair 6-strategy comparison, the real story is "Graph ties BM25 at ranking, but uniquely provides workflow chains." Less dramatic. More honest. More useful.

4. 100% automation < good defaults + easy editing

I spent days trying to make the workflow planner automatically find updateIssue for "close an issue." Then I spent an afternoon building a visual editor. The editor was more valuable than any accuracy improvement.

5. Zero-dep is a feature

The entire core library has zero Python dependencies. BM25, graph traversal, wRRF fusion — all in stdlib Python. Add [embedding] for semantic search, [mcp] for MCP server mode. Users install only what they need.


Numbers

Metric Value
Supported tool scale 1068 (tested)
Recall@5 (no embedding) 91.6%
Recall@5 (with embedding) 96.8%
Latency (no embedding) 1.5ms
Token reduction 64–91%
Dependencies (core) 0
Test coverage 494 tests

Try It

pip install graph-tool-call
Enter fullscreen mode Exit fullscreen mode
from graph_tool_call import ToolGraph

tg = ToolGraph.from_url(
    "https://petstore3.swagger.io/api/v3/openapi.json",
    cache="petstore.json",
)

# Search
tools = tg.retrieve("place an order", top_k=5)
for t in tools:
    print(f"{t.name}: {t.description}")

# Workflow chain
plan = tg.plan_workflow("buy a pet and place an order")
print(plan)  # WorkflowPlan([addPet → placeOrder])

# Visual editor
plan.open_editor(tools=tg.tools)
Enter fullscreen mode Exit fullscreen mode

Or try it without installing — the interactive playground runs in your browser with demo data.

GitHub: github.com/SonAIengine/graph-tool-call
PyPI: pypi.org/project/graph-tool-call

Feedback, issues, and stars are welcome.

Top comments (0)