Jeremiah Justin Barias

Posted on Jan 16 • Originally published at justinbarias.github.io on Jan 16

How I Reduced My Agent's Token Consumption by 83%

#tokens #anthropic #tool #search

(Excuse the bad meme image prompt, I'm new at this LOL)

How I Reduced My Agent's Token Consumption by 83%

I was building a research agent with HoloDeck for paper search, Brave Search for web lookups, and a memory MCP server for knowledge graphs. Pretty standard stuff.

Then I looked at my API call payload for a simple "hi there" message:

{
  "messages": [...],
  "tools": [
    {"function": {"name": "vectorstore-search_papers", ...}},
    {"function": {"name": "brave_search-brave_image_search", ...}},
    {"function": {"name": "brave_search-brave_local_search", ...}},
    {"function": {"name": "brave_search-brave_news_search", ...}},
    {"function": {"name": "brave_search-brave_summarizer", ...}},
    {"function": {"name": "brave_search-brave_video_search", ...}},
    {"function": {"name": "brave_search-brave_web_search", ...}},
    {"function": {"name": "memory-add_observations", ...}},
    {"function": {"name": "memory-create_entities", ...}},
    {"function": {"name": "memory-create_relations", ...}},
    {"function": {"name": "memory-delete_entities", ...}},
    {"function": {"name": "memory-delete_observations", ...}},
    {"function": {"name": "memory-delete_relations", ...}},
    {"function": {"name": "memory-open_nodes", ...}},
    {"function": {"name": "memory-read_graph", ...}},
    {"function": {"name": "memory-search_nodes", ...}}
  ]
}

16 tools. For "hi there."

The Brave Search MCP server alone exposes 6 functions with verbose parameter schemas (country codes, language enums, pagination options). The memory server adds another 9. Every single request was burning tokens on tool definitions the model would never use.

The Anthropic Inspiration

Anthropic's engineering team published a fantastic post on advanced tool use that addressed exactly this problem. Their key insight: don't load all tools upfront—discover them on demand.

Their numbers were compelling: a five-server MCP setup went from ~55K tokens to ~8.7K tokens. An 85% reduction.

I wanted that for HoloDeck. But I'm using Microsoft's Semantic Kernel, not Claude's native tool system. So I had to figure out how to make it work.

The Architecture

Here's what I built:

User Query
    │
    ▼
┌─────────────────────────────┐
│     ToolFilterManager       │
│  ┌───────────────────────┐  │
│  │      ToolIndex        │  │
│  │  • Tool metadata      │  │
│  │  • Embeddings         │  │
│  │  • BM25 index         │  │
│  │  • Usage tracking     │  │
│  └───────────────────────┘  │
│             │               │
│      search(query)          │
│             │               │
│             ▼               │
│    Filtered tool list       │
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│  FunctionChoiceBehavior     │
│  .Auto(filters={            │
│    "included_functions":    │
│      ["tool1", "tool2"]     │
│  })                         │
└─────────────────────────────┘
    │
    ▼
Semantic Kernel Agent Invocation
(only selected tools in context)

Three main components:

ToolIndex - Indexes all tools from Semantic Kernel plugins with embeddings and BM25 stats
ToolFilterManager - Orchestrates filtering and integrates with SK's execution settings
FunctionChoiceBehavior - SK's native mechanism for restricting which functions the LLM sees

Building the Tool Index

The first challenge: extracting tool metadata from Semantic Kernel's plugin system. SK organizes tools as functions within plugins, so I needed to crawl that structure:

async def build_from_kernel(
    self,
    kernel: Kernel,
    embedding_service: EmbeddingGeneratorBase | None = None,
    defer_loading_map: dict[str, bool] | None = None,
) -> None:
    plugins: dict[str, KernelPlugin] = getattr(kernel, "plugins", {})

    for plugin_name, plugin in plugins.items():
        functions: dict[str, KernelFunction] = getattr(plugin, "functions", {})

        for func_name, func in functions.items():
            full_name = f"{plugin_name}-{func_name}"

            # Extract description and parameters for search
            description = getattr(func, "description", "")
            parameters: list[str] = []

            func_params: list[KernelParameterMetadata] | None = getattr(
                func, "parameters", None
            )
            if func_params:
                for param in func_params:
                    if param.description:
                        parameters.append(f"{param.name}: {param.description}")

            # Create searchable metadata
            tool_metadata = ToolMetadata(
                name=func_name,
                plugin_name=plugin_name,
                full_name=full_name,
                description=description,
                parameters=parameters,
                defer_loading=defer_loading_map.get(full_name, True),
            )

            self.tools[full_name] = tool_metadata

Each tool becomes a searchable document combining its name, plugin, description, and parameter info.

Three Search Methods

I implemented three ways to find relevant tools:

1. Semantic Search (Embeddings)

The obvious choice. Embed the query, embed the tools, compute cosine similarity:

async def _semantic_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -> list[tuple[ToolMetadata, float]]:
    # Generate query embedding
    query_embeddings = await embedding_service.generate_embeddings([query])
    query_embedding = list(query_embeddings[0])

    results: list[tuple[ToolMetadata, float]] = []
    for tool in self.tools.values():
        if tool.embedding:
            score = _cosine_similarity(query_embedding, tool.embedding)
            results.append((tool, score))

    return results

Good for understanding intent. "Find information about refunds" matches get_return_policy even though they share no keywords. Scores range from 0.0 to 1.0, with good matches typically in the 0.4-0.6 range.

2. BM25 (Keyword Matching)

Classic information retrieval using BM25 (Robertson & Zaragoza, 2009). Sometimes you want exact matches:

def _bm25_score_single(self, query: str, tool: ToolMetadata) -> float:
    query_tokens = _tokenize(query)
    doc_tokens = _tokenize(self._create_searchable_text(tool))

    # Count term frequencies
    term_freq: dict[str, int] = {}
    for token in doc_tokens:
        term_freq[token] = term_freq.get(token, 0) + 1

    score = 0.0
    for term in query_tokens:
        if term not in term_freq:
            continue

        tf = term_freq[term]
        idf = self._idf_cache.get(term, 0.0)

        # BM25 formula
        numerator = tf * (self._BM25_K1 + 1)
        denominator = tf + self._BM25_K1 * (
            1 - self._BM25_B + self._BM25_B * doc_length / self._avg_doc_length
        )
        score += idf * (numerator / denominator)

    return score

Fast, no embeddings needed. Great for technical terms: "brave_search" should definitely match tools from the Brave Search plugin.

Important gotcha: The tokenizer must split on underscores! Tool names like brave_web_search need to tokenize as ["brave", "web", "search"], not as a single token. Otherwise queries containing "web" won't match the tool. I learned this the hard way when "find papers on the web" was returning brave_image_search instead of brave_web_search.

def _tokenize(text: str) -> list[str]:
    # Use [a-zA-Z0-9]+ to split on underscores (not \w+ which includes them)
    tokens = re.findall(r"[a-zA-Z0-9]+", text.lower())
    return tokens

3. Hybrid (Reciprocal Rank Fusion)

Why choose? Combine both with Reciprocal Rank Fusion (Cormack et al., 2009):

async def _hybrid_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -> list[tuple[ToolMetadata, float]]:
    semantic_results = await self._semantic_search(query, embedding_service)
    bm25_results = self._bm25_search(query)

    # Reciprocal Rank Fusion
    k = 60 # Constant from the original paper
    rrf_scores: dict[str, float] = {}

    semantic_sorted = sorted(semantic_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(semantic_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    bm25_sorted = sorted(bm25_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(bm25_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    # Normalize to 0-1 range (raw RRF scores are ~0.01-0.03)
    max_score = max(rrf_scores.values()) if rrf_scores else 1.0
    normalized = {name: score / max_score for name, score in rrf_scores.items()}

    return [(self.tools[name], score) for name, score in normalized.items()]

RRF rewards tools that rank highly in both methods without being dominated by either's raw scores.

Critical detail: Raw RRF scores are tiny (0.01-0.03 range) because of the formula 1/(k+rank+1) with k=60. If you apply a similarity_threshold of 0.3 to raw scores, everything gets filtered out! You must normalize RRF scores to 0-1 range by dividing by the max score. After normalization, good matches score 0.8-1.0.

The Semantic Kernel Integration

Semantic Kernel has a FunctionChoiceBehavior class that controls which functions the LLM can call. It supports a filters parameter:

def create_function_choice_behavior(
    self, filtered_tools: list[str]
) -> FunctionChoiceBehavior:
    return FunctionChoiceBehavior.Auto(
        filters={"included_functions": filtered_tools}
    )

That's it. Pass in a list of tool names, and SK only sends those tool definitions to the LLM.

The manager wires it all together:

async def prepare_execution_settings(
    self,
    query: str,
    base_settings: PromptExecutionSettings,
) -> PromptExecutionSettings:
    if not self.config.enabled:
        return base_settings

    # Filter tools based on query
    filtered_tools = await self.filter_tools(query)

    # Create behavior with only filtered tools
    function_choice = self.create_function_choice_behavior(filtered_tools)

    # Clone settings and attach filtered behavior
    cloned = self._clone_settings(base_settings)
    cloned.function_choice_behavior = function_choice

    return cloned

Configuration

I made it all YAML-configurable because that's the HoloDeck way:

tool_filtering:
  enabled: true
  top_k: 5 # Max tools per request
  similarity_threshold: 0.5 # Minimum score for inclusion
  always_include:
    - search_papers # Critical tools always available
  always_include_top_n_used: 0 # Disable until usage patterns stabilize
  search_method: hybrid # Options: semantic, bm25, hybrid

Sensible Defaults

Here's what I recommend starting with:

Parameter	Default	Rationale
`top_k`	5	Enough tools for most tasks without token bloat
`similarity_threshold`	0.5	Include tools at least 50% as relevant as top result
`always_include`	[]	Agent-specific—add your critical tools here
`always_include_top_n_used`	0	Avoid early usage bias; enable after patterns stabilize
`search_method`	hybrid	Best of semantic + keyword matching

Threshold Tuning by Search Method

All search methods now return normalized scores in the 0-1 range, making the similarity_threshold consistent across methods:

Method	Good Match Range	Recommended Threshold
semantic	0.4 - 0.6	0.3 - 0.4
bm25 (normalized)	0.8 - 1.0	0.5 - 0.6
hybrid (normalized)	0.8 - 1.0	0.5 - 0.6

A threshold of 0.5 means "include tools scoring at least 50% of what the top result scores." This filters out clearly irrelevant tools while keeping useful ones.

Configuration Knobs

top_k : How many tools max per request
similarity_threshold : Below this score, tools get filtered out
always_include : Your core tools that should always be available
always_include_top_n_used : Adaptive optimization—frequently used tools stay in context. Caution: This tracks usage across requests, so early/accidental tool calls can bias future filtering. Keep at 0 during development.

Here's the full agent configuration I was testing with:

# HoloDeck Research Agent Configuration
name: "research-agent"
description: "Research analysis AI assistant"

model:
  provider: azure_openai
  name: gpt-5.2

instructions:
  file: instructions/system-prompt.md

# Tools Configuration
tools:
  # Vectorstore for research paper search
  - type: vectorstore
    name: search_papers
    description: Search research papers and documents for relevant passages
    source: data/papers_index.json
    embedding_model: text-embedding-3-small
    top_k: 5
    database:
      provider: chromadb

  # Brave Search MCP Server (exposes 6 functions)
  - type: mcp
    name: brave_search
    description: Web search using Brave Search API
    command: npx
    args: ["-y", "@brave/brave-search-mcp-server"]
    env:
      BRAVE_API_KEY: ${BRAVE_API_KEY}

  # Memory MCP Server (exposes 9 functions)
  - type: mcp
    name: memory
    description: Persistent memory using local knowledge graph
    command: npx
    args: ["-y", "@modelcontextprotocol/server-memory"]

# Tool Filtering - This is where the magic happens
tool_filtering:
  enabled: true
  top_k: 5
  similarity_threshold: 0.5
  always_include:
    - search_papers
  always_include_top_n_used: 0
  search_method: hybrid

Three tool sources. 16 total functions exposed. Without filtering, every request sends all 16 tool schemas.

The Results

Let me show you actual API payloads. With filtering off , here's what gets sent for a simple "hi there":

{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", "description": "Search research papers...", "parameters": {...}}},
    {"type": "function", "function": {"name": "brave_search-brave_image_search", "description": "Performs an image search...", "parameters": {"properties": {"query": {...}, "country": {...}, "search_lang": {...}, "count": {...}, "safesearch": {...}, "spellcheck": {...}}, ...}}},
    {"type": "function", "function": {"name": "brave_search-brave_local_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_news_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_summarizer", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_video_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}},
    {"type": "function", "function": {"name": "memory-add_observations", ...}},
    {"type": "function", "function": {"name": "memory-create_entities", ...}},
    {"type": "function", "function": {"name": "memory-create_relations", ...}},
    {"type": "function", "function": {"name": "memory-delete_entities", ...}},
    {"type": "function", "function": {"name": "memory-delete_observations", ...}},
    {"type": "function", "function": {"name": "memory-delete_relations", ...}},
    {"type": "function", "function": {"name": "memory-open_nodes", ...}},
    {"type": "function", "function": {"name": "memory-read_graph", ...}},
    {"type": "function", "function": {"name": "memory-search_nodes", ...}}
  ]
}

16 tools. 5,888 tokens. For "hi there."

Look at those Brave Search parameter schemas—country code enums, language preferences, pagination options, safesearch filters. Each tool definition is a token hog.

With filtering on :

{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}}
  ]
}

2 tools. 1,016 tokens.

That's an 83% reduction —from 5,888 tokens down to 1,016.

The logs tell the story:

Tool filtering: 2/16 tools selected for query: 'hi there...'
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search']

For a real research query like "Find papers on transformer architectures on the web", the filtering gets smarter:

Tool filtering: 3/16 tools selected
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search', 'memory-search_nodes']

The right tools. Automatically. Based on what the user actually asked.

Lessons Learned

1. MCP servers are tool factories. A single MCP server can expose dozens of functions. Without filtering, your token costs explode.

2. Tokenization matters for BM25. Make sure your tokenizer splits on underscores so brave_web_search becomes ["brave", "web", "search"]. Otherwise keyword matching fails on tool names.

3. Normalize your search scores. Raw BM25 scores range from 0-10+, and raw RRF scores are tiny (0.01-0.03). Both need normalization to 0-1 range, or your similarity_threshold won't work consistently. Semantic search (cosine similarity) is already 0-1.

4. After normalization, thresholds are consistent. With all methods normalized, good matches score 0.8-1.0 for BM25/hybrid, and 0.4-0.6 for semantic. A threshold of 0.5 works well across methods.

5. always_include is your safety net. Some tools are so core to your agent that you never want them filtered out. Make that explicit.

6. Be careful with always_include_top_n_used. This feature tracks usage and auto-includes frequently used tools. Sounds great, but early/accidental usage can bias future requests. Keep it at 0 during development.

What's Next

This is just tool filtering. Anthropic's post also covers:

Programmatic tool calling : Let the model write code to process intermediate results
Tool use examples : Providing concrete usage patterns to reduce parameter ambiguity

I might implement those next. But for now, getting 83% token reduction with a few hundred lines of code feels pretty good.

The full implementation is in HoloDeck's tool_filter module. PRs welcome.

DEV Community

How I Reduced My Agent's Token Consumption by 83%

How I Reduced My Agent's Token Consumption by 83%

The Anthropic Inspiration

The Architecture

Building the Tool Index

Three Search Methods

1. Semantic Search (Embeddings)

2. BM25 (Keyword Matching)

3. Hybrid (Reciprocal Rank Fusion)

The Semantic Kernel Integration

Configuration

Sensible Defaults

Threshold Tuning by Search Method

Configuration Knobs

The Results

Lessons Learned

What's Next

Top comments (0)