Kim Namhyun

Posted on Feb 27

Designing a Tool Architecture for AI Agents — Base Tools, Toolkits, and Dynamic Routing

#agents #ai #architecture #llm

How do you give an AI agent 30+ tools without drowning the context window? Include everything and you waste tokens. Be selective and the agent can't do its job. Here's how I solved it with a 3-layer architecture.

The Problem: Too Many Tools

As your AI agent grows, so does its toolbox. My personal assistant now has 35+ tools — web search, email, calendar, weather, Git, host PC control, file management, code execution, and more.

Sending all 35 tool schemas to the LLM on every request causes two problems:

Token cost explosion: 35 JSON function schemas easily consume 3,000+ tokens per turn
Selection accuracy drops: The more tools available, the more likely the LLM picks the wrong one

But if you trim tools aggressively, the agent can't handle requests it should be able to.

The Solution: 3-Layer Architecture

         ┌───────────────┐
         │  User Input    │
         └───────┬───────┘
                 ↓
┌────────────────────────────────────┐
│          Tool Registry             │
│                                    │
│  ┌────────────┐  ┌──────────────┐  │
│  │ Base Tools │  │  Toolkits    │  │
│  │ (always ON)│  │(dynamic load)│  │
│  │ 13 general │  │  8 packs     │  │
│  └──────┬─────┘  └──────┬───────┘  │
│         │               │          │
│         │         ┌─────┴──────┐   │
│         │         │   Tasks    │   │
│         │         │(individual)│   │
│         │         └─────┬──────┘   │
│         └───────┬───────┘          │
│                 ↓                  │
│     ┌───────────────────────┐      │
│     │ Selected Tools → LLM  │      │
│     └───────────────────────┘      │
└────────────────────────────────────┘

Layer 1: Base Tools — Always Included

13 general-purpose tools that could be needed for any request:

web_search, read_file, write_file, list_files,
run_command, get_datetime, calculate, run_python_code,
pip_install, recall, forget,
host_list_files, vm_to_host, host_to_vm

Web search, file I/O, code execution, memory (recall/forget) — these are universal. Included in every LLM call regardless of the user's request.

Layer 2: Toolkits — Domain-Specific Tool Packs

Related tools grouped into packs, defined as JSON files:

toolkits/
├── calendar.json    # create_event, list_events, update_event, delete_event
├── contacts.json    # find_contact
├── email.json       # send_email, read_email, search_email
├── git.json         # git_clone, git_status, git_commit, git_push
├── host_pc.json     # host_open_url, host_open_app, host_find_file, ...
├── meta.json        # help, show_config, health
├── scheduler.json   # create_task, list_tasks, cancel_task
└── weather.json     # weather

Each toolkit JSON contains:

{
  "name": "weather",
  "tier": "free",
  "description": "Weather and forecast information...",
  "keywords": ["날씨", "weather", "forecast", "temperature", "rain", "umbrella"],
  "tasks": [
    {
      "type": "function",
      "function": {
        "name": "weather",
        "description": "Get weather info...",
        "parameters": { ... }
      }
    }
  ]
}

description: Used for embedding similarity matching
keywords: Fast keyword-based activation
tasks: Actual OpenAI function calling schemas sent to the LLM

Layer 3: Tasks — Individual Tool Functions

A Task is a single tool function inside a Toolkit. The weather toolkit has 1 task (weather()), while calendar has 4 (create_event, list_events, update_event, delete_event).

Dynamic Routing: Which Toolkits to Activate?

The key question: given a user input, which toolkits are relevant?

Two-Stage Matching

def select_tools(user_input):
    selected = base_tools  # always included

    # Stage 1: Keyword matching (fast, certain)
    for toolkit in all_toolkits:
        if any(keyword in user_input for keyword in toolkit.keywords):
            selected += toolkit.tasks

    # Stage 2: Embedding similarity (catches what keywords miss)
    input_embedding = embed(user_input)
    for toolkit in remaining_toolkits:
        if cosine(input_embedding, toolkit.embedding) >= 0.40:
            selected += toolkit.tasks

    return selected

Stage 1: Keyword Matching

"What's the weather today?" → "weather" keyword → weather toolkit activated
"Open Chrome" → "열어줘" (Korean "open") keyword → host_pc toolkit activated
Fast and precise, but limited coverage

Stage 2: Embedding Similarity (BGE-M3)

"Should I bring an umbrella?" → No keyword match, but semantically similar to weather toolkit → activated
Threshold: 0.40 (prioritize recall — better to include extra tools than miss needed ones)
Model: BGE-M3 (multilingual, runs locally via Ollama)

Pre-computed Embeddings

At server startup, all toolkit descriptions are embedded once:

def init():
    for toolkit in all_toolkits:
        toolkit.embedding = get_embedding(toolkit.description)
    # At request time, only the user input needs embedding (1 API call)

Real Example

User: "If it rains tomorrow, plan an indoor workout and add it to my calendar"

[ToolRouter] 18/35 tools | activated: [weather(keyword:1.00), calendar(embed:0.52)]

"rain" → weather toolkit via keyword
"add to calendar" → calendar toolkit via embedding (similarity 0.52)

Base 13 + weather 1 + calendar 4 = 18 tools sent to LLM. The other 17 tools (git, email, contacts, etc.) are excluded → token savings + better accuracy.

Design Decisions

Decision 1: Why not send all tools every time?

With 35+ tool schemas:

Token cost increases (inference cost + response latency)
LLM confuses similar tools (send_email vs host_run_command for sending mail)
Especially severe with smaller models (8B parameters)

Decision 2: Why not use embeddings only?

Embedding-only approach:

Even obvious keywords like "weather" require an embedding API call (unnecessary latency)
If the embedding server goes down, everything breaks

→ Keywords first + embeddings as fallback is the optimal two-stage design

Decision 3: What threshold for similarity?

0.60: Precise but misses relevant toolkits
0.40: May over-activate but never misses
Recall 100% is the priority — extra tools in the context are harmless (LLM ignores them), but missing a needed tool means the agent simply can't do its job

Conclusion

Tool management for AI agents comes down to one question: "Which tools should the LLM see for this specific request?"

The 3-layer answer:

Base Tools: Universal → always ON
Toolkits: Domain packs → dynamically activated via keyword + embedding
Tasks: Individual functions inside toolkits

This architecture lets "What's the weather?" include only the weather task, while "Commit my code" includes only git tasks — saving tokens and improving accuracy across the board.

DEV Community