DEV Community

Kim Namhyun
Kim Namhyun

Posted on

Designing a Tool Architecture for AI Agents — Base Tools, Toolkits, and Dynamic Routing

How do you give an AI agent 30+ tools without drowning the context window? Include everything and you waste tokens. Be selective and the agent can't do its job. Here's how I solved it with a 3-layer architecture.

The Problem: Too Many Tools

As your AI agent grows, so does its toolbox. My personal assistant now has 35+ tools — web search, email, calendar, weather, Git, host PC control, file management, code execution, and more.

Sending all 35 tool schemas to the LLM on every request causes two problems:

  1. Token cost explosion: 35 JSON function schemas easily consume 3,000+ tokens per turn
  2. Selection accuracy drops: The more tools available, the more likely the LLM picks the wrong one

But if you trim tools aggressively, the agent can't handle requests it should be able to.


The Solution: 3-Layer Architecture

         ┌───────────────┐
         │  User Input    │
         └───────┬───────┘
                 ↓
┌────────────────────────────────────┐
│          Tool Registry             │
│                                    │
│  ┌────────────┐  ┌──────────────┐  │
│  │ Base Tools │  │  Toolkits    │  │
│  │ (always ON)│  │(dynamic load)│  │
│  │ 13 general │  │  8 packs     │  │
│  └──────┬─────┘  └──────┬───────┘  │
│         │               │          │
│         │         ┌─────┴──────┐   │
│         │         │   Tasks    │   │
│         │         │(individual)│   │
│         │         └─────┬──────┘   │
│         └───────┬───────┘          │
│                 ↓                  │
│     ┌───────────────────────┐      │
│     │ Selected Tools → LLM  │      │
│     └───────────────────────┘      │
└────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Layer 1: Base Tools — Always Included

13 general-purpose tools that could be needed for any request:

web_search, read_file, write_file, list_files,
run_command, get_datetime, calculate, run_python_code,
pip_install, recall, forget,
host_list_files, vm_to_host, host_to_vm
Enter fullscreen mode Exit fullscreen mode

Web search, file I/O, code execution, memory (recall/forget) — these are universal. Included in every LLM call regardless of the user's request.

Layer 2: Toolkits — Domain-Specific Tool Packs

Related tools grouped into packs, defined as JSON files:

toolkits/
├── calendar.json    # create_event, list_events, update_event, delete_event
├── contacts.json    # find_contact
├── email.json       # send_email, read_email, search_email
├── git.json         # git_clone, git_status, git_commit, git_push
├── host_pc.json     # host_open_url, host_open_app, host_find_file, ...
├── meta.json        # help, show_config, health
├── scheduler.json   # create_task, list_tasks, cancel_task
└── weather.json     # weather
Enter fullscreen mode Exit fullscreen mode

Each toolkit JSON contains:

{
  "name": "weather",
  "tier": "free",
  "description": "Weather and forecast information...",
  "keywords": ["날씨", "weather", "forecast", "temperature", "rain", "umbrella"],
  "tasks": [
    {
      "type": "function",
      "function": {
        "name": "weather",
        "description": "Get weather info...",
        "parameters": { ... }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode
  • description: Used for embedding similarity matching
  • keywords: Fast keyword-based activation
  • tasks: Actual OpenAI function calling schemas sent to the LLM

Layer 3: Tasks — Individual Tool Functions

A Task is a single tool function inside a Toolkit. The weather toolkit has 1 task (weather()), while calendar has 4 (create_event, list_events, update_event, delete_event).


Dynamic Routing: Which Toolkits to Activate?

The key question: given a user input, which toolkits are relevant?

Two-Stage Matching

def select_tools(user_input):
    selected = base_tools  # always included

    # Stage 1: Keyword matching (fast, certain)
    for toolkit in all_toolkits:
        if any(keyword in user_input for keyword in toolkit.keywords):
            selected += toolkit.tasks

    # Stage 2: Embedding similarity (catches what keywords miss)
    input_embedding = embed(user_input)
    for toolkit in remaining_toolkits:
        if cosine(input_embedding, toolkit.embedding) >= 0.40:
            selected += toolkit.tasks

    return selected
Enter fullscreen mode Exit fullscreen mode

Stage 1: Keyword Matching

  • "What's the weather today?" → "weather" keyword → weather toolkit activated
  • "Open Chrome" → "열어줘" (Korean "open") keyword → host_pc toolkit activated
  • Fast and precise, but limited coverage

Stage 2: Embedding Similarity (BGE-M3)

  • "Should I bring an umbrella?" → No keyword match, but semantically similar to weather toolkit → activated
  • Threshold: 0.40 (prioritize recall — better to include extra tools than miss needed ones)
  • Model: BGE-M3 (multilingual, runs locally via Ollama)

Pre-computed Embeddings

At server startup, all toolkit descriptions are embedded once:

def init():
    for toolkit in all_toolkits:
        toolkit.embedding = get_embedding(toolkit.description)
    # At request time, only the user input needs embedding (1 API call)
Enter fullscreen mode Exit fullscreen mode

Real Example

User: "If it rains tomorrow, plan an indoor workout and add it to my calendar"

[ToolRouter] 18/35 tools | activated: [weather(keyword:1.00), calendar(embed:0.52)]
Enter fullscreen mode Exit fullscreen mode
  • "rain" → weather toolkit via keyword
  • "add to calendar" → calendar toolkit via embedding (similarity 0.52)

Base 13 + weather 1 + calendar 4 = 18 tools sent to LLM. The other 17 tools (git, email, contacts, etc.) are excluded → token savings + better accuracy.


Design Decisions

Decision 1: Why not send all tools every time?

With 35+ tool schemas:

  • Token cost increases (inference cost + response latency)
  • LLM confuses similar tools (send_email vs host_run_command for sending mail)
  • Especially severe with smaller models (8B parameters)

Decision 2: Why not use embeddings only?

Embedding-only approach:

  • Even obvious keywords like "weather" require an embedding API call (unnecessary latency)
  • If the embedding server goes down, everything breaks

Keywords first + embeddings as fallback is the optimal two-stage design

Decision 3: What threshold for similarity?

  • 0.60: Precise but misses relevant toolkits
  • 0.40: May over-activate but never misses
  • Recall 100% is the priority — extra tools in the context are harmless (LLM ignores them), but missing a needed tool means the agent simply can't do its job

Conclusion

Tool management for AI agents comes down to one question: "Which tools should the LLM see for this specific request?"

The 3-layer answer:

  • Base Tools: Universal → always ON
  • Toolkits: Domain packs → dynamically activated via keyword + embedding
  • Tasks: Individual functions inside toolkits

This architecture lets "What's the weather?" include only the weather task, while "Commit my code" includes only git tasks — saving tokens and improving accuracy across the board.

Top comments (0)