Part 1: A theorical implementation of LLM selection and orchestration
Introduction
One of the greatest strengths of Bob Orchestrator lies in its intelligent multi-LLM routing logic. Instead of relying on a single, costly model for everything, Bob dynamically selects the best tool for the specific task at hand. Need comprehensive documentation? Bob routes it to a model optimized for text. Need complex code generation? Bob leverages the exact LLM architecture best suited for that language or framework. By matching the task to the right LLM “T-Shirt size,” Bob automatically optimizes your cost-per-token, delivering maximum performance at the lowest possible price point.
In this opening installment of the series, I will implement and demonstrate how an orchestration scenario scaffolds a sample application built specifically to implement this multi-LLM routing logic.
To keep this walkthrough straightforward, reproducible, and cost-effective, I am running the orchestration entirely locally via Ollama. Below is the specific roster of local models — ranging from lightweight utility models to heavy-duty reasoning engines — that we will use to test an intelligent application’s dynamic routing decisions:
| Model Name | Image ID | Size |
| ------------------------------ | -------------- | ------ |
| **mistral-small3.2:latest** | `5a408ab55df5` | 15 GB |
| **ibm/granite3.3-guardian:8b** | `90a8aabc98eb` | 6.7 GB |
| **granite3.3:8b** | `fd429f23b909` | 4.9 GB |
| **gemma3:4b** | `a2af6cc3eb7f` | 3.3 GB |
| **ibm/granite3.2-guardian:3b** | `4f7cdff62b7e` | 2.7 GB |
| **granite3-guardian:latest** | `ba81a177bd23` | 2.7 GB |
| **granite4.1:3b** | `6fd349357287` | 2.1 GB |
| **ibm/granite4:3b** | `89962fcc7523` | 2.1 GB |
| **llama3.2:latest** | `a80c4f17acd5` | 2.0 GB |
| **ibm/granite4:350m** | `5eee845b49c4` | 708 MB |
| **granite-embedding:30m** | `eb4c533ba6f7` | 62 MB |
Architecture Overview: The Intelligent Cost-Routing Pipeline
> 🚨 Disclaimer: This article is for educational purposes only. It provides a conceptual, hands-on look at how LLM routing logic operates and should not be mistaken for the official enterprise implementation or architecture of the IBM Bob product.
At the heart of orchestrator’s cost-efficiency engine is a modular pipeline designed to intercept user intent, break down complex requirements, and match each sub-task to the most cost-effective LLM tier. Instead of blindly passing a massive multi-part prompt to a premium, high-cost reasoning model, an orchestrator acts as a smart traffic controller.
+------------------------+
| User Complex Request |
+-----------+------------+
|
v
+------------------------+
| Task Splitter |
+-----------+------------+
|
[Splits into Sub-Tasks]
|
v
+------------------------+
| Task Classifier |
+-----------+------------+
|
[Identifies Domain & Size]
|
v
+------------------------+
| Dynamic LLM Router |
+-----------+------------+
|
[Queries Cost Registry & Selects Best Fit]
|
+--------------------+--------------------+
| | |
v v v
+-----------------+ +-----------------+ +-----------------+
| Mistral (Small) | | Granite (Guard) | | Llama (Utility)|
+-----------------+ +-----------------+ +-----------------+
The pipeline operates in four distinct stages:
- Task Splitting: Parsing complex, compound requests into a sequence of isolated, clean sub-tasks.
- Intent & Domain Classification: Scoring each sub-task to determine its exact domain (e.g., code, documentation, text transformation) and complexity requirement (“T-Shirt sizing”).
- Registry-Driven Routing: Cross-referencing the classification payload against a centralized model tier and cost ledger to pick the absolute optimal local Ollama target.
- Isolated Execution & Reassembly: Invoking the targeted models and stitching the final context back together flawlessly.
Technical Deep Dive & Core Implementations
Granular Task Decomposition
To avoid processing unnecessary tokens through high-tier engines, the architecture relies on a specialized TaskSplitter. This component evaluates the raw user payload and breaks it down into discrete components so that documentation and code can travel along separate routing paths.
"""
splitter.py
Splits a compound prompt into independent clauses, then classifies each one.
"""
from __future__ import annotations
import re
from src.task_classifier import classify, TaskProfile
# Conjunctions that signal independent sub-tasks within a single prompt
_SPLIT_PATTERN = re.compile(
r"(?i)\s*\b(and also|and then|additionally|plus|as well as|AND|also)\b\s*",
)
def split_and_classify(prompt: str) -> list[TaskProfile]:
"""
1. Split the prompt on conjunction markers.
2. Discard separator tokens.
3. Classify each remaining clause.
"""
raw_clauses = _SPLIT_PATTERN.split(prompt)
# Keep only non-separator, non-empty parts
clauses = [
c.strip()
for c in raw_clauses
if c and not _SPLIT_PATTERN.match(c.strip()) and len(c.strip()) > 8
]
if not clauses:
# Fallback: treat the whole prompt as one task
clauses = [prompt.strip()]
return [classify(c) for c in clauses]
T-Shirt Sizing and Feature Classification
Once separated, each distinct sub-task passes through the TaskClassifier. The classifier looks for domain markers (e.g., security policies, coding constraints, or structural documentation formatting) and assigns a maximum required model complexity tier: SMALL, MEDIUM, or LARGE.
"""
task_classifier.py
Classifies a text prompt into a TaskProfile: type, complexity score, and token estimate.
"""
from __future__ import annotations
import math
import re
from dataclasses import dataclass, field
from enum import Enum
class TaskType(Enum):
DOCUMENTATION = "documentation" # READMEs, API docs, inline comments
CODE_GENERATION = "code_generation" # new functions / modules / classes
CODE_REVIEW = "code_review" # diff analysis, security audit
REASONING = "reasoning" # architecture, design decisions
QA_SIMPLE = "qa_simple" # factual lookups, definitions
SUMMARISATION = "summarisation" # condense / shorten text
class ComplexityTier(Enum):
LOW = 1 # score 0.00 – 0.34
MEDIUM = 2 # score 0.35 – 0.64
HIGH = 3 # score 0.65 – 1.00
@dataclass
class TaskProfile:
raw_text: str
task_type: TaskType
complexity: float # 0.0 → 1.0
tier: ComplexityTier
token_estimate: int
sub_tasks: list["TaskProfile"] = field(default_factory=list)
def __repr__(self) -> str:
return (
f"TaskProfile(type={self.task_type.value!r}, "
f"complexity={self.complexity}, tier={self.tier.name}, "
f"tokens≈{self.token_estimate})"
)
# ── Keyword signal table: (keywords, base_complexity_score) ───────────────────
_SIGNALS: dict[TaskType, tuple[list[str], float]] = {
TaskType.DOCUMENTATION: (
["write the api", "api reference", "write docs", "write documentation",
"document", "readme", "docstring", "comment",
"explain", "describe"], 0.20),
TaskType.CODE_GENERATION: (
["implement", "create function", "write code", "build",
"develop", "middleware", "endpoint", "class", "module"], 0.55),
TaskType.CODE_REVIEW: (
["review", "audit", "security", "vulnerability",
"diff", "check", "analyse code", "analyze code"], 0.60),
TaskType.REASONING: (
["architecture", "design", "trade-off", "compare",
"evaluate", "pros and cons", "should i", "best approach"], 0.80),
TaskType.QA_SIMPLE: (
["what is", "how does", "define", "tell me", "when was"], 0.10),
TaskType.SUMMARISATION: (
["summarise", "summarize", "tldr", "shorten",
"brief", "condense", "overview"], 0.15),
}
# Multi-word phrases that force a specific TaskType regardless of keyword scoring.
_PHRASE_OVERRIDES: list[tuple[str, TaskType]] = [
("api reference", TaskType.DOCUMENTATION),
("write the api", TaskType.DOCUMENTATION),
("write docs", TaskType.DOCUMENTATION),
("write documentation", TaskType.DOCUMENTATION),
("code review", TaskType.CODE_REVIEW),
("security audit", TaskType.CODE_REVIEW),
]
def _token_estimate(text: str) -> int:
"""Rough GPT-style tokenisation: ~4 chars per token."""
return max(1, len(text) // 4)
def _depth_penalty(text: str) -> float:
"""
Extra score for multi-step / conditional reasoning markers.
Each hit adds 0.04, capped at 0.20.
"""
markers = [
"step 1", "first,", "then,", "finally,",
"however,", "alternatively,", "trade-off",
"on the other hand", "in addition", "furthermore",
]
lower = text.lower()
hits = sum(1 for m in markers if m in lower)
return min(0.20, hits * 0.04)
def classify(text: str) -> TaskProfile:
"""Return a TaskProfile for the given text snippet."""
best_type, base_score = TaskType.QA_SIMPLE, 0.10
lower = text.lower()
# Phase 1: phrase-level overrides (highest priority)
for phrase, forced_type in _PHRASE_OVERRIDES:
if phrase in lower:
best_type = forced_type
base_score = dict(_SIGNALS)[forced_type][1]
break
else:
# Phase 2: single-keyword scoring (only if no phrase matched)
for ttype, (keywords, score) in _SIGNALS.items():
if any(k in lower for k in keywords):
if score > base_score:
best_type, base_score = ttype, score
tokens = _token_estimate(text)
token_factor = min(0.15, math.log10(max(1, tokens)) * 0.05)
depth = _depth_penalty(text)
final_score = min(1.0, base_score + token_factor + depth)
tier = (ComplexityTier.LOW if final_score < 0.35 else
ComplexityTier.MEDIUM if final_score < 0.65 else
ComplexityTier.HIGH)
return TaskProfile(
raw_text=text,
task_type=best_type,
complexity=round(final_score, 3),
tier=tier,
token_estimate=tokens,
)
Dynamic Cost Routing Mechanics
The LLMRouter pairs the classification metadata with a static CostRegistry. By calculating cost thresholds based on local resource sizes or token-equivalent costs, it isolates which Ollama container is running the model best suited for the job.
"""
router.py
Core routing logic: selects the cheapest model that meets the quality threshold
for each TaskProfile, using a utility function U = β·Q − α·C.
"""
from __future__ import annotations
from src.cost_registry import MODEL_REGISTRY, ModelSpec
from src.task_classifier import TaskProfile
# ── Tunable parameters ────────────────────────────────────────────────────────
QUALITY_THRESHOLD = 0.72 # default minimum acceptable quality score
COST_WEIGHT = 0.40 # α — penalise cost
QUALITY_WEIGHT = 0.60 # β — reward quality
# Per-task-type quality thresholds.
# Documentation / summaries tolerate lower quality from a cheap model;
# code and reasoning tasks require a higher bar.
TASK_QUALITY_THRESHOLDS: dict[str, float] = {
"documentation": 0.60, # light model acceptable for prose
"summarisation": 0.55, # even more lenient for summaries
"qa_simple": 0.55,
"code_generation": 0.72,
"code_review": 0.85, # security: high bar
"reasoning": 0.80,
}
# Task types that require the heaviest tier (Tier 4) regardless of complexity score.
# code_review always goes to Mistral (security-aware, largest context)
SECURITY_OVERRIDES: set[str] = {"code_review"}
def _utility(model: ModelSpec, profile: TaskProfile) -> float:
"""
Utility U = β·Q − α·C
Q = model.quality_score(complexity) ∈ [0, 1]
C = model.cost_per_1k (normalised) ∈ [0, 1]
A higher U means the model is preferred.
"""
q = model.quality_score(profile.complexity)
c = model.cost_per_1k
return QUALITY_WEIGHT * q - COST_WEIGHT * c
def route(profile: TaskProfile) -> ModelSpec:
"""
Return the optimal ModelSpec for the given TaskProfile.
Selection rules (in order):
1. Model must support the task type.
2. Model context window must fit the estimated token budget.
3. Model quality_score must meet QUALITY_THRESHOLD
(unless it's a security override — then tier 3 is forced).
4. Among qualifying candidates, the one with the highest utility wins.
5. If no candidate passes the threshold, fall back to the highest-quality model.
"""
ttype = profile.task_type.value
# Force tier-4 (heaviest) for security-sensitive task types
if ttype in SECURITY_OVERRIDES:
heavy = next(m for m in MODEL_REGISTRY if m.tier == 4)
return heavy
threshold = TASK_QUALITY_THRESHOLDS.get(ttype, QUALITY_THRESHOLD)
candidates = [
m for m in MODEL_REGISTRY
if ttype in m.capable_types
and m.max_tokens >= profile.token_estimate
and m.quality_score(profile.complexity) >= threshold
]
if not candidates:
# Safety net: pick highest-quality model regardless of cost
candidates = sorted(
MODEL_REGISTRY,
key=lambda m: m.quality_score(profile.complexity),
reverse=True,
)
return max(candidates, key=lambda m: _utility(m, profile))
def dispatch(profiles: list[TaskProfile]) -> dict[str, tuple[TaskProfile, ModelSpec]]:
"""
Route each sub-task independently.
Returns a mapping of task_type → (profile, selected_model).
When multiple sub-tasks share the same type, a suffix is appended.
"""
result: dict[str, tuple[TaskProfile, ModelSpec]] = {}
type_counts: dict[str, int] = {}
for profile in profiles:
key = profile.task_type.value
count = type_counts.get(key, 0)
type_counts[key] = count + 1
unique_key = key if count == 0 else f"{key}_{count}"
result[unique_key] = (profile, route(profile))
return result
Concrete Cost-Efficiency Proof
By decoupling tasks, the application completely redefines the cost equation for local and hybrid enterprise architectures.
Consider a compound prompt asking to: “Generate a secure Python CRUD interface and write 3 paragraphs of Markdown documentation for it.”
| Action Strategy | Traditional Monolithic Approach | Orchestrator Pipeline Approach | Financial / Resource Impact |
| -------------------------- | ------------------------------- | --------------------------------------------------- | ------------------------------------ |
| **Code Generation Step** | Runs on `mistral-small` (15 GB) | Runs on `granite3.3:8b` (4.9 GB) or `mistral-small` | **Targeted Resource Allocation** |
| **Documentation Step** | Runs on `mistral-small` (15 GB) | Runs on `llama3.2:latest` (2.0 GB) | **~86% VRAM / Compute reduction** |
| **Guardrail/Safety Check** | Ignored or rerun on flagship | Runs on `granite3.2-guardian:3b` (2.7 GB) | **Dedicated, low-latency alignment** |
The Token-Size Formula
When processing large workloads, the total resource overhead or token expense can be represented by the formula:
Where a traditional monolithic setup applies the maximum rate (RateMax) to all tokens (TokensTotal), the application drastically optimizes (in this case all is free as we use Ollama locally) the total cost footprint by ensuring that only high-complexity sub-tasks scale up to premium tiers, while documentation and utility steps run at significantly reduced rates.
To test the application we can just run the “demo.sh” script;
./multi-llm-router/scripts/demo.sh
Running the documentation-vs-code routing scenario…
(Set DRY_RUN=1 to skip LLM calls and inspect routing only)
================================================================================
Multi-LLM Task Router — Demonstration Scenario (4 providers)
================================================================================
Prompt:
Write the API reference documentation for our REST /users endpoints AND implement the OAuth2 Bearer-token middleware in Python FastAPI.
Model tier mapping:
model::light IBM Granite → granite4.1:3b
model::medium Meta LLaMA → llama3.2:latest
model::balanced Google Gemma → gemma3:4b
model::heavy Mistral AI → mistral-small3.2:latest
Detected 2 sub-task(s):
[1] TaskProfile(type='documentation', complexity=0.26, tier=LOW, tokens≈16)
[2] TaskProfile(type='code_generation', complexity=0.609, tier=MEDIUM, tokens≈15)
Routing decisions:
Task Type Complexity Tier Label Provider U
------------------------------------------------------------------------------
documentation 0.260 LOW model::light IBM Granite +0.3616
code_generation 0.609 MEDIUM model::balanced Google Gemma +0.3718
Executing sub-tasks via Ollama… (set DRY_RUN=1 to skip LLM calls)
================================================================================
ROUTING TABLE
================================================================================
Task Type Label Provider Latency Tokens Cost
------------------------------------------------------------------------------
documentation model::light IBM Granite 2839 ms 48 0.002400
code_generation model::balanced Google Gemma 15279 ms 45 0.015750
Total wall-clock: 15327 ms
======================================================================
MERGED RESPONSE
======================================================================
## Documentation _(via model::light, 2839 ms)_
# REST API Reference: `/users` Endpoints
Our RESTful API provides a set of endpoints under the `/users` resource to manage user-related operations. Below is a detailed description of each endpoint, including supported HTTP methods, request/response
---
## Code Generation _(via model::balanced, 15279 ms)_
`python
from fastapi import Depends, HTTPException, status
from fastapi.security import CredentialsTokenAuthentication
import jwt
# Replace with your secret key (keep this secure!)
JWT_SECRET = "your-secret-
======================================================================
FEEDBACK SUMMARY
======================================================================
{
"total_calls": 2,
"total_cost": 0.01815,
"avg_quality": 0.665,
"calls_by_tier": {
"1": 1,
"3": 1
}
}
Conclusion
By treating LLMs as modular, specialized workers rather than monolithic black boxes, this simple orchestrator demonstrates that the most high-performing AI strategy is also the most cost-effective one. Through automated splitting, precise T-shirt sizing, and registry-driven routing, you can systematically lower your token footprint without sacrificing a single drop of output quality.
🎯 Et voilà, that’s a wrap 💯 for part 1 and thanks for reading! Stay tuned for the next episode 📺
Links
- Code repository for this post: https://github.com/aairom/multillm-taskrouting/tree/main/multi-llm-router
- IBM Bob: https://bob.ibm.com/




Top comments (0)