Alain Airom (Ayrom)

Posted on Jul 1

The ROI of SDLC: How Bob Orchestration of LLMs Makes Financially Sense (Part 1)

#bob #sdlc #llmrouting #granite

Part 1: A theorical implementation of LLM selection and orchestration

Introduction

One of the greatest strengths of Bob Orchestrator lies in its intelligent multi-LLM routing logic. Instead of relying on a single, costly model for everything, Bob dynamically selects the best tool for the specific task at hand. Need comprehensive documentation? Bob routes it to a model optimized for text. Need complex code generation? Bob leverages the exact LLM architecture best suited for that language or framework. By matching the task to the right LLM “T-Shirt size,” Bob automatically optimizes your cost-per-token, delivering maximum performance at the lowest possible price point.

In this opening installment of the series, I will implement and demonstrate how an orchestration scenario scaffolds a sample application built specifically to implement this multi-LLM routing logic.

To keep this walkthrough straightforward, reproducible, and cost-effective, I am running the orchestration entirely locally via Ollama. Below is the specific roster of local models — ranging from lightweight utility models to heavy-duty reasoning engines — that we will use to test an intelligent application’s dynamic routing decisions:

| Model Name                     | Image ID       | Size   |
| ------------------------------ | -------------- | ------ |
| **mistral-small3.2:latest**    | `5a408ab55df5` | 15 GB  |
| **ibm/granite3.3-guardian:8b** | `90a8aabc98eb` | 6.7 GB |
| **granite3.3:8b**              | `fd429f23b909` | 4.9 GB |
| **gemma3:4b**                  | `a2af6cc3eb7f` | 3.3 GB |
| **ibm/granite3.2-guardian:3b** | `4f7cdff62b7e` | 2.7 GB |
| **granite3-guardian:latest**   | `ba81a177bd23` | 2.7 GB |
| **granite4.1:3b**              | `6fd349357287` | 2.1 GB |
| **ibm/granite4:3b**            | `89962fcc7523` | 2.1 GB |
| **llama3.2:latest**            | `a80c4f17acd5` | 2.0 GB |
| **ibm/granite4:350m**          | `5eee845b49c4` | 708 MB |
| **granite-embedding:30m**      | `eb4c533ba6f7` | 62 MB  |

Architecture Overview: The Intelligent Cost-Routing Pipeline

> 🚨 Disclaimer: This article is for educational purposes only. It provides a conceptual, hands-on look at how LLM routing logic operates and should not be mistaken for the official enterprise implementation or architecture of the IBM Bob product.

At the heart of orchestrator’s cost-efficiency engine is a modular pipeline designed to intercept user intent, break down complex requirements, and match each sub-task to the most cost-effective LLM tier. Instead of blindly passing a massive multi-part prompt to a premium, high-cost reasoning model, an orchestrator acts as a smart traffic controller.

                  +------------------------+
                  |  User Complex Request  |
                  +-----------+------------+
                              |
                              v
                  +------------------------+
                  |    Task Splitter       |
                  +-----------+------------+
                              |
                    [Splits into Sub-Tasks]
                              |
                              v
                  +------------------------+
                  |    Task Classifier     |
                  +-----------+------------+
                              |
                    [Identifies Domain & Size]
                              |
                              v
                  +------------------------+
                  |   Dynamic LLM Router   |
                  +-----------+------------+
                              |
            [Queries Cost Registry & Selects Best Fit]
                              |
         +--------------------+--------------------+
         |                    |                    |
         v                    v                    v
+-----------------+  +-----------------+  +-----------------+
| Mistral (Small) |  | Granite (Guard) |  |  Llama (Utility)|
+-----------------+  +-----------------+  +-----------------+

The pipeline operates in four distinct stages:

Task Splitting: Parsing complex, compound requests into a sequence of isolated, clean sub-tasks.
Intent & Domain Classification: Scoring each sub-task to determine its exact domain (e.g., code, documentation, text transformation) and complexity requirement (“T-Shirt sizing”).
Registry-Driven Routing: Cross-referencing the classification payload against a centralized model tier and cost ledger to pick the absolute optimal local Ollama target.
Isolated Execution & Reassembly: Invoking the targeted models and stitching the final context back together flawlessly.

Technical Deep Dive & Core Implementations

Granular Task Decomposition

To avoid processing unnecessary tokens through high-tier engines, the architecture relies on a specialized TaskSplitter. This component evaluates the raw user payload and breaks it down into discrete components so that documentation and code can travel along separate routing paths.

"""
splitter.py
Splits a compound prompt into independent clauses, then classifies each one.
"""

from __future__ import annotations
import re
from src.task_classifier import classify, TaskProfile


# Conjunctions that signal independent sub-tasks within a single prompt
_SPLIT_PATTERN = re.compile(
    r"(?i)\s*\b(and also|and then|additionally|plus|as well as|AND|also)\b\s*",
)


def split_and_classify(prompt: str) -> list[TaskProfile]:
    """
    1. Split the prompt on conjunction markers.
    2. Discard separator tokens.
    3. Classify each remaining clause.
    """
    raw_clauses = _SPLIT_PATTERN.split(prompt)

    # Keep only non-separator, non-empty parts
    clauses = [
        c.strip()
        for c in raw_clauses
        if c and not _SPLIT_PATTERN.match(c.strip()) and len(c.strip()) > 8
    ]

    if not clauses:
        # Fallback: treat the whole prompt as one task
        clauses = [prompt.strip()]

    return [classify(c) for c in clauses]

T-Shirt Sizing and Feature Classification

Once separated, each distinct sub-task passes through the TaskClassifier. The classifier looks for domain markers (e.g., security policies, coding constraints, or structural documentation formatting) and assigns a maximum required model complexity tier: SMALL, MEDIUM, or LARGE.

"""
task_classifier.py
Classifies a text prompt into a TaskProfile: type, complexity score, and token estimate.
"""

from __future__ import annotations
import math
import re
from dataclasses import dataclass, field
from enum import Enum


class TaskType(Enum):
    DOCUMENTATION   = "documentation"    # READMEs, API docs, inline comments
    CODE_GENERATION = "code_generation"  # new functions / modules / classes
    CODE_REVIEW     = "code_review"      # diff analysis, security audit
    REASONING       = "reasoning"        # architecture, design decisions
    QA_SIMPLE       = "qa_simple"        # factual lookups, definitions
    SUMMARISATION   = "summarisation"    # condense / shorten text


class ComplexityTier(Enum):
    LOW    = 1   # score 0.00 – 0.34
    MEDIUM = 2   # score 0.35 – 0.64
    HIGH   = 3   # score 0.65 – 1.00


@dataclass
class TaskProfile:
    raw_text:       str
    task_type:      TaskType
    complexity:     float          # 0.0 → 1.0
    tier:           ComplexityTier
    token_estimate: int
    sub_tasks:      list["TaskProfile"] = field(default_factory=list)

    def __repr__(self) -> str:
        return (
            f"TaskProfile(type={self.task_type.value!r}, "
            f"complexity={self.complexity}, tier={self.tier.name}, "
            f"tokens≈{self.token_estimate})"
        )


# ── Keyword signal table: (keywords, base_complexity_score) ───────────────────
_SIGNALS: dict[TaskType, tuple[list[str], float]] = {
    TaskType.DOCUMENTATION:   (
        ["write the api", "api reference", "write docs", "write documentation",
         "document", "readme", "docstring", "comment",
         "explain", "describe"], 0.20),
    TaskType.CODE_GENERATION: (
        ["implement", "create function", "write code", "build",
         "develop", "middleware", "endpoint", "class", "module"], 0.55),
    TaskType.CODE_REVIEW:     (
        ["review", "audit", "security", "vulnerability",
         "diff", "check", "analyse code", "analyze code"], 0.60),
    TaskType.REASONING:       (
        ["architecture", "design", "trade-off", "compare",
         "evaluate", "pros and cons", "should i", "best approach"], 0.80),
    TaskType.QA_SIMPLE:       (
        ["what is", "how does", "define", "tell me", "when was"], 0.10),
    TaskType.SUMMARISATION:   (
        ["summarise", "summarize", "tldr", "shorten",
         "brief", "condense", "overview"], 0.15),
}

# Multi-word phrases that force a specific TaskType regardless of keyword scoring.
_PHRASE_OVERRIDES: list[tuple[str, TaskType]] = [
    ("api reference",       TaskType.DOCUMENTATION),
    ("write the api",       TaskType.DOCUMENTATION),
    ("write docs",          TaskType.DOCUMENTATION),
    ("write documentation", TaskType.DOCUMENTATION),
    ("code review",         TaskType.CODE_REVIEW),
    ("security audit",      TaskType.CODE_REVIEW),
]


def _token_estimate(text: str) -> int:
    """Rough GPT-style tokenisation: ~4 chars per token."""
    return max(1, len(text) // 4)


def _depth_penalty(text: str) -> float:
    """
    Extra score for multi-step / conditional reasoning markers.
    Each hit adds 0.04, capped at 0.20.
    """
    markers = [
        "step 1", "first,", "then,", "finally,",
        "however,", "alternatively,", "trade-off",
        "on the other hand", "in addition", "furthermore",
    ]
    lower = text.lower()
    hits = sum(1 for m in markers if m in lower)
    return min(0.20, hits * 0.04)


def classify(text: str) -> TaskProfile:
    """Return a TaskProfile for the given text snippet."""
    best_type, base_score = TaskType.QA_SIMPLE, 0.10
    lower = text.lower()

    # Phase 1: phrase-level overrides (highest priority)
    for phrase, forced_type in _PHRASE_OVERRIDES:
        if phrase in lower:
            best_type = forced_type
            base_score = dict(_SIGNALS)[forced_type][1]
            break
    else:
        # Phase 2: single-keyword scoring (only if no phrase matched)
        for ttype, (keywords, score) in _SIGNALS.items():
            if any(k in lower for k in keywords):
                if score > base_score:
                    best_type, base_score = ttype, score

    tokens       = _token_estimate(text)
    token_factor = min(0.15, math.log10(max(1, tokens)) * 0.05)
    depth        = _depth_penalty(text)
    final_score  = min(1.0, base_score + token_factor + depth)

    tier = (ComplexityTier.LOW    if final_score < 0.35 else
            ComplexityTier.MEDIUM if final_score < 0.65 else
            ComplexityTier.HIGH)

    return TaskProfile(
        raw_text=text,
        task_type=best_type,
        complexity=round(final_score, 3),
        tier=tier,
        token_estimate=tokens,
    )

Dynamic Cost Routing Mechanics

The LLMRouter pairs the classification metadata with a static CostRegistry. By calculating cost thresholds based on local resource sizes or token-equivalent costs, it isolates which Ollama container is running the model best suited for the job.

"""
router.py
Core routing logic: selects the cheapest model that meets the quality threshold
for each TaskProfile, using a utility function U = β·Q − α·C.
"""

from __future__ import annotations
from src.cost_registry import MODEL_REGISTRY, ModelSpec
from src.task_classifier import TaskProfile

# ── Tunable parameters ────────────────────────────────────────────────────────
QUALITY_THRESHOLD = 0.72   # default minimum acceptable quality score
COST_WEIGHT       = 0.40   # α — penalise cost
QUALITY_WEIGHT    = 0.60   # β — reward quality

# Per-task-type quality thresholds.
# Documentation / summaries tolerate lower quality from a cheap model;
# code and reasoning tasks require a higher bar.
TASK_QUALITY_THRESHOLDS: dict[str, float] = {
    "documentation":   0.60,   # light model acceptable for prose
    "summarisation":   0.55,   # even more lenient for summaries
    "qa_simple":       0.55,
    "code_generation": 0.72,
    "code_review":     0.85,   # security: high bar
    "reasoning":       0.80,
}

# Task types that require the heaviest tier (Tier 4) regardless of complexity score.
# code_review always goes to Mistral (security-aware, largest context)
SECURITY_OVERRIDES: set[str] = {"code_review"}


def _utility(model: ModelSpec, profile: TaskProfile) -> float:
    """
    Utility  U = β·Q − α·C
      Q = model.quality_score(complexity)   ∈ [0, 1]
      C = model.cost_per_1k (normalised)    ∈ [0, 1]
    A higher U means the model is preferred.
    """
    q = model.quality_score(profile.complexity)
    c = model.cost_per_1k
    return QUALITY_WEIGHT * q - COST_WEIGHT * c


def route(profile: TaskProfile) -> ModelSpec:
    """
    Return the optimal ModelSpec for the given TaskProfile.

    Selection rules (in order):
      1. Model must support the task type.
      2. Model context window must fit the estimated token budget.
      3. Model quality_score must meet QUALITY_THRESHOLD
         (unless it's a security override — then tier 3 is forced).
      4. Among qualifying candidates, the one with the highest utility wins.
      5. If no candidate passes the threshold, fall back to the highest-quality model.
    """
    ttype = profile.task_type.value

    # Force tier-4 (heaviest) for security-sensitive task types
    if ttype in SECURITY_OVERRIDES:
        heavy = next(m for m in MODEL_REGISTRY if m.tier == 4)
        return heavy

    threshold = TASK_QUALITY_THRESHOLDS.get(ttype, QUALITY_THRESHOLD)

    candidates = [
        m for m in MODEL_REGISTRY
        if ttype in m.capable_types
        and m.max_tokens >= profile.token_estimate
        and m.quality_score(profile.complexity) >= threshold
    ]

    if not candidates:
        # Safety net: pick highest-quality model regardless of cost
        candidates = sorted(
            MODEL_REGISTRY,
            key=lambda m: m.quality_score(profile.complexity),
            reverse=True,
        )

    return max(candidates, key=lambda m: _utility(m, profile))


def dispatch(profiles: list[TaskProfile]) -> dict[str, tuple[TaskProfile, ModelSpec]]:
    """
    Route each sub-task independently.

    Returns a mapping of task_type → (profile, selected_model).
    When multiple sub-tasks share the same type, a suffix is appended.
    """
    result: dict[str, tuple[TaskProfile, ModelSpec]] = {}
    type_counts: dict[str, int] = {}

    for profile in profiles:
        key = profile.task_type.value
        count = type_counts.get(key, 0)
        type_counts[key] = count + 1
        unique_key = key if count == 0 else f"{key}_{count}"
        result[unique_key] = (profile, route(profile))

    return result

Concrete Cost-Efficiency Proof

By decoupling tasks, the application completely redefines the cost equation for local and hybrid enterprise architectures.

Consider a compound prompt asking to: “Generate a secure Python CRUD interface and write 3 paragraphs of Markdown documentation for it.”

| Action Strategy            | Traditional Monolithic Approach | Orchestrator Pipeline Approach                  | Financial / Resource Impact          |
| -------------------------- | ------------------------------- | --------------------------------------------------- | ------------------------------------ |
| **Code Generation Step**   | Runs on `mistral-small` (15 GB) | Runs on `granite3.3:8b` (4.9 GB) or `mistral-small` | **Targeted Resource Allocation**     |
| **Documentation Step**     | Runs on `mistral-small` (15 GB) | Runs on `llama3.2:latest` (2.0 GB)                  | **~86% VRAM / Compute reduction**    |
| **Guardrail/Safety Check** | Ignored or rerun on flagship    | Runs on `granite3.2-guardian:3b` (2.7 GB)           | **Dedicated, low-latency alignment** |

The Token-Size Formula

When processing large workloads, the total resource overhead or token expense can be represented by the formula:

Where a traditional monolithic setup applies the maximum rate (RateMax) to all tokens (TokensTotal), the application drastically optimizes (in this case all is free as we use Ollama locally) the total cost footprint by ensuring that only high-complexity sub-tasks scale up to premium tiers, while documentation and utility steps run at significantly reduced rates.

To test the application we can just run the “demo.sh” script;

./multi-llm-router/scripts/demo.sh

Running the documentation-vs-code routing scenario…
(Set DRY_RUN=1 to skip LLM calls and inspect routing only)

================================================================================
  Multi-LLM Task Router  —  Demonstration Scenario  (4 providers)
================================================================================

Prompt:
  Write the API reference documentation for our REST /users endpoints AND implement the OAuth2 Bearer-token middleware in Python FastAPI.

Model tier mapping:
  model::light          IBM Granite     →  granite4.1:3b
  model::medium         Meta LLaMA      →  llama3.2:latest
  model::balanced       Google Gemma    →  gemma3:4b
  model::heavy          Mistral AI      →  mistral-small3.2:latest

Detected 2 sub-task(s):
  [1] TaskProfile(type='documentation', complexity=0.26, tier=LOW, tokens≈16)
  [2] TaskProfile(type='code_generation', complexity=0.609, tier=MEDIUM, tokens≈15)

Routing decisions:
  Task Type              Complexity  Tier       Label                Provider              U
  ------------------------------------------------------------------------------
  documentation               0.260  LOW        model::light         IBM Granite     +0.3616
  code_generation             0.609  MEDIUM     model::balanced      Google Gemma    +0.3718

Executing sub-tasks via Ollama…  (set DRY_RUN=1 to skip LLM calls)


================================================================================
  ROUTING TABLE
================================================================================
  Task Type              Label                Provider         Latency   Tokens        Cost
  ------------------------------------------------------------------------------
  documentation          model::light         IBM Granite       2839 ms       48    0.002400
  code_generation        model::balanced      Google Gemma     15279 ms       45    0.015750

  Total wall-clock: 15327 ms

======================================================================
  MERGED RESPONSE
======================================================================
## Documentation  _(via model::light, 2839 ms)_

# REST API Reference: `/users` Endpoints

Our RESTful API provides a set of endpoints under the `/users` resource to manage user-related operations. Below is a detailed description of each endpoint, including supported HTTP methods, request/response

---

## Code Generation  _(via model::balanced, 15279 ms)_

`python
from fastapi import Depends, HTTPException, status
from fastapi.security import CredentialsTokenAuthentication
import jwt

# Replace with your secret key (keep this secure!)
JWT_SECRET = "your-secret-

======================================================================
  FEEDBACK SUMMARY
======================================================================
{
  "total_calls": 2,
  "total_cost": 0.01815,
  "avg_quality": 0.665,
  "calls_by_tier": {
    "1": 1,
    "3": 1
  }
}

Conclusion

By treating LLMs as modular, specialized workers rather than monolithic black boxes, this simple orchestrator demonstrates that the most high-performing AI strategy is also the most cost-effective one. Through automated splitting, precise T-shirt sizing, and registry-driven routing, you can systematically lower your token footprint without sacrificing a single drop of output quality.

🎯 Et voilà, that’s a wrap 💯 for part 1 and thanks for reading! Stay tuned for the next episode 📺

DEV Community