DEV Community: Marc Newstead

Why Your RAG Agent Can't Connect the Dots (And How to Fix It)

Marc Newstead — Mon, 13 Jul 2026 09:13:29 +0000

The Problem You've Probably Hit

You've built a RAG agent. It answers questions from your docs brilliantly... until someone asks something that requires connecting information across multiple sources. Then it falls apart.

"Who worked on projects related to the component that failed in production last week?"

Your vector-based agent returns documents about the failure, documents about team members, and documents about projects. But it can't connect them. That's multi-hop reasoning, and it's where vector embeddings hit their ceiling.

What Multi-Hop Actually Looks Like in Code

Let's be concrete. Single-hop reasoning is straightforward:

# Single hop: "What does the auth service do?"
query_embedding = embed("auth service functionality")
results = vector_db.similarity_search(query_embedding, k=5)
# Returns relevant docs about auth service ✓

Multi-hop reasoning chains multiple steps:

# Multi-hop: "Which engineer should fix the auth service bug?"
# Step 1: What is the auth service?
# Step 2: Who maintains it?
# Step 3: Who's currently available?
# Step 4: Who has fixed similar bugs before?

With vectors alone, you're either:

Hoping all that context lives in one chunk (unlikely)
Re-querying multiple times and losing the thread
Jamming everything into the LLM context window and burning tokens

None of these scale.

Why We Defaulted to Vectors

Vectors were the pragmatic choice in 2023. The tooling was mature, implementation was straightforward, and for 80% of use cases—document retrieval, FAQ matching, semantic search—they worked brilliantly.

Pinecone, Weaviate, Chroma: all excellent tools. The problem isn't the technology; it's the architectural assumption that every knowledge retrieval problem is a similarity search problem.

It isn't.

What Graphs Do Differently

Graph databases store knowledge as entities and relationships:

// Neo4j example
(Alice:Engineer)-[:MAINTAINS]->(AuthService:Component)
(AuthService)-[:DEPENDS_ON]->(UserDB:Database)
(UserDB)-[:HOSTED_ON]->(ProdServer:Infrastructure)
(Bob:Engineer)-[:ON_CALL_FOR]->(ProdServer)

Now that multi-hop query becomes traversable:

MATCH (bug:Issue {component: "AuthService"})
      -[:AFFECTS]->(service:Component)
      <-[:MAINTAINS]-(engineer:Engineer)
      -[:FIXED]->(similar:Issue)
WHERE similar.type = bug.type
RETURN engineer.name, COUNT(similar) as experience
ORDER BY experience DESC

The graph encodes the connections. You're not asking an LLM to infer relationships from unstructured text—you're traversing explicit edges.

The Practical Hybrid Architecture

Here's what's actually working in production:

Use vectors for:

Initial document retrieval
Semantic similarity matching
Unstructured content search

Use graphs for:

Entity relationships
Multi-hop queries
Traversing connected data

Implementation pattern:

Ingest: Extract entities and relationships from your documents (using NER, LLMs, or structured parsers)
Store: Documents as vectors, entities and relationships in a graph
Query: Use vectors to find candidate documents, use graphs to find connected context
Combine: Pass both to your LLM as enriched context

This isn't theoretical. Teams have measured 40%+ accuracy improvements on multi-hop tasks by adding graph memory to existing vector pipelines. The research on why graph beats vectors shows the performance gap clearly.

Actually Building This

You don't need to rip out your existing stack. Start small:

Identify multi-hop queries in your logs (anything requiring "and then" logic)
Extract key entities from those query domains
Add a graph layer (Neo4j, Amazon Neptune, or even PostgreSQL with recursive CTEs)
Build a hybrid retriever that queries both systems

If you're working with a team that specialises in AI automation and software development, they'll likely be working through similar architectural decisions right now.

The Real Tradeoff

Graphs add complexity. You need:

Entity extraction pipelines
Relationship modelling
Graph database expertise
More complex query logic

But if your agent needs to reason across connected information—and most interesting agents do—the accuracy gains justify the overhead.

Vectors are brilliant for similarity. Graphs are brilliant for connectivity. Use both.

Stop Measuring AI Agent ROI Like It's a Chatbot: A Developer's Guide

Marc Newstead — Mon, 13 Jul 2026 09:10:51 +0000

Stop Measuring AI Agent ROI Like It's a Chatbot: A Developer's Guide

If you've recently shipped an AI agent or are being asked to build one, you've probably been handed a spreadsheet that calculates ROI based on "hours saved" or "tickets deflected." These metrics made sense for traditional automation—but they're killing your agentic AI projects before they start.

Here's why, and what to measure instead.

The Problem: Traditional Metrics Miss the Point

Most organisations are still measuring AI agents like they're measuring a search bar upgrade or a new help desk widget. The conversation goes something like this:

"If the agent handles 1,000 queries per month and saves 5 minutes per query, that's 83 hours saved. Times hourly rate, minus infrastructure cost... ROI positive in 18 months!"

Sounds reasonable. Except agentic AI doesn't just answer questions—it makes decisions and takes actions. A procurement agent doesn't just look up vendor details; it evaluates quotes, flags compliance risks, and routes approvals. A triage agent doesn't just categorise tickets; it assesses severity, assigns priority, and sometimes resolves the issue outright.

When you measure these systems purely on cost savings, you're optimising for the wrong thing. You end up building glorified FAQ bots with extra steps.

A Better Metric: Decisions Automated Per Hour

Instead of "how much did we save?", ask: "How many decisions is this system making per hour?"

This reframes the entire conversation. Suddenly you're thinking about:

Throughput: Can this agent handle 10 routing decisions per hour, or 1,000?
Scope: What categories of decisions can it own end-to-end?
Latency: How quickly does it move from input to action?
Scale: What happens when load doubles?

These are questions developers instinctively ask about any production system. Treating your agent as a decision-making service—not a cost-centre—makes architectural choices clearer.

Example: Deployment Approval Agent

Let's say you build an agent that reviews deployment requests for a platform team:

# Traditional ROI thinking:
# "Saves 10 minutes per manual review, ~200 reviews/month"
# = 33 hours saved, ~£1,500/month

# Decisions-per-hour thinking:
# "Handles 50 deployment approvals/hour during peak"
# "Automatically clears 80% with no human in loop"
# "Escalates 20% with full context and risk assessment"

The second framing makes it obvious this isn't about saving 33 hours—it's about removing a bottleneck. You can now deploy 50 times per hour instead of queuing for manual review. The value isn't the time saved; it's the velocity unlocked.

This shift in perspective influences everything: how you design the agent, what data it needs, how you monitor it, and how you sell it to stakeholders. The decisions matter more article covers the strategic side of this in detail.

What This Changes for You as a Developer

1. You'll Prioritise Different Features

Cost-reduction logic says: "Make the agent handle the simplest 80% of queries."

Decisions-per-hour logic says: "Make the agent handle the highest-volume decision paths end-to-end."

Very different backlogs.

2. Observability Becomes Central

You need to instrument decision quality, not just uptime. Think:

Decision confidence scores
Escalation rates by decision type
Feedback loops from humans who review edge cases
Drift detection on decision patterns

Your agent is now a service with SLOs. Treat it like one.

3. You'll Design for Throughput, Not Coverage

A chatbot optimises for "Can it answer this question?"

An agent optimises for "Can it close the loop on this workflow?"

That means thinking about state management, rollback strategies, and idempotency from day one. Your agent isn't read-only.

The Quality Trap (and How to Avoid It)

Obviously, "decisions per hour" is meaningless if those decisions are rubbish. The key is to instrument quality from the start:

Shadow mode first: Run the agent in parallel, compare to human decisions
Confidence thresholds: Auto-execute high-confidence decisions, escalate the rest
Continuous evaluation: Sample and audit decisions regularly
Feedback signals: Did the human override? Did downstream systems reject it?

If you're working with teams focused on AI automation and software development, these feedback loops should be baked into your delivery process, not bolted on later.

Final Thought

If your current AI project is being measured purely on "cost per query" or "FTE avoided," push back. Ask what decisions the system could own, how many per hour, and what good looks like.

You'll build better systems—and you'll have a much easier time explaining why they matter.

MCP and A2A: Two Protocols Every Multi-Agent Dev Should Know

Marc Newstead — Mon, 06 Jul 2026 09:11:56 +0000

The Two-Layer Problem Nobody Saw Coming

If you're building anything with AI agents right now, you've probably hit the same wall I did: how do you wire multiple agents together without creating a brittle mess of custom integrations?

Turns out, two protocols are emerging to split this problem cleanly down the middle. MCP (Model Context Protocol) handles the vertical—how a single agent talks to its tools and data sources. A2A (Agent-to-Agent) handles the horizontal—how agents discover and communicate with each other.

Understanding this split early saves you from architectural regret later. Let me show you why.

The Vertical Layer: MCP

Think of MCP as the standardised plumbing between your agent and everything it needs to do actual work. Before MCP, every LLM framework had its own way of wrapping API calls, database queries, or file system access.

Here's what MCP standardises:

Tool definitions — A consistent schema for describing what a function does and what parameters it needs
Resource access — How agents read from databases, file systems, or external APIs
Prompts and context — A protocol-level way to inject context and instructions

In practice, this means you can write an MCP server once and plug it into Claude, custom LangChain agents, or any other MCP-compatible runtime:

# Pseudocode: MCP server exposing a tool
class WeatherMCPServer:
    @mcp_tool
    def get_forecast(location: str) -> dict:
        return fetch_weather_api(location)

Anthropie drives MCP, and adoption is growing fast. If you're building agents that need reliable tool access, MCP is becoming the safe default.

The Horizontal Layer: A2A

A2A solves a different problem: how do agents find each other and coordinate?

Imagine you're building a customer support system with three specialised agents:

An intake agent that routes requests
A technical agent that handles product queries
A billing agent that processes refunds

Without A2A, you'd hard-code those connections. With A2A, agents register their capabilities in a directory, and other agents discover them dynamically.

Key A2A concepts:

Agent discovery — A registry where agents advertise what they can do
Message routing — Standardised envelopes for inter-agent communication
Capability negotiation — Agents declare their skills; others query and invoke them

Google's pushing A2A hard, and major cloud vendors are lining up support. The two-layer stack is quickly becoming the de facto architecture for serious multi-agent systems.

Where Devs Get It Wrong

The biggest mistake I see? Using MCP to solve A2A problems, or vice versa.

For example:

❌ Trying to use MCP tool calls to route messages between agents
❌ Building custom agent discovery when A2A already defines it
❌ Implementing A2A-style capability negotiation inside MCP servers

Keep it clean:

MCP = One agent, many tools
A2A = Many agents, one conversation

Real-World Architecture

Here's how I'd structure a production system:

┌─────────────────────────────────────┐
│   A2A Layer (Agent Coordination)    │
│  - Discovery registry               │
│  - Message bus                      │
│  - Routing logic                    │
└─────────────────────────────────────┘
           ↓           ↓           ↓
    Agent A       Agent B       Agent C
       ↓              ↓              ↓
┌──────────────────────────────────────┐
│    MCP Layer (Tool Integration)      │
│  - Database connector                │
│  - API wrappers                      │
│  - File system access                │
└──────────────────────────────────────┘

Each agent has its own MCP stack. The A2A layer orchestrates the whole system.

The Standards Risk

One caveat: neither protocol is governed by a neutral standards body yet. MCP is Anthropic's baby; A2A is Google's. That's not necessarily a dealbreaker, but it does mean the specs could shift based on commercial priorities.

If you're building something enterprise-grade, keep an eye on governance. The worst-case scenario is ending up locked into one vendor's interpretation of the spec.

Should You Care Right Now?

If you're building:

Single-agent systems → MCP is immediately useful. Start there.
Multi-agent orchestration → You need both layers. Design for A2A from day one.
Enterprise AI at scale → The two-layer stack is becoming table stakes. Teams working on AI automation and software development are already adopting this architecture.

The payoff is real: cleaner code, less coupling, easier to extend. And when the next big LLM framework drops, you won't be rewriting everything.

Where to Start

Experiment with MCP — Anthropic's docs are solid. Build a simple tool server.
Read the A2A spec — Google published it earlier this year. It's clearer than you'd expect.
Design for the split — Even if you're not using both today, structure your code as if you will.

The agent internet is coming. These two protocols are the plumbing.

Why Your Next AI Agent Should Probably Be a Manager

Marc Newstead — Mon, 06 Jul 2026 09:09:16 +0000

Why Your Next AI Agent Should Probably Be a Manager

If you've built an AI agent that tries to do everything—answer support tickets, query databases, generate reports, and update CRM records—you've probably noticed it's mediocre at all of them. There's a better pattern, and it's one we've used in software architecture for decades: delegation.

Instead of building one massive agent with an enormous context window and dozens of tools, build a supervisor agent that spawns specialists on demand. Think of it as a tech lead who routes work to the right engineer, not a full-stack developer trying to do everything themselves.

The Problem with Jack-of-All-Trades Agents

A single LLM agent handling a complex workflow hits a wall quickly. You give it:

Access to 15+ API endpoints
A 3,000-word system prompt
Instructions for edge cases in six different domains
Tools for everything from data validation to PDF generation

What you get back is an agent that:

Picks the wrong tool 20% of the time
Hallucinates when context exceeds its sweet spot
Can't specialise deeply enough to handle domain-specific nuance
Becomes exponentially harder to debug as complexity grows

Sound familiar? It's the same reason we stopped building monoliths and started building microservices.

Enter the Supervisor Pattern

The supervisor and sub-agents pattern flips the model. Your supervisor agent doesn't do the work—it routes it.

Here's a simplified flow:

# Pseudocode: Supervisor agent receives a task
task = "Analyse this customer complaint and update their support ticket"

# Supervisor decomposes and delegates
supervisor.analyse(task)
# Output: [
#   {"agent": "sentiment_analyser", "input": complaint_text},
#   {"agent": "crm_updater", "input": ticket_id, "sentiment": result}
# ]

# Each specialist does one thing well
sentiment = sentiment_agent.run(complaint_text)
crm_agent.update(ticket_id, sentiment)

The supervisor's job is orchestration, not execution. It:

Breaks down the user's request into subtasks
Determines which specialist agent handles each subtask
Passes context between agents
Aggregates results and responds

Each sub-agent has:

A narrow, well-defined role
A smaller, focused system prompt
Only the tools it actually needs
Higher accuracy within its domain

Why This Works (and Why It's Familiar)

If you've built distributed systems, this should feel natural. It's the same principles:

Single Responsibility Principle: Each agent does one thing well
Loose Coupling: Agents don't need to know about each other
Bounded Context: Clear domain boundaries reduce complexity
Graceful Degradation: One agent failing doesn't tank the whole system

You wouldn't build a microservice that handles payments, inventory, and email notifications. Don't build agents that way either.

Dynamic Spawning: The Next Level

Static sub-agent pools work, but the real power comes when your supervisor can spawn agents dynamically. Frameworks like LangGraph, AutoGen, and CrewAI support this.

Imagine your supervisor encounters a task it's never seen:

# Task: "Translate this legal document to French and summarise it"

# Supervisor spawns specialists on-the-fly:
supervisor.spawn_agent(
    role="legal_translator",
    tools=["translation_api"],
    context="French legal terminology, formal tone"
)

supervisor.spawn_agent(
    role="document_summariser",
    tools=["text_analysis"],
    context="Legal summary, bullet points, max 200 words"
)

No need to pre-define every possible agent. The supervisor adapts to the task.

The Gotchas (Because There Always Are)

Routing errors are your biggest risk. If the supervisor sends a task to the wrong specialist, you're worse off than with a generalist. Mitigation strategies:

Use structured outputs (JSON, Pydantic models) for routing decisions
Log every delegation decision for debugging
Implement confidence scores—if the supervisor isn't sure, escalate to a human
Test routing logic obsessively

Token costs add up. Multiple agents mean multiple LLM calls. Profile your usage and optimise:

Use smaller models for specialist tasks
Cache common decompositions
Consider local models for low-stakes subtasks

Observability is critical. Distributed agent systems are harder to debug than single agents. Invest in:

Tracing (OpenTelemetry, LangSmith)
Structured logging with request IDs
Dashboards showing agent performance and routing patterns

Should You Build This?

If your agent workflow has:

Multiple distinct domains (e.g., data retrieval + analysis + reporting)
More than 8–10 tools
Frequent routing mistakes
Growing system prompts that feel unwieldy

...then yes, try the supervisor pattern.

If you're building a simple chatbot or single-purpose assistant, stick with one agent. Don't over-engineer.

Where to Start

Pick one complex agent workflow you've already built. Identify two or three distinct subtasks. Refactor into a supervisor + two specialists. Measure accuracy and token usage before and after.

You'll know quickly if it's the right pattern for your use case.

For teams working on AI automation and software development at scale, this pattern is increasingly becoming the default. It's not about replacing single agents—it's about knowing when orchestration beats execution.

Now go build something modular.

MCP + A2A: You're Building Two Integration Layers Whether You Realise It or Not

Marc Newstead — Mon, 29 Jun 2026 09:05:26 +0000

The Problem You Probably Have Already

If you're building agentic systems in 2025, chances are you've already got two integration layers in your stack:

MCP (Model Context Protocol) — wiring your AI models to databases, APIs, filesystems, internal tools
A2A (Agent-to-Agent) protocols — letting your agents discover, negotiate with, and invoke each other

They're not competing. They're complementary. But without a clear interoperability story, you're setting yourself up for the kind of integration spaghetti that kept enterprise architects busy (and miserable) in the ESB era.

What MCP Actually Does

MCP is Anthropic's answer to a simple problem: how do you give an LLM structured, reliable access to external resources without writing bespoke glue code for every data source?

Instead of hardcoding database queries or API calls into your prompts, you expose them as MCP servers. Your AI client connects via a standard protocol, discovers available tools, and invokes them with typed parameters.

Example use case:

# MCP server exposes a tool
@mcp_tool
def query_customer_orders(customer_id: str) -> list[Order]:
    return db.execute("SELECT * FROM orders WHERE customer_id = ?", customer_id)

Your LLM can now call query_customer_orders as a function — no prompt engineering, no brittle scraping, no hoping the model "figures it out".

MCP is vertical integration: connecting one agent to the resources it needs to do its job.

What A2A Actually Does

A2A is horizontal. It's about agents talking to other agents.

Imagine you've got:

A customer service agent that handles support tickets
A logistics agent that tracks shipments
A billing agent that processes refunds

When a customer asks "Where's my refund?", the support agent needs to talk to billing. That's A2A.

A2A protocols define:

Discovery: how does Agent A find Agent B?
Capability negotiation: what can Agent B actually do?
Invocation: how does Agent A call Agent B and handle the response?

Unlike MCP, A2A is still fragmented. There's no single standard. You might be using:

HTTP APIs with custom service meshes
Pub/sub queues (Kafka, RabbitMQ)
gRPC with Protobuf schemas
Proprietary frameworks from LangChain, AutoGen, CrewAI

Each works. None interoperate cleanly.

Why This Feels Familiar (and Not in a Good Way)

If you worked in enterprise integration in the 2000s, this smells like ESB déjà vu.

The Enterprise Service Bus promised to solve the n-squared integration problem: instead of every system talking directly to every other system, route everything through a central bus.

It worked — until it didn't. ESBs became:

Single points of failure
Performance bottlenecks
Proprietary lock-in traps (looking at you, TIBCO and WebSphere)

The two-layer stack emerging now — MCP below, A2A above — risks the same fate if we're not careful.

What You Should Do About It

1. Treat MCP as infrastructure, not application logic

MCP is brilliant for standardising resource access. Don't abuse it by cramming business logic into MCP tools. Keep them thin, composable, and stateless.

2. Pick one A2A pattern per bounded context

Don't mix pub/sub and RPC in the same workflow unless you have a damn good reason. Consistency beats flexibility when debugging multi-agent failures at 2am.

3. Design for replaceability

Whatever A2A framework you choose today will probably be legacy in 18 months. Wrap it. Abstract it. Make it swappable.

Example:

class AgentInvoker(Protocol):
    def invoke(self, agent_id: str, task: dict) -> dict:
        ...

class KafkaAgentInvoker(AgentInvoker):
    # Implementation today

class GrpcAgentInvoker(AgentInvoker):
    # Swap in tomorrow

4. Monitor both layers separately

MCP failures look different to A2A failures:

MCP: "Tool not found", "Database timeout", "API key expired"
A2A: "Agent unavailable", "Circular dependency", "Message bus backlog"

Your observability stack needs to distinguish them.

The Boring Truth

There's no silver bullet yet. MCP is maturing fast, but A2A is still the Wild West. If you're building production agentic systems — especially in regulated industries or at scale — treat this as an architecture risk, not just a tooling choice.

You don't need to freeze development. But you do need to:

Isolate the integration layer
Version your agent contracts
Plan for migration

The teams that get this right won't be the ones with the cleverest agents. They'll be the ones who can rewire them without a full rewrite.

If you're evaluating MCP, A2A, or any other part of the agentic stack and want a second opinion, teams specialising in AI automation and software development can help you de-risk the architecture before you're too deep to reverse course.

But honestly? Just don't build another ESB. We've been there. It wasn't fun.

Stop Hardcoding Your Agent Workflows (or Don't): A Dev's Guide to Supervisor Delegation

Marc Newstead — Mon, 29 Jun 2026 09:02:40 +0000

Stop Hardcoding Your Agent Workflows (or Don't): A Dev's Guide to Supervisor Delegation

If you're building anything with LLM agents right now, you've probably hit this fork in the road: do you hardcode which agent handles what, or do you let a "supervisor" agent decide at runtime?

It's tempting to reach for the clever solution—dynamic delegation feels proper, like good OOP. But after shipping a few of these systems, I've learned the hard way that the right answer is annoyingly context-dependent.

Let me walk you through the trade-offs, so you can make the call before you burn through your token budget.

The Two Approaches (in 60 Seconds)

Hardcoded routing is exactly what it sounds like:

def route_task(task):
    if "refund" in task.lower():
        return refund_agent.handle(task)
    elif "track order" in task.lower():
        return tracking_agent.handle(task)
    else:
        return general_support_agent.handle(task)

Simple. Deterministic. Zero tokens spent on routing logic.

Supervisor delegation puts an LLM in charge of routing:

def route_task(task):
    supervisor_prompt = f"""
    Given this task: {task}

    Available agents:
    - refund_agent: handles refund requests
    - tracking_agent: handles order tracking
    - support_agent: general queries

    Which agent should handle this? Return JSON.
    """

    decision = llm.call(supervisor_prompt)
    return get_agent(decision['agent']).handle(task)

Flexible. Handles edge cases you didn't anticipate. Also: costs tokens on every single request.

When the Supervisor Actually Earns Its Keep

Dynamic delegation makes sense when your input space is genuinely unpredictable. I'm talking:

Research workflows where you don't know upfront whether a query needs web search, database lookup, or code execution
Multi-domain customer support where a single ticket might touch billing, technical support, and account management
Data pipelines where the shape of incoming data determines which transformation agents fire

The key signal: you can't write the if/else tree because you genuinely don't know the patterns yet.

If you're in this camp, supervisor and sub-agents can save you from an unmaintainable mess of routing logic.

When You Should Just Write the Damn If Statement

Here's the uncomfortable truth: most enterprise use cases are way more constrained than we pretend.

Customer service triage? You've got maybe 6–10 intent categories. Document classification? Probably fewer than 20 types. Internal tooling automation? You know exactly what your users are going to ask for because you control the interface.

If you can enumerate the cases in a planning doc, you can hardcode the routing.

Hardcoded routing gives you:

Predictable costs: no surprise token spikes
Faster execution: skip the LLM call entirely
Easier debugging: stack traces beat "the supervisor made a weird choice"
Simpler observability: you know exactly which code path ran

Start here. Add the supervisor later if you actually need it.

The Hidden Costs Nobody Talks About

Even if dynamic delegation works, it compounds costs in ways that'll bite you:

Supervisor reasoning: tokens on every request
Sub-agent context: each agent needs enough context to work, often duplicating the supervisor's input
Inter-agent communication: if agents collaborate, they're burning tokens talking to each other
Retry logic: when delegation fails, you pay again

I've seen teams blow their monthly OpenAI budget in a week because their supervisor was re-analysing the same 500-word input on every routing decision.

Observability Is Your Real Problem

Debugging "why did the supervisor send this to the wrong agent?" is miserable. You're spelunking through LLM logs, trying to reconstruct reasoning that wasn't deterministic to begin with.

Hardcoded routing? grep works. Supervisor delegation? You need structured logging, trace IDs, and probably a vector database just to understand what happened.

If you're not already doing observability well, adding a supervisor will hurt.

My Default Recommendation

Start with hardcoded routing. Write the simplest if/elif/else chain that handles your known cases. Add a catch-all that logs unknown inputs.

Then instrument heavily:

if match := extract_intent(task):
    logger.info(f"Routing {task_id} to {match.agent}", 
                intent=match.intent, confidence=match.score)
    return route_to(match.agent, task)
else:
    logger.warning(f"No route for {task_id}", task=task)
    return fallback_agent.handle(task)

Once you've got a month of prod data showing you can't write better rules, then evaluate a supervisor.

And if you're working with a team that specialises in AI automation and software development, they'll probably tell you the same thing: solve the problem you have, not the one that sounds clever.

TL;DR

Supervisor delegation ≠ always better
Hardcode first if your domain is constrained
Dynamic routing pays for itself when inputs are genuinely unpredictable
Observability and cost control are harder than you think
Start simple, instrument everything, upgrade if data proves you need it

Your future self (and your token budget) will thank you.

Building Multi-Agent Systems: When Your AI Should Spawn More AIs

Marc Newstead — Mon, 22 Jun 2026 09:12:00 +0000

Building Multi-Agent Systems: When Your AI Should Spawn More AIs

You've built a chatbot. It works. Now product wants it to handle legal queries, technical troubleshooting, and customer support—all in one conversation. Your first instinct? Build three specialised agents and hardcode the routing logic. But there's another pattern gaining traction: supervisor agents that dynamically spawn sub-agents on demand.

Let's talk about when each approach makes sense, because the choice will fundamentally shape your system's cost, reliability, and maintainability.

The Hardcoded Hierarchy: Routing Rules You Control

In a static architecture, your orchestration logic is explicit code:

def route_request(query):
    if contains_legal_keywords(query):
        return legal_agent.process(query)
    elif is_technical_issue(query):
        return tech_support_agent.process(query)
    else:
        return general_agent.process(query)

This is predictable. You know exactly which agent handles what, your token costs are bounded, and debugging is straightforward. When something goes wrong, you're reading deterministic code, not trying to decipher why an LLM decided to spawn a "blockchain expert" for a password reset.

The trade-off? Brittleness. Every new capability means updating your routing logic. Edge cases pile up. That legal query about API rate limits? Neither your legal agent nor your technical agent was designed for it, and your routing function won't know what to do.

Dynamic Orchestration: Let the LLM Decide

Dynamic supervisors flip the script. Instead of hardcoded rules, the supervisor reasons about which sub-agents to spawn:

# Supervisor system prompt (simplified)
supervisor_prompt = """
You have access to these specialist agents:
- LegalAgent: contract review, compliance, GDPR
- TechAgent: API debugging, infrastructure
- CustomerAgent: billing, account management

Analyse the user's request and spawn appropriate sub-agents.
You may spawn multiple agents if needed.
"""

The supervisor becomes a meta-agent that interprets intent and assembles a response pipeline at runtime. For that legal-technical hybrid query? It might spawn both agents, coordinate their outputs, and synthesise a final answer.

This is powerful. New capabilities can be added by updating the agent registry and supervisor prompt—no code changes. The system adapts to novel combinations of requirements you didn't anticipate at design time.

But it's expensive. Every request now includes:

Supervisor reasoning step (100–500 tokens)
Sub-agent spawning decision (variable)
Coordination overhead if multiple agents run
Final synthesis step

You've potentially tripled your token spend, and latency scales with the supervisor's decision complexity.

When to Choose Which

Here's the decision framework I use:

Choose static hierarchies when:

Your domain is well-defined and stable
Cost predictability matters more than flexibility
You need deterministic behaviour for compliance/audit
Your team is comfortable maintaining explicit orchestration code

Choose dynamic supervisors when:

Requirements evolve frequently
You're handling truly unpredictable user input
Development velocity matters more than marginal cost
You have robust observability to debug LLM routing decisions

The Observability Problem

Dynamic systems create a new debugging challenge: you're troubleshooting decisions made by a model, not by your code. When a supervisor routes incorrectly, you need:

Full prompt/response logging for every supervisor decision
Structured traces showing which sub-agents were spawned and why
Token usage broken down by orchestration vs. task execution

Without this, you're flying blind. Budget for engineering time to build proper instrumentation—it's not optional.

For teams working on AI automation and software development, investing early in observability patterns for multi-agent systems pays dividends once you're handling production traffic.

A Hybrid Approach

In practice, you don't have to pick one or the other. Consider a tiered model:

Static top-level routing for broad categories (legal, technical, sales)
Dynamic sub-agent spawning within each category for nuanced specialisation

This gives you cost control at the entry point while preserving flexibility where it matters. Your supervisor still makes intelligent decisions, but within a bounded domain that limits runaway token usage.

The Real Question

When building multi-agent systems, the architecture choice boils down to this: are you optimising for developer control or system adaptability?

Static hierarchies give you determinism and debuggability. Dynamic supervisors give you flexibility and emergent capabilities. Neither is inherently better—it depends on your constraints.

If you're still evaluating which pattern fits your use case, the detailed comparison on how one agent learns to delegate walks through cost modelling and reliability trade-offs worth considering before you commit.

One final tip: whichever you choose, start simple. A two-level hierarchy (supervisor + three sub-agents) is easier to reason about than a recursive tree of agents spawning agents. Get the observability and cost monitoring right at two levels before you go deeper.

Your future self—and your AWS bill—will thank you.

MCP vs A2A: Stop Building Agent Architectures Wrong

Marc Newstead — Mon, 22 Jun 2026 09:08:45 +0000

MCP vs A2A: Stop Building Agent Architectures Wrong

If you're wiring up AI agents in production right now, you've probably hit the same confusion I did: when do I use MCP, and when do I need A2A?

Turns out, they're not alternatives. They solve different problems at different layers of your stack. Mixing them up will wreck your architecture before you've shipped v1.

Let me break down what I wish someone had told me three months ago.

MCP: Your Agent Talks to Tools

Model Context Protocol (Anthropic's spec) is about agent-to-tool communication. Think of it as the interface layer between your LLM and the stuff it needs to do things.

When your agent needs to:

Query a database
Call an internal API
Read from a file system
Fetch customer records

...you're in MCP territory.

What MCP Actually Gives You

// MCP server exposes capabilities
const mcpServer = {
  tools: [
    {
      name: "query_customer_db",
      description: "Fetch customer by ID",
      inputSchema: { customerId: "string" }
    }
  ]
};

The protocol standardises how your agent discovers what tools exist, how to invoke them, and how results flow back. It's RPC with schema negotiation baked in.

Crucially: MCP doesn't care about agent autonomy. It's a request-response pattern. Your agent asks, the tool answers. Done.

A2A: Your Agents Talk to Each Other

Agent-to-Agent protocol (Google's answer to multi-agent coordination) operates at a completely different layer. This is about autonomous systems negotiating with each other.

When you need:

A research agent to delegate to a summarisation agent
A planning agent to coordinate with execution agents
Agents to negotiate task ownership
Asynchronous handoffs between agent workflows

...you're in A2A territory.

The Key Difference

A2A assumes both sides have agency. They're not calling dumb tools — they're collaborating with other intelligent systems that have their own goals, context, and decision-making.

# A2A coordination (conceptual)
Agent A -> Agent B: "Can you handle UK tax calculation?"
Agent B -> Agent A: "Yes, send me the transaction data"
Agent A -> Agent B: [structured payload]
Agent B -> Agent A: "Completed. Result: £2,450 VAT due"

Notice the back-and-forth? That's negotiation. MCP doesn't do that.

Why This Matters When You're Building

Here's where teams go wrong: they try to use MCP to wire agents together.

Don't do this:

# Anti-pattern: Agent-to-agent via MCP
mcp_server.register_tool(
    name="call_summarisation_agent",  # ❌ This is an agent, not a tool
    handler=lambda x: summarisation_agent.run(x)
)

Why does this break down?

MCP is synchronous and blocking — agents need async coordination
MCP has no concept of agent state or context handoff
You lose the negotiation layer (what if the agent is busy? unavailable? needs clarification?)

Instead, use MCP to give each agent its own tools, then use A2A (or a message bus, or even HTTP with agent-aware semantics) to let them coordinate.

Better architecture:

┌─────────────────┐         ┌─────────────────┐
│   Agent A       │         │   Agent B       │
│                 │         │                 │
│  ┌───────────┐  │  A2A    │  ┌───────────┐  │
│  │ MCP Tools │  │ ◄─────► │  │ MCP Tools │  │
│  └───────────┘  │         │  └───────────┘  │
└─────────────────┘         └─────────────────┘
       │                           │
       │ MCP                       │ MCP
       ▼                           ▼
   [Database]                  [API]

Each agent gets its own MCP interface to tools. Agents talk to each other via A2A.

What to Do Tomorrow

If you're designing an agentic system:

Map your tool layer first — what external capabilities do agents need? Build MCP servers for those.
Identify agent boundaries — where does one agent's responsibility end and another's begin?
Choose your A2A transport — could be Google's spec, could be a message queue with agent-aware semantics. Just don't use MCP for it.
Keep agents dumb about each other's internals — they should coordinate via high-level intent, not RPC.

The two-layer stack isn't theoretical — it's the separation of concerns your system needs to scale beyond a proof-of-concept.

The Bottom Line

MCP = agent ↔ tool (synchronous, request-response, capability exposure)
A2A = agent ↔ agent (asynchronous, stateful, coordination)

Get this wrong and you'll end up refactoring your entire stack when you need to add agent #3. Get it right and your architecture stays clean as you scale.

If you're building this for real and need architecture help, teams doing serious AI automation and software development are already using this mental model.

Now go build something that doesn't fall over when you add more agents.

Stop Selling AI Projects on Hours Saved (And What to Track Instead)

Marc Newstead — Mon, 15 Jun 2026 09:12:42 +0000

Stop Selling AI Projects on Hours Saved (And What to Track Instead)

You've just shipped an AI feature that automates part of your product workflow. Marketing wants ROI numbers. Your PM asks: "How many hours does this save?"

It's a trap.

Here's why chasing "hours saved" metrics will sabotage your AI projects—and what you should measure instead.

The Problem with Time-Saved Metrics

Let's say you build an AI classifier that auto-tags support tickets. You measure that it saves your support team 30 minutes per day. Ship it, report the win, move on.

Six months later:

The feature is still running
No headcount has been reduced
The support team is just as busy
Finance asks why costs haven't changed

What happened? You optimised for the wrong metric. Time saved doesn't automatically translate to business value—especially when that time gets absorbed by other work, context switching, or simply Parkinson's Law.

Worse, hour-based ROI invites the wrong conversations. Stakeholders hear "we're automating 10 hours a week" and start asking about redundancies. Your engineering work becomes a political minefield instead of a technical improvement.

What Developers Should Measure Instead

Shift your metrics from inputs (time spent) to outcomes (results achieved). Ask: what is this process actually supposed to accomplish?

Let's revisit that support ticket classifier:

Time-saved framing:

"Saves 30 minutes of manual tagging per day"

Outcome-based framing:

"Reduces average ticket resolution time by 18%"
"Increases first-response accuracy from 73% to 91%"
"Decreases ticket escalations by 40%"

See the difference? The second set of metrics connects directly to what the business cares about: faster support, happier customers, fewer escalations.

Practical Examples by Domain

Here's how this translates across common automation scenarios:

Code review automation:

❌ "Saves 2 hours per week reviewing PRs"
✅ "Reduces time-to-merge by 35%, catches 60% more style issues pre-review"

Content moderation ML:

❌ "Automates 1000 moderation decisions daily"
✅ "Reduces harmful content visibility by 80%, decreases appeal rate by 25%"

Inventory forecasting:

❌ "Saves data team 5 hours weekly on reports"
✅ "Reduces stockouts by 45%, decreases overstock by 30%"

Building This Into Your Workflow

The trick is defining success criteria before you write any code. Here's a lightweight framework:

1. Start with the Business Outcome

Before the sprint starts, write down:

What specific outcome are we trying to improve?
How is it measured today?
What's the baseline metric?

# Document this in your ADR or project brief
BASELINE_METRICS = {
    "ticket_resolution_time_p50": 4.2,  # hours
    "first_response_accuracy": 0.73,
    "escalation_rate": 0.18
}

TARGET_OUTCOMES = {
    "ticket_resolution_time_p50": 3.5,  # 15% improvement
    "first_response_accuracy": 0.85,    # 12pp improvement
    "escalation_rate": 0.12             # 6pp improvement
}

2. Instrument for Outcomes, Not Activity

Your telemetry should track results, not just usage.

// Less useful
logger.info('AI classifier invoked', { ticketId });

// More useful
logger.info('Ticket resolved', {
  ticketId,
  resolutionTimeMinutes,
  autoTagAccuracy,
  escalated: false,
  aiAssisted: true
});

3. Compare Before/After, Control/Treatment

Run A/B tests where possible. Roll out gradually and measure the delta:

50% of tickets use AI tagging, 50% don't
Compare resolution times, accuracy, escalations
Ship to 100% only if outcomes improve

Why This Matters for Your Career

As a developer, learning to frame your work in business outcomes—not just technical achievements—is a force multiplier. It makes your projects easier to fund, easier to defend, and much harder to cut when budgets tighten.

The companies getting AI automation and software development right aren't the ones chasing headcount reduction. They're the ones connecting automation directly to revenue, customer satisfaction, or strategic goals.

If you're presenting AI work to non-technical stakeholders, skip the "hours saved" pitch. Show them outcomes, not hours saved, and watch the conversation shift from "can we afford this?" to "how fast can we scale it?"

Quick Takeaways

Time saved ≠ value delivered. Hours are an input; outcomes are what matter.
Define success metrics before you code. Baseline, target, measurement strategy.
Instrument for business outcomes, not just feature usage. Track resolution time, accuracy, customer impact.
Use A/B testing to prove the delta. Control groups make ROI undeniable.
Frame your work in business terms. It makes you a more effective engineer and a better communicator.

Your AI feature might save time—but if you can't connect it to a better outcome, you're building on sand.

Building Persistent AI Agents: A Dev's Guide to State Management and Long-Running Workflows

Marc Newstead — Mon, 15 Jun 2026 09:10:05 +0000

The Problem with Stateless Agents

Most AI agents we build today are essentially fancy request-response systems. User asks, agent responds, context dies. Rinse and repeat. But what happens when you need an agent that can start a workflow on Monday, wait for external approval on Wednesday, and resume execution on Friday — all while maintaining perfect context?

That's the shift from chatbots to persistent agents. And it changes everything about how we architect AI systems.

What Makes an Agent "Persistent"?

A persistent agent isn't just a chatbot with better memory. It's a system that:

Maintains state across sessions — not just conversation history, but workflow position, pending actions, and decision context
Can pause and resume — waiting on external events, human input, or scheduled triggers without losing its place
Initiates actions autonomously — checking conditions, triggering workflows, and making decisions without explicit user prompts
Recovers gracefully — handling failures, retries, and state corruption without manual intervention

Think less "chat interface" and more "background worker with reasoning capabilities".

The Hard Part Isn't the LLM Calls

Here's what surprised me: building the agent logic itself is relatively straightforward. Frameworks like LangGraph, CrewAI, and AutoGen handle the orchestration. Calling GPT-4 or Claude is trivial. Integrating tools and APIs is just... normal backend work.

The hard part is state management.

State Storage: More Than Just JSON

You need to persist:

{
  "workflow_id": "claim_review_001",
  "current_step": "awaiting_manager_approval",
  "context": {
    "claim_amount": 1250,
    "supporting_docs": [...],
    "previous_decisions": [...]
  },
  "pending_actions": [
    {"type": "wait_for_approval", "timeout": "2024-02-15T17:00:00Z"}
  ],
  "llm_state": {
    "reasoning_trace": [...],
    "tool_call_history": [...]
  }
}

But here's the thing: this state evolves. The agent needs atomic updates. You need versioning for rollbacks. You need to handle concurrent modifications if multiple agents or humans interact with the same workflow.

Suddenly you're designing a state machine with persistence, not just prompt engineering.

Interruption as a First-Class Concept

In traditional software, interruption is an exception case. In persistent agents, interruption is the normal case.

Your agent will:

Pause to wait for human approval
Stop because a rate limit was hit
Yield while waiting for an external system
Get interrupted because a higher-priority task arrived

Each interruption point needs explicit handling:

def process_claim_step(state):
    if state.requires_human_review():
        return PauseState(
            resume_trigger="approval_received",
            context=state.to_dict(),
            timeout_hours=72
        )
    # Continue processing...

You're not building a linear function anymore. You're building a resumable state machine.

Human-in-the-Loop Isn't Optional

For any agent doing real work — approving expenses, modifying production data, sending customer communications — human oversight isn't a nice-to-have. It's regulatory, ethical, and practical table stakes.

But "human-in-the-loop" means different things:

Approval gates — agent pauses, human approves/rejects, agent continues
Suggested actions — agent proposes, human edits, agent executes
Monitoring dashboards — humans can intervene at any point

This isn't a UI concern. It's an architectural decision that affects your state model, your event system, and your error handling.

If you're building systems that blend AI automation and software development, you need to design for human intervention from day one, not bolt it on later.

Observability Gets Weird

How do you debug an agent that's been running for three days and is currently paused?

Your standard APM tools won't help. You need:

Workflow visualisation — where is each agent in its process?
State inspection — what decisions has it made? What's it waiting for?
Reasoning traces — why did it take action X instead of Y?
Replay capability — can you rerun from a checkpoint with different conditions?

Logging becomes critical. Every LLM call, every tool invocation, every state transition needs to be traceable.

Where to Start

If you're building your first persistent agent:

Choose a state backend early — Redis, PostgreSQL, or a proper workflow engine like Temporal
Design your state schema first — before you write agent logic
Build pause/resume into every step — don't assume linear execution
Make state transitions explicit — log everything, version your state
Test interruption scenarios — not just happy paths

The Shift in Thinking

Persistent agents force us to think differently. We're not building APIs or microservices anymore. We're building systems that think in the background — agents that are always on, always aware, and always ready to pick up where they left off.

The code isn't harder. The architecture is just... different. And that's the real challenge.

Stop Measuring AI Features By Hours Saved (Measure This Instead)

Marc Newstead — Mon, 08 Jun 2026 09:12:39 +0000

The "Time Saved" Trap We Keep Falling Into

You've just shipped an AI feature. Your manager asks: "How much time does this save users?"

It sounds reasonable. We're engineers — we optimise for efficiency. But this question often leads us to build the wrong thing and measure what doesn't matter.

I've watched teams spend months building AI tools that technically saved hours but delivered zero business value. The feature worked. The metrics looked good. Nobody used it after the first week.

Here's why measuring outcomes not hours matters more than you think — and how to instrument for it from day one.

Why "Hours Saved" Breaks Your Decision-Making

The labour-hour metric made sense when automation meant replacing repetitive tasks. If your script processes 1,000 invoices instead of a human spending 40 hours doing it, the maths is simple.

But modern AI features don't work like that. They:

Augment decisions (suggesting code completions, not writing entire apps)
Enable new workflows (analysis that wasn't feasible manually)
Shift quality, not just speed (better detection, fewer false positives)

When you measure a code completion tool by "time saved typing", you miss that its real value might be:

Reducing context-switching by keeping developers in flow
Lowering the barrier for junior devs to write idiomatic code
Decreasing cognitive load during complex refactors

None of those show up in a time-saved metric. Worse, optimising for time-saved might lead you to auto-complete aggressively when developers actually want suggestions that help them think, not type faster.

What to Measure Instead: Outcomes Engineers Can Instrument

Shift your instrumentation to capture what changed, not just what was faster.

Example: AI-Powered Code Review Assistant

Don't measure: "Saved 15 minutes per PR review"

Do measure:

Defect escape rate (bugs reaching production)
Time-to-merge for PRs of similar complexity
Reviewer confidence scores (post-merge survey)
Rate of AI suggestions accepted vs. dismissed

Example: Automated Customer Query Classifier

Don't measure: "Replaced 10 hours/week of manual tagging"

Do measure:

First-response accuracy (correct routing)
Customer satisfaction with resolution
Escalation rate to human agents
Query resolution time end-to-end

The Pattern

For any AI feature, ask:

What business outcome does this enable? (faster deployments, fewer incidents, better conversion)
What baseline exists? (instrument before you ship)
What proxy metrics indicate progress? (leading indicators you can measure weekly)

Instrumenting for Outcomes From Day One

This is where most teams fail: they bolt on measurement after launch. You can't retrofit a baseline.

Pre-Launch Checklist

# Pseudocode: What your instrumentation might look like

class AIFeatureMetrics:
    def __init__(self, feature_name):
        self.feature = feature_name

    def log_interaction(self, user_id, action, context):
        """
        Log every meaningful interaction:
        - What did the AI suggest?
        - What did the user do with it?
        - What was the context? (task type, user experience level)
        """
        event = {
            'timestamp': now(),
            'feature': self.feature,
            'user': user_id,
            'action': action,  # accepted, rejected, modified
            'context': context,
            'outcome': None  # filled in later
        }
        self.event_store.append(event)

    def link_to_outcome(self, interaction_id, outcome_metric):
        """
        Connect the AI interaction to business outcome:
        - Did the PR with AI suggestions have fewer bugs?
        - Did the AI-routed ticket resolve faster?
        """
        self.event_store.update(interaction_id, outcome=outcome_metric)

Key principle: Capture the interaction and the eventual outcome. This lets you correlate AI assistance with business results.

Making This Work in Practice

For teams working on AI automation and software development, here's the tactical approach:

1. Define Success Before You Code

Write your "definition of done" to include outcome metrics:

## Feature: AI-Powered Incident Classifier

**Success criteria:**
- 80% of incidents routed to correct team (up from 65% baseline)
- Mean-time-to-engagement decreases by 20%
- On-call satisfaction score maintained or improved

**NOT success:**
- "Saves 5 hours/week of manual classification"

2. Build a Baseline Period

Run your instrumentation for 2-4 weeks before enabling the AI feature. You need the counterfactual.

3. Plan Your Feedback Loop

How will you know if outcomes improve?

Weekly cohort analysis (users with AI vs. without)
Monthly business metric reviews
Qualitative feedback sessions (what changed in practice?)

The Bottom Line

Hours saved is easy to measure but often meaningless. Outcomes are harder to instrument but tell you whether you built the right thing.

As engineers, we control the telemetry. Instrument for outcomes from day one, and you'll ship AI features that actually matter.

What outcome metrics are you tracking for your AI features? Let's discuss in the comments.

Stop Using One LLM for Everything: A Dev's Guide to Model Routing

Marc Newstead — Mon, 08 Jun 2026 09:10:08 +0000

The Problem With Your Current LLM Stack

If you're sending every prompt through GPT-4 or Claude Opus because "it's the best model", you're probably burning money on overkill. Classifying a support ticket's sentiment doesn't need the same horsepower as generating a product requirements document. Yet most codebases I see treat LLM calls like they're all created equal.

Model routing solves this. Instead of one model for everything, you dynamically select which model handles each task based on complexity, cost, and latency requirements. Think of it as load balancing, but for intelligence.

What Model Routing Actually Looks Like

At its core, a router is middleware between your app and your LLM providers. Here's the mental model:

def route_llm_request(task):
    complexity = analyse_task(task)

    if complexity == "simple":
        return call_model("gpt-3.5-turbo", task)
    elif complexity == "moderate":
        return call_model("claude-haiku", task)
    else:
        return call_model("gpt-4", task)

Obviously production implementations get more sophisticated, but the principle holds: inspect the task, pick the cheapest model that can handle it reliably.

Mapping Tasks to Models

The hard part isn't the routing logic—it's building a sensible taxonomy of your tasks. Start by auditing what you're actually sending to LLMs:

Classification tasks: Intent detection, sentiment analysis, category assignment. These are often binary or multi-class decisions. GPT-3.5-turbo or even GPT-4o-mini handles these beautifully at a fraction of the cost.
Retrieval-augmented generation: Answering questions from your docs. Moderate complexity. Models like Claude Haiku or Gemini Flash offer solid performance without flagship pricing.
Content generation: Drafting emails, writing code, creating marketing copy. This is where you might actually need GPT-4 or Claude Opus—but only when the stakes justify it.
Structured extraction: Pulling entities from text, parsing invoices. If you can define a JSON schema, smaller models work fine, especially with function calling.

The key insight: most applications have a long tail of simple tasks subsidising a small number of complex ones. Route accordingly.

Tracking the Wins

You need telemetry. Log every routing decision with:

{
  taskId: uuid(),
  taskType: "classification",
  modelSelected: "gpt-3.5-turbo",
  tokens: 150,
  cost: 0.0003,
  latency: 420,
  timestamp: Date.now()
}

After a week, aggregate this. You'll likely find:

70%+ of requests are simple and could use cheaper models
Your highest costs come from 5-10% of requests
Latency improves because smaller models are faster

One team I worked with cut their monthly LLM bill by 60% just by routing classification and extraction tasks away from GPT-4. The business logic didn't change—just the infrastructure underneath.

Fallback Strategies and Provider Diversity

Routing also gives you resilience. If OpenAI's API goes down (and it will), your router can failover to Anthropic or Gemini. This requires:

Normalised interfaces: Abstract provider-specific SDKs behind a common interface
Retry logic: Catch rate limits and failures, try the next model in your tier
Circuit breakers: Temporarily skip a provider if it's consistently failing

def call_with_fallback(task, models_list):
    for model in models_list:
        try:
            return call_model(model, task)
        except ProviderError:
            continue
    raise AllProvidersFailed()

This multi-provider approach also dodges vendor lock-in. When you're not married to a single API, you can negotiate better pricing and adopt new models faster.

Getting Started

You don't need to build a Netflix-scale routing system on day one. Start simple:

Categorise your prompts. Spend an afternoon tagging a sample of requests by complexity.
Benchmark models on each category. Test accuracy, cost, and latency.
Implement a basic router. Even a hardcoded if/else saves money immediately.
Instrument everything. You can't optimise what you don't measure.
Iterate. Add more sophisticated routing rules as your usage patterns emerge.

For a deeper dive into the strategic thinking behind this approach, the team at AI automation and software development have a solid write-up on deploying LLMs at scale that's worth reading.

The Bottom Line

Using one model for everything is like running every database query against your production master. Sure, it works—but it's wasteful and fragile. Model routing gives you cost control, performance headroom, and architectural flexibility.

Start small, measure everything, and let the data guide your routing decisions. Your infrastructure budget will thank you.