Vikas Sah

Posted on Mar 20

The Autonomy Slider: A Decision Framework for When to Use Workflows, Single Agents, or Multi-Agent Systems

#agentai #llmarchitecture #aiengineer #multiagent

The industry is over-indexing on multi-agent. Here's a concrete framework — with code — for choosing the right level of autonomy.

The $50,000 Refactor Nobody Talks About

A developer on r/LangChain posted something that stopped me mid-scroll: "I spent three weeks building a CrewAI pipeline with five agents. Then I rewrote it as a single agent with tools and it was faster, cheaper, and more reliable."

This is not an isolated incident. It is the dominant pattern I keep seeing in production deployments: the vast majority of systems marketed as "multi-agent" are actually single agents with routing logic. The companies building the foundational models — Anthropic, OpenAI — rely on single-agent tool-calling architectures for their own flagship products. Claude uses tool-calling. ChatGPT uses tool-calling. That should tell you something.

Yet the conference circuit keeps selling the dream: autonomous agents collaborating like a well-oiled engineering team. The framework ecosystem — CrewAI, AutoGen, LangGraph — has indexed entirely on this vision. And developers keep building five-agent pipelines for problems that need fifty lines of Python.

The issue is not that multi-agent systems are useless. They are genuinely powerful for a narrow set of problems. The issue is that we have created a false binary: "dumb workflows" versus "intelligent multi-agent collaboration." The reality is a spectrum. And most production use cases sit squarely in the middle.

I call it the autonomy slider.

This article is not just another "start simple" sermon. The advice to start simple is correct but incomplete — it gives you a direction without a destination. What follows is the specific decision criterion for when to move up each level of autonomy, so you know exactly when simple stops being enough.

The Autonomy Slider: Five Levels of Agent Architecture

The core insight comes from Anthropic's own engineering guide, "Building Effective Agents," which draws a clear line between workflows and agents. Workflows are predetermined code paths where LLMs are used at specific nodes with deterministic orchestration. Agents are systems where the LLM dynamically directs its own processes and tool usage.

Anthropic's advice is blunt: "Workflows offer predictability and consistency for well-defined tasks, whereas agents are better for open-ended problems where it's hard to predict required steps." Translation — if you can draw the flowchart, you do not need an agent.

But even this is too binary. There are actually five distinct levels of autonomy, and understanding where your system sits — and where it should sit — is the single most important architectural decision in AI engineering today.

Level	Architecture	LLM Role	Cost	Reliability	Best For
0	Hard-coded workflow	None	$0	~99%	Structured data, known patterns
1	LLM-augmented workflow	Specific nodes	~$0.02 (illustrative, claude-sonnet pricing)	~95%	Classification, extraction, personalization
2	Single agent + tools	Decides which tool	~$0.04 (illustrative, claude-sonnet pricing)	~90%	Dynamic tasks, tool selection
3	Orchestrator + workers	Delegates subtasks	~$0.12 (illustrative, claude-sonnet pricing)	~82%	Complex generation with review
4	Multi-agent collaboration	Independent reasoning	~$0.20 (illustrative, claude-sonnet pricing)	~$0.20	Adversarial, simulation, parallel search

Most production systems belong at Level 1 or 2. Let me show you what each level looks like in code.

Industry Context: The Great Simplification

The market signals are unmistakable. Martin Fowler's Thoughtworks published an analysis in early 2025 making a critical distinction: "modular code" is not the same as "multiple autonomous LLMs." You can have clean software architecture — separate modules, well-defined interfaces, single responsibility — without giving each module its own language model. Most "multi-agent" systems conflate good software design with the need for multiple autonomous decision-makers.

MetaGPT, one of the most cited multi-agent research projects, revealed something counterintuitive. Its performance gains came from structured standard operating procedures — waterfall-style handoffs between stages with well-defined schemas — not from agent autonomy. The agents followed strict protocols. When you strip away the "multi-agent" branding, MetaGPT is a workflow engine with LLM-powered nodes. That is Level 1 on the slider, not Level 4.

Meanwhile, the Reflexion paper achieved 91% on HumanEval — a strong coding benchmark — using a single agent with a reflection loop. No second agent reviewing code. No "team of specialists." One model, reflecting on its own output and iterating. The gains came from the loop pattern, not from adding more agents.

The framework ecosystem is feeling this tension. Enterprise teams that adopted LangChain in 2023 are moving to direct API calls or lighter wrappers. The abstraction cost exceeds the development speed benefit. Pydantic AI and Instructor — libraries focused on structured output without agent overhead — are growing fast. The market is voting with its pip install commands: less framework, more control.

And the strongest signal of all: the companies building frontier models — Anthropic and OpenAI — do not use multi-agent architectures for their own flagship products. Claude is a single agent with tool-calling. ChatGPT is a single agent with tool-calling. If the teams with the deepest understanding of these models have decided multi-agent is unnecessary for their highest-stakes applications, that is evidence worth weighing.

Evidence and Examples: Code at Every Level

Let me walk through each level of the autonomy slider with runnable Python code using the Anthropic SDK. No framework code. That itself is the editorial statement.

Level 0: Hard-Coded Workflow (Zero LLM Calls)

Not everything needs a language model. Before reaching for Claude, ask: can regex and rules handle this?

import re
from dataclasses import dataclass

@dataclass
class TicketRoute:
    category: str
    priority: str
    handler: str

def classify_support_ticket(text: str) -> TicketRoute:
    """Level 0: Pure rules. No LLM. No API cost. No latency."""
    text_lower = text.lower()

    # Pattern matching for known categories
    if re.search(r"(password|login|auth|mfa|2fa)", text_lower):
        return TicketRoute("auth", "high", "security_team")

    if re.search(r"(billing|charge|invoice|refund|payment)", text_lower):
        return TicketRoute("billing", "medium", "finance_team")

    if re.search(r"(crash|error|bug|broken|500|timeout)", text_lower):
        return TicketRoute("technical", "high", "engineering_on_call")

    if re.search(r"(feature|request|suggestion|improve)", text_lower):
        return TicketRoute("feature_request", "low", "product_team")

    return TicketRoute("general", "medium", "support_team")

# Usage
ticket = classify_support_ticket("I can't login and my MFA code isn't working")
print(f"Route to: {ticket.handler} (priority: {ticket.priority})")
# Route to: security_team (priority: high)

When Level 0 is right: You can enumerate every path. Inputs are structured. The domain is well-understood. You will be surprised how often this is the case.

When to move up: Unknown inputs start appearing. Classification accuracy drops below your threshold. Edge cases multiply faster than you can write rules.

Level 1: LLM-Augmented Workflow (LLM at Specific Nodes)

The workhorse of production AI. Deterministic routing with LLM calls at specific nodes where rules fail. This is what Anthropic calls "prompt chaining" and "routing."

import anthropic
import json

client = anthropic.Anthropic()

def process_customer_email(email_text: str) -> dict:
    """Level 1: Deterministic workflow, LLM at two nodes."""

    # Node 1 (LLM): Classify intent — the ONLY ambiguous step
    classification = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Classify this customer email into exactly one category.

Categories: billing, technical, feature_request, complaint, praise

Email: {email_text}

Respond with JSON: {{"category": "...", "urgency": "low|medium|high"}}"""
        }]
    )
    result = json.loads(classification.content[0].text)

    # Deterministic routing — no LLM needed here
    templates = {
        "billing": "billing_response.txt",
        "technical": "tech_response.txt",
        "complaint": "escalation_response.txt",
        "feature_request": "feature_ack.txt",
        "praise": "thank_you.txt",
    }
    template_path = templates.get(result["category"], "generic.txt")
    base_template = load_template(template_path)  # plain file read

    # Node 2 (LLM): Personalize the template response
    personalized = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Personalize this template response for the customer.

Template: {base_template}
Original email: {email_text}
Category: {result['category']}

Keep the structure. Adjust tone and add specific references to their issue.
Do not invent information not in the original email."""
        }]
    )

    return {
        "category": result["category"],
        "urgency": result["urgency"],
        "response": personalized.content[0].text,
        "template_used": template_path,
    }

Two LLM calls. Deterministic routing in between. Total cost: roughly $0.02-0.04. Latency: 3-5 seconds. Debuggable: you can log exactly which template was selected and why.

This is where most production AI should live.

Level 2: Single Agent with Tools (LLM Decides What to Do)

When the task requires dynamic tool selection — not just classification, but deciding which action to take — you need a single agent with tools. The LLM chooses which tool to call based on the input.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools the agent can use
tools = [
    {
        "name": "search_knowledge_base",
        "description": "Search internal docs for answers to customer questions",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "lookup_order",
        "description": "Look up order status by order ID or customer email",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "email": {"type": "string"}
            }
        }
    },
    {
        "name": "escalate_to_human",
        "description": "Escalate to a human agent when the issue is too complex",
        "input_schema": {
            "type": "object",
            "properties": {
                "reason": {"type": "string"},
                "priority": {"type": "string", "enum": ["low", "medium", "high"]}
            },
            "required": ["reason", "priority"]
        }
    }
]

def handle_support_request(user_message: str) -> str:
    """Level 2: Single agent chooses which tools to use."""
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system="You are a support agent. Use tools to help customers. "
                   "Be concise and helpful.",
            tools=tools,
            messages=messages,
        )

        # If the model is done, return the text response
        if response.stop_reason == "end_turn":
            return next(
                b.text for b in response.content if b.type == "text"
            )

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result),
                })

        # Continue the conversation with tool results
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

Notice: one agent, multiple tools, a loop. The agent decides whether to search the knowledge base, look up an order, or escalate. This is genuinely agentic behavior — the LLM is making decisions — but it is a single agent. No coordination overhead. No inter-agent communication protocol. No framework.

Level 3: Orchestrator + Workers (Structured Delegation)

When you need specialized processing with review, Level 3 introduces an orchestrator that delegates to worker agents. This is Anthropic's "orchestrator-workers" pattern.

import anthropic
import json

client = anthropic.Anthropic()

def generate_and_review_code(task_description: str) -> dict:
    """Level 3: Orchestrator delegates to coder, then to reviewer."""

    # Worker 1: Generate code
    coder_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior Python developer. Write clean, production-ready "
               "code. Include error handling and type hints. Return ONLY code.",
        messages=[{"role": "user", "content": task_description}]
    )
    generated_code = coder_response.content[0].text

    # Worker 2: Review code (adversarial — independent context)
    reviewer_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        system="You are a code reviewer. Be critical. Check for: bugs, security "
               "issues, performance problems, missing edge cases. "
               "Return JSON: {\"issues\": [...], \"severity\": \"pass|minor|major\", "
               "\"approved\": true/false}",
        messages=[{
            "role": "user",
            "content": f"Review this code:\n\n```
{% endraw %}
python\n{generated_code}\n
{% raw %}
```"
        }]
    )
    review = json.loads(reviewer_response.content[0].text)

    # Orchestrator logic: deterministic decision based on review
    if review.get("approved"):
        return {"code": generated_code, "status": "approved", "review": review}

    # If not approved, revise with feedback (one retry)
    revision_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior Python developer. Fix the issues identified "
               "in the code review. Return ONLY the corrected code.",
        messages=[{
            "role": "user",
            "content": f"Original code:\n```
{% endraw %}
python\n{generated_code}\n
{% raw %}
```\n\n"
                       f"Review feedback:\n{json.dumps(review['issues'])}\n\n"
                       f"Fix these issues."
        }]
    )

    return {
        "code": revision_response.content[0].text,
        "status": "revised",
        "review": review,
    }

Three LLM calls. Structured handoffs — the coder's output becomes the reviewer's input, the review's output drives the orchestrator's decision. The orchestration logic is deterministic Python, not another LLM call.

This is where MetaGPT's gains actually come from. Not from agents negotiating in natural language, but from structured schemas passed between specialized LLM calls with deterministic orchestration in between.

Cost: ~$0.10-0.12 per run (illustrative, claude-sonnet pricing). Latency: ~8-12 seconds. The cost is 3-5x Level 1, so you need to justify it.

Level 4: Multi-Agent Collaboration (When You Actually Need It)

True multi-agent systems. Parallel agents with independent reasoning, adversarial dynamics, or simulation. This is the expensive end of the slider — and the only level where the complexity is justified by the problem structure.

import anthropic
import asyncio
import json

client = anthropic.Anthropic()

async def parallel_research_with_adversarial_review(question: str) -> dict:
    """Level 4: Parallel search agents + adversarial synthesis."""

    # Parallel search agents — genuinely independent research
    search_perspectives = [
        {
            "role": "technical_researcher",
            "prompt": f"Research the TECHNICAL merits and limitations of: {question}\n"
                      f"Focus on benchmarks, architecture tradeoffs, and implementation details."
        },
        {
            "role": "market_researcher",
            "prompt": f"Research the MARKET dynamics around: {question}\n"
                      f"Focus on adoption rates, enterprise usage, and competitive landscape."
        },
        {
            "role": "contrarian_researcher",
            "prompt": f"Find evidence AGAINST the mainstream view on: {question}\n"
                      f"Focus on failures, limitations, overlooked alternatives."
        },
    ]

    # Run all searches in parallel (genuine speedup — not possible single-threaded)
    async def run_search(perspective: dict) -> dict:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1500,
            system=f"You are a {perspective['role']}. Be thorough and specific. "
                   f"Cite concrete examples, numbers, and dates.",
            messages=[{"role": "user", "content": perspective["prompt"]}],
        )
        return {
            "role": perspective["role"],
            "findings": response.content[0].text,
        }

    results = await asyncio.gather(
        *[run_search(p) for p in search_perspectives]
    )

    # Synthesis agent — combines perspectives, flags contradictions
    all_findings = "\n\n---\n\n".join(
        f"## {r['role']}\n{r['findings']}" for r in results
    )

    synthesis = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a research synthesis agent. Combine findings from multiple "
               "researchers. Highlight agreements, contradictions, and gaps. "
               "Be intellectually honest — flag where sources disagree.",
        messages=[{
            "role": "user",
            "content": f"Synthesize these research findings:\n\n{all_findings}"
        }],
    )

    return {
        "individual_findings": results,
        "synthesis": synthesis.content[0].text,
        "agent_count": len(search_perspectives) + 1,
        "estimated_cost": "$0.15-0.20 (illustrative, claude-sonnet pricing)",
    }

Four LLM calls. Three run in parallel — this is the key justification. A single agent would need to search sequentially, tripling the latency. The adversarial perspective (the contrarian researcher) provides genuinely different reasoning than the other two. The synthesis agent resolves contradictions.

This is legitimate multi-agent. The parallelism provides real speedup. The adversarial dynamic produces findings a single agent would not surface. The cost is justified by the problem structure.

But ask yourself: how many of your tasks actually need this?

Before and After: The Same Task, Two Ways

Consider a code review system. Here is how it looks at Level 4 (multi-agent) versus Level 1 (LLM-augmented workflow):

Level 4 (Multi-Agent) — The Over-Engineered Version:

# 5 agents, ~$0.20 per review, 15-20s latency
# Parser Agent → Style Agent → Security Agent → Logic Agent → Summary Agent
# Each agent: own system prompt, own context window, own failure mode
# Inter-agent communication: JSON schemas, retry logic, timeout handling
# Debugging: trace through 5 agent logs to find where review went wrong
# Lines of code: ~300+ (plus error handling for each handoff)

Level 1 (LLM-Augmented Workflow) — The Production Version:

def review_code(code: str) -> dict:
    """One LLM call. Structured output. Done."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior code reviewer. Analyze the code for: "
               "style issues, security vulnerabilities, logic errors, "
               "and performance concerns. Be specific and actionable.",
        messages=[{
            "role": "user",
            "content": f"Review this code:\n\n```
{% endraw %}
\n{code}\n
{% raw %}
```\n\n"
                       f"Return JSON: {{\"issues\": [{{\"type\": \"style|security|logic|perf\", "
                       f"\"line\": int, \"description\": str, \"severity\": \"low|medium|high\"}}], "
                       f"\"summary\": str, \"approved\": bool}}"
        }]
    )
    return json.loads(response.content[0].text)

# One call. ~$0.03. ~3 seconds. One log entry to debug.

Same task. The Level 1 version costs 85% less, runs 5x faster, and has exactly one point of failure. The Level 4 version adds complexity that must be justified by measurably better review quality — and in most cases, it is not.

Try This Yourself: The Autonomy Audit

Take a system you have built or are building. For each component, answer these questions:

Step 1: Map your current architecture

Component: [name]
Current level: [0-4]
Number of LLM calls: [count]
Monthly API cost for this component: [$]
P95 latency: [seconds]

Step 2: Apply the decision checklist

Can I enumerate all possible paths?
├── YES → Level 0 (hard-coded workflow)
└── NO
    Does the task require classifying unstructured input?
    ├── YES, but routing is deterministic after → Level 1
    └── YES, and the next action depends on classification
        Does the agent need to choose between tools dynamically?
        ├── YES, one agent can handle it → Level 2
        └── YES, and subtasks need independent reasoning
            Do subtasks benefit from adversarial review?
            ├── YES → Level 3 (orchestrator + workers)
            └── NO, but they benefit from parallel execution
                with independent perspectives → Level 4

Step 3: Compare cost and reliability

The reliability table below assumes 90% individual agent reliability — benchmark your own system to calibrate these numbers.

def estimate_pipeline_reliability(
    agent_count: int,
    per_agent_reliability: float = 0.90
) -> float:
    """The math most multi-agent advocates skip."""
    return per_agent_reliability ** agent_count

# The reliability tax
for n in range(1, 6):
    r = estimate_pipeline_reliability(n)
    cost_multiplier = n * 1.2  # rough, includes context overhead
    print(f"  {n} agent(s): {r:.1%} reliability, ~{cost_multiplier:.1f}x cost")

#   1 agent(s): 90.0% reliability, ~1.2x cost
#   2 agent(s): 81.0% reliability, ~2.4x cost
#   3 agent(s): 72.9% reliability, ~3.6x cost
#   4 agent(s): 65.6% reliability, ~4.8x cost
#   5 agent(s): 59.0% reliability, ~6.0x cost

Three agents at 90% individual reliability gives you 73% pipeline reliability. Five agents drops you to 59%. Every agent you add multiplies your failure surface. Retry logic and error handling can recover some of this, but they add latency and cost — making the total overhead even higher.

What Practitioners Should Actually Do

Audit your existing systems. Take every AI component in your stack and place it on the slider. If anything sits at Level 3 or 4, ask: does the problem structure genuinely require this level of autonomy? Or did we default to multi-agent because a tutorial told us to?

Start at Level 0 and move up. For every new feature, begin with rules and regex. Move to Level 1 only when deterministic classification breaks. Move to Level 2 only when the agent needs dynamic tool selection. Move to Level 3 only when adversarial review measurably improves output quality. Move to Level 4 only when you need parallel independent reasoning.

Drop the framework, use the SDK. The code examples above use the Anthropic SDK directly — no LangChain, no CrewAI, no AutoGen. Each example is under 50 lines. Frameworks add abstraction layers that hide complexity you will need to understand when things break at 2am. For most use cases, anthropic.Anthropic() and a while loop is all you need.

Measure before you architect. Before committing to multi-agent, run a baseline with a single well-prompted agent. Measure accuracy, cost, and latency. Then add the second agent. Does accuracy improve enough to justify the 2-3x cost increase and the new failure mode? Often it does not.

Reserve Level 4 for its real use cases. Adversarial red-teaming. Parallel specialized search where latency matters. Simulation with genuinely independent actors. Compliance workflows where separation of concerns has regulatory value. These are real. Everything else should probably be a workflow.

Strategic Takeaways

Most production AI systems should be workflows with one or two LLM calls, not multi-agent architectures. Anthropic's own engineering guide says it. Their own products demonstrate it. Start simple.
Multi-agent adds 3-5x cost and multiplicative failure modes. Three agents at 90% reliability each give you 73% pipeline reliability. The question is not "can I build this with agents?" but "do I need to?"
The autonomy slider has five levels, and most production use cases belong at Level 1-2. Level 0 is rules. Level 1 is the workhorse. Level 2 is for genuine tool selection. Levels 3-4 are for adversarial review and parallel search — not for email classification.
Run a single-agent baseline first. If adding a second agent does not improve the primary metric by at least 10%, do not add it. The coordination cost, additional failure surface, and debugging complexity are only justified by measurable gains — not by architectural elegance.
The best architecture is the one you can debug at 2am when your on-call pager goes off. A single agent with structured logging beats a five-agent pipeline with distributed tracing every time. Operational simplicity is a feature, not a compromise.

Vikas Sah is the founder of Code Coin Cognition LLC, building AI-powered screening and analysis systems. He writes about agentic AI architecture from the practitioner's perspective — what actually works in production, not what looks good in a demo.

DEV Community