DEV Community: Vikas Sah

Use git diff + Claude to Auto-Generate PR Descriptions in One Bash Alias

Vikas Sah — Mon, 23 Mar 2026 00:54:31 +0000

I used to spend five minutes writing PR descriptions. Now it takes five seconds. One bash alias pipes your diff and commit log into Claude CLI, and you get a formatted description back on stdout.

The Alias

Add this to your ~/.zshrc or ~/.bashrc:

alias prdesc='(echo "## Commits"; git log main..HEAD --oneline; echo "## Files changed"; git diff main..HEAD --stat; echo "## Diff"; git diff main..HEAD) | claude -p "Generate a concise PR description from this git diff. Use this format:

## Summary
One paragraph explaining what changed and why.

## Changes
- Bullet list of key changes (group by theme, not by file)

## Testing
- How to verify these changes work

Keep it under 200 words. No preamble. Start with the ## Summary heading."'

Reload your shell:

source ~/.zshrc

Usage

Make your changes, commit to a branch, then:

prdesc

That's it. Claude reads your commits, file stats, and full diff, then spits out a ready-to-paste PR description.

Sample Output

For a branch with three commits touching auth middleware:

## Summary
Adds rate limiting to the authentication middleware. Requests exceeding
50 per minute per API key now receive a 429 response with a Retry-After
header. Includes Redis-backed sliding window counter and integration tests.

## Changes
- Add `RateLimiter` class with Redis sliding window algorithm
- Wire rate limiter into auth middleware pipeline
- Return 429 with Retry-After header on limit breach
- Add integration tests covering burst and sustained traffic patterns
- Update API docs with rate limit response schema

## Testing
- Run `pytest tests/test_rate_limiter.py` for unit tests
- Run `pytest tests/integration/test_auth_flow.py` for end-to-end
- Hit `/api/health` 51 times rapidly and verify 429 response

Copy It Straight to gh

Pipe it into your PR creation command:

prdesc | gh pr create --title "Add rate limiting to auth middleware" --body-file -

Or copy to clipboard for the GitHub web UI:

prdesc | pbcopy  # macOS
prdesc | xclip -selection clipboard  # Linux

Why It Works

Claude's -p flag runs in headless mode — it reads stdin, processes the prompt, and prints the result to stdout. No interactive session, no TUI. That makes it composable with any Unix pipeline.

By feeding it three layers of context — commit messages (the why), file stats (the scope), and the full diff (the what) — you give the model enough signal to write a description that actually matches the change. Commit messages alone miss detail. Raw diffs alone miss intent. Both together hit the sweet spot.

Gotchas:

If your default branch is master, swap main for master in the alias. Or use git symbolic-ref refs/remotes/origin/HEAD | sed 's@^refs/remotes/origin/@@' to detect it automatically.
If your diff is huge (7,000+ characters), the stdin pipe can hit limits in some Claude CLI versions. For large PRs, drop git diff main..HEAD and keep only --stat — you'll lose line-level detail but the summary stays accurate. Or split the PR. Which you probably should anyway.

The 3-Line Python Decorator That Tracks Every Token Your AI Agent Spends

Vikas Sah — Mon, 23 Mar 2026 00:53:46 +0000

Jensen Huang just told GTC 2026 that every NVIDIA engineer will get a token budget worth half their base salary — $100K-$150K in compute credits. His argument: in the agentic era, your output is capped by your token access, not your working hours.

Which means somebody has to track those tokens. Here's a decorator that does it in three lines of logic.

The Decorator

import functools
from collections import defaultdict

_token_log = defaultdict(lambda: {"calls": 0, "input": 0, "output": 0})
_session_total = {"input": 0, "output": 0}

ALERT_THRESHOLD = 500_000  # tokens — adjust to your budget

def track_tokens(fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        result = fn(*args, **kwargs)

        # Extract token counts from the response
        usage = result.usage  # works for OpenAI, Anthropic, LiteLLM
        inp, out = usage.input_tokens, usage.output_tokens

        # Log per-function and per-session
        _token_log[fn.__name__]["calls"] += 1
        _token_log[fn.__name__]["input"] += inp
        _token_log[fn.__name__]["output"] += out
        _session_total["input"] += inp
        _session_total["output"] += out

        total = _session_total["input"] + _session_total["output"]
        if total > ALERT_THRESHOLD:
            print(f"⚠️  TOKEN ALERT: {total:,} tokens used — over {ALERT_THRESHOLD:,} limit")

        return result
    return wrapper

Usage

Wrap any function that returns an LLM response:

import anthropic

client = anthropic.Anthropic()

@track_tokens
def summarize(text: str):
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Summarize: {text}"}]
    )

@track_tokens
def classify(text: str):
    return client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Classify sentiment: {text}"}]
    )

# Run your agent workflow
summarize("Long document here...")
classify("I love this product")
classify("Terrible experience")

Check where your tokens are going:

def token_report():
    print(f"\n{'Function':<20} {'Calls':>6} {'Input':>10} {'Output':>10} {'Cost':>10}")
    print("-" * 60)
    for fn_name, stats in _token_log.items():
        # Claude Sonnet 4.5 pricing: $3/M input, $15/M output
        cost = (stats["input"] / 1e6 * 3) + (stats["output"] / 1e6 * 15)
        print(f"{fn_name:<20} {stats['calls']:>6} {stats['input']:>10,} {stats['output']:>10,} ${cost:>8.4f}")
    total_cost = (_session_total["input"] / 1e6 * 3) + (_session_total["output"] / 1e6 * 15)
    print(f"\n{'TOTAL':<20} {'':>6} {_session_total['input']:>10,} {_session_total['output']:>10,} ${total_cost:>8.4f}")

token_report()

Sample Output

Function             Calls      Input     Output       Cost
------------------------------------------------------------
summarize                1      1,247        312   $0.0085
classify                 2        418        124   $0.0031

TOTAL                            1,665        436   $0.0116

And when your agent goes on a loop (they always do):

⚠️  TOKEN ALERT: 502,101 tokens used — over 500,000 limit

Why It Works

Both the Anthropic and OpenAI Python SDKs return a usage object on every response with input_tokens and output_tokens. The decorator intercepts that before passing the response through, so your calling code never changes. The defaultdict keeps a running tally per function name — no setup, no database, no third-party library.

Adapt It

Swap the pricing constants for your model. Current rates per million tokens:

Model	Input	Output
Claude Opus 4.5	$5.00	$25.00
Claude Sonnet 4.5	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60

For OpenAI responses, change usage.input_tokens to usage.prompt_tokens and usage.output_tokens to usage.completion_tokens. That's it.

Gotcha: Streaming responses don't include usage by default. For Anthropic, pass stream_options={"include_usage": True} and grab the final message_delta event. For OpenAI, set stream_options={"include_usage": True} in the create() call.

TIL: One Claude Code Hook That Auto-Approves Safe Commands and Kills the 'Yes' Fatigue

Vikas Sah — Mon, 23 Mar 2026 00:53:45 +0000

You know the drill. Claude Code wants to read a file. You click Yes. It wants to grep something. Yes. List a directory. Yes. Check git status. Yes. Forty-seven times a day, you're a human rubber stamp.

Here's a hook that auto-approves read-only operations. In my testing, it cuts daily approval prompts by roughly 60%.

The Hook

Add this to .claude/settings.json in your project root:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Read|Grep|Glob",
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/approve-readonly.sh"
          }
        ]
      },
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/approve-safe-bash.sh"
          }
        ]
      }
    ]
  }
}

Create .claude/hooks/approve-readonly.sh:

#!/bin/bash
# Auto-approve all read-only tools (Read, Grep, Glob)
# These tools cannot modify files — safe to skip the prompt

jq -n '{
  hookSpecificOutput: {
    hookEventName: "PreToolUse",
    permissionDecision: "allow",
    permissionDecisionReason: "Read-only operation auto-approved"
  }
}'

Create .claude/hooks/approve-safe-bash.sh:

#!/bin/bash
# Auto-approve safe bash commands, prompt for everything else

INPUT=$(cat)
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command')

# Safe read-only commands — extend this list as needed
SAFE_PATTERNS="^(ls|cat|head|tail|grep|rg|find|wc|git status|git log|git diff|git branch|pwd|echo|which|type|file|stat)( |$)"

if echo "$COMMAND" | grep -qE "$SAFE_PATTERNS"; then
  jq -n '{
    hookSpecificOutput: {
      hookEventName: "PreToolUse",
      permissionDecision: "allow",
      permissionDecisionReason: "Safe read-only bash command"
    }
  }'
else
  # Fall through to normal permission prompt
  exit 0
fi

Make them executable:

chmod +x .claude/hooks/approve-readonly.sh .claude/hooks/approve-safe-bash.sh

Why It Works

Claude Code fires a PreToolUse event before every tool invocation. Your hook intercepts it, inspects the tool name and input, and returns a JSON decision: "allow", "deny", or nothing (falls through to the normal prompt).

The matcher field is a regex. "Read|Grep|Glob" catches all three read-only tools. The Bash hook checks the actual command against a safe-command regex — ls, cat, git status, etc. The ( |$) at the end matches commands both with arguments (ls -la) and without (ls). Anything not on the list still triggers the normal approval prompt.

Bonus: Share It With Your Team

Commit .claude/settings.json and the hooks directory to your repo. Every developer who clones the project gets the same auto-approve behavior. No per-machine setup.

If you want these hooks globally (all projects), put the config in ~/.claude/settings.json instead. For personal tweaks that won't affect teammates, use .claude/settings.local.json — same format, automatically gitignored.

Gotcha: The regex uses ( |$) to avoid partial matches — so rm in rmarkdown won't accidentally match a rm safe-pattern. Keep destructive commands (rm, mv, chmod, docker) off the list.

The Autonomy Slider: A Decision Framework for When to Use Workflows, Single Agents, or Multi-Agent Systems

Vikas Sah — Fri, 20 Mar 2026 19:01:14 +0000

The industry is over-indexing on multi-agent. Here's a concrete framework — with code — for choosing the right level of autonomy.

The $50,000 Refactor Nobody Talks About

A developer on r/LangChain posted something that stopped me mid-scroll: "I spent three weeks building a CrewAI pipeline with five agents. Then I rewrote it as a single agent with tools and it was faster, cheaper, and more reliable."

This is not an isolated incident. It is the dominant pattern I keep seeing in production deployments: the vast majority of systems marketed as "multi-agent" are actually single agents with routing logic. The companies building the foundational models — Anthropic, OpenAI — rely on single-agent tool-calling architectures for their own flagship products. Claude uses tool-calling. ChatGPT uses tool-calling. That should tell you something.

Yet the conference circuit keeps selling the dream: autonomous agents collaborating like a well-oiled engineering team. The framework ecosystem — CrewAI, AutoGen, LangGraph — has indexed entirely on this vision. And developers keep building five-agent pipelines for problems that need fifty lines of Python.

The issue is not that multi-agent systems are useless. They are genuinely powerful for a narrow set of problems. The issue is that we have created a false binary: "dumb workflows" versus "intelligent multi-agent collaboration." The reality is a spectrum. And most production use cases sit squarely in the middle.

I call it the autonomy slider.

This article is not just another "start simple" sermon. The advice to start simple is correct but incomplete — it gives you a direction without a destination. What follows is the specific decision criterion for when to move up each level of autonomy, so you know exactly when simple stops being enough.

The Autonomy Slider: Five Levels of Agent Architecture

The core insight comes from Anthropic's own engineering guide, "Building Effective Agents," which draws a clear line between workflows and agents. Workflows are predetermined code paths where LLMs are used at specific nodes with deterministic orchestration. Agents are systems where the LLM dynamically directs its own processes and tool usage.

Anthropic's advice is blunt: "Workflows offer predictability and consistency for well-defined tasks, whereas agents are better for open-ended problems where it's hard to predict required steps." Translation — if you can draw the flowchart, you do not need an agent.

But even this is too binary. There are actually five distinct levels of autonomy, and understanding where your system sits — and where it should sit — is the single most important architectural decision in AI engineering today.

Level	Architecture	LLM Role	Cost	Reliability	Best For
0	Hard-coded workflow	None	$0	~99%	Structured data, known patterns
1	LLM-augmented workflow	Specific nodes	~$0.02 (illustrative, claude-sonnet pricing)	~95%	Classification, extraction, personalization
2	Single agent + tools	Decides which tool	~$0.04 (illustrative, claude-sonnet pricing)	~90%	Dynamic tasks, tool selection
3	Orchestrator + workers	Delegates subtasks	~$0.12 (illustrative, claude-sonnet pricing)	~82%	Complex generation with review
4	Multi-agent collaboration	Independent reasoning	~$0.20 (illustrative, claude-sonnet pricing)	~$0.20	Adversarial, simulation, parallel search

Most production systems belong at Level 1 or 2. Let me show you what each level looks like in code.

Industry Context: The Great Simplification

The market signals are unmistakable. Martin Fowler's Thoughtworks published an analysis in early 2025 making a critical distinction: "modular code" is not the same as "multiple autonomous LLMs." You can have clean software architecture — separate modules, well-defined interfaces, single responsibility — without giving each module its own language model. Most "multi-agent" systems conflate good software design with the need for multiple autonomous decision-makers.

MetaGPT, one of the most cited multi-agent research projects, revealed something counterintuitive. Its performance gains came from structured standard operating procedures — waterfall-style handoffs between stages with well-defined schemas — not from agent autonomy. The agents followed strict protocols. When you strip away the "multi-agent" branding, MetaGPT is a workflow engine with LLM-powered nodes. That is Level 1 on the slider, not Level 4.

Meanwhile, the Reflexion paper achieved 91% on HumanEval — a strong coding benchmark — using a single agent with a reflection loop. No second agent reviewing code. No "team of specialists." One model, reflecting on its own output and iterating. The gains came from the loop pattern, not from adding more agents.

The framework ecosystem is feeling this tension. Enterprise teams that adopted LangChain in 2023 are moving to direct API calls or lighter wrappers. The abstraction cost exceeds the development speed benefit. Pydantic AI and Instructor — libraries focused on structured output without agent overhead — are growing fast. The market is voting with its pip install commands: less framework, more control.

And the strongest signal of all: the companies building frontier models — Anthropic and OpenAI — do not use multi-agent architectures for their own flagship products. Claude is a single agent with tool-calling. ChatGPT is a single agent with tool-calling. If the teams with the deepest understanding of these models have decided multi-agent is unnecessary for their highest-stakes applications, that is evidence worth weighing.

Evidence and Examples: Code at Every Level

Let me walk through each level of the autonomy slider with runnable Python code using the Anthropic SDK. No framework code. That itself is the editorial statement.

Level 0: Hard-Coded Workflow (Zero LLM Calls)

Not everything needs a language model. Before reaching for Claude, ask: can regex and rules handle this?

import re
from dataclasses import dataclass

@dataclass
class TicketRoute:
    category: str
    priority: str
    handler: str

def classify_support_ticket(text: str) -> TicketRoute:
    """Level 0: Pure rules. No LLM. No API cost. No latency."""
    text_lower = text.lower()

    # Pattern matching for known categories
    if re.search(r"(password|login|auth|mfa|2fa)", text_lower):
        return TicketRoute("auth", "high", "security_team")

    if re.search(r"(billing|charge|invoice|refund|payment)", text_lower):
        return TicketRoute("billing", "medium", "finance_team")

    if re.search(r"(crash|error|bug|broken|500|timeout)", text_lower):
        return TicketRoute("technical", "high", "engineering_on_call")

    if re.search(r"(feature|request|suggestion|improve)", text_lower):
        return TicketRoute("feature_request", "low", "product_team")

    return TicketRoute("general", "medium", "support_team")

# Usage
ticket = classify_support_ticket("I can't login and my MFA code isn't working")
print(f"Route to: {ticket.handler} (priority: {ticket.priority})")
# Route to: security_team (priority: high)

When Level 0 is right: You can enumerate every path. Inputs are structured. The domain is well-understood. You will be surprised how often this is the case.

When to move up: Unknown inputs start appearing. Classification accuracy drops below your threshold. Edge cases multiply faster than you can write rules.

Level 1: LLM-Augmented Workflow (LLM at Specific Nodes)

The workhorse of production AI. Deterministic routing with LLM calls at specific nodes where rules fail. This is what Anthropic calls "prompt chaining" and "routing."

import anthropic
import json

client = anthropic.Anthropic()

def process_customer_email(email_text: str) -> dict:
    """Level 1: Deterministic workflow, LLM at two nodes."""

    # Node 1 (LLM): Classify intent — the ONLY ambiguous step
    classification = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Classify this customer email into exactly one category.

Categories: billing, technical, feature_request, complaint, praise

Email: {email_text}

Respond with JSON: {{"category": "...", "urgency": "low|medium|high"}}"""
        }]
    )
    result = json.loads(classification.content[0].text)

    # Deterministic routing — no LLM needed here
    templates = {
        "billing": "billing_response.txt",
        "technical": "tech_response.txt",
        "complaint": "escalation_response.txt",
        "feature_request": "feature_ack.txt",
        "praise": "thank_you.txt",
    }
    template_path = templates.get(result["category"], "generic.txt")
    base_template = load_template(template_path)  # plain file read

    # Node 2 (LLM): Personalize the template response
    personalized = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Personalize this template response for the customer.

Template: {base_template}
Original email: {email_text}
Category: {result['category']}

Keep the structure. Adjust tone and add specific references to their issue.
Do not invent information not in the original email."""
        }]
    )

    return {
        "category": result["category"],
        "urgency": result["urgency"],
        "response": personalized.content[0].text,
        "template_used": template_path,
    }

Two LLM calls. Deterministic routing in between. Total cost: roughly $0.02-0.04. Latency: 3-5 seconds. Debuggable: you can log exactly which template was selected and why.

This is where most production AI should live.

Level 2: Single Agent with Tools (LLM Decides What to Do)

When the task requires dynamic tool selection — not just classification, but deciding which action to take — you need a single agent with tools. The LLM chooses which tool to call based on the input.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools the agent can use
tools = [
    {
        "name": "search_knowledge_base",
        "description": "Search internal docs for answers to customer questions",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "lookup_order",
        "description": "Look up order status by order ID or customer email",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "email": {"type": "string"}
            }
        }
    },
    {
        "name": "escalate_to_human",
        "description": "Escalate to a human agent when the issue is too complex",
        "input_schema": {
            "type": "object",
            "properties": {
                "reason": {"type": "string"},
                "priority": {"type": "string", "enum": ["low", "medium", "high"]}
            },
            "required": ["reason", "priority"]
        }
    }
]

def handle_support_request(user_message: str) -> str:
    """Level 2: Single agent chooses which tools to use."""
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system="You are a support agent. Use tools to help customers. "
                   "Be concise and helpful.",
            tools=tools,
            messages=messages,
        )

        # If the model is done, return the text response
        if response.stop_reason == "end_turn":
            return next(
                b.text for b in response.content if b.type == "text"
            )

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result),
                })

        # Continue the conversation with tool results
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

Notice: one agent, multiple tools, a loop. The agent decides whether to search the knowledge base, look up an order, or escalate. This is genuinely agentic behavior — the LLM is making decisions — but it is a single agent. No coordination overhead. No inter-agent communication protocol. No framework.

Level 3: Orchestrator + Workers (Structured Delegation)

When you need specialized processing with review, Level 3 introduces an orchestrator that delegates to worker agents. This is Anthropic's "orchestrator-workers" pattern.

import anthropic
import json

client = anthropic.Anthropic()

def generate_and_review_code(task_description: str) -> dict:
    """Level 3: Orchestrator delegates to coder, then to reviewer."""

    # Worker 1: Generate code
    coder_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior Python developer. Write clean, production-ready "
               "code. Include error handling and type hints. Return ONLY code.",
        messages=[{"role": "user", "content": task_description}]
    )
    generated_code = coder_response.content[0].text

    # Worker 2: Review code (adversarial — independent context)
    reviewer_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        system="You are a code reviewer. Be critical. Check for: bugs, security "
               "issues, performance problems, missing edge cases. "
               "Return JSON: {\"issues\": [...], \"severity\": \"pass|minor|major\", "
               "\"approved\": true/false}",
        messages=[{
            "role": "user",
            "content": f"Review this code:\n\n```
{% endraw %}
python\n{generated_code}\n
{% raw %}
```"
        }]
    )
    review = json.loads(reviewer_response.content[0].text)

    # Orchestrator logic: deterministic decision based on review
    if review.get("approved"):
        return {"code": generated_code, "status": "approved", "review": review}

    # If not approved, revise with feedback (one retry)
    revision_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior Python developer. Fix the issues identified "
               "in the code review. Return ONLY the corrected code.",
        messages=[{
            "role": "user",
            "content": f"Original code:\n```
{% endraw %}
python\n{generated_code}\n
{% raw %}
```\n\n"
                       f"Review feedback:\n{json.dumps(review['issues'])}\n\n"
                       f"Fix these issues."
        }]
    )

    return {
        "code": revision_response.content[0].text,
        "status": "revised",
        "review": review,
    }

Three LLM calls. Structured handoffs — the coder's output becomes the reviewer's input, the review's output drives the orchestrator's decision. The orchestration logic is deterministic Python, not another LLM call.

This is where MetaGPT's gains actually come from. Not from agents negotiating in natural language, but from structured schemas passed between specialized LLM calls with deterministic orchestration in between.

Cost: ~$0.10-0.12 per run (illustrative, claude-sonnet pricing). Latency: ~8-12 seconds. The cost is 3-5x Level 1, so you need to justify it.

Level 4: Multi-Agent Collaboration (When You Actually Need It)

True multi-agent systems. Parallel agents with independent reasoning, adversarial dynamics, or simulation. This is the expensive end of the slider — and the only level where the complexity is justified by the problem structure.

import anthropic
import asyncio
import json

client = anthropic.Anthropic()

async def parallel_research_with_adversarial_review(question: str) -> dict:
    """Level 4: Parallel search agents + adversarial synthesis."""

    # Parallel search agents — genuinely independent research
    search_perspectives = [
        {
            "role": "technical_researcher",
            "prompt": f"Research the TECHNICAL merits and limitations of: {question}\n"
                      f"Focus on benchmarks, architecture tradeoffs, and implementation details."
        },
        {
            "role": "market_researcher",
            "prompt": f"Research the MARKET dynamics around: {question}\n"
                      f"Focus on adoption rates, enterprise usage, and competitive landscape."
        },
        {
            "role": "contrarian_researcher",
            "prompt": f"Find evidence AGAINST the mainstream view on: {question}\n"
                      f"Focus on failures, limitations, overlooked alternatives."
        },
    ]

    # Run all searches in parallel (genuine speedup — not possible single-threaded)
    async def run_search(perspective: dict) -> dict:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1500,
            system=f"You are a {perspective['role']}. Be thorough and specific. "
                   f"Cite concrete examples, numbers, and dates.",
            messages=[{"role": "user", "content": perspective["prompt"]}],
        )
        return {
            "role": perspective["role"],
            "findings": response.content[0].text,
        }

    results = await asyncio.gather(
        *[run_search(p) for p in search_perspectives]
    )

    # Synthesis agent — combines perspectives, flags contradictions
    all_findings = "\n\n---\n\n".join(
        f"## {r['role']}\n{r['findings']}" for r in results
    )

    synthesis = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a research synthesis agent. Combine findings from multiple "
               "researchers. Highlight agreements, contradictions, and gaps. "
               "Be intellectually honest — flag where sources disagree.",
        messages=[{
            "role": "user",
            "content": f"Synthesize these research findings:\n\n{all_findings}"
        }],
    )

    return {
        "individual_findings": results,
        "synthesis": synthesis.content[0].text,
        "agent_count": len(search_perspectives) + 1,
        "estimated_cost": "$0.15-0.20 (illustrative, claude-sonnet pricing)",
    }

Four LLM calls. Three run in parallel — this is the key justification. A single agent would need to search sequentially, tripling the latency. The adversarial perspective (the contrarian researcher) provides genuinely different reasoning than the other two. The synthesis agent resolves contradictions.

This is legitimate multi-agent. The parallelism provides real speedup. The adversarial dynamic produces findings a single agent would not surface. The cost is justified by the problem structure.

But ask yourself: how many of your tasks actually need this?

Before and After: The Same Task, Two Ways

Consider a code review system. Here is how it looks at Level 4 (multi-agent) versus Level 1 (LLM-augmented workflow):

Level 4 (Multi-Agent) — The Over-Engineered Version:

# 5 agents, ~$0.20 per review, 15-20s latency
# Parser Agent → Style Agent → Security Agent → Logic Agent → Summary Agent
# Each agent: own system prompt, own context window, own failure mode
# Inter-agent communication: JSON schemas, retry logic, timeout handling
# Debugging: trace through 5 agent logs to find where review went wrong
# Lines of code: ~300+ (plus error handling for each handoff)

Level 1 (LLM-Augmented Workflow) — The Production Version:

def review_code(code: str) -> dict:
    """One LLM call. Structured output. Done."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system="You are a senior code reviewer. Analyze the code for: "
               "style issues, security vulnerabilities, logic errors, "
               "and performance concerns. Be specific and actionable.",
        messages=[{
            "role": "user",
            "content": f"Review this code:\n\n```
{% endraw %}
\n{code}\n
{% raw %}
```\n\n"
                       f"Return JSON: {{\"issues\": [{{\"type\": \"style|security|logic|perf\", "
                       f"\"line\": int, \"description\": str, \"severity\": \"low|medium|high\"}}], "
                       f"\"summary\": str, \"approved\": bool}}"
        }]
    )
    return json.loads(response.content[0].text)

# One call. ~$0.03. ~3 seconds. One log entry to debug.

Same task. The Level 1 version costs 85% less, runs 5x faster, and has exactly one point of failure. The Level 4 version adds complexity that must be justified by measurably better review quality — and in most cases, it is not.

Try This Yourself: The Autonomy Audit

Take a system you have built or are building. For each component, answer these questions:

Step 1: Map your current architecture

Component: [name]
Current level: [0-4]
Number of LLM calls: [count]
Monthly API cost for this component: [$]
P95 latency: [seconds]

Step 2: Apply the decision checklist

Can I enumerate all possible paths?
├── YES → Level 0 (hard-coded workflow)
└── NO
    Does the task require classifying unstructured input?
    ├── YES, but routing is deterministic after → Level 1
    └── YES, and the next action depends on classification
        Does the agent need to choose between tools dynamically?
        ├── YES, one agent can handle it → Level 2
        └── YES, and subtasks need independent reasoning
            Do subtasks benefit from adversarial review?
            ├── YES → Level 3 (orchestrator + workers)
            └── NO, but they benefit from parallel execution
                with independent perspectives → Level 4

Step 3: Compare cost and reliability

The reliability table below assumes 90% individual agent reliability — benchmark your own system to calibrate these numbers.

def estimate_pipeline_reliability(
    agent_count: int,
    per_agent_reliability: float = 0.90
) -> float:
    """The math most multi-agent advocates skip."""
    return per_agent_reliability ** agent_count

# The reliability tax
for n in range(1, 6):
    r = estimate_pipeline_reliability(n)
    cost_multiplier = n * 1.2  # rough, includes context overhead
    print(f"  {n} agent(s): {r:.1%} reliability, ~{cost_multiplier:.1f}x cost")

#   1 agent(s): 90.0% reliability, ~1.2x cost
#   2 agent(s): 81.0% reliability, ~2.4x cost
#   3 agent(s): 72.9% reliability, ~3.6x cost
#   4 agent(s): 65.6% reliability, ~4.8x cost
#   5 agent(s): 59.0% reliability, ~6.0x cost

Three agents at 90% individual reliability gives you 73% pipeline reliability. Five agents drops you to 59%. Every agent you add multiplies your failure surface. Retry logic and error handling can recover some of this, but they add latency and cost — making the total overhead even higher.

What Practitioners Should Actually Do

Audit your existing systems. Take every AI component in your stack and place it on the slider. If anything sits at Level 3 or 4, ask: does the problem structure genuinely require this level of autonomy? Or did we default to multi-agent because a tutorial told us to?

Start at Level 0 and move up. For every new feature, begin with rules and regex. Move to Level 1 only when deterministic classification breaks. Move to Level 2 only when the agent needs dynamic tool selection. Move to Level 3 only when adversarial review measurably improves output quality. Move to Level 4 only when you need parallel independent reasoning.

Drop the framework, use the SDK. The code examples above use the Anthropic SDK directly — no LangChain, no CrewAI, no AutoGen. Each example is under 50 lines. Frameworks add abstraction layers that hide complexity you will need to understand when things break at 2am. For most use cases, anthropic.Anthropic() and a while loop is all you need.

Measure before you architect. Before committing to multi-agent, run a baseline with a single well-prompted agent. Measure accuracy, cost, and latency. Then add the second agent. Does accuracy improve enough to justify the 2-3x cost increase and the new failure mode? Often it does not.

Reserve Level 4 for its real use cases. Adversarial red-teaming. Parallel specialized search where latency matters. Simulation with genuinely independent actors. Compliance workflows where separation of concerns has regulatory value. These are real. Everything else should probably be a workflow.

Strategic Takeaways

Most production AI systems should be workflows with one or two LLM calls, not multi-agent architectures. Anthropic's own engineering guide says it. Their own products demonstrate it. Start simple.
Multi-agent adds 3-5x cost and multiplicative failure modes. Three agents at 90% reliability each give you 73% pipeline reliability. The question is not "can I build this with agents?" but "do I need to?"
The autonomy slider has five levels, and most production use cases belong at Level 1-2. Level 0 is rules. Level 1 is the workhorse. Level 2 is for genuine tool selection. Levels 3-4 are for adversarial review and parallel search — not for email classification.
Run a single-agent baseline first. If adding a second agent does not improve the primary metric by at least 10%, do not add it. The coordination cost, additional failure surface, and debugging complexity are only justified by measurable gains — not by architectural elegance.
The best architecture is the one you can debug at 2am when your on-call pager goes off. A single agent with structured logging beats a five-agent pipeline with distributed tracing every time. Operational simplicity is a feature, not a compromise.

Vikas Sah is the founder of Code Coin Cognition LLC, building AI-powered screening and analysis systems. He writes about agentic AI architecture from the practitioner's perspective — what actually works in production, not what looks good in a demo.