Stéphane Derosiaux

Posted on Dec 31, 2025

1500 Lines of Markdown vs 15000 Lines of Python.

The best orchestrator might be the one you don't have to build. Claude Code is already an orchestrator. Stop building infrastructure around it.

I spent a weekend studying GPT-Researcher, an open-source project with 24,000+ GitHub stars. It builds an autonomous research agent that generates comprehensive reports with citations. The architecture is elegant: multiple specialized agents coordinate through LangGraph, parallel execution speeds up research, and quality gates ensure reliable output.

It uses LLM calls to decide which agent to run. It uses LLM calls to generate sub-queries. It uses LLM calls to select tools. It uses LLM calls to coordinate parallel work.

We're wrapping LLMs in infrastructure to teach them orchestration... when they can already orchestrate.

The orchestration overhead

Consider a typical agent workflow:

LLM API call to analyze the task
LLM API call to plan the approach
LLM API call to select tools
LLM API call to execute (finally, the actual work)
LLM API call to verify results

Four out of five calls are orchestration overhead. The LLM produces structured outputs (JSON) that code interprets to decide... what to ask the LLM next and passing it the right context.

This pattern is everywhere. LangChain and LangGraph provide graph-based workflows. AutoGen from Microsoft enables multi-agent conversations. CrewAI offers role-based agent coordination, used by Oracle, PwC, and NVIDIA. Each framework solves real problems: managing state, coordinating agents, handling failures.

But they all share the same assumption: the LLM needs infrastructure to orchestrate.

What GPT-Researcher does well

Credit where it's due: GPT-Researcher is excellent. According to its maintainers, it outperforms Perplexity, OpenAI's research tools, and other systems in benchmarks on citation quality, report quality, and information coverage.

The architecture is sophisticated:

Multi-agent roles: Chief Editor orchestrates the process. Researchers investigate subtopics. Editors plan structure. Reviewers validate quality. Revisers incorporate feedback. Writers compile reports.
Parallel execution: Research happens concurrently across subtopics. Multiple retrievers (Tavily, Google, Bing) run in parallel. Web scraping is asynchronous.
Quality gates: Review cycles catch errors. Revision loops improve output.

The architecture tax

Here's what the orchestration layer requires:

# agent_creator.py - LLM decides which agent to use
response = await llm.call(
    "Analyze this query and return JSON with agent type..."
)
agent_type = parse_json(response)  # error handling, retries

# query_processing.py - LLM generates sub-queries
response = await llm.call(
    "Generate search queries for this task..."
)
queries = parse_list(response)  # more parsing, more error handling

# tool_selector.py - LLM selects MCP tools
response = await llm.call(
    "Select relevant tools from this list..."
)
tools = parse_tool_selection(response)  # yet more parsing

Each step requires prompt engineering, output parsing, error handling, and retry logic. The orchestration layer is substantial.

Claude Code's native capabilities

Here's what Claude Code provides out of the box:

Claude Code doesn't need an LLM call to decide what tools to use. It IS the LLM. It reasons about the task and uses tools directly, in the same context, without round-trips.

When you ask Claude Code to research a topic:

It analyzes the query (no separate LLM call)
It generates sub-queries (no separate LLM call)
It executes parallel searches (native Task agents)
It synthesizes results (no separate LLM call)
It writes the report (native Write tool)

What required infrastructure now requires prompts.

The rewrite

I created Claude Researcher to test this. It's not a Python package. It's four commands and one skill file.

Commands:

The /research-team command implements the full multi-agent pattern:

Claude Code (Chief Editor)
      │
      ├── [PARALLEL] Research Agent 1 → findings
      ├── [PARALLEL] Research Agent 2 → findings
      └── [PARALLEL] Research Agent 3 → findings
              ↓
         Draft Report
              ↓
         Reviewer Agent → feedback
              ↓
         Reviser Agent → improved draft
              ↓ (repeat until quality gate passes)
         Final Report

Quality gates are defined in the command file. Review cycles repeat until scores meet thresholds. The multi-agent patterns from GPT-Researcher, expressed as instructions rather than code.

The difference in sub-query generation:

GPT-Researcher:

prompt = f"""Write {max_iterations} google search queries...
You must respond with a list of strings in the following format: [{example}].
The response should contain ONLY the list."""

response = await llm.call(prompt)
queries = json.loads(response)  # parsing, error handling, retries

Claude Researcher:

Generate 5 search queries to research: "{query}"
- Each query should explore a different angle
- Include queries for recent information when relevant

Claude Code executes the queries directly. No parsing layer. No error handling for malformed output. The LLM produces the queries and uses them in the same context.

When orchestration frameworks make sense

This isn't a claim that frameworks are useless. They solve real problems:

Production systems with strict SLAs. When you need guaranteed response formats, retry logic, circuit breakers, and observability, frameworks provide battle-tested infrastructure. Claude Code is conversational, not transactional.
Non-Claude environments. If you're building on GPT-4, Gemini, or open-source models, Claude Code isn't available. Frameworks provide the coordination layer those environments lack.
Complex state machines. Research is relatively linear: gather, synthesize, write. Workflows with branching logic, human-in-the-loop steps, or long-running state benefit from explicit orchestration.
Team standardization. Frameworks enforce patterns. When multiple developers build agents, shared infrastructure ensures consistency. Markdown commands are flexible but less structured.
Audit requirements. Enterprise deployments often need detailed logs of every decision. Frameworks with explicit orchestration make this easier than conversational interfaces.

The question isn't "frameworks vs no frameworks". It's "do you need the framework for THIS task?"

Try it yourself

Clone the repo:

git clone https://github.com/sderosiaux/claude-researcher

Copy to your Claude Code config:

cp commands/_*.md ~/.claude/commands/
cp skills/researcher.md ~/.claude/skills/

Run a research task:

/research "Impact of AI agents on software development" --depth=deep

/research-team "Comparison of vector databases" --quality=high

/lookup "What is Claude Opus 4.5 context window?"

No pip install. No API keys beyond what Claude Code already uses. No configuration files.

Claude Code IS the Orchestrator, a really good one

There's a pattern in software engineering: we build abstractions to solve problems, then build abstractions to manage our abstractions. Each layer adds capability but also complexity, configuration, and cognitive load.

What if the base layer already does what I need?

Claude Code is an LLM with native tool access, parallel execution, and context management. GPT-Researcher is infrastructure that makes LLMs do those things. For research tasks, the native capabilities are sufficient.

The best orchestrator might be the one you don't have to build.