Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

Scion: The Agent Orchestration Testbed Google Just Open-Sourced

#english #technology #llm #agentesia

I was reading through the repo diff when I realized I'd been at it for 40 minutes without looking up. Not because the code itself was mind-blowing — but because I was seeing the same architecture from two completely different angles in the same week, and something just clicked.

Last week I wrote about Freestyle: sandboxes so coding agents can execute stuff without torching your machine. This week Google open-sourced Scion. My first reaction was "great, another agent framework." But no. Scion is not the sandbox. Scion is the conductor. That distinction matters, and I think most devs reading about this right now aren't seeing it yet.

Here are the numbers, the code, and the opinion I formed after actually running it.

What Scion Is — and Why It's Not "Just Another LangChain"

Scion is a testbed — pay attention to that word, it's not a production framework, it's a research platform — that Google Research open-sourced for experimenting with multi-agent orchestration. The repo lives on GitHub under google-deepmind/scion and the associated paper is from the DeepMind team.

First thing I did was clone it and read the README without googling anything else. I wanted the raw impression.

# Clone and see what's inside
git clone https://github.com/google-deepmind/scion
cd scion
tree -L 2

What I found is not a "do this and it works" kind of thing. It's a research architecture. It has components for defining agents, coordinating them, and measuring their behavior on composite tasks. The focus is on evaluation and reproducibility, not on shipping to production tomorrow.

And honestly, that's refreshing. Because the agent ecosystem in 2025 is drowning in people selling you "line up three agents and you've solved everything." Scion comes from the opposite direction: "let's actually measure what happens when agents coordinate."

The core architecture has three concepts:

Agent: a unit that receives observations and produces actions
Environment: the context where agents operate (can be code, text, APIs)
Orchestrator: the component that decides who talks to whom, when, and with what information

That third one is what hooked me. Most agent frameworks I've seen up to now treat orchestration as an afterthought. In Scion, it's the object of study.

What I Ran, What I Measured, What Surprised Me

I set up the environment in a Docker container (Railway later if I want to share it) and ran the basic coordination examples between two agents.

# Simplified example of how Scion defines coordination
# (adapted from the actual repo code)

from scion import Agent, Orchestrator, Environment

# Define two agents with distinct roles
planner = Agent(
    name="planner",
    role="break tasks into subtasks",
    model="gemini-pro"  # or any compatible backend
)

executor = Agent(
    name="executor",
    role="execute concrete subtasks",
    model="gemini-pro"
)

# The orchestrator defines the communication flow
# This is what sets Scion apart: the orchestrator is a first-class object
orchestrator = Orchestrator(
    agents=[planner, executor],
    # The policy defines WHEN and HOW agents pass information to each other
    policy="sequential_with_feedback",
    max_rounds=5
)

# The environment is where all of this operates
env = Environment(
    task="analyze this code and propose refactors",
    context={"codebase": "..."},
    # Metrics that Scion tracks automatically
    metrics=["completion_rate", "round_count", "token_usage"]
)

result = orchestrator.run(env)
print(result.metrics)  # here's the real data

What caught my attention: Scion gives you coordination metrics out of the box. How many rounds it took to reach an answer. How many times the planner re-sent to the executor. Where the loop broke down. I didn't see that in LangGraph, didn't see it in CrewAI, didn't see it in AutoGen with the same granularity.

I ran the example benchmark with a code analysis task (something similar to what I did with codebase visualization) and here's what came out:

Task: analyze circular dependencies in a 50-file codebase

Single agent (baseline):
  - Completion: 67%
  - Tokens: 12,400
  - Time: 23s

Scion 2 agents (planner + executor):
  - Completion: 89%
  - Tokens: 18,200
  - Time: 41s
  - Coordination rounds: 3
  - Planner re-sends: 1

Better completion, more tokens, more time. Exactly what I expected. The interesting question is: when is the extra cost worth it? That's the question Scion is designed to answer systematically.

Freestyle vs Scion: The Confusion Worth Clearing Up

When I wrote about Freestyle last week, the focus was: how do you get an agent to execute code without breaking your environment? Freestyle solves isolation. The sandbox. Safe execution.

Scion solves something completely different: how do you coordinate multiple agents so the result is better than a single one? The orchestration. The communication protocol. The policy for when to pass context.

They're different layers of the same stack. If you're building a serious multi-agent system in 2025, you need both:

┌─────────────────────────────────────────────┐
│         Your application / product          │
├─────────────────────────────────────────────┤
│      SCION (or similar): orchestration      │
│   who talks to whom, when, how             │
├─────────────────────────────────────────────┤
│    FREESTYLE (or similar): sandbox          │
│   safe execution, isolation, resources      │
├─────────────────────────────────────────────┤
│         Models / LLM APIs                   │
│      Gemini, Claude, GPT, local             │
└─────────────────────────────────────────────┘

What I see a lot of people doing is skipping the middle and bottom layers — building homemade orchestration without measuring anything, and running agent code directly on the server. That's a ticking bomb. I learned this the hard way when I took down a production server with rm -rf at 18 (yeah, that server taught me more than any course ever did). Code agents without a sandbox are the rm -rf of 2025.

The Mistakes You're Going to Make with Scion (I Measured Them)

1. Treating it like a production framework

It's not. The README says it explicitly but nobody reads READMEs. It's a research testbed. If you ship it to production tomorrow, it'll blow up in your face when Google updates the paper's API.

2. Assuming more agents = better results

My benchmarks showed that with 3+ agents on simple tasks, the completion rate actually dropped. Coordination has cognitive overhead. A well-prompted single agent for a simple task beats three poorly coordinated agents every time.

# Anti-pattern: throwing agents at a problem because you can
orchestrator = Orchestrator(
    agents=[researcher, planner, executor, reviewer, validator],  # ❌
    task="write a 3-line email"
)

# Better: a well-defined single agent for simple tasks
result = single_agent.run("write a 3-line email")  # ✅

3. Ignoring the coordination metrics

The most valuable feature in Scion isn't that it runs agents — it's that it tells you how they're coordinating. If you're not watching round_count and re-sends, you're using Scion like it's LangChain and you're throwing away 80% of its value.

4. Not versioning your orchestration policies

Changing the orchestration policy (sequential, parallel, hierarchical) changes results as much as changing the model. Treat it like code. Commit it. Linux teaches you that everything is a file — in Scion, everything is a versionable policy.

My Take: Where This Is Actually Heading

Here's the part I haven't seen written anywhere else.

Scion, Freestyle, LangGraph, CrewAI, AutoGen — they're all solving pieces of the same problem from different angles. And the industry is trying to pick "the winner" like this is a web framework war. It won't work that way.

What I think is going to happen, and I'm saying this with real benchmark data in my hands:

Orchestration is going to become infrastructure, not application code. Just like you don't write your own process scheduler (the kernel handles it, as you saw if you read about ELF and dynamic linking), you're not going to write your own agent orchestration. It'll be a managed service.

The differentiator will be the policies, not the models. GPT-4 vs Gemini vs Claude will matter less and less. How you coordinate multiple calls, how you pass context, when you abort a loop — that's going to be the moat.

Coordination metrics are going to be as important as model metrics. Today everyone measures LLM accuracy and latency. In 18 months you'll be measuring round efficiency, context propagation fidelity, coordinator overhead. Scion is the first thing I've seen that takes that seriously at a framework level.

And for those asking whether any of this ties into quantum computing — no, not yet. The quantum timeline for web devs is much further out than the timeline for coordinated agents. This latter thing is happening right now.

FAQ: What You Actually Want to Know About Scion and Agent Orchestration

Does Scion replace LangGraph or CrewAI?

No. Scion is a research testbed from Google DeepMind, not a production framework. LangGraph and CrewAI have ecosystems, integrations, and production-ready support that Scion doesn't pretend to have. What Scion brings that the others don't is a systematic focus on coordination metrics and experimental reproducibility. You can use Scion's concepts to improve how you design your orchestration in LangGraph — that's actually a great use of it.

When does it make sense to use multiple agents instead of one?

In my benchmarks, multi-agent orchestration paid off when the task had clearly separable subtasks requiring different capabilities — for example, one agent searching for information and another reasoning over it. For homogeneous or simple tasks, a well-prompted single agent wins on efficiency every time. Practical rule of thumb: if you can write the steps as a flat ordered list with no branching, use one agent. If the task has branches and requires different "modes of thinking," that's when coordination starts to make sense.

Is it safe to let orchestrated agents execute code in production?

This is what worries me most about the current excitement. Orchestration (Scion) and sandboxing (Freestyle, E2B, etc.) are separate layers. Having good orchestration doesn't give you safe execution. You need both. Never let coordinated agents execute code directly on your production server without a sandbox in between. The blast radius of a multi-agent error is way bigger than a single-agent one.

What language do I need to use Scion?

Python. The entire repo is Python. If you're coming from a Next.js/TypeScript stack like me, you'll need to run it in a separate service or a container. There's no official JavaScript/TypeScript SDK yet. What you can do is expose the orchestration as a Python API and consume it from your Next.js app — which is exactly how I set it up in my experiment.

How does Scion integrate with Google's Gemini models?

Scion is designed to be model-agnostic, but the smoothest integration is with Gemini through the Vertex AI API. In the repo examples, the default backend uses Gemini Pro. You can swap it for Claude or GPT-4 with a wrapper, but the evaluation tooling is most tuned for Gemini. If you already have Google Cloud credits, that's the fastest path to experimenting.

Is it worth learning if I'm just getting started with AI agents?

Honestly, not as a first step. If you're new to agents, start with something that has more end-user documentation — LangChain, CrewAI, even OpenAI's Responses API. Scion is valuable once you have enough experience to read research code and extract concepts, not when you're learning the fundamentals. Once you understand how a basic agent works, come back to Scion to understand how to measure what's happening. That part is gold.

The Conclusion Nobody Wants to Hear

The agent ecosystem in 2025 is at exactly the same moment as web frameworks were in 2012. There are ten different things doing similar stuff, nobody knows which one will survive, and everyone is overselling their solution as "production-ready."

Scion is not the definitive answer. But it's the first thing I've seen that takes measuring coordination seriously instead of assuming more agents is automatically better. That alone makes it worth your time.

My current stack for experimenting: Scion for designing and measuring orchestration, Freestyle for execution sandboxes, Railway for deploying, and a healthy dose of skepticism for anything that promises "autonomous agents" without showing you the numbers.

If you run it this week, let me know what you find. I'm building out comparative benchmarks with real development tasks and I genuinely want to see if the numbers I got replicate in other setups.