Clinton Adedeji

Posted on Apr 4 • Edited on Apr 6

I Built a Multi-Agent AI Runtime in Go Because Python Wasn't an Option

#agents #go #llm #ai

The idea that started everything

Some weeks ago, I was thinking about Infrastructure as Code.

The reason IaC became so widely adopted is not because it's technically superior to clicking through a cloud console. It's because it removed the barrier between intent and execution. You write what you want, not how to do it. A DevOps engineer doesn't need to understand the internals of how an EC2 instance is provisioned — they write a YAML file, and the machine figures it out.

I started wondering: why doesn't this exist for AI agents?

If I want to run a multi-agent workflow today, I have two choices. I learn Python and use LangGraph or CrewAI, or I build my own tooling from scratch. Neither option is satisfying. The first forces me into an ecosystem and a language I might not want. The second means rebuilding primitives every time.

What if I could write a YAML file that described what I wanted — which agents, which tools, which LLM providers — and a runtime would just handle the rest? What if a non-developer could read that file and understand what the system does? What if I didn't have to understand how an agent works internally before I could use one?

That question became Routex.

Why Go, not Python

Every AI agent framework that exists today is written in Python. LangChain, LangGraph, CrewAI, AutoGen — Python all the way down. And for good reason: Python has the richest ML ecosystem, the most tutorials, and the lowest barrier to entry for data scientists.

But as a Go developer. And I kept thinking: Go should be a natural fit for this.

Here's why. An AI agent is fundamentally a concurrent system. An agent waits for an LLM response, executes tools, waits for tool results, calls the LLM again. Multiple agents run in parallel, passing results to each other through a dependency graph. This is exactly what Go was designed for.

Goroutines are cheap enough that you can run one per agent without thinking about thread pool sizing. Channels give you typed, safe communication between agents without shared state. The context package gives you cancellation and timeout propagation that flows naturally through the entire call stack. You get a single, statically compiled binary you can deploy anywhere without a runtime, a virtualenv, or a requirements.txt.

Go already had everything the problem needed — it just didn't have the framework yet.
So I built it.

What Routex looks like to a user

The core idea is that you should be able to describe an entire multi-agent crew in a YAML file, run it with a single command, and get results — without writing a single line of Go.
Here's what that looks like:

agents.yaml

runtime:
  name:         "research-crew"
  llm_provider: "anthropic"
  model:        "claude-haiku-4-5-20251001"
  api_key:      "env:ANTHROPIC_API_KEY"

task:
  input: "Compare the top Go web frameworks in 2026"

agents:
  - id:    "researcher"
    role:  "researcher"
    goal:  "Find detailed information about Go web frameworks"
    tools: ["web_search", "wikipedia"]

  - id:      "writer"
    role:    "writer"
    goal:    "Write a clear, structured report from the research"
    depends: ["researcher"]

tools:
  - name: "web_search"
  - name: "wikipedia"

Run it:

routex run agents.yaml

That's the entire user experience for the common case. The researcher runs first, uses web search and Wikipedia to gather information, then the writer agent picks up those results and produces a report. The dependency is declared — depends: ["researcher"] — and the runtime handles the ordering automatically.

A non-developer can read this file and understand exactly what it does. A developer can extend it with custom tools, different LLM providers per agent, Redis-backed memory, and OpenTelemetry tracing — all from YAML, all without touching the runtime code.

Example of routex comparing two cities' weather

The technical core: goroutines, channels, and a topological scheduler

Under the YAML surface, Routex is built on three Go primitives: goroutines, channels, and a topological sort.

Each agent is a long-lived goroutine. It sits waiting on an Inbox channel. The scheduler sends it a task, it runs its thinking loop — calling the LLM, executing tools, calling the LLM again — and sends its result back through an Output channel. This model maps so naturally onto Go that the core agent loop is less than fifty lines.

The scheduler uses Kahn's algorithm to determine execution order. It builds a dependency graph from your YAML, identifies which agents have no dependencies, and runs them all in parallel as the first "wave." When that wave completes, it identifies agents whose dependencies are now satisfied and runs those. This continues until all agents have run.

In practice, this means independent agents run concurrently without you having to think about it. If you have three researcher agents gathering data about different topics, they all run almost the same time. The writer agent waits until all three are done, then synthesizes their results in a single pass.

The thing I didn't plan for: what happens when an agent fails

I finished the scheduler and felt good about it. Then I realised I had a problem.

LLM calls fail. They time out, hit rate limits, return malformed responses. If an agent fails halfway through a crew run, every agent that depends on it gets the wrong answer — or no answer at all. The writer agent would try to synthesise results that don't exist. The whole run corrupts silently.

I needed a way to handle failure that was more principled than wrapping everything in a retry loop.

I started reading about how Erlang handles this problem. Erlang was built by Ericsson in the 1980s for telephone switches — systems that cannot go down. Their solution was the supervision tree: every process is watched by a supervisor, and when a process crashes, the supervisor decides what to do based on a policy. The philosophy is "let it crash" — don't write defensive code trying to handle every possible failure, just let things fail fast and trust the supervisor to recover cleanly.

This maps perfectly onto agents. An agent fails — the supervisor checks its policy:

one_for_one — restart only this agent, leave the others running
one_for_all — restart the entire crew
rest_for_one — restart this agent and everything that depends on it

The supervisor also tracks a restart budget. If an agent crashes three times within one minute, the supervisor stops trying and declares it permanently failed rather than looping forever burning API tokens.

In Routex, you configure this in one line:

agents:
  - id:      "researcher"
    role:    "researcher"
    restart: "one_for_one"

The bug that taught me everything about channel protocols

Here is a story about a bug I will not forget.

I had finished the supervisor. Restart policies were working. The supervisor correctly restarted failed agents. I was feeling very good about myself.

Then I ran a test with a researcher agent that was configured to fail on its first attempt and recover on the second. I kicked it off and watched the logs. The supervisor saw the failure. It applied the one_for_one policy. It restarted the agent goroutine. The logs said:

supervisor: agent "researcher" restarted
agent "researcher": waiting for message

And then... nothing. The application just sat there.

No timeout error. No panic. No LLM calls. Just silence.

I stared at my terminal for a full two minutes assuming the LLM was being slow. I went and made tea. I came back. Still nothing. I started wondering if the Anthropic API was down. I checked the Anthropic status page. Everything was fine.

I added more logging. The agent was alive — it was sitting in its select loop, genuinely waiting for a message on its Inbox channel. It had been restarted correctly. The problem was that the scheduler had no idea.

The scheduler had sent the original task to the agent's Inbox before the failure. The agent crashed mid-run. The supervisor restarted a fresh agent goroutine. That fresh goroutine was now sitting patiently waiting for a new task to arrive on the channel — which it never would, because from the scheduler's perspective, the task had already been sent. The scheduler was blocked waiting for a result from the old goroutine that no longer existed.

Two goroutines. Both alive. Both waiting. Neither knowing the other was waiting. A perfect deadlock dressed up as a slow LLM.

The fix was the FailureReport / Decision protocol. The scheduler now never moves on after a failure until the supervisor explicitly tells it what to do:

// Scheduler sends this when an agent fails
type FailureReport struct {
    AgentID string
    Err     error
    Reply   chan<- Decision
}

// Supervisor responds with this
type Decision struct {
    AgentID string
    Retry   bool
    Err     error
}

When an agent fails, the scheduler sends a FailureReport and blocks on the Reply channel. The supervisor restarts the agent, then sends Decision{Retry: true} back. The scheduler receives this, re-sends the original task to the agent's Inbox, and waits for the result again.

Now the scheduler always knows. The agent always gets its message. And I no longer spend time checking the Anthropic status page when my own code is broken.

Parallel tool calls: the LLM asked for three tools at once

When a language model responds with a tool call, most agent frameworks execute it, wait for the result, then call the LLM again. One tool at a time, sequentially.

But modern LLMs can request multiple tools in a single response when those tools are independent. Claude might decide it needs to search the web, read a file, and query Wikipedia simultaneously — and return all three requests in one response. Running them sequentially wastes time.

In Routex, when the LLM returns multiple tool calls, they all execute concurrently:

var wg sync.WaitGroup
results := make([]toolResult, len(toExecute))

for i, tc := range toExecute {
    wg.Add(1)
    go func(i int, tc llm.ToolCallRequest) {
        defer wg.Done()
        out, err := registry.Execute(ctx, tc.ToolName, tc.Input)
        results[i] = toolResult{output: out, err: err}
    }(i, tc)
}

wg.Wait()

All results are appended to history in order before the next LLM call. From the LLM's perspective, it asked for three tools and got three results — the parallelism is invisible to it, but the wall-clock time is the slowest single tool rather than the sum of all three.

Calling LLM APIs directly with net/http

Every LLM SDK is just an HTTP client under the hood. Both the Anthropic and OpenAI adapters in Routex use net/http directly — no anthropic-sdk-go, no go-openai in go.mod. The wire format is straightforward JSON over HTTP.

req, err := http.NewRequestWithContext(ctx, http.MethodPost, c.baseURL+"/v1/messages", body)
req.Header.Set("x-api-key", c.apiKey)
req.Header.Set("anthropic-version", "2023-06-01")
req.Header.Set("content-type", "application/json")

That is the entire Anthropic adapter setup. Removing the SDKs dropped several megabytes of transitive dependencies from the binary and made the HTTP layer completely transparent — no SDK abstractions, no version mismatches, no wrapping errors in SDK-specific types. When the API changes, you update a struct. That's it.

Multi-LLM crews: different models for different jobs

One pattern that emerges naturally from the YAML-driven design is using different LLM providers for different agents:

agents:
  - id:   "researcher"
    role: "researcher"
    llm:
      provider: "anthropic"
      model:    "claude-haiku-4-5-20251001"  # fast, cheap

  - id:   "writer"
    role: "writer"
    llm:
      provider: "openai"
      model:    "gpt-4o"                      # more capable

  - id:   "critic"
    role: "critic"
    llm:
      provider: "ollama"
      model:    "llama3"                      # local, free

Each agent has its own LLM configuration. You can run Claude for research, GPT-4o for writing, and a local Llama model for review — all in the same crew, all declared in YAML.

MCP: connecting to the entire ecosystem

Model Context Protocol is Anthropic's open standard for connecting LLMs to external tools via JSON-RPC. Any MCP-compatible server exposes a standard interface that Routex can connect to at startup:

tools:
  - name: "mcp"
    extra:
      server_url:           "http://localhost:3000"
      server_name:          "github"
      header_Authorization: "env:GITHUB_TOKEN"

Routex connects, calls tools/list to discover everything the server exposes, and registers each tool automatically. From that point, agents use them exactly like built-in tools.

What the build looked like

The project went through several distinct phases, each with its own surprises.

The YAML config and basic agent loop came together relatively quickly. The topological scheduler took longer — mostly spent making sure cycles were detected cleanly and parallel waves executed correctly. The supervisor was the hardest part by far — not the restart logic itself, but making the channel protocol between the scheduler and supervisor airtight. The deadlock story above is the most vivid evidence of that difficulty.

Parallel tool calls came late in the project, after I noticed the LLM was sometimes requesting multiple tools in one response and the runtime was silently discarding all but the first. Once I understood the pattern, the implementation was clean — but the change rippled through the history format, both LLM adapters, and the agent loop simultaneously.

What I'd do differently: Start with the supervision model. I bolted it on after the scheduler was built, which meant retrofitting the channel protocol. If I were starting again, I'd design the scheduler–supervisor communication contract first and build everything else around it. The deadlock I described above would likely never have happened.

Try it

Routex v1.0.1 is available now:

go get github.com/Ad3bay0c/routex

Or install the CLI:

go install github.com/Ad3bay0c/routex/cmd/routex@latest

Scaffold a new project:

routex init my-crew
cd my-crew

Update the generated agents.yaml and copy .env.example to .env and update the correct environment values, then run

routex run agents.yaml

If you're a Go developer who has been watching the AI agent ecosystem from the sidelines — Routex is for you.

Routex is open source under the MIT License. Source, examples, and documentation: https://github.com/Ad3bay0c/routex

DEV Community