DEV Community: JackChen

A 100% Local Multi-Agent Team in TypeScript (Ollama + Gemma)

JackChen — Thu, 02 Jul 2026 16:07:45 +0000

The API bill is a data-exfiltration receipt

Every call your agents make to a hosted model is two things at once: a line on an invoice, and a copy of your input landing on someone else's server. For a lot of AI features that trade is fine. For some — internal logs, customer records, anything under a compliance regime, or just a side project you don't want metered — it isn't.

The usual answer is "run the model locally," and people assume that means the worker agents run locally while something smarter in the cloud still does the thinking. This post goes further: the coordinator — the agent that reads a goal, decomposes it into a task graph, and dispatches the workers — is itself a ~5B model running on your laptop. No cloud in the loop at all. Zero API cost, and the data never leaves the machine.

I'll show the one line that makes local models first-class, build a fully-local team on Gemma 4, prove the local coordinator actually decomposed the goal (rather than the framework quietly covering for it), and then be honest about the two things that bite: RAM and a thinking-model quirk. At the end, a hybrid variant — cloud coder, local reviewer — with a failure I reproduced and the exact fix.

Everything below was run on an Apple M1 / 16 GB, gemma4:e2b over Ollama. The numbers are from one measured run, not a brochure.

The one move: point `baseURL` at a local endpoint

open-multi-agent talks to models through the OpenAI-compatible protocol. Every serious local runtime speaks that protocol too. So "use a local model" is not an integration — it's three fields on an agent config: reuse the openai provider, set model, and point baseURL at the local server.

import { OpenMultiAgent } from '@open-multi-agent/core'
import type { AgentConfig } from '@open-multi-agent/core'

const researcher: AgentConfig = {
  name: 'researcher',
  model: 'gemma4:e2b',
  provider: 'openai',                    // OpenAI-compatible protocol, not the OpenAI cloud
  baseURL: 'http://localhost:11434/v1',  // Ollama's OpenAI-compatible endpoint
  apiKey: 'ollama',                      // placeholder; Ollama ignores it, the OpenAI SDK just needs a non-empty string
  systemPrompt: `You are a system researcher. Use bash to run non-destructive,
read-only commands (uname -a, sw_vers, df -h, uptime, etc.) and report results.`,
  tools: ['bash', 'file_write'],
  maxTurns: 8,
}

apiKey is a placeholder on purpose: there's no key, but the SDK requires a non-empty string. The baseURL is the whole trick, and it works against any of these — pick your runtime, keep the rest of the code identical:

Local runtime	OpenAI-compatible `baseURL`
Ollama	`http://localhost:11434/v1`
vLLM	`http://localhost:8000/v1`
LM Studio	`http://localhost:1234/v1`
llama.cpp server	`http://localhost:8080/v1`

One environment gotcha before you run anything: if you have an HTTP_PROXY set, exempt localhost with no_proxy=localhost, or the SDK will try to route your local model calls through the proxy and hang.

A team where even the coordinator is local

The example ships two ways to run the same two-role team (a researcher that gathers system facts with bash, a summarizer that writes them up). Both run 100% on gemma4:e2b.

Part 1 — you own the DAG (runTasks). You declare the tasks and their dependencies explicitly; the framework schedules them:

const orchestrator = new OpenMultiAgent({
  defaultModel: 'gemma4:e2b',
  maxConcurrency: 1,          // a local model serves one request at a time
})

const team = orchestrator.createTeam('explicit', {
  name: 'explicit',
  agents: [researcher, summarizer],
  sharedMemory: true,
})

const result = await orchestrator.runTasks(team, [
  { title: 'Gather system information', assignee: 'researcher', description: '...' },
  { title: 'Summarize the report',      assignee: 'summarizer', description: '...',
    dependsOn: ['Gather system information'] },
])

Part 2 — the local model owns the DAG (runTeam). This is the real claim. You hand the team a one-line goal and let the local Gemma act as coordinator: it decides the decomposition, the assignees, and the dependencies.

// The coordinator is auto-created by runTeam(). These `default*` fields are what
// keep it local too — they point the auto-created coordinator at Ollama, not the cloud.
const orchestrator = new OpenMultiAgent({
  defaultModel: 'gemma4:e2b',
  defaultProvider: 'openai',
  defaultBaseURL: 'http://localhost:11434/v1',
  defaultApiKey: 'ollama',
  maxConcurrency: 1,
})

const team = orchestrator.createTeam('auto', {
  name: 'auto',
  agents: [researcher, summarizer],
  sharedMemory: true,
})

// One natural-language goal; the local Gemma coordinator decomposes it and dispatches.
const result = await orchestrator.runTeam(
  team,
  "Check this machine's Node.js version, npm version, and OS info, then write a short Markdown report.",
)

For that to work, a 5.1B quantized model has to do the two things local models are notoriously bad at: emit a syntactically valid JSON task decomposition, and make real tool calls. It did both.

Proof it was the model, not the fallback

Here's the subtlety that separates a real result from a demo that only looks like one. runTeam has a safety net: if the coordinator's decomposition fails to parse, it silently falls back to a trivial one-task-per-agent plan. A green checkmark alone tells you nothing — you have to prove the plan came from the model.

This is the decomposition gemma4:e2b actually produced, captured raw from the model and reproduced through the framework's planOnly path (valid json fence, strict JSON.parse succeeds):

[
  {
    "title": "Gather System Information",
    "description": "Execute necessary bash commands (e.g., uname -a, sw_vers, node -v, npm -v) to collect the Node.js version, npm version, and OS information from the machine.",
    "assignee": "researcher",
    "dependsOn": []
  },
  {
    "title": "Generate Markdown Report",
    "description": "Read the collected system information and compile it into a concise Markdown summary report.",
    "assignee": "summarizer",
    "dependsOn": ["Gather System Information"]
  }
]

Why this is genuinely the model and not the fallback:

The titles are the model's own. The fallback names tasks like researcher: <goal…>; the executed tasks were Gather System Information / Generate Markdown Report.
There's a real dependency. The summarizer task dependsOn the researcher task — and the one-task-per-agent fallback never creates dependencies. A dependency edge can only come from a real decomposition.
Correct roles. Researcher gathers, summarizer writes. The model understood which agent does what.
Four consistent data points. Two full end-to-end runs (my instrumented copy and the unmodified shipped file), plus a runAgent raw-output probe and a runTeam({ planOnly }) probe — all produced the same valid 2-task decomposition, all reporting fallback = false.

(One honest footnote for anyone who reads the raw evidence JSON: my first instrumentation pass false-positived a fallbackEngaged: true flag, because the harness read a collapsed, empty coordinator key. The four probes above are what corrected it — the flag is my measurement bug, not the framework's behavior. I left the flag in the evidence file with a note rather than scrub it, because that's what the ground-truthing actually looked like.)

One real run — the ledger

Part 2, runTeam with the local coordinator, one measured run:

Task	Agent	Model	Latency	Tok in	Tok out	Tools	Cost
(decompose + synthesis)	coordinator	gemma4:e2b	—	1615	1677	(none)	$0
Gather System Information	researcher	gemma4:e2b	58.9 s	979	1023	bash	$0
Generate Markdown Report	summarizer	gemma4:e2b	46.9 s	1654	858	file_write	$0
Total			199.9 s wall	4248	3558		$0

The final report.md carried the real, correct values — Node v22.22.3, npm 10.9.8, macOS 26.5 (build 25F71), Darwin 25.5.0 … arm64 — so the workers didn't just run, they produced accurate output. I re-ran the unmodified shipped file as a second confirmation: same outcome, runTasks 182.4 s and runTeam 155.5 s, same valid decomposition.

The honest headline on cost is $0, and on speed is minutes, not seconds — read on.

The friction nobody puts in the demo

This is the part you don't get from a vendor page, and it's the most useful part if you're going to run this yourself.

1. It's a "thinking" model — don't cap maxTokens small. gemma4:e2b emits reasoning tokens in a separate channel before its answer. I reproduced the trap directly: a call with max_tokens: 10 returned empty content — the thinking ate the whole budget. The shipped example sets no maxTokens, so Ollama's default applies and it works. But if you tighten the token budget to save memory, an empty coordinator response is exactly what triggers that silent fallback from the last section. On a thinking model, keep maxTokens generous.

2. Plan for ~16 GB RAM, and expect swap. The 7.16 GB Q4 model pushed my 16 GB machine into swap (~6.8–7.3 GB used during runs). It completed correctly, but the bigger gemma4:e4b (9.6 GB) would be worse here. Set expectations: e2b wants 16 GB and still swaps; go bigger and you need more.

3. Slow but functional. Per-call latency ran 5–25 s; a full demo (Part 1 + Part 2) is about 6 minutes. That's fine for a "$0, private, runs overnight" story; it is not interactive-snappy.

4. No quantization pathologies — and you don't need the sampling knobs for this model. Across every run: zero repetition loops, zero hallucinated tool schemas, zero invalid JSON. The repo's local-quantized.ts example (topK / minP / repetition_penalty tuning) targets other MoE quants that misbehave — you don't need it for gemma4:e2b. Which is a nice segue, because tuning does come back to matter in the hybrid case.

Going hybrid: cloud coder, local reviewer (and where it broke)

The same baseURL trick lets you mix cloud and local in one pipeline: send the hard, non-sensitive work to a strong cloud model and keep the rest local. The shipped ollama.ts does exactly this — a coder plus a reviewer. I ran a faithful copy with a cloud coder (DeepSeek) and the shipped local reviewer (llama3.1 over Ollama).

Agent	Provider	Model	Tools executed	Verdict
coder	deepseek (cloud)	deepseek-v4-pro	bash×3, file_write×3	excellent
reviewer	ollama (local)	llama3.1	none	hallucinated

The cloud coder passed for real. DeepSeek wrote a clean retry.ts (exponential backoff, shouldRetry, withRetry) and a 6-case test file. I ran the tests independently: 6 passed, 0 failed.

The local llama3.1 reviewer failed substantively — twice. It never read the files (tools: [], ~468 input tokens against the ~2,400 the two files would cost), then hallucinated a review: it called TypeScript code "try-except blocks" (that's Python), described a 3-export module as "a single function," and rubber-stamped Verdict: SHIP. Worse, the run reported success: true. A confident review of code it never opened.

Root cause, precisely: llama3.1 didn't emit native tool_calls — it narrated the call as text, and the text was malformed for the safety-net extractor (invalid JSON in one run, a wrong function-as-string shape in the repro). Neither the native path nor the fallback fired, so no file was ever read. This is model-specific: gemma4:e2b in the 100%-local example emitted correct native tool calls and every tool executed.

Fixing the local reviewer: a two-part fix, not a model swap

The obvious fix is "swap the reviewer to a model with real tool-calling." Necessary, but not sufficient — the temperature matters just as much. Same files, same reviewer prompt, only the reviewer config changes:

Reviewer config	`file_read` executed	Input tok	Outcome
llama3.1 (default)	none	335–468	malformed text tool-calls → hallucinated review
gemma4:e2b @ temp 1 (its default)	none	415	emitted no tool call → "I haven't read the files yet"
gemma4:e2b @ temp 0.2, topP 0.9	file_read ×2	3028	read both files → grounded review

Only the last row actually read the code — you can see it in the input tokens jumping from 415 to 3028, and in the review citing real specifics (the makeFlaky helper, testFailureExhaustion at lines 63–65, the exact 'permanent failure' assertion string).

Why temperature? gemma4:e2b is a thinking model with a default temperature: 1. At temp 1 it stochastically narrated "I'll read them later" and emitted no tool call, so the agent loop ended after one turn. At temp 0.2 it deterministically followed "read first." This is the same "tame your sampling for local models" lesson from local-quantized.ts — shown here to govern tool-call reliability, not just repetition. The recipe for a working local reviewer: (1) a model with real native tool-calling, and (2) a low temperature. With both, the local reviewer genuinely reads and reviews, at $0.

When to go fully local vs hybrid

Fully local when data residency is the hard constraint: nothing leaves the machine, $0 cost, and — as shown — even the coordinator can be local. The price is RAM and latency (minutes, not seconds).
Hybrid when one step genuinely needs a frontier model (the coder above) but the rest can stay home. The plumbing is sound — cloud and local in one runTasks pipeline via baseURL. Just pick your local agents for solid native tool-calling and tame their temperature, or you get a confident reviewer that never read the code.

There's also a higher-level angle worth a link: in Part 2 I let the local model be the coordinator. If you want the mechanics of how a goal becomes a task DAG in this framework, I wrote that up in Goal In, DAG Out. The surprise of this post is that the model driving that decomposition can be 5B and running on your laptop.

Run it

npm install @open-multi-agent/core

You'll need Ollama running with the model pulled:

ollama pull gemma4:e2b        # ~7 GB; wants ~16 GB RAM to run comfortably
# then run the example from the repo (remember: no_proxy=localhost if you use a proxy)

Three example files to read, in increasing spice: gemma4-local.ts (100% local, both runTasks and runTeam), local-quantized.ts (sampling knobs for MoE quants that misbehave), and ollama.ts (the hybrid — cloud + local in one pipeline).

One honest caveat: local tool-calling reliability varies a lot by model — gemma4:e2b was solid, llama3.1 was not in this task — and the project's production validation is still early. If you run a local team of your own, I'd like to hear which local models emitted clean native tool calls and which didn't.

From Transcript to Typed Action Items: Three Parallel Agents in TypeScript

JackChen — Wed, 24 Jun 2026 11:34:38 +0000

Your meeting summarizer is quietly doing three jobs in one prompt

The usual way to summarize a meeting with an LLM is one prompt: "Here's the transcript — give me a summary, pull out the action items, and tell me how everyone felt." One call, one model, one blob of text back.

It works on a demo and frays on a real transcript. Those are three different jobs with three different shapes. A summary wants to flow as prose. Action items want to be a strict list with an owner on every row. Sentiment wants one verdict per speaker. Cram them into a single prompt and they fight: the model pads the summary into the action items, or it forgets to tag a speaker, or the "action items" come back as a paragraph you now have to parse by hand. You also pay for all of it serially, and you get back unstructured text when half of what you wanted was structured data.

There's a cleaner shape. Run three specialists, each doing exactly one job, each at its own temperature, two of them returning typed objects instead of prose — and run them at the same time, because none of them needs another's output. Then a fourth agent merges the three results into one report.

This post builds exactly that, from the meeting-summarizer cookbook example in open-multi-agent. The whole thing is ~280 lines of TypeScript and the parallelism is the point.

What you get out of it

The end product is a single Markdown report with a fixed shape — a prose summary, an action-item table, per-person sentiment, and synthesized next steps. Here's the action-item section from a real run against a 21-line engineering standup — every row came back as typed data, not prose the script had to parse:

Task	Owner	Due
Deploy shadow-write harness for billing-v2 migration	Raj	2026-04-24
Add covering index to reconciliation query before cutover	Raj	2026-04-28
Flip feature flag for checkout redesign to 5% traffic	Priya	2026-04-23
Draft proposal for mandatory second reviewer on multi-region changes	Dan	2026-04-27
Create handoff doc for primary on-call rotation	Dan	—
Follow up with Len about authz refactor timeline	Maya	—

The full report also carries the three-paragraph summary, a per-speaker sentiment read, and a synthesized Next Steps list. All of it is produced by four agents — three of which ran concurrently. Here's how it's wired.

Three specialists, one transcript

Each specialist is a plain Agent with its own system prompt and temperature. Start with the summarizer — prose out, no schema, a slightly higher temperature so it reads naturally:

const summaryConfig: AgentConfig = {
  name: 'summary',
  model: 'claude-sonnet-4-6',
  systemPrompt: `You are a meeting note-taker. Given a transcript, produce a
three-paragraph summary:

1. What was discussed (the agenda).
2. Decisions made.
3. Notable context or risk the team should remember.

Plain prose. No bullet points. 200-300 words total.`,
  maxTurns: 1,
  temperature: 0.3,
}

The other two specialists are where this stops being "call an LLM three times" and starts being reliable: they return typed objects, not text. You declare a Zod schema, hand it to the agent as outputSchema, and read the parsed result off result.structured.

Action items are a list, and every item must carry an owner. The due date is optional, because real meetings only sometimes name one:

const ActionItemList = z.object({
  items: z.array(
    z.object({
      task: z.string().describe('The action to be taken'),
      owner: z.string().describe('Name of the person responsible'),
      due_date: z.string().optional().describe('ISO date or human-readable due date if mentioned'),
    }),
  ),
})

const actionItemsConfig: AgentConfig = {
  name: 'action-items',
  model: 'claude-sonnet-4-6',
  systemPrompt: `You extract action items from meeting transcripts. An action
item is a concrete task with a clear owner. Skip vague intentions ("we should
think about X"). Include due dates only when the speaker named one explicitly.

Return JSON matching the schema.`,
  maxTurns: 1,
  temperature: 0.1,
  outputSchema: ActionItemList,
}

Note the temperature: 0.1. Extraction is not a place for creativity — you want the same transcript to yield the same action items. And because outputSchema is set, result.structured comes back as a typed { items: [...] } you can push straight into Jira or Linear. No regex, no "parse the markdown table the model hopefully produced."

Sentiment is the same idea with a tighter constraint — tone is an enum, so the model can only return one of four values, and every verdict has to cite evidence:

const SentimentReport = z.object({
  participants: z.array(
    z.object({
      participant: z.string().describe('Name as it appears in the transcript'),
      tone: z.enum(['positive', 'neutral', 'negative', 'mixed']),
      evidence: z.string().describe('Direct quote or brief paraphrase supporting the tone'),
    }),
  ),
})

The evidence field is a cheap hallucination guard: forcing the model to attach a quote to each tone keeps it from inventing a mood nobody expressed. (One naming gotcha if you adapt this: the outer keys are plural — items and participants — and the arrays live under them.)

Fan out: run the three at once

None of the three specialists depends on another — they all read the same transcript and write independent outputs. That's the textbook condition for fan-out. open-multi-agent's AgentPool runs agents concurrently up to a limit; give it three slots, add the agents, and kick them all off with Promise.all:

function buildAgent(config: AgentConfig): Agent {
  const registry = new ToolRegistry()
  registerBuiltInTools(registry)
  const executor = new ToolExecutor(registry)
  return new Agent(config, registry, executor)
}

const pool = new AgentPool(3) // three specialists can run concurrently
pool.add(buildAgent(summaryConfig))
pool.add(buildAgent(actionItemsConfig))
pool.add(buildAgent(sentimentConfig))

const specialists = ['summary', 'action-items', 'sentiment'] as const

const parallelStart = performance.now()
const timed = await Promise.all(
  specialists.map(async (name) => {
    const t = performance.now()
    const result = await pool.run(name, TRANSCRIPT)
    return { name, result, durationMs: performance.now() - t }
  }),
)
const parallelElapsed = performance.now() - parallelStart

One subtlety worth knowing: AgentPool holds a per-agent lock, so the same agent can't run twice at once — but three differently-named agents run truly in parallel. A pool size of 3 is exactly enough to fit them.

Now the part most fan-out tutorials skip: proving it actually ran in parallel. Measure two things — the wall-clock time around the whole Promise.all, and the sum of each agent's own duration. If the work really overlapped, the wall time is much smaller than the sum:

const serialSum = timed.reduce((acc, r) => acc + r.durationMs, 0)
console.log(`Parallel wall time: ${Math.round(parallelElapsed)}ms`)
console.log(`Serial sum (per-agent): ${Math.round(serialSum)}ms`)
console.log(`Speedup: ${(serialSum / parallelElapsed).toFixed(2)}x`)

if (parallelElapsed >= serialSum * 0.7) {
  console.error('ASSERTION FAILED: parallel wall time is not < 70% of serial sum.')
  process.exit(1)
}

That last block is deliberate, and it's worth keeping in your own version. It's a parallelism self-check: if the three calls didn't substantially overlap — say your provider rate-limited you and quietly serialized the requests — the wall time creeps up toward the serial sum and the script exits non-zero. So if you run this and see ASSERTION FAILED, that's usually not a bug in the code; it's the check earning its keep by telling you the fan-out degraded into a queue.

On a real run against DeepSeek the three specialists overlapped for a 2.21× speedup — 11.7s of wall time against 25.9s of summed per-agent work. The exact number moves with model latency and network, which is the point of measuring it per run instead of quoting a brochure figure.

The fourth agent: the aggregator

Fan-out gets you three results in parallel. You still need them merged into one report — and that's a fourth agent, running after the others because it depends on all three. No hiding it: this pattern is three-parallel-plus-one, not three.

The aggregator takes the prose summary as text and the two structured results as JSON, and is told to emit a fixed four-heading report:

const aggregatorPrompt = `Merge the three analyses below into a single Markdown report.

--- SUMMARY (prose) ---
${byName.get('summary')!.output}

--- ACTION ITEMS (JSON) ---
${JSON.stringify(actionData, null, 2)}

--- SENTIMENT (JSON) ---
${JSON.stringify(sentimentData, null, 2)}

Produce the Markdown report per the system instructions.`

const reportResult = await pool.run('aggregator', aggregatorPrompt)

Its system prompt pins the output structure (## Summary / ## Action Items / ## Sentiment / ## Next Steps, action items as a table) and adds one important rule: do not invent action items that are not grounded in the other data. The aggregator's job is to format and synthesize next steps, not to discover new facts — that line keeps it from drifting.

One real run

The example ships with claude-sonnet-4-6; these numbers are from a run swapped to DeepSeek (deepseek-v4-flash) — the agent configs are identical, only the model id changes. The three specialists fanned out, the action-items and sentiment outputs validated against their Zod schemas, and the aggregator produced the report above. Token usage for the full run — three specialists plus the aggregator — was 3,225 input and 4,083 output tokens. (That's token counts, not a dollar figure; what you pay depends on your provider and model.)

A thing to set expectations on: fan-out buys you wall-clock time, not tokens. You still make four model calls — you've just stopped waiting for them one after another. And you added a call (the aggregator) you wouldn't have with a single prompt. On a tiny transcript the coordination overhead can eat the win; the pattern pays off as each specialist's own work grows.

When this pattern fits — and when it doesn't

Reach for fan-out when one input needs several independent analyses. Meeting → {summary, actions, sentiment} is the canonical case, but so is a PR → {security review, style review, test-coverage check}, or a support ticket → {category, urgency, suggested reply}. Independent jobs, same source, typed outputs you want to use downstream.

Don't when the steps depend on each other — research-then-write is a pipeline, not a fan-out, and forcing it parallel just breaks the data flow. And don't fan out a single job for the sake of it: one agent is simpler than a pool plus an aggregator.

There's also a higher-level option in the same framework. Here you wired the parallelism by hand — you decided what runs concurrently. If you'd rather describe a goal and let a coordinator decompose it into a task graph and parallelize that for you, that's what runTeam() does; I wrote it up in Goal In, DAG Out. Hand-wired fan-out like this post is the right call when the shape is fixed and you want it explicit; the coordinator is the right call when the shape varies with the goal.

Run it

npm install @open-multi-agent/core

The full example is in the repo — run it from the repository root (it needs ANTHROPIC_API_KEY):

npx tsx packages/core/examples/cookbook/meeting-summarizer.ts

Source to read: the meeting-summarizer example and its transcript fixture. For the same fan-out/aggregate shape stripped to its essentials, see the fan-out-aggregate pattern.

One honest caveat: the transcript here is a synthetic standup, and the project's production validation is still early. If you point this at real meetings, I'd like to hear where the typed extraction held up and where it didn't.

Goal In, DAG Out: How Open-Multi-Agent Turns a Goal into a Task DAG

JackChen — Sun, 21 Jun 2026 06:40:44 +0000

You wrote the graph by hand. Then the requirements changed.

Most TypeScript agent frameworks make you draw the graph yourself. You declare the nodes, wire the edges, decide what runs after what, where it branches, where it joins. It works, right up until the goal shifts and you are back in the graph editor re-wiring a pipeline you already built once.

There is another way to model this: describe the goal, and let a coordinator build the graph for you.

That is what runTeam() does in open-multi-agent. You hand it a team and a sentence. It hands back a result. In between, a coordinator agent decomposes the goal into a task DAG, assigns the tasks to your agents, runs the independent ones in parallel, and synthesizes the final answer. There are no edges to wire.

This post is about what happens in that "in between," because the mechanism is the whole point.

The one call

import { OpenMultiAgent } from '@open-multi-agent/core'

const orchestrator = new OpenMultiAgent({
  defaultModel: 'deepseek-v4-flash',
  defaultProvider: 'deepseek',
})

const team = orchestrator.createTeam('research', {
  name: 'research',
  agents: [
    { name: 'researcher', model: 'deepseek-v4-flash', provider: 'deepseek',
      systemPrompt: 'You research topics and gather concrete facts.' },
    { name: 'writer', model: 'deepseek-v4-flash', provider: 'deepseek',
      systemPrompt: 'You turn research notes into clear prose.' },
  ],
  sharedMemory: true,
})

const result = await orchestrator.runTeam(
  team,
  'Research the tradeoffs of TypeScript decorators, covering the stage-3 standard ' +
  'versus the legacy experimental implementation, runtime and bundle-size cost, and ' +
  'current framework support, then write a clear 500-word explainer for a team ' +
  'deciding whether to adopt them.',
)

console.log(result.agentResults.get('coordinator')?.output)

Three things to notice before we go under the hood:

You never declared a task graph. You wrote the goal in plain English.
Each agent declares its own model. The orchestrator's defaultModel is used by the coordinator; worker agents carry their own. (Swap deepseek for any supported provider: Anthropic, OpenAI, Gemini, a local model, and so on.)
The goal is deliberately specific. A short, single-clause goal is treated as a simple task and skips the coordinator entirely; more on that below.

Run this and the framework does seven things. Here they are, in order.

Step 1: A coordinator decomposes the goal

runTeam() spins up a temporary agent called coordinator. It is not part of your roster. The framework creates it for this run and discards it afterward. It receives your goal, the names of your agents, and one instruction:

Decompose the following goal into tasks for your team (researcher, writer). Return ONLY the JSON task array in a json code fence.

The coordinator answers with a JSON array of task specs. Here is a real decomposition from the run above:

[
  { "title": "Research stage-3 vs legacy experimental decorators",
    "description": "Gather the syntax and behavioral differences ...",
    "assignee": "researcher", "dependsOn": [] },
  { "title": "Research runtime and bundle-size cost of decorators",
    "description": "Investigate helper code, tree-shaking, benchmarks ...",
    "assignee": "researcher", "dependsOn": [] },
  { "title": "Research current framework support for decorators",
    "description": "Survey Angular, NestJS, TypeORM, MobX ...",
    "assignee": "researcher", "dependsOn": [] },
  { "title": "Write 500-word explainer on decorator tradeoffs",
    "description": "Using the three research outputs, write the explainer ...",
    "assignee": "writer",
    "dependsOn": [
      "Research stage-3 vs legacy experimental decorators",
      "Research runtime and bundle-size cost of decorators",
      "Research current framework support for decorators"
    ] }
]

Each task carries a title, a description (the actual instruction the assigned agent will receive), an assignee, and dependsOn, a list of task titles it must wait for. That last field is the DAG, expressed as data instead of as wiring. Notice the coordinator chose to split the research into three independent tasks and make the write task depend on all three. The exact split varies between runs, because the coordinator is an LLM; this was one real plan.

This step costs one extra LLM call. The coordinator runs with a maxTurns of 3 by default. Keep that overhead in mind; we come back to it at the end.

Step 2: The tasks become a dependency graph

The specs load into a TaskQueue. The title-based dependsOn references resolve to real task IDs, so the queue knows the true shape of the graph. A task becomes "ready" only once every task it depends on has completed. Tasks with no dependencies are ready immediately.

If the coordinator fails to return usable JSON, the run does not crash. The framework falls back to one task per agent, each handed the original goal as its description. You get a degraded run, not an exception.

Step 3: Unassigned tasks get an owner

The coordinator usually fills in assignee, but it does not have to. Any task left unassigned is handed to the Scheduler, which assigns it to an agent. The default strategy is dependency-first; you can also pick round-robin, least-busy, or capability-match, which scores each agent's name and system prompt against the task.

Step 4: Execution, parallel by default

Tasks run through an AgentPool. Independent tasks (nothing pending in their dependsOn) run concurrently, up to maxConcurrency, which defaults to 5. Dependents wait until their inputs are done, then become ready and dispatch. In the real run above, the three research tasks had no dependencies, so they all started in the same instant and ran together; the write task waited until all three finished. You did not schedule any of that. The graph shape decides what can overlap, and the pool runs as much of it in parallel as the limit allows.

Step 5: Every result is persisted to shared memory

After each task completes, its output is written to the team's shared memory. That is how the writer sees the researcher's findings: by the time the write task is ready, the three research results are already in memory. Agents communicate through this shared store rather than by you threading outputs from one call into the next.

Step 6: The coordinator synthesizes

Once the queue drains, the coordinator runs a second time. This pass reads every task output and writes the final answer to the goal. This is the result you read from agentResults.get('coordinator').

Want to inspect the plan itself rather than the final prose? The task records are on the result as result.tasks (each with title, assignee, status, and dependsOn), and you can get just the plan without executing anything by calling runTeam(team, goal, { planOnly: true }).

Step 7: You get a structured result

runTeam() resolves to a TeamRunResult: an agentResults map keyed by agent name (here coordinator, researcher, writer), a totalTokenUsage figure, and the tasks record list with statuses and metrics. Everything that happened is inspectable after the fact.

What one real run looks like

Here is the actual output, running the code above against DeepSeek (deepseek-v4-flash):

The coordinator decomposed the goal into three parallel research tasks and one dependent write task, ran the research concurrently, persisted each result, and synthesized the final explainer. runTeam() finished success=true; the explicit runTasks() version below ran the same way.

When a task fails

Failures do not cascade past their own dependents. A failed task is marked failed, and any task that depends on it stays blocked. Every task that does not depend on the failure keeps running to completion. You end the run with partial results plus a clear record of which branch broke, instead of one error tearing down the whole graph.

When you should NOT use the coordinator

Goal-first is not a silver bullet, and the framework is explicit about that.

Simple goals skip the coordinator entirely. If the goal is short (200 characters or fewer) and contains no coordination directives, runTeam() short-circuits: it picks the best-matching agent and runs it directly, with no decomposition and no synthesis pass. There is no reason to pay for two extra LLM calls to "Summarize this paragraph." (This is exactly why the quickstart goal above is spelled out in detail: a one-liner would have been routed straight to a single agent.)

When you need determinism, write the graph yourself. The coordinator is an LLM, so its decomposition can vary run to run (the example above produced three research tasks on one run and a single research task on another). If you need the exact same pipeline every time (CI, regulated workflows, anything you have to reason about precisely), use runTasks() and supply the DAG directly:

const result = await orchestrator.runTasks(team, [
  {
    title: 'Research decorator tradeoffs',
    description: 'Gather concrete pros and cons of TypeScript decorators.',
    assignee: 'researcher',
  },
  {
    title: 'Write the explainer',
    description: 'Using the research notes, write a 500-word explainer.',
    assignee: 'writer',
    dependsOn: ['Research decorator tradeoffs'],
  },
])

Same queue, same scheduler, same parallel execution. You just own the graph instead of asking for one. (You can also pin a coordinator-generated plan and replay it deterministically, but that is a separate post.)

So the tradeoff is concrete:

runTeam() is goal-first: flexible, two extra LLM calls of planning overhead, a plan that can change between runs.
runTasks() is graph-first: deterministic and cheaper per run, but you maintain the graph.

Goal-first vs graph-first

This is the distinction that actually matters when you choose a framework. Graph-first tools (you wire the nodes) trade maintenance for control and determinism. Goal-first (you describe the outcome) trades an extra planning pass and a non-deterministic plan for flexibility. open-multi-agent ships both behind one API, so you can start goal-first and drop to an explicit graph on the paths that have to be locked down. I wrote more about that split in Goal-Driven Agent Orchestration vs Explicit Graphs.

Try it

npm install @open-multi-agent/core

The team-collaboration example is the smallest end-to-end runTeam() run. If you want to see how far this goes, the Gemma 4 local example puts a 5B local model in the coordinator seat: it does the JSON decomposition and the synthesis on your own machine.

One honest caveat: community and production validation are still early. If you run the coordinator on a real workload, I would like to hear where its plan held up and where you had to drop to runTasks().

Give Your TypeScript AI Agents Long-Term Memory with TencentDB-Agent-Memory

JackChen — Mon, 15 Jun 2026 11:12:40 +0000

A walkthrough wiring open-multi-agent's pluggable MemoryStore to TencentDB-Agent-Memory through its Hermes Gateway, with a real cross-run memory loop measured end to end on DeepSeek, plus two upstream behaviors that are not in any README and will silently cost you your memories if you miss them.

Most multi-agent frameworks have no long-term memory, and that is by design. They orchestrate: decompose a goal, run agents, pass results between them. The moment the run ends, everything the agents learned is gone. There is no notion of "what did this user tell us last week" or "what did we conclude the last three times we looked at this." For a one-shot batch job that is fine. For an assistant, a support bot, or anything a user comes back to, it is the whole game.

open-multi-agent is one of those frameworks. Its SharedMemory is in-process coordination state for a single run, not a knowledge base. So the honest answer to "how do my agents remember things across sessions" is: you bring your own memory layer. This post wires in one specific layer, TencentDB-Agent-Memory (TDAM), an open-source agent memory system from Tencent Cloud that distills raw conversation into searchable long-term memory and keeps all of it on local disk.

By the end you will have:

A MemoryStore adapter that gives an agent team persistent memory across separate process runs.
A measured two-run loop: run one writes and distills memories, run two recalls them and feeds them back into the agents' prompts.
Two upstream gotchas, with server-log evidence, that decide whether anything gets stored at all.
A clear line on when this is the right memory layer and when it is overkill.

Where this sits: three ways to give agents memory

Before any code, the honest landscape, because the right answer for you might be simpler than this whole post.

There are roughly three ways to add long-term memory to an agent system:

Roll your own. A vector database, an embedding model, and your own logic to decide what to store and how to summarize it. Maximum control, and you maintain all of it forever.
Hosted memory SaaS. A managed API that stores and retrieves memories for you. Lowest effort, but your conversation history and extracted facts live on someone else's servers.
Self-hosted distilled memory. A system that runs the extraction pipeline itself, over your data, on infrastructure you control. TDAM is this kind: raw conversations (L0) get distilled into atomic facts (L1), then scenes (L2), then a persona (L3), stored in local SQLite with sqlite-vec, retrieved by BM25 plus vector hybrid search. Zero external API dependency for storage.

This post builds the third option. It is worth your time specifically if you are on a TypeScript stack, you want memory extraction to run on infrastructure you control, and "the data never leaves our machine" is a real requirement (regulated industries, on-prem deployments, privacy-sensitive products). If you just need a key-value scratchpad that survives restarts, point open-multi-agent's MemoryStore at Redis or SQLite directly and skip all of this.

How the two systems meet

open-multi-agent exposes a MemoryStore interface (get / set / list / delete / clear) and lets you inject any implementation as a team's sharedMemoryStore. TDAM, for its part, has no general-purpose SDK; third-party frameworks integrate through its Hermes Gateway, an HTTP sidecar (default 127.0.0.1:8420) exposing capture, search, and recall endpoints, with optional Bearer auth. So the adapter is a MemoryStore that speaks to the Gateway over HTTP.

One mismatch shapes the whole design. MemoryStore is a key-value contract: get(key) must return exactly what set(key, value) wrote. The Gateway has no read-by-key endpoint at all; its search and recall return distilled, formatted text, not the raw record you stored. Forcing key-value reads through a search endpoint would quietly corrupt the orchestrator's bookkeeping, since it reads task results back by key between steps. So the adapter splits responsibilities:

within a run:
  get / list / delete / clear  ───────────────►  local in-process map  (exact KV)
  set(key, value)              ──┬────────────►  local map
                                 └── /capture ──►  TDAM  →  L0 → L1 → L2 → L3  (local SQLite)

across runs:
  recall(topic)  ◄── formatted context ──  TDAM  (BM25 + vector hybrid)  ──►  agent prompts

Within a run, the local map is the source of truth, identical to the default in-memory store. Across runs, the distilled TDAM memories are what persist. That distinction is the entire integration.

Two upstream behaviors that decide whether anything gets stored

This is the part you cannot get from the README, and the part most likely to make you think the integration is broken when it is working exactly as designed.

1. The extractor only remembers the user, never the assistant

TDAM's L1 extraction prompt distills three kinds of memory, all of them about the user: persona, episodic, and instruction. Its "do not extract" list explicitly names the AI assistant's own output. It is a user-memory system, not a transcript archive — and the exclusion is enforced in the extraction prompt, not by a code-level filter.

The first version of my adapter put the agent's result in the assistant_content field of the captured turn, which felt natural: the agent produced the result, so it is the assistant talking. The Gateway accepted the capture, triggered extraction, and stored nothing:

[l1-extractor] Total extracted memories: 0 across 1 scene(s)
[l1] L1 complete: extracted=0, stored=0

The fix is to phrase the capture so the agent reports its result as the user speaking. Same content, different slot. After the change, the same run extracted a memory:

[l1-extractor] Total extracted memories: 1 across 1 scene(s)
[l1] L1 complete: extracted=1, stored=1

If you are feeding any non-conversational producer (an agent, a job, a pipeline) into TDAM, this is the single most important thing to get right.

2. Extraction is scheduled, and `session/end` does not force it

Captured turns are not extracted immediately. Extraction fires when a session's conversation count crosses a threshold, or after a 600-second idle timer. The threshold has a warm-up that doubles: it starts at 1, then 2, then 4, before settling at the steady-state value (everyNConversations, default 5).

That doubling is the trap. With the default config and a short run of two captures:

notify: conversation_count=1/1 (warmup: 1)   -> threshold reached, triggering L1
Warm-up advanced -> next threshold 2
notify: conversation_count=1/2 (warmup: 2)   -> L1 idle timer reset (600s)
flushSession: complete

The first capture extracts. The second does not: it needs count 2, and the warm-up is now demanding 2 from a fresh count of 1. Calling POST /session/end drains extraction already in flight but does not force the second capture through. In the log above, flushSession: complete is followed by no second extraction. That memory is sitting in a buffer, waiting for a threshold or a timer that a short-lived demo never hits.

For a long-running production session this scheduling is fine and probably what you want. For a deterministic "capture, then immediately search" loop, set everyNConversations: 1 in tdai-gateway.yaml. The threshold then graduates straight to a steady-state of 1 and every capture extracts on the spot.

One smaller note while you are setting up: the Gateway needs Node 22. On Node 20 it fails to start with TypeError: webidl.util.markAsUncloneable is not a function, an undici incompatibility.

The measured loop

Setup: TDAM v0.3.6 from the npm package, Node 22, SQLite backend with embeddings disabled (so retrieval is BM25 / FTS), Bearer auth on, and deepseek-v4-flash driving both the agent team and the Gateway's extraction pipeline. The team is two agents (an analyst and a writer) researching one topic. Auth behaves as documented: GET /health is open, every other endpoint returns 401 without a valid Bearer token.

Run one, cold start. No prior memory exists. The team runs, and every shared-memory write is captured into TDAM:

[1/4] Recalling long-term memory... No long-term memories yet (first run).
[3/4] Captured 2/2 shared-memory writes into TDAM (4 L0 records). Flushing...
[4/4] 1 memories match (strategy: fts).
  [episodic] (priority: 80) The user (analyst) reported completed work comparing
  SQLite and PostgreSQL for agent memory stores, concluding that SQLite is
  preferable for real-time latency and simplicity, while PostgreSQL is better
  for multi-agent concurrency and scalability.

The two captured turns distilled into one episodic memory. Token usage: 4443 in, 2137 out.

Run two, same session, fresh process. This time recall finds something:

[1/4] Recalling long-term memory... Recalled 1 memories (strategy: hybrid)
      -> injecting into agent prompts.

The recalled memory goes into both agents' system prompts, and the writer builds on the prior conclusion instead of starting over. The new run's results capture back, and TDAM does not just append a second memory, it merges the two runs into one upgraded record:

[4/4] 1 memories match (strategy: fts).
  [episodic] (priority: 85) The open-multi-agent team completed work on storage
  choice for AI agent memory: the analyst completed an analysis comparing SQLite
  and PostgreSQL ... the writer completed a memorandum recommending SQLite for
  single-agent local workloads and PostgreSQL for multi-agent concurrent systems.

Priority rose from 80 to 85, and the scene now covers both agents. Token usage: 11384 in, 2489 out, higher because the recalled context is now in the prompts. That is the loop closed: write, distill, recall across a process boundary, feed back, re-distill into something better.

One honest detail on latency. In run one the flush returned in 0.0 seconds, because with everyNConversations: 1 extraction had already completed inline during capture. In run two the flush took 49.9 seconds, because that run's captures were still queued and session/end genuinely waited for the extraction model. Budget for real LLM latency at flush time. On a local model, think minutes, not seconds.

When to reach for this, and when not to

The cost is honest and worth stating: TDAM's Gateway is a separate service you run alongside your app. For a framework whose whole pitch is three dependencies, in-process, one call, "now also run a sidecar and an extraction LLM" is real friction. You take that on for a reason, not by default.

Reach for it when long-term memory has to stay on infrastructure you control, when you want layered distillation (facts, scenes, persona) rather than a flat log, and when you are willing to run the sidecar to get it. Skip it when a key-value store that survives restarts is all you need; a MemoryStore backed by Redis or SQLite is a tenth of the moving parts.

The full runnable example (adapter, search toolkit, two-agent demo, README) is in the open-multi-agent repo under examples/integrations/with-tencentdb-memory/, pinned to TDAM v0.3.6 (both gotchas verified unchanged in TDAM's source through 1.0.0), lint and the full test suite green. If you wire TDAM into a different framework and hit a third gotcha, I want to know which one.

Goal-Driven Agent Orchestration vs Explicit Graphs: A TypeScript Framework Taxonomy

JackChen — Wed, 03 Jun 2026 12:57:35 +0000

Most multi-agent framework reviews compare features. This post argues you should compare a different axis first: where the framework places the decomposition cost. Goal-first frameworks pay it at runtime in tokens; graph-first frameworks pay it at design time in code. The right default depends on what kind of work your team actually has.

If you spent any time in 2025 evaluating TypeScript agent frameworks, you probably hit the same wall I did. The product pages do not distinguish themselves on the axes that matter once you ship anything. They all promise "multi-agent". They all show a cooperating-agents diagram. They all link to a half dozen integrations. None of them tell you what the framework is going to make you write at 2am when a customer reports a regression.

The thing the pages do not tell you is who decides the topology. That is the central design choice of every multi-agent framework, and it sorts cleanly into two camps. Graph-first frameworks make you decide. Goal-first frameworks let a coordinator agent decide for you at runtime. Each has costs the other does not.

I am the maintainer of one of the goal-first frameworks (open-multi-agent, the TypeScript-ecosystem answer to CrewAI), so my bias is going to show. I am going to try to be honest about what goal-first costs you, because I think the choice between paradigms is more interesting than which framework wins a feature checklist.

The two-axis problem with most comparison posts

Read any "Top N TypeScript Agent Frameworks 2026" listicle and you will get a feature grid: which framework supports streaming, structured output, retries, observability, MCP, Zod schemas, lifecycle hooks, agent handoffs, and so on. Every framework gets a check or a partial mark.

The grid is not useless, but it conceals the design choice that determines whether you will be productive on the framework six months in. That choice is:

Who decides which agent runs next, and when?

Two answers:

You do, at design time. The framework gives you primitives for nodes, edges, conditions, and you wire them. The execution path is whatever your code says it is.
A coordinator does, at runtime. You declare the agents and the goal. The framework runs an LLM call to plan a task DAG, then executes it.

Call these graph-first and goal-first respectively. They are not feature sets. They are paradigms with different cost shapes, different failure modes, and different right-fit use cases.

This post is the taxonomy. The same two-agent task appears inline below, implemented in four TypeScript frameworks (LangGraph.js, Mastra workflows, KaibanJS, open-multi-agent), so you can read each surface and judge for yourself.

What "graph-first" and "goal-first" mean

Graph-first

You declare the topology yourself. Nodes are agents (or steps). Edges declare execution order or conditions. The compiled object is a deterministic state machine. The framework executes it; it does not change it.

Canonical examples:

LangGraph.js: StateGraph with explicit Annotation.Root schema, addNode per agent, addEdge for transitions, addConditionalEdges for routing. You compile the graph and invoke it.
Mastra workflows: createWorkflow with typed inputSchema and outputSchema, createStep per stage, .then() to chain. A linear-graph DSL on top of typed steps.

Goal-first

You declare agents (role, model, system prompt). You hand the orchestrator a sentence-level goal. At runtime, a coordinator agent (an LLM call you do not write) decomposes the goal into a task DAG, assigns tasks to agents, runs them in dependency order, and synthesizes a final answer.

Canonical examples:

CrewAI (Python): the prototype. Crew.kickoff() on a list of agents plus tasks. CrewAI uses the role/goal/backstory metaphor to seed agent behavior.
open-multi-agent: OpenMultiAgent orchestrator plus runTeam(team, goal). The coordinator pattern lives in src/orchestrator/orchestrator.ts (runTeam method) and is described inline in the JSDoc, six steps: decompose, queue, schedule, execute with parallelism, persist results, synthesize.

What about KaibanJS

KaibanJS lives between the two. You define Agent and Task objects with explicit dependencies (the writer's task description references the researcher's output by convention), then the framework's Kanban board moves tasks across columns. Topology is mostly explicit in your task list, but the state machine that drives execution is hidden inside the board abstraction. Call it a hybrid. Closer to graph-first than goal-first in practice.

Four-way side by side: the same task in four frameworks

The task is small on purpose: a researcher agent gathers a brief on a topic, and a writer agent turns the brief into a 400-word summary. The writer depends on the researcher. The snippets below are the relevant core of each implementation, written to show the API surface side by side rather than as a packaged runnable project.

LangGraph.js

const State = Annotation.Root({
  topic:   Annotation<string>,
  brief:   Annotation<string>,
  summary: Annotation<string>,
})

async function researcher(state: typeof State.State) {
  const res = await model.invoke([
    { role: 'system', content: '...' },
    { role: 'user',   content: `Topic: ${state.topic}` },
  ])
  return { brief: String(res.content) }
}

async function writer(state: typeof State.State) {
  const res = await model.invoke([
    { role: 'system', content: '...' },
    { role: 'user',   content: `Brief:\n${state.brief}` },
  ])
  return { summary: String(res.content) }
}

const graph = new StateGraph(State)
  .addNode('researcher', researcher)
  .addNode('writer',     writer)
  .addEdge('__start__',  'researcher')
  .addEdge('researcher', 'writer')
  .addEdge('writer',     '__end__')
  .compile()

const result = await graph.invoke({ topic: '...' })

What you see in the code: a state schema, two node functions, three edges, a compile-and-invoke. The dependency between researcher and writer is the edge addEdge('researcher', 'writer'). The writer reads state.brief because the researcher wrote to it.

Mastra workflows

const researchStep = createStep({
  id: 'research',
  inputSchema:  z.object({ topic: z.string() }),
  outputSchema: z.object({ brief: z.string() }),
  execute: async ({ inputData }) => ({ brief: await callLLM(inputData.topic) }),
})

const writeStep = createStep({
  id: 'write',
  inputSchema:  z.object({ brief: z.string() }),
  outputSchema: z.object({ summary: z.string() }),
  execute: async ({ inputData }) => ({ summary: await callLLM(inputData.brief) }),
})

const wf = createWorkflow({ id: 'r+w', inputSchema: ..., outputSchema: ... })
  .then(researchStep)
  .then(writeStep)
  .commit()

What you see: Zod-typed steps, a .then() chain. The dependency between research and write is positional in the chain. Mastra workflows expose more types than LangGraph at this complexity level, and trades graph generality for linear DSL clarity.

KaibanJS

const researchTask = new Task({
  description: 'Research the topic {topic} and produce a brief.',
  agent: researcher,
})
const writeTask = new Task({
  description: 'Using the brief produced previously, write a 400-word summary.',
  agent: writer,
})

const team = new Team({
  name: 'Research and Write',
  agents: [researcher, writer],
  tasks: [researchTask, writeTask],
  inputs: { topic: '...' },
})

await team.start()

What you see: agents and tasks declared, dependency implicit in task order and prose references. The board state machine drives execution under the hood.

open-multi-agent

const orchestrator = new OpenMultiAgent({ defaultModel: 'claude-sonnet-4-6', defaultProvider: 'anthropic' })

const team = orchestrator.createTeam('research-and-write', {
  name: 'research-and-write',
  agents: [researcher, writer],   // researcher + writer AgentConfig
  sharedMemory: true,
})

const goal = 'Research "Multi-agent orchestration tradeoffs in TypeScript" and write a 400-word summary.'
const result = await orchestrator.runTeam(team, goal)

What you see: agents declared, a goal sentence, one call. What runTeam() can do for you, on a goal complex enough to need it, is a coordinator planning pass that decomposes the goal into a DAG (here a researcher → writer dependency), runs the tasks in order, and synthesizes the final summary. One honest caveat, expanded in the cost section below: this particular goal is simple enough that OMA short-circuits and skips the coordinator entirely, dispatching straight to one agent.

Where the cost actually lives

These four snippets look like a "fewer lines = better" comparison. They are not.

Graph-first frameworks make you pay the decomposition cost in code, at design time, in your file. You write the State schema, you write the edges, you decide the routing. The cost is visible, version-controlled, diffable, and stable: the graph behaves the same on Tuesday as it did on Monday because nothing about it has changed.

Goal-first frameworks make you pay the decomposition cost at runtime, in tokens, in a coordinator LLM call. The cost is invisible in your source code but visible in your bill and your trace. A non-trivial runTeam() call spends a coordinator turn to plan the DAG, plus a synthesis turn to combine results, before any of your agents do their actual work. Those two extra turns are the overhead: proportionally heavy on a tiny job, shrinking toward noise as the real work grows. Simple goals skip it entirely, through the short-circuit described below.

The graph-first model is more predictable. The goal-first model is more compressive: the line count is lower because the work moved into a place you cannot see. Neither is free.

This is the part most comparisons skip. Picking a framework is not picking a feature set. It is picking where you want the topology cost to live: in your repo or in your API bill.

Which paradigm fits your work

Graph-first when the shape of the work is fixed. A pipeline that runs the same five steps on every record is worth writing once as an explicit graph; goal-first would just rediscover that DAG every run and bill you a coordinator turn for it. Same when you need an audit trail (compliance, legal, medical, finance), where "the coordinator decided to skip step 4 this run" is not an answer and an explicit edge from node 3 to node 5 is. Same, doubly, when the problem is a real state machine: cycles, retries that mutate state, interrupt-and-resume across long pauses; LangGraph.js is the most mature option there. And the most common reason of all, less technical than the rest: if your senior engineers do not yet trust an LLM making the routing call, build graph-first and earn that trust before you hand it over.

Goal-first when the shape of the work varies. "Summarize this 3-page contract" and "summarize this 80-page master services agreement" want different sub-tasks; hard-coding the maximum graph and gating it with conditionals is doable but ugly, and a coordinator handles it naturally. It is also the faster paradigm while you are still discovering the decomposition, since changing a sentence-level goal beats redrawing a graph after every user conversation. Once you have several similarly-capable agents, letting a coordinator route between them beats encoding a soft "who does what" call as hard edges. And if your product's value is that it adapts to the user's goal rather than running a fixed automation, you want that planning surface visible, not buried in a graph DSL.

The Coordinator: what the framework writes for you in goal-first

Goal-first sounds magical when you describe it from outside. From inside, the trick is mundane: it is one well-prompted LLM call.

When you call runTeam(team, goal) in open-multi-agent, the orchestrator does this in the runTeam method (full source in src/orchestrator/orchestrator.ts, see the JSDoc on the method):

A temporary coordinator agent receives the goal and the list of agents on the team, with their names, models, and system prompts.
The coordinator is asked to output a JSON array of tasks. Each task has a title, description, assignee (one of the agent names), and an optional dependsOn field listing which earlier tasks must complete first.
Title-based dependency tokens are resolved to task IDs and the array is loaded into a TaskQueue.
A scheduler assigns ready tasks to the named agent. The queue's topological dependency resolution (src/task/queue.ts) figures out which tasks are unblocked.
Independent tasks run in parallel up to maxConcurrency. Results are written to shared memory after each task completes, so downstream tasks can read them.
After all tasks complete, the coordinator runs once more to synthesize a final answer from the collected outputs.
A TeamRunResult is returned with per-agent token usage and a total.

There is a short-circuit: if the goal is short and contains no multi-step signals, the orchestrator skips the coordinator and dispatches the goal directly to the best-matching agent. That keeps trivial goals from paying the planning tax.

The cost is genuinely real, and the example here has a sharp edge worth knowing. The two-agent goal in the OMA example above is simple enough that the short-circuit fires: the coordinator never runs, the goal is dispatched straight to a single agent, and for this goal the tax is zero. You pay the planning-plus-synthesis overhead only once a goal is genuinely multi-step, and when you do, it is proportionally heavy on small jobs and minor on large ones. Do not trust a generic number for it: measure your own workload from result.totalTokenUsage, because the ratio depends entirely on how large your real agent work is next to the two coordinator turns.

What goal-first loses on

Now the honest counterweight: where goal-first is the weaker choice.

The clearest cost is tokens. Whenever a goal is complex enough that the coordinator runs, you pay for that planning turn plus a synthesis turn on top of the actual agent work. Putting the coordinator on a cheap model softens that but does not remove it. There is also a subtler cost that grows with the team: every dependent task carries upstream results forward as context, so agents spend tokens re-reading state they did not produce. Ken W Alger aptly named that the "Prose Tax", and it is the argument for passing explicit, typed results between agents rather than free-form prose.

Then there is control flow. A flat DAG is the natural unit, so anything that is really a state machine (cycles, retry-with-mutation, interrupt-and-resume across long pauses, deeply nested conditionals) can be bent to fit but should not be. If your problem is a state machine, do not pretend it is a DAG.

Debugging is less predictable too. The coordinator can pick a slightly different decomposition between runs, which shifts the output shape. Low temperature, a pinned prompt, and reviewing the plan through the onPlanReady hook all help, but none of them are as deterministic as reading a graph file you wrote yourself.

And there is social proof. Graph-first frameworks have shipped at LinkedIn, Klarna, and J.P. Morgan; goal-first is younger in production at that tier, so it is the harder sell in a board-level review today. For a documented case of an LLM-driven routing system hitting these walls and the engineering response, see Mastra's year of network-to-supervisor, traced through its own issue tracker.

Human-in-the-loop: the bridge to production, already shipped

The path from goal-first as a fast-prototyping paradigm to goal-first as a production paradigm runs through human-in-the-loop, and in open-multi-agent that path is already in place. Two primitives, both shipped, let you put a human between the plan and its execution: the onPlanReady hook (a callback that receives the coordinator's task list and returns approve or reject before anything runs) and PlanOnly mode (runTeam(team, goal, { planOnly: true }), which returns the plan without executing it, so you can inspect or edit it first).

This is the move that brings goal-first into the same audit story graph-first has had since day one: the planning is still LLM-driven, but the executed plan is human-approved. You get the goal-first benefit (you did not write the topology yourself) and the graph-first guarantee (a human signed off on it).

That is the piece that turns goal-first from a prototyping convenience into something you can defend in a production review. The ergonomics will keep improving, but the load-bearing primitives are available today.

Decision rubric

Take this as a starting point, not gospel:

Signal	Lean graph-first	Lean goal-first
Pipeline shape varies per input	weak	strong
Audit trail required	strong	weak
Prototyping a new product	weak	strong
Team unfamiliar with LLM planning	strong	weak
3+ agents with overlapping skills	weak	strong
Cycles or interrupts needed	strong	weak
Goals expressed in user language	weak	strong
Token cost per run is the bottleneck	strong	weak

If your signals split, build the first version graph-first. Graph-first is the safer starting place when you cannot tell which way the pendulum goes. You can always swap to goal-first later when you have learned enough to trust the planning. The reverse migration is harder because you have to invent the explicit topology you never wrote down.

What I'd build first

If you are starting a new TypeScript agent project today and have not picked a framework, I would advise this sequence:

Build the first three weeks graph-first in Mastra workflows. It is the cleanest TypeScript graph DSL right now and gets you to running in the least surface area.
When you find that you are spending most of your time editing the graph rather than the agent prompts, that is the signal your task shape is varying. Try the same task in open-multi-agent with a goal sentence and see whether the coordinator's plan matches what you would have written.
If the coordinator plan is consistently good, switch your prototype to goal-first. If it is consistently off, you are in graph-first territory and you have just saved yourself a six-month detour.

None of this is provably right; it is what I would tell a friend. The frameworks are not enemies, and which paradigm fits your work matters more than which one wins a feature checklist. Pick the paradigm, and the framework choice mostly follows.

About open-multi-agent. TypeScript-native multi-agent orchestration framework, MIT-licensed. Goal-first by design, with a coordinator pattern that decomposes a sentence-level goal into a parallel task DAG at runtime. Three runtime dependencies (@anthropic-ai/sdk, openai, zod). The TypeScript-ecosystem answer to CrewAI's role/goal/crew pattern.

Repo: https://github.com/open-multi-agent/open-multi-agent. Coordinator implementation: src/orchestrator/orchestrator.ts.

Related posts.

5 Walls Multi-Agent Frameworks Hit: the empirical companion to this taxonomy. One TypeScript framework's year-long migration from LLM-driven routing to a supervisor tree, traced through its issue tracker, is the production evidence behind the goal-first / graph-first split.

5 walls multi-agent frameworks hit: receipts from Mastra's year of .network() to Supervisor migration

JackChen — Thu, 21 May 2026 16:28:50 +0000

Multi-agent in TypeScript is engineering-hard. Context propagation between agents, routing quality across providers, observability inside LLM-driven decisions, nesting depth, performance under concurrency: each of these has bitten Mastra over the past year, with public GitHub issues to prove it.

This post pulls 5 engineering walls out of Mastra's year-long migration from .network() to the Supervisor pattern. I searched the Mastra GitHub repo for AgentNetwork, multi-agent, supervisor, and network, got 32 relevant issues spanning May 2025 to May 2026, and cite 18 representative ones below.

Why Mastra specifically? Because they are the most public case study. On April 9, 2026, Mastra raised a $22M Series A led by Spark Capital, bringing total funding to $35M. Same day, they launched Mastra Platform. I read the Series A post, the Platform announcement, and the pricing page end to end: the exact phrase "multi-agent" appears zero times across all three. They still mention subagents once in the Series A post, so multi-agent coordination has not vanished as a capability. But "multi-agent" as a positioning word is gone.

This is a shift. Nine months earlier, in July 2025, Mastra published "Beyond Workflows: Introducing Agent Network" and positioned automatic LLM-driven multi-agent routing as a step beyond workflows. Nine months later, the Series A narrative is Studio + Server + Memory Gateway. It is "agent infrastructure platform." It is "framework gives you primitives, platform gives you tools to run at scale."

What happened in between? The 32 issues are far more honest than any blog post. They cover Mastra's full arc: every iteration, every transition, every shift in positioning.

The five walls are below. Each one Mastra hit, with issue receipts.

Context: everyone was racing on multi-agent

To understand the weight of this shift, look at the field. Through 2024 and 2025, LangGraph, CrewAI, AutoGen, and Mastra all pushed multi-agent as a core narrative. The Microsoft AutoGen paper kept emphasizing "multiple agents collaborating outperform a single agent." LangChain promoted LangGraph to a top-line product. CrewAI grew to tens of thousands of stars in a year.

In the TypeScript world, Mastra was the standard bearer. Founded October 2024 by Gatsby co-founder Sam Bhagwat and team. YC W25. $13M seed round announced October 2025, from 100+ investors (the post headline says "120+ others"), including YC, Paul Graham, Guillermo Rauch, Amjad Masad, and Balaji Srinivasan. Three founders from a framework used by hundreds of thousands of developers.

They had every advantage: TS ecosystem, Gatsby pedigree, YC, top-tier VCs, marquee angels, a $22M Series A, and customers including Replit, Brex, Sanity, Factorial, Indeed, Marsh McLennan, MongoDB, Workday, and Salesforce.

And they still moved .network() out of the multi-agent headline.

The full timeline of Mastra's multi-agent narrative

Date	Event	Position of "multi-agent" in their story
Oct 2024	Team formed	None. Pitch was "TS framework for the next million AI developers"
H1 2025	AgentNetwork v1 (experimental)	Present, but they later admitted v1 was "pretty whack"
Oct 8, 2025	$13M seed round announced	Not a core funding narrative
Jul 3, 2025	Blog: "Beyond Workflows: Introducing Agent Network (vNext)"	Peak. Original wording: "intelligent AI orchestration that automatically routes and executes complex multi-agent tasks without predetermined workflows"
Aug 26, 2025	Blog: "Improved agent orchestration with AI SDK v5"	The "orchestration" in the title quietly downgrades to single-agent tool orchestration
Oct 10, 2025	Blog: "The evolution of AgentNetwork." `.network()` API consolidates	Multi-agent still featured, but the API is simplifying
Nov 2025	v1 Beta	Still mentions `.network()` for agent networks
Jan 2026	v1.0 stable	Multi-agent no longer a top-line feature
Feb 26, 2026	Supervisor pattern launches as the first-class primitive for multi-agent orchestration	`.network()` later marked deprecated in the migration guide
Apr 9, 2026	Mastra Platform + Series A $22M	The exact phrase "multi-agent" appears 0 times across the Series A, Platform, and Pricing pages. `subagents` still mentioned once
May 19, 2026	"Introducing A2A support"	Cross-framework interop protocol. Multi-agent capability continues, but now framed as agent-to-agent interop rather than internal orchestration

Multi-agent as primary headline ran from July 2025 to November 2025, roughly 4 to 5 months. Then a quiet downgrade in August, an API migration to Supervisor in February, and a vocabulary switch in the Series A by April.

Nine months of repositioning.

Blogs are written for press. Issues are real.

Announcement posts have communications teams. GitHub issues do not.

The Mastra repo currently has about 24.1k stars, 2.1k forks, and 200+ open issues (checked 2026-05-21). Of the 32 issues matching my search, this post cites 18 representative ones. Five themes show up repeatedly. These are not feature requests. They are not typos or doc errors. They are the actual hard problems of running multi-agent systems in production.

Mastra spent a year on them and chose to migrate to a structurally simpler design (Supervisor tree) that sidesteps some, but not all, of them.

Here are the five walls.

Wall 1: Memory, context propagation, and persistence between agents

This is the deepest wall, and the one Mastra hit longest.

Issue #11468, titled simply "Agent Network," was filed December 29, 2025 from their Discord. The original text:

"Using agent.network() I found something that when an orchestrating agent decides which secondary agent to call, the message history is not transferred to the secondary agent, making it difficult for it to understand the context for action. Please, can you help me with this? I haven't found in any documentation how to pass the memory to this flow in the final agent."

Translated to product language: the coordinator decides who to call, but the agent being called does not know why it is being called.

This problem persisted in Mastra's tracker for at least six months. Issue #5381 ("Memory for Networks?") was filed June 23, 2025. Adjacent memory/storage/persistence issues continued after the Supervisor migration, including #15336 ("LibSQL Storage/Memory Error with supervisor agent and sub agents") and #14583 ("Supervisor/Subagent Persistence Duplication"). These are not strictly the same "message history not propagated" bug as #11468, but they share a root: state management between a coordinator and its sub-agents is hard, in multiple ways.

The real engineering hardness: you cannot dump the entire conversation history into every sub-agent (token explosion, privacy, signal-to-noise), but you cannot leave them blind either (they need task context). The tradeoff is an open problem, not a few-months problem.

Wall 2: Routing quality and prompt fragility

Automatic LLM-driven routing depends on prompt robustness across models. Cross-provider, the same routing prompt behaves very differently.

The receipts:

#9873 (2025-11-07) "Network Agents does not forward the request to sub agents inside the network." Routing literally does not work
#12468 (2026-01-29) "Agent Network Routing Latency." Slow
#12955 (2026-02-11) "The sub agents are returning empty output inside network." Sub-agents return empty
#13621 (2026-02-28) "Agent Network routing prompt has trailing whitespace, causing failures with Bedrock-backed Claude models." A trailing whitespace in a routing prompt breaks the entire routing chain on Bedrock Claude.

The last one is the most diagnostic. A trailing whitespace, undebuggable across providers. This is not a user mistake. This is the brittleness of LLM-driven routing as a paradigm. Switch providers and your routing behavior may need a full re-tune.

Wall 3: AgentNetwork routing observability gap

LLMs do the routing inside AgentNetwork. Users couldn't see why.

Issue #12277 (2026-01-24) "Missing Observability for Routing and Validation LLM Calls in Agent Networks" pointed this out directly. The scope is narrow: it's specifically about tracing for AgentNetwork's internal routing and validation LLM calls, not framework-wide observability. By that date, .network() had been live for roughly three months. Production users of .network() had been flying blind on the routing layer that whole time.

Observability for AgentNetwork's routing and validation had to be a day-one design decision. Adding it months in means months of users hitting "why did the coordinator pick this agent" without an answer.

Wall 4: Three-level nesting already breaks

Issue #15013 (2026-04-03) "3-level sub-agent delegation: no progressive streaming to client."

Three levels of sub-agent delegation is enough to break streaming.

This matters because multi-agent frameworks that aspire to be an "agent OS" or "agent operating system" need to support deep organizational structures. Mastra cracks at three levels of sub-agent delegation. I haven't found public benchmarks for any framework at four levels of agent delegation, and Mastra hasn't disclosed the topology depth of their customer workloads (Brex, Indeed, Marsh McLennan), so I can't claim anything about that ceiling. What I can say is: three-level streaming broke for at least one user of Mastra, and the "deep agent organization" pitch deserves a higher evidence bar than it usually gets.

Wall 5: Performance collapse

Issue #15478 (2026-04-17, closed 2026-05-20), "[RFC] Agent Performance Optimization (Slow Responses)."

This is an RFC, not a bug. Mastra opened a public RFC acknowledging slow agent responses had reached the level of a systemic issue. The RFC was closed on May 20, 2026, the day before this post, after a maintainer commented that it had been taken care of.

The diagnosis comes via #15677 (2026-04-23):

"ObservationalMemoryProcessor.processInputStep blocks every agentic loop step with DB reads and token counting even when far below thresholds."

Translated: every agent loop iteration triggers a database read and token counting. Tolerable on a single agent. Catastrophic when amplified across a supervisor with multiple concurrent sub-agents.

The hidden cost of multi-agent is consistently underestimated. Each agent is one LLM call. Each call needs context handling, plus observability, tracing, token counting, memory I/O. Mastra is exposing the real cost of these "lightweight" operations once they are stacked on multi-agent topologies.

What happened after migrating to Supervisor

On February 26, 2026, Mastra officially launched the Supervisor pattern. The changelog described it as "a first-class supervisor pattern, exposed through the same primitives you already use, stream() and generate()." .network() was later marked deprecated in the migration guide: "will be removed in a future release. While existing code will continue to work until then, no new features will be added to it."

The logic of the migration: shift from LLM-driven routing to manually configured sub-agents in a tree. Simpler structure, more predictable decisions, fewer bug surfaces.

The issue data tells a different story.

From March to May 2026, Supervisor-related issues clustered:

#14723 (2026-03-26) Supervisor and sub-agent interactions stored as Supervisor-and-User interactions (history pollution)
#14820 (2026-03-29) No way to abort sub-agent execution in supervisor mode
#14583 (2026-03-23) Supervisor/Subagent persistence duplication
#15013 (2026-04-03) Three-level sub-agent delegation streaming broken
#15336 (2026-04-14) LibSQL storage + sub-agent throws
#15436 (2026-04-16) No control over sub-agent tool results
#15734 (2026-04-24) Suspend/resume breaks when sub-agent owns a workflow
#15887 (2026-04-28) Sub-agent calls serialized under approval mode (concurrency dies)
#16422 (2026-05-11) transformAgent drops sub-agent tool input streaming chunks
#15478 (performance RFC, closed 2026-05-20)

Supervisor is not the destination. It simplified some problems (no more LLM auto-routing), but the core multi-agent challenges (context propagation, persistence consistency, nesting depth, concurrency control, streaming) did not disappear. They got new issue numbers and resurfaced.

"Simplified design" is the story told to the community and to investors. The engineering reality is that they are still patching.

What this means

Three takeaways from a year of Mastra's public behavior.

One. The migration is not a concept failure. It is an engineering hardness.

Multi-agent as a concept is validated in research and product. LangChain, AutoGen, CrewAI are all doing it. The gap is between "concept works" and "production-stable." Crossing it took Mastra a year, dozens of issues, one major API rewrite, and a vocabulary switch in the Series A. This is not a "pick it up and ship" direction. It is a real-engineering domain.

Two. Multi-agent depth beyond two levels remains a hard, undersolved problem.

Mastra's 3-level streaming bug (#15013) suggests this isn't a Mastra-only ceiling, but I can't speak for frameworks I haven't tested. What I can say from the receipts: prompt robustness, context propagation, streaming, token accounting, error recovery each barely held at two levels in Mastra's case, got shaky at three, and I haven't found public reliable demos at four for any framework. If you have counterexamples, I'd genuinely like to see them.

Three. "Agent OS" as a buzzword does not match engineering reality.

An operating system implies stability, predictability, and deep process nesting. A system that breaks streaming at three levels of delegation, needs an RFC for performance, and took six months to figure out context propagation is at best a framework. Calling it an OS is writing a check that current technology cannot cash.

Mastra clearly knows this. Their funding announcements do not use "OS." They use "Platform." They use "framework." They use "infrastructure." These are bounded words.

Where I am sitting

I have been working on an open-source TypeScript multi-agent framework since April 2026. We call it open-multi-agent. The repo lives at github.com/open-multi-agent/open-multi-agent.

Watching Mastra's migration over the past year did not convince me the road is dead. It convinced me of something else: Mastra chose to migrate .network() into Supervisor and shift the headline vocabulary. We are choosing to walk directly into these five walls and call them out by name.

Where we are today:

Wall 1 (context propagation): SharedMemory as a namespaced key-value store, injected into prompts as markdown summaries, plus MessageBus for point-to-point and broadcast messages between agents. Different mechanism from Mastra's "pass message history." We sidestep the problem rather than solving it directly
Wall 2 (routing): Coordinator decisions are constrained by Zod schema with one automatic retry on validation failure. Local model fallback parses raw JSON and markdown-fenced JSON output formats
Wall 3 (observability): Built-in post-run task DAG dashboard (pure HTML render, no I/O dependency). Every team run renders the task DAG, assignee per task, status, timing, and token usage. Day-one design
Wall 4 (nesting): maxDelegationDepth cap, plus cycle detection (target already in delegationChain is rejected) and agent pool deadlock detection (rejected when availableRunSlots < 1). Three guards from day one
Wall 5 (performance): Three runtime dependencies only (@anthropic-ai/sdk, openai, zod). SharedMemory is in-process by default with no per-step DB I/O. Three-layer Semaphore concurrency control (agent pool, per-agent, tool execution)

What we have not solved

None of the five walls is "we figured it out with a clever design." This is industry-hard. Not a single-team intelligence problem.

I should be explicit about where we are short:

Nesting depth: maxDelegationDepth defaults to 3. That is exactly the depth at which Mastra cracked. We have not done serious engineering tests beyond four. Open problem for us too
Performance: We have not systematically load-tested 100 tasks × 10 agents. Mastra hit performance walls in customer production, and our sample size is not comparable yet
Context propagation: SharedMemory + MessageBus is directionally correct, but the policy of "what a sub-agent sees by default" is still iterating. We have not reproduced every Mastra failure case
Cross-provider robustness: We run basic routing-consistency tests across providers. Edge cases like "trailing whitespace breaks Bedrock Claude" have not been systematically swept

This post is not announcing that we solved what Mastra did not.

It is an invitation. Multi-agent is a real direction with real engineering value and real engineering difficulty. We are continuing to push on it. If you also believe this is worth doing, come help.

How to get involved

Repo: github.com/open-multi-agent/open-multi-agent

PRs welcome. Counterarguments welcome in issues. Failure cases that break our claims especially welcome.

Frequently asked questions

Did Mastra abandon multi-agent?

No. Mastra continues to support multi-agent coordination through the Supervisor pattern (launched February 26, 2026), the subagents primitive, and the A2A cross-framework interop protocol (May 19, 2026). What changed is the headline vocabulary: their Series A and Platform announcements (April 9, 2026) no longer use "multi-agent" as a positioning word. The capability stayed; the API and the marketing both shifted.

If I'm starting a multi-agent project in TypeScript today, what should I use?

It depends on what you need. Mastra's Supervisor pattern is backed by a well-funded company with enterprise customers (Replit, Brex, MongoDB, Workday, Salesforce, Indeed), and it fits well when you want manually configured sub-agents with predictable execution. If you want LLM-driven goal-to-DAG decomposition (one goal in, an auto-generated task DAG executed across multiple agents in dependency order, rather than manually configuring a supervisor tree), open-multi-agent takes that approach with 3 runtime dependencies and a Coordinator pattern. LangGraph is also worth a look if you don't mind Python.

What is open-multi-agent and how does it differ from Mastra?

open-multi-agent is an open-source TypeScript framework for multi-agent orchestration, launched April 2026 (github.com/open-multi-agent/open-multi-agent). The core difference: open-multi-agent uses a Coordinator that decomposes a goal into a task DAG via LLM, then executes tasks in dependency order across multiple agents in parallel, whereas Mastra's current Supervisor pattern uses manually configured sub-agents with the supervisor delegating at runtime. Other design choices include 3 runtime dependencies (@anthropic-ai/sdk, openai, zod), in-process SharedMemory with no per-step DB I/O by default, and built-in delegation depth caps with cycle detection.

Does open-multi-agent solve the 5 walls Mastra hit?

Honestly, partially. We have explicit design choices for each wall (SharedMemory + MessageBus for context, Zod-constrained Coordinator decisions for routing, post-run task DAG dashboard for observability, depth caps + cycle detection for nesting, in-process state for performance). But we have not load-tested at 100 tasks × 10 agents, our default maxDelegationDepth is 3 (exactly where Mastra cracked), and cross-provider routing edge cases like "trailing whitespace breaks Bedrock Claude" have not been systematically swept. This is an open invitation, not a solved problem.

Sources

Mastra posts and announcements:

GitHub issues (in order of appearance):

Context propagation: #11468 #5381 #15336 #14583
Routing quality: #9873 #12468 #12955 #13621
Observability: #12277
Nesting: #15013
Performance: #15478 #15677
Supervisor era: #14723 #14820 #15436 #15734 #15887 #16422

I work on open-multi-agent, an open-source TypeScript multi-agent framework. Comments, counterarguments, and failure cases welcome at the repo.

How to Run a Mixed-Model AI Agent Team in TypeScript?

JackChen — Sat, 16 May 2026 08:05:41 +0000

A practical walkthrough that takes you from a single-model team baseline to a mixed-provider production setup with live cost and latency monitoring, using open-multi-agent, the TypeScript-ecosystem answer to CrewAI.

If you have ever priced out a multi-agent system that runs on a single frontier model, you already know the trap. You wire up three agents, plan, build, review, and the monthly bill can hit hundreds of dollars at modest cadence (100 runs/day) and climb past four figures at production volume, because the architect, the developer, and the reviewer are all eating Opus tokens to argue about a one-line bug.

Most TypeScript agent frameworks assume one provider, one model. You can swap the model, but only globally. That single knob makes the cost-quality tradeoff harder than it needs to be. The frontier models are right for a small fraction of the team's turns. The rest is wasted spend.

This post shows the alternative. Three agents, three different model tiers, one runTeam() call. Architect on Claude Opus 4.7, developer on a cheaper hosted OpenAI model, reviewer on a local model running through Ollama. The mix is configured per agent in AgentConfig, no provider lock-in, no glue code. By the end you will have:

A concrete way to combine the existing OMA examples for multi-model teams, local models, and cost-tiered execution.
A cost ledger that gets you per-agent dollar numbers from a single onProgress callback.
A current pricing table for the providers used here, snapshot 2026-05-16, so you can do back-of-the-envelope monthly forecasts before you commit.
An honest list of what mixed-model teams cost you (because they do cost something).

The repository examples this post connects are:

examples/basics/multi-model-team.ts: different hosted models per agent.
examples/providers/ollama.ts: Claude plus a local Ollama reviewer through an OpenAI-compatible baseURL.
examples/patterns/cost-tiered-pipeline.ts: token usage and cost comparison across model tiers.
examples/providers/gemini.ts: Pro/Flash tiering inside one provider.

There is no separate companion repo for this post. The point is the pattern: per-agent model assignment is already a first-class field on AgentConfig.

What "mixed-model" actually means here

A few definitions before we start, because the term is overloaded.

Single-model team. Every agent runs on the same provider and the same model. Easiest to reason about, easiest to debug, most expensive when the model is a frontier one.

Multi-provider team. Different agents run on different providers, but each agent's model is still picked at design time, not at runtime. This is what we mean by "mixed-model" in this post.

Dynamic model routing. The framework decides which model to use per turn based on cost, latency, or quality signals. Powerful, but adds a layer of indirection and a new failure mode. Out of scope here. Static per-agent assignment gets you 80% of the value with 5% of the complexity.

The framework primitive we lean on is AgentConfig. open-multi-agent's AgentConfig carries provider, model, baseURL, and apiKey on each agent. The orchestrator instantiates the right LLM adapter (Anthropic, OpenAI, Gemini, Grok, Bedrock, Copilot, and OpenAI-compatible local servers via baseURL) lazily, so you only install the SDKs you actually use.

The four pieces, in order

The OMA repo already has the building blocks. Read them in this order; each one isolates a different part of the mixed-model pattern.

Step 1: All-Opus baseline (the cost ceiling)

import { OpenMultiAgent } from '@open-multi-agent/core'
import type { AgentConfig } from '@open-multi-agent/core'

const architect: AgentConfig = {
  name: 'architect',
  provider: 'anthropic',
  model: 'claude-opus-4-7',
  systemPrompt: 'You are a senior software architect...',
  temperature: 0.2,
}
const developer: AgentConfig = { ...architect, name: 'developer', systemPrompt: '...' }
const reviewer:  AgentConfig = { ...architect, name: 'reviewer',  systemPrompt: '...' }

const orchestrator = new OpenMultiAgent({
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
})

const team = orchestrator.createTeam('design-build-review', {
  name: 'design-build-review',
  agents: [architect, developer, reviewer],
  sharedMemory: true,
})

const result = await orchestrator.runTeam(team, 'Implement retryWithBackoff<T>...')
console.log(result.totalTokenUsage)

That is the entire team. Three agents, one model, one orchestrator. The coordinator that gets spawned by runTeam() decomposes the goal into tasks, assigns each task to an agent, runs them with dependency-aware parallelism, and returns a TeamRunResult with per-agent token usage in agentResults and a total in totalTokenUsage.

We use it as the cost ceiling. Everything that follows will be measured against this baseline.

Step 2: Dual-model (Opus plans, OpenAI executes)

The architect's job is to look at the goal once, decide what to build and what to skip, and lay out an API shape. It is called rarely, but each call is high leverage. Opus is right for that.

The developer and the reviewer fire many times on smaller prompts. The reviewer in particular is mostly running short checklist passes. Spending Opus tokens on those is wasteful. A mini-tier OpenAI model, for example GPT-5.4 mini at $0.75 input and $4.50 output per million tokens as of 2026-05-16, is roughly 5.5x cheaper than Opus on the output side. You can absorb the quality delta on simpler turns.

const architect: AgentConfig = {
  name: 'architect',
  provider: 'anthropic',
  model: 'claude-opus-4-7',
  systemPrompt: '...',
}
const developer: AgentConfig = {
  name: 'developer',
  provider: 'openai',
  model: 'gpt-5.4-mini',
  systemPrompt: '...',
}
const reviewer: AgentConfig = {
  name: 'reviewer',
  provider: 'openai',
  model: 'gpt-5.4-mini',
  systemPrompt: '...',
}

Same createTeam(), same runTeam(). The only thing that changes is provider and model on two of the three agents. The orchestrator handles the adapter switching internally; you do not write a single line of provider-specific code in your team setup.

Step 3: Triple-model with a local reviewer

Now we push the reviewer onto a local model. The argument is the same as Step 2, pushed harder: the reviewer does style checks, edge-case prompts, and short-form QA. A local Ollama model is often good enough for that, and the marginal cloud cost is zero.

The trick is that Ollama exposes an OpenAI-compatible endpoint, so you reuse the openai adapter and override baseURL and apiKey on the agent:

const reviewer: AgentConfig = {
  name: 'reviewer',
  provider: 'openai',
  model: 'llama3.1',
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
  systemPrompt: '...',
}

A note that catches people: even when the local server ignores the key, the OpenAI SDK validates that it is non-empty. Pass any placeholder. 'ollama' works.

Prerequisites for this step: ollama pull llama3.1 once, then ollama serve in a separate terminal. Swap llama3.1 for Gemma, Qwen, or whatever local model you actually use. Same setup works with vLLM, LM Studio, llama-server, or any OpenAI-compatible inference server.

Step 4: Live cost and latency monitoring

The mixed-model team is now functionally complete. The problem is you do not yet know what it costs. In production you want to find out fast when an agent hot-loops on Opus and burns $50 in an afternoon.

open-multi-agent's OrchestratorConfig accepts an onProgress callback that fires for agent_start, agent_complete, task_start, task_complete, error, and a few others. We use the start and complete events to build a per-agent ledger.

const ledger = new Map<string, { startedAt: number; finishedAt?: number; model: string }>()
const modelForAgent = (agent: string) =>
  [architect, developer, reviewer].find(a => a.name === agent)?.model ??
  (agent === 'coordinator' ? 'claude-opus-4-7' : 'unknown')

const orchestrator = new OpenMultiAgent({
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
  onProgress: (event) => {
    if (event.type === 'agent_start' && event.agent) {
      ledger.set(event.agent, { startedAt: Date.now(), model: modelForAgent(event.agent) })
    }
    if (event.type === 'agent_complete' && event.agent) {
      const entry = ledger.get(event.agent)
      if (entry) entry.finishedAt = Date.now()
    }
  },
})

After runTeam() finishes, you walk result.agentResults, pull tokenUsage for each agent, look up the per-million pricing for its model, and compute a dollar number. The same ledger pattern used by the cost-tiered example produces output like this (representative shape; for a real recorded run with measured numbers, see the Update 2026-05-22 section near the end of this post):

agent       | model              | latency  | tokens in/out  | cost
------------+--------------------+----------+----------------+--------
coordinator | claude-opus-4-7    |    3.4s  |   1102/   612  | $0.0208
architect   | claude-opus-4-7    |    6.1s  |   1580/  1140  | $0.0364
developer   | gpt-5.4-mini       |    8.7s  |   2240/  2106  | $0.0112
reviewer    | llama3.1           |   12.0s  |   2680/   480  | $0.0000

Grand total: $0.0684 USD

The reviewer is the slowest agent in the table because a local model on a consumer machine is usually slower than a hosted frontier model. That is the local-cost trade-off in concrete terms: zero marginal cloud dollars, more wall-clock seconds. Whether that is the right call depends on whether your team blocks on the reviewer or runs it asynchronously.

Pricing snapshot (2026-05-16)

These are the numbers used by the examples and estimates in this post. They are also the numbers you want for any cost back-of-the-envelope before you ship a mixed-model team to production. Verify on the official pages before you commit to a forecast.

Model	Input ($/1M)	Output ($/1M)	Source
Claude Opus 4.7	$5.00	$25.00	platform.claude.com
Claude Sonnet 4.6	$3.00	$15.00	same
Claude Haiku 4.5	$1.00	$5.00	same
GPT-5.5	$5.00	$30.00	openai.com/api/pricing
GPT-5.4	$2.50	$15.00	same
GPT-5.4 mini	$0.75	$4.50	same
Gemini 2.5 Flash	$0.30	$2.50	ai.google.dev/gemini-api/docs/pricing
Local model via Ollama	$0 marginal	$0 marginal	electricity + amortized hardware

A reasonable starting reading of the table: the cloud frontier models charge 4x to 8x more on the output side than on the input side. That is the inversion you want to design against. Push input-heavy agents (research, summarization, retrieval grounding) onto the cheaper models. Reserve the expensive models for agents that produce a lot of high-stakes output.

A worked cost comparison on a recurring workload

Suppose your team runs the same three-agent task 100 times a day (real-world cadence for an automation that fires on inbound webhooks, scheduled batches, or per-customer pipelines). A representative run uses roughly:

Coordinator: 1.1K input, 0.6K output tokens (Opus 4.7 in all variants)
Architect: 1.6K input, 1.1K output
Developer: 2.2K input, 2.1K output
Reviewer: 2.7K input, 0.5K output

Use this as a representative shape, not a benchmark. Your numbers will differ; the math below shows how to do it. If you want to measure your own workload, start with examples/patterns/cost-tiered-pipeline.ts.

All-Opus run (Step 1 baseline) runs to about $0.15 per execution on the token shape above. At 100/day that is roughly $450/month.

Dual-model (Step 2): Opus on coordinator + architect, GPT-5.4 mini on developer + reviewer. About $0.073 per run. Roughly $219/month.

Triple-model (Step 3): Opus on coordinator + architect, GPT-5.4 mini on developer, local Ollama model on reviewer. About $0.068 per run on the cloud side. Roughly $205/month, plus the local model's wall-clock overhead on a machine you already own.

That is a little over 50% monthly savings against the all-Opus baseline on this token shape, with the quality drop contained to lower-risk roles. The savings shape changes by workload: research-heavy, summarization-heavy, or retrieval-grounded teams can gain more by pushing input-heavy roles to cheaper models. Reasoning-heavy teams gain less and may be wrong to mix at all.

When mixed-model is the wrong call

The post would be useless if it told you to always mix. Here is the honest list of when not to.

Single-agent tasks. If your goal is small enough that one agent can finish it, do not split into a team just to mix models. The coordinator overhead and inter-agent context shuffling will dwarf the savings.

High-consistency tasks. Models from different providers disagree on edge cases. If your output is graded against a strict rubric (legal review, medical triage, anything with a regulator-facing audit trail), the variance from mixing providers will bite you. Stay on one model and pay the price.

Tight feedback loops with human reviewers. When you are still iterating on prompts and the team's behavior is unstable, mixing models adds a degree of freedom you do not want. Get the team to converge on a single model first, then push pieces out to cheaper models once each role's prompt is stable.

Cheap models with cheap output. If every agent in your team already fits a single Haiku or GPT-5.4 mini run, the absolute savings from mixing are pennies and the operational complexity is real. Mixed-model pays off when at least one role's monthly cost is meaningful enough to justify the variance.

Provider failure handling that is not in place. A mixed-model team has more failure surfaces than a single-model team. Three providers means three rate-limit ceilings, three auth flows, three latency profiles, and three ways your pipeline can stall on a 503. Decide your retry, fallback, and circuit-breaker strategy before you ship.

What the cookbook example looks like in mixed-model form

open-multi-agent ships a personalized interview simulator cookbook that runs three agents on Claude Sonnet 4.6: an interviewer, an observer, and a reporter. It is a nice match for the mixed-model pattern.

The interviewer does deep, candidate-specific question generation across many turns. That role earns Opus.

The observer reads the transcript after each turn and writes 3-6 short flags. The role is short-output, repeatable, and structurally simple. Push it to a cheaper hosted model or even a local model.

The reporter runs once at the end of the session against a strict Zod schema (recommendation: 'strong-hire' | 'hire' | ..., plus structured arrays). Structured-output agents are sensitive to the underlying model's JSON adherence. Keep that on a frontier model.

The migration is two provider and model edits, two AgentConfig blocks. You do not touch the orchestration logic. You do not refactor the prompts. You read the schema and decide where the consistency requirements actually live.

Why this lives in the TypeScript ecosystem

A small note on positioning since this is the question I get asked most.

CrewAI established the team-of-agents shape that this post leans on: an agent has a role, agents form crews, a crew has a goal, and the framework orchestrates the goal into work. CrewAI is Python-only, and the TypeScript options for the same pattern have been thin until recently. open-multi-agent treats the TypeScript ecosystem as a first-class target: 100% TypeScript runtime, three runtime dependencies (@anthropic-ai/sdk, openai, zod), and the same Goal → Result one-call surface (runTeam) that you would get from CrewAI's Crew.kickoff(). The mixed-model team is, by design, a first-class pattern rather than a custom adapter you write yourself.

If you are coming from CrewAI and looking for the team-of-roles model in TypeScript, the examples above are the migration target.

Update 2026-05-22: Real run data + thinking-mode cost

The original post (2026-05-16) uses Anthropic + OpenAI as the canonical example because that's the setup most TypeScript readers will recognize. After publishing, I ran the same runTeam-shaped pipeline with DeepSeek + a local Qwen model on an M1 16GB MacBook because those were the API credentials I had at hand. Same code path, different provider/model strings. That's the model-agnostic point of the post, applied honestly to my own constraints.

This section adds (1) the actual recorded ledger so anyone reproducing can work from measured data, (2) a side experiment on thinking-mode cost using two Qwen 3.5 9B variants, and (3) a known limitation of the OMA + OpenAI-compatible path for thinking-mode models.

Setup for the real run

coordinator / default: DeepSeek deepseek-chat (non-thinking, $0.14 / $0.28 per MTok)
architect: DeepSeek deepseek-reasoner (thinking mode, same pricing)
developer: DeepSeek deepseek-chat
reviewer: Ollama qwen3:8b (local, accessed via Ollama's OpenAI-compatible endpoint)
DAG: runTasks(team, tasks) with explicit per-task assignee (see "Why runTasks" below)
Hardware: MacBook M1, 16GB unified memory, Ollama 0.20.2

Ledger 1: real run

agent       | model              | latency  | tokens in/out  | cost
------------+--------------------+----------+----------------+--------
architect   | deepseek-reasoner  |    25.3s |   1612/  2450  | $0.0009
developer   | deepseek-chat      |    68.1s | 108219/ 10408  | $0.0181
reviewer    | qwen3:8b           |   208.5s |   1432/   696  | $0 (local)

Grand total: $0.0190 USD
Wall total : 5:03

Per-run cost is $0.0190. At 100 runs/day that is roughly $57/month against the original all-Opus baseline of $450/month. The shape is different from the post's "40-70% savings" claim because the DeepSeek pricing floor is much lower than the OpenAI mini tier the original example used. Use whichever pricing matches your stack.

Why `runTasks` over `runTeam(goal)` for mixed cloud + local

I tried runTeam(goal) three times before switching. Two failures worth recording, because they will hit anyone trying to pin a role to local inference:

Short goal (under 200 chars, no complexity keywords). OMA v1.1.0 introduced "Skip Coordinator for Simple Goals" — when isSimpleGoal(goal) returns true, the coordinator skips DAG decomposition entirely and routes the whole goal to the best-matching agent. The reviewer was never invoked.
Multi-step goal that did trigger the coordinator. The coordinator generated a DAG but folded the review work into the developer's task instead of dispatching to the reviewer agent. Even adding "the reviewer agent must independently audit" to the goal text didn't move the dispatch — the coordinator's routing decision wins over goal text hints.

For mixed cloud+local pipelines where you have intentionally pinned a role to local inference, this matters: a coordinator that optimizes your local agent away costs you both the architectural intent and the cost saving. runTasks(team, tasks) is the path that guarantees per-agent dispatch with compile-time assignee typing:

await orchestrator.runTasks(team, [
  { title: 'design-api', assignee: 'architect', description: '...' },
  { title: 'implement',  assignee: 'developer', description: '...', dependsOn: ['design-api'] },
  { title: 'review',     assignee: 'reviewer',  description: '...', dependsOn: ['implement'] },
])

For goal-driven workloads where you genuinely want the coordinator to decide, runTeam(goal) is still the right call. The two APIs are complementary; pick by whether the dispatch decision is yours or the framework's.

Thinking-mode cost: same model, estimated 4-5x latency

Side experiment. I ran the same DAG two more times, only swapping the reviewer model. First with qwen3.5:9b-mlx (Qwen 3.5 9B, MLX-optimized for Apple Silicon, 8.9GB on disk) with thinking mode on its default. Then with the same model + /no_think appended to the reviewer task description (the prompt-level workaround Qwen 3 series ships with).

Ledger	reviewer config	reviewer latency	reviewer in / out tokens	wall total
1	`qwen3:8b` (no thinking by design)	208s	1432 / 696	5:03
2	`qwen3.5:9b-mlx` (thinking default ON)	1347s	21014 / 1566	24:30
3	`qwen3.5:9b-mlx` (`/no_think` workaround)	554s	38342 / 568	10:44

Ledger 2 vs Ledger 3 is the cleanest control: same model, same hardware, only thinking mode varies. Output tokens dropped 63% (1566 → 568) with /no_think, and reviewer latency dropped 59% (1347s → 554s) — despite Ledger 3's reviewer receiving 82% more input (38K vs 21K) because the developer happened to emit more in that run. Normalizing for the input difference, the pure thinking-mode cost on this M1 16GB is an estimated 4-5x reviewer latency.

For short code-review-shaped tasks (~200-word output), thinking mode is overkill. Match the local model + thinking setting to the role, not to "biggest thing I can pull".

Caveat: OMA + OpenAI-compatible can't fully kill Qwen 3.5 thinking

I tested three ways to disable thinking from OMA's side. Two failed, one partially works:

Ollama's native think: false field via the OpenAI-compatible endpoint. Did not work. Adding "think": false to a request body sent to localhost:11434/v1/chat/completions kept the model thinking for 60+ seconds, with reasoning length still 758 chars. The OpenAI Chat Completions schema doesn't define this field, and Ollama's compatibility layer drops it silently.
OMA's AgentConfig.extraBody is designed exactly for passing provider-specific params like GPT-5.5's reasoning_effort: 'xhigh'. It hits the same OpenAI-compatible endpoint, so the same outcome is the most likely (I did not retest separately; if you do, please report back).
/no_think placed cleanly at the end of the user message / task description. Partially works. reasoning length drops, output tokens drop, latency drops (Ledger 3 above). Putting /no_think in systemPrompt with extra reinforcement ("skip the block entirely") made it strictly worse — the model pushed 1337 chars into reasoning and produced empty content because the max_tokens budget was consumed by the reasoning phase.

The fully-off path goes through Ollama's native /api/chat endpoint with "think": false. On the same review prompt, that endpoint completes in 6.7 seconds. That is the theoretical ceiling for this model + hardware on this prompt. OMA's OpenAI-compatible path leaves measurable performance on the table for thinking-mode models. If you have found a way to pass Ollama's think field through an OpenAI-compatible client without bypassing the framework, please open an issue on the OMA repo. I would like to be wrong about this limitation.

What this changes about the post above

Nothing about the architectural argument changes — per-agent model assignment through AgentConfig is still the lever, runTasks vs runTeam(goal) is still the dispatch choice, and the cost-tiered framing is still the design pattern. The cost numbers in the original worked example use a representative Anthropic + OpenAI shape; the real-run ledger here is a different provider mix with a different cost floor. Both are valid reference points, and being able to swap between them without rewriting orchestration code is exactly what the post argues for.

Wrap-up: what to take from here

Mixed-model agent teams are not a clever trick. They are the right default once your team grows beyond two agents and the workload starts running on a real cadence. The savings can be material, often 40-70% against an all-frontier baseline depending on token shape, the operational cost is real (more failure modes, more variance), and the design choice that matters most is which agent gets the expensive model.

Three takeaways:

Per-agent model assignment is a design lever, not an optimization. Decide it when you decide the team. Retrofitting it later means rewriting prompts that have already drifted to match the wrong model.
Start with two providers, then add local. Step 2 captures most of the savings with two API keys and zero infrastructure. Step 3 is incremental and depends on whether you can spare the local-model latency.
onProgress is the cheapest insurance you can buy. Twenty lines of TypeScript turn token counts into dollar numbers per run. Without it, mixed-model teams silently regress and you find out from the bill.

Start with the existing repo examples: multi-model-team, providers/ollama, and cost-tiered-pipeline. Run the one closest to your workload, then add the per-agent ledger before you scale it up. If you push this pattern to production, I would like to hear what your real cost shape looks like. The reasonable model split is probably different from the examples, and the right answer is workload-specific.

About open-multi-agent. TypeScript-native multi-agent orchestration framework, MIT-licensed. Goal → Result in one runTeam() call. Three runtime dependencies. Repo: https://github.com/open-multi-agent/open-multi-agent. The framework treats the TypeScript ecosystem as a first-class target rather than a secondary port from Python.

Edits and corrections. If a price has moved since 2026-05-16 or a model has been renamed, please open an issue against the OMA repo and I will refresh the constants in the examples.

Adding Multi-Agent Orchestration to a Vercel AI SDK App

JackChen — Wed, 15 Apr 2026 17:20:14 +0000

I hit a wall recently. I had a working AI SDK app -- streamText, useChat, the whole thing -- and then I needed it to do something that a single agent can't: research a topic with one agent, then hand that research to a second agent for writing.

You can do this manually. Glue two generateText calls together, pass context around, handle the error cases. But once you want a coordinator that figures out which tasks to run in what order, or three agents sharing state, you're writing orchestration infrastructure. I didn't want to write orchestration infrastructure.

So I wired open-multi-agent (OMA) into a Next.js API route next to the AI SDK, and the two libraries turned out to work well together. This is how.

Where each library sits

AI SDK and OMA do different jobs. They don't overlap much.

	Vercel AI SDK	open-multi-agent
What it is	LLM call layer + streaming UI	Multi-agent orchestration framework
Core strength	Unified API for 60+ providers, `useChat`, `streamText`, structured outputs	`runTeam()` -- auto task decomposition, parallel execution, shared memory
Agent model	Single agent with tool loop (`ToolLoopAgent`)	Team of agents with coordinator pattern
Streaming	First-class (`toUIMessageStreamResponse`)	Not streaming-native (batch results)
Ecosystem	23,400+ GitHub stars, 10M+ weekly downloads	5,700+ GitHub stars, 3 runtime deps

AI SDK talks to models and streams tokens. OMA sits above that: given a goal and a roster of agents, it breaks the goal into tasks, runs them in dependency order, and collects the results. The two can share the same API route.

What we're building

A Next.js chat app. User types a topic, two agents collaborate on a researched article, the result streams back through useChat.

Browser (useChat)
    |
    v
POST /api/chat
    |
    +-- Phase 1: OMA runTeam()
    |     coordinator decomposes goal
    |     -> researcher agent gathers info
    |     -> writer agent drafts article
    |     (shared memory passes context between agents)
    |
    +-- Phase 2: AI SDK streamText()
    |     streams the team's output to the browser
    |
    v
useChat renders streamed response

Phase 1: OMA runs the team. A coordinator agent (created automatically by runTeam) analyzes the goal, produces a task plan, and executes it. The researcher's output lands in shared memory so the writer can reference it.

Phase 2: the coordinator's final output gets piped into AI SDK's streamText, which streams it to the browser through useChat. This is the bridge between OMA's batch output and AI SDK's streaming protocol.

Step 1: Project setup

mkdir with-vercel-ai-sdk && cd with-vercel-ai-sdk

package.json:

{
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build"
  },
  "dependencies": {
    "@ai-sdk/openai-compatible": "^2.0.0",
    "@ai-sdk/react": "^3.0.0",
    "@open-multi-agent/open-multi-agent": "^1.1.0",
    "ai": "^6.0.0",
    "next": "^16.0.0",
    "react": "^19.0.0",
    "react-dom": "^19.0.0"
  }
}

We're using @ai-sdk/openai-compatible here because the demo points at DeepSeek. If you use Anthropic or OpenAI directly, swap in their provider package instead.

npm install

Step 2: The backend

One API route, two phases. The interesting part is how little glue code the integration needs.

app/api/chat/route.ts:

import { streamText, convertToModelMessages, type UIMessage } from 'ai'
import { createOpenAICompatible } from '@ai-sdk/openai-compatible'
import { OpenMultiAgent } from '@open-multi-agent/open-multi-agent'
import type { AgentConfig } from '@open-multi-agent/open-multi-agent'

export const maxDuration = 120

// --- Provider setup (swap this for your preferred LLM) ---
const BASE_URL = 'https://api.deepseek.com'
const MODEL = 'deepseek-chat'

const provider = createOpenAICompatible({
  name: 'deepseek',
  baseURL: `${BASE_URL}/v1`,
  apiKey: process.env.DEEPSEEK_API_KEY,
})

// --- Agent definitions ---
const researcher: AgentConfig = {
  name: 'researcher',
  model: MODEL,
  provider: 'openai',
  baseURL: BASE_URL,
  apiKey: process.env.DEEPSEEK_API_KEY,
  systemPrompt: `You are a research specialist. Given a topic, provide thorough,
factual research with key findings, relevant data points, and important context.
Be concise but comprehensive. Output structured notes, not prose.`,
  maxTurns: 3,
  temperature: 0.2,
}

const writer: AgentConfig = {
  name: 'writer',
  model: MODEL,
  provider: 'openai',
  baseURL: BASE_URL,
  apiKey: process.env.DEEPSEEK_API_KEY,
  systemPrompt: `You are an expert writer. Using research from team members
(available in shared memory), write a well-structured, engaging article
with clear headings and concise paragraphs.`,
  maxTurns: 3,
  temperature: 0.4,
}

OMA's provider: 'openai' means "use the OpenAI-compatible chat completions API." It works with DeepSeek, Ollama, Together, or anything that speaks that protocol.

Now the request handler:

function extractText(message: UIMessage): string {
  return message.parts
    .filter((p): p is { type: 'text'; text: string } => p.type === 'text')
    .map((p) => p.text)
    .join('')
}

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json()
  const lastText = extractText(messages.at(-1)!)

  // --- Phase 1: OMA multi-agent orchestration ---
  const orchestrator = new OpenMultiAgent({
    defaultModel: MODEL,
    defaultProvider: 'openai',
    defaultBaseURL: BASE_URL,
    defaultApiKey: process.env.DEEPSEEK_API_KEY,
  })

  const team = orchestrator.createTeam('research-writing', {
    name: 'research-writing',
    agents: [researcher, writer],
    sharedMemory: true,
  })

  const teamResult = await orchestrator.runTeam(
    team,
    `Research and write an article about: ${lastText}`,
  )

  const teamOutput =
    teamResult.agentResults.get('coordinator')?.output ?? ''

  // --- Phase 2: Stream result via Vercel AI SDK ---
  const result = streamText({
    model: provider(MODEL),
    system: `You are presenting research from a multi-agent team.
The team has already done the work. Relay their output faithfully
in a well-formatted way.

## Team Output
${teamOutput}`,
    messages: await convertToModelMessages(messages),
  })

  return result.toUIMessageStreamResponse()
}

What runTeam() does internally:

A coordinator agent receives the goal plus the agent roster
It produces a JSON task plan -- tasks, assignments, dependency edges
OMA's TaskQueue topologically sorts the plan. Independent tasks run in parallel; dependent tasks wait.
Each agent writes its output to SharedMemory, so the writer can see what the researcher found
The coordinator synthesizes everything into a final output

You define agents and a goal. The coordinator decides the task graph.

Step 3: The frontend

AI SDK v6's useChat handles streaming. A few things changed from v3 that tripped me up: there's no built-in handleSubmit or input state anymore, and messages use parts instead of a content string. The isLoading boolean is gone too -- replaced by a status field with four states ('ready', 'submitted', 'streaming', 'error').

app/page.tsx:

'use client'

import { useState } from 'react'
import { useChat } from '@ai-sdk/react'

export default function Home() {
  const { messages, sendMessage, status, error } = useChat()
  const [input, setInput] = useState('')

  const isLoading = status === 'submitted' || status === 'streaming'

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault()
    if (!input.trim() || isLoading) return
    const text = input
    setInput('')
    await sendMessage({ text })
  }

  return (
    <main style={{ maxWidth: 720, margin: '0 auto', padding: '32px 16px' }}>
      <h1>Research Team</h1>

      {messages.map((m) => (
        <div key={m.id} style={{ marginBottom: 24 }}>
          <strong>{m.role === 'user' ? 'You' : 'Research Team'}</strong>
          <div style={{ whiteSpace: 'pre-wrap' }}>
            {m.parts
              .filter(
                (p): p is { type: 'text'; text: string } =>
                  p.type === 'text',
              )
              .map((p) => p.text)
              .join('')}
          </div>
        </div>
      ))}

      {isLoading && status === 'submitted' && (
        <p>Agents are collaborating -- this may take a minute...</p>
      )}

      {error && <p style={{ color: 'red' }}>Error: {error.message}</p>}

      <form onSubmit={handleSubmit} style={{ display: 'flex', gap: 8 }}>
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          placeholder="Enter a topic to research..."
          disabled={isLoading}
          style={{ flex: 1, padding: '10px 14px' }}
        />
        <button type="submit" disabled={isLoading || !input.trim()}>
          Send
        </button>
      </form>
    </main>
  )
}

Step 4: Run it

export DEEPSEEK_API_KEY=sk-...
npm run dev

Open http://localhost:3000 and try a topic.

The OMA orchestration phase takes 30-60 seconds (coordinator planning + two agents running sequentially), then the streaming phase kicks in and you get the article token by token.

One gotcha: @ai-sdk/openai v2 defaults to OpenAI's new Responses API (/responses endpoint). If your provider doesn't support it (most don't yet), use @ai-sdk/openai-compatible instead, or call provider.chat('model-name') explicitly rather than provider('model-name'). Burned about 20 minutes on this.

Under the hood

The full request lifecycle:

useChat POSTs to /api/chat with the message history
runTeam() starts. Coordinator agent receives the goal.
Coordinator produces a task plan via LLM call (JSON with tasks, assignments, dependencies)
TaskQueue topologically sorts the tasks
Researcher agent runs, output goes to SharedMemory
Writer agent runs (reads researcher's output from shared memory), produces the article
Coordinator synthesizes the final output
streamText() takes that output and streams it through AI SDK's wire protocol
useChat renders the tokens in the browser

Steps 3-7 happen inside runTeam(). That's where OMA earns its keep -- you declare agents and a goal, it handles decomposition, ordering, and state passing.

When to use what

AI SDK alone handles most single-agent work: chatbots, RAG, tool-calling agents, structured extraction. If one agent can finish the job in a single conversation loop, adding OMA would just be extra complexity.

Add OMA when you need agents collaborating -- research + writing teams, multi-perspective code review, fan-out data collection, anything where one agent's output feeds into another and the dependency graph isn't something you want to hardcode.

Trade-offs, since every library has them:

	AI SDK	OMA
Provider support	60+ (official + community)	Anthropic, OpenAI-compatible, Gemini, Grok
DevTools	Built-in DevTools, Telemetry integration	`onProgress` / `onTrace` callbacks
Community	Massive (10M+ weekly downloads)	Smaller (5,700+ stars)
Maturity	Years of production use	Newer, iterating fast

OMA's strengths are orchestration-specific: automatic task decomposition, dependency DAGs, shared memory, concurrency control with semaphores. Its provider coverage and tooling ecosystem are thinner. Whether that matters depends on your project.

Full example

The working code is in the open-multi-agent repo:

github.com/open-multi-agent/open-multi-agent/tree/main/packages/core/examples/integrations/with-vercel-ai-sdk

Clone it, set your API key, npm install && npm run dev.

If multi-agent orchestration is new to you, the single-agent example might be a better starting point.

DEV Community: JackChen

A 100% Local Multi-Agent Team in TypeScript (Ollama + Gemma)

The API bill is a data-exfiltration receipt

The one move: point baseURL at a local endpoint

A team where even the coordinator is local

Proof it was the model, not the fallback

One real run — the ledger

The friction nobody puts in the demo

Going hybrid: cloud coder, local reviewer (and where it broke)

Fixing the local reviewer: a two-part fix, not a model swap

When to go fully local vs hybrid

Run it

From Transcript to Typed Action Items: Three Parallel Agents in TypeScript

Your meeting summarizer is quietly doing three jobs in one prompt

What you get out of it

Three specialists, one transcript

Fan out: run the three at once

The fourth agent: the aggregator

One real run

When this pattern fits — and when it doesn't

Run it

Goal In, DAG Out: How Open-Multi-Agent Turns a Goal into a Task DAG

You wrote the graph by hand. Then the requirements changed.

The one call

Step 1: A coordinator decomposes the goal

Step 2: The tasks become a dependency graph

Step 3: Unassigned tasks get an owner

Step 4: Execution, parallel by default

Step 5: Every result is persisted to shared memory

Step 6: The coordinator synthesizes

Step 7: You get a structured result

What one real run looks like

When a task fails

When you should NOT use the coordinator

Goal-first vs graph-first

Try it

Give Your TypeScript AI Agents Long-Term Memory with TencentDB-Agent-Memory

Where this sits: three ways to give agents memory

How the two systems meet

Two upstream behaviors that decide whether anything gets stored

1. The extractor only remembers the user, never the assistant

2. Extraction is scheduled, and session/end does not force it

The measured loop

When to reach for this, and when not to

Goal-Driven Agent Orchestration vs Explicit Graphs: A TypeScript Framework Taxonomy

The two-axis problem with most comparison posts

What "graph-first" and "goal-first" mean

Graph-first

Goal-first

What about KaibanJS

Four-way side by side: the same task in four frameworks

LangGraph.js

Mastra workflows

KaibanJS

open-multi-agent

Where the cost actually lives

Which paradigm fits your work

The Coordinator: what the framework writes for you in goal-first

What goal-first loses on

Human-in-the-loop: the bridge to production, already shipped

Decision rubric

What I'd build first

5 walls multi-agent frameworks hit: receipts from Mastra's year of .network() to Supervisor migration

Context: everyone was racing on multi-agent

The full timeline of Mastra's multi-agent narrative

Blogs are written for press. Issues are real.

Wall 1: Memory, context propagation, and persistence between agents

Wall 2: Routing quality and prompt fragility

Wall 3: AgentNetwork routing observability gap

Wall 4: Three-level nesting already breaks

Wall 5: Performance collapse

What happened after migrating to Supervisor

What this means

Where I am sitting

What we have not solved

How to get involved

Frequently asked questions

Sources

How to Run a Mixed-Model AI Agent Team in TypeScript?

What "mixed-model" actually means here

The one move: point `baseURL` at a local endpoint

2. Extraction is scheduled, and `session/end` does not force it

Why `runTasks` over `runTeam(goal)` for mixed cloud + local