JackChen

Posted on May 16

How to Run a Mixed-Model AI Agent Team in TypeScript?

#opensource #agents #ai #typescript

A practical walkthrough that takes you from a single-model team baseline to a mixed-provider production setup with live cost and latency monitoring, using open-multi-agent, the TypeScript-ecosystem answer to CrewAI.

If you have ever priced out a multi-agent system that runs on a single frontier model, you already know the trap. You wire up three agents, plan, build, review, and the monthly bill can hit hundreds of dollars at modest cadence (100 runs/day) and climb past four figures at production volume, because the architect, the developer, and the reviewer are all eating Opus tokens to argue about a one-line bug.

Most TypeScript agent frameworks assume one provider, one model. You can swap the model, but only globally. That single knob makes the cost-quality tradeoff harder than it needs to be. The frontier models are right for a small fraction of the team's turns. The rest is wasted spend.

This post shows the alternative. Three agents, three different model tiers, one runTeam() call. Architect on Claude Opus 4.7, developer on a cheaper hosted OpenAI model, reviewer on a local model running through Ollama. The mix is configured per agent in AgentConfig, no provider lock-in, no glue code. By the end you will have:

A concrete way to combine the existing OMA examples for multi-model teams, local models, and cost-tiered execution.
A cost ledger that gets you per-agent dollar numbers from a single onProgress callback.
A current pricing table for the providers used here, snapshot 2026-05-16, so you can do back-of-the-envelope monthly forecasts before you commit.
An honest list of what mixed-model teams cost you (because they do cost something).

The repository examples this post connects are:

examples/basics/multi-model-team.ts: different hosted models per agent.
examples/providers/ollama.ts: Claude plus a local Ollama reviewer through an OpenAI-compatible baseURL.
examples/patterns/cost-tiered-pipeline.ts: token usage and cost comparison across model tiers.
examples/providers/gemini.ts: Pro/Flash tiering inside one provider.

There is no separate companion repo for this post. The point is the pattern: per-agent model assignment is already a first-class field on AgentConfig.

What "mixed-model" actually means here

A few definitions before we start, because the term is overloaded.

Single-model team. Every agent runs on the same provider and the same model. Easiest to reason about, easiest to debug, most expensive when the model is a frontier one.

Multi-provider team. Different agents run on different providers, but each agent's model is still picked at design time, not at runtime. This is what we mean by "mixed-model" in this post.

Dynamic model routing. The framework decides which model to use per turn based on cost, latency, or quality signals. Powerful, but adds a layer of indirection and a new failure mode. Out of scope here. Static per-agent assignment gets you 80% of the value with 5% of the complexity.

The framework primitive we lean on is AgentConfig. open-multi-agent's AgentConfig carries provider, model, baseURL, and apiKey on each agent. The orchestrator instantiates the right LLM adapter (Anthropic, OpenAI, Gemini, Grok, Bedrock, Copilot, and OpenAI-compatible local servers via baseURL) lazily, so you only install the SDKs you actually use.

The four pieces, in order

The OMA repo already has the building blocks. Read them in this order; each one isolates a different part of the mixed-model pattern.

Step 1: All-Opus baseline (the cost ceiling)

import { OpenMultiAgent } from '@open-multi-agent/core'
import type { AgentConfig } from '@open-multi-agent/core'

const architect: AgentConfig = {
  name: 'architect',
  provider: 'anthropic',
  model: 'claude-opus-4-7',
  systemPrompt: 'You are a senior software architect...',
  temperature: 0.2,
}
const developer: AgentConfig = { ...architect, name: 'developer', systemPrompt: '...' }
const reviewer:  AgentConfig = { ...architect, name: 'reviewer',  systemPrompt: '...' }

const orchestrator = new OpenMultiAgent({
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
})

const team = orchestrator.createTeam('design-build-review', {
  name: 'design-build-review',
  agents: [architect, developer, reviewer],
  sharedMemory: true,
})

const result = await orchestrator.runTeam(team, 'Implement retryWithBackoff<T>...')
console.log(result.totalTokenUsage)

That is the entire team. Three agents, one model, one orchestrator. The coordinator that gets spawned by runTeam() decomposes the goal into tasks, assigns each task to an agent, runs them with dependency-aware parallelism, and returns a TeamRunResult with per-agent token usage in agentResults and a total in totalTokenUsage.

We use it as the cost ceiling. Everything that follows will be measured against this baseline.

Step 2: Dual-model (Opus plans, OpenAI executes)

The architect's job is to look at the goal once, decide what to build and what to skip, and lay out an API shape. It is called rarely, but each call is high leverage. Opus is right for that.

The developer and the reviewer fire many times on smaller prompts. The reviewer in particular is mostly running short checklist passes. Spending Opus tokens on those is wasteful. A mini-tier OpenAI model, for example GPT-5.4 mini at $0.75 input and $4.50 output per million tokens as of 2026-05-16, is roughly 5.5x cheaper than Opus on the output side. You can absorb the quality delta on simpler turns.

const architect: AgentConfig = {
  name: 'architect',
  provider: 'anthropic',
  model: 'claude-opus-4-7',
  systemPrompt: '...',
}
const developer: AgentConfig = {
  name: 'developer',
  provider: 'openai',
  model: 'gpt-5.4-mini',
  systemPrompt: '...',
}
const reviewer: AgentConfig = {
  name: 'reviewer',
  provider: 'openai',
  model: 'gpt-5.4-mini',
  systemPrompt: '...',
}

Same createTeam(), same runTeam(). The only thing that changes is provider and model on two of the three agents. The orchestrator handles the adapter switching internally; you do not write a single line of provider-specific code in your team setup.

Step 3: Triple-model with a local reviewer

Now we push the reviewer onto a local model. The argument is the same as Step 2, pushed harder: the reviewer does style checks, edge-case prompts, and short-form QA. A local Ollama model is often good enough for that, and the marginal cloud cost is zero.

The trick is that Ollama exposes an OpenAI-compatible endpoint, so you reuse the openai adapter and override baseURL and apiKey on the agent:

const reviewer: AgentConfig = {
  name: 'reviewer',
  provider: 'openai',
  model: 'llama3.1',
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
  systemPrompt: '...',
}

A note that catches people: even when the local server ignores the key, the OpenAI SDK validates that it is non-empty. Pass any placeholder. 'ollama' works.

Prerequisites for this step: ollama pull llama3.1 once, then ollama serve in a separate terminal. Swap llama3.1 for Gemma, Qwen, or whatever local model you actually use. Same setup works with vLLM, LM Studio, llama-server, or any OpenAI-compatible inference server.

Step 4: Live cost and latency monitoring

The mixed-model team is now functionally complete. The problem is you do not yet know what it costs. In production you want to find out fast when an agent hot-loops on Opus and burns $50 in an afternoon.

open-multi-agent's OrchestratorConfig accepts an onProgress callback that fires for agent_start, agent_complete, task_start, task_complete, error, and a few others. We use the start and complete events to build a per-agent ledger.

const ledger = new Map<string, { startedAt: number; finishedAt?: number; model: string }>()

const orchestrator = new OpenMultiAgent({
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
  onProgress: (event) => {
    if (event.type === 'agent_start' && event.agent) {
      const cfg = [architect, developer, reviewer].find(a => a.name === event.agent)
      ledger.set(event.agent, { startedAt: Date.now(), model: cfg?.model ?? 'unknown' })
    }
    if (event.type === 'agent_complete' && event.agent) {
      const entry = ledger.get(event.agent)
      if (entry) entry.finishedAt = Date.now()
    }
  },
})

After runTeam() finishes, you walk result.agentResults, pull tokenUsage for each agent, look up the per-million pricing for its model, and compute a dollar number. The same ledger pattern used by the cost-tiered example produces output like this:

agent       | model              | latency  | tokens in/out  | cost
------------+--------------------+----------+----------------+--------
coordinator | claude-opus-4-7    |    3.4s  |   1102/   612  | $0.0208
architect   | claude-opus-4-7    |    6.1s  |   1580/  1140  | $0.0364
developer   | gpt-5.4-mini       |    8.7s  |   2240/  2106  | $0.0112
reviewer    | llama3.1           |   12.0s  |   2680/   480  | $0.0000

Grand total: $0.0684 USD

The reviewer is the slowest agent in the table because a local model on a consumer machine is usually slower than a hosted frontier model. That is the local-cost trade-off in concrete terms: zero marginal cloud dollars, more wall-clock seconds. Whether that is the right call depends on whether your team blocks on the reviewer or runs it asynchronously.

Pricing snapshot (2026-05-16)

These are the numbers used by the examples and estimates in this post. They are also the numbers you want for any cost back-of-the-envelope before you ship a mixed-model team to production. Verify on the official pages before you commit to a forecast.

Model	Input ($/1M)	Output ($/1M)	Source
Claude Opus 4.7	$5.00	$25.00	platform.claude.com
Claude Sonnet 4.6	$3.00	$15.00	same
Claude Haiku 4.5	$1.00	$5.00	same
GPT-5.5	$5.00	$30.00	openai.com/api/pricing
GPT-5.4	$2.50	$15.00	same
GPT-5.4 mini	$0.75	$4.50	same
Gemini 2.5 Flash	$0.30	$2.50	ai.google.dev/gemini-api/docs/pricing
Local model via Ollama	$0 marginal	$0 marginal	electricity + amortized hardware

A reasonable starting reading of the table: the cloud frontier models charge 4x to 8x more on the output side than on the input side. That is the inversion you want to design against. Push input-heavy agents (research, summarization, retrieval grounding) onto the cheaper models. Reserve the expensive models for agents that produce a lot of high-stakes output.

A worked cost comparison on a recurring workload

Suppose your team runs the same three-agent task 100 times a day (real-world cadence for an automation that fires on inbound webhooks, scheduled batches, or per-customer pipelines). A representative run uses roughly:

Coordinator: 1.1K input, 0.6K output tokens (Opus 4.7 in all variants)
Architect: 1.6K input, 1.1K output
Developer: 2.2K input, 2.1K output
Reviewer: 2.7K input, 0.5K output

Use this as a representative shape, not a benchmark. Your numbers will differ; the math below shows how to do it. If you want to measure your own workload, start with examples/patterns/cost-tiered-pipeline.ts.

All-Opus run (Step 1 baseline) runs to about $0.15 per execution on the token shape above. At 100/day that is roughly $450/month.

Dual-model (Step 2): Opus on coordinator + architect, GPT-5.4 mini on developer + reviewer. About $0.073 per run. Roughly $219/month.

Triple-model (Step 3): Opus on coordinator + architect, GPT-5.4 mini on developer, local Ollama model on reviewer. About $0.068 per run on the cloud side. Roughly $205/month, plus the local model's wall-clock overhead on a machine you already own.

That is a little over 50% monthly savings against the all-Opus baseline on this token shape, with the quality drop contained to lower-risk roles. The savings shape changes by workload: research-heavy, summarization-heavy, or retrieval-grounded teams can gain more by pushing input-heavy roles to cheaper models. Reasoning-heavy teams gain less and may be wrong to mix at all.

When mixed-model is the wrong call

The post would be useless if it told you to always mix. Here is the honest list of when not to.

Single-agent tasks. If your goal is small enough that one agent can finish it, do not split into a team just to mix models. The coordinator overhead and inter-agent context shuffling will dwarf the savings.

High-consistency tasks. Models from different providers disagree on edge cases. If your output is graded against a strict rubric (legal review, medical triage, anything with a regulator-facing audit trail), the variance from mixing providers will bite you. Stay on one model and pay the price.

Tight feedback loops with human reviewers. When you are still iterating on prompts and the team's behavior is unstable, mixing models adds a degree of freedom you do not want. Get the team to converge on a single model first, then push pieces out to cheaper models once each role's prompt is stable.

Cheap models with cheap output. If every agent in your team already fits a single Haiku or GPT-5.4 mini run, the absolute savings from mixing are pennies and the operational complexity is real. Mixed-model pays off when at least one role's monthly cost is meaningful enough to justify the variance.

Provider failure handling that is not in place. A mixed-model team has more failure surfaces than a single-model team. Three providers means three rate-limit ceilings, three auth flows, three latency profiles, and three ways your pipeline can stall on a 503. Decide your retry, fallback, and circuit-breaker strategy before you ship.

What the cookbook example looks like in mixed-model form

open-multi-agent ships a personalized interview simulator cookbook that runs three agents on Claude Sonnet 4.6: an interviewer, an observer, and a reporter. It is a nice match for the mixed-model pattern.

The interviewer does deep, candidate-specific question generation across many turns. That role earns Opus.

The observer reads the transcript after each turn and writes 3-6 short flags. The role is short-output, repeatable, and structurally simple. Push it to a cheaper hosted model or even a local model.

The reporter runs once at the end of the session against a strict Zod schema (recommendation: 'strong-hire' | 'hire' | ..., plus structured arrays). Structured-output agents are sensitive to the underlying model's JSON adherence. Keep that on a frontier model.

The migration is two provider and model edits, two AgentConfig blocks. You do not touch the orchestration logic. You do not refactor the prompts. You read the schema and decide where the consistency requirements actually live.

Why this lives in the TypeScript ecosystem

A small note on positioning since this is the question I get asked most.

CrewAI established the team-of-agents shape that this post leans on: an agent has a role, agents form crews, a crew has a goal, and the framework orchestrates the goal into work. CrewAI is Python-only, and the TypeScript options for the same pattern have been thin until recently. open-multi-agent treats the TypeScript ecosystem as a first-class target: 100% TypeScript runtime, three runtime dependencies (@anthropic-ai/sdk, openai, zod), and the same Goal → Result one-call surface (runTeam) that you would get from CrewAI's Crew.kickoff(). The mixed-model team is, by design, a first-class pattern rather than a custom adapter you write yourself.

If you are coming from CrewAI and looking for the team-of-roles model in TypeScript, the examples above are the migration target.

Wrap-up: what to take from here

Mixed-model agent teams are not a clever trick. They are the right default once your team grows beyond two agents and the workload starts running on a real cadence. The savings can be material, often 40-70% against an all-frontier baseline depending on token shape, the operational cost is real (more failure modes, more variance), and the design choice that matters most is which agent gets the expensive model.

Three takeaways:

Per-agent model assignment is a design lever, not an optimization. Decide it when you decide the team. Retrofitting it later means rewriting prompts that have already drifted to match the wrong model.
Start with two providers, then add local. Step 2 captures most of the savings with two API keys and zero infrastructure. Step 3 is incremental and depends on whether you can spare the local-model latency.
onProgress is the cheapest insurance you can buy. Twenty lines of TypeScript turn token counts into dollar numbers per run. Without it, mixed-model teams silently regress and you find out from the bill.

Start with the existing repo examples: multi-model-team, providers/ollama, and cost-tiered-pipeline. Run the one closest to your workload, then add the per-agent ledger before you scale it up. If you push this pattern to production, I would like to hear what your real cost shape looks like. The reasonable model split is probably different from the examples, and the right answer is workload-specific.

About open-multi-agent. TypeScript-native multi-agent orchestration framework, MIT-licensed. Goal → Result in one runTeam() call. Three runtime dependencies. Repo: https://github.com/open-multi-agent/open-multi-agent. The framework treats the TypeScript ecosystem as a first-class target rather than a secondary port from Python.

Edits and corrections. If a price has moved since 2026-05-16 or a model has been renamed, please open an issue against the OMA repo and I will refresh the constants in the examples.

DEV Community