TL;DR: Agent Duelist is a TypeScript-first framework that pits multiple LLM providers against each other on the same tasks. Get structured, reproducible results for correctness, latency, tokens, and cost—all from one unified interface.
The Problem
You're building with LLMs and you need to answer questions like:
- Should I use GPT-5.2 or Claude Opus 4.6 for this task?
- Is Azure OpenAI faster than standard OpenAI for my use case?
- How much will switching models actually cost me?
- Which provider handles tool calls best?
Right now, answering these questions means:
- Writing separate integration code for each provider
- Manually tracking metrics across runs
- Copying results into spreadsheets
- Making educated guesses about cost
There has to be a better way.
Enter Agent Duelist
Agent Duelist is a benchmarking framework that lets you:
✅ Define tasks once, run them everywhere — OpenAI, Azure, Anthropic, Google Gemini, and any OpenAI-compatible gateway
✅ Get real metrics — Latency, token counts, and cost estimates from a bundled pricing catalog
✅ Compare providers objectively — Built-in scorers for correctness, schema validation, fuzzy similarity, and even LLM-as-judge
✅ Benchmark agent workflows — Full tool-calling support with local handlers
✅ TypeScript-native DX — Strongly typed APIs, Zod schemas, and excellent IDE support
✅ CLI-first — From zero to useful comparison tables in minutes
Quick Start (60 seconds)
Install it:
npm install agent-duelist
Initialize a config:
npx duelist init
This creates arena.config.ts. Here's a minimal example:
import { defineArena, openai, anthropic } from 'agent-duelist'
import { z } from 'zod'
export default defineArena({
providers: [
openai('gpt-5.2'),
anthropic('claude-sonnet-4-6-20260217'),
],
tasks: [
{
name: 'simple-qa',
prompt: 'In one sentence, explain what a monorepo is.',
expected: 'A monorepo is a single repository that contains code for multiple projects.',
},
{
name: 'structured-extraction',
prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
expected: { company: 'Acme', year: 2024 },
schema: z.object({
company: z.string(),
year: z.number(),
}),
},
],
scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
runs: 3,
})
Run the benchmark:
npx duelist run
You get a beautiful table showing:
- Rows: Your tasks (simple-qa, structured-extraction)
- Columns: Your providers (OpenAI GPT-5.2, Anthropic Claude Sonnet 4.6)
- Cells: Correctness score, latency, tokens, estimated cost
For CI or dashboards:
npx duelist run --reporter json > results.json
Why Agent Duelist?
1. Provider-Agnostic
Write your tasks once. Swap models and providers without rewriting anything:
const providers = [
openai('gpt-5.2'),
openai('gpt-5.2-chat-latest'),
azureOpenai('gpt-5.2', { deployment: 'my-deployment' }),
anthropic('claude-opus-4-6-20260205'),
anthropic('claude-sonnet-4-6-20260217'),
gemini('gemini-3.1-pro-preview'),
gemini('gemini-3-flash-preview'),
openaiCompatible({
id: 'local/llama',
name: 'Local LLaMA',
baseURL: 'http://localhost:11434/v1',
model: 'llama-3.3',
}),
]
2. Agent-Focused
Designed for real-world agent workflows with tool calling:
const weatherTool = {
name: 'getCurrentWeather',
description: 'Get the current weather in a given city',
parameters: z.object({ city: z.string() }),
handler: async ({ city }) => ({
city,
tempC: 20,
}),
}
export default defineArena({
providers: [openai('gpt-5.2')],
tasks: [
{
name: 'weather-agent',
prompt: 'What is the temperature in Amsterdam? Use the tool.',
expected: { city: 'Amsterdam' },
tools: [weatherTool],
},
],
scorers: ['latency', 'cost', 'tool-usage'],
})
The model calls the tool, your handler executes, and you get metrics on tool usage accuracy.
3. Realistic Metrics
Latency: Wall-clock response time in milliseconds
Token counts: Direct from the provider APIs—the source of truth
Cost estimation: Transparent and conservative
- Bundled pricing catalog derived from OpenRouter's public data
- Maps
(provider, model)→{ inputPerM, outputPerM }in USD per 1M tokens - Azure models resolve back to base OpenAI pricing automatically
- Formula:
(promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
Example output:
Tokens: prompt: 142, completion: 38
Cost: ~$0.189m (millicents)
4. Rich Scoring System
Built-in scorers:
| Scorer | What It Measures |
|---|---|
latency |
Wall-clock response time |
cost |
Estimated USD cost from tokens + pricing catalog |
correctness |
Exact match against expected (deep-equal) |
schema-correctness |
Validates output against Zod schema |
fuzzy-similarity |
Jaccard token-overlap similarity |
llm-judge-correctness |
LLM-as-judge scoring (accuracy, completeness, conciseness) |
tool-usage |
Whether expected tools were invoked |
LLM-as-Judge Example:
defineArena({
scorers: ['latency', 'cost', 'llm-judge-correctness'],
judgeModel: 'gpt-5.2-chat-latest', // or 'gemini-3.1-pro-preview'
})
The judge evaluates outputs on three criteria and returns a composite 0–1 score.
5. TypeScript-Native
- Strongly typed provider interfaces
- Zod schemas for structured outputs
- Full IDE autocomplete and type safety
- No runtime surprises
Real-World Example
Let's say you're building an extraction pipeline and need to choose between the latest frontier models:
import { defineArena, openai, anthropic, gemini } from 'agent-duelist'
import { z } from 'zod'
export default defineArena({
providers: [
openai('gpt-5.2'),
openai('gpt-5.2-chat-latest'),
anthropic('claude-sonnet-4-6-20260217'),
gemini('gemini-3-flash-preview'),
],
tasks: [
{
name: 'extract-company',
prompt: 'Extract company and role as JSON from: "I work at Acme Corp as a senior engineer."',
expected: { company: 'Acme Corp', role: 'senior engineer' },
schema: z.object({ company: z.string(), role: z.string() }),
},
{
name: 'classify-sentiment',
prompt: 'Classify sentiment as "positive", "negative", or "neutral": "The product works great!"',
expected: 'positive',
},
],
scorers: ['latency', 'cost', 'correctness', 'schema-correctness'],
runs: 3,
})
Run npx duelist run and get:
Task: extract-company
Provider Latency Cost Tokens Match Schema
────────────────────────────────────────────────────────────────────────────────
openai/gpt-5.2 1905ms ~$0.312m 140 100% 100%
openai/gpt-5.2-chat-latest 842ms ~$0.091m 132 100% 100%
anthropic/claude-sonnet-4.6 1493ms ~$0.189m 126 100% 100%
gemini/gemini-3-flash-preview 610ms ~$0.041m 119 100% 100%
Summary
◆ Most correct: all providers tied (avg 100%)
◆ Fastest: gemini/gemini-3-flash-preview (avg 610ms)
◆ Cheapest: gemini/gemini-3-flash-preview (avg ~$0.041m)
Now you can make data-driven decisions: Gemini 3 Flash is fastest and cheapest, GPT-5.2 gives you maximum reasoning depth, Claude Sonnet 4.6 leads on computer use tasks.
Use Cases
🔬 Model Selection
Compare models on your actual tasks before committing
💰 Cost Optimization
Identify which model gives you the best quality/cost ratio — in 2026, prices vary wildly between frontier models
⚡ Performance Tuning
Track latency across providers and deployments
🛠️ Agent Development
Benchmark tool-calling accuracy for multi-step workflows
📊 CI/CD Integration
Run benchmarks in CI and fail if metrics regress:
npx duelist run --reporter json | jq '.summary.avgLatency'
📈 Dashboarding
Export JSON results to Grafana, Datadog, or your metrics platform
What's Next?
Already shipped:
- ✅ OpenAI (GPT-5, GPT-5.2), Azure, Anthropic (Claude 4.6 series), Gemini (3/3.1 series), OpenAI-compatible providers
- ✅ 7 built-in scorers including LLM-as-judge
- ✅ Tool-calling support for agent benchmarking
- ✅ Console & JSON reporters
- ✅ Pricing catalog with refresh script
Roadmap (shaped by community feedback):
- 📜 More providers (OpenRouter-native, more gateways)
- 📜 Markdown/HTML/CSV reporters
- 📜 GitHub Actions summaries
- 📜 Multi-step agent workflows
- 📜 Plugin system for custom scorers
- 📜 Embedding-based semantic similarity
Get Started Now
npm install agent-duelist
npx duelist init
npx duelist run
📦 Package: npm.com/package/agent-duelist
📖 GitHub: github.com/DataGobes/agent-duelist
🐛 Issues: github.com/DataGobes/agent-duelist/issues
Contributing
Contributions welcome! 🎉
- Bug reports / ideas: Open a GitHub issue
-
Code changes: Fork, branch, test (
npm test), build (npm run build), PR
Keep PRs focused (one provider, one scorer) for easier review.
Final Thoughts
Choosing the right LLM provider shouldn't involve spreadsheets, guesswork, and manual copy-paste. Agent Duelist gives you a single command to get objective, reproducible comparisons.
Whether you're optimizing costs, improving latency, or validating correctness—let the models duel it out.
⚔️ May the best model win.
Tags
#typescript #llm #ai #benchmarking #openai #anthropic #gemini #agents #devtools #opensource
What LLM provider battle are you running first? Drop a comment!

Top comments (0)