Gijs Jansen

Posted on Mar 1 • Edited on Mar 2

Agent Duelist: Benchmark LLM Providers in One Command

#ai #productivity #devops #typescript

TL;DR: Agent Duelist is a TypeScript‑first framework for benchmarking multiple LLM providers on your real tasks. Get structured, reproducible metrics for correctness, latency, tokens, and cost from a single unified interface.

What the Output Looks Like

Here’s the CLI benchmark summary you get from a single npx duelist run:

And here’s a run rendered as an HTML table:

Run It Yourself (10 seconds)

npm install agent-duelist
npx duelist init
npx duelist run

The Problem

You're building with LLMs and you need to answer questions like:

Should I use GPT-5.2 or Claude Opus 4.6 for this task?
Is Azure OpenAI faster than standard OpenAI for my use case?
How much will switching models actually cost me?
Which provider handles tool calls best?

Today, answering these questions typically means:

Wiring up separate integrations for each provider
Manually tracking latency, tokens, and errors across runs
Copy‑pasting outputs into spreadsheets
Guesstimating costs from pricing pages and blog posts

There has to be a better way to compare models than ad‑hoc scripts and spreadsheets.

Enter Agent Duelist

Agent Duelist is a benchmarking framework that lets you:

✅ Define tasks once, run them everywhere — Benchmark OpenAI, Azure, Anthropic, Gemini, and any OpenAI‑compatible gateway without changing task code.
✅ Get real metrics — Capture latency, token counts, and cost estimates using a bundled pricing catalog.
✅ Compare providers objectively — Use built‑in scorers for correctness, schema validation, fuzzy similarity, and LLM‑as‑judge.
✅ Benchmark agent workflows — Measure tool‑calling behavior with local handlers.
✅ TypeScript-native DX — Strong types, Zod schemas, and full IDE support.
✅ CLI-first — From zero to comparison tables in a single command.

Quick Start (60 seconds)

Install it:

npm install agent-duelist

Initialize a config:

npx duelist init

This creates arena.config.ts. Here's a minimal example:

import { defineArena, openai, anthropic } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    anthropic('claude-sonnet-4-6-20260217'),
  ],
  tasks: [
    {
      name: 'simple-qa',
      prompt: 'In one sentence, explain what a monorepo is.',
      expected: 'A monorepo is a single repository that contains code for multiple projects.',
    },
    {
      name: 'structured-extraction',
      prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
      expected: { company: 'Acme', year: 2024 },
      schema: z.object({
        company: z.string(),
        year: z.number(),
      }),
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
  runs: 3,
})

Run the benchmark:

npx duelist run

You get a beautiful table showing:

Rows: Your tasks (simple-qa, structured-extraction)
Columns: Your providers (OpenAI GPT-5.2, Anthropic Claude Sonnet 4.6)
Cells: Correctness score, latency, tokens, estimated cost

For CI or dashboards:

npx duelist run --reporter json > results.json

Why Agent Duelist?

1. Provider-Agnostic

Write your tasks once. Swap models and providers without rewriting anything:

const providers = [
  openai('gpt-5.2'),
  openai('gpt-5.2-chat-latest'),
  azureOpenai('gpt-5.2', { deployment: 'my-deployment' }),
  anthropic('claude-opus-4-6-20260205'),
  anthropic('claude-sonnet-4-6-20260217'),
  gemini('gemini-3.1-pro-preview'),
  gemini('gemini-3-flash-preview'),
  openaiCompatible({
    id: 'local/llama',
    name: 'Local LLaMA',
    baseURL: 'http://localhost:11434/v1',
    model: 'llama-3.3',
  }),
]

2. Agent-Focused

Designed for real-world agent workflows with tool calling:

const weatherTool = {
  name: 'getCurrentWeather',
  description: 'Get the current weather in a given city',
  parameters: z.object({ city: z.string() }),
  handler: async ({ city }) => ({
    city,
    tempC: 20,
  }),
}

export default defineArena({
  providers: [openai('gpt-5.2')],
  tasks: [
    {
      name: 'weather-agent',
      prompt: 'What is the temperature in Amsterdam? Use the tool.',
      expected: { city: 'Amsterdam' },
      tools: [weatherTool],
    },
  ],
  scorers: ['latency', 'cost', 'tool-usage'],
})

The model calls the tool, your handler executes, and you get metrics on tool usage accuracy.

3. Realistic Metrics

Latency: Wall-clock response time in milliseconds

Token counts: Direct from the provider APIs—the source of truth

Cost estimation: Transparent and conservative

Bundled pricing catalog derived from OpenRouter's public data
Maps (provider, model) → { inputPerM, outputPerM } in USD per 1M tokens
Azure models resolve back to base OpenAI pricing automatically
Formula: (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000

Example output:

Tokens: prompt: 142, completion: 38
Cost: ~$0.189m (millicents)

4. Rich Scoring System

Built-in scorers:

Scorer	What It Measures
`latency`	Wall-clock response time
`cost`	Estimated USD cost from tokens + pricing catalog
`correctness`	Exact match against expected (deep-equal)
`schema-correctness`	Validates output against Zod schema
`fuzzy-similarity`	Jaccard token-overlap similarity
`llm-judge-correctness`	LLM-as-judge scoring (accuracy, completeness, conciseness)
`tool-usage`	Whether expected tools were invoked

LLM-as-Judge Example:

defineArena({
  scorers: ['latency', 'cost', 'llm-judge-correctness'],
  judgeModel: 'gpt-5.2-chat-latest', // or 'gemini-3.1-pro-preview'
})

The judge evaluates outputs on three criteria and returns a composite 0–1 score.

5. TypeScript-Native

Strongly typed provider interfaces
Zod schemas for structured outputs
Full IDE autocomplete and type safety
No runtime surprises

Real-World Example

Let's say you're building an extraction pipeline and need to choose between the latest frontier models:

import { defineArena, openai, anthropic, gemini } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    openai('gpt-5.2-chat-latest'),
    anthropic('claude-sonnet-4-6-20260217'),
    gemini('gemini-3-flash-preview'),
  ],
  tasks: [
    {
      name: 'extract-company',
      prompt: 'Extract company and role as JSON from: "I work at Acme Corp as a senior engineer."',
      expected: { company: 'Acme Corp', role: 'senior engineer' },
      schema: z.object({ company: z.string(), role: z.string() }),
    },
    {
      name: 'classify-sentiment',
      prompt: 'Classify sentiment as "positive", "negative", or "neutral": "The product works great!"',
      expected: 'positive',
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness'],
  runs: 3,
})

Run npx duelist run and get:

Task: extract-company
Provider                        Latency    Cost       Tokens    Match    Schema
────────────────────────────────────────────────────────────────────────────────
openai/gpt-5.2                  1905ms    ~$0.312m     140      100%     100%
openai/gpt-5.2-chat-latest       842ms    ~$0.091m     132      100%     100%
anthropic/claude-sonnet-4.6     1493ms    ~$0.189m     126      100%     100%
gemini/gemini-3-flash-preview    610ms    ~$0.041m     119      100%     100%

Summary
◆ Most correct: all providers tied (avg 100%)
◆ Fastest: gemini/gemini-3-flash-preview (avg 610ms)
◆ Cheapest: gemini/gemini-3-flash-preview (avg ~$0.041m)

Now you can make data-driven decisions: Gemini 3 Flash is fastest and cheapest, GPT-5.2 gives you maximum reasoning depth, Claude Sonnet 4.6 leads on computer use tasks.

Use Cases

🔬 Model Selection

Compare models on your actual tasks before committing

💰 Cost Optimization

Identify which model gives you the best quality/cost ratio — in 2026, prices vary wildly between frontier models

⚡ Performance Tuning

Track latency across providers and deployments

🛠️ Agent Development

Benchmark tool-calling accuracy for multi-step workflows

📊 CI/CD Integration

Run benchmarks in CI and fail if metrics regress:

npx duelist run --reporter json | jq '.summary.avgLatency'

📈 Dashboarding

Export JSON results to Grafana, Datadog, or your metrics platform

What's Next?

Already shipped:

✅ OpenAI (GPT-5, GPT-5.2), Azure, Anthropic (Claude 4.6 series), Gemini (3/3.1 series), OpenAI-compatible providers
✅ 7 built-in scorers including LLM-as-judge
✅ Tool-calling support for agent benchmarking
✅ Console & JSON reporters
✅ Pricing catalog with refresh script

Roadmap (shaped by community feedback):

📜 More providers (OpenRouter-native, more gateways)
📜 Markdown/HTML/CSV reporters
📜 GitHub Actions summaries
📜 Multi-step agent workflows
📜 Plugin system for custom scorers
📜 Embedding-based semantic similarity

Get Started Now

npm install agent-duelist
npx duelist init
npx duelist run

📦 Package: npm.com/package/agent-duelist

📖 GitHub: github.com/DataGobes/agent-duelist

🐛 Issues: github.com/DataGobes/agent-duelist/issues

Contributing

Contributions welcome! 🎉

Bug reports / ideas: Open a GitHub issue
Code changes: Fork, branch, test (npm test), build (npm run build), PR

Keep PRs focused (one provider, one scorer) for easier review.

Final Thoughts

Choosing the right LLM provider shouldn't involve spreadsheets, guesswork, and manual copy-paste. Agent Duelist gives you a single command to get objective, reproducible comparisons.

Whether you're optimizing costs, improving latency, or validating correctness—let the models duel it out.

⚔️ May the best model win.

DEV Community