DEV Community

Gijs Jansen
Gijs Jansen

Posted on

Introducing Agent Duelist: Benchmark LLM Providers Like a Pro

TL;DR: Agent Duelist is a TypeScript-first framework that pits multiple LLM providers against each other on the same tasks. Get structured, reproducible results for correctness, latency, tokens, and cost—all from one unified interface.

Agent Duelist out of the box output


The Problem

You're building with LLMs and you need to answer questions like:

  • Should I use GPT-5.2 or Claude Opus 4.6 for this task?
  • Is Azure OpenAI faster than standard OpenAI for my use case?
  • How much will switching models actually cost me?
  • Which provider handles tool calls best?

Right now, answering these questions means:

  1. Writing separate integration code for each provider
  2. Manually tracking metrics across runs
  3. Copying results into spreadsheets
  4. Making educated guesses about cost

There has to be a better way.

Enter Agent Duelist

Agent Duelist is a benchmarking framework that lets you:

Define tasks once, run them everywhere — OpenAI, Azure, Anthropic, Google Gemini, and any OpenAI-compatible gateway

Get real metrics — Latency, token counts, and cost estimates from a bundled pricing catalog

Compare providers objectively — Built-in scorers for correctness, schema validation, fuzzy similarity, and even LLM-as-judge

Benchmark agent workflows — Full tool-calling support with local handlers

TypeScript-native DX — Strongly typed APIs, Zod schemas, and excellent IDE support

CLI-first — From zero to useful comparison tables in minutes


Quick Start (60 seconds)

Install it:

npm install agent-duelist
Enter fullscreen mode Exit fullscreen mode

Initialize a config:

npx duelist init
Enter fullscreen mode Exit fullscreen mode

This creates arena.config.ts. Here's a minimal example:

import { defineArena, openai, anthropic } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    anthropic('claude-sonnet-4-6-20260217'),
  ],
  tasks: [
    {
      name: 'simple-qa',
      prompt: 'In one sentence, explain what a monorepo is.',
      expected: 'A monorepo is a single repository that contains code for multiple projects.',
    },
    {
      name: 'structured-extraction',
      prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
      expected: { company: 'Acme', year: 2024 },
      schema: z.object({
        company: z.string(),
        year: z.number(),
      }),
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
  runs: 3,
})
Enter fullscreen mode Exit fullscreen mode

Run the benchmark:

npx duelist run
Enter fullscreen mode Exit fullscreen mode

You get a beautiful table showing:

  • Rows: Your tasks (simple-qa, structured-extraction)
  • Columns: Your providers (OpenAI GPT-5.2, Anthropic Claude Sonnet 4.6)
  • Cells: Correctness score, latency, tokens, estimated cost

For CI or dashboards:

npx duelist run --reporter json > results.json
Enter fullscreen mode Exit fullscreen mode

Why Agent Duelist?

1. Provider-Agnostic

Write your tasks once. Swap models and providers without rewriting anything:

const providers = [
  openai('gpt-5.2'),
  openai('gpt-5.2-chat-latest'),
  azureOpenai('gpt-5.2', { deployment: 'my-deployment' }),
  anthropic('claude-opus-4-6-20260205'),
  anthropic('claude-sonnet-4-6-20260217'),
  gemini('gemini-3.1-pro-preview'),
  gemini('gemini-3-flash-preview'),
  openaiCompatible({
    id: 'local/llama',
    name: 'Local LLaMA',
    baseURL: 'http://localhost:11434/v1',
    model: 'llama-3.3',
  }),
]
Enter fullscreen mode Exit fullscreen mode

2. Agent-Focused

Designed for real-world agent workflows with tool calling:

const weatherTool = {
  name: 'getCurrentWeather',
  description: 'Get the current weather in a given city',
  parameters: z.object({ city: z.string() }),
  handler: async ({ city }) => ({
    city,
    tempC: 20,
  }),
}

export default defineArena({
  providers: [openai('gpt-5.2')],
  tasks: [
    {
      name: 'weather-agent',
      prompt: 'What is the temperature in Amsterdam? Use the tool.',
      expected: { city: 'Amsterdam' },
      tools: [weatherTool],
    },
  ],
  scorers: ['latency', 'cost', 'tool-usage'],
})
Enter fullscreen mode Exit fullscreen mode

The model calls the tool, your handler executes, and you get metrics on tool usage accuracy.

3. Realistic Metrics

Latency: Wall-clock response time in milliseconds

Token counts: Direct from the provider APIs—the source of truth

Cost estimation: Transparent and conservative

  • Bundled pricing catalog derived from OpenRouter's public data
  • Maps (provider, model){ inputPerM, outputPerM } in USD per 1M tokens
  • Azure models resolve back to base OpenAI pricing automatically
  • Formula: (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000

Example output:

Tokens: prompt: 142, completion: 38
Cost: ~$0.189m (millicents)
Enter fullscreen mode Exit fullscreen mode

4. Rich Scoring System

Built-in scorers:

Scorer What It Measures
latency Wall-clock response time
cost Estimated USD cost from tokens + pricing catalog
correctness Exact match against expected (deep-equal)
schema-correctness Validates output against Zod schema
fuzzy-similarity Jaccard token-overlap similarity
llm-judge-correctness LLM-as-judge scoring (accuracy, completeness, conciseness)
tool-usage Whether expected tools were invoked

LLM-as-Judge Example:

defineArena({
  scorers: ['latency', 'cost', 'llm-judge-correctness'],
  judgeModel: 'gpt-5.2-chat-latest', // or 'gemini-3.1-pro-preview'
})
Enter fullscreen mode Exit fullscreen mode

The judge evaluates outputs on three criteria and returns a composite 0–1 score.

5. TypeScript-Native

  • Strongly typed provider interfaces
  • Zod schemas for structured outputs
  • Full IDE autocomplete and type safety
  • No runtime surprises

Real-World Example

Let's say you're building an extraction pipeline and need to choose between the latest frontier models:

import { defineArena, openai, anthropic, gemini } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    openai('gpt-5.2-chat-latest'),
    anthropic('claude-sonnet-4-6-20260217'),
    gemini('gemini-3-flash-preview'),
  ],
  tasks: [
    {
      name: 'extract-company',
      prompt: 'Extract company and role as JSON from: "I work at Acme Corp as a senior engineer."',
      expected: { company: 'Acme Corp', role: 'senior engineer' },
      schema: z.object({ company: z.string(), role: z.string() }),
    },
    {
      name: 'classify-sentiment',
      prompt: 'Classify sentiment as "positive", "negative", or "neutral": "The product works great!"',
      expected: 'positive',
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness'],
  runs: 3,
})
Enter fullscreen mode Exit fullscreen mode

Run npx duelist run and get:

Task: extract-company
Provider                        Latency    Cost       Tokens    Match    Schema
────────────────────────────────────────────────────────────────────────────────
openai/gpt-5.2                  1905ms    ~$0.312m     140      100%     100%
openai/gpt-5.2-chat-latest       842ms    ~$0.091m     132      100%     100%
anthropic/claude-sonnet-4.6     1493ms    ~$0.189m     126      100%     100%
gemini/gemini-3-flash-preview    610ms    ~$0.041m     119      100%     100%

Summary
◆ Most correct: all providers tied (avg 100%)
◆ Fastest: gemini/gemini-3-flash-preview (avg 610ms)
◆ Cheapest: gemini/gemini-3-flash-preview (avg ~$0.041m)
Enter fullscreen mode Exit fullscreen mode

Now you can make data-driven decisions: Gemini 3 Flash is fastest and cheapest, GPT-5.2 gives you maximum reasoning depth, Claude Sonnet 4.6 leads on computer use tasks.


Use Cases

🔬 Model Selection

Compare models on your actual tasks before committing

💰 Cost Optimization

Identify which model gives you the best quality/cost ratio — in 2026, prices vary wildly between frontier models

⚡ Performance Tuning

Track latency across providers and deployments

🛠️ Agent Development

Benchmark tool-calling accuracy for multi-step workflows

📊 CI/CD Integration

Run benchmarks in CI and fail if metrics regress:

npx duelist run --reporter json | jq '.summary.avgLatency'
Enter fullscreen mode Exit fullscreen mode

📈 Dashboarding

Export JSON results to Grafana, Datadog, or your metrics platform


What's Next?

Already shipped:

  • ✅ OpenAI (GPT-5, GPT-5.2), Azure, Anthropic (Claude 4.6 series), Gemini (3/3.1 series), OpenAI-compatible providers
  • ✅ 7 built-in scorers including LLM-as-judge
  • ✅ Tool-calling support for agent benchmarking
  • ✅ Console & JSON reporters
  • ✅ Pricing catalog with refresh script

Roadmap (shaped by community feedback):

  • 📜 More providers (OpenRouter-native, more gateways)
  • 📜 Markdown/HTML/CSV reporters
  • 📜 GitHub Actions summaries
  • 📜 Multi-step agent workflows
  • 📜 Plugin system for custom scorers
  • 📜 Embedding-based semantic similarity

Get Started Now

npm install agent-duelist
npx duelist init
npx duelist run
Enter fullscreen mode Exit fullscreen mode

📦 Package: npm.com/package/agent-duelist

📖 GitHub: github.com/DataGobes/agent-duelist

🐛 Issues: github.com/DataGobes/agent-duelist/issues


Contributing

Contributions welcome! 🎉

  • Bug reports / ideas: Open a GitHub issue
  • Code changes: Fork, branch, test (npm test), build (npm run build), PR

Keep PRs focused (one provider, one scorer) for easier review.


Final Thoughts

Choosing the right LLM provider shouldn't involve spreadsheets, guesswork, and manual copy-paste. Agent Duelist gives you a single command to get objective, reproducible comparisons.

Whether you're optimizing costs, improving latency, or validating correctness—let the models duel it out.

⚔️ May the best model win.


Tags

#typescript #llm #ai #benchmarking #openai #anthropic #gemini #agents #devtools #opensource


What LLM provider battle are you running first? Drop a comment!

Top comments (0)