DEV Community

Cover image for How to Benchmark LLMs From Your Terminal in One Command
Stanley Yang
Stanley Yang

Posted on

How to Benchmark LLMs From Your Terminal in One Command

How to Benchmark LLMs From Your Terminal in One Command

With 40+ LLMs worth considering in 2026, picking the right model for your project means actually comparing them. In this tutorial, I'll show you how to benchmark LLMs directly from your terminal using yardstiq, an open-source CLI tool.

No web UI, no notebooks, no setup. Just one command.

Prerequisites

  • Node.js 18+
  • At least one API key (OpenAI, Anthropic, Google, etc.) or Vercel AI Gateway key
  • Optional: Ollama for local models

Step 1: Run Your First Comparison

No install needed. npx handles it:

npx yardstiq "Explain the difference between TCP and UDP" \
  -m claude-sonnet -m gpt-4o
Enter fullscreen mode Exit fullscreen mode

If you rather install it:

npm i -g yardstiq
Enter fullscreen mode Exit fullscreen mode

yardstiq will prompt you for API keys on first run (yardstiq setup), or you can set them as environment variables:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
Enter fullscreen mode Exit fullscreen mode

Or use a single key for all models:

export AI_GATEWAY_API_KEY=your_gateway_key
Enter fullscreen mode Exit fullscreen mode

You'll see responses stream side-by-side in real time, followed by a performance table:

Model              Time     TTFT     Tokens     Tok/sec   Cost
Claude Sonnet ⚡   1.24s    432ms    18→86      69.4 t/s  $0.0013
GPT-4o             1.89s    612ms    18→91      48.1 t/s  $0.0010
Enter fullscreen mode Exit fullscreen mode

Step 2: Compare More Models

Add as many models as you want with -m:

npx yardstiq "Write a Python function to merge two sorted lists" \
  -m claude-sonnet -m gpt-4o -m gemini-flash -m deepseek
Enter fullscreen mode Exit fullscreen mode

This sends the same prompt to all four models simultaneously and streams all responses in parallel.

Step 3: Add an AI Judge

Want an objective evaluation? Add --judge:

npx yardstiq "Implement a thread-safe singleton in Java" \
  -m claude-sonnet -m gpt-4o -m gemini-pro \
  --judge
Enter fullscreen mode Exit fullscreen mode

The judge (defaults to a strong model) evaluates each response and gives scored verdicts with reasoning. You can customize it:

npx yardstiq "Write unit tests for this function" \
  -m claude-sonnet -m gpt-4o \
  --judge --judge-model gpt-4.1 \
  --judge-criteria "Focus on edge case coverage and test readability"
Enter fullscreen mode Exit fullscreen mode

Step 4: Compare Local Models

If you have Ollama running, prefix models with local::

npx yardstiq "Explain CORS in simple terms" \
  -m local:llama3.2 -m local:mistral
Enter fullscreen mode Exit fullscreen mode

Mix local and cloud for cost comparison:

npx yardstiq "Parse this JSON and extract emails" \
  -m local:llama3.2 -m claude-haiku -m gpt-4o-mini
Enter fullscreen mode Exit fullscreen mode

Step 5: Use System Prompts and File Input

Add context with system prompts:

npx yardstiq "Review this code for security issues" \
  -s "You are a senior security engineer" \
  -m claude-sonnet -m gpt-4o
Enter fullscreen mode Exit fullscreen mode

Read prompts from files:

npx yardstiq -f ./my-prompt.txt -m claude-sonnet -m gpt-4o
Enter fullscreen mode Exit fullscreen mode

Pipe from stdin:

cat code.py | npx yardstiq -s "Find bugs in this code" \
  -m claude-sonnet -m gpt-4o
Enter fullscreen mode Exit fullscreen mode

Step 6: Export Results

Save comparisons in different formats:

# JSON (great for scripting and analysis)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --json > results.json

# Markdown (for documentation)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --markdown > comparison.md

# HTML (self-contained report with dark theme)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --html > report.html
Enter fullscreen mode Exit fullscreen mode

Step 7: Run Benchmark Suites

For systematic evaluation, create a YAML benchmark file:

# coding-benchmark.yaml
name: coding-eval
prompts:
  - category: algorithms
    text: "Implement a LRU cache in Python with O(1) operations"
  - category: debugging  
    text: "Find and fix the race condition in this Go code: ..."
  - category: refactoring
    text: "Refactor this 200-line function into clean, testable modules"
Enter fullscreen mode Exit fullscreen mode

Run it:

npx yardstiq benchmark run ./coding-benchmark.yaml \
  -m claude-sonnet -m gpt-4o -m deepseek -m codestral
Enter fullscreen mode Exit fullscreen mode

This runs every prompt against every model and gives you aggregate scores.

Step 8: Save and Review History

Save comparisons for later:

npx yardstiq "Explain quicksort" -m claude-sonnet -m gpt-4o --save quicksort-compare

# Later...
npx yardstiq history list
npx yardstiq history show quicksort-compare
Enter fullscreen mode Exit fullscreen mode

Useful Patterns

Model selection for a project:

# Test your actual use case across budget tiers
npx yardstiq -f ./real-prompt.txt \
  -m claude-sonnet -m claude-haiku -m gpt-4o -m gpt-4o-mini -m deepseek \
  --judge
Enter fullscreen mode Exit fullscreen mode

Quick cost comparison:

# Same task, different price points
npx yardstiq "Summarize this article: ..." \
  -m claude-haiku -m gpt-4o-mini -m gemini-flash-lite -m local:llama3.2
Enter fullscreen mode Exit fullscreen mode

Tuning temperature:

npx yardstiq "Write a creative product name for a sleep tracking app" \
  -m claude-sonnet -m gpt-4o -t 0.9
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

yardstiq gives you a fast feedback loop for model comparison without leaving your terminal. It's not a replacement for rigorous evaluation frameworks, but it covers 90% of the "which model should I use?" decisions.

Top comments (0)