How to Benchmark LLMs From Your Terminal in One Command
With 40+ LLMs worth considering in 2026, picking the right model for your project means actually comparing them. In this tutorial, I'll show you how to benchmark LLMs directly from your terminal using yardstiq, an open-source CLI tool.
No web UI, no notebooks, no setup. Just one command.
Prerequisites
- Node.js 18+
- At least one API key (OpenAI, Anthropic, Google, etc.) or Vercel AI Gateway key
- Optional: Ollama for local models
Step 1: Run Your First Comparison
No install needed. npx handles it:
npx yardstiq "Explain the difference between TCP and UDP" \
-m claude-sonnet -m gpt-4o
If you rather install it:
npm i -g yardstiq
yardstiq will prompt you for API keys on first run (yardstiq setup), or you can set them as environment variables:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
Or use a single key for all models:
export AI_GATEWAY_API_KEY=your_gateway_key
You'll see responses stream side-by-side in real time, followed by a performance table:
Model Time TTFT Tokens Tok/sec Cost
Claude Sonnet ⚡ 1.24s 432ms 18→86 69.4 t/s $0.0013
GPT-4o 1.89s 612ms 18→91 48.1 t/s $0.0010
Step 2: Compare More Models
Add as many models as you want with -m:
npx yardstiq "Write a Python function to merge two sorted lists" \
-m claude-sonnet -m gpt-4o -m gemini-flash -m deepseek
This sends the same prompt to all four models simultaneously and streams all responses in parallel.
Step 3: Add an AI Judge
Want an objective evaluation? Add --judge:
npx yardstiq "Implement a thread-safe singleton in Java" \
-m claude-sonnet -m gpt-4o -m gemini-pro \
--judge
The judge (defaults to a strong model) evaluates each response and gives scored verdicts with reasoning. You can customize it:
npx yardstiq "Write unit tests for this function" \
-m claude-sonnet -m gpt-4o \
--judge --judge-model gpt-4.1 \
--judge-criteria "Focus on edge case coverage and test readability"
Step 4: Compare Local Models
If you have Ollama running, prefix models with local::
npx yardstiq "Explain CORS in simple terms" \
-m local:llama3.2 -m local:mistral
Mix local and cloud for cost comparison:
npx yardstiq "Parse this JSON and extract emails" \
-m local:llama3.2 -m claude-haiku -m gpt-4o-mini
Step 5: Use System Prompts and File Input
Add context with system prompts:
npx yardstiq "Review this code for security issues" \
-s "You are a senior security engineer" \
-m claude-sonnet -m gpt-4o
Read prompts from files:
npx yardstiq -f ./my-prompt.txt -m claude-sonnet -m gpt-4o
Pipe from stdin:
cat code.py | npx yardstiq -s "Find bugs in this code" \
-m claude-sonnet -m gpt-4o
Step 6: Export Results
Save comparisons in different formats:
# JSON (great for scripting and analysis)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --json > results.json
# Markdown (for documentation)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --markdown > comparison.md
# HTML (self-contained report with dark theme)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --html > report.html
Step 7: Run Benchmark Suites
For systematic evaluation, create a YAML benchmark file:
# coding-benchmark.yaml
name: coding-eval
prompts:
- category: algorithms
text: "Implement a LRU cache in Python with O(1) operations"
- category: debugging
text: "Find and fix the race condition in this Go code: ..."
- category: refactoring
text: "Refactor this 200-line function into clean, testable modules"
Run it:
npx yardstiq benchmark run ./coding-benchmark.yaml \
-m claude-sonnet -m gpt-4o -m deepseek -m codestral
This runs every prompt against every model and gives you aggregate scores.
Step 8: Save and Review History
Save comparisons for later:
npx yardstiq "Explain quicksort" -m claude-sonnet -m gpt-4o --save quicksort-compare
# Later...
npx yardstiq history list
npx yardstiq history show quicksort-compare
Useful Patterns
Model selection for a project:
# Test your actual use case across budget tiers
npx yardstiq -f ./real-prompt.txt \
-m claude-sonnet -m claude-haiku -m gpt-4o -m gpt-4o-mini -m deepseek \
--judge
Quick cost comparison:
# Same task, different price points
npx yardstiq "Summarize this article: ..." \
-m claude-haiku -m gpt-4o-mini -m gemini-flash-lite -m local:llama3.2
Tuning temperature:
npx yardstiq "Write a creative product name for a sleep tracking app" \
-m claude-sonnet -m gpt-4o -t 0.9
Wrapping Up
yardstiq gives you a fast feedback loop for model comparison without leaving your terminal. It's not a replacement for rigorous evaluation frameworks, but it covers 90% of the "which model should I use?" decisions.
- Repo: github.com/stanleycyang/yardstiq
- npm: npmjs.com/package/yardstiq
- License: MIT
Top comments (0)