Stanley Yang

Posted on Mar 2

How to Benchmark LLMs From Your Terminal in One Command

#ai #llm #cli #tutorial

How to Benchmark LLMs From Your Terminal in One Command

With 40+ LLMs worth considering in 2026, picking the right model for your project means actually comparing them. In this tutorial, I'll show you how to benchmark LLMs directly from your terminal using yardstiq, an open-source CLI tool.

No web UI, no notebooks, no setup. Just one command.

Prerequisites

Node.js 18+
At least one API key (OpenAI, Anthropic, Google, etc.) or Vercel AI Gateway key
Optional: Ollama for local models

Step 1: Run Your First Comparison

No install needed. npx handles it:

npx yardstiq "Explain the difference between TCP and UDP" \
  -m claude-sonnet -m gpt-4o

If you rather install it:

npm i -g yardstiq

yardstiq will prompt you for API keys on first run (yardstiq setup), or you can set them as environment variables:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

Or use a single key for all models:

export AI_GATEWAY_API_KEY=your_gateway_key

You'll see responses stream side-by-side in real time, followed by a performance table:

Model              Time     TTFT     Tokens     Tok/sec   Cost
Claude Sonnet ⚡   1.24s    432ms    18→86      69.4 t/s  $0.0013
GPT-4o             1.89s    612ms    18→91      48.1 t/s  $0.0010

Step 2: Compare More Models

Add as many models as you want with -m:

npx yardstiq "Write a Python function to merge two sorted lists" \
  -m claude-sonnet -m gpt-4o -m gemini-flash -m deepseek

This sends the same prompt to all four models simultaneously and streams all responses in parallel.

Step 3: Add an AI Judge

Want an objective evaluation? Add --judge:

npx yardstiq "Implement a thread-safe singleton in Java" \
  -m claude-sonnet -m gpt-4o -m gemini-pro \
  --judge

The judge (defaults to a strong model) evaluates each response and gives scored verdicts with reasoning. You can customize it:

npx yardstiq "Write unit tests for this function" \
  -m claude-sonnet -m gpt-4o \
  --judge --judge-model gpt-4.1 \
  --judge-criteria "Focus on edge case coverage and test readability"

Step 4: Compare Local Models

If you have Ollama running, prefix models with local::

npx yardstiq "Explain CORS in simple terms" \
  -m local:llama3.2 -m local:mistral

Mix local and cloud for cost comparison:

npx yardstiq "Parse this JSON and extract emails" \
  -m local:llama3.2 -m claude-haiku -m gpt-4o-mini

Step 5: Use System Prompts and File Input

Add context with system prompts:

npx yardstiq "Review this code for security issues" \
  -s "You are a senior security engineer" \
  -m claude-sonnet -m gpt-4o

Read prompts from files:

npx yardstiq -f ./my-prompt.txt -m claude-sonnet -m gpt-4o

Pipe from stdin:

cat code.py | npx yardstiq -s "Find bugs in this code" \
  -m claude-sonnet -m gpt-4o

Step 6: Export Results

Save comparisons in different formats:

# JSON (great for scripting and analysis)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --json > results.json

# Markdown (for documentation)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --markdown > comparison.md

# HTML (self-contained report with dark theme)
npx yardstiq "Explain monads" -m claude-sonnet -m gpt-4o --html > report.html

Step 7: Run Benchmark Suites

For systematic evaluation, create a YAML benchmark file:

# coding-benchmark.yaml
name: coding-eval
prompts:
  - category: algorithms
    text: "Implement a LRU cache in Python with O(1) operations"
  - category: debugging  
    text: "Find and fix the race condition in this Go code: ..."
  - category: refactoring
    text: "Refactor this 200-line function into clean, testable modules"

Run it:

npx yardstiq benchmark run ./coding-benchmark.yaml \
  -m claude-sonnet -m gpt-4o -m deepseek -m codestral

This runs every prompt against every model and gives you aggregate scores.

Step 8: Save and Review History

Save comparisons for later:

npx yardstiq "Explain quicksort" -m claude-sonnet -m gpt-4o --save quicksort-compare

# Later...
npx yardstiq history list
npx yardstiq history show quicksort-compare

Useful Patterns

Model selection for a project:

# Test your actual use case across budget tiers
npx yardstiq -f ./real-prompt.txt \
  -m claude-sonnet -m claude-haiku -m gpt-4o -m gpt-4o-mini -m deepseek \
  --judge

Quick cost comparison:

# Same task, different price points
npx yardstiq "Summarize this article: ..." \
  -m claude-haiku -m gpt-4o-mini -m gemini-flash-lite -m local:llama3.2

Tuning temperature:

npx yardstiq "Write a creative product name for a sleep tracking app" \
  -m claude-sonnet -m gpt-4o -t 0.9

Wrapping Up

yardstiq gives you a fast feedback loop for model comparison without leaving your terminal. It's not a replacement for rigorous evaluation frameworks, but it covers 90% of the "which model should I use?" decisions.