Every AI developer has this problem: you're using GPT-4o for something, you wonder if Claude would be better, you know Groq is faster but you don't know how much faster, and you have no easy way to compare.
So I built it. It's called llm-bench — a Rust CLI that runs your prompt across every provider in parallel and shows you latency, cost, and output side by side.
GitHub: LakshmiSravyaVedantham/llm-bench
cargo install llm-bench
llm-bench run \
--prompt "Explain the CAP theorem in one paragraph" \
--models gpt-4o,claude-3-5-sonnet-20241022,llama3-70b-8192
Output:
Running 3 provider(s) in parallel...
╒═══════════════════════════════╤══════════╤══════════╤═══════════════╤══════════════════════════════╕
│ Model │ Latency │ Cost │ In/Out Tokens │ Output (first 60 chars) │
╞═══════════════════════════════╪══════════╪══════════╪═══════════════╪══════════════════════════════╡
│ gpt-4o │ 1240ms │ $0.00412 │ 12/87 │ The CAP theorem states th... │
├───────────────────────────────┼──────────┼──────────┼───────────────┼──────────────────────────────┤
│ claude-3-5-sonnet-20241022 │ 890ms │ $0.00281 │ 12/83 │ In distributed systems, t... │
├───────────────────────────────┼──────────┼──────────┼───────────────┼──────────────────────────────┤
│ llama3-70b-8192 │ 121ms │ $0.00008 │ 12/91 │ The CAP theorem, proposed... │
╘═══════════════════════════════╧══════════╧══════════╧═══════════════╧══════════════════════════════╛
Groq is 10x faster than GPT-4o on this prompt. And 50x cheaper.
Why does this not exist already?
There are Python scripts that do versions of this. But they're slow to start, they're not single binaries you can drop in a CI pipeline, and they don't run providers in parallel.
llm-bench is a single binary with ~5ms startup time. It runs every provider concurrently using Tokio async tasks. By the time the slowest provider responds, all the others are already done.
How it works
The architecture is ~400 lines across 6 files:
src/
├── providers/
│ ├── mod.rs # Provider trait + shared types
│ ├── openai.rs # OpenAI (gpt-4o, o1, o3...)
│ ├── anthropic.rs # Anthropic (claude-3-5-sonnet...)
│ └── groq.rs # Groq (llama3, mixtral...)
├── runner.rs # Parallel async runner
├── report.rs # Terminal table
└── main.rs # CLI
The Provider trait
Every provider implements one trait:
#[async_trait]
pub trait Provider: Send + Sync {
fn name(&self) -> &str;
async fn complete(&self, req: &CompletionRequest) -> Result<CompletionResult>;
}
CompletionResult carries everything you need to compare:
pub struct CompletionResult {
pub model: String,
pub output: String,
pub latency_ms: u128,
pub input_tokens: u32,
pub output_tokens: u32,
pub cost_usd: f64,
}
Parallel execution
The runner spawns one Tokio task per provider and collects all results:
pub async fn run_providers(
providers: Vec<Box<dyn Provider>>,
req: CompletionRequest,
) -> Vec<(String, Result<CompletionResult>)> {
let req = Arc::new(req);
let handles: Vec<_> = providers.into_iter().map(|provider| {
let req = Arc::clone(&req);
tokio::spawn(async move {
let name = provider.name().to_string();
let result = provider.complete(&req).await;
(name, result)
})
}).collect();
// Collect all results — they ran concurrently
futures::future::join_all(handles).await
.into_iter()
.filter_map(|r| r.ok())
.collect()
}
The key insight: Arc::clone lets every spawned task hold a reference to the same request without copying. Each task is independent, so they all run at the same time.
Setup
# Install
cargo install llm-bench
# Set API keys (or use a .env file)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GROQ_API_KEY=gsk_...
# Benchmark
llm-bench run --prompt "Your prompt here" --models gpt-4o,claude-3-5-sonnet-20241022,llama3-70b-8192
llm-bench automatically detects which keys are set and only runs those providers. If only GROQ_API_KEY is set, it only runs Groq.
Adding a new provider
Create one file implementing the Provider trait:
pub struct GeminiProvider { api_key: String, model: String, client: Client }
#[async_trait]
impl Provider for GeminiProvider {
fn name(&self) -> &str { &self.model }
async fn complete(&self, req: &CompletionRequest) -> Result<CompletionResult> {
// ... HTTP call to Gemini API
}
}
Then add one match arm in build_providers(). That's it.
What I learned building this
The most surprising finding: Groq's latency advantage disappears for long outputs. For short prompts (<100 output tokens), Groq is 8-12x faster than OpenAI. For long prompts (>500 output tokens), it's only 2-3x faster — because the bottleneck shifts from inference speed to token generation bandwidth.
You can only discover things like this by running the actual comparison.
The code is at github.com/LakshmiSravyaVedantham/llm-bench.
Also in this series:
Top comments (0)