This article was originally published on runaihome.com
TL;DR: Picking a local LLM by parameter count is the wrong signal — a well-quantized 14B can outperform a crushed 27B, and a model that barely fits your VRAM will stall at under 10 tok/s. These five tools automate the math: whichllm ranks what to run in one command, LocalScore measures how fast your hardware actually is, and llama-bench gives you the raw throughput numbers to validate both.
| whichllm | LocalScore | llama-bench | |
|---|---|---|---|
| Best for | "What model should I run?" | "How fast is my actual chip?" | Raw tok/s baseline for any config |
| Input required | Your GPU (auto-detected) | GPU + GGUF model file | Any compiled GGUF + llama.cpp |
| Output | Ranked model list with quant | PP speed, TG speed, TTFT | tok/s table across batch/thread configs |
| The catch | Scores rely on merged leaderboards, not local runs | Single-GPU setups only | No quality signal — speed only |
Honest take: Run whichllm first, get a ranked list in under 10 seconds, then validate the top pick's tok/s with llama-bench on your machine before committing to a multi-GB download.
The "fit the biggest model your VRAM holds" heuristic has two failure modes. First, a 14B Q3 can outperform a 7B Q8 on general reasoning and lose badly on code — parameter count is not a quality proxy once quantization enters the picture. Second, a model that barely squeezes into 8GB at Q4 will offload key-value cache to system RAM when context grows past a few thousand tokens, dropping you from 40 tok/s to under 10 tok/s mid-conversation.
What you actually need is a three-part filter: quality score from real-world evals, verified VRAM fit at your preferred quantization, and measured tokens per second on your specific chip. The five tools below cover that stack.
1. whichllm — one command, ranked results
whichllm is a Python CLI that auto-detects your GPU, CPU, and RAM, then ranks HuggingFace models by a merged benchmark score rather than parameter count. It hit v0.5.7 on May 19, 2026, and has 2,000 GitHub stars since its March 2026 launch.
Install (pick any):
uvx whichllm@latest # one-off run, no persistent install
uv tool install whichllm # persistent
pip install whichllm
brew install andyyyy64/whichllm/whichllm
Hardware auto-detected: NVIDIA GPUs via nvidia-ml-py, AMD GPUs via ROCm/dbgpu, Apple Silicon via Metal, plus CPU core count, system RAM, and available disk space.
How the 0–100 score is built:
- LiveBench, Artificial Analysis Index, and Aider scores (live-merged, highest weight)
- Chatbot Arena ELO and Open LLM Leaderboard v2 (frozen, lower recency weight)
- A log₂-scaled model-size bonus as a knowledge proxy
- A quantization penalty — lower-bit variants take a multiplicative hit
- A runtime-fit penalty: partial offload (layers spilling to system RAM) scores 0.72×, CPU-only runs score 0.50×
- Speed adjustment: ±8 points based on estimated tok/s performance
That last factor matters. A model that gets 22 tok/s on your 8GB card scores lower than the same model would on a 24GB card running it fully on-GPU — not because the model changed but because partial offload degrades the experience in a way pure-quality benchmarks miss.
Real results by GPU (May 2026 snapshot):
| GPU | Top pick | Quant | Tok/s | Score |
|---|---|---|---|---|
| RTX 5090 32 GB | Qwen3.6-27B | Q6_K | ~40 | 94.7 |
| RTX 4090 24 GB | Qwen3.6-27B | Q5_K_M | ~27 | 92.8 |
| RTX 4090 24 GB (alt) | Qwen3-32B | Q4_K_M | ~31 | 83.0 |
| RTX 4060 8 GB | Qwen3-14B | Q3_K_M | ~22 | 71.0 |
| Apple M3 Max 36 GB | Qwen3.6-27B | Q5_K_M | ~9 | 89.4 |
The gap between the 8GB card (71.0) and the 4090 (92.8) reflects both the model quality ceiling and the Q3 quant penalty — not purely chip speed. An 8GB owner running a Q3 14B gets measurably worse reasoning quality than a 24GB owner running a Q5 27B, independent of tok/s. If you're deciding whether the 16GB step-up is worth $50 on the RTX 5060 Ti, that quality difference is the actual argument — see the full breakdown at /blog/rtx-5060-ti-8gb-vs-16gb-local-ai-2026/.
The honest limitation: whichllm derives its scores from community leaderboard data, not from tests run on your machine. It gives you the best model for your hardware class; your specific chip, driver version, and cooling headroom may produce different throughput numbers. Use it to shortlist, then validate with llama-bench.
2. LocalScore — measure your specific chip
LocalScore is a Mozilla Builders project that runs a standardized test battery on a GGUF model you supply, then (optionally) uploads your result to the community database at localscore.ai. It's built on Llamafile, itself a portable wrapper around llama.cpp, which gives it cross-platform coverage on Windows, Linux, and macOS without requiring a CUDA compile.
Three metrics it measures:
- Prompt Processing (PP) — tokens per second ingesting context. Matters for RAG pipelines with long document chunks and multi-turn conversations with large history.
- Token Generation (TG) — tokens per second producing output. This is the number users feel as "speed."
- Time to First Token (TTFT) — milliseconds before the first character appears. Critical for interactive use; high TTFT makes a fast model feel slow.
These combine into a single LocalScore number you can compare against the community database. Before benchmarking anything yourself, search your GPU model on localscore.ai — there's a good chance someone has already measured the model you're evaluating on identical or similar hardware.
Limitation: LocalScore supports single-GPU setups only. Multi-GPU NVLink configs and CPU+GPU hybrid inference are outside its current scope.
For open-source tooling in this space more broadly, aifoss.dev tracks LocalScore alongside other self-hosted AI benchmarking projects.
3. llama-bench — the raw throughput baseline
llama-bench ships inside llama.cpp and is the closest thing to a ground-truth speed measurement for single-process inference. If you have llama.cpp compiled, you already have it at ./llama-bench.
# Minimal: tests both prompt processing and generation
./llama-bench -m model.gguf -ngl 99
# Test multiple batch sizes in one pass
./llama-bench -m model.gguf -ngl 99 -b 128,256,512
# Three repetitions for stable averages
./llama-bench -m model.gguf -ngl 99 --repetitions 3
-ngl 99 offloads all layers to GPU. Omit it and you'll measure a hybrid CPU+GPU config, which may look fine but won't tell you what a fully-loaded inference run actually does. Always include it for apples-to-apples comparisons.
Output reports pp512 (prompt processing at a 512-token context, in tok/s) and tg128 (token generation for 128 output tokens, in tok/s), each with standard deviation across repetitions. These are the numbers most hardware reviewers publish, which makes them the most useful for cross-referencing community benches.
What llama-bench won't tell you: quality. A Q2 model might generate 80 tok/s while producing garbled answers. Always pair throughput numbers with a quality benchmark like whichllm's score, or run a quick LiveCodeBench sample before declaring a setup usable.
4. llama-benchy — benchmark across inference backends
llama-benchy addresses a gap that matters for anyone comparing serving backends: llama-bench only works with llama.cpp directly. If you want equivalent throughput numbers for Ollama, vLLM, or SGLang on the same model and hardware, you need a different tool.
bash
pip install llama-benchy
# or via Docker
docker run he
Top comments (0)