Jovan Chan

Posted on Jun 2 • Originally published at runaihome.com

How to find the best local LLM for your hardware: 5 benchmark tools compared (2026)

#localllm #benchmarking #hardware #whichllm

This article was originally published on runaihome.com

TL;DR: Picking a local LLM by parameter count is the wrong signal — a well-quantized 14B can outperform a crushed 27B, and a model that barely fits your VRAM will stall at under 10 tok/s. These five tools automate the math: whichllm ranks what to run in one command, LocalScore measures how fast your hardware actually is, and llama-bench gives you the raw throughput numbers to validate both.

	whichllm	LocalScore	llama-bench
Best for	"What model should I run?"	"How fast is my actual chip?"	Raw tok/s baseline for any config
Input required	Your GPU (auto-detected)	GPU + GGUF model file	Any compiled GGUF + llama.cpp
Output	Ranked model list with quant	PP speed, TG speed, TTFT	tok/s table across batch/thread configs
The catch	Scores rely on merged leaderboards, not local runs	Single-GPU setups only	No quality signal — speed only

Honest take: Run whichllm first, get a ranked list in under 10 seconds, then validate the top pick's tok/s with llama-bench on your machine before committing to a multi-GB download.

The "fit the biggest model your VRAM holds" heuristic has two failure modes. First, a 14B Q3 can outperform a 7B Q8 on general reasoning and lose badly on code — parameter count is not a quality proxy once quantization enters the picture. Second, a model that barely squeezes into 8GB at Q4 will offload key-value cache to system RAM when context grows past a few thousand tokens, dropping you from 40 tok/s to under 10 tok/s mid-conversation.

What you actually need is a three-part filter: quality score from real-world evals, verified VRAM fit at your preferred quantization, and measured tokens per second on your specific chip. The five tools below cover that stack.

1. whichllm — one command, ranked results

whichllm is a Python CLI that auto-detects your GPU, CPU, and RAM, then ranks HuggingFace models by a merged benchmark score rather than parameter count. It hit v0.5.7 on May 19, 2026, and has 2,000 GitHub stars since its March 2026 launch.

Install (pick any):

uvx whichllm@latest       # one-off run, no persistent install
uv tool install whichllm  # persistent
pip install whichllm
brew install andyyyy64/whichllm/whichllm

Hardware auto-detected: NVIDIA GPUs via nvidia-ml-py, AMD GPUs via ROCm/dbgpu, Apple Silicon via Metal, plus CPU core count, system RAM, and available disk space.

How the 0–100 score is built:

LiveBench, Artificial Analysis Index, and Aider scores (live-merged, highest weight)
Chatbot Arena ELO and Open LLM Leaderboard v2 (frozen, lower recency weight)
A log₂-scaled model-size bonus as a knowledge proxy
A quantization penalty — lower-bit variants take a multiplicative hit
A runtime-fit penalty: partial offload (layers spilling to system RAM) scores 0.72×, CPU-only runs score 0.50×
Speed adjustment: ±8 points based on estimated tok/s performance

That last factor matters. A model that gets 22 tok/s on your 8GB card scores lower than the same model would on a 24GB card running it fully on-GPU — not because the model changed but because partial offload degrades the experience in a way pure-quality benchmarks miss.

Real results by GPU (May 2026 snapshot):

GPU	Top pick	Quant	Tok/s	Score
RTX 5090 32 GB	Qwen3.6-27B	Q6_K	~40	94.7
RTX 4090 24 GB	Qwen3.6-27B	Q5_K_M	~27	92.8
RTX 4090 24 GB (alt)	Qwen3-32B	Q4_K_M	~31	83.0
RTX 4060 8 GB	Qwen3-14B	Q3_K_M	~22	71.0
Apple M3 Max 36 GB	Qwen3.6-27B	Q5_K_M	~9	89.4

The gap between the 8GB card (71.0) and the 4090 (92.8) reflects both the model quality ceiling and the Q3 quant penalty — not purely chip speed. An 8GB owner running a Q3 14B gets measurably worse reasoning quality than a 24GB owner running a Q5 27B, independent of tok/s. If you're deciding whether the 16GB step-up is worth $50 on the RTX 5060 Ti, that quality difference is the actual argument — see the full breakdown at /blog/rtx-5060-ti-8gb-vs-16gb-local-ai-2026/.

The honest limitation: whichllm derives its scores from community leaderboard data, not from tests run on your machine. It gives you the best model for your hardware class; your specific chip, driver version, and cooling headroom may produce different throughput numbers. Use it to shortlist, then validate with llama-bench.

2. LocalScore — measure your specific chip

LocalScore is a Mozilla Builders project that runs a standardized test battery on a GGUF model you supply, then (optionally) uploads your result to the community database at localscore.ai. It's built on Llamafile, itself a portable wrapper around llama.cpp, which gives it cross-platform coverage on Windows, Linux, and macOS without requiring a CUDA compile.

Three metrics it measures:

Prompt Processing (PP) — tokens per second ingesting context. Matters for RAG pipelines with long document chunks and multi-turn conversations with large history.
Token Generation (TG) — tokens per second producing output. This is the number users feel as "speed."
Time to First Token (TTFT) — milliseconds before the first character appears. Critical for interactive use; high TTFT makes a fast model feel slow.

These combine into a single LocalScore number you can compare against the community database. Before benchmarking anything yourself, search your GPU model on localscore.ai — there's a good chance someone has already measured the model you're evaluating on identical or similar hardware.

Limitation: LocalScore supports single-GPU setups only. Multi-GPU NVLink configs and CPU+GPU hybrid inference are outside its current scope.

For open-source tooling in this space more broadly, aifoss.dev tracks LocalScore alongside other self-hosted AI benchmarking projects.

3. llama-bench — the raw throughput baseline

llama-bench ships inside llama.cpp and is the closest thing to a ground-truth speed measurement for single-process inference. If you have llama.cpp compiled, you already have it at ./llama-bench.

# Minimal: tests both prompt processing and generation
./llama-bench -m model.gguf -ngl 99

# Test multiple batch sizes in one pass
./llama-bench -m model.gguf -ngl 99 -b 128,256,512

# Three repetitions for stable averages
./llama-bench -m model.gguf -ngl 99 --repetitions 3

-ngl 99 offloads all layers to GPU. Omit it and you'll measure a hybrid CPU+GPU config, which may look fine but won't tell you what a fully-loaded inference run actually does. Always include it for apples-to-apples comparisons.

Output reports pp512 (prompt processing at a 512-token context, in tok/s) and tg128 (token generation for 128 output tokens, in tok/s), each with standard deviation across repetitions. These are the numbers most hardware reviewers publish, which makes them the most useful for cross-referencing community benches.

What llama-bench won't tell you: quality. A Q2 model might generate 80 tok/s while producing garbled answers. Always pair throughput numbers with a quality benchmark like whichllm's score, or run a quick LiveCodeBench sample before declaring a setup usable.

4. llama-benchy — benchmark across inference backends

llama-benchy addresses a gap that matters for anyone comparing serving backends: llama-bench only works with llama.cpp directly. If you want equivalent throughput numbers for Ollama, vLLM, or SGLang on the same model and hardware, you need a different tool.


bash
pip install llama-benchy
# or via Docker
docker run he

DEV Community

How to find the best local LLM for your hardware: 5 benchmark tools compared (2026)

1. whichllm — one command, ranked results

2. LocalScore — measure your specific chip

3. llama-bench — the raw throughput baseline

4. llama-benchy — benchmark across inference backends

Top comments (0)