DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

How to find the best local LLM for your hardware: 5 benchmark tools compared (2026)

This article was originally published on runaihome.com

TL;DR: Picking a local LLM by parameter count is the wrong signal — a well-quantized 14B can outperform a crushed 27B, and a model that barely fits your VRAM will stall at under 10 tok/s. These five tools automate the math: whichllm ranks what to run in one command, LocalScore measures how fast your hardware actually is, and llama-bench gives you the raw throughput numbers to validate both.

whichllm LocalScore llama-bench
Best for "What model should I run?" "How fast is my actual chip?" Raw tok/s baseline for any config
Input required Your GPU (auto-detected) GPU + GGUF model file Any compiled GGUF + llama.cpp
Output Ranked model list with quant PP speed, TG speed, TTFT tok/s table across batch/thread configs
The catch Scores rely on merged leaderboards, not local runs Single-GPU setups only No quality signal — speed only

Honest take: Run whichllm first, get a ranked list in under 10 seconds, then validate the top pick's tok/s with llama-bench on your machine before committing to a multi-GB download.


The "fit the biggest model your VRAM holds" heuristic has two failure modes. First, a 14B Q3 can outperform a 7B Q8 on general reasoning and lose badly on code — parameter count is not a quality proxy once quantization enters the picture. Second, a model that barely squeezes into 8GB at Q4 will offload key-value cache to system RAM when context grows past a few thousand tokens, dropping you from 40 tok/s to under 10 tok/s mid-conversation.

What you actually need is a three-part filter: quality score from real-world evals, verified VRAM fit at your preferred quantization, and measured tokens per second on your specific chip. The five tools below cover that stack.

1. whichllm — one command, ranked results

whichllm is a Python CLI that auto-detects your GPU, CPU, and RAM, then ranks HuggingFace models by a merged benchmark score rather than parameter count. It hit v0.5.7 on May 19, 2026, and has 2,000 GitHub stars since its March 2026 launch.

Install (pick any):

uvx whichllm@latest       # one-off run, no persistent install
uv tool install whichllm  # persistent
pip install whichllm
brew install andyyyy64/whichllm/whichllm
Enter fullscreen mode Exit fullscreen mode

Hardware auto-detected: NVIDIA GPUs via nvidia-ml-py, AMD GPUs via ROCm/dbgpu, Apple Silicon via Metal, plus CPU core count, system RAM, and available disk space.

How the 0–100 score is built:

  • LiveBench, Artificial Analysis Index, and Aider scores (live-merged, highest weight)
  • Chatbot Arena ELO and Open LLM Leaderboard v2 (frozen, lower recency weight)
  • A log₂-scaled model-size bonus as a knowledge proxy
  • A quantization penalty — lower-bit variants take a multiplicative hit
  • A runtime-fit penalty: partial offload (layers spilling to system RAM) scores 0.72×, CPU-only runs score 0.50×
  • Speed adjustment: ±8 points based on estimated tok/s performance

That last factor matters. A model that gets 22 tok/s on your 8GB card scores lower than the same model would on a 24GB card running it fully on-GPU — not because the model changed but because partial offload degrades the experience in a way pure-quality benchmarks miss.

Real results by GPU (May 2026 snapshot):

GPU Top pick Quant Tok/s Score
RTX 5090 32 GB Qwen3.6-27B Q6_K ~40 94.7
RTX 4090 24 GB Qwen3.6-27B Q5_K_M ~27 92.8
RTX 4090 24 GB (alt) Qwen3-32B Q4_K_M ~31 83.0
RTX 4060 8 GB Qwen3-14B Q3_K_M ~22 71.0
Apple M3 Max 36 GB Qwen3.6-27B Q5_K_M ~9 89.4

The gap between the 8GB card (71.0) and the 4090 (92.8) reflects both the model quality ceiling and the Q3 quant penalty — not purely chip speed. An 8GB owner running a Q3 14B gets measurably worse reasoning quality than a 24GB owner running a Q5 27B, independent of tok/s. If you're deciding whether the 16GB step-up is worth $50 on the RTX 5060 Ti, that quality difference is the actual argument — see the full breakdown at /blog/rtx-5060-ti-8gb-vs-16gb-local-ai-2026/.

The honest limitation: whichllm derives its scores from community leaderboard data, not from tests run on your machine. It gives you the best model for your hardware class; your specific chip, driver version, and cooling headroom may produce different throughput numbers. Use it to shortlist, then validate with llama-bench.

2. LocalScore — measure your specific chip

LocalScore is a Mozilla Builders project that runs a standardized test battery on a GGUF model you supply, then (optionally) uploads your result to the community database at localscore.ai. It's built on Llamafile, itself a portable wrapper around llama.cpp, which gives it cross-platform coverage on Windows, Linux, and macOS without requiring a CUDA compile.

Three metrics it measures:

  1. Prompt Processing (PP) — tokens per second ingesting context. Matters for RAG pipelines with long document chunks and multi-turn conversations with large history.
  2. Token Generation (TG) — tokens per second producing output. This is the number users feel as "speed."
  3. Time to First Token (TTFT) — milliseconds before the first character appears. Critical for interactive use; high TTFT makes a fast model feel slow.

These combine into a single LocalScore number you can compare against the community database. Before benchmarking anything yourself, search your GPU model on localscore.ai — there's a good chance someone has already measured the model you're evaluating on identical or similar hardware.

Limitation: LocalScore supports single-GPU setups only. Multi-GPU NVLink configs and CPU+GPU hybrid inference are outside its current scope.

For open-source tooling in this space more broadly, aifoss.dev tracks LocalScore alongside other self-hosted AI benchmarking projects.

3. llama-bench — the raw throughput baseline

llama-bench ships inside llama.cpp and is the closest thing to a ground-truth speed measurement for single-process inference. If you have llama.cpp compiled, you already have it at ./llama-bench.

# Minimal: tests both prompt processing and generation
./llama-bench -m model.gguf -ngl 99

# Test multiple batch sizes in one pass
./llama-bench -m model.gguf -ngl 99 -b 128,256,512

# Three repetitions for stable averages
./llama-bench -m model.gguf -ngl 99 --repetitions 3
Enter fullscreen mode Exit fullscreen mode

-ngl 99 offloads all layers to GPU. Omit it and you'll measure a hybrid CPU+GPU config, which may look fine but won't tell you what a fully-loaded inference run actually does. Always include it for apples-to-apples comparisons.

Output reports pp512 (prompt processing at a 512-token context, in tok/s) and tg128 (token generation for 128 output tokens, in tok/s), each with standard deviation across repetitions. These are the numbers most hardware reviewers publish, which makes them the most useful for cross-referencing community benches.

What llama-bench won't tell you: quality. A Q2 model might generate 80 tok/s while producing garbled answers. Always pair throughput numbers with a quality benchmark like whichllm's score, or run a quick LiveCodeBench sample before declaring a setup usable.

4. llama-benchy — benchmark across inference backends

llama-benchy addresses a gap that matters for anyone comparing serving backends: llama-bench only works with llama.cpp directly. If you want equivalent throughput numbers for Ollama, vLLM, or SGLang on the same model and hardware, you need a different tool.


bash
pip install llama-benchy
# or via Docker
docker run he
Enter fullscreen mode Exit fullscreen mode

Top comments (0)