Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)

Yo — Thu, 04 Jun 2026 16:05:03 +0000

Most "can I run this LLM?" calculators estimate the KV cache with the textbook formula:

  KV ≈ 2 × layers × kv_heads × head_dim × context × bytes

It assumes every layer keeps a full-context KV cache with one head shape. True for
Llama-1/2 — wrong for most 2025–2026 models:

Gemma 4 is a 5:1 sliding-window:global interleave — most layers only hold the last 1024 tokens, and global layers use a different head shape. token-proportional KV.
MoE keeps every expert resident even if only a few activate per token.

So the naive number overcounts the KV-cache term — ~4× on Qwen 3.6, ~11× on Gemma 4 31B
at long context — enough to flip "won't fit" into "fits". (A second common slip: applying
the GGUF weight quant to the KV cache — llama.cpp keeps KV at f16 by default; weight bits
≠ KV bits.)

FitLLM reads each model's official config.json live and models
sliding-window / linear / global / MoE layers separately — it reproduces Gemma 4 31B's
published 20.78 GiB full-context KV. Covers Apple Silicon and NVIDIA RTX, and you can
paste any Hugging Face model id. It's an estimator, not ground truth (tok/s especially is
bandwidth-bound).

The whole calculation engine is one readable MIT file, so you can audit the math, fork
it, or PR a correction:
👉 https://github.com/click6067-ship-it/fitllm-engine

Try it: https://fitllm.run

DEV Community: Yo

Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)