Sergey Melyukov

Posted on Apr 20

WeightRoom — an LLM resource calculator

#ai #llm #calculator

Hey LLM folks 👋

Whether you're shipping with Ollama on a Mac, vllm on 8× H100, or anything in between — model planning shouldn't be napkin math.

I built a free calculator: WeightRoom.

→ https://smelukov.github.io/WeightRoom/

No backend, no signup, no telemetry. Pick a model, pick a quant, pick a context window — get RAM, disk and TPS estimates in real time. State serializes to URL, so configs are shareable.

smelukov.github.io

KV cache: 4 formulas, not 1

The thing most calculators get wrong.

WeightRoom ships 4 formulas matching real architectures:

Standard GQA — Llama, Qwen, Mistral
Sliding Window — Gemma 2/3 (most layers cap at the window size)
MLA — DeepSeek V3/R1 (joint K+V latent, ~10× smaller cache than GQA)
Linear + Full — Qwen 3.5 (only sparse full-attention layers grow)

Pick the model, the right formula is applied automatically. Use the wrong formula on Gemma or DeepSeek and you over-estimate KV cache by 5–10×.

smelukov.github.io

MoE: total ≠ active

Another classic mistake — treating MoE models like dense ones.

DeepSeek V3 = 671B params total, 37B active per token.

RAM/disk uses total params (all experts must fit in memory)
TPS uses active params (only those are streamed per decode step)

Naive math under-estimates DeepSeek V3 TPS by ~18×. WeightRoom does both correctly.

smelukov.github.io

Inference engine matters as much as the GPU

Same model, same hardware, different engine — wildly different throughput.

Llama 3.3 70B on 1× H100, 8 concurrent users:

llama.cpp (100% KV pre-alloc): 22 tok/s/user → 172 tok/s system
vllm (PagedAttention ~25%): 45 tok/s/user → 356 tok/s system

Different trade-offs. llama.cpp wins on simplicity and local UX, vLLM on aggregate throughput. Both are correct for their use case — pick the engine preset, the math adjusts.

smelukov.github.io

Compare mode

Side-by-side cards with charts. Pick a budget, see what fits.

Llama 3.3 70B vs Qwen 3 32B on the same H100 (Q4, 32k ctx):

🐢 Llama 70B: ~54 tok/s
🐇 Qwen 32B: ~101 tok/s

Smaller model, ~2× the speed, same VRAM headroom.

smelukov.github.io

HuggingFace import

Got a model not in the catalog? Paste any HuggingFace URL — it pulls config.json and auto-detects the formula, MoE fields, precision.

34 pre-configured models
All major quants (Q1 → FP32) for both weights and KV cache
HF Transformers + MLX repos supported
GGUF-only repos: import the original Transformers repo, then pick the quant manually

The honest caveat

TPS is a roof-line estimate — bandwidth-bound model, theoretical maximum.

Real throughput is typically:

60–90% of estimate for dense models on a single GPU
40–60% for multi-GPU dense (tensor parallel)
20–40% for multi-GPU MoE (DeepSeek V3, Mixtral × N)

Not modelled: NVLink sync, MoE expert routing, prefill / attention compute, dequantisation overhead. The UI shows "theoretical maximum" right under each TPS number, and the "How calculations work" footer has the full Limitations section.

Use the numbers for sizing decisions ("does this fit?", "is config A faster than config B?"), not as a substitute for real benchmarks.

Try it / give feedback

→ https://smelukov.github.io/WeightRoom/

Just launched, so the formulas and defaults are still rough around the edges. If you spot a number that disagrees with what you actually measured, please tell me — I'd rather fix the calculator than have anyone quote a wrong figure.

Reply here
Open an issue: https://github.com/smelukov/WeightRoom/issues
PRs welcome

Open source, MIT, all client-side React.

Top comments (1)

Zack Kev • Apr 23 • Edited

This is actually really useful. I like how you kept it simple and fast without adding unnecessary complexity. The real time estimates make it much easier to plan things instead of guessing or doing rough calculations.

Tools like this save a lot of time, especially when you’re trying different setups. Nice work 👍

thanks for sharing