Hey LLM folks π
Whether you're shipping with Ollama on a Mac, vllm on 8Γ H100, or anything in between β model planning shouldn't be napkin math.
I built a free calculator: WeightRoom.
β https://smelukov.github.io/WeightRoom/
No backend, no signup, no telemetry. Pick a model, pick a quant, pick a context window β get RAM, disk and TPS estimates in real time. State serializes to URL, so configs are shareable.
KV cache: 4 formulas, not 1
The thing most calculators get wrong.
WeightRoom ships 4 formulas matching real architectures:
- Standard GQA β Llama, Qwen, Mistral
- Sliding Window β Gemma 2/3 (most layers cap at the window size)
- MLA β DeepSeek V3/R1 (joint K+V latent, ~10Γ smaller cache than GQA)
- Linear + Full β Qwen 3.5 (only sparse full-attention layers grow)
Pick the model, the right formula is applied automatically. Use the wrong formula on Gemma or DeepSeek and you over-estimate KV cache by 5β10Γ.
MoE: total β active
Another classic mistake β treating MoE models like dense ones.
DeepSeek V3 = 671B params total, 37B active per token.
- RAM/disk uses total params (all experts must fit in memory)
- TPS uses active params (only those are streamed per decode step)
Naive math under-estimates DeepSeek V3 TPS by ~18Γ. WeightRoom does both correctly.
Inference engine matters as much as the GPU
Same model, same hardware, different engine β wildly different throughput.
Llama 3.3 70B on 1Γ H100, 8 concurrent users:
- llama.cpp (100% KV pre-alloc): 22 tok/s/user β 172 tok/s system
- vllm (PagedAttention ~25%): 45 tok/s/user β 356 tok/s system
Different trade-offs. llama.cpp wins on simplicity and local UX, vLLM on aggregate throughput. Both are correct for their use case β pick the engine preset, the math adjusts.
Compare mode
Side-by-side cards with charts. Pick a budget, see what fits.
Llama 3.3 70B vs Qwen 3 32B on the same H100 (Q4, 32k ctx):
- π’ Llama 70B: ~54 tok/s
- π Qwen 32B: ~101 tok/s
Smaller model, ~2Γ the speed, same VRAM headroom.
HuggingFace import
Got a model not in the catalog? Paste any HuggingFace URL β it pulls config.json and auto-detects the formula, MoE fields, precision.
- 34 pre-configured models
- All major quants (Q1 β FP32) for both weights and KV cache
- HF Transformers + MLX repos supported
- GGUF-only repos: import the original Transformers repo, then pick the quant manually
The honest caveat
TPS is a roof-line estimate β bandwidth-bound model, theoretical maximum.
Real throughput is typically:
- 60β90% of estimate for dense models on a single GPU
- 40β60% for multi-GPU dense (tensor parallel)
- 20β40% for multi-GPU MoE (DeepSeek V3, Mixtral Γ N)
Not modelled: NVLink sync, MoE expert routing, prefill / attention compute, dequantisation overhead. The UI shows "theoretical maximum" right under each TPS number, and the "How calculations work" footer has the full Limitations section.
Use the numbers for sizing decisions ("does this fit?", "is config A faster than config B?"), not as a substitute for real benchmarks.
Try it / give feedback
β https://smelukov.github.io/WeightRoom/
Just launched, so the formulas and defaults are still rough around the edges. If you spot a number that disagrees with what you actually measured, please tell me β I'd rather fix the calculator than have anyone quote a wrong figure.
- Reply here
- Open an issue: https://github.com/smelukov/WeightRoom/issues
- PRs welcome
Open source, MIT, all client-side React.

Top comments (0)