DEV Community

Sergey Melyukov
Sergey Melyukov

Posted on

WeightRoom β€” an LLM resource calculator

Hey LLM folks πŸ‘‹

Whether you're shipping with Ollama on a Mac, vllm on 8Γ— H100, or anything in between β€” model planning shouldn't be napkin math.

I built a free calculator: WeightRoom.

β†’ https://smelukov.github.io/WeightRoom/

No backend, no signup, no telemetry. Pick a model, pick a quant, pick a context window β€” get RAM, disk and TPS estimates in real time. State serializes to URL, so configs are shareable.


KV cache: 4 formulas, not 1

The thing most calculators get wrong.

WeightRoom ships 4 formulas matching real architectures:

  • Standard GQA β€” Llama, Qwen, Mistral
  • Sliding Window β€” Gemma 2/3 (most layers cap at the window size)
  • MLA β€” DeepSeek V3/R1 (joint K+V latent, ~10Γ— smaller cache than GQA)
  • Linear + Full β€” Qwen 3.5 (only sparse full-attention layers grow)

Pick the model, the right formula is applied automatically. Use the wrong formula on Gemma or DeepSeek and you over-estimate KV cache by 5–10Γ—.


MoE: total β‰  active

Another classic mistake β€” treating MoE models like dense ones.

DeepSeek V3 = 671B params total, 37B active per token.

  • RAM/disk uses total params (all experts must fit in memory)
  • TPS uses active params (only those are streamed per decode step)

Naive math under-estimates DeepSeek V3 TPS by ~18Γ—. WeightRoom does both correctly.


Inference engine matters as much as the GPU

Same model, same hardware, different engine β€” wildly different throughput.

Llama 3.3 70B on 1Γ— H100, 8 concurrent users:

  • llama.cpp (100% KV pre-alloc): 22 tok/s/user β†’ 172 tok/s system
  • vllm (PagedAttention ~25%): 45 tok/s/user β†’ 356 tok/s system

Different trade-offs. llama.cpp wins on simplicity and local UX, vLLM on aggregate throughput. Both are correct for their use case β€” pick the engine preset, the math adjusts.


Compare mode

Side-by-side cards with charts. Pick a budget, see what fits.

Llama 3.3 70B vs Qwen 3 32B on the same H100 (Q4, 32k ctx):

  • 🐒 Llama 70B: ~54 tok/s
  • πŸ‡ Qwen 32B: ~101 tok/s

Smaller model, ~2Γ— the speed, same VRAM headroom.


HuggingFace import

Got a model not in the catalog? Paste any HuggingFace URL β€” it pulls config.json and auto-detects the formula, MoE fields, precision.

  • 34 pre-configured models
  • All major quants (Q1 β†’ FP32) for both weights and KV cache
  • HF Transformers + MLX repos supported
  • GGUF-only repos: import the original Transformers repo, then pick the quant manually

The honest caveat

TPS is a roof-line estimate β€” bandwidth-bound model, theoretical maximum.

Real throughput is typically:

  • 60–90% of estimate for dense models on a single GPU
  • 40–60% for multi-GPU dense (tensor parallel)
  • 20–40% for multi-GPU MoE (DeepSeek V3, Mixtral Γ— N)

Not modelled: NVLink sync, MoE expert routing, prefill / attention compute, dequantisation overhead. The UI shows "theoretical maximum" right under each TPS number, and the "How calculations work" footer has the full Limitations section.

Use the numbers for sizing decisions ("does this fit?", "is config A faster than config B?"), not as a substitute for real benchmarks.


Try it / give feedback

β†’ https://smelukov.github.io/WeightRoom/

Just launched, so the formulas and defaults are still rough around the edges. If you spot a number that disagrees with what you actually measured, please tell me β€” I'd rather fix the calculator than have anyone quote a wrong figure.

Open source, MIT, all client-side React.

Top comments (0)