I added a new tool to the site: the Local AI VRAM Calculator & GPU Planner (Beta).
This came out of a problem I kept running into while experimenting with local models. It is surprisingly easy to end up with a setup that technically works, but does not actually match the kind of workloads you want to run.
This is mainly aimed at figuring out GPU and VRAM requirements for running local LLMs. Most advice in this space is broadly correct, but not very specific. Things like “get more VRAM” or “use NVIDIA” help at a high level, but they do not help much when you are comparing a curated set of real GPUs, a manual VRAM tier, different quantization levels, or larger context windows.
I wanted something that made those tradeoffs visible before committing to hardware.
What the Planner Does
The Local AI VRAM Calculator & GPU Planner (Beta) takes a few inputs: a GPU from the site snapshot or a manual VRAM tier, system RAM, quantization level, context length, and the primary workload.
From there it tries to give a practical read on whether the setup makes sense. That includes a rough fit score, GPU-specific notes when you pick a card from the snapshot, and a set of model recommendations based on the selected workload.
The part I focused on the most is the estimate breakdown. Instead of showing a single number, the planner separates the estimate into model weights, KV cache, runtime overhead, total VRAM, and storage.
That makes it easier to see what actually changes when you adjust something like context length or quantization. In a lot of cases, the bottleneck is not where you expect.
The estimates are not meant to be exact. Some are configuration-based, others are heuristic. The tool tries to label that clearly so it is obvious how much confidence to put in each result.
The context selector is also capped by the model metadata currently loaded into the planner. In practice that means the available maximum is based on the current curated model snapshot, plus any public Hugging Face models you import into the tool.
How Much VRAM Do You Need for Local LLMs?
This is the question I kept coming back to, and it is harder to answer than it should be.
As a rough guideline:
- Smaller models (7B–8B) can often run on 8–12 GB of VRAM with quantization.
- 13B–14B models typically need around 12–16 GB.
- Larger models usually require 24 GB or more, or some form of offloading.
- Context length increases memory usage, sometimes more than expected.
- Runtime overhead and KV cache can add a meaningful amount on top of raw model size.
These are not strict rules, but they are useful for avoiding obviously bad configurations.
The planner is meant to make these tradeoffs visible. Instead of guessing whether a model will fit, you can see how each component contributes to total VRAM usage.
Why This Is Single GPU Only
I originally added a multi-GPU option and removed it.
In practice, two GPUs do not behave like one larger pool of VRAM. Some runtimes can split work across devices, but many workflows still depend on the model fitting mostly on a single card. Performance also depends on details that are hard to generalize, like backend support and interconnect behavior.
Given that, a single-GPU estimate felt more honest. If a setup does not make sense on one card, the tool should not imply that adding another card will automatically fix it.
Where This Fits
I wrote previously about using Tailscale to access private LLMs, which focuses on the networking side of running local models.
This tool is more about the step before that: deciding what kind of hardware and model setup is actually reasonable.
In practice, both pieces are part of the same system. Running local LLMs ends up touching hardware, storage, networking, and a few operational decisions that are easy to overlook at the start.
Try It
If you are trying to figure out whether your GPU can run a specific LLM, you can try it here: Local AI VRAM Calculator & GPU Planner (Beta).
It is not a benchmark or a guarantee of how every runtime will behave. It is closer to a planning tool: something that makes the constraints visible and helps avoid obviously bad decisions within the GPU snapshot and model data the site currently ships.
I will keep updating the underlying data as I test more setups.
Top comments (0)