DEV Community

karan singh
karan singh

Posted on

A Free Tool to Check VRAM Requirements for Any HuggingFace Model

TL;DR: I got tired of guessing whether models would fit on my GPU. So I built vramio — a free API that tells you exactly how much VRAM any HuggingFace model needs. One curl command. Instant answer.

VRAMIO in Action


The Problem Every ML Engineer Knows

You're browsing HuggingFace. You find a model that looks perfect for your project. Then the questions start:

  • "Will this fit on my 24GB RTX 4090?"
  • "Do I need to quantize it?"
  • "What's the actual memory footprint?"

And the answers? They're nowhere.

Some model cards mention it. Most don't. You could download the model and find out the hard way. Or dig through config files, count parameters, multiply by bytes per dtype, add overhead for KV cache...

I've done this calculation dozens of times and have blogged here Calculate vRAM for LLM. It's tedious. It shouldn't be.

The Solution: One API Call

curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"
Enter fullscreen mode Exit fullscreen mode

That's it. You get back:

{
  "model": "mistralai/Mistral-7B-v0.1",
  "total_parameters": "7.24B",
  "memory_required": "13.49 GB",
  "recommended_vram": "16.19 GB",
  "other_precisions": {
    "fp32": "26.99 GB",
    "fp16": "13.49 GB",
    "int8": "6.75 GB",
    "int4": "3.37 GB"
  }
}
Enter fullscreen mode Exit fullscreen mode

recommended_vram includes the standard 20% overhead for activations and KV cache during inference. This is what you actually need.

How It Works

No magic. No downloads. Just math.

  1. Fetch safetensors metadata from HuggingFace (just the headers, ~50KB)
  2. Parse tensor shapes and data types
  3. Calculate: parameters × bytes_per_dtype
  4. Add 20% for inference overhead

The entire thing is 160 lines of Python with a single dependency (httpx).

Why I Built This

I run models locally. A lot. Every time I wanted to try something new, I'd waste 10 minutes figuring out if it would even fit.

I wanted something dead simple:

  • No signup
  • No rate limits
  • No bloated web UI
  • Just an API endpoint

So I built it over a weekend and deployed it for free on Render.

Try It

Live API: https://vramio.ksingh.in/model?hf_id=YOUR_MODEL_ID

Examples:

# Llama 2 7B
curl "https://vramio.ksingh.in/model?hf_id=meta-llama/Llama-2-7b"

# Phi-2
curl "https://vramio.ksingh.in/model?hf_id=microsoft/phi-2"

# Mistral 7B
curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"
Enter fullscreen mode Exit fullscreen mode

Self-Host It

It's open source. Run your own:

git clone https://github.com/ksingh-scogo/vramio.git
cd vramio
pip install httpx[http2]
python server_embedded.py
Enter fullscreen mode Exit fullscreen mode

What's Next

This solves my immediate problem. If people find it useful, I might add:

  • Batch queries for multiple models
  • Training memory estimates (not just inference)
  • Browser extension for HuggingFace

But honestly? The current version does exactly what I needed. Sometimes simple is enough.


GitHub: https://github.com/ksingh-scogo/vramio

Built with help from hf-mem by @alvarobartt.


If this saved you time, consider starring the repo. And if you have ideas for improvements, open an issue — I'd love to hear them.

Top comments (0)