karan singh

Posted on Jan 7

A Free Tool to Check VRAM Requirements for Any HuggingFace Model

#vram #gpu #llm #inferencing

TL;DR: I got tired of guessing whether models would fit on my GPU. So I built vramio — a free API that tells you exactly how much VRAM any HuggingFace model needs. One curl command. Instant answer.

The Problem Every ML Engineer Knows

You're browsing HuggingFace. You find a model that looks perfect for your project. Then the questions start:

"Will this fit on my 24GB RTX 4090?"
"Do I need to quantize it?"
"What's the actual memory footprint?"

And the answers? They're nowhere.

Some model cards mention it. Most don't. You could download the model and find out the hard way. Or dig through config files, count parameters, multiply by bytes per dtype, add overhead for KV cache...

I've done this calculation dozens of times and have blogged here Calculate vRAM for LLM. It's tedious. It shouldn't be.

The Solution: One API Call

curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"

That's it. You get back:

{
  "model": "mistralai/Mistral-7B-v0.1",
  "total_parameters": "7.24B",
  "memory_required": "13.49 GB",
  "recommended_vram": "16.19 GB",
  "other_precisions": {
    "fp32": "26.99 GB",
    "fp16": "13.49 GB",
    "int8": "6.75 GB",
    "int4": "3.37 GB"
  }
}

recommended_vram includes the standard 20% overhead for activations and KV cache during inference. This is what you actually need.

How It Works

No magic. No downloads. Just math.

Fetch safetensors metadata from HuggingFace (just the headers, ~50KB)
Parse tensor shapes and data types
Calculate: parameters × bytes_per_dtype
Add 20% for inference overhead

The entire thing is 160 lines of Python with a single dependency (httpx).

Why I Built This

I run models locally. A lot. Every time I wanted to try something new, I'd waste 10 minutes figuring out if it would even fit.

I wanted something dead simple:

No signup
No rate limits
No bloated web UI
Just an API endpoint

So I built it over a weekend and deployed it for free on Render.

Try It

Live API: https://vramio.ksingh.in/model?hf_id=YOUR_MODEL_ID

Examples:

# Llama 2 7B
curl "https://vramio.ksingh.in/model?hf_id=meta-llama/Llama-2-7b"

# Phi-2
curl "https://vramio.ksingh.in/model?hf_id=microsoft/phi-2"

# Mistral 7B
curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"

Self-Host It

It's open source. Run your own:

git clone https://github.com/ksingh-scogo/vramio.git
cd vramio
pip install httpx[http2]
python server_embedded.py

What's Next

This solves my immediate problem. If people find it useful, I might add:

Batch queries for multiple models
Training memory estimates (not just inference)
Browser extension for HuggingFace

But honestly? The current version does exactly what I needed. Sometimes simple is enough.

GitHub: https://github.com/ksingh-scogo/vramio

Built with help from hf-mem by @alvarobartt.

If this saved you time, consider starring the repo. And if you have ideas for improvements, open an issue — I'd love to hear them.

DEV Community