TL;DR: I got tired of guessing whether models would fit on my GPU. So I built vramio — a free API that tells you exactly how much VRAM any HuggingFace model needs. One curl command. Instant answer.
The Problem Every ML Engineer Knows
You're browsing HuggingFace. You find a model that looks perfect for your project. Then the questions start:
- "Will this fit on my 24GB RTX 4090?"
- "Do I need to quantize it?"
- "What's the actual memory footprint?"
And the answers? They're nowhere.
Some model cards mention it. Most don't. You could download the model and find out the hard way. Or dig through config files, count parameters, multiply by bytes per dtype, add overhead for KV cache...
I've done this calculation dozens of times and have blogged here Calculate vRAM for LLM. It's tedious. It shouldn't be.
The Solution: One API Call
curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"
That's it. You get back:
{
"model": "mistralai/Mistral-7B-v0.1",
"total_parameters": "7.24B",
"memory_required": "13.49 GB",
"recommended_vram": "16.19 GB",
"other_precisions": {
"fp32": "26.99 GB",
"fp16": "13.49 GB",
"int8": "6.75 GB",
"int4": "3.37 GB"
}
}
recommended_vram includes the standard 20% overhead for activations and KV cache during inference. This is what you actually need.
How It Works
No magic. No downloads. Just math.
- Fetch safetensors metadata from HuggingFace (just the headers, ~50KB)
- Parse tensor shapes and data types
- Calculate:
parameters × bytes_per_dtype - Add 20% for inference overhead
The entire thing is 160 lines of Python with a single dependency (httpx).
Why I Built This
I run models locally. A lot. Every time I wanted to try something new, I'd waste 10 minutes figuring out if it would even fit.
I wanted something dead simple:
- No signup
- No rate limits
- No bloated web UI
- Just an API endpoint
So I built it over a weekend and deployed it for free on Render.
Try It
Live API: https://vramio.ksingh.in/model?hf_id=YOUR_MODEL_ID
Examples:
# Llama 2 7B
curl "https://vramio.ksingh.in/model?hf_id=meta-llama/Llama-2-7b"
# Phi-2
curl "https://vramio.ksingh.in/model?hf_id=microsoft/phi-2"
# Mistral 7B
curl "https://vramio.ksingh.in/model?hf_id=mistralai/Mistral-7B-v0.1"
Self-Host It
It's open source. Run your own:
git clone https://github.com/ksingh-scogo/vramio.git
cd vramio
pip install httpx[http2]
python server_embedded.py
What's Next
This solves my immediate problem. If people find it useful, I might add:
- Batch queries for multiple models
- Training memory estimates (not just inference)
- Browser extension for HuggingFace
But honestly? The current version does exactly what I needed. Sometimes simple is enough.
GitHub: https://github.com/ksingh-scogo/vramio
Built with help from hf-mem by @alvarobartt.
If this saved you time, consider starring the repo. And if you have ideas for improvements, open an issue — I'd love to hear them.

Top comments (0)