Learn AI Resource

Posted on Jun 7

Running Local LLMs Without Burning Out Your GPU

#ai #llm #gpu #optimization

So you want to play with LLMs locally but your RTX 4090 sounds like a jet engine and your electricity bill just became a mortgage payment. Yeah, I've been there.

The good news? You don't need a monster GPU to actually use language models. You just need to be smart about it.

Start With Quantization

First thing: stop downloading the full 70B parameter model. That's like buying a full-size truck when you just need to haul groceries once a month.

GGUF quantization is your friend. It's basically model compression that trades a tiny bit of quality for massive speed and memory savings. We're talking 4-bit or 8-bit quantization cutting your VRAM usage in half or more.

Use Ollama (no fancy setup required):

ollama pull mistral:7b-q4
ollama run mistral:7b-q4

Done. It handles all the boring stuff and runs smoothly on even older hardware.

GPU Layers: The Middle Ground

Quantization is great, but if you have a GPU, why not use it? Most quantized models support offloading specific layers to VRAM while keeping the rest in system RAM.

With Ollama, just specify how many layers to use:

ollama run mistral:7b-q4 --num-gpu 10

Adjust the number up or down based on what your GPU can handle. This is genuinely faster than pure CPU while keeping power consumption reasonable.

Better: Use Smaller Models Trained For Your Task

Here's the secret most people miss: you don't need a huge model for most tasks.

Mistral 7B beats older 13B models on most benchmarks
Phi-3 is shockingly good for its size
TinyLlama is ridiculous for local coding tasks

If you're building a chatbot or API service, a fine-tuned 7B beats an untrained 70B every single time.

Batch Your Inference

Running one prompt at a time? That's leaving performance on the table.

from ollama import Client

client = Client()

prompts = [
    "Explain webhooks to a 5-year-old",
    "Write a Python function to validate emails",
    "Best practices for error handling in Node.js"
]

for prompt in prompts:
    response = client.generate(model='mistral:7b-q4', prompt=prompt)
    print(response['response'])

Batching increases GPU utilization. Idle GPU time is wasted money.

The Real Win: Context Caching

If you're reusing the same system prompt or documents across multiple queries, cache it. Ollama handles this automatically for repeated contexts.

Before: System prompt processed every request
After: System prompt processed once, reused for the session

This is honestly the biggest performance boost that nobody talks about.

Monitor What You're Actually Using

nvidia-smi --query-gpu=memory.used,memory.total --format=csv,nounit

Run this periodically. You might find you're using way less VRAM than you think, which means you can optimize further.

Pro tip: Set power limits with nvidia-smi if you're worried about electricity:

sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 200  # 200W limit

Real Numbers

I run Mistral 7B quantized with 10 GPU layers on an RTX 3060 (12GB VRAM):

VRAM used: ~8GB
Typical response time: 50-100 tokens/sec
Power draw: ~100W
Temperature: 65-70°C

That's usable. Not blazing fast, but genuinely practical for APIs, local chat, and automation.

The Boring But Important Part

Your setup matters less than your workflow. Before you get a bigger GPU:

Try a smaller model quantized
Batch your requests
Use cached context when possible
Check if you actually need GPU acceleration

Spoiler: 80% of use cases don't.

Want more on this stuff? Subscribe to LearnAI Weekly for practical AI engineering tips without the hype. We cover real deployment challenges, model selection, and actually useful tools.

Curious what you're building? Drop it in the comments—I actually read them.

DEV Community