GGUF & Modelfile: The Power User's Guide to Local LLMs
Beyond
ollama pull— download any model from Hugging Face, quantize it, customize it, and import it into Ollama.
What's GGUF?
GGUF (GPT-Generated Unified Format) is the standard file format for running LLMs locally. Think of it as the .mp3 of AI models:
- Compressed — 70-85% smaller than the original float16 weights
- Fast — optimized for CPU and GPU inference
- Portable — one file contains the entire model
- Metadata-rich — includes tokenizer, chat template, and model config
Every ollama pull downloads a GGUF file under the hood. But the real power move is downloading GGUF files directly from Hugging Face and importing them yourself.
Quantization Analogy (Steal This)
Quantization is like JPEG compression for AI models. A RAW photo is 50MB. A JPEG of the same photo is 5MB — 90% smaller, but it still looks 95% as good. That's what Q4_K_M quantization does to a model: 70% smaller, 96% of the intelligence.
Step 1: Finding the Right GGUF File
The Golden Rule
Always look for Q4_K_M — it's the sweet spot of size vs quality for almost every model.
Where to Find GGUFs
| Source | URL | Best For |
|---|---|---|
| Official provider |
huggingface.co/Qwen etc. |
Trustworthy, but often only Q8/Q6 |
| Unsloth | huggingface.co/unsloth |
Best selection of quants (Q2-Q8) |
| Bartowski | huggingface.co/bartowski |
Massive library, every quantization |
| MaziyarPanahi | huggingface.co/MaziyarPanahi |
Merged models, niche architectures |
The GGUF Filename Decoder
Qwen2.5-14B-Q4_K_M.gguf
├── Model name ├── Size └── Quantization
| Quant Code | Compression | Quality | Use Case |
|---|---|---|---|
| Q8_0 | 50% | 99% | When you have VRAM to spare |
| Q6_K | 60% | 98% | High-quality, reasonable size |
| Q4_K_M | 70% | 96% | 🟢 Sweet spot — use this |
| Q3_K_M | 78% | 92% | When VRAM is tight |
| Q2_K | 85% | 85% | Emergency only — quality noticeably drops |
| IQ4_XS | 72% | 95% | Experimental import format |
Step 2: Download & Import a GGUF
Basic Import
# 1. Download Q4_K_M of Qwen 2.5-14B
wget https://huggingface.co/bartowski/Qwen2.5-14B-GGUF/resolve/main/Qwen2.5-14B-Q4_K_M.gguf
# 2. Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen2.5-14B-Q4_K_M.gguf
EOF
# 3. Import into Ollama
ollama create my-custom-model -f Modelfile
# 4. Run it
ollama run my-custom-model
Smart Import (with Optimized Settings)
cat > Modelfile << 'EOF'
FROM ./DeepSeek-R1-14B-Q4_K_M.gguf
# Performance tuning
PARAMETER num_ctx 32768
PARAMETER num_gpu_layers 999
PARAMETER num_thread 8
PARAMETER numa true
# Generation
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
# Chat template (CRITICAL — must match the model!)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
# System prompt
SYSTEM """You are a helpful AI assistant."""
EOF
ollama create my-r1-custom -f Modelfile
ollama run my-r1-custom
Step 3: Modelfile Reference
A Modelfile is like a Dockerfile for LLMs. Every line is an instruction.
Parameters Reference
| Parameter | What It Does | Default | Recommended Range |
|---|---|---|---|
temperature |
Creativity level | 0.8 | 0.2 (code) – 1.0 (creative) |
top_p |
Nucleus sampling | 0.9 | 0.85 – 0.95 |
top_k |
Top-K sampling | 40 | 20 – 100 |
num_ctx |
Context window size | 2048 | 4096 – 65536 |
num_gpu |
GPU layers | 0 (auto) | 999 (use all VRAM) |
num_thread |
CPU threads | auto | 4 – 16 |
repeat_penalty |
Penalize repetition | 1.1 | 1.0 – 1.2 |
stop |
Stop sequences | varies | `< |
INSTRUCTION vs SYSTEM vs TEMPLATE
{% raw %}
# SYSTEM: Persistent system prompt (like OpenAI's system message)
SYSTEM """You are a helpful assistant."""
# TEMPLATE: How user messages are formatted
TEMPLATE """User: {{ .Prompt }}
Assistant: """
# INSTRUCTION: Model-specific instruction format (rarely needed)
INSTRUCTION """Follow the user's instructions carefully."""
Three Production Configs
1. Coding Assistant
FROM qwen2.5:7b
PARAMETER temperature 0.2
PARAMETER top_p 0.85
PARAMETER num_ctx 65536
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert Python developer. Write clean, tested code."""
2. Creative Writer
FROM mistral
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER num_ctx 16384
SYSTEM """You are a novelist. Be vivid and descriptive."""
3. Customer Support
FROM llama4
PARAMETER temperature 0.5
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are a helpful customer support agent.
Be polite, concise, and solution-oriented.
NEVER mention that you are an AI."""
Step 4: Advanced Techniques
4.1 Multi-GPU Setup
FROM deepseek-r1:70b
# Distribute across 2 GPUs
PARAMETER num_gpu_layers 999
PARAMETER main_gpu 0
PARAMETER tensor_split "0.5,0.5"
4.2 LoRA Adapters (Experimental)
Some Ollama builds support LoRA adapters:
FROM base-model
ADAPTER ./my-finetune-lora.gguf
PARAMETER temperature 0.7
4.3 Custom Stop Tokens
DeepSeek-R1 and Qwen use different stop tokens:
# For Qwen
TEMPLATE """<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
# For DeepSeek
TEMPLATE """User: {{ .Prompt }}
Assistant: """
PARAMETER stop "User:"
4.4 Emergency: VRAM Too Low
If you get "CUDA out of memory":
# Force CPU for some layers
PARAMETER num_gpu_layers 24 # Only put 24 layers on GPU
PARAMETER num_thread 8 # Use 8 CPU threads for the rest
Step 5: GGUF from Ollama Models (Export)
You can also export a model from Ollama back to a GGUF file:
# Save a model as GGUF
ollama pull qwen2.5:7b
ollama export qwen2.5:7b ./my-export.gguf
# Now you can use it anywhere (llama.cpp, text-generation-webui, etc.)
./llama-cli -m ./my-export.gguf -p "Hello"
This is useful for:
- Moving models between machines without re-downloading
- Using the same model with multiple inference engines
- Sharing a specific quantization with teammates
Performance Cheat Sheet
By GPU
| GPU | VRAM | Best GGUF Model | Expected Speed |
|---|---|---|---|
| RTX 3060 / 4060 | 12 GB | Qwen 2.5-14B (Q4_K_M) | 30-40 tok/s |
| RTX 4070 / 5070 | 12 GB | Qwen 2.5-14B (Q4_K_M) | 35-50 tok/s |
| RTX 4080 / 5080 | 16 GB | DeepSeek-R1-14B (Q4_K_M) | 30-45 tok/s |
| RTX 4090 / 5090 | 24 GB | DeepSeek-R1-32B (Q4_K_M) | 18-25 tok/s |
| Mac M2 Pro | 16 GB | Qwen 2.5-7B (Q4_K_M) | 15-25 tok/s |
| Mac M4 Max | 36 GB | Qwen 3.6-27B (Q4_K_M) | 20-30 tok/s |
CPU-Only Performance
| Model | Quant | RAM | Speed |
|---|---|---|---|
| Qwen 2.5-1.5B | Q4_K_M | 4 GB | 8-15 tok/s |
| Qwen 2.5-7B | Q4_K_M | 16 GB | 1-4 tok/s |
| Qwen 2.5-7B | Q2_K | 8 GB | 2-6 tok/s |
Common Pitfalls
| Problem | Cause | Fix |
|---|---|---|
| "Model not found" after import | Modelfile path is wrong | Use absolute path: FROM /home/user/model.gguf
|
| Gibberish output | Wrong chat template | The TEMPLATE line must match the model's expected format |
| Slow generation | Running on CPU | PARAMETER num_gpu_layers 999 |
| CUDA out of memory | Quantization too large for VRAM | Try smaller quant (Q3_K_M instead of Q4_K_M) |
| Import errors | Corrupt GGUF download | Re-download and verify checksum |
| Temperature not working | Set in Modelfile but overridden in API | Use the same temp in both places |
| Chinese text output | Wrong template or default system prompt | Add `PARAMETER stop "< |
The tl;dr
-
Download: {% raw %}
wget <huggingface-url>/Model-Q4_K_M.gguf -
Create Modelfile:
FROM ./Model.gguf+ your settings -
Import:
ollama create my-model -f Modelfile -
Run:
ollama run my-model - Profit: Free, private, local AI
Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.
Top comments (0)