Local LLM Setup Guide (v5)
Practical Installation & Optimization Guide for Developers
1. Overview & Prerequisites
Running LLMs locally requires minimal hardware but significant attention to memory management and system configuration.
Hardware Requirements:
- GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM)
- CPU: Intel i5-12600K or AMD Ryzen 7 5800X
- RAM: Minimum 16GB, 32GB recommended
- Storage: 500GB+ SSD for model storage
OS: Ubuntu 22.04 LTS or Debian 12
Prerequisites Installation:
sudo apt update
sudo apt install -y git cmake build-essential python3-pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
2. Framework Comparison
| Framework | Pros | Cons | Use Case |
|---|---|---|---|
| llama.cpp | Native C++, no dependencies, minimal overhead | No GPU acceleration in CPU mode | Lightweight inference |
| Ollama | Simple CLI, automatic model downloads | Less control, dependency on Docker | Quick prototyping |
| vLLM | High throughput, optimized for serving | Complex setup, requires Python expertise | Production inference |
| LocalAI | API-first, supports multiple backends | Heavy, requires container setup | Enterprise deployments |
Recommendation: Use llama.cpp for development, Ollama for quick testing.
3. Step-by-Step Installation (llama.cpp)
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make
# Download a model (example: Mistral 7B)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/
# Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
4. Model Selection Guide
For Chat Applications:
- Mistral 7B Q4_K_M (balanced quality vs size)
- Llama 2 7B Q4_K_M (best commercial support)
For Code Generation:
- CodeLlama 7B Q4_K_M (best for programming tasks)
- Phi-2 Q4_K_M (smaller, fast)
For Research/Development:
- Llama 3 8B Q4_K_M (latest architecture)
- Mixtral 8x7B Q4_K_M (sparse mixture of experts)
Example usage:
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
-p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \
--temp 0.1 --repeat_penalty 1.1
# Code model
./main -m models/codellama-7b.Q4_K_M.gguf \
-p "def fibonacci(n):" \
--temp 0.0 --repeat_penalty 1.0
5. Quantization Types Explained
| Quantization | Size | Quality | Use Case |
|---|---|---|---|
| Q4_K_M | 4.2GB | High | General use |
| Q5_K_M | 5.1GB | Very High | Research |
| Q6_K | 6.4GB | Excellent | Premium quality |
| Q8_0 | 8.2GB | Full | Maximum quality |
| F16 | 16GB | Full | Development/testing |
Benchmark comparison (1000 tokens, RTX 4090):
- Q4_K_M: 150 tokens/sec
- Q5_K_M: 130 tokens/sec
- Q8_0: 100 tokens/sec
Quantization command:
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
6. API Setup and Integration
Simple HTTP API with Python:
# server.py
from flask import Flask, request, jsonify
import subprocess
import json
app = Flask(__name__)
@app.route('/generate', methods=['POST'])
def generate():
data = request.json
prompt = data['prompt']
model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf')
result = subprocess.run([
'./main', '-m', model_path, '-p', prompt,
'--temp', '0.2', '--repeat_penalty', '1.1'
], capture_output=True, text=True)
return jsonify({'response': result.stdout.strip()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
Integration with existing tools:
# Test API
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain quantum computing in simple terms:"}'
7. Systemd Service for 24/7 Operation
Create /etc/systemd/system/local-llm.service:
[Unit]
Description=Local LLM Service
After=network.target
[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable local-llm
sudo systemctl start local-llm
sudo systemctl status local-llm
8. Monitoring and Performance Tuning
Memory monitoring script:
#!/bin/bash
# monitor.sh
while true; do
echo "Timestamp: $(date)"
echo "GPU Memory:"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
echo "System Memory:"
free -h
echo "---"
sleep 30
done
Performance optimization flags:
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \
--threads 16 \
--ctx-size 4096 \
--batch-size 512 \
--temp 0.0 \
--repeat_penalty 1.0 \
--n-predict 1000
# For low latency (sub-100ms response)
./main -m model.gguf \
--threads 8 \
--ctx-size 1024 \
--batch-size 64 \
--temp 0.2 \
--repeat_penalty 1.1
9. Real Command Examples
Complete workflow example:
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev
# 2. Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# 3. Download model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/
# 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
-p "Write a bash script that checks disk space:" \
--temp 0.1 --repeat_penalty 1.0
# 5. Monitor performance
watch -n 1 nvidia-smi
Benchmark script:
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking."
echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
Quick deployment command:
bash
# One-liner setup and test
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/
---
📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
Top comments (0)