Local LLM Setup Guide (v17)
Overview & Prerequisites
Running large language models locally requires understanding hardware constraints and software requirements. This guide assumes you're working with a modern Linux system (Ubuntu 20.04+ recommended) with at least 8GB RAM and a GPU with CUDA support (RTX 30xx or newer).
Hardware Requirements:
- CPU: 4+ cores (8+ recommended)
- RAM: 16GB+ (32GB+ for larger models)
- GPU: NVIDIA RTX 30xx or newer with CUDA support
- Storage: 50GB+ free space (models can be 2-10GB each)
Prerequisites Installation:
sudo apt update && sudo apt install -y git curl build-essential python3-pip
Framework Comparison
| Framework | Pros | Cons | Best For |
|---|---|---|---|
| llama.cpp | Native C++, extreme portability, minimal dependencies | No GUI, limited model support | Development, lightweight inference |
| Ollama | Simple CLI, automatic model management, Docker support | Requires Docker, less control | Quick prototyping, testing |
| vLLM | Highest throughput, optimized for inference | Complex setup, requires Python | Production environments |
| LocalAI | Web API, model manager, multi-framework support | Heavy dependencies, complex config | API-first applications |
Recommendation: Use llama.cpp for development, Ollama for quick testing, and vLLM for production.
Step-by-Step Installation
1. Install llama.cpp
cd /opt
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make
2. Download a Model
cd /opt
mkdir models && cd models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
3. Test Basic Inference
cd /opt/llama.cpp
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1
4. Setup Ollama (Alternative)
curl -fsSL https://ollama.com/install.sh | sh
ollama run mistral
Model Selection Guide
For Chat Applications: Mistral-7B-v0.1 or Phi-3-mini
For Code Generation: CodeLlama-7B or StarCoder2
For Research: Llama-3-8B or Mixtral-8x7B
For Memory-Limited Systems: TinyLlama or Phi-2
# Download recommended models
cd /opt/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf
Quantization Types Explained
Quantization reduces model size and improves performance:
- Q4_K_M: 4-bit, high quality, good for most use cases
- Q5_K_M: 5-bit, balanced quality/performance
- Q6_K: 6-bit, excellent quality, larger files
- Q8_0: 8-bit, minimal loss, best for performance
# Convert model to different quantizations
./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m
API Setup and Integration
Simple HTTP Server with llama.cpp
# Run model as HTTP server
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
Python Integration Example
import requests
def call_local_llm(prompt):
response = requests.post(
"http://localhost:8080/completion",
json={"prompt": prompt, "n_predict": 100}
)
return response.json()['content']
# Usage
result = call_local_llm("Explain quantum computing in simple terms")
Systemd Service for 24/7 Operation
Create /etc/systemd/system/local-llm.service:
[Unit]
Description=Local LLM Service
After=network.target
[Service]
Type=simple
User=your_user
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable local-llm
sudo systemctl start local-llm
Monitoring and Performance Tuning
GPU Memory Monitoring
# Monitor GPU usage
watch -n 1 nvidia-smi
# Check memory usage of running process
nvidia-smi pmon -c 1
Performance Testing
# Benchmark model performance
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100
Memory Optimization Flags
# For high memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1
# For low memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1
Real Command Examples
Complete Setup Script
#!/bin/bash
# setup-local-llm.sh
cd /opt
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make
# Download model
cd /opt/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
# Run benchmark
echo "Starting benchmark..."
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50
echo "Setup complete. Run 'systemctl start local-llm' to start service."
API Integration with curl
# Basic API test
curl -X POST http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}'
# Streaming response
curl -X POST http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}'
Configuration Files
Default llama.cpp Settings
Create /opt/llama.cpp/config.json:
{
"model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf",
"port": 8080,
"host": "0.0.0.0",
"n_gpu_layers": 33,
"ctx_size": 8192,
"temp": 0.1,
"n_predict": 100
}
Environment Variables
# Add to ~/.bashrc
export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf"
export LOCAL_LLM_PORT="8080"
export LOCAL_LLM_NGL="33"
Benchmark Results
Model: Mistral-7B Q5_K_M
Hardware: RTX 4090, 32GB RAM
Results:
- Context: 8192 tokens
- Response time: ~1.2s for 100 tokens
- GPU memory usage: ~12GB
Performance Tips:
- Use
--ctxto increase context window - Increase
--nglfor more GPU layers - Lower
--tempfor faster responses - Use
--n-predictto limit generation length
This setup provides a production-ready local LLM environment that costs $3-7 to operate while offering performance comparable to cloud services.
📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)
Top comments (0)