matias yoon

Posted on May 24

로컬 LLM 셋업 가이드 (v5)

#ai #llm #developers #tutorial

Local LLM Setup Guide (v5)

Practical Installation & Optimization Guide for Developers

1. Overview & Prerequisites

Running LLMs locally requires minimal hardware but significant attention to memory management and system configuration.

Hardware Requirements:

GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM)
CPU: Intel i5-12600K or AMD Ryzen 7 5800X
RAM: Minimum 16GB, 32GB recommended
Storage: 500GB+ SSD for model storage

OS: Ubuntu 22.04 LTS or Debian 12

Prerequisites Installation:

sudo apt update
sudo apt install -y git cmake build-essential python3-pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Framework Comparison

Framework	Pros	Cons	Use Case
llama.cpp	Native C++, no dependencies, minimal overhead	No GPU acceleration in CPU mode	Lightweight inference
Ollama	Simple CLI, automatic model downloads	Less control, dependency on Docker	Quick prototyping
vLLM	High throughput, optimized for serving	Complex setup, requires Python expertise	Production inference
LocalAI	API-first, supports multiple backends	Heavy, requires container setup	Enterprise deployments

Recommendation: Use llama.cpp for development, Ollama for quick testing.

3. Step-by-Step Installation (llama.cpp)

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make

# Download a model (example: Mistral 7B)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/

# Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2

4. Model Selection Guide

For Chat Applications:

Mistral 7B Q4_K_M (balanced quality vs size)
Llama 2 7B Q4_K_M (best commercial support)

For Code Generation:

CodeLlama 7B Q4_K_M (best for programming tasks)
Phi-2 Q4_K_M (smaller, fast)

For Research/Development:

Llama 3 8B Q4_K_M (latest architecture)
Mixtral 8x7B Q4_K_M (sparse mixture of experts)

Example usage:

# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \
  --temp 0.1 --repeat_penalty 1.1

# Code model
./main -m models/codellama-7b.Q4_K_M.gguf \
  -p "def fibonacci(n):" \
  --temp 0.0 --repeat_penalty 1.0

5. Quantization Types Explained

Quantization	Size	Quality	Use Case
Q4_K_M	4.2GB	High	General use
Q5_K_M	5.1GB	Very High	Research
Q6_K	6.4GB	Excellent	Premium quality
Q8_0	8.2GB	Full	Maximum quality
F16	16GB	Full	Development/testing

Benchmark comparison (1000 tokens, RTX 4090):

Q4_K_M: 150 tokens/sec
Q5_K_M: 130 tokens/sec
Q8_0: 100 tokens/sec

Quantization command:

# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf

6. API Setup and Integration

Simple HTTP API with Python:

# server.py
from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data['prompt']
    model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf')

    result = subprocess.run([
        './main', '-m', model_path, '-p', prompt, 
        '--temp', '0.2', '--repeat_penalty', '1.1'
    ], capture_output=True, text=True)

    return jsonify({'response': result.stdout.strip()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Integration with existing tools:

# Test API
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in simple terms:"}'

7. Systemd Service for 24/7 Operation

Create /etc/systemd/system/local-llm.service:

[Unit]
Description=Local LLM Service
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable local-llm
sudo systemctl start local-llm
sudo systemctl status local-llm

8. Monitoring and Performance Tuning

Memory monitoring script:

#!/bin/bash
# monitor.sh
while true; do
    echo "Timestamp: $(date)"
    echo "GPU Memory:"
    nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
    echo "System Memory:"
    free -h
    echo "---"
    sleep 30
done

Performance optimization flags:

# For high throughput (1000+ tokens/sec)
./main -m model.gguf \
  --threads 16 \
  --ctx-size 4096 \
  --batch-size 512 \
  --temp 0.0 \
  --repeat_penalty 1.0 \
  --n-predict 1000

# For low latency (sub-100ms response)
./main -m model.gguf \
  --threads 8 \
  --ctx-size 1024 \
  --batch-size 64 \
  --temp 0.2 \
  --repeat_penalty 1.1

9. Real Command Examples

Complete workflow example:

# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev

# 2. Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 3. Download model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/

# 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -p "Write a bash script that checks disk space:" \
  --temp 0.1 --repeat_penalty 1.0

# 5. Monitor performance
watch -n 1 nvidia-smi

Benchmark script:

#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking."

echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"

Quick deployment command:


bash
# One-liner setup and test
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

DEV Community