DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v5)

Local LLM Setup Guide (v5)

Practical Installation & Optimization Guide for Developers

1. Overview & Prerequisites

Running LLMs locally requires minimal hardware but significant attention to memory management and system configuration.

Hardware Requirements:

  • GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM)
  • CPU: Intel i5-12600K or AMD Ryzen 7 5800X
  • RAM: Minimum 16GB, 32GB recommended
  • Storage: 500GB+ SSD for model storage

OS: Ubuntu 22.04 LTS or Debian 12

Prerequisites Installation:

sudo apt update
sudo apt install -y git cmake build-essential python3-pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Enter fullscreen mode Exit fullscreen mode

2. Framework Comparison

Framework Pros Cons Use Case
llama.cpp Native C++, no dependencies, minimal overhead No GPU acceleration in CPU mode Lightweight inference
Ollama Simple CLI, automatic model downloads Less control, dependency on Docker Quick prototyping
vLLM High throughput, optimized for serving Complex setup, requires Python expertise Production inference
LocalAI API-first, supports multiple backends Heavy, requires container setup Enterprise deployments

Recommendation: Use llama.cpp for development, Ollama for quick testing.

3. Step-by-Step Installation (llama.cpp)

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make

# Download a model (example: Mistral 7B)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/

# Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
Enter fullscreen mode Exit fullscreen mode

4. Model Selection Guide

For Chat Applications:

  • Mistral 7B Q4_K_M (balanced quality vs size)
  • Llama 2 7B Q4_K_M (best commercial support)

For Code Generation:

  • CodeLlama 7B Q4_K_M (best for programming tasks)
  • Phi-2 Q4_K_M (smaller, fast)

For Research/Development:

  • Llama 3 8B Q4_K_M (latest architecture)
  • Mixtral 8x7B Q4_K_M (sparse mixture of experts)

Example usage:

# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \
  --temp 0.1 --repeat_penalty 1.1

# Code model
./main -m models/codellama-7b.Q4_K_M.gguf \
  -p "def fibonacci(n):" \
  --temp 0.0 --repeat_penalty 1.0
Enter fullscreen mode Exit fullscreen mode

5. Quantization Types Explained

Quantization Size Quality Use Case
Q4_K_M 4.2GB High General use
Q5_K_M 5.1GB Very High Research
Q6_K 6.4GB Excellent Premium quality
Q8_0 8.2GB Full Maximum quality
F16 16GB Full Development/testing

Benchmark comparison (1000 tokens, RTX 4090):

  • Q4_K_M: 150 tokens/sec
  • Q5_K_M: 130 tokens/sec
  • Q8_0: 100 tokens/sec

Quantization command:

# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
Enter fullscreen mode Exit fullscreen mode

6. API Setup and Integration

Simple HTTP API with Python:

# server.py
from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data['prompt']
    model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf')

    result = subprocess.run([
        './main', '-m', model_path, '-p', prompt, 
        '--temp', '0.2', '--repeat_penalty', '1.1'
    ], capture_output=True, text=True)

    return jsonify({'response': result.stdout.strip()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)
Enter fullscreen mode Exit fullscreen mode

Integration with existing tools:

# Test API
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in simple terms:"}'
Enter fullscreen mode Exit fullscreen mode

7. Systemd Service for 24/7 Operation

Create /etc/systemd/system/local-llm.service:

[Unit]
Description=Local LLM Service
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable local-llm
sudo systemctl start local-llm
sudo systemctl status local-llm
Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Performance Tuning

Memory monitoring script:

#!/bin/bash
# monitor.sh
while true; do
    echo "Timestamp: $(date)"
    echo "GPU Memory:"
    nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
    echo "System Memory:"
    free -h
    echo "---"
    sleep 30
done
Enter fullscreen mode Exit fullscreen mode

Performance optimization flags:

# For high throughput (1000+ tokens/sec)
./main -m model.gguf \
  --threads 16 \
  --ctx-size 4096 \
  --batch-size 512 \
  --temp 0.0 \
  --repeat_penalty 1.0 \
  --n-predict 1000

# For low latency (sub-100ms response)
./main -m model.gguf \
  --threads 8 \
  --ctx-size 1024 \
  --batch-size 64 \
  --temp 0.2 \
  --repeat_penalty 1.1
Enter fullscreen mode Exit fullscreen mode

9. Real Command Examples

Complete workflow example:

# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev

# 2. Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 3. Download model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/

# 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -p "Write a bash script that checks disk space:" \
  --temp 0.1 --repeat_penalty 1.0

# 5. Monitor performance
watch -n 1 nvidia-smi
Enter fullscreen mode Exit fullscreen mode

Benchmark script:

#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking."

echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
Enter fullscreen mode Exit fullscreen mode

Quick deployment command:


bash
# One-liner setup and test
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)