DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v12)

로컬 LLM 셋업 가이드 (v12)

Practical Guide for Local LLM Deployment

1. Overview & Prerequisites

Running local LLMs requires understanding your system's capabilities and limitations. For optimal performance, you'll need:

Hardware Requirements:

  • CPU: Intel i5-12600K or AMD Ryzen 5 5600X minimum
  • RAM: 16GB minimum (32GB recommended)
  • GPU: NVIDIA RTX 3060 or better (optional but highly recommended)
  • Storage: 50GB SSD minimum for models and cache

Operating System:
Ubuntu 22.04 LTS or Debian 12 (ARM64 support limited)

Prerequisites Installation:

sudo apt update && sudo apt install -y git build-essential cmake python3-pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Enter fullscreen mode Exit fullscreen mode

2. Framework Comparison

Framework Pros Cons Best For
llama.cpp Lightweight, single binary, best for small models No GUI, limited multi-GPU support Quick prototyping, embedded systems
Ollama Easy setup, built-in model management Heavy resource usage Development environments
vLLM Highest throughput, optimized for serving Complex setup, requires Python expertise Production inference services
LocalAI Multi-model support, HTTP API Resource intensive Enterprise deployments

3. Recommended Setup: llama.cpp + Systemd Service

Installation Steps:

# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
make clean
make -j$(nproc) LLAMA_CUDA=1

# Verify installation
./llama-cli --help
Enter fullscreen mode Exit fullscreen mode

Model Download:

# Create models directory
mkdir -p ~/models

# Download a 7B parameter model
cd ~/models
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
Enter fullscreen mode Exit fullscreen mode

4. Model Selection Guide

Use Case Recommended Model Quantization Reason
General Chat Llama-2-7B Q4_K_M Balance of quality vs size
Code Generation CodeLlama-7B Q5_K_M Better code understanding
Reasoning Mistral-7B Q5_K_M Strong logical capabilities
Lightweight Phi-2 Q4_K_M Fast inference, small size

Example model usage:

# Run with 8GB VRAM limit
./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf \
  -c 2048 \
  --temp 0.7 \
  --n-predict 200 \
  -n 128
Enter fullscreen mode Exit fullscreen mode

5. Quantization Types Explained

Q4_K_M: 4-bit quantization with Kahan summation, provides best accuracy at 4-bit
Q5_K_M: 5-bit quantization with Kahan summation, better quality than Q4_K_M
Q6_K: 6-bit quantization, highest quality at cost of 6-bit storage

Benchmark comparison for Llama-2-7B:

# Performance test
time ./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf -p "What is 2+2?" --n-predict 10

# Memory usage test
watch -n 1 free -h
Enter fullscreen mode Exit fullscreen mode

6. API Setup and Integration

Create a simple API wrapper:

# api_server.py
from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data.get('prompt', '')
    max_tokens = data.get('max_tokens', 100)

    # Call llama.cpp with prompt
    result = subprocess.run([
        './llama-cli',
        '-m', '~/models/llama-2-7b.Q4_K_M.gguf',
        '-p', prompt,
        '--n-predict', str(max_tokens),
        '--temp', '0.7'
    ], capture_output=True, text=True, timeout=300)

    return jsonify({'response': result.stdout})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
Enter fullscreen mode Exit fullscreen mode

Integration with existing tools:

# Test API
curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in simple terms", "max_tokens": 150}'
Enter fullscreen mode Exit fullscreen mode

7. Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/llama.service
Enter fullscreen mode Exit fullscreen mode

Service configuration:

[Unit]
Description=Local LLM Service
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/llama-server \
  -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \
  --port 8000 \
  --host 0.0.0.0 \
  --threads 8 \
  --ctx-size 2048
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Start and enable service:

sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service
sudo systemctl status llama.service
Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Performance Tuning

Monitor system resources:

# Real-time monitoring
htop
nvidia-smi -l 1  # For GPU monitoring

# Log system performance
journalctl -u llama.service -f
Enter fullscreen mode Exit fullscreen mode

Performance tuning parameters:

# Optimize for CPU usage
./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf \
  -c 2048 \
  --threads 4 \
  --batch-size 512 \
  --temp 0.7 \
  --n-predict 200

# For GPU optimization
./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf \
  --gpu-layers 30 \
  --threads 8 \
  --ctx-size 4096
Enter fullscreen mode Exit fullscreen mode

Memory optimization script:

#!/bin/bash
# memory_optimizer.sh

# Check available memory
AVAILABLE_MEM=$(free -g | awk '/^Mem:/{print $7}')

if [ $AVAILABLE_MEM -lt 8 ]; then
  echo "Warning: Low memory detected"
  # Reduce context size for smaller systems
  ./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf -c 1024
else
  ./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf -c 2048
fi
Enter fullscreen mode Exit fullscreen mode

9. Real Command Examples

Complete deployment workflow:

# 1. Setup environment
mkdir -p ~/llama && cd ~/llama
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && make -j$(nproc) LLAMA_CUDA=1

# 2. Download model
cd ~/models
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# 3. Test with sample prompt
./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf -p "Write a Python function to calculate Fibonacci numbers" --n-predict 100

# 4. Benchmark performance
time ./llama-cli -m ~/models/llama-2-7b.Q4_K_M.gguf -p "Explain the difference between quantum computing and classical computing" --n-predict 150

# 5. Setup service for automatic startup
sudo systemctl daemon-reload
sudo systemctl enable llama.service
Enter fullscreen mode Exit fullscreen mode

Expected performance metrics:

  • 7B model on RTX 3060: ~10-15 tokens/second
  • 7B model on i7-12700K: ~4-6 tokens/second
  • Memory usage: ~8GB for Q4_K_M quantization
  • Response time: 0.5-2 seconds for typical prompts

This guide provides a complete, production-ready solution for local LLM deployment that balances performance, cost, and practicality. The llama.cpp framework with Systemd service ensures reliable 24/7 operation with minimal overhead.


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)