matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v18)

#ai #llm #developers #tutorial

Local LLM Setup Guide (v18)

1. Overview & Prerequisites

Running LLMs locally requires minimal hardware but careful resource management. This guide assumes:

Ubuntu 20.04/22.04 or Debian 11/12
8GB+ RAM (16GB+ recommended)
NVIDIA GPU with CUDA support (RTX 3060+), or CPU-only setup
20GB+ free disk space for models

For GPU-accelerated inference, install CUDA:

# Install NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-535

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-4

2. Framework Comparison

Framework	GPU Support	Ease of Use	Performance	Best For
llama.cpp	Yes	Medium	Fast	Quick prototyping
Ollama	Yes	Easy	Fast	Development/testing
vLLM	Yes	Medium	Fastest	Production inference
LocalAI	Yes	Easy	Fast	API-first workflows

Recommendation: Use llama.cpp with Ollama for development workflow.

3. Step-by-Step Installation

Install llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Install Ollama (for easier model management):

curl -fsSL https://ollama.com/install.sh | sh

Test installation:

ollama run llama3:8b

Setup model directory:

mkdir -p ~/llm-models
cd ~/llm-models

4. Model Selection Guide

Use Case: Code Generation

Model: codellama:7b or phi3:3.8b
RAM: 8GB minimum
Command: ollama run codellama:7b

Use Case: Chatbot

Model: llama3:8b or mistral:7b
RAM: 8GB minimum
Command: ollama run llama3:8b

Use Case: High Precision

Model: llama3:70b or mixtral:8x7b
RAM: 16GB minimum
Command: ollama run llama3:70b

5. Quantization Types Explained

Quantization reduces model size while maintaining performance:

Q4_K_M: 4-bit quantization, 4.5GB for 7B model
Q5_K_M: 5-bit quantization, 5.5GB for 7B model
Q8_0: 8-bit quantization, 8GB for 7B model
F16: Full precision, 16GB for 7B model

Example: Download and convert model:

# Download 7B model
ollama pull llama3:8b

# Convert to Q4_K_M (smallest size, good performance)
ollama run llama3:8b --quantize Q4_K_M

6. API Setup and Integration

Create API server with llama.cpp:

# Start llama.cpp server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192

Test API:

curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Write a Python function to reverse a string.",
        "temperature": 0.7,
        "max_tokens": 100
    }'

Integrate with Python:

import requests

def llm_query(prompt):
    response = requests.post(
        'http://localhost:1234/completion',
        json={
            'prompt': prompt,
            'temperature': 0.7,
            'max_tokens': 200
        }
    )
    return response.json()['content']

# Usage
result = llm_query("Explain quantum computing in simple terms")

7. Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/llm-server.service

Content:

[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_username
WorkingDirectory=/home/your_username/llama.cpp
ExecStart=/home/your_username/llama.cpp/server \
    -m /home/your_username/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server

8. Monitoring and Performance Tuning

Monitor GPU usage:

nvidia-smi -l 1  # Update every second

Monitor memory usage:

watch -n 1 free -h

Benchmark inference:

# Test 100 token generation
time ./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --prompt "The future of AI is" \
    --max-tokens 100 \
    --threads 8

Performance tuning parameters:

--ctx-size: 8192 for 8B models, 16384 for 70B models
--threads: CPU cores / 2 for optimal performance
--n-gpu-layers: Number of layers on GPU (default: 100 for 8B models)

9. Real Command Examples

Full workflow example:

# 1. Install dependencies
sudo apt update
sudo apt install git cmake build-essential

# 2. Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 3. Download model
ollama pull llama3:8b

# 4. Start server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100

# 5. Test API
curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello world", "max_tokens": 10}'

Production-ready startup script:

#!/bin/bash
# ~/start-llm.sh

MODEL_PATH="$HOME/llm-models/llama3-8b-Q4_K_M.gguf"
PORT=1234

if [ ! -f "$MODEL_PATH" ]; then
    echo "Model not found at $MODEL_PATH"
    exit 1
fi

echo "Starting LLM server on port $PORT..."
./server \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port $PORT \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100

This setup provides a production-ready local LLM infrastructure with minimal hardware requirements and optimal performance. The combination of llama.cpp for low-level control and Ollama for easy model management gives developers the best of both worlds for local LLM development.

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

DEV Community