DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v18)

Local LLM Setup Guide (v18)

1. Overview & Prerequisites

Running LLMs locally requires minimal hardware but careful resource management. This guide assumes:

  • Ubuntu 20.04/22.04 or Debian 11/12
  • 8GB+ RAM (16GB+ recommended)
  • NVIDIA GPU with CUDA support (RTX 3060+), or CPU-only setup
  • 20GB+ free disk space for models

For GPU-accelerated inference, install CUDA:

# Install NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-535

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-4
Enter fullscreen mode Exit fullscreen mode

2. Framework Comparison

Framework GPU Support Ease of Use Performance Best For
llama.cpp Yes Medium Fast Quick prototyping
Ollama Yes Easy Fast Development/testing
vLLM Yes Medium Fastest Production inference
LocalAI Yes Easy Fast API-first workflows

Recommendation: Use llama.cpp with Ollama for development workflow.

3. Step-by-Step Installation

Install llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Enter fullscreen mode Exit fullscreen mode

Install Ollama (for easier model management):

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Test installation:

ollama run llama3:8b
Enter fullscreen mode Exit fullscreen mode

Setup model directory:

mkdir -p ~/llm-models
cd ~/llm-models
Enter fullscreen mode Exit fullscreen mode

4. Model Selection Guide

Use Case: Code Generation

  • Model: codellama:7b or phi3:3.8b
  • RAM: 8GB minimum
  • Command: ollama run codellama:7b

Use Case: Chatbot

  • Model: llama3:8b or mistral:7b
  • RAM: 8GB minimum
  • Command: ollama run llama3:8b

Use Case: High Precision

  • Model: llama3:70b or mixtral:8x7b
  • RAM: 16GB minimum
  • Command: ollama run llama3:70b

5. Quantization Types Explained

Quantization reduces model size while maintaining performance:

  • Q4_K_M: 4-bit quantization, 4.5GB for 7B model
  • Q5_K_M: 5-bit quantization, 5.5GB for 7B model
  • Q8_0: 8-bit quantization, 8GB for 7B model
  • F16: Full precision, 16GB for 7B model

Example: Download and convert model:

# Download 7B model
ollama pull llama3:8b

# Convert to Q4_K_M (smallest size, good performance)
ollama run llama3:8b --quantize Q4_K_M
Enter fullscreen mode Exit fullscreen mode

6. API Setup and Integration

Create API server with llama.cpp:

# Start llama.cpp server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192
Enter fullscreen mode Exit fullscreen mode

Test API:

curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Write a Python function to reverse a string.",
        "temperature": 0.7,
        "max_tokens": 100
    }'
Enter fullscreen mode Exit fullscreen mode

Integrate with Python:

import requests

def llm_query(prompt):
    response = requests.post(
        'http://localhost:1234/completion',
        json={
            'prompt': prompt,
            'temperature': 0.7,
            'max_tokens': 200
        }
    )
    return response.json()['content']

# Usage
result = llm_query("Explain quantum computing in simple terms")
Enter fullscreen mode Exit fullscreen mode

7. Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/llm-server.service
Enter fullscreen mode Exit fullscreen mode

Content:

[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_username
WorkingDirectory=/home/your_username/llama.cpp
ExecStart=/home/your_username/llama.cpp/server \
    -m /home/your_username/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server
Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Performance Tuning

Monitor GPU usage:

nvidia-smi -l 1  # Update every second
Enter fullscreen mode Exit fullscreen mode

Monitor memory usage:

watch -n 1 free -h
Enter fullscreen mode Exit fullscreen mode

Benchmark inference:

# Test 100 token generation
time ./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --prompt "The future of AI is" \
    --max-tokens 100 \
    --threads 8
Enter fullscreen mode Exit fullscreen mode

Performance tuning parameters:

  • --ctx-size: 8192 for 8B models, 16384 for 70B models
  • --threads: CPU cores / 2 for optimal performance
  • --n-gpu-layers: Number of layers on GPU (default: 100 for 8B models)

9. Real Command Examples

Full workflow example:

# 1. Install dependencies
sudo apt update
sudo apt install git cmake build-essential

# 2. Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 3. Download model
ollama pull llama3:8b

# 4. Start server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 1234 \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100

# 5. Test API
curl http://localhost:1234/completion \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello world", "max_tokens": 10}'
Enter fullscreen mode Exit fullscreen mode

Production-ready startup script:

#!/bin/bash
# ~/start-llm.sh

MODEL_PATH="$HOME/llm-models/llama3-8b-Q4_K_M.gguf"
PORT=1234

if [ ! -f "$MODEL_PATH" ]; then
    echo "Model not found at $MODEL_PATH"
    exit 1
fi

echo "Starting LLM server on port $PORT..."
./server \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port $PORT \
    --threads 8 \
    --ctx-size 8192 \
    --n-gpu-layers 100
Enter fullscreen mode Exit fullscreen mode

This setup provides a production-ready local LLM infrastructure with minimal hardware requirements and optimal performance. The combination of llama.cpp for low-level control and Ollama for easy model management gives developers the best of both worlds for local LLM development.


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)