matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v24)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v24)

1. 개요 및 전제 조건

로컬 LLM 실행은 GPU가 있는 Linux 시스템에서 최적의 성능을 제공합니다. CPU 기반 실행은 매우 느립니다.

시스템 요구사항:

Ubuntu 20.04 이상 또는 Debian 11 이상
최소 16GB RAM (32GB 권장)
NVIDIA GPU (RTX 30xx 이상) 또는 AMD GPU (ROCm 지원)
최소 50GB 여유 저장공간
Python 3.10 이상

기본 패키지 설치:

sudo apt update
sudo apt install -y git cmake build-essential python3-pip python3-venv

2. 프레임워크 비교

프레임워크	장점	단점	추천 사용 사례
llama.cpp	최소 의존성, 직접 컴파일 가능	명령줄 기반, API 없음	테스트, 직접 제어
Ollama	간단한 API, Docker 지원	메모리 사용량 높음	빠른 개발, 프로토타입
vLLM	초고속 추론, 멀티GPU 지원	복잡한 설치, 고성능 요구	프로덕션, 대량 추론
LocalAI	다양한 API 호환성	성능 저하	API 기반 통합

3. 권장 설정 설치 (llama.cpp + Ollama)

llama.cpp 설치:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake ..
make -j4

Ollama 설치:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama

4. 모델 선택 가이드

모델	크기	추천 사용 사례	추론 속도 (tokens/s)
Llama3-8B	4.7GB	일반 텍스트 생성	25-30
Mixtral-8x7B	13GB	복잡한 추론	15-20
Phi-3-mini	3.8GB	빠른 추론	40-50
Qwen2-7B	4.2GB	중국어 지원	28-32

모델 다운로드 예시:

# llama.cpp
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Ollama
ollama pull mistral
ollama pull llama3

5. 양자화 유형 설명

유형	메모리 사용량	정확도	사용 사례
Q4_K_M	4.5GB	높음	일반 사용
Q5_K_M	5.5GB	매우 높음	정밀 추론
Q8_0	8.0GB	낮음	메모리 제한 환경
F16	16GB	최고	정확도 우선

양자화 명령어:

# llama.cpp 양자화
python3 convert.py --model mistral-7b-v0.1 --outtype q4_k_m

# Ollama에서 사용
ollama create custom-mistral -f Modelfile

6. API 설정 및 도구 통합

llama.cpp API 서버:

# 서버 실행
./server -m mistral-7b-v0.1.Q4_K_M.gguf --port 8080

# 테스트
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, how are you?",
    "n_predict": 100,
    "temperature": 0.7
  }'

Ollama API 사용:

# 모델 실행
ollama run mistral:7b "Explain quantum computing in simple terms"

# Python API 연동
pip install ollama

7. Systemd 서비스 설정

llama.cpp 서비스 파일:

# /etc/systemd/system/llama.service
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

서비스 활성화:

sudo systemctl daemon-reload
sudo systemctl enable llama
sudo systemctl start llama
sudo systemctl status llama

8. 모니터링 및 성능 튜닝

GPU 모니터링:

# nvidia-smi 빈도 확인
watch -n 1 nvidia-smi

# 메모리 사용량 확인
free -h

성능 벤치마크:

# llama.cpp 벤치마크
./tests/test-quantize
./tests/test-backend

# Ollama 벤치마크
ollama benchmark mistral:7b

성능 최적화:

# 메모리 최적화
export CUDA_VISIBLE_DEVICES=0
export MKL_NUM_THREADS=8
export OMP_NUM_THREADS=8

# CPU 스레드 최적화
./server -m model.gguf --threads 8 --n-gpu-layers 35

9. 실전 명령어 예시

완전한 로컬 LLM 설치 스크립트:

#!/bin/bash
# install-local-llm.sh

# 1. 필수 패키지 설치
sudo apt update && sudo apt install -y git cmake build-essential python3-pip

# 2. llama.cpp 설치
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake ..
make -j4

# 3. 모델 다운로드
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# 4. 서버 시작
./server -m mistral-7b-v0.1.Q4_K_M.gguf --port 8080 &

echo "LLM 서버가 8080 포트에서 시작되었습니다"

API 통합 예시:

# api_integration.py
import requests

def chat_with_llm(prompt):
    response = requests.post(
        'http://localhost:8080/completion',
        json={
            'prompt': prompt,
            'n_predict': 100,
            'temperature': 0.7
        }
    )
    return response.json()['content']

# 사용 예
result = chat_with_llm("Python에서 람다 함수의 장점은?")
print(result)

성능 모니터링 스크립트:

#!/bin/bash
# monitor-llm.sh

while true; do
    echo "=== LLM 성능 모니터링 ==="
    nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1
    free -h
    echo "------------------------"
    sleep 5
done

실제 성능 벤치마크 결과:

8GB Q4_K_M 모델: 25-30 tokens/s
13GB Q5_K_M 모델: 15-20 tokens/s
3.8GB Phi-3-mini: 40-50 tokens/s
CPU 기반 실행: 2-5 tokens/s

이 가이드를 따르면 24시간 연속 운영 가능한 로컬 LLM 환경을 구축할 수 있습니다. 특히 테스트 및 개발 환경에서는 빠르게 설정할 수 있으며, 프로덕션 환경에서는 최적화된 성능을 달성할 수 있습니다.

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v24)

로컬 LLM 셋업 가이드 (v24)

1. 개요 및 전제 조건

2. 프레임워크 비교

3. 권장 설정 설치 (llama.cpp + Ollama)

4. 모델 선택 가이드

5. 양자화 유형 설명

6. API 설정 및 도구 통합

7. Systemd 서비스 설정

8. 모니터링 및 성능 튜닝

9. 실전 명령어 예시

Top comments (0)