matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v33)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v33)

1. 개요 & 사전 준비

로컬 LLM 실행은 클라우드 비용 절감과 데이터 프라이버시 보호를 위한 실용적인 솔루션입니다. 이 가이드는 최신 LLM 운영을 위한 실질적 접근법을 제공합니다.

필수 사항:

Linux (Ubuntu 20.04 이상 권장)
GPU (NVIDIA RTX 30xx 이상 권장, 최소 8GB VRAM)
최소 16GB RAM (32GB 이상 권장)
최소 30GB 디스크 공간 (모델별로 다름)

디바이스별 권장 사양:

RTX 3060: 8GB VRAM (Q4_K_M 모델 추천)
RTX 4060: 8GB VRAM (Q4_K_M 모델 추천)
RTX 4090: 24GB VRAM (Q5_K_M 이상 모델 추천)

2. 프레임워크 비교

프레임워크	장점	단점	적합성
llama.cpp	최소 의존성, C++ 성능 최적화	명령줄 중심, API 부족	고급 사용자, 커스터마이징
Ollama	Docker 기반, 쉬운 설치	커뮤니티 지원 제한	개발자, 테스트용
vLLM	높은 처리량, 효율적 메모리 관리	복잡한 설치, GPU 최적화 필요	프로덕션, 고속 처리
LocalAI	다양한 API 호환성 (OpenAI, Cohere)	성능 저하 가능성	API 통합 필요

추천: 실제 프로덕션 환경에서는 llama.cpp + API 서버 조합을 사용합니다.

3. 설치 가이드

3.1 라이브러리 설치

# Ubuntu 22.04 기준
sudo apt update
sudo apt install -y git cmake build-essential python3-pip

# CUDA 12.1 설치 (필요시)
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
sudo sh cuda_12.1.1_530.30.02_linux.run

3.2 llama.cpp 설치

# 소스 코드 다운로드
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 빌드 (GPU 지원)
make clean
make -j$(nproc)

# CUDA 빌드 (필요시)
make -j$(nproc) CUDA=1

3.3 모델 다운로드 및 준비

# 모델 디렉토리 생성
mkdir -p /opt/models
cd /opt/models

# 예시: Mistral-7B-v0.1 모델 다운로드
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

4. 모델 선택 가이드

모델	용도	추천 GPU	메모리 요구량
Mistral-7B-v0.1	일반 텍스트 생성	8GB 이상	4GB
Mixtral-8x7B	복잡한 작업	24GB 이상	12GB
Llama-3-8B	높은 성능	16GB 이상	8GB
Phi-3-mini	빠른 응답	8GB 이상	3GB

# 모델 정보 확인
./llama.cpp --model /opt/models/mistral-7b-v0.1.Q4_K_M.gguf --help

5. 양자화 유형 설명

유형	품질	크기	사용 예시
Q4_K_M	우수	4.5GB	일반 사용
Q5_K_M	최상	5.5GB	정확도 우선
Q6_K	중간	6.5GB	성능과 품질 균형
Q8_0	낮음	8.0GB	고성능 요구

# 양자화 변환 예시
python3 convert-hf-to-gguf.py /path/to/hf/model /opt/models/mistral-7b-v0.1.Q4_K_M.gguf --outtype q4_k_m

6. API 설정 및 통합

6.1 API 서버 구동

# 기본 API 서버 실행
./server -m /opt/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 -ngl 33 --host 0.0.0.0 --port 8080

# 환경 변수 설정
export LLM_API_URL="http://localhost:8080"
export LLM_MODEL="mistral-7b-v0.1.Q4_K_M.gguf"

6.2 OpenAI API 호환

# config.yaml
server:
  port: 8080
  host: 0.0.0.0
  model: /opt/models/mistral-7b-v0.1.Q4_K_M.gguf
  context: 2048
  ngl: 33

api:
  openai_compat: true

7. Systemd 서비스 설정

# 서비스 파일 생성
sudo nano /etc/systemd/system/llm-server.service

[Unit]
Description=LLM Server Service
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/server -m /opt/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 -ngl 33 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 서비스 시작
sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server
sudo systemctl status llm-server

8. 모니터링 및 성능 최적화

8.1 성능 모니터링

# GPU 상태 확인
nvidia-smi -l 1

# 로그 모니터링
tail -f /var/log/llm-server.log

# CPU/메모리 모니터링
htop

8.2 성능 최적화 설정

# 최적화 실행 옵션
./server -m /opt/models/mistral-7b-v0.1.Q4_K_M.gguf \
         -c 2048 \
         -ngl 33 \
         --host 0.0.0.0 \
         --port 8080 \
         --threads 8 \
         --batch-size 512 \
         --ctx-keep 1024

8.3 벤치마크 테스트

# 텍스트 생성 벤치마크
./llama.cpp -m /opt/models/mistral-7b-v0.1.Q4_K_M.gguf \
            -n 256 \
            --temp 0.2 \
            --seed 42 \
            --prompt "Hello, this is a benchmark test for LLM performance."

# 성능 결과 예시 (RTX 4090):
# - 1024 토큰 생성: 12.3s (83.0 TPS)
# - 2048 토큰生成: 24.8s (82.3 TPS)

9. 실제 사용 예시

9.1 API 호출 예제

# curl 요청
curl http://localhost:8080/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a Python function to calculate Fibonacci numbers",
    "max_tokens": 256,
    "temperature": 0.3
  }'

9.2 Python 클라이언트


python
import requests

def call

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v33)

로컬 LLM 셋업 가이드 (v33)

1. 개요 & 사전 준비

2. 프레임워크 비교

3. 설치 가이드

3.1 라이브러리 설치

3.2 llama.cpp 설치

3.3 모델 다운로드 및 준비

4. 모델 선택 가이드

5. 양자화 유형 설명

6. API 설정 및 통합

6.1 API 서버 구동

6.2 OpenAI API 호환

7. Systemd 서비스 설정

8. 모니터링 및 성능 최적화

8.1 성능 모니터링

8.2 성능 최적화 설정

8.3 벤치마크 테스트

9. 실제 사용 예시

9.1 API 호출 예제

9.2 Python 클라이언트

Top comments (0)