matias yoon

Posted on May 26

로컬 LLM 셋업 가이드 (v44)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v44)

1. 개요 및 사전 요구사항

로컬 LLM(대형 언어 모델)을 실행하는 것은 민감한 코드를 외부로 노출하지 않고 AI 기능을 사용하는 효과적인 방법입니다. 이 가이드는 최소한의 리소스로도 작동하는 실용적인 로컬 LLM 구성을 제공합니다.

필수 사전 조건:

Linux 시스템 (Ubuntu 20.04 이상 권장)
NVIDIA GPU (12GB 이상 메모리 권장)
최소 16GB RAM (32GB 이상 권장)
50GB 이상 여유 저장공간

시스템 확인 명령어:

# GPU 확인
nvidia-smi

# RAM 확인
free -h

# 디스크 공간 확인
df -h

# CPU 정보 확인
lscpu

2. 프레임워크 비교: llama.cpp vs Ollama vs vLLM vs LocalAI

프레임워크	장점	단점	적합성
llama.cpp	최소 리소스, 직접 컴파일 가능, 확장성 높음	복잡한 설치, API 제한	고급 사용자, 성능 최적화
Ollama	간단한 설치, Docker 지원, 모델 관리 용이	GPU 메모리 제약, 상대적 느림	빠른 실험, 개발용
vLLM	높은 성능, 병렬 처리, 대용량 모델 지원	복잡한 설정, 메모리 요구량 높음	프로덕션 환경
LocalAI	다양한 API 호환성, Docker 기반, 쉬운 배포	리소스 소모량 많음	웹 애플리케이션 연동

추천: 최적의 균형을 위해 llama.cpp + Ollama 조합을 사용합니다.

3. 권장 설치 절차

3.1 llama.cpp 설치

# 소스 코드 다운로드
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 빌드
make clean
make

# 필요 시 CUDA 지원 추가
make clean
make LLAMA_CUDA=1

3.2 모델 다운로드 및 변환

# 모델 디렉터리 생성
mkdir -p models/llama

# 예시: Mistral 7B 모델 다운로드
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -O models/llama/mistral-7b.gguf

# 모델 변환 (필요 시)
python3 convert.py models/llama/mistral-7b.gguf

3.3 Ollama 설치 및 설정

# Ollama 설치 (Ubuntu)
curl -fsSL https://ollama.com/install.sh | sh

# Ollama 시작
sudo systemctl start ollama
sudo systemctl enable ollama

# 모델 로드
ollama pull mistral
ollama pull llama3

4. 모델 선택 가이드

모델	사용 사례	권장 메모리	특징
Mistral 7B Q4_K_M	일반 개발, 코드 생성	8GB	빠른 성능, 저비용
Llama3 8B Q4_K_M	복잡한 프로젝트, 정확도 우선	12GB	최신 기술, 높은 정확도
Mixtral 8x7B Q4_K_M	대량 데이터 처리	24GB+	확장성 높음
Phi-3 3.8B Q4_K_M	빠른 응답, 간단한 작업	6GB	메모리 효율적

# Ollama에서 모델 실행 예시
ollama run mistral:7b "Python 코드를 작성해 주세요"
ollama run llama3:8b "API 문서 작성"

5. 양자화 유형 설명

유형	설명	성능	메모리 사용량
Q4_K_M	LLM-KV 양자화, 최적화된 성능	높음	4GB
Q5_K_M	Q4 + 약간 더 정확한 양자화	매우 높음	5GB
Q6_K	높은 정확도, 중간 메모리	매우 높음	6GB
Q8_0	정밀도 유지, 최대 성능	최고	8GB

# llama.cpp에서 양자화된 모델 실행
./main -m models/llama/mistral-7b.gguf -p "Your prompt here" -n 128

6. API 설정 및 기존 도구 통합

6.1 API 서버 시작

# llama.cpp API 서버 시작
./server -m models/llama/mistral-7b.gguf -c 2048 --host 0.0.0.0 --port 8080

# Ollama API 서버 시작
ollama serve

6.2 Python 클라이언트 예제

# requirements.txt
openai==1.3.5
requests==2.31.0

# api_client.py
import requests
import json

class LocalLLMClient:
    def __init__(self, base_url="http://localhost:8080"):
        self.base_url = base_url

    def generate(self, prompt, max_tokens=128):
        response = requests.post(
            f"{self.base_url}/completion",
            json={
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": 0.7
            }
        )
        return response.json()['completion']

# 사용 예시
client = LocalLLMClient()
result = client.generate("Python으로 hello world 프로그램 작성")
print(result)

6.3 VSCode 확장 통합

// settings.json
{
    "openai.apiKey": "local",
    "openai.baseURL": "http://localhost:8080/v1",
    "openai.model": "mistral-7b"
}

7. 시스템드 서비스 설정

# /etc/systemd/system/llama-server.service
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama/mistral-7b.gguf -c 2048 --host 0.0.0.0 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 서비스 시작
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server

# 상태 확인
sudo systemctl status llama-server

8. 모니터링 및 성능 튜닝

8.1 성능 모니터링 스크립트

#!/bin/bash
# monitor.sh

while true; do
    echo "=== Memory Usage ==="
    free -h
    echo "=== GPU Usage ==="
    nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv
    echo "=== CPU Usage ==="
    top -b -n 1 | head -20
    echo "=== Network ==="
    ss -tuln | grep :8080
    sleep 30
done

8.2 메모리 최적화

# 메모리 사용량 줄이기
./main -m models/llama/mistral-7b.gguf \
  -c 2048 \
  -n 128 \
  --ctx-size 2048 \
  --threads 4 \
  --batch-size 512

# 메모리 기반 파라미터
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4

8.3 벤치마킹 명령어


bash
# llama.cpp 벤치마크
./main -m models/llama/mistral-7b.gguf \
  -p "The quick brown fox jumps over

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v44)

로컬 LLM 셋업 가이드 (v44)

1. 개요 및 사전 요구사항

2. 프레임워크 비교: llama.cpp vs Ollama vs vLLM vs LocalAI

3. 권장 설치 절차

3.1 llama.cpp 설치

3.2 모델 다운로드 및 변환

3.3 Ollama 설치 및 설정

4. 모델 선택 가이드

5. 양자화 유형 설명

6. API 설정 및 기존 도구 통합

6.1 API 서버 시작

6.2 Python 클라이언트 예제

6.3 VSCode 확장 통합

7. 시스템드 서비스 설정

8. 모니터링 및 성능 튜닝

8.1 성능 모니터링 스크립트

8.2 메모리 최적화

8.3 벤치마킹 명령어

Top comments (0)