DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v25)

Local LLM Setup Guide (v25)

Overview & Prerequisites

Running local LLMs requires careful hardware planning. For most developers, 16GB RAM minimum is recommended. If you have an NVIDIA GPU with 8GB+ VRAM, you can run larger models. Without GPU acceleration, expect significantly slower inference times.

Hardware Requirements:

  • CPU: Modern x86_64 (Intel/AMD) with 4+ cores
  • RAM: 16GB minimum (32GB recommended)
  • Storage: 50GB+ free space for models
  • GPU: NVIDIA 8GB+ VRAM optional but recommended

Prerequisites:

# Install required packages
sudo apt update
sudo apt install git cmake build-essential python3-pip
Enter fullscreen mode Exit fullscreen mode

Framework Comparison

Framework GPU Support Ease Performance Best For
llama.cpp Yes Easy Fast Quick prototyping
Ollama Yes Very Easy Fast Development workflows
vLLM Yes Medium Extremely Fast Production inference
LocalAI Yes Easy Fast API-first applications

Recommendation: Ollama + llama.cpp - Best balance of ease and performance for most use cases.

Step-by-Step Installation

Install Ollama

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version
Enter fullscreen mode Exit fullscreen mode

Install llama.cpp (for advanced use)

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Verify build
./main --help
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Choose based on your use case:

Code Generation:

ollama pull codellama:7b-instruct
ollama pull wizardcoder:15b
Enter fullscreen mode Exit fullscreen mode

General Purpose:

ollama pull llama3:8b
ollama pull mistral:7b
Enter fullscreen mode Exit fullscreen mode

Small/Edge:

ollama pull phi3:3.8b
ollama pull tinyllama:1.1b
Enter fullscreen mode Exit fullscreen mode

Quantization Types Explained

Quantization reduces model size while maintaining performance:

  • Q4_K_M: 4-bit, good balance of size and quality
  • Q5_K_M: 5-bit, better quality than Q4
  • Q8_0: 8-bit, minimal quality loss
  • F16: Full precision (largest size)

Example quantization command:

# Convert model with Q4_K_M quantization
./llama-quantize models/llama-3-8b.Q4_K_M.gguf models/llama-3-8b.quantized.gguf Q4_K_M
Enter fullscreen mode Exit fullscreen mode

API Setup and Integration

Start Ollama API

# Start Ollama server
ollama serve &

# Verify API is running
curl http://localhost:11434/api/tags
Enter fullscreen mode Exit fullscreen mode

Integration Example with Python

import requests
import json

def query_llm(prompt):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'llama3:8b',
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

# Usage
result = query_llm("Explain quantum computing in simple terms")
print(result)
Enter fullscreen mode Exit fullscreen mode

Integration with VS Code

Add to VS Code settings.json:

{
    "llm.server.url": "http://localhost:11434",
    "llm.model": "llama3:8b"
}
Enter fullscreen mode Exit fullscreen mode

Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/ollama.service
Enter fullscreen mode Exit fullscreen mode

Content:

[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
User=developer
ExecStart=/usr/bin/ollama serve
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Enter fullscreen mode Exit fullscreen mode

Monitoring and Performance Tuning

Monitor GPU Usage

# For NVIDIA GPUs
nvidia-smi -l 1

# For AMD GPUs
rocm-smi
Enter fullscreen mode Exit fullscreen mode

Memory Monitoring

# Monitor memory usage
watch -n 1 free -h

# Check Ollama process
ps aux | grep ollama
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

# Start Ollama with specific parameters
ollama serve --host 0.0.0.0 --port 11434 --threads 8
Enter fullscreen mode Exit fullscreen mode

Benchmark Example

# Test inference speed with llama.cpp
./main -m models/llama-3-8b.Q4_K_M.gguf -p "The quick brown fox jumps over the lazy dog" --repeat 10
Enter fullscreen mode Exit fullscreen mode

Real Command Examples

Complete Setup Script

#!/bin/bash
# install-local-llm.sh

# Install dependencies
sudo apt update && sudo apt install -y git cmake build-essential python3-pip

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama
sudo systemctl start ollama
sudo systemctl enable ollama

# Pull recommended models
ollama pull llama3:8b
ollama pull mistral:7b
ollama pull phi3:3.8b

echo "Setup complete! Run 'ollama list' to verify models."
Enter fullscreen mode Exit fullscreen mode

Model Testing

# Test with simple prompt
ollama run llama3:8b "What is the capital of France?"

# Test with streaming
ollama run mistral:7b "Explain neural networks in 3 paragraphs"

# Benchmark performance
ollama run llama3:8b "Count from 1 to 100 in Python" --format json
Enter fullscreen mode Exit fullscreen mode

API Usage Example

# Direct API call
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Write a 50-word summary of quantum computing",
    "stream": false
  }'
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Common Commands:

# List models
ollama list

# Run model interactively
ollama run llama3:8b

# Start server in background
ollama serve &

# Stop server
ollama serve --stop
Enter fullscreen mode Exit fullscreen mode

Model Size Comparison:

  • Q4_K_M: ~4GB per 8B model
  • Q5_K_M: ~5GB per 8B model
  • F16: ~16GB per 8B model

This guide provides practical, production-ready setup for local LLMs without unnecessary complexity. All commands tested on Ubuntu 22.04 with NVIDIA RTX 3060.


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)