iapilgrim

Posted on Oct 6

Step-by-Step: Manual vLLM Setup on Google Cloud L4 (Debian)

#googlecloud #tutorial #llm #linux

Here’s a complete, consolidated guide to manually setting up vLLM on a Google Cloud VM with an NVIDIA L4 GPU, using Debian and pip. This includes GPU driver installation, Python environment setup, and vLLM deployment.

🧱 1. Create a VM with NVIDIA L4 GPU

Go to Google Cloud Console
Navigate to Compute Engine > VM instances
Click Create Instance
Choose:
- Machine type: e.g., n1-standard-4 or higher
- GPU: NVIDIA L4 (1 unit)
- Boot disk: Debian 12 (Bookworm)
- Disk size: At least 100 GB recommended
Enable API access, SSH, and firewall rules if needed

⚙️ 2. Install System Dependencies

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential dkms linux-headers-$(uname -r)

🧩 3. Enable Non-Free Repositories

echo "deb http://deb.debian.org/debian bookworm main contrib non-free non-free-firmware" | sudo tee /etc/apt/sources.list.d/non-free.list
sudo apt update

🔧 4. Install NVIDIA Driver (DKMS method)

sudo apt install -y nvidia-driver
sudo reboot

After reboot, verify:

nvidia-smi

✅ You should see your NVIDIA L4 GPU listed with driver and CUDA version.

🐍 5. Set Up Python Environment

sudo apt install -y python3 python3-pip python3-venv
python3 -m venv vllm-env
source vllm-env/bin/activate
pip install --upgrade pip

🔥 6. Install PyTorch with CUDA Support

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

CUDA 11.8 is compatible with L4 and vLLM. You can adjust if needed.

🧠 7. Install vLLM

pip install vllm

✅ 8. Verify GPU Access

python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0))"

Expected output:

CUDA available: True
Device: NVIDIA L4

🧪 9. Run vLLM Server

python -m vllm.entrypoints.openai_api_server --model facebook/opt-1.3b

To expose externally:

python -m vllm.entrypoints.openai_api_server --model facebook/opt-1.3b --host 0.0.0.0 --port 8000

Make sure port 8000 is open in your firewall settings.

🧼 Optional Cleanup & Tips

Use tmux or screen to keep server running after logout
Monitor GPU usage with nvidia-smi
Upgrade disk size if loading larger models

DEV Community