DEV Community

iapilgrim
iapilgrim

Posted on

Step-by-Step: Manual vLLM Setup on Google Cloud L4 (Debian)

Here’s a complete, consolidated guide to manually setting up vLLM on a Google Cloud VM with an NVIDIA L4 GPU, using Debian and pip. This includes GPU driver installation, Python environment setup, and vLLM deployment.


🧱 1. Create a VM with NVIDIA L4 GPU

  • Go to Google Cloud Console
  • Navigate to Compute Engine > VM instances
  • Click Create Instance
  • Choose:
    • Machine type: e.g., n1-standard-4 or higher
    • GPU: NVIDIA L4 (1 unit)
    • Boot disk: Debian 12 (Bookworm)
    • Disk size: At least 100 GB recommended
  • Enable API access, SSH, and firewall rules if needed

⚙️ 2. Install System Dependencies

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential dkms linux-headers-$(uname -r)
Enter fullscreen mode Exit fullscreen mode

🧩 3. Enable Non-Free Repositories

echo "deb http://deb.debian.org/debian bookworm main contrib non-free non-free-firmware" | sudo tee /etc/apt/sources.list.d/non-free.list
sudo apt update
Enter fullscreen mode Exit fullscreen mode

🔧 4. Install NVIDIA Driver (DKMS method)

sudo apt install -y nvidia-driver
sudo reboot
Enter fullscreen mode Exit fullscreen mode

After reboot, verify:

nvidia-smi
Enter fullscreen mode Exit fullscreen mode

✅ You should see your NVIDIA L4 GPU listed with driver and CUDA version.


🐍 5. Set Up Python Environment

sudo apt install -y python3 python3-pip python3-venv
python3 -m venv vllm-env
source vllm-env/bin/activate
pip install --upgrade pip
Enter fullscreen mode Exit fullscreen mode

🔥 6. Install PyTorch with CUDA Support

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Enter fullscreen mode Exit fullscreen mode

CUDA 11.8 is compatible with L4 and vLLM. You can adjust if needed.


🧠 7. Install vLLM

pip install vllm
Enter fullscreen mode Exit fullscreen mode

✅ 8. Verify GPU Access

python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0))"
Enter fullscreen mode Exit fullscreen mode

Expected output:

CUDA available: True
Device: NVIDIA L4
Enter fullscreen mode Exit fullscreen mode

🧪 9. Run vLLM Server

python -m vllm.entrypoints.openai_api_server --model facebook/opt-1.3b
Enter fullscreen mode Exit fullscreen mode

To expose externally:

python -m vllm.entrypoints.openai_api_server --model facebook/opt-1.3b --host 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Make sure port 8000 is open in your firewall settings.


🧼 Optional Cleanup & Tips

  • Use tmux or screen to keep server running after logout
  • Monitor GPU usage with nvidia-smi
  • Upgrade disk size if loading larger models

Top comments (0)