DEV Community

Cover image for Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx
Shannon Dias
Shannon Dias

Posted on

Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx

Tired of simplified tutorials that show you how to run DeepSeek R1 on a personal laptop but leave your dedicated server completely exposed to the web? Let's build a secure, high-throughput enterprise deployment.

In this comprehensive guide, we build a production-grade stack using Ubuntu 22.04, Docker, vLLM, and Nginx.

The Stack Architecture

  • Inference Engine: vLLM (utilizing PagedAttention for continuous batching).
  • Model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic (optimized for single-node multi-GPU VRAM constraints).
  • Security Layer: UFW firewall rules combined with an Nginx reverse proxy enforcing secure Bearer token authentication.

Critical Deployment Elements Covered

  1. Fixing Docker's Firewall Bypass: How to safely bind your model port to 127.0.0.1 so your expensive GPU isn't open to the public internet.
  2. Shared Memory Configuration: Allocating adequate --shm-size to support multi-GPU NCCL communication and eliminate Out of Memory (OOM) crashes.
  3. Nginx SSE Streaming Fixes: Disabling proxy_buffering and extending timeouts to handle token-by-token text generation seamlessly.

For the full step-by-step code blocks, configuration files, and commands, read more by visiting the tutorial link: Host DeepSeek R1 on a Dedicated Server with vLLM

Top comments (0)