DEV Community

Chandrani Mukherjee
Chandrani Mukherjee

Posted on

Deploy Pixtral at Scale: vLLM + Docker Made Simple

Large language models are compute-heavy, and deploying them efficiently requires optimized inference engines like vLLM. In this guide, weโ€™ll walk through containerizing Pixtral using Docker, running it with vLLM, and exposing the model endpoint for external access.


๐Ÿ“ฆ Prerequisites

  • Docker installed (latest version recommended)
  • NVIDIA GPU with CUDA drivers (if running with GPU acceleration)
  • Model weights available (Pixtral Hugging Face repo or local)

๐Ÿ›  Step 1: Create a Dockerfile

FROM nvidia/cuda:12.2.2-runtime-ubuntu22.04

# System dependencies
RUN apt-get update && apt-get install -y \
    git wget curl python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Install vLLM
RUN pip3 install --upgrade pip
RUN pip3 install vllm

# Set working directory
WORKDIR /app

# Expose port for API
EXPOSE 8000

# Default command (can be overridden)
CMD ["bash"]
Enter fullscreen mode Exit fullscreen mode

โš™๏ธ Step 2: Build the Docker Image

docker build -t pixtral-vllm .
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“‚ Step 3: Run the Container with Model Access

You can start the vLLM server inside the container to serve Pixtral.

docker run --gpus all -it -p 8000:8000 pixtral-vllm \
  python3 -m vllm.entrypoints.openai.api_server \
  --model <huggingface_repo_or_local_path_to_pixtral>
Enter fullscreen mode Exit fullscreen mode
  • Replace <huggingface_repo_or_local_path_to_pixtral> with your model location.
  • The container will serve an OpenAI-compatible API on port 8000.

๐ŸŒ Step 4: Test the API Endpoint

After running the container, test with curl:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "pixtral", "prompt": "Hello Pixtral!"}'
Enter fullscreen mode Exit fullscreen mode

Expected response:

{
  "id": "cmpl-1234",
  "object": "text_completion",
  "choices": [
    {"text": "Hello there!"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ก Step 5: Expose for External Access

If you want to make the API accessible from outside the host:

  • Map the container port 8000 to the host (-p 8000:8000).
  • Ensure firewall/security group allows inbound traffic on 8000.

For production, consider:

  • Reverse proxy with NGINX or Traefik
  • Adding authentication before exposing publicly

โœ… Conclusion

You now have Pixtral deployed with vLLM inside Docker, exposed via an OpenAI-compatible API. This setup allows you to scale inference workloads efficiently, while keeping deployment portable.


๐Ÿ”— References

Top comments (0)