Docker Deployment for GPU-Accelerated Services

#docker #gpu #monitoring #python

Containerizing standard Python web apps is easy. Containerizing Python apps that need to talk to NVIDIA GPUs, manage gRPC streams, handle WebSockets at scale, and integrate with complex monitoring stacks? That's a different beast.

In this article, I'll share how we structured our Docker deployment for a GPU-accelerated Speech-to-Speech service, moving from a fragile "works on my machine" setup to a robust production infrastructure.

The Challenge: GPU & Environment Complexity

We faced three main challenges:

Dual Deployment Modes: We needed to support both "Cloud" (Nvidia NVCF) and "Self-Hosted" (On-premise GPU) modes from the same image.
Log Management: High-volume WebSocket traffic generates massive logs. We needed structured JSON for machines (Grafana/Loki) but readable text for developers.
Process Management: Uvicorn needs careful tuning for async workloads to avoid blocking the event loop.

1. The Entrypoint Pattern

Instead of a simple CMD ["uvicorn", ...], we implemented a robust run.sh entrypoint script. This allows us to handle environment setup before the application starts.

Dockerfile:

# ... builds setup ...
COPY run.sh .
RUN chmod +x run.sh
ENV FASTAPI_ENV=production
CMD ["./run.sh", "--env", "production"]

run.sh (Simplified):

#!/bin/bash
# Smart defaults based on environment
if [ "$FASTAPI_ENV" = "production" ]; then
    LOG_CONFIG="json"
    WORKERS=${WORKERS:-4}
else
    LOG_CONFIG="human"
    WORKERS=1
fi

# Launch Uvicorn with calculated arguments
exec uvicorn app:app \
    --host 0.0.0.0 \
    --workers $WORKERS \
    --log-config $LOG_CONFIG

This pattern gives us flexibility. We can flip a single environment variable FASTAPI_ENV and the container completely reconfigures its logging strategy and worker count.

2. Docker Compose Profiles

We use Docker Compose not just for local dev, but for defining deployment "flavors". By having separate compose files sharing the same core image, we document the requirements for each mode effectively.

docker-compose.cloud.yml (Minimal, API Keys only):

services:
  riva-s2s:
    image: riva-s2s:latest
    environment:
      - RIVA_DEPLOYMENT_MODE=cloud
      - RIVA_API_KEY=${RIVA_API_KEY}

docker-compose.selfhosted.yml (Heavy, internal GPU networking):

services:
  riva-s2s:
    image: riva-s2s:latest
    environment:
      - RIVA_DEPLOYMENT_MODE=self-hosted
      - RIVA_ASR_SERVER=10.0.1.5:50051
      - RIVA_TTS_SERVER=10.0.1.6:50051
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

3. Solving the Logging Nightmare

In a real-time WebSocket service, standard logging is useless. You get thousands of "Connected" / "Disconnected" messages. We implemented structured JSON logging that Loki can ingest.

Key Trick:
In your logging_config.py, check the LOG_FORMAT env var. If it's json, switch your formatter to python-json-logger.

# Docker sees this:
{"timestamp": "2026-03-04T12:00:00", "level": "INFO", "session_id": "abc-123", "event": "audio_chunk_received", "size_bytes": 4096}

This allows us to write Grafana queries like:
sum by (session_id) (rate({app="riva-s2s"} | json | event="audio_chunk_received"[1m]))

Benefits of this Approach

Immutability: The same Docker image runs in cloud prototyping and on-prem production.
Observability: JSON logging + Loki means we can trace a single audio packet through the entire stack.
Scalability: The run.sh script makes it trivial for Kubernetes or Docker Swarm to override worker counts without rebuilding.

Conclusion

Containerizing AI services isn't just about wrapping Python in Linux. It's about exposing the necessary knobs (Environment Variables) to control the complex underlying hardware and software stack without rebuilding the container. Treat your Dockerfile and entrypoint as first-class code.