Containerizing standard Python web apps is easy. Containerizing Python apps that need to talk to NVIDIA GPUs, manage gRPC streams, handle WebSockets at scale, and integrate with complex monitoring stacks? That's a different beast.
In this article, I'll share how we structured our Docker deployment for a GPU-accelerated Speech-to-Speech service, moving from a fragile "works on my machine" setup to a robust production infrastructure.
The Challenge: GPU & Environment Complexity
We faced three main challenges:
- Dual Deployment Modes: We needed to support both "Cloud" (Nvidia NVCF) and "Self-Hosted" (On-premise GPU) modes from the same image.
- Log Management: High-volume WebSocket traffic generates massive logs. We needed structured JSON for machines (Grafana/Loki) but readable text for developers.
- Process Management: Uvicorn needs careful tuning for async workloads to avoid blocking the event loop.
1. The Entrypoint Pattern
Instead of a simple CMD ["uvicorn", ...], we implemented a robust run.sh entrypoint script. This allows us to handle environment setup before the application starts.
Dockerfile:
# ... builds setup ...
COPY run.sh .
RUN chmod +x run.sh
ENV FASTAPI_ENV=production
CMD ["./run.sh", "--env", "production"]
run.sh (Simplified):
#!/bin/bash
# Smart defaults based on environment
if [ "$FASTAPI_ENV" = "production" ]; then
LOG_CONFIG="json"
WORKERS=${WORKERS:-4}
else
LOG_CONFIG="human"
WORKERS=1
fi
# Launch Uvicorn with calculated arguments
exec uvicorn app:app \
--host 0.0.0.0 \
--workers $WORKERS \
--log-config $LOG_CONFIG
This pattern gives us flexibility. We can flip a single environment variable FASTAPI_ENV and the container completely reconfigures its logging strategy and worker count.
2. Docker Compose Profiles
We use Docker Compose not just for local dev, but for defining deployment "flavors". By having separate compose files sharing the same core image, we document the requirements for each mode effectively.
docker-compose.cloud.yml (Minimal, API Keys only):
services:
riva-s2s:
image: riva-s2s:latest
environment:
- RIVA_DEPLOYMENT_MODE=cloud
- RIVA_API_KEY=${RIVA_API_KEY}
docker-compose.selfhosted.yml (Heavy, internal GPU networking):
services:
riva-s2s:
image: riva-s2s:latest
environment:
- RIVA_DEPLOYMENT_MODE=self-hosted
- RIVA_ASR_SERVER=10.0.1.5:50051
- RIVA_TTS_SERVER=10.0.1.6:50051
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
3. Solving the Logging Nightmare
In a real-time WebSocket service, standard logging is useless. You get thousands of "Connected" / "Disconnected" messages. We implemented structured JSON logging that Loki can ingest.
Key Trick:
In your logging_config.py, check the LOG_FORMAT env var. If it's json, switch your formatter to python-json-logger.
# Docker sees this:
{"timestamp": "2026-03-04T12:00:00", "level": "INFO", "session_id": "abc-123", "event": "audio_chunk_received", "size_bytes": 4096}
This allows us to write Grafana queries like:
sum by (session_id) (rate({app="riva-s2s"} | json | event="audio_chunk_received"[1m]))
Benefits of this Approach
- Immutability: The same Docker image runs in cloud prototyping and on-prem production.
- Observability: JSON logging + Loki means we can trace a single audio packet through the entire stack.
- Scalability: The
run.shscript makes it trivial for Kubernetes or Docker Swarm to override worker counts without rebuilding.
Conclusion
Containerizing AI services isn't just about wrapping Python in Linux. It's about exposing the necessary knobs (Environment Variables) to control the complex underlying hardware and software stack without rebuilding the container. Treat your Dockerfile and entrypoint as first-class code.
Top comments (0)