GeneLab_999

Posted on Feb 21

How I Run 6 AI Services Simultaneously on RTX 5090 + WSL2 + Docker (And You Can Too)

#cuda #linux #showdev #ai

TL;DR:

I built a multi-service local AI stack (image gen, video gen, voice synthesis, voice cloning) running on RTX 5090 via WSL2 Docker. The key breakthrough was solving the GPU driver passthrough layer that nobody documented. Here's the architecture, the critical gpu-run function, and everything I learned the hard way.

The Problem Nobody Solved

In August 2025, I bought an RTX 5090. Blackwell architecture. 32GB GDDR7. Compute capability sm_120.

And nobody could make it work with WSL2 + Docker + PyTorch.

The issue wasn't any single component. nvidia-smi worked fine in containers. libcuda.so.1 loaded correctly. But PyTorch kept returning torch.cuda.is_available() = False with a cryptic Error 500: named symbol not found.

I spent roughly 40 hours debugging. Here's what I found, and how I turned it into a production multi-service AI environment.

The Root Cause

The failure point was in the interaction layer between WSL2's driver mounting and Docker's GPU runtime.

When you run --gpus all in a Docker container on WSL2, the NVIDIA Container Toolkit mounts /usr/lib/wsl/lib into the container. This directory contains libcuda.so.1 and friends. For most GPUs, this is enough.

For the RTX 5090, it's not.

The actual driver binaries live in a separate directory: /usr/lib/wsl/drivers/nvmdi.inf_amd64_<hash>. This directory contains the real libcuda.so.1.1, libnvdxgdmal.so.1, libnvidia-ptxjitcompiler.so.1, and other dependencies that the PyTorch CUDA runtime needs to initialize the Blackwell architecture.

Without mounting this directory AND setting LD_LIBRARY_PATH to include it, PyTorch's CUDA initialization hits a dead end -- it finds libcuda.so.1 but can't resolve the sm_120-specific symbols.

The Solution: `gpu-run`

Here's the function that makes everything work:

gpu-run () {
  local D BN
  D=$(ls -d /usr/lib/wsl/drivers/nvmdi.inf_amd64_* | head -n1) || return 1
  BN=$(basename "$D")
  echo "Using driver path: $D"
  docker run --rm --gpus all \
    -v /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro \
    -v "$D":/usr/lib/wsl/drivers/"$BN":ro \
    -e LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/"$BN" \
    "$@"
}

What this does:

Finds the driver directory dynamically -- the hash suffix changes with driver updates
Mounts both WSL lib paths -- the standard /usr/lib/wsl/lib AND the driver-specific directory
Sets LD_LIBRARY_PATH to prioritize these paths for symbol resolution

Verification:

source gpu-run.sh
gpu-run torch-wsl-cu128 python3 -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0))
print('VRAM:', torch.cuda.get_device_properties(0).total_mem // 1024**3, 'GB')
"

Output:

Using driver path: /usr/lib/wsl/drivers/nvmdi.inf_amd64_fb80e95fa979ce23
PyTorch: 2.9.0.dev20250812+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 5090
VRAM: 32 GB

The Dockerfile Template

Every AI service in my stack uses a variation of this base:

FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

ENV TZ=Asia/Tokyo
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV CUDA_HOME=/usr/local/cuda

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip python3-dev git ffmpeg ca-certificates \
    build-essential cmake ninja-build libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir numpy==1.26.4

RUN pip3 install --no-cache-dir --pre \
    torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu128

RUN python3 -c "import torch; print('PyTorch:', torch.__version__); assert 'cu128' in torch.__version__"

WORKDIR /app

Key decisions:

nvidia/cuda:12.8.0-devel-ubuntu22.04 -- CUDA 12.8 is the minimum for sm_120. Using devel (not runtime) because some AI frameworks compile CUDA extensions at build time.
PyTorch nightly cu128 -- as of early 2026, stable PyTorch still has incomplete Blackwell support. Nightly cu128 is non-negotiable.
numpy pinned to 1.26.4 -- numpy 2.x breaks several AI frameworks that haven't updated their C extensions.
Install torch LAST -- many requirements.txt files include torch. If you install dependencies first, they'll pull in a stable torch that doesn't support sm_120. Always install your carefully selected torch version as the final step.

Docker Compose Architecture

Here's how six AI services coexist in a single compose.yaml:

services:
  comfyui:
    build:
      context: ./apps/comfyui
      dockerfile: Dockerfile
    image: comfyui:wsl-cu12
    profiles: ["comfyui", "all"]
    runtime: nvidia
    environment:
      - LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/${WSL_DRV_BN}
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro
      - ${WSL_DRV_DIR}:/usr/lib/wsl/drivers/${WSL_DRV_BN}:ro
      - ./data/comfyui-models:/app/models
      - ./shared/models:/shared/models:ro
    ports:
      - "8188:8188"
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

  sbv2:
    build:
      context: ./apps/sbv2
      dockerfile: Dockerfile
    image: sbv2:wsl-cu12
    profiles: ["sbv2", "all"]
    runtime: nvidia
    environment:
      - LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/${WSL_DRV_BN}
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
    volumes:
      - /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro
      - ${WSL_DRV_DIR}:/usr/lib/wsl/drivers/${WSL_DRV_BN}:ro
      - ./data/sbv2-models:/opt/models
    ports:
      - "5000:5000"
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

  cosyvoice:
    profiles: ["cosyvoice", "all"]
    ports:
      - "7865:7865"

  rvc:
    profiles: ["rvc", "all"]
    ports:
      - "7866:7866"

  framepack:
    profiles: ["framepack", "all"]
    ports:
      - "7862:7862"

(Each service follows the same WSL driver mount pattern -- I've abbreviated the later ones for readability.)

The .env file is auto-generated:

WSL_DRV_DIR=$(ls -d /usr/lib/wsl/drivers/nvmdi.inf_amd64_* | head -n1)
WSL_DRV_BN=$(basename "$WSL_DRV_DIR")
cat > .env << EOF
WSL_DRV_DIR=$WSL_DRV_DIR
WSL_DRV_BN=$WSL_DRV_BN
EOF

Design Decisions That Saved My Sanity

1. Docker Profiles for Resource Isolation

With 32GB VRAM, you can't run everything simultaneously. Video generation alone can eat 24GB. Docker profiles let me spin up exactly what I need:

docker compose --profile comfyui up -d
docker compose --profile sbv2 --profile cosyvoice up -d
docker compose --profile all up -d

2. Shared Model Directory

AI models are enormous. Flux checkpoints, HunyuanVideo weights, voice models -- easily 200GB+. Instead of duplicating them per container:

~/ai-workspace-correct/
  shared/
    models/           # Cross-service shared models
    hf_cache/         # HuggingFace cache (persistent)
  data/
    comfyui-models/   # Service-specific models
    sbv2-models/
    cosyvoice-models/

Each service mounts shared/models read-only. Service-specific models go in their own data/ directory.

3. Port Allocation Strategy

I carved out port ranges by domain:

Range	Domain	Services
5000-5009	Voice synthesis	Style-BERT-VITS2
7860-7869	Voice/Video AI	FramePack, CosyVoice, RVC
8180-8189	Image AI	ComfyUI

This avoids collisions and makes firewall rules predictable.

4. The torchaudio Trap

This one cost me hours. Several voice synthesis frameworks use torchaudio.info() and torchaudio.load(). The nightly cu128 build of torchaudio has breaking API changes. The fix:

import soundfile as sf

sample_rate = sf.info(wav_path).samplerate
audio_data, sr = sf.read(wav_path)

I patch these at Docker build time with sed:

RUN sed -i 's/import torchaudio/import torchaudio\nimport soundfile as sf/' /opt/app/webui.py && \
    sed -i 's/torchaudio.info(prompt_wav).sample_rate/sf.info(prompt_wav).samplerate/g' /opt/app/webui.py

Lessons Learned (The Hard Way)

1. Never let requirements.txt install torch.
Strip torch, torchvision, torchaudio from every requirements.txt before installing. Then install your nightly cu128 build as the final step. If you don't, pip will happily overwrite your working torch with a stable version that can't see your GPU.

2. Driver updates break the hash.
The nvmdi.inf_amd64_<hash> directory changes when you update NVIDIA drivers. The gpu-run function handles this with dynamic lookup. But if you hardcode the path anywhere, you'll have a bad time.

3. ipc: host is non-negotiable for AI workloads.
Without it, PyTorch's shared memory operations fail silently or with cryptic errors. Always set it.

4. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This environment variable enables PyTorch's memory-efficient allocation strategy. Without it on 32GB VRAM, you'll hit fragmentation issues on large models that shouldn't theoretically run out of memory.

5. Document everything as if you'll have amnesia tomorrow.
I wrote my setup docs with the goal of "restore everything from scratch in 30 minutes." That document has saved me three times already.

Current Stack (February 2026)

Service	Purpose	Port	Status
ComfyUI	Image generation (Flux, SDXL)	8188	Stable
Style-BERT-VITS2	Japanese TTS voice synthesis	5000	Stable
CosyVoice	Multi-speaker voice synthesis	7865	Stable
RVC	Real-time voice conversion	7866	Stable
FramePack	Video generation (HunyuanVideo)	7862	Stable

All running on:

GPU: RTX 5090 32GB GDDR7
CPU: Intel Core Ultra 9 285K
RAM: 64GB DDR5
OS: Windows 11 Pro + WSL2 Ubuntu 22.04
Container runtime: Docker with NVIDIA Container Toolkit

Is This Still Unique?

As of February 2026, there are published examples of single-service RTX 5090 + Docker setups (vLLM, ComfyUI, basic PyTorch). What I haven't found elsewhere is:

A multi-service Docker Compose stack orchestrating 5+ AI services on Blackwell
The specific WSL2 driver mount solution documented with the nvmdi.inf_amd64_* path
A systematic approach to dependency isolation across services sharing one GPU
Production-grade patterns for model sharing, port management, and environment recovery

If you've done something similar, I'd genuinely love to hear about it. Drop a comment or reach out.

Built with ~40 hours of debugging, 200+ GB of model files, and an unreasonable amount of stubbornness. Based in Tokyo.

#rtx5090 #docker #wsl2 #pytorch #cuda #blackwell #ai #selfhosted

Top comments (1)

Kon Nichiwa • Mar 31

This is the actual project I was trying to comment on, but will check out your ComfyUI information as well. Thank you!! This issue had me trying to create a .whl for almost a month on and off in poetry last year. I basically was waiting for someone to build a .whl at this point for the blackwell architecture. This bypassed that issue in minutes and now I can use my GPU for something🎆