Most tutorials about running AI inference at scale assume you have access to cloud GPU clusters, Kubernetes, and a team of infrastructure engineers. I had none of that. What I had was a single workstation with 7 NVIDIA RTX 5090 GPUs, a fiber internet connection, and a goal: serve free AI image generation to anyone on the internet without a signup wall.
This is the architecture that makes ZSky AI work. Every design decision here came from a real production failure or bottleneck, not from a whiteboard exercise.
The Constraints That Shaped Everything
Before diving into the architecture, here are the constraints that ruled out most "standard" approaches:
- Single machine. All 7 GPUs live in one box. No cluster networking, no distributed training frameworks.
- Consumer hardware. RTX 5090s, not A100s or H100s. Consumer drivers, consumer cooling, consumer power delivery.
- Real-time serving. Users expect results in under 4 seconds. Batch processing is not an option.
- Mixed workloads. Image generation and video generation share the same GPU pool. Video uses 28GB+ VRAM; images use 12-22GB. They cannot coexist on the same GPU simultaneously.
- Zero downtime tolerance. If one GPU crashes, the remaining six must continue serving without dropping requests.
What I Tried and Abandoned
Kubernetes with NVIDIA Device Plugin
The enterprise-approved approach. I spent two weeks on it. On a single machine, it added 200-400ms of pod scheduling overhead per request, consumed non-trivial memory for the control plane, and made GPU debugging significantly harder because every error was wrapped in three layers of Kubernetes abstraction.
For a multi-node cluster, Kubernetes makes sense. For 7 GPUs on one machine, it is pure overhead.
Ray Serve
Ray is purpose-built for multi-GPU Python workloads. It worked better than Kubernetes, but its actor model introduced indirection I did not need. Model loading through Ray's object store was measurably slower than direct VRAM loading, and debugging distributed state across Ray actors when a GPU hung was painful.
Ray is excellent for distributed computing across machines. For local multi-GPU inference, it is overkill.
Triton Inference Server
NVIDIA's own inference server. It is fast and mature, but it assumes a model-serving paradigm where you deploy one model per endpoint. I needed dynamic model loading and eviction across 7 GPUs with shared state, which Triton's static configuration model does not handle well.
The Architecture That Stuck
After three failed approaches, I arrived at something embarrassingly simple:
┌──────────────┐
│ Nginx │
│ (TLS term, │
│ rate limit) │
└──────┬───────┘
│
┌──────┴───────┐
│ API Server │
│ (FastAPI + │
│ WebSocket) │
└──────┬───────┘
│
┌───────────┴───────────┐
│ Redis │
│ (job queue + state + │
│ pub/sub progress) │
└───────────┬───────────┘
│
┌───────┬───────┬───────┼───────┬───────┬───────┐
│ │ │ │ │ │ │
┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐┌──┴──┐
│GPU 0││GPU 1││GPU 2││GPU 3││GPU 4││GPU 5││GPU 6│
│ W0 ││ W1 ││ W2 ││ W3 ││ W4 ││ W5 ││ W6 │
└─────┘└─────┘└─────┘└─────┘└─────┘└─────┘└─────┘
Seven independent Python processes, each pinned to a single GPU via CUDA_VISIBLE_DEVICES. They share nothing except Redis. A supervisor process monitors heartbeats and restarts crashed workers.
That is the entire orchestration layer. No service mesh, no container runtime, no scheduler. One process per GPU, one Redis instance, one supervisor.
GPU Queue Management: Model-Affinity Routing
The naive approach is a single FIFO queue: jobs go in, the first available GPU picks one up. This works until you realize that loading a model into VRAM takes 4-6 seconds. If GPU 3 already has the image generation model loaded and GPU 5 has the video model loaded, sending an image generation job to GPU 5 means the user waits an extra 4 seconds while the model loads.
Model-affinity routing fixes this:
class JobRouter:
def route(self, job: dict) -> int:
model_needed = job["model"]
# Priority 1: GPU that already has the model loaded
# AND has room in its queue
for gpu_id in range(self.num_gpus):
state = self.get_gpu_state(gpu_id)
if model_needed in state["loaded_models"]:
if state["queue_depth"] < self.max_queue_depth:
return gpu_id
# Priority 2: GPU with the most free VRAM
# (can load the model without evicting)
for gpu_id in sorted(range(self.num_gpus),
key=lambda g: self.get_free_vram(g),
reverse=True):
if self.get_free_vram(gpu_id) >= MODEL_VRAM[model_needed]:
return gpu_id
# Priority 3: Least-busy GPU (will need to evict)
return min(range(self.num_gpus),
key=lambda g: self.get_gpu_state(g)["queue_depth"])
Each GPU worker publishes its state to Redis every 5 seconds: loaded models, queue depth, current VRAM usage, temperature. The router reads this state to make decisions.
This single optimization -- sending jobs to GPUs that already have the right model cached -- cut median latency by 30%. It is the highest-impact change in the entire system.
VRAM Management: The LRU Eviction Problem
Each RTX 5090 has 32GB of VRAM. A large image generation model uses about 22GB. A video model uses 28GB. Utility models (upscaling, background removal) use 1-2GB each.
The challenge: when a video generation request arrives at a GPU that currently holds an image model plus two utility models (22 + 2 + 1.5 = 25.5GB), the worker must evict everything to make room for the 28GB video model.
I use LRU eviction with a twist -- model priority. Utility models are always evicted before generation models, regardless of recency:
def evict_for(self, required_mb: int):
free = self.get_free_vram_mb()
while free < required_mb and self.loaded_models:
# Evict lowest-priority, then least-recently-used
victim = min(
self.loaded_models.values(),
key=lambda m: (m.priority, m.last_used)
)
self.unload(victim)
torch.cuda.empty_cache()
free = self.get_free_vram_mb()
Priority-weighted eviction prevents a common pathological case: a burst of small utility requests evicting the main generation model, followed by an image generation request that must reload it from disk. Without priority weighting, this thrashing pattern added 4+ seconds to one in every ten requests during peak traffic.
Load Balancing Across 7 GPUs
True load balancing across heterogeneous GPU workloads is harder than it sounds. Image generation takes 2-3 seconds; video generation takes 60-90 seconds. If you balance purely on queue depth, a GPU processing a video job shows queue depth 1 for 90 seconds, while image GPUs show queue depth 0 between 3-second jobs. The video GPU looks "available" but is not.
I balance on estimated completion time rather than queue depth:
def estimated_wait(self, gpu_id: int) -> float:
state = self.get_gpu_state(gpu_id)
wait = 0.0
for queued_job in state["queue"]:
wait += self.estimate_duration(queued_job)
if state["current_job"]:
elapsed = time.time() - state["current_job"]["started"]
remaining = self.estimate_duration(state["current_job"]) - elapsed
wait += max(0, remaining)
return wait
Duration estimates come from historical data by model and resolution, stored in a simple rolling average. This gives the router a much more accurate picture of actual GPU availability.
Thermal Management: The Underrated Problem
Seven GPUs in one case generate approximately 3,500 watts of heat under full load. Consumer GPU coolers are designed for one or two cards with adequate airspace, not seven cards packed together.
Thermal throttling at 83C reduced my throughput by 15-25% during sustained loads. The fixes, in order of impact:
Fan curve override. Default fan curves prioritize noise over cooling. I run all fans at 80%+ whenever any GPU exceeds 70C using a systemd service that polls nvidia-smi and sets fan speeds via NVML:
def thermal_governor():
while True:
for gpu_id in range(NUM_GPUS):
temp = get_gpu_temp(gpu_id)
if temp > 75:
set_fan_speed(gpu_id, min(100, 70 + (temp - 75) * 4))
elif temp > 70:
set_fan_speed(gpu_id, 80)
else:
set_fan_speed(gpu_id, 50)
time.sleep(10)
Thermal-aware job routing. The router penalizes hot GPUs. A GPU at 80C gets a 2x weight penalty on its estimated wait time, which naturally diverts traffic to cooler GPUs:
def weighted_wait(self, gpu_id: int) -> float:
base_wait = self.estimated_wait(gpu_id)
temp = self.get_gpu_temp(gpu_id)
if temp > 80:
return base_wait * 2.0
elif temp > 75:
return base_wait * 1.3
return base_wait
Physical airflow. This is the least interesting but most effective fix. I removed the side panel, added two 140mm intake fans blowing directly across the GPU backplates, and ensured adequate spacing between cards using riser cables for the most constrained slots. This dropped peak temperatures by 8-12C.
After these changes, sustained full-load operation runs at 72-76C across all seven GPUs. No throttling.
Latency Optimization: Where the Milliseconds Go
A breakdown of a typical image generation request:
Network (TLS + HTTP parse) → 8ms
API validation + auth → 12ms
Redis enqueue + routing → 3ms
Queue wait (p50) → 50ms
Model load (if cached) → 0ms (4,200ms if cold)
Prompt encoding → 180ms
Denoising (28 steps, compiled) → 2,050ms
VAE decode → 105ms
PNG encode (async, off-GPU) → 45ms
Redis result + HTTP response → 12ms
────────────────────────────────────────
Total (warm, p50) → 2,465ms
The optimizations that matter most:
- Model caching eliminates 4,200ms on cache hits. Cache hit rate is 94% in production because most requests use the same model.
-
torch.compile with
mode="reduce-overhead"cuts denoising by ~350ms by eliminating Python overhead in the inference loop. - Prompt embedding cache saves 180ms on repeated prompts. About 15% of prompts are exact repeats.
- CUDA graphs for the most common resolution (1024x1024) saves another ~140ms by replaying a captured GPU execution plan.
- Async PNG encoding does not reduce individual latency but frees the GPU 45ms sooner for the next request.
Failure Handling: What Breaks in Production
In four months of production operation, here is what has actually failed:
| Failure | Count | Detection | Recovery |
|---|---|---|---|
| OOM on video generation | 4 | Worker crash | Auto-restart, job re-queued |
| Stuck worker (unknown cause) | 2 | Heartbeat timeout (120s) | Supervisor kill + restart |
| CUDA context corruption | 1 | Garbled output detected by QC | Worker restart |
| Power interruption | 1 | All workers died | Full system restart via systemd |
| NVLink error | 0 | N/A (no NVLink) | N/A |
| Driver crash | 0 | N/A | N/A |
The supervisor is the critical component. Every worker sends a Redis heartbeat every 10 seconds. If the heartbeat stops for 120 seconds, the supervisor kills the process, clears its GPU memory, and spawns a new worker. Jobs that were in-progress are re-queued automatically.
The most subtle failure mode is CUDA context corruption, which does not crash the worker but produces garbled images. I added a lightweight quality check on every output -- variance below a threshold (indicating a solid-color or corrupted image) triggers a worker restart and job retry.
Production Performance
Current production numbers on 7x RTX 5090:
| Metric | Value |
|---|---|
| Median image generation latency | 2.5s |
| p99 image generation latency | 4.1s |
| Median video generation latency | 67s |
| Sustained image throughput | ~2,400 images/hour |
| Model cache hit rate | 94% |
| GPU utilization (avg) | 41% |
| Uptime (last 90 days) | 99.7% |
The 41% average GPU utilization reflects real traffic patterns -- demand is bursty, with peaks during US business hours and valleys overnight. During peak hours, utilization hits 85-90%.
What I Would Do Differently
Start with per-GPU queues from day one. I initially used a single shared queue and retrofitted per-GPU queues for affinity routing. The refactor was messy.
Instrument everything from the start. I added Prometheus metrics after launch. Having generation-time histograms, queue-depth gauges, and VRAM-utilization metrics from day one would have caught the thermal throttling issue weeks earlier.
Do not try to batch real-time requests. Batching improves throughput but increases latency for the first request in the batch. For user-facing inference, single-request processing with model caching is strictly better.
Budget for thermal engineering from the beginning. I treated cooling as an afterthought and paid for it with two weeks of debugging intermittent throughput drops that turned out to be thermal throttling.
Conclusion
The final architecture is a Redis job queue with one Python process per GPU, a model-affinity router, an LRU VRAM manager, and a heartbeat-based supervisor. No Kubernetes, no Ray, no Triton. For a single-machine multi-GPU inference setup, simplicity is not just easier -- it is faster and more reliable.
The complexity lives where it should: in VRAM management and job routing, not in orchestration frameworks.
If you want to see the end result, ZSky AI serves 50 free generations per day with no signup required. The architecture described here is what is running behind the scenes.
I build AI infrastructure at ZSky AI, where we run AI image and video generation on self-hosted GPUs. If you are building something similar or just want to talk about GPU inference architecture, find me in the comments.
Top comments (0)