DEV Community

Cover image for Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native Vulkan — Full Architecture Guide [2026]
AIVisionsLab
AIVisionsLab

Posted on

Running Flux Schnell (12B) + LLMs on a Legacy AMD RX 580 (8GB) via Native Vulkan — Full Architecture Guide [2026]

Most people were told the RX 580 was dead for AI in 2026. CUDA-only ecosystems, ROCm dropping Polaris support at v5.x, DirectML abandoned before it matured. This is the full technical breakdown of how we proved that wrong.

Hardware Setup

  • GPU: AMD RX 580 2048SP — 8GB GDDR5 VRAM (Vulkan 1.x native)
  • CPU: Intel Xeon E5-2690 v3 — 12c/24t @ 3.5GHz boost
  • RAM: 32GB DDR4 REG ECC Quad Channel
  • Storage: NVMe 1TB — critical bottleneck fix
  • OS: Windows 10 Pro + WSL2 Ubuntu 22.04.5

Why everything else failed

Solution Status Reason
CUDA Nvidia-only
ROCm Dropped Polaris at v5.x
DirectML OpaqueTensorImpl crash on CLIPTextEncode
OpenVINO ldm/sgm modules missing on Forge

DirectML's fatal error:

NotImplementedError: Cannot access storage of OpaqueTensorImpl

Enter fullscreen mode Exit fullscreen mode

The driver wraps memory in opaque tensors that ComfyUI's attention backends can't read. It's a dead end.

The Solution — Dual Architecture

PATH 1 — GPU Vulkan (RX 580 acceleration)

Native build of stable-diffusion.cpp compiled with -DGGML_VULKAN=ON. The ggml engine maps directly to the GPU without ROCm or CUDA. SD 1.5 GGUF models render in ~72 seconds.

PATH 2 — CPU Xeon (SOTA heavy models)

FLUX.1 Schnell at 16GB exceeds physical VRAM. ComfyUI runs via CPU inside WSL2, using ECC RAM as stable virtual VRAM. Full 768x768 generation in ~24 minutes.

Hybrid Memory Segmentation for Flux (12B Q4_K)

Component File Allocation Size
Diffusion Model flux1-schnell-q4_k.gguf GPU VRAM ~6.5GB
VAE ae.safetensors CPU RAM ~160MB
CLIP L clip_l.safetensors GPU VRAM ~235MB
T5XXL t5xxl_fp16.safetensors CPU RAM ~9.3GB

Production command

sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 \
  --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" \
  --vae "E:\models\ae.safetensors" \
  --clip_l "E:\models\clip_l.safetensors" \
  --t5xxl "E:\models\t5xxl_fp16.safetensors" \
  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

Enter fullscreen mode Exit fullscreen mode

--vae-on-cpu + --vae-tiling are non-negotiable. Without them: instant DeviceMemoryAllocation crash.

Real Benchmarks

Workload Backend Result
LLM text inference CPU only 3–5 tokens/s ❌
LLM text inference RX 580 Vulkan 15–16 tokens/s ✅
SD 1.5 20 steps DirectML ~450s + crash ❌
SD 1.5 20 steps Vulkan native ~72s ✅
Flux 1024x1024 Xeon CPU WSL2 ~24 min ✅

NVMe impact: Model load time dropped from 25 minutes (HDD) to 4 minutes (NVMe). For Flux 16GB: from 25 min to ~30 seconds. Storage is as critical as compute.

Service Map

OpenWebUI Docker :3000
  ├── llama-server.exe :8081  (Vulkan — RX 580)
  ├── sd-server.exe    :7860  (Vulkan — RX 580)
  └── ComfyUI          :8188  (CPU — Xeon WSL2)

Enter fullscreen mode Exit fullscreen mode

Resources

Full documentation, .bat orchestration scripts, compiled binaries and model configs:
👉 https://setup-ia-local-rx580-vulkan.firebaseapp.com/


Hardware doesn't die. It gets liberated by the right software. Are you running legacy AMD cards? Let's discuss your buffer allocation and command queue latency findings in the comments.

Top comments (0)