Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide
"Your RX 580 can't run AI. Buy a new GPU."
That was the consensus in 2026. AMD dropped ROCm support for Polaris/GCN4 architecture in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. Every mainstream AI stack gave up on this card.
We didn't.
This is the complete technical record of how we built a full local AI production stack on an AMD RX 580 8GB — running LLMs at 17 tok/s, generating images in 72 seconds, transcribing audio 150× faster than CPU, and even cloning voices. All offline. All free. All on hardware that cost under $50.
The Hardware
Component Spec
GPU AMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)
CPU Intel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)
RAM 32GB DDR4 REG ECC Quad Channel
Storage NVMe 1TB — 1.7–3.5 GB/s
OS Windows 10 Pro + WSL2 / Ubuntu 26.04 LTS
The RX 580 2048SP is the mining-variant with 2048 shader processors instead of the original 2304SP. It's everywhere on the used market for under $50. It performs identically through Vulkan.
One thing nobody talks about: storage matters as much as the GPU. Moving from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to 30 seconds. The bottleneck was never the GPU.
Why Vulkan?
The entire mainstream AI stack runs on either CUDA (Nvidia-only) or ROCm (AMD dropped Polaris in v5.x). That leaves legacy AMD GPUs with no official path.
But there's a third option: Vulkan — a universal graphics/compute API that works on any modern GPU, including the RX 580, which has supported Vulkan 1.x since its 2017 drivers.
The ggml project (the engine behind llama.cpp and stable-diffusion.cpp) implements Vulkan compute backends in pure C++. This means you can compile directly against the Vulkan API and completely bypass the ROCm/CUDA ecosystem. No driver packages. No compatibility layers. Just the GPU doing math.
What We Tried Before Vulkan (And Why It All Failed)
Before finding the working path, we hit every dead end:
DirectML + ComfyUI — The GPU gets detected as privateuseone0, but then:
NotImplementedError: Cannot access storage of OpaqueTensorImpl
DirectML wraps tensor data in opaque objects that ComfyUI's attention backends literally cannot read. Also: Microsoft hasn't updated it since September 2024. It's abandoned.
ROCm on Polaris — AMD officially dropped GCN4/Polaris in ROCm v5.x. Compatibility layers via WSL2 generate kernel panics under inference load. There is no Windows support. Dead end by design.
OpenVINO + Stable Diffusion Forge — Intel's extension was built for the old Automatic1111 architecture. Forge restructured everything. Result:
ModuleNotFoundError: No module named 'ldm'
ModuleNotFoundError: No module named 'sgm'
Error build_unet: Invalid backend: 'openvino'
CPU-only + HDD — Our baseline before any optimization: 85-second startup, ~19 minutes per 512×512 image. The mechanical HDD competing with memory paging made it completely unusable.
The pattern: every "AMD-compatible" option either targets newer hardware, is abandoned, or is simply incompatible with modern pipelines. Vulkan is the only path that actually works.
The Architecture: Dual-Path Stack
The core insight of this project is that not every workload fits in 8GB of VRAM. The solution is intelligent routing between GPU and CPU:
OpenWebUI :3000 (Docker)
│
├──► llama-server :8081 ──► RX 580 Vulkan [llama.cpp]
│ └── Ollama :11434 ──► CPU fallback
│
└──► sd-server :7860 ──► RX 580 Vulkan [stable-diffusion.cpp]
├── SD 1.5 GGUF ──► 72s / image
└── FLUX hybrid ──► ~14 min / image
└──► ComfyUI :8188 ──► Xeon CPU WSL2
Path 1 — GPU Vulkan: LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.
Path 2 — CPU Xeon: FLUX.1 16GB models, AnimateDiff video pipelines. The 32GB ECC RAM acts as "virtual VRAM" for models that don't fit on the card.
Building llama.cpp with Vulkan
Run in Developer PowerShell for Visual Studio:
powershell
cd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20
Validate GPU detection:
powershell
cd build\bin\Release
.\llama-cli.exe --list-devices
Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅
Start the LLM server:
powershell
.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" `
--host 0.0.0.0 --port 8081 --device Vulkan0
How to verify it's actually using the GPU:
ggml_vulkan: Found 1 Vulkan device(s)
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB
17.77 t/s ← RX 580 Vulkan ✅
If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. The --device Vulkan0 flag is mandatory.
Building stable-diffusion.cpp with Vulkan
powershell
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20
Start the image server:
powershell
E:
cd "E:\stable-diffusion.cpp\build\bin\Release"
.\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `
-m "E:\models\dreamshaper8.gguf"
FLUX.1 Schnell: Running a 16GB Model on 8GB VRAM
FLUX.1 Schnell is a 12B parameter SOTA model that nominally requires 16GB. Here's how we run it on 8GB:
The strategy is memory segmentation — put the diffusion model on VRAM, offload everything else to RAM:
Component File Where
Diffusion Model flux1-schnell-q4_k.gguf GPU VRAM (~6.5GB)
VAE ae.safetensors CPU RAM (~160MB)
CLIP L clip_l.safetensors GPU VRAM (~235MB)
T5XXL t5xxl_fp16.safetensors CPU RAM (~9.3GB)
batch
sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
--diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^
--vae "E:\models\ae.safetensors" ^
--clip_l "E:\models\clip_l.safetensors" ^
--t5xxl "E:\models\t5xxl_fp16.safetensors" ^
--cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling
⚠️ --vae-tiling is not optional. Without it, VAE decode causes OOM and crashes the server.
Timing per 1024×1024 image:
Stage Time
T5XXL conditioning 11.49s
Sampling (4 steps) ~838s
VAE decode (9 tiles) 40.45s
Total ~14 min
Critical: Two GGUF formats for FLUX
This trips up almost everyone. There are two different GGUF distributions for FLUX:
Source Compatible with
city96 (HuggingFace) ComfyUI + ComfyUI-GGUF node only
leejet (HuggingFace) stable-diffusion.cpp ✅
Using a city96 GGUF in sd-server returns:
[ERROR] main.cpp:92 - new_sd_ctx_t failed
Always download from: huggingface.co/leejet/FLUX.1-schnell-gguf
whisper.cpp: Audio Transcription on the RX 580
This is where the numbers get absurd.
Build whisper.cpp with Vulkan:
powershell
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j4
Transcribe a video (MP4 → TXT):
powershell
Extract audio first (Whisper requires WAV on Windows)
ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"
Transcribe
.\build\bin\Release\whisper-cli.exe
-m models\ggml-large-v3-turbo.bin
-f "audio.wav" -l pt --output-txt
Performance on a 15-minute video (Windows):
Stage Time
Model load 4s
Mel spectrogram 1.2s
GPU encode 73s
Decode + batch 168s
Total 307s
VRAM used: only 2.6GB of 8GB. CPU stays at ~5%.
On Linux (Ubuntu 26.04, Mesa RADV), same hardware, same model:
Metric Windows Linux
Time (106s audio) 307s 23.58s
VRAM used 2.6GB 1.6GB
A 13× speedup on the same GPU. Mesa RADV's Vulkan compute path is dramatically more efficient for this workload than the Windows AMD driver.
Windows vs Linux: Full Benchmark Comparison
Workload Windows 10 Ubuntu 26.04 (Mesa RADV) Winner
LLM Qwen3 4B @ 99 layers ~15–17 tok/s ~35 tok/s 🏆 Linux (2×)
LLM Qwen3.6 35B @ max layers 7.62 tok/s (max 10 ngl) 5.18 tok/s (max 20 ngl) ⚖️ Tie
SD 1.5 DreamShaper (50 steps) ~72s ~85s 🏆 Windows
FLUX Schnell (4 steps, 512×512) ~84s ~52s 🏆 Linux
Whisper large-v3-turbo 307s · 2.6GB 23.58s · 1.6GB 🏆 Linux
Why Linux is faster for LLM: Mesa RADV allows up to 20 GPU layers for large models where Windows AMD drivers cap at 10. RADV's memory management is simply more aggressive and efficient.
Why Windows wins SD 1.5: The proprietary AMD driver has more stable direct rendering for this specific workload. Consistent 1.44s/it vs 1.65s/it on Linux.
Voice Cloning: Applio RVC on AMD Windows
We also built a full voice cloning pipeline:
Text → Balabolka (TTS) → WAV → Applio RVC → Cloned Voice
The key insight: instead of using a generative TTS model (which sounds robotic), we use a real voice actor (Antônio Neural, a Microsoft Neural voice) for prosody and emotion, then apply RVC to convert the identity to our target voice (Yuri). Result: 80–95% naturalness vs 60–70% for pure TTS.
AMD-specific critical findings:
DirectML is effectively dead for RVC — torch-directml is locked to torch==2.4.1 while Applio requires torch==2.7.1. Irreconcilable conflict.
Use CPU mode. On Xeon E5-2690 v3 (24 threads): ~6 min/epoch, ~20 hours for 200 epochs. Inference after training: 2 hours of audio → ~30 minutes processing.
The silent failure trap:
powershell
NEVER set these — they silently break feature extraction
set CUDA_VISIBLE_DEVICES=-1
set ROCM_VISIBLE_DEVICES=-1
Training will print "Model trained successfully" but produce nothing
Always verify logs/project/extracted/ contains .npy files before starting training.
The Community Timeline
This project didn't happen in isolation. Three independent researchers, same GPU, same conclusion:
Date Author Contribution
Jan 2025 艾米心 Amihart First LLM via Vulkan on RX 580 — 24.56 tok/s on Debian
Dec 2025 DH / DadHacks First SD via Vulkan — stable-diffusion.cpp breakthrough
2026 AIVisionsLab Full Windows + Linux production stack, voice cloning, transcription
The shared foundation: ggml by Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire proprietary driver ecosystem.
Real Benchmarks Summary
Workload Model Backend Result
LLM inference Mistral 7B Q4_K_M RX 580 Vulkan (Win) 17–18 tok/s
LLM inference Qwen3 4B Q4_K_M RX 580 Vulkan (Linux) ~35 tok/s
LLM baseline Mistral 7B Q4_K_M Xeon CPU pure 3–5 tok/s
Image gen DreamShaper 8 SD1.5 RX 580 Vulkan ~72s / 512×512
Image gen flux1-schnell-q4_k GPU+CPU hybrid ~14 min @ 1024×1024
Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Linux) 23.58s / 106s audio
Video frames AnimateDiff Xeon WSL2 CPU ~141s/frame
Voice inference Applio RVC Xeon CPU ~30 min / 2h audio
Troubleshooting: The Most Common Failures
generate_image returned no results / frozen terminal Bug in sd-server with Seed: -1. Fix: set a fixed integer seed (42, 1337) in OpenWebUI.
new_sd_ctx_t failed with FLUX You're using a city96 GGUF. Download from leejet instead.
Docker can't reach sd-server Windows Defender blocks the Docker subnet (172.x.x.x). Run as Administrator:
powershell
New-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `
-Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow
--override-tensor exps=CPU slows down Vulkan This flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU transfer overhead destroys any gains. Don't apply CUDA-optimized flags to Vulkan backends.
Full Documentation
This post covers the core architecture. Full guides for each component:
📖 Master documentation (PT/EN): setup-ia-local-rx580-vulkan.web.app
💻 GitHub repository: github.com/aivisionslab-studios/rx580-local-ai-guide
🎥 YouTube: @aivisionslab-hub
Conclusion
The narrative that legacy AMD GPUs can't run AI is a software problem, not a hardware limitation. The RX 580 has supported Vulkan since 2017. The compute capability was always there.
What changed is that ggml and its ecosystem built Vulkan backends that bypass the entire proprietary driver stack. The result is a GPU from 2017 running SOTA models from 2026 — locally, privately, for free.
RX 580 (2017) + Xeon (2014) + Vulkan + ggml = SOTA AI in 2026
The problem was never the GPU.
AIVisionsLab — Documenting local AI on legacy hardware. São Paulo, Brazil 🇧🇷
Description
How we built a full local AI stack (LLMs, Stable Diffusion, FLUX, Whisper, Voice Cloning) on a 2017 GPU using Vulkan. No CUDA. No ROCm. No cloud. Real benchmarks, real failures, real solutions.
Title
Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide
Published
true
Tags
ai, amd, vulkan, opensource
Cover_image
https://setup-ia-local-rx580-vulkan.web.app/og-image.png
Canonical_url
https://setup-ia-local-rx580-vulkan.web.app

Top comments (0)