Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide

#ai #amd #vulkan #opensource

Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide
"An RX 580 from 2017 running a 12B parameter model in 2026.
No CUDA. No ROCm. No cloud. Here's exactly how."

That was the consensus in 2026. AMD dropped ROCm support for Polaris/GCN4 architecture in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. Every mainstream AI stack gave up on this card.
We didn't.
This is the complete technical record of how we built a full local AI production stack on an AMD RX 580 8GB — running LLMs at 17 tok/s, generating images in 72 seconds, transcribing audio 150× faster than CPU, and even cloning voices. All offline. All free. All on hardware that cost under $50.

The Hardware
ComponentSpecGPUAMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)CPUIntel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)RAM32GB DDR4 REG ECC Quad ChannelStorageNVMe 1TB — 1.7–3.5 GB/sOSWindows 10 Pro + WSL2 / Ubuntu 26.04 LTS
The RX 580 2048SP is the mining-variant with 2048 shader processors instead of the original 2304SP. It's everywhere on the used market for under $50. It performs identically through Vulkan.
One thing nobody talks about: storage matters as much as the GPU. Moving from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to 30 seconds. The bottleneck was never the GPU.

Why Vulkan?
The entire mainstream AI stack runs on either CUDA (Nvidia-only) or ROCm (AMD dropped Polaris in v5.x). That leaves legacy AMD GPUs with no official path.
But there's a third option: Vulkan — a universal graphics/compute API that works on any modern GPU, including the RX 580, which has supported Vulkan 1.x since its 2017 drivers.
The ggml project (the engine behind llama.cpp and stable-diffusion.cpp) implements Vulkan compute backends in pure C++. This means you can compile directly against the Vulkan API and completely bypass the ROCm/CUDA ecosystem. No driver packages. No compatibility layers. Just the GPU doing math.

What We Tried Before Vulkan (And Why It All Failed)
Before finding the working path, we hit every dead end:
DirectML + ComfyUI — The GPU gets detected as privateuseone0, but then:
NotImplementedError: Cannot access storage of OpaqueTensorImpl
DirectML wraps tensor data in opaque objects that ComfyUI's attention backends literally cannot read. Also: Microsoft hasn't updated it since September 2024. It's abandoned.
ROCm on Polaris — AMD officially dropped GCN4/Polaris in ROCm v5.x. Compatibility layers via WSL2 generate kernel panics under inference load. There is no Windows support. Dead end by design.
OpenVINO + Stable Diffusion Forge — Intel's extension was built for the old Automatic1111 architecture. Forge restructured everything. Result:
ModuleNotFoundError: No module named 'ldm'
ModuleNotFoundError: No module named 'sgm'
Error build_unet: Invalid backend: 'openvino'
CPU-only + HDD — Our baseline before any optimization: 85-second startup, ~19 minutes per 512×512 image. The mechanical HDD competing with memory paging made it completely unusable.
The pattern: every "AMD-compatible" option either targets newer hardware, is abandoned, or is simply incompatible with modern pipelines. Vulkan is the only path that actually works.

The Architecture: Dual-Path Stack
The core insight of this project is that not every workload fits in 8GB of VRAM. The solution is intelligent routing between GPU and CPU:
OpenWebUI :3000 (Docker)
│
├──► llama-server :8081 ──► RX 580 Vulkan [llama.cpp]
│ └── Ollama :11434 ──► CPU fallback
│
└──► sd-server :7860 ──► RX 580 Vulkan [stable-diffusion.cpp]
├── SD 1.5 GGUF ──► 72s / image
└── FLUX hybrid ──► ~14 min / image

└──► ComfyUI       :8188  ──►  Xeon CPU WSL2

Path 1 — GPU Vulkan: LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.
Path 2 — CPU Xeon: FLUX.1 16GB models, AnimateDiff video pipelines. The 32GB ECC RAM acts as "virtual VRAM" for models that don't fit on the card.

Building llama.cpp with Vulkan
Run in Developer PowerShell for Visual Studio:
powershellcd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20
Validate GPU detection:
powershellcd build\bin\Release
.\llama-cli.exe --list-devices

Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅

Start the LLM server:
powershell.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" `
--host 0.0.0.0 --port 8081 --device Vulkan0
How to verify it's actually using the GPU:
ggml_vulkan: Found 1 Vulkan device(s)
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB
17.77 t/s ← RX 580 Vulkan ✅
If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. The --device Vulkan0 flag is mandatory.

Building stable-diffusion.cpp with Vulkan
powershellgit clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20
Start the image server:
powershellE:
cd "E:\stable-diffusion.cpp\build\bin\Release"
.\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `
-m "E:\models\dreamshaper8.gguf"

FLUX.1 Schnell: Running a 16GB Model on 8GB VRAM
FLUX.1 Schnell is a 12B parameter SOTA model that nominally requires 16GB. Here's how we run it on 8GB:
The strategy is memory segmentation — put the diffusion model on VRAM, offload everything else to RAM:
ComponentFileWhereDiffusion Modelflux1-schnell-q4_k.ggufGPU VRAM (~6.5GB)VAEae.safetensorsCPU RAM (~160MB)CLIP Lclip_l.safetensorsGPU VRAM (~235MB)T5XXLt5xxl_fp16.safetensorsCPU RAM (~9.3GB)
batchsd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
--diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^
--vae "E:\models\ae.safetensors" ^
--clip_l "E:\models\clip_l.safetensors" ^
--t5xxl "E:\models\t5xxl_fp16.safetensors" ^
--cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

⚠️ --vae-tiling is not optional. Without it, VAE decode causes OOM and crashes the server.

Timing per 1024×1024 image:
StageTimeT5XXL conditioning11.49sSampling (4 steps)~838sVAE decode (9 tiles)40.45sTotal~14 min
Critical: Two GGUF formats for FLUX
This trips up almost everyone. There are two different GGUF distributions for FLUX:
SourceCompatible withcity96 (HuggingFace)ComfyUI + ComfyUI-GGUF node onlyleejet (HuggingFace)stable-diffusion.cpp ✅
Using a city96 GGUF in sd-server returns:
[ERROR] main.cpp:92 - new_sd_ctx_t failed
Always download from: huggingface.co/leejet/FLUX.1-schnell-gguf

whisper.cpp: Audio Transcription on the RX 580
This is where the numbers get absurd.
Build whisper.cpp with Vulkan:
powershellgit clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j4
Transcribe a video (MP4 → TXT):
powershell# Extract audio first (Whisper requires WAV on Windows)
ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"

Transcribe

.\build\bin\Release\whisper-cli.exe -m models\ggml-large-v3-turbo.bin
-f "audio.wav" -l pt --output-txt
Performance on a 15-minute video (Windows):
StageTimeModel load4sMel spectrogram1.2sGPU encode73sDecode + batch168sTotal307s
VRAM used: only 2.6GB of 8GB. CPU stays at ~5%.
On Linux (Ubuntu 26.04, Mesa RADV), same hardware, same model:
MetricWindowsLinuxTime (106s audio)307s23.58sVRAM used2.6GB1.6GB
A 13× speedup on the same GPU. Mesa RADV's Vulkan compute path is dramatically more efficient for this workload than the Windows AMD driver.

Windows vs Linux: Full Benchmark Comparison
WorkloadWindows 10Ubuntu 26.04 (Mesa RADV)WinnerLLM Qwen3 4B @ 99 layers~15–17 tok/s~35 tok/s🏆 Linux (2×)LLM Qwen3.6 35B @ max layers7.62 tok/s (max 10 ngl)5.18 tok/s (max 20 ngl)⚖️ TieSD 1.5 DreamShaper (50 steps)~72s~85s🏆 WindowsFLUX Schnell (4 steps, 512×512)~84s~52s🏆 LinuxWhisper large-v3-turbo307s · 2.6GB23.58s · 1.6GB🏆 Linux
Why Linux is faster for LLM: Mesa RADV allows up to 20 GPU layers for large models where Windows AMD drivers cap at 10. RADV's memory management is simply more aggressive and efficient.
Why Windows wins SD 1.5: The proprietary AMD driver has more stable direct rendering for this specific workload. Consistent 1.44s/it vs 1.65s/it on Linux.

Voice Cloning: Applio RVC on AMD Windows
We also built a full voice cloning pipeline:
Text → Balabolka (TTS) → WAV → Applio RVC → Cloned Voice
The key insight: instead of using a generative TTS model (which sounds robotic), we use a real voice actor (Antônio Neural, a Microsoft Neural voice) for prosody and emotion, then apply RVC to convert the identity to our target voice (Yuri). Result: 80–95% naturalness vs 60–70% for pure TTS.
AMD-specific critical findings:
DirectML is effectively dead for RVC — torch-directml is locked to torch==2.4.1 while Applio requires torch==2.7.1. Irreconcilable conflict.
Use CPU mode. On Xeon E5-2690 v3 (24 threads): ~6 min/epoch, ~20 hours for 200 epochs. Inference after training: 2 hours of audio → ~30 minutes processing.
The silent failure trap:
powershell# NEVER set these — they silently break feature extraction

set CUDA_VISIBLE_DEVICES=-1

set ROCM_VISIBLE_DEVICES=-1

Training will print "Model trained successfully" but produce nothing

Always verify logs/project/extracted/ contains .npy files before starting training.

The Community Timeline
This project didn't happen in isolation. Three independent researchers, same GPU, same conclusion:
DateAuthorContributionJan 2025艾米心 AmihartFirst LLM via Vulkan on RX 580 — 24.56 tok/s on DebianDec 2025DH / DadHacksFirst SD via Vulkan — stable-diffusion.cpp breakthrough2026AIVisionsLabFull Windows + Linux production stack, voice cloning, transcription
The shared foundation: ggml by Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire proprietary driver ecosystem.

Real Benchmarks Summary
WorkloadModelBackendResultLLM inferenceMistral 7B Q4_K_MRX 580 Vulkan (Win)17–18 tok/sLLM inferenceQwen3 4B Q4_K_MRX 580 Vulkan (Linux)~35 tok/sLLM baselineMistral 7B Q4_K_MXeon CPU pure3–5 tok/sImage genDreamShaper 8 SD1.5RX 580 Vulkan~72s / 512×512Image genflux1-schnell-q4_kGPU+CPU hybrid~14 min @ 1024×1024Audio transcriptionWhisper large-v3-turboRX 580 Vulkan (Linux)23.58s / 106s audioVideo framesAnimateDiffXeon WSL2 CPU~141s/frameVoice inferenceApplio RVCXeon CPU~30 min / 2h audio

Troubleshooting: The Most Common Failures
generate_image returned no results / frozen terminal
Bug in sd-server with Seed: -1. Fix: set a fixed integer seed (42, 1337) in OpenWebUI.
new_sd_ctx_t failed with FLUX
You're using a city96 GGUF. Download from leejet instead.
Docker can't reach sd-server
Windows Defender blocks the Docker subnet (172.x.x.x). Run as Administrator:
powershellNew-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `
-Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow
--override-tensor exps=CPU slows down Vulkan
This flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU transfer overhead destroys any gains. Don't apply CUDA-optimized flags to Vulkan backends.

Full Documentation
This post covers the core architecture. Full guides for each component:

📖 Master documentation (PT/EN): setup-ia-local-rx580-vulkan.web.app
💻 GitHub repository: github.com/aivisionslab-studios/rx580-local-ai-guide
🎥 YouTube: @aivisionslab-hub

Conclusion
The narrative that legacy AMD GPUs can't run AI is a software problem, not a hardware limitation. The RX 580 has supported Vulkan since 2017. The compute capability was always there.
What changed is that ggml and its ecosystem built Vulkan backends that bypass the entire proprietary driver stack. The result is a GPU from 2017 running SOTA models from 2026 — locally, privately, for free.
RX 580 (2017) + Xeon (2014) + Vulkan + ggml = SOTA AI in 2026
The problem was never the GPU.

AIVisionsLab — Documenting local AI on legacy hardware.
São Paulo, Brazil 🇧🇷