DEV Community: AIVisionsLab

Rodei um modelo MoE de 35B de parâmetros em uma RX 580 de 8GB de 2017 (e quase desisti três vezes)

AIVisionsLab — Tue, 16 Jun 2026 20:50:45 +0000

Isso começou como uma pergunta idiota feita depois de já ter visto o Qwen3 4B rodando a 35 tokens/s via Vulkan na mesma máquina: se isso já funciona, até onde vai o limite real?

A resposta levou duas sessões de testes, cinco tentativas falhas, um erro de protocolo, dois esgotamentos de contexto, um timeout de cliente e, só então, uma resposta completa. No dia seguinte, mais três testes confirmaram exatamente o que tinha dado errado e por quê. Esse é o relato inteiro, sem cortar as partes em que o negócio quebrou.

O hardware: nada disso é novo

Xeon E5-2690 v3, 12 núcleos / 24 threads, lançado em 2014. 31,8GB de RAM DDR4 REG ECC em quad-channel. E uma AMD Radeon RX 580 2048SP de 8GB GDDR5, lançada em 2017 — a placa que todo mundo associa a mineração de criptomoeda, não a inferência de modelos de linguagem.

Componente	Especificação
CPU	Intel Xeon E5-2690 v3 — 12C/24T, 3,05GHz turbo
RAM	31,8GB DDR4 REG ECC quad-channel
GPU	AMD RX 580 2048SP — 8GB GDDR5
Storage	NVMe (modelos) + HDD (swap/sistema)
Driver AMD	31.0.21925.1001
Backend	Vulkan — sem CUDA, sem ROCm, sem DirectML

Stack de software: llama.cpp (build b9049), Vulkan SDK 1.4.350.0, OpenWebUI v0.9.6 e SearXNG via Docker para web search. Nada exótico, tudo open source, tudo rodando local.

O modelo era o Qwen3.5-35B-A3B-Uncensored Q6_K: 34,66 bilhões de parâmetros totais, arquitetura Mixture of Experts com 256 experts, dos quais apenas 8 são ativados por token — 3,1% do modelo "aceso" a cada passo. Esse detalhe é o motivo pelo qual a história inteira é possível. Em um modelo denso de 35B, os 35 bilhões de parâmetros entram em jogo para cada token. No MoE, o roteador escolhe 8 experts relevantes e ignora os outros 248 naquele instante. Isso não reduz o tamanho do arquivo no disco (ainda são 26,55GB), mas reduz brutalmente o que precisa estar disponível com baixa latência ao mesmo tempo. (Logs brutos completos dessa rodada na Seção 33 do laboratório.)

O fitting automático: 1,15 segundo decidindo onde colocar 26GB

Aqui está a parte que, na minha opinião, é mais interessante que o benchmark final. Eu não passei nenhuma flag manual de camadas (-ngl, --override-tensor ou qualquer split rígido). O comando de inicialização foi literalmente isso:

.\llama-server.exe -m "E:\models\Qwen3.5-35B-A3B-...-Q6_K.gguf" --host 0.0.0.0 --port 8081

E o llama.cpp resolveu o problema sozinho. A situação inicial era inviável: o modelo completo precisava de ~32.961 MiB de VRAM e havia 7.366 MiB livres. Um déficit de mais de 26GB. O algoritmo de fitting fez, em sequência, e em pouco mais de um segundo:

Reduziu o contexto de 262.144 tokens (o máximo de treino do modelo) para 4.096 tokens, liberando ~5.347 MiB.
Moveu todos os 256 experts MoE para a RAM, mapeados via mmap — 25.613 MiB saindo da equação da GPU.
Realocou as camadas densas residuais de volta para a GPU, de trás para frente (back-to-front), até ocupar 3.048 MiB.
Preencheu o resto front-to-back com overflow fracionado no gate layer, terminando com 41 camadas na GPU (36 delas "overflowing"), uso final de 6.255 MiB e apenas 1.111 MiB livres.

O resultado: 41 camadas densas + a output layer ficaram na VRAM da RX 580 (5.154 MiB), e os 256 experts MoE foram para CPU_Mapped, ocupando 26.784 MiB de RAM via mmap. Em cima disso ainda rodava KV cache (80 MiB), buffer recorrente (251 MiB), compute buffer (770 MiB) e mais alguns buffers menores — tudo somando entre 6,2 e 7,2GB de uso efetivo de VRAM, ou seja, 77–90% dos 8GB físicos.

O sistema acabou usando quatro níveis de memória simultaneamente:

VRAM GDDR5 da RX 580 (~400 GB/s) para as camadas densas.
RAM DDR4 ECC quad-channel (~51 GB/s) para os experts MoE via mmap.
SSD NVMe (1,7–3,5 GB/s) como origem do arquivo .gguf.
HDD via swap do Windows (~120–180 MB/s) quando a RAM passava de 97% de uso.

Esse último nível é o vilão da história, e chega a aparecer de novo mais adiante.

Os números da primeira sessão

Com flash attention habilitado automaticamente, fused gated delta net (autoregressive e chunked) ativos, e 4 slots paralelos configurados, a geração ficou estável em torno de 5,6 tokens/segundo, com prompt eval em ~34–40 tok/s.

Sessão	Prompt Eval	Geração	Tokens totais	Tempo total
1	34,13 tok/s	5,57 tok/s	1.377	~107s
2	~40,00 tok/s	5,64 tok/s	2.929	~533s

A temperatura nunca passou de 80°C, com a placa operando entre 44–64°C na maior parte do tempo e só subindo de fato quando o web search estava ativo (70–75°C). O throttling térmico da RX 580 fica em torno de 90°C — sobrou margem o tempo inteiro. Em nenhum momento a placa crashou ou resetou, mesmo sob 12h+ de stress acumulado entre as duas sessões.

A pergunta que quebrou cinco vezes antes de funcionar

O prompt de teste foi sempre o mesmo: peça para explicar atenção em transformers e por que MoE é mais eficiente que modelos densos, em português. Documentei cada tentativa porque cada falha ensinou algo diferente:

Teste 1 — thinking ON + web search ON + geração de imagem ON. Resultado: erro de protocolo. O OpenWebUI injetou um prefill de resposta incompatível com a flag enable_thinking, e o servidor cancelou a chamada depois de já ter gastado um minuto pensando e recuperado 10 fontes do SearXNG.

Teste 2 — thinking ON + web search ON, 30 pesquisas consecutivas. O contexto esgotou: n_tokens = 3285, truncated = 1. As 30 buscas injetaram tanto texto no histórico que sobrou pouquíssimo espaço para o raciocínio interno do modelo, que tentou alocar seu próprio buffer e estourou os 4.096 tokens definidos pelo fitting automático.

Teste 3 — mesma configuração, 25 pesquisas em uma rodada acumulada. Esgotamento de novo, agora batendo exatamente no teto: n_tokens = 4095, truncated = 1. O OpenWebUI reenviou o histórico anterior, e o prompt cresceu até o limite físico.

Teste 4 — thinking ON, web search OFF. A temperatura caiu para 51°C estáveis, mas o raciocínio interno levou tempo demais na CPU Xeon de 2014, e a interface desistiu por timeout antes do servidor terminar de gerar qualquer coisa.

Teste 5 — thinking ON, web search OFF, prompt reduzido a 45 tokens. Sucesso completo. O modelo pensou por 4 minutos sob o Xeon, rascunhou a matemática internamente e entregou uma resposta técnica e estruturada em português sobre atenção scaled dot-product, multi-head e a eficiência do roteamento esparso do MoE.

A conclusão da primeira sessão ficou clara: a GPU e a CPU nunca foram o problema. Em nenhuma das cinco tentativas houve instabilidade física, reset ou crash. Todas as falhas foram de calibração de software — thinking mode e contexto de 4.096 tokens não combinam quando o histórico (especialmente com web search) cresce demais.

Para contexto: outro projeto da comunidade, do Matheus Fertunani, rodou o mesmo Qwen3.5 35B em Q8 usando CPU pura com 192GB de RAM em Linux, atingindo 7–8 tokens/s. Esse setup, com uma GPU de menos de R$400 no mercado de segunda mão somada a 32GB de RAM ECC, chegou a 5,64 tokens/s. A diferença de hardware profissional para hardware reaproveitado é grande, mas o resultado prático fica surpreendentemente próximo.

Capítulo 2: provando a hipótese, no dia seguinte

A hipótese da primeira sessão era simples — "o problema nunca foi o hardware, foi thinking + web search esgotando os 4.096 tokens de contexto." No dia seguinte, três testes foram desenhados especificamente para confirmar isso. (Documentação completa desses três testes na Seção 34.)

O curl resolve o que o navegador não resolvia

Primeiro, era preciso isolar se o timeout do Teste 4 vinha do servidor ou da camada AJAX do cliente OpenWebUI. A resposta veio batendo o endpoint /v1/chat/completions diretamente via curl, com timeout de 600 segundos e sem nenhuma interface no meio:

curl.exe -X POST http://localhost:8081/v1/chat/completions -H "Content-Type: application/json" --max-time 600 -d "@E:\teste.json"

Resultado: truncated = 0. Resposta completa entregue — 1.955 tokens, 266,42 segundos de tempo total, 255,97 segundos de eval a 6,57 tok/s (já usando Q4_K_M, que entrou no lugar do Q6_K para esse segundo bloco de testes). O hardware sempre conseguiu terminar o trabalho; era o navegador que desistia da conexão TCP enquanto o servidor continuava gerando em segundo plano.

Durante esse mesmo teste, o OpenWebUI foi conectado em paralelo disparando a mesma pergunta por outro canal. O agendador do llama.cpp processou as duas tarefas concorrentemente sem travar nada: uma gerando a 5,96 tok/s e outra a 5,08 tok/s, simultaneamente, com a GPU em 63°C e o uso de RAM do sistema em 91% — seguro, sem travar o Windows.

--ctx-size 8192 e os "pensamentos" capturados do modelo

O segundo teste simplesmente subiu o contexto manualmente:

.\llama-server.exe -m "...Q4_K_M.gguf" --host 0.0.0.0 --port 8081 --ctx-size 8192

Com esse buffer dobrado, o modelo recebeu o prompt "explique por que MoE permite rodar modelos grandes em hardware com pouca VRAM" e processou por 9 minutos inteiros de raciocínio em background, sem nenhum corte. A parte curiosa veio da interceptação direta dos blocos <think> gerados nas três rodadas de teste — o que dá uma visão rara de como o modelo "argumenta" internamente antes de responder.

No bloco capturado durante o esgotamento de contexto (Teste com Q4_K_M ainda em 4.096 tokens), o modelo literalmente calculou a memória necessária para 35B de parâmetros em diferentes quantizações — FP32 em 140GB, FP16 em 70GB, INT8 em 35GB, INT4 em 17,5GB — concluiu que nenhuma dessas contas fechava com 8GB de VRAM, e então corrigiu a própria premissa: se o modelo está rodando mesmo assim, só pode ser por offloading agressivo dos experts inativos para a RAM, mantendo na GPU apenas o roteador e os blocos compartilhados. Esse raciocínio descreveu, com bastante precisão, a própria arquitetura de mmap que estava sustentando ele em tempo real — sem ter qualquer acesso aos metadados do sistema de arquivos do laboratório. Esse bloco específico não chegou a ser entregue, porque o contexto de 4.096 tokens estourou antes da resposta final ser formatada.

Já com --ctx-size 8192 ativo, o mesmo tipo de raciocínio — mais focado, menos exploratório, claramente comprimindo o rascunho até caber em "direto e conciso" como pedido — terminou em sucesso absoluto, zero cortes.

Métrica	Sessão Q6_K	Sessão Q4_K_M (ctx 4096)	Sessão Q4_K_M (ctx 8192)
Duração do raciocínio	~4 min	~11 min	~9 min
Tokens do bloco `<think>`	~2.000	~3.500	~3.000
Questionou a premissa do prompt?	Não	Sim, com cálculo de memória	Sim, reajuste técnico
Resultado	Sucesso	Estourou contexto	Sucesso absoluto

Q4_K_M venceu na prática

Trocar Q6_K (28,51GB) por Q4_K_M (21,17GB) liberou cerca de 7GB na carga física de RAM, o que reduziu a dependência do swap em HDD — o ponto mais lento de toda a cadeia de memória. O resultado prático: geração subindo para 6,42–6,65 tok/s, pico de temperatura caindo para 74°C (10°C mais frio que a sessão anterior) e zero atividade de swap visível durante a inferência.

O que sobrou disso tudo

Algumas conclusões que valem para qualquer pessoa tentando algo parecido com hardware velho:

O thermal throttling nunca foi um risco real nesse setup — mesmo sob horas de stress acumulado, a RX 580 ficou sempre 10–16°C abaixo do limite de 90°C. O fitting automático do llama.cpp via cálculo de grafos resolveu a distribuição entre GPU e RAM melhor do que qualquer tentativa manual de split por flags rígidas teria feito. O thinking mode do Qwen3.5 consome sozinho entre 2.000 e 3.500 tokens antes de produzir qualquer resposta visível, então qualquer contexto abaixo de 8.192 tokens vira um gargalo quase garantido se você também ligar web search ou histórico de conversa. E o timeout do cliente (navegador/interface) é, na prática, mais limitante que a capacidade real do hardware — bater direto no endpoint via curl revelou isso de forma inequívoca.

No fim, o veredito é direto: uma GPU de 2017 com 8GB de VRAM e uma CPU de datacenter de 2014 rodam um modelo MoE de 35B de parâmetros de forma estável, sem crash, sem throttling e sem custo adicional além da eletricidade. Não é prático para uso diário — 5,6 a 6,6 tokens/s e um contexto efetivo limitado não competem com qualquer GPU moderna —, mas como prova de conceito sobre até onde sparsity de MoE e fitting automático de memória conseguem levar hardware obsoleto, a resposta é: bem mais longe do que o mercado de hardware sugere.

Hardware: Xeon E5-2690 v3 + RX 580 2048SP 8GB + 32GB DDR4 ECC. Stack: llama.cpp (Vulkan) + OpenWebUI + SearXNG. Todos os números vêm de logs reais de duas sessões de teste, sem nenhuma camada de marketing por cima.

Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide

AIVisionsLab — Thu, 11 Jun 2026 23:35:39 +0000

Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide
"An RX 580 from 2017 running a 12B parameter model in 2026.
No CUDA. No ROCm. No cloud. Here's exactly how."

That was the consensus in 2026. AMD dropped ROCm support for Polaris/GCN4 architecture in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. Every mainstream AI stack gave up on this card.
We didn't.
This is the complete technical record of how we built a full local AI production stack on an AMD RX 580 8GB — running LLMs at 17 tok/s, generating images in 72 seconds, transcribing audio 150× faster than CPU, and even cloning voices. All offline. All free. All on hardware that cost under $50.

The Hardware
ComponentSpecGPUAMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)CPUIntel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)RAM32GB DDR4 REG ECC Quad ChannelStorageNVMe 1TB — 1.7–3.5 GB/sOSWindows 10 Pro + WSL2 / Ubuntu 26.04 LTS
The RX 580 2048SP is the mining-variant with 2048 shader processors instead of the original 2304SP. It's everywhere on the used market for under $50. It performs identically through Vulkan.
One thing nobody talks about: storage matters as much as the GPU. Moving from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to 30 seconds. The bottleneck was never the GPU.

Why Vulkan?
The entire mainstream AI stack runs on either CUDA (Nvidia-only) or ROCm (AMD dropped Polaris in v5.x). That leaves legacy AMD GPUs with no official path.
But there's a third option: Vulkan — a universal graphics/compute API that works on any modern GPU, including the RX 580, which has supported Vulkan 1.x since its 2017 drivers.
The ggml project (the engine behind llama.cpp and stable-diffusion.cpp) implements Vulkan compute backends in pure C++. This means you can compile directly against the Vulkan API and completely bypass the ROCm/CUDA ecosystem. No driver packages. No compatibility layers. Just the GPU doing math.

What We Tried Before Vulkan (And Why It All Failed)
Before finding the working path, we hit every dead end:
DirectML + ComfyUI — The GPU gets detected as privateuseone0, but then:
NotImplementedError: Cannot access storage of OpaqueTensorImpl
DirectML wraps tensor data in opaque objects that ComfyUI's attention backends literally cannot read. Also: Microsoft hasn't updated it since September 2024. It's abandoned.
ROCm on Polaris — AMD officially dropped GCN4/Polaris in ROCm v5.x. Compatibility layers via WSL2 generate kernel panics under inference load. There is no Windows support. Dead end by design.
OpenVINO + Stable Diffusion Forge — Intel's extension was built for the old Automatic1111 architecture. Forge restructured everything. Result:
ModuleNotFoundError: No module named 'ldm'
ModuleNotFoundError: No module named 'sgm'
Error build_unet: Invalid backend: 'openvino'
CPU-only + HDD — Our baseline before any optimization: 85-second startup, ~19 minutes per 512×512 image. The mechanical HDD competing with memory paging made it completely unusable.
The pattern: every "AMD-compatible" option either targets newer hardware, is abandoned, or is simply incompatible with modern pipelines. Vulkan is the only path that actually works.

The Architecture: Dual-Path Stack
The core insight of this project is that not every workload fits in 8GB of VRAM. The solution is intelligent routing between GPU and CPU:
OpenWebUI :3000 (Docker)
│
├──► llama-server :8081 ──► RX 580 Vulkan [llama.cpp]
│ └── Ollama :11434 ──► CPU fallback
│
└──► sd-server :7860 ──► RX 580 Vulkan [stable-diffusion.cpp]
├── SD 1.5 GGUF ──► 72s / image
└── FLUX hybrid ──► ~14 min / image

└──► ComfyUI       :8188  ──►  Xeon CPU WSL2

Path 1 — GPU Vulkan: LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.
Path 2 — CPU Xeon: FLUX.1 16GB models, AnimateDiff video pipelines. The 32GB ECC RAM acts as "virtual VRAM" for models that don't fit on the card.

Building llama.cpp with Vulkan
Run in Developer PowerShell for Visual Studio:
powershellcd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20
Validate GPU detection:
powershellcd build\bin\Release
.\llama-cli.exe --list-devices

Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅

Start the LLM server:
powershell.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" `
--host 0.0.0.0 --port 8081 --device Vulkan0
How to verify it's actually using the GPU:
ggml_vulkan: Found 1 Vulkan device(s)
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB
17.77 t/s ← RX 580 Vulkan ✅
If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. The --device Vulkan0 flag is mandatory.

Building stable-diffusion.cpp with Vulkan
powershellgit clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20
Start the image server:
powershellE:
cd "E:\stable-diffusion.cpp\build\bin\Release"
.\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `
-m "E:\models\dreamshaper8.gguf"

FLUX.1 Schnell: Running a 16GB Model on 8GB VRAM
FLUX.1 Schnell is a 12B parameter SOTA model that nominally requires 16GB. Here's how we run it on 8GB:
The strategy is memory segmentation — put the diffusion model on VRAM, offload everything else to RAM:
ComponentFileWhereDiffusion Modelflux1-schnell-q4_k.ggufGPU VRAM (~6.5GB)VAEae.safetensorsCPU RAM (~160MB)CLIP Lclip_l.safetensorsGPU VRAM (~235MB)T5XXLt5xxl_fp16.safetensorsCPU RAM (~9.3GB)
batchsd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
--diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^
--vae "E:\models\ae.safetensors" ^
--clip_l "E:\models\clip_l.safetensors" ^
--t5xxl "E:\models\t5xxl_fp16.safetensors" ^
--cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

⚠️ --vae-tiling is not optional. Without it, VAE decode causes OOM and crashes the server.

Timing per 1024×1024 image:
StageTimeT5XXL conditioning11.49sSampling (4 steps)~838sVAE decode (9 tiles)40.45sTotal~14 min
Critical: Two GGUF formats for FLUX
This trips up almost everyone. There are two different GGUF distributions for FLUX:
SourceCompatible withcity96 (HuggingFace)ComfyUI + ComfyUI-GGUF node onlyleejet (HuggingFace)stable-diffusion.cpp ✅
Using a city96 GGUF in sd-server returns:
[ERROR] main.cpp:92 - new_sd_ctx_t failed
Always download from: huggingface.co/leejet/FLUX.1-schnell-gguf

whisper.cpp: Audio Transcription on the RX 580
This is where the numbers get absurd.
Build whisper.cpp with Vulkan:
powershellgit clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j4
Transcribe a video (MP4 → TXT):
powershell# Extract audio first (Whisper requires WAV on Windows)
ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"

Transcribe

.\build\bin\Release\whisper-cli.exe -m models\ggml-large-v3-turbo.bin
-f "audio.wav" -l pt --output-txt
Performance on a 15-minute video (Windows):
StageTimeModel load4sMel spectrogram1.2sGPU encode73sDecode + batch168sTotal307s
VRAM used: only 2.6GB of 8GB. CPU stays at ~5%.
On Linux (Ubuntu 26.04, Mesa RADV), same hardware, same model:
MetricWindowsLinuxTime (106s audio)307s23.58sVRAM used2.6GB1.6GB
A 13× speedup on the same GPU. Mesa RADV's Vulkan compute path is dramatically more efficient for this workload than the Windows AMD driver.

Windows vs Linux: Full Benchmark Comparison
WorkloadWindows 10Ubuntu 26.04 (Mesa RADV)WinnerLLM Qwen3 4B @ 99 layers~15–17 tok/s~35 tok/s🏆 Linux (2×)LLM Qwen3.6 35B @ max layers7.62 tok/s (max 10 ngl)5.18 tok/s (max 20 ngl)⚖️ TieSD 1.5 DreamShaper (50 steps)~72s~85s🏆 WindowsFLUX Schnell (4 steps, 512×512)~84s~52s🏆 LinuxWhisper large-v3-turbo307s · 2.6GB23.58s · 1.6GB🏆 Linux
Why Linux is faster for LLM: Mesa RADV allows up to 20 GPU layers for large models where Windows AMD drivers cap at 10. RADV's memory management is simply more aggressive and efficient.
Why Windows wins SD 1.5: The proprietary AMD driver has more stable direct rendering for this specific workload. Consistent 1.44s/it vs 1.65s/it on Linux.

Voice Cloning: Applio RVC on AMD Windows
We also built a full voice cloning pipeline:
Text → Balabolka (TTS) → WAV → Applio RVC → Cloned Voice
The key insight: instead of using a generative TTS model (which sounds robotic), we use a real voice actor (Antônio Neural, a Microsoft Neural voice) for prosody and emotion, then apply RVC to convert the identity to our target voice (Yuri). Result: 80–95% naturalness vs 60–70% for pure TTS.
AMD-specific critical findings:
DirectML is effectively dead for RVC — torch-directml is locked to torch==2.4.1 while Applio requires torch==2.7.1. Irreconcilable conflict.
Use CPU mode. On Xeon E5-2690 v3 (24 threads): ~6 min/epoch, ~20 hours for 200 epochs. Inference after training: 2 hours of audio → ~30 minutes processing.
The silent failure trap:
powershell# NEVER set these — they silently break feature extraction

set CUDA_VISIBLE_DEVICES=-1

set ROCM_VISIBLE_DEVICES=-1

Training will print "Model trained successfully" but produce nothing

Always verify logs/project/extracted/ contains .npy files before starting training.

The Community Timeline
This project didn't happen in isolation. Three independent researchers, same GPU, same conclusion:
DateAuthorContributionJan 2025艾米心 AmihartFirst LLM via Vulkan on RX 580 — 24.56 tok/s on DebianDec 2025DH / DadHacksFirst SD via Vulkan — stable-diffusion.cpp breakthrough2026AIVisionsLabFull Windows + Linux production stack, voice cloning, transcription
The shared foundation: ggml by Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire proprietary driver ecosystem.

Real Benchmarks Summary
WorkloadModelBackendResultLLM inferenceMistral 7B Q4_K_MRX 580 Vulkan (Win)17–18 tok/sLLM inferenceQwen3 4B Q4_K_MRX 580 Vulkan (Linux)~35 tok/sLLM baselineMistral 7B Q4_K_MXeon CPU pure3–5 tok/sImage genDreamShaper 8 SD1.5RX 580 Vulkan~72s / 512×512Image genflux1-schnell-q4_kGPU+CPU hybrid~14 min @ 1024×1024Audio transcriptionWhisper large-v3-turboRX 580 Vulkan (Linux)23.58s / 106s audioVideo framesAnimateDiffXeon WSL2 CPU~141s/frameVoice inferenceApplio RVCXeon CPU~30 min / 2h audio

Troubleshooting: The Most Common Failures
generate_image returned no results / frozen terminal
Bug in sd-server with Seed: -1. Fix: set a fixed integer seed (42, 1337) in OpenWebUI.
new_sd_ctx_t failed with FLUX
You're using a city96 GGUF. Download from leejet instead.
Docker can't reach sd-server
Windows Defender blocks the Docker subnet (172.x.x.x). Run as Administrator:
powershellNew-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `
-Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow
--override-tensor exps=CPU slows down Vulkan
This flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU transfer overhead destroys any gains. Don't apply CUDA-optimized flags to Vulkan backends.

Full Documentation
This post covers the core architecture. Full guides for each component:

📖 Master documentation (PT/EN): setup-ia-local-rx580-vulkan.web.app
💻 GitHub repository: github.com/aivisionslab-studios/rx580-local-ai-guide
🎥 YouTube: @aivisionslab-hub

Conclusion
The narrative that legacy AMD GPUs can't run AI is a software problem, not a hardware limitation. The RX 580 has supported Vulkan since 2017. The compute capability was always there.
What changed is that ggml and its ecosystem built Vulkan backends that bypass the entire proprietary driver stack. The result is a GPU from 2017 running SOTA models from 2026 — locally, privately, for free.
RX 580 (2017) + Xeon (2014) + Vulkan + ggml = SOTA AI in 2026
The problem was never the GPU.

AIVisionsLab — Documenting local AI on legacy hardware.
São Paulo, Brazil 🇧🇷

Three researchers. One GPU. Two years. How the RX 580 became an AI platform.

AIVisionsLab — Sun, 24 May 2026 13:20:37 +0000

All images in this article were generated on the RX 580 8GB — the same GPU everyone said couldn't run AI.

This is collective knowledge

Three independent researchers. No coordination. Same GPU. Same conclusion.

January 2025 — 艾米心 Amihart

Platform: Debian Linux
Published: Medium

Amihart was the first to document LLM inference via Vulkan on the RX 580.

Compiled llama.cpp with -DGGML_VULKAN=on on Debian, connected a Celeron G6900 CPU setup, and measured:

CPU only: 5.45 tok/s
RX 580 via Vulkan: 24.56 tok/s

A 4.5× uplift on hardware that officially "doesn't support AI."

But then came this line — honest, and correct for the time:

"Sadly, even though Vulkan seems to do a pretty good job with the RX580, I am unaware of any way to get Vulkan to work with Stable Diffusion. If you want to use Stable Diffusion, you will need ROCm."

That sentence opened a question that the next researcher answered.

December 2025 — DH / DadHacks

Platform: Linux/Debian
Published: dadhacks.org

DadHacks refuted Amihart's limitation — not as a criticism, but as proof that the software evolved.

stable-diffusion.cpp had matured. With -DSD_VULKAN=ON (equivalent to -DGGML_VULKAN=ON in newer versions), image generation via Vulkan on the RX 580 worked.

Including FLUX.1 Schnell in Q4 quantization, with CPU offloading for components that exceeded VRAM.

The barrier Amihart correctly identified in January had fallen by December.

2026 — AIVisionsLab

Platform: Windows 10 Pro + WSL2
Published: setup-ia-local-rx580-vulkan.web.app

The third step was integration.

Both previous projects ran on Linux. Neither connected everything into a unified daily-use system on Windows. Neither documented the failures (DirectML, ROCm, OpenVINO). Neither built automation scripts. Neither integrated OpenWebUI.

AIVisionsLab filled those gaps:

Full Windows stack with .bat automation
OpenWebUI integration via Docker with firewall notes
Dual architecture: GPU Vulkan for fast models, Xeon CPU WSL2 for FLUX 16GB
Documented every failure with root cause analysis
Discovered the critical GGUF incompatibility: city96 vs leejet formats

The question each project answered

Project	Question	Answer
Amihart	Can LLMs run on Vulkan RX 580?	Yes. 24.56 tok/s
DadHacks	Can Stable Diffusion run on Vulkan RX 580?	Yes. sd.cpp works
AIVisionsLab	Can all this run integrated on Windows daily?	Yes. Full stack documented

The common denominator

All three converge on the same engine:

ggml (Georgi Gerganov)
  ├── llama.cpp    → LLMs via Vulkan
  └── stable-diffusion.cpp (leejet) → Images via Vulkan

ggml ported deep learning tensor operations to C and exposed Vulkan hooks. That single decision freed legacy AMD hardware from the CUDA/ROCm dependency trap.

Three philosophies, same conclusion

Amihart:

"Despite how ancient this card is, it is technically possible to use it for AI."

DadHacks:

"This setup provides an accessible pathway for leveraging existing hardware investments without requiring expensive upgrades or specialized software stacks like ROCm."

AIVisionsLab:

"Commercial planned obsolescence is a market choice, not an engineering barrier. Legacy hardware doesn't die — it's liberated by the right software."

Full documentation

📖 setup-ia-local-rx580-vulkan.web.app — complete guide in PT/EN/ES/FR/AR
📦 github.com/aivisionslab-studios/rx580-local-ai-guide
🤗 huggingface.co/aivisionslab/ai-local-rx580-stack

Running FLUX.1 Schnell on an RX 580 8GB — GPU/CPU hybrid architecture

AIVisionsLab — Sun, 24 May 2026 13:18:29 +0000

Image above: generated by FLUX.1 Schnell running on the hybrid architecture described in this post.

The problem

FLUX.1 Schnell is a 12B parameter model. Full precision needs more VRAM than the RX 580 has.

The solution: split the components between GPU and CPU RAM.

Memory map

Component	File	Where	Size
Diffusion model	flux1-schnell-q4_k.gguf	GPU VRAM	~6.5GB
VAE	ae.safetensors	CPU RAM	~160MB
CLIP L	clip_l.safetensors	GPU VRAM	~235MB
T5XXL	t5xxl_fp16.safetensors	CPU RAM	~9.3GB

Total VRAM used: ~6.7GB / 8GB available
Total RAM used: ~9.5GB

The T5XXL encoder dominates RAM usage. If you're tight on RAM, t5xxl_fp8.safetensors reduces it to ~5GB.

⚠️ Critical: use leejet GGUF, not city96

Two different GGUF formats exist for FLUX. They have similar names but are NOT interchangeable:

Source	For
city96 on HuggingFace	ComfyUI + ComfyUI-GGUF node
leejet on HuggingFace	stable-diffusion.cpp ✅

Using city96 GGUF with sd-server returns:

[ERROR] stable-diffusion.cpp:355 - get sd version from file failed
[ERROR] main.cpp:92 - new_sd_ctx_t failed

Download from: https://huggingface.co/leejet/FLUX.1-schnell-gguf

The command

sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
  --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^
  --vae "E:\models\ae.safetensors" ^
  --clip_l "E:\models\clip_l.safetensors" ^
  --t5xxl "E:\models\t5xxl_fp16.safetensors" ^
  --cfg-scale 1.0 --steps 4 ^
  --clip-on-cpu --vae-on-cpu --vae-tiling

Flag breakdown:

Flag	Why
`--clip-on-cpu`	Frees ~235MB VRAM
`--vae-on-cpu`	Frees ~160MB VRAM
`--vae-tiling`	Prevents OOM at high resolution
`--cfg-scale 1.0`	Required for FLUX — higher values distort
`--steps 4`	Schnell converges in 4 steps by design

Real benchmark

Stage	Time
T5XXL conditioning	11.49s
Sampling (4 steps @ 1024×1024)	~838s (~14 min)
VAE decode (9 tiles)	40.45s
Total	~14 min

Terminal status at generation:

Listening on http://0.0.0.0:7860
VRAM: 7.6/8.0 GB | RAM: ~9.5 GB | Temp: 66°C

Windows Firewall fix

If OpenWebUI can't reach the server even with --listen-ip 0.0.0.0:

# Run as Administrator
New-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `
  -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow

Docker runs in an isolated WSL2 network — 127.0.0.1 won't work. Use your machine's actual local IP.

Full documentation

📖 setup-ia-local-rx580-vulkan.web.app
📦 github.com/aivisionslab-studios/rx580-local-ai-guide

Everything that failed before Vulkan saved our RX 580 AI setup

AIVisionsLab — Sun, 24 May 2026 13:14:16 +0000

All images in this article were generated locally on the RX 580 8GB — after we fixed everything described below.

The graveyard

Before Vulkan worked, we tried everything. This is the technical autopsy.

1. DirectML — Microsoft's promise that crashed

The attempt: torch-directml with --directml flag in ComfyUI.

The GPU was detected as privateuseone0. Looked promising.

Then this appeared on every run:

WARNING: torch-directml barely works, is very slow,
has not been updated in over 1 year and might be
removed soon, please don't use it.

NotImplementedError: Cannot access storage of OpaqueTensorImpl

Root cause: DirectML wraps tensor data in opaque objects called OpaqueTensorImpl. When ComfyUI's modern attention backends try to read the raw memory contents, the Microsoft layer blocks access entirely.

The project hasn't been updated in over a year. It's effectively abandoned.

Manual fix attempt: Downgrade to the May 2024 dev build:

pip uninstall torch torch-directml torchaudio
pip install torch==2.3.1+cpu --index-url https://download.pytorch.org/whl/cpu
pip install torch-directml==0.2.1.dev240521 --no-deps

This stops the crash but the performance is so slow it's unusable.

2. ROCm — officially dead for GCN4

The attempt: AMD's official GPGPU framework.

The reality: AMD dropped official support for Polaris/GCN4 architecture in ROCm v5.x. Permanently. There is no workaround.

On Windows: no native ROCm support at all.
On WSL2 with compatibility layers: kernel panics under heavy inference load.

The only working ROCm path for the RX 580 is via Docker containers that emulate gfx803 — which is what Amihart documented in January 2025. It works for Stable Diffusion, but requires Docker overhead and doesn't support modern FLUX architecture.

3. OpenVINO + Stable Diffusion Forge

The attempt: Intel's sd-webui-openvino extension inside Forge.

ModuleNotFoundError: No module named 'ldm'
ModuleNotFoundError: No module named 'sgm'
Error build_unet: Invalid backend: 'openvino'

Root cause: The extension was designed for the old AUTOMATIC1111 architecture. Forge completely restructured the codebase and replaced the native ldm and sgm modules. The OpenVINO injection fails at the foundation level.

4. CPU + HDD — the baseline disaster

Before any GPU acceleration:

Boot time: 85 seconds
LLM response: 3–5 tok/s
Image generation: ~19 minutes per 512×512 image
FLUX 16GB model load: 25 minutes from HDD

The mechanical drive was as much of a bottleneck as the missing GPU acceleration.

What actually worked

After all of this: Vulkan.

The ggml engine in llama.cpp and stable-diffusion.cpp uses Vulkan as a native GPU backend. The RX 580 has supported Vulkan 1.x since 2017 drivers. No special installation. No compatibility layers. Just compile with -DGGML_VULKAN=ON.

Results after switching:

LLM: 15–16 tok/s (from 3–5)
Image: ~72s (from ~19 min)
FLUX load: 30 seconds (from 25 min, after NVMe migration)

The lesson

The hardware was never the problem. Every failure above was a software problem:

DirectML: abandoned by Microsoft
ROCm: architecture policy decision by AMD
OpenVINO: extension not maintained for modern frontends
HDD: wrong storage choice

The RX 580 was waiting for ggml + Vulkan.

Full documentation

📖 setup-ia-local-rx580-vulkan.web.app
📦 github.com/aivisionslab-studios/rx580-local-ai-guide

Rodei Flux Schnell + LLM numa GPU de R$300. Sem CUDA. Sem cloud. Sem ROCm.

AIVisionsLab — Sun, 24 May 2026 13:08:33 +0000

Todas as imagens deste artigo foram geradas localmente na RX 580 8GB descrita abaixo.

A narrativa era clara

Em 2026, todo guia diz a mesma coisa:

"Sua AMD RX 580 não roda IA. Compra uma GPU nova."

A AMD removeu suporte ROCm para Polaris/GCN4 na v5.x.
DirectML travava com erros de OpaqueTensorImpl.
OpenVINO falhava silenciosamente.

GPU de 8GB parada em 0% de uso enquanto o CPU respondia LLMs a 3 tokens por segundo.

A gente recusou comprar uma GPU nova.

A solução: Vulkan

O projeto ggml — engine base do llama.cpp e stable-diffusion.cpp — suporta Vulkan como backend de GPU. Vulkan é um padrão aberto que ainda suporta a RX 580 nativamente desde os drivers de 2017.

Sem CUDA. Sem ROCm. Sem DirectML. Só Vulkan.

Resultados reais (logs do terminal, não benchmarks sintéticos)

Workload	Modelo	Velocidade
LLM	Mistral 7B Q4	15–16 tok/s
Geração de imagem	DreamShaper 8 GGUF	~72s/imagem
FLUX.1 Schnell	flux1-schnell-q4_k híbrido	~14 min @ 1024×1024

CPU sem GPU: 3–5 tok/s.
Ganho com Vulkan: 3–4× numa GPU que "não suporta IA".

Hardware

GPU:     AMD RX 580 2048SP — 8GB GDDR5 (Polaris / GCN4)
CPU:     Intel Xeon E5-2690 v3 — 12c/24t (2014)
RAM:     32GB DDR4 REG ECC
Storage: NVMe 1TB — 1.7–3.5 GB/s
OS:      Windows 10 Pro + WSL2 Ubuntu 22.04

O NVMe sozinho reduziu o carregamento do FLUX de 25 minutos para 30 segundos.
Storage é tão crítico quanto a GPU.

Compilar llama.cpp com Vulkan

# Executar no Developer PowerShell do VS
cd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20

Validação:

cd build\bin\Release
.\llama-cli.exe --list-devices
# Esperado: Vulkan0: AMD Radeon RX 580 2048SP ✅

Compilar stable-diffusion.cpp com Vulkan

git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp && mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20

Subir o servidor

E:
cd "E:\stable-diffusion.cpp\build\bin\Release"
sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
  -m "E:\models\dreamshaper_8.safetensors"

No OpenWebUI → Admin → Imagens → Automatic1111 → http://SEU_IP_LOCAL:7860/

⚠️ Crítico: dois tipos de GGUF incompatíveis

Se você tentar rodar FLUX e receber new_sd_ctx_t failed — você baixou o GGUF errado.

Fonte	Compatível com
city96 (HuggingFace)	ComfyUI apenas
leejet (HuggingFace)	stable-diffusion.cpp ✅

Sempre use: https://huggingface.co/leejet/FLUX.1-schnell-gguf

O que não funcionou (documentado com causa raiz)

Tentativa	Erro	Motivo
DirectML	`OpaqueTensorImpl`	Tensores MS incompatíveis com ComfyUI
ROCm	Kernel panics	GCN4 removido no v5.x — permanente
OpenVINO	`No module 'ldm'`	Extensão para arquitetura antiga A1111
CPU + HDD	19 min/imagem	Zero GPU + gargalo de I/O mecânico

Documentação completa

📖 Guia master (PT/EN/ES/FR/AR) com diagramas, benchmarks, scripts de automação:
👉 setup-ia-local-rx580-vulkan.web.app

📦 GitHub (scripts + docs):
👉 github.com/aivisionslab-studios/rx580-local-ai-guide

O problema nunca foi a placa.

I ran Flux Schnell + LLMs on a $50 GPU. No CUDA. No cloud. No ROCm.

AIVisionsLab — Sun, 24 May 2026 13:04:51 +0000

All images in this article were generated locally on the RX 580 8GB described below.

The narrative was clear

In 2026, every guide says the same thing:

"Your AMD RX 580 can't run AI. Buy a new GPU."

AMD dropped ROCm support for Polaris/GCN4 in v5.x.
DirectML crashed with OpaqueTensorImpl errors.
OpenVINO failed silently.

So we had a 8GB GPU sitting at 0% utilization while the CPU burned through LLM responses at 3 tokens/second.

We refused to buy a new GPU.

The fix: Vulkan

The ggml project — the engine behind llama.cpp and stable-diffusion.cpp — supports Vulkan as a GPU backend. Vulkan is an open standard that still supports the RX 580 natively since its 2017 drivers.

No CUDA. No ROCm. No DirectML. Just Vulkan.

Results (real terminal logs, not benchmarks)

Workload	Model	Speed
LLM inference	Mistral 7B Q4	15–16 tok/s
Image generation	DreamShaper 8 GGUF	~72s/image
FLUX.1 Schnell	flux1-schnell-q4_k (hybrid)	~14 min @ 1024×1024

CPU baseline without GPU: 3–5 tok/s.
Vulkan uplift: 3–4× on a GPU that "doesn't support AI."

Hardware

GPU:     AMD RX 580 2048SP — 8GB GDDR5 (Polaris / GCN4)
CPU:     Intel Xeon E5-2690 v3 — 12c/24t (2014)
RAM:     32GB DDR4 REG ECC
Storage: NVMe 1TB — 1.7–3.5 GB/s
OS:      Windows 10 Pro + WSL2 Ubuntu 22.04

The NVMe alone reduced FLUX model load time from 25 minutes to 30 seconds.
Storage is as critical as the GPU.

Build llama.cpp with Vulkan

# Run in Developer PowerShell for VS
cd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20

Validate:

cd build\bin\Release
.\llama-cli.exe --list-devices
# Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅

Build stable-diffusion.cpp with Vulkan

git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp && mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20

Run the server

E:
cd "E:\stable-diffusion.cpp\build\bin\Release"
sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
  -m "E:\models\dreamshaper_8.safetensors"

Connect OpenWebUI → Admin → Images → Automatic1111 → http://YOUR_LOCAL_IP:7860/

⚠️ Critical: two types of GGUF

If you try to run FLUX and get new_sd_ctx_t failed — you downloaded the wrong GGUF.

Source	Compatible with
city96 (HuggingFace)	ComfyUI only
leejet (HuggingFace)	stable-diffusion.cpp ✅

Always use: https://huggingface.co/leejet/FLUX.1-schnell-gguf

What failed (documented)

Attempt	Error	Why
DirectML	`OpaqueTensorImpl`	MS tensors can't talk to ComfyUI backends
ROCm	Kernel panics	GCN4 dropped in v5.x — permanent
OpenVINO	`No module 'ldm'`	Extension targets old A1111 arch
CPU + HDD	19 min/image	No GPU + mechanical I/O bottleneck

Full documentation

📖 Complete guide (PT/EN/ES/FR/AR) with architecture diagrams, benchmarks, automation scripts:
👉 setup-ia-local-rx580-vulkan.web.app

📦 GitHub (scripts + docs):
👉 github.com/aivisionslab-studios/rx580-local-ai-guide

The problem was never the GPU.

Запуск Flux Schnell (12B) + LLM на устаревшей AMD RX 580 (8 ГБ) через Vulkan — Полное архитектурное руководство [2026]

AIVisionsLab — Fri, 22 May 2026 18:24:02 +0000

Многие считали, что RX 580 «мертва» для ИИ в 2026 году. Экосистемы, завязанные только на CUDA, прекращение поддержки Polaris в ROCm начиная с версии 5.x, и DirectML, который так и не был доведен до ума. Это подробный технический отчет о том, как мы доказали обратное.

Аппаратное обеспечение

GPU: AMD RX 580 2048SP — 8 ГБ GDDR5 VRAM (нативная поддержка Vulkan 1.x)
CPU: Intel Xeon E5-2690 v3 — 12 ядер/24 потока @ 3.5 ГГц boost
RAM: 32 ГБ DDR4 REG ECC Quad Channel
Накопитель: NVMe 1 ТБ (критически важно для устранения «узких мест»)
ОС: Windows 10 Pro + WSL2 Ubuntu 22.04.5

Почему другие решения не работают?

Решение	Статус	Причина
CUDA	❌	Только для Nvidia
ROCm	❌	Поддержка Polaris прекращена в v5.x
DirectML	❌	Ошибка `OpaqueTensorImpl` в CLIPTextEncode
OpenVINO	❌	Отсутствие модулей `ldm/sgm` в Forge

Фатальная ошибка DirectML:

NotImplementedError: Cannot access storage of OpaqueTensorImpl

Драйвер упаковывает память в непрозрачные тензоры (opaque tensors), которые бэкенды внимания ComfyUI не могут считать. Это тупик.

Решение — Двухуровневая архитектура

ПУТЬ 1 — GPU Vulkan (ускорение RX 580)

Нативная сборка stable-diffusion.cpp, скомпилированная с -DGGML_VULKAN=ON. Движок ggml работает напрямую с GPU без необходимости в ROCm или CUDA. Модели SD 1.5 GGUF генерируют изображение примерно за 72 секунды.

ПУТЬ 2 — CPU Xeon (тяжелые SOTA модели)

FLUX.1 Schnell (16 ГБ) превышает объем физической VRAM. ComfyUI работает через CPU внутри WSL2, используя ECC RAM в качестве стабильной виртуальной VRAM. Генерация 768x768 занимает ~24 минуты.

Гибридная сегментация памяти (Flux 12B Q4_K)

Компонент	Файл	Выделение памяти
Diffusion Model	flux1-schnell-q4_k.gguf	GPU VRAM ~6.5 ГБ
VAE	ae.safetensors	CPU RAM ~160 МБ
CLIP L	clip_l.safetensors	GPU VRAM ~235 МБ
T5XXL	t5xxl_fp16.safetensors	CPU RAM ~9.3 ГБ

Команда для запуска

sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 \
  --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" \
  --vae "E:\models\ae.safetensors" \
  --clip_l "E:\models\clip_l.safetensors" \
  --t5xxl "E:\models\t5xxl_fp16.safetensors" \
  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

--vae-on-cpu и --vae-tiling обязательны. Без них ошибка DeviceMemoryAllocation возникает мгновенно.

Бенчмарки

Задача	Бэкенд	Результат
LLM инференс	Только CPU	3–5 токенов/с ❌
LLM инференс	RX 580 Vulkan	15–16 токенов/с ✅
SD 1.5 20 шагов	DirectML	~450с + сбой ❌
SD 1.5 20 шагов	Vulkan натив	~72с ✅
Flux 1024x1024	Xeon CPU WSL2	~24 мин ✅

Примечание: Время загрузки моделей сократилось с 25 мин (HDD) до 4 мин (NVMe).

Карта сервисов

OpenWebUI Docker :3000
  ├── llama-server.exe :8081  (Vulkan — RX 580)
  ├── sd-server.exe    :7860  (Vulkan — RX 580)
  └── ComfyUI          :8188  (CPU — Xeon WSL2)

Ресурсы

Полная документация, .bat скрипты оркестрации и скомпилированные бинарные файлы:
👉 https://setup-ia-local-rx580-vulkan.web.app/

Железо не умирает. Оно просто получает вторую жизнь благодаря правильному ПО. Используете старые карты AMD для ИИ? Давайте обсудим оптимизацию буферов и задержки в комментариях.

Совет: Для тегов на Dev.to используйте: russia, ai, hardware, amd, vulkan.