<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AIVisionsLab</title>
    <description>The latest articles on DEV Community by AIVisionsLab (@aivisionslab).</description>
    <link>https://dev.to/aivisionslab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946564%2F2e4047e5-fedf-4680-9e84-b8d8a1f32be6.png</url>
      <title>DEV Community: AIVisionsLab</title>
      <link>https://dev.to/aivisionslab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aivisionslab"/>
    <language>en</language>
    <item>
      <title>Rodei um modelo MoE de 35B de parâmetros em uma RX 580 de 8GB de 2017 (e quase desisti três vezes)</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Tue, 16 Jun 2026 20:50:45 +0000</pubDate>
      <link>https://dev.to/aivisionslab/rodei-um-modelo-moe-de-35b-de-parametros-em-uma-rx-580-de-8gb-de-2017-e-quase-desisti-tres-vezes-2i1i</link>
      <guid>https://dev.to/aivisionslab/rodei-um-modelo-moe-de-35b-de-parametros-em-uma-rx-580-de-8gb-de-2017-e-quase-desisti-tres-vezes-2i1i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcob5ovtkes3ocijxat17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcob5ovtkes3ocijxat17.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Isso começou como uma pergunta idiota feita depois de já ter visto o Qwen3 4B rodando a 35 tokens/s via Vulkan na mesma máquina: &lt;em&gt;se isso já funciona, até onde vai o limite real?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A resposta levou duas sessões de testes, cinco tentativas falhas, um erro de protocolo, dois esgotamentos de contexto, um timeout de cliente e, só então, uma resposta completa. No dia seguinte, mais três testes confirmaram exatamente o que tinha dado errado e por quê. Esse é o relato inteiro, sem cortar as partes em que o negócio quebrou.&lt;/p&gt;

&lt;h2&gt;
  
  
  O hardware: nada disso é novo
&lt;/h2&gt;

&lt;p&gt;Xeon E5-2690 v3, 12 núcleos / 24 threads, lançado em 2014. 31,8GB de RAM DDR4 REG ECC em quad-channel. E uma AMD Radeon RX 580 2048SP de 8GB GDDR5, lançada em 2017 — a placa que todo mundo associa a mineração de criptomoeda, não a inferência de modelos de linguagem.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Componente&lt;/th&gt;
&lt;th&gt;Especificação&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;Intel Xeon E5-2690 v3 — 12C/24T, 3,05GHz turbo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;31,8GB DDR4 REG ECC quad-channel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;AMD RX 580 2048SP — 8GB GDDR5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;NVMe (modelos) + HDD (swap/sistema)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Driver AMD&lt;/td&gt;
&lt;td&gt;31.0.21925.1001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;Vulkan — sem CUDA, sem ROCm, sem DirectML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Stack de software: llama.cpp (build b9049), Vulkan SDK 1.4.350.0, OpenWebUI v0.9.6 e SearXNG via Docker para web search. Nada exótico, tudo open source, tudo rodando local.&lt;/p&gt;

&lt;p&gt;O modelo era o &lt;strong&gt;Qwen3.5-35B-A3B-Uncensored Q6_K&lt;/strong&gt;: 34,66 bilhões de parâmetros totais, arquitetura Mixture of Experts com 256 experts, dos quais apenas 8 são ativados por token — 3,1% do modelo "aceso" a cada passo. Esse detalhe é o motivo pelo qual a história inteira é possível. Em um modelo denso de 35B, os 35 bilhões de parâmetros entram em jogo para cada token. No MoE, o roteador escolhe 8 experts relevantes e ignora os outros 248 naquele instante. Isso não reduz o tamanho do arquivo no disco (ainda são 26,55GB), mas reduz brutalmente o que precisa estar disponível com baixa latência ao mesmo tempo. &lt;em&gt;(Logs brutos completos dessa rodada na &lt;a href="https://setup-ia-local-rx580-vulkan.web.app/#limit_qwen_35b" rel="noopener noreferrer"&gt;Seção 33 do laboratório&lt;/a&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  O fitting automático: 1,15 segundo decidindo onde colocar 26GB
&lt;/h2&gt;

&lt;p&gt;Aqui está a parte que, na minha opinião, é mais interessante que o benchmark final. Eu não passei nenhuma flag manual de camadas (&lt;code&gt;-ngl&lt;/code&gt;, &lt;code&gt;--override-tensor&lt;/code&gt; ou qualquer split rígido). O comando de inicialização foi literalmente isso:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;.\llama&lt;span class="na"&gt;-server&lt;/span&gt;.exe &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="s2"&gt;"E:\models\Qwen3.5-35B-A3B-...-Q6_K.gguf"&lt;/span&gt; &lt;span class="na"&gt;--host &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0 &lt;span class="na"&gt;--port &lt;/span&gt;&lt;span class="m"&gt;8081&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;E o llama.cpp resolveu o problema sozinho. A situação inicial era inviável: o modelo completo precisava de ~32.961 MiB de VRAM e havia 7.366 MiB livres. Um déficit de mais de 26GB. O algoritmo de fitting fez, em sequência, e em pouco mais de um segundo:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reduziu o contexto&lt;/strong&gt; de 262.144 tokens (o máximo de treino do modelo) para 4.096 tokens, liberando ~5.347 MiB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Moveu todos os 256 experts MoE para a RAM&lt;/strong&gt;, mapeados via mmap — 25.613 MiB saindo da equação da GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realocou as camadas densas residuais de volta para a GPU&lt;/strong&gt;, de trás para frente (back-to-front), até ocupar 3.048 MiB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preencheu o resto front-to-back com overflow fracionado no gate layer&lt;/strong&gt;, terminando com 41 camadas na GPU (36 delas "overflowing"), uso final de 6.255 MiB e apenas 1.111 MiB livres.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;O resultado: 41 camadas densas + a output layer ficaram na VRAM da RX 580 (5.154 MiB), e os 256 experts MoE foram para &lt;code&gt;CPU_Mapped&lt;/code&gt;, ocupando 26.784 MiB de RAM via mmap. Em cima disso ainda rodava KV cache (80 MiB), buffer recorrente (251 MiB), compute buffer (770 MiB) e mais alguns buffers menores — tudo somando entre 6,2 e 7,2GB de uso efetivo de VRAM, ou seja, 77–90% dos 8GB físicos.&lt;/p&gt;

&lt;p&gt;O sistema acabou usando &lt;strong&gt;quatro níveis de memória simultaneamente&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VRAM GDDR5 da RX 580 (~400 GB/s) para as camadas densas.&lt;/li&gt;
&lt;li&gt;RAM DDR4 ECC quad-channel (~51 GB/s) para os experts MoE via mmap.&lt;/li&gt;
&lt;li&gt;SSD NVMe (1,7–3,5 GB/s) como origem do arquivo .gguf.&lt;/li&gt;
&lt;li&gt;HDD via swap do Windows (~120–180 MB/s) quando a RAM passava de 97% de uso.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esse último nível é o vilão da história, e chega a aparecer de novo mais adiante.&lt;/p&gt;

&lt;h2&gt;
  
  
  Os números da primeira sessão
&lt;/h2&gt;

&lt;p&gt;Com flash attention habilitado automaticamente, fused gated delta net (autoregressive e chunked) ativos, e 4 slots paralelos configurados, a geração ficou estável em torno de &lt;strong&gt;5,6 tokens/segundo&lt;/strong&gt;, com prompt eval em ~34–40 tok/s.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sessão&lt;/th&gt;
&lt;th&gt;Prompt Eval&lt;/th&gt;
&lt;th&gt;Geração&lt;/th&gt;
&lt;th&gt;Tokens totais&lt;/th&gt;
&lt;th&gt;Tempo total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;34,13 tok/s&lt;/td&gt;
&lt;td&gt;5,57 tok/s&lt;/td&gt;
&lt;td&gt;1.377&lt;/td&gt;
&lt;td&gt;~107s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;~40,00 tok/s&lt;/td&gt;
&lt;td&gt;5,64 tok/s&lt;/td&gt;
&lt;td&gt;2.929&lt;/td&gt;
&lt;td&gt;~533s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A temperatura nunca passou de 80°C, com a placa operando entre 44–64°C na maior parte do tempo e só subindo de fato quando o web search estava ativo (70–75°C). O throttling térmico da RX 580 fica em torno de 90°C — sobrou margem o tempo inteiro. Em nenhum momento a placa crashou ou resetou, mesmo sob 12h+ de stress acumulado entre as duas sessões.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pergunta que quebrou cinco vezes antes de funcionar
&lt;/h2&gt;

&lt;p&gt;O prompt de teste foi sempre o mesmo: peça para explicar atenção em transformers e por que MoE é mais eficiente que modelos densos, em português. Documentei cada tentativa porque cada falha ensinou algo diferente:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teste 1&lt;/strong&gt; — thinking ON + web search ON + geração de imagem ON. Resultado: erro de protocolo. O OpenWebUI injetou um prefill de resposta incompatível com a flag &lt;code&gt;enable_thinking&lt;/code&gt;, e o servidor cancelou a chamada depois de já ter gastado um minuto pensando e recuperado 10 fontes do SearXNG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teste 2&lt;/strong&gt; — thinking ON + web search ON, 30 pesquisas consecutivas. O contexto esgotou: &lt;code&gt;n_tokens = 3285, truncated = 1&lt;/code&gt;. As 30 buscas injetaram tanto texto no histórico que sobrou pouquíssimo espaço para o raciocínio interno do modelo, que tentou alocar seu próprio buffer e estourou os 4.096 tokens definidos pelo fitting automático.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teste 3&lt;/strong&gt; — mesma configuração, 25 pesquisas em uma rodada acumulada. Esgotamento de novo, agora batendo exatamente no teto: &lt;code&gt;n_tokens = 4095, truncated = 1&lt;/code&gt;. O OpenWebUI reenviou o histórico anterior, e o prompt cresceu até o limite físico.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teste 4&lt;/strong&gt; — thinking ON, web search OFF. A temperatura caiu para 51°C estáveis, mas o raciocínio interno levou tempo demais na CPU Xeon de 2014, e a interface desistiu por timeout antes do servidor terminar de gerar qualquer coisa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teste 5&lt;/strong&gt; — thinking ON, web search OFF, prompt reduzido a 45 tokens. Sucesso completo. O modelo pensou por 4 minutos sob o Xeon, rascunhou a matemática internamente e entregou uma resposta técnica e estruturada em português sobre atenção scaled dot-product, multi-head e a eficiência do roteamento esparso do MoE.&lt;/p&gt;

&lt;p&gt;A conclusão da primeira sessão ficou clara: a GPU e a CPU nunca foram o problema. Em nenhuma das cinco tentativas houve instabilidade física, reset ou crash. Todas as falhas foram de calibração de software — thinking mode e contexto de 4.096 tokens não combinam quando o histórico (especialmente com web search) cresce demais.&lt;/p&gt;

&lt;p&gt;Para contexto: outro projeto da comunidade, do Matheus Fertunani, rodou o mesmo Qwen3.5 35B em Q8 usando CPU pura com 192GB de RAM em Linux, atingindo 7–8 tokens/s. Esse setup, com uma GPU de menos de R$400 no mercado de segunda mão somada a 32GB de RAM ECC, chegou a 5,64 tokens/s. A diferença de hardware profissional para hardware reaproveitado é grande, mas o resultado prático fica surpreendentemente próximo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capítulo 2: provando a hipótese, no dia seguinte
&lt;/h2&gt;

&lt;p&gt;A hipótese da primeira sessão era simples — "o problema nunca foi o hardware, foi thinking + web search esgotando os 4.096 tokens de contexto." No dia seguinte, três testes foram desenhados especificamente para confirmar isso. &lt;em&gt;(Documentação completa desses três testes na &lt;a href="https://setup-ia-local-rx580-vulkan.web.app/#proving_hypothesis_35b" rel="noopener noreferrer"&gt;Seção 34&lt;/a&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  O curl resolve o que o navegador não resolvia
&lt;/h3&gt;

&lt;p&gt;Primeiro, era preciso isolar se o timeout do Teste 4 vinha do servidor ou da camada AJAX do cliente OpenWebUI. A resposta veio batendo o endpoint &lt;code&gt;/v1/chat/completions&lt;/code&gt; diretamente via curl, com timeout de 600 segundos e sem nenhuma interface no meio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;&lt;span class="nb"&gt;curl.exe&lt;/span&gt; &lt;span class="na"&gt;-X &lt;/span&gt;&lt;span class="kd"&gt;POST&lt;/span&gt; &lt;span class="kd"&gt;http&lt;/span&gt;://localhost:8081/v1/chat/completions &lt;span class="na"&gt;-H &lt;/span&gt;&lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="na"&gt;--max-time &lt;/span&gt;&lt;span class="m"&gt;600&lt;/span&gt; &lt;span class="na"&gt;-d &lt;/span&gt;&lt;span class="s2"&gt;"@E:\teste.json"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resultado: &lt;code&gt;truncated = 0&lt;/code&gt;. Resposta completa entregue — 1.955 tokens, 266,42 segundos de tempo total, 255,97 segundos de eval a 6,57 tok/s (já usando Q4_K_M, que entrou no lugar do Q6_K para esse segundo bloco de testes). O hardware sempre conseguiu terminar o trabalho; era o navegador que desistia da conexão TCP enquanto o servidor continuava gerando em segundo plano.&lt;/p&gt;

&lt;p&gt;Durante esse mesmo teste, o OpenWebUI foi conectado em paralelo disparando a mesma pergunta por outro canal. O agendador do llama.cpp processou as duas tarefas concorrentemente sem travar nada: uma gerando a 5,96 tok/s e outra a 5,08 tok/s, simultaneamente, com a GPU em 63°C e o uso de RAM do sistema em 91% — seguro, sem travar o Windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  --ctx-size 8192 e os "pensamentos" capturados do modelo
&lt;/h3&gt;

&lt;p&gt;O segundo teste simplesmente subiu o contexto manualmente:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;.\llama&lt;span class="na"&gt;-server&lt;/span&gt;.exe &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="s2"&gt;"...Q4_K_M.gguf"&lt;/span&gt; &lt;span class="na"&gt;--host &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0 &lt;span class="na"&gt;--port &lt;/span&gt;&lt;span class="m"&gt;8081&lt;/span&gt; &lt;span class="na"&gt;--ctx-size &lt;/span&gt;&lt;span class="m"&gt;8192&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Com esse buffer dobrado, o modelo recebeu o prompt "explique por que MoE permite rodar modelos grandes em hardware com pouca VRAM" e processou por 9 minutos inteiros de raciocínio em background, sem nenhum corte. A parte curiosa veio da interceptação direta dos blocos &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; gerados nas três rodadas de teste — o que dá uma visão rara de como o modelo "argumenta" internamente antes de responder.&lt;/p&gt;

&lt;p&gt;No bloco capturado durante o esgotamento de contexto (Teste com Q4_K_M ainda em 4.096 tokens), o modelo literalmente calculou a memória necessária para 35B de parâmetros em diferentes quantizações — FP32 em 140GB, FP16 em 70GB, INT8 em 35GB, INT4 em 17,5GB — concluiu que nenhuma dessas contas fechava com 8GB de VRAM, e então corrigiu a própria premissa: se o modelo está rodando mesmo assim, só pode ser por offloading agressivo dos experts inativos para a RAM, mantendo na GPU apenas o roteador e os blocos compartilhados. Esse raciocínio descreveu, com bastante precisão, a própria arquitetura de mmap que estava sustentando ele em tempo real — sem ter qualquer acesso aos metadados do sistema de arquivos do laboratório. Esse bloco específico não chegou a ser entregue, porque o contexto de 4.096 tokens estourou antes da resposta final ser formatada.&lt;/p&gt;

&lt;p&gt;Já com &lt;code&gt;--ctx-size 8192&lt;/code&gt; ativo, o mesmo tipo de raciocínio — mais focado, menos exploratório, claramente comprimindo o rascunho até caber em "direto e conciso" como pedido — terminou em sucesso absoluto, zero cortes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Métrica&lt;/th&gt;
&lt;th&gt;Sessão Q6_K&lt;/th&gt;
&lt;th&gt;Sessão Q4_K_M (ctx 4096)&lt;/th&gt;
&lt;th&gt;Sessão Q4_K_M (ctx 8192)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Duração do raciocínio&lt;/td&gt;
&lt;td&gt;~4 min&lt;/td&gt;
&lt;td&gt;~11 min&lt;/td&gt;
&lt;td&gt;~9 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens do bloco &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~2.000&lt;/td&gt;
&lt;td&gt;~3.500&lt;/td&gt;
&lt;td&gt;~3.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Questionou a premissa do prompt?&lt;/td&gt;
&lt;td&gt;Não&lt;/td&gt;
&lt;td&gt;Sim, com cálculo de memória&lt;/td&gt;
&lt;td&gt;Sim, reajuste técnico&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resultado&lt;/td&gt;
&lt;td&gt;Sucesso&lt;/td&gt;
&lt;td&gt;Estourou contexto&lt;/td&gt;
&lt;td&gt;Sucesso absoluto&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Q4_K_M venceu na prática
&lt;/h3&gt;

&lt;p&gt;Trocar Q6_K (28,51GB) por Q4_K_M (21,17GB) liberou cerca de 7GB na carga física de RAM, o que reduziu a dependência do swap em HDD — o ponto mais lento de toda a cadeia de memória. O resultado prático: geração subindo para 6,42–6,65 tok/s, pico de temperatura caindo para 74°C (10°C mais frio que a sessão anterior) e zero atividade de swap visível durante a inferência.&lt;/p&gt;

&lt;h2&gt;
  
  
  O que sobrou disso tudo
&lt;/h2&gt;

&lt;p&gt;Algumas conclusões que valem para qualquer pessoa tentando algo parecido com hardware velho:&lt;/p&gt;

&lt;p&gt;O thermal throttling nunca foi um risco real nesse setup — mesmo sob horas de stress acumulado, a RX 580 ficou sempre 10–16°C abaixo do limite de 90°C. O fitting automático do llama.cpp via cálculo de grafos resolveu a distribuição entre GPU e RAM melhor do que qualquer tentativa manual de split por flags rígidas teria feito. O thinking mode do Qwen3.5 consome sozinho entre 2.000 e 3.500 tokens antes de produzir qualquer resposta visível, então qualquer contexto abaixo de 8.192 tokens vira um gargalo quase garantido se você também ligar web search ou histórico de conversa. E o timeout do cliente (navegador/interface) é, na prática, mais limitante que a capacidade real do hardware — bater direto no endpoint via curl revelou isso de forma inequívoca.&lt;/p&gt;

&lt;p&gt;No fim, o veredito é direto: uma GPU de 2017 com 8GB de VRAM e uma CPU de datacenter de 2014 rodam um modelo MoE de 35B de parâmetros de forma estável, sem crash, sem throttling e sem custo adicional além da eletricidade. Não é prático para uso diário — 5,6 a 6,6 tokens/s e um contexto efetivo limitado não competem com qualquer GPU moderna —, mas como prova de conceito sobre até onde sparsity de MoE e fitting automático de memória conseguem levar hardware obsoleto, a resposta é: bem mais longe do que o mercado de hardware sugere.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Hardware: Xeon E5-2690 v3 + RX 580 2048SP 8GB + 32GB DDR4 ECC. Stack: llama.cpp (Vulkan) + OpenWebUI + SearXNG. Todos os números vêm de logs reais de duas sessões de teste, sem nenhuma camada de marketing por cima.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>localllama</category>
      <category>llamacpp</category>
      <category>vulkan</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Thu, 11 Jun 2026 23:35:39 +0000</pubDate>
      <link>https://dev.to/aivisionslab/running-local-ai-on-an-amd-rx-580-in-2026-the-complete-vulkan-guide-52a5</link>
      <guid>https://dev.to/aivisionslab/running-local-ai-on-an-amd-rx-580-in-2026-the-complete-vulkan-guide-52a5</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yluqzyzu6akde4khwu8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yluqzyzu6akde4khwu8.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running Local AI on an AMD RX 580 in 2026 — The Complete Vulkan Guide&lt;br&gt;
"An RX 580 from 2017 running a 12B parameter model in 2026. &lt;br&gt;
No CUDA. No ROCm. No cloud. Here's exactly how."&lt;/p&gt;

&lt;p&gt;That was the consensus in 2026. AMD dropped ROCm support for Polaris/GCN4 architecture in v5.x. DirectML crashes with OpaqueTensorImpl. OpenVINO fails silently on Forge. Every mainstream AI stack gave up on this card.&lt;br&gt;
We didn't.&lt;br&gt;
This is the complete technical record of how we built a full local AI production stack on an AMD RX 580 8GB — running LLMs at 17 tok/s, generating images in 72 seconds, transcribing audio 150× faster than CPU, and even cloning voices. All offline. All free. All on hardware that cost under $50.&lt;/p&gt;

&lt;p&gt;The Hardware&lt;br&gt;
ComponentSpecGPUAMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)CPUIntel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014)RAM32GB DDR4 REG ECC Quad ChannelStorageNVMe 1TB — 1.7–3.5 GB/sOSWindows 10 Pro + WSL2 / Ubuntu 26.04 LTS&lt;br&gt;
The RX 580 2048SP is the mining-variant with 2048 shader processors instead of the original 2304SP. It's everywhere on the used market for under $50. It performs identically through Vulkan.&lt;br&gt;
One thing nobody talks about: storage matters as much as the GPU. Moving from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to 30 seconds. The bottleneck was never the GPU.&lt;/p&gt;

&lt;p&gt;Why Vulkan?&lt;br&gt;
The entire mainstream AI stack runs on either CUDA (Nvidia-only) or ROCm (AMD dropped Polaris in v5.x). That leaves legacy AMD GPUs with no official path.&lt;br&gt;
But there's a third option: Vulkan — a universal graphics/compute API that works on any modern GPU, including the RX 580, which has supported Vulkan 1.x since its 2017 drivers.&lt;br&gt;
The ggml project (the engine behind llama.cpp and stable-diffusion.cpp) implements Vulkan compute backends in pure C++. This means you can compile directly against the Vulkan API and completely bypass the ROCm/CUDA ecosystem. No driver packages. No compatibility layers. Just the GPU doing math.&lt;/p&gt;

&lt;p&gt;What We Tried Before Vulkan (And Why It All Failed)&lt;br&gt;
Before finding the working path, we hit every dead end:&lt;br&gt;
DirectML + ComfyUI — The GPU gets detected as privateuseone0, but then:&lt;br&gt;
NotImplementedError: Cannot access storage of OpaqueTensorImpl&lt;br&gt;
DirectML wraps tensor data in opaque objects that ComfyUI's attention backends literally cannot read. Also: Microsoft hasn't updated it since September 2024. It's abandoned.&lt;br&gt;
ROCm on Polaris — AMD officially dropped GCN4/Polaris in ROCm v5.x. Compatibility layers via WSL2 generate kernel panics under inference load. There is no Windows support. Dead end by design.&lt;br&gt;
OpenVINO + Stable Diffusion Forge — Intel's extension was built for the old Automatic1111 architecture. Forge restructured everything. Result:&lt;br&gt;
ModuleNotFoundError: No module named 'ldm'&lt;br&gt;
ModuleNotFoundError: No module named 'sgm'&lt;br&gt;
Error build_unet: Invalid backend: 'openvino'&lt;br&gt;
CPU-only + HDD — Our baseline before any optimization: 85-second startup, ~19 minutes per 512×512 image. The mechanical HDD competing with memory paging made it completely unusable.&lt;br&gt;
The pattern: every "AMD-compatible" option either targets newer hardware, is abandoned, or is simply incompatible with modern pipelines. Vulkan is the only path that actually works.&lt;/p&gt;

&lt;p&gt;The Architecture: Dual-Path Stack&lt;br&gt;
The core insight of this project is that not every workload fits in 8GB of VRAM. The solution is intelligent routing between GPU and CPU:&lt;br&gt;
OpenWebUI  :3000  (Docker)&lt;br&gt;
    │&lt;br&gt;
    ├──► llama-server  :8081  ──►  RX 580 Vulkan  [llama.cpp]&lt;br&gt;
    │         └── Ollama      :11434  ──►  CPU fallback&lt;br&gt;
    │&lt;br&gt;
    └──► sd-server     :7860  ──►  RX 580 Vulkan  [stable-diffusion.cpp]&lt;br&gt;
              ├── SD 1.5 GGUF      ──►  72s / image&lt;br&gt;
              └── FLUX hybrid      ──►  ~14 min / image&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;└──► ComfyUI       :8188  ──►  Xeon CPU WSL2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Path 1 — GPU Vulkan: LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.&lt;br&gt;
Path 2 — CPU Xeon: FLUX.1 16GB models, AnimateDiff video pipelines. The 32GB ECC RAM acts as "virtual VRAM" for models that don't fit on the card.&lt;/p&gt;

&lt;p&gt;Building llama.cpp with Vulkan&lt;br&gt;
Run in Developer PowerShell for Visual Studio:&lt;br&gt;
powershellcd E:\&lt;br&gt;
git clone &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;https://github.com/ggerganov/llama.cpp&lt;/a&gt;&lt;br&gt;
cd llama.cpp&lt;br&gt;
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release&lt;br&gt;
cmake --build build --config Release -j20&lt;br&gt;
Validate GPU detection:&lt;br&gt;
powershellcd build\bin\Release&lt;br&gt;
.\llama-cli.exe --list-devices&lt;/p&gt;

&lt;h1&gt;
  
  
  Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅
&lt;/h1&gt;

&lt;p&gt;Start the LLM server:&lt;br&gt;
powershell.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" `&lt;br&gt;
  --host 0.0.0.0 --port 8081 --device Vulkan0&lt;br&gt;
How to verify it's actually using the GPU:&lt;br&gt;
ggml_vulkan: Found 1 Vulkan device(s)&lt;br&gt;
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB&lt;br&gt;
17.77 t/s  ← RX 580 Vulkan ✅&lt;br&gt;
If you see 3–5 t/s with no ggml_vulkan line — it's running on CPU. The --device Vulkan0 flag is mandatory.&lt;/p&gt;

&lt;p&gt;Building stable-diffusion.cpp with Vulkan&lt;br&gt;
powershellgit clone --recursive &lt;a href="https://github.com/leejet/stable-diffusion.cpp" rel="noopener noreferrer"&gt;https://github.com/leejet/stable-diffusion.cpp&lt;/a&gt;&lt;br&gt;
cd stable-diffusion.cpp&lt;br&gt;
mkdir build &amp;amp;&amp;amp; cd build&lt;br&gt;
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release&lt;br&gt;
cmake --build . --config Release -j20&lt;br&gt;
Start the image server:&lt;br&gt;
powershellE:&lt;br&gt;
cd "E:\stable-diffusion.cpp\build\bin\Release"&lt;br&gt;
.\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `&lt;br&gt;
  -m "E:\models\dreamshaper8.gguf"&lt;/p&gt;

&lt;p&gt;FLUX.1 Schnell: Running a 16GB Model on 8GB VRAM&lt;br&gt;
FLUX.1 Schnell is a 12B parameter SOTA model that nominally requires 16GB. Here's how we run it on 8GB:&lt;br&gt;
The strategy is memory segmentation — put the diffusion model on VRAM, offload everything else to RAM:&lt;br&gt;
ComponentFileWhereDiffusion Modelflux1-schnell-q4_k.ggufGPU VRAM (~6.5GB)VAEae.safetensorsCPU RAM (~160MB)CLIP Lclip_l.safetensorsGPU VRAM (~235MB)T5XXLt5xxl_fp16.safetensorsCPU RAM (~9.3GB)&lt;br&gt;
batchsd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^&lt;br&gt;
  --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^&lt;br&gt;
  --vae "E:\models\ae.safetensors" ^&lt;br&gt;
  --clip_l "E:\models\clip_l.safetensors" ^&lt;br&gt;
  --t5xxl "E:\models\t5xxl_fp16.safetensors" ^&lt;br&gt;
  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling&lt;/p&gt;

&lt;p&gt;⚠️ --vae-tiling is not optional. Without it, VAE decode causes OOM and crashes the server.&lt;/p&gt;

&lt;p&gt;Timing per 1024×1024 image:&lt;br&gt;
StageTimeT5XXL conditioning11.49sSampling (4 steps)~838sVAE decode (9 tiles)40.45sTotal~14 min&lt;br&gt;
Critical: Two GGUF formats for FLUX&lt;br&gt;
This trips up almost everyone. There are two different GGUF distributions for FLUX:&lt;br&gt;
SourceCompatible withcity96 (HuggingFace)ComfyUI + ComfyUI-GGUF node onlyleejet (HuggingFace)stable-diffusion.cpp ✅&lt;br&gt;
Using a city96 GGUF in sd-server returns:&lt;br&gt;
[ERROR] main.cpp:92 - new_sd_ctx_t failed&lt;br&gt;
Always download from: huggingface.co/leejet/FLUX.1-schnell-gguf&lt;/p&gt;

&lt;p&gt;whisper.cpp: Audio Transcription on the RX 580&lt;br&gt;
This is where the numbers get absurd.&lt;br&gt;
Build whisper.cpp with Vulkan:&lt;br&gt;
powershellgit clone &lt;a href="https://github.com/ggml-org/whisper.cpp" rel="noopener noreferrer"&gt;https://github.com/ggml-org/whisper.cpp&lt;/a&gt;&lt;br&gt;
cd whisper.cpp&lt;br&gt;
cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF&lt;br&gt;
cmake --build build --config Release -j4&lt;br&gt;
Transcribe a video (MP4 → TXT):&lt;br&gt;
powershell# Extract audio first (Whisper requires WAV on Windows)&lt;br&gt;
ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"&lt;/p&gt;

&lt;h1&gt;
  
  
  Transcribe
&lt;/h1&gt;

&lt;p&gt;.\build\bin\Release\whisper-cli.exe &lt;code&gt;&lt;br&gt;
  -m models\ggml-large-v3-turbo.bin&lt;/code&gt;&lt;br&gt;
  -f "audio.wav" -l pt --output-txt&lt;br&gt;
Performance on a 15-minute video (Windows):&lt;br&gt;
StageTimeModel load4sMel spectrogram1.2sGPU encode73sDecode + batch168sTotal307s&lt;br&gt;
VRAM used: only 2.6GB of 8GB. CPU stays at ~5%.&lt;br&gt;
On Linux (Ubuntu 26.04, Mesa RADV), same hardware, same model:&lt;br&gt;
MetricWindowsLinuxTime (106s audio)307s23.58sVRAM used2.6GB1.6GB&lt;br&gt;
A 13× speedup on the same GPU. Mesa RADV's Vulkan compute path is dramatically more efficient for this workload than the Windows AMD driver.&lt;/p&gt;

&lt;p&gt;Windows vs Linux: Full Benchmark Comparison&lt;br&gt;
WorkloadWindows 10Ubuntu 26.04 (Mesa RADV)WinnerLLM Qwen3 4B @ 99 layers~15–17 tok/s~35 tok/s🏆 Linux (2×)LLM Qwen3.6 35B @ max layers7.62 tok/s (max 10 ngl)5.18 tok/s (max 20 ngl)⚖️ TieSD 1.5 DreamShaper (50 steps)~72s~85s🏆 WindowsFLUX Schnell (4 steps, 512×512)~84s~52s🏆 LinuxWhisper large-v3-turbo307s · 2.6GB23.58s · 1.6GB🏆 Linux&lt;br&gt;
Why Linux is faster for LLM: Mesa RADV allows up to 20 GPU layers for large models where Windows AMD drivers cap at 10. RADV's memory management is simply more aggressive and efficient.&lt;br&gt;
Why Windows wins SD 1.5: The proprietary AMD driver has more stable direct rendering for this specific workload. Consistent 1.44s/it vs 1.65s/it on Linux.&lt;/p&gt;

&lt;p&gt;Voice Cloning: Applio RVC on AMD Windows&lt;br&gt;
We also built a full voice cloning pipeline:&lt;br&gt;
Text → Balabolka (TTS) → WAV → Applio RVC → Cloned Voice&lt;br&gt;
The key insight: instead of using a generative TTS model (which sounds robotic), we use a real voice actor (Antônio Neural, a Microsoft Neural voice) for prosody and emotion, then apply RVC to convert the identity to our target voice (Yuri). Result: 80–95% naturalness vs 60–70% for pure TTS.&lt;br&gt;
AMD-specific critical findings:&lt;br&gt;
DirectML is effectively dead for RVC — torch-directml is locked to torch==2.4.1 while Applio requires torch==2.7.1. Irreconcilable conflict.&lt;br&gt;
Use CPU mode. On Xeon E5-2690 v3 (24 threads): ~6 min/epoch, ~20 hours for 200 epochs. Inference after training: 2 hours of audio → ~30 minutes processing.&lt;br&gt;
The silent failure trap:&lt;br&gt;
powershell# NEVER set these — they silently break feature extraction&lt;/p&gt;

&lt;h1&gt;
  
  
  set CUDA_VISIBLE_DEVICES=-1
&lt;/h1&gt;

&lt;h1&gt;
  
  
  set ROCM_VISIBLE_DEVICES=-1
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Training will print "Model trained successfully" but produce nothing
&lt;/h1&gt;

&lt;p&gt;Always verify logs/project/extracted/ contains .npy files before starting training.&lt;/p&gt;

&lt;p&gt;The Community Timeline&lt;br&gt;
This project didn't happen in isolation. Three independent researchers, same GPU, same conclusion:&lt;br&gt;
DateAuthorContributionJan 2025艾米心 AmihartFirst LLM via Vulkan on RX 580 — 24.56 tok/s on DebianDec 2025DH / DadHacksFirst SD via Vulkan — stable-diffusion.cpp breakthrough2026AIVisionsLabFull Windows + Linux production stack, voice cloning, transcription&lt;br&gt;
The shared foundation: ggml by Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire proprietary driver ecosystem.&lt;/p&gt;

&lt;p&gt;Real Benchmarks Summary&lt;br&gt;
WorkloadModelBackendResultLLM inferenceMistral 7B Q4_K_MRX 580 Vulkan (Win)17–18 tok/sLLM inferenceQwen3 4B Q4_K_MRX 580 Vulkan (Linux)~35 tok/sLLM baselineMistral 7B Q4_K_MXeon CPU pure3–5 tok/sImage genDreamShaper 8 SD1.5RX 580 Vulkan~72s / 512×512Image genflux1-schnell-q4_kGPU+CPU hybrid~14 min @ 1024×1024Audio transcriptionWhisper large-v3-turboRX 580 Vulkan (Linux)23.58s / 106s audioVideo framesAnimateDiffXeon WSL2 CPU~141s/frameVoice inferenceApplio RVCXeon CPU~30 min / 2h audio&lt;/p&gt;

&lt;p&gt;Troubleshooting: The Most Common Failures&lt;br&gt;
generate_image returned no results / frozen terminal&lt;br&gt;
Bug in sd-server with Seed: -1. Fix: set a fixed integer seed (42, 1337) in OpenWebUI.&lt;br&gt;
new_sd_ctx_t failed with FLUX&lt;br&gt;
You're using a city96 GGUF. Download from leejet instead.&lt;br&gt;
Docker can't reach sd-server&lt;br&gt;
Windows Defender blocks the Docker subnet (172.x.x.x). Run as Administrator:&lt;br&gt;
powershellNew-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `&lt;br&gt;
  -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow&lt;br&gt;
--override-tensor exps=CPU slows down Vulkan&lt;br&gt;
This flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU transfer overhead destroys any gains. Don't apply CUDA-optimized flags to Vulkan backends.&lt;/p&gt;

&lt;p&gt;Full Documentation&lt;br&gt;
This post covers the core architecture. Full guides for each component:&lt;/p&gt;

&lt;p&gt;📖 Master documentation (PT/EN): setup-ia-local-rx580-vulkan.web.app&lt;br&gt;
💻 GitHub repository: github.com/aivisionslab-studios/rx580-local-ai-guide&lt;br&gt;
🎥 YouTube: @aivisionslab-hub&lt;/p&gt;

&lt;p&gt;Conclusion&lt;br&gt;
The narrative that legacy AMD GPUs can't run AI is a software problem, not a hardware limitation. The RX 580 has supported Vulkan since 2017. The compute capability was always there.&lt;br&gt;
What changed is that ggml and its ecosystem built Vulkan backends that bypass the entire proprietary driver stack. The result is a GPU from 2017 running SOTA models from 2026 — locally, privately, for free.&lt;br&gt;
RX 580 (2017) + Xeon (2014) + Vulkan + ggml = SOTA AI in 2026&lt;br&gt;
The problem was never the GPU.&lt;/p&gt;

&lt;p&gt;AIVisionsLab — Documenting local AI on legacy hardware.&lt;br&gt;
São Paulo, Brazil 🇧🇷&lt;/p&gt;

</description>
      <category>ai</category>
      <category>amd</category>
      <category>vulkan</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Three researchers. One GPU. Two years. How the RX 580 became an AI platform.</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Sun, 24 May 2026 13:20:37 +0000</pubDate>
      <link>https://dev.to/aivisionslab/three-researchers-one-gpu-two-years-how-the-rx-580-became-an-ai-platform-5989</link>
      <guid>https://dev.to/aivisionslab/three-researchers-one-gpu-two-years-how-the-rx-580-became-an-ai-platform-5989</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxqtvno31j3ttgqdvj5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxqtvno31j3ttgqdvj5n.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All images in this article were generated on the RX 580 8GB — the same GPU everyone said couldn't run AI.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  This is collective knowledge
&lt;/h2&gt;

&lt;p&gt;Three independent researchers. No coordination. Same GPU. Same conclusion.&lt;/p&gt;




&lt;h2&gt;
  
  
  January 2025 — 艾米心 Amihart
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Platform:&lt;/strong&gt; Debian Linux&lt;br&gt;
&lt;strong&gt;Published:&lt;/strong&gt; &lt;a href="https://medium.com/@amihart" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amihart was the first to document LLM inference via Vulkan on the RX 580.&lt;/p&gt;

&lt;p&gt;Compiled &lt;code&gt;llama.cpp&lt;/code&gt; with &lt;code&gt;-DGGML_VULKAN=on&lt;/code&gt; on Debian, connected a Celeron G6900 CPU setup, and measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU only:&lt;/strong&gt; 5.45 tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RX 580 via Vulkan:&lt;/strong&gt; 24.56 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 4.5× uplift on hardware that officially "doesn't support AI."&lt;/p&gt;

&lt;p&gt;But then came this line — honest, and correct for the time:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Sadly, even though Vulkan seems to do a pretty good job with the RX580, I am unaware of any way to get Vulkan to work with Stable Diffusion. If you want to use Stable Diffusion, you will need ROCm."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sentence opened a question that the next researcher answered.&lt;/p&gt;


&lt;h2&gt;
  
  
  December 2025 — DH / DadHacks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Platform:&lt;/strong&gt; Linux/Debian&lt;br&gt;
&lt;strong&gt;Published:&lt;/strong&gt; &lt;a href="https://dadhacks.org/2025/12/05/ai-image-generation-on-rx-580-using-vulkan-a-cost-effective-solution/" rel="noopener noreferrer"&gt;dadhacks.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DadHacks refuted Amihart's limitation — not as a criticism, but as proof that the software evolved.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stable-diffusion.cpp&lt;/code&gt; had matured. With &lt;code&gt;-DSD_VULKAN=ON&lt;/code&gt; (equivalent to &lt;code&gt;-DGGML_VULKAN=ON&lt;/code&gt; in newer versions), image generation via Vulkan on the RX 580 worked.&lt;/p&gt;

&lt;p&gt;Including FLUX.1 Schnell in Q4 quantization, with CPU offloading for components that exceeded VRAM.&lt;/p&gt;

&lt;p&gt;The barrier Amihart correctly identified in January had fallen by December.&lt;/p&gt;


&lt;h2&gt;
  
  
  2026 — AIVisionsLab
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Platform:&lt;/strong&gt; Windows 10 Pro + WSL2&lt;br&gt;
&lt;strong&gt;Published:&lt;/strong&gt; &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The third step was integration.&lt;/p&gt;

&lt;p&gt;Both previous projects ran on Linux. Neither connected everything into a unified daily-use system on Windows. Neither documented the failures (DirectML, ROCm, OpenVINO). Neither built automation scripts. Neither integrated OpenWebUI.&lt;/p&gt;

&lt;p&gt;AIVisionsLab filled those gaps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full Windows stack with &lt;code&gt;.bat&lt;/code&gt; automation&lt;/li&gt;
&lt;li&gt;OpenWebUI integration via Docker with firewall notes&lt;/li&gt;
&lt;li&gt;Dual architecture: GPU Vulkan for fast models, Xeon CPU WSL2 for FLUX 16GB&lt;/li&gt;
&lt;li&gt;Documented every failure with root cause analysis&lt;/li&gt;
&lt;li&gt;Discovered the critical GGUF incompatibility: city96 vs leejet formats&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The question each project answered
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amihart&lt;/td&gt;
&lt;td&gt;Can LLMs run on Vulkan RX 580?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes.&lt;/strong&gt; 24.56 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DadHacks&lt;/td&gt;
&lt;td&gt;Can Stable Diffusion run on Vulkan RX 580?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes.&lt;/strong&gt; sd.cpp works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIVisionsLab&lt;/td&gt;
&lt;td&gt;Can all this run integrated on Windows daily?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes.&lt;/strong&gt; Full stack documented&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  The common denominator
&lt;/h2&gt;

&lt;p&gt;All three converge on the same engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml (Georgi Gerganov)
  ├── llama.cpp    → LLMs via Vulkan
  └── stable-diffusion.cpp (leejet) → Images via Vulkan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ggml&lt;/code&gt; ported deep learning tensor operations to C and exposed Vulkan hooks. That single decision freed legacy AMD hardware from the CUDA/ROCm dependency trap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three philosophies, same conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Amihart:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Despite how ancient this card is, it is technically possible to use it for AI."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;DadHacks:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"This setup provides an accessible pathway for leveraging existing hardware investments without requiring expensive upgrades or specialized software stacks like ROCm."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;AIVisionsLab:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Commercial planned obsolescence is a market choice, not an engineering barrier. Legacy hardware doesn't die — it's liberated by the right software."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Full documentation
&lt;/h2&gt;

&lt;p&gt;📖 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt; — complete guide in PT/EN/ES/FR/AR&lt;br&gt;
📦 &lt;a href="https://github.com/aivisionslab-studios/rx580-local-ai-guide" rel="noopener noreferrer"&gt;github.com/aivisionslab-studios/rx580-local-ai-guide&lt;/a&gt;&lt;br&gt;
🤗 &lt;a href="https://huggingface.co/aivisionslab/ai-local-rx580-stack" rel="noopener noreferrer"&gt;huggingface.co/aivisionslab/ai-local-rx580-stack&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>amd</category>
      <category>hystory</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Running FLUX.1 Schnell on an RX 580 8GB — GPU/CPU hybrid architecture</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Sun, 24 May 2026 13:18:29 +0000</pubDate>
      <link>https://dev.to/aivisionslab/running-flux1-schnell-on-an-rx-580-8gb-gpucpu-hybrid-architecture-ipb</link>
      <guid>https://dev.to/aivisionslab/running-flux1-schnell-on-an-rx-580-8gb-gpucpu-hybrid-architecture-ipb</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5a23do7yme5xr69qlqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5a23do7yme5xr69qlqz.png" alt=" " width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Image above: generated by FLUX.1 Schnell running on the hybrid architecture described in this post.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;FLUX.1 Schnell is a 12B parameter model. Full precision needs more VRAM than the RX 580 has.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;split the components between GPU and CPU RAM&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory map
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Where&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Diffusion model&lt;/td&gt;
&lt;td&gt;flux1-schnell-q4_k.gguf&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPU VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~6.5GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE&lt;/td&gt;
&lt;td&gt;ae.safetensors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CPU RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~160MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLIP L&lt;/td&gt;
&lt;td&gt;clip_l.safetensors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPU VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~235MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T5XXL&lt;/td&gt;
&lt;td&gt;t5xxl_fp16.safetensors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CPU RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~9.3GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total VRAM used:&lt;/strong&gt; ~6.7GB / 8GB available&lt;br&gt;
&lt;strong&gt;Total RAM used:&lt;/strong&gt; ~9.5GB&lt;/p&gt;

&lt;p&gt;The T5XXL encoder dominates RAM usage. If you're tight on RAM, &lt;code&gt;t5xxl_fp8.safetensors&lt;/code&gt; reduces it to ~5GB.&lt;/p&gt;


&lt;h2&gt;
  
  
  ⚠️ Critical: use leejet GGUF, not city96
&lt;/h2&gt;

&lt;p&gt;Two different GGUF formats exist for FLUX. They have similar names but are NOT interchangeable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;city96 on HuggingFace&lt;/td&gt;
&lt;td&gt;ComfyUI + ComfyUI-GGUF node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;leejet on HuggingFace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;stable-diffusion.cpp ✅&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using city96 GGUF with sd-server returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ERROR] stable-diffusion.cpp:355 - get sd version from file failed
[ERROR] main.cpp:92 - new_sd_ctx_t failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Download from: &lt;code&gt;https://huggingface.co/leejet/FLUX.1-schnell-gguf&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The command
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;&lt;span class="kd"&gt;sd&lt;/span&gt;&lt;span class="na"&gt;-server&lt;/span&gt;.exe &lt;span class="na"&gt;--listen-ip &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0 &lt;span class="na"&gt;--listen-port &lt;/span&gt;&lt;span class="m"&gt;7860&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--diffusion-model &lt;/span&gt;&lt;span class="s2"&gt;"E:\models\flux1-schnell-q4_k.gguf"&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--vae &lt;/span&gt;&lt;span class="s2"&gt;"E:\models\ae.safetensors"&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--clip&lt;/span&gt;_l &lt;span class="s2"&gt;"E:\models\clip_l.safetensors"&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--t&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="kd"&gt;xxl&lt;/span&gt; &lt;span class="s2"&gt;"E:\models\t5xxl_fp16.safetensors"&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--cfg-scale &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0 &lt;span class="na"&gt;--steps &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;--clip-on-cpu --vae-on-cpu --vae-tiling
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Flag breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--clip-on-cpu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Frees ~235MB VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--vae-on-cpu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Frees ~160MB VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--vae-tiling&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevents OOM at high resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--cfg-scale 1.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Required for FLUX — higher values distort&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--steps 4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Schnell converges in 4 steps by design&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Real benchmark
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T5XXL conditioning&lt;/td&gt;
&lt;td&gt;11.49s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sampling (4 steps @ 1024×1024)&lt;/td&gt;
&lt;td&gt;~838s (~14 min)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE decode (9 tiles)&lt;/td&gt;
&lt;td&gt;40.45s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~14 min&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Terminal status at generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Listening on http://0.0.0.0:7860
VRAM: 7.6/8.0 GB | RAM: ~9.5 GB | Temp: 66°C
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Windows Firewall fix
&lt;/h2&gt;

&lt;p&gt;If OpenWebUI can't reach the server even with &lt;code&gt;--listen-ip 0.0.0.0&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run as Administrator&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;New-NetFirewallRule&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DisplayName&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sd-server AIVisionsLab"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Inbound&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Protocol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;TCP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-LocalPort&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;7860&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Action&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Allow&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker runs in an isolated WSL2 network — &lt;code&gt;127.0.0.1&lt;/code&gt; won't work. Use your machine's actual local IP.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full documentation
&lt;/h2&gt;

&lt;p&gt;📖 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt;&lt;br&gt;
📦 &lt;a href="https://github.com/aivisionslab-studios/rx580-local-ai-guide" rel="noopener noreferrer"&gt;github.com/aivisionslab-studios/rx580-local-ai-guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>flux</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>stablediffusion</category>
    </item>
    <item>
      <title>Everything that failed before Vulkan saved our RX 580 AI setup</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Sun, 24 May 2026 13:14:16 +0000</pubDate>
      <link>https://dev.to/aivisionslab/everything-that-failed-before-vulkan-saved-our-rx-580-ai-setup-4apj</link>
      <guid>https://dev.to/aivisionslab/everything-that-failed-before-vulkan-saved-our-rx-580-ai-setup-4apj</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lyfny98rk42la9ot4j0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lyfny98rk42la9ot4j0.png" alt=" " width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All images in this article were generated locally on the RX 580 8GB — after we fixed everything described below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The graveyard
&lt;/h2&gt;

&lt;p&gt;Before Vulkan worked, we tried everything. This is the technical autopsy.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. DirectML — Microsoft's promise that crashed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The attempt:&lt;/strong&gt; torch-directml with &lt;code&gt;--directml&lt;/code&gt; flag in ComfyUI.&lt;/p&gt;

&lt;p&gt;The GPU was detected as &lt;code&gt;privateuseone0&lt;/code&gt;. Looked promising.&lt;/p&gt;

&lt;p&gt;Then this appeared on every run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING: torch-directml barely works, is very slow,
has not been updated in over 1 year and might be
removed soon, please don't use it.

NotImplementedError: Cannot access storage of OpaqueTensorImpl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; DirectML wraps tensor data in opaque objects called &lt;code&gt;OpaqueTensorImpl&lt;/code&gt;. When ComfyUI's modern attention backends try to read the raw memory contents, the Microsoft layer blocks access entirely.&lt;/p&gt;

&lt;p&gt;The project hasn't been updated in over a year. It's effectively abandoned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual fix attempt:&lt;/strong&gt; Downgrade to the May 2024 dev build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip uninstall torch torch-directml torchaudio
pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.3.1+cpu &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cpu
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch-directml&lt;span class="o"&gt;==&lt;/span&gt;0.2.1.dev240521 &lt;span class="nt"&gt;--no-deps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stops the crash but the performance is so slow it's unusable.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. ROCm — officially dead for GCN4
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The attempt:&lt;/strong&gt; AMD's official GPGPU framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reality:&lt;/strong&gt; AMD dropped official support for Polaris/GCN4 architecture in ROCm v5.x. Permanently. There is no workaround.&lt;/p&gt;

&lt;p&gt;On Windows: no native ROCm support at all.&lt;br&gt;
On WSL2 with compatibility layers: kernel panics under heavy inference load.&lt;/p&gt;

&lt;p&gt;The only working ROCm path for the RX 580 is via Docker containers that emulate &lt;code&gt;gfx803&lt;/code&gt; — which is what &lt;a href="https://medium.com/@amihart" rel="noopener noreferrer"&gt;Amihart documented in January 2025&lt;/a&gt;. It works for Stable Diffusion, but requires Docker overhead and doesn't support modern FLUX architecture.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. OpenVINO + Stable Diffusion Forge
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The attempt:&lt;/strong&gt; Intel's &lt;code&gt;sd-webui-openvino&lt;/code&gt; extension inside Forge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ModuleNotFoundError: No module named 'ldm'
ModuleNotFoundError: No module named 'sgm'
Error build_unet: Invalid backend: 'openvino'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; The extension was designed for the old AUTOMATIC1111 architecture. Forge completely restructured the codebase and replaced the native &lt;code&gt;ldm&lt;/code&gt; and &lt;code&gt;sgm&lt;/code&gt; modules. The OpenVINO injection fails at the foundation level.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. CPU + HDD — the baseline disaster
&lt;/h2&gt;

&lt;p&gt;Before any GPU acceleration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boot time: &lt;strong&gt;85 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;LLM response: &lt;strong&gt;3–5 tok/s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Image generation: &lt;strong&gt;~19 minutes per 512×512 image&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;FLUX 16GB model load: &lt;strong&gt;25 minutes from HDD&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mechanical drive was as much of a bottleneck as the missing GPU acceleration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually worked
&lt;/h2&gt;

&lt;p&gt;After all of this: &lt;strong&gt;Vulkan&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ggml&lt;/code&gt; engine in &lt;code&gt;llama.cpp&lt;/code&gt; and &lt;code&gt;stable-diffusion.cpp&lt;/code&gt; uses Vulkan as a native GPU backend. The RX 580 has supported Vulkan 1.x since 2017 drivers. No special installation. No compatibility layers. Just compile with &lt;code&gt;-DGGML_VULKAN=ON&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Results after switching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM: &lt;strong&gt;15–16 tok/s&lt;/strong&gt; (from 3–5)&lt;/li&gt;
&lt;li&gt;Image: &lt;strong&gt;~72s&lt;/strong&gt; (from ~19 min)&lt;/li&gt;
&lt;li&gt;FLUX load: &lt;strong&gt;30 seconds&lt;/strong&gt; (from 25 min, after NVMe migration)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;The hardware was never the problem. Every failure above was a software problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DirectML: abandoned by Microsoft&lt;/li&gt;
&lt;li&gt;ROCm: architecture policy decision by AMD&lt;/li&gt;
&lt;li&gt;OpenVINO: extension not maintained for modern frontends&lt;/li&gt;
&lt;li&gt;HDD: wrong storage choice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The RX 580 was waiting for &lt;code&gt;ggml&lt;/code&gt; + Vulkan.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Full documentation
&lt;/h2&gt;

&lt;p&gt;📖 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt;&lt;br&gt;
📦 &lt;a href="https://github.com/aivisionslab-studios/rx580-local-ai-guide" rel="noopener noreferrer"&gt;github.com/aivisionslab-studios/rx580-local-ai-guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>amd</category>
      <category>playwright</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Rodei Flux Schnell + LLM numa GPU de R$300. Sem CUDA. Sem cloud. Sem ROCm.</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Sun, 24 May 2026 13:08:33 +0000</pubDate>
      <link>https://dev.to/aivisionslab/rodei-flux-schnell-llm-numa-gpu-de-r300-sem-cuda-sem-cloud-sem-rocm-575e</link>
      <guid>https://dev.to/aivisionslab/rodei-flux-schnell-llm-numa-gpu-de-r300-sem-cuda-sem-cloud-sem-rocm-575e</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qh3ho60mfgjjsdwlh2e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qh3ho60mfgjjsdwlh2e.png" alt=" " width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Todas as imagens deste artigo foram geradas localmente na RX 580 8GB descrita abaixo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A narrativa era clara
&lt;/h2&gt;

&lt;p&gt;Em 2026, todo guia diz a mesma coisa:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Sua AMD RX 580 não roda IA. Compra uma GPU nova."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A AMD removeu suporte ROCm para Polaris/GCN4 na v5.x.&lt;br&gt;
DirectML travava com erros de &lt;code&gt;OpaqueTensorImpl&lt;/code&gt;.&lt;br&gt;
OpenVINO falhava silenciosamente.&lt;/p&gt;

&lt;p&gt;GPU de 8GB parada em 0% de uso enquanto o CPU respondia LLMs a 3 tokens por segundo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A gente recusou comprar uma GPU nova.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  A solução: Vulkan
&lt;/h2&gt;

&lt;p&gt;O projeto &lt;code&gt;ggml&lt;/code&gt; — engine base do &lt;code&gt;llama.cpp&lt;/code&gt; e &lt;code&gt;stable-diffusion.cpp&lt;/code&gt; — suporta Vulkan como backend de GPU. Vulkan é um padrão aberto que ainda suporta a RX 580 nativamente desde os drivers de 2017.&lt;/p&gt;

&lt;p&gt;Sem CUDA. Sem ROCm. Sem DirectML. Só Vulkan.&lt;/p&gt;


&lt;h2&gt;
  
  
  Resultados reais (logs do terminal, não benchmarks sintéticos)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Modelo&lt;/th&gt;
&lt;th&gt;Velocidade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Mistral 7B Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15–16 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geração de imagem&lt;/td&gt;
&lt;td&gt;DreamShaper 8 GGUF&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~72s/imagem&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLUX.1 Schnell&lt;/td&gt;
&lt;td&gt;flux1-schnell-q4_k híbrido&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~14 min @ 1024×1024&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CPU sem GPU: &lt;strong&gt;3–5 tok/s&lt;/strong&gt;.&lt;br&gt;
Ganho com Vulkan: &lt;strong&gt;3–4×&lt;/strong&gt; numa GPU que "não suporta IA".&lt;/p&gt;


&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU:     AMD RX 580 2048SP — 8GB GDDR5 (Polaris / GCN4)
CPU:     Intel Xeon E5-2690 v3 — 12c/24t (2014)
RAM:     32GB DDR4 REG ECC
Storage: NVMe 1TB — 1.7–3.5 GB/s
OS:      Windows 10 Pro + WSL2 Ubuntu 22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;O NVMe sozinho reduziu o carregamento do FLUX de &lt;strong&gt;25 minutos para 30 segundos&lt;/strong&gt;.&lt;br&gt;
Storage é tão crítico quanto a GPU.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Compilar llama.cpp com Vulkan
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Executar no Developer PowerShell do VS&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;E:\&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://github.com/ggerganov/llama.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llama.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-B&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nx"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Release&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-j20&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Validação:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build\bin\Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\llama-cli.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--list-devices&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# Esperado: Vulkan0: AMD Radeon RX 580 2048SP ✅&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Compilar stable-diffusion.cpp com Vulkan
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--recursive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://github.com/leejet/stable-diffusion.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;stable-diffusion.cpp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nx"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Release&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-j20&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Subir o servidor
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;&lt;span class="kd"&gt;E&lt;/span&gt;:
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"E:\stable-diffusion.cpp\build\bin\Release"&lt;/span&gt;
&lt;span class="kd"&gt;sd&lt;/span&gt;&lt;span class="na"&gt;-server&lt;/span&gt;.exe &lt;span class="na"&gt;--listen-ip &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0 &lt;span class="na"&gt;--listen-port &lt;/span&gt;&lt;span class="m"&gt;7860&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="s2"&gt;"E:\models\dreamshaper_8.safetensors"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No OpenWebUI → Admin → Imagens → Automatic1111 → &lt;code&gt;http://SEU_IP_LOCAL:7860/&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Crítico: dois tipos de GGUF incompatíveis
&lt;/h2&gt;

&lt;p&gt;Se você tentar rodar FLUX e receber &lt;code&gt;new_sd_ctx_t failed&lt;/code&gt; — você baixou o GGUF errado.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fonte&lt;/th&gt;
&lt;th&gt;Compatível com&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;city96&lt;/strong&gt; (HuggingFace)&lt;/td&gt;
&lt;td&gt;ComfyUI apenas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;leejet&lt;/strong&gt; (HuggingFace)&lt;/td&gt;
&lt;td&gt;stable-diffusion.cpp ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sempre use: &lt;code&gt;https://huggingface.co/leejet/FLUX.1-schnell-gguf&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  O que não funcionou (documentado com causa raiz)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tentativa&lt;/th&gt;
&lt;th&gt;Erro&lt;/th&gt;
&lt;th&gt;Motivo&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DirectML&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OpaqueTensorImpl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tensores MS incompatíveis com ComfyUI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ROCm&lt;/td&gt;
&lt;td&gt;Kernel panics&lt;/td&gt;
&lt;td&gt;GCN4 removido no v5.x — permanente&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVINO&lt;/td&gt;
&lt;td&gt;&lt;code&gt;No module 'ldm'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extensão para arquitetura antiga A1111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU + HDD&lt;/td&gt;
&lt;td&gt;19 min/imagem&lt;/td&gt;
&lt;td&gt;Zero GPU + gargalo de I/O mecânico&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Documentação completa
&lt;/h2&gt;

&lt;p&gt;📖 Guia master (PT/EN/ES/FR/AR) com diagramas, benchmarks, scripts de automação:&lt;br&gt;
👉 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📦 GitHub (scripts + docs):&lt;br&gt;
👉 &lt;a href="https://github.com/aivisionslab-studios/rx580-local-ai-guide" rel="noopener noreferrer"&gt;github.com/aivisionslab-studios/rx580-local-ai-guide&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;O problema nunca foi a placa.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>opensource</category>
      <category>programacao</category>
    </item>
    <item>
      <title>I ran Flux Schnell + LLMs on a $50 GPU. No CUDA. No cloud. No ROCm.</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Sun, 24 May 2026 13:04:51 +0000</pubDate>
      <link>https://dev.to/aivisionslab/i-ran-flux-schnell-llms-on-a-50-gpu-no-cuda-no-cloud-no-rocm-55ap</link>
      <guid>https://dev.to/aivisionslab/i-ran-flux-schnell-llms-on-a-50-gpu-no-cuda-no-cloud-no-rocm-55ap</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b3bc3im4u06uc5f5itk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b3bc3im4u06uc5f5itk.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All images in this article were generated locally on the RX 580 8GB described below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The narrative was clear
&lt;/h2&gt;

&lt;p&gt;In 2026, every guide says the same thing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your AMD RX 580 can't run AI. Buy a new GPU."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AMD dropped ROCm support for Polaris/GCN4 in v5.x.&lt;br&gt;
DirectML crashed with &lt;code&gt;OpaqueTensorImpl&lt;/code&gt; errors.&lt;br&gt;
OpenVINO failed silently.&lt;/p&gt;

&lt;p&gt;So we had a 8GB GPU sitting at 0% utilization while the CPU burned through LLM responses at 3 tokens/second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We refused to buy a new GPU.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The fix: Vulkan
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ggml&lt;/code&gt; project — the engine behind &lt;code&gt;llama.cpp&lt;/code&gt; and &lt;code&gt;stable-diffusion.cpp&lt;/code&gt; — supports Vulkan as a GPU backend. Vulkan is an open standard that still supports the RX 580 natively since its 2017 drivers.&lt;/p&gt;

&lt;p&gt;No CUDA. No ROCm. No DirectML. Just Vulkan.&lt;/p&gt;


&lt;h2&gt;
  
  
  Results (real terminal logs, not benchmarks)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM inference&lt;/td&gt;
&lt;td&gt;Mistral 7B Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15–16 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image generation&lt;/td&gt;
&lt;td&gt;DreamShaper 8 GGUF&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~72s/image&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLUX.1 Schnell&lt;/td&gt;
&lt;td&gt;flux1-schnell-q4_k (hybrid)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~14 min @ 1024×1024&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CPU baseline without GPU: &lt;strong&gt;3–5 tok/s&lt;/strong&gt;.&lt;br&gt;
Vulkan uplift: &lt;strong&gt;3–4×&lt;/strong&gt; on a GPU that "doesn't support AI."&lt;/p&gt;


&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU:     AMD RX 580 2048SP — 8GB GDDR5 (Polaris / GCN4)
CPU:     Intel Xeon E5-2690 v3 — 12c/24t (2014)
RAM:     32GB DDR4 REG ECC
Storage: NVMe 1TB — 1.7–3.5 GB/s
OS:      Windows 10 Pro + WSL2 Ubuntu 22.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;The NVMe alone reduced FLUX model load time from &lt;strong&gt;25 minutes to 30 seconds&lt;/strong&gt;.&lt;br&gt;
Storage is as critical as the GPU.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Build llama.cpp with Vulkan
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run in Developer PowerShell for VS&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;E:\&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://github.com/ggerganov/llama.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llama.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-B&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nx"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Release&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-j20&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Validate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build\bin\Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\llama-cli.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--list-devices&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Build stable-diffusion.cpp with Vulkan
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--recursive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://github.com/leejet/stable-diffusion.cpp&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;stable-diffusion.cpp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mkdir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;build&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nx"&gt;cmake&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--build&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Release&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-j20&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Run the server
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;&lt;span class="kd"&gt;E&lt;/span&gt;:
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"E:\stable-diffusion.cpp\build\bin\Release"&lt;/span&gt;
&lt;span class="kd"&gt;sd&lt;/span&gt;&lt;span class="na"&gt;-server&lt;/span&gt;.exe &lt;span class="na"&gt;--listen-ip &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0 &lt;span class="na"&gt;--listen-port &lt;/span&gt;&lt;span class="m"&gt;7860&lt;/span&gt; &lt;span class="se"&gt;^
&lt;/span&gt;  &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="s2"&gt;"E:\models\dreamshaper_8.safetensors"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Connect OpenWebUI → Admin → Images → Automatic1111 → &lt;code&gt;http://YOUR_LOCAL_IP:7860/&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Critical: two types of GGUF
&lt;/h2&gt;

&lt;p&gt;If you try to run FLUX and get &lt;code&gt;new_sd_ctx_t failed&lt;/code&gt; — you downloaded the wrong GGUF.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Compatible with&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;city96&lt;/strong&gt; (HuggingFace)&lt;/td&gt;
&lt;td&gt;ComfyUI only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;leejet&lt;/strong&gt; (HuggingFace)&lt;/td&gt;
&lt;td&gt;stable-diffusion.cpp ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Always use: &lt;code&gt;https://huggingface.co/leejet/FLUX.1-schnell-gguf&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What failed (documented)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attempt&lt;/th&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DirectML&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OpaqueTensorImpl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MS tensors can't talk to ComfyUI backends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ROCm&lt;/td&gt;
&lt;td&gt;Kernel panics&lt;/td&gt;
&lt;td&gt;GCN4 dropped in v5.x — permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVINO&lt;/td&gt;
&lt;td&gt;&lt;code&gt;No module 'ldm'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extension targets old A1111 arch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU + HDD&lt;/td&gt;
&lt;td&gt;19 min/image&lt;/td&gt;
&lt;td&gt;No GPU + mechanical I/O bottleneck&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Full documentation
&lt;/h2&gt;

&lt;p&gt;📖 Complete guide (PT/EN/ES/FR/AR) with architecture diagrams, benchmarks, automation scripts:&lt;br&gt;
👉 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app" rel="noopener noreferrer"&gt;setup-ia-local-rx580-vulkan.web.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📦 GitHub (scripts + docs):&lt;br&gt;
👉 &lt;a href="https://github.com/aivisionslab-studios/rx580-local-ai-guide" rel="noopener noreferrer"&gt;github.com/aivisionslab-studios/rx580-local-ai-guide&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The problem was never the GPU.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Запуск Flux Schnell (12B) + LLM на устаревшей AMD RX 580 (8 ГБ) через Vulkan — Полное архитектурное руководство [2026]</title>
      <dc:creator>AIVisionsLab</dc:creator>
      <pubDate>Fri, 22 May 2026 18:24:02 +0000</pubDate>
      <link>https://dev.to/aivisionslab/zapusk-flux-schnell-12b-llm-na-ustarievshiei-amd-rx-580-8-gb-chieriez-vulkan-polnoie-arkhitiekturnoie-273d</link>
      <guid>https://dev.to/aivisionslab/zapusk-flux-schnell-12b-llm-na-ustarievshiei-amd-rx-580-8-gb-chieriez-vulkan-polnoie-arkhitiekturnoie-273d</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofim06vygdktsaknbjvc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofim06vygdktsaknbjvc.png" alt=" " width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Многие считали, что RX 580 «мертва» для ИИ в 2026 году. Экосистемы, завязанные только на CUDA, прекращение поддержки Polaris в ROCm начиная с версии 5.x, и DirectML, который так и не был доведен до ума. Это подробный технический отчет о том, как мы доказали обратное.&lt;/p&gt;

&lt;h2&gt;
  
  
  Аппаратное обеспечение
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; AMD RX 580 2048SP — 8 ГБ GDDR5 VRAM (нативная поддержка Vulkan 1.x)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; Intel Xeon E5-2690 v3 — 12 ядер/24 потока @ 3.5 ГГц boost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 32 ГБ DDR4 REG ECC Quad Channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Накопитель:&lt;/strong&gt; NVMe 1 ТБ (критически важно для устранения «узких мест»)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ОС:&lt;/strong&gt; Windows 10 Pro + WSL2 Ubuntu 22.04.5&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Почему другие решения не работают?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Решение&lt;/th&gt;
&lt;th&gt;Статус&lt;/th&gt;
&lt;th&gt;Причина&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CUDA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Только для Nvidia&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ROCm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Поддержка Polaris прекращена в v5.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DirectML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Ошибка &lt;code&gt;OpaqueTensorImpl&lt;/code&gt; в CLIPTextEncode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenVINO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Отсутствие модулей &lt;code&gt;ldm/sgm&lt;/code&gt; в Forge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Фатальная ошибка DirectML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NotImplementedError: Cannot access storage of OpaqueTensorImpl

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Драйвер упаковывает память в непрозрачные тензоры (opaque tensors), которые бэкенды внимания ComfyUI не могут считать. Это тупик.&lt;/p&gt;

&lt;h2&gt;
  
  
  Решение — Двухуровневая архитектура
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ПУТЬ 1 — GPU Vulkan (ускорение RX 580)
&lt;/h3&gt;

&lt;p&gt;Нативная сборка &lt;code&gt;stable-diffusion.cpp&lt;/code&gt;, скомпилированная с &lt;code&gt;-DGGML_VULKAN=ON&lt;/code&gt;. Движок &lt;code&gt;ggml&lt;/code&gt; работает напрямую с GPU без необходимости в ROCm или CUDA. Модели SD 1.5 GGUF генерируют изображение примерно за 72 секунды.&lt;/p&gt;

&lt;h3&gt;
  
  
  ПУТЬ 2 — CPU Xeon (тяжелые SOTA модели)
&lt;/h3&gt;

&lt;p&gt;FLUX.1 Schnell (16 ГБ) превышает объем физической VRAM. ComfyUI работает через CPU внутри WSL2, используя ECC RAM в качестве стабильной виртуальной VRAM. Генерация 768x768 занимает ~24 минуты.&lt;/p&gt;

&lt;h3&gt;
  
  
  Гибридная сегментация памяти (Flux 12B Q4_K)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Компонент&lt;/th&gt;
&lt;th&gt;Файл&lt;/th&gt;
&lt;th&gt;Выделение памяти&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Diffusion Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;flux1-schnell-q4_k.gguf&lt;/td&gt;
&lt;td&gt;GPU VRAM ~6.5 ГБ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VAE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ae.safetensors&lt;/td&gt;
&lt;td&gt;CPU RAM ~160 МБ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CLIP L&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;clip_l.safetensors&lt;/td&gt;
&lt;td&gt;GPU VRAM ~235 МБ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T5XXL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;t5xxl_fp16.safetensors&lt;/td&gt;
&lt;td&gt;CPU RAM ~9.3 ГБ&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Команда для запуска
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sd-server.exe &lt;span class="nt"&gt;--listen-ip&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--listen-port&lt;/span&gt; 7860 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--diffusion-model&lt;/span&gt; &lt;span class="s2"&gt;"E:&lt;/span&gt;&lt;span class="se"&gt;\m&lt;/span&gt;&lt;span class="s2"&gt;odels&lt;/span&gt;&lt;span class="se"&gt;\f&lt;/span&gt;&lt;span class="s2"&gt;lux1-schnell-q4_k.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vae&lt;/span&gt; &lt;span class="s2"&gt;"E:&lt;/span&gt;&lt;span class="se"&gt;\m&lt;/span&gt;&lt;span class="s2"&gt;odels&lt;/span&gt;&lt;span class="se"&gt;\a&lt;/span&gt;&lt;span class="s2"&gt;e.safetensors"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clip_l&lt;/span&gt; &lt;span class="s2"&gt;"E:&lt;/span&gt;&lt;span class="se"&gt;\m&lt;/span&gt;&lt;span class="s2"&gt;odels&lt;/span&gt;&lt;span class="se"&gt;\c&lt;/span&gt;&lt;span class="s2"&gt;lip_l.safetensors"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--t5xxl&lt;/span&gt; &lt;span class="s2"&gt;"E:&lt;/span&gt;&lt;span class="se"&gt;\m&lt;/span&gt;&lt;span class="s2"&gt;odels&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s2"&gt;5xxl_fp16.safetensors"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cfg-scale&lt;/span&gt; 1.0 &lt;span class="nt"&gt;--steps&lt;/span&gt; 4 &lt;span class="nt"&gt;--clip-on-cpu&lt;/span&gt; &lt;span class="nt"&gt;--vae-on-cpu&lt;/span&gt; &lt;span class="nt"&gt;--vae-tiling&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--vae-on-cpu&lt;/code&gt; и &lt;code&gt;--vae-tiling&lt;/code&gt; обязательны. Без них ошибка &lt;code&gt;DeviceMemoryAllocation&lt;/code&gt; возникает мгновенно.&lt;/p&gt;

&lt;h2&gt;
  
  
  Бенчмарки
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Задача&lt;/th&gt;
&lt;th&gt;Бэкенд&lt;/th&gt;
&lt;th&gt;Результат&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM инференс&lt;/td&gt;
&lt;td&gt;Только CPU&lt;/td&gt;
&lt;td&gt;3–5 токенов/с ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM инференс&lt;/td&gt;
&lt;td&gt;RX 580 Vulkan&lt;/td&gt;
&lt;td&gt;15–16 токенов/с ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SD 1.5 20 шагов&lt;/td&gt;
&lt;td&gt;DirectML&lt;/td&gt;
&lt;td&gt;~450с + сбой ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SD 1.5 20 шагов&lt;/td&gt;
&lt;td&gt;Vulkan натив&lt;/td&gt;
&lt;td&gt;~72с ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flux 1024x1024&lt;/td&gt;
&lt;td&gt;Xeon CPU WSL2&lt;/td&gt;
&lt;td&gt;~24 мин ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Примечание: Время загрузки моделей сократилось с 25 мин (HDD) до 4 мин (NVMe).&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Карта сервисов
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenWebUI Docker :3000
  ├── llama-server.exe :8081  (Vulkan — RX 580)
  ├── sd-server.exe    :7860  (Vulkan — RX 580)
  └── ComfyUI          :8188  (CPU — Xeon WSL2)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ресурсы
&lt;/h2&gt;

&lt;p&gt;Полная документация, &lt;code&gt;.bat&lt;/code&gt; скрипты оркестрации и скомпилированные бинарные файлы:&lt;br&gt;
👉 &lt;a href="https://setup-ia-local-rx580-vulkan.web.app/" rel="noopener noreferrer"&gt;https://setup-ia-local-rx580-vulkan.web.app/&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Железо не умирает. Оно просто получает вторую жизнь благодаря правильному ПО.&lt;/strong&gt; &lt;em&gt;Используете старые карты AMD для ИИ? Давайте обсудим оптимизацию буферов и задержки в комментариях.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Совет:&lt;/strong&gt; Для тегов на Dev.to используйте: &lt;code&gt;russia&lt;/code&gt;, &lt;code&gt;ai&lt;/code&gt;, &lt;code&gt;hardware&lt;/code&gt;, &lt;code&gt;amd&lt;/code&gt;, &lt;code&gt;vulkan&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
