<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David </title>
    <description>The latest articles on DEV Community by David  (@purpledoubled).</description>
    <link>https://dev.to/purpledoubled</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802440%2Fbd0118a6-e9df-4efa-965a-8f8f9c2ef510.png</url>
      <title>DEV Community: David </title>
      <link>https://dev.to/purpledoubled</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/purpledoubled"/>
    <language>en</language>
    <item>
      <title>anthropic charges $25/M tokens for opus 4.7. alibaba just released the same capability for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:14:04 +0000</pubDate>
      <link>https://dev.to/purpledoubled/anthropic-charges-25m-tokens-for-opus-47-alibaba-just-released-the-same-capability-for-free-3o11</link>
      <guid>https://dev.to/purpledoubled/anthropic-charges-25m-tokens-for-opus-47-alibaba-just-released-the-same-capability-for-free-3o11</guid>
      <description>&lt;p&gt;Anthropic charges $25 per million output tokens for Claude Opus 4.7. That's their new flagship coding model, released today. It's good — 13% better than Opus 4.6 on coding benchmarks, improved vision, stronger at multi-step agentic work.&lt;/p&gt;

&lt;p&gt;Meanwhile, also this week: Alibaba released Qwen3.6-35B-A3B under Apache 2.0. Scores 73.4 on SWE-bench Verified. Runs on an 8 GB GPU. Costs nothing.&lt;/p&gt;

&lt;p&gt;Two models. Same week. Completely opposite philosophies. Let's break down what's actually happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  the cloud tax is getting harder to justify
&lt;/h2&gt;

&lt;p&gt;When GPT-4 launched in 2023, there was nothing local that came close. Paying for API access made sense because there was no alternative.&lt;/p&gt;

&lt;p&gt;In 2024, open models started catching up. Llama 3, Qwen 2.5, Mistral — good enough for many tasks, but still clearly behind frontier models on the hard stuff.&lt;/p&gt;

&lt;p&gt;In 2026, the gap has narrowed to the point where you have to really think about whether the remaining difference is worth $25 per million output tokens.&lt;/p&gt;

&lt;p&gt;Here's a concrete example. A developer using Opus 4.7 as their primary coding agent, running maybe 50 complex coding sessions a day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average session: ~10K input tokens (code context) + ~5K output tokens (response)&lt;/li&gt;
&lt;li&gt;50 sessions: 500K input + 250K output tokens&lt;/li&gt;
&lt;li&gt;Daily cost: $2.50 + $6.25 = &lt;strong&gt;$8.75/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Monthly: &lt;strong&gt;~$190/month&lt;/strong&gt; just for one developer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now scale that to a team of 5. That's nearly $1,000/month on AI coding assistance.&lt;/p&gt;

&lt;p&gt;The same team could buy a single RTX 4070 ($550 one-time) and run Qwen3.6 at 20+ tokens/second with zero ongoing costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  what you actually get for $0
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B isn't just "a free model." It's specifically designed for the exact use case Opus 4.7 targets — coding agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic coding benchmarks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SWE-bench Verified: 73.4 (fix real bugs in real repos autonomously)&lt;/li&gt;
&lt;li&gt;Terminal-Bench 2.0: 51.5 (operate a terminal to solve coding tasks)&lt;/li&gt;
&lt;li&gt;MCPMark: 37.0 (tool calling and agent protocols)&lt;/li&gt;
&lt;li&gt;QwenWebBench: 1397 Elo (frontend artifact generation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture advantages for local deployment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MoE: 35B total params, 3B active — runs like a small model, thinks like a big one&lt;/li&gt;
&lt;li&gt;Gated DeltaNet: 3 of 4 layers use linear attention — memory efficient on long contexts&lt;/li&gt;
&lt;li&gt;Native vision: understand screenshots, diagrams, code images without a separate model&lt;/li&gt;
&lt;li&gt;262K context: plenty for most codebase contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you give up vs Opus 4.7:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Probably some edge on the hardest 10% of tasks&lt;/li&gt;
&lt;li&gt;Anthropic's specific safety/self-verification features&lt;/li&gt;
&lt;li&gt;The polish of a model trained with massive RLHF compute&lt;/li&gt;
&lt;li&gt;Cloud convenience (no GPU needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you gain:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your code never leaves your machine&lt;/li&gt;
&lt;li&gt;No rate limits, no outages, no API key management&lt;/li&gt;
&lt;li&gt;No per-token costs, ever&lt;/li&gt;
&lt;li&gt;Full control over the model behavior&lt;/li&gt;
&lt;li&gt;Works offline, on a plane, in an air-gapped environment&lt;/li&gt;
&lt;li&gt;Apache 2.0 — fine-tune it, modify it, deploy it commercially&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  the $25/M question
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 is genuinely impressive. Anthropic's coding models have been best-in-class for a while and this extends that lead. The self-verification feature — where the model checks its own work before reporting back — is particularly useful for autonomous workflows.&lt;/p&gt;

&lt;p&gt;But the honest question every developer should ask is: &lt;strong&gt;for my specific tasks, does the delta between Opus 4.7 and Qwen3.6 justify the cost?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a solo developer building a startup: probably not. Qwen3.6 handles 73.4% of real-world GitHub issues autonomously. That's more than enough for daily coding work.&lt;/p&gt;

&lt;p&gt;For a large enterprise with strict compliance requirements and deep pockets: maybe. The convenience and Anthropic's enterprise features have real value.&lt;/p&gt;

&lt;p&gt;For anyone processing sensitive code: local wins by default. No amount of ToS promises equals "the data literally never left my hardware."&lt;/p&gt;

&lt;h2&gt;
  
  
  how to try both and decide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.7:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API key from anthropic.com
Model: claude-opus-4-7
$5/M input, $25/M output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen3.6 locally:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or for a complete setup with a coding agent, vision, and tool calling — &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; v2.3.3 supports both. Connect Anthropic's API for Opus 4.7 when you need it, run Qwen3.6 locally for everything else. Switch between them in the same interface. Best of both worlds.&lt;/p&gt;

&lt;h2&gt;
  
  
  where this is heading
&lt;/h2&gt;

&lt;p&gt;The pattern is clear. Every 3-4 months, a new open model appears that matches the paid frontier model from 6 months ago. The cost of "good enough" is trending toward zero.&lt;/p&gt;

&lt;p&gt;Anthropic, OpenAI, and Google will keep pushing the frontier. Open models will keep closing the gap. And the developers in the middle will increasingly ask: "Is the remaining gap worth $25 per million tokens?"&lt;/p&gt;

&lt;p&gt;Today, for most coding tasks, the answer is already no.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for running AI locally. Supports cloud APIs AND local models. Chat, coding agents, image gen, video gen. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
    <item>
      <title>claude opus 4.7 just dropped. here's what runs locally for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:13:20 +0000</pubDate>
      <link>https://dev.to/purpledoubled/claude-opus-47-just-dropped-heres-what-runs-locally-for-free-5665</link>
      <guid>https://dev.to/purpledoubled/claude-opus-47-just-dropped-heres-what-runs-locally-for-free-5665</guid>
      <description>&lt;p&gt;Anthropic just released Claude Opus 4.7. It's their best coding model yet — 13% better than Opus 4.6 on their internal 93-task benchmark, better vision, stronger at long-running agentic tasks.&lt;/p&gt;

&lt;p&gt;It's also $5 per million input tokens and $25 per million output tokens. API only. Every character you type goes through Anthropic's servers.&lt;/p&gt;

&lt;p&gt;Let's talk about what you can do locally for $0.&lt;/p&gt;

&lt;h2&gt;
  
  
  what opus 4.7 actually brings
&lt;/h2&gt;

&lt;p&gt;Based on Anthropic's announcement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;13% improvement&lt;/strong&gt; over Opus 4.6 on a 93-task coding benchmark, including 4 tasks neither Opus 4.6 nor Sonnet 4.6 could solve&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better vision&lt;/strong&gt; — higher resolution image understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stronger agentic workflows&lt;/strong&gt; — handles complex, multi-step tasks without losing context or stopping early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-verification&lt;/strong&gt; — the model checks its own outputs before reporting back&lt;/li&gt;
&lt;li&gt;Available on Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real improvements. Opus has been the go-to for serious coding work, and 4.7 makes it better.&lt;/p&gt;

&lt;p&gt;But here's the thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  the cost of frontier cloud AI
&lt;/h2&gt;

&lt;p&gt;At $5/$25 per million tokens, a heavy coding session with Opus 4.7 can easily run $2-5/day. A team of developers using it as their primary coding agent? That's hundreds per month.&lt;/p&gt;

&lt;p&gt;And every line of your proprietary code flows through someone else's infrastructure. Every prompt, every codebase context, every business logic snippet — stored, processed, potentially used for training (even with opt-outs, you're trusting the provider).&lt;/p&gt;

&lt;p&gt;For hobby projects, fine. For anything sensitive — financial code, healthcare logic, proprietary algorithms — that's a real risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  what runs locally right now
&lt;/h2&gt;

&lt;p&gt;The local model landscape has changed dramatically in the last few months. Here's what's available today at $0/month:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; (released this week)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;35B total parameters, 3B active (MoE architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;73.4 on SWE-bench Verified&lt;/strong&gt; — autonomous bug fixing on real GitHub repos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;51.5 on Terminal-Bench 2.0&lt;/strong&gt; — agentic terminal coding&lt;/li&gt;
&lt;li&gt;Built-in vision, 262K context&lt;/li&gt;
&lt;li&gt;Runs on &lt;strong&gt;8 GB VRAM&lt;/strong&gt; with Q4_K_M quantization&lt;/li&gt;
&lt;li&gt;Apache 2.0 license&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Is it as good as Opus 4.7? On raw capability, probably not — Anthropic has massive compute advantages. But on the tasks most developers actually do daily (fixing bugs, writing functions, understanding codebases, code review), Qwen3.6 is genuinely competitive. And it runs on hardware you already own.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real comparison isn't benchmarks
&lt;/h2&gt;

&lt;p&gt;It's this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude Opus 4.7&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$5/$25 per million tokens&lt;/td&gt;
&lt;td&gt;$0 forever&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy&lt;/td&gt;
&lt;td&gt;Cloud-processed&lt;/td&gt;
&lt;td&gt;Never leaves your machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Subject to API congestion&lt;/td&gt;
&lt;td&gt;As fast as your GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;Depends on Anthropic's uptime&lt;/td&gt;
&lt;td&gt;Runs offline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limits&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retention&lt;/td&gt;
&lt;td&gt;Anthropic's policy&lt;/td&gt;
&lt;td&gt;You control everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agentic coding&lt;/td&gt;
&lt;td&gt;Yes (strong)&lt;/td&gt;
&lt;td&gt;Yes (73.4 SWE-bench)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;API key + credit card&lt;/td&gt;
&lt;td&gt;Ollama + 10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  how to set up the local alternative
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Or if you want a full desktop experience with a coding agent, vision support, and model management:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; just shipped v2.3.3 with Qwen3.6 day-0 support. It wraps Ollama into a desktop app with a built-in coding agent that streams live between tool calls, agent mode with 13 tools and MCP integration, and remote access from your phone. Open source, AGPL-3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  when cloud still makes sense
&lt;/h2&gt;

&lt;p&gt;Being honest: there are cases where Opus 4.7 is worth the money.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need the absolute frontier of capability and $25/M output tokens is pocket change for your use case&lt;/li&gt;
&lt;li&gt;You're doing something that requires Anthropic's specific safety features&lt;/li&gt;
&lt;li&gt;You need the model to handle tasks that are genuinely beyond what open models can do today&lt;/li&gt;
&lt;li&gt;You don't have a GPU (though even a laptop with 8GB VRAM works for Qwen3.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everyone else — the gap between cloud and local is closing fast. A model that scores 73.4 on SWE-bench running on a gaming laptop would have been science fiction two years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  the trajectory matters more than today's snapshot
&lt;/h2&gt;

&lt;p&gt;Every few months, a new open model drops that would have been frontier-class the year before. The pricing gap between cloud and local is permanent — cloud will always cost per token, local will always be free after hardware.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is impressive. But the question isn't whether it's good — it's whether it's $5/$25 per million tokens better than what you can run yourself.&lt;/p&gt;

&lt;p&gt;For a growing number of developers, the answer is no.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. Qwen3.6 day-0 support. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>i cancelled my AI subscriptions. qwen3.6 on my own GPU does the same thing for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:56:16 +0000</pubDate>
      <link>https://dev.to/purpledoubled/i-cancelled-my-ai-subscriptions-qwen36-on-my-own-gpu-does-the-same-thing-for-free-493h</link>
      <guid>https://dev.to/purpledoubled/i-cancelled-my-ai-subscriptions-qwen36-on-my-own-gpu-does-the-same-thing-for-free-493h</guid>
      <description>&lt;p&gt;You're paying $20/month for ChatGPT. $10 for Copilot. Maybe another $20 for Midjourney. And every prompt you type goes through someone else's server.&lt;/p&gt;

&lt;p&gt;Meanwhile, Alibaba just open-sourced a model that scores 73.4 on SWE-bench Verified — the benchmark where an AI autonomously reads a GitHub issue, understands the codebase, writes a fix, and runs the tests. That's frontier-level coding ability. And it runs on your gaming laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  the model
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B. It's a Mixture-of-Experts model: 35 billion parameters total, but only 3 billion active per token. Your GPU loads 9 experts per token (8 routed + 1 shared) out of 256 total. The rest sit idle.&lt;/p&gt;

&lt;p&gt;Result: it runs like a 3B model but thinks like a 30B+ model.&lt;/p&gt;

&lt;p&gt;Apache 2.0 license. No usage restrictions. No rate limits. No one reading your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  what your $0/month gets you
&lt;/h2&gt;

&lt;p&gt;Let's do the math on what you're replacing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChatGPT Plus ($20/month)&lt;/strong&gt; — Qwen3.6 scores 86.0 on GPQA Diamond (graduate-level reasoning), 83.6 on HMMT (Harvard-MIT Math Tournament), and handles 119 languages. It has vision built in — drag an image into the chat and ask questions about it. For most daily tasks, you won't notice a difference. For coding tasks, this model is arguably better than GPT-4 for the stuff you actually do (fixing bugs, writing functions, understanding codebases).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot ($10/month)&lt;/strong&gt; — 73.4 on SWE-bench means this model can autonomously fix real bugs in real repositories. 51.5 on Terminal-Bench means it can operate a terminal to solve coding tasks. With the right frontend, it functions as a full coding agent, not just autocomplete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud API costs&lt;/strong&gt; — no per-token pricing. Run it 24/7 on your own hardware. The model doesn't get slower during peak hours. It doesn't have outages. It doesn't change its behavior because the provider decided to add more safety filters.&lt;/p&gt;

&lt;h2&gt;
  
  
  the hardware you already own is enough
&lt;/h2&gt;

&lt;p&gt;This is the part that surprises people. With Q4_K_M quantization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8 GB VRAM&lt;/strong&gt; (RTX 3060, RTX 4060): runs at 30+ tokens/second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12-14 GB VRAM&lt;/strong&gt; (RTX 4070, RTX 3090): Q8 quantization, 20+ tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple Silicon M1/M2/M3&lt;/strong&gt;: runs great on unified memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you bought a GPU in the last 3-4 years, you probably have enough. The MoE architecture is the key — your GPU only processes 3B parameters per token regardless of the total model size.&lt;/p&gt;

&lt;h2&gt;
  
  
  the catch (being honest)
&lt;/h2&gt;

&lt;p&gt;There are trade-offs. You should know them before you cancel anything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No real-time internet access&lt;/strong&gt; — the model only knows what it was trained on. No "search the web" or "check the latest docs." You need to paste context manually or use RAG.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Setup isn't zero&lt;/strong&gt; — you need Ollama or a similar runtime, and a frontend. It's not "open a browser tab and start typing." More like 10-15 minutes to set up if you've never done it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long context costs more locally&lt;/strong&gt; — 262K native context is great on paper, but processing 100K+ tokens on consumer hardware gets slow. Cloud APIs hide this cost from you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No multimodal generation&lt;/strong&gt; — Qwen3.6 can understand images (vision input) but can't generate them. For image generation you need a separate model (Stable Diffusion, Flux, etc.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Updates are manual&lt;/strong&gt; — when a better model drops, you download and switch yourself. No silent upgrades.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For people who type "write me a poem" into ChatGPT twice a week, this is overkill. For developers, researchers, and anyone processing sensitive data — the trade-offs are overwhelmingly in favor of local.&lt;/p&gt;

&lt;h2&gt;
  
  
  the stack that replaces everything
&lt;/h2&gt;

&lt;p&gt;Here's what a complete local setup looks like in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat + reasoning&lt;/strong&gt;: Qwen3.6-35B-A3B (this article)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image generation&lt;/strong&gt;: Stable Diffusion 3.5, Flux, or SDXL via ComfyUI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video generation&lt;/strong&gt;: Wan 2.1, FramePack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code completion&lt;/strong&gt;: same Qwen3.6, connected as a coding agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-text&lt;/strong&gt;: Whisper (runs on CPU)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total cost after hardware you already own: $0/month. Forever.&lt;/p&gt;

&lt;p&gt;Or use a tool that bundles all of this. &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; wraps Ollama + ComfyUI into one desktop app — chat, image gen, video gen, coding agent. v2.3.3 has Qwen3.6 day-0 support with vision and a full agent mode. AGPL-3.0, open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real question
&lt;/h2&gt;

&lt;p&gt;It's not "is local AI good enough yet?" — it passed that threshold months ago.&lt;/p&gt;

&lt;p&gt;The real question is: how much longer are you going to pay monthly fees to send your data to someone else's server when the same capability runs on hardware sitting under your desk?&lt;/p&gt;

&lt;p&gt;Qwen3.6 weights: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. No cloud, no subscription. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>qwen3.6 scores 73.4 on SWE-bench with only 3B active parameters. here's why that matters.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:43:39 +0000</pubDate>
      <link>https://dev.to/purpledoubled/qwen36-scores-734-on-swe-bench-with-only-3b-active-parameters-heres-why-that-matters-2fmp</link>
      <guid>https://dev.to/purpledoubled/qwen36-scores-734-on-swe-bench-with-only-3b-active-parameters-heres-why-that-matters-2fmp</guid>
      <description>&lt;p&gt;Alibaba just mass-released Qwen3.6 and the first model is already turning heads. Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35 billion total parameters — but only 3 billion are active at inference time.&lt;/p&gt;

&lt;p&gt;That means it runs on an 8GB GPU. And it just scored 73.4 on SWE-bench Verified.&lt;/p&gt;

&lt;p&gt;For context, Gemma4-31B — a dense model using all 31 billion parameters for every single token — scores 17.4 on the same benchmark. Qwen3.6 uses a tenth of the compute and scores four times higher.&lt;/p&gt;

&lt;h2&gt;
  
  
  the architecture is genuinely different
&lt;/h2&gt;

&lt;p&gt;Most MoE models just slap a router on top of a standard transformer. Qwen3.6 does something more interesting.&lt;/p&gt;

&lt;p&gt;Three out of every four layers use &lt;strong&gt;Gated DeltaNet&lt;/strong&gt; — a linear attention mechanism that's significantly cheaper than standard attention. Only every fourth layer uses full Gated Attention with KV cache. This hybrid layout means you get near-full-attention quality at a fraction of the memory cost, especially on long contexts.&lt;/p&gt;

&lt;p&gt;The expert setup: 256 total experts, 8 routed + 1 shared active per token. That's where the 35B→3B compression comes from. Each token only touches the experts it needs.&lt;/p&gt;

&lt;p&gt;And it has &lt;strong&gt;vision built in&lt;/strong&gt;. Not bolted on — the model is natively multimodal (Image-Text-to-Text). MMMU score of 81.7, RealWorldQA at 85.3.&lt;/p&gt;

&lt;h2&gt;
  
  
  the benchmarks that matter
&lt;/h2&gt;

&lt;p&gt;I'm not going to dump every number. Here are the ones that actually tell you something:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-bench Verified: 73.4&lt;/strong&gt; — this is the "can you autonomously fix real GitHub issues" test. The model reads the issue, understands the codebase, writes a fix, and runs the tests. 73.4 means it successfully fixes nearly three out of four real-world bugs thrown at it. Its predecessor (Qwen3.5-35B-A3B) scored 70.0. Gemma4-31B scored 17.4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0: 51.5&lt;/strong&gt; — agentic terminal coding. Can the model operate a terminal to solve coding tasks? Qwen3.6 beats its predecessor (40.5), the dense Qwen3.5-27B (41.6), and Gemma4-31B (42.9). An 11-point jump over the previous version is massive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QwenWebBench: 1397 Elo&lt;/strong&gt; — frontend artifact generation. The predecessor scored 978. A 400+ Elo jump in one generation. For chess players: that's going from a club player to a titled player.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPQA Diamond: 86.0&lt;/strong&gt; — graduate-level science reasoning. This is the benchmark where PhD students in physics, chemistry, and biology try to answer questions outside their subfield and fail about half the time. 86.0 is competitive with models many times this size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCPMark: 37.0&lt;/strong&gt; — general agent benchmark testing MCP (Model Context Protocol) tool use. Predecessor scored 27.0. Gemma4-31B scored 36.3. This model was clearly trained with agentic tool calling in mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  what 3B active parameters actually means for your hardware
&lt;/h2&gt;

&lt;p&gt;Here's the thing people keep getting wrong about MoE models. The total parameter count (35B) determines the model's knowledge capacity — how much it "knows." But the active parameter count (3B) determines how fast it runs and how much VRAM it needs at inference time.&lt;/p&gt;

&lt;p&gt;So while the model file on disk is large (it contains all 256 experts), at inference time your GPU only loads the 9 active experts per token. The rest sit in memory doing nothing until they're needed.&lt;/p&gt;

&lt;p&gt;Practical VRAM requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q4_K_M quantized: ~6-8 GB&lt;/strong&gt; — runs on an RTX 3060 12GB at 30+ tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q8_0 quantized: ~12-14 GB&lt;/strong&gt; — RTX 4070 territory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP8 official: ~35 GB&lt;/strong&gt; — RTX 4090 or A6000&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 full: ~70 GB&lt;/strong&gt; — multi-GPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can run a 7B model, you can run this. The speed profile is similar to a 3B dense model, but the output quality is closer to a 30B+ dense model.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real competition
&lt;/h2&gt;

&lt;p&gt;The model Qwen3.6 is really competing against isn't Gemma4-31B. It's proprietary models.&lt;/p&gt;

&lt;p&gt;73.4 on SWE-bench Verified puts it in the same ballpark as frontier closed-source models — except this one is Apache 2.0 licensed, runs on consumer hardware, and never sends your code to anyone's server.&lt;/p&gt;

&lt;p&gt;For coding specifically, the combination of high SWE-bench scores + strong terminal/agent capabilities + MCP support makes this arguably the best local coding model per compute dollar right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  how to actually run it
&lt;/h2&gt;

&lt;p&gt;The model just dropped so GGUF quantizations are still rolling out. Check HuggingFace for the latest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official weights: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;Qwen/Qwen3.6-35B-A3B&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;FP8 variant: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8" rel="noopener noreferrer"&gt;Qwen/Qwen3.6-35B-A3B-FP8&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once GGUFs land, &lt;code&gt;ollama run qwen3.6:35b-a3b&lt;/code&gt; should work.&lt;/p&gt;

&lt;p&gt;For a full desktop setup with model management, vision support, and a built-in coding agent, &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; just shipped v2.3.3 with day-0 Qwen3.6 support. Open source, AGPL-3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  the bottom line
&lt;/h2&gt;

&lt;p&gt;3B active parameters scoring 73.4 on SWE-bench is the kind of efficiency gain that changes what's possible on consumer hardware. A year ago you needed a 70B+ dense model or API access for this level of coding capability. Now it runs on a gaming laptop.&lt;/p&gt;

&lt;p&gt;Apache 2.0. No strings attached.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; is an open-source desktop app for running AI models locally — chat, coding agents, image gen, video gen. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>coding</category>
    </item>
    <item>
      <title>How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:20:08 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-qwen36-35b-a3b-locally-the-coding-moe-that-beats-models-10x-its-active-size-3pbh</link>
      <guid>https://dev.to/purpledoubled/how-to-run-qwen36-35b-a3b-locally-the-coding-moe-that-beats-models-10x-its-active-size-3pbh</guid>
      <description>&lt;p&gt;Qwen just released &lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.&lt;/p&gt;

&lt;p&gt;Translation: big-model quality at small-model speed. And this time it has vision built in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this model matters
&lt;/h2&gt;

&lt;p&gt;The numbers speak for themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;73.4 on SWE-bench Verified&lt;/strong&gt; — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;51.5 on Terminal-Bench 2.0&lt;/strong&gt; — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1397 Elo on QwenWebBench&lt;/strong&gt; — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;86.0 on GPQA Diamond&lt;/strong&gt; — graduate-level science reasoning. Competitive with models many times its size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision support&lt;/strong&gt; — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full benchmark picture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B&lt;/th&gt;
&lt;th&gt;Qwen3.5-35B-A3B&lt;/th&gt;
&lt;th&gt;Gemma4-31B&lt;/th&gt;
&lt;th&gt;Qwen3.5-27B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;17.4&lt;/td&gt;
&lt;td&gt;51.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40.5&lt;/td&gt;
&lt;td&gt;42.9&lt;/td&gt;
&lt;td&gt;41.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Multilingual&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;67.2&lt;/td&gt;
&lt;td&gt;69.3&lt;/td&gt;
&lt;td&gt;60.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QwenWebBench (Elo)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1397&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;978&lt;/td&gt;
&lt;td&gt;1178&lt;/td&gt;
&lt;td&gt;1197&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NL2Repo&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;27.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCPMark&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27.0&lt;/td&gt;
&lt;td&gt;36.3&lt;/td&gt;
&lt;td&gt;15.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;84.2&lt;/td&gt;
&lt;td&gt;84.3&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMU&lt;/td&gt;
&lt;td&gt;81.7&lt;/td&gt;
&lt;td&gt;81.4&lt;/td&gt;
&lt;td&gt;80.4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gated DeltaNet attention&lt;/strong&gt; — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;256 experts, 9 active&lt;/strong&gt; — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision encoder built in&lt;/strong&gt; — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thinking Preservation&lt;/strong&gt; — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;262K native context&lt;/strong&gt; — extensible beyond that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache 2.0 license&lt;/strong&gt; — fully open, commercial use allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware requirements
&lt;/h2&gt;

&lt;p&gt;The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;VRAM needed&lt;/th&gt;
&lt;th&gt;Expected speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M quant&lt;/td&gt;
&lt;td&gt;~6-8 GB&lt;/td&gt;
&lt;td&gt;30+ tok/s on RTX 3060 12GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0 quant&lt;/td&gt;
&lt;td&gt;~12-14 GB&lt;/td&gt;
&lt;td&gt;20+ tok/s on RTX 4070&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8 (official)&lt;/td&gt;
&lt;td&gt;~35 GB&lt;/td&gt;
&lt;td&gt;RTX 4090 or A6000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16 full&lt;/td&gt;
&lt;td&gt;~70 GB&lt;/td&gt;
&lt;td&gt;Multi-GPU setup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1: Ollama (easiest)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for GGUFs to appear — usually within hours of release. Check &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; for the latest quantized versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: vLLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm
vllm serve Qwen/Qwen3.6-35B-A3B &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 3: Transformers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 4: Locally Uncensored (full GUI + model management)
&lt;/h3&gt;

&lt;p&gt;If you want a clean desktop app that handles downloading, model management, and chatting in one place:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Grab &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — it's open source (AGPL-3.0)&lt;/li&gt;
&lt;li&gt;v2.3.3 just shipped with day-0 Qwen3.6 support&lt;/li&gt;
&lt;li&gt;Download the model directly from the app, pick your quantization, and start chatting&lt;/li&gt;
&lt;li&gt;Vision works out of the box — drag and drop images into the chat&lt;/li&gt;
&lt;li&gt;The new Codex mode with live streaming is particularly nice for coding tasks with this model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should care
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local AI coders&lt;/strong&gt; — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-focused devs&lt;/strong&gt; — Apache 2.0, runs on consumer hardware, no data leaves your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal users&lt;/strong&gt; — built-in vision means one model for text AND image understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone running Qwen3.5-35B-A3B&lt;/strong&gt; — this is a straight upgrade. Same architecture class, better everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.&lt;/p&gt;

&lt;p&gt;Weights are on &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;. FP8 variant &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8" rel="noopener noreferrer"&gt;here&lt;/a&gt;. GGUFs incoming.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Run GLM 4.7 Flash Locally with Ollama — 30B Quality at 3B Speed</title>
      <dc:creator>David </dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:28:13 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-glm-47-flash-locally-with-ollama-30b-quality-at-3b-speed-2lii</link>
      <guid>https://dev.to/purpledoubled/how-to-run-glm-47-flash-locally-with-ollama-30b-quality-at-3b-speed-2lii</guid>
      <description>&lt;p&gt;ZhipuAI quietly dropped GLM 4.7 Flash and it's been blowing up — 830K+ downloads on HuggingFace, 1,600+ likes. The pitch: 30B-parameter MoE model with only 3B active parameters per token. Translation: you get 30B-class quality at the speed and VRAM cost of a 3B model.&lt;/p&gt;

&lt;p&gt;The benchmarks back it up. AIME 25: 91.6% (beats GPT-class models). SWE-bench Verified: 59.2% (nearly 3x Qwen3-30B-A3B). And it's MIT licensed — commercial use, fine-tuning, whatever you want.&lt;/p&gt;

&lt;p&gt;I've been building a local AI desktop app (&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;) and just added GLM 4.7 support. Here's how to run it locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install GLM 4.7 Flash with Ollama
&lt;/h2&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Ollama handles the download and quantization. Default is Q4_K_M which gives you the best quality-to-size ratio.&lt;/p&gt;

&lt;p&gt;If you want a specific quantization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7:q4_k_m    &lt;span class="c"&gt;# ~5 GB, recommended&lt;/span&gt;
ollama run glm4.7:q8_0      &lt;span class="c"&gt;# ~10 GB, higher quality&lt;/span&gt;
ollama run glm4.7:q2_k      &lt;span class="c"&gt;# ~3 GB, if VRAM is tight&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why GLM 4.7 Flash Matters
&lt;/h2&gt;

&lt;p&gt;The MoE (Mixture of Experts) architecture is the key. The model has 30B total parameters but only activates 3B per token. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt;: Token generation is fast — comparable to running a 3B dense model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt;: Only needs 6-8 GB for Q4 quantization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt;: Reasoning and coding performance matches models 10x its active size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's how it compares:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;GLM 4.7 Flash (30B-A3B)&lt;/th&gt;
&lt;th&gt;Qwen3-30B-A3B&lt;/th&gt;
&lt;th&gt;GPT-OSS-20B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AIME 25&lt;/td&gt;
&lt;td&gt;91.6&lt;/td&gt;
&lt;td&gt;85.0&lt;/td&gt;
&lt;td&gt;91.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA&lt;/td&gt;
&lt;td&gt;75.2&lt;/td&gt;
&lt;td&gt;73.4&lt;/td&gt;
&lt;td&gt;71.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;59.2&lt;/td&gt;
&lt;td&gt;22.0&lt;/td&gt;
&lt;td&gt;34.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;τ²-Bench (agentic)&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;49.0&lt;/td&gt;
&lt;td&gt;47.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BrowseComp&lt;/td&gt;
&lt;td&gt;42.8&lt;/td&gt;
&lt;td&gt;2.29&lt;/td&gt;
&lt;td&gt;28.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agentic benchmarks are insane. τ²-Bench at 79.5 vs Qwen3's 49.0 — that's not a marginal improvement, that's a different league. This model was built for tool calling and multi-step reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q2_K&lt;/strong&gt;: ~3-4 GB VRAM (or CPU-only with 8 GB RAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q4_K_M&lt;/strong&gt;: 6-8 GB VRAM — the sweet spot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q8_0&lt;/strong&gt;: 10-12 GB VRAM — if you have the room&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16&lt;/strong&gt;: 20+ GB — only for research/fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a GTX 1660 (6 GB) or better, Q4_K_M runs comfortably. On Apple Silicon with 16 GB unified memory, it flies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Mode with GLM 4.7
&lt;/h2&gt;

&lt;p&gt;This is where GLM 4.7 really shines. The model was specifically optimized for agentic tasks — it has a "Preserved Thinking" mode that keeps chain-of-thought reasoning active across multi-turn tool interactions.&lt;/p&gt;

&lt;p&gt;In practice: you give it a tool (web search, file read, code execution) and it actually uses it intelligently. The 59.2% SWE-bench score means it can navigate real codebases, understand context, and produce working patches — not just toy completions.&lt;/p&gt;

&lt;p&gt;In Locally Uncensored, GLM 4.7 is auto-detected as an agent-capable model. Enable Agent mode in the UI and it gets access to web search, file operations, and code execution out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  GLM 4.7 vs the Competition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;vs Qwen3-30B-A3B&lt;/strong&gt;: Same architecture class (30B MoE, 3B active) but GLM 4.7 dominates on agentic and coding tasks. Qwen3 is better at pure math.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs Gemma 4 E4B&lt;/strong&gt;: Gemma 4 is smaller (4.5B effective) and faster, but GLM 4.7 has significantly better reasoning depth. If you need an agent that can handle complex multi-step tasks, GLM 4.7 wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs Llama 3.3 70B&lt;/strong&gt;: Llama needs 3-4x the VRAM for similar coding performance. GLM 4.7 is the efficiency play.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the Catch?
&lt;/h2&gt;

&lt;p&gt;Honestly, not much:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chinese-English bilingual&lt;/strong&gt; — Trained on both, works great in both. If you only need English, it's still excellent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window&lt;/strong&gt; — Supports up to 128K tokens. More than enough for most use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT license&lt;/strong&gt; — Fully open. No restrictions on commercial use, modification, or redistribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main caveat: if you want vision/multimodal, GLM 4.7 Flash is text-only. Look at GLM-4V or Gemma 4 for image input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you want a full desktop UI with agent mode, image gen, and A/B model comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — free, open source, AGPL-3.0. Single .exe/.AppImage, no Docker needed. GLM 4.7 is in the recommended models list.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running GLM 4.7 on your hardware? I'd love to hear your tok/s numbers and use case. Drop a comment.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://locallyuncensored.com" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — AGPL-3.0 licensed. &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>ollama</category>
      <category>opensource</category>
    </item>
    <item>
      <title>we shipped a ComfyUI fix 12 hours after the bug report. here's what v2.3.1 changes.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Sat, 11 Apr 2026 12:29:24 +0000</pubDate>
      <link>https://dev.to/purpledoubled/we-shipped-a-comfyui-fix-12-hours-after-the-bug-report-heres-what-v231-changes-1d6j</link>
      <guid>https://dev.to/purpledoubled/we-shipped-a-comfyui-fix-12-hours-after-the-bug-report-heres-what-v231-changes-1d6j</guid>
      <description>&lt;p&gt;v2.3.1 just dropped — small version number, big quality-of-life improvements.&lt;/p&gt;

&lt;h2&gt;
  
  
  in-app Ollama install
&lt;/h2&gt;

&lt;p&gt;no more "go to ollama.com, download, install, restart". click &lt;strong&gt;Install Ollama&lt;/strong&gt; in the onboarding wizard, watch the progress bar (with real download speed and elapsed time), done. silent install, auto-start, auto-detect. zero manual steps.&lt;/p&gt;

&lt;p&gt;this is what "plug &amp;amp; play" should feel like.&lt;/p&gt;

&lt;h2&gt;
  
  
  configurable ComfyUI port &amp;amp; path
&lt;/h2&gt;

&lt;p&gt;this one's a direct response to user feedback. ComfyUI recently moved from a browser-based interface to a desktop app — which uses a different port than the old default 8188.&lt;/p&gt;

&lt;p&gt;result: LU couldn't find ComfyUI even though it was running. multiple users reported this in our GitHub Discussions within hours of v2.3.0.&lt;/p&gt;

&lt;p&gt;v2.3.1 fix: the ComfyUI port is now fully configurable in &lt;strong&gt;Settings &amp;gt; ComfyUI (Image &amp;amp; Video)&lt;/strong&gt;. same for the path — editable with a Connect button. no need to go through onboarding again.&lt;/p&gt;

&lt;p&gt;previously the port was hardcoded in 20+ places across the codebase. now it's a single config value.&lt;/p&gt;

&lt;h2&gt;
  
  
  ComfyUI install progress (actually works now)
&lt;/h2&gt;

&lt;p&gt;the one-click ComfyUI install now shows real step-by-step progress:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1/3: Clone repository&lt;/li&gt;
&lt;li&gt;Step 2/3: Install PyTorch&lt;/li&gt;
&lt;li&gt;Step 3/3: Install dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;with an elapsed timer. before this, the install screen got stuck at "Starting..." forever because the Rust backend thread never reported completion back to the frontend.&lt;/p&gt;

&lt;h2&gt;
  
  
  provider status that actually means something
&lt;/h2&gt;

&lt;p&gt;the connection dots next to providers in Settings now show actual status:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🟢 green = connected and responding&lt;/li&gt;
&lt;li&gt;🔴 red = connection failed&lt;/li&gt;
&lt;li&gt;⚪ gray = not checked yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;before this they were always green. which is worse than showing nothing at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  the bigger picture
&lt;/h2&gt;

&lt;p&gt;v2.3.0 (yesterday) was the big feature release — ComfyUI plug &amp;amp; play, 20 model bundles, image-to-image, image-to-video on 6 GB VRAM.&lt;/p&gt;

&lt;p&gt;v2.3.1 is the "we actually listen to our users" release. someone reported ComfyUI Desktop not connecting at 11pm, the fix shipped by noon next day.&lt;/p&gt;

&lt;p&gt;if you tried v2.3.0 and ComfyUI didn't work — update and check Settings. the port config should fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;PurpleDoubleD/locally-uncensored&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Download:&lt;/strong&gt; &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases/tag/v2.3.1" rel="noopener noreferrer"&gt;v2.3.1 Release&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>comfyui</category>
      <category>desktop</category>
    </item>
    <item>
      <title>France is ditching Windows for Linux. your AI should be next.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 10 Apr 2026 16:54:35 +0000</pubDate>
      <link>https://dev.to/purpledoubled/france-is-ditching-windows-for-linux-your-ai-should-be-next-4le9</link>
      <guid>https://dev.to/purpledoubled/france-is-ditching-windows-for-linux-your-ai-should-be-next-4le9</guid>
      <description>&lt;p&gt;France just announced they're migrating government systems from Windows to Linux to reduce dependence on US tech companies. This is a sovereign nation saying "we can't trust our infrastructure to foreign corporations."&lt;/p&gt;

&lt;p&gt;Now apply the same logic to AI.&lt;/p&gt;

&lt;p&gt;Every prompt you type into ChatGPT routes through OpenAI's servers in San Francisco. Every image you generate with Midjourney lives on their infrastructure. Every code snippet Copilot sees passes through Microsoft's cloud.&lt;/p&gt;

&lt;p&gt;You're outsourcing your thinking to US corporations. And unlike Windows, where France at least had the source code available for audit — AI services are complete black boxes. You don't know what happens to your data, how long it's stored, who can access it, or whether it's used to train the next model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Local Alternative Exists Now
&lt;/h2&gt;

&lt;p&gt;Two years ago, "run AI locally" meant a janky Python script talking to a quantized 7B model that could barely hold a conversation. That era is over.&lt;/p&gt;

&lt;p&gt;Today's local stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 3.5 35B&lt;/strong&gt; — matches GPT-4 on reasoning, 256K context, runs on 16 GB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 27B&lt;/strong&gt; — Google's latest with native vision and tool calling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GLM 5.1&lt;/strong&gt; — 754B MoE, MIT license, leads SWE-Bench Pro. Released this week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FLUX 2 Klein&lt;/strong&gt; — text-to-image that rivals Midjourney, runs locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FramePack F1&lt;/strong&gt; — image-to-video on 6 GB VRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All open-weight. All running on consumer hardware. No API keys, no subscriptions, no data leaving your network.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Missing Piece Was Always the Frontend
&lt;/h2&gt;

&lt;p&gt;The models exist. The backends exist (Ollama, vLLM, llama.cpp). What was missing was a unified frontend that doesn't require a PhD in YAML configuration to set up.&lt;/p&gt;

&lt;p&gt;That's what I've been building. &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; auto-detects 12 local backends, handles ComfyUI installation and model downloads with one click, and bundles chat, coding agent, and image/video generation into a single desktop app.&lt;/p&gt;

&lt;p&gt;No Docker. No terminal commands. Install the .exe, launch it, the setup wizard handles everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Digital Sovereignty Isn't Just for Governments
&lt;/h2&gt;

&lt;p&gt;France is making this move at the national level because they understand the risk. But individuals and companies face the same dependency problem.&lt;/p&gt;

&lt;p&gt;When OpenAI changes their pricing — and they will — you're locked in. When Anthropic gets blacklisted by your government — it happened this week — your workflows break. When Midjourney decides your prompt violates their content policy — no appeal, no alternative, your subscription just becomes less useful.&lt;/p&gt;

&lt;p&gt;Local AI has no terms of service. No content policy. No pricing changes. No government blacklists. The model runs on your hardware, and the only person who can restrict what it does is you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.ai/install.sh | sh
ollama pull gemma4:27b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five minutes. Then your AI is yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;PurpleDoubleD/locally-uncensored&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License&lt;/strong&gt;: AGPL-3.0&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>linux</category>
      <category>privacy</category>
    </item>
    <item>
      <title>i cancelled ChatGPT, Midjourney, and Copilot. here's the $0/month stack that replaced all three.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 10 Apr 2026 16:53:27 +0000</pubDate>
      <link>https://dev.to/purpledoubled/i-cancelled-chatgpt-midjourney-and-copilot-heres-the-0month-stack-that-replaced-all-three-30hn</link>
      <guid>https://dev.to/purpledoubled/i-cancelled-chatgpt-midjourney-and-copilot-heres-the-0month-stack-that-replaced-all-three-30hn</guid>
      <description>&lt;p&gt;ChatGPT Plus: $20/month. Midjourney: $10/month. GitHub Copilot: $10/month.&lt;/p&gt;

&lt;p&gt;That's $480/year for AI tools that send every keystroke to someone else's server. I cancelled all three and replaced them with a local stack that costs nothing after the initial setup.&lt;/p&gt;

&lt;p&gt;Here's exactly what I'm running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chat: Ollama + Qwen 3.5 (replaces ChatGPT)
&lt;/h2&gt;

&lt;p&gt;Qwen 3.5 9B runs on 8 GB VRAM. Qwen 3.5 35B (MoE) runs on 16 GB with 256K context — longer than ChatGPT's 128K. It handles reasoning, analysis, writing, and code generation. For math and logic, it matches GPT-4o on most benchmarks.&lt;/p&gt;

&lt;p&gt;Gemma 4 27B is the alternative if you want native vision (describe images, analyze screenshots) without an API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen3.5:9b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Running. Forever. No subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  Images: ComfyUI + FLUX (replaces Midjourney)
&lt;/h2&gt;

&lt;p&gt;FLUX.1 Dev generates images that compete with Midjourney v6. FLUX 2 Klein is the newer, faster variant. Z-Image does uncensored generation — no "I can't generate that" refusals.&lt;/p&gt;

&lt;p&gt;The catch with ComfyUI has always been setup complexity. Model files in the wrong folder, custom nodes breaking, workflow JSONs that don't load.&lt;/p&gt;

&lt;p&gt;I built a wrapper that handles all of it. ComfyUI auto-detection, one-click install if it's missing, model bundles with one-click download, and a Dynamic Workflow Builder that constructs the correct pipeline based on what you have installed. You never touch a workflow JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Agent: MCP Tools (replaces Copilot)
&lt;/h2&gt;

&lt;p&gt;Not autocomplete — a full coding agent. It reads your project files, writes code, runs shell commands, executes tests, and iterates on errors. 13 MCP tools: file I/O, shell execution, web search, code execution, screenshots.&lt;/p&gt;

&lt;p&gt;The difference from Copilot: it doesn't just suggest the next line. You say "add input validation to this form and write tests" and it reads the code, writes the validation, creates the test file, runs the tests, and fixes failures. Up to 20 tool iterations per task.&lt;/p&gt;

&lt;p&gt;Works with any model. Native tool calling for Qwen, Gemma, Llama. XML fallback for everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The App That Ties It Together
&lt;/h2&gt;

&lt;p&gt;All of this runs through &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — one desktop app, one window, swap between chat, code agent, and image/video generation.&lt;/p&gt;

&lt;p&gt;It auto-detects 12 local backends (Ollama, LM Studio, vLLM, KoboldCpp, etc.) and has 20+ cloud provider presets if you occasionally need a frontier model. A/B model comparison lets you test two models side by side with the same prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not Electron.&lt;/strong&gt; Tauri v2 with a Rust backend. The app itself uses ~80 MB of RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Lost
&lt;/h2&gt;

&lt;p&gt;Honestly? Two things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-4's creative writing&lt;/strong&gt; is still noticeably better than local models for fiction and marketing copy. For technical writing, code, and analysis — local models are equal or better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Midjourney's aesthetic consistency&lt;/strong&gt; across different prompts is hard to match locally. FLUX is technically more capable, but Midjourney has a "house style" that's effortlessly good. Locally, you need to dial in your prompts more carefully.&lt;/p&gt;

&lt;p&gt;Everything else — speed, privacy, availability, cost — is better locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monthly Cost Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Cloud&lt;/th&gt;
&lt;th&gt;Local&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chat AI&lt;/td&gt;
&lt;td&gt;$20/mo (ChatGPT Plus)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image Gen&lt;/td&gt;
&lt;td&gt;$10/mo (Midjourney)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Agent&lt;/td&gt;
&lt;td&gt;$10/mo (Copilot)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$40/mo ($480/yr)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Hardware requirement: a GPU with 8+ GB VRAM. If you have a gaming PC from the last 5 years, you probably already have one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;PurpleDoubleD/locally-uncensored&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;License&lt;/strong&gt;: AGPL-3.0 — fully open source, no telemetry, no cloud dependency.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>beginners</category>
    </item>
    <item>
      <title>the FBI can read your ChatGPT history. they can't read mine.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 10 Apr 2026 16:53:26 +0000</pubDate>
      <link>https://dev.to/purpledoubled/the-fbi-can-read-your-chatgpt-history-they-cant-read-mine-21ed</link>
      <guid>https://dev.to/purpledoubled/the-fbi-can-read-your-chatgpt-history-they-cant-read-mine-21ed</guid>
      <description>&lt;p&gt;The FBI just retrieved deleted Signal messages through iPhone notification data. That was supposed to be the most private messenger available.&lt;/p&gt;

&lt;p&gt;Your ChatGPT conversations? Those sit on OpenAI's servers. Subpoena-ready. Every prompt you've written, every document you've pasted, every code snippet you've shared. OpenAI's privacy policy explicitly states they can share data with law enforcement.&lt;/p&gt;

&lt;p&gt;Same goes for Claude, Gemini, Copilot. Every cloud AI provider stores your conversations. Some for 30 days, some for "model improvement," some indefinitely. And when a court order arrives, they comply. They have to.&lt;/p&gt;

&lt;p&gt;I stopped using cloud AI for anything that matters six months ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Use
&lt;/h2&gt;

&lt;p&gt;Everything runs on my machine. No server, no API calls, no conversation logs on someone else's infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For text/chat&lt;/strong&gt;: Ollama running Qwen 3.5 or Gemma 4. The model weights live on my SSD. The conversation exists in RAM while I'm using it, and nowhere else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For code&lt;/strong&gt;: A local coding agent with MCP tools — file read/write, shell execution, web search. It reads my codebase directly from disk. No code leaves my network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For images&lt;/strong&gt;: ComfyUI with FLUX and Z-Image. Prompts and generated images stay on my hard drive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For video&lt;/strong&gt;: FramePack F1 — image-to-video on 6 GB VRAM.&lt;/p&gt;

&lt;p&gt;All of this runs through one app: &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;. It auto-detects whatever backend you have installed, or walks you through setup if you have nothing. One desktop app, everything local.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of "Free" Cloud AI
&lt;/h2&gt;

&lt;p&gt;People think ChatGPT Plus at $20/month is cheap. But the actual cost isn't money — it's that every conversation becomes training data, legal evidence, and a data breach waiting to happen.&lt;/p&gt;

&lt;p&gt;The Mistral breach just exposed internal data. OpenAI employees have warned about internal security practices. Anthropic got blacklisted by the US government. Disney just cancelled a billion-dollar OpenAI deal over trust concerns.&lt;/p&gt;

&lt;p&gt;These aren't hypothetical risks. They're this week's headlines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Local Setup Is Easier Than You Think
&lt;/h2&gt;

&lt;p&gt;Five years ago, running AI locally meant compiling CUDA kernels and hand-editing config files. Today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Ollama (one command)&lt;/li&gt;
&lt;li&gt;Pull a model: &lt;code&gt;ollama pull qwen3.5:9b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Open Locally Uncensored, it auto-detects Ollama&lt;/li&gt;
&lt;li&gt;Chat, code, generate — everything stays on your machine&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total time: under 10 minutes. Total recurring cost: $0. Total data sent to the cloud: zero bytes.&lt;/p&gt;

&lt;h2&gt;
  
  
  No, Local Models Aren't "Worse"
&lt;/h2&gt;

&lt;p&gt;Qwen 3.5 35B matches GPT-4 on most benchmarks. Gemma 4 27B has native vision and tool calling. GLM 5.1 (754B, MIT license, released this week) leads SWE-Bench Pro.&lt;/p&gt;

&lt;p&gt;The "cloud models are better" argument made sense in 2023. In 2026, it's marketing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;PurpleDoubleD/locally-uncensored&lt;/a&gt; — AGPL-3.0, Tauri v2, no telemetry.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>privacy</category>
      <category>opensource</category>
      <category>security</category>
    </item>
    <item>
      <title>OpenAI just mass-unsubscribed paying users — the case for running AI locally</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 10 Apr 2026 10:41:28 +0000</pubDate>
      <link>https://dev.to/purpledoubled/openai-just-mass-unsubscribed-paying-users-the-case-for-running-ai-locally-5ehp</link>
      <guid>https://dev.to/purpledoubled/openai-just-mass-unsubscribed-paying-users-the-case-for-running-ai-locally-5ehp</guid>
      <description>&lt;p&gt;Yesterday, thousands of ChatGPT Plus and Pro subscribers woke up to find their paid accounts downgraded to free tier — mid-work, mid-exam, mid-project. OpenAI confirmed it was a billing bug and said they're working on it, but offered no timeline and no mention of refunds.&lt;/p&gt;

&lt;p&gt;Around the same time, Sam Altman publicly admitted that the viral GPT-4o image generation feature "broke" their GPUs. Free-tier users saw image generation disappear entirely. Paid users experienced multi-minute generation times where it had been instant the day before.&lt;/p&gt;

&lt;p&gt;This isn't the first time. And it won't be the last.&lt;/p&gt;

&lt;h2&gt;
  
  
  what actually happened
&lt;/h2&gt;

&lt;p&gt;The 4o image generation update went massively viral — everyone was making Ghibli-style portraits, editing product photos, creating memes. The demand overwhelmed OpenAI's infrastructure to the point where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Billing systems broke&lt;/strong&gt;, mass-downgrading paying customers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU capacity was exhausted&lt;/strong&gt;, degrading service for everyone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free tier access was pulled&lt;/strong&gt;, with users calling it a "bait-and-switch"&lt;/li&gt;
&lt;li&gt;Multiple users reported &lt;strong&gt;losing 60-90 minutes of paid access&lt;/strong&gt; during work deadlines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://old.reddit.com/r/ChatGPT/comments/1jvuahj/" rel="noopener noreferrer"&gt;r/ChatGPT thread&lt;/a&gt; about the mass-unsubscription has 488+ comments. People are frustrated.&lt;/p&gt;

&lt;h2&gt;
  
  
  meanwhile, in the local AI world
&lt;/h2&gt;

&lt;p&gt;On the same day this was happening, the local LLM community was busy doing their own thing — running models on their own hardware with zero downtime. No billing bugs. No GPU shortages. No service degradation.&lt;/p&gt;

&lt;p&gt;Here's what the local AI landscape looks like right now:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text/Chat models you can run locally:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Llama 3&lt;/strong&gt; — Meta's latest, runs great on consumer GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral&lt;/strong&gt; — just committed to always releasing open models alongside commercial ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phi-4&lt;/strong&gt; — Microsoft's new reasoning model (though &lt;a href="https://old.reddit.com/r/LocalLLaMA/comments/1jxbuil/" rel="noopener noreferrer"&gt;r/LocalLLaMA is roasting it&lt;/a&gt; for being over-censored — the irony of an open model with closed guardrails)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 2.5&lt;/strong&gt; — Alibaba's strong multilingual models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 3&lt;/strong&gt; — Google's compact open models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Image generation locally:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stable Diffusion&lt;/strong&gt; / &lt;strong&gt;SDXL&lt;/strong&gt; / &lt;strong&gt;Flux&lt;/strong&gt; — the OG local image gen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ComfyUI&lt;/strong&gt; — node-based workflow editor, incredibly flexible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fooocus&lt;/strong&gt; — simple Midjourney-like interface, runs on 6GB VRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools to run them:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/a&gt; — dead simple CLI for running LLMs locally&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://lmstudio.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;LM Studio&lt;/strong&gt;&lt;/a&gt; — nice GUI, one-click model downloads&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/open-webui/open-webui" rel="noopener noreferrer"&gt;&lt;strong&gt;Open WebUI&lt;/strong&gt;&lt;/a&gt; — ChatGPT-like web interface for local models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://locallyuncensored.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Locally Uncensored&lt;/strong&gt;&lt;/a&gt; — all-in-one app bundling chat, image gen, and video gen with no Docker needed&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://jan.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Jan&lt;/strong&gt;&lt;/a&gt; — offline-first desktop app&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  the actual tradeoff
&lt;/h2&gt;

&lt;p&gt;Let's be honest: local models aren't as good as GPT-4o at everything. The frontier cloud models still lead in complex reasoning and multimodal tasks.&lt;/p&gt;

&lt;p&gt;But here's what you get by running locally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;100% uptime&lt;/strong&gt; — your GPU doesn't have billing bugs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No subscription fees&lt;/strong&gt; — one-time hardware cost, then it's free forever&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete privacy&lt;/strong&gt; — your data never leaves your machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No censorship surprises&lt;/strong&gt; — you control the guardrails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No "bait-and-switch"&lt;/strong&gt; — features don't disappear because a company overprovisioned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most daily tasks — writing, brainstorming, code assistance, image generation, summarization — a 7B or 14B parameter model running locally is more than enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real question
&lt;/h2&gt;

&lt;p&gt;Every time a cloud AI service goes down, the same conversation happens: "maybe I should run this stuff locally." Then the service comes back up and everyone forgets.&lt;/p&gt;

&lt;p&gt;But the gap is closing fast. Local models are getting better every month. The tooling is getting simpler. A decent GPU from 2-3 years ago can run surprisingly capable models.&lt;/p&gt;

&lt;p&gt;Maybe this time it's worth actually trying it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to get started, &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; is probably the easiest entry point. Install it, run &lt;code&gt;ollama run llama3&lt;/code&gt;, and you've got a local chatbot in 30 seconds. For image generation, &lt;a href="https://github.com/comfyanonymous/ComfyUI" rel="noopener noreferrer"&gt;ComfyUI&lt;/a&gt; or &lt;a href="https://github.com/lllyasviel/Fooocus" rel="noopener noreferrer"&gt;Fooocus&lt;/a&gt; are solid choices. And if you want everything in one package, &lt;a href="https://locallyuncensored.com" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; bundles chat + image gen + video gen with a simple installer.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>selfhosted</category>
      <category>productivity</category>
    </item>
    <item>
      <title>i generated AI video on a GTX 1660. here's what it actually takes.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 10 Apr 2026 10:17:54 +0000</pubDate>
      <link>https://dev.to/purpledoubled/i-generated-ai-video-on-a-gtx-1660-heres-what-it-actually-takes-2i6l</link>
      <guid>https://dev.to/purpledoubled/i-generated-ai-video-on-a-gtx-1660-heres-what-it-actually-takes-2i6l</guid>
      <description>&lt;p&gt;A few weeks ago, generating video from a single image required either a cloud API with per-second billing, or a GPU with 24+ GB VRAM. FramePack changed that.&lt;/p&gt;

&lt;p&gt;FramePack F1 generates video from a single image on &lt;strong&gt;6 GB VRAM&lt;/strong&gt;. That's a GTX 1660, an RTX 3060, or basically any GPU sold in the last 5 years. I've been running it locally and the results are genuinely usable — not "proof of concept" usable, but "I'd actually put this in a project" usable.&lt;/p&gt;

&lt;p&gt;Here's what's actually involved, because "runs on 6 GB VRAM" doesn't tell the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You're Actually Downloading
&lt;/h2&gt;

&lt;p&gt;FramePack isn't one file. It's a pipeline with five components, and you need all of them:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FramePack F1 I2V Model (FP8)&lt;/td&gt;
&lt;td&gt;13 GB&lt;/td&gt;
&lt;td&gt;The core diffusion model — generates video frames&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLaVA LLaMA3 Text Encoder (FP8)&lt;/td&gt;
&lt;td&gt;8.5 GB&lt;/td&gt;
&lt;td&gt;Understands your text prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HunyuanVideo VAE&lt;/td&gt;
&lt;td&gt;2.3 GB&lt;/td&gt;
&lt;td&gt;Encodes your input image to latent space, decodes generated frames back to pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SigCLIP Vision Encoder&lt;/td&gt;
&lt;td&gt;900 MB&lt;/td&gt;
&lt;td&gt;Understands the content of your input image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLIP-L Text Encoder&lt;/td&gt;
&lt;td&gt;240 MB&lt;/td&gt;
&lt;td&gt;Additional text understanding (shared with HunyuanVideo)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total download: ~25 GB.&lt;/strong&gt; Plus you need ComfyUI installed and the &lt;a href="https://github.com/kijai/ComfyUI-FramePackWrapper" rel="noopener noreferrer"&gt;ComfyUI-FramePackWrapper&lt;/a&gt; custom nodes from Kijai.&lt;/p&gt;

&lt;p&gt;So yeah — the model fits in 6 GB VRAM, but your hard drive needs 25 GB and the initial download takes a while.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why 6 GB VRAM Works
&lt;/h2&gt;

&lt;p&gt;Most video generation models load everything into VRAM at once. A 14B parameter model at FP16 needs ~28 GB just for the weights. That's why Wan 2.1 14B needs a 3090 or better.&lt;/p&gt;

&lt;p&gt;FramePack uses &lt;strong&gt;next-frame prediction&lt;/strong&gt;. Instead of generating all frames simultaneously, it generates one frame at a time, keeping only what it needs in memory. The model itself is 13 GB on disk but the FP8 quantization and frame-by-frame approach mean it peaks around 6 GB of VRAM usage.&lt;/p&gt;

&lt;p&gt;The trade-off is speed. Generating a 3-second clip takes several minutes on a mid-range GPU. On a high-end card it's faster, but it's never going to be real-time. The architecture is optimized for memory, not throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Uses Under the Hood
&lt;/h2&gt;

&lt;p&gt;FramePack F1 is built on the &lt;strong&gt;HunyuanVideo&lt;/strong&gt; backbone. That's why it shares components with HunyuanVideo (the VAE, CLIP-L encoder). The pipeline works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;SigCLIP Vision Encoder&lt;/strong&gt; looks at your input image and creates visual embeddings — a numerical representation of what's in the image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DualCLIPLoader&lt;/strong&gt; loads both text encoders (CLIP-L + LLaVA LLaMA3) to process your text prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAE&lt;/strong&gt; encodes your input image into latent space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FramePackSampler&lt;/strong&gt; takes the image latent, vision embeddings, and text conditioning, then generates video frames one at a time using next-frame prediction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAE&lt;/strong&gt; decodes the generated latent frames back into actual pixels&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The sampler has a &lt;code&gt;gpu_memory_preservation&lt;/code&gt; parameter set to 6.0 GB by default — it actively manages memory to stay within that budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Results Look Like
&lt;/h2&gt;

&lt;p&gt;FramePack does &lt;strong&gt;motion from a still image&lt;/strong&gt;. Give it a photo of a person and it'll add natural movement — head turns, blinking, subtle body motion. Give it a landscape and it'll add wind, clouds, water flow.&lt;/p&gt;

&lt;p&gt;It's strongest with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Portraits and people&lt;/strong&gt; — natural micro-movements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nature scenes&lt;/strong&gt; — wind, water, atmospheric effects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple compositions&lt;/strong&gt; — one clear subject against a background&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It struggles with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex multi-person scenes&lt;/strong&gt; — tracking gets confused&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast action&lt;/strong&gt; — it's tuned for gentle, natural motion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long durations&lt;/strong&gt; — quality degrades after ~4 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output resolution follows your input image. Feed it a 512x768 portrait, you get a 512x768 video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;

&lt;p&gt;If you want to set this up manually: install ComfyUI, clone the FramePackWrapper custom nodes, download all five model files to the correct ComfyUI subdirectories, build a workflow connecting all the nodes in the right order, and pray nothing conflicts with your existing setup.&lt;/p&gt;

&lt;p&gt;Or — and this is what I built — &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; handles the entire pipeline. Open the Create tab, pick the FramePack bundle, one-click download all five components, upload an image, write what motion you want, generate. The app builds the correct workflow automatically.&lt;/p&gt;

&lt;p&gt;It also does text-to-image, image-to-image, and text-to-video with other models (Wan 2.1, CogVideoX, FLUX, SDXL). ComfyUI gets auto-detected or one-click installed. Open source, AGPL-3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;6 GB VRAM video generation is real, and it works. But let's not pretend it's magic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;25 GB download&lt;/strong&gt; before you generate anything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Several minutes per clip&lt;/strong&gt; on mid-range hardware&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3-4 seconds&lt;/strong&gt; of usable output per generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality varies&lt;/strong&gt; — some images animate beautifully, others look weird&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a tool for specific use cases, not a replacement for cloud video gen services. But for those use cases — quick social media content, animated product shots, bringing concept art to life — running it locally for free on a GPU you already own is genuinely compelling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;PurpleDoubleD/locally-uncensored&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>video</category>
    </item>
  </channel>
</rss>
