<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David </title>
    <description>The latest articles on DEV Community by David  (@purpledoubled).</description>
    <link>https://dev.to/purpledoubled</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802440%2Fbd0118a6-e9df-4efa-965a-8f8f9c2ef510.png</url>
      <title>DEV Community: David </title>
      <link>https://dev.to/purpledoubled</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/purpledoubled"/>
    <language>en</language>
    <item>
      <title>Abliterated Models Guide - Qwen 3.6, Gemma 4 Heretic, Llama 3.1 Uncensored Download Links</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 24 Apr 2026 13:58:02 +0000</pubDate>
      <link>https://dev.to/purpledoubled/abliterated-models-guide-qwen-36-gemma-4-heretic-llama-31-uncensored-download-links-1f4e</link>
      <guid>https://dev.to/purpledoubled/abliterated-models-guide-qwen-36-gemma-4-heretic-llama-31-uncensored-download-links-1f4e</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://locallyuncensored.com/blog/abliterated-models-guide.html" rel="noopener noreferrer"&gt;locallyuncensored.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you've looked at the Discover tab in any local-AI app and wondered why some Llama variants have &lt;em&gt;abliterated&lt;/em&gt; in the name, this is the post that explains it. Plus the curated download list for 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Abliteration Actually Is
&lt;/h2&gt;

&lt;p&gt;Modern instruction-tuned LLMs have a learned &lt;strong&gt;refusal direction&lt;/strong&gt; in their residual stream. When a prompt activates that direction strongly enough, the model outputs "I cannot help with that." The direction was put there during RLHF.&lt;/p&gt;

&lt;p&gt;Abliteration removes it via &lt;strong&gt;orthogonalisation&lt;/strong&gt;. You take a corpus of refused prompts, isolate the activation pattern that distinguishes them from accepted prompts, then project that direction out of every weight matrix. The result is a model with the same training and capabilities but no longer prone to categorical refusal.&lt;/p&gt;

&lt;p&gt;It's a clean technique - not a finetune, not a jailbreak, not a system-prompt trick. Original paper: &lt;em&gt;"Refusal in Language Models Is Mediated by a Single Direction"&lt;/em&gt; (Arditi et al., 2024).&lt;/p&gt;

&lt;h2&gt;
  
  
  Abliterated vs Other Uncensored Approaches
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;How it works&lt;/th&gt;
&lt;th&gt;Effort&lt;/th&gt;
&lt;th&gt;Quality impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Abliteration&lt;/td&gt;
&lt;td&gt;Project out refusal direction&lt;/td&gt;
&lt;td&gt;hours on GPU&lt;/td&gt;
&lt;td&gt;1-3% degradation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full finetune (Dolphin, Hermes)&lt;/td&gt;
&lt;td&gt;Re-train on uncensored corpus&lt;/td&gt;
&lt;td&gt;days, expensive&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA finetune&lt;/td&gt;
&lt;td&gt;Adapter on uncensored data&lt;/td&gt;
&lt;td&gt;hours&lt;/td&gt;
&lt;td&gt;Minor, reversible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge (Frankenmerges)&lt;/td&gt;
&lt;td&gt;Combine multiple finetunes&lt;/td&gt;
&lt;td&gt;hours&lt;/td&gt;
&lt;td&gt;Highly variable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompt jailbreak&lt;/td&gt;
&lt;td&gt;Persona-style instructions&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Brittle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Abliteration is the cleanest research-grounded option. Dolphin and Hermes are battle-tested production finetunes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Abliterated Models (2026)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Qwen 3.6 Family
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;richardyoung/qwen3-14b-abliterated:q4_K_M&lt;/strong&gt; - 9 GB, fits 12 GB VRAM, vision-capable. Comes in &lt;code&gt;:q4_K_M&lt;/code&gt; (chat) and &lt;code&gt;:agent&lt;/code&gt; (tool-calling) tags via Ollama.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 3.6 27B Samantha (huihui-ai variant)&lt;/strong&gt; - abliterated dense 27B with the Samantha personality finetune.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Gemma 4 Heretic
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stabhappy/gemma-4-31B-it-heretic-Gguf&lt;/strong&gt; - Gemma 4 31B base abliterated. ~17 GB at Q4_K_M. Native vision, tool calling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 26B MoE HERETIC&lt;/strong&gt; - 26B brain with 4B active. Smaller VRAM peak, MoE-fast inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Llama 3.1 Family
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mannix/llama3.1-8b-abliterated:q5_K_M&lt;/strong&gt; - 5.7 GB. The most-pulled abliterated Llama on Ollama. Comes with &lt;code&gt;:agent&lt;/code&gt; tag for tool calling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated&lt;/strong&gt; - the canonical reference variant.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hermes 3
&lt;/h3&gt;

&lt;p&gt;Hermes 3 is technically a full finetune, not abliteration, but functions similarly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;hermes3:8b&lt;/strong&gt; via Ollama - 4.7 GB, fits 8 GB GPUs. Good chat default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hermes3:70b&lt;/strong&gt; - 40 GB, needs 48 GB VRAM or aggressive quantisation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GLM 5.1 Heretic
&lt;/h3&gt;

&lt;p&gt;The newest entrant: &lt;strong&gt;huihui-ai/Huihui-GLM-5.1-abliterated-GGUF&lt;/strong&gt;. The 754B MoE GLM 5.1 abliterated. 236 GB at IQ2_M - not consumer hardware, but if you have a Mac Studio M4 Ultra, it's the strongest open abliterated model period.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Download and Run
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Path 1 - Ollama (one command)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull richardyoung/qwen3-14b-abliterated:q4_K_M
ollama run richardyoung/qwen3-14b-abliterated:q4_K_M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Path 2 - Locally Uncensored (one click)
&lt;/h3&gt;

&lt;p&gt;Open &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;, navigate to &lt;strong&gt;Model Manager &amp;gt; Discover &amp;gt; Text&lt;/strong&gt;, click the &lt;strong&gt;UNCENSORED&lt;/strong&gt; filter tab. The 34 curated abliterated GGUFs are all there with one-click download.&lt;/p&gt;

&lt;p&gt;The new &lt;a href="https://locallyuncensored.com/blog/locally-uncensored-v2-4-0-release.html" rel="noopener noreferrer"&gt;v2.4.0 Settings &amp;gt; Model Storage&lt;/a&gt; override lets you redirect the GGUF download folder if you want them on a separate drive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Recommendations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Best Abliterated Pick&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;Llama 3.1 8B abliterated Q4_K_M&lt;/td&gt;
&lt;td&gt;Fits with headroom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12 GB (RTX 3060)&lt;/td&gt;
&lt;td&gt;Qwen 3 14B abliterated Q4_K_M&lt;/td&gt;
&lt;td&gt;Sweet spot, ~15 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 31B Heretic Q4_K_M&lt;/td&gt;
&lt;td&gt;Best general-purpose at this VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24 GB (RTX 3090/4090)&lt;/td&gt;
&lt;td&gt;Gemma 4 31B Heretic Q5_K_M&lt;/td&gt;
&lt;td&gt;Higher quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;48 GB+&lt;/td&gt;
&lt;td&gt;Hermes 3 70B or GLM 5.1 Heretic IQ2&lt;/td&gt;
&lt;td&gt;Frontier-tier quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Common Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Will an abliterated model write me malware?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Probably not the way you're thinking. Abliteration removes the categorical refusal but the model still has training-time priors against obviously-bad outputs. The models work best for legitimate-but-edge-case use cases: security research, fiction with violence, medical questions the base model deflects, legal grey areas, adult creative writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are abliterated models dangerous?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No more than the underlying base. Abliteration removes a layer of guardrails. The model's underlying knowledge is unchanged from the base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I abliterate a model myself?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. The technique is well-documented and the code is on GitHub (search &lt;em&gt;abliterator&lt;/em&gt;). You need a GPU with the model loaded, a few thousand refused-vs-accepted prompt pairs, and a few hours.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Locally Uncensored is AGPL-3.0 licensed. Built by &lt;a href="https://github.com/PurpleDoubleD" rel="noopener noreferrer"&gt;PurpleDoubleD&lt;/a&gt;. Bug reports on &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt; or in the &lt;a href="https://discord.gg/nHnGnDw2c8" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>opensource</category>
      <category>tutorial</category>
      <category>ai</category>
    </item>
    <item>
      <title>How to Run Qwen 3.6 Locally - 27B Dense, 35B MoE, and Coding Variants Setup Guide</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 24 Apr 2026 13:58:02 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-qwen-36-locally-27b-dense-35b-moe-and-coding-variants-setup-guide-4di</link>
      <guid>https://dev.to/purpledoubled/how-to-run-qwen-36-locally-27b-dense-35b-moe-and-coding-variants-setup-guide-4di</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://locallyuncensored.com/blog/how-to-run-qwen-3-6-locally.html" rel="noopener noreferrer"&gt;locallyuncensored.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Qwen 3.6 dropped on April 21 2026. Two main families: a &lt;strong&gt;27B dense&lt;/strong&gt; model that activates every parameter per token and a &lt;strong&gt;35B MoE&lt;/strong&gt; with 3B active per token. Both ship with vision, agentic coding, thinking-mode preservation, and a 256K context window.&lt;/p&gt;

&lt;p&gt;If you only have time for the short version: install &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;, open Model Manager &amp;gt; Discover &amp;gt; Text, search Qwen 3.6, hit the download arrow on the variant that fits your VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Qwen 3.6 Variant Should You Pick?
&lt;/h2&gt;

&lt;p&gt;The biggest decision is dense vs MoE. The second biggest is which quant.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;27B dense&lt;/strong&gt; activates all 27B parameters for every token. Slower per token, but every token gets the full model. Quality is consistent. Recommended default for general chat, reasoning, and most coding.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;35B MoE&lt;/strong&gt; only activates 3B parameters per token via routing. Much faster per token (often 2-3x throughput at similar quants). VRAM peak during inference is lower than the model size suggests. But routing introduces variance. The MoE wins on coding benchmarks (SWE-bench specifically) when you pick the coding-specialised variant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quant Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Disk&lt;/th&gt;
&lt;th&gt;VRAM Target&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;UD-IQ2_XXS&lt;/td&gt;
&lt;td&gt;8.7 GB&lt;/td&gt;
&lt;td&gt;8 GB GPU&lt;/td&gt;
&lt;td&gt;Good (low-VRAM lifesaver)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;13 GB&lt;/td&gt;
&lt;td&gt;12 GB GPU&lt;/td&gt;
&lt;td&gt;Very good (RTX 3060 sweet spot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;16 GB GPU&lt;/td&gt;
&lt;td&gt;Recommended default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;16 GB GPU&lt;/td&gt;
&lt;td&gt;Better quality per GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td&gt;18 GB&lt;/td&gt;
&lt;td&gt;20 GB GPU&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;Q6_K&lt;/td&gt;
&lt;td&gt;21 GB&lt;/td&gt;
&lt;td&gt;24 GB GPU&lt;/td&gt;
&lt;td&gt;Near-lossless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27B dense&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;27 GB&lt;/td&gt;
&lt;td&gt;32 GB GPU&lt;/td&gt;
&lt;td&gt;Effectively lossless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;35B MoE&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;24 GB GPU&lt;/td&gt;
&lt;td&gt;Recommended for MoE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;35B MoE&lt;/td&gt;
&lt;td&gt;NVFP4&lt;/td&gt;
&lt;td&gt;22 GB&lt;/td&gt;
&lt;td&gt;22 GB GPU (RTX 40+)&lt;/td&gt;
&lt;td&gt;Smallest with full quality on Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;35B MoE coding&lt;/td&gt;
&lt;td&gt;NVFP4&lt;/td&gt;
&lt;td&gt;22 GB&lt;/td&gt;
&lt;td&gt;22 GB GPU (RTX 40+)&lt;/td&gt;
&lt;td&gt;Best coding-bench-per-GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;35B MoE&lt;/td&gt;
&lt;td&gt;BF16&lt;/td&gt;
&lt;td&gt;71 GB&lt;/td&gt;
&lt;td&gt;96 GB GPU&lt;/td&gt;
&lt;td&gt;Reference quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Recommendation by Hardware
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8 GB VRAM&lt;/strong&gt; (RTX 3060 8GB, RTX 4060 8GB): 27B UD-IQ2_XXS - the only quant that fits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12 GB VRAM&lt;/strong&gt; (RTX 3060 12GB, RTX 3080 Ti, RTX 4070): 27B Q3_K_M - sweet spot, ~15-25 tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16 GB VRAM&lt;/strong&gt;: 27B Q4_K_M or UD-Q4_K_XL - the recommended default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;24 GB VRAM&lt;/strong&gt; (RTX 3090, RTX 4090): 27B Q6_K for max dense quality, OR 35B MoE Q4_K_M for coding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTX 40+ Blackwell&lt;/strong&gt;: 35B MoE NVFP4 - smallest size with native quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple Silicon M3/M4&lt;/strong&gt;: 35B MoE MLX BF16 via MLX runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU only with 32 GB RAM&lt;/strong&gt;: 27B Q4_K_M at 1-3 tok/s - usable for short tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Installation Path 1 - Ollama (CLI)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen3.6:27b           &lt;span class="c"&gt;# dense Q4_K_M, 16 GB&lt;/span&gt;
ollama pull qwen3.6                &lt;span class="c"&gt;# 35B MoE Q4_K_M, 24 GB&lt;/span&gt;
ollama pull qwen3.6:35b-a3b-coding-nvfp4   &lt;span class="c"&gt;# coding NVFP4&lt;/span&gt;
ollama run qwen3.6:27b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installation Path 2 - Locally Uncensored (GUI)
&lt;/h2&gt;

&lt;p&gt;If you want a one-click experience plus chat, agent mode, image generation, and a/b model compare in the same window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download the &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases" rel="noopener noreferrer"&gt;v2.4.0 installer&lt;/a&gt; for your OS&lt;/li&gt;
&lt;li&gt;First-launch wizard auto-detects Ollama (or offers one-click install)&lt;/li&gt;
&lt;li&gt;Model Manager &amp;gt; Discover &amp;gt; Text &amp;gt; search Qwen 3.6&lt;/li&gt;
&lt;li&gt;Click the download arrow on the variant matching your VRAM&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Performance on RTX 3060 12 GB
&lt;/h2&gt;

&lt;p&gt;Tested with Qwen 3.6 27B Q3_K_M, 4096-token context, fp16 KV cache:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold first response&lt;/td&gt;
&lt;td&gt;~3 (model load)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warm chat (50-token answers)&lt;/td&gt;
&lt;td&gt;22-26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-form (1000 tokens)&lt;/td&gt;
&lt;td&gt;18-20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thinking-mode enabled&lt;/td&gt;
&lt;td&gt;15-18&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Vision Support
&lt;/h2&gt;

&lt;p&gt;Both 27B dense and 35B MoE accept image input. Drag-and-drop a screenshot, photo, or chart. VRAM cost for vision is +1-2 GB on top of the base model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding Performance
&lt;/h2&gt;

&lt;p&gt;The 35B MoE coding-specialised variants are tuned on SWE-bench training data. The coding NVFP4 variant scores in the same ballpark as Claude 3.5 Sonnet on SWE-bench-verified at a fraction of the inference cost.&lt;/p&gt;

&lt;p&gt;For day-to-day coding inside &lt;a href="https://locallyuncensored.com/blog/codex-cli-universal-model-support.html" rel="noopener noreferrer"&gt;LU's Codex agent&lt;/a&gt;, the 27B dense Q4_K_M is the better default - consistent quality, no MoE-routing variance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Qwen 3.6 vs Qwen 3.5
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Qwen 3.5&lt;/th&gt;
&lt;th&gt;Qwen 3.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vision&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (both 27B and 35B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thinking mode&lt;/td&gt;
&lt;td&gt;QwQ-only&lt;/td&gt;
&lt;td&gt;Preserved across variants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding-specific MoE&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (35B-a3b-coding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVFP4 quant&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (35B MoE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLX variant for Apple Silicon&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Locally Uncensored is AGPL-3.0 licensed. Built by &lt;a href="https://github.com/PurpleDoubleD" rel="noopener noreferrer"&gt;PurpleDoubleD&lt;/a&gt;. Bug reports on &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt; or in the &lt;a href="https://discord.gg/nHnGnDw2c8" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>qwen</category>
      <category>tutorial</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Locally Uncensored v2.4.0 — Settings Polish, Linux Drag Fix, and Configurable HuggingFace Path</title>
      <dc:creator>David </dc:creator>
      <pubDate>Fri, 24 Apr 2026 13:53:15 +0000</pubDate>
      <link>https://dev.to/purpledoubled/locally-uncensored-v240-settings-polish-linux-drag-fix-and-configurable-huggingface-path-34bo</link>
      <guid>https://dev.to/purpledoubled/locally-uncensored-v240-settings-polish-linux-drag-fix-and-configurable-huggingface-path-34bo</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://locallyuncensored.com/blog/locally-uncensored-v2-4-0-release.html" rel="noopener noreferrer"&gt;locallyuncensored.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; v2.4.0 is a polish release. Eight fixes, two of them surfaced through community feedback on Discord, six caught during an internal end-to-end pass on the v2.3.9 build. No new headline features — this release exists so the next feature release lands on a cleaner foundation.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-instance lock&lt;/strong&gt; — double-clicking the shortcut focuses the existing window instead of spawning a second process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settings → Model Storage&lt;/strong&gt; — paste or pick the folder where HuggingFace GGUF downloads land&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settings → Privacy&lt;/strong&gt; — in-app statement of what runs locally and what doesn't&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settings → Onboarding&lt;/strong&gt; — a button that re-runs the first-launch wizard on demand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reset tutorial&lt;/strong&gt; — the button now actually does what its label promises&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux window drag&lt;/strong&gt; — the title-bar drag works on Ubuntu 24.04 again&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discover&lt;/strong&gt; — the HuggingFace download path is no longer printed twice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace search heuristic&lt;/strong&gt; — search results for repos with a quant tag in the name no longer 404 on download&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Single-Instance Lock
&lt;/h2&gt;

&lt;p&gt;Before v2.4.0, double-clicking the desktop shortcut started a second &lt;code&gt;locally-uncensored.exe&lt;/code&gt; process. Both instances would race each other writing to the store backup file — not a frequent corruption source, but a real one when both happened to flush at the same millisecond.&lt;/p&gt;

&lt;p&gt;v2.4.0 ships with &lt;code&gt;tauri-plugin-single-instance&lt;/code&gt;. The second launch focuses, un-minimizes, and brings the existing window to front. No new process. Verified with three back-to-back launches: only one PID survives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Settings → Model Storage — Configurable HuggingFace Folder
&lt;/h2&gt;

&lt;p&gt;The Model Manager → Discover → Text tab lets you download GGUF models from HuggingFace. Until v2.4.0, the destination folder was always auto-detected from the active openai-compat provider — usually LM Studio's models folder.&lt;/p&gt;

&lt;p&gt;That worked fine for single-disk setups. It did &lt;strong&gt;not&lt;/strong&gt; work for dual-boot users who wanted a shared model partition between Linux and Windows, or anyone running a NAS-mounted models folder. Reported on Discord by &lt;code&gt;diimmortalis&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;v2.4.0 adds a dedicated &lt;strong&gt;Settings → Model Storage&lt;/strong&gt; section with a path input, a Browse button, and a Reset button. The override takes effect immediately. Verified end-to-end with a Gemma 4 E4B download (4.6 GB) landing in a custom folder while the LM Studio default folder stayed untouched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linux Window Drag Fix
&lt;/h2&gt;

&lt;p&gt;On Ubuntu 24.04 the title-bar drag threw an unhandled Promise rejection — &lt;code&gt;core:window:allow-start-dragging&lt;/code&gt; was missing from the capability list. Reported on Discord by &lt;code&gt;diimmortalis&lt;/code&gt; with a clean Promise-rejection dump. One-line fix in &lt;code&gt;src-tauri/capabilities/default.json&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests &amp;amp; Verification
&lt;/h2&gt;

&lt;p&gt;Test suite went from 2205 to 2216 (+11 regression tests). &lt;code&gt;cargo check&lt;/code&gt; clean. &lt;code&gt;tsc --noEmit&lt;/code&gt; clean.&lt;/p&gt;

&lt;p&gt;Live end-to-end on the installed v2.4.0 build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-instance&lt;/strong&gt;: 3 back-to-back exe launches → 1 PID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HF download override&lt;/strong&gt;: typed custom path, Discover subtitle updated, Gemma 4 E4B partial download (35.9 MB at 897 KB/s) landed in the picked folder&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-run onboarding&lt;/strong&gt;: click → marker deleted → 6-step wizard renders → marker re-created&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reset tutorial&lt;/strong&gt;: click → flag flipped → new chat → Agent toggle → tutorial renders&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Download
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases/tag/v2.4.0" rel="noopener noreferrer"&gt;GitHub Releases&lt;/a&gt;. Signed installers for Windows (.exe, .msi) and Linux (.deb, .rpm, .AppImage). Auto-update picks the new build up on next launch for anyone on v2.3.x.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Locally Uncensored is AGPL-3.0 licensed. Built by &lt;a href="https://github.com/PurpleDoubleD" rel="noopener noreferrer"&gt;PurpleDoubleD&lt;/a&gt;. Bug reports on &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt; or in the &lt;a href="https://discord.gg/nHnGnDw2c8" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>opensource</category>
      <category>tauri</category>
      <category>release</category>
    </item>
    <item>
      <title>Anthropic is Rationing Claude Code on Pro — Here's a Local Alternative</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 23 Apr 2026 10:34:18 +0000</pubDate>
      <link>https://dev.to/purpledoubled/anthropic-is-rationing-claude-code-on-pro-heres-a-local-alternative-574n</link>
      <guid>https://dev.to/purpledoubled/anthropic-is-rationing-claude-code-on-pro-heres-a-local-alternative-574n</guid>
      <description>&lt;p&gt;Earlier this week, Anthropic ran a quiet test: a small slice (~2%) of new Pro plan subscribers found that Claude Code wasn't included with their $20/month subscription. The pricing page was updated to reflect this. It made some noise on Reddit and X, Anthropic walked it back, and the page was reverted.&lt;/p&gt;

&lt;p&gt;But the incident highlights something real: &lt;strong&gt;the economics of hosted AI are strained.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happened
&lt;/h2&gt;

&lt;p&gt;Anthropic's head of growth &lt;a href="https://arstechnica.com/ai/2026/04/anthropic-tested-removing-claude-code-from-the-pro-plan/" rel="noopener noreferrer"&gt;clarified on social media&lt;/a&gt; that the test affected about 2% of new prosumer signups. The reasoning was straightforward: usage patterns have changed dramatically. Users have moved from brief chat sessions to "nearly always-on, multi-agent workflows" that consume vastly more tokens. The current plans weren't built for this.&lt;/p&gt;

&lt;p&gt;To be clear: this wasn't a crisis. It was a business experiment that got rolled back quickly. But it was also a signal — one that shouldn't be surprising if you've been paying attention to how compute-heavy AI tools have become.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trend Is Clear
&lt;/h2&gt;

&lt;p&gt;Claude Code isn't unique here. OpenAI has introduced peak-hour caps. Anthropic has added tighter limits during high-traffic periods. Gemini, ChatGPT, and others have all introduced various forms of rate limiting as agentic workflows (long-running, multi-step tasks) have taken off.&lt;/p&gt;

&lt;p&gt;This isn't malice — it's math. Running a model that can handle complex, hours-long agentic tasks requires significant GPU compute. At $20/month, there's a real gap between what heavy users consume and what the subscription covers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Local Models
&lt;/h2&gt;

&lt;p&gt;This is where running AI locally becomes genuinely compelling, not just theoretically interesting.&lt;/p&gt;

&lt;p&gt;Tools like &lt;strong&gt;Ollama&lt;/strong&gt;, &lt;strong&gt;LM Studio&lt;/strong&gt;, and &lt;strong&gt;&lt;a href="https://locallyuncensored.com" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;&lt;/strong&gt; let you run capable language models on your own hardware. No subscription. No per-token billing. No rate limits. No plan changes.&lt;/p&gt;

&lt;p&gt;The tradeoff is real: you need decent hardware (a modern Mac with unified memory, a gaming PC with a good GPU, or a dedicated home server), and the experience differs from hosted APIs. But for developers who rely on agentic workflows — the exact users feeling the squeeze from providers — the local path is increasingly viable.&lt;/p&gt;

&lt;p&gt;Recent open-weight models from Mistral, Qwen, and the Llama family are genuinely capable for coding tasks. They're not matching the frontier models on every benchmark, but for the majority of real-world dev work, the gap has shrunk considerably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is This a Sales Pitch?
&lt;/h2&gt;

&lt;p&gt;Not really — and I want to be clear about that. If Anthropic's pricing works for you and you don't hit limits, there's no urgent reason to change. Their models are excellent.&lt;/p&gt;

&lt;p&gt;But if you've been on the receiving end of a rate limit mid-flow, or if you're watching your usage climb and wondering what happens next, it's worth knowing that the local option exists and has gotten significantly easier to set up over the past year.&lt;/p&gt;

&lt;p&gt;The local ecosystem isn't for everyone. But for developers who have built automated workflows around AI — the exact users Anthropic was quietly trying to ration — it might be worth an afternoon of experimentation.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What do you think — is the local-first approach realistic for your use case, or are you all-in on hosted APIs? I'd genuinely like to know what you're running into.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>localmodels</category>
      <category>privacy</category>
      <category>programming</category>
    </item>
    <item>
      <title>qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:30:20 +0000</pubDate>
      <link>https://dev.to/purpledoubled/qwen36-27b-scores-772-on-swe-bench-the-dense-model-is-winning-against-moe-3e4b</link>
      <guid>https://dev.to/purpledoubled/qwen36-27b-scores-772-on-swe-bench-the-dense-model-is-winning-against-moe-3e4b</guid>
      <description>&lt;p&gt;When Alibaba released Qwen3.6-35B-A3B, the MoE (Mixture of Experts) design stole all the headlines. 35 billion parameters, 3 billion activated per token — everyone's been focused on that ratio.&lt;/p&gt;

&lt;p&gt;Then they dropped Qwen3.6-27B. A plain old dense model. 27 billion parameters, all active.&lt;/p&gt;

&lt;p&gt;On SWE-bench Verified, the 27B dense scores &lt;strong&gt;77.2%&lt;/strong&gt;. The 35B MoE scores &lt;strong&gt;73.4%&lt;/strong&gt;. The dense model is outperforming the MoE by nearly 4 points — on the benchmark that measures real software engineering capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  what SWE-bench actually measures
&lt;/h2&gt;

&lt;p&gt;SWE-bench gives an LLM a real GitHub issue and a codebase. It has to understand the problem, find the right files, write the fix, and get the tests to pass. It's not multiple choice — it requires actual coding.&lt;/p&gt;

&lt;p&gt;Qwen3.6-27B at 77.2% puts it in range of proprietary models. Claude Opus 4.5 scores 80.9%. The gap is real but narrowing — and Qwen3.6-27B does it on your own GPU under Apache 2.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  why is the dense model winning?
&lt;/h2&gt;

&lt;p&gt;Two factors seem to be driving this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Full parameter utilization.&lt;/strong&gt; In a MoE model like the 35B-A3B, only 3B of 35B parameters are active per token. The routing layer decides which experts to use. This is efficient for inference speed, but the model can't "use" all of its knowledge simultaneously. A dense model can activate its full capacity for harder reasoning tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Architecture: Gated DeltaNet.&lt;/strong&gt; Qwen3.6-27B isn't a vanilla dense transformer. It uses a Gated DeltaNet + Gated Attention hybrid — alternating layers of linear-gated attention (DeltaNet) with standard gated attention. DeltaNet processes information in compressed deltas rather than full representations, which lets it handle long contexts more efficiently while maintaining reasoning depth.&lt;/p&gt;

&lt;p&gt;The result is a model that can do 262K context natively (extendable to 1M tokens) while still being a strong coder.&lt;/p&gt;

&lt;h2&gt;
  
  
  the benchmark breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Qwen3.6-27B (dense)&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B (MoE)&lt;/th&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;73.4&lt;/td&gt;
&lt;td&gt;+3.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Pro&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;49.5&lt;/td&gt;
&lt;td&gt;+4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;51.5&lt;/td&gt;
&lt;td&gt;+7.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SkillsBench Avg5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28.7&lt;/td&gt;
&lt;td&gt;+19.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QwenWebBench&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1487&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1397&lt;/td&gt;
&lt;td&gt;+90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NL2Repo&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;36.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;29.4&lt;/td&gt;
&lt;td&gt;+6.8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Terminal-Bench (real terminal operations) and SkillsBench show the largest gaps. These are tasks where the model needs to chain together multiple operations — the kind of thing where full parameter access seems to matter most.&lt;/p&gt;

&lt;h2&gt;
  
  
  the tradeoff
&lt;/h2&gt;

&lt;p&gt;Dense models aren't free. The 27B activates all 27B parameters per forward pass. The 35B MoE activates only 3B. During inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;35B MoE is faster per token&lt;/strong&gt; (3B vs 27B compute)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;35B MoE uses less memory&lt;/strong&gt; for the active computation (but total disk/loaded size is still large)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;27B dense is better at hard coding tasks&lt;/strong&gt; (SWE-bench, terminal operations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're doing simple chat, the MoE will be faster. If you're running an agent that needs to reason through a complex codebase — the dense model is showing real advantages.&lt;/p&gt;

&lt;h2&gt;
  
  
  vision included
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-27B is an image-text-to-text model. The vision encoder is built in. That means you can screenshot a UI and ask it to fix the bug, read a diagram and explain the architecture, or debug from screenshots. The 35B MoE is text-only.&lt;/p&gt;

&lt;h2&gt;
  
  
  running it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6-27b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;, you also get image input, a built-in code agent, and fully local outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/PurpleDoubleD/locally-uncensored
&lt;span class="nb"&gt;cd &lt;/span&gt;locally-uncensored &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run tauri dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The MoE vs dense debate isn't settled. But on coding agent tasks, Qwen3.6-27B is making a strong case that raw parameter count isn't everything — architecture and full utilization matter too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — AGPL-3.0 license.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>coding</category>
    </item>
    <item>
      <title>how to run qwen3.6-27b locally — the dense 27B that beats the 35B MoE on coding</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:29:20 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-qwen36-27b-locally-the-dense-27b-that-beats-the-35b-moe-on-coding-172e</link>
      <guid>https://dev.to/purpledoubled/how-to-run-qwen36-27b-locally-the-dense-27b-that-beats-the-35b-moe-on-coding-172e</guid>
      <description>&lt;p&gt;Alibaba just dropped Qwen3.6-27B, a 27-billion parameter dense model that scores 77.2% on SWE-bench Verified. That's higher than Qwen3.6-35B-A3B (73.4%) — the MoE version everyone was talking about last week.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;, a desktop AI app, and we just added Qwen3.6-27B support.&lt;/p&gt;

&lt;h2&gt;
  
  
  install with ollama
&lt;/h2&gt;

&lt;p&gt;If you already have Ollama set up, it's a one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6-27b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. If you want a specific quantization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6-27b:q4_K_M   &lt;span class="c"&gt;# 16GB RAM recommended&lt;/span&gt;
ollama run qwen3.6-27b:q8_0     &lt;span class="c"&gt;# 27GB RAM recommended&lt;/span&gt;
ollama run qwen3.6-27b:fp8      &lt;span class="c"&gt;# needs ~27GB VRAM (FP8)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: if &lt;code&gt;ollama run qwen3.6-27b&lt;/code&gt; returns "model not found", give it a minute — Ollama's library updates periodically. You can also pull manually with &lt;code&gt;ollama pull qwen3.6-27b&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  what makes qwen3.6-27b different
&lt;/h2&gt;

&lt;p&gt;The 35B-A3B is a Mixture-of-Experts model: 35B total params but only 3B activated per token. Qwen3.6-27B is a different beast — a &lt;strong&gt;dense&lt;/strong&gt; 27B model with a Gated DeltaNet + Gated Attention hybrid architecture.&lt;/p&gt;

&lt;p&gt;Key specs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;27B parameters (all active, no MoE routing)&lt;/li&gt;
&lt;li&gt;64 layers, 5120 hidden dimension&lt;/li&gt;
&lt;li&gt;262,144 token context natively (extensible to 1,010,000)&lt;/li&gt;
&lt;li&gt;Vision encoder included (image-text-to-text)&lt;/li&gt;
&lt;li&gt;Apache 2.0 license&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gated DeltaNet architecture processes tokens through alternating Gated DeltaNet and Gated Attention layers — a hybrid that combines linear-attention efficiency with gated selective attention. It's a different design philosophy from both vanilla transformers and the 35B MoE.&lt;/p&gt;

&lt;h2&gt;
  
  
  benchmark table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Qwen3.6-27B&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B&lt;/th&gt;
&lt;th&gt;Gemma4-31B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;73.4&lt;/td&gt;
&lt;td&gt;52.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Pro&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;49.5&lt;/td&gt;
&lt;td&gt;35.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;51.5&lt;/td&gt;
&lt;td&gt;42.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SkillsBench Avg5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28.7&lt;/td&gt;
&lt;td&gt;23.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU-Pro&lt;/td&gt;
&lt;td&gt;86.2&lt;/td&gt;
&lt;td&gt;85.2&lt;/td&gt;
&lt;td&gt;85.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench v6&lt;/td&gt;
&lt;td&gt;83.9&lt;/td&gt;
&lt;td&gt;80.4&lt;/td&gt;
&lt;td&gt;80.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIME 2026&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;94.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92.7&lt;/td&gt;
&lt;td&gt;89.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All numbers from the &lt;a href="https://huggingface.co/Qwen/Qwen3.6-27B" rel="noopener noreferrer"&gt;official Qwen3.6-27B model card&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The 27B dense model is pulling ahead of the 35B MoE on agentic coding tasks — SWE-bench, Terminal-Bench, SkillsBench. The gap is especially wide on SkillsBench (48.2 vs 28.7) which tests real-world dev skills.&lt;/p&gt;

&lt;h2&gt;
  
  
  vram requirements
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-27B is a dense model, so all 27B parameters stay in memory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;VRAM (approx)&lt;/th&gt;
&lt;th&gt;Recommended GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q2_K&lt;/td&gt;
&lt;td&gt;10-11 GB&lt;/td&gt;
&lt;td&gt;RTX 3060, RTX 4060&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;16-17 GB&lt;/td&gt;
&lt;td&gt;RTX 4070, RTX 3080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;27-28 GB&lt;/td&gt;
&lt;td&gt;RTX 4090, A5000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8&lt;/td&gt;
&lt;td&gt;27 GB&lt;/td&gt;
&lt;td&gt;RTX 4090, H100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;54 GB&lt;/td&gt;
&lt;td&gt;dual GPU or professional&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note: these are for the base model only. With the vision encoder + KV cache for long context, add 2-4 GB overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  why not just use the 35B MoE?
&lt;/h2&gt;

&lt;p&gt;The 35B-A3B activates fewer params per token, which means faster inference and lower memory during generation. But if you're doing agentic coding with longer context windows, the dense 27B is showing real advantages on benchmark tasks that require deep repository reasoning.&lt;/p&gt;

&lt;p&gt;The 35B MoE also requires more total disk space (the full expert bank is still loaded even if only 3B activate per token) and the routing decisions can introduce variability.&lt;/p&gt;

&lt;h2&gt;
  
  
  try it with locally uncensored
&lt;/h2&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — a cross-platform desktop app that lets you run Qwen3.6-27B (and other models) with uncensored outputs, image understanding, and a built-in code agent.&lt;/p&gt;

&lt;p&gt;Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-click model setup via Ollama&lt;/li&gt;
&lt;li&gt;Image + text input&lt;/li&gt;
&lt;li&gt;Built-in code agent mode&lt;/li&gt;
&lt;li&gt;Chat history and export&lt;/li&gt;
&lt;li&gt;No cloud, no data leaving your machine
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# clone and run&lt;/span&gt;
git clone https://github.com/PurpleDoubleD/locally-uncensored
&lt;span class="nb"&gt;cd &lt;/span&gt;locally-uncensored &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run tauri dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored/releases" rel="noopener noreferrer"&gt;GitHub releases&lt;/a&gt; for pre-built binaries.&lt;/p&gt;




&lt;p&gt;What GPU are you running? And have you tried the 27B vs the 35B MoE side-by-side? Drop a comment with your setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — AGPL-3.0 license.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>anthropic charges $25/M tokens for opus 4.7. alibaba just released the same capability for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:14:04 +0000</pubDate>
      <link>https://dev.to/purpledoubled/anthropic-charges-25m-tokens-for-opus-47-alibaba-just-released-the-same-capability-for-free-3o11</link>
      <guid>https://dev.to/purpledoubled/anthropic-charges-25m-tokens-for-opus-47-alibaba-just-released-the-same-capability-for-free-3o11</guid>
      <description>&lt;p&gt;Anthropic charges $25 per million output tokens for Claude Opus 4.7. That's their new flagship coding model, released today. It's good — 13% better than Opus 4.6 on coding benchmarks, improved vision, stronger at multi-step agentic work.&lt;/p&gt;

&lt;p&gt;Meanwhile, also this week: Alibaba released Qwen3.6-35B-A3B under Apache 2.0. Scores 73.4 on SWE-bench Verified. Runs on an 8 GB GPU. Costs nothing.&lt;/p&gt;

&lt;p&gt;Two models. Same week. Completely opposite philosophies. Let's break down what's actually happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  the cloud tax is getting harder to justify
&lt;/h2&gt;

&lt;p&gt;When GPT-4 launched in 2023, there was nothing local that came close. Paying for API access made sense because there was no alternative.&lt;/p&gt;

&lt;p&gt;In 2024, open models started catching up. Llama 3, Qwen 2.5, Mistral — good enough for many tasks, but still clearly behind frontier models on the hard stuff.&lt;/p&gt;

&lt;p&gt;In 2026, the gap has narrowed to the point where you have to really think about whether the remaining difference is worth $25 per million output tokens.&lt;/p&gt;

&lt;p&gt;Here's a concrete example. A developer using Opus 4.7 as their primary coding agent, running maybe 50 complex coding sessions a day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average session: ~10K input tokens (code context) + ~5K output tokens (response)&lt;/li&gt;
&lt;li&gt;50 sessions: 500K input + 250K output tokens&lt;/li&gt;
&lt;li&gt;Daily cost: $2.50 + $6.25 = &lt;strong&gt;$8.75/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Monthly: &lt;strong&gt;~$190/month&lt;/strong&gt; just for one developer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now scale that to a team of 5. That's nearly $1,000/month on AI coding assistance.&lt;/p&gt;

&lt;p&gt;The same team could buy a single RTX 4070 ($550 one-time) and run Qwen3.6 at 20+ tokens/second with zero ongoing costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  what you actually get for $0
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B isn't just "a free model." It's specifically designed for the exact use case Opus 4.7 targets — coding agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic coding benchmarks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SWE-bench Verified: 73.4 (fix real bugs in real repos autonomously)&lt;/li&gt;
&lt;li&gt;Terminal-Bench 2.0: 51.5 (operate a terminal to solve coding tasks)&lt;/li&gt;
&lt;li&gt;MCPMark: 37.0 (tool calling and agent protocols)&lt;/li&gt;
&lt;li&gt;QwenWebBench: 1397 Elo (frontend artifact generation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture advantages for local deployment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MoE: 35B total params, 3B active — runs like a small model, thinks like a big one&lt;/li&gt;
&lt;li&gt;Gated DeltaNet: 3 of 4 layers use linear attention — memory efficient on long contexts&lt;/li&gt;
&lt;li&gt;Native vision: understand screenshots, diagrams, code images without a separate model&lt;/li&gt;
&lt;li&gt;262K context: plenty for most codebase contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you give up vs Opus 4.7:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Probably some edge on the hardest 10% of tasks&lt;/li&gt;
&lt;li&gt;Anthropic's specific safety/self-verification features&lt;/li&gt;
&lt;li&gt;The polish of a model trained with massive RLHF compute&lt;/li&gt;
&lt;li&gt;Cloud convenience (no GPU needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you gain:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your code never leaves your machine&lt;/li&gt;
&lt;li&gt;No rate limits, no outages, no API key management&lt;/li&gt;
&lt;li&gt;No per-token costs, ever&lt;/li&gt;
&lt;li&gt;Full control over the model behavior&lt;/li&gt;
&lt;li&gt;Works offline, on a plane, in an air-gapped environment&lt;/li&gt;
&lt;li&gt;Apache 2.0 — fine-tune it, modify it, deploy it commercially&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  the $25/M question
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 is genuinely impressive. Anthropic's coding models have been best-in-class for a while and this extends that lead. The self-verification feature — where the model checks its own work before reporting back — is particularly useful for autonomous workflows.&lt;/p&gt;

&lt;p&gt;But the honest question every developer should ask is: &lt;strong&gt;for my specific tasks, does the delta between Opus 4.7 and Qwen3.6 justify the cost?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a solo developer building a startup: probably not. Qwen3.6 handles 73.4% of real-world GitHub issues autonomously. That's more than enough for daily coding work.&lt;/p&gt;

&lt;p&gt;For a large enterprise with strict compliance requirements and deep pockets: maybe. The convenience and Anthropic's enterprise features have real value.&lt;/p&gt;

&lt;p&gt;For anyone processing sensitive code: local wins by default. No amount of ToS promises equals "the data literally never left my hardware."&lt;/p&gt;

&lt;h2&gt;
  
  
  how to try both and decide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.7:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API key from anthropic.com
Model: claude-opus-4-7
$5/M input, $25/M output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen3.6 locally:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or for a complete setup with a coding agent, vision, and tool calling — &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; v2.3.3 supports both. Connect Anthropic's API for Opus 4.7 when you need it, run Qwen3.6 locally for everything else. Switch between them in the same interface. Best of both worlds.&lt;/p&gt;

&lt;h2&gt;
  
  
  where this is heading
&lt;/h2&gt;

&lt;p&gt;The pattern is clear. Every 3-4 months, a new open model appears that matches the paid frontier model from 6 months ago. The cost of "good enough" is trending toward zero.&lt;/p&gt;

&lt;p&gt;Anthropic, OpenAI, and Google will keep pushing the frontier. Open models will keep closing the gap. And the developers in the middle will increasingly ask: "Is the remaining gap worth $25 per million tokens?"&lt;/p&gt;

&lt;p&gt;Today, for most coding tasks, the answer is already no.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for running AI locally. Supports cloud APIs AND local models. Chat, coding agents, image gen, video gen. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
    <item>
      <title>claude opus 4.7 just dropped. here's what runs locally for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:13:20 +0000</pubDate>
      <link>https://dev.to/purpledoubled/claude-opus-47-just-dropped-heres-what-runs-locally-for-free-5665</link>
      <guid>https://dev.to/purpledoubled/claude-opus-47-just-dropped-heres-what-runs-locally-for-free-5665</guid>
      <description>&lt;p&gt;Anthropic just released Claude Opus 4.7. It's their best coding model yet — 13% better than Opus 4.6 on their internal 93-task benchmark, better vision, stronger at long-running agentic tasks.&lt;/p&gt;

&lt;p&gt;It's also $5 per million input tokens and $25 per million output tokens. API only. Every character you type goes through Anthropic's servers.&lt;/p&gt;

&lt;p&gt;Let's talk about what you can do locally for $0.&lt;/p&gt;

&lt;h2&gt;
  
  
  what opus 4.7 actually brings
&lt;/h2&gt;

&lt;p&gt;Based on Anthropic's announcement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;13% improvement&lt;/strong&gt; over Opus 4.6 on a 93-task coding benchmark, including 4 tasks neither Opus 4.6 nor Sonnet 4.6 could solve&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better vision&lt;/strong&gt; — higher resolution image understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stronger agentic workflows&lt;/strong&gt; — handles complex, multi-step tasks without losing context or stopping early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-verification&lt;/strong&gt; — the model checks its own outputs before reporting back&lt;/li&gt;
&lt;li&gt;Available on Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real improvements. Opus has been the go-to for serious coding work, and 4.7 makes it better.&lt;/p&gt;

&lt;p&gt;But here's the thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  the cost of frontier cloud AI
&lt;/h2&gt;

&lt;p&gt;At $5/$25 per million tokens, a heavy coding session with Opus 4.7 can easily run $2-5/day. A team of developers using it as their primary coding agent? That's hundreds per month.&lt;/p&gt;

&lt;p&gt;And every line of your proprietary code flows through someone else's infrastructure. Every prompt, every codebase context, every business logic snippet — stored, processed, potentially used for training (even with opt-outs, you're trusting the provider).&lt;/p&gt;

&lt;p&gt;For hobby projects, fine. For anything sensitive — financial code, healthcare logic, proprietary algorithms — that's a real risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  what runs locally right now
&lt;/h2&gt;

&lt;p&gt;The local model landscape has changed dramatically in the last few months. Here's what's available today at $0/month:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; (released this week)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;35B total parameters, 3B active (MoE architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;73.4 on SWE-bench Verified&lt;/strong&gt; — autonomous bug fixing on real GitHub repos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;51.5 on Terminal-Bench 2.0&lt;/strong&gt; — agentic terminal coding&lt;/li&gt;
&lt;li&gt;Built-in vision, 262K context&lt;/li&gt;
&lt;li&gt;Runs on &lt;strong&gt;8 GB VRAM&lt;/strong&gt; with Q4_K_M quantization&lt;/li&gt;
&lt;li&gt;Apache 2.0 license&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Is it as good as Opus 4.7? On raw capability, probably not — Anthropic has massive compute advantages. But on the tasks most developers actually do daily (fixing bugs, writing functions, understanding codebases, code review), Qwen3.6 is genuinely competitive. And it runs on hardware you already own.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real comparison isn't benchmarks
&lt;/h2&gt;

&lt;p&gt;It's this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude Opus 4.7&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$5/$25 per million tokens&lt;/td&gt;
&lt;td&gt;$0 forever&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy&lt;/td&gt;
&lt;td&gt;Cloud-processed&lt;/td&gt;
&lt;td&gt;Never leaves your machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Subject to API congestion&lt;/td&gt;
&lt;td&gt;As fast as your GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;Depends on Anthropic's uptime&lt;/td&gt;
&lt;td&gt;Runs offline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limits&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retention&lt;/td&gt;
&lt;td&gt;Anthropic's policy&lt;/td&gt;
&lt;td&gt;You control everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agentic coding&lt;/td&gt;
&lt;td&gt;Yes (strong)&lt;/td&gt;
&lt;td&gt;Yes (73.4 SWE-bench)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;API key + credit card&lt;/td&gt;
&lt;td&gt;Ollama + 10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  how to set up the local alternative
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Or if you want a full desktop experience with a coding agent, vision support, and model management:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; just shipped v2.3.3 with Qwen3.6 day-0 support. It wraps Ollama into a desktop app with a built-in coding agent that streams live between tool calls, agent mode with 13 tools and MCP integration, and remote access from your phone. Open source, AGPL-3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  when cloud still makes sense
&lt;/h2&gt;

&lt;p&gt;Being honest: there are cases where Opus 4.7 is worth the money.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need the absolute frontier of capability and $25/M output tokens is pocket change for your use case&lt;/li&gt;
&lt;li&gt;You're doing something that requires Anthropic's specific safety features&lt;/li&gt;
&lt;li&gt;You need the model to handle tasks that are genuinely beyond what open models can do today&lt;/li&gt;
&lt;li&gt;You don't have a GPU (though even a laptop with 8GB VRAM works for Qwen3.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everyone else — the gap between cloud and local is closing fast. A model that scores 73.4 on SWE-bench running on a gaming laptop would have been science fiction two years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  the trajectory matters more than today's snapshot
&lt;/h2&gt;

&lt;p&gt;Every few months, a new open model drops that would have been frontier-class the year before. The pricing gap between cloud and local is permanent — cloud will always cost per token, local will always be free after hardware.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is impressive. But the question isn't whether it's good — it's whether it's $5/$25 per million tokens better than what you can run yourself.&lt;/p&gt;

&lt;p&gt;For a growing number of developers, the answer is no.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. Qwen3.6 day-0 support. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>i cancelled my AI subscriptions. qwen3.6 on my own GPU does the same thing for free.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:56:16 +0000</pubDate>
      <link>https://dev.to/purpledoubled/i-cancelled-my-ai-subscriptions-qwen36-on-my-own-gpu-does-the-same-thing-for-free-493h</link>
      <guid>https://dev.to/purpledoubled/i-cancelled-my-ai-subscriptions-qwen36-on-my-own-gpu-does-the-same-thing-for-free-493h</guid>
      <description>&lt;p&gt;You're paying $20/month for ChatGPT. $10 for Copilot. Maybe another $20 for Midjourney. And every prompt you type goes through someone else's server.&lt;/p&gt;

&lt;p&gt;Meanwhile, Alibaba just open-sourced a model that scores 73.4 on SWE-bench Verified — the benchmark where an AI autonomously reads a GitHub issue, understands the codebase, writes a fix, and runs the tests. That's frontier-level coding ability. And it runs on your gaming laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  the model
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B. It's a Mixture-of-Experts model: 35 billion parameters total, but only 3 billion active per token. Your GPU loads 9 experts per token (8 routed + 1 shared) out of 256 total. The rest sit idle.&lt;/p&gt;

&lt;p&gt;Result: it runs like a 3B model but thinks like a 30B+ model.&lt;/p&gt;

&lt;p&gt;Apache 2.0 license. No usage restrictions. No rate limits. No one reading your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  what your $0/month gets you
&lt;/h2&gt;

&lt;p&gt;Let's do the math on what you're replacing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChatGPT Plus ($20/month)&lt;/strong&gt; — Qwen3.6 scores 86.0 on GPQA Diamond (graduate-level reasoning), 83.6 on HMMT (Harvard-MIT Math Tournament), and handles 119 languages. It has vision built in — drag an image into the chat and ask questions about it. For most daily tasks, you won't notice a difference. For coding tasks, this model is arguably better than GPT-4 for the stuff you actually do (fixing bugs, writing functions, understanding codebases).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot ($10/month)&lt;/strong&gt; — 73.4 on SWE-bench means this model can autonomously fix real bugs in real repositories. 51.5 on Terminal-Bench means it can operate a terminal to solve coding tasks. With the right frontend, it functions as a full coding agent, not just autocomplete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud API costs&lt;/strong&gt; — no per-token pricing. Run it 24/7 on your own hardware. The model doesn't get slower during peak hours. It doesn't have outages. It doesn't change its behavior because the provider decided to add more safety filters.&lt;/p&gt;

&lt;h2&gt;
  
  
  the hardware you already own is enough
&lt;/h2&gt;

&lt;p&gt;This is the part that surprises people. With Q4_K_M quantization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8 GB VRAM&lt;/strong&gt; (RTX 3060, RTX 4060): runs at 30+ tokens/second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12-14 GB VRAM&lt;/strong&gt; (RTX 4070, RTX 3090): Q8 quantization, 20+ tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple Silicon M1/M2/M3&lt;/strong&gt;: runs great on unified memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you bought a GPU in the last 3-4 years, you probably have enough. The MoE architecture is the key — your GPU only processes 3B parameters per token regardless of the total model size.&lt;/p&gt;

&lt;h2&gt;
  
  
  the catch (being honest)
&lt;/h2&gt;

&lt;p&gt;There are trade-offs. You should know them before you cancel anything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No real-time internet access&lt;/strong&gt; — the model only knows what it was trained on. No "search the web" or "check the latest docs." You need to paste context manually or use RAG.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Setup isn't zero&lt;/strong&gt; — you need Ollama or a similar runtime, and a frontend. It's not "open a browser tab and start typing." More like 10-15 minutes to set up if you've never done it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long context costs more locally&lt;/strong&gt; — 262K native context is great on paper, but processing 100K+ tokens on consumer hardware gets slow. Cloud APIs hide this cost from you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No multimodal generation&lt;/strong&gt; — Qwen3.6 can understand images (vision input) but can't generate them. For image generation you need a separate model (Stable Diffusion, Flux, etc.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Updates are manual&lt;/strong&gt; — when a better model drops, you download and switch yourself. No silent upgrades.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For people who type "write me a poem" into ChatGPT twice a week, this is overkill. For developers, researchers, and anyone processing sensitive data — the trade-offs are overwhelmingly in favor of local.&lt;/p&gt;

&lt;h2&gt;
  
  
  the stack that replaces everything
&lt;/h2&gt;

&lt;p&gt;Here's what a complete local setup looks like in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat + reasoning&lt;/strong&gt;: Qwen3.6-35B-A3B (this article)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image generation&lt;/strong&gt;: Stable Diffusion 3.5, Flux, or SDXL via ComfyUI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video generation&lt;/strong&gt;: Wan 2.1, FramePack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code completion&lt;/strong&gt;: same Qwen3.6, connected as a coding agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-text&lt;/strong&gt;: Whisper (runs on CPU)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total cost after hardware you already own: $0/month. Forever.&lt;/p&gt;

&lt;p&gt;Or use a tool that bundles all of this. &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; wraps Ollama + ComfyUI into one desktop app — chat, image gen, video gen, coding agent. v2.3.3 has Qwen3.6 day-0 support with vision and a full agent mode. AGPL-3.0, open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real question
&lt;/h2&gt;

&lt;p&gt;It's not "is local AI good enough yet?" — it passed that threshold months ago.&lt;/p&gt;

&lt;p&gt;The real question is: how much longer are you going to pay monthly fees to send your data to someone else's server when the same capability runs on hardware sitting under your desk?&lt;/p&gt;

&lt;p&gt;Qwen3.6 weights: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. No cloud, no subscription. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>qwen3.6 scores 73.4 on SWE-bench with only 3B active parameters. here's why that matters.</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:43:39 +0000</pubDate>
      <link>https://dev.to/purpledoubled/qwen36-scores-734-on-swe-bench-with-only-3b-active-parameters-heres-why-that-matters-2fmp</link>
      <guid>https://dev.to/purpledoubled/qwen36-scores-734-on-swe-bench-with-only-3b-active-parameters-heres-why-that-matters-2fmp</guid>
      <description>&lt;p&gt;Alibaba just mass-released Qwen3.6 and the first model is already turning heads. Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35 billion total parameters — but only 3 billion are active at inference time.&lt;/p&gt;

&lt;p&gt;That means it runs on an 8GB GPU. And it just scored 73.4 on SWE-bench Verified.&lt;/p&gt;

&lt;p&gt;For context, Gemma4-31B — a dense model using all 31 billion parameters for every single token — scores 17.4 on the same benchmark. Qwen3.6 uses a tenth of the compute and scores four times higher.&lt;/p&gt;

&lt;h2&gt;
  
  
  the architecture is genuinely different
&lt;/h2&gt;

&lt;p&gt;Most MoE models just slap a router on top of a standard transformer. Qwen3.6 does something more interesting.&lt;/p&gt;

&lt;p&gt;Three out of every four layers use &lt;strong&gt;Gated DeltaNet&lt;/strong&gt; — a linear attention mechanism that's significantly cheaper than standard attention. Only every fourth layer uses full Gated Attention with KV cache. This hybrid layout means you get near-full-attention quality at a fraction of the memory cost, especially on long contexts.&lt;/p&gt;

&lt;p&gt;The expert setup: 256 total experts, 8 routed + 1 shared active per token. That's where the 35B→3B compression comes from. Each token only touches the experts it needs.&lt;/p&gt;

&lt;p&gt;And it has &lt;strong&gt;vision built in&lt;/strong&gt;. Not bolted on — the model is natively multimodal (Image-Text-to-Text). MMMU score of 81.7, RealWorldQA at 85.3.&lt;/p&gt;

&lt;h2&gt;
  
  
  the benchmarks that matter
&lt;/h2&gt;

&lt;p&gt;I'm not going to dump every number. Here are the ones that actually tell you something:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-bench Verified: 73.4&lt;/strong&gt; — this is the "can you autonomously fix real GitHub issues" test. The model reads the issue, understands the codebase, writes a fix, and runs the tests. 73.4 means it successfully fixes nearly three out of four real-world bugs thrown at it. Its predecessor (Qwen3.5-35B-A3B) scored 70.0. Gemma4-31B scored 17.4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal-Bench 2.0: 51.5&lt;/strong&gt; — agentic terminal coding. Can the model operate a terminal to solve coding tasks? Qwen3.6 beats its predecessor (40.5), the dense Qwen3.5-27B (41.6), and Gemma4-31B (42.9). An 11-point jump over the previous version is massive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QwenWebBench: 1397 Elo&lt;/strong&gt; — frontend artifact generation. The predecessor scored 978. A 400+ Elo jump in one generation. For chess players: that's going from a club player to a titled player.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPQA Diamond: 86.0&lt;/strong&gt; — graduate-level science reasoning. This is the benchmark where PhD students in physics, chemistry, and biology try to answer questions outside their subfield and fail about half the time. 86.0 is competitive with models many times this size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCPMark: 37.0&lt;/strong&gt; — general agent benchmark testing MCP (Model Context Protocol) tool use. Predecessor scored 27.0. Gemma4-31B scored 36.3. This model was clearly trained with agentic tool calling in mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  what 3B active parameters actually means for your hardware
&lt;/h2&gt;

&lt;p&gt;Here's the thing people keep getting wrong about MoE models. The total parameter count (35B) determines the model's knowledge capacity — how much it "knows." But the active parameter count (3B) determines how fast it runs and how much VRAM it needs at inference time.&lt;/p&gt;

&lt;p&gt;So while the model file on disk is large (it contains all 256 experts), at inference time your GPU only loads the 9 active experts per token. The rest sit in memory doing nothing until they're needed.&lt;/p&gt;

&lt;p&gt;Practical VRAM requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q4_K_M quantized: ~6-8 GB&lt;/strong&gt; — runs on an RTX 3060 12GB at 30+ tok/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q8_0 quantized: ~12-14 GB&lt;/strong&gt; — RTX 4070 territory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP8 official: ~35 GB&lt;/strong&gt; — RTX 4090 or A6000&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 full: ~70 GB&lt;/strong&gt; — multi-GPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can run a 7B model, you can run this. The speed profile is similar to a 3B dense model, but the output quality is closer to a 30B+ dense model.&lt;/p&gt;

&lt;h2&gt;
  
  
  the real competition
&lt;/h2&gt;

&lt;p&gt;The model Qwen3.6 is really competing against isn't Gemma4-31B. It's proprietary models.&lt;/p&gt;

&lt;p&gt;73.4 on SWE-bench Verified puts it in the same ballpark as frontier closed-source models — except this one is Apache 2.0 licensed, runs on consumer hardware, and never sends your code to anyone's server.&lt;/p&gt;

&lt;p&gt;For coding specifically, the combination of high SWE-bench scores + strong terminal/agent capabilities + MCP support makes this arguably the best local coding model per compute dollar right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  how to actually run it
&lt;/h2&gt;

&lt;p&gt;The model just dropped so GGUF quantizations are still rolling out. Check HuggingFace for the latest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official weights: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;Qwen/Qwen3.6-35B-A3B&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;FP8 variant: &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8" rel="noopener noreferrer"&gt;Qwen/Qwen3.6-35B-A3B-FP8&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once GGUFs land, &lt;code&gt;ollama run qwen3.6:35b-a3b&lt;/code&gt; should work.&lt;/p&gt;

&lt;p&gt;For a full desktop setup with model management, vision support, and a built-in coding agent, &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; just shipped v2.3.3 with day-0 Qwen3.6 support. Open source, AGPL-3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  the bottom line
&lt;/h2&gt;

&lt;p&gt;3B active parameters scoring 73.4 on SWE-bench is the kind of efficiency gain that changes what's possible on consumer hardware. A year ago you needed a 70B+ dense model or API access for this level of coding capability. Now it runs on a gaming laptop.&lt;/p&gt;

&lt;p&gt;Apache 2.0. No strings attached.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; is an open-source desktop app for running AI models locally — chat, coding agents, image gen, video gen. AGPL-3.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>coding</category>
    </item>
    <item>
      <title>How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size</title>
      <dc:creator>David </dc:creator>
      <pubDate>Thu, 16 Apr 2026 15:20:08 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-qwen36-35b-a3b-locally-the-coding-moe-that-beats-models-10x-its-active-size-3pbh</link>
      <guid>https://dev.to/purpledoubled/how-to-run-qwen36-35b-a3b-locally-the-coding-moe-that-beats-models-10x-its-active-size-3pbh</guid>
      <description>&lt;p&gt;Qwen just released &lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.&lt;/p&gt;

&lt;p&gt;Translation: big-model quality at small-model speed. And this time it has vision built in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this model matters
&lt;/h2&gt;

&lt;p&gt;The numbers speak for themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;73.4 on SWE-bench Verified&lt;/strong&gt; — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;51.5 on Terminal-Bench 2.0&lt;/strong&gt; — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1397 Elo on QwenWebBench&lt;/strong&gt; — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;86.0 on GPQA Diamond&lt;/strong&gt; — graduate-level science reasoning. Competitive with models many times its size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision support&lt;/strong&gt; — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full benchmark picture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Qwen3.6-35B-A3B&lt;/th&gt;
&lt;th&gt;Qwen3.5-35B-A3B&lt;/th&gt;
&lt;th&gt;Gemma4-31B&lt;/th&gt;
&lt;th&gt;Qwen3.5-27B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;17.4&lt;/td&gt;
&lt;td&gt;51.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40.5&lt;/td&gt;
&lt;td&gt;42.9&lt;/td&gt;
&lt;td&gt;41.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Multilingual&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;67.2&lt;/td&gt;
&lt;td&gt;69.3&lt;/td&gt;
&lt;td&gt;60.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QwenWebBench (Elo)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1397&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;978&lt;/td&gt;
&lt;td&gt;1178&lt;/td&gt;
&lt;td&gt;1197&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NL2Repo&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;27.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCPMark&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27.0&lt;/td&gt;
&lt;td&gt;36.3&lt;/td&gt;
&lt;td&gt;15.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;84.2&lt;/td&gt;
&lt;td&gt;84.3&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMU&lt;/td&gt;
&lt;td&gt;81.7&lt;/td&gt;
&lt;td&gt;81.4&lt;/td&gt;
&lt;td&gt;80.4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gated DeltaNet attention&lt;/strong&gt; — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;256 experts, 9 active&lt;/strong&gt; — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision encoder built in&lt;/strong&gt; — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thinking Preservation&lt;/strong&gt; — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;262K native context&lt;/strong&gt; — extensible beyond that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache 2.0 license&lt;/strong&gt; — fully open, commercial use allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware requirements
&lt;/h2&gt;

&lt;p&gt;The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;VRAM needed&lt;/th&gt;
&lt;th&gt;Expected speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M quant&lt;/td&gt;
&lt;td&gt;~6-8 GB&lt;/td&gt;
&lt;td&gt;30+ tok/s on RTX 3060 12GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q8_0 quant&lt;/td&gt;
&lt;td&gt;~12-14 GB&lt;/td&gt;
&lt;td&gt;20+ tok/s on RTX 4070&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8 (official)&lt;/td&gt;
&lt;td&gt;~35 GB&lt;/td&gt;
&lt;td&gt;RTX 4090 or A6000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16 full&lt;/td&gt;
&lt;td&gt;~70 GB&lt;/td&gt;
&lt;td&gt;Multi-GPU setup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1: Ollama (easiest)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.6:35b-a3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for GGUFs to appear — usually within hours of release. Check &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; for the latest quantized versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: vLLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm
vllm serve Qwen/Qwen3.6-35B-A3B &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 3: Transformers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 4: Locally Uncensored (full GUI + model management)
&lt;/h3&gt;

&lt;p&gt;If you want a clean desktop app that handles downloading, model management, and chatting in one place:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Grab &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — it's open source (AGPL-3.0)&lt;/li&gt;
&lt;li&gt;v2.3.3 just shipped with day-0 Qwen3.6 support&lt;/li&gt;
&lt;li&gt;Download the model directly from the app, pick your quantization, and start chatting&lt;/li&gt;
&lt;li&gt;Vision works out of the box — drag and drop images into the chat&lt;/li&gt;
&lt;li&gt;The new Codex mode with live streaming is particularly nice for coding tasks with this model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should care
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local AI coders&lt;/strong&gt; — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-focused devs&lt;/strong&gt; — Apache 2.0, runs on consumer hardware, no data leaves your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal users&lt;/strong&gt; — built-in vision means one model for text AND image understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone running Qwen3.5-35B-A3B&lt;/strong&gt; — this is a straight upgrade. Same architecture class, better everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.&lt;/p&gt;

&lt;p&gt;Weights are on &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;. FP8 variant &lt;a href="https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8" rel="noopener noreferrer"&gt;here&lt;/a&gt;. GGUFs incoming.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Run GLM 4.7 Flash Locally with Ollama — 30B Quality at 3B Speed</title>
      <dc:creator>David </dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:28:13 +0000</pubDate>
      <link>https://dev.to/purpledoubled/how-to-run-glm-47-flash-locally-with-ollama-30b-quality-at-3b-speed-2lii</link>
      <guid>https://dev.to/purpledoubled/how-to-run-glm-47-flash-locally-with-ollama-30b-quality-at-3b-speed-2lii</guid>
      <description>&lt;p&gt;ZhipuAI quietly dropped GLM 4.7 Flash and it's been blowing up — 830K+ downloads on HuggingFace, 1,600+ likes. The pitch: 30B-parameter MoE model with only 3B active parameters per token. Translation: you get 30B-class quality at the speed and VRAM cost of a 3B model.&lt;/p&gt;

&lt;p&gt;The benchmarks back it up. AIME 25: 91.6% (beats GPT-class models). SWE-bench Verified: 59.2% (nearly 3x Qwen3-30B-A3B). And it's MIT licensed — commercial use, fine-tuning, whatever you want.&lt;/p&gt;

&lt;p&gt;I've been building a local AI desktop app (&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt;) and just added GLM 4.7 support. Here's how to run it locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install GLM 4.7 Flash with Ollama
&lt;/h2&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Ollama handles the download and quantization. Default is Q4_K_M which gives you the best quality-to-size ratio.&lt;/p&gt;

&lt;p&gt;If you want a specific quantization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7:q4_k_m    &lt;span class="c"&gt;# ~5 GB, recommended&lt;/span&gt;
ollama run glm4.7:q8_0      &lt;span class="c"&gt;# ~10 GB, higher quality&lt;/span&gt;
ollama run glm4.7:q2_k      &lt;span class="c"&gt;# ~3 GB, if VRAM is tight&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why GLM 4.7 Flash Matters
&lt;/h2&gt;

&lt;p&gt;The MoE (Mixture of Experts) architecture is the key. The model has 30B total parameters but only activates 3B per token. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt;: Token generation is fast — comparable to running a 3B dense model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt;: Only needs 6-8 GB for Q4 quantization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt;: Reasoning and coding performance matches models 10x its active size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's how it compares:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;GLM 4.7 Flash (30B-A3B)&lt;/th&gt;
&lt;th&gt;Qwen3-30B-A3B&lt;/th&gt;
&lt;th&gt;GPT-OSS-20B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AIME 25&lt;/td&gt;
&lt;td&gt;91.6&lt;/td&gt;
&lt;td&gt;85.0&lt;/td&gt;
&lt;td&gt;91.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA&lt;/td&gt;
&lt;td&gt;75.2&lt;/td&gt;
&lt;td&gt;73.4&lt;/td&gt;
&lt;td&gt;71.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;59.2&lt;/td&gt;
&lt;td&gt;22.0&lt;/td&gt;
&lt;td&gt;34.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;τ²-Bench (agentic)&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;49.0&lt;/td&gt;
&lt;td&gt;47.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BrowseComp&lt;/td&gt;
&lt;td&gt;42.8&lt;/td&gt;
&lt;td&gt;2.29&lt;/td&gt;
&lt;td&gt;28.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agentic benchmarks are insane. τ²-Bench at 79.5 vs Qwen3's 49.0 — that's not a marginal improvement, that's a different league. This model was built for tool calling and multi-step reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q2_K&lt;/strong&gt;: ~3-4 GB VRAM (or CPU-only with 8 GB RAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q4_K_M&lt;/strong&gt;: 6-8 GB VRAM — the sweet spot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q8_0&lt;/strong&gt;: 10-12 GB VRAM — if you have the room&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16&lt;/strong&gt;: 20+ GB — only for research/fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a GTX 1660 (6 GB) or better, Q4_K_M runs comfortably. On Apple Silicon with 16 GB unified memory, it flies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Mode with GLM 4.7
&lt;/h2&gt;

&lt;p&gt;This is where GLM 4.7 really shines. The model was specifically optimized for agentic tasks — it has a "Preserved Thinking" mode that keeps chain-of-thought reasoning active across multi-turn tool interactions.&lt;/p&gt;

&lt;p&gt;In practice: you give it a tool (web search, file read, code execution) and it actually uses it intelligently. The 59.2% SWE-bench score means it can navigate real codebases, understand context, and produce working patches — not just toy completions.&lt;/p&gt;

&lt;p&gt;In Locally Uncensored, GLM 4.7 is auto-detected as an agent-capable model. Enable Agent mode in the UI and it gets access to web search, file operations, and code execution out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  GLM 4.7 vs the Competition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;vs Qwen3-30B-A3B&lt;/strong&gt;: Same architecture class (30B MoE, 3B active) but GLM 4.7 dominates on agentic and coding tasks. Qwen3 is better at pure math.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs Gemma 4 E4B&lt;/strong&gt;: Gemma 4 is smaller (4.5B effective) and faster, but GLM 4.7 has significantly better reasoning depth. If you need an agent that can handle complex multi-step tasks, GLM 4.7 wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs Llama 3.3 70B&lt;/strong&gt;: Llama needs 3-4x the VRAM for similar coding performance. GLM 4.7 is the efficiency play.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the Catch?
&lt;/h2&gt;

&lt;p&gt;Honestly, not much:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chinese-English bilingual&lt;/strong&gt; — Trained on both, works great in both. If you only need English, it's still excellent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window&lt;/strong&gt; — Supports up to 128K tokens. More than enough for most use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT license&lt;/strong&gt; — Fully open. No restrictions on commercial use, modification, or redistribution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main caveat: if you want vision/multimodal, GLM 4.7 Flash is text-only. Look at GLM-4V or Gemma 4 for image input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run glm4.7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you want a full desktop UI with agent mode, image gen, and A/B model comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — free, open source, AGPL-3.0. Single .exe/.AppImage, no Docker needed. GLM 4.7 is in the recommended models list.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running GLM 4.7 on your hardware? I'd love to hear your tok/s numbers and use case. Drop a comment.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://locallyuncensored.com" rel="noopener noreferrer"&gt;Locally Uncensored&lt;/a&gt; — AGPL-3.0 licensed. &lt;a href="https://github.com/PurpleDoubleD/locally-uncensored" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>ollama</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
