DEV Community: David

Abliterated Models Guide - Qwen 3.6, Gemma 4 Heretic, Llama 3.1 Uncensored Download Links

David — Fri, 24 Apr 2026 13:58:02 +0000

Originally published on locallyuncensored.com

If you've looked at the Discover tab in any local-AI app and wondered why some Llama variants have abliterated in the name, this is the post that explains it. Plus the curated download list for 2026.

What Abliteration Actually Is

Modern instruction-tuned LLMs have a learned refusal direction in their residual stream. When a prompt activates that direction strongly enough, the model outputs "I cannot help with that." The direction was put there during RLHF.

Abliteration removes it via orthogonalisation. You take a corpus of refused prompts, isolate the activation pattern that distinguishes them from accepted prompts, then project that direction out of every weight matrix. The result is a model with the same training and capabilities but no longer prone to categorical refusal.

It's a clean technique - not a finetune, not a jailbreak, not a system-prompt trick. Original paper: "Refusal in Language Models Is Mediated by a Single Direction" (Arditi et al., 2024).

Abliterated vs Other Uncensored Approaches

Method	How it works	Effort	Quality impact
Abliteration	Project out refusal direction	hours on GPU	1-3% degradation
Full finetune (Dolphin, Hermes)	Re-train on uncensored corpus	days, expensive	Variable
LoRA finetune	Adapter on uncensored data	hours	Minor, reversible
Merge (Frankenmerges)	Combine multiple finetunes	hours	Highly variable
System prompt jailbreak	Persona-style instructions	None	Brittle

Abliteration is the cleanest research-grounded option. Dolphin and Hermes are battle-tested production finetunes.

Recommended Abliterated Models (2026)

Qwen 3.6 Family

richardyoung/qwen3-14b-abliterated:q4_K_M - 9 GB, fits 12 GB VRAM, vision-capable. Comes in :q4_K_M (chat) and :agent (tool-calling) tags via Ollama.
Qwen 3.6 27B Samantha (huihui-ai variant) - abliterated dense 27B with the Samantha personality finetune.

Gemma 4 Heretic

Stabhappy/gemma-4-31B-it-heretic-Gguf - Gemma 4 31B base abliterated. ~17 GB at Q4_K_M. Native vision, tool calling.
Gemma 4 26B MoE HERETIC - 26B brain with 4B active. Smaller VRAM peak, MoE-fast inference.

Llama 3.1 Family

mannix/llama3.1-8b-abliterated:q5_K_M - 5.7 GB. The most-pulled abliterated Llama on Ollama. Comes with :agent tag for tool calling.
mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated - the canonical reference variant.

Hermes 3

Hermes 3 is technically a full finetune, not abliteration, but functions similarly:

hermes3:8b via Ollama - 4.7 GB, fits 8 GB GPUs. Good chat default.
hermes3:70b - 40 GB, needs 48 GB VRAM or aggressive quantisation.

GLM 5.1 Heretic

The newest entrant: huihui-ai/Huihui-GLM-5.1-abliterated-GGUF. The 754B MoE GLM 5.1 abliterated. 236 GB at IQ2_M - not consumer hardware, but if you have a Mac Studio M4 Ultra, it's the strongest open abliterated model period.

How to Download and Run

Path 1 - Ollama (one command)

ollama pull richardyoung/qwen3-14b-abliterated:q4_K_M
ollama run richardyoung/qwen3-14b-abliterated:q4_K_M

Path 2 - Locally Uncensored (one click)

Open Locally Uncensored, navigate to Model Manager > Discover > Text, click the UNCENSORED filter tab. The 34 curated abliterated GGUFs are all there with one-click download.

The new v2.4.0 Settings > Model Storage override lets you redirect the GGUF download folder if you want them on a separate drive.

Hardware Recommendations

VRAM	Best Abliterated Pick	Why
8 GB	Llama 3.1 8B abliterated Q4_K_M	Fits with headroom
12 GB (RTX 3060)	Qwen 3 14B abliterated Q4_K_M	Sweet spot, ~15 tok/s
16 GB	Gemma 4 31B Heretic Q4_K_M	Best general-purpose at this VRAM
24 GB (RTX 3090/4090)	Gemma 4 31B Heretic Q5_K_M	Higher quality
48 GB+	Hermes 3 70B or GLM 5.1 Heretic IQ2	Frontier-tier quality

Common Questions

Will an abliterated model write me malware?

Probably not the way you're thinking. Abliteration removes the categorical refusal but the model still has training-time priors against obviously-bad outputs. The models work best for legitimate-but-edge-case use cases: security research, fiction with violence, medical questions the base model deflects, legal grey areas, adult creative writing.

Are abliterated models dangerous?

No more than the underlying base. Abliteration removes a layer of guardrails. The model's underlying knowledge is unchanged from the base.

Can I abliterate a model myself?

Yes. The technique is well-documented and the code is on GitHub (search abliterator). You need a GPU with the model loaded, a few thousand refused-vs-accepted prompt pairs, and a few hours.

Locally Uncensored is AGPL-3.0 licensed. Built by PurpleDoubleD. Bug reports on GitHub Discussions or in the Discord.

How to Run Qwen 3.6 Locally - 27B Dense, 35B MoE, and Coding Variants Setup Guide

David — Fri, 24 Apr 2026 13:58:02 +0000

Originally published on locallyuncensored.com

Qwen 3.6 dropped on April 21 2026. Two main families: a 27B dense model that activates every parameter per token and a 35B MoE with 3B active per token. Both ship with vision, agentic coding, thinking-mode preservation, and a 256K context window.

If you only have time for the short version: install Locally Uncensored, open Model Manager > Discover > Text, search Qwen 3.6, hit the download arrow on the variant that fits your VRAM.

Which Qwen 3.6 Variant Should You Pick?

The biggest decision is dense vs MoE. The second biggest is which quant.

The 27B dense activates all 27B parameters for every token. Slower per token, but every token gets the full model. Quality is consistent. Recommended default for general chat, reasoning, and most coding.

The 35B MoE only activates 3B parameters per token via routing. Much faster per token (often 2-3x throughput at similar quants). VRAM peak during inference is lower than the model size suggests. But routing introduces variance. The MoE wins on coding benchmarks (SWE-bench specifically) when you pick the coding-specialised variant.

Quant Comparison Table

Variant	Quant	Disk	VRAM Target	Quality
27B dense	UD-IQ2_XXS	8.7 GB	8 GB GPU	Good (low-VRAM lifesaver)
27B dense	Q3_K_M	13 GB	12 GB GPU	Very good (RTX 3060 sweet spot)
27B dense	Q4_K_M	16 GB	16 GB GPU	Recommended default
27B dense	UD-Q4_K_XL	16 GB	16 GB GPU	Better quality per GB
27B dense	Q5_K_M	18 GB	20 GB GPU	High
27B dense	Q6_K	21 GB	24 GB GPU	Near-lossless
27B dense	Q8_0	27 GB	32 GB GPU	Effectively lossless
35B MoE	Q4_K_M	24 GB	24 GB GPU	Recommended for MoE
35B MoE	NVFP4	22 GB	22 GB GPU (RTX 40+)	Smallest with full quality on Blackwell
35B MoE coding	NVFP4	22 GB	22 GB GPU (RTX 40+)	Best coding-bench-per-GB
35B MoE	BF16	71 GB	96 GB GPU	Reference quality

Recommendation by Hardware

8 GB VRAM (RTX 3060 8GB, RTX 4060 8GB): 27B UD-IQ2_XXS - the only quant that fits
12 GB VRAM (RTX 3060 12GB, RTX 3080 Ti, RTX 4070): 27B Q3_K_M - sweet spot, ~15-25 tok/s
16 GB VRAM: 27B Q4_K_M or UD-Q4_K_XL - the recommended default
24 GB VRAM (RTX 3090, RTX 4090): 27B Q6_K for max dense quality, OR 35B MoE Q4_K_M for coding
RTX 40+ Blackwell: 35B MoE NVFP4 - smallest size with native quality
Apple Silicon M3/M4: 35B MoE MLX BF16 via MLX runtime
CPU only with 32 GB RAM: 27B Q4_K_M at 1-3 tok/s - usable for short tasks

Installation Path 1 - Ollama (CLI)

ollama pull qwen3.6:27b           # dense Q4_K_M, 16 GB
ollama pull qwen3.6                # 35B MoE Q4_K_M, 24 GB
ollama pull qwen3.6:35b-a3b-coding-nvfp4   # coding NVFP4
ollama run qwen3.6:27b

Installation Path 2 - Locally Uncensored (GUI)

If you want a one-click experience plus chat, agent mode, image generation, and a/b model compare in the same window:

Download the v2.4.0 installer for your OS
First-launch wizard auto-detects Ollama (or offers one-click install)
Model Manager > Discover > Text > search Qwen 3.6
Click the download arrow on the variant matching your VRAM

Performance on RTX 3060 12 GB

Tested with Qwen 3.6 27B Q3_K_M, 4096-token context, fp16 KV cache:

Workload	tok/s
Cold first response	~3 (model load)
Warm chat (50-token answers)	22-26
Long-form (1000 tokens)	18-20
Thinking-mode enabled	15-18

Vision Support

Both 27B dense and 35B MoE accept image input. Drag-and-drop a screenshot, photo, or chart. VRAM cost for vision is +1-2 GB on top of the base model.

Coding Performance

The 35B MoE coding-specialised variants are tuned on SWE-bench training data. The coding NVFP4 variant scores in the same ballpark as Claude 3.5 Sonnet on SWE-bench-verified at a fraction of the inference cost.

For day-to-day coding inside LU's Codex agent, the 27B dense Q4_K_M is the better default - consistent quality, no MoE-routing variance.

Qwen 3.6 vs Qwen 3.5

Feature	Qwen 3.5	Qwen 3.6
Vision	No	Yes (both 27B and 35B)
Context window	128K	256K
Thinking mode	QwQ-only	Preserved across variants
Coding-specific MoE	No	Yes (35B-a3b-coding)
NVFP4 quant	No	Yes (35B MoE)
MLX variant for Apple Silicon	No	Yes

Locally Uncensored is AGPL-3.0 licensed. Built by PurpleDoubleD. Bug reports on GitHub Discussions or in the Discord.

Locally Uncensored v2.4.0 — Settings Polish, Linux Drag Fix, and Configurable HuggingFace Path

David — Fri, 24 Apr 2026 13:53:15 +0000

Originally published on locallyuncensored.com

Locally Uncensored v2.4.0 is a polish release. Eight fixes, two of them surfaced through community feedback on Discord, six caught during an internal end-to-end pass on the v2.3.9 build. No new headline features — this release exists so the next feature release lands on a cleaner foundation.

TL;DR

Single-instance lock — double-clicking the shortcut focuses the existing window instead of spawning a second process
Settings → Model Storage — paste or pick the folder where HuggingFace GGUF downloads land
Settings → Privacy — in-app statement of what runs locally and what doesn't
Settings → Onboarding — a button that re-runs the first-launch wizard on demand
Reset tutorial — the button now actually does what its label promises
Linux window drag — the title-bar drag works on Ubuntu 24.04 again
Discover — the HuggingFace download path is no longer printed twice
HuggingFace search heuristic — search results for repos with a quant tag in the name no longer 404 on download

Single-Instance Lock

Before v2.4.0, double-clicking the desktop shortcut started a second locally-uncensored.exe process. Both instances would race each other writing to the store backup file — not a frequent corruption source, but a real one when both happened to flush at the same millisecond.

v2.4.0 ships with tauri-plugin-single-instance. The second launch focuses, un-minimizes, and brings the existing window to front. No new process. Verified with three back-to-back launches: only one PID survives.

Settings → Model Storage — Configurable HuggingFace Folder

The Model Manager → Discover → Text tab lets you download GGUF models from HuggingFace. Until v2.4.0, the destination folder was always auto-detected from the active openai-compat provider — usually LM Studio's models folder.

That worked fine for single-disk setups. It did not work for dual-boot users who wanted a shared model partition between Linux and Windows, or anyone running a NAS-mounted models folder. Reported on Discord by diimmortalis.

v2.4.0 adds a dedicated Settings → Model Storage section with a path input, a Browse button, and a Reset button. The override takes effect immediately. Verified end-to-end with a Gemma 4 E4B download (4.6 GB) landing in a custom folder while the LM Studio default folder stayed untouched.

Linux Window Drag Fix

On Ubuntu 24.04 the title-bar drag threw an unhandled Promise rejection — core:window:allow-start-dragging was missing from the capability list. Reported on Discord by diimmortalis with a clean Promise-rejection dump. One-line fix in src-tauri/capabilities/default.json.

Tests & Verification

Test suite went from 2205 to 2216 (+11 regression tests). cargo check clean. tsc --noEmit clean.

Live end-to-end on the installed v2.4.0 build:

Single-instance: 3 back-to-back exe launches → 1 PID
HF download override: typed custom path, Discover subtitle updated, Gemma 4 E4B partial download (35.9 MB at 897 KB/s) landed in the picked folder
Re-run onboarding: click → marker deleted → 6-step wizard renders → marker re-created
Reset tutorial: click → flag flipped → new chat → Agent toggle → tutorial renders

Download

GitHub Releases. Signed installers for Windows (.exe, .msi) and Linux (.deb, .rpm, .AppImage). Auto-update picks the new build up on next launch for anyone on v2.3.x.

Locally Uncensored is AGPL-3.0 licensed. Built by PurpleDoubleD. Bug reports on GitHub Discussions or in the Discord.

Anthropic is Rationing Claude Code on Pro — Here's a Local Alternative

David — Thu, 23 Apr 2026 10:34:18 +0000

Earlier this week, Anthropic ran a quiet test: a small slice (~2%) of new Pro plan subscribers found that Claude Code wasn't included with their $20/month subscription. The pricing page was updated to reflect this. It made some noise on Reddit and X, Anthropic walked it back, and the page was reverted.

But the incident highlights something real: the economics of hosted AI are strained.

What Actually Happened

Anthropic's head of growth clarified on social media that the test affected about 2% of new prosumer signups. The reasoning was straightforward: usage patterns have changed dramatically. Users have moved from brief chat sessions to "nearly always-on, multi-agent workflows" that consume vastly more tokens. The current plans weren't built for this.

To be clear: this wasn't a crisis. It was a business experiment that got rolled back quickly. But it was also a signal — one that shouldn't be surprising if you've been paying attention to how compute-heavy AI tools have become.

The Trend Is Clear

Claude Code isn't unique here. OpenAI has introduced peak-hour caps. Anthropic has added tighter limits during high-traffic periods. Gemini, ChatGPT, and others have all introduced various forms of rate limiting as agentic workflows (long-running, multi-step tasks) have taken off.

This isn't malice — it's math. Running a model that can handle complex, hours-long agentic tasks requires significant GPU compute. At $20/month, there's a real gap between what heavy users consume and what the subscription covers.

Enter Local Models

This is where running AI locally becomes genuinely compelling, not just theoretically interesting.

Tools like Ollama, LM Studio, and Locally Uncensored let you run capable language models on your own hardware. No subscription. No per-token billing. No rate limits. No plan changes.

The tradeoff is real: you need decent hardware (a modern Mac with unified memory, a gaming PC with a good GPU, or a dedicated home server), and the experience differs from hosted APIs. But for developers who rely on agentic workflows — the exact users feeling the squeeze from providers — the local path is increasingly viable.

Recent open-weight models from Mistral, Qwen, and the Llama family are genuinely capable for coding tasks. They're not matching the frontier models on every benchmark, but for the majority of real-world dev work, the gap has shrunk considerably.

Is This a Sales Pitch?

Not really — and I want to be clear about that. If Anthropic's pricing works for you and you don't hit limits, there's no urgent reason to change. Their models are excellent.

But if you've been on the receiving end of a rate limit mid-flow, or if you're watching your usage climb and wondering what happens next, it's worth knowing that the local option exists and has gotten significantly easier to set up over the past year.

The local ecosystem isn't for everyone. But for developers who have built automated workflows around AI — the exact users Anthropic was quietly trying to ration — it might be worth an afternoon of experimentation.

What do you think — is the local-first approach realistic for your use case, or are you all-in on hosted APIs? I'd genuinely like to know what you're running into.

qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.

David — Thu, 23 Apr 2026 07:30:20 +0000

When Alibaba released Qwen3.6-35B-A3B, the MoE (Mixture of Experts) design stole all the headlines. 35 billion parameters, 3 billion activated per token — everyone's been focused on that ratio.

Then they dropped Qwen3.6-27B. A plain old dense model. 27 billion parameters, all active.

On SWE-bench Verified, the 27B dense scores 77.2%. The 35B MoE scores 73.4%. The dense model is outperforming the MoE by nearly 4 points — on the benchmark that measures real software engineering capability.

what SWE-bench actually measures

SWE-bench gives an LLM a real GitHub issue and a codebase. It has to understand the problem, find the right files, write the fix, and get the tests to pass. It's not multiple choice — it requires actual coding.

Qwen3.6-27B at 77.2% puts it in range of proprietary models. Claude Opus 4.5 scores 80.9%. The gap is real but narrowing — and Qwen3.6-27B does it on your own GPU under Apache 2.0.

why is the dense model winning?

Two factors seem to be driving this:

1. Full parameter utilization. In a MoE model like the 35B-A3B, only 3B of 35B parameters are active per token. The routing layer decides which experts to use. This is efficient for inference speed, but the model can't "use" all of its knowledge simultaneously. A dense model can activate its full capacity for harder reasoning tasks.

2. Architecture: Gated DeltaNet. Qwen3.6-27B isn't a vanilla dense transformer. It uses a Gated DeltaNet + Gated Attention hybrid — alternating layers of linear-gated attention (DeltaNet) with standard gated attention. DeltaNet processes information in compressed deltas rather than full representations, which lets it handle long contexts more efficiently while maintaining reasoning depth.

The result is a model that can do 262K context natively (extendable to 1M tokens) while still being a strong coder.

the benchmark breakdown

Task	Qwen3.6-27B (dense)	Qwen3.6-35B-A3B (MoE)	Gap
SWE-bench Verified	77.2	73.4	+3.8
SWE-bench Pro	53.5	49.5	+4.0
Terminal-Bench 2.0	59.3	51.5	+7.8
SkillsBench Avg5	48.2	28.7	+19.5
QwenWebBench	1487	1397	+90
NL2Repo	36.2	29.4	+6.8

Terminal-Bench (real terminal operations) and SkillsBench show the largest gaps. These are tasks where the model needs to chain together multiple operations — the kind of thing where full parameter access seems to matter most.

the tradeoff

Dense models aren't free. The 27B activates all 27B parameters per forward pass. The 35B MoE activates only 3B. During inference:

35B MoE is faster per token (3B vs 27B compute)
35B MoE uses less memory for the active computation (but total disk/loaded size is still large)
27B dense is better at hard coding tasks (SWE-bench, terminal operations)

If you're doing simple chat, the MoE will be faster. If you're running an agent that needs to reason through a complex codebase — the dense model is showing real advantages.

vision included

Qwen3.6-27B is an image-text-to-text model. The vision encoder is built in. That means you can screenshot a UI and ask it to fix the bug, read a diagram and explain the architecture, or debug from screenshots. The 35B MoE is text-only.

running it

ollama run qwen3.6-27b

With Locally Uncensored, you also get image input, a built-in code agent, and fully local outputs:

git clone https://github.com/PurpleDoubleD/locally-uncensored
cd locally-uncensored && npm run tauri dev

The MoE vs dense debate isn't settled. But on coding agent tasks, Qwen3.6-27B is making a strong case that raw parameter count isn't everything — architecture and full utilization matter too.

Locally Uncensored — AGPL-3.0 license.

how to run qwen3.6-27b locally — the dense 27B that beats the 35B MoE on coding

David — Thu, 23 Apr 2026 07:29:20 +0000

Alibaba just dropped Qwen3.6-27B, a 27-billion parameter dense model that scores 77.2% on SWE-bench Verified. That's higher than Qwen3.6-35B-A3B (73.4%) — the MoE version everyone was talking about last week.

I've been building Locally Uncensored, a desktop AI app, and we just added Qwen3.6-27B support.

install with ollama

If you already have Ollama set up, it's a one-liner:

ollama run qwen3.6-27b

That's it. If you want a specific quantization:

ollama run qwen3.6-27b:q4_K_M   # 16GB RAM recommended
ollama run qwen3.6-27b:q8_0     # 27GB RAM recommended
ollama run qwen3.6-27b:fp8      # needs ~27GB VRAM (FP8)

Note: if ollama run qwen3.6-27b returns "model not found", give it a minute — Ollama's library updates periodically. You can also pull manually with ollama pull qwen3.6-27b.

what makes qwen3.6-27b different

The 35B-A3B is a Mixture-of-Experts model: 35B total params but only 3B activated per token. Qwen3.6-27B is a different beast — a dense 27B model with a Gated DeltaNet + Gated Attention hybrid architecture.

Key specs:

27B parameters (all active, no MoE routing)
64 layers, 5120 hidden dimension
262,144 token context natively (extensible to 1,010,000)
Vision encoder included (image-text-to-text)
Apache 2.0 license

The Gated DeltaNet architecture processes tokens through alternating Gated DeltaNet and Gated Attention layers — a hybrid that combines linear-attention efficiency with gated selective attention. It's a different design philosophy from both vanilla transformers and the 35B MoE.

benchmark table

Benchmark	Qwen3.6-27B	Qwen3.6-35B-A3B	Gemma4-31B
SWE-bench Verified	77.2	73.4	52.0
SWE-bench Pro	53.5	49.5	35.7
Terminal-Bench 2.0	59.3	51.5	42.9
SkillsBench Avg5	48.2	28.7	23.6
MMLU-Pro	86.2	85.2	85.2
LiveCodeBench v6	83.9	80.4	80.0
AIME 2026	94.1	92.7	89.2

All numbers from the official Qwen3.6-27B model card.

The 27B dense model is pulling ahead of the 35B MoE on agentic coding tasks — SWE-bench, Terminal-Bench, SkillsBench. The gap is especially wide on SkillsBench (48.2 vs 28.7) which tests real-world dev skills.

vram requirements

Qwen3.6-27B is a dense model, so all 27B parameters stay in memory:

Quantization	VRAM (approx)	Recommended GPU
Q2_K	10-11 GB	RTX 3060, RTX 4060
Q4_K_M	16-17 GB	RTX 4070, RTX 3080
Q8_0	27-28 GB	RTX 4090, A5000
FP8	27 GB	RTX 4090, H100
FP16	54 GB	dual GPU or professional

Note: these are for the base model only. With the vision encoder + KV cache for long context, add 2-4 GB overhead.

why not just use the 35B MoE?

The 35B-A3B activates fewer params per token, which means faster inference and lower memory during generation. But if you're doing agentic coding with longer context windows, the dense 27B is showing real advantages on benchmark tasks that require deep repository reasoning.

The 35B MoE also requires more total disk space (the full expert bank is still loaded even if only 3B activate per token) and the routing decisions can introduce variability.

try it with locally uncensored

I've been building Locally Uncensored — a cross-platform desktop app that lets you run Qwen3.6-27B (and other models) with uncensored outputs, image understanding, and a built-in code agent.

Features:

One-click model setup via Ollama
Image + text input
Built-in code agent mode
Chat history and export
No cloud, no data leaving your machine

# clone and run
git clone https://github.com/PurpleDoubleD/locally-uncensored
cd locally-uncensored && npm run tauri dev

Check the GitHub releases for pre-built binaries.

What GPU are you running? And have you tried the 27B vs the 35B MoE side-by-side? Drop a comment with your setup.

Locally Uncensored — AGPL-3.0 license.

anthropic charges $25/M tokens for opus 4.7. alibaba just released the same capability for free.

David — Thu, 16 Apr 2026 17:14:04 +0000

Anthropic charges $25 per million output tokens for Claude Opus 4.7. That's their new flagship coding model, released today. It's good — 13% better than Opus 4.6 on coding benchmarks, improved vision, stronger at multi-step agentic work.

Meanwhile, also this week: Alibaba released Qwen3.6-35B-A3B under Apache 2.0. Scores 73.4 on SWE-bench Verified. Runs on an 8 GB GPU. Costs nothing.

Two models. Same week. Completely opposite philosophies. Let's break down what's actually happening.

the cloud tax is getting harder to justify

When GPT-4 launched in 2023, there was nothing local that came close. Paying for API access made sense because there was no alternative.

In 2024, open models started catching up. Llama 3, Qwen 2.5, Mistral — good enough for many tasks, but still clearly behind frontier models on the hard stuff.

In 2026, the gap has narrowed to the point where you have to really think about whether the remaining difference is worth $25 per million output tokens.

Here's a concrete example. A developer using Opus 4.7 as their primary coding agent, running maybe 50 complex coding sessions a day:

Average session: ~10K input tokens (code context) + ~5K output tokens (response)
50 sessions: 500K input + 250K output tokens
Daily cost: $2.50 + $6.25 = $8.75/day
Monthly: ~$190/month just for one developer

Now scale that to a team of 5. That's nearly $1,000/month on AI coding assistance.

The same team could buy a single RTX 4070 ($550 one-time) and run Qwen3.6 at 20+ tokens/second with zero ongoing costs.

what you actually get for $0

Qwen3.6-35B-A3B isn't just "a free model." It's specifically designed for the exact use case Opus 4.7 targets — coding agents:

Agentic coding benchmarks:

SWE-bench Verified: 73.4 (fix real bugs in real repos autonomously)
Terminal-Bench 2.0: 51.5 (operate a terminal to solve coding tasks)
MCPMark: 37.0 (tool calling and agent protocols)
QwenWebBench: 1397 Elo (frontend artifact generation)

Architecture advantages for local deployment:

MoE: 35B total params, 3B active — runs like a small model, thinks like a big one
Gated DeltaNet: 3 of 4 layers use linear attention — memory efficient on long contexts
Native vision: understand screenshots, diagrams, code images without a separate model
262K context: plenty for most codebase contexts

What you give up vs Opus 4.7:

Probably some edge on the hardest 10% of tasks
Anthropic's specific safety/self-verification features
The polish of a model trained with massive RLHF compute
Cloud convenience (no GPU needed)

What you gain:

Your code never leaves your machine
No rate limits, no outages, no API key management
No per-token costs, ever
Full control over the model behavior
Works offline, on a plane, in an air-gapped environment
Apache 2.0 — fine-tune it, modify it, deploy it commercially

the $25/M question

Opus 4.7 is genuinely impressive. Anthropic's coding models have been best-in-class for a while and this extends that lead. The self-verification feature — where the model checks its own work before reporting back — is particularly useful for autonomous workflows.

But the honest question every developer should ask is: for my specific tasks, does the delta between Opus 4.7 and Qwen3.6 justify the cost?

For a solo developer building a startup: probably not. Qwen3.6 handles 73.4% of real-world GitHub issues autonomously. That's more than enough for daily coding work.

For a large enterprise with strict compliance requirements and deep pockets: maybe. The convenience and Anthropic's enterprise features have real value.

For anyone processing sensitive code: local wins by default. No amount of ToS promises equals "the data literally never left my hardware."

how to try both and decide

Opus 4.7:

API key from anthropic.com
Model: claude-opus-4-7
$5/M input, $25/M output

Qwen3.6 locally:

ollama run qwen3.6:35b-a3b

Or for a complete setup with a coding agent, vision, and tool calling — Locally Uncensored v2.3.3 supports both. Connect Anthropic's API for Opus 4.7 when you need it, run Qwen3.6 locally for everything else. Switch between them in the same interface. Best of both worlds.

where this is heading

The pattern is clear. Every 3-4 months, a new open model appears that matches the paid frontier model from 6 months ago. The cost of "good enough" is trending toward zero.

Anthropic, OpenAI, and Google will keep pushing the frontier. Open models will keep closing the gap. And the developers in the middle will increasingly ask: "Is the remaining gap worth $25 per million tokens?"

Today, for most coding tasks, the answer is already no.

Locally Uncensored — open-source desktop app for running AI locally. Supports cloud APIs AND local models. Chat, coding agents, image gen, video gen. AGPL-3.0.

claude opus 4.7 just dropped. here's what runs locally for free.

David — Thu, 16 Apr 2026 17:13:20 +0000

Anthropic just released Claude Opus 4.7. It's their best coding model yet — 13% better than Opus 4.6 on their internal 93-task benchmark, better vision, stronger at long-running agentic tasks.

It's also $5 per million input tokens and $25 per million output tokens. API only. Every character you type goes through Anthropic's servers.

Let's talk about what you can do locally for $0.

what opus 4.7 actually brings

Based on Anthropic's announcement:

13% improvement over Opus 4.6 on a 93-task coding benchmark, including 4 tasks neither Opus 4.6 nor Sonnet 4.6 could solve
Better vision — higher resolution image understanding
Stronger agentic workflows — handles complex, multi-step tasks without losing context or stopping early
Self-verification — the model checks its own outputs before reporting back
Available on Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry

These are real improvements. Opus has been the go-to for serious coding work, and 4.7 makes it better.

But here's the thing.

the cost of frontier cloud AI

At $5/$25 per million tokens, a heavy coding session with Opus 4.7 can easily run $2-5/day. A team of developers using it as their primary coding agent? That's hundreds per month.

And every line of your proprietary code flows through someone else's infrastructure. Every prompt, every codebase context, every business logic snippet — stored, processed, potentially used for training (even with opt-outs, you're trusting the provider).

For hobby projects, fine. For anything sensitive — financial code, healthcare logic, proprietary algorithms — that's a real risk.

what runs locally right now

The local model landscape has changed dramatically in the last few months. Here's what's available today at $0/month:

Qwen3.6-35B-A3B (released this week)

35B total parameters, 3B active (MoE architecture)
73.4 on SWE-bench Verified — autonomous bug fixing on real GitHub repos
51.5 on Terminal-Bench 2.0 — agentic terminal coding
Built-in vision, 262K context
Runs on 8 GB VRAM with Q4_K_M quantization
Apache 2.0 license

Is it as good as Opus 4.7? On raw capability, probably not — Anthropic has massive compute advantages. But on the tasks most developers actually do daily (fixing bugs, writing functions, understanding codebases, code review), Qwen3.6 is genuinely competitive. And it runs on hardware you already own.

the real comparison isn't benchmarks

It's this:

	Claude Opus 4.7	Qwen3.6-35B-A3B
Cost	$5/$25 per million tokens	$0 forever
Privacy	Cloud-processed	Never leaves your machine
Speed	Subject to API congestion	As fast as your GPU
Availability	Depends on Anthropic's uptime	Runs offline
Rate limits	Yes	No
Data retention	Anthropic's policy	You control everything
License	Proprietary	Apache 2.0
Vision	Yes	Yes
Agentic coding	Yes (strong)	Yes (73.4 SWE-bench)
Setup	API key + credit card	Ollama + 10 minutes

how to set up the local alternative

ollama run qwen3.6:35b-a3b

That's it. Or if you want a full desktop experience with a coding agent, vision support, and model management:

Locally Uncensored just shipped v2.3.3 with Qwen3.6 day-0 support. It wraps Ollama into a desktop app with a built-in coding agent that streams live between tool calls, agent mode with 13 tools and MCP integration, and remote access from your phone. Open source, AGPL-3.0.

when cloud still makes sense

Being honest: there are cases where Opus 4.7 is worth the money.

You need the absolute frontier of capability and $25/M output tokens is pocket change for your use case
You're doing something that requires Anthropic's specific safety features
You need the model to handle tasks that are genuinely beyond what open models can do today
You don't have a GPU (though even a laptop with 8GB VRAM works for Qwen3.6)

For everyone else — the gap between cloud and local is closing fast. A model that scores 73.4 on SWE-bench running on a gaming laptop would have been science fiction two years ago.

the trajectory matters more than today's snapshot

Every few months, a new open model drops that would have been frontier-class the year before. The pricing gap between cloud and local is permanent — cloud will always cost per token, local will always be free after hardware.

Opus 4.7 is impressive. But the question isn't whether it's good — it's whether it's $5/$25 per million tokens better than what you can run yourself.

For a growing number of developers, the answer is no.

Locally Uncensored — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. Qwen3.6 day-0 support. AGPL-3.0.

i cancelled my AI subscriptions. qwen3.6 on my own GPU does the same thing for free.

David — Thu, 16 Apr 2026 15:56:16 +0000

You're paying $20/month for ChatGPT. $10 for Copilot. Maybe another $20 for Midjourney. And every prompt you type goes through someone else's server.

Meanwhile, Alibaba just open-sourced a model that scores 73.4 on SWE-bench Verified — the benchmark where an AI autonomously reads a GitHub issue, understands the codebase, writes a fix, and runs the tests. That's frontier-level coding ability. And it runs on your gaming laptop.

the model

Qwen3.6-35B-A3B. It's a Mixture-of-Experts model: 35 billion parameters total, but only 3 billion active per token. Your GPU loads 9 experts per token (8 routed + 1 shared) out of 256 total. The rest sit idle.

Result: it runs like a 3B model but thinks like a 30B+ model.

Apache 2.0 license. No usage restrictions. No rate limits. No one reading your code.

what your $0/month gets you

Let's do the math on what you're replacing:

ChatGPT Plus ($20/month) — Qwen3.6 scores 86.0 on GPQA Diamond (graduate-level reasoning), 83.6 on HMMT (Harvard-MIT Math Tournament), and handles 119 languages. It has vision built in — drag an image into the chat and ask questions about it. For most daily tasks, you won't notice a difference. For coding tasks, this model is arguably better than GPT-4 for the stuff you actually do (fixing bugs, writing functions, understanding codebases).

GitHub Copilot ($10/month) — 73.4 on SWE-bench means this model can autonomously fix real bugs in real repositories. 51.5 on Terminal-Bench means it can operate a terminal to solve coding tasks. With the right frontend, it functions as a full coding agent, not just autocomplete.

Cloud API costs — no per-token pricing. Run it 24/7 on your own hardware. The model doesn't get slower during peak hours. It doesn't have outages. It doesn't change its behavior because the provider decided to add more safety filters.

the hardware you already own is enough

This is the part that surprises people. With Q4_K_M quantization:

8 GB VRAM (RTX 3060, RTX 4060): runs at 30+ tokens/second
12-14 GB VRAM (RTX 4070, RTX 3090): Q8 quantization, 20+ tok/s
Apple Silicon M1/M2/M3: runs great on unified memory

If you bought a GPU in the last 3-4 years, you probably have enough. The MoE architecture is the key — your GPU only processes 3B parameters per token regardless of the total model size.

the catch (being honest)

There are trade-offs. You should know them before you cancel anything:

No real-time internet access — the model only knows what it was trained on. No "search the web" or "check the latest docs." You need to paste context manually or use RAG.
Setup isn't zero — you need Ollama or a similar runtime, and a frontend. It's not "open a browser tab and start typing." More like 10-15 minutes to set up if you've never done it.
Long context costs more locally — 262K native context is great on paper, but processing 100K+ tokens on consumer hardware gets slow. Cloud APIs hide this cost from you.
No multimodal generation — Qwen3.6 can understand images (vision input) but can't generate them. For image generation you need a separate model (Stable Diffusion, Flux, etc.)
Updates are manual — when a better model drops, you download and switch yourself. No silent upgrades.

For people who type "write me a poem" into ChatGPT twice a week, this is overkill. For developers, researchers, and anyone processing sensitive data — the trade-offs are overwhelmingly in favor of local.

the stack that replaces everything

Here's what a complete local setup looks like in 2026:

Chat + reasoning: Qwen3.6-35B-A3B (this article)
Image generation: Stable Diffusion 3.5, Flux, or SDXL via ComfyUI
Video generation: Wan 2.1, FramePack
Code completion: same Qwen3.6, connected as a coding agent
Speech-to-text: Whisper (runs on CPU)

Total cost after hardware you already own: $0/month. Forever.

Or use a tool that bundles all of this. Locally Uncensored wraps Ollama + ComfyUI into one desktop app — chat, image gen, video gen, coding agent. v2.3.3 has Qwen3.6 day-0 support with vision and a full agent mode. AGPL-3.0, open source.

the real question

It's not "is local AI good enough yet?" — it passed that threshold months ago.

The real question is: how much longer are you going to pay monthly fees to send your data to someone else's server when the same capability runs on hardware sitting under your desk?

Qwen3.6 weights: HuggingFace

Locally Uncensored — open-source desktop app for local AI. Chat, coding agents, image gen, video gen. No cloud, no subscription. AGPL-3.0.

qwen3.6 scores 73.4 on SWE-bench with only 3B active parameters. here's why that matters.

David — Thu, 16 Apr 2026 15:43:39 +0000

Alibaba just mass-released Qwen3.6 and the first model is already turning heads. Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35 billion total parameters — but only 3 billion are active at inference time.

That means it runs on an 8GB GPU. And it just scored 73.4 on SWE-bench Verified.

For context, Gemma4-31B — a dense model using all 31 billion parameters for every single token — scores 17.4 on the same benchmark. Qwen3.6 uses a tenth of the compute and scores four times higher.

the architecture is genuinely different

Most MoE models just slap a router on top of a standard transformer. Qwen3.6 does something more interesting.

Three out of every four layers use Gated DeltaNet — a linear attention mechanism that's significantly cheaper than standard attention. Only every fourth layer uses full Gated Attention with KV cache. This hybrid layout means you get near-full-attention quality at a fraction of the memory cost, especially on long contexts.

The expert setup: 256 total experts, 8 routed + 1 shared active per token. That's where the 35B→3B compression comes from. Each token only touches the experts it needs.

And it has vision built in. Not bolted on — the model is natively multimodal (Image-Text-to-Text). MMMU score of 81.7, RealWorldQA at 85.3.

the benchmarks that matter

I'm not going to dump every number. Here are the ones that actually tell you something:

SWE-bench Verified: 73.4 — this is the "can you autonomously fix real GitHub issues" test. The model reads the issue, understands the codebase, writes a fix, and runs the tests. 73.4 means it successfully fixes nearly three out of four real-world bugs thrown at it. Its predecessor (Qwen3.5-35B-A3B) scored 70.0. Gemma4-31B scored 17.4.

Terminal-Bench 2.0: 51.5 — agentic terminal coding. Can the model operate a terminal to solve coding tasks? Qwen3.6 beats its predecessor (40.5), the dense Qwen3.5-27B (41.6), and Gemma4-31B (42.9). An 11-point jump over the previous version is massive.

QwenWebBench: 1397 Elo — frontend artifact generation. The predecessor scored 978. A 400+ Elo jump in one generation. For chess players: that's going from a club player to a titled player.

GPQA Diamond: 86.0 — graduate-level science reasoning. This is the benchmark where PhD students in physics, chemistry, and biology try to answer questions outside their subfield and fail about half the time. 86.0 is competitive with models many times this size.

MCPMark: 37.0 — general agent benchmark testing MCP (Model Context Protocol) tool use. Predecessor scored 27.0. Gemma4-31B scored 36.3. This model was clearly trained with agentic tool calling in mind.

what 3B active parameters actually means for your hardware

Here's the thing people keep getting wrong about MoE models. The total parameter count (35B) determines the model's knowledge capacity — how much it "knows." But the active parameter count (3B) determines how fast it runs and how much VRAM it needs at inference time.

So while the model file on disk is large (it contains all 256 experts), at inference time your GPU only loads the 9 active experts per token. The rest sit in memory doing nothing until they're needed.

Practical VRAM requirements:

Q4_K_M quantized: ~6-8 GB — runs on an RTX 3060 12GB at 30+ tok/s
Q8_0 quantized: ~12-14 GB — RTX 4070 territory
FP8 official: ~35 GB — RTX 4090 or A6000
FP16 full: ~70 GB — multi-GPU

If you can run a 7B model, you can run this. The speed profile is similar to a 3B dense model, but the output quality is closer to a 30B+ dense model.

the real competition

The model Qwen3.6 is really competing against isn't Gemma4-31B. It's proprietary models.

73.4 on SWE-bench Verified puts it in the same ballpark as frontier closed-source models — except this one is Apache 2.0 licensed, runs on consumer hardware, and never sends your code to anyone's server.

For coding specifically, the combination of high SWE-bench scores + strong terminal/agent capabilities + MCP support makes this arguably the best local coding model per compute dollar right now.

how to actually run it

The model just dropped so GGUF quantizations are still rolling out. Check HuggingFace for the latest:

Official weights: Qwen/Qwen3.6-35B-A3B
FP8 variant: Qwen/Qwen3.6-35B-A3B-FP8

Once GGUFs land, ollama run qwen3.6:35b-a3b should work.

For a full desktop setup with model management, vision support, and a built-in coding agent, Locally Uncensored just shipped v2.3.3 with day-0 Qwen3.6 support. Open source, AGPL-3.0.

the bottom line

3B active parameters scoring 73.4 on SWE-bench is the kind of efficiency gain that changes what's possible on consumer hardware. A year ago you needed a 70B+ dense model or API access for this level of coding capability. Now it runs on a gaming laptop.

Apache 2.0. No strings attached.

Locally Uncensored is an open-source desktop app for running AI models locally — chat, coding agents, image gen, video gen. AGPL-3.0.

How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size

David — Thu, 16 Apr 2026 15:20:08 +0000

Qwen just released Qwen3.6-35B-A3B — the first model in their 3.6 series. It's a Mixture-of-Experts model with 35 billion total parameters but only 3 billion active at inference time.

Translation: big-model quality at small-model speed. And this time it has vision built in.

Why this model matters

The numbers speak for themselves:

73.4 on SWE-bench Verified — this is an agentic coding benchmark where the model autonomously fixes real GitHub issues. For reference, Gemma4-31B (a dense model with all 31B params active) scores 17.4. Qwen3.6 scores 4x higher with 10x fewer active parameters.
51.5 on Terminal-Bench 2.0 — agentic terminal coding. It beats Qwen3.5-27B (41.6), its own predecessor Qwen3.5-35B-A3B (40.5), and even Gemma4-31B (42.9).
1397 Elo on QwenWebBench — frontend artifact generation. The predecessor scored 978. That's a 400+ Elo jump in one generation.
86.0 on GPQA Diamond — graduate-level science reasoning. Competitive with models many times its size.
Vision support — handles image-text-to-text tasks natively. MMMU score of 81.7, RealWorldQA at 85.3.

The full benchmark picture:

Benchmark	Qwen3.6-35B-A3B	Qwen3.5-35B-A3B	Gemma4-31B	Qwen3.5-27B
SWE-bench Verified	73.4	70.0	17.4	51.2
Terminal-Bench 2.0	51.5	40.5	42.9	41.6
SWE-bench Multilingual	75.0	67.2	69.3	60.3
QwenWebBench (Elo)	1397	978	1178	1197
NL2Repo	29.4	20.5	—	27.3
MCPMark	37.0	27.0	36.3	15.5
GPQA Diamond	86.0	84.2	84.3	85.5
MMMU	81.7	81.4	80.4	82.3

What's under the hood

This isn't just a bigger Qwen3.5. The architecture got meaningful upgrades:

Gated DeltaNet attention — 3 out of every 4 layers use linear attention (Gated DeltaNet) instead of standard attention. Only every 4th layer uses full Gated Attention. This makes it much more memory-efficient for long contexts.

256 experts, 9 active — 8 routed + 1 shared expert active per token. That's where the "35B total, 3B active" comes from. Most of the model sits idle while only the relevant experts fire.

Vision encoder built in — it's a true multimodal model (Image-Text-to-Text), not a text model with a bolted-on adapter.

Thinking Preservation — new feature that retains reasoning context from previous messages. Less overhead for iterative coding sessions.

262K native context — extensible beyond that.

Apache 2.0 license — fully open, commercial use allowed.

Hardware requirements

The beauty of MoE: your hardware only needs to handle the active parameters, not the total count.

Setup	VRAM needed	Expected speed
Q4_K_M quant	~6-8 GB	30+ tok/s on RTX 3060 12GB
Q8_0 quant	~12-14 GB	20+ tok/s on RTX 4070
FP8 (official)	~35 GB	RTX 4090 or A6000
FP16 full	~70 GB	Multi-GPU setup

If you can run a 7B model, you can run this. The 3B active parameter count is the number that matters for speed.

How to run it

Option 1: Ollama (easiest)

ollama run qwen3.6:35b-a3b

Wait for GGUFs to appear — usually within hours of release. Check HuggingFace for the latest quantized versions.

Option 2: vLLM

pip install vllm
vllm serve Qwen/Qwen3.6-35B-A3B --tensor-parallel-size 1

Option 3: Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")

Option 4: Locally Uncensored (full GUI + model management)

If you want a clean desktop app that handles downloading, model management, and chatting in one place:

Grab Locally Uncensored — it's open source (AGPL-3.0)
v2.3.3 just shipped with day-0 Qwen3.6 support
Download the model directly from the app, pick your quantization, and start chatting
Vision works out of the box — drag and drop images into the chat
The new Codex mode with live streaming is particularly nice for coding tasks with this model

LU also has agent mode with 13 tools, remote access from your phone, and a bunch of other stuff that pairs well with an agentic model like this one.

Who should care

Local AI coders — if you use AI for coding and want to run it locally, this is now the best MoE option. 73.4 SWE-bench with 3B active params is absurd.
Privacy-focused devs — Apache 2.0, runs on consumer hardware, no data leaves your machine.
Multimodal users — built-in vision means one model for text AND image understanding.
Anyone running Qwen3.5-35B-A3B — this is a straight upgrade. Same architecture class, better everything.

The bottom line

Qwen3.6-35B-A3B is what happens when you optimize MoE properly. 3B active parameters shouldn't be this good, but here we are. The coding benchmarks in particular are hard to argue with — 73.4 on SWE-bench Verified puts it in the same league as much larger, closed-source models.

Weights are on HuggingFace. FP8 variant here. GGUFs incoming.

Locally Uncensored is an open-source desktop app for running AI models locally with full privacy. Handles model downloads, chat, coding agents, image generation, and more. AGPL-3.0 licensed.

How to Run GLM 4.7 Flash Locally with Ollama — 30B Quality at 3B Speed

David — Sun, 12 Apr 2026 13:28:13 +0000

ZhipuAI quietly dropped GLM 4.7 Flash and it's been blowing up — 830K+ downloads on HuggingFace, 1,600+ likes. The pitch: 30B-parameter MoE model with only 3B active parameters per token. Translation: you get 30B-class quality at the speed and VRAM cost of a 3B model.

The benchmarks back it up. AIME 25: 91.6% (beats GPT-class models). SWE-bench Verified: 59.2% (nearly 3x Qwen3-30B-A3B). And it's MIT licensed — commercial use, fine-tuning, whatever you want.

I've been building a local AI desktop app (Locally Uncensored) and just added GLM 4.7 support. Here's how to run it locally.

Install GLM 4.7 Flash with Ollama

One command:

ollama run glm4.7

That's it. Ollama handles the download and quantization. Default is Q4_K_M which gives you the best quality-to-size ratio.

If you want a specific quantization:

ollama run glm4.7:q4_k_m    # ~5 GB, recommended
ollama run glm4.7:q8_0      # ~10 GB, higher quality
ollama run glm4.7:q2_k      # ~3 GB, if VRAM is tight

Why GLM 4.7 Flash Matters

The MoE (Mixture of Experts) architecture is the key. The model has 30B total parameters but only activates 3B per token. This means:

Speed: Token generation is fast — comparable to running a 3B dense model
VRAM: Only needs 6-8 GB for Q4 quantization
Quality: Reasoning and coding performance matches models 10x its active size

Here's how it compares:

Benchmark	GLM 4.7 Flash (30B-A3B)	Qwen3-30B-A3B	GPT-OSS-20B
AIME 25	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench (agentic)	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

The agentic benchmarks are insane. τ²-Bench at 79.5 vs Qwen3's 49.0 — that's not a marginal improvement, that's a different league. This model was built for tool calling and multi-step reasoning.

VRAM Requirements

Q2_K: ~3-4 GB VRAM (or CPU-only with 8 GB RAM)
Q4_K_M: 6-8 GB VRAM — the sweet spot
Q8_0: 10-12 GB VRAM — if you have the room
FP16: 20+ GB — only for research/fine-tuning

If you have a GTX 1660 (6 GB) or better, Q4_K_M runs comfortably. On Apple Silicon with 16 GB unified memory, it flies.

Agent Mode with GLM 4.7

This is where GLM 4.7 really shines. The model was specifically optimized for agentic tasks — it has a "Preserved Thinking" mode that keeps chain-of-thought reasoning active across multi-turn tool interactions.

In practice: you give it a tool (web search, file read, code execution) and it actually uses it intelligently. The 59.2% SWE-bench score means it can navigate real codebases, understand context, and produce working patches — not just toy completions.

In Locally Uncensored, GLM 4.7 is auto-detected as an agent-capable model. Enable Agent mode in the UI and it gets access to web search, file operations, and code execution out of the box.

GLM 4.7 vs the Competition

vs Qwen3-30B-A3B: Same architecture class (30B MoE, 3B active) but GLM 4.7 dominates on agentic and coding tasks. Qwen3 is better at pure math.

vs Gemma 4 E4B: Gemma 4 is smaller (4.5B effective) and faster, but GLM 4.7 has significantly better reasoning depth. If you need an agent that can handle complex multi-step tasks, GLM 4.7 wins.

vs Llama 3.3 70B: Llama needs 3-4x the VRAM for similar coding performance. GLM 4.7 is the efficiency play.

What's the Catch?

Honestly, not much:

Chinese-English bilingual — Trained on both, works great in both. If you only need English, it's still excellent.
Context window — Supports up to 128K tokens. More than enough for most use cases.
MIT license — Fully open. No restrictions on commercial use, modification, or redistribution.

The main caveat: if you want vision/multimodal, GLM 4.7 Flash is text-only. Look at GLM-4V or Gemma 4 for image input.

Try It

ollama run glm4.7

Or if you want a full desktop UI with agent mode, image gen, and A/B model comparison:

Locally Uncensored — free, open source, AGPL-3.0. Single .exe/.AppImage, no Docker needed. GLM 4.7 is in the recommended models list.

Running GLM 4.7 on your hardware? I'd love to hear your tok/s numbers and use case. Drop a comment.

Locally Uncensored — AGPL-3.0 licensed. GitHub.