Qwen 3.6 Ollama Release, Consumer GPU Benchmarks, GGUF Quantization Fixes

#ai #llm #selfhosted

Qwen 3.6 Ollama Release, Consumer GPU Benchmarks, GGUF Quantization Fixes

Today's Highlights

This week's local AI news highlights the official release of Qwen 3.6 models on Ollama, offering easy access to the new MoE architecture with various quantization levels. Developers are also sharing critical performance optimizations for Qwen 3.6 on consumer hardware and novel techniques to enhance GGUF quantization quality.

New on Ollama: batiai/qwen3.6-35b — full Qwen 3.6 lineup with tools + thinking (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1soyu4s/new_on_ollama_batiaiqwen3635b_full_qwen_36_lineup/

This update announces the immediate availability of the new Qwen 3.6 35B-A3B Mixture-of-Experts (MoE) model on the Ollama platform, hosted under the batiai/ namespace. Users can now easily pull and run various quantized versions of Qwen 3.6, which are specifically tailored for efficient local inference on diverse consumer hardware, with a particular focus on Mac systems with varying RAM capacities.

The release prominently features iq3 (13 GB, suitable for 16 GB Macs) and iq4 (18 GB, for 24 GB Macs) quantization levels. This makes the powerful Qwen 3.6 architecture, known for its advanced capabilities, more accessible for a wider range of users looking to run models locally. The integration into Ollama streamlines the process of deploying and experimenting with cutting-edge open-weight models, furthering the platform's role in the self-hosted AI ecosystem. The models are also noted to include "tools + thinking" capabilities, suggesting enhanced support for agentic workflows directly from the start.

This release directly addresses the growing demand for user-friendly access to high-performance open-weight models on personal machines, making it simpler for developers and enthusiasts to leverage Qwen 3.6 for their projects without relying on cloud-based services. The emphasis on Mac-first tuning is particularly beneficial for that segment of the local AI community.

Comment: This is a big one for Ollama users. Qwen 3.6’s MoE architecture with these optimized quantizations means I can now run a more capable, instruction-tuned model locally on my MacBook Pro for coding tasks, directly pulling it with ollama pull.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_5070_ti_9800x3d_running_qwen3635ba3b_at_79_ts/

A notable achievement in local inference performance has been reported, showcasing the Qwen 3.6 35B-A3B Mixture-of-Experts (MoE) model running efficiently on a consumer-grade hardware setup. The user successfully achieved a generation speed of 79 tokens per second (t/s) while utilizing a very large 128K context window, all on an RTX 5070 Ti GPU paired with a 9800X3D CPU.

The critical insight from this benchmark is the profound impact of the --n-cpu-moe flag. This flag is highlighted as the most important configuration setting, indicating its role in intelligently offloading specific MoE layers or computational tasks to the CPU. This hybrid processing approach effectively bypasses the VRAM limitations often encountered with MoE models on consumer GPUs, allowing for significantly higher throughput and deeper context handling than typically expected from such hardware.

This finding is invaluable for the local AI community, particularly those working with MoE architectures. It demonstrates that with precise configuration and optimal hardware utilization, high-context, high-speed inference is not only possible but highly performant on readily available consumer hardware. Such optimizations are crucial for advancing the capabilities of self-hosted LLMs and making advanced models more practical for everyday use.

Comment: Finding the right flags for MoE models is crucial for performance on my setup. The --n-cpu-moe tip for Qwen3.6 is exactly the kind of optimization detail that makes a difference between barely running a model and actually using it productively for 128K context.

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sp2l72/qwen3635ba3buncensoredwassersteingguf/

This news item announces a significant technical improvement in the quality of quantized GGUF models, specifically addressing the Qwen 3.6-35B-A3B model. The developer has identified and implemented a solution to fix the "ssm_conv1d tensor drift" issue, a common problem that can degrade the performance and accuracy of models after quantization. This drift often leads to noticeable discrepancies between the full-precision model's output and its quantized counterpart.

The proposed solution involves leveraging the Wasserstein metric (W1), a mathematical distance measure, during the quantization process. By applying this metric, the developer has found a way to minimize the drift in critical tensors, resulting in GGUF models that maintain higher fidelity to the original unquantized model. This improvement translates directly into more reliable and capable local inference, as the compressed models perform closer to their full-size counterparts.

For the local AI community, where GGUF is a foundational format for running large language models on consumer hardware, this development is crucial. Enhancing the quality and stability of quantized models directly addresses a core challenge in local inference, making advanced open-weight models like Qwen 3.6 more robust and trustworthy for various applications, from creative writing to complex coding tasks.

Comment: Tensor drift has been a hidden problem in many quantized models, reducing their real-world effectiveness. Using the Wasserstein metric to stabilize ssm_conv1d tensors in GGUF is a clever fix that could significantly improve the quality of future local inference models, making them much more reliable.