DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Qwen3.6 MoE, WritHer Offline AI, & llama.cpp Benchmarks Lead Local AI News

Qwen3.6 MoE, WritHer Offline AI, & llama.cpp Benchmarks Lead Local AI News

Today's Highlights

This week, the open-source Qwen3.6-35B-A3B MoE model landed with strong multimodal and agentic coding capabilities, offering significant local inference potential. Concurrently, a new offline voice assistant, WritHer, integrates Whisper and Ollama for Windows, alongside detailed benchmarks for Qwen3.6's performance on llama.cpp, highlighting KV cache optimizations.

Qwen3.6-35B-A3B Open-Source MoE Model Released (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sn3izh/qwen3635ba3b_released/

Qwen3.6-35B-A3B, a new sparse Mixture-of-Experts (MoE) model, has been released under the Apache 2.0 license, making it openly available for local deployment. This model boasts 35 billion total parameters with only 3 billion active parameters, offering a compelling balance between performance and computational efficiency. Key highlights include strong agentic coding capabilities, reportedly on par with models ten times its active size, and robust multimodal perception.
The release of Qwen3.6 represents a significant step forward for developers and enthusiasts focused on running advanced AI models on consumer-grade hardware. Its MoE architecture is designed to provide high-quality outputs with lower inference costs compared to dense models of similar total parameter count. The multimodal capabilities expand its utility beyond traditional text-only applications, allowing for more complex interactions and understanding. Developers can access the model via Hugging Face, with discussions already surfacing on optimizing its performance for local inference engines like llama.cpp.

Comment: This MoE architecture with only 3B active params is a game-changer for local machines. The multimodal aspect means it's not just a chat bot but can tackle more complex real-world tasks locally.

WritHer: 100% Offline Voice Assistant & Dictation for Windows (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1sne3el/writher_100_offline_voice_assistant_dictation_for/

WritHer is a newly released open-source project designed to deliver a fully offline voice assistant and dictation experience for Windows users. This privacy-focused tool integrates OpenAI's Whisper for robust speech-to-text capabilities with Ollama, enabling local large language model (LLM) inference. Users can leverage WritHer for seamless dictation, voice commands, and an interactive AI assistant without any internet dependency, addressing significant privacy concerns often associated with cloud-based AI services.
The project emphasizes ease of use for self-hosting enthusiasts. By combining Whisper for accurate transcription and Ollama for powerful local LLM processing, WritHer demonstrates a practical application of running advanced AI components entirely on consumer hardware. This approach not only enhances data privacy but also offers low-latency responses, making it ideal for creative writing, coding, and general productivity tasks where an internet connection might be unstable or undesirable. Its open-source nature invites community contributions and further customization, solidifying its place as a valuable tool in the local AI ecosystem.

Comment: WritHer is exactly what many of us want: a fully local, privacy-preserving voice assistant. Integrating Whisper with Ollama for Windows makes it incredibly accessible for everyday use.

Qwen 3.6 MoE vs Qwen 3.5 MoE Comparison on Local Inferencing (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sn9iff/comparison_qwen_36_35b_moe_vs_qwen_35_35b_moe_on/

A recent comparison on r/LocalLLaMA delves into the performance differences between the newly released Qwen3.6 35B MoE and its predecessor, Qwen 3.5 35B MoE, specifically when run on local hardware using llama.cpp. The benchmark focuses on a practical task: generating a web application from a research paper, with reasoning capabilities turned off to isolate core generation performance. The results highlight the iterative improvements in the Qwen series, providing valuable insights for users considering an upgrade or evaluating the efficiency of MoE architectures for their self-hosted AI setups.
The comparison specifically notes the configuration settings, emphasizing the importance of preserve_thinking for optimal KV cache reuse in Qwen3.6. This practical tip, often overlooked, can significantly impact inference speed and memory footprint, making it a crucial detail for local deployment. Such direct, user-contributed benchmarks are vital for the local AI community, offering real-world performance data and configuration advice that complements official model releases. They underscore the ongoing efforts to maximize efficiency and capability of open-weight models on consumer GPUs, demonstrating the tangible benefits of advancements in model architecture and inference engines like llama.cpp.

Comment: The direct comparison of Qwen 3.6 with 3.5 on llama.cpp provides concrete performance data. The preserve_thinking tip is golden for optimizing KV cache and getting better speeds on local hardware.

Top comments (0)