llama.cpp supports Sparse MoE, new Qwen3.6 GGUF, & WebWorld for local agents

#ai #llm #selfhosted

llama.cpp supports Sparse MoE, new Qwen3.6 GGUF, & WebWorld for local agents

Today's Highlights

Today's local AI news features a significant llama.cpp update adding support for Xiaomi's Mimo v2.5 Sparse MoE model, enhancing architectural diversity for local inference. Additionally, a new uncensored Qwen3.6 27B model has been released in GGUF, alongside a Qwen3-based WebWorld series for local web agent development.

llama.cpp Adds Support for Xiaomi's Mimo v2.5 Sparse MoE Model (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t67lvx/feat_add_mimo_v25_model_support_by_aessedai_pull/

The popular llama.cpp project, a C/C++ inference engine for LLMs, has merged a pull request adding support for the Xiaomi MiMo-V2.5 model. MiMo-V2.5 is a Sparse Mixture of Experts (MoE) model with an impressive 310 billion total parameters, activating 15 billion parameters during inference. This update allows users to leverage the efficiency and capabilities of MoE architectures directly within llama.cpp on local hardware. The integration makes it easier for enthusiasts and developers to experiment with large, powerful models that utilize advanced architectural designs like MoE, which typically offer competitive performance with fewer active parameters compared to dense models of similar scale, making them more feasible for consumer-grade GPUs.

Comment: This is a fantastic update for llama.cpp users. Running a 310B MoE model (even if only 15B are active) locally with llama.cpp is a testament to its optimization, and it's exciting to see more diverse architectures supported.

New Qwen3.6 27B Heretic v2 Model Released in GGUF & NVFP4 Quantizations (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t5yajb/qwen36_27b_uncensored_heretic_v2_native_mtp/

A new iteration of the Qwen3.6 model, named "Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved," has been released, providing a robust, uncensored option for local AI enthusiasts. This model boasts significant performance improvements, including a low Kullback-Leibler Divergence (KLD) of 0.0021 and only 6 refusals out of 100 prompts, indicating its ability to adhere to instructions without unnecessary filtering. Crucially for local inference, it is available in several practical formats: Safetensors, GGUFs (for llama.cpp and Ollama), and NVFP4s. The GGUF format, in particular, enables efficient quantized inference on consumer GPUs, making this powerful 27B model accessible to a broader audience for various applications where an unfiltered and capable language model is desired.

Comment: An uncensored 27B Qwen model in GGUF is a big win for local privacy and flexibility. The reported low refusal rate and MTP preservation make it very appealing for self-hosted creative and analytical tasks.

Qwen3-Based WebWorld Models Released for Local Web Agent Development (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t6c6vs/qwenwebworld_32b14b8b_qwen3_finetune/

The "WebWorld" series introduces a set of large-scale open-web world models built on Qwen3, specifically designed for training and evaluating web agents. These models are fine-tuned on over 1 million real-world web interaction trajectories, utilizing a scalable hierarchical data collection and training pipeline. The availability of multiple parameter sizes – 32B, 14B, and 8B – makes this series highly versatile for local deployment, catering to users with varying GPU memory capacities. WebWorld models aim to equip local LLMs with enhanced capabilities for navigating and interacting with web environments, pushing the boundaries of what can be achieved with self-hosted AI for automated web tasks and research.

Comment: This Qwen3 finetune for web agents is incredibly practical, especially with the multiple sizes. It directly enables advanced local applications and is exactly the kind of open-weight utility model we look for.