Ollama v0.30.0, Qwen3.5 35B, & 1-bit Multimodal AI on WebGPU

#ai #llm #selfhosted

Ollama v0.30.0, Qwen3.5 35B, & 1-bit Multimodal AI on WebGPU

Today's Highlights

This week, Ollama's v0.30.0 pre-release hints at improved llama.cpp interoperability, while a new Qwen3.5 35B model offers diverse quantization formats for robust local inference. A highlight for accessibility is PrismML's Bonsai Image 4B, enabling 1-bit text-to-image diffusion directly in your browser via WebGPU.

Qwen3.5 35B A3B Uncensored Heretic: GGUF & GPTQ Formats Available (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1tnzalm/qwen35_35b_a3b_uncensored_heretic_native_mtp/

This new release from the Qwen series, specifically the 35B A3B "uncensored heretic" variant, marks a notable addition to open-weight models available for local inference. The model prioritizes the preservation of "Native MTPs" (Multi-Turn Preserved contexts or specific training data characteristics), retaining the full 785 MTPs. This focus suggests an emphasis on coherent and extended conversational capabilities, which can be crucial for complex, multi-turn interactions.

Crucially for local AI enthusiasts, the model is distributed in a wide array of optimized formats. Users can access it as raw Safetensors, or in quantized versions like GGUFs (including NVFP4 GGUFs) and GPTQ-Int4 formats. These quantization options are vital for enabling deployment on consumer-grade GPUs, allowing a larger segment of the community to experiment with a powerful 35-billion parameter model without requiring professional-tier hardware. The availability of these varied formats facilitates integration with popular local inference engines like llama.cpp and text-generation-webui, enhancing flexibility for self-hosted deployments.

Comment: This release offers a substantial open-weight model in optimized formats, making it accessible for users with beefier consumer GPUs to explore more advanced capabilities locally.

PrismML Releases Bonsai Image 4B: 1-bit/Ternary Vision on WebGPU (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1togflk/prismml_just_released_binary_and_ternary_bonsai/

PrismML has unveiled its Binary and Ternary Bonsai Image 4B models, representing a significant advancement in local multimodal AI, particularly for text-to-image generation. These models are unique as they are 1-bit and ternary diffusion transformers, which dramatically reduces their computational footprint. At approximately 3GB in size, they are considerably smaller than comparable models like FLUX.2 Klein 4B (~16GB), making them highly efficient for constrained environments, including consumer GPUs.

The most groundbreaking aspect is their ability to run entirely locally within a web browser, leveraging WebGPU. This eliminates the need for complex local setup or dedicated high-end GPUs, democratizing access to powerful generative AI capabilities directly from a user's web browser. Released under the Apache-2.0 license, the models are freely available on Hugging Face, inviting developers and enthusiasts to experiment with highly optimized, browser-based image generation on a broad range of consumer devices. This innovation showcases the potential of extreme quantization and WebGPU for making advanced AI models ubiquitous and easily accessible.

Comment: Achieving 1-bit diffusion directly in the browser with WebGPU is a significant leap for local, accessible multimodal AI, dramatically lowering hardware barriers.

Ollama v0.30.0 Pre-release Hints at `llama.cpp` Interoperability (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1tnomhq/ollama_v0300_prerelease/

The pre-release of Ollama v0.30.0 signals important forthcoming enhancements for users of the popular local AI runtime. While specific release notes are still pending for the full update, the community discussion around this pre-release highlights a critical anticipated feature: improved interoperability with llama.cpp. This addresses a long-standing pain point for many users who frequently work with both Ollama and raw llama.cpp environments.

Currently, users often find themselves downloading and storing the same model files multiple times in different formats or directories to accommodate each system. The mention of achieving "interoperability... without downloading the same models multiple times or relying on improvised workarounds" strongly suggests that Ollama might be moving towards a more unified model management system, possibly by standardizing internal formats or offering seamless conversion/linking of GGUF files. This would significantly streamline workflows, reduce disk space usage, and enhance the overall experience for developers and enthusiasts running open-weight models locally, making the ecosystem more cohesive and efficient.

Comment: Improved llama.cpp interoperability in Ollama would simplify model management, allowing users to leverage GGUF models across both ecosystems more efficiently without redundancy.

DEV Community

Ollama v0.30.0, Qwen3.5 35B, & 1-bit Multimodal AI on WebGPU