BeeLlama.cpp enhances llama.cpp, Qwen 35B hits 128K context, iOS local LLMs with Ollama
Today's Highlights
This week sees major advancements in local inference, with a new llama.cpp fork enhancing performance and multimodal capabilities. Additionally, a powerful Qwen model demonstrates high-context processing on consumer GPUs, and an open-source iOS app enables on-device LLM inference.
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) (r/LocalLLaMA)
Source: https://reddit.com/r/LocalLLaMA/comments/1t88zvv/beellamacpp_advanced_dflash_turboquant_with/
A new fork of llama.cpp, dubbed BeeLlama.cpp, has emerged, focusing on advanced optimizations and expanded capabilities for local LLM inference. This project introduces DFlash and TurboQuant techniques, promising significant acceleration—up to 2-3x faster than the baseline llama.cpp, achieving peak speeds of 135 tokens per second.
Key features include support for large context windows, demonstrated by running Qwen 3.6 27B Q5 with 200,000 tokens of context on a single RTX 3090. Crucially, BeeLlama.cpp also incorporates vision capabilities, making it a step forward for multimodal models on consumer hardware. The fork aims to provide a Windows-friendly inference environment with speculative decoding, enabling high context processing without excessive quantization, thereby balancing performance and model fidelity. This development is vital for users looking to push the boundaries of what's achievable with local LLMs on readily available GPUs.
Comment: This fork brings impressive speed and multimodal features to llama.cpp, making it possible to run large context vision models efficiently on a single GPU. It's a game-changer for accessible advanced local AI.
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP (r/LocalLLaMA)
Source: https://reddit.com/r/LocalLLaMA/comments/1t82zxv/80_toksec_and_128k_context_on_12gb_vram_with/
The open-source community is abuzz with the impressive performance of the Qwen3.6 35B A3B model, demonstrating remarkable speed and context handling on consumer-grade GPUs with limited VRAM. Users are reporting speeds of 80 tokens per second and the ability to process up to 128,000 tokens of context, all on systems with just 12GB of VRAM, such as an RTX 3060. This is achieved using the latest llama.cpp builds integrated with Multi-Threaded Pipelining (MTP).
The Qwen3.6 35B A3B model itself is a significant release, featuring a Mixture-of-Experts (MoE) architecture and optimized for various quantization formats critical for local inference. It's available in Safetensors, GGUFs, NVFP4 GGUFs, and GPTQ-Int4 formats, ensuring broad compatibility and efficiency across different hardware setups. Its performance on 12GB VRAM makes large language models more accessible than ever, pushing the envelope for self-hosted AI applications and showcasing the power of optimized quantization and acceleration techniques like MTP.
Comment: Achieving 80 tok/sec with 128K context on a 12GB card using Qwen 3.6 35B and llama.cpp is phenomenal. It proves that powerful LLMs can run very effectively on common consumer hardware.
Open sourced an iOS app that runs LLMs on-device with llama.cpp, and lets you plug in your own Ollama for automatic health insights from HealthKit (r/Ollama)
Source: https://reddit.com/r/ollama/comments/1t889r4/open_sourced_an_ios_app_that_runs_llms_ondevice/
A new open-source iOS application named Priv AI has been released, enabling users to run large language models entirely on their iPhone devices. This app leverages llama.cpp, a leading library for efficient local inference, to execute popular GGUF models such as SmolLM2, Qwen 2.5, Llama 3.2, and Gemma directly on the phone's hardware, ensuring privacy and offline functionality.
Beyond basic inference, Priv AI integrates with Apple's HealthKit, allowing the local LLM to generate automatic health insights based on personal health data. Users also have the flexibility to connect the app to their self-hosted Ollama instances, further extending its capabilities and allowing access to a wider range of models or custom configurations. This development marks a significant step towards truly private and portable AI, providing a practical example of how local AI can enhance personal applications without reliance on cloud services.
Comment: This open-source iOS app is a fantastic practical demonstration of on-device LLM inference using llama.cpp and Ollama. HealthKit integration is a smart use case for truly private mobile AI.
Top comments (0)