Local AI Updates: llama.cpp MTP, vLLM Gemma 4 Speeds, Ollama Coder Benchmarks

#ai #llm #selfhosted

Local AI Updates: llama.cpp MTP, vLLM Gemma 4 Speeds, Ollama Coder Benchmarks

Today's Highlights

This week, llama.cpp gains Multi-Token Prediction for 40% speedups on Gemma 26B, while vLLM pushes Gemma 4 26B to 600 tok/s on RTX 5090 with DFlash. The Ollama community also delivers practical benchmarks for Qwen and DeepSeek coding models for local development.

Multi-Token Prediction (MTP) for LLaMA.cpp Speeds Up Gemma 4 by 40% (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t6se6r/multitoken_prediction_mtp_for_llamacpp_gemma_4/

The popular llama.cpp project has introduced Multi-Token Prediction (MTP), a significant acceleration technique for local large language model inference. This new feature allows llama.cpp to draft multiple tokens simultaneously, greatly enhancing decoding speed and overall throughput. By predicting several tokens in parallel, then verifying them with the main model, MTP reduces the number of sequential operations required for generation, making local LLM experiences smoother and more responsive.

Early benchmarks using quantized Gemma 4 assistant models in GGUF format demonstrate impressive performance gains. Tests conducted on a MacBook Pro M5Max—a powerful consumer device—showed that a Gemma 26B model, when running with MTP, achieved a substantial 40% increase in token generation speed. This improvement is crucial for users looking to maximize inference throughput on consumer-grade hardware, bringing advanced capabilities closer to everyday setups. The integration of MTP into llama.cpp underscores the continuous innovation within the open-source community to push the boundaries of efficient local AI and improve user experience.

Comment: MTP in llama.cpp is a game-changer for my MacBook Pro. Seeing a 40% boost on Gemma 26B means my local dev loop just got a lot faster, especially with GGUF models.

Gemma 4 26B Achieves 600 Tok/s on RTX 5090 with vLLM DFlash Speculative Decoding (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t796qe/gemma_4_26b_hits_600_toks_on_one_rtx_5090/

New benchmarks highlight the exceptional performance of the Gemma 4 26B model, specifically the cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit variant, reaching an impressive 600 tokens per second on a single RTX 5090 GPU equipped with 32GB VRAM. This speed was achieved using vLLM version 0.19.2rc1 and leverages DFlash speculative decoding, a technique pioneered by z-lab for significant inference acceleration.

The setup involved using a smaller draft model to pre-generate potential token sequences, which the main model then quickly validates. This speculative approach dramatically reduces the computational load for each token, leading to higher throughput. For developers and enthusiasts running large open-weight models locally, these results demonstrate the potential of combining powerful consumer hardware with advanced acceleration techniques like DFlash and efficient quantization (AWQ-4bit) to achieve near-real-time generation speeds. This pushes the envelope for what's possible on a single, high-end consumer GPU and provides a clear target for optimizing local inference setups.

Comment: 600 tok/s on a single 5090 with Gemma 4 and DFlash is incredible. It really shows how vLLM and smart decoding can turn powerful consumer GPUs into serious inference machines, especially with AWQ quantization.

Ollama Community Benchmarks Qwen3.6, Qwen3-Coder, and DeepSeek-Coder for Local Code Generation (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1t76uh0/compared_qwen36_qwen3coder_and_deepseekcoder_on/

The Ollama community has published a valuable comparison of several popular open-weight coding models, all running locally through the Ollama platform. This practical benchmark focused on evaluating qwen3.6, qwen3-coder, and deepseek-coder, assessing their strengths and weaknesses across three critical coding benchmarks. These included general code generation tasks, the precision of function calling, and their ability to perform multi-step problem-solving through a "thought chain" task.

This community-driven effort helps users decide which models best suit their needs for local development, providing clear insights without requiring extensive personal experimentation. It also highlights the flexibility and ease of use of Ollama for running and evaluating multiple LLMs without extensive setup on self-hosted machines. By offering direct performance and capability comparisons, the community empowers developers to make informed choices, ensuring they leverage the most effective models for their self-hosted coding AI agents and tools, ultimately fostering more efficient local AI development and resource allocation on consumer machines.

Comment: This Ollama comparison is super useful for choosing a local coding LLM. Instead of guessing, I can quickly see if Qwen or DeepSeek-Coder performs better for my specific code generation tasks, saving disk space and time.