Gemma4 Apex GGUF, Ollama Context Optimization, & Llama3 Benchmarks

#ai #llm #selfhosted

Gemma4 Apex GGUF, Ollama Context Optimization, & Llama3 Benchmarks

Today's Highlights

This week, discover new Apex GGUF quantizations for Gemma4 delivering high token rates at large contexts. Also, explore a significant 89% prompt context reduction for Ollama with Memgraph, alongside new benchmarks revealing Llama3's boolean logic performance.

Gemma4 26b a4b Apex quant is quite good (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1tl9woz/gemma4_26b_a4b_apex_quant_is_quite_good/

This post highlights the impressive performance of Mudler's Apex quantization for the Gemma4 26B-A4B-it model. Users are reporting exceptional speed, achieving 38 tokens per second (tps) while maintaining a substantial context window of 90,000 tokens without any noticeable quality degradation. The specific model used is mudler/gemma-4-26B-A4B-it-APEX-GGUF, indicating a focus on the GGUF format optimized for CPU and GPU inference via tools like llama.cpp. This level of performance with such a large context window is a significant achievement for running large language models locally on consumer hardware.

The success of this Apex quantization demonstrates the continuous advancements in model compression techniques, allowing larger and more capable models to run efficiently outside of powerful data centers. For developers and enthusiasts, this means more complex applications requiring extensive context can now be realized on local machines, opening new possibilities for self-hosted AI agents and powerful, private chatbots. It underscores the value of highly optimized GGUF releases for the open-source community.

Comment: Running a 26B model at 38 tps with 90k context locally is wild; this Apex quant is a game-changer for my long-context retrieval-augmented generation experiments.

I cut prompt context by 89.44% on a Python task with Ollama + Memgraph (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1tl9y3j/i_cut_prompt_context_by_8944_on_a_python_task/

A developer shared a new workflow, dubbed "SiliconBrain," which significantly reduces prompt context length for Python-based tasks leveraging Ollama and Memgraph. By integrating Ollama for real-time reasoning with Memgraph for structured external memory, the user was able to cut prompt context by an impressive 89.44%. This dramatic reduction addresses a common bottleneck in local LLM deployments, where large context windows consume substantial VRAM and slow down inference, making local AI more practical and efficient.

The "SiliconBrain" workflow likely utilizes Memgraph to store and retrieve relevant information dynamically, preventing the need to feed the entire history or knowledge base into the LLM's context window for every query. This approach is crucial for building efficient and scalable local AI agents, particularly for tasks involving complex, evolving states or extensive document analysis. It exemplifies effective context management, a key area for optimizing local inference performance and pushing the boundaries of what's achievable on consumer-grade hardware. This method is easily replicable for those looking to enhance their local Ollama setups.

Comment: Context management is everything for local agents, and an 89% cut with Ollama + Memgraph means I can run much more complex, long-running tasks without blowing my GPU VRAM.

I benchmarked tinyllama and llama3.2:3b on boolean logic. Both scored 50% — coin flip. Here's the proof. (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1tli1g0/i_benchmarked_tinyllama_and_llama323b_on_boolean/

A user benchmarked the logical reasoning capabilities of TinyLlama and Llama3.2:3b models, running them locally via Ollama. The testing focused on boolean logic tasks, revealing that both models scored approximately 50%, akin to a coin flip. This indicates a significant limitation in their ability to consistently perform basic logical operations, even for relatively small and focused models. The developer created a Python-based boolean engine to conduct these tests accurately and efficiently, designed to catch hallucinations without excessive overhead.

This benchmark provides crucial insights into the current state of smaller, open-weight language models, suggesting that while they excel in many generative tasks, their underlying logical understanding remains a challenge. For developers building applications requiring precise reasoning or decision-making, this finding underscores the importance of either integrating external reasoning modules or carefully prompting to mitigate these limitations. It highlights a key area for future research and development in improving the foundational intelligence of local LLMs, and offers a practical method for others to test model capabilities.

Comment: The 50% score for boolean logic on these Llama models via Ollama is a stark reminder to use external tools or very specific prompting for any logic-heavy tasks, rather than relying solely on the LLM.