Gemma4 Tool Calling Fixes in llama.cpp, RTX cuBLAS MatMul Bug, & Local Ollama + Whisper UI

#ai #llm #selfhosted

Gemma4 Tool Calling Fixes in llama.cpp, RTX cuBLAS MatMul Bug, & Local Ollama + Whisper UI

Today's Highlights

This week features significant technical updates for local AI, including critical fixes for Gemma4's tool calling in llama.cpp, a deep dive into a major cuBLAS performance bug affecting RTX GPUs, and a new local-first UI integrating Whisper and Ollama for multimodal tasks.

More Gemma4 fixes in the past 24 hours (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1shs6sx/more_gemma4_fixes_in_the_past_24_hours/

Recent updates to llama.cpp address key issues affecting the Gemma4 model, particularly related to tool calling and reasoning capabilities. A significant 'reasoning budget fix' has been merged into the ggml-org/llama.cpp repository, indicated by pull request #21697. This fix is crucial for improving Gemma4's ability to process and generate logical responses, particularly in complex tasks.

In addition to the reasoning budget, Google has released new chat templates specifically designed to rectify existing problems with Gemma4's tool calling functionality for its 31B variant. Tool calling is a vital feature for integrating large language models with external applications and services, allowing them to execute specific actions or retrieve information. These templates are expected to enhance Gemma4's reliability and effectiveness when interacting with tools in local inference environments, making it more robust for diverse applications on consumer hardware.

Comment: These llama.cpp updates are vital for anyone running Gemma4 locally, directly addressing core issues that impact its practical usability for reasoning and external tool integration. It's a significant step towards more reliable local model deployments.

I built AmicoScript: A local-first Whisper UI with Speaker Diarization and Ollama integration for summaries. (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1shverq/i_built_amicoscript_a_localfirst_whisper_ui_with/

AmicoScript is a new local-first application designed to streamline the process of converting audio recordings into structured, usable data, all without relying on cloud services. Built as a FastAPI-based tool, AmicoScript intelligently combines the power of Whisper for highly accurate speech-to-text transcription and Speaker Diarization for identifying and separating different speakers within an audio file.

The tool further integrates with Ollama, enabling users to leverage local large language models to generate concise summaries from the transcribed audio. This local-first approach ensures privacy and control over data, making it ideal for sensitive information or environments with limited internet access. AmicoScript provides a compelling example of self-hosted multimodal AI, allowing users to perform complex audio processing and summarization tasks directly on their consumer GPUs or local machines.

Comment: AmicoScript offers a compelling local-first solution for multimodal tasks. Integrating Whisper for diarization and Ollama for summarization on a local machine is a great practical use case for consumer hardware. Readers can set this up for private, efficient audio processing.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 D

Source: https://reddit.com/r/MachineLearning/comments/1shtv0r/d_60_matmul_performance_bug_in_cublas_on_rtx_5090/

A significant performance bug has been identified in NVIDIA's cuBLAS library, impacting matrix multiplication (MatMul) operations on RTX 5090 GPUs and potentially all RTX series cards. The bug causes cuBLAS to dispatch an inefficient kernel for a wide range of batched FP32 workloads, specifically those ranging from 256x256 to 8192x8192x8 dimensions. This inefficiency results in the GPUs utilizing only approximately 40% of their available compute power for these critical operations.

This discovery is highly relevant for local AI inference, as FP32 batched MatMul operations are fundamental to many large language models and other deep learning workloads. The underperformance means that users running models locally on RTX GPUs may be experiencing significantly slower inference speeds than theoretically possible, impacting the perceived efficiency and practical throughput of their setups. Acknowledging and resolving this bug could lead to substantial performance gains for local inference on NVIDIA hardware.

Comment: This cuBLAS bug is a critical finding for anyone running LLMs locally on NVIDIA RTX GPUs. A 60% performance hit on MatMul for FP32 is massive and directly impacts inference speed; understanding this bottleneck is key for optimization efforts and anticipating future driver updates.