Gemma 4 Benchmarks, iMac G3 Local LLM, and Ollama Android Client for On-Device Inference

#ai #llm #selfhosted

Gemma 4 Benchmarks, iMac G3 Local LLM, and Ollama Android Client for On-Device Inference

Today's Highlights

This week features impressive benchmarks for the new Gemma 4, highlighting its potential for local inference, alongside an incredible feat of running an LLM on a 1998 iMac G3. Additionally, a new native Android client for Ollama allows seamless interaction with self-hosted models from mobile devices.

I technically got an LLM running locally on a 1998 iMac G3 with 32 MB of RAM (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sdnw7l/i_technically_got_an_llm_running_locally_on_a/

An astonishing demonstration of extreme local inference has surfaced, showcasing an LLM running on a vintage 1998 iMac G3. This machine, equipped with a 233 MHz PowerPC 750 processor and a mere 32 MB of RAM, successfully loaded and ran Andrej Karpathy’s 260K TinyStories model. The TinyStories model, based on a Llama 2 architecture and featuring a checkpoint size of approximately 1 MB, represents a significant achievement in low-resource deployment. This feat underscores the ongoing efforts within the local AI community to push the boundaries of what's possible on consumer hardware, even systems considered obsolete.

The project highlights the potential for highly optimized, small-scale models to run on virtually any computing device, extending the reach of local AI beyond modern GPUs. While the performance might not be suitable for complex tasks, it serves as a powerful proof-of-concept for pervasive AI applications. This kind of innovative tinkering paves the way for understanding the minimal requirements for local inference and inspires further research into ultra-efficient model architectures and quantization techniques suitable for highly constrained environments.

Comment: This is a wild testament to how far model compression and basic LLM architecture have come. Running a Llama 2-based model on 32MB RAM opens up insane possibilities for embedded AI.

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sdcotc/gemma_4_just_casually_destroyed_every_model_on/

The local AI community is buzzing with the release of Gemma 4, Google DeepMind's latest open-weight model. Initial benchmarks shared on r/LocalLLaMA indicate that the 31B parameter version of Gemma 4 has delivered exceptional performance, surpassing nearly all other models on a community-maintained leaderboard, with the sole exceptions being high-tier proprietary models like Opus 4.6 and GPT-5.2. Notably, the report claims 100% survival, 5 out of 5 runs profitable, and a significant median ROI, at an impressive cost of just $0.20 per run when accessed via an API, which hints at its cost-efficiency even when not run entirely locally.

This strong showing positions Gemma 4 as a compelling option for local inference, offering a blend of high capability and accessibility. For developers and enthusiasts focused on self-hosting, the benchmark results provide crucial validation for investing compute resources into this new open-weight contender. Its performance against more expensive proprietary models suggests that the gap in capabilities for locally deployable models is rapidly closing, reinforcing the value proposition of open-source innovation in the AI space.

Comment: Gemma 4's performance is a game-changer for open models. Seeing a 31B model punch so far above its weight on benchmarks makes it a top priority for local experimentation and deployment.

FolliA v0.6: Native Android client for Ollama with Real-Time Streaming and Markdown support. (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1sdzaxo/follia_v06_native_android_client_for_ollama_with/

FolliA v0.6 has been released as a native Android client designed to connect seamlessly with local Ollama instances, bringing the power of self-hosted LLMs directly to mobile devices. Developed by an IT student, the application emphasizes a lightweight design and offers key features such as real-time streaming of responses and robust Markdown support for formatted output. This client addresses a growing need within the local AI community for accessible interfaces that allow users to interact with their personal LLM setups from anywhere within their network, effectively turning an Android phone into a powerful AI front-end.

The ability to connect to a local Ollama server from an Android device simplifies the process of interacting with models hosted on a home server or powerful workstation. It empowers users to leverage their own compute resources without relying on cloud-based services, aligning perfectly with the principles of privacy and control inherent in self-hosting. As more individuals deploy Ollama for various tasks, tools like FolliA become indispensable for creating a truly integrated and convenient local AI ecosystem, supporting ongoing exploration of local inference capabilities on consumer devices.

Comment: Having a native Android client for Ollama with real-time streaming is huge. It makes local inference truly mobile, allowing me to access my powerful server-side models from my phone without any fuss.

DEV Community

Gemma 4 Benchmarks, iMac G3 Local LLM, and Ollama Android Client for On-Device Inference