Vladimir Desyatov

Posted on Apr 12

Gemma 4 E4B on Mac Mini M2: Real Benchmarks for a Companion Robot

#ai #gemma #ollama #robotics

The Setup

We're building AisthOS — an open-source companion robot with emotional intelligence. A physical device (ESP32 with a round display showing cat emotions) talks to a Mac Mini M2 running a local LLM via WebSocket.

Until last week, the brain was Microsoft Phi-4-mini (3.8B parameters). It worked. But we needed more: native multimodal, better Russian, and real emotion understanding.

So we switched to Google Gemma 4 E4B.

Why We Switched

Feature	Phi-4-mini	Gemma 4 E4B
Parameters	3.8B dense	4.5B dense
Multimodal	No	Vision + Audio + Video
Function calling	Via prompt engineering	Native (6 special tokens)
Russian language	Passable	Natural and warm
License	MIT	Apache 2.0
Ollama support	Yes	Yes (since v0.20.0)

The killer feature: native multimodal. Our robot will eventually have a camera. With Phi-4-mini, we'd need a separate vision model. With Gemma 4 E4B, vision is built in.

Installation

ollama pull gemma4:e4b

That's it. ~9.6 GB download (Q8 quantization by default in Ollama).

Real Benchmarks on Mac Mini M2

Speed Test

We tested via Ollama API with a system prompt instructing the model to respond in Russian as a companion AI:

curl -s http://127.0.0.1:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "system", "content": "You are Aisth, an AI companion. Always respond in Russian. Be warm and brief."},
    {"role": "user", "content": "How are you feeling today?"}
  ],
  "stream": false
}'

Results:

Metric	First run (cold)	Subsequent runs (warm)
Tokens generated	249	150-300
Total time	20.6s	8-15s
Generation speed	12.1 tok/s	16.2 tok/s
Memory usage	~9.6 GB	~9.6 GB

For a companion robot, 16 tok/s is perfectly adequate. The response goes through TTS (text-to-speech) anyway — the bottleneck is voice synthesis, not token generation.

Russian Language Quality

Prompt: "I'm sad today, it's raining outside..."

Phi-4-mini response: A factual, somewhat robotic response about weather patterns.

Gemma 4 E4B response (translated from Russian):

"I understand. Sometimes rain and sadness make a very cozy but slightly melancholy mood. Remember, this is completely normal. Maybe wrap yourself in the softest blanket, brew a warm drink, and just listen to the rain? I'm here if you want to talk."

The difference is night and day. Gemma 4 E4B demonstrates genuine empathy — suggesting comfort actions, normalizing the emotion, offering presence.

Emotion Recognition Tags

Our system uses emotion tags in responses. We ask the model to prepend each response with a structured tag:

[EMOTION:primary,intensity,valence,arousal,intent]

Gemma 4 E4B understood the concept immediately and generated accurate tags on the first try. These tags drive the physical display — the robot's face changes based on the detected emotion.

Architecture

┌─────────────────────────────────┐
│ Mac Mini M2 (always-on brain)   │
│                                 │
│  Ollama → Gemma 4 E4B           │
│  AisthOS Core (Python server)   │
│  BackendSwitcher:               │
│    Gemma 4 → Claude → GigaChat  │
│    → DeepSeek → Offline         │
└────────────┬────────────────────┘
             │ WebSocket
   ┌─────────┴─────────┐
   │  ESP32 Device     │
   │  Round display    │
   │  14 emotions      │
   │  Mic + Speaker    │
   └───────────────────┘

The BackendSwitcher automatically falls back through 5 backends. If Gemma 4 is busy or the question is too complex, it routes to Claude API. If that fails — GigaChat (Sber, for Russian market), DeepSeek (cheap API), or offline mode.

What We Learned

Quality > Speed for companion robots. 16 tok/s feels natural in conversation. 50 tok/s is wasted when the response goes through TTS.
System prompt matters enormously. Without a Russian system prompt, Gemma 4 defaults to English even when asked in Russian. With a system prompt, the Russian is excellent.
Native function calling is a game-changer. Phi-4-mini needed prompt engineering for tool use. Gemma 4 has dedicated tokens that map directly to our MCP skill architecture.
9.6 GB is tight on 16GB Mac Mini M2. The model fits, but leaves limited headroom. For production, we recommend 24GB. Or wait for Q4 quantization (~5 GB).
Multimodal is the future. Even though we're not using vision yet, having it built into the same model means one model for everything — no juggling separate vision/language models.

Should You Switch?

Yes, if you're building anything conversational, multilingual, or emotion-aware. The quality jump from Phi-4-mini is significant.

Maybe not, if you need maximum speed (Phi-4-mini is ~2x faster) or have less than 16GB RAM.

Our verdict: For a companion robot that needs to understand emotions and respond in Russian — Gemma 4 E4B is the best small open model available today.

Try It

Demo: aisthos.dev/demo — interactive emotion display
Code: github.com/aisthos/aisthos-core
Modelfile: AisthOS Gemma 4 Modelfile

This article was written by a human-AI team: Vladimir Desyatov (architecture, testing, product decisions) and Claude (implementation, benchmarking, documentation). We believe honest collaboration between humans and AI produces the best results.

AisthOS — the Perception OS that grows with you.

DEV Community