DEV Community

Vladimir Desyatov
Vladimir Desyatov

Posted on

Gemma 4 E4B on Mac Mini M2: Real Benchmarks for a Companion Robot

The Setup

We're building AisthOS — an open-source companion robot with emotional intelligence. A physical device (ESP32 with a round display showing cat emotions) talks to a Mac Mini M2 running a local LLM via WebSocket.

Until last week, the brain was Microsoft Phi-4-mini (3.8B parameters). It worked. But we needed more: native multimodal, better Russian, and real emotion understanding.

So we switched to Google Gemma 4 E4B.

Why We Switched

Feature Phi-4-mini Gemma 4 E4B
Parameters 3.8B dense 4.5B dense
Multimodal No Vision + Audio + Video
Function calling Via prompt engineering Native (6 special tokens)
Russian language Passable Natural and warm
License MIT Apache 2.0
Ollama support Yes Yes (since v0.20.0)

The killer feature: native multimodal. Our robot will eventually have a camera. With Phi-4-mini, we'd need a separate vision model. With Gemma 4 E4B, vision is built in.

Installation

ollama pull gemma4:e4b
Enter fullscreen mode Exit fullscreen mode

That's it. ~9.6 GB download (Q8 quantization by default in Ollama).

Real Benchmarks on Mac Mini M2

Speed Test

We tested via Ollama API with a system prompt instructing the model to respond in Russian as a companion AI:

curl -s http://127.0.0.1:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "system", "content": "You are Aisth, an AI companion. Always respond in Russian. Be warm and brief."},
    {"role": "user", "content": "How are you feeling today?"}
  ],
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

Results:

Metric First run (cold) Subsequent runs (warm)
Tokens generated 249 150-300
Total time 20.6s 8-15s
Generation speed 12.1 tok/s 16.2 tok/s
Memory usage ~9.6 GB ~9.6 GB

For a companion robot, 16 tok/s is perfectly adequate. The response goes through TTS (text-to-speech) anyway — the bottleneck is voice synthesis, not token generation.

Russian Language Quality

Prompt: "I'm sad today, it's raining outside..."

Phi-4-mini response: A factual, somewhat robotic response about weather patterns.

Gemma 4 E4B response (translated from Russian):

"I understand. Sometimes rain and sadness make a very cozy but slightly melancholy mood. Remember, this is completely normal. Maybe wrap yourself in the softest blanket, brew a warm drink, and just listen to the rain? I'm here if you want to talk."

The difference is night and day. Gemma 4 E4B demonstrates genuine empathy — suggesting comfort actions, normalizing the emotion, offering presence.

Emotion Recognition Tags

Our system uses emotion tags in responses. We ask the model to prepend each response with a structured tag:

[EMOTION:primary,intensity,valence,arousal,intent]
Enter fullscreen mode Exit fullscreen mode

Gemma 4 E4B understood the concept immediately and generated accurate tags on the first try. These tags drive the physical display — the robot's face changes based on the detected emotion.

Architecture

┌─────────────────────────────────┐
│ Mac Mini M2 (always-on brain)   │
│                                 │
│  Ollama → Gemma 4 E4B           │
│  AisthOS Core (Python server)   │
│  BackendSwitcher:               │
│    Gemma 4 → Claude → GigaChat  │
│    → DeepSeek → Offline         │
└────────────┬────────────────────┘
             │ WebSocket
   ┌─────────┴─────────┐
   │  ESP32 Device     │
   │  Round display    │
   │  14 emotions      │
   │  Mic + Speaker    │
   └───────────────────┘
Enter fullscreen mode Exit fullscreen mode

The BackendSwitcher automatically falls back through 5 backends. If Gemma 4 is busy or the question is too complex, it routes to Claude API. If that fails — GigaChat (Sber, for Russian market), DeepSeek (cheap API), or offline mode.

What We Learned

  1. Quality > Speed for companion robots. 16 tok/s feels natural in conversation. 50 tok/s is wasted when the response goes through TTS.

  2. System prompt matters enormously. Without a Russian system prompt, Gemma 4 defaults to English even when asked in Russian. With a system prompt, the Russian is excellent.

  3. Native function calling is a game-changer. Phi-4-mini needed prompt engineering for tool use. Gemma 4 has dedicated tokens that map directly to our MCP skill architecture.

  4. 9.6 GB is tight on 16GB Mac Mini M2. The model fits, but leaves limited headroom. For production, we recommend 24GB. Or wait for Q4 quantization (~5 GB).

  5. Multimodal is the future. Even though we're not using vision yet, having it built into the same model means one model for everything — no juggling separate vision/language models.

Should You Switch?

Yes, if you're building anything conversational, multilingual, or emotion-aware. The quality jump from Phi-4-mini is significant.

Maybe not, if you need maximum speed (Phi-4-mini is ~2x faster) or have less than 16GB RAM.

Our verdict: For a companion robot that needs to understand emotions and respond in Russian — Gemma 4 E4B is the best small open model available today.

Try It


This article was written by a human-AI team: Vladimir Desyatov (architecture, testing, product decisions) and Claude (implementation, benchmarking, documentation). We believe honest collaboration between humans and AI produces the best results.

AisthOS — the Perception OS that grows with you.

Top comments (0)