The Setup
We're building AisthOS — an open-source companion robot with emotional intelligence. A physical device (ESP32 with a round display showing cat emotions) talks to a Mac Mini M2 running a local LLM via WebSocket.
Until last week, the brain was Microsoft Phi-4-mini (3.8B parameters). It worked. But we needed more: native multimodal, better Russian, and real emotion understanding.
So we switched to Google Gemma 4 E4B.
Why We Switched
| Feature | Phi-4-mini | Gemma 4 E4B |
|---|---|---|
| Parameters | 3.8B dense | 4.5B dense |
| Multimodal | No | Vision + Audio + Video |
| Function calling | Via prompt engineering | Native (6 special tokens) |
| Russian language | Passable | Natural and warm |
| License | MIT | Apache 2.0 |
| Ollama support | Yes | Yes (since v0.20.0) |
The killer feature: native multimodal. Our robot will eventually have a camera. With Phi-4-mini, we'd need a separate vision model. With Gemma 4 E4B, vision is built in.
Installation
ollama pull gemma4:e4b
That's it. ~9.6 GB download (Q8 quantization by default in Ollama).
Real Benchmarks on Mac Mini M2
Speed Test
We tested via Ollama API with a system prompt instructing the model to respond in Russian as a companion AI:
curl -s http://127.0.0.1:11434/api/chat -d '{
"model": "gemma4:e4b",
"messages": [
{"role": "system", "content": "You are Aisth, an AI companion. Always respond in Russian. Be warm and brief."},
{"role": "user", "content": "How are you feeling today?"}
],
"stream": false
}'
Results:
| Metric | First run (cold) | Subsequent runs (warm) |
|---|---|---|
| Tokens generated | 249 | 150-300 |
| Total time | 20.6s | 8-15s |
| Generation speed | 12.1 tok/s | 16.2 tok/s |
| Memory usage | ~9.6 GB | ~9.6 GB |
For a companion robot, 16 tok/s is perfectly adequate. The response goes through TTS (text-to-speech) anyway — the bottleneck is voice synthesis, not token generation.
Russian Language Quality
Prompt: "I'm sad today, it's raining outside..."
Phi-4-mini response: A factual, somewhat robotic response about weather patterns.
Gemma 4 E4B response (translated from Russian):
"I understand. Sometimes rain and sadness make a very cozy but slightly melancholy mood. Remember, this is completely normal. Maybe wrap yourself in the softest blanket, brew a warm drink, and just listen to the rain? I'm here if you want to talk."
The difference is night and day. Gemma 4 E4B demonstrates genuine empathy — suggesting comfort actions, normalizing the emotion, offering presence.
Emotion Recognition Tags
Our system uses emotion tags in responses. We ask the model to prepend each response with a structured tag:
[EMOTION:primary,intensity,valence,arousal,intent]
Gemma 4 E4B understood the concept immediately and generated accurate tags on the first try. These tags drive the physical display — the robot's face changes based on the detected emotion.
Architecture
┌─────────────────────────────────┐
│ Mac Mini M2 (always-on brain) │
│ │
│ Ollama → Gemma 4 E4B │
│ AisthOS Core (Python server) │
│ BackendSwitcher: │
│ Gemma 4 → Claude → GigaChat │
│ → DeepSeek → Offline │
└────────────┬────────────────────┘
│ WebSocket
┌─────────┴─────────┐
│ ESP32 Device │
│ Round display │
│ 14 emotions │
│ Mic + Speaker │
└───────────────────┘
The BackendSwitcher automatically falls back through 5 backends. If Gemma 4 is busy or the question is too complex, it routes to Claude API. If that fails — GigaChat (Sber, for Russian market), DeepSeek (cheap API), or offline mode.
What We Learned
Quality > Speed for companion robots. 16 tok/s feels natural in conversation. 50 tok/s is wasted when the response goes through TTS.
System prompt matters enormously. Without a Russian system prompt, Gemma 4 defaults to English even when asked in Russian. With a system prompt, the Russian is excellent.
Native function calling is a game-changer. Phi-4-mini needed prompt engineering for tool use. Gemma 4 has dedicated tokens that map directly to our MCP skill architecture.
9.6 GB is tight on 16GB Mac Mini M2. The model fits, but leaves limited headroom. For production, we recommend 24GB. Or wait for Q4 quantization (~5 GB).
Multimodal is the future. Even though we're not using vision yet, having it built into the same model means one model for everything — no juggling separate vision/language models.
Should You Switch?
Yes, if you're building anything conversational, multilingual, or emotion-aware. The quality jump from Phi-4-mini is significant.
Maybe not, if you need maximum speed (Phi-4-mini is ~2x faster) or have less than 16GB RAM.
Our verdict: For a companion robot that needs to understand emotions and respond in Russian — Gemma 4 E4B is the best small open model available today.
Try It
- Demo: aisthos.dev/demo — interactive emotion display
- Code: github.com/aisthos/aisthos-core
- Modelfile: AisthOS Gemma 4 Modelfile
This article was written by a human-AI team: Vladimir Desyatov (architecture, testing, product decisions) and Claude (implementation, benchmarking, documentation). We believe honest collaboration between humans and AI produces the best results.
AisthOS — the Perception OS that grows with you.
Top comments (0)