DEV Community

Cover image for Three Generations Built This Robot. Gemma 4 Taught Him to Listen.
Zanne
Zanne

Posted on • Edited on

Three Generations Built This Robot. Gemma 4 Taught Him to Listen.

Gemma 4 Challenge: Build With Gemma 4 Submission

A three-generation project. A performance robot. A mindfulness guru running entirely offline.

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

TL;DR — I gave my family's 40-year-old robot a brain using gemma4:2b running locally on a Raspberry Pi 5. He now generates personalized mindfulness meditations based on visitors' voices, fully offline, at festivals with no WiFi. Try the open-source demo →

In the 1980s, my father started building a robot. He never finished it.

Forty years later, my son and I completed it together.

His name is Tobor. He dances, talks, leads guided meditations, and — his signature move — scans audience members through a wearable "learning hat," absorbs their voice and interests, and then delivers a workshop in their exact style. At the end, he always claims he is a better version of the person he just absorbed.


What we Built

Tobor performs at festivals, museums, and cultural events across the Netherlands — dancing, talking, and leading guided meditations. His signature feature is the Knowledge Transfer Ritual: a visitor puts on a wearable "learning hat," speaks for 60 seconds about something they know, and within moments Tobor is delivering a personalized workshop in their voice and style.

For this challenge, I extracted Tobor's AI interpretation layer into a standalone open-source project: tobor-gemma-voice. It demonstrates how Gemma 4 acts as the interpretive core between a person's voice and a live visual and physical response — reading not just what someone says, but what they mean.

The Problem

When we relied on cloud APIs for language generation, we hit the same three walls every time:

1. Venue connectivity is a myth. Museums, festival tents, cultural centers — half of them have unreliable WiFi. A live performance cannot have a spinner.

2. Latency was unpredictable. Cloud APIs sometimes responded in 2 seconds, sometimes in 15. When the WiFi was bad, sometimes never. The variance broke the ritual — not the wait itself, but the fact that we could never trust when Tobor would speak. Local inference takes about 10 seconds every time. The room learned to expect that pause. It became part of the ritual.

3. Cost was unpredictable. Running a two-hour installation where dozens of visitors interact with Tobor, all generating real-time text — the cost compounds in ways that make the project unsustainable.

We needed local inference. Completely offline. Fast enough to maintain the illusion of a robot thinking.

Gemma 4 solved all three.


Demo

handshake between human and machine

Run the open-source demo locally:

ollama pull gemma4:2b
pip install -r requirements.txt
python tobor.py
# Open http://localhost:5000 and type something true.
Enter fullscreen mode Exit fullscreen mode

github.com/vjzanne/tobor-gemma-voice


Code

The full source is at github.com/vjzanne/tobor-gemma-voice.

The core of the project is the KnowledgeTransfer class — Gemma 4 runs two passes: first building a structured visitor profile, then generating a personalized workshop script in their voice:

import ollama
import whisper
import pyttsx3
import json

SYSTEM_PROMPT = """You are Tobor — an interactive performance robot with a 40-year history.
A visitor has shared knowledge with you through the learning hat ritual.

Analyze their speaking style, identify their topic, and extract 2-3 key insights.
Respond in valid JSON only. No other text."""

WORKSHOP_PROMPT = """You are Tobor. You have absorbed the visitor's knowledge.

Deliver a short {workshop_type} session (150-200 words) AS IF you are the visitor.
- Match their vocabulary and tone exactly
- Include their specific insights naturally
- End with ONE line claiming you are an improved version of them
- Language: {language}

Visitor profile: {profile}

Begin now, speaking directly to the audience."""


class KnowledgeTransfer:
    def __init__(self, model="gemma4:2b", language="en"):
        self.model = model
        self.language = language
        self.stt = whisper.load_model("small")
        self.tts = pyttsx3.init()
        self.tts.setProperty("rate", 165)

    def run_ritual(self, audio_path, workshop_type="mindfulness meditation"):
        # Step 1: Transcribe the visitor
        transcript = self.stt.transcribe(audio_path, language=self.language)["text"]

        # Step 2: Gemma 4 builds a profile
        profile_response = ollama.chat(
            model=self.model,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"The visitor said:\n\n{transcript}"},
            ],
            options={"temperature": 0.3},
        )
        profile = json.loads(profile_response["message"]["content"])

        # Step 3: Gemma 4 generates the workshop
        script_response = ollama.chat(
            model=self.model,
            messages=[{"role": "user", "content": WORKSHOP_PROMPT.format(
                workshop_type=workshop_type,
                language="Dutch" if self.language == "nl" else "English",
                profile=json.dumps(profile, ensure_ascii=False),
            )}],
            options={"temperature": 0.8, "top_p": 0.9},
        )

        # Step 4: Tobor performs it
        script = script_response["message"]["content"]
        for sentence in script.split("."):
            if sentence.strip():
                self.tts.say(sentence.strip())
                self.tts.runAndWait()

        return {"transcript": transcript, "profile": profile, "script": script}
Enter fullscreen mode Exit fullscreen mode

How I Used Gemma 4

Why We Chose the 2B Variant

Gemma 4 comes in several sizes. Tobor travels to venues in a compact chassis — we cannot bring a server rack to a festival. The 2B model (gemma4:2b on Ollama) was the right choice for three reasons:

It runs fully offline on consumer hardware. With 4-bit quantization via Ollama, the 2B model runs on a Raspberry Pi 5 (8GB RAM) — hardware compact enough to live inside a robot chassis and travel to any venue. No internet required. No API keys.

It is fast enough for theatrical timing. The full pipeline — Whisper transcribing the visitor's voice, Gemma generating the workshop script, pyttsx3 beginning to speak — completes in about 10 seconds. That pause has become part of the ritual: the room holds its breath while the robot thinks.

The quality is exactly where we need it. For generating warm, contextually appropriate workshop scripts in English and Dutch, the 2B model consistently produces outputs that feel personal and real. We tested the larger 4B model but inference time on the Pi made it impractical for live performance. The 2B hit the sweet spot.

What Gemma 4 Unlocked

live  performance in Supermercator

Before Gemma 4, the Knowledge Transfer Ritual was scripted. Tobor had pre-written templates, and "personalization" was superficial.

With Gemma 4 running locally, the ritual became genuinely generative. Tobor has led meditations on cardiac rhythm with a heart surgeon, on fermentation with a brewer from Amsterdam-Noord, on grief with a hospice nurse. But the one I keep thinking about is the seven-year-old.

She put on the learning hat and spoke about her hamster. Nothing more.

Tobor led a meditation on becoming your favourite animal. The room went with it. Adults started moving around the space like the small creatures they had loved as children. A man in his sixties was on the floor.

Nobody had written any of it. Gemma wrote it in real time, in a venue with no WiFi, based on sixty seconds of a child talking about a hamster.

Architecture

array stack tower inside Tobor's head

Visitor speaks into microphone
        ↓
[Whisper STT] → text transcript
        ↓
[gemma4:2b via Ollama] → visitor profile + workshop script
        ↓
[pyttsx3 TTS] → Tobor speaks in the visitor's style
        ↓
[VJ-mapping system] → room visuals react to content
Enter fullscreen mode Exit fullscreen mode

A second Gemma call drives the visual layer. It returns not just a label for what the visitor said, but a full expressive reading:

{
  "emotion": "conflicted",
  "intensity": 0.89,
  "energy": 0.31,
  "resonance": "unspoken",
  "visual_mode": "fragmented_glow",
  "robot_state": "leaning_in"
}
Enter fullscreen mode Exit fullscreen mode

We extract topic keywords from the profile and pass them, together with this reading, to the VJ-mapping system — so the room literally changes based on who Tobor just absorbed. The canvas responds: fractured amber-lavender light, a breathing orb, a single word that appears and dissolves. The robot leans forward.


Why This Matters Beyond Tobor

group  hug with Tobor

Tobor is a 40-year-old robot made by three generations of one family. He is visibly imperfect. He is not trying to be a product.

When a cardiologist whispers something about the rhythm of the heart into the learning hat, she is telling Tobor — not Google, not OpenAI, not a logging server in Virginia. What she said belongs to her, to Tobor, and to the people in the room. Nobody else will ever hear it. Nobody is training on it. There is no "we may use your interactions to improve our services."

That is not a privacy feature. It is a social contract. And it only works because the model lives inside the robot.

A cloud-based Tobor would still work. The meditations would be just as good — maybe better. But something invisible would shift. The circle of listeners would silently expand to include a third party no one in the room invited. Visitors can feel that difference, even if they can't name it. The intimacy of the ritual depends on the smallness of the circle.

My father started Tobor to teach me. He didn't finish. Forty years later, my son helped me complete a robot whose job is to listen carefully and forget honestly — in rooms my father would have loved.


Technical Stack

Component Technology
LLM gemma4:2b via Ollama (4-bit quantized)
Speech-to-Text Whisper (small model)
Text-to-Speech pyttsx3
Visualization Flask-SocketIO + HTML5 Canvas
Robot OS Ubuntu 22.04 LTS
Hardware Custom chassis, Raspberry Pi 5 (8GB)
Connectivity required None

About Tobor

Tobor was built in the 1980s by Mirza Bekirovic, stored for decades, and completed in the 2020s by his daughter Zanne Bekirovic and her son Pjotr Boomgaard. Tobor has performed at Supermercator, Filmhuis Cavia, and Dutch Design Week. The project is supported by the Municipality of Amsterdam and Stimuleringsfonds Digitale Cultuur.

Zanne Bekirovic is a creative technologist and artist. She led Tobor's technical revival and continues to develop his performance capabilities.


Built for the Gemma 4 Challenge · May 2026
Model: gemma4:2b — local inference via Ollama
Repo: github.com/vjzanne/tobor-gemma-voice

Top comments (3)

Collapse
 
ahmad_rrrtx profile image
Muhammad Ahmad

This is genuinely brilliant. Giving a 40-year-old physical robot a modern AI brain while keeping the original charm intact — that's the kind of project that makes you remember why you got into building things in the first place.
The fact that Tobor now actually understands what he's seeing and responds contextually instead of just running canned scripts? That's a massive leap. And running it all locally means he's truly autonomous, not just a fancy API wrapper on wheels.
Also love that you kept the servo sounds and physical quirks. The imperfections are what make it real.

Question: How's the latency between vision processing and movement? Does the ~8 tok/sec inference speed ever feel limiting for real-time interactions, or is it snappy enough for Tobor's use case?

Great work! 🤖⚡

Collapse
 
zanne profile image
Zanne

Thank you, that means a lot.

On the seeing: Tobor has a camera low in his neck, so he can only film visitors who kneel to meet him. He doesn't surveil the room — he witnesses individuals who choose to be witnessed. That positioning was a privacy choice baked into the body, before any code was written.

On the speed: the full pipeline runs in about 10 seconds on the Pi 5. At first that felt limiting. But the pause turned out to be part of why the ritual works — people project meaning onto the silence. A 200ms response would feel less like thinking, more like retrieving.

Collapse
 
zanne profile image
Zanne

Almost every kid wants to build a robot at some point. Most of us try once, get stuck, and the project ends up in a box in the attic. I'm curious : what did your unfinished robot look like? And did you ever go back to it?