Mongoo (Mungunshagai Tum)

Posted on Mar 16

Amibuddy Live homework agent app - Gemini Live agent hackathon

#agents #devchallenge #gemini #showdev

My Daughter's Doodles Came to Life: Building an AI Tutor with Gemini & SAM 2

GeminiLiveAgentChallenge.

By a Mom Developer

The Spark: First Grade Jitters

My daughter just started first grade here in Japan. It’s a huge milestone—the shiny new randoseru (backpack), the excitement of new friends, but also the daunting reality of homework.

Watching her sit at her desk, I noticed two things: she loved drawing, but she found studying to be a chore. She would doodle monsters and princesses on the corners of her math worksheets instead of solving the problems.

That’s when it hit me. What if those doodles could help her study?

As a developer (and a mom), I decided to build a solution. I wanted to bridge the gap between her creativity and her education. I wanted to bring her drawings to life, not just as animations, but as intelligent, interactive study partners.

Introducing AmiBuddy

AmiBuddy is the result of that late-night "mom coding." It’s an AI-powered application that transforms static children's drawings into "Living AI Agents."

Demo & Links

📺 Watch the Demo:

https://youtu.be/PFbhIW4gZxs

🚀 Try the App: AmiBuddy Live Site

Architecture Overview

I built AmiBuddy using a hybrid architecture: a React Native (Expo) frontend for the best mobile experience, and a Python (FastAPI) backend on Cloud Run for heavy AI processing.

graph LR
    subgraph Frontend ["Frontend Layer (React Native / Expo)"]
        direction TB
        MobileClient["📱 Mobile App"]
        WebClient["💻 Web App"]
    end

    subgraph External_AI ["External AI Services"]
        direction TB
        Gemini["✨ Gemini 2.5 Flash (Vision/Reasoning)"]
        ElevenLabs["🗣️ ElevenLabs (TTS)"]
    end

    subgraph Cloud_Run ["Backend (Cloud Run)"]
        direction TB
        OrchestratorAPI["🚀 Orchestrator API"]
        SAM2["🧩 SAM 2"]
        RiggingAgent["🦴 Rigging Agent"]
    end

    Frontend -.->|Direct API Call| Gemini
    Frontend -.->|Direct API Call| ElevenLabs
    Frontend ==>|Upload Image| OrchestratorAPI

    OrchestratorAPI --> SAM2
    OrchestratorAPI --> RiggingAgent
    RiggingAgent -.->|Query Structure| Gemini

Technical Deep Dive: The "Living Agent" Pipeline

The core magic happens in the backend, where we turn a static image into a rigged character.

Phase 1: Structural Understanding (Rigging Agent)

Before we can animate a drawing, we need to understand what it is. Where is the head? Where are the hands?

I used Google Gemini 2.5 Flash for this. Unlike traditional object detection models that need training, Gemini has incredible "zero-shot" understanding. I simply prompt it to find the joints and body parts.

Code Highlight: The Rigging Prompt (agents/rigging.py)

I specifically ask Gemini to be "generous" with bounding boxes to ensure we don't crop off limbs.

        prompt = """
        Analyze this character drawing for skeletal animation.
        Provide a JSON object with the following structure:
        {
            "joints": {
                "neck": [x, y],
                "left_shoulder": [x, y],
                ...
            },
            "parts": {
                "head": [ymin, xmin, ymax, xmax],
                "left_arm": [ymin, xmin, ymax, xmax],
                ...
            }
        }
        IMPORTANT: For "parts", ensure the bounding boxes are GENEROUS and INCLUSIVE. 
        Include the ENTIRE limb, even if it overlaps with the body slightly. 
        Do not crop off hands, feet, or joints.
        RETURN ONLY JSON.
        """

Phase 2: Surgical Extraction (Segmentation Agent)

Once Gemini tells us where the parts are (roughly), we need to cut them out with pixel-perfect precision.

For this, I used Meta SAM 2 (Segment Anything Model 2). The magic here is the hand-off: we use the bounding boxes from Gemini as the "prompt" for SAM 2.

Code Highlight: From Gemini Box to SAM 2 Mask (agents/segmentation.py)

        for part_name, box in parts.items():
            # Convert normalized coordinates from Gemini to pixels
            ymin, xmin, ymax, xmax = box
            input_box = np.array([xmin * width, ymin * height, xmax * width, ymax * height])

            # Add padding to ensure SAM 2 gets the whole context
            padding = 10 
            input_box[0] = max(0, input_box[0] - padding)
            ...

            # Ask SAM 2 to predict the mask for this box
            masks, scores, _ = self.predictor.predict(
                point_coords=None,
                point_labels=None,
                box=input_box[None, :],
                multimask_output=False,
            )

            # The result is a perfect alpha mask for that body part!

Phase 3: Life & Voice

Finally, the frontend (LiveBuddy component) takes these cut-out parts and animates them using simple sine waves (math-based animation) to simulate breathing and floating.

The Study Buddy Agents

Once the character is "alive," it needs to be "smart." I implemented two more agents on the Frontend side.

Phase 4: Homework Analysis (Gemini Vision Agent)

When the child uploads a homework sheet, it’s not enough to just OCR the text. The AI needs to understand the intent and difficulty to explain it to a child.

Code Highlight: Homework Analysis (src/services/geminiService.ts)

I use a structured JSON prompt with Gemini Vision to extract metadata about the homework. Note how I switch the prompt language based on the user's locale!

    const prompt = locale === 'en'
      ? `Analyze this homework image for a young student.
         Provide a JSON response with:
         - description: A short, encouraging summary of what the homework is about (in simple English).
         - topics: A list of key topics found (e.g., "Addition", "Kanji").
         - difficulty: estimated difficulty ("Easy", "Medium", "Hard").
         Keep the tone friendly and encouraging.
         Return ONLY valid JSON.`
      : `この宿題の画像を子供向けに分析してください。
         以下のJSON形式で返してください：
         - description: 宿題の内容についての短く励ますような説明（やさしい日本語で）。
         - topics: 見つかった主要なトピックのリスト（例：「たしざん」、「かんじ」）。
         ...`;

    const result = await model.generateContent([
      prompt,
      { inlineData: { mimeType, data: base64 } } // Multimodal Input
    ]);

Phase 5: Voice Conversation (Speech-to-Text & TTS)

For a 6-year-old, typing is a barrier. AmiBuddy is voice-first.
I used Gemini 2.5 Flash again for Speech-to-Text (transcription) because it handles mixed languages (Japanese/English) better than traditional STT engines, and ElevenLabs for the character's voice.

Code Highlight: Context-Aware Chat (src/services/voiceConversationService.ts)

The AI needs to know what homework we are looking at to answer questions correctly.

    // Build conversation context
    const contextPrompt = locale === 'en'
      ? `You are a friendly teacher helping a child with homework.
         Homework context: ${homeworkContext}

         Answer the child's question simply and kindly.
         Keep answers short (2-3 sentences).`
      : `あなたは子供の宿題を手伝う優しい先生です。
         宿題の内容: ${homeworkContext}

         子供の質問に対して、わかりやすく、優しく答えてください。
         答えは2-3文で簡潔にしてください。`;

Workflow Summary

sequenceDiagram
    participant User as 👤 User
    participant API as 🚀 Backend
    participant Gemini as ✨ Gemini
    participant SAM2 as 🧩 SAM 2
    participant Storage as ☁️ Firebase
    participant Eleven as 🗣️ ElevenLabs

    Note over User, Storage: 1. Character Creation
    User->>API: Upload Drawing
    API->>Gemini: "Rig this character"
    Gemini-->>API: Skeleton Data
    loop Each Part
        API->>SAM2: "Segment this part"
        SAM2-->>API: Part Mask
    end
    API-->>User: Living Character!

    Note over User, Eleven: 2. Study Session
    User->>Gemini: Upload Homework Image
    Gemini-->>User: "This looks like Math!"
    User->>User: Ask by Voice: "How do I do this?"
    User->>Gemini: Transcribe Voice -> LLM Answer
    Gemini-->>User: "Try counting backwards!"
    User->>Eleven: Speak Response
    Eleven-->>User: Character speaks!

Why This Matters

For my daughter, AmiBuddy turned homework from a struggle into play. She’s excited to show her "buddy" what she learned today.

For me, it’s a testament to how AI can be personal. We often talk about AI in terms of productivity or business, but its potential to connect with us emotionally—to nurture a child's curiosity—is what excites me the most.

I’m open-sourcing parts of this journey to help other parents and developers build their own magic. Because sometimes, the best way to predict the future is to invent it for your kids.

DEV Community