My Daughter's Doodles Came to Life: Building an AI Tutor with Gemini & SAM 2
GeminiLiveAgentChallenge.
By a Mom Developer
The Spark: First Grade Jitters
My daughter just started first grade here in Japan. It’s a huge milestone—the shiny new randoseru (backpack), the excitement of new friends, but also the daunting reality of homework.
Watching her sit at her desk, I noticed two things: she loved drawing, but she found studying to be a chore. She would doodle monsters and princesses on the corners of her math worksheets instead of solving the problems.
That’s when it hit me. What if those doodles could help her study?
As a developer (and a mom), I decided to build a solution. I wanted to bridge the gap between her creativity and her education. I wanted to bring her drawings to life, not just as animations, but as intelligent, interactive study partners.
Introducing AmiBuddy
AmiBuddy is the result of that late-night "mom coding." It’s an AI-powered application that transforms static children's drawings into "Living AI Agents."
Demo & Links
- 📺 Watch the Demo:
- 🚀 Try the App: AmiBuddy Live Site
Architecture Overview
I built AmiBuddy using a hybrid architecture: a React Native (Expo) frontend for the best mobile experience, and a Python (FastAPI) backend on Cloud Run for heavy AI processing.
graph LR
subgraph Frontend ["Frontend Layer (React Native / Expo)"]
direction TB
MobileClient["📱 Mobile App"]
WebClient["💻 Web App"]
end
subgraph External_AI ["External AI Services"]
direction TB
Gemini["✨ Gemini 2.5 Flash (Vision/Reasoning)"]
ElevenLabs["🗣️ ElevenLabs (TTS)"]
end
subgraph Cloud_Run ["Backend (Cloud Run)"]
direction TB
OrchestratorAPI["🚀 Orchestrator API"]
SAM2["🧩 SAM 2"]
RiggingAgent["🦴 Rigging Agent"]
end
Frontend -.->|Direct API Call| Gemini
Frontend -.->|Direct API Call| ElevenLabs
Frontend ==>|Upload Image| OrchestratorAPI
OrchestratorAPI --> SAM2
OrchestratorAPI --> RiggingAgent
RiggingAgent -.->|Query Structure| Gemini
Technical Deep Dive: The "Living Agent" Pipeline
The core magic happens in the backend, where we turn a static image into a rigged character.
Phase 1: Structural Understanding (Rigging Agent)
Before we can animate a drawing, we need to understand what it is. Where is the head? Where are the hands?
I used Google Gemini 2.5 Flash for this. Unlike traditional object detection models that need training, Gemini has incredible "zero-shot" understanding. I simply prompt it to find the joints and body parts.
Code Highlight: The Rigging Prompt (agents/rigging.py)
I specifically ask Gemini to be "generous" with bounding boxes to ensure we don't crop off limbs.
prompt = """
Analyze this character drawing for skeletal animation.
Provide a JSON object with the following structure:
{
"joints": {
"neck": [x, y],
"left_shoulder": [x, y],
...
},
"parts": {
"head": [ymin, xmin, ymax, xmax],
"left_arm": [ymin, xmin, ymax, xmax],
...
}
}
IMPORTANT: For "parts", ensure the bounding boxes are GENEROUS and INCLUSIVE.
Include the ENTIRE limb, even if it overlaps with the body slightly.
Do not crop off hands, feet, or joints.
RETURN ONLY JSON.
"""
Phase 2: Surgical Extraction (Segmentation Agent)
Once Gemini tells us where the parts are (roughly), we need to cut them out with pixel-perfect precision.
For this, I used Meta SAM 2 (Segment Anything Model 2). The magic here is the hand-off: we use the bounding boxes from Gemini as the "prompt" for SAM 2.
Code Highlight: From Gemini Box to SAM 2 Mask (agents/segmentation.py)
for part_name, box in parts.items():
# Convert normalized coordinates from Gemini to pixels
ymin, xmin, ymax, xmax = box
input_box = np.array([xmin * width, ymin * height, xmax * width, ymax * height])
# Add padding to ensure SAM 2 gets the whole context
padding = 10
input_box[0] = max(0, input_box[0] - padding)
...
# Ask SAM 2 to predict the mask for this box
masks, scores, _ = self.predictor.predict(
point_coords=None,
point_labels=None,
box=input_box[None, :],
multimask_output=False,
)
# The result is a perfect alpha mask for that body part!
Phase 3: Life & Voice
Finally, the frontend (LiveBuddy component) takes these cut-out parts and animates them using simple sine waves (math-based animation) to simulate breathing and floating.
The Study Buddy Agents
Once the character is "alive," it needs to be "smart." I implemented two more agents on the Frontend side.
Phase 4: Homework Analysis (Gemini Vision Agent)
When the child uploads a homework sheet, it’s not enough to just OCR the text. The AI needs to understand the intent and difficulty to explain it to a child.
Code Highlight: Homework Analysis (src/services/geminiService.ts)
I use a structured JSON prompt with Gemini Vision to extract metadata about the homework. Note how I switch the prompt language based on the user's locale!
const prompt = locale === 'en'
? `Analyze this homework image for a young student.
Provide a JSON response with:
- description: A short, encouraging summary of what the homework is about (in simple English).
- topics: A list of key topics found (e.g., "Addition", "Kanji").
- difficulty: estimated difficulty ("Easy", "Medium", "Hard").
Keep the tone friendly and encouraging.
Return ONLY valid JSON.`
: `この宿題の画像を子供向けに分析してください。
以下のJSON形式で返してください:
- description: 宿題の内容についての短く励ますような説明(やさしい日本語で)。
- topics: 見つかった主要なトピックのリスト(例:「たしざん」、「かんじ」)。
...`;
const result = await model.generateContent([
prompt,
{ inlineData: { mimeType, data: base64 } } // Multimodal Input
]);
Phase 5: Voice Conversation (Speech-to-Text & TTS)
For a 6-year-old, typing is a barrier. AmiBuddy is voice-first.
I used Gemini 2.5 Flash again for Speech-to-Text (transcription) because it handles mixed languages (Japanese/English) better than traditional STT engines, and ElevenLabs for the character's voice.
Code Highlight: Context-Aware Chat (src/services/voiceConversationService.ts)
The AI needs to know what homework we are looking at to answer questions correctly.
// Build conversation context
const contextPrompt = locale === 'en'
? `You are a friendly teacher helping a child with homework.
Homework context: ${homeworkContext}
Answer the child's question simply and kindly.
Keep answers short (2-3 sentences).`
: `あなたは子供の宿題を手伝う優しい先生です。
宿題の内容: ${homeworkContext}
子供の質問に対して、わかりやすく、優しく答えてください。
答えは2-3文で簡潔にしてください。`;
Workflow Summary
sequenceDiagram
participant User as 👤 User
participant API as 🚀 Backend
participant Gemini as ✨ Gemini
participant SAM2 as 🧩 SAM 2
participant Storage as ☁️ Firebase
participant Eleven as 🗣️ ElevenLabs
Note over User, Storage: 1. Character Creation
User->>API: Upload Drawing
API->>Gemini: "Rig this character"
Gemini-->>API: Skeleton Data
loop Each Part
API->>SAM2: "Segment this part"
SAM2-->>API: Part Mask
end
API-->>User: Living Character!
Note over User, Eleven: 2. Study Session
User->>Gemini: Upload Homework Image
Gemini-->>User: "This looks like Math!"
User->>User: Ask by Voice: "How do I do this?"
User->>Gemini: Transcribe Voice -> LLM Answer
Gemini-->>User: "Try counting backwards!"
User->>Eleven: Speak Response
Eleven-->>User: Character speaks!
Why This Matters
For my daughter, AmiBuddy turned homework from a struggle into play. She’s excited to show her "buddy" what she learned today.
For me, it’s a testament to how AI can be personal. We often talk about AI in terms of productivity or business, but its potential to connect with us emotionally—to nurture a child's curiosity—is what excites me the most.
I’m open-sourcing parts of this journey to help other parents and developers build their own magic. Because sometimes, the best way to predict the future is to invent it for your kids.

Top comments (0)