Created for the Gemini Live Agent Challenge
The Problem: Children Are Losing the Habit of Reading
Children worldwide are spending more time on social media than ever before — and it's replacing reading, creative play, and healthy development.
- U.S. teens spend an average of 4.8 hours per day on social media (Gallup, 2023)
- Kids aged 4-18 spend an average of 112 minutes daily on TikTok — 60% more than YouTube (Qustodio via TechCrunch, 2024)
- ASEAN school-aged children (ages 6-14) spend 2.77 hours per day on screens, exceeding the recommended 2-hour limit (HealthcareAsia, 2024)
- Over 90 minutes of daily screen time is linked to below-average performance in communication, writing, and numeracy for children ages 2-8 (Centre for Social Justice, 2024)
Children are drawn to screens because the content is engaging — colorful, animated, interactive. But most of that content is passive consumption. There is a lack of interactive, creative, educational digital experiences that match the engagement level of social media while actually benefiting children's development.
As a developer from Timor-Leste — where over 72% of the population is under 35 but children's book access remains extremely limited — I wanted to change that. What if a child could simply speak their idea and watch it transform into a fully illustrated, narrated storybook?
That's exactly what I built with KidStory: Storybook for Kids — an AI-powered platform that transforms screen time from passive consumption into active creation.
For Parents
Register an account, then create stories for your children to read and enjoy together.
For Kids
Let your child speak their own story idea and watch the AI bring it to life with pictures and narration.
Screenshot of the app homepage with logo and "Create Your Story" button
The Challenge: Creative Storyteller Category
I built this project for the Gemini Live Agent Challenge in the Creative Storyteller category, which focuses on multimodal storytelling with interleaved output. The challenge was to create an agent that thinks like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream.
The key requirement? Use Gemini's native interleaved output capabilities - not just generating text and then images separately, but creating them together in one cohesive stream.
The Magic: Interleaved Output in Action
What is Interleaved Output?
Traditional AI workflows generate content sequentially:
- Generate story text
- Wait for completion
- Generate images based on text
- Wait for completion
- Generate audio
This creates a disjointed experience with long wait times.
Interleaved output changes everything. With Gemini's gemini-2.5-flash-image model, I can request both text AND images in a single API call, and they stream back together:
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents: prompt,
config: {
responseModalities: ["TEXT", "IMAGE"], // The magic happens here!
},
});
Diagram showing traditional sequential generation vs interleaved generation
The User Experience
When a parent/child creates a story, here's what happens:
- Voice or Text Input: The child speaks or types their story idea
- Real-time Generation: Story text and illustrations generate together
- Magical Painting Effect: Each page's illustration appears as the story is being written
- Parallel Audio: While images stream, audio narration generates in parallel
- Complete Storybook: Within minutes, a fully illustrated, narrated storybook is ready
Screenshot of story generation in progress
The Architecture: Orchestrating Multiple Gemini Models
One of the most interesting aspects of this project was learning to orchestrate multiple Gemini models, each specialized for different tasks:
1. The Creative Director: gemini-2.5-flash-image
This model handles the core story generation with interleaved output:
// In /app/api/generate-story/route.ts
const result = await ai.models.generateContentStream({
model: "gemini-2.5-flash-image",
contents: [{ role: "user", parts }],
config: {
responseModalities: ["TEXT", "IMAGE"],
responseMimeType: "application/json",
},
});
// Stream back to client with Server-Sent Events
for await (const chunk of result.stream) {
// Send story JSON chunks
// Send image data as it arrives
}
Code snippet showing the interleaved generation setup
2. The Voice: gemini-2.5-flash-preview-tts
For narration, I use Gemini's text-to-speech model with three distinct narrator personalities:
// In /lib/gcp/tts.ts
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [
{
role: "user",
parts: [{ text: pageText }],
},
],
config: {
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: selectedVoice } },
},
},
});
Kids can choose from:
- Luna: Warm and gentle
- Stella: Energetic and playful
- Kiko: Calm and soothing
Screenshot showing narrator voice selection
3. The Quiz Master: gemini-2.5-flash
After reading, kids can take an interactive quiz. I optimized this by using the text-only model for quiz generation (faster and more cost-effective), then adding TTS for audio:
// In /app/api/live-quiz/route.ts
const quizResponse = await ai.models.generateContent({
model: "gemini-2.5-flash", // Text-only for speed
contents: quizPrompt,
config: {
responseMimeType: "application/json",
},
});
// Then generate audio separately
const audioResponse = await generateAudio(question.text);
Screenshot of the Magic Quiz interface with voice input
Technical Deep Dive
The Stack
- Frontend: Next.js 16 with React 19 and TypeScript
- Styling: Tailwind CSS 4
- AI Models: Google Gemini via Vertex AI
- Database: Firebase Firestore
- Storage: Google Cloud Storage
- Hosting: Google Cloud Run
- Authentication: Firebase Auth
KidStory system architecture showing all components
Key Technical Challenges & Solutions
1. Streaming Interleaved Content
The biggest challenge was handling the interleaved stream of text and images. Gemini returns them mixed together, so I needed to:
// Parse the stream and separate text from images
for await (const chunk of result.stream) {
const parts = chunk.candidates?.[0]?.content?.parts || [];
for (const part of parts) {
if (part.text) {
// Handle story JSON text
storyBuffer += part.text;
} else if (part.inlineData) {
// Handle image data
const imageData = part.inlineData.data;
// Upload to Cloud Storage
// Send URL to client
}
}
}
2. Character Consistency
Kids can upload photos of themselves or loved ones to appear in the story. To maintain consistency across all pages:
// Compress and include reference images
const referenceImages = await Promise.all(
characterPhotos.map(async (photo) => {
const compressed = await compressImage(photo);
return {
inlineData: {
mimeType: "image/jpeg",
data: compressed,
},
};
}),
);
// Include in prompt
const prompt = {
role: "user",
parts: [
{ text: storyPrompt },
...referenceImages, // Gemini uses these for consistency
],
};
3. Real-time Progress Updates
To create the "magical painting" effect, I used Server-Sent Events (SSE) to stream updates to the client:
// Server: Send updates as they arrive
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
// Send story chunks
controller.enqueue(encoder.encode(`data: ${JSON.stringify(chunk)}\n\n`));
// Send image URLs as they're uploaded
controller.enqueue(
encoder.encode(`data: ${JSON.stringify(imageUpdate)}\n\n`),
);
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
},
});
// Client: React hook to handle streaming
const useStoryGenerator = () => {
const [progress, setProgress] = useState({});
const eventSource = new EventSource("/api/generate-story");
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "chunk") {
// Update story text
} else if (data.type === "image") {
// Show image for specific page
setProgress((prev) => ({
...prev,
[`page_${data.pageNumber}`]: "complete",
}));
}
};
};
4. Voice Input for Kids
Using the Web Speech API, kids can speak their story ideas:
// Custom hook for voice recognition
const useVoiceInput = () => {
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
recognition.onresult = (event) => {
const transcript = Array.from(event.results)
.map((result) => result[0].transcript)
.join("");
setTranscript(transcript);
};
return { startListening, stopListening, transcript };
};
5. PDF Generation with Compression
Stories can be downloaded as PDFs. To keep file sizes manageable, I implemented image compression:
// Compress images before adding to PDF
async function compressImage(dataUrl: string): Promise<string> {
return new Promise((resolve) => {
const img = new Image();
img.onload = () => {
const canvas = document.createElement("canvas");
const maxWidth = 800;
let width = img.width;
let height = img.height;
if (width > maxWidth) {
height = (height * maxWidth) / width;
width = maxWidth;
}
canvas.width = width;
canvas.height = height;
const ctx = canvas.getContext("2d");
ctx.drawImage(img, 0, 0, width, height);
// Compress as JPEG with 70% quality
const compressed = canvas.toDataURL("image/jpeg", 0.7);
resolve(compressed);
};
img.src = dataUrl;
});
}
This reduced PDF sizes by 70-85% while maintaining good visual quality.
Screenshot of PDF download with sample pages
Deployment on Google Cloud
The entire application runs on Google Cloud Platform:
Cloud Run Deployment
I created a simple deployment script that handles everything:
#!/bin/bash
# deploy.sh
gcloud run deploy storybook-for-kids \
--source=. \
--region=us-central1 \
--platform=managed \
--allow-unauthenticated \
--min-instances=1 \
--set-env-vars=GOOGLE_CLOUD_PROJECT=storybook-for-kids \
--set-env-vars=GCS_BUCKET_NAME=storybook-for-kids-media \
# ... other environment variables
Screenshot of Cloud Run console showing deployed service
IAM Permissions
One challenge was setting up the right permissions. I automated this with a script:
#!/bin/bash
# fix_permissions.sh
# Grant Vertex AI access
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${COMPUTE_SA}" \
--role="roles/aiplatform.user"
# Grant Firestore access
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${COMPUTE_SA}" \
--role="roles/datastore.user"
# Grant Cloud Storage access
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${COMPUTE_SA}" \
--role="roles/storage.objectAdmin"
Deployment architecture diagram
Performance Optimizations
1. Retry Logic for Rate Limiting
Gemini API can hit rate limits during peak usage. I implemented exponential backoff:
async function generateWithRetry(prompt: any, maxRetries = 8) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await ai.models.generateContent(prompt);
} catch (error) {
if (error.status === 429 && attempt < maxRetries - 1) {
const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
const jitter = Math.random() * 1000;
await new Promise((r) => setTimeout(r, delay + jitter));
continue;
}
throw error;
}
}
}
2. Parallel Processing
Images and audio generate in parallel to reduce total wait time:
// Generate all images and audio simultaneously
const results = await Promise.all([
...pages.map((page) => generateImage(page.imagePrompt)),
...pages.map((page) => generateAudio(page.text)),
]);
3. Image Compression
All images are compressed before storage:
- Resize to max 800px width
- Convert to JPEG with 70% quality
- Reduces storage costs and improves load times
Lessons Learned
1. Interleaved Output is Powerful
The difference between sequential and interleaved generation is night and day. Users see progress immediately, and the overall experience feels much more "magical."
2. Model Selection Matters
Using the right model for each task improved both performance and cost:
-
gemini-2.5-flash-imagefor interleaved story generation -
gemini-2.5-flash(text-only) for quiz generation (faster, cheaper) -
gemini-2.5-flash-preview-ttsfor audio narration
3. Error Handling is Critical
With multiple API calls and streaming, things can go wrong. Robust error handling and retry logic are essential for production apps.
4. User Experience First
Technical capabilities mean nothing if the UX isn't great. I spent significant time on:
- Smooth animations with Framer Motion
- Clear progress indicators
- Voice input that "just works"
- Child-friendly interface design
The Results
Here's what the app can do:
✅ Voice or text input for story ideas
✅ Interleaved generation of text and images
✅ Character consistency with photo uploads
✅ Three narrator voices with distinct personalities
✅ Interactive quizzes with voice input
✅ PDF export with compressed images
✅ Story library with Firebase Firestore
✅ Google authentication for user accounts
✅ Deployed on Cloud Run with auto-scaling
Try It Yourself
The project is live at ai.kidstory.app and source on GitHub: https://github.com/ajitonelsonn/torybook-for-Kids
To run it locally:
# Clone the repository
git clone https://github.com/ajitonelsonn/torybook-for-Kids
cd storybook-for-kids-app
# Install dependencies
npm install
# Set up environment variables
cp .env.example .env.local
# Edit .env.local with your credentials
# Run development server
npm run dev
What's Next?
I'm excited to continue developing this project. Future features I'm considering:
- Video generation for animated stories
- Multi-language support for global accessibility
- Collaborative stories where multiple kids contribute
- Story templates for different genres
- Parent dashboard to track reading progress
Conclusion
Building KidStory: Storybook for Kids taught me so much about:
- Gemini's powerful interleaved output capabilities
- Orchestrating multiple AI models effectively
- Creating delightful user experiences with AI
- Deploying production-ready apps on Google Cloud
The Gemini Live Agent Challenge pushed me to think beyond simple text-in/text-out interactions and create something truly multimodal. The result is an app that makes reading magical for kids and demonstrates the incredible potential of AI in education.
If you're interested in building with Gemini, I highly recommend exploring the interleaved output capabilities. It's a game-changer for creating rich, multimedia experiences.
This project was created for the Gemini Live Agent Challenge.
GeminiLiveAgentChallenge
Tech Stack: Next.js, React, TypeScript, Gemini AI, Vertex AI, Cloud Run, Firestore, Cloud Storage, Firebase Auth
Category: Creative Storyteller - Multimodal Storytelling with Interleaved Output
GitHub: https://github.com/ajitonelsonn/torybook-for-Kids
Live Demo: https://ai.kidstory.app
Video Demo:
Questions?
Feel free to reach out or open an issue on GitHub. I'd love to hear your thoughts and answer any questions about building with Gemini's interleaved output!
Made with ❤️ for children everywhere













Top comments (0)