Ajito Nelson Lucio da Costa

Posted on Mar 15

Building an AI-Powered Storybook with Gemini's Interleaved Output

#geminiliveagentchallenge

Created for the Gemini Live Agent Challenge

The Problem: Children Are Losing the Habit of Reading

Children worldwide are spending more time on social media than ever before — and it's replacing reading, creative play, and healthy development.

U.S. teens spend an average of 4.8 hours per day on social media (Gallup, 2023)
Kids aged 4-18 spend an average of 112 minutes daily on TikTok — 60% more than YouTube (Qustodio via TechCrunch, 2024)
ASEAN school-aged children (ages 6-14) spend 2.77 hours per day on screens, exceeding the recommended 2-hour limit (HealthcareAsia, 2024)
Over 90 minutes of daily screen time is linked to below-average performance in communication, writing, and numeracy for children ages 2-8 (Centre for Social Justice, 2024)

Children are drawn to screens because the content is engaging — colorful, animated, interactive. But most of that content is passive consumption. There is a lack of interactive, creative, educational digital experiences that match the engagement level of social media while actually benefiting children's development.

As a developer from Timor-Leste — where over 72% of the population is under 35 but children's book access remains extremely limited — I wanted to change that. What if a child could simply speak their idea and watch it transform into a fully illustrated, narrated storybook?

That's exactly what I built with KidStory: Storybook for Kids — an AI-powered platform that transforms screen time from passive consumption into active creation.

For Parents

For Kids

Let your child speak their own story idea and watch the AI bring it to life with pictures and narration.

Screenshot of the app homepage with logo and "Create Your Story" button

The Challenge: Creative Storyteller Category

I built this project for the Gemini Live Agent Challenge in the Creative Storyteller category, which focuses on multimodal storytelling with interleaved output. The challenge was to create an agent that thinks like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream.

The key requirement? Use Gemini's native interleaved output capabilities - not just generating text and then images separately, but creating them together in one cohesive stream.

The Magic: Interleaved Output in Action

What is Interleaved Output?

Traditional AI workflows generate content sequentially:

Generate story text
Wait for completion
Generate images based on text
Wait for completion
Generate audio

This creates a disjointed experience with long wait times.

Interleaved output changes everything. With Gemini's gemini-2.5-flash-image model, I can request both text AND images in a single API call, and they stream back together:

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: prompt,
  config: {
    responseModalities: ["TEXT", "IMAGE"], // The magic happens here!
  },
});

Diagram showing traditional sequential generation vs interleaved generation

The User Experience

When a parent/child creates a story, here's what happens:

Voice or Text Input: The child speaks or types their story idea
Real-time Generation: Story text and illustrations generate together
Magical Painting Effect: Each page's illustration appears as the story is being written
Parallel Audio: While images stream, audio narration generates in parallel
Complete Storybook: Within minutes, a fully illustrated, narrated storybook is ready

Screenshot of story generation in progress

The Architecture: Orchestrating Multiple Gemini Models

One of the most interesting aspects of this project was learning to orchestrate multiple Gemini models, each specialized for different tasks:

1. The Creative Director: `gemini-2.5-flash-image`

This model handles the core story generation with interleaved output:

// In /app/api/generate-story/route.ts
const result = await ai.models.generateContentStream({
  model: "gemini-2.5-flash-image",
  contents: [{ role: "user", parts }],
  config: {
    responseModalities: ["TEXT", "IMAGE"],
    responseMimeType: "application/json",
  },
});

// Stream back to client with Server-Sent Events
for await (const chunk of result.stream) {
  // Send story JSON chunks
  // Send image data as it arrives
}

Code snippet showing the interleaved generation setup

2. The Voice: `gemini-2.5-flash-preview-tts`

For narration, I use Gemini's text-to-speech model with three distinct narrator personalities:

// In /lib/gcp/tts.ts
const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [
    {
      role: "user",
      parts: [{ text: pageText }],
    },
  ],
  config: {
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: selectedVoice } },
    },
  },
});

Kids can choose from:

Luna: Warm and gentle
Stella: Energetic and playful
Kiko: Calm and soothing

Screenshot showing narrator voice selection

3. The Quiz Master: `gemini-2.5-flash`

After reading, kids can take an interactive quiz. I optimized this by using the text-only model for quiz generation (faster and more cost-effective), then adding TTS for audio:

// In /app/api/live-quiz/route.ts
const quizResponse = await ai.models.generateContent({
  model: "gemini-2.5-flash", // Text-only for speed
  contents: quizPrompt,
  config: {
    responseMimeType: "application/json",
  },
});

// Then generate audio separately
const audioResponse = await generateAudio(question.text);

Screenshot of the Magic Quiz interface with voice input

Technical Deep Dive

The Stack

Frontend: Next.js 16 with React 19 and TypeScript
Styling: Tailwind CSS 4
AI Models: Google Gemini via Vertex AI
Database: Firebase Firestore
Storage: Google Cloud Storage
Hosting: Google Cloud Run
Authentication: Firebase Auth

KidStory system architecture showing all components

Key Technical Challenges & Solutions

1. Streaming Interleaved Content

The biggest challenge was handling the interleaved stream of text and images. Gemini returns them mixed together, so I needed to:

// Parse the stream and separate text from images
for await (const chunk of result.stream) {
  const parts = chunk.candidates?.[0]?.content?.parts || [];

  for (const part of parts) {
    if (part.text) {
      // Handle story JSON text
      storyBuffer += part.text;
    } else if (part.inlineData) {
      // Handle image data
      const imageData = part.inlineData.data;
      // Upload to Cloud Storage
      // Send URL to client
    }
  }
}

2. Character Consistency

Kids can upload photos of themselves or loved ones to appear in the story. To maintain consistency across all pages:

// Compress and include reference images
const referenceImages = await Promise.all(
  characterPhotos.map(async (photo) => {
    const compressed = await compressImage(photo);
    return {
      inlineData: {
        mimeType: "image/jpeg",
        data: compressed,
      },
    };
  }),
);

// Include in prompt
const prompt = {
  role: "user",
  parts: [
    { text: storyPrompt },
    ...referenceImages, // Gemini uses these for consistency
  ],
};

3. Real-time Progress Updates

To create the "magical painting" effect, I used Server-Sent Events (SSE) to stream updates to the client:

// Server: Send updates as they arrive
const encoder = new TextEncoder();
const stream = new ReadableStream({
  async start(controller) {
    // Send story chunks
    controller.enqueue(encoder.encode(`data: ${JSON.stringify(chunk)}\n\n`));

    // Send image URLs as they're uploaded
    controller.enqueue(
      encoder.encode(`data: ${JSON.stringify(imageUpdate)}\n\n`),
    );
  },
});

return new Response(stream, {
  headers: {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
  },
});

// Client: React hook to handle streaming
const useStoryGenerator = () => {
  const [progress, setProgress] = useState({});

  const eventSource = new EventSource("/api/generate-story");

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === "chunk") {
      // Update story text
    } else if (data.type === "image") {
      // Show image for specific page
      setProgress((prev) => ({
        ...prev,
        [`page_${data.pageNumber}`]: "complete",
      }));
    }
  };
};

4. Voice Input for Kids

Using the Web Speech API, kids can speak their story ideas:

// Custom hook for voice recognition
const useVoiceInput = () => {
  const recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onresult = (event) => {
    const transcript = Array.from(event.results)
      .map((result) => result[0].transcript)
      .join("");

    setTranscript(transcript);
  };

  return { startListening, stopListening, transcript };
};

5. PDF Generation with Compression

Stories can be downloaded as PDFs. To keep file sizes manageable, I implemented image compression:

// Compress images before adding to PDF
async function compressImage(dataUrl: string): Promise<string> {
  return new Promise((resolve) => {
    const img = new Image();
    img.onload = () => {
      const canvas = document.createElement("canvas");
      const maxWidth = 800;

      let width = img.width;
      let height = img.height;

      if (width > maxWidth) {
        height = (height * maxWidth) / width;
        width = maxWidth;
      }

      canvas.width = width;
      canvas.height = height;

      const ctx = canvas.getContext("2d");
      ctx.drawImage(img, 0, 0, width, height);

      // Compress as JPEG with 70% quality
      const compressed = canvas.toDataURL("image/jpeg", 0.7);
      resolve(compressed);
    };
    img.src = dataUrl;
  });
}

This reduced PDF sizes by 70-85% while maintaining good visual quality.

Screenshot of PDF download with sample pages

Deployment on Google Cloud

The entire application runs on Google Cloud Platform:

Cloud Run Deployment

I created a simple deployment script that handles everything:

#!/bin/bash
# deploy.sh

gcloud run deploy storybook-for-kids \
  --source=. \
  --region=us-central1 \
  --platform=managed \
  --allow-unauthenticated \
  --min-instances=1 \
  --set-env-vars=GOOGLE_CLOUD_PROJECT=storybook-for-kids \
  --set-env-vars=GCS_BUCKET_NAME=storybook-for-kids-media \
  # ... other environment variables

Screenshot of Cloud Run console showing deployed service

IAM Permissions

One challenge was setting up the right permissions. I automated this with a script:

#!/bin/bash
# fix_permissions.sh

# Grant Vertex AI access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/aiplatform.user"

# Grant Firestore access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/datastore.user"

# Grant Cloud Storage access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/storage.objectAdmin"

Deployment architecture diagram

Performance Optimizations

1. Retry Logic for Rate Limiting

Gemini API can hit rate limits during peak usage. I implemented exponential backoff:

async function generateWithRetry(prompt: any, maxRetries = 8) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await ai.models.generateContent(prompt);
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
        const jitter = Math.random() * 1000;
        await new Promise((r) => setTimeout(r, delay + jitter));
        continue;
      }
      throw error;
    }
  }
}

2. Parallel Processing

Images and audio generate in parallel to reduce total wait time:

// Generate all images and audio simultaneously
const results = await Promise.all([
  ...pages.map((page) => generateImage(page.imagePrompt)),
  ...pages.map((page) => generateAudio(page.text)),
]);

3. Image Compression

All images are compressed before storage:

Resize to max 800px width
Convert to JPEG with 70% quality
Reduces storage costs and improves load times

Lessons Learned

1. Interleaved Output is Powerful

The difference between sequential and interleaved generation is night and day. Users see progress immediately, and the overall experience feels much more "magical."

2. Model Selection Matters

Using the right model for each task improved both performance and cost:

gemini-2.5-flash-image for interleaved story generation
gemini-2.5-flash (text-only) for quiz generation (faster, cheaper)
gemini-2.5-flash-preview-tts for audio narration

3. Error Handling is Critical

With multiple API calls and streaming, things can go wrong. Robust error handling and retry logic are essential for production apps.

4. User Experience First

Technical capabilities mean nothing if the UX isn't great. I spent significant time on:

Smooth animations with Framer Motion
Clear progress indicators
Voice input that "just works"
Child-friendly interface design

The Results

Here's what the app can do:

✅ Voice or text input for story ideas

✅ Interleaved generation of text and images

✅ Character consistency with photo uploads

✅ Three narrator voices with distinct personalities

✅ Interactive quizzes with voice input

✅ PDF export with compressed images

✅ Story library with Firebase Firestore

✅ Google authentication for user accounts

✅ Deployed on Cloud Run with auto-scaling

Try It Yourself

The project is live at ai.kidstory.app and source on GitHub: https://github.com/ajitonelsonn/torybook-for-Kids

To run it locally:

# Clone the repository
git clone https://github.com/ajitonelsonn/torybook-for-Kids
cd storybook-for-kids-app

# Install dependencies
npm install

# Set up environment variables
cp .env.example .env.local
# Edit .env.local with your credentials

# Run development server
npm run dev

What's Next?

I'm excited to continue developing this project. Future features I'm considering:

Video generation for animated stories
Multi-language support for global accessibility
Collaborative stories where multiple kids contribute
Story templates for different genres
Parent dashboard to track reading progress

Conclusion

Building KidStory: Storybook for Kids taught me so much about:

Gemini's powerful interleaved output capabilities
Orchestrating multiple AI models effectively
Creating delightful user experiences with AI
Deploying production-ready apps on Google Cloud

The Gemini Live Agent Challenge pushed me to think beyond simple text-in/text-out interactions and create something truly multimodal. The result is an app that makes reading magical for kids and demonstrates the incredible potential of AI in education.

If you're interested in building with Gemini, I highly recommend exploring the interleaved output capabilities. It's a game-changer for creating rich, multimedia experiences.

This project was created for the Gemini Live Agent Challenge.

GeminiLiveAgentChallenge

Tech Stack: Next.js, React, TypeScript, Gemini AI, Vertex AI, Cloud Run, Firestore, Cloud Storage, Firebase Auth

Category: Creative Storyteller - Multimodal Storytelling with Interleaved Output

GitHub: https://github.com/ajitonelsonn/torybook-for-Kids
Live Demo: https://ai.kidstory.app

Video Demo:

Questions?

Feel free to reach out or open an issue on GitHub. I'd love to hear your thoughts and answer any questions about building with Gemini's interleaved output!

Made with ❤️ for children everywhere

DEV Community

Building an AI-Powered Storybook with Gemini's Interleaved Output

The Problem: Children Are Losing the Habit of Reading

For Parents

For Kids

The Challenge: Creative Storyteller Category

The Magic: Interleaved Output in Action

What is Interleaved Output?

The User Experience

The Architecture: Orchestrating Multiple Gemini Models

1. The Creative Director: `gemini-2.5-flash-image`

2. The Voice: `gemini-2.5-flash-preview-tts`

3. The Quiz Master: `gemini-2.5-flash`

Technical Deep Dive

The Stack

Key Technical Challenges & Solutions

1. Streaming Interleaved Content

2. Character Consistency

3. Real-time Progress Updates

4. Voice Input for Kids

5. PDF Generation with Compression

Deployment on Google Cloud

Cloud Run Deployment

IAM Permissions

Performance Optimizations

1. Retry Logic for Rate Limiting

2. Parallel Processing

3. Image Compression

Lessons Learned

1. Interleaved Output is Powerful

2. Model Selection Matters

3. Error Handling is Critical

4. User Experience First

The Results

Try It Yourself

What's Next?

Conclusion

GeminiLiveAgentChallenge

Questions?

Top comments (0)

The Problem: Children Are Losing the Habit of Reading

For Parents

For Kids

The Challenge: Creative Storyteller Category

The Magic: Interleaved Output in Action

What is Interleaved Output?

The User Experience

The Architecture: Orchestrating Multiple Gemini Models

1. The Creative Director: gemini-2.5-flash-image

2. The Voice: gemini-2.5-flash-preview-tts

3. The Quiz Master: gemini-2.5-flash

Technical Deep Dive

The Stack

Key Technical Challenges & Solutions

1. Streaming Interleaved Content

2. Character Consistency

3. Real-time Progress Updates

4. Voice Input for Kids

5. PDF Generation with Compression

Deployment on Google Cloud

Cloud Run Deployment

IAM Permissions

Performance Optimizations

1. Retry Logic for Rate Limiting

2. Parallel Processing

3. Image Compression

Lessons Learned

1. Interleaved Output is Powerful

2. Model Selection Matters

3. Error Handling is Critical

4. User Experience First

The Results

Try It Yourself

What's Next?

Conclusion

GeminiLiveAgentChallenge

Questions?

1. The Creative Director: `gemini-2.5-flash-image`

2. The Voice: `gemini-2.5-flash-preview-tts`

3. The Quiz Master: `gemini-2.5-flash`