DEV Community

Ajito Nelson Lucio da Costa
Ajito Nelson Lucio da Costa

Posted on

Building an AI-Powered Storybook with Gemini's Interleaved Output

Created for the Gemini Live Agent Challenge


The Problem: Children Are Losing the Habit of Reading

Children worldwide are spending more time on social media than ever before — and it's replacing reading, creative play, and healthy development.

  • U.S. teens spend an average of 4.8 hours per day on social media (Gallup, 2023)
  • Kids aged 4-18 spend an average of 112 minutes daily on TikTok — 60% more than YouTube (Qustodio via TechCrunch, 2024)
  • ASEAN school-aged children (ages 6-14) spend 2.77 hours per day on screens, exceeding the recommended 2-hour limit (HealthcareAsia, 2024)
  • Over 90 minutes of daily screen time is linked to below-average performance in communication, writing, and numeracy for children ages 2-8 (Centre for Social Justice, 2024)

Children are drawn to screens because the content is engaging — colorful, animated, interactive. But most of that content is passive consumption. There is a lack of interactive, creative, educational digital experiences that match the engagement level of social media while actually benefiting children's development.

As a developer from Timor-Leste — where over 72% of the population is under 35 but children's book access remains extremely limited — I wanted to change that. What if a child could simply speak their idea and watch it transform into a fully illustrated, narrated storybook?

That's exactly what I built with KidStory: Storybook for Kids — an AI-powered platform that transforms screen time from passive consumption into active creation.

For Parents

Register an account, then create stories for your children to read and enjoy together.

For Kids

Let your child speak their own story idea and watch the AI bring it to life with pictures and narration.

Screenshot of the app homepage with logo and

Screenshot of the app homepage with logo and "Create Your Story" button


The Challenge: Creative Storyteller Category

I built this project for the Gemini Live Agent Challenge in the Creative Storyteller category, which focuses on multimodal storytelling with interleaved output. The challenge was to create an agent that thinks like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream.

The key requirement? Use Gemini's native interleaved output capabilities - not just generating text and then images separately, but creating them together in one cohesive stream.


The Magic: Interleaved Output in Action

What is Interleaved Output?

Traditional AI workflows generate content sequentially:

  1. Generate story text
  2. Wait for completion
  3. Generate images based on text
  4. Wait for completion
  5. Generate audio

This creates a disjointed experience with long wait times.

Interleaved output changes everything. With Gemini's gemini-2.5-flash-image model, I can request both text AND images in a single API call, and they stream back together:

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: prompt,
  config: {
    responseModalities: ["TEXT", "IMAGE"], // The magic happens here!
  },
});
Enter fullscreen mode Exit fullscreen mode

Traditional: 3 separate API calls, Long wait times and Disjointed experience

Interleaved: 1 API call, Streaming updates and Magical user experience

Diagram showing traditional sequential generation vs interleaved generation

The User Experience

When a parent/child creates a story, here's what happens:

  1. Voice or Text Input: The child speaks or types their story idea
  2. Real-time Generation: Story text and illustrations generate together
  3. Magical Painting Effect: Each page's illustration appears as the story is being written
  4. Parallel Audio: While images stream, audio narration generates in parallel
  5. Complete Storybook: Within minutes, a fully illustrated, narrated storybook is ready

Create History History generation

Screenshot of story generation in progress


The Architecture: Orchestrating Multiple Gemini Models

One of the most interesting aspects of this project was learning to orchestrate multiple Gemini models, each specialized for different tasks:

1. The Creative Director: gemini-2.5-flash-image

This model handles the core story generation with interleaved output:

// In /app/api/generate-story/route.ts
const result = await ai.models.generateContentStream({
  model: "gemini-2.5-flash-image",
  contents: [{ role: "user", parts }],
  config: {
    responseModalities: ["TEXT", "IMAGE"],
    responseMimeType: "application/json",
  },
});

// Stream back to client with Server-Sent Events
for await (const chunk of result.stream) {
  // Send story JSON chunks
  // Send image data as it arrives
}
Enter fullscreen mode Exit fullscreen mode

Vertex AI Setup and Model Definition
Interleaved generation setup

Code snippet showing the interleaved generation setup

2. The Voice: gemini-2.5-flash-preview-tts

For narration, I use Gemini's text-to-speech model with three distinct narrator personalities:

// In /lib/gcp/tts.ts
const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [
    {
      role: "user",
      parts: [{ text: pageText }],
    },
  ],
  config: {
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: selectedVoice } },
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Kids can choose from:

  • Luna: Warm and gentle
  • Stella: Energetic and playful
  • Kiko: Calm and soothing

Screenshot showing narrator voice selection

Screenshot showing narrator voice selection

3. The Quiz Master: gemini-2.5-flash

After reading, kids can take an interactive quiz. I optimized this by using the text-only model for quiz generation (faster and more cost-effective), then adding TTS for audio:

// In /app/api/live-quiz/route.ts
const quizResponse = await ai.models.generateContent({
  model: "gemini-2.5-flash", // Text-only for speed
  contents: quizPrompt,
  config: {
    responseMimeType: "application/json",
  },
});

// Then generate audio separately
const audioResponse = await generateAudio(question.text);
Enter fullscreen mode Exit fullscreen mode

Screenshot of the Magic Quiz interface with voice input

Screenshot of the Magic Quiz interface with voice input


Technical Deep Dive

The Stack

  • Frontend: Next.js 16 with React 19 and TypeScript
  • Styling: Tailwind CSS 4
  • AI Models: Google Gemini via Vertex AI
  • Database: Firebase Firestore
  • Storage: Google Cloud Storage
  • Hosting: Google Cloud Run
  • Authentication: Firebase Auth

KidStory - System Architecture

KidStory system architecture showing all components

Key Technical Challenges & Solutions

1. Streaming Interleaved Content

The biggest challenge was handling the interleaved stream of text and images. Gemini returns them mixed together, so I needed to:

// Parse the stream and separate text from images
for await (const chunk of result.stream) {
  const parts = chunk.candidates?.[0]?.content?.parts || [];

  for (const part of parts) {
    if (part.text) {
      // Handle story JSON text
      storyBuffer += part.text;
    } else if (part.inlineData) {
      // Handle image data
      const imageData = part.inlineData.data;
      // Upload to Cloud Storage
      // Send URL to client
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Character Consistency

Kids can upload photos of themselves or loved ones to appear in the story. To maintain consistency across all pages:

// Compress and include reference images
const referenceImages = await Promise.all(
  characterPhotos.map(async (photo) => {
    const compressed = await compressImage(photo);
    return {
      inlineData: {
        mimeType: "image/jpeg",
        data: compressed,
      },
    };
  }),
);

// Include in prompt
const prompt = {
  role: "user",
  parts: [
    { text: storyPrompt },
    ...referenceImages, // Gemini uses these for consistency
  ],
};
Enter fullscreen mode Exit fullscreen mode

3. Real-time Progress Updates

To create the "magical painting" effect, I used Server-Sent Events (SSE) to stream updates to the client:

// Server: Send updates as they arrive
const encoder = new TextEncoder();
const stream = new ReadableStream({
  async start(controller) {
    // Send story chunks
    controller.enqueue(encoder.encode(`data: ${JSON.stringify(chunk)}\n\n`));

    // Send image URLs as they're uploaded
    controller.enqueue(
      encoder.encode(`data: ${JSON.stringify(imageUpdate)}\n\n`),
    );
  },
});

return new Response(stream, {
  headers: {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
  },
});
Enter fullscreen mode Exit fullscreen mode
// Client: React hook to handle streaming
const useStoryGenerator = () => {
  const [progress, setProgress] = useState({});

  const eventSource = new EventSource("/api/generate-story");

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === "chunk") {
      // Update story text
    } else if (data.type === "image") {
      // Show image for specific page
      setProgress((prev) => ({
        ...prev,
        [`page_${data.pageNumber}`]: "complete",
      }));
    }
  };
};
Enter fullscreen mode Exit fullscreen mode

4. Voice Input for Kids

Using the Web Speech API, kids can speak their story ideas:

// Custom hook for voice recognition
const useVoiceInput = () => {
  const recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onresult = (event) => {
    const transcript = Array.from(event.results)
      .map((result) => result[0].transcript)
      .join("");

    setTranscript(transcript);
  };

  return { startListening, stopListening, transcript };
};
Enter fullscreen mode Exit fullscreen mode

5. PDF Generation with Compression

Stories can be downloaded as PDFs. To keep file sizes manageable, I implemented image compression:

// Compress images before adding to PDF
async function compressImage(dataUrl: string): Promise<string> {
  return new Promise((resolve) => {
    const img = new Image();
    img.onload = () => {
      const canvas = document.createElement("canvas");
      const maxWidth = 800;

      let width = img.width;
      let height = img.height;

      if (width > maxWidth) {
        height = (height * maxWidth) / width;
        width = maxWidth;
      }

      canvas.width = width;
      canvas.height = height;

      const ctx = canvas.getContext("2d");
      ctx.drawImage(img, 0, 0, width, height);

      // Compress as JPEG with 70% quality
      const compressed = canvas.toDataURL("image/jpeg", 0.7);
      resolve(compressed);
    };
    img.src = dataUrl;
  });
}
Enter fullscreen mode Exit fullscreen mode

This reduced PDF sizes by 70-85% while maintaining good visual quality.

Screenshot of PDF download with sample pages

Screenshot of PDF download with sample pages


Deployment on Google Cloud

The entire application runs on Google Cloud Platform:

Cloud Run Deployment

I created a simple deployment script that handles everything:

#!/bin/bash
# deploy.sh

gcloud run deploy storybook-for-kids \
  --source=. \
  --region=us-central1 \
  --platform=managed \
  --allow-unauthenticated \
  --min-instances=1 \
  --set-env-vars=GOOGLE_CLOUD_PROJECT=storybook-for-kids \
  --set-env-vars=GCS_BUCKET_NAME=storybook-for-kids-media \
  # ... other environment variables
Enter fullscreen mode Exit fullscreen mode

Screenshot of Cloud Run console

Screenshot of Cloud Run console showing deployed service

IAM Permissions

One challenge was setting up the right permissions. I automated this with a script:

#!/bin/bash
# fix_permissions.sh

# Grant Vertex AI access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/aiplatform.user"

# Grant Firestore access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/datastore.user"

# Grant Cloud Storage access
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/storage.objectAdmin"
Enter fullscreen mode Exit fullscreen mode

Deployment architecture diagram

Deployment architecture diagram


Performance Optimizations

1. Retry Logic for Rate Limiting

Gemini API can hit rate limits during peak usage. I implemented exponential backoff:

async function generateWithRetry(prompt: any, maxRetries = 8) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await ai.models.generateContent(prompt);
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
        const jitter = Math.random() * 1000;
        await new Promise((r) => setTimeout(r, delay + jitter));
        continue;
      }
      throw error;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Parallel Processing

Images and audio generate in parallel to reduce total wait time:

// Generate all images and audio simultaneously
const results = await Promise.all([
  ...pages.map((page) => generateImage(page.imagePrompt)),
  ...pages.map((page) => generateAudio(page.text)),
]);
Enter fullscreen mode Exit fullscreen mode

3. Image Compression

All images are compressed before storage:

  • Resize to max 800px width
  • Convert to JPEG with 70% quality
  • Reduces storage costs and improves load times

Lessons Learned

1. Interleaved Output is Powerful

The difference between sequential and interleaved generation is night and day. Users see progress immediately, and the overall experience feels much more "magical."

2. Model Selection Matters

Using the right model for each task improved both performance and cost:

  • gemini-2.5-flash-image for interleaved story generation
  • gemini-2.5-flash (text-only) for quiz generation (faster, cheaper)
  • gemini-2.5-flash-preview-tts for audio narration

3. Error Handling is Critical

With multiple API calls and streaming, things can go wrong. Robust error handling and retry logic are essential for production apps.

4. User Experience First

Technical capabilities mean nothing if the UX isn't great. I spent significant time on:

  • Smooth animations with Framer Motion
  • Clear progress indicators
  • Voice input that "just works"
  • Child-friendly interface design

The Results

Here's what the app can do:

Voice or text input for story ideas

Interleaved generation of text and images

Character consistency with photo uploads

Three narrator voices with distinct personalities

Interactive quizzes with voice input

PDF export with compressed images

Story library with Firebase Firestore

Google authentication for user accounts

Deployed on Cloud Run with auto-scaling


Try It Yourself

The project is live at ai.kidstory.app and source on GitHub: https://github.com/ajitonelsonn/torybook-for-Kids

To run it locally:

# Clone the repository
git clone https://github.com/ajitonelsonn/torybook-for-Kids
cd storybook-for-kids-app

# Install dependencies
npm install

# Set up environment variables
cp .env.example .env.local
# Edit .env.local with your credentials

# Run development server
npm run dev
Enter fullscreen mode Exit fullscreen mode

What's Next?

I'm excited to continue developing this project. Future features I'm considering:

  • Video generation for animated stories
  • Multi-language support for global accessibility
  • Collaborative stories where multiple kids contribute
  • Story templates for different genres
  • Parent dashboard to track reading progress

Conclusion

Building KidStory: Storybook for Kids taught me so much about:

  • Gemini's powerful interleaved output capabilities
  • Orchestrating multiple AI models effectively
  • Creating delightful user experiences with AI
  • Deploying production-ready apps on Google Cloud

The Gemini Live Agent Challenge pushed me to think beyond simple text-in/text-out interactions and create something truly multimodal. The result is an app that makes reading magical for kids and demonstrates the incredible potential of AI in education.

If you're interested in building with Gemini, I highly recommend exploring the interleaved output capabilities. It's a game-changer for creating rich, multimedia experiences.


This project was created for the Gemini Live Agent Challenge.

GeminiLiveAgentChallenge

Tech Stack: Next.js, React, TypeScript, Gemini AI, Vertex AI, Cloud Run, Firestore, Cloud Storage, Firebase Auth

Category: Creative Storyteller - Multimodal Storytelling with Interleaved Output

GitHub: https://github.com/ajitonelsonn/torybook-for-Kids
Live Demo: https://ai.kidstory.app

Video Demo:


Questions?

Feel free to reach out or open an issue on GitHub. I'd love to hear your thoughts and answer any questions about building with Gemini's interleaved output!


Made with ❤️ for children everywhere

Top comments (0)