Farzan

Posted on Mar 15

Building with Google Vertex multimodal AI

#gemini #ai #vertexai

How We Built Toon World: An AI-Powered Interactive Learning App for Kids Using Google Gemini and Google Cloud

A deep dive into building a real-time interleaved text, image, and voice educational experience — entirely on Google Cloud infrastructure

There is a moment in every good lesson when something clicks. Not because the teacher repeated themselves louder, but because they showed you something at exactly the right moment. The word on the page and the picture in your mind aligned, and suddenly the abstract became real.

That is the experience we set out to build for children aged four to eight. And the technology that made it possible — Google Gemini's interleaved multimodal output running on Vertex AI — turned out to be exactly the right tool for the job.

This is the story of how we built Toon World, an interactive educational app where original cartoon characters teach children subjects like counting, the alphabet, and the solar system through AI-generated lessons that weave text and illustrations together in real time. Every lesson is unique. Every character is original. Every image is created from scratch to match the exact words being spoken at that moment.

The Problem With How Kids Learn Online

Most digital educational content falls into one of two categories. Either it is a static webpage — text and pre-drawn images that never change — or it is a video, which a child watches passively without any real engagement. Both formats share a fundamental flaw: the content was made once, for everyone, and it never changes.

We wanted to build something different. Something where the content is made for this child, right now, and will never exist in exactly this form again. Something where the teacher is not reading from a script but genuinely inhabiting the lesson — showing you what they are describing the moment they describe it.

The question that drove the design of Toon World was simple: what if an AI model could be both the storyteller and the illustrator simultaneously?

The Core Insight: Interleaved Multimodal Generation

Most people think of AI image generation and AI text generation as separate systems. You write a prompt, get text. You write a different prompt, get an image. You stitch them together yourself.

Google Gemini's responseModalities feature breaks this assumption entirely.

By passing responseModalities: [Modality.TEXT, Modality.IMAGE] to the @google/genai SDK, you instruct Gemini to produce text and images from the same generation pass — the same forward pass through the same neural network, with full context of everything generated so far. The model does not generate the story and then separately commission illustrations. It thinks in both modalities simultaneously.

This is the architectural foundation of Toon World. When Luna the Star Fox explains that three apples plus two apples equals five apples, the illustration of Luna holding five apples appears because the model's internal state at that moment of generation contained the semantic context of what it had just written. The image is not a decoration — it is a direct visual expression of the model's reasoning at that exact point in the story.

The result is a lesson that feels coherent in a way that no stitched-together pipeline ever could, because it actually is coherent. The text and images share a single origin.

Architecture Overview

Before diving into each component, it helps to understand the shape of the system at a high level.

Toon World has three layers. The browser runs a React 18 application that handles the user interface and streams lesson content to the screen as it arrives. A Node.js server running on Google Cloud Run acts as the secure backend, using the official @google/genai SDK to communicate with Vertex AI. Vertex AI hosts the Gemini model that generates the interleaved lessons.

There is no database. Lesson content is generated fresh on every request and never stored. Characters and subjects are plain JavaScript data files bundled into the frontend at build time. The architecture is deliberately minimal — every component exists because it must, not because it was easy to add.

Authentication flows through Google Cloud IAM using Application Default Credentials. On Cloud Run, the service account attached to the container automatically authenticates with Vertex AI. No API keys live in the codebase. No secrets are managed manually.

Building the Frontend: React 18 with Vite

The frontend is a React 18 single-page application built with Vite, using React Router for client-side navigation. The user interface is built around two screens.

The Home Screen: The Sentence Builder

The home screen presents a simple fill-in-the-blank sentence: "I want to learn ___ with ___!"

Each blank is an interactive tile picker — a horizontally scrollable row of large emoji tiles, one for each subject or character. When a child taps a subject tile, the character picker automatically opens. When a character is chosen, a "Let's Go!" button slides up from the bottom of the screen.

The tile picker is built as a pair of reusable components. TilePicker.jsx manages the scrollable row and animation state. TileCard.jsx renders an individual tile with its emoji, label, accent colour, and selection state. The data for all tiles lives in two plain files: subjects.js and characters.js, each exporting a simple array of objects.

// client/src/data/subjects.js (excerpt)
const subjects = [
  {
    value: 'basic addition',
    label: 'Addition',
    emoji: '➕',
    accent: '#fb923c',
    bg: 'rgba(251,146,60,.15)',
  },
  // ...21 subjects total
]

This structure means adding a new subject is literally one line in a data file. No database migration. No API change. The entire content catalogue is static data.

The Lesson Screen: Real-Time Streaming

The lesson page, LessonPage.jsx, is where the interesting engineering lives. When it mounts, it immediately calls streamLesson() from gemini.js and begins rendering blocks as they arrive.

The streaming architecture works as follows. The browser sends a POST request to /api/stream on the Node server. The server opens a Server-Sent Events (SSE) connection back to the browser — a long-lived HTTP connection that the server can push data into at any time. As Gemini generates each part of the lesson, the server pushes it down this connection immediately. The browser reads each chunk using the Fetch API's ReadableStream, parses it, and calls onBlock() with the completed block. LessonPage adds that block to React state, which triggers a re-render, which shows the new paragraph or image on screen.

The effect is that the lesson writes itself in front of the child. A paragraph of text appears. Then an illustration materialises beneath it. Then the next paragraph. Then the next illustration. The entire sequence takes fifteen to thirty seconds from click to complete lesson, and the child watches every piece of it arrive.

// client/src/pages/LessonPage.jsx (core streaming loop)
async function runLesson() {
  setBlocks([])
  setStatus('loading')

  await streamLesson(subject, character, visualDescription, (block) => {
    if (firstBlock) {
      firstBlock = false
      setStatus('streaming')
    }
    // Each block appended to state immediately — React re-renders on each one
    setBlocks(prev => [...prev, block])
  })

  setStatus('done')
}

The status transitions from loading (before any block arrives) to streaming (after the first block, while more are still coming) to done (when the stream closes). Each status drives different UI — a full-screen spinner gives way to a subtle inline indicator, which disappears entirely when the lesson is complete.

Voice: Browser Web Speech API

Each text paragraph has a speaker button. Clicking it reads the paragraph aloud using the browser's built-in Web Speech API — no external service, no API key, no cost. The useSpeech custom hook manages which paragraph is currently speaking, cancels any playing audio when a new paragraph is started, and cleans up on unmount.

// client/src/api/tts.js
export function speak(text, { onEnd, rate = 0.9, pitch = 1.05 } = {}) {
  window.speechSynthesis.cancel()
  const utterance = new SpeechSynthesisUtterance(text)
  utterance.rate    = rate
  utterance.pitch   = pitch
  utterance.onend   = () => onEnd?.()
  window.speechSynthesis.speak(utterance)
  return { cancel: () => window.speechSynthesis.cancel() }
}

Building the Backend: Node.js with the @google/genai SDK

The backend is a Node.js server with a deliberately narrow surface area. It does three things: serves the built React app as static files, handles the POST /api/stream endpoint that powers lesson generation, and manages authentication with Vertex AI.

The @google/genai SDK

The most important architectural decision in the backend is using the official @google/genai SDK rather than making raw HTTP requests. This is not merely a convenience choice — it is what satisfies the Google GenAI SDK requirement for this project, and it fundamentally changes the quality of the code.

The SDK is initialised once at server startup:

const { GoogleGenAI, Modality } = require('@google/genai');

const ai = new GoogleGenAI({
  vertexai: true,
  project:  process.env.GOOGLE_CLOUD_PROJECT,
  location: 'global',
});

The vertexai: true flag tells the SDK to use Vertex AI rather than AI Studio. It also tells the SDK to handle authentication using Application Default Credentials automatically. On Cloud Run, this means the SDK fetches a fresh OAuth2 token from the GCE metadata server on every request, caches it, and refreshes it before it expires — all without a single line of token management code in the application.

Before adopting the SDK, we had written approximately 250 lines of manual code to handle token fetching, request construction, chunked transfer encoding, SSE parsing, and error handling. The SDK replaced all of that with a single generateContentStream call.

The Streaming Handler

The core of the backend is the handleStream function, called for every POST /api/stream request:

async function handleStream(req, res) {

  // Set SSE headers — keep the connection open for streaming
  res.writeHead(200, {
    'Content-Type':                'text/event-stream',
    'Cache-Control':               'no-cache',
    'Connection':                  'keep-alive',
    'Access-Control-Allow-Origin': '*',
  });

  // The core SDK call — the heart of the entire application
  const stream = await ai.models.generateContentStream({
    model:    'gemini-2.5-flash-image',
    contents: [{ role: 'user', parts: [{ text: promptText }] }],
    config: {
      responseModalities: [Modality.TEXT, Modality.IMAGE],
    },
  });

  // Forward each chunk to the browser the moment it arrives
  for await (const chunk of stream) {
    const parts = chunk?.candidates?.[0]?.content?.parts ?? [];
    for (const part of parts) {
      if (part.text || part.inlineData) {
        res.write(`data: ${JSON.stringify(chunk)}\n\n`);
      }
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
}

The for await loop is the key pattern. The SDK's generateContentStream returns an async iterator — each iteration yields a chunk from Vertex AI as soon as it arrives over the network. By immediately writing each chunk to the SSE response, we achieve true end-to-end streaming: the first image pixel and the first text character reach the browser as soon as Gemini produces them, with no intermediate buffering.

The Prompt Engineering

Getting Gemini to reliably produce a well-structured interleaved lesson required careful prompt design. The prompt separates text and image rules explicitly, because we discovered that Gemini's content filters are applied to the entire prompt context when generating images — meaning a character name mentioned anywhere in the prompt could trigger a PROHIBITED_CONTENT error when generating an image.

The solution was a two-layer naming system. Character names are used freely in the text narration rules. Image generation instructions reference only a visual description — "a friendly orange fox with big curious eyes, a fluffy tail with a glowing star at the tip, wearing a tiny blue cape" — with an explicit instruction that image descriptions must contain zero character names, zero brand names, and zero IP references.

function buildLessonPrompt(character, subject, visualDescription) {
  return `
━━━ TEXT SECTION RULES ━━━
• Written from the perspective of ${character}
• May freely use the narrator's name

━━━ IMAGE SECTION RULES ━━━
• The main character looks like this: ${visualDescription}
• CRITICAL: Zero character names, zero brand names, zero IP references

━━━ LESSON STRUCTURE ━━━
[TEXT 1] ${character} gives a warm introduction...
[IMAGE 1] Vibrant Pixar 3D CGI. The main character (${visualDescription}) waves hello...
  `
}

This structure gives the model clear, separate instructions for each output type, and the explicit separation prevents character names from bleeding into image generation context.

The Characters: Original, IP-Free by Design

One of the most important early decisions was to build entirely original characters rather than use recognisable fictional IP. This was initially driven by practical necessity — Gemini's content safety filters reliably block image generation requests that mention Disney, Pixar, or DreamWorks characters by name — but it turned out to be the right creative decision for other reasons too.

An original character can be designed for a specific educational archetype. Professor Twigs is a wise old owl in a tweed jacket with a mortarboard hat — visual shorthand for academic knowledge. Blaze the Fire Dragon is small, friendly, and enthusiastic — perfect energy for a child who needs encouragement. Celeste the Stargazer sits on a crescent moon surrounded by constellation patterns — immediately evocative of astronomy and wonder.

Each character has three descriptions in the data file. The label is the display name shown in the UI. The visualDescription is used in image prompts — a detailed physical description that Gemini uses to generate consistent character representations. The voicePersonality (a future feature) will describe the character's speaking style for text-to-speech integration.

{
  value: 'Professor Twigs',
  label: 'Prof. Twigs',
  emoji: '🦉',
  accent: '#a78bfa',
  bg: 'rgba(167,139,250,.15)',
  visualDescription: 'a wise old owl with oversized round spectacles, wearing a tiny tweed jacket with elbow patches, a mortarboard graduation cap that is slightly too big, and carrying a miniature stack of books',
  voicePersonality: 'A scholarly, warm elderly voice with gentle authority...',
}

Google Cloud Infrastructure

Vertex AI vs AI Studio

Toon World uses Vertex AI rather than AI Studio for a specific reason: Vertex AI is Google Cloud's enterprise AI platform, designed for production workloads. It uses service account authentication rather than API keys, which means credentials are never stored in code. It supports Application Default Credentials, which means authentication is handled automatically based on where the code is running.

AI Studio, by contrast, requires an API key that must be stored somewhere in the application — and a key in a frontend application is a key that anyone can steal from the browser's network tab. Vertex AI eliminates this attack surface entirely.

The Vertex AI endpoint for Gemini 2.5 Flash Image is at aiplatform.googleapis.com, using the global location for models released after mid-2025. The full model path in the request is:

/v1/projects/{PROJECT}/locations/global/publishers/google/models/gemini-2.5-flash-image

The @google/genai SDK constructs and manages this URL automatically when initialised with vertexai: true.

Cloud Run

The backend is deployed on Cloud Run, Google's serverless container platform. Cloud Run is an ideal host for this kind of application because it scales to zero when not in use (no cost when nobody is learning), scales up automatically when multiple children are using the app simultaneously, and provides each container with a managed service account identity that the @google/genai SDK uses for ADC authentication.

Deployment is handled by a Dockerfile and a deploy.sh script that runs gcloud builds submit — which builds the Docker container entirely in the cloud using Google Cloud Build, with no Docker installation required locally. The script then deploys the built image to Cloud Run and grants the service account the roles/aiplatform.user IAM role required to call Vertex AI.

# Two-stage build: compile React frontend, then copy into lean Node runtime
FROM node:20-slim AS build
WORKDIR /app/client
COPY client/package*.json ./
RUN npm install
COPY client/ ./
RUN npm run build

FROM node:20-slim AS runtime
WORKDIR /app
COPY package*.json ./
RUN npm install --omit=dev
COPY server.js ./
COPY --from=build /app/public ./public
ENV PORT=8080
CMD ["node", "server.js"]

IAM and Security

The security model is simple and correct. The Cloud Run service account is granted exactly one IAM role: roles/aiplatform.user. This role allows the service account to call Vertex AI. It cannot read or write Cloud Storage. It cannot access other GCP services. It cannot create or modify GCP resources. The principle of least privilege is enforced by the IAM configuration.

Key Technical Challenges

Making the Stream Truly Feel Interleaved

The first working version of Toon World used the non-streaming generateContent endpoint. The lesson was technically interleaved — Gemini returned alternating text and image blocks — but the experience was a 40 to 60 second spinner followed by all seven blocks appearing simultaneously. It looked and felt like a bulk content dump.

Switching to streamGenerateContent changed everything. But streaming introduced a new challenge: base64-encoded images can span multiple SSE chunks. A naive implementation that renders each chunk immediately would show broken partial images while the rest of the data was still in transit.

The solution was an accumulator in gemini.js that detects part-type transitions. When the stream switches from image data back to text, the accumulator knows the image is complete:

function processPart(part) {
  if (part.text != null) {
    flushImage()           // complete any pending image before starting text
    pendingText += part.text
  } else if (part.inlineData?.mimeType?.startsWith('image/')) {
    flushText()            // complete any pending text before starting image
    pendingImageData += part.inlineData.data
    pendingImageMime  = part.inlineData.mimeType
  }
}

This pattern ensures that every block rendered on screen is complete — no partial images, no truncated paragraphs.

Content Safety Filters

Gemini's content safety filters are aggressive when it comes to generating images of named fictional characters. Early versions of the app used real character names in image prompts and encountered PROHIBITED_CONTENT errors on the majority of requests.

The two-layer naming approach described above — character names in text rules, visual descriptions in image rules — reduced these errors dramatically. Creating entirely original characters rather than fictional IP references eliminated them almost entirely. The remaining source of PROHIBITED_CONTENT errors turned out to be lesson topics phrased in ways that inadvertently triggered safety heuristics, which we resolved by rewording the most sensitive subjects in the data file.

Getting the Right Vertex AI Configuration

The Vertex AI configuration for newer Gemini models differs from older ones in non-obvious ways. gemini-2.5-flash-image requires the global location, not a regional location like us-central1. The endpoint hostname for the global location is aiplatform.googleapis.com, not global-aiplatform.googleapis.com as one might expect. The @google/genai SDK handles these details automatically, but discovering them during the period when we were using raw HTTP requests cost several debugging sessions.

The Role Requirement

Vertex AI requires every entry in the contents array to have an explicit role field set to either "user" or "model". AI Studio is more lenient and accepts contents without role fields. This subtle API difference was the source of a frustrating 400 Bad Request error after migrating from AI Studio to Vertex AI, resolved by ensuring every content object included role: 'user'.

Lessons Learned

The SDK is not optional. We spent a significant portion of development manually implementing things the @google/genai SDK already did correctly — token fetching, request formatting, SSE parsing, error handling. The SDK exists because these details are complex and easy to get wrong. Using it from the start would have saved weeks.

Streaming changes the product, not just the implementation. The switch from batch to streaming was not a performance optimisation — it was a product transformation. A lesson that writes itself in front of a child is fundamentally different from a lesson that appears all at once. The experience is what matters, and streaming is what creates the right experience.

Prompt structure matters as much as prompt content. A prompt that asks for interleaved content without explicitly separating the rules for text sections from the rules for image sections will produce inconsistent results. The structured prompt format with clearly separated rules for each modality was the difference between reliable interleaved output and occasional interleaved output.

Original characters are a feature, not a limitation. The inability to use named fictional IP forced us to invent original characters, which turned out to be more valuable. Original characters can be designed for specific pedagogical roles, their visual descriptions can be crafted to generate consistently, and they create a visual identity that is uniquely Toon World's.

What Is Next

The foundation of Toon World is built to grow. The most immediate roadmap items are an expanded character and subject library — both are plain data file additions, so this is creative work rather than engineering work.

The most exciting technical addition on the horizon is expressive text-to-speech. The current browser Web Speech API produces functional but robotic voice output. We are evaluating two approaches: ElevenLabs Voice Design, which can generate a unique voice from a plain-English personality description (perfectly aligned with how we already describe characters), and Google's Gemini Text-to-Speech API, which would keep the entire stack within the Google ecosystem and enable voice generation from the same model family that generates the lesson content.

Beyond voice, the next major product feature is interactive quiz moments between lesson blocks. After the third lesson point, the character would ask a question and the child's response would trigger a new Gemini generation — either an image of the character celebrating a correct answer, or a gentle encouraging scene prompting them to try again. This would transform Toon World from something a child watches into something a child truly participates in.

Conclusion

Toon World demonstrates something that has not been possible before: an AI system that is simultaneously the author, the illustrator, and the voice of an educational experience, generating all three in a single coherent stream. The technology that makes it possible — Gemini's interleaved multimodal output via the @google/genai SDK on Vertex AI — is not a gimmick. It is a genuinely new capability that enables a genuinely new kind of educational content.

The architecture is straightforward in retrospect: a React frontend that streams blocks to the screen as they arrive, a Node.js server on Cloud Run that uses the official SDK to call Gemini with responseModalities: [TEXT, IMAGE], and Vertex AI handling the generation with ADC keeping everything secure. But the experience that architecture enables — a child watching their chosen character teach them something in a story that has never existed before and will never exist again — is something worth building.

Every lesson is a world premiere. We think that matters.

Built with React 18, Node.js, @google/genai SDK, Vertex AI, Google Cloud Run, and Gemini 2.5 Flash Image.

DEV Community