Talha Tahir

Posted on Mar 16

An AI That Turns a Parent's Voice Into a Personalised Illustrated Storybook

#ai #webdev #geminiliveagentchallenge #vertexai

I Built an AI That Turns a Parent's Voice Into a Personalized Illustrated Storybook — Here's How

From Gemini Live API to real-time SSE streaming on Cloud Run — the full technical story of DreamBook

My daughter went through a dinosaur phase. Then a space phase. Then superheroes. No bookstore in the world can keep up with a six-year-old's rotating obsessions — and even if one could, it still wouldn't know her name, her fears, or the lesson I was trying to teach her that week.

That's the problem DreamBook solves.

You speak. You say something like "Emma, age five, loves dinosaurs and painting, scared of thunder, I want her to learn that being brave doesn't mean not being scared." Ninety seconds later, you have a fully illustrated, narrated, personalized storybook — text streaming in live, illustrations fading in as they generate, audio narration ready to play per page, and a PDF you can print and keep forever.

I built this for the Gemini Live Agent Hackathon. This post is the full technical story — what I built, every bug I hit, and what I learned along the way.

The Stack

Before we dive in, here's the full picture:

Backend: NestJS + TypeScript on Google Cloud Run
Frontend: Next.js 15 (App Router) on Vercel
AI: @google/genai SDK throughout
- gemini-3.1-pro-preview for story generation
- gemini-3.1-flash-image-preview (Nano Banana) for illustrations
- gemini-2.5-flash-preview-tts for audio narration
- gemini-2.5-flash-native-audio-preview via Live API for voice input
Infrastructure: Firestore, Cloud Storage, Cloud Build, Firebase Auth

The Architecture

The system has three layers of real-time communication happening simultaneously:

Browser
  ├── Socket.io WebSocket → NestJS /voice gateway → Gemini Live API
  │   (PCM audio chunks → real-time transcript → StoryRequest JSON)
  │
  └── fetch() SSE stream ← NestJS StoryController
      (page:text, page:image, page:audio events as they generate)

NestJS (Cloud Run)
  ├── GeminiService     → gemini-3.1-pro-preview (text streaming)
  ├── ImagenService     → Nano Banana (illustrations, concurrent)
  ├── TtsService        → gemini-2.5-flash-preview-tts (per page)
  └── PdfService        → pdf-lib → Cloud Storage

The interesting part is that text, illustrations, and audio all generate concurrently. Gemini streams the story text; each time an [IMAGE: ...] directive appears in the stream, an illustration job fires immediately without waiting for the story to finish. TTS runs the same way. By the time Gemini finishes generating the last page of text, most of the illustrations and narrations are already done.

The Voice Input Pipeline

This is where the Gemini Live API comes in — and it's genuinely impressive.

The browser captures raw PCM audio from the microphone using an AudioWorklet:

// Inline AudioWorklet — converts Float32 mic samples to Int16 PCM
const PCM_PROCESSOR_CODE = `
class PcmProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0];
    if (!input || !input[0]) return true;
    const float32 = input[0];
    const int16 = new Int16Array(float32.length);
    for (let i = 0; i < float32.length; i++) {
      const s = Math.max(-1, Math.min(1, float32[i]));
      int16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    this.port.postMessage(int16.buffer, [int16.buffer]);
    return true;
  }
}
registerProcessor('pcm-processor', PcmProcessor);
`;

Those PCM chunks stream over Socket.io to NestJS, which opens a Gemini Live session per user:

const liveSession = await this.ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: [Modality.AUDIO],
    inputAudioTranscription: {},  // enables real-time transcription
    systemInstruction: {
      parts: [{ text: 'You are a transcription assistant. Just transcribe.' }],
    },
  },
  callbacks: {
    onmessage: (e: MessageEvent<LiveServerMessage>): void => {
      const content = e.data?.serverContent;
      if (content?.inputTranscription?.text) {
        session.transcript += content.inputTranscription.text;
        socket.emit('voice:transcript', { text: session.transcript });
      }
    },
  },
});

When the user stops speaking, Gemini Flash extracts a structured StoryRequest from the accumulated transcript:

const response = await this.ai.models.generateContent({
  model: 'gemini-2.0-flash',
  contents: [{
    role: 'user',
    parts: [{ text: `Extract story params from: "${transcript}". 
    Return raw JSON only: { childName, childAge, interests, pageCount, 
    illustrationStyle, language } — optionally lesson and fears.` }],
  }],
});

One thing I noticed: the inputTranscription.finished flag doesn't always fire for the native audio model. The fix is to accumulate all transcript chunks continuously rather than waiting for a "final" marker.

The Story Generation Pipeline

This is the core of the app. gemini-3.1-pro-preview generates the story with a specific format requirement:

const prompt = `
Write exactly ${pageCount} pages for a storybook about ${childName}.
After each page, add an image directive:
[IMAGE: <detailed illustration prompt in ${style} style>]
Only narrative text and [IMAGE:] directives. Nothing else.
`;

const streamResult = await this.ai.models.generateContentStream({
  model: 'gemini-3.1-pro-preview',
  contents: [{ role: 'user', parts: [{ text: prompt }] }],
});

As chunks stream in, I scan for the [IMAGE:] directive pattern:

const IMAGE_DIRECTIVE_RE = /\[IMAGE:\s*([^\]]+)\]/gi;

for await (const chunk of streamResult) {
  const candidate = chunk.candidates?.[0];
  const chunkText = candidate?.content?.parts
    ?.map((p) => p.text ?? '').join('') ?? '';

  pageTextBuffer += chunkText;

  const imageMatch = IMAGE_DIRECTIVE_RE.exec(pageTextBuffer);
  if (imageMatch) {
    pageNumber++;
    const imagePrompt = imageMatch[1].trim();
    const narrativeText = pageTextBuffer.slice(0, imageMatch.index).trim();

    // Emit text immediately over SSE
    subject.next({ event: 'page:text', data: { pageNumber, text: narrativeText } });

    // Fire illustration job concurrently (don't await)
    subject.next({ event: 'page:image', data: { pageNumber, imagePrompt } });

    pageTextBuffer = pageTextBuffer.slice(imageMatch.index + imageMatch[0].length);
    IMAGE_DIRECTIVE_RE.lastIndex = 0;
  }
}

Important: In @google/genai, streaming chunks don't have a .text property — you have to extract text from candidates[0].content.parts[].text. This is different from the old @google/generative-ai SDK where chunk.text() was a method. This bug cost me several hours.

Illustrations with Nano Banana

Nano Banana (gemini-3.1-flash-image-preview) is a huge improvement over the old Imagen 4 Vertex AI REST approach. No access token fetching, no endpoint construction, just the SDK:

const response = await this.ai.models.generateContent({
  model: 'gemini-3.1-flash-image-preview',
  contents: fullPrompt,
  config: {
    responseModalities: ['TEXT', 'IMAGE'],
  },
});

const imagePart = response.candidates?.[0]?.content?.parts
  ?.find((p) => p.inlineData?.mimeType?.startsWith('image/'));

const imageBase64 = imagePart.inlineData.data;

The image comes back as base64 inline data. I upload it to Cloud Storage and return a signed URL.

Audio Narration

TTS uses generateContent with audio response modality — no Live API needed here:

const response = await this.ai.models.generateContent({
  model: 'gemini-2.5-flash-preview-tts',
  contents: [{ role: 'user', parts: [{ text: narratePrompt }] }],
  config: {
    responseModalities: ['AUDIO'],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: 'Kore' },
      },
    },
  },
});

Gemini TTS returns raw audio/L16;codec=pcm;rate=24000 — raw PCM bytes. Browsers can't play raw PCM directly, so I wrap it in a WAV header:

private pcmToWav(pcm: Buffer, sampleRate = 24000): Buffer {
  const header = Buffer.alloc(44);
  header.write('RIFF', 0);
  header.writeUInt32LE(36 + pcm.length, 4);
  header.write('WAVE', 8);
  header.write('fmt ', 12);
  header.writeUInt32LE(16, 16);   // PCM chunk size
  header.writeUInt16LE(1, 20);    // PCM format
  header.writeUInt16LE(1, 22);    // mono
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(sampleRate * 2, 28); // byte rate
  header.writeUInt16LE(2, 32);    // block align
  header.writeUInt16LE(16, 34);   // 16-bit
  header.write('data', 36);
  header.writeUInt32LE(pcm.length, 40);
  return Buffer.concat([header, pcm]);
}

The SSE Streaming Problem

This was the most painful bug of the entire project.

Everything worked perfectly locally. On Cloud Run, the frontend received pings every 15 seconds but zero page events — even though the server logs showed pages being emitted correctly.

The cause: Cloud Run's Google load balancer buffers HTTP/2 responses for compression efficiency. SSE events were being held in a buffer and only released when the buffer filled or the connection closed — which meant the entire story arrived in one batch after generation finished, not page by page.

The fix required three things together:

// 1. Disable compression entirely
res.setHeader('Content-Encoding', 'identity');

// 2. Disable nginx/proxy buffering
res.setHeader('X-Accel-Buffering', 'no');
res.setHeader('Cache-Control', 'no-cache, no-store, no-transform');

// 3. Explicitly flush after EVERY write
const flush = () => {
  if (typeof (res as any).flush === 'function') {
    (res as any).flush();
  }
};

const write = (event: string, data: unknown) => {
  res.write(`event: ${event}\ndata: ${JSON.stringify(data)}\n\n`);
  flush(); // ← this is the critical one
};

Any one of these alone wasn't sufficient. All three together fixed it.

The GCS Signed URL Problem

On Cloud Run, getSignedUrl() from @google-cloud/storage throws:

Permission 'iam.serviceAccounts.signBlob' denied

This doesn't happen locally because your service account JSON file handles signing. On Cloud Run, the compute service account needs explicit permission:

gcloud projects add-iam-policy-binding YOUR_PROJECT \
  --member="serviceAccount:YOUR_PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
  --role="roles/iam.serviceAccountTokenCreator"

You also need to tell the GCS client which service account to use when signing. I auto-fetch this from the GCP metadata server:

const res = await fetch(
  'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email',
  { headers: { 'Metadata-Flavor': 'Google' } },
);
this.serviceAccountEmail = await res.text();

Then pass it to getSignedUrl():

const [url] = await this.bucket.file(gcsPath).getSignedUrl({
  action: 'read',
  expires: Date.now() + 60 * 60 * 1000,
  serviceAccountEmail: this.serviceAccountEmail,
});

The Firebase Auth Race Condition

On Vercel production, every API call on first page load returned 401 Not authenticated — even for logged-in users.

Firebase's auth.currentUser is null on first render, even when a session exists in localStorage. Firebase needs a tick to rehydrate. The onAuthStateChanged listener is the right way to wait for it, but there's a subtlety: on first load, Firebase emits null immediately before emitting the real user. If you reject on the first emission, you'll always get 401.

function waitForAuth(): Promise<User> {
  return new Promise((resolve, reject) => {
    if (auth.currentUser) {
      resolve(auth.currentUser);
      return;
    }

    let settled = false;

    // Wait for definitive state — ignore initial null
    const unsub = auth.onAuthStateChanged((user) => {
      if (settled) return;
      if (user) {
        settled = true;
        unsub();
        resolve(user);
      }
      // If null: Firebase still initialising — keep waiting
    });

    // 8-second timeout
    setTimeout(() => {
      if (!settled) {
        settled = true;
        unsub();
        auth.currentUser ? resolve(auth.currentUser) : reject(new Error('Not authenticated'));
      }
    }, 8000);
  });
}

The Secret With a Newline

After fixing auth, every token verification still returned 401 with this message:

Firebase ID token has incorrect "aud" claim.
Expected "live-agent-challenge-489310\n" but got "live-agent-challenge-489310"

That \n at the end of Expected is a literal newline character embedded in the secret. When I created the secret in Secret Manager, I used echo without the -n flag:

# WRONG — adds a trailing newline
echo "live-agent-challenge-489310" | gcloud secrets create FIREBASE_PROJECT_ID --data-file=-

# CORRECT — no newline
echo -n "live-agent-challenge-489310" | gcloud secrets create FIREBASE_PROJECT_ID --data-file=-

Fix: add a new version of the secret with the correct value and redeploy.

React StrictMode Double-Firing

In development, React 18 StrictMode mounts components twice to help catch side effects. This caused the story generation pipeline to fire twice — two competing Gemini calls, two sets of concurrent Imagen calls hitting rate limits, random pages missing illustrations.

The fix was using a ref instead of state to guard the pipeline:

// ❌ State resets on StrictMode remount
const [streamStarted, setStreamStarted] = useState(false);

// ✅ Ref survives StrictMode remount
const hasStarted = useRef(false);

useEffect(() => {
  if (!storyId || hasStarted.current) return;
  hasStarted.current = true;
  startStream(storyId);
}, [storyId]);

I also added a server-side guard using a Set of active pipeline IDs — if a pipeline is already running for a storyId, subsequent requests complete immediately without starting another generation.

Deploying to Cloud Run

The setup that works:

Memory: 2 GiB — concurrent TTS + Imagen + PDF generation needs headroom
CPU: 2 — parallel processing per story
Request timeout: 3600 seconds — SSE stream stays open during full generation (default 300s kills it mid-story)
Session affinity: enabled — required for stateful WebSocket voice sessions
Execution environment: 2nd gen — better network performance for streaming
CI/CD: GitHub repo → Cloud Build → Cloud Run on every push to main

The most important setting people miss is the request timeout. 300 seconds sounds like a lot until you have an 8-page story with TTS and illustrations running concurrently.

What I'd Do Differently

Start with @google/genai from day one. Migrating from the deprecated @google/generative-ai mid-project cost real time. The streaming API is different, the response shape is different, and assuming parity between the two is a mistake.

Test SSE streaming through a reverse proxy early. I spent hours debugging a problem that only existed in production because I didn't simulate the Cloud Run load balancer locally. ngrok with compression enabled would have caught the buffering issue much earlier.

Use structured logging from the start. The Cloud Logging queries I ran to diagnose production issues (resource.type="cloud_run_revision" AND resource.labels.service_name="dream-book-api") only worked because I had consistent log formatting throughout. Good logging is not optional for cloud deployments.