I Built an AI That Turns a Parent's Voice Into a Personalized Illustrated Storybook — Here's How
From Gemini Live API to real-time SSE streaming on Cloud Run — the full technical story of DreamBook
My daughter went through a dinosaur phase. Then a space phase. Then superheroes. No bookstore in the world can keep up with a six-year-old's rotating obsessions — and even if one could, it still wouldn't know her name, her fears, or the lesson I was trying to teach her that week.
That's the problem DreamBook solves.
You speak. You say something like "Emma, age five, loves dinosaurs and painting, scared of thunder, I want her to learn that being brave doesn't mean not being scared." Ninety seconds later, you have a fully illustrated, narrated, personalized storybook — text streaming in live, illustrations fading in as they generate, audio narration ready to play per page, and a PDF you can print and keep forever.
I built this for the Gemini Live Agent Hackathon. This post is the full technical story — what I built, every bug I hit, and what I learned along the way.
The Stack
Before we dive in, here's the full picture:
- Backend: NestJS + TypeScript on Google Cloud Run
- Frontend: Next.js 15 (App Router) on Vercel
-
AI:
@google/genaiSDK throughout-
gemini-3.1-pro-previewfor story generation -
gemini-3.1-flash-image-preview(Nano Banana) for illustrations -
gemini-2.5-flash-preview-ttsfor audio narration -
gemini-2.5-flash-native-audio-previewvia Live API for voice input
-
- Infrastructure: Firestore, Cloud Storage, Cloud Build, Firebase Auth
The Architecture
The system has three layers of real-time communication happening simultaneously:
Browser
├── Socket.io WebSocket → NestJS /voice gateway → Gemini Live API
│ (PCM audio chunks → real-time transcript → StoryRequest JSON)
│
└── fetch() SSE stream ← NestJS StoryController
(page:text, page:image, page:audio events as they generate)
NestJS (Cloud Run)
├── GeminiService → gemini-3.1-pro-preview (text streaming)
├── ImagenService → Nano Banana (illustrations, concurrent)
├── TtsService → gemini-2.5-flash-preview-tts (per page)
└── PdfService → pdf-lib → Cloud Storage
The interesting part is that text, illustrations, and audio all generate concurrently. Gemini streams the story text; each time an [IMAGE: ...] directive appears in the stream, an illustration job fires immediately without waiting for the story to finish. TTS runs the same way. By the time Gemini finishes generating the last page of text, most of the illustrations and narrations are already done.
The Voice Input Pipeline
This is where the Gemini Live API comes in — and it's genuinely impressive.
The browser captures raw PCM audio from the microphone using an AudioWorklet:
// Inline AudioWorklet — converts Float32 mic samples to Int16 PCM
const PCM_PROCESSOR_CODE = `
class PcmProcessor extends AudioWorkletProcessor {
process(inputs) {
const input = inputs[0];
if (!input || !input[0]) return true;
const float32 = input[0];
const int16 = new Int16Array(float32.length);
for (let i = 0; i < float32.length; i++) {
const s = Math.max(-1, Math.min(1, float32[i]));
int16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
this.port.postMessage(int16.buffer, [int16.buffer]);
return true;
}
}
registerProcessor('pcm-processor', PcmProcessor);
`;
Those PCM chunks stream over Socket.io to NestJS, which opens a Gemini Live session per user:
const liveSession = await this.ai.live.connect({
model: 'gemini-2.5-flash-native-audio-preview-12-2025',
config: {
responseModalities: [Modality.AUDIO],
inputAudioTranscription: {}, // enables real-time transcription
systemInstruction: {
parts: [{ text: 'You are a transcription assistant. Just transcribe.' }],
},
},
callbacks: {
onmessage: (e: MessageEvent<LiveServerMessage>): void => {
const content = e.data?.serverContent;
if (content?.inputTranscription?.text) {
session.transcript += content.inputTranscription.text;
socket.emit('voice:transcript', { text: session.transcript });
}
},
},
});
When the user stops speaking, Gemini Flash extracts a structured StoryRequest from the accumulated transcript:
const response = await this.ai.models.generateContent({
model: 'gemini-2.0-flash',
contents: [{
role: 'user',
parts: [{ text: `Extract story params from: "${transcript}".
Return raw JSON only: { childName, childAge, interests, pageCount,
illustrationStyle, language } — optionally lesson and fears.` }],
}],
});
One thing I noticed: the inputTranscription.finished flag doesn't always fire for the native audio model. The fix is to accumulate all transcript chunks continuously rather than waiting for a "final" marker.
The Story Generation Pipeline
This is the core of the app. gemini-3.1-pro-preview generates the story with a specific format requirement:
const prompt = `
Write exactly ${pageCount} pages for a storybook about ${childName}.
After each page, add an image directive:
[IMAGE: <detailed illustration prompt in ${style} style>]
Only narrative text and [IMAGE:] directives. Nothing else.
`;
const streamResult = await this.ai.models.generateContentStream({
model: 'gemini-3.1-pro-preview',
contents: [{ role: 'user', parts: [{ text: prompt }] }],
});
As chunks stream in, I scan for the [IMAGE:] directive pattern:
const IMAGE_DIRECTIVE_RE = /\[IMAGE:\s*([^\]]+)\]/gi;
for await (const chunk of streamResult) {
const candidate = chunk.candidates?.[0];
const chunkText = candidate?.content?.parts
?.map((p) => p.text ?? '').join('') ?? '';
pageTextBuffer += chunkText;
const imageMatch = IMAGE_DIRECTIVE_RE.exec(pageTextBuffer);
if (imageMatch) {
pageNumber++;
const imagePrompt = imageMatch[1].trim();
const narrativeText = pageTextBuffer.slice(0, imageMatch.index).trim();
// Emit text immediately over SSE
subject.next({ event: 'page:text', data: { pageNumber, text: narrativeText } });
// Fire illustration job concurrently (don't await)
subject.next({ event: 'page:image', data: { pageNumber, imagePrompt } });
pageTextBuffer = pageTextBuffer.slice(imageMatch.index + imageMatch[0].length);
IMAGE_DIRECTIVE_RE.lastIndex = 0;
}
}
Important: In
@google/genai, streaming chunks don't have a.textproperty — you have to extract text fromcandidates[0].content.parts[].text. This is different from the old@google/generative-aiSDK wherechunk.text()was a method. This bug cost me several hours.
Illustrations with Nano Banana
Nano Banana (gemini-3.1-flash-image-preview) is a huge improvement over the old Imagen 4 Vertex AI REST approach. No access token fetching, no endpoint construction, just the SDK:
const response = await this.ai.models.generateContent({
model: 'gemini-3.1-flash-image-preview',
contents: fullPrompt,
config: {
responseModalities: ['TEXT', 'IMAGE'],
},
});
const imagePart = response.candidates?.[0]?.content?.parts
?.find((p) => p.inlineData?.mimeType?.startsWith('image/'));
const imageBase64 = imagePart.inlineData.data;
The image comes back as base64 inline data. I upload it to Cloud Storage and return a signed URL.
Audio Narration
TTS uses generateContent with audio response modality — no Live API needed here:
const response = await this.ai.models.generateContent({
model: 'gemini-2.5-flash-preview-tts',
contents: [{ role: 'user', parts: [{ text: narratePrompt }] }],
config: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Kore' },
},
},
},
});
Gemini TTS returns raw audio/L16;codec=pcm;rate=24000 — raw PCM bytes. Browsers can't play raw PCM directly, so I wrap it in a WAV header:
private pcmToWav(pcm: Buffer, sampleRate = 24000): Buffer {
const header = Buffer.alloc(44);
header.write('RIFF', 0);
header.writeUInt32LE(36 + pcm.length, 4);
header.write('WAVE', 8);
header.write('fmt ', 12);
header.writeUInt32LE(16, 16); // PCM chunk size
header.writeUInt16LE(1, 20); // PCM format
header.writeUInt16LE(1, 22); // mono
header.writeUInt32LE(sampleRate, 24);
header.writeUInt32LE(sampleRate * 2, 28); // byte rate
header.writeUInt16LE(2, 32); // block align
header.writeUInt16LE(16, 34); // 16-bit
header.write('data', 36);
header.writeUInt32LE(pcm.length, 40);
return Buffer.concat([header, pcm]);
}
The SSE Streaming Problem
This was the most painful bug of the entire project.
Everything worked perfectly locally. On Cloud Run, the frontend received pings every 15 seconds but zero page events — even though the server logs showed pages being emitted correctly.
The cause: Cloud Run's Google load balancer buffers HTTP/2 responses for compression efficiency. SSE events were being held in a buffer and only released when the buffer filled or the connection closed — which meant the entire story arrived in one batch after generation finished, not page by page.
The fix required three things together:
// 1. Disable compression entirely
res.setHeader('Content-Encoding', 'identity');
// 2. Disable nginx/proxy buffering
res.setHeader('X-Accel-Buffering', 'no');
res.setHeader('Cache-Control', 'no-cache, no-store, no-transform');
// 3. Explicitly flush after EVERY write
const flush = () => {
if (typeof (res as any).flush === 'function') {
(res as any).flush();
}
};
const write = (event: string, data: unknown) => {
res.write(`event: ${event}\ndata: ${JSON.stringify(data)}\n\n`);
flush(); // ← this is the critical one
};
Any one of these alone wasn't sufficient. All three together fixed it.
The GCS Signed URL Problem
On Cloud Run, getSignedUrl() from @google-cloud/storage throws:
Permission 'iam.serviceAccounts.signBlob' denied
This doesn't happen locally because your service account JSON file handles signing. On Cloud Run, the compute service account needs explicit permission:
gcloud projects add-iam-policy-binding YOUR_PROJECT \
--member="serviceAccount:YOUR_PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
--role="roles/iam.serviceAccountTokenCreator"
You also need to tell the GCS client which service account to use when signing. I auto-fetch this from the GCP metadata server:
const res = await fetch(
'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email',
{ headers: { 'Metadata-Flavor': 'Google' } },
);
this.serviceAccountEmail = await res.text();
Then pass it to getSignedUrl():
const [url] = await this.bucket.file(gcsPath).getSignedUrl({
action: 'read',
expires: Date.now() + 60 * 60 * 1000,
serviceAccountEmail: this.serviceAccountEmail,
});
The Firebase Auth Race Condition
On Vercel production, every API call on first page load returned 401 Not authenticated — even for logged-in users.
Firebase's auth.currentUser is null on first render, even when a session exists in localStorage. Firebase needs a tick to rehydrate. The onAuthStateChanged listener is the right way to wait for it, but there's a subtlety: on first load, Firebase emits null immediately before emitting the real user. If you reject on the first emission, you'll always get 401.
function waitForAuth(): Promise<User> {
return new Promise((resolve, reject) => {
if (auth.currentUser) {
resolve(auth.currentUser);
return;
}
let settled = false;
// Wait for definitive state — ignore initial null
const unsub = auth.onAuthStateChanged((user) => {
if (settled) return;
if (user) {
settled = true;
unsub();
resolve(user);
}
// If null: Firebase still initialising — keep waiting
});
// 8-second timeout
setTimeout(() => {
if (!settled) {
settled = true;
unsub();
auth.currentUser ? resolve(auth.currentUser) : reject(new Error('Not authenticated'));
}
}, 8000);
});
}
The Secret With a Newline
After fixing auth, every token verification still returned 401 with this message:
Firebase ID token has incorrect "aud" claim.
Expected "live-agent-challenge-489310\n" but got "live-agent-challenge-489310"
That \n at the end of Expected is a literal newline character embedded in the secret. When I created the secret in Secret Manager, I used echo without the -n flag:
# WRONG — adds a trailing newline
echo "live-agent-challenge-489310" | gcloud secrets create FIREBASE_PROJECT_ID --data-file=-
# CORRECT — no newline
echo -n "live-agent-challenge-489310" | gcloud secrets create FIREBASE_PROJECT_ID --data-file=-
Fix: add a new version of the secret with the correct value and redeploy.
React StrictMode Double-Firing
In development, React 18 StrictMode mounts components twice to help catch side effects. This caused the story generation pipeline to fire twice — two competing Gemini calls, two sets of concurrent Imagen calls hitting rate limits, random pages missing illustrations.
The fix was using a ref instead of state to guard the pipeline:
// ❌ State resets on StrictMode remount
const [streamStarted, setStreamStarted] = useState(false);
// ✅ Ref survives StrictMode remount
const hasStarted = useRef(false);
useEffect(() => {
if (!storyId || hasStarted.current) return;
hasStarted.current = true;
startStream(storyId);
}, [storyId]);
I also added a server-side guard using a Set of active pipeline IDs — if a pipeline is already running for a storyId, subsequent requests complete immediately without starting another generation.
Deploying to Cloud Run
The setup that works:
- Memory: 2 GiB — concurrent TTS + Imagen + PDF generation needs headroom
- CPU: 2 — parallel processing per story
- Request timeout: 3600 seconds — SSE stream stays open during full generation (default 300s kills it mid-story)
- Session affinity: enabled — required for stateful WebSocket voice sessions
- Execution environment: 2nd gen — better network performance for streaming
- CI/CD: GitHub repo → Cloud Build → Cloud Run on every push to main
The most important setting people miss is the request timeout. 300 seconds sounds like a lot until you have an 8-page story with TTS and illustrations running concurrently.
What I'd Do Differently
Start with @google/genai from day one. Migrating from the deprecated @google/generative-ai mid-project cost real time. The streaming API is different, the response shape is different, and assuming parity between the two is a mistake.
Test SSE streaming through a reverse proxy early. I spent hours debugging a problem that only existed in production because I didn't simulate the Cloud Run load balancer locally. ngrok with compression enabled would have caught the buffering issue much earlier.
Use structured logging from the start. The Cloud Logging queries I ran to diagnose production issues (resource.type="cloud_run_revision" AND resource.labels.service_name="dream-book-api") only worked because I had consistent log formatting throughout. Good logging is not optional for cloud deployments.
Try It
🌐 Live: https://dream-book-web.vercel.app
📦 Backend repo: https://github.com/Talha-Tahir2001/dream-book-api
📦 Frontend repo: https://github.com/Talha-Tahir2001/dream-book-web
Tags
#googlecloud #gemini #nestjs #nextjs #typescript #ai #hackathon #webdev
Built for the Gemini Live Agent Hackathon 2026. If you have questions about any part of the implementation, drop them in the comments — happy to dig into the details.
Top comments (0)