Sarthak Rawat

Posted on Mar 16

Building Tapestry: How We Taught an AI to Tell History Like a Documentary

#geminiliveagentchallenge #webdev #ai #opensource

I created this blog for detailing about my project in Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem: History Is Everywhere, But It's Buried

Drop a pin anywhere on a map. There's a story there.

The field where a decisive battle was fought. The street corner where a movement began. The mountain that an entire civilization built its mythology around. The port city that connected two worlds for five hundred years.

Most of us walk past these places — or scroll past them on a map — without any idea what happened there. And when we do get curious, we hit a wall: Wikipedia gives us a list of facts. Search gives us ten blue links. None of it feels like a story. None of it makes you feel the weight of what happened.

We built Tapestry to fix that.

Tapestry is a location-based historical research agent. You click anywhere on a globe, and Tapestry generates a rich, multi-layered historical narrative about that place — complete with AI-generated imagery, a narrated audio track, a chronological timeline, and four distinct ways to experience the story. It's not a search engine. It's a storyteller.

The Core Insight: History Has a Shape

Before we wrote a single line of code, we spent time thinking about what makes a great historical narrative. Not a Wikipedia article — a story.

Great historical storytelling has a consistent shape:

An opening — set the scene, establish the place, make the reader feel present
Origins and discovery — where did this place come from? Who was here first?
Pivotal moments — the events that changed everything
The human layer — the people, not just the events. The lives lived here.
The present day — what remains? What was lost? What does it mean now?
A closing — bring it home. Leave the reader with something to carry.

This six-act structure became the backbone of Tapestry's research schema. Every piece of content the AI generates maps to one of these stages. The result is a narrative that has rhythm and arc — not just a dump of historical facts.

The Architecture: A Research Pipeline That Thinks Like a Writer

Tapestry's backend is a multi-stage pipeline, each stage building on the last:

Stage 1: Deep Research

When a user selects a location, we kick off a parallel research sweep using Tavily (web search) and Gemini (synthesis). Tavily pulls current, grounded web sources. Gemini synthesizes them into structured research notes organized by our six-act schema.

This two-step approach is deliberate. We don't ask Gemini to invent history — we ask it to synthesize from real sources. Every claim is grounded in web-retrieved content. The result is a ResearchOutput object: a typed schema with sections, timeline entries, location coordinates, and source citations.

// The research schema that drives everything
interface ResearchSection {
  stage: 'opening' | 'discovery' | 'key_events' | 'human_layer' | 'today' | 'closing';
  title: string;
  blocks: ContentBlock[]; // text, image, map, diagram
}

Stage 2: Visual Storytelling (Interleaved Output)

This is where Tapestry goes beyond research. Once the text narrative is complete, a second Gemini pass generates the Visual Story — an interleaved sequence of narration passages and AI-generated image prompts, woven together into a single cinematic flow.

The key word is interleaved. We're not generating text and then separately generating images. We're generating a unified narrative where text and imagery are designed together — each image prompt is written to complement the passage it follows, creating a coherent visual arc across the whole story.

// Interleaved output structure
type InterleavedPart =
  | { type: 'text'; text: string }        // narration passage
  | { type: 'image'; imageUrl: string }   // AI-generated illustration

The image generation uses Google Imagen via Vertex AI. Each prompt is crafted by Gemini to match the tone and period of the narrative — a battle scene gets dramatic lighting, a founding moment gets a sense of dawn and possibility, a human story gets intimacy and warmth.

Stage 3: Audio Narration (Streaming TTS)

The text narrative doesn't just sit on the page — it speaks.

Every research output can be narrated using Google Cloud Text-to-Speech with the en-US-Journey-D voice — a natural, documentary-style voice tuned with a slightly deeper pitch (-1.0 semitones) and a measured speaking rate (0.95x) that fits long-form historical content. This is a deliberate choice: the voice sounds like a documentary narrator, not a virtual assistant.

We stream the audio in chunks: the text is split into ~50-word segments at sentence boundaries, each synthesized independently and sent to the client as a Server-Sent Event. The client starts playing the first chunk while the rest are still being generated. The result is near-instant audio playback — users hear the narration within seconds of clicking "Listen", with no waiting for the full script to synthesize.

The audio layer is what completes the interleaved experience. The Visual Story gives you text and images woven together. The narration adds a third dimension — you can read along, look at the imagery, and listen simultaneously. All three modalities telling the same story at the same time.

Stage 4: Persistence and Sharing

Everything is persisted to MongoDB: the full research output, the interleaved narrative parts, hero images, sources, and metadata. Users can return to any past research through a history sidebar. Any research can be shared via a public link with a unique token — the share page reconstructs the full experience from the stored data.

The Four Display Modes: One Story, Four Experiences

This is the part of Tapestry we're most proud of. The same historical content can be experienced in four completely different ways, each with its own visual identity and interaction model.

1. Ancient Scroll — Documentary Scroll

The default mode. The content unrolls like an ancient parchment scroll, complete with wooden rollers at the top and bottom, aged paper texture, SVG displacement filters for the wavy torn-edge effect, and parallax stain overlays that move as you scroll. Drop caps on opening paragraphs. Ornamental flourishes between sections. A ❧ divider that feels like it belongs in a medieval manuscript.

This mode is designed for reading — long-form, immersive, unhurried.

2. Stone Tablet — Timeline Explorer

A vertical chronological timeline rendered as carved stone. Dark granite background (#2a2218), chiseled text with textShadow drop and highlight, Roman numerals alongside year labels, and — the detail we spent the most time on — genuinely rough-hewn edges.

The edges use multi-layer clip-path polygons with dozens of irregular vertices, an SVG displacement filter applied on top, and a fracture highlight line at the break. The result looks like someone actually chipped this out of rock, not like a CSS border-radius.

Events animate into view with whileInView as you scroll down, with alternating left/right layout on desktop.

3. Celluloid Film — Visual Story

The interleaved narrative rendered as a vintage film strip. Warm amber-brown base (#2a1f0e), translucent sprocket holes with inset shadows, sepia-toned text, and a film grain SVG filter layered over everything.

Text passages are styled as screenplay intertitles with scene numbers. Images are styled as projected slides with light leak, vignette, and slight desaturation — like frames from an old documentary. A parallax grain overlay moves at a different rate than the content as you scroll.

This mode only activates when the Visual Story has been generated. If it hasn't, it falls back gracefully to a condensed text view.

4. Hardcover Book — Flipbook

The most technically complex mode. The content is split into pages — title page, one page per section, a timeline page, a sources page — and presented as an open hardcover book.

The left side is a leather spine with gold ornaments and the title printed vertically in writing-mode: vertical-rl. The right side is a cream page with paper grain texture, running headers, and corner page numbers.

The page-turn animation uses two stacked layers: the destination page sits underneath, and the current page folds away using rotateY from the right edge (or left edge for backward navigation), revealing the next page beneath. The back face of the folding page shows slightly darker paper — the reverse side of the leaf. A pulsing corner triangle hints that the page is turnable.

The Interleaved Output: Text + Image + Audio as One

The hackathon's Creative Storyteller category asks for agents that "think and create like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream."

Tapestry's answer is three modalities working together — not sequentially, not independently, but as a unified experience.

Text + Image: Designed Together

The key is that Gemini doesn't generate text and images separately. It generates a unified creative brief — a sequence of narrative beats where each beat is either a passage of prose or an image that illustrates it. The image prompts are written with the surrounding text in mind. The prose is written knowing an image will follow. They're designed together.

Here's a simplified version of the prompt structure:

You are a documentary filmmaker and writer. Generate an interleaved narrative 
about [location] that alternates between:
- Narration passages (2-3 sentences, cinematic, present-tense)
- Image descriptions (vivid, specific, painterly — describe what the camera sees)

The narrative should follow this arc: [six-act structure]
Each image should visually complement the passage before it.
Output as a JSON array of { type: "text"|"image", content: string } objects.

The result is a story where you read a passage about the moment a city was founded, then see an AI-generated image of that dawn — torches, mud walls, the first stones being laid. Then you read about the first traders arriving, and see the harbor as it might have looked. Text and image, woven together, each making the other more powerful.

Audio: The Third Pillar

The visual story doesn't just read and show — it speaks.

Hit the narration button and the entire research output is converted to audio using Google Cloud Text-to-Speech with the en-US-Journey-D voice — a natural, documentary-style voice with a slightly deeper pitch and measured pace, tuned specifically for long-form narration. This isn't a robotic screen reader. It sounds like a BBC documentary.

The audio is synchronized with the written narrative. As the voice reads through the opening, you're looking at the same prose on screen. As it moves into the pivotal moments section, the imagery from that section is in view. The three modalities — prose, image, voice — are all telling the same story at the same time, each reinforcing the others.

Critically, the audio streams in chunks rather than waiting for the full narration to synthesize. The text is split into ~50-word segments at sentence boundaries, each synthesized independently and sent to the client as a Server-Sent Event. Playback starts within seconds of clicking the button — the first chunk plays while the rest are still being generated. This is what makes it feel live rather than like a file download.

User clicks "Listen" →
  Chunk 1 synthesized → sent → starts playing immediately
  Chunk 2 synthesized → queued → plays when chunk 1 ends
  Chunk 3 synthesized → queued → ...
  [audio plays continuously, seamlessly, while generation is still running]

The result is a genuinely immersive experience: you can read along, look at the images, and listen to the narration simultaneously. That's text, image, and audio interleaved — not three separate features, but one coherent story told across three senses.

Google Cloud Services: The Full Stack

Service	Role
Gemini 2.0 Flash	Research synthesis, narrative generation, interleaved storytelling
Google Imagen (Vertex AI)	AI image generation for the Visual Story mode
Google Cloud Text-to-Speech	Streaming audio narration (`en-US-Journey-D`)
Google Cloud Translate	Full research translation into 15+ languages
Google Maps Embed API	Interactive maps embedded in content blocks
Google Cloud Storage	Hero image and generated image hosting
MongoDB Atlas	Research persistence, history, share tokens
Next.js on Cloud Run	Frontend and API routes, server-side rendering

The pipeline is entirely Google Cloud native. Research, synthesis, image generation, audio, translation — every AI capability runs through a Google service. This isn't a wrapper around OpenAI with a Google Maps widget bolted on. It's built on the Google AI stack from the ground up.

Key Technical Decisions

Streaming Everything

Tapestry streams at every layer. The research pipeline sends Server-Sent Events as each stage completes — the client shows a live pipeline progress indicator as research, storytelling, and image generation happen in sequence. The TTS narration streams chunk by chunk so audio starts playing within seconds. The user never stares at a loading spinner waiting for everything to finish.

Typed Research Schema

The ResearchOutput schema is the contract between the AI and the UI. Every display mode reads from the same typed structure. This means we can add a new display mode without touching the research pipeline, and improve the research pipeline without touching the display modes. The schema enforces that every section has a stage, every block has a type, every source has a url.

Grounded Research — No Hallucinations by Design

This is the most important reliability decision we made, and it directly addresses the judging criterion: "Does the agent avoid hallucinations? Is there evidence of grounding?"

The answer is structural, not just a prompt instruction.

Step 1: Retrieve before generating. Before Gemini writes a single sentence, Tavily fetches real web sources for the location — Wikipedia articles, historical society pages, academic references, news archives. These are passed directly into the Gemini prompt as source material.

Step 2: Synthesize, don't invent. The research prompt explicitly instructs Gemini to synthesize from the provided sources, not to draw on training data alone. If a claim isn't supported by the retrieved sources, it shouldn't appear in the narrative.

Step 3: Cite everything. Every source retrieved by Tavily is stored in the ResearchOutput alongside a grounded: boolean flag. Sources that were directly retrieved from the web are marked grounded: true. These citations appear in the Sources page of the Flipbook and the Sources section of the Scroll — visible to the user, not hidden in metadata.

Step 4: Structured output enforces coherence. Gemini outputs to a strict JSON schema. If the output doesn't conform — wrong stage names, missing required fields, malformed blocks — the pipeline rejects it and retries. The schema is the guardrail. A hallucinated "stage" that doesn't exist in the schema simply can't make it into the final output.

The result: Tapestry's narratives are grounded in real sources, cited transparently, and structurally validated before they ever reach the user. It's not foolproof — no LLM system is — but the architecture makes hallucination significantly harder than a naive "ask Gemini about this place" approach.

Dark Mode Throughout

Every display mode has a carefully designed dark variant. The parchment scroll becomes deep amber-on-dark. The stone tablet stays dark but the chiseled text brightens. The film strip deepens to near-black with warm amber accents. The flipbook pages go to a warm near-black (#1e1c18). Dark mode isn't an afterthought — it's a first-class experience.

Challenges We Faced

Interleaved output consistency: Getting Gemini to reliably output a well-structured interleaved JSON array — with the right balance of text and image beats, and image prompts that actually match the surrounding prose — required significant prompt engineering. The final prompt includes explicit examples, length constraints per beat, and a reminder that image prompts should describe what the camera sees, not abstract concepts.

Image generation latency: Imagen takes 3-8 seconds per image. With 6-10 images per story, sequential generation would take a minute. We generate images in parallel where possible and stream them to the client as they complete, so the Visual Story mode populates progressively rather than all at once.

Page-fold animation: The Flipbook's page-turn effect went through three complete rewrites. The first used react-pageflip (which takes over the DOM with absolute positioning and breaks Tailwind's fluid layout). The second used framer-motion AnimatePresence with rotateY (which looked like a slide, not a fold). The final version uses two stacked layers — destination page underneath, current page folding away on top — with backfaceVisibility: hidden on both faces so you see the front going away and the back briefly before it's gone. That's what makes it feel like a real page turning.

TTS streaming state machine: The audio player has four states (idle, loading, playing, paused) and a chunk queue that fills asynchronously while playback is already running. The original implementation had a stale closure bug — playNextChunk was defined with empty useCallback deps, so it captured a frozen reference to the queue. We fixed it by storing the function in a ref that's updated every render, so async callbacks always call the current implementation.

Stone tablet edges: Getting the timeline's edges to look genuinely rough — not like a CSS clip-path — required two overlapping clip-path polygon layers (outer chip + inner fracture), an SVG displacement filter applied on top, and a fracture highlight line at the break. The final polygon has 40+ vertices per edge.

What We Learned

The schema is the product. The ResearchOutput schema — six stages, typed blocks, timeline entries — is what makes Tapestry work. Every display mode, every AI prompt, every database query is built around it. Getting the schema right early saved us from a dozen refactors later. The schema is also what makes grounding tractable: when the output is structured, you can validate it. When you can validate it, you can reject malformed or hallucinated content before it reaches the user.

True interleaved output is qualitatively different from "generate text, then add images." When Gemini writes the prose and the image prompts in the same pass — knowing that an image will follow each passage — the result is a coherent visual arc, not a illustrated article. The images feel like they were directed, not retrieved. That's the difference between a creative director and a search engine.

Audio completes the triad. Text and images alone are a rich experience. Add narration and it becomes something else — something that doesn't require active reading, that you can close your eyes and listen to. The en-US-Journey-D voice with its documentary cadence turns a history article into something that feels like it belongs on a streaming platform. The three modalities together are more than the sum of their parts.

Grounding is an architecture decision, not a prompt instruction. You can tell Gemini "don't make things up" in the system prompt. That helps. But what actually prevents hallucinations is retrieving real sources first, passing them as context, enforcing a strict output schema, and citing everything. The prompt instruction is the last line of defense. The architecture is the first.

Visual identity matters for immersion. We could have made all four display modes look the same — a clean card with a white background. Instead, each mode has its own material metaphor: parchment, stone, celluloid, paper. Users notice. The stone tablet feels ancient. The film strip feels like a documentary. The aesthetic isn't decoration — it's part of the storytelling. A story about ancient Rome told on a stone tablet lands differently than the same story in a sans-serif card.

Streaming changes the experience. When the research pipeline streams its progress in real time — "Researching sources... Generating narrative... Creating visual story... Generating images...", users feel like they're watching something being made. It's more engaging than a spinner. It builds anticipation. And when the audio starts playing within two seconds of clicking "Listen" — before the full narration has even finished synthesizing — it feels instant. Latency is a UX problem as much as a technical one.

Try It Yourself

GitHub: https://github.com/SarthakRawat-1/tapestry
Demo Video: https://vimeo.com/1174045515?share=copy&fl=sv&fe=ci

Built with Gemini 2.0 Flash, Google Imagen, Google Cloud Text-to-Speech, Google Cloud Translate, Google Maps, Google Cloud Storage, Next.js, Framer Motion, MongoDB, and Tailwind CSS — all running on Google Cloud.

Created for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

DEV Community