Bizzi Cole87

Posted on Mar 15

I Built a Voice-First AI Photo & Document Editor with the Gemini Live API— Here's How

#gemini #ai #gcp #nanobanana

This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

There's a version of photo editing where you don't touch a single slider. You click the part of the image you want to change, say what you want out loud, and watch it happen. That's Say Edit.

Over the past few weeks I built Say Edit — a voice-first AI workspace that lets you edit images and navigate documents entirely by speaking. It's powered by the Gemini Live API, Gemini image generation, and deployed on Google Cloud Run. This article is the behind-the-scenes of how I built it, what broke, and what surprised me.

The Core Idea

Most AI tools make you type. You open a chat window, describe what you want, wait for a response, copy it somewhere, repeat. I wanted to eliminate every one of those steps for two use cases I found genuinely painful:

Editing a photo — you know exactly what you want to change, but you have to hunt through menus, masks, and sliders to get there.
Reading a dense document — you have a question, but finding the exact passage means scrolling through 80 pages.

The answer to both is the same: a persistent voice session that listens continuously, understands your intent, and acts immediately.

The Voice Loop — Gemini Live API

The foundation of Say Edit is a bi-directional WebSocket with the Gemini Live API (gemini-2.5-flash-native-audio-preview-12-2025). This is not a push-to-talk button. The model listens the entire time you're in the workspace.

Here's what the audio pipeline looks like on the browser side:

const source = audioCtx.createMediaStreamSource(stream);
const processor = audioCtx.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioCtx.destination);

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  // Float32 → Int16 → base64
  const int16 = new Int16Array(float32.length);
  for (let i = 0; i < float32.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
  }
  session.sendRealtimeInput({
    media: { data: btoa(String.fromCharCode(...new Uint8Array(int16.buffer))), mimeType: 'audio/pcm;rate=16000' }
  });
};

16kHz PCM, streamed continuously. The model responds with audio at 24kHz, which I queue through AudioBufferSourceNode for gapless playback.

The part that changed how I think about this kind of app: interruption. When you speak while the model is responding, the Live API sends serverContent.interrupted. I clear the audio queue instantly and the model re-listens. No waiting. That single behavior makes the whole thing feel like talking to a person rather than waiting for a request.

Image Editing — Hotspot + Voice + Gemini Image Generation

The image editor works in three steps:

Click anywhere on the image. The frontend records the pixel coordinates (x, y) as a "hotspot" and renders a crosshair.
Speak your edit. The Gemini Live session calls the edit_image_region tool with your spoken instruction and the hotspot coordinates.
Watch the edit apply. The current image is sent as base64 to gemini-2.0-flash-exp-image-generation alongside the prompt and coordinates. The result replaces the current frame in the history stack.

// Tool definition registered with the Live session
{
  name: 'edit_image_region',
  description: 'Edit a specific region of the current image.',
  parameters: {
    type: Type.OBJECT,
    properties: {
      edit_prompt: { type: Type.STRING },
      hotspot_x: { type: Type.NUMBER },
      hotspot_y: { type: Type.NUMBER },
    },
    required: ['edit_prompt', 'hotspot_x', 'hotspot_y'],
  }
}

The edit call to Gemini image generation:

const response = await ai.models.generateContent({
  model: 'gemini-3.1-flash-image-preview',
  contents: [
    { inlineData: { mimeType: file.type, data: base64 } },
    { text: `Edit: "${editPrompt}". Focus around pixel (x: ${x}, y: ${y}). Return ONLY the edited image.` }
  ],
  config: { responseModalities: ['IMAGE', 'TEXT'] }
});

Every edit is non-destructive — pushed onto a history stack with full undo/redo and a hold-to-compare button that fades back to the original.

The Stale Closure Problem

This one burned me for a while. The Live API's onmessage callback is registered once at session start. Any React state it closes over is immediately stale — meaning history[historyIndex] would always point to the original image, no matter how many edits had been applied.

The fix was to maintain refs that mirror the state and always read from those inside the callback:

const historyRef = useRef<File[]>([initialFile]);
const historyIndexRef = useRef<number>(0);
historyRef.current = history;       // kept in sync on every render
historyIndexRef.current = historyIndex;

// Inside the async tool handler — always gets the live value
const imageFile = historyRef.current[historyIndexRef.current];

This is now a fixture of how I build anything on top of the Live API.

Document Navigation — Spatial Search + Live Highlights

The document workspace is a different beast. When you ask a question, the AI shouldn't just answer — it should show you exactly where in the document the answer lives.

The pipeline:

1. Ingestion (NestJS backend on Google Cloud Run)

When you upload a PDF, the backend runs it through pdfjs-dist, extracts text items with their transform matrices, groups words into lines by Y-coordinate proximity, and merges lines into sentence-level chunks. Every chunk stores a tight bounding box [x, y, width, height] alongside the text.

One gotcha: PDF coordinates use bottom-left origin, but the React PDF viewer uses top-left. Every Y coordinate gets flipped during ingestion:

y = pageHeight - transform[5] - (height || 10)

Each chunk is then embedded with gemini-embedding-001 and stored in a Supabase pgvector index with an HNSW index for fast cosine similarity search.

2. Search during the Live session

When you ask a question, the Live session calls search_document. The frontend hits the backend's /query endpoint, which embeds your query and runs a cosine similarity search against the document's chunks. The top results come back with page numbers and bounding boxes.

3. Spatial highlight

The model then calls focus_document_section with the page and an array of bounding box coordinates. The frontend jumps the PDF viewer to the right page and renders yellow highlight overlays at the exact pixel positions of the relevant sentences:

// Convert stored PDF-point bboxes to percentage overlays
const highlight = {
  pageIndex,
  left:   (bx / pageWidthPx) * 100,
  top:    (by / pageHeightPx) * 100,
  width:  (bw / pageWidthPx) * 100,
  height: (bh / pageHeightPx) * 100,
};

You hear the answer and see it highlighted on the page at the same moment.

The Compose Studio

Beyond single-image editing, Say Edit has a Compose Studio that merges two photos by voice:

"Dress the person from A in the outfit from B."
"Put the product on the background from B."

Both images are encoded as base64 and sent to Gemini image generation together:

const response = await ai.models.generateContent({
  model: 'gemini-3.1-flash-image-preview',
  contents: [
    { inlineData: { mimeType: imgA.type, data: base64A } },
    { inlineData: { mimeType: imgB.type, data: base64B } },
    { text: `Composition request: "${prompt}". Return ONLY the composed image.` }
  ],
  config: { responseModalities: ['IMAGE', 'TEXT'] }
});

The result lands in the same history stack as regular edits — so you can compose two images together and then keep refining the result by voice.

Infrastructure — Google Cloud Run

Both the frontend (React + Vite) and the backend (NestJS) are containerized and deployed to Google Cloud Run. The Dockerfile for the frontend bakes the Vite environment variables at build time using ARG injection:

ARG VITE_GEMINI_API_KEY
ARG VITE_BACKEND_URL
RUN echo "VITE_GEMINI_API_KEY=$VITE_GEMINI_API_KEY" > .env
RUN echo "VITE_BACKEND_URL=$VITE_BACKEND_URL" >> .env
RUN npx vite build

Deploy with:

gcloud run deploy say-edit \
  --source . \
  --region us-central1 \
  --allow-unauthenticated \
  --set-build-env-vars "VITE_GEMINI_API_KEY=...,VITE_BACKEND_URL=..."

What I Learned

The Live API's interruption model is the whole game. Every other voice interface I've used makes you wait. The ability to interrupt mid-sentence — backed by a proper server-side signal rather than a client-side hack — is what makes Say Edit feel like a real conversation.

Tool design is UX design. The get_current_hotspot tool exists purely so users can say "edit this" without re-stating coordinates. Getting the tool schema right — what the model calls, when, and with what arguments — determines the quality of the interaction more than any UI element.

Refs are mandatory in Live API callbacks. Any async callback registered at session start captures stale React state. The ref-mirror pattern is non-negotiable.

PDF extraction is harder than it looks. pdfjs-dist gives you transform matrices, not sentences. The grouping and chunking pipeline is the most underestimated part of the whole system.