Hezron Okwach

Posted on Mar 16 • Edited on Apr 16

Building Gemini Screen Scribe: A Multimodal Chrome Extension for Developer Communication

#gemini #ai #googleaichallenge #agentaichallenge

Introduction

Developer communication is broken. We spend countless hours writing bug reports, creating design specs, and explaining UI issues through screenshots and text. What if you could just show what you mean, draw on it, and talk through it? That's exactly what we built with Gemini Screen Scribe.

This article walks through the technical architecture, challenges, and solutions behind building a Chrome extension that captures screen recordings with voice narration and freehand annotations, then uses Google's Gemini Live API to generate developer-ready code and instructions.

The Problem Space

Every developer has experienced this workflow:

Spot a bug or see a design you want to recreate
Take multiple screenshots
Open a text editor
Spend 15-30 minutes writing a detailed explanation
Hope the person reading it understands your intent

Screenshots lack context. Text descriptions miss visual nuances. The gap between what you see and what you can communicate is frustrating.

We needed a solution that captures the full context: what you're looking at, what you're pointing to, and what you're thinking.

Solution Architecture

Gemini Screen Scribe is a Chrome Manifest V3 extension with a Cloud Run backend. The architecture consists of seven key components:

1. Chrome Extension Components

Popup UI (React + TypeScript + Tailwind CSS)

The popup serves as the control panel. Users select between two modes (Edit & Fix or Inspire), start/stop recordings, and view generated prompts. We built it with React 19 and Tailwind CSS for a modern, responsive interface.

// Core state management in the popup
const [mode, setMode] = useState<Mode>('edit');
const [isRecording, setIsRecording] = useState(false);
const [apiKey, setApiKey] = useState<string>('');
const [resultPrompt, setResultPrompt] = useState<string>('');

The popup communicates with the background service worker through Chrome's message passing API. When a user clicks "Start Screen Scribe", it sends a message to initiate the recording flow.

Content Script (Shadow DOM Overlay)

The content script injects a drawing overlay on every active tab. This is where users draw annotations while recording. The critical decision here was using Shadow DOM for complete style isolation.

// Creating the isolated Shadow DOM container
const container = document.createElement('div');
container.id = 'gemini-screen-scribe-root';
container.style.position = 'fixed';
container.style.top = '0';
container.style.left = '0';
container.style.width = '100vw';
container.style.height = '100vh';
container.style.pointerEvents = 'none';
container.style.zIndex = '2147483647';
document.body.appendChild(container);

const shadow = container.attachShadow({ mode: 'open' });

Shadow DOM ensures our overlay works on any website without CSS conflicts. The host page's styles don't affect our overlay, and our styles don't leak into the host page.

Background Service Worker

The background service worker is the orchestration layer. It coordinates communication between the popup, content script, and offscreen document. It also handles API calls to the Cloud Run backend.

chrome.runtime.onMessage.addListener((message, _sender, sendResponse) => {
  if (message.type === 'start-recording') {
    handleStartRecording(sendResponse);
    return true;
  }
  if (message.type === 'stop-recording') {
    handleStopRecording(sendResponse);
    return true;
  }
});

Offscreen Document

This is the most critical piece for Manifest V3 compliance. Service workers cannot access getUserMedia or getDisplayMedia APIs. Chrome requires an offscreen document to handle media capture.

async function startRecording() {
  const displayStream = await navigator.mediaDevices.getDisplayMedia({
    video: { width: { max: 1920 }, height: { max: 1080 } },
    audio: true,
  });

  const micStream = await navigator.mediaDevices.getUserMedia({ 
    audio: true 
  }).catch(() => null);

  const tracks = [...displayStream.getTracks()];
  if (micStream) {
    tracks.push(...micStream.getAudioTracks());
  }

  const combinedStream = new MediaStream(tracks);
  recorder = new MediaRecorder(combinedStream, { mimeType: 'video/webm' });
  recorder.start();
}

The offscreen document combines screen capture, system audio, and microphone audio into a single MediaRecorder stream. When recording stops, it converts the blob to base64 for transmission.

2. Drawing System

We use the perfect-freehand library for smooth, natural drawing strokes. The library takes pointer events and generates SVG paths that feel like real pen strokes.

const handlePointerDown = (e: React.PointerEvent) => {
  if (!isDrawingMode) return;
  (e.target as Element).setPointerCapture(e.pointerId);
  setCurrentPath([[e.clientX, e.clientY, e.pressure]]);
};

const handlePointerMove = (e: React.PointerEvent) => {
  if (!isDrawingMode || e.buttons !== 1) return;
  setCurrentPath((c) => [...c, [e.clientX, e.clientY, e.pressure]]);
};

The key insight is using pointer capture to ensure smooth tracking even when the cursor moves quickly. We collect points with pressure data, then use perfect-freehand to generate the final stroke path.

const getSvgPathFromStroke = (stroke: number[][]) => {
  if (!stroke.length) return '';
  const d = stroke.reduce(
    (acc, [x0, y0], i, arr) => {
      const [x1, y1] = arr[(i + 1) % arr.length];
      acc.push(x0, y0, (x0 + x1) / 2, (y0 + y1) / 2);
      return acc;
    },
    ['M', ...stroke[0], 'Q']
  );
  d.push('Z');
  return d.join(' ');
};

3. Cloud Run Backend

The backend is a Node.js service built with Hono, a lightweight web framework. It handles two main responsibilities: processing videos with Gemini and managing prompt history in Firestore.

app.post('/process-video', async (c) => {
  const { video, mode, sessionId } = await c.req.json();

  // Decode base64 video
  const base64Data = video.split(',')[1];

  // Call Gemini Live API
  const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
  const response = await ai.models.generateContent({
    model: 'gemini-flash-latest',
    contents: [{
      role: 'user',
      parts: [
        { text: "Analyze this screen recording according to the mode." },
        { inlineData: { mimeType: "video/webm", data: base64Data } }
      ]
    }],
    config: {
      systemInstruction: mode === 'edit' ? EDIT_PROMPT : INSPIRE_PROMPT,
      temperature: 0.2,
    }
  });

  const prompt = response.text;

  // Save to Firestore
  await db.collection('sessions').doc(sessionId)
    .collection('prompts').add({
      prompt,
      mode,
      timestamp: Date.now()
    });

  return c.json({ success: true, prompt });
});

The backend holds the Gemini API key securely, so users never need to provide their own key. This also allows us to implement rate limiting and usage tracking.

4. Gemini Live API Integration

The Gemini Live API is the core of this project. It's designed for real-time multimodal interactions, processing video, audio, and visual context together.

We wrote two distinct system prompts for the two modes:

Edit & Fix Mode Prompt:

const EDIT_PROMPT = `
You are an expert Senior UI/UX Engineer and Frontend Developer. 
The user is providing a screen recording of an interface. They are narrating their desired changes and drawing on the screen to point out specific bugs or UI modifications.

Your job is to act as their pair programmer. 
Analyze the video, listen to the audio, and look at the drawn annotations.
Output a highly structured set of instructions or code snippets (React/Tailwind preferred) that directly solves the user's request.
Be concise, focus on the code, and explain why you are making the change.
`;

Inspire Mode Prompt:

const INSPIRE_PROMPT = `
You are a world-class Frontend Architect and UI Designer.
The user is providing a screen recording of a website they find inspiring. They might narrate what they like about it or draw to highlight specific animations, layouts, or components.

Your job is to reverse-engineer the "vibe" and structure of what they are showing you.
Analyze the video to understand the layout (CSS Grid/Flexbox), the color palette, typography choices, and animations.
Output a clear breakdown of how to build this in modern React and Tailwind CSS. Provide foundational code snippets to recreate the visual aesthetic they are pointing out.
`;

The system prompts completely change how Gemini interprets the same video input. This is the power of multimodal AI with proper context.

5. Firestore Integration

We use Firestore to store prompt history. Each user gets an anonymous session ID stored in local storage. All their generated prompts are saved under that session.

// Firestore structure
sessions/{sessionId}/prompts/{promptId}
  - prompt: string
  - mode: 'edit' | 'inspire'
  - timestamp: number

The popup fetches history on load:

app.get('/history/:sessionId', async (c) => {
  const sessionId = c.req.param('sessionId');
  const snapshot = await db.collection('sessions').doc(sessionId)
    .collection('prompts')
    .orderBy('timestamp', 'desc')
    .limit(20)
    .get();

  const prompts = snapshot.docs.map(doc => ({
    id: doc.id,
    ...doc.data()
  }));

  return c.json({ prompts });
});

Technical Challenges and Solutions

Challenge 1: Manifest V3 Media Capture

Problem: Chrome's Manifest V3 service workers run in a restricted context. They cannot access getUserMedia or getDisplayMedia APIs, which are essential for screen recording.

Solution: We implemented an offscreen document. This is a hidden HTML page that runs in a full browser context with access to all web APIs. The service worker creates the offscreen document, sends it messages to start/stop recording, and receives the video blob back.

async function setupOffscreenDocument(path: string) {
  const offscreenUrl = chrome.runtime.getURL(path);
  const existingContexts = await chrome.runtime.getContexts({
    contextTypes: [chrome.runtime.ContextType.OFFSCREEN_DOCUMENT],
    documentUrls: [offscreenUrl]
  });

  if (existingContexts.length > 0) return;

  await chrome.offscreen.createDocument({
    url: path,
    reasons: [
      chrome.offscreen.Reason.DISPLAY_MEDIA,
      chrome.offscreen.Reason.USER_MEDIA
    ],
    justification: 'Recording screen and microphone for Gemini AI analysis',
  });
}

Challenge 2: Shadow DOM Style Isolation

Problem: The drawing overlay needs to work on any website without breaking the host page's layout or having the host page's CSS affect our overlay.

Solution: Shadow DOM provides true style encapsulation. We attach a shadow root to our container and inject our styles only into the shadow tree.

const shadow = container.attachShadow({ mode: 'open' });

const styleElement = document.createElement('style');
styleElement.textContent = shadowCss;
shadow.appendChild(styleElement);

const renderRoot = document.createElement('div');
shadow.appendChild(renderRoot);
const root = createRoot(renderRoot);
root.render(<Overlay />);

The shadow boundary ensures complete isolation. Our overlay has worked flawlessly on every website we've tested, from simple blogs to complex web apps.

Challenge 3: Video Size Management

Problem: High-quality screen recordings get large fast. A 2-minute recording at 1920x1080 can easily exceed 50MB. We're sending this to an API, so size matters.

Solution: We implemented several optimizations:

Cap recording duration at 2 minutes with an automatic timeout
Limit resolution to 1920x1080 (sufficient for most use cases)
Use WebM format with efficient compression
Clean up blob data immediately after encoding

// Automatic timeout in offscreen document
setTimeout(() => {
  if(recorder?.state === 'recording') {
    chrome.runtime.sendMessage({ type: 'recording-timeout' });
  }
}, 120000); // 2 minutes

Challenge 4: Cross-Context State Management

Problem: We have four separate JavaScript contexts (popup, background, content, offscreen) that need to share state. Each context has different capabilities and lifecycles.

Solution: We use Chrome's storage APIs and message passing strategically:

chrome.storage.local for persistent data (API key, history)
chrome.storage.session for temporary recording state
chrome.runtime.sendMessage for commands and responses
chrome.tabs.sendMessage for content script communication

// Background worker coordinates everything
async function handleStartRecording(sendResponse: (x: any) => void) {
  const tabs = await chrome.tabs.query({ active: true, currentWindow: true });
  await chrome.storage.session.set({ recordingTabId: tabs[0].id });

  await setupOffscreenDocument('src/offscreen/index.html');

  const response = await chrome.runtime.sendMessage({
    type: 'start-recording',
    target: 'offscreen',
  });

  if (response?.success) {
    chrome.tabs.sendMessage(tabs[0].id, { 
      type: 'toggle-drawing-mode', 
      active: true 
    });
  }

  sendResponse(response);
}

Challenge 5: Audio Stream Combination

Problem: We need to capture both the screen/tab audio and the user's microphone audio in a single recording.

Solution: We create separate streams and combine their tracks into a single MediaStream before passing it to MediaRecorder.

const displayStream = await navigator.mediaDevices.getDisplayMedia({
  video: { width: { max: 1920 }, height: { max: 1080 } },
  audio: true,
});

const micStream = await navigator.mediaDevices.getUserMedia({ 
  audio: true 
}).catch(() => null);

const tracks = [...displayStream.getTracks()];
if (micStream) {
  tracks.push(...micStream.getAudioTracks());
}

const combinedStream = new MediaStream(tracks);

The key is proper track lifecycle management. When recording stops, we must stop all tracks to release the camera/microphone:

recorder.onstop = () => {
  combinedStream.getTracks().forEach((t) => t.stop());
};

Performance Considerations

Frontend Performance

The drawing overlay needs to be responsive. We use React's state management efficiently to avoid unnecessary re-renders:

// Only re-render when drawing state changes
const [paths, setPaths] = useState<number[][][]>([]);
const [currentPath, setCurrentPath] = useState<number[][]>([]);

We also use pointer capture to ensure smooth tracking:

(e.target as Element).setPointerCapture(e.pointerId);

Backend Performance

The Cloud Run backend scales automatically based on load. Each request is independent, so we can handle multiple concurrent video processing requests.

We set a low temperature (0.2) for Gemini to get more deterministic, code-focused outputs:

config: {
  systemInstruction: systemPrompt,
  temperature: 0.2,
}

Memory Management

Video blobs can spike memory usage. We clean up immediately after encoding:

reader.onloadend = () => {
  data = []; // Clear the blob array
  recorder = null;
  resolve(reader.result as string);
};

Development Workflow

We used Vite with the CRXJS plugin for development. This gives us hot module replacement for Chrome extensions, which is a massive productivity boost.

// vite.config.ts
export default defineConfig({
  plugins: [react(), crx({ manifest })],
  build: {
    rollupOptions: {
      input: {
        offscreen: 'src/offscreen/index.html',
      },
    },
  },
});

The CRXJS plugin automatically handles:

Manifest generation
Content script injection
Background service worker bundling
Hot reload for all extension contexts

Deployment

The extension is built with npm run build, which outputs to the dist/ folder. Users load this as an unpacked extension in Chrome.

The backend deploys to Cloud Run with a simple Dockerfile:

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 8080
CMD ["node", "index.js"]

We use environment variables for secrets:

gcloud run deploy gemini-screen-scribe \
  --source . \
  --set-env-vars GEMINI_API_KEY=$GEMINI_API_KEY \
  --allow-unauthenticated

Results and Impact

Gemini Screen Scribe reduces communication time dramatically:

Bug reporting: 15 minutes → 2 minutes
Design specs: 30 minutes → 3 minutes

The multimodal approach captures context that text alone cannot. Gemini sees what you're pointing at, hears your explanation, and understands the visual context together.

Key Learnings

1. Manifest V3 Requires Architectural Thinking

You can't just port V2 extensions to V3. The service worker model forces you to think about state management, message passing, and context boundaries differently. Offscreen documents are powerful but add complexity.

2. Multimodal AI Needs Good Prompts

The same video input produces wildly different outputs based on the system prompt. Spending time crafting mode-specific prompts was critical to output quality.

3. Shadow DOM is Perfect for Extensions

Shadow DOM solves the CSS isolation problem elegantly. It's more lightweight than iframes and provides true encapsulation.

4. Developer Experience Matters

Using Vite + CRXJS made development enjoyable. Hot reload for extensions is a game changer. Invest in good tooling early.

Future Roadmap

Real-Time Analysis

Instead of waiting until recording ends, stream the video to Gemini for live suggestions. This requires WebSocket connections and streaming API support.

Team Collaboration

Add sharing features so teams can collaborate on recordings and prompts. This requires user authentication and access control.

Custom Modes

Let power users write their own system prompts for specific workflows. Accessibility audits, performance reviews, security analysis, etc.

IDE Integration

Direct export to VS Code or GitHub. Generate the prompt, make the changes, commit them, all in one flow.

Enhanced Drawing Tools

Add color picker, shapes, text annotations, and an eraser. Right now it's just freehand purple strokes.

Conclusion

Building Gemini Screen Scribe taught us that the future of developer communication is multimodal. Text alone is insufficient for visual problems. By combining screen recording, voice narration, and freehand annotations with Gemini's Live API, we created a tool that captures the full context of what developers want to communicate.

The technical challenges were significant: Manifest V3 restrictions, Shadow DOM isolation, video size management, and cross-context state coordination. But the result is a polished extension that just works.

The Gemini Live API is the key enabler. Its ability to process video, audio, and visual context together as a continuous conversation is what makes this possible. We're excited to see what else can be built with multimodal AI.

Try Gemini Screen Scribe. Stop writing bug reports. Start showing them.

Technical Specifications

Frontend Stack:

React 19
TypeScript 5.9
Tailwind CSS 3.4
Vite 7.3
CRXJS 2.0

Backend Stack:

Node.js 20
Hono (web framework)
Google Cloud Run
Google Firestore

APIs and Libraries:

Gemini Live API via @google/genai
perfect-freehand for drawing
MediaRecorder API for video capture
Chrome Extension APIs (Manifest V3)

Browser Support:

Chrome 120+
Edge 120+ (Chromium-based)

Resources

GitHub Repository: [https://github.com/bravian1/uiextension]
Demo Video: [https://youtu.be/7IsrCzPnSVg]

About the Author

Built for the Gemini Live Agent Challenge. This project showcases what's possible when you combine real-time multimodal AI with practical developer workflows.

DEV Community