Introduction
Developer communication is broken. We spend countless hours writing bug reports, creating design specs, and explaining UI issues through screenshots and text. What if you could just show what you mean, draw on it, and talk through it? That's exactly what we built with Gemini Screen Scribe.
This article walks through the technical architecture, challenges, and solutions behind building a Chrome extension that captures screen recordings with voice narration and freehand annotations, then uses Google's Gemini Live API to generate developer-ready code and instructions.
The Problem Space
Every developer has experienced this workflow:
- Spot a bug or see a design you want to recreate
- Take multiple screenshots
- Open a text editor
- Spend 15-30 minutes writing a detailed explanation
- Hope the person reading it understands your intent
Screenshots lack context. Text descriptions miss visual nuances. The gap between what you see and what you can communicate is frustrating.
We needed a solution that captures the full context: what you're looking at, what you're pointing to, and what you're thinking.
Solution Architecture
Gemini Screen Scribe is a Chrome Manifest V3 extension with a Cloud Run backend. The architecture consists of seven key components:
1. Chrome Extension Components
Popup UI (React + TypeScript + Tailwind CSS)
The popup serves as the control panel. Users select between two modes (Edit & Fix or Inspire), start/stop recordings, and view generated prompts. We built it with React 19 and Tailwind CSS for a modern, responsive interface.
// Core state management in the popup
const [mode, setMode] = useState<Mode>('edit');
const [isRecording, setIsRecording] = useState(false);
const [apiKey, setApiKey] = useState<string>('');
const [resultPrompt, setResultPrompt] = useState<string>('');
The popup communicates with the background service worker through Chrome's message passing API. When a user clicks "Start Screen Scribe", it sends a message to initiate the recording flow.
Content Script (Shadow DOM Overlay)
The content script injects a drawing overlay on every active tab. This is where users draw annotations while recording. The critical decision here was using Shadow DOM for complete style isolation.
// Creating the isolated Shadow DOM container
const container = document.createElement('div');
container.id = 'gemini-screen-scribe-root';
container.style.position = 'fixed';
container.style.top = '0';
container.style.left = '0';
container.style.width = '100vw';
container.style.height = '100vh';
container.style.pointerEvents = 'none';
container.style.zIndex = '2147483647';
document.body.appendChild(container);
const shadow = container.attachShadow({ mode: 'open' });
Shadow DOM ensures our overlay works on any website without CSS conflicts. The host page's styles don't affect our overlay, and our styles don't leak into the host page.
Background Service Worker
The background service worker is the orchestration layer. It coordinates communication between the popup, content script, and offscreen document. It also handles API calls to the Cloud Run backend.
chrome.runtime.onMessage.addListener((message, _sender, sendResponse) => {
if (message.type === 'start-recording') {
handleStartRecording(sendResponse);
return true;
}
if (message.type === 'stop-recording') {
handleStopRecording(sendResponse);
return true;
}
});
Offscreen Document
This is the most critical piece for Manifest V3 compliance. Service workers cannot access getUserMedia or getDisplayMedia APIs. Chrome requires an offscreen document to handle media capture.
async function startRecording() {
const displayStream = await navigator.mediaDevices.getDisplayMedia({
video: { width: { max: 1920 }, height: { max: 1080 } },
audio: true,
});
const micStream = await navigator.mediaDevices.getUserMedia({
audio: true
}).catch(() => null);
const tracks = [...displayStream.getTracks()];
if (micStream) {
tracks.push(...micStream.getAudioTracks());
}
const combinedStream = new MediaStream(tracks);
recorder = new MediaRecorder(combinedStream, { mimeType: 'video/webm' });
recorder.start();
}
The offscreen document combines screen capture, system audio, and microphone audio into a single MediaRecorder stream. When recording stops, it converts the blob to base64 for transmission.
2. Drawing System
We use the perfect-freehand library for smooth, natural drawing strokes. The library takes pointer events and generates SVG paths that feel like real pen strokes.
const handlePointerDown = (e: React.PointerEvent) => {
if (!isDrawingMode) return;
(e.target as Element).setPointerCapture(e.pointerId);
setCurrentPath([[e.clientX, e.clientY, e.pressure]]);
};
const handlePointerMove = (e: React.PointerEvent) => {
if (!isDrawingMode || e.buttons !== 1) return;
setCurrentPath((c) => [...c, [e.clientX, e.clientY, e.pressure]]);
};
The key insight is using pointer capture to ensure smooth tracking even when the cursor moves quickly. We collect points with pressure data, then use perfect-freehand to generate the final stroke path.
const getSvgPathFromStroke = (stroke: number[][]) => {
if (!stroke.length) return '';
const d = stroke.reduce(
(acc, [x0, y0], i, arr) => {
const [x1, y1] = arr[(i + 1) % arr.length];
acc.push(x0, y0, (x0 + x1) / 2, (y0 + y1) / 2);
return acc;
},
['M', ...stroke[0], 'Q']
);
d.push('Z');
return d.join(' ');
};
3. Cloud Run Backend
The backend is a Node.js service built with Hono, a lightweight web framework. It handles two main responsibilities: processing videos with Gemini and managing prompt history in Firestore.
app.post('/process-video', async (c) => {
const { video, mode, sessionId } = await c.req.json();
// Decode base64 video
const base64Data = video.split(',')[1];
// Call Gemini Live API
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: 'gemini-flash-latest',
contents: [{
role: 'user',
parts: [
{ text: "Analyze this screen recording according to the mode." },
{ inlineData: { mimeType: "video/webm", data: base64Data } }
]
}],
config: {
systemInstruction: mode === 'edit' ? EDIT_PROMPT : INSPIRE_PROMPT,
temperature: 0.2,
}
});
const prompt = response.text;
// Save to Firestore
await db.collection('sessions').doc(sessionId)
.collection('prompts').add({
prompt,
mode,
timestamp: Date.now()
});
return c.json({ success: true, prompt });
});
The backend holds the Gemini API key securely, so users never need to provide their own key. This also allows us to implement rate limiting and usage tracking.
4. Gemini Live API Integration
The Gemini Live API is the core of this project. It's designed for real-time multimodal interactions, processing video, audio, and visual context together.
We wrote two distinct system prompts for the two modes:
Edit & Fix Mode Prompt:
const EDIT_PROMPT = `
You are an expert Senior UI/UX Engineer and Frontend Developer.
The user is providing a screen recording of an interface. They are narrating their desired changes and drawing on the screen to point out specific bugs or UI modifications.
Your job is to act as their pair programmer.
Analyze the video, listen to the audio, and look at the drawn annotations.
Output a highly structured set of instructions or code snippets (React/Tailwind preferred) that directly solves the user's request.
Be concise, focus on the code, and explain why you are making the change.
`;
Inspire Mode Prompt:
const INSPIRE_PROMPT = `
You are a world-class Frontend Architect and UI Designer.
The user is providing a screen recording of a website they find inspiring. They might narrate what they like about it or draw to highlight specific animations, layouts, or components.
Your job is to reverse-engineer the "vibe" and structure of what they are showing you.
Analyze the video to understand the layout (CSS Grid/Flexbox), the color palette, typography choices, and animations.
Output a clear breakdown of how to build this in modern React and Tailwind CSS. Provide foundational code snippets to recreate the visual aesthetic they are pointing out.
`;
The system prompts completely change how Gemini interprets the same video input. This is the power of multimodal AI with proper context.
5. Firestore Integration
We use Firestore to store prompt history. Each user gets an anonymous session ID stored in local storage. All their generated prompts are saved under that session.
// Firestore structure
sessions/{sessionId}/prompts/{promptId}
- prompt: string
- mode: 'edit' | 'inspire'
- timestamp: number
The popup fetches history on load:
app.get('/history/:sessionId', async (c) => {
const sessionId = c.req.param('sessionId');
const snapshot = await db.collection('sessions').doc(sessionId)
.collection('prompts')
.orderBy('timestamp', 'desc')
.limit(20)
.get();
const prompts = snapshot.docs.map(doc => ({
id: doc.id,
...doc.data()
}));
return c.json({ prompts });
});
Technical Challenges and Solutions
Challenge 1: Manifest V3 Media Capture
Problem: Chrome's Manifest V3 service workers run in a restricted context. They cannot access getUserMedia or getDisplayMedia APIs, which are essential for screen recording.
Solution: We implemented an offscreen document. This is a hidden HTML page that runs in a full browser context with access to all web APIs. The service worker creates the offscreen document, sends it messages to start/stop recording, and receives the video blob back.
async function setupOffscreenDocument(path: string) {
const offscreenUrl = chrome.runtime.getURL(path);
const existingContexts = await chrome.runtime.getContexts({
contextTypes: [chrome.runtime.ContextType.OFFSCREEN_DOCUMENT],
documentUrls: [offscreenUrl]
});
if (existingContexts.length > 0) return;
await chrome.offscreen.createDocument({
url: path,
reasons: [
chrome.offscreen.Reason.DISPLAY_MEDIA,
chrome.offscreen.Reason.USER_MEDIA
],
justification: 'Recording screen and microphone for Gemini AI analysis',
});
}
Challenge 2: Shadow DOM Style Isolation
Problem: The drawing overlay needs to work on any website without breaking the host page's layout or having the host page's CSS affect our overlay.
Solution: Shadow DOM provides true style encapsulation. We attach a shadow root to our container and inject our styles only into the shadow tree.
const shadow = container.attachShadow({ mode: 'open' });
const styleElement = document.createElement('style');
styleElement.textContent = shadowCss;
shadow.appendChild(styleElement);
const renderRoot = document.createElement('div');
shadow.appendChild(renderRoot);
const root = createRoot(renderRoot);
root.render(<Overlay />);
The shadow boundary ensures complete isolation. Our overlay has worked flawlessly on every website we've tested, from simple blogs to complex web apps.
Challenge 3: Video Size Management
Problem: High-quality screen recordings get large fast. A 2-minute recording at 1920x1080 can easily exceed 50MB. We're sending this to an API, so size matters.
Solution: We implemented several optimizations:
- Cap recording duration at 2 minutes with an automatic timeout
- Limit resolution to 1920x1080 (sufficient for most use cases)
- Use WebM format with efficient compression
- Clean up blob data immediately after encoding
// Automatic timeout in offscreen document
setTimeout(() => {
if(recorder?.state === 'recording') {
chrome.runtime.sendMessage({ type: 'recording-timeout' });
}
}, 120000); // 2 minutes
Challenge 4: Cross-Context State Management
Problem: We have four separate JavaScript contexts (popup, background, content, offscreen) that need to share state. Each context has different capabilities and lifecycles.
Solution: We use Chrome's storage APIs and message passing strategically:
-
chrome.storage.localfor persistent data (API key, history) -
chrome.storage.sessionfor temporary recording state -
chrome.runtime.sendMessagefor commands and responses -
chrome.tabs.sendMessagefor content script communication
// Background worker coordinates everything
async function handleStartRecording(sendResponse: (x: any) => void) {
const tabs = await chrome.tabs.query({ active: true, currentWindow: true });
await chrome.storage.session.set({ recordingTabId: tabs[0].id });
await setupOffscreenDocument('src/offscreen/index.html');
const response = await chrome.runtime.sendMessage({
type: 'start-recording',
target: 'offscreen',
});
if (response?.success) {
chrome.tabs.sendMessage(tabs[0].id, {
type: 'toggle-drawing-mode',
active: true
});
}
sendResponse(response);
}
Challenge 5: Audio Stream Combination
Problem: We need to capture both the screen/tab audio and the user's microphone audio in a single recording.
Solution: We create separate streams and combine their tracks into a single MediaStream before passing it to MediaRecorder.
const displayStream = await navigator.mediaDevices.getDisplayMedia({
video: { width: { max: 1920 }, height: { max: 1080 } },
audio: true,
});
const micStream = await navigator.mediaDevices.getUserMedia({
audio: true
}).catch(() => null);
const tracks = [...displayStream.getTracks()];
if (micStream) {
tracks.push(...micStream.getAudioTracks());
}
const combinedStream = new MediaStream(tracks);
The key is proper track lifecycle management. When recording stops, we must stop all tracks to release the camera/microphone:
recorder.onstop = () => {
combinedStream.getTracks().forEach((t) => t.stop());
};
Performance Considerations
Frontend Performance
The drawing overlay needs to be responsive. We use React's state management efficiently to avoid unnecessary re-renders:
// Only re-render when drawing state changes
const [paths, setPaths] = useState<number[][][]>([]);
const [currentPath, setCurrentPath] = useState<number[][]>([]);
We also use pointer capture to ensure smooth tracking:
(e.target as Element).setPointerCapture(e.pointerId);
Backend Performance
The Cloud Run backend scales automatically based on load. Each request is independent, so we can handle multiple concurrent video processing requests.
We set a low temperature (0.2) for Gemini to get more deterministic, code-focused outputs:
config: {
systemInstruction: systemPrompt,
temperature: 0.2,
}
Memory Management
Video blobs can spike memory usage. We clean up immediately after encoding:
reader.onloadend = () => {
data = []; // Clear the blob array
recorder = null;
resolve(reader.result as string);
};
Development Workflow
We used Vite with the CRXJS plugin for development. This gives us hot module replacement for Chrome extensions, which is a massive productivity boost.
// vite.config.ts
export default defineConfig({
plugins: [react(), crx({ manifest })],
build: {
rollupOptions: {
input: {
offscreen: 'src/offscreen/index.html',
},
},
},
});
The CRXJS plugin automatically handles:
- Manifest generation
- Content script injection
- Background service worker bundling
- Hot reload for all extension contexts
Deployment
The extension is built with npm run build, which outputs to the dist/ folder. Users load this as an unpacked extension in Chrome.
The backend deploys to Cloud Run with a simple Dockerfile:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 8080
CMD ["node", "index.js"]
We use environment variables for secrets:
gcloud run deploy gemini-screen-scribe \
--source . \
--set-env-vars GEMINI_API_KEY=$GEMINI_API_KEY \
--allow-unauthenticated
Results and Impact
Gemini Screen Scribe reduces communication time dramatically:
- Bug reporting: 15 minutes → 2 minutes
- Design specs: 30 minutes → 3 minutes
The multimodal approach captures context that text alone cannot. Gemini sees what you're pointing at, hears your explanation, and understands the visual context together.
Key Learnings
1. Manifest V3 Requires Architectural Thinking
You can't just port V2 extensions to V3. The service worker model forces you to think about state management, message passing, and context boundaries differently. Offscreen documents are powerful but add complexity.
2. Multimodal AI Needs Good Prompts
The same video input produces wildly different outputs based on the system prompt. Spending time crafting mode-specific prompts was critical to output quality.
3. Shadow DOM is Perfect for Extensions
Shadow DOM solves the CSS isolation problem elegantly. It's more lightweight than iframes and provides true encapsulation.
4. Developer Experience Matters
Using Vite + CRXJS made development enjoyable. Hot reload for extensions is a game changer. Invest in good tooling early.
Future Roadmap
Real-Time Analysis
Instead of waiting until recording ends, stream the video to Gemini for live suggestions. This requires WebSocket connections and streaming API support.
Team Collaboration
Add sharing features so teams can collaborate on recordings and prompts. This requires user authentication and access control.
Custom Modes
Let power users write their own system prompts for specific workflows. Accessibility audits, performance reviews, security analysis, etc.
IDE Integration
Direct export to VS Code or GitHub. Generate the prompt, make the changes, commit them, all in one flow.
Enhanced Drawing Tools
Add color picker, shapes, text annotations, and an eraser. Right now it's just freehand purple strokes.
Conclusion
Building Gemini Screen Scribe taught us that the future of developer communication is multimodal. Text alone is insufficient for visual problems. By combining screen recording, voice narration, and freehand annotations with Gemini's Live API, we created a tool that captures the full context of what developers want to communicate.
The technical challenges were significant: Manifest V3 restrictions, Shadow DOM isolation, video size management, and cross-context state coordination. But the result is a polished extension that just works.
The Gemini Live API is the key enabler. Its ability to process video, audio, and visual context together as a continuous conversation is what makes this possible. We're excited to see what else can be built with multimodal AI.
Try Gemini Screen Scribe. Stop writing bug reports. Start showing them.
Technical Specifications
Frontend Stack:
- React 19
- TypeScript 5.9
- Tailwind CSS 3.4
- Vite 7.3
- CRXJS 2.0
Backend Stack:
- Node.js 20
- Hono (web framework)
- Google Cloud Run
- Google Firestore
APIs and Libraries:
- Gemini Live API via @google/genai
- perfect-freehand for drawing
- MediaRecorder API for video capture
- Chrome Extension APIs (Manifest V3)
Browser Support:
- Chrome 120+
- Edge 120+ (Chromium-based)
Resources
- GitHub Repository: [https://github.com/bravian1/uiextension]
- Demo Video: [https://youtu.be/7IsrCzPnSVg]
About the Author
Built for the Gemini Live Agent Challenge. This project showcases what's possible when you combine real-time multimodal AI with practical developer workflows.
Top comments (0)