I Built an AI Agent That Sees Through Smart Glasses and Tells You How to Fix Anything

#gemini #googlecloud #ai #geminiliveagentchallenge

Last week I was standing in my driveway, staring at my Ford Maverick's engine bay, wearing Ray-Ban Meta smart glasses, and talking to an AI agent I built called Clutch. I asked it to circle the dipstick. It drew a bounding box around it — live, on my phone — from what it saw through my glasses camera.

That moment made the past 12 days of hackathon chaos worth it.

The Problem Nobody's Solved

Here's the thing about learning how to do stuff with your hands: YouTube is terrible at it.

You're under the hood of your car with greasy hands. You can't scroll. You can't pause and rewind. You definitely can't hold your phone and a wrench at the same time. And that 14-minute video has 3 minutes of actual content buried under intros, sponsors, and "don't forget to like and subscribe."

Enterprise AR platforms like PTC Vuforia solve this beautifully — holographic step-by-step overlays on the actual equipment. But they cost thousands, require pre-authored content, and target factory floors, not your driveway.

Consumer smart glasses like Meta's Ray-Ban line can see your environment and chat about it. But ask "how do I check my oil?" and you get a conversational paragraph. No structured steps. No images. No progress tracking.

Nobody has built the bridge: glasses that see your task → AI that generates a structured how-to → multimedia guidance delivered to your phone. That's Clutch.

What Clutch Does

You put on Ray-Ban Meta smart glasses (or just use your phone camera). You ask a question like "How do I check the oil in my truck?" and Clutch:

Sees what you're looking at through the camera
Generates step-by-step instructions using Gemini 2.5 Flash
Creates AI reference images for each step using Imagen 4 Fast
Finds relevant YouTube tutorials automatically
Guides you through each step with voice narration
Annotates objects in your camera view when you ask ("circle the dipstick")
Switches languages mid-conversation (English, Spanish, Vietnamese, French, Chinese)
Exports the steps as a PDF to save for later

All in real-time. All voice-controlled. All while your hands are busy doing the actual task.

The Tech Stack

Clutch runs on:

Gemini Live API — bidirectional audio/vision streaming. The agent hears you and sees through your camera simultaneously.
Google ADK (Agent Development Kit) — orchestrates tool calls. The agent decides when to generate steps, search YouTube, annotate objects, or advance the wizard.
Imagen 4 Fast — generates reference images for each step in parallel (~5 seconds for 4 images).
Google Cloud Run — hosts the Python backend with WebSocket support for real-time communication.
YouTube Data API v3 — finds relevant how-to videos matched to your specific task.
Gemini 2.5 Flash Vision — powers the annotation tool, identifying objects in camera frames and drawing bounding boxes with labels.
Meta Wearables DAT SDK — streams 720p/30fps video from Ray-Ban Meta glasses to the iOS companion app.
SwiftUI — native iOS app with glasses connectivity, audio routing to Bluetooth speakers, and the full wizard UI.

The web app works in any browser as a fallback — no glasses required.

The Architecture

The data flow is:

Ray-Ban Meta Glasses (camera + mic)
    ↓ Bluetooth
iPhone App (SwiftUI) or Web Browser
    ↓ WebSocket
Google Cloud Run (Python ADK Agent)
    ↓ Gemini Live API (bidi-streaming audio + vision)
    ↓ Tools: generate_steps, search_youtube, annotate_image, search_products
    ↓ Imagen 4 Fast (parallel image generation)
    ↓ YouTube Data API v3

Everything except the glasses camera stream runs on Google Cloud.

The Hardest Problem: Bidi Stream Size Limits

The Gemini Live API uses bidirectional WebSocket streaming for audio. It's incredible for real-time conversation. But it has a hard size limit on messages — and when I tried to return 4 base64-encoded images through that stream, it crashed with a 1008 policy violation error every time.

The fix was an out-of-band pattern: tools store their heavy payloads (images, video results, product data) in a server-side dictionary keyed by session ID. They return only a tiny summary to the bidi stream (e.g., {"action": "steps_summary", "count": 8}). The server intercepts the tool response, pulls the full data from the dictionary, and sends it directly to the frontend via a separate WebSocket message — bypassing the bidi stream entirely.

This pattern ended up being the architectural backbone of the entire app. Every tool that returns anything larger than a few hundred bytes uses it: generate_steps, annotate_image, search_youtube, and search_products.

The Other Hard Problem: Model Hallucination

The Gemini model powering the Live API is brilliant at conversation but unreliable at tool calling. It frequently says "I've highlighted the glasses for you" — without actually calling the annotate_image tool. The user sees nothing on screen.

No amount of prompt engineering fully solved this. The system prompt literally says:

"You MUST call the annotate_image tool FIRST. Do NOT say 'I've highlighted it' before the tool runs — the annotation only appears on screen when the tool actually executes."

The model still sometimes ignores it. For the demo, you just keep trying until it cooperates. This is the reality of building on bleeding-edge AI APIs during a hackathon.

What I'd Build Next

Clutch today is a proof of concept. The vision is much bigger:

Auto-advancing steps using voice detection ("done" → next step)
Per-step image generation beyond the first 4 steps
Real product recommendations via Google Shopping API (currently mock data)
Spatial AR overlays when glasses hardware supports it
Collaborative mode — an expert watches your glasses feed remotely and annotates
Egocentric learning — record yourself completing a task, extract it into a reusable how-to for others

That last one is where this gets really interesting. Imagine every skilled tradesperson, surgeon, or chef wearing glasses that capture their process — and Clutch turning that into structured, teachable content automatically. That's not a hackathon project. That's a company.

The Solo Build Reality

I built Clutch alone in 12 days. No team. I'm a creator and designer who thinks at the systems level — orchestrating multiple AI models into autonomous pipelines. This project pushed me deep into Python backend development, iOS Swift, WebSocket architecture, and Google Cloud deployment.

The hardest part wasn't the code. It was the API instability. The Gemini Live API goes through periods where it simply doesn't process audio — the connection opens, the model loads, but nothing happens. No errors. Just silence. Then you refresh and it works perfectly. Then it doesn't again.

Building a demo on top of that required patience, persistence, and a lot of screen recordings where I got lucky.