Thor 雷神 Schaeff for Google AI

Posted on Feb 3 • Edited on Feb 9

Realtime Multimodal AI on Ray-Ban Meta Glasses with Gemini Live & LiveKit

#xr #gemini #android #ai

Imagine walking down the street, asking your glasses what kind of plant you're looking at, and getting a response in near real-time. With the combination of Gemini Live API, LiveKit, and Meta Wearables SDK, this isn't science fiction anymore, it's something you can build today.

In this post, we’ll walk through how to set up a vision-enabled AI agent that connects to Meta Ray-Ban glasses via a secure WebRTC proxy.

The Architecture

The setup involves several layers to ensure low-latency, secure communication between the wearable device and the AI:

Meta Ray-Ban Glasses: Capture video and audio, connecting via Bluetooth to your phone.
Phone (Android/iOS): Acts as the gateway, connecting via WebRTC to LiveKit Cloud.
LiveKit Cloud: Serves as a secure, high-performance proxy for the Gemini Live API.
Gemini Live API: Processes the stream via WebSockets, enabling real-time multimodal interaction.

The Backend: Building the Gemini Live Agent

We use the LiveKit Agents framework to act as a secure WebRTC proxy for the Gemini Live API. This agent joins the LiveKit room, listens to the audio, and processes the video stream from the glasses.

Setting up the Assistant

The core of our agent is the AgentSession. We use the google.beta.realtime.RealtimeModel to interface with Gemini. Crucially, we enable video_input in the RoomOptions to allow the agent to "see."

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    ctx.log_context_fields = {"room": ctx.room.name}

    session = AgentSession(
        llm=google.beta.realtime.RealtimeModel(
            model="gemini-2.5-flash-native-audio-preview-12-2025",
            proactivity=True,
            enable_affective_dialog=True
        ),
        vad=ctx.proc.userdata["vad"],
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
        room_options=room_io.RoomOptions(
            video_input=True,
        )
    )
    await ctx.connect()
    await session.generate_reply()

By setting video_input=True, the agent automatically requests the video track from the room, which in this case is the 1FPS stream coming from the glasses.

Running the Agent

To start your agent in development mode and make it accessible globally via LiveKit Cloud, simply run:

uv run agent.py dev

Find the full Gemini Live vision agent example in the LiveKit docs.

Connection & Authentication

To connect your frontend to LiveKit, you need a short-lived access token.

CLI Token Generation

For testing and demos, you can quickly generate a token using the LiveKit CLI:

lk token create \
  --api-key <YOUR_API_KEY> \
  --api-secret <YOUR_API_SECRET> \
  --join \
  --room <ROOM_NAME> \
  --identity <PARTICIPANT_IDENTITY> \
  --valid-for 24h

In a production environment, you should always issue tokens from a secure backend to keep your API secrets safe.

The Frontend: Meta Wearables Integration

This example targets Android devices (like the Google Pixel). You'll need the Meta Wearables Toolkit and the specific sample project.

Clone the Sample: Get the Android client example.
Configure local.properties: Add your GitHub Token as required by the Meta SDK.
Update Connection Details: In StreamScreen.kt, replace the server URL and token with your LiveKit details:

// streamViewModel.connectToLiveKit
connectToLiveKit(
    url = "wss://your-project.livekit.cloud",
    token = "your-generated-token"
)

Run the App: Connect your device via USB and deploy from Android Studio.

Conclusion

By bridging Meta Wearables with Gemini Live via LiveKit, we've created a powerful, low-latency vision AI experience. This architecture is scalable and secure, providing a foundation for the next generation of wearable AI applications.

Resources

Happy hacking! 🚀

Top comments (1)

Ofri Peretz • Feb 8

The 1FPS video stream detail is interesting — curious how much that constraint shapes the UX in practice. Do you find users naturally adapt to the latency, or does the conversation model need to compensate (e.g. Gemini asking clarifying questions while waiting for a better frame)? Feels like there's a whole design space around "good enough" frame rates for different use cases.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.