DEV Community

Kevin Lu
Kevin Lu

Posted on

Voice-first PC Builder Agent Built on Gemini Live API

This post was written for my submission to the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

architecture diagram

Inspiration

I've built PCs before and know the pain firsthand from squinting at tiny connectors, pausing YouTube every 10 seconds with greasy fingers, and second-guessing whether you're about to break your build. But the real breaking point was trying to help a friend build their first PC remotely over a video call. I'm watching a shaky camera feed, trying to describe which cable goes where, and they can't find the front panel headers or find the parts I'm referencing unless I pull up pictures. I realized what they needed wasn't me on a video call, it was an AI that could see what they see, talk them through it hands-free, and not lose patience on the 15th, "which one is the 8-pin?"

What it does

BuildBuddy is a hands-free, voice-first AI assistant that guides users through a PC build in real time. You talk to it, it talks back, no typing, no pausing, no touching your phone with thermal-paste fingers. Point your camera at a part or connector you don't recognize, and it identifies it visually. A built-in parts reference shows labeled diagrams for tricky connectors like PSU cables. It tracks your build progress step-by-step and logs every action with timestamps and camera snapshots to a shareable timeline so a friend, mentor, or forum can review exactly what happened and verify the AI's guidance. Because AI can be wrong, and accountability matters.

main screen

How we built it

  • Gemini Live API: for real-time bidirectional audio streaming, the user speaks, the AI responds with voice, all over a persistent WebSocket connection
  • Google ADK (Agent Development Kit): for agent orchestration, tool definitions, and session management
  • Custom tool calls: for build progress tracking (update_part_status), connector image references (get_connector_image, show_user_part), enabling the AI to trigger UI updates mid-conversation
  • Cloud Run: for deployment with session affinity to maintain WebSocket connections
  • Cloud Build: for automated CI/CD, pushes to the repo trigger a Docker image build and deployment to Cloud Run without manual steps
  • Firestore: for logging every build event with timestamps, part status, and notes
  • Google Cloud Storage: for storing camera snapshots at each build step
  • Vanilla HTML/CSS/JS frontend: for no frameworks, mobile-first, designed for one-handed use with a phone propped up next to your build

Challenges we ran into

Real-time audio was the hardest part of the entire project. The PCM microphone recorder would flood the WebSocket with audio buffer data, causing crashes, and Nvidia Broadcast made it worse with suspected buffer timing issues. Muting the mic after the speech and eventually bypassing Nvidia Broadcast almost resolved the symptoms, but it took real debugging time to isolate. On top of that, camera frames are constantly sent during tool call execution, which would interrupt the audio data flow, causing the WebSocket to lose connection entirely. The Gemini Live API and ADK expect very specific WebSocket data timing, so if tool calls and camera frames collide with the audio stream, everything falls apart. The fix required async waits and blocking mechanisms to prevent simultaneous data floods.

Cloud Run deployment was its own adventure. The app broke immediately with a single concurrent instance because the frontend needs to query the JS and static files while the WebSocket holds a connection. With one instance, we'd get rate-limited on our own requests. The solution was setting a minimum of one warm instance and a max of two, plus enabling session affinity to keep WebSocket connections stable. GCS bucket configuration for public image access was also non-obvious. Disabling the "prevent public access" setting isn't enough and you need to explicitly grant the allUsers principal a Storage Object Viewer role. I would have preferred per-object public access, but the bucket's uniform access policy doesn't allow mixed permissions.

Finally, ADK documentation didn't always match the actual behavior of the current version, which meant a lot of trial-and-error to figure out how things actually worked versus how they were documented.

Accomplishments that we're proud of

The thing I'm most proud of is that it actually works end-to-end as a real-time voice agent. You can have a natural conversation with your hands full while building a PC, and the AI genuinely helps. The shareable build timeline turned out to be a surprisingly compelling feature. It reframes AI assistance from "trust the black box" to "here's a reviewable record of everything that happened."

report feature

What we learned

This was my deep dive into Gemini's Live API and real-time multimodal streaming. The biggest takeaway is that voice-first interfaces have fundamentally different UX constraints than text-based ones, latency matters more than token count, interruption handling is critical, and you can't show a wall of text to someone whose hands are inside a PC case. I also learned that tool calls in a streaming audio context need careful state management to prevent race conditions between camera frames, audio chunks, and tool executions happening simultaneously. On the deployment side, Cloud Run's session affinity is non-negotiable for WebSocket applications, something that's obvious in hindsight but costs real debugging time.

What's next for BuildBuddy

The current version is a single-user hackathon demo, but for production:

  • Multi-user sessions: right now there's no session isolation. Adding proper authentication and per-session Firestore collections would let multiple people use BuildBuddy simultaneously, each with their own build timeline
  • AR connector overlay: use the camera feed to overlay labels directly on the motherboard showing exactly where each cable connects
  • Build templates: pre-loaded guides for popular builds so the AI has part-specific knowledge from the start, with compatibility checking to warn about mismatches before you start building
  • Motherboard manual RAG: use embeddings to index motherboard manuals so the AI can reference exact pin layouts, header locations, and BIOS settings specific to the user's board since the motherboard manual is the real source of truth for every build
  • Community review: let experienced builders comment on shared timelines to catch AI mistakes, suggest better cable routing, or help troubleshoot blocked steps
  • Multi-language support: the voice assistant should support more than just English
  • Noise robustness: the hackathon demo runs in a quiet space, but a real user has fans, tools, and background noise. Proper noise cancellation and wake-word detection would be essential
  • Production security: the hackathon version uses broad permissions and a public GCS bucket for convenience. In production, I'd use presigned URLs with expiration for snapshot access, lock down Firestore rules per user, and apply least-privilege IAM roles for each service account across Cloud Run, Firestore, and GCS

Devpost Submission | Demo Video

Top comments (0)