For the Gemini Live Agent Challenge, I wanted to solve a real-world problem: making homework engaging and accessible for students in Singapore. The result is SgStudyPal, an AI-powered tutoring platform that utilizes both real-time voice and multimodal image recognition to act as a personalized tutor.
Here is a breakdown of how I built the infrastructure using Google's ecosystem.
The Tech Stack
- Frontend: Next.js 14 (App Router), Tailwind CSS
- Backend: Node.js, Vercel AI SDK
- AI Models: Google Gemini 2.5 Flash (Multimodal) & Gemini Live (WebSockets)
- Auth & DB: Firebase Authentication & Firestore
- Infrastructure: Google Cloud Run (Docker)
Architecture & Google Cloud Integration
To ensure a highly scalable, stateful environment, I opted out of standard serverless edge functions and built a strict Multi-Stage Docker container deployed on Google Cloud Run.
-
Multimodal Homework Help: Students can upload a photo of a complex math worksheet. The Next.js backend parses the image into a binary buffer and streams it securely to the
gemini-2.5-flashmodel. By utilizing strict prompt engineering, the AI bypasses standard chat pleasantries and immediately breaks down the visual math problem step-by-step. - Real-Time Video Tutor (Gwen): Using WebSockets, the app connects directly to the Gemini Live API, allowing students to have fluid, interruptible, low-latency audio/video conversations with their AI tutor.
Overcoming Deployment Challenges
Deploying a Turborepo Next.js app to Cloud Run requires careful environment variable management. To ensure Firebase client variables (NEXT_PUBLIC_) were securely baked into the production bundle while keeping Gemini API keys strictly isolated as runtime secrets, I implemented a 3-stage Docker build process (deps, builder, runner).
By combining the low latency of Google Cloud Run with the incredible multimodal capabilities of Gemini 2.5 Flash, SgStudyPal transforms static worksheets into interactive, real-time learning experiences.
Top comments (0)