This is a submission for the Google AI Studio Multimodal Challenge
What I Built
BabySafe AI is a user-friendly web applet designed to give parents and caregivers peace of mind. It solves the often overwhelming problem of childproofing a home for a mobile baby (6-18 months) by acting as an AI-powered "second set of eyes."
Users simply upload a photo of a room, and the applet leverages the Google Gemini model's vision capabilities to analyze the image, identify common household safety hazards, and present them in a clear, actionable list. The experience is designed to be simple, fast, and reassuring, turning a potentially stressful task into a straightforward safety check.
Demo
Here's a walkthrough of the BabySafe AI user experience:
Upload: The user is greeted with a clean, simple interface where they can drag-and-drop or click to upload a photo of a room.
Preview & Analyze: Once an image is selected, a preview is shown. The user then clicks the "Analyze Safety" button to begin the AI analysis.
Analysis in Progress: A clear loading indicator lets the user know the AI is working.
Hazard Report: If hazards are found, the results are displayed in a two-panel layout. The original image is on the left, and a detailed, scrollable list of identified hazards is on the right. Each item in the list includes the hazard's name, the specific risk it poses, and a clear description of its location.
How I Used Google AI Studio
I used the gemini-2.5-flash model from the Google AI platform to power the core functionality of BabySafe AI. My implementation focused on two key multimodal capabilities:
Vision Understanding: The primary function relies on the model's ability to interpret and understand the contents of an image. It goes beyond simple object detection to recognize items in the context of child safety.
Structured JSON Output: I used Gemini to write up a detailed system prompt and provided a strict responseSchema. This instructs the AI to act as a "Baby Safety Expert" and forces it to return its findings in a predictable JSON format. This was crucial for reliably parsing the AI's response and rendering the hazard list in the user interface. By defining the expected data structure, I turned a powerful, general-purpose vision model into a specialized and reliable analysis tool.
Multimodal Features
The core multimodal feature of this applet is Image-to-Structured-Text Analysis. It takes a visual, unstructured input (a user's photo) and transforms it into structured, actionable text data (a JSON object listing hazards).
This enhances the user experience in several key ways:
Intuitive Input: Users don't need to describe their room with words. They can simply show the AI the environment with a photo, which is a far more natural and efficient way to communicate complex visual information.
Contextual Hazard Identification: The AI doesn't just list objects; it identifies them as hazards. It understands that an electrical outlet low on a wall is a risk to a crawling baby, or that a dangling cord is an entanglement threat. This contextual understanding is only possible through analyzing the visual data.
Spatially-Aware Descriptions: The model generates a location_description for each hazard (e.g., "On the floor in the bottom-left corner"). This spatial awareness, derived directly from the image, makes the feedback immediately actionable. The user knows exactly where to look and what to fix, transforming a generic warning into a specific, helpful instruction.
Top comments (0)