This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built "DIY-AI Fix-It", a comprehensive, AI-powered assistant that guides users through common household repairs. The app is designed to solve the entire problem from start to finish: from not knowing what's wrong, to successfully completing the fix with confidence.
A user simply uploads a photo of their problem and describes it. The AI then generates a complete, professional-grade repair plan, including:
A Clear Diagnosis: A simple explanation of what's wrong.
Complete Tool & Material List: A checklist of everything needed for the job.
Difficulty & Time Estimates: So you know what you're getting into before you start.
Step-by-Step Instructions: Safe, easy-to-follow steps, with automated warnings for hazardous tasks.
Potential Pitfalls: Critical advice on common mistakes to avoid.
The experience is fully interactive. Users can ask clarifying questions in a Contextual Follow-up Chat and even have the instructions read aloud with a Text-to-Speech feature for hands-free guidance during a repair.
Link to my deployed applet:
Screenshots of the app in action:
The main interface where a user uploads an image and describes the problem.
The initial, structured solution provided by the AI assistant.
The conversational follow-up feature, where the user can ask for more help.
How I Used Google AI Studio
I used Google AI Studio's Freeform prompt to build the entire logic core of my AI assistant. The prompt I created is designed to be highly intelligent and state-aware. It instructs the Gemini model to analyze a user's request and determine if it's an initial query or a follow-up question within an existing conversation.
This conditional logic, built entirely within a single prompt, allows the model to respond in two distinct ways:
For new problems, it provides a structured, predictable JSON output that my React app can easily parse and display.
For follow-up questions, it switches to a conversational mode, providing helpful, context-aware answers in natural language.
I was able to test, refine, and perfect this complex logic flow entirely within the AI Studio environment before deploying it.
Multimodal Features
My app is built from the ground up on Gemini's powerful multimodal capabilities, specifically combining image/video analysis with contextual text.
The core feature is the AI's ability to ground its entire analysis and conversation in the visual information provided by the user. When it provides instructions, it's referring directly to the components it can "see" in the photo.
The most advanced multimodal feature is the conversational repair logic. The AI must hold the initial image/video in context while processing the new, text-based questions from the chat_history. This creates an experience where a user can essentially have a conversation about a physical object, asking questions like "Are you sure I should turn that blue knob?" and getting an intelligent response. This deep integration of visual data and conversational text is what makes the app so uniquely helpful.
Top comments (0)