This is a submission for the Google AI Studio Multimodal Challenge
What I Built
Whiteboard Wizard is an AI-powered applet designed to bridge the gap between analog brainstorming and digital development. It solves the common problem of valuable technical diagrams being trapped on physical whiteboards, where they are difficult to share, edit, and analyze.
The application allows a user to upload a photograph of a hand-drawn diagram and supplement it with an audio recording where they narrate the context, explain the logic, or describe a problem they are facing. Whiteboard Wizard then uses a multimodal AI model to:
- Digitize the drawing into clean, editable Mermaid syntax.
- Analyze the logical flow and architecture described in both the diagram and the narration.
- Provide a detailed textual analysis, identifying potential errors, inefficiencies, or areas for improvement.
- Suggest concrete solutions, including updated diagrams and code snippets, to help the user debug and refine their ideas.
Essentially, it acts as an expert pair-programmer, instantly transforming a static image and spoken thoughts into an interactive, digital, and actionable development tool.
Demo
Live Link- Click Here
How I Used Google AI Studio
Google AI Studio was instrumental in prototyping and refining the core multimodal prompt that powers Whiteboard Wizard. The application's success hinges on its ability to receive an image, audio, and a text prompt, and return a perfectly structured JSON object.
- Prompt Engineering: I used AI Studio's iterative environment to craft a detailed prompt for the gemini-2.5-flash model. This involved defining the AI's persona as an "expert software architect," outlining its specific tasks, and providing critical rules for generating valid Mermaid syntax—including examples of what not to do.
- Multimodal Input Testing: AI Studio was perfect for testing how the model would interpret various combinations of images and audio files. This allowed me to quickly see how different diagram styles or narration qualities affected the output quality.
- Structured Output (JSON Schema): The most critical capability I leveraged was Gemini's JSON mode. I designed a responseSchema and tested it extensively in AI Studio to ensure the model would consistently return the required mermaidCode, analysis, and suggestions fields. This reliability is key to parsing the response and rendering the results in the UI without errors.
Multimodal Features
Whiteboard Wizard is built around a core multimodal feature: the fusion of visual (image) and auditory (audio) inputs to generate a comprehensive analysis. This enhances the user experience in several crucial ways:
- Contextual Understanding: A diagram by itself lacks intent. The user's audio narration provides the critical "why" behind the "what." It allows the user to explain their goals, point out areas of concern, and ask specific questions. The AI uses this context to provide analysis and suggestions that are highly relevant to the user's actual problem, rather than just performing a generic transcription.
- Natural and Efficient Interaction: The workflow mimics how developers collaborate in the real world—by pointing to a diagram and talking through it. This is a far more natural and faster way to convey complex information than writing a lengthy text description to accompany an image.
- Deeper, More Accurate Analysis: By processing both modalities simultaneously, the AI gains a much deeper understanding of the user's work. It can correlate a specific shape on the diagram with a concept the user describes in the audio, leading to more insightful and accurate debugging. For example, if a user mentions "scalability concerns" in the audio while pointing to a database symbol in the diagram, the AI can specifically look for and flag potential bottlenecks in that part of the architecture.
Top comments (0)