This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I've created an application called "Reverse Engineering Reality." It's a creative tool that allows users to upload a photo of any everyday object and, using the power of AI, receive a detailed, imaginative set of instructions for either assembling it from scratch or disassembling it.
The app solves the problem of curiosity and creativity. It transforms a passive observation of an object ("I wonder how that's made?") into an active, engaging, and educational experience. It provides users with a fictional "blueprint" for the world around them, complete with materials, tools, step-by-step guides, and custom illustrations, fostering a deeper appreciation for design and engineering.
Demo
Try Out the applet here on a deployed cloudrun instance
How I Used Google AI Studio
This app is built directly on the Gemini API, the same technology that powers Google AI Studio. The development process mirrors the iterative prompting and schema design one would perform in the Studio.
Here's how I leveraged its capabilities:
Model Selection: I primarily use gemini-2.5-flash for its speed and powerful reasoning capabilities, which are perfect for analyzing images, generating structured text, and powering the chat assistant. For image generation, I use imagen-4.0-generate-001.
Structured Output (JSON Mode): This is a critical feature. I provide the Gemini model with a strict JSON schema to ensure the output for the instructions (object name, materials, tools, steps, etc.) and object detection (bounding boxes) is predictable and machine-readable. This allows me to easily parse the AI's response and render it into a structured, user-friendly interface without complex string manipulation.
System Instructions: I use system instructions to set the context for the AI. For instruction generation, the AI is prompted to act as an "expert reverse engineer and master craftsman." For the chat feature, it's prompted to be a helpful "AI Assembly Assistant," ensuring its responses are focused on the provided blueprint.
Chat Functionality: The app uses the Gemini API's chat capabilities (ai.chats.create) to create a conversational assistant that has memory of the generated instructions, allowing users to ask follow-up questions in a natural way.
Multimodal Features
The app is fundamentally multimodal, combining image and text inputs and outputs to create a rich, interactive experience.
Image-to-Text (Core Analysis): The primary multimodal feature is the app's ability to understand an image uploaded by the user. It takes visual data (a photo of an object) and outputs structured text (a JSON object containing the full assembly/disassembly blueprint). This demonstrates a deep visual reasoning capability.
Object Detection from Image: Before generating instructions, the app first analyzes the image to identify and locate distinct objects, returning their names and bounding box coordinates. This is another form of image-to-text functionality that enhances user control by allowing them to select the specific object of interest.
Text-to-Image (Illustrations): To make the instructions more intuitive and engaging, the app uses a powerful text-to-image workflow. For each step in the generated blueprint, the AI also creates a descriptive imagePrompt (text). This text is then fed to the imagen-4.0-generate-001 model to generate a custom, diagram-style illustration for that specific step. This combination—analyzing an image to produce text, then using that text to create a new image—is a sophisticated multimodal pipeline that greatly enhances the final product.
Together, these features allow a user to seamlessly translate a real-world object into a fully illustrated, interactive digital guide.
Top comments (2)
Reverse Engineering Reality is such a fascinating concept! I love how it turns simple curiosity into an interactive learning experience. The combination of Gemini for image understanding and Imagen for step-by-step illustrations makes the app both educational and creative. Structured JSON outputs and the chat assistant make the workflow intuitive and powerful. This is a brilliant example of multimodal AI in action—can’t wait to try it out!
Nice concept