This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built the AI Halloween Costume Generator, a comprehensive web application designed to solve the age-old problem of choosing the perfect Halloween costume. It acts as a creative partner, transforming vague ideas or even simple images into complete, ready-to-make costume guides.
The app provides a multi-faceted approach to inspiration:
Search: Users can type in any theme or idea (e.g., "costumes for my dog," "spooky sci-fi ideas") and receive five distinct, fully-detailed costume concepts to compare and choose from.
Generate from Image: Users can upload a photo of an object, a person, or a pet, and the AI will generate a unique costume concept based on the visual input.
Surprise Me!: For users who are completely stumped, a "Surprise Me!" button generates a totally random and creative idea out of the blue.
Once an idea is selected, the app provides a complete DIY guide, including a list of materials, an estimated cost and difficulty level, and most importantly, detailed step-by-step instructions with custom-generated, additive illustrations that show the costume coming together.
Demo
To try out: Cloud Run Demo link
How I Used Google AI Studio
I leveraged a suite of multimodal models from the Gemini API to power the application's core features, orchestrating them to create a seamless user experience.
gemini-2.5-flash: This was the primary model for all text and structured data generation. I used it with a strictly defined responseSchema to ensure the AI's output was always in a predictable JSON format. This model was responsible for:
Generating the costume's name, description, materials, and detailed text instructions.
Powering the search feature by creating five distinct costume concepts from a single prompt.
Handling the conversational "Refine" feature, where it would modify a costume based on follow-up user input.
imagen-4.0-generate-001: This powerful image generation model was used to create the crucial first image for each set of instructions, establishing the visual foundation for the step-by-step guide.
gemini-2.5-flash-image-preview: This versatile image editing model was the key to creating the app's most unique feature. It was used to generate all subsequent instruction images by taking the previous step's image as input and adding the new details described in the current step's text.
Multimodal Features
The app is built around two core multimodal functionalities that create a rich and intuitive user experience.
Vision Understanding: Image to Costume Idea
The ability for a user to upload an image and receive a relevant costume idea is a powerful multimodal feature. It goes beyond simple text prompts by allowing for visual context. A user can upload a picture of their pet, a favorite object, or a friend, and the AI can creatively interpret that visual data to generate a highly personalized and often unexpected costume concept. This makes the brainstorming process more personal and engaging.
Additive Image Generation: A Cohesive Visual Guide
The app's standout feature is its ability to create a set of instruction images that build upon one another. Instead of generating a new, disconnected image for each step, the system uses an iterative, multimodal process:
- Step 1: Generate a base image from a text prompt.
- Step 2+: Feed the image from the previous step plus the text for the current step into the image editing model (gemini-2.5-flash-image-preview).
This creates a coherent visual narrative, allowing the user to literally watch the costume come together from one image to the next. This significantly enhances the user experience by making the instructions far easier to understand and follow compared to a series of isolated diagrams. It transforms the app from a simple idea generator into a true step-by-step visual crafting guide.
Top comments (0)