This is a submission for the Built with Google Gemini: Writing Challenge
What I Built with Google Gemini
I built an AI Halloween Costume Generator because picking a costume is harder than it should be. You either spend hours scrolling through the same generic ideas online, or you have a vague concept but no clue how to actually make it. I wanted something that could take any input a text idea, a photo, or literally nothing and turn it into a complete DIY guide with materials, steps, and visuals.
The app works three ways. You can search for costume ideas by typing anything ("costumes for my dog," "spooky sci-fi"), and it gives you five different concepts to pick from. You can upload a photo of an object, person, or pet, and it generates a costume based on that image. Or if you're completely stuck, there's a "Surprise Me" button that creates something random.
Once you pick an idea, you get a full breakdown with materials, estimated cost, difficulty level, and step-by-step instructions. The interesting part is the visuals. Instead of generic stock photos or disconnected diagrams, each instruction step has an image that builds on the previous one. You literally watch the costume come together piece by piece.
I used three different Gemini models for this. gemini-2.5-flash handles all the text generation and structured data costume names, descriptions, materials lists, instructions. I defined a JSON schema so the output is always consistent and easy to work with. imagen-4.0-generate-001 creates the first image for each costume guide. Then gemini-2.5-flash-image-preview does something cool it takes the previous step's image and adds the new elements described in the current step. So instead of generating five separate images, it's building one image progressively.
That additive image generation was the hardest part to get right. The model needs to understand what's already in the image, what the text is asking it to add, and how to blend them naturally. It took experimentation to figure out the right prompts and image parameters, but when it works, it makes the instructions way easier to follow than text alone.
Demo
What I Learned
I learned that multimodal doesn't just mean "uses images and text." It's about how those modes interact. The additive image feature only works because the model can see the previous image and understand the text instruction at the same time. That's different from just generating images from text prompts.
Structured outputs made this project manageable. Without the JSON schema, I'd be parsing freeform text and dealing with inconsistent formats. With it, I know exactly what I'm getting back every time, which makes building a UI straightforward.
The search feature taught me something about prompt design. Asking for "five costume ideas" in one call is way more efficient than making five separate calls, and the results are actually more diverse because the model can differentiate them in context.
Google Gemini Feedback
The image editing model (gemini-2.5-flash-image-preview) was the star here. Being able to iteratively build on an image is powerful and not something I've seen done well elsewhere. It worked better than I expected for this use case.
Structured outputs continue to be essential. They're the difference between a prototype and something you can actually ship.
The friction came from tuning the image generation. Sometimes the additive steps would drift from the original concept, or the model would reinterpret elements instead of just adding to them. I had to be specific in the prompts about what to preserve and what to add. It wasn't a model limitation as much as figuring out how to communicate clearly with it.
One thing I'd like is better control over image composition in the editing model. Being able to specify regions or layers would make the additive process more predictable. But overall, the multimodal capabilities let me build something I couldn't have built otherwise.


Top comments (0)