Prompt-to-Puzzle: Generating Infinite 'Spot the Difference' Games with Gemini

#devchallenge #googleaichallenge #ai #gemini

Google AI Challenge Submission

What I Built

I built Prompt-to-Puzzle, a web application that truly lives up to its name. It turns your imagination into a fully playable 'Spot the Difference' game using a powerful multimodal AI pipeline.

Instead of playing from a static library of pre-made puzzles, Prompt-to-Puzzle puts the user in the creative driver's seat. You can describe any scene imaginable—"A futuristic city with flying cars at sunset" or "A cozy cat cafe on a rainy day"—and the app will generate a brand new game board in seconds.

The core idea was to create a truly generative experience where AI acts as a creative partner. The app uses a two-step AI process:

Imagen 4 generates a high-quality base image from the user's text prompt.
Gemini 2.5 Flash Image then takes that base image and a second prompt, intelligently and subtly altering it to create the second "different" image.

Finally, the app uses classic computer vision techniques (JavaScript and the Canvas API) on the client-side to mathematically detect these differences, turning the AI-generated art into a fully interactive game.

Demo

Deployed Applet Link: https://prompt-to-puzzle-1010886538823.us-west1.run.app/

Screenshots & Video

Here’s a look at the app in action, from generation to gameplay.

Generating a New Game: The user simply types a prompt to kick off the AI generation pipeline.
The Generated Game Board: The app presents the two images side-by-side, ready for the user to find the differences.
The Manual Editor: For perfect results, users can enter an editor mode to fine-tune the clickable "difference" regions found by the algorithm.

How I Used Google AI Studio

Google AI Studio was the central hub for the entire AI development process.

Prototyping and Prompt Engineering: Before writing a single line of application code, I used the AI Studio playground extensively. I experimented with different prompts for Imagen 4 to understand how to generate clear, detailed base images. More importantly, I spent a lot of time crafting the perfect prompt for Gemini 2.5 Flash Image. The goal was to instruct it to act as an "image editor" and make 3-5 structural changes (like adding or removing an object) while explicitly avoiding simple color or brightness shifts that wouldn't make for a good game. AI Studio's rapid feedback loop was essential for this.
API Integration: Once I had prompts that consistently produced great results, I used the "Get Code" feature in AI Studio to get the boilerplate API call code. This made the transition from prototype to the React application seamless.
Model Selection: AI Studio made it easy to browse and select the right models for the job. I chose imagen-4.0-generate-001 for its exceptional text-to-image quality and gemini-2.5-flash-image-preview for its powerful and fast image-and-text understanding capabilities.

Multimodal Features

This applet is built entirely around a creative, two-stage multimodal pipeline. The specific features used are text-to-image generation and image-and-text-to-image modification.

Initial Scene Creation (Text-to-Image with Imagen 4)

The experience begins when the user's text prompt is sent to the Imagen 4 model. This is the first multimodal step, translating a linguistic concept ("A whimsical fantasy library with floating books") into a rich, visual representation. This forms the canvas and foundation for our game.
Intelligent Difference Generation (Image-and-Text-to-Image with Gemini 2.5 Flash Image)

This is the core of the app's magic. We provide Gemini 2.5 Flash Image with two distinct inputs:
The base image generated by Imagen.
A text prompt that instructs it to modify the image in specific ways.

The prompt is the secret sauce, telling the model: "Here is an image. Please make 3 to 5 significant but subtle changes. Add a new object, remove an existing one, or alter the structure of something. Do not just change colors or textures."

This image-plus-text reasoning is what makes the applet possible. Gemini doesn't just see pixels; it understands the content of the image and can execute complex edit commands based on natural language. This enhances the user experience by creating an endless stream of novel and surprising puzzles that feel hand-crafted, turning a simple text idea into a complete, interactive experience.

DEV Community