This is a submission for the Google AI Studio Multimodal Challenge
What I Built
Wordsketcher is an interactive app that transforms words into images. Users can place words on a digital canvas, and their arrangement serves as a compositional guide for an AI-powered image generator. Designed for language learners, it helps connect a word’s form to its meaning through imagery and by leveraging AI’s multimodal capabilities.
The app offers three creative modes:
Challenge Mode: Provides a specific scene to create (e.g., "a cozy house under a smiling sun") with a predefined set of words.
Topic Mode: Offers a theme (e.g., "At the Beach") and a bank of related words for users to build a scene.
Freeform Mode: Gives users a blank canvas to add any words they like, offering complete creative freedom.
Demo
Here is a short video of the project in action:
https://youtu.be/Oif1XgAlGRU?feature=shared
The link to the deployed app: https://wordsketcher-286603958397.us-west1.run.app/[](https://wordsketcher-286603958397.us-west1.run.app/)
The app is simple to use, and can be used across all devices.
How I Used Google AI Studio
From start to finish, the app was developed and deployed entirely in Google AI Studio. The application leverages the Gemini API for two core intelligent features:
AI Image Generation (imagen-4.0-generate-001): This is the app's central feature. When a user clicks "Sketch it!", the application constructs a highly detailed text prompt that is sent to the Gemini API. This prompt intelligently combines the base prompt (from the selected challenge or topic), the user-selected art style (like 'Sketchbook' or 'Watercolor'), and, most importantly, a list of all words on the canvas. For each word, its position is translated into a descriptive location (e.g., "the word 'sun' should appear generally in the top-right"), effectively turning the visual layout into a set of compositional instructions for the AI.
Reverse Dictionary (gemini-2.5-flash): The "What's the Word?" feature uses a text-generation model to act as a reverse dictionary. When a user provides a description (e.g., "a yellow fruit that's long and curvy"), the app sends this to the Gemini API with instructions to guess the single most likely word. The model's response ("banana") is then presented to the user to add to their canvas.
Multimodal Features
The primary multimodal capability demonstrated in Wordsketcher is Text-to-Image Generation. The application takes user input in one modality (text) and uses the Gemini API to generate output in another modality (image).
A unique aspect of this implementation is how it interprets spatial information as part of the prompt. The user provides input that is both textual (the word itself) and spatial (its X/Y position on the canvas). The application translates this combined multimodal input into a sophisticated text prompt that guides the AI's understanding of the desired image composition. This allows the user to "sketch" with words in a very literal sense, influencing not just what appears in the image, but also where it appears.
The app is just the seed, but I hope you like the concept!
Top comments (0)