This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built an interactive web application that brings the classic drawing and guessing game into the digital age with a modern twist. The app challenges a player to draw a word provided by the game, while a sophisticated generative AI attempts to guess the drawing in near real-time.
This creates a unique and engaging solo-player experience where the user's artistic skills are pitted against the AI's image recognition capabilities. It solves the problem of needing multiple players for a game of Pictionary and provides a fun, interactive way to experience the power of multimodal AI.
Demo
How I Used Google AI Studio
I leveraged the Gemini API, accessible through the @google/genai SDK, to power the core guessing mechanic of the game. Specifically, I used the gemini-2.5-flash model for its speed and powerful multimodal capabilities.
The implementation involves capturing the user's drawing from the HTML canvas as a PNG image, converting it to a base64 string, and sending it to the Gemini model. This image is sent alongside a carefully crafted text prompt: "What is this a drawing of? Look at the image carefully and provide your best guess in a single word." The model then processes this combined visual and textual input to return its guess as a single word of text. This demonstrates a powerful image-to-text, or visual understanding, use case.
Multimodal Features
The central multimodal feature of this application is visual reasoning and description. The app seamlessly integrates two distinct modalities:
Image Input: The user's free-form drawing on the canvas serves as the primary visual input.
Text Output: The Gemini model analyzes this visual information and generates a textual guess.
Top comments (0)