DEV Community

Cover image for Look & Learn: a Google AI Multimodal Challenge Entry

Look & Learn: a Google AI Multimodal Challenge Entry

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I created Look & Learn. An app for language learners. The app will generate an image with some interesting scene on it. It will then ask you some questions about that image, in the language you're tryin to learn. For the beginner level, all the questions will be multiple choice. For intermediate and advanced levels, you'll have to type out your answers.

Demo

Try Look & Learn here

Screenshots:

Screenshot of the Look & Learn start screen. Here, the user has options to select the language they speak, the language they want to learn, and their level of fluency. A button at the bottom lets them start the quiz.

Image showing a multiple choice question where the user has picked the wrong answer. The wrong answer is highlighted in red, while the correct one is shown in green. At the bottom, an explanation in the user's native language is shown.

Screenshot from an intermediate level Dutch quiz:
Screenshot showing a question where the user had to type in an answer. At the bottom, there's a message indicating that the user has answered correctly, but that there was a grammar issue about verb conjugation, and explaining it.

How I Used Google AI Studio

I wanted to see how far I could take Google AI studio while touching the code by hand as little as possible. While I'm mostly skeptical of vibe coding, I thought this challenge was an interesting opportunity to give it a try. So, I mostly wrote prompts, and would give the model feedback in natural language.

Multimodal Features

At the start of the quiz, the application either generates an interesting image using Imagen, or takes an existing one from Google Cloud Storage. Currently taking a stored one 80% of the time. It then uses Gemini-2.5-flash, to generate questions about the image, giving it the generated image and a prompt that includes guidelines for the questions, as well as the user's fluency level.

For multiple choice questions, there's a well-defined correct answer, so the application immediately gives the user feedback. For text entry questions, we again give the image to Gemini-2.5-flash, along with the question and the users answer, in order to get an evaluation of the correctness, as well as the user's use of vocabulary and grammar.

I'm also passing the image to Gemini-2.5-flash to generate an alt text for the image. The alt text should contain enough details to answer all the quiz questions. It's provided in the user's native language, so that they still have to go through the exercise of finding the correct answer in the description, and translating it. I've also tried to ensure that all elements that may appear in a different language have a matching lang attribute, so that screen readers read them out correctly.

Top comments (1)

Collapse
 
pravesh_sudha_3c2b0c2b5e0 profile image
Pravesh Sudha

Great Project, fun way to test language proficiency!