This is a submission for the Google AI Studio Multimodal Challenge
What I Built
The AI First Aid Assistant is a multimodal, emergency-response web application designed to provide immediate, clear, and accessible first-aid guidance during high-stress situations. By leveraging the full spectrum of Gemini's multimodal capabilities, it transforms a user's phone into an intelligent crisis-response tool. Users can describe an emergency using text, photos, video, or audio, and in return, receive a comprehensive, easy-to-understand set of instructions enhanced with AI-generated images and videos.
What Problem It Solves
In a medical emergency, panic and uncertainty are the biggest barriers to effective action. People often don't know the correct procedures, and fumbling through search engines for text-based instructions can be slow, confusing, and impractical—especially if their hands are occupied. Furthermore, complex medical actions are difficult to understand from text alone, and real-world images can be graphic and distressing. Language barriers can also prevent individuals from getting the help they need.
Our application solves these problems by:
- Reducing Panic: Providing a single, trusted source for immediate, step-by-step guidance.
- Breaking Down Complexity: Using AI-generated visuals (images and videos) to make instructions clear and easy to follow.
- Overcoming Input Barriers: Allowing users to communicate the situation in the most natural way possible—whether by speaking, typing, or showing.
- Ensuring Accessibility: Offering bilingual support (English/Portuguese) and text-to-speech narration for hands-free operation.
What Experience It Creates
The AI First Aid Assistant creates an experience of empowerment and calm in the face of chaos. Instead of feeling helpless, a user is guided by a calm, authoritative assistant.
The user journey is designed to be seamless and intuitive:
- Multimodal Input: A user facing an emergency (e.g., a deep cut, a burn, or someone choking) can instantly capture the situation. They can type a description ("deep cut on arm"), take a photo of the injury, record a short video of the scene, or record a voice memo describing what's happening.
- AI-Powered Multimodal Guidance: The application processes the multimodal input and delivers a comprehensive response:
- Core Instructions: Gemini 2.5 Flash, grounded with Google Search for accuracy, generates clear, numbered, step-by-step first-aid instructions tailored to the specific emergency.
- Illustrative Images: For each step, Imagen 4.0 generates a custom, photorealistic illustration. This visual aid clarifies the instruction (e.g., how to apply pressure to a wound) without showing graphic or distressing real-world imagery.
- Demonstration Videos: To eliminate any ambiguity, the user can click a button on any step to generate a short, animated demonstration video, powered by Veo 2.0. This transforms a static instruction into a dynamic, easy-to-follow visual guide, showing the exact motion and technique required.
- Audio Narration: The full set of instructions can be read aloud via the browser's text-to-speech engine, allowing the user to follow along hands-free while they administer aid.
- Integrated Emergency Tools: The experience goes beyond just instructions. The interface provides immediate, one-tap access to call local emergency services (911 or 192, depending on the language setting) and a map to find nearby hospitals, creating a complete emergency response hub.
By seamlessly integrating text, image, video, and audio generation, the AI First Aid Assistant demystifies first-aid procedures and empowers anyone to become a capable first responder, potentially bridging the critical gap between when an incident occurs and when professional help arrives.
Demo
The app default mode is in portuguese but you can click the button and translate it to english
[https://first-aid-frontend-496246101066.us-central1.run.app/]
How I Used Google AI Studio
Multimodal Features
Leveraging Google AI Studio & Multimodal Capabilities
Our AI First Aid Assistant was born from the iterative, rapid-prototyping environment of Google AI Studio. The platform was not just a starting point but an essential tool throughout the development process, allowing us to design, test, and refine the core logic before writing a single line of production code. The application is fundamentally multimodal, creating a seamless loop from user input in any format to AI-generated guidance in multiple formats.
How We Leveraged Google AI Studio
Google AI Studio was our primary workbench for prompt engineering and validating our multimodal strategy.
Rapid Prototyping and Prompt Engineering: We used AI Studio's web-based interface to craft and test the complex system instructions for Gemini 2.5 Flash.
This was particularly crucial for the main instruction-generation prompt, where we needed the model to:
- Act as a calm and reliable first-aid assistant.
- Use the Google Search tool for grounding to ensure the accuracy of medical advice.
- Consistently output a response in a strict, non-negotiable JSON format ({steps: [{step_number, instruction_text}]}). We iterated dozens of times in the Studio to perfect the instruction that forced this output reliably, even when a tool was in use.
- Validating the Multimodal Input Strategy: The core premise of our app is that a user can describe an emergency in whatever way is easiest for them. We used AI Studio to simulate this by:
- Uploading various test images (e.g., photos of simulated cuts, burns, insect bites) alongside text prompts like "what do I do for this?"
- Testing how Gemini would interpret the combination of text and media to understand the full context of the emergency. This gave us confidence that the model could synthesize information from multiple sources before we built the file upload and camera functionalities.
Fine-Tuning Prompts for Media Generation: We used AI Studio to experiment with the prompts sent to our image and video generation models. We crafted and tested prompt templates to ensure the outputs were:
- For Images (Imagen 4.0): Clear, illustrative, photorealistic but not graphic or gory. We settled on a formula like: "Generate a clear, simple, photorealistic illustration for this first-aid step... Style: modern textbook illustration. No text or gore."
- For Videos (Veo 2.0): Focused, slow-motion, and easy to understand. We refined prompts to emphasize the core action, such as: "Create a short, clear, slow-motion video demonstrating this first-aid instruction. Focus on the action described."
Implemented Multimodal Capabilities
Our application is a true end-to-end multimodal experience, leveraging Gemini's ability to both understand and generate content across different formats.
1. Multimodal INPUT: Understanding the Emergency
- The user is empowered to communicate their situation using the most intuitive method available to them in a high-stress moment. Gemini 2.5 Flash serves as the central intelligence, fusing these varied inputs into a coherent understanding of the emergency.
- Text: The user can type a description (e.g., "my friend fell and has a deep cut on his arm").
- Image: The user can take a photo with their camera or upload an image of the injury, providing critical visual context that words might fail to capture.
- Video: The user can record a short video of the scene, which can show the severity of bleeding, the person's state, or the environment.
- Audio: For a hands-free approach, the user can record a voice memo describing what happened.
2. Multimodal OUTPUT: Delivering Actionable Guidance
Once the emergency is understood, the application delivers a comprehensive, multi-layered response designed for maximum clarity and usability under pressure.
- Grounded Text Instructions: Gemini 2.5 Flash, with its reasoning capabilities grounded by the Google Search tool, generates accurate, numbered, step-by-step instructions.
- Generated Illustrative Images: For each text instruction, we call Imagen 4.0 to generate a custom, photorealistic image. This visual aid clarifies the action (e.g., how to apply pressure, the correct way to position a limb) without the distressing or graphic nature of real photos.
- Generated Demonstration Videos: To eliminate all ambiguity for complex actions, the user can generate a short, dynamic video for any step. We use the illustrative image and the instruction text as inputs for Veo 2.0, which creates a video showing the precise technique required. This transforms a static instruction into an easy-to-follow demonstration.
- Synthesized Audio (Text-to-Speech): The application uses the browser's built-in speech synthesis engine to read the full set of instructions aloud. This provides a crucial hands-free mode, allowing the user to listen and act simultaneously without needing to look at their screen.
By integrating this full spectrum of multimodal inputs and outputs, the AI First Aid Assistant transforms a simple query into a rich, interactive, and life-saving set of guidance, truly showcasing the power of Gemini's multimodal ecosystem.
The application was completely developed by me with Google AI Studio
Top comments (0)