This is a submission for the Google AI Studio Multimodal Challenge
What I Built
An applet that lets users create realistic travel photos of themselves in iconic places (New York, Rome, Amsterdam, Marrakesh, etc.).
The user takes a live photo, then enters a studio to:
pick a city and either a curated landmark or a precise spot via a 3D map,
tune scene parameters (day/night, weather, clothing style, selfie vs normal),
optionally use voice to navigate the map to a place.
Two generation modes:
Immersive Map โ Street View (2-step compose): user positions a virtual camera; the app captures the center lat/lng, fetches a Street View image for that location, then composes the user into that background with Gemini.
Landmark Grid (text-to-image compose): user picks a famous place; the app describes the place (no Street View fetch) and asks Gemini to generate the scene + integrate the user.
Demo
- Take a live photo in-app ๐ธ
- Choose a city ๐
- Select Landmark (grid) or switch to Map and place the camera at the exact spot ๐บ๏ธ
- Adjust time, weather, clothing, selfie/normal โจ
- Generate โ preview โ download/share the image ๐
(Optional) Say: โNavigate to Times Squareโ โ the map centers there via voice command
How I Used Google AI Studio
Gemini 2.5 Flash (text)
- Parse voice commands into a {placeName, lat, lng} result.
- Build the final scene prompt (lighting, outfit, style, constraints).
Gemini 2.5 Flash Image (image)
- Image-to-image compose: merge the userโs live photo with either (a) Street View background (map mode) or (b) text-described landmark scene (grid mode).
- Enforce constraints (single subject, photorealism, perspective/lighting consistency).
Models used in code:
- gemini-2.5-flash (text) for NLU & JSON extraction
- gemini-2.5-flash-image-preview (image) for generation/compose
Multimodal Features
Pipeline A โ Immersive Map โ Street View โ Compose (2 steps)
Step A1. 3D Map camera placement
- API: Google Maps JS API (v=beta) with libraries=maps3d,marker
- Component: Web Component (props: center, range, tilt, heading)
- Action: User moves camera; we read center lat/lng from the map element.
- Optional: Voice navigation (see Step A0).
Step A0. Voice โ Place โ Lat/Lng (optional)
- API (browser): Web Speech API (webkitSpeechRecognition) for STT
- AI: gemini-2.5-flash to interpret the transcript and return JSON:
{ "placeName": "Times Square", "lat": 40.7580, "lng": -73.9855 }
- Action: Fly the camera to that coordinate via flyCameraTo.
Step A2. Fetch Street View
- API: Street View Static API
GET https://maps.googleapis.com/maps/api/streetview/metadata?location={lat},{lng}&key=GMAP_API_KEY (check availability)
GET https://maps.googleapis.com/maps/api/streetview?size=640x480&location={lat},{lng}&fov=90&heading=โฆ&pitch=โฆ&key=GMAP_API_KEY
- Output: background image (base64) for the exact spot.
Step A3. Compose with Gemini (image)
Model: gemini-2.5-flash-image-preview
Inputs:
Inline image 1 = user live photo (base64)
Inline image 2 = Street View background (base64)
Text prompt = scene params (time, weather, clothing, selfie/normal) + constraints (1 person only, photorealism, light/shadow match).
- Output: final image (base64, shown + downloadable).
Pipeline B โ Landmark Grid โ Text-described Scene โ Compose
Step B1. Select a famous place
Data: curated places[] per city (name + description).
No Street View call in this mode.
Step B2. Generate with Gemini (image)
- Model: gemini-2.5-flash-image-preview
- Inputs:
Inline image 1 = user live photo (base64)
Text prompt = detailed scene description of the landmark (lighting, weather, outfit, selfie/normal, constraints).
- Output: final photorealistic image.
Multimodal Features (Why it helps)
- 3D map + camera metaphors: users choose the exact angle and spot, leading to more believable compositions.
- Voice navigation: faster targeting of places (hands-free) before capture.
- Dual generation paths: precision (Street View background) or speed (landmark text scene).
- One-tap export: download/share directly.
Top comments (0)