Amine ZAMANI

Posted on Sep 14 • Edited on Sep 23

🌍✨ MapShot : From Landmarks to Local Shops: Capture Yourself Anywhere using Gemini-2.5-flash-image-preview

#devchallenge #googleaichallenge #ai #gemini

Google AI Challenge Submission

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

An applet that lets users create realistic travel photos of themselves in iconic places (New York, Rome, Amsterdam, Marrakesh, etc.).
The user takes a live photo, then enters a studio to:

pick a city and either a curated landmark or a precise spot via a 3D map,

tune scene parameters (day/night, weather, clothing style, selfie vs normal),

optionally use voice to navigate the map to a place.

Two generation modes:

Immersive Map → Street View (2-step compose): user positions a virtual camera; the app captures the center lat/lng, fetches a Street View image for that location, then composes the user into that background with Gemini.

Landmark Grid (text-to-image compose): user picks a famous place; the app describes the place (no Street View fetch) and asks Gemini to generate the scene + integrate the user.

Demo

CANVA VIDEO DEMO 🎥

YOUTUBE VIDEO DEMO 🎥

LIVE DEMO 📸

Take a live photo in-app 📸
Choose a city 🌍
Select Landmark (grid) or switch to Map and place the camera at the exact spot 🗺️
Adjust time, weather, clothing, selfie/normal ✨
Generate → preview → download/share the image 🎉

(Optional) Say: “Navigate to Times Square” → the map centers there via voice command

How I Used Google AI Studio

Gemini 2.5 Flash (text)

Parse voice commands into a {placeName, lat, lng} result.
Build the final scene prompt (lighting, outfit, style, constraints).

Gemini 2.5 Flash Image (image)

Image-to-image compose: merge the user’s live photo with either (a) Street View background (map mode) or (b) text-described landmark scene (grid mode).
Enforce constraints (single subject, photorealism, perspective/lighting consistency).

Models used in code:

gemini-2.5-flash (text) for NLU & JSON extraction
gemini-2.5-flash-image-preview (image) for generation/compose

Multimodal Features

Pipeline A — Immersive Map → Street View → Compose (2 steps)
Step A1. 3D Map camera placement

API: Google Maps JS API (v=beta) with libraries=maps3d,marker
Component: Web Component (props: center, range, tilt, heading)
Action: User moves camera; we read center lat/lng from the map element.
Optional: Voice navigation (see Step A0).

Step A0. Voice → Place → Lat/Lng (optional)

API (browser): Web Speech API (webkitSpeechRecognition) for STT
AI: gemini-2.5-flash to interpret the transcript and return JSON:

{ "placeName": "Times Square", "lat": 40.7580, "lng": -73.9855 }

Action: Fly the camera to that coordinate via flyCameraTo.

Step A2. Fetch Street View

API: Street View Static API

GET https://maps.googleapis.com/maps/api/streetview/metadata?location={lat},{lng}&key=GMAP_API_KEY (check availability)

GET https://maps.googleapis.com/maps/api/streetview?size=640x480&location={lat},{lng}&fov=90&heading=…&pitch=…&key=GMAP_API_KEY

Output: background image (base64) for the exact spot.

Step A3. Compose with Gemini (image)

Model: gemini-2.5-flash-image-preview
Inputs:

Inline image 1 = user live photo (base64)

Inline image 2 = Street View background (base64)

Text prompt = scene params (time, weather, clothing, selfie/normal) + constraints (1 person only, photorealism, light/shadow match).

Output: final image (base64, shown + downloadable).

Pipeline B — Landmark Grid → Text-described Scene → Compose
Step B1. Select a famous place

Data: curated places[] per city (name + description).
No Street View call in this mode.

Step B2. Generate with Gemini (image)

Model: gemini-2.5-flash-image-preview
Inputs:

Inline image 1 = user live photo (base64)

Text prompt = detailed scene description of the landmark (lighting, weather, outfit, selfie/normal, constraints).

Output: final photorealistic image.

Multimodal Features (Why it helps)

3D map + camera metaphors: users choose the exact angle and spot, leading to more believable compositions.
Voice navigation: faster targeting of places (hands-free) before capture.
Dual generation paths: precision (Street View background) or speed (landmark text scene).
One-tap export: download/share directly.