DEV Community

Cover image for ๐ŸŒโœจ MapShot : From Landmarks to Local Shops: Capture Yourself Anywhere using Gemini API Flash 2.5
Amine ZAMANI
Amine ZAMANI

Posted on

๐ŸŒโœจ MapShot : From Landmarks to Local Shops: Capture Yourself Anywhere using Gemini API Flash 2.5

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

An applet that lets users create realistic travel photos of themselves in iconic places (New York, Rome, Amsterdam, Marrakesh, etc.).
The user takes a live photo, then enters a studio to:

pick a city and either a curated landmark or a precise spot via a 3D map,

tune scene parameters (day/night, weather, clothing style, selfie vs normal),

optionally use voice to navigate the map to a place.

Two generation modes:

Immersive Map โ†’ Street View (2-step compose): user positions a virtual camera; the app captures the center lat/lng, fetches a Street View image for that location, then composes the user into that background with Gemini.

Landmark Grid (text-to-image compose): user picks a famous place; the app describes the place (no Street View fetch) and asks Gemini to generate the scene + integrate the user.

Demo

VIDEO DEMO ๐ŸŽฅ

LIVE DEMO ๐Ÿ“ธ

  1. Take a live photo in-app ๐Ÿ“ธ
  2. Choose a city ๐ŸŒ
  3. Select Landmark (grid) or switch to Map and place the camera at the exact spot ๐Ÿ—บ๏ธ
  4. Adjust time, weather, clothing, selfie/normal โœจ
  5. Generate โ†’ preview โ†’ download/share the image ๐ŸŽ‰

(Optional) Say: โ€œNavigate to Times Squareโ€ โ†’ the map centers there via voice command

How I Used Google AI Studio

Gemini 2.5 Flash (text)

  • Parse voice commands into a {placeName, lat, lng} result.
  • Build the final scene prompt (lighting, outfit, style, constraints).

Gemini 2.5 Flash Image (image)

  • Image-to-image compose: merge the userโ€™s live photo with either (a) Street View background (map mode) or (b) text-described landmark scene (grid mode).
  • Enforce constraints (single subject, photorealism, perspective/lighting consistency).

Models used in code:

  • gemini-2.5-flash (text) for NLU & JSON extraction
  • gemini-2.5-flash-image-preview (image) for generation/compose

Multimodal Features

Pipeline A โ€” Immersive Map โ†’ Street View โ†’ Compose (2 steps)
Step A1. 3D Map camera placement

  • API: Google Maps JS API (v=beta) with libraries=maps3d,marker
  • Component: Web Component (props: center, range, tilt, heading)
  • Action: User moves camera; we read center lat/lng from the map element.
  • Optional: Voice navigation (see Step A0).

Step A0. Voice โ†’ Place โ†’ Lat/Lng (optional)

  • API (browser): Web Speech API (webkitSpeechRecognition) for STT
  • AI: gemini-2.5-flash to interpret the transcript and return JSON:

{ "placeName": "Times Square", "lat": 40.7580, "lng": -73.9855 }

  • Action: Fly the camera to that coordinate via flyCameraTo.

Step A2. Fetch Street View

  • API: Street View Static API

GET https://maps.googleapis.com/maps/api/streetview/metadata?location={lat},{lng}&key=GMAP_API_KEY (check availability)

GET https://maps.googleapis.com/maps/api/streetview?size=640x480&location={lat},{lng}&fov=90&heading=โ€ฆ&pitch=โ€ฆ&key=GMAP_API_KEY

  • Output: background image (base64) for the exact spot.

Step A3. Compose with Gemini (image)

  • Model: gemini-2.5-flash-image-preview

  • Inputs:

Inline image 1 = user live photo (base64)

Inline image 2 = Street View background (base64)

Text prompt = scene params (time, weather, clothing, selfie/normal) + constraints (1 person only, photorealism, light/shadow match).

  • Output: final image (base64, shown + downloadable).

Pipeline B โ€” Landmark Grid โ†’ Text-described Scene โ†’ Compose
Step B1. Select a famous place

  • Data: curated places[] per city (name + description).

  • No Street View call in this mode.

Step B2. Generate with Gemini (image)

  • Model: gemini-2.5-flash-image-preview
  • Inputs:

Inline image 1 = user live photo (base64)

Text prompt = detailed scene description of the landmark (lighting, weather, outfit, selfie/normal, constraints).

  • Output: final photorealistic image.

Multimodal Features (Why it helps)

  • 3D map + camera metaphors: users choose the exact angle and spot, leading to more believable compositions.
  • Voice navigation: faster targeting of places (hands-free) before capture.
  • Dual generation paths: precision (Street View background) or speed (landmark text scene).
  • One-tap export: download/share directly.

Top comments (0)