Multimodal AI is no longer a futuristic concept — it's a practical tool that can analyze text reviews, product images, and podcast audio in a single workflow. In this post, I walk through the GSP524 Challenge Lab from Google Cloud Skills Boost, where we use the Gemini 2.5 Flash model on Vertex AI to extract actionable marketing insights from three different data modalities for a fictional brand called Cymbal Direct.
If you're preparing for this lab or want to understand how multimodal prompting with Gemini actually works in practice, this guide covers every task with the reasoning behind each solution.
The Scenario
Cymbal Direct has just launched a new line of athletic apparel. Our job is to analyze social media engagement across three channels:
- Text — Customer reviews and social media posts (sentiment, themes, product mentions).
- Images — Influencer and customer photos (style trends, visual messaging, target audience).
- Audio — A podcast interview with a Cymbal Direct representative (satisfaction drivers, biases, recommendations).
Finally, we synthesize everything into a comprehensive Markdown report and upload it to Cloud Storage.
Environment Setup (Task 1)
The lab provides a pre-configured Vertex AI Workbench instance with a Jupyter notebook (gsp524-challenge.ipynb). Task 1 has no TODOs — you just run the provided cells to:
- Install the Google Gen AI SDK (
google-genai). - Restart the kernel (important — the new package won't load without this).
- Import all required libraries, including
Part,ThinkingConfig, andGenerateContentConfigfromgoogle.genai.types. - Initialize the Gen AI client pointing to your lab project.
- Set the model ID to
gemini-2.5-flash.
Two critical objects are set up here that you'll reuse throughout the lab:
# The client — your gateway to Gemini
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
# The model
MODEL_ID = "gemini-2.5-flash"
Later, a config object enables Gemini thinking (extended reasoning) with dynamic budget:
config = types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(
include_thoughts=True,
thinking_budget=-1 # Dynamic: model decides how much to reason
)
)
This config is the key difference between a basic call and a deep-reasoning call. You'll use it in every "Deep Dive" section.
Task 2: Analyzing Customer Reviews (Text)
Initial Analysis
The first real challenge is constructing a prompt that tells Gemini exactly what to extract from the raw text data. The reviews are loaded from a file, and we embed them directly into the prompt using an f-string:
prompt = f"""
Analyze the following customer reviews and social media posts about
Cymbal Direct's new athletic apparel line. For each review or post:
- Identify the overall sentiment (positive, negative, or neutral).
- Extract key themes and topics discussed, such as product quality,
fit, style, customer service, and pricing.
- Identify any frequently mentioned product names or specific features.
Provide a structured summary of your findings in Markdown format.
Customer Reviews and Social Media Posts:
{text_data}
"""
response = client.models.generate_content(
model=MODEL_ID,
contents=prompt,
)
Why this works: The prompt is explicit about the three dimensions we care about (sentiment, themes, product names) and asks for structured Markdown output. Gemini handles the rest — it categorizes each review and surfaces patterns across the dataset.
Deep Dive with Thinking
Now we go deeper. The second prompt asks Gemini to reason about what's driving sentiment and to role-play as a marketing consultant:
thinking_mode_prompt = f"""
Analyze the following customer reviews and social media posts in detail.
Specifically:
- Identify the main factors driving positive and negative sentiment.
- Assess the overall impact on brand perception.
- Identify three key areas where Cymbal Direct can improve.
- Highlight the three most important takeaways as if presenting to
the Cymbal Direct marketing team.
Customer Reviews and Social Media Posts:
{text_data}
"""
thinking_model_response = client.models.generate_content(
model=MODEL_ID,
contents=thinking_mode_prompt,
config=config, # <-- This enables thinking mode
)
The only API-level difference is passing config=config. But the output is dramatically richer — Gemini shows its chain of thought before delivering the final answer, and the print_thoughts() helper function separates these for display.
The analysis is saved to analysis/text_analysis.md for use in the final synthesis.
Task 3: Analyzing Images (Visual Content)
Initial Analysis
Images require a different content structure. Instead of embedding data in the prompt string, we pass a list of Part objects alongside the prompt:
prompt = """
Analyze the following images of Cymbal Direct's new athletic apparel line.
For each image:
- Identify the apparel items shown.
- Describe the attributes of each item (color, style, material, branding).
- Identify any prominent style trends or preferences across the images.
"""
response = client.models.generate_content(
model=MODEL_ID,
contents=[prompt] + image_parts, # Prompt + list of image Part objects
)
Key pattern: For multimodal content, contents accepts a list where the first element is the text prompt and subsequent elements are Part objects (images, audio, video). The images are loaded as bytes and wrapped with Part.from_bytes().
Reasoning on Image Trends
The deep dive asks Gemini to go beyond description into inference — hypothesizing about target audience, analyzing visual composition, and comparing to broader fashion trends:
thinking_mode_prompt = """
Analyze the images in greater detail:
- Hypothesize about the target audience for each image.
- Analyze how visual elements contribute to the overall message and appeal.
- Compare observed trends with broader athletic wear fashion trends.
- Provide recommendations for future marketing campaigns.
"""
thinking_model_response_image = client.models.generate_content(
model=MODEL_ID,
contents=[thinking_mode_prompt] + image_parts,
config=config,
)
Same pattern: prompt + image parts + thinking config. Results are saved to analysis/image_analysis.md.
Task 4: Analyzing Audio (Podcast)
Initial Analysis
Audio follows the same multimodal pattern, but uses Part.from_uri() instead of Part.from_bytes() since the audio file lives in Cloud Storage:
# Audio part (created in a setup cell)
audio_part = Part.from_uri(
file_uri=f"gs://{PROJECT_ID}-bucket/media/audio/cymbal_direct_expert_interview.wav",
mime_type="audio/wav",
)
prompt = """
Analyze the following audio recording:
- Transcribe the conversation, identifying different speakers.
- Provide sentiment analysis (positive, negative, neutral opinions).
- Identify key themes (comfort, fit, performance, style, competitor comparisons).
"""
response = client.models.generate_content(
model=MODEL_ID,
contents=[audio_part, prompt], # Audio first, then prompt
)
Note the order: For audio, the audio_part comes before the prompt in the contents list. This is a subtle but important detail — Gemini processes the audio first, then applies the prompt instructions to it.
Reasoning on Audio Insights
The deep dive extracts strategic intelligence from the conversation:
thinking_mode_prompt = """
Analyze the audio recording in greater detail:
- Reason about overall customer satisfaction.
- Deduce key factors influencing customer perception.
- Develop three data-driven recommendations.
- Identify potential biases or limitations in the audio data.
"""
thinking_model_response = client.models.generate_content(
model=MODEL_ID,
contents=[audio_part, thinking_mode_prompt],
config=config,
)
This is particularly interesting because Gemini can identify biases like interviewer framing or selection bias in who was invited to the podcast — something that requires genuine reasoning, not just transcription.
Task 5: Synthesizing Multimodal Insights
The final task loads all three analysis files and asks Gemini to produce a unified report:
comprehensive_report_prompt = f"""
Based on the following combined analysis of text reviews, image analysis,
and audio insights, generate a comprehensive report:
- Summarize overall sentiment across all data modalities.
- Identify key themes and trends in customer feedback.
- Provide insights on style preferences, usage patterns, and behavior.
- Evaluate how audio insights fit with product image and text feedback.
- Offer actionable recommendations for marketing strategy and positioning.
Format the report in well-structured Markdown with clear sections.
Combined Analysis Results:
{all_analysis}
"""
thinking_model_response = client.models.generate_content(
model=MODEL_ID,
contents=comprehensive_report_prompt,
config=config,
)
After generating the report, it's saved locally and uploaded to Cloud Storage:
!gcloud storage cp analysis/final_report.md gs://{PROJECT_ID}-bucket/analysis/final_report.md
This last step is what the grading system checks, so don't skip it.
Key Learnings
One API, three modalities. The
generate_contentmethod handles text, images, and audio with the same interface — the only difference is how you construct thecontentslist.Thinking mode is a single config toggle. Adding
config=configwithinclude_thoughts=Truetransforms a surface-level response into a reasoned analysis. The-1thinking budget lets the model decide how deep to go based on prompt complexity.Prompt specificity drives output quality. Vague prompts produce vague results. Each prompt in this lab explicitly lists the dimensions to analyze (sentiment, themes, audience, recommendations), and the output quality reflects that precision.
Content ordering matters for multimodal inputs. For images, the prompt comes first followed by image parts. For audio, the audio part comes first. This isn't arbitrary — it affects how the model processes the input.
Chaining analyses enables synthesis. By saving intermediate results to files and feeding them into a final prompt, we build a pipeline where each modality's insights compound into a richer final report.
Best Practices
Always ask for structured output. Requesting "Markdown format with clear sections" gives you parseable, presentable results instead of a wall of text.
Use thinking mode for analysis, skip it for extraction. Initial passes (transcription, item identification) don't need extended reasoning. Deep dives (inferring audience, identifying biases, generating recommendations) benefit enormously from it.
Embed data directly in prompts for text; use Part objects for binary data. Text data fits naturally inside f-strings. Images and audio should always go through
Part.from_bytes()orPart.from_uri().Save intermediate results. Writing each analysis to a file creates a paper trail and enables the final synthesis step without re-running expensive model calls.
Don't forget the upload. In challenge labs, the grading system checks Cloud Storage — your analysis could be perfect, but if the file isn't in the bucket, you won't pass.
Conclusion
This challenge lab demonstrates a realistic workflow for multimodal AI analysis: ingest data from different sources, extract structured insights from each, apply deeper reasoning where it matters, and synthesize everything into a decision-ready report. The Gemini 2.5 Flash model on Vertex AI makes this surprisingly straightforward — the same generate_content call handles text, images, and audio, and the thinking mode adds genuine analytical depth without requiring a different model or API.
The patterns here — structured prompts, multimodal content lists, thinking configuration, and chained analyses — are directly applicable to real-world use cases like brand monitoring, market research, and content analysis. The hard part isn't the API calls; it's crafting prompts that extract the right insights from the right data.
Top comments (0)