<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: William Schnaider Torres Bermon</title>
    <description>The latest articles on DEV Community by William Schnaider Torres Bermon (@willtorber).</description>
    <link>https://dev.to/willtorber</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3728365%2F69cdea6c-28ad-4266-9b7d-0e3dc79a8910.jpg</url>
      <title>DEV Community: William Schnaider Torres Bermon</title>
      <link>https://dev.to/willtorber</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/willtorber"/>
    <language>en</language>
    <item>
      <title>Spec Kit vs BMAD vs OpenSpec: Choosing an SDD Framework in 2026</title>
      <dc:creator>William Schnaider Torres Bermon</dc:creator>
      <pubDate>Thu, 23 Apr 2026 05:08:44 +0000</pubDate>
      <link>https://dev.to/willtorber/spec-kit-vs-bmad-vs-openspec-choosing-an-sdd-framework-in-2026-d3j</link>
      <guid>https://dev.to/willtorber/spec-kit-vs-bmad-vs-openspec-choosing-an-sdd-framework-in-2026-d3j</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2srl7y9v81kjdjzqw10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2srl7y9v81kjdjzqw10.png" alt="Spec Kit vs BMAD vs OpenSpec" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the AI writes the code, the spec is the artifact. That's the entire thesis. Everything else is tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Pick based on your codebase:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Existing codebase, adding features&lt;/strong&gt; → &lt;strong&gt;OpenSpec&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New project from scratch&lt;/strong&gt; → &lt;strong&gt;Spec Kit&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance, audit trails, regulated&lt;/strong&gt; → &lt;strong&gt;BMAD&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsure?&lt;/strong&gt; → &lt;strong&gt;OpenSpec.&lt;/strong&gt; It tends to minimize adoption friction compared to the others, works on both greenfield and brownfield, and won't lock you in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that's all you needed, stop here. The rest is the reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclosure
&lt;/h2&gt;

&lt;p&gt;I haven't run all three of these in production. This is structural analysis: docs, case studies, design choices, and community reports — not a veteran's field guide. I'll flag where I'm extrapolating versus citing something documented. If you've shipped with any of these, your experience outranks this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SDD actually is (and isn't)
&lt;/h2&gt;

&lt;p&gt;Spec-Driven Development isn't a 2025 invention. BDD, formal requirements docs, ICDs — all versions of the same idea. What changed is that LLMs turned natural-language specs into something you can execute. A Markdown file plus Claude or GPT produces working code. No custom DSL, no code generator, no parser.&lt;/p&gt;

&lt;p&gt;The workflow, across all frameworks, is roughly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Constitution&lt;/strong&gt; — standards that apply to every change (tests, stack, security).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specification&lt;/strong&gt; — &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design&lt;/strong&gt; — &lt;em&gt;how&lt;/em&gt;, architecture decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks&lt;/strong&gt; — ordered implementation units.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation&lt;/strong&gt; — the agent executes; you review.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1–4 used to fit in a three-line Jira ticket because writing them properly cost more than the code itself. That calculation flipped. AI generates a draft spec in minutes. But "draft" is doing work in that sentence — catching missing edge cases, validating assumptions, and detecting hallucinations still costs real human time. LLMs collapsed the cost of &lt;em&gt;drafting&lt;/em&gt;, not the cost of &lt;em&gt;quality&lt;/em&gt;. The difference matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The economic shift
&lt;/h2&gt;

&lt;p&gt;Old pattern: planning is compressed. Tickets are thin. The real spec is in the developer's head, in Confluence pages nobody updates, in Slack threads from two sprints ago. Code is the expensive part, so you optimize for coding time.&lt;/p&gt;

&lt;p&gt;New pattern: code is cheap. AI writes it. The expensive thing is now &lt;em&gt;intent&lt;/em&gt; — making sure the AI builds what you actually need. Suddenly an exhaustive spec with acceptance criteria, Gherkin scenarios, error-handling sections, and architectural constraints is worth producing because the AI uses it and the cost of generating the draft is trivial.&lt;/p&gt;

&lt;p&gt;Spec is the source of truth. Code is the build output. That's the inversion. The frameworks below are different implementations of the same idea.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frameworks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Spec Kit (GitHub)
&lt;/h3&gt;

&lt;p&gt;GitHub's open-source SDD &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;toolkit&lt;/a&gt;. CLI called &lt;code&gt;specify&lt;/code&gt;, with 90K+ GitHub stars at the time of writing. Integrates with a broad range of AI coding agents (the project lists 30+), including GitHub Copilot, Claude Code, Cursor, and Gemini CLI.&lt;/p&gt;

&lt;p&gt;The workflow uses slash commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/speckit.constitution → project principles
/speckit.specify      → feature specification
/speckit.plan         → technical design
/speckit.tasks        → implementation breakdown
/speckit.implement    → agent executes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;constitution.md&lt;/code&gt; is the piece worth understanding. It's not just a rules file — it's the document every subsequent spec references. Your testing strategy, your security posture, your stack constraints, your error-handling conventions. Write it well once and it multiplies across every feature. Write it badly and you get exactly the chaos documented below.&lt;/p&gt;

&lt;p&gt;Spec Kit is greenfield-optimized. Its branch-per-spec model treats specs as change artifacts, not long-lived capability contracts. On a mature codebase, that means every feature starts with reverse-engineering and the artifacts don't compound into system-level documentation. Microsoft Learn now has a brownfield module for Spec Kit, and presets help, but the underlying model is still change-scoped. If your codebase is 3 years old and you want specs that describe the &lt;em&gt;system&lt;/em&gt;, not just the &lt;em&gt;next PR&lt;/em&gt;, this is a friction point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting started.&lt;/strong&gt; Requires Python 3.11+ and &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;. Pin a release tag for stability (check &lt;a href="https://github.com/github/spec-kit/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt; for the latest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;specify-cli &lt;span class="nt"&gt;--from&lt;/span&gt; git+https://github.com/github/spec-kit.git@v0.7.2
specify init my-project &lt;span class="nt"&gt;--ai&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  BMAD-METHOD
&lt;/h3&gt;

&lt;p&gt;BMAD ("Breakthrough Method for Agile AI-Driven Development") is a different animal. It's a multi-agent &lt;a href="https://github.com/bmad-code-org/BMAD-METHOD" rel="noopener noreferrer"&gt;framework&lt;/a&gt; with 43K+ stars at the time of writing — 12+ AI personas (Analyst, PM, Architect, Scrum Master, Developer, QA, UX Designer...) modeled as Markdown "Agent-as-Code" files. v6 hit stable recently after an extended alpha, with features like Scale Adaptive workflows, BMad-CORE engine, and a builder toolkit for custom agents.&lt;/p&gt;

&lt;p&gt;The pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Analyst → PM (PRD) → Architect → Scrum Master (stories) → Developer → QA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each handoff is a versioned artifact. Audit trail out of the box. Every decision is traceable from requirement to PR.&lt;/p&gt;

&lt;p&gt;That structure is impressive when your deployment target is a SOC 2 audit, a consulting deliverable, or a multi-team platform. For a two-person startup, it's a trap. Here's why: BMAD is a process multiplier, not a process creator. If your team already thinks in PRDs, architecture docs, and sprint stories, BMAD will accelerate that and make it auditable. If your team doesn't have structured processes, BMAD won't conjure them — it'll reproduce your chaos across seven agents and you'll spend more time debugging agent coordination than writing code.&lt;/p&gt;

&lt;p&gt;Concrete costs people forget about: more agents means more tokens per cycle. Handoff failures between personas are a real debugging surface. And when the Architect agent makes an assumption the PM agent didn't document, the Scrum Master propagates it into stories, and the Developer implements it confidently. You find out in QA, or worse, in production. The pipeline is only as good as the weakest handoff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting started.&lt;/strong&gt; Requires Node.js v20+. The interactive installer handles module selection and IDE-specific file generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx bmad-method &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  OpenSpec (Fission-AI)
&lt;/h3&gt;

&lt;p&gt;Lightweight, brownfield-first SDD. npm package (&lt;a href="https://www.npmjs.com/package/@fission-ai/openspec" rel="noopener noreferrer"&gt;&lt;code&gt;@fission-ai/openspec&lt;/code&gt;&lt;/a&gt;, 77K+ downloads). &lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;. Works with 25+ AI assistants via slash commands and an &lt;code&gt;AGENTS.md&lt;/code&gt; file that acts as a "README for robots."&lt;/p&gt;

&lt;p&gt;OpenSpec's core idea is change-centric specs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openspec/
  project.md                  ← project context
  specs/                      ← current system behavior
  changes/
    add-dark-mode/
      proposal.md             ← what's changing and why
      design.md               ← technical approach
      tasks.md                ← checklist
      specs/                  ← deltas: ADDED / MODIFIED / REMOVED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The delta markers are the thing that makes this work for existing codebases. You're forced to categorize every change as ADDED, MODIFIED, or REMOVED relative to what exists. That discipline prevents the agent from hallucinating new requirements onto existing behavior. When the change ships, the deltas merge into the main specs, so your system-level documentation compounds over time. That's the right model for brownfield, and it's a model Spec Kit doesn't natively emphasize.&lt;/p&gt;

&lt;p&gt;Limitations are real: specs don't self-update during implementation. If the agent drifts (and it will — more on that below), you resync manually. There's no multi-agent orchestration; a single agent runs the whole flow. And for simple tasks — a bug fix, a copy change — the overhead of a full proposal-design-tasks cycle can feel like performing surgery with a forklift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting started.&lt;/strong&gt; Requires Node.js &amp;gt;= 20.19.0. Install globally and initialize inside your project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @fission-ai/openspec
openspec init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A different category: SDD as a product, not a framework
&lt;/h3&gt;

&lt;p&gt;The three frameworks above are CLI tools you bolt onto your existing editor. There's another approach: products that bake SDD directly into their own environment. Two worth tracking:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kiro&lt;/strong&gt; (AWS) is a VS Code fork with spec-driven development built into the IDE itself. You describe a feature, Kiro generates requirements in EARS notation, produces a technical design, and breaks it into trackable tasks — all inside the editor, no CLI involved. Powered by Claude Sonnet via Amazon Bedrock, $20/month. If you're AWS-native and want the tightest possible integration between specs and implementation, Kiro removes the seams. The tradeoff is vendor lock: you adopt their IDE, their model pipeline, their ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Augment Intent&lt;/strong&gt; is a standalone desktop app (Mac, public beta as of early 2026) built around "living specs" — specifications that update themselves as agents work, solving the drift problem the CLI frameworks leave manual. Intent uses a coordinator/specialist/verifier agent architecture where multiple agents execute in parallel on isolated git worktrees, all sharing the same evolving spec. Pricing is credit-based ($20–200/month depending on tier), and it supports BYOA (Bring Your Own Agent — Claude Code, Codex, OpenCode) alongside Augment's own agents. The living-spec approach is the most interesting architectural bet in this space right now: if it works reliably, it makes the manual reconciliation step described later in this article unnecessary. It's still beta, though, and independent production validation is limited.&lt;/p&gt;

&lt;p&gt;These aren't competitors to Spec Kit, BMAD, or OpenSpec — they're a different layer. A CLI framework gives you a spec workflow inside the tools you already use. Kiro and Intent ask you to move into their environment. Whether that trade is worth it depends on how much friction you're willing to accept for tighter integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Roll your own
&lt;/h3&gt;

&lt;p&gt;If you already have project context documented (stack, standards, workflows), four custom slash commands and an &lt;code&gt;AGENTS.md&lt;/code&gt; get you surprisingly far:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/commands/
  plan-feature.md       → produce spec + design from intent
  break-into-tasks.md   → decompose spec into tasks
  implement.md          → execute one task within your conventions
  review-spec.md        → critique the spec for gaps
AGENTS.md               → project rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 80% of what the established frameworks do, shaped to your workflow. The other 20% — Spec Kit's presets, OpenSpec's delta markers, BMAD's agent handoffs — is the reason people use frameworks. Start custom if your workflow is idiosyncratic enough that the frameworks fight you. Otherwise, pick a framework and extend it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The spec drift problem (and what to do about it)
&lt;/h2&gt;

&lt;p&gt;This gets its own section because it's the single most common failure mode and none of the current frameworks handle it well.&lt;/p&gt;

&lt;p&gt;Here's what happens: you write a spec. The agent starts implementing. Partway through task 3 of 8, the agent encounters something the spec didn't anticipate — a library API that doesn't work as expected, a database constraint that forces a different approach, an edge case the spec didn't cover. The agent adapts. It writes working code that solves the real problem. But the spec still describes the &lt;em&gt;planned&lt;/em&gt; approach, not the &lt;em&gt;actual&lt;/em&gt; one.&lt;/p&gt;

&lt;p&gt;Now the spec is fiction. The next engineer who reads it (or the next agent that uses it as context) gets misled. As &lt;a href="https://www.augmentcode.com/blog/what-spec-driven-development-gets-wrong" rel="noopener noreferrer"&gt;Amelia Wattenberger&lt;/a&gt; put it: a stale design doc misleads the next engineer who happens to read it; a stale spec misleads agents that don't know any better, and they'll execute a plan that no longer matches reality without flagging anything wrong.&lt;/p&gt;

&lt;p&gt;This isn't a corner case. It's the default behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to do about it.&lt;/strong&gt; There's no automation that fully solves this today. The practical approach is a post-implementation reconciliation step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;After the agent finishes implementation, run a comparison pass: "Read the spec. Read the code. List every place where they diverge."&lt;/li&gt;
&lt;li&gt;For each divergence, decide: was the agent's adaptation correct? If yes, update the spec. If no, fix the code.&lt;/li&gt;
&lt;li&gt;Commit the updated spec alongside the code diff.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenSpec has &lt;code&gt;/opsx:sync&lt;/code&gt; for this. Spec Kit recently added a drift reconciliation extension (&lt;code&gt;/speckit.reconcile&lt;/code&gt;). In BMAD, you'd do it manually via a QA agent review. None of these are automatic — you have to trigger them, and you have to review the output. That's overhead, and it's the overhead that most teams skip until their specs are six months out of date.&lt;/p&gt;

&lt;p&gt;The emerging approach — what Augment Intent is built around — is bidirectional spec updates: agents write changes back to the spec as they work. That closes the loop in theory. Whether it holds up reliably across complex codebases is the open question, and it's the single biggest feature gap separating the CLI frameworks from the next generation of SDD tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the constitution fails: a real example
&lt;/h2&gt;

&lt;p&gt;EPAM published a detailed &lt;a href="https://www.epam.com/insights/ai/blogs/using-spec-kit-for-brownfield-codebase" rel="noopener noreferrer"&gt;case study&lt;/a&gt; of using Spec Kit on a brownfield codebase. One finding stands out. Their &lt;code&gt;constitution.md&lt;/code&gt; contained an explicit rule: &lt;strong&gt;"NO try-catch blocks in route handlers — use global middleware."&lt;/strong&gt; The rule was unambiguous. The agent ignored it and added try-catch blocks in router handlers anyway.&lt;/p&gt;

&lt;p&gt;This isn't a Spec Kit bug. It's a model behavior issue: the agent was pattern-matching against what it had seen in millions of codebases where try-catch in handlers is the norm, and the constitution's single-line prohibition wasn't enough to override that prior. The fix was obvious in hindsight — reinforce the rule in the constitution with context explaining &lt;em&gt;why&lt;/em&gt; (middleware-based error handling enables centralized logging and consistent error responses), not just &lt;em&gt;what&lt;/em&gt;. Models follow "why" better than "don't."&lt;/p&gt;

&lt;p&gt;The deeper lesson: a constitution isn't a config file. Writing "don't do X" isn't enough. You need "don't do X because Y, and instead do Z." The constitution that works is the one written as if you were onboarding a smart but literal-minded junior developer who has never seen your codebase. Because that's exactly what the agent is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes that will cost you a sprint
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SDD as waterfall.&lt;/strong&gt; Gojko Adzic flagged this when Spec Kit launched. He's right. A 50-page spec you freeze before implementation is not SDD — it's BDUF with Markdown. Specs should change during implementation. The iterative loop is the point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three-page spec with no edge cases.&lt;/strong&gt; Looks thorough. Covers the happy path beautifully. Says nothing about what happens when the input is malformed, the downstream service 500s, or the user's session expires mid-request. The agent implements exactly what's specified. You ship a demo. It breaks the first day in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Green tests, wrong behavior.&lt;/strong&gt; Every acceptance criterion passes. Tests are green. But the solution doesn't actually solve the user's problem. Acceptance criteria are a proxy for intent, not intent itself. Add a "Why this matters" and "Non-goals" section to every spec so the agent stays grounded in the problem, not just the checklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework shopping.&lt;/strong&gt; You can burn a sprint evaluating four frameworks. You will learn nothing that four weeks of actual use on real tickets wouldn't teach you faster. Pick from the TL;DR. Start. Reconsider in a month if you need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The cost of drafting specifications has collapsed. Tests, tickets, architecture docs, ADRs — artifacts that used to get skipped because they cost too much time are now cheap to produce in draft form. The review work didn't go away — but the activation energy for producing the document in the first place did. That's the change, and it's permanent regardless of which framework wins.&lt;/p&gt;

&lt;p&gt;One thing worth saying plainly: SDD tooling is still early-stage. Patterns are emerging, not standardized, and most teams are still figuring out what "good" looks like in practice. The frameworks in this article are the best available answers right now — not settled ones.&lt;/p&gt;

&lt;p&gt;If in doubt, start with OpenSpec. Invest an hour in your constitution. Wire up your MCPs so the agent can open PRs, update tickets, and run tests. And when the spec drifts from the code — not if, when — take the thirty minutes to reconcile them. That's where SDD succeeds or fails in practice, and it's the part no framework will do for you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've shipped with any of these, the war stories are more useful than the docs. What broke?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>architecture</category>
      <category>devops</category>
    </item>
    <item>
      <title>Solving the Gemini API Challenge Lab on Vertex AI: Text, Function Calling &amp; Video Understanding</title>
      <dc:creator>William Schnaider Torres Bermon</dc:creator>
      <pubDate>Thu, 23 Apr 2026 01:43:52 +0000</pubDate>
      <link>https://dev.to/willtorber/solving-the-gemini-api-challenge-lab-on-vertex-ai-text-function-calling-video-understanding-6pn</link>
      <guid>https://dev.to/willtorber/solving-the-gemini-api-challenge-lab-on-vertex-ai-text-function-calling-video-understanding-6pn</guid>
      <description>&lt;p&gt;The "Explore Generative AI with the Gemini API in Vertex AI: Challenge Lab" on Google Cloud Skills Boost throws three Gemini capabilities at you in one sitting: a raw REST call from Cloud Shell, function calling from a Jupyter notebook, and multimodal video analysis. None of it is hard once you know what the verifier is actually checking — but a couple of things are easy to get wrong on the first attempt and the lab gives you almost no feedback when you do.&lt;/p&gt;

&lt;p&gt;This walkthrough is the version of the solution I wish I had read before starting. I'll show you the working code for every task, but more importantly, I'll explain &lt;em&gt;why&lt;/em&gt; each piece works the way it does — including a deep dive into the function-call response object, which is genuinely interesting once you understand it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The challenge in one paragraph
&lt;/h2&gt;

&lt;p&gt;You're playing the role of a developer at a video-analysis startup. Your job is to prove you can wire up three Gemini features end-to-end: generating text via a direct REST call, declaring a tool that Gemini can decide to invoke, and feeding a video from Cloud Storage into the model so it can describe what it sees. The lab provides a half-finished Jupyter notebook with &lt;code&gt;INSERT&lt;/code&gt; placeholders, and your job is to fill in the blanks.&lt;/p&gt;

&lt;p&gt;The model used throughout is &lt;code&gt;gemini-2.5-flash&lt;/code&gt;, and the notebook uses the new &lt;code&gt;google-genai&lt;/code&gt; SDK (not the legacy &lt;code&gt;vertexai&lt;/code&gt; one — this matters because the class names and import paths are different).&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 1: Text generation via curl from Cloud Shell
&lt;/h2&gt;

&lt;p&gt;The first task is the simplest in concept and the most annoying in practice. You open Cloud Shell, you &lt;code&gt;curl&lt;/code&gt; the Vertex AI endpoint, you ask Gemini why the sky is blue, you get an answer back. Done.&lt;/p&gt;

&lt;p&gt;Except the verifier won't accept your call unless you hit a very specific endpoint. More on that in a moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting up the environment
&lt;/h3&gt;

&lt;p&gt;The lab pre-fills these variables for you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;qwiklabs-gcp-00-207c94de3534   &lt;span class="c"&gt;# yours will differ&lt;/span&gt;
&lt;span class="nv"&gt;LOCATION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-east1
&lt;span class="nv"&gt;API_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LOCATION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="nt"&gt;-aiplatform&lt;/span&gt;.googleapis.com
&lt;span class="nv"&gt;MODEL_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gemini-2.5-flash"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you need to make sure the Vertex AI API is enabled. The lab tells you to do this in the Console, but the CLI is faster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud services &lt;span class="nb"&gt;enable &lt;/span&gt;aiplatform.googleapis.com &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The curl call (with the gotcha)
&lt;/h3&gt;

&lt;p&gt;Here's the part where the lab can quietly waste 20 minutes of your time. The Vertex AI generative endpoints expose two methods: &lt;code&gt;generateContent&lt;/code&gt; (returns one big response) and &lt;code&gt;streamGenerateContent&lt;/code&gt; (returns a stream of chunks). Both work. Both return valid Gemini answers. &lt;strong&gt;Only one of them satisfies the lab verifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The verifier checks for &lt;code&gt;streamGenerateContent&lt;/code&gt;. Use this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud auth print-access-token&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;API_ENDPOINT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/v1/projects/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/locations/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LOCATION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/publishers/google/models/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODEL_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:streamGenerateContent"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "contents": [
      {
        "role": "user",
        "parts": [
          { "text": "Why is the sky blue?" }
        ]
      }
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a JSON array back where each element contains a &lt;code&gt;candidates[].content.parts[].text&lt;/code&gt; field with text about Rayleigh scattering, you're good. Hit "Check my progress" and Task 1 turns green.&lt;/p&gt;

&lt;p&gt;If you get &lt;code&gt;403 PERMISSION_DENIED&lt;/code&gt;, the API hadn't fully propagated yet — wait 30 seconds after enabling and try again. If you get &lt;code&gt;404&lt;/code&gt;, you've got a typo in the region or model name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; the difference between &lt;code&gt;generateContent&lt;/code&gt; and &lt;code&gt;streamGenerateContent&lt;/code&gt; is operational, not semantic. Streaming is what you'd actually want in production for any user-facing chatbot, because it lets the UI display tokens as they arrive instead of making the user stare at a spinner. The lab is implicitly nudging you toward that pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 2: Open the notebook in Vertex AI Workbench
&lt;/h2&gt;

&lt;p&gt;This task has no scoring — it's purely navigational. From the Console: &lt;strong&gt;Navigation menu → Vertex AI → Workbench&lt;/strong&gt;. Find the &lt;code&gt;generative-ai-jupyterlab&lt;/code&gt; instance (it should already be running), click &lt;strong&gt;Open JupyterLab&lt;/strong&gt;, and once the new tab loads, double-click &lt;code&gt;gemini-explorer-challenge.ipynb&lt;/code&gt;. When the kernel selector pops up, pick &lt;strong&gt;Python 3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's it. Now the real work begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 3: Function calling with Gemini
&lt;/h2&gt;

&lt;p&gt;Function calling is the feature that turns Gemini from a chatbot into something that can actually &lt;em&gt;do things&lt;/em&gt; in the world. The idea: you describe a function to the model — its name, what it does, what arguments it takes — and the model decides on its own whether and when to invoke it based on what the user is asking.&lt;/p&gt;

&lt;p&gt;The notebook has four cells to fill in. Let's do them.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 — Load the model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 3.1
&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just the model identifier as a string. The new SDK doesn't make you instantiate a model object the way the legacy &lt;code&gt;vertexai&lt;/code&gt; library did — you pass the model name straight into &lt;code&gt;client.models.generate_content()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 — Declare the function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 3.2
&lt;/span&gt;&lt;span class="n"&gt;get_current_weather_func&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FunctionDeclaration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_current_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get the current weather in a given location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;FunctionDeclaration&lt;/code&gt; (already imported at the top of the notebook from &lt;code&gt;google.genai.types&lt;/code&gt;) is how you describe a function to Gemini. Notice that you're not giving it any actual code — you're giving it a &lt;em&gt;schema&lt;/em&gt;. The &lt;code&gt;description&lt;/code&gt; field is critical: this is what Gemini reads to decide whether your function is relevant to the user's prompt. A vague description means the model might not call your function when it should, or might call it when it shouldn't.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;parameters&lt;/code&gt; block is JSON Schema. If your real function took more arguments — say, &lt;code&gt;unit&lt;/code&gt; for Celsius vs Fahrenheit — you'd add them here.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 — Wrap it in a Tool
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 3.3
&lt;/span&gt;&lt;span class="n"&gt;weather_tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;function_declarations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_current_weather_func&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;Tool&lt;/code&gt; is a container for one or more related function declarations. You could bundle &lt;code&gt;get_current_weather&lt;/code&gt; and &lt;code&gt;get_forecast&lt;/code&gt; and &lt;code&gt;get_historical_weather&lt;/code&gt; into a single tool, and Gemini would pick whichever one fits the user's question.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 — Invoke the model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 3.4
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the weather like in Boston?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;weather_tool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;temperature=0&lt;/code&gt; is important here: when you're asking the model to make a structured decision (call this function with these args), you want it to be deterministic, not creative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decoding the response (the interesting part)
&lt;/h3&gt;

&lt;p&gt;Run the cell and you'll see something that looks alarming the first time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;GenerateContentResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;Candidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;avg_logprobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mf"&gt;0.5011326244899205&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="nc"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;function_call&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;FunctionCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
              &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="n"&gt;Max&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="n"&gt;Max&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;thought_signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\xcb\x01\x01\x8f&lt;/span&gt;&lt;span class="s"&gt;=k_u&lt;/span&gt;&lt;span class="se"&gt;\x91\xe5\x14&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
          &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FinishReason&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STOP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;STOP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="bp"&gt;...&lt;/span&gt;
  &lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentResponseUsageMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;candidates_token_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_token_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;thoughts_token_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_token_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no &lt;code&gt;text&lt;/code&gt; anywhere in the response. That's not a bug — that's the entire point. Let me unpack what's happening.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Part&lt;/code&gt; with &lt;code&gt;function_call&lt;/code&gt; instead of &lt;code&gt;text&lt;/code&gt;.&lt;/strong&gt; Normally a &lt;code&gt;Part&lt;/code&gt; carries a &lt;code&gt;text&lt;/code&gt; field with whatever the model wrote. This one carries a &lt;code&gt;function_call&lt;/code&gt; instead. What Gemini is telling you is: &lt;em&gt;"I cannot answer 'what's the weather in Boston' from my training data, but the user gave me a tool called &lt;code&gt;get_current_weather&lt;/code&gt; that can. I'm not going to make up an answer — I'm going to ask the caller to invoke that tool with &lt;code&gt;location='Boston'&lt;/code&gt; and pass me back the result."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;&amp;lt;... Max depth ...&amp;gt;&lt;/code&gt; you see is just Python's &lt;code&gt;repr&lt;/code&gt; truncating the output for display. The data is there. If you actually want to read it, do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;function_call&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# "get_current_weather"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# {"location": "Boston"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;thought_signature&lt;/code&gt; (those scary-looking bytes).&lt;/strong&gt; Gemini 2.5 is a &lt;em&gt;thinking model&lt;/em&gt; — it does internal chain-of-thought reasoning before producing output. The &lt;code&gt;thought_signature&lt;/code&gt; is an opaque, signed blob of that reasoning. You don't read it. Its only purpose is to be passed back to Gemini in a follow-up call (the second turn of the function-calling loop, see below) so the model can resume its reasoning without having to re-derive everything from scratch. It's a cache key for the model's internal state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;finish_reason=STOP&lt;/code&gt;.&lt;/strong&gt; The model finished cleanly. Not truncated by token limit, not blocked by a safety filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The token counts.&lt;/strong&gt; This is where Gemini 2.5 gets fun:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;prompt_token_count=25&lt;/code&gt;: your prompt plus the function declaration consumed 25 input tokens.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;candidates_token_count=7&lt;/code&gt;: the function call output was 7 tokens.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;thoughts_token_count=39&lt;/code&gt;: the model spent &lt;strong&gt;39 tokens thinking internally&lt;/strong&gt; before deciding to call the function. This is the cost of the chain-of-thought. You're billed for it, and it's only present on the 2.5 family.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;total_token_count=71&lt;/code&gt;: the sum, which is what hits your bill.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The full function-calling loop (which the lab doesn't make you complete)
&lt;/h3&gt;

&lt;p&gt;What you just saw is step 2 of a 4-step dance. In a real application:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You&lt;/strong&gt; send a prompt plus tool definitions to Gemini.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; returns a &lt;code&gt;function_call&lt;/code&gt; saying which function to invoke and with what args. ← &lt;em&gt;the lab stops here&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You&lt;/strong&gt; actually execute the function — call a real weather API, hit a database, whatever — and send the result back to Gemini as a &lt;code&gt;function_response&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; uses that result to compose a natural-language answer like &lt;em&gt;"It's currently 18°C and partly cloudy in Boston."&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The lab only grades you up to step 2 because what's being demonstrated is that the model &lt;em&gt;understands&lt;/em&gt; the tool and knows &lt;em&gt;when&lt;/em&gt; to use it. The actual execution lives in your application code, not in Gemini's responsibilities. Once you grasp this separation of concerns, function calling stops feeling magical and starts feeling like a very natural API contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 4: Describing video contents
&lt;/h2&gt;

&lt;p&gt;Same model, same client, but now you're going to feed it a video file from Cloud Storage and ask it to describe what's in it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 — Load the model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 4.1
&lt;/span&gt;&lt;span class="n"&gt;multimodal_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same model as before. &lt;code&gt;gemini-2.5-flash&lt;/code&gt; is natively multimodal — it doesn't need a separate "vision" or "video" variant. You hand it text, images, audio, or video, and it figures it out.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 — Generate the description
&lt;/h3&gt;

&lt;p&gt;The notebook has two &lt;code&gt;INSERT&lt;/code&gt; placeholders here, plus you have to recognize that it's expecting a streaming call (the &lt;code&gt;for response in responses:&lt;/code&gt; loop at the bottom is the giveaway).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Task 4.2 Generate a video description
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
What is shown in this video?
Where should I go to see it?
What are the top 5 places in the world that look like this?
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;file_uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gs://github-repo/img/gemini/multimodality_usecases_overview/mediterraneansea.mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video/mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;multimodal_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-------Prompt--------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print_multimodal_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;-------Response--------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things to notice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Part.from_uri&lt;/code&gt; is how you reference Cloud Storage assets.&lt;/strong&gt; You don't download the video to the notebook and base64-encode it — Gemini reads it directly from &lt;code&gt;gs://&lt;/code&gt;. Faster, cheaper, and works for files much larger than what you could comfortably embed inline. The &lt;code&gt;mime_type&lt;/code&gt; is required so the model knows how to decode the bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;contents&lt;/code&gt; is a list mixing text and media.&lt;/strong&gt; You pass &lt;code&gt;[prompt, video]&lt;/code&gt; and the SDK figures out what each element is. You could pass &lt;code&gt;[image, prompt, video, image, prompt]&lt;/code&gt; if you wanted — the model treats it as a sequential multimodal message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;generate_content_stream&lt;/code&gt;, not &lt;code&gt;generate_content&lt;/code&gt;.&lt;/strong&gt; This is the second &lt;code&gt;INSERT&lt;/code&gt; and it's the one most people miss. The &lt;code&gt;for response in responses:&lt;/code&gt; loop at the bottom of the cell only makes sense if &lt;code&gt;responses&lt;/code&gt; is iterable — which it is for the streaming version. If you used the non-streaming &lt;code&gt;generate_content&lt;/code&gt;, you'd get back a single response object and the &lt;code&gt;for&lt;/code&gt; loop would iterate over its attributes and break in confusing ways. The lab's hint is in the comment links: one of them points to the "stream response" docs.&lt;/p&gt;

&lt;p&gt;When you run it, you'll see the video embedded in the notebook and then a streaming description fill in chunk by chunk — turquoise water, rocky cliffs, the Mediterranean — followed by a top-5 list with places like Amalfi, Santorini, the Côte d'Azur, Mallorca, and Croatia's Dalmatian coast.&lt;/p&gt;

&lt;p&gt;Hit "Check my progress" and Task 4 goes green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key learnings
&lt;/h2&gt;

&lt;p&gt;A few things worth taking away from this lab beyond just passing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;google-genai&lt;/code&gt; SDK is not the old &lt;code&gt;vertexai&lt;/code&gt; SDK.&lt;/strong&gt; If you've used Vertex AI's generative features before, you're probably used to &lt;code&gt;from vertexai.generative_models import GenerativeModel&lt;/code&gt;. That's the legacy path. The new path is &lt;code&gt;from google import genai&lt;/code&gt; plus &lt;code&gt;from google.genai.types import ...&lt;/code&gt;. Class names like &lt;code&gt;FunctionDeclaration&lt;/code&gt;, &lt;code&gt;Tool&lt;/code&gt;, and &lt;code&gt;Part&lt;/code&gt; are similar but live in different modules. Don't mix them — pick one and stick with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Function calling is a contract, not an execution.&lt;/strong&gt; Gemini will never actually call your function. It will tell you &lt;em&gt;that you should&lt;/em&gt; call your function, with these args, and then wait for you to pass the result back. The model is the brain; your code is the hands. This separation is what makes function calling safe to deploy in production — you control exactly what the model can and cannot reach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thinking tokens are real and they cost money.&lt;/strong&gt; Gemini 2.5 Flash's &lt;code&gt;thoughts_token_count&lt;/code&gt; is a separate billable line item from input and output tokens. For most prompts it's small, but for complex reasoning tasks it can dominate the bill. If you're cost-optimizing, this is worth measuring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multimodal inputs come from Cloud Storage, not from your notebook.&lt;/strong&gt; For anything bigger than a small image, the right pattern is to upload to GCS and reference with &lt;code&gt;Part.from_uri&lt;/code&gt;. This avoids round-tripping bytes through your runtime and is dramatically faster for video.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming vs non-streaming is a real choice.&lt;/strong&gt; &lt;code&gt;generateContent&lt;/code&gt; returns a single payload. &lt;code&gt;streamGenerateContent&lt;/code&gt; returns chunks as they're produced. Pick streaming for any user-facing experience and non-streaming for server-to-server batch jobs where latency-to-first-token doesn't matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices
&lt;/h2&gt;

&lt;p&gt;A few things I'd do differently in real code than what the lab asks for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never hard-code the project ID.&lt;/strong&gt; The notebook has &lt;code&gt;PROJECT_ID = "qwiklabs-gcp-..."&lt;/code&gt; because the lab is ephemeral, but in production read it from &lt;code&gt;google.auth.default()&lt;/code&gt; or an environment variable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write detailed function descriptions.&lt;/strong&gt; "Get the current weather" is fine for a demo. For real tools, describe what the function returns, what units, what error conditions, and anything else that helps the model decide when to invoke it. The model only sees what you write.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always set &lt;code&gt;temperature=0&lt;/code&gt; for tool calls.&lt;/strong&gt; Creative variation in a function-call decision is almost never what you want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle the multi-turn flow.&lt;/strong&gt; A demo that stops at step 2 of the function-calling loop isn't a real integration. Build out the full round-trip: receive the function call, execute it, send the &lt;code&gt;function_response&lt;/code&gt; back, get the natural-language answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate tool arguments before executing.&lt;/strong&gt; Gemini is good at structured outputs but not perfect. Your function executor should treat the args as untrusted input and validate them against the schema before doing anything destructive.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The Gemini API challenge lab is a small surface area but a surprisingly good introduction to three patterns you'll use constantly if you build with Vertex AI: direct REST access for quick experiments, function calling for tool-using agents, and multimodal inputs from Cloud Storage. The three things that tripped me up — the &lt;code&gt;streamGenerateContent&lt;/code&gt; requirement in Task 1, the meaning of the function-call response object in Task 3, and the streaming method in Task 4 — are the things worth remembering, because they all reflect how you'd actually use these APIs in production.&lt;/p&gt;

&lt;p&gt;Now go build something with it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>googlecloud</category>
      <category>googleaichallenge</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>Solving "Analyze and Reason on Multimodal Data with Gemini: Challenge Lab" — A Complete Guide</title>
      <dc:creator>William Schnaider Torres Bermon</dc:creator>
      <pubDate>Wed, 08 Apr 2026 04:45:09 +0000</pubDate>
      <link>https://dev.to/willtorber/solving-analyze-and-reason-on-multimodal-data-with-gemini-challenge-lab-a-complete-guide-4che</link>
      <guid>https://dev.to/willtorber/solving-analyze-and-reason-on-multimodal-data-with-gemini-challenge-lab-a-complete-guide-4che</guid>
      <description>&lt;p&gt;Multimodal AI is no longer a futuristic concept — it's a practical tool that can analyze text reviews, product images, and podcast audio in a single workflow. In this post, I walk through the &lt;strong&gt;&lt;a href="https://www.skills.google/course_templates/1240/labs/618945?locale=en" rel="noopener noreferrer"&gt;GSP524 Challenge Lab&lt;/a&gt;&lt;/strong&gt; from Google Cloud Skills Boost, where we use the &lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt; model on Vertex AI to extract actionable marketing insights from three different data modalities for a fictional brand called &lt;strong&gt;Cymbal Direct&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you're preparing for this lab or want to understand how multimodal prompting with Gemini actually works in practice, this guide covers every task with the reasoning behind each solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;Cymbal Direct has just launched a new line of athletic apparel. Our job is to analyze social media engagement across three channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text&lt;/strong&gt; — Customer reviews and social media posts (sentiment, themes, product mentions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images&lt;/strong&gt; — Influencer and customer photos (style trends, visual messaging, target audience).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt; — A podcast interview with a Cymbal Direct representative (satisfaction drivers, biases, recommendations).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, we synthesize everything into a comprehensive Markdown report and upload it to Cloud Storage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Environment Setup (Task 1)
&lt;/h2&gt;

&lt;p&gt;The lab provides a pre-configured &lt;strong&gt;Vertex AI Workbench&lt;/strong&gt; instance with a Jupyter notebook (&lt;code&gt;gsp524-challenge.ipynb&lt;/code&gt;). Task 1 has no TODOs — you just run the provided cells to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the Google Gen AI SDK (&lt;code&gt;google-genai&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Restart the kernel (important — the new package won't load without this).&lt;/li&gt;
&lt;li&gt;Import all required libraries, including &lt;code&gt;Part&lt;/code&gt;, &lt;code&gt;ThinkingConfig&lt;/code&gt;, and &lt;code&gt;GenerateContentConfig&lt;/code&gt; from &lt;code&gt;google.genai.types&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Initialize the Gen AI client pointing to your lab project.&lt;/li&gt;
&lt;li&gt;Set the model ID to &lt;code&gt;gemini-2.5-flash&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two critical objects are set up here that you'll reuse throughout the lab:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The client — your gateway to Gemini
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;LOCATION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The model
&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Later, a &lt;code&gt;config&lt;/code&gt; object enables &lt;strong&gt;Gemini thinking&lt;/strong&gt; (extended reasoning) with dynamic budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;include_thoughts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# Dynamic: model decides how much to reason
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;config&lt;/code&gt; is the key difference between a basic call and a deep-reasoning call. You'll use it in every "Deep Dive" section.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 2: Analyzing Customer Reviews (Text)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Analysis
&lt;/h3&gt;

&lt;p&gt;The first real challenge is constructing a prompt that tells Gemini exactly what to extract from the raw text data. The reviews are loaded from a file, and we embed them directly into the prompt using an f-string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the following customer reviews and social media posts about
Cymbal Direct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s new athletic apparel line. For each review or post:
- Identify the overall sentiment (positive, negative, or neutral).
- Extract key themes and topics discussed, such as product quality,
  fit, style, customer service, and pricing.
- Identify any frequently mentioned product names or specific features.

Provide a structured summary of your findings in Markdown format.

Customer Reviews and Social Media Posts:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; The prompt is explicit about the three dimensions we care about (sentiment, themes, product names) and asks for structured Markdown output. Gemini handles the rest — it categorizes each review and surfaces patterns across the dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep Dive with Thinking
&lt;/h3&gt;

&lt;p&gt;Now we go deeper. The second prompt asks Gemini to &lt;em&gt;reason&lt;/em&gt; about what's driving sentiment and to role-play as a marketing consultant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the following customer reviews and social media posts in detail.
Specifically:
- Identify the main factors driving positive and negative sentiment.
- Assess the overall impact on brand perception.
- Identify three key areas where Cymbal Direct can improve.
- Highlight the three most important takeaways as if presenting to
  the Cymbal Direct marketing team.

Customer Reviews and Social Media Posts:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;thinking_model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- This enables thinking mode
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only API-level difference is passing &lt;code&gt;config=config&lt;/code&gt;. But the output is dramatically richer — Gemini shows its chain of thought before delivering the final answer, and the &lt;code&gt;print_thoughts()&lt;/code&gt; helper function separates these for display.&lt;/p&gt;

&lt;p&gt;The analysis is saved to &lt;code&gt;analysis/text_analysis.md&lt;/code&gt; for use in the final synthesis.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 3: Analyzing Images (Visual Content)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Analysis
&lt;/h3&gt;

&lt;p&gt;Images require a different content structure. Instead of embedding data in the prompt string, we pass a list of &lt;code&gt;Part&lt;/code&gt; objects alongside the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the following images of Cymbal Direct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s new athletic apparel line.
For each image:
- Identify the apparel items shown.
- Describe the attributes of each item (color, style, material, branding).
- Identify any prominent style trends or preferences across the images.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;image_parts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Prompt + list of image Part objects
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key pattern:&lt;/strong&gt; For multimodal content, &lt;code&gt;contents&lt;/code&gt; accepts a list where the first element is the text prompt and subsequent elements are &lt;code&gt;Part&lt;/code&gt; objects (images, audio, video). The images are loaded as bytes and wrapped with &lt;code&gt;Part.from_bytes()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning on Image Trends
&lt;/h3&gt;

&lt;p&gt;The deep dive asks Gemini to go beyond description into &lt;em&gt;inference&lt;/em&gt; — hypothesizing about target audience, analyzing visual composition, and comparing to broader fashion trends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the images in greater detail:
- Hypothesize about the target audience for each image.
- Analyze how visual elements contribute to the overall message and appeal.
- Compare observed trends with broader athletic wear fashion trends.
- Provide recommendations for future marketing campaigns.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;thinking_model_response_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;image_parts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern: prompt + image parts + thinking config. Results are saved to &lt;code&gt;analysis/image_analysis.md&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 4: Analyzing Audio (Podcast)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Analysis
&lt;/h3&gt;

&lt;p&gt;Audio follows the same multimodal pattern, but uses &lt;code&gt;Part.from_uri()&lt;/code&gt; instead of &lt;code&gt;Part.from_bytes()&lt;/code&gt; since the audio file lives in Cloud Storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Audio part (created in a setup cell)
&lt;/span&gt;&lt;span class="n"&gt;audio_part&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;file_uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gs://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-bucket/media/audio/cymbal_direct_expert_interview.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the following audio recording:
- Transcribe the conversation, identifying different speakers.
- Provide sentiment analysis (positive, negative, neutral opinions).
- Identify key themes (comfort, fit, performance, style, competitor comparisons).
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;audio_part&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Audio first, then prompt
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note the order:&lt;/strong&gt; For audio, the &lt;code&gt;audio_part&lt;/code&gt; comes &lt;em&gt;before&lt;/em&gt; the prompt in the contents list. This is a subtle but important detail — Gemini processes the audio first, then applies the prompt instructions to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning on Audio Insights
&lt;/h3&gt;

&lt;p&gt;The deep dive extracts strategic intelligence from the conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Analyze the audio recording in greater detail:
- Reason about overall customer satisfaction.
- Deduce key factors influencing customer perception.
- Develop three data-driven recommendations.
- Identify potential biases or limitations in the audio data.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;thinking_model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;audio_part&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thinking_mode_prompt&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is particularly interesting because Gemini can identify biases like interviewer framing or selection bias in who was invited to the podcast — something that requires genuine reasoning, not just transcription.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 5: Synthesizing Multimodal Insights
&lt;/h2&gt;

&lt;p&gt;The final task loads all three analysis files and asks Gemini to produce a unified report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;comprehensive_report_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Based on the following combined analysis of text reviews, image analysis,
and audio insights, generate a comprehensive report:
- Summarize overall sentiment across all data modalities.
- Identify key themes and trends in customer feedback.
- Provide insights on style preferences, usage patterns, and behavior.
- Evaluate how audio insights fit with product image and text feedback.
- Offer actionable recommendations for marketing strategy and positioning.

Format the report in well-structured Markdown with clear sections.

Combined Analysis Results:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;all_analysis&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;thinking_model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;comprehensive_report_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After generating the report, it's saved locally and uploaded to Cloud Storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;gcloud&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;final_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;final_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This last step is what the grading system checks, so don't skip it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One API, three modalities.&lt;/strong&gt; The &lt;code&gt;generate_content&lt;/code&gt; method handles text, images, and audio with the same interface — the only difference is how you construct the &lt;code&gt;contents&lt;/code&gt; list.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Thinking mode is a single config toggle.&lt;/strong&gt; Adding &lt;code&gt;config=config&lt;/code&gt; with &lt;code&gt;include_thoughts=True&lt;/code&gt; transforms a surface-level response into a reasoned analysis. The &lt;code&gt;-1&lt;/code&gt; thinking budget lets the model decide how deep to go based on prompt complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prompt specificity drives output quality.&lt;/strong&gt; Vague prompts produce vague results. Each prompt in this lab explicitly lists the dimensions to analyze (sentiment, themes, audience, recommendations), and the output quality reflects that precision.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Content ordering matters for multimodal inputs.&lt;/strong&gt; For images, the prompt comes first followed by image parts. For audio, the audio part comes first. This isn't arbitrary — it affects how the model processes the input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chaining analyses enables synthesis.&lt;/strong&gt; By saving intermediate results to files and feeding them into a final prompt, we build a pipeline where each modality's insights compound into a richer final report.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Always ask for structured output.&lt;/strong&gt; Requesting "Markdown format with clear sections" gives you parseable, presentable results instead of a wall of text.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use thinking mode for analysis, skip it for extraction.&lt;/strong&gt; Initial passes (transcription, item identification) don't need extended reasoning. Deep dives (inferring audience, identifying biases, generating recommendations) benefit enormously from it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Embed data directly in prompts for text; use Part objects for binary data.&lt;/strong&gt; Text data fits naturally inside f-strings. Images and audio should always go through &lt;code&gt;Part.from_bytes()&lt;/code&gt; or &lt;code&gt;Part.from_uri()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Save intermediate results.&lt;/strong&gt; Writing each analysis to a file creates a paper trail and enables the final synthesis step without re-running expensive model calls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't forget the upload.&lt;/strong&gt; In challenge labs, the grading system checks Cloud Storage — your analysis could be perfect, but if the file isn't in the bucket, you won't pass.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This challenge lab demonstrates a realistic workflow for multimodal AI analysis: ingest data from different sources, extract structured insights from each, apply deeper reasoning where it matters, and synthesize everything into a decision-ready report. The Gemini 2.5 Flash model on Vertex AI makes this surprisingly straightforward — the same &lt;code&gt;generate_content&lt;/code&gt; call handles text, images, and audio, and the thinking mode adds genuine analytical depth without requiring a different model or API.&lt;/p&gt;

&lt;p&gt;The patterns here — structured prompts, multimodal content lists, thinking configuration, and chained analyses — are directly applicable to real-world use cases like brand monitoring, market research, and content analysis. The hard part isn't the API calls; it's crafting prompts that extract the right insights from the right data.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>googleaichallenge</category>
      <category>python</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>Solving "Use Machine Learning APIs on Google Cloud: Challenge Lab" — A Complete Guide</title>
      <dc:creator>William Schnaider Torres Bermon</dc:creator>
      <pubDate>Thu, 19 Mar 2026 01:41:08 +0000</pubDate>
      <link>https://dev.to/willtorber/solving-use-machine-learning-apis-on-google-cloud-challenge-lab-a-complete-guide-4no6</link>
      <guid>https://dev.to/willtorber/solving-use-machine-learning-apis-on-google-cloud-challenge-lab-a-complete-guide-4no6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;This &lt;a href="https://www.skills.google/course_templates/630/labs/612231?locale=en" rel="noopener noreferrer"&gt;challenge&lt;/a&gt; lab tests your ability to build an end-to-end pipeline that extracts text from images using the &lt;strong&gt;Cloud Vision API&lt;/strong&gt;, translates it with the &lt;strong&gt;Cloud Translation API&lt;/strong&gt;, and loads the results into &lt;strong&gt;BigQuery&lt;/strong&gt;. Unlike guided labs, you're expected to fill in the blanks of a partially written Python script and configure IAM permissions yourself.&lt;/p&gt;

&lt;p&gt;Let's walk through every task with clear explanations of &lt;em&gt;why&lt;/em&gt; each step matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The pipeline works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Python script reads image files from a &lt;strong&gt;Cloud Storage&lt;/strong&gt; bucket&lt;/li&gt;
&lt;li&gt;Each image is sent to the &lt;strong&gt;Cloud Vision API&lt;/strong&gt; for text detection&lt;/li&gt;
&lt;li&gt;The extracted text is saved back to Cloud Storage as a &lt;code&gt;.txt&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;If the text is &lt;strong&gt;not&lt;/strong&gt; in Japanese (&lt;code&gt;locale != 'ja'&lt;/code&gt;), it's sent to the &lt;strong&gt;Translation API&lt;/strong&gt; to get a Japanese translation&lt;/li&gt;
&lt;li&gt;All results (original text, locale, translation) are uploaded to a &lt;strong&gt;BigQuery&lt;/strong&gt; table.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn4p2j9bdb8nyx6y3zh7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn4p2j9bdb8nyx6y3zh7.png" alt="Graphic description of the challenge" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 1: Configure a Service Account
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why a Service Account?
&lt;/h3&gt;

&lt;p&gt;The Python script needs programmatic access to Vision API, Translation API, Cloud Storage, and BigQuery. A service account acts as the script's identity, and IAM roles define what it can do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set your project ID&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud config get-value project&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Create the service account&lt;/span&gt;
gcloud iam service-accounts create my-ml-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ML API Service Account"&lt;/span&gt;

&lt;span class="c"&gt;# Grant BigQuery Data Editor role (to insert rows)&lt;/span&gt;
gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/bigquery.dataEditor"&lt;/span&gt;

&lt;span class="c"&gt;# Grant Cloud Storage Object Admin role (to read images and write text files)&lt;/span&gt;
gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/storage.objectAdmin"&lt;/span&gt;

&lt;span class="c"&gt;# Grant Service Usage Consumer role (required to make API calls within the project)&lt;/span&gt;
gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/serviceusage.serviceUsageConsumer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Without &lt;code&gt;roles/serviceusage.serviceUsageConsumer&lt;/code&gt;, the service account cannot consume any enabled APIs in the project (BigQuery, Vision, Translation, etc.), even if it has data-level roles like &lt;code&gt;dataEditor&lt;/code&gt; or &lt;code&gt;storage.objectAdmin&lt;/code&gt;. This results in a &lt;code&gt;403 USER_PROJECT_DENIED&lt;/code&gt; error.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects get-iam-policy &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flatten&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bindings[].members"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bindings.members:my-ml-sa@"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see &lt;code&gt;roles/bigquery.dataEditor&lt;/code&gt;, &lt;code&gt;roles/storage.objectAdmin&lt;/code&gt;, and &lt;code&gt;roles/serviceusage.serviceUsageConsumer&lt;/code&gt; listed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 2: Create and Download Credentials
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Download a Key?
&lt;/h3&gt;

&lt;p&gt;While Cloud Shell has default credentials for the logged-in user, the challenge explicitly requires you to create a JSON key file and point the &lt;code&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/code&gt; environment variable to it. This simulates how credentials work in production environments outside GCP.&lt;/p&gt;

&lt;h3&gt;
  
  
  Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate the JSON key file&lt;/span&gt;
gcloud iam service-accounts keys create ml-sa-key.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--iam-account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-ml-sa@&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;.iam.gserviceaccount.com

&lt;span class="c"&gt;# Set the environment variable so Google Cloud client libraries find the key&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PWD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/ml-sa-key.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Task 3: Modify the Script — Vision API Text Detection
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Get the Script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gsutil &lt;span class="nb"&gt;cp &lt;/span&gt;gs://&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;/analyze-images-v2.py &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What to Modify
&lt;/h3&gt;

&lt;p&gt;The script has four sections that need your attention: three &lt;code&gt;# TBD:&lt;/code&gt; comments and one commented-out BigQuery upload line. Open the script with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano analyze-images-v2.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TBD #1 — Create a Vision API image object:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the comment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# TBD: Create a Vision API image object called image_object
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add below it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;image_object&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an &lt;code&gt;Image&lt;/code&gt; object from the raw bytes downloaded from Cloud Storage (&lt;code&gt;file_content&lt;/code&gt;). The Vision API requires this object format to process images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TBD #2 — Call the Vision API to detect text:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the comment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# TBD: Detect text in the image and save the response data into an object called response
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add below it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vision_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;document_text_detection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;image_object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends the image to the Vision API's &lt;code&gt;document_text_detection&lt;/code&gt; method, which is optimized for dense text like signs. Note that the client variable is called &lt;code&gt;vision_client&lt;/code&gt; (as defined earlier in the script), and the image parameter uses the &lt;code&gt;image_object&lt;/code&gt; we just created.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test It
&lt;/h3&gt;

&lt;p&gt;Run the script after completing TBDs #1 and #2 to verify text extraction works before moving on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 analyze-images-v2.py &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see extracted text appearing in the console output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 4: Modify the Script — Translation API
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What to Modify
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;TBD #3 — Translate non-Japanese text to Japanese:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the comment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# TBD: According to the target language pass the description data to the translation API
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add below it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;translation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;translate_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We use &lt;code&gt;desc&lt;/code&gt; (not a generic variable like &lt;code&gt;text&lt;/code&gt;) because that's the variable name the script assigns to the extracted description earlier: &lt;code&gt;desc = response.text_annotations[0].description&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The target language is &lt;code&gt;'ja'&lt;/code&gt; (Japanese) as specified in the lab instructions&lt;/li&gt;
&lt;li&gt;The result is stored in &lt;code&gt;translation&lt;/code&gt;, and the script already accesses &lt;code&gt;translation['translatedText']&lt;/code&gt; on the next line&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Enable the BigQuery Upload
&lt;/h2&gt;

&lt;p&gt;At the very end of the script, find the commented-out line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# errors = bq_client.insert_rows(table, rows_for_bq)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remove the &lt;code&gt;#&lt;/code&gt; to enable it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows_for_bq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The line immediately after (&lt;code&gt;assert errors == []&lt;/code&gt;) will verify the upload succeeded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Complete Modified Script Reference
&lt;/h3&gt;

&lt;p&gt;Here's a summary of all four changes in the script:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location in Script&lt;/th&gt;
&lt;th&gt;What to Add / Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;After &lt;code&gt;# TBD: Create a Vision API image object&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;image_object = vision.Image(content=file_content)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After &lt;code&gt;# TBD: Detect text in the image&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;response = vision_client.document_text_detection(image=image_object)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After &lt;code&gt;# TBD: According to the target language&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;translation = translate_client.translate(desc, target_language='ja')&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Last commented line&lt;/td&gt;
&lt;td&gt;Remove &lt;code&gt;#&lt;/code&gt; from &lt;code&gt;errors = bq_client.insert_rows(table, rows_for_bq)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Run the Complete Script
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 analyze-images-v2.py &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch the output — you should see text being extracted from each image, locale detection, and Japanese translations for non-Japanese text, followed by "Writing Vision API image data to BigQuery..."&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Python Script (&lt;code&gt;analyze-images-v2.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Before modifying the script, it's important to understand what it does. Here's a general overview followed by a line-by-line breakdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  General Overview
&lt;/h3&gt;

&lt;p&gt;The script is an automated image-processing pipeline. It connects to four Google Cloud services simultaneously: Cloud Storage (to read images and write text files), Vision API (to extract text from images via OCR), Translation API (to translate non-Japanese text into Japanese), and BigQuery (to store the final results in a queryable table).&lt;/p&gt;

&lt;p&gt;The workflow for each image is: download the image bytes from the bucket → send them to the Vision API → save the detected text back to Cloud Storage as a &lt;code&gt;.txt&lt;/code&gt; file → check the language locale → if not Japanese, translate to Japanese → collect all results → batch-upload everything to BigQuery at the end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Line-by-Line Breakdown
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dataset: image_classification_dataset
# Table name: image_text_detail
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 1-4:&lt;/strong&gt; Comments documenting the target BigQuery dataset/table. Imports &lt;code&gt;os&lt;/code&gt; (to read environment variables) and &lt;code&gt;sys&lt;/code&gt; (to read command-line arguments).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;translate_v2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 7:&lt;/strong&gt; Imports the five Google Cloud client libraries. &lt;code&gt;storage&lt;/code&gt; for Cloud Storage, &lt;code&gt;bigquery&lt;/code&gt; for BigQuery, &lt;code&gt;language&lt;/code&gt; for Natural Language API (not used in this script but imported from the original template), &lt;code&gt;vision&lt;/code&gt; for Vision API, and &lt;code&gt;translate_v2&lt;/code&gt; for the Translation API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])):&lt;/span&gt;
        &lt;span class="nf"&gt;print &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The GOOGLE_APPLICATION_CREDENTIALS file does not exist.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The GOOGLE_APPLICATION_CREDENTIALS environment variable is not defined.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 9-15:&lt;/strong&gt; &lt;strong&gt;Credentials check.&lt;/strong&gt; Verifies two things: (1) the &lt;code&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/code&gt; environment variable is set, and (2) the file it points to actually exists on disk. If either check fails, the script exits immediately with an error message. This is a safety gate — without valid credentials, no API call will work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You must provide parameters for the Google Cloud project ID and Storage bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python3 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[PROJECT_NAME] [BUCKET_NAME]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;project_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;bucket_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 17-23:&lt;/strong&gt; &lt;strong&gt;Argument parsing.&lt;/strong&gt; The script requires two command-line arguments: the GCP project ID and the Cloud Storage bucket name. In this lab, both are the same value (your project ID). If you forget to pass them, the script prints usage instructions and exits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;storage_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;bq_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;nl_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LanguageServiceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 26-28:&lt;/strong&gt; &lt;strong&gt;Client initialization (part 1).&lt;/strong&gt; Creates client objects for Cloud Storage, BigQuery (bound to your project), and the Natural Language API. The &lt;code&gt;nl_client&lt;/code&gt; is inherited from the original template but not used in this challenge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vision_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ImageAnnotatorClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;translate_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;translate_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 31-32:&lt;/strong&gt; &lt;strong&gt;Client initialization (part 2).&lt;/strong&gt; Creates the Vision API client (for text detection) and the Translation API client (for translating text). These are the two ML API clients you'll use in the TBD sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataset_ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;image_classification_dataset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_ref&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table_ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;image_text_detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_ref&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 35-38:&lt;/strong&gt; &lt;strong&gt;BigQuery table setup.&lt;/strong&gt; Creates a reference chain: dataset name → dataset object → table name → table object. The &lt;code&gt;get_table()&lt;/code&gt; call actually contacts BigQuery to verify the table exists and retrieves its schema. This is where the &lt;code&gt;403 USER_PROJECT_DENIED&lt;/code&gt; error occurs if the service account lacks the &lt;code&gt;serviceUsageConsumer&lt;/code&gt; role.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rows_for_bq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 41:&lt;/strong&gt; &lt;strong&gt;Results buffer.&lt;/strong&gt; Initializes an empty list that will accumulate tuples of &lt;code&gt;(description, locale, translated_text, filename)&lt;/code&gt; for each processed image. These get batch-uploaded to BigQuery at the end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;list_blobs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 44-45:&lt;/strong&gt; &lt;strong&gt;Bucket access.&lt;/strong&gt; &lt;code&gt;list_blobs()&lt;/code&gt; returns an iterator over every file (blob) in the bucket. The &lt;code&gt;bucket&lt;/code&gt; object is saved separately because we'll need it later to upload text files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Processing image files from GCS. This will take a few minutes..&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 47:&lt;/strong&gt; Status message so you know the script is working.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jpg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt;  &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;file_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download_as_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 50-52:&lt;/strong&gt; &lt;strong&gt;Main loop start.&lt;/strong&gt; Iterates over every blob in the bucket, filters for image files (&lt;code&gt;.jpg&lt;/code&gt; or &lt;code&gt;.png&lt;/code&gt;), and downloads the image as raw bytes into &lt;code&gt;file_content&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="c1"&gt;# TBD: Create a Vision API image object called image_object
&lt;/span&gt;        &lt;span class="n"&gt;image_object&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# ← YOU ADD THIS
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 55 (TBD #1):&lt;/strong&gt; Wraps the raw image bytes into a &lt;code&gt;vision.Image&lt;/code&gt; object. The Vision API cannot accept raw bytes directly — it needs this structured object that can hold either image bytes (&lt;code&gt;content&lt;/code&gt;) or a GCS URI (&lt;code&gt;source&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="c1"&gt;# TBD: Detect text in the image and save the response data into an object called response
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vision_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;document_text_detection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;image_object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# ← YOU ADD THIS
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 59 (TBD #2):&lt;/strong&gt; Sends the image to the Vision API's &lt;code&gt;document_text_detection&lt;/code&gt; method. This performs OCR (Optical Character Recognition) optimized for dense text. The response contains a list of &lt;code&gt;text_annotations&lt;/code&gt; — the first element holds the full concatenated text and the detected language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="n"&gt;text_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_annotations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 62:&lt;/strong&gt; Extracts the full detected text from the first annotation. The &lt;code&gt;text_annotations&lt;/code&gt; array always puts the complete text in index &lt;code&gt;[0]&lt;/code&gt;, with individual word-level detections in subsequent indices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_from_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text/plain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 65-67:&lt;/strong&gt; &lt;strong&gt;Save text to Cloud Storage.&lt;/strong&gt; Converts the image filename (e.g., &lt;code&gt;sign1.jpg&lt;/code&gt;) to a text filename (&lt;code&gt;sign1.txt&lt;/code&gt;), creates a blob reference, and uploads the extracted text. This creates a text file in the same bucket for each processed image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_annotations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;
        &lt;span class="n"&gt;locale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_annotations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;locale&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 72-73:&lt;/strong&gt; Extracts the description (full text) and locale (language code like &lt;code&gt;'en'&lt;/code&gt;, &lt;code&gt;'ja'&lt;/code&gt;, &lt;code&gt;'fr'&lt;/code&gt;) from the response. Note that &lt;code&gt;desc&lt;/code&gt; is the same value as &lt;code&gt;text_data&lt;/code&gt; — the script extracts it again for clarity of variable naming.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;locale&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;translated_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# TBD: According to the target language pass the description data to the translation API
&lt;/span&gt;            &lt;span class="n"&gt;translation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;translate_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# ← YOU ADD THIS
&lt;/span&gt;
            &lt;span class="n"&gt;translated_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;translation&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;translatedText&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 77-83 (TBD #3):&lt;/strong&gt; &lt;strong&gt;Translation logic.&lt;/strong&gt; If the locale is empty (no language detected), the original text is used as-is. Otherwise, the text is sent to the Translation API with &lt;code&gt;target_language='ja'&lt;/code&gt; (Japanese). The API returns a dictionary; the translated text is in the &lt;code&gt;'translatedText'&lt;/code&gt; key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;translated_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 84:&lt;/strong&gt; Prints the translated (or original) text to the console so you can monitor progress.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_annotations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;rows_for_bq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;locale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;translated_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 88-89:&lt;/strong&gt; &lt;strong&gt;Collect results.&lt;/strong&gt; If the Vision API found any text (safety check), appends a tuple with the original text, locale, translated text, and filename to the results buffer. This tuple matches the BigQuery table schema.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Writing Vision API image data to BigQuery...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows_for_bq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# ← YOU UNCOMMENT THIS
&lt;/span&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lines 91-93:&lt;/strong&gt; &lt;strong&gt;BigQuery upload.&lt;/strong&gt; After all images are processed, uses &lt;code&gt;insert_rows()&lt;/code&gt; to perform a streaming insert of all collected rows into the BigQuery table. The &lt;code&gt;assert&lt;/code&gt; verifies that no errors occurred — if any row failed to insert, the script crashes with an &lt;code&gt;AssertionError&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task 5: Validate with BigQuery
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Run the Verification Query
&lt;/h3&gt;

&lt;p&gt;Go to &lt;strong&gt;BigQuery&lt;/strong&gt; in the Console or use the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bq query &lt;span class="nt"&gt;--use_legacy_sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'SELECT locale, COUNT(locale) as lcount FROM image_classification_dataset.image_text_detail GROUP BY locale ORDER BY lcount DESC'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see a breakdown of language codes (e.g., &lt;code&gt;ja&lt;/code&gt;, &lt;code&gt;en&lt;/code&gt;, &lt;code&gt;fr&lt;/code&gt;, &lt;code&gt;de&lt;/code&gt;) with their counts. This confirms the full pipeline worked end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference — All Commands in Order
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ============================================&lt;/span&gt;
&lt;span class="c"&gt;# TASK 1: Create service account + bind roles&lt;/span&gt;
&lt;span class="c"&gt;# ============================================&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud config get-value project&lt;span class="si"&gt;)&lt;/span&gt;

gcloud iam service-accounts create my-ml-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ML API Service Account"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/bigquery.dataEditor"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/storage.objectAdmin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:my-ml-sa@&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/serviceusage.serviceUsageConsumer"&lt;/span&gt;

&lt;span class="c"&gt;# ============================================&lt;/span&gt;
&lt;span class="c"&gt;# TASK 2: Create credentials + set env var&lt;/span&gt;
&lt;span class="c"&gt;# ============================================&lt;/span&gt;
gcloud iam service-accounts keys create ml-sa-key.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--iam-account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-ml-sa@&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;.iam.gserviceaccount.com

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PWD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/ml-sa-key.json

&lt;span class="c"&gt;# ============================================&lt;/span&gt;
&lt;span class="c"&gt;# TASK 3 &amp;amp; 4: Copy and modify the script&lt;/span&gt;
&lt;span class="c"&gt;# ============================================&lt;/span&gt;
gsutil &lt;span class="nb"&gt;cp &lt;/span&gt;gs://&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;/analyze-images-v2.py &lt;span class="nb"&gt;.&lt;/span&gt;
nano analyze-images-v2.py

&lt;span class="c"&gt;# --- Inside nano, make these 4 edits: ---&lt;/span&gt;
&lt;span class="c"&gt;# 1. After "TBD: Create a Vision API image object":&lt;/span&gt;
&lt;span class="c"&gt;#        image_object = vision.Image(content=file_content)&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# 2. After "TBD: Detect text in the image":&lt;/span&gt;
&lt;span class="c"&gt;#        response = vision_client.document_text_detection(image=image_object)&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# 3. After "TBD: According to the target language":&lt;/span&gt;
&lt;span class="c"&gt;#        translation = translate_client.translate(desc, target_language='ja')&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# 4. Uncomment the last line:&lt;/span&gt;
&lt;span class="c"&gt;#        errors = bq_client.insert_rows(table, rows_for_bq)&lt;/span&gt;
&lt;span class="c"&gt;# --- Save with Ctrl+O, Enter, Ctrl+X ---&lt;/span&gt;

&lt;span class="c"&gt;# ============================================&lt;/span&gt;
&lt;span class="c"&gt;# TASK 5: Run script and validate&lt;/span&gt;
&lt;span class="c"&gt;# ============================================&lt;/span&gt;
python3 analyze-images-v2.py &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;

bq query &lt;span class="nt"&gt;--use_legacy_sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'SELECT locale, COUNT(locale) as lcount FROM image_classification_dataset.image_text_detail GROUP BY locale ORDER BY lcount DESC'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;403 USER_PROJECT_DENIED&lt;/code&gt; on BigQuery or API calls&lt;/td&gt;
&lt;td&gt;Add the missing role: &lt;code&gt;gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/serviceusage.serviceUsageConsumer"&lt;/code&gt; — wait 1-2 min for propagation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;403 ACCESS_DENIED&lt;/code&gt; on Cloud Storage&lt;/td&gt;
&lt;td&gt;You may have used &lt;code&gt;roles/storage.admin&lt;/code&gt; instead of &lt;code&gt;roles/storage.objectAdmin&lt;/code&gt;. Fix: bind the correct role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;PERMISSION_DENIED&lt;/code&gt; on Vision/Translate API calls&lt;/td&gt;
&lt;td&gt;Enable the APIs: &lt;code&gt;gcloud services enable vision.googleapis.com translate.googleapis.com&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;PERMISSION_DENIED&lt;/code&gt; on BigQuery&lt;/td&gt;
&lt;td&gt;Verify the &lt;code&gt;dataEditor&lt;/code&gt; role was bound correctly; wait 1-2 minutes for IAM propagation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ModuleNotFoundError&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Install packages: &lt;code&gt;pip3 install google-cloud-vision google-cloud-translate google-cloud-bigquery google-cloud-storage google-cloud-language&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credentials file error&lt;/td&gt;
&lt;td&gt;Verify: &lt;code&gt;echo $GOOGLE_APPLICATION_CREDENTIALS&lt;/code&gt; and &lt;code&gt;ls -la ml-sa-key.json&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NameError: name 'image_object' is not defined&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TBD #1 is missing — add &lt;code&gt;image_object = vision.Image(content=file_content)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NameError: name 'response' is not defined&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TBD #2 is missing — add the &lt;code&gt;vision_client.document_text_detection()&lt;/code&gt; call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NameError: name 'translation' is not defined&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TBD #3 is missing — add the &lt;code&gt;translate_client.translate()&lt;/code&gt; call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Empty BigQuery table&lt;/td&gt;
&lt;td&gt;Confirm you uncommented &lt;code&gt;errors = bq_client.insert_rows(table, rows_for_bq)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;AssertionError&lt;/code&gt; on &lt;code&gt;assert errors == []&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Check that the BigQuery table &lt;code&gt;image_text_detail&lt;/code&gt; exists in dataset &lt;code&gt;image_classification_dataset&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Script argument error&lt;/td&gt;
&lt;td&gt;Ensure you pass both arguments: &lt;code&gt;python3 analyze-images-v2.py $PROJECT_ID $PROJECT_ID&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Service accounts&lt;/strong&gt; are the standard way to provide application-level credentials in GCP. Each service account can have granular IAM roles scoped to specific services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/code&gt;&lt;/strong&gt; is the universal environment variable that all Google Cloud client libraries check for authentication.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Vision API&lt;/strong&gt; requires an &lt;code&gt;Image&lt;/code&gt; object created from raw bytes — you can't pass the bytes directly to the detection method.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Vision API's &lt;code&gt;document_text_detection&lt;/code&gt;&lt;/strong&gt; returns a structured response where the first element in &lt;code&gt;text_annotations&lt;/code&gt; contains the full detected text and its locale.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Translation API's &lt;code&gt;translate()&lt;/code&gt; method&lt;/strong&gt; returns a dictionary with &lt;code&gt;translatedText&lt;/code&gt;, &lt;code&gt;detectedSourceLanguage&lt;/code&gt;, and &lt;code&gt;input&lt;/code&gt; keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BigQuery's &lt;code&gt;insert_rows()&lt;/code&gt;&lt;/strong&gt; performs streaming inserts and returns an empty list on success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always read the existing code&lt;/strong&gt; before modifying — variable names like &lt;code&gt;vision_client&lt;/code&gt;, &lt;code&gt;desc&lt;/code&gt;, and &lt;code&gt;image_object&lt;/code&gt; are defined by the script and must be used exactly as expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;roles/storage.objectAdmin&lt;/code&gt;&lt;/strong&gt; instead of &lt;code&gt;roles/storage.admin&lt;/code&gt; — it grants object-level read/write/delete without unnecessary bucket-level management permissions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Principle of least privilege&lt;/strong&gt;: Only grant the roles your service account actually needs (&lt;code&gt;dataEditor&lt;/code&gt; for BigQuery writes, &lt;code&gt;storage.objectAdmin&lt;/code&gt; for GCS object access, &lt;code&gt;serviceUsageConsumer&lt;/code&gt; for API consumption).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test incrementally&lt;/strong&gt;: Run the script after each modification to catch errors early rather than debugging everything at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment variables for credentials&lt;/strong&gt;: Never hard-code paths to credential files in your scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read the existing code carefully&lt;/strong&gt;: Variable names matter — using &lt;code&gt;vision_client&lt;/code&gt; vs &lt;code&gt;client&lt;/code&gt; or &lt;code&gt;desc&lt;/code&gt; vs &lt;code&gt;text&lt;/code&gt; can cause &lt;code&gt;NameError&lt;/code&gt; exceptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;document_text_detection&lt;/code&gt; over &lt;code&gt;text_detection&lt;/code&gt;&lt;/strong&gt; when dealing with dense text in images — it uses a more advanced OCR model.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This challenge lab walks you through a realistic ML pipeline pattern: ingest raw data (images), enrich it using ML APIs (Vision + Translation), and store structured results for analysis (BigQuery). These same building blocks — Cloud Storage for data lake, ML APIs for enrichment, BigQuery for analytics — appear in production architectures across industries. Mastering this flow gives you a solid foundation for building more complex ML data pipelines on Google Cloud.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>googleaichallenge</category>
    </item>
  </channel>
</rss>
