DEV Community: William Schnaider Torres Bermon

Spec Kit vs BMAD vs OpenSpec: Choosing an SDD Framework in 2026

William Schnaider Torres Bermon — Thu, 23 Apr 2026 05:08:44 +0000

If the AI writes the code, the spec is the artifact. That's the entire thesis. Everything else is tooling.

TL;DR

Pick based on your codebase:

Existing codebase, adding features → OpenSpec
New project from scratch → Spec Kit
Compliance, audit trails, regulated → BMAD
Unsure? → OpenSpec. It tends to minimize adoption friction compared to the others, works on both greenfield and brownfield, and won't lock you in.

If that's all you needed, stop here. The rest is the reasoning.

Disclosure

I haven't run all three of these in production. This is structural analysis: docs, case studies, design choices, and community reports — not a veteran's field guide. I'll flag where I'm extrapolating versus citing something documented. If you've shipped with any of these, your experience outranks this article.

What SDD actually is (and isn't)

Spec-Driven Development isn't a 2025 invention. BDD, formal requirements docs, ICDs — all versions of the same idea. What changed is that LLMs turned natural-language specs into something you can execute. A Markdown file plus Claude or GPT produces working code. No custom DSL, no code generator, no parser.

The workflow, across all frameworks, is roughly:

Constitution — standards that apply to every change (tests, stack, security).
Specification — what and why.
Design — how, architecture decisions.
Tasks — ordered implementation units.
Implementation — the agent executes; you review.

Steps 1–4 used to fit in a three-line Jira ticket because writing them properly cost more than the code itself. That calculation flipped. AI generates a draft spec in minutes. But "draft" is doing work in that sentence — catching missing edge cases, validating assumptions, and detecting hallucinations still costs real human time. LLMs collapsed the cost of drafting, not the cost of quality. The difference matters.

The economic shift

Old pattern: planning is compressed. Tickets are thin. The real spec is in the developer's head, in Confluence pages nobody updates, in Slack threads from two sprints ago. Code is the expensive part, so you optimize for coding time.

New pattern: code is cheap. AI writes it. The expensive thing is now intent — making sure the AI builds what you actually need. Suddenly an exhaustive spec with acceptance criteria, Gherkin scenarios, error-handling sections, and architectural constraints is worth producing because the AI uses it and the cost of generating the draft is trivial.

Spec is the source of truth. Code is the build output. That's the inversion. The frameworks below are different implementations of the same idea.

The frameworks

Spec Kit (GitHub)

GitHub's open-source SDD toolkit. CLI called specify, with 90K+ GitHub stars at the time of writing. Integrates with a broad range of AI coding agents (the project lists 30+), including GitHub Copilot, Claude Code, Cursor, and Gemini CLI.

The workflow uses slash commands:

/speckit.constitution → project principles
/speckit.specify      → feature specification
/speckit.plan         → technical design
/speckit.tasks        → implementation breakdown
/speckit.implement    → agent executes

The constitution.md is the piece worth understanding. It's not just a rules file — it's the document every subsequent spec references. Your testing strategy, your security posture, your stack constraints, your error-handling conventions. Write it well once and it multiplies across every feature. Write it badly and you get exactly the chaos documented below.

Spec Kit is greenfield-optimized. Its branch-per-spec model treats specs as change artifacts, not long-lived capability contracts. On a mature codebase, that means every feature starts with reverse-engineering and the artifacts don't compound into system-level documentation. Microsoft Learn now has a brownfield module for Spec Kit, and presets help, but the underlying model is still change-scoped. If your codebase is 3 years old and you want specs that describe the system, not just the next PR, this is a friction point.

Getting started. Requires Python 3.11+ and uv. Pin a release tag for stability (check Releases for the latest):

uv tool install specify-cli --from git+https://github.com/github/spec-kit.git@v0.7.2
specify init my-project --ai claude

BMAD-METHOD

BMAD ("Breakthrough Method for Agile AI-Driven Development") is a different animal. It's a multi-agent framework with 43K+ stars at the time of writing — 12+ AI personas (Analyst, PM, Architect, Scrum Master, Developer, QA, UX Designer...) modeled as Markdown "Agent-as-Code" files. v6 hit stable recently after an extended alpha, with features like Scale Adaptive workflows, BMad-CORE engine, and a builder toolkit for custom agents.

The pipeline:

Analyst → PM (PRD) → Architect → Scrum Master (stories) → Developer → QA

Each handoff is a versioned artifact. Audit trail out of the box. Every decision is traceable from requirement to PR.

That structure is impressive when your deployment target is a SOC 2 audit, a consulting deliverable, or a multi-team platform. For a two-person startup, it's a trap. Here's why: BMAD is a process multiplier, not a process creator. If your team already thinks in PRDs, architecture docs, and sprint stories, BMAD will accelerate that and make it auditable. If your team doesn't have structured processes, BMAD won't conjure them — it'll reproduce your chaos across seven agents and you'll spend more time debugging agent coordination than writing code.

Concrete costs people forget about: more agents means more tokens per cycle. Handoff failures between personas are a real debugging surface. And when the Architect agent makes an assumption the PM agent didn't document, the Scrum Master propagates it into stories, and the Developer implements it confidently. You find out in QA, or worse, in production. The pipeline is only as good as the weakest handoff.

Getting started. Requires Node.js v20+. The interactive installer handles module selection and IDE-specific file generation:

npx bmad-method install

OpenSpec (Fission-AI)

Lightweight, brownfield-first SDD. npm package (@fission-ai/openspec, 77K+ downloads). GitHub repo. Works with 25+ AI assistants via slash commands and an AGENTS.md file that acts as a "README for robots."

OpenSpec's core idea is change-centric specs:

openspec/
  project.md                  ← project context
  specs/                      ← current system behavior
  changes/
    add-dark-mode/
      proposal.md             ← what's changing and why
      design.md               ← technical approach
      tasks.md                ← checklist
      specs/                  ← deltas: ADDED / MODIFIED / REMOVED

The delta markers are the thing that makes this work for existing codebases. You're forced to categorize every change as ADDED, MODIFIED, or REMOVED relative to what exists. That discipline prevents the agent from hallucinating new requirements onto existing behavior. When the change ships, the deltas merge into the main specs, so your system-level documentation compounds over time. That's the right model for brownfield, and it's a model Spec Kit doesn't natively emphasize.

Limitations are real: specs don't self-update during implementation. If the agent drifts (and it will — more on that below), you resync manually. There's no multi-agent orchestration; a single agent runs the whole flow. And for simple tasks — a bug fix, a copy change — the overhead of a full proposal-design-tasks cycle can feel like performing surgery with a forklift.

Getting started. Requires Node.js >= 20.19.0. Install globally and initialize inside your project:

npm install -g @fission-ai/openspec
openspec init

A different category: SDD as a product, not a framework

The three frameworks above are CLI tools you bolt onto your existing editor. There's another approach: products that bake SDD directly into their own environment. Two worth tracking:

Kiro (AWS) is a VS Code fork with spec-driven development built into the IDE itself. You describe a feature, Kiro generates requirements in EARS notation, produces a technical design, and breaks it into trackable tasks — all inside the editor, no CLI involved. Powered by Claude Sonnet via Amazon Bedrock, $20/month. If you're AWS-native and want the tightest possible integration between specs and implementation, Kiro removes the seams. The tradeoff is vendor lock: you adopt their IDE, their model pipeline, their ecosystem.

Augment Intent is a standalone desktop app (Mac, public beta as of early 2026) built around "living specs" — specifications that update themselves as agents work, solving the drift problem the CLI frameworks leave manual. Intent uses a coordinator/specialist/verifier agent architecture where multiple agents execute in parallel on isolated git worktrees, all sharing the same evolving spec. Pricing is credit-based ($20–200/month depending on tier), and it supports BYOA (Bring Your Own Agent — Claude Code, Codex, OpenCode) alongside Augment's own agents. The living-spec approach is the most interesting architectural bet in this space right now: if it works reliably, it makes the manual reconciliation step described later in this article unnecessary. It's still beta, though, and independent production validation is limited.

These aren't competitors to Spec Kit, BMAD, or OpenSpec — they're a different layer. A CLI framework gives you a spec workflow inside the tools you already use. Kiro and Intent ask you to move into their environment. Whether that trade is worth it depends on how much friction you're willing to accept for tighter integration.

Roll your own

If you already have project context documented (stack, standards, workflows), four custom slash commands and an AGENTS.md get you surprisingly far:

.claude/commands/
  plan-feature.md       → produce spec + design from intent
  break-into-tasks.md   → decompose spec into tasks
  implement.md          → execute one task within your conventions
  review-spec.md        → critique the spec for gaps
AGENTS.md               → project rules

About 80% of what the established frameworks do, shaped to your workflow. The other 20% — Spec Kit's presets, OpenSpec's delta markers, BMAD's agent handoffs — is the reason people use frameworks. Start custom if your workflow is idiosyncratic enough that the frameworks fight you. Otherwise, pick a framework and extend it.

The spec drift problem (and what to do about it)

This gets its own section because it's the single most common failure mode and none of the current frameworks handle it well.

Here's what happens: you write a spec. The agent starts implementing. Partway through task 3 of 8, the agent encounters something the spec didn't anticipate — a library API that doesn't work as expected, a database constraint that forces a different approach, an edge case the spec didn't cover. The agent adapts. It writes working code that solves the real problem. But the spec still describes the planned approach, not the actual one.

Now the spec is fiction. The next engineer who reads it (or the next agent that uses it as context) gets misled. As Amelia Wattenberger put it: a stale design doc misleads the next engineer who happens to read it; a stale spec misleads agents that don't know any better, and they'll execute a plan that no longer matches reality without flagging anything wrong.

This isn't a corner case. It's the default behavior.

What to do about it. There's no automation that fully solves this today. The practical approach is a post-implementation reconciliation step:

After the agent finishes implementation, run a comparison pass: "Read the spec. Read the code. List every place where they diverge."
For each divergence, decide: was the agent's adaptation correct? If yes, update the spec. If no, fix the code.
Commit the updated spec alongside the code diff.

OpenSpec has /opsx:sync for this. Spec Kit recently added a drift reconciliation extension (/speckit.reconcile). In BMAD, you'd do it manually via a QA agent review. None of these are automatic — you have to trigger them, and you have to review the output. That's overhead, and it's the overhead that most teams skip until their specs are six months out of date.

The emerging approach — what Augment Intent is built around — is bidirectional spec updates: agents write changes back to the spec as they work. That closes the loop in theory. Whether it holds up reliably across complex codebases is the open question, and it's the single biggest feature gap separating the CLI frameworks from the next generation of SDD tooling.

When the constitution fails: a real example

EPAM published a detailed case study of using Spec Kit on a brownfield codebase. One finding stands out. Their constitution.md contained an explicit rule: "NO try-catch blocks in route handlers — use global middleware." The rule was unambiguous. The agent ignored it and added try-catch blocks in router handlers anyway.

This isn't a Spec Kit bug. It's a model behavior issue: the agent was pattern-matching against what it had seen in millions of codebases where try-catch in handlers is the norm, and the constitution's single-line prohibition wasn't enough to override that prior. The fix was obvious in hindsight — reinforce the rule in the constitution with context explaining why (middleware-based error handling enables centralized logging and consistent error responses), not just what. Models follow "why" better than "don't."

The deeper lesson: a constitution isn't a config file. Writing "don't do X" isn't enough. You need "don't do X because Y, and instead do Z." The constitution that works is the one written as if you were onboarding a smart but literal-minded junior developer who has never seen your codebase. Because that's exactly what the agent is.

Mistakes that will cost you a sprint

SDD as waterfall. Gojko Adzic flagged this when Spec Kit launched. He's right. A 50-page spec you freeze before implementation is not SDD — it's BDUF with Markdown. Specs should change during implementation. The iterative loop is the point.

The three-page spec with no edge cases. Looks thorough. Covers the happy path beautifully. Says nothing about what happens when the input is malformed, the downstream service 500s, or the user's session expires mid-request. The agent implements exactly what's specified. You ship a demo. It breaks the first day in production.

Green tests, wrong behavior. Every acceptance criterion passes. Tests are green. But the solution doesn't actually solve the user's problem. Acceptance criteria are a proxy for intent, not intent itself. Add a "Why this matters" and "Non-goals" section to every spec so the agent stays grounded in the problem, not just the checklist.

Framework shopping. You can burn a sprint evaluating four frameworks. You will learn nothing that four weeks of actual use on real tickets wouldn't teach you faster. Pick from the TL;DR. Start. Reconsider in a month if you need to.

Closing

The cost of drafting specifications has collapsed. Tests, tickets, architecture docs, ADRs — artifacts that used to get skipped because they cost too much time are now cheap to produce in draft form. The review work didn't go away — but the activation energy for producing the document in the first place did. That's the change, and it's permanent regardless of which framework wins.

One thing worth saying plainly: SDD tooling is still early-stage. Patterns are emerging, not standardized, and most teams are still figuring out what "good" looks like in practice. The frameworks in this article are the best available answers right now — not settled ones.

If in doubt, start with OpenSpec. Invest an hour in your constitution. Wire up your MCPs so the agent can open PRs, update tickets, and run tests. And when the spec drifts from the code — not if, when — take the thirty minutes to reconcile them. That's where SDD succeeds or fails in practice, and it's the part no framework will do for you.

If you've shipped with any of these, the war stories are more useful than the docs. What broke?

Solving the Gemini API Challenge Lab on Vertex AI: Text, Function Calling & Video Understanding

William Schnaider Torres Bermon — Thu, 23 Apr 2026 01:43:52 +0000

The "Explore Generative AI with the Gemini API in Vertex AI: Challenge Lab" on Google Cloud Skills Boost throws three Gemini capabilities at you in one sitting: a raw REST call from Cloud Shell, function calling from a Jupyter notebook, and multimodal video analysis. None of it is hard once you know what the verifier is actually checking — but a couple of things are easy to get wrong on the first attempt and the lab gives you almost no feedback when you do.

This walkthrough is the version of the solution I wish I had read before starting. I'll show you the working code for every task, but more importantly, I'll explain why each piece works the way it does — including a deep dive into the function-call response object, which is genuinely interesting once you understand it.

The challenge in one paragraph

You're playing the role of a developer at a video-analysis startup. Your job is to prove you can wire up three Gemini features end-to-end: generating text via a direct REST call, declaring a tool that Gemini can decide to invoke, and feeding a video from Cloud Storage into the model so it can describe what it sees. The lab provides a half-finished Jupyter notebook with INSERT placeholders, and your job is to fill in the blanks.

The model used throughout is gemini-2.5-flash, and the notebook uses the new google-genai SDK (not the legacy vertexai one — this matters because the class names and import paths are different).

Task 1: Text generation via curl from Cloud Shell

The first task is the simplest in concept and the most annoying in practice. You open Cloud Shell, you curl the Vertex AI endpoint, you ask Gemini why the sky is blue, you get an answer back. Done.

Except the verifier won't accept your call unless you hit a very specific endpoint. More on that in a moment.

Setting up the environment

The lab pre-fills these variables for you:

PROJECT_ID=qwiklabs-gcp-00-207c94de3534   # yours will differ
LOCATION=us-east1
API_ENDPOINT=${LOCATION}-aiplatform.googleapis.com
MODEL_ID="gemini-2.5-flash"

Then you need to make sure the Vertex AI API is enabled. The lab tells you to do this in the Console, but the CLI is faster:

gcloud services enable aiplatform.googleapis.com --project=${PROJECT_ID}

The curl call (with the gotcha)

Here's the part where the lab can quietly waste 20 minutes of your time. The Vertex AI generative endpoints expose two methods: generateContent (returns one big response) and streamGenerateContent (returns a stream of chunks). Both work. Both return valid Gemini answers. Only one of them satisfies the lab verifier.

The verifier checks for streamGenerateContent. Use this:

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:streamGenerateContent" -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          { "text": "Why is the sky blue?" }
        ]
      }
    ]
  }'

If you get a JSON array back where each element contains a candidates[].content.parts[].text field with text about Rayleigh scattering, you're good. Hit "Check my progress" and Task 1 turns green.

If you get 403 PERMISSION_DENIED, the API hadn't fully propagated yet — wait 30 seconds after enabling and try again. If you get 404, you've got a typo in the region or model name.

Why this matters: the difference between generateContent and streamGenerateContent is operational, not semantic. Streaming is what you'd actually want in production for any user-facing chatbot, because it lets the UI display tokens as they arrive instead of making the user stare at a spinner. The lab is implicitly nudging you toward that pattern.

Task 2: Open the notebook in Vertex AI Workbench

This task has no scoring — it's purely navigational. From the Console: Navigation menu → Vertex AI → Workbench. Find the generative-ai-jupyterlab instance (it should already be running), click Open JupyterLab, and once the new tab loads, double-click gemini-explorer-challenge.ipynb. When the kernel selector pops up, pick Python 3.

That's it. Now the real work begins.

Task 3: Function calling with Gemini

Function calling is the feature that turns Gemini from a chatbot into something that can actually do things in the world. The idea: you describe a function to the model — its name, what it does, what arguments it takes — and the model decides on its own whether and when to invoke it based on what the user is asking.

The notebook has four cells to fill in. Let's do them.

3.1 — Load the model

# Task 3.1
model_id = "gemini-2.5-flash"

Just the model identifier as a string. The new SDK doesn't make you instantiate a model object the way the legacy vertexai library did — you pass the model name straight into client.models.generate_content().

3.2 — Declare the function

# Task 3.2
get_current_weather_func = FunctionDeclaration(
    name="get_current_weather",
    description="Get the current weather in a given location",
    parameters={
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "Location"
            }
        }
    },
)

FunctionDeclaration (already imported at the top of the notebook from google.genai.types) is how you describe a function to Gemini. Notice that you're not giving it any actual code — you're giving it a schema. The description field is critical: this is what Gemini reads to decide whether your function is relevant to the user's prompt. A vague description means the model might not call your function when it should, or might call it when it shouldn't.

The parameters block is JSON Schema. If your real function took more arguments — say, unit for Celsius vs Fahrenheit — you'd add them here.

3.3 — Wrap it in a Tool

# Task 3.3
weather_tool = Tool(
    function_declarations=[get_current_weather_func],
)

A Tool is a container for one or more related function declarations. You could bundle get_current_weather and get_forecast and get_historical_weather into a single tool, and Gemini would pick whichever one fits the user's question.

3.4 — Invoke the model

# Task 3.4
prompt = "What is the weather like in Boston?"
response = client.models.generate_content(
    model=model_id,
    contents=prompt,
    config=GenerateContentConfig(
        tools=[weather_tool],
        temperature=0,
    ),
)
response

temperature=0 is important here: when you're asking the model to make a structured decision (call this function with these args), you want it to be deterministic, not creative.

Decoding the response (the interesting part)

Run the cell and you'll see something that looks alarming the first time:

GenerateContentResponse(
  candidates=[
    Candidate(
      avg_logprobs=-0.5011326244899205,
      content=Content(
        parts=[
          Part(
            function_call=FunctionCall(
              args=<... Max depth ...>,
              name=<... Max depth ...>
            ),
            thought_signature=b'\n\xcb\x01\x01\x8f=k_u\x91\xe5\x14...'
          ),
        ],
        role='model'
      ),
      finish_reason=<FinishReason.STOP: 'STOP'>
    ),
  ],
  ...
  usage_metadata=GenerateContentResponseUsageMetadata(
    candidates_token_count=7,
    prompt_token_count=25,
    thoughts_token_count=39,
    total_token_count=71,
    ...
  )
)

There is no text anywhere in the response. That's not a bug — that's the entire point. Let me unpack what's happening.

Part with function_call instead of text. Normally a Part carries a text field with whatever the model wrote. This one carries a function_call instead. What Gemini is telling you is: "I cannot answer 'what's the weather in Boston' from my training data, but the user gave me a tool called get_current_weather that can. I'm not going to make up an answer — I'm going to ask the caller to invoke that tool with location='Boston' and pass me back the result."

The <... Max depth ...> you see is just Python's repr truncating the output for display. The data is there. If you actually want to read it, do:

fc = response.candidates[0].content.parts[0].function_call
print(fc.name)   # "get_current_weather"
print(fc.args)   # {"location": "Boston"}

thought_signature (those scary-looking bytes). Gemini 2.5 is a thinking model — it does internal chain-of-thought reasoning before producing output. The thought_signature is an opaque, signed blob of that reasoning. You don't read it. Its only purpose is to be passed back to Gemini in a follow-up call (the second turn of the function-calling loop, see below) so the model can resume its reasoning without having to re-derive everything from scratch. It's a cache key for the model's internal state.

finish_reason=STOP. The model finished cleanly. Not truncated by token limit, not blocked by a safety filter.

The token counts. This is where Gemini 2.5 gets fun:

prompt_token_count=25: your prompt plus the function declaration consumed 25 input tokens.
candidates_token_count=7: the function call output was 7 tokens.
thoughts_token_count=39: the model spent 39 tokens thinking internally before deciding to call the function. This is the cost of the chain-of-thought. You're billed for it, and it's only present on the 2.5 family.
total_token_count=71: the sum, which is what hits your bill.

The full function-calling loop (which the lab doesn't make you complete)

What you just saw is step 2 of a 4-step dance. In a real application:

You send a prompt plus tool definitions to Gemini.
Gemini returns a function_call saying which function to invoke and with what args. ← the lab stops here
You actually execute the function — call a real weather API, hit a database, whatever — and send the result back to Gemini as a function_response.
Gemini uses that result to compose a natural-language answer like "It's currently 18°C and partly cloudy in Boston."

The lab only grades you up to step 2 because what's being demonstrated is that the model understands the tool and knows when to use it. The actual execution lives in your application code, not in Gemini's responsibilities. Once you grasp this separation of concerns, function calling stops feeling magical and starts feeling like a very natural API contract.

Task 4: Describing video contents

Same model, same client, but now you're going to feed it a video file from Cloud Storage and ask it to describe what's in it.

4.1 — Load the model

# Task 4.1
multimodal_model = "gemini-2.5-flash"

Same model as before. gemini-2.5-flash is natively multimodal — it doesn't need a separate "vision" or "video" variant. You hand it text, images, audio, or video, and it figures it out.

4.2 — Generate the description

The notebook has two INSERT placeholders here, plus you have to recognize that it's expecting a streaming call (the for response in responses: loop at the bottom is the giveaway).

# Task 4.2 Generate a video description
prompt = """
What is shown in this video?
Where should I go to see it?
What are the top 5 places in the world that look like this?
"""
video = Part.from_uri(
    file_uri="gs://github-repo/img/gemini/multimodality_usecases_overview/mediterraneansea.mp4",
    mime_type="video/mp4",
)
contents = [prompt, video]

responses = client.models.generate_content_stream(
    model=multimodal_model,
    contents=contents
)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
    print(response.text, end="")

Three things to notice.

Part.from_uri is how you reference Cloud Storage assets. You don't download the video to the notebook and base64-encode it — Gemini reads it directly from gs://. Faster, cheaper, and works for files much larger than what you could comfortably embed inline. The mime_type is required so the model knows how to decode the bytes.

contents is a list mixing text and media. You pass [prompt, video] and the SDK figures out what each element is. You could pass [image, prompt, video, image, prompt] if you wanted — the model treats it as a sequential multimodal message.

generate_content_stream, not generate_content. This is the second INSERT and it's the one most people miss. The for response in responses: loop at the bottom of the cell only makes sense if responses is iterable — which it is for the streaming version. If you used the non-streaming generate_content, you'd get back a single response object and the for loop would iterate over its attributes and break in confusing ways. The lab's hint is in the comment links: one of them points to the "stream response" docs.

When you run it, you'll see the video embedded in the notebook and then a streaming description fill in chunk by chunk — turquoise water, rocky cliffs, the Mediterranean — followed by a top-5 list with places like Amalfi, Santorini, the Côte d'Azur, Mallorca, and Croatia's Dalmatian coast.

Hit "Check my progress" and Task 4 goes green.

Key learnings

A few things worth taking away from this lab beyond just passing it.

The google-genai SDK is not the old vertexai SDK. If you've used Vertex AI's generative features before, you're probably used to from vertexai.generative_models import GenerativeModel. That's the legacy path. The new path is from google import genai plus from google.genai.types import .... Class names like FunctionDeclaration, Tool, and Part are similar but live in different modules. Don't mix them — pick one and stick with it.

Function calling is a contract, not an execution. Gemini will never actually call your function. It will tell you that you should call your function, with these args, and then wait for you to pass the result back. The model is the brain; your code is the hands. This separation is what makes function calling safe to deploy in production — you control exactly what the model can and cannot reach.

Thinking tokens are real and they cost money. Gemini 2.5 Flash's thoughts_token_count is a separate billable line item from input and output tokens. For most prompts it's small, but for complex reasoning tasks it can dominate the bill. If you're cost-optimizing, this is worth measuring.

Multimodal inputs come from Cloud Storage, not from your notebook. For anything bigger than a small image, the right pattern is to upload to GCS and reference with Part.from_uri. This avoids round-tripping bytes through your runtime and is dramatically faster for video.

Streaming vs non-streaming is a real choice. generateContent returns a single payload. streamGenerateContent returns chunks as they're produced. Pick streaming for any user-facing experience and non-streaming for server-to-server batch jobs where latency-to-first-token doesn't matter.

Best practices

A few things I'd do differently in real code than what the lab asks for:

Never hard-code the project ID. The notebook has PROJECT_ID = "qwiklabs-gcp-..." because the lab is ephemeral, but in production read it from google.auth.default() or an environment variable.
Write detailed function descriptions. "Get the current weather" is fine for a demo. For real tools, describe what the function returns, what units, what error conditions, and anything else that helps the model decide when to invoke it. The model only sees what you write.
Always set temperature=0 for tool calls. Creative variation in a function-call decision is almost never what you want.
Handle the multi-turn flow. A demo that stops at step 2 of the function-calling loop isn't a real integration. Build out the full round-trip: receive the function call, execute it, send the function_response back, get the natural-language answer.
Validate tool arguments before executing. Gemini is good at structured outputs but not perfect. Your function executor should treat the args as untrusted input and validate them against the schema before doing anything destructive.

Wrapping up

The Gemini API challenge lab is a small surface area but a surprisingly good introduction to three patterns you'll use constantly if you build with Vertex AI: direct REST access for quick experiments, function calling for tool-using agents, and multimodal inputs from Cloud Storage. The three things that tripped me up — the streamGenerateContent requirement in Task 1, the meaning of the function-call response object in Task 3, and the streaming method in Task 4 — are the things worth remembering, because they all reflect how you'd actually use these APIs in production.

Now go build something with it.

Solving "Analyze and Reason on Multimodal Data with Gemini: Challenge Lab" — A Complete Guide

William Schnaider Torres Bermon — Wed, 08 Apr 2026 04:45:09 +0000

Multimodal AI is no longer a futuristic concept — it's a practical tool that can analyze text reviews, product images, and podcast audio in a single workflow. In this post, I walk through the GSP524 Challenge Lab from Google Cloud Skills Boost, where we use the Gemini 2.5 Flash model on Vertex AI to extract actionable marketing insights from three different data modalities for a fictional brand called Cymbal Direct.

If you're preparing for this lab or want to understand how multimodal prompting with Gemini actually works in practice, this guide covers every task with the reasoning behind each solution.

The Scenario

Cymbal Direct has just launched a new line of athletic apparel. Our job is to analyze social media engagement across three channels:

Text — Customer reviews and social media posts (sentiment, themes, product mentions).
Images — Influencer and customer photos (style trends, visual messaging, target audience).
Audio — A podcast interview with a Cymbal Direct representative (satisfaction drivers, biases, recommendations).

Finally, we synthesize everything into a comprehensive Markdown report and upload it to Cloud Storage.

Environment Setup (Task 1)

The lab provides a pre-configured Vertex AI Workbench instance with a Jupyter notebook (gsp524-challenge.ipynb). Task 1 has no TODOs — you just run the provided cells to:

Install the Google Gen AI SDK (google-genai).
Restart the kernel (important — the new package won't load without this).
Import all required libraries, including Part, ThinkingConfig, and GenerateContentConfig from google.genai.types.
Initialize the Gen AI client pointing to your lab project.
Set the model ID to gemini-2.5-flash.

Two critical objects are set up here that you'll reuse throughout the lab:

# The client — your gateway to Gemini
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# The model
MODEL_ID = "gemini-2.5-flash"

Later, a config object enables Gemini thinking (extended reasoning) with dynamic budget:

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(
        include_thoughts=True,
        thinking_budget=-1  # Dynamic: model decides how much to reason
    )
)

This config is the key difference between a basic call and a deep-reasoning call. You'll use it in every "Deep Dive" section.

Task 2: Analyzing Customer Reviews (Text)

Initial Analysis

The first real challenge is constructing a prompt that tells Gemini exactly what to extract from the raw text data. The reviews are loaded from a file, and we embed them directly into the prompt using an f-string:

prompt = f"""
Analyze the following customer reviews and social media posts about
Cymbal Direct's new athletic apparel line. For each review or post:
- Identify the overall sentiment (positive, negative, or neutral).
- Extract key themes and topics discussed, such as product quality,
  fit, style, customer service, and pricing.
- Identify any frequently mentioned product names or specific features.

Provide a structured summary of your findings in Markdown format.

Customer Reviews and Social Media Posts:
{text_data}
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=prompt,
)

Why this works: The prompt is explicit about the three dimensions we care about (sentiment, themes, product names) and asks for structured Markdown output. Gemini handles the rest — it categorizes each review and surfaces patterns across the dataset.

Deep Dive with Thinking

Now we go deeper. The second prompt asks Gemini to reason about what's driving sentiment and to role-play as a marketing consultant:

thinking_mode_prompt = f"""
Analyze the following customer reviews and social media posts in detail.
Specifically:
- Identify the main factors driving positive and negative sentiment.
- Assess the overall impact on brand perception.
- Identify three key areas where Cymbal Direct can improve.
- Highlight the three most important takeaways as if presenting to
  the Cymbal Direct marketing team.

Customer Reviews and Social Media Posts:
{text_data}
"""

thinking_model_response = client.models.generate_content(
    model=MODEL_ID,
    contents=thinking_mode_prompt,
    config=config,  # <-- This enables thinking mode
)

The only API-level difference is passing config=config. But the output is dramatically richer — Gemini shows its chain of thought before delivering the final answer, and the print_thoughts() helper function separates these for display.

The analysis is saved to analysis/text_analysis.md for use in the final synthesis.

Task 3: Analyzing Images (Visual Content)

Initial Analysis

Images require a different content structure. Instead of embedding data in the prompt string, we pass a list of Part objects alongside the prompt:

prompt = """
Analyze the following images of Cymbal Direct's new athletic apparel line.
For each image:
- Identify the apparel items shown.
- Describe the attributes of each item (color, style, material, branding).
- Identify any prominent style trends or preferences across the images.
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[prompt] + image_parts,  # Prompt + list of image Part objects
)

Key pattern: For multimodal content, contents accepts a list where the first element is the text prompt and subsequent elements are Part objects (images, audio, video). The images are loaded as bytes and wrapped with Part.from_bytes().

Reasoning on Image Trends

The deep dive asks Gemini to go beyond description into inference — hypothesizing about target audience, analyzing visual composition, and comparing to broader fashion trends:

thinking_mode_prompt = """
Analyze the images in greater detail:
- Hypothesize about the target audience for each image.
- Analyze how visual elements contribute to the overall message and appeal.
- Compare observed trends with broader athletic wear fashion trends.
- Provide recommendations for future marketing campaigns.
"""

thinking_model_response_image = client.models.generate_content(
    model=MODEL_ID,
    contents=[thinking_mode_prompt] + image_parts,
    config=config,
)

Same pattern: prompt + image parts + thinking config. Results are saved to analysis/image_analysis.md.

Task 4: Analyzing Audio (Podcast)

Initial Analysis

Audio follows the same multimodal pattern, but uses Part.from_uri() instead of Part.from_bytes() since the audio file lives in Cloud Storage:

# Audio part (created in a setup cell)
audio_part = Part.from_uri(
    file_uri=f"gs://{PROJECT_ID}-bucket/media/audio/cymbal_direct_expert_interview.wav",
    mime_type="audio/wav",
)

prompt = """
Analyze the following audio recording:
- Transcribe the conversation, identifying different speakers.
- Provide sentiment analysis (positive, negative, neutral opinions).
- Identify key themes (comfort, fit, performance, style, competitor comparisons).
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[audio_part, prompt],  # Audio first, then prompt
)

Note the order: For audio, the audio_part comes before the prompt in the contents list. This is a subtle but important detail — Gemini processes the audio first, then applies the prompt instructions to it.

Reasoning on Audio Insights

The deep dive extracts strategic intelligence from the conversation:

thinking_mode_prompt = """
Analyze the audio recording in greater detail:
- Reason about overall customer satisfaction.
- Deduce key factors influencing customer perception.
- Develop three data-driven recommendations.
- Identify potential biases or limitations in the audio data.
"""

thinking_model_response = client.models.generate_content(
    model=MODEL_ID,
    contents=[audio_part, thinking_mode_prompt],
    config=config,
)

This is particularly interesting because Gemini can identify biases like interviewer framing or selection bias in who was invited to the podcast — something that requires genuine reasoning, not just transcription.

Task 5: Synthesizing Multimodal Insights

The final task loads all three analysis files and asks Gemini to produce a unified report:

comprehensive_report_prompt = f"""
Based on the following combined analysis of text reviews, image analysis,
and audio insights, generate a comprehensive report:
- Summarize overall sentiment across all data modalities.
- Identify key themes and trends in customer feedback.
- Provide insights on style preferences, usage patterns, and behavior.
- Evaluate how audio insights fit with product image and text feedback.
- Offer actionable recommendations for marketing strategy and positioning.

Format the report in well-structured Markdown with clear sections.

Combined Analysis Results:
{all_analysis}
"""

thinking_model_response = client.models.generate_content(
    model=MODEL_ID,
    contents=comprehensive_report_prompt,
    config=config,
)

After generating the report, it's saved locally and uploaded to Cloud Storage:

!gcloud storage cp analysis/final_report.md gs://{PROJECT_ID}-bucket/analysis/final_report.md

This last step is what the grading system checks, so don't skip it.

Key Learnings

One API, three modalities. The generate_content method handles text, images, and audio with the same interface — the only difference is how you construct the contents list.
Thinking mode is a single config toggle. Adding config=config with include_thoughts=True transforms a surface-level response into a reasoned analysis. The -1 thinking budget lets the model decide how deep to go based on prompt complexity.
Prompt specificity drives output quality. Vague prompts produce vague results. Each prompt in this lab explicitly lists the dimensions to analyze (sentiment, themes, audience, recommendations), and the output quality reflects that precision.
Content ordering matters for multimodal inputs. For images, the prompt comes first followed by image parts. For audio, the audio part comes first. This isn't arbitrary — it affects how the model processes the input.
Chaining analyses enables synthesis. By saving intermediate results to files and feeding them into a final prompt, we build a pipeline where each modality's insights compound into a richer final report.

Best Practices

Always ask for structured output. Requesting "Markdown format with clear sections" gives you parseable, presentable results instead of a wall of text.
Use thinking mode for analysis, skip it for extraction. Initial passes (transcription, item identification) don't need extended reasoning. Deep dives (inferring audience, identifying biases, generating recommendations) benefit enormously from it.
Embed data directly in prompts for text; use Part objects for binary data. Text data fits naturally inside f-strings. Images and audio should always go through Part.from_bytes() or Part.from_uri().
Save intermediate results. Writing each analysis to a file creates a paper trail and enables the final synthesis step without re-running expensive model calls.
Don't forget the upload. In challenge labs, the grading system checks Cloud Storage — your analysis could be perfect, but if the file isn't in the bucket, you won't pass.

Conclusion

This challenge lab demonstrates a realistic workflow for multimodal AI analysis: ingest data from different sources, extract structured insights from each, apply deeper reasoning where it matters, and synthesize everything into a decision-ready report. The Gemini 2.5 Flash model on Vertex AI makes this surprisingly straightforward — the same generate_content call handles text, images, and audio, and the thinking mode adds genuine analytical depth without requiring a different model or API.

The patterns here — structured prompts, multimodal content lists, thinking configuration, and chained analyses — are directly applicable to real-world use cases like brand monitoring, market research, and content analysis. The hard part isn't the API calls; it's crafting prompts that extract the right insights from the right data.

Solving "Use Machine Learning APIs on Google Cloud: Challenge Lab" — A Complete Guide

William Schnaider Torres Bermon — Thu, 19 Mar 2026 01:41:08 +0000

Introduction

This challenge lab tests your ability to build an end-to-end pipeline that extracts text from images using the Cloud Vision API, translates it with the Cloud Translation API, and loads the results into BigQuery. Unlike guided labs, you're expected to fill in the blanks of a partially written Python script and configure IAM permissions yourself.

Let's walk through every task with clear explanations of why each step matters.

The Architecture

The pipeline works like this:

A Python script reads image files from a Cloud Storage bucket
Each image is sent to the Cloud Vision API for text detection
The extracted text is saved back to Cloud Storage as a .txt file
If the text is not in Japanese (locale != 'ja'), it's sent to the Translation API to get a Japanese translation
All results (original text, locale, translation) are uploaded to a BigQuery table.

Task 1: Configure a Service Account

Why a Service Account?

The Python script needs programmatic access to Vision API, Translation API, Cloud Storage, and BigQuery. A service account acts as the script's identity, and IAM roles define what it can do.

Commands

# Set your project ID
export PROJECT_ID=$(gcloud config get-value project)

# Create the service account
gcloud iam service-accounts create my-ml-sa \
  --display-name="ML API Service Account"

# Grant BigQuery Data Editor role (to insert rows)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

# Grant Cloud Storage Object Admin role (to read images and write text files)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

# Grant Service Usage Consumer role (required to make API calls within the project)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/serviceusage.serviceUsageConsumer"

Important: Without roles/serviceusage.serviceUsageConsumer, the service account cannot consume any enabled APIs in the project (BigQuery, Vision, Translation, etc.), even if it has data-level roles like dataEditor or storage.objectAdmin. This results in a 403 USER_PROJECT_DENIED error.

Verification

gcloud projects get-iam-policy $PROJECT_ID \
  --flatten="bindings[].members" \
  --filter="bindings.members:my-ml-sa@"

You should see roles/bigquery.dataEditor, roles/storage.objectAdmin, and roles/serviceusage.serviceUsageConsumer listed.

Task 2: Create and Download Credentials

Why Download a Key?

While Cloud Shell has default credentials for the logged-in user, the challenge explicitly requires you to create a JSON key file and point the GOOGLE_APPLICATION_CREDENTIALS environment variable to it. This simulates how credentials work in production environments outside GCP.

Commands

# Generate the JSON key file
gcloud iam service-accounts keys create ml-sa-key.json \
  --iam-account=my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com

# Set the environment variable so Google Cloud client libraries find the key
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/ml-sa-key.json

Task 3: Modify the Script — Vision API Text Detection

Get the Script

gsutil cp gs://$PROJECT_ID/analyze-images-v2.py .

What to Modify

The script has four sections that need your attention: three # TBD: comments and one commented-out BigQuery upload line. Open the script with:

nano analyze-images-v2.py

TBD #1 — Create a Vision API image object:

Find the comment:

# TBD: Create a Vision API image object called image_object

Add below it:

image_object = vision.Image(content=file_content)

This creates an Image object from the raw bytes downloaded from Cloud Storage (file_content). The Vision API requires this object format to process images.

TBD #2 — Call the Vision API to detect text:

Find the comment:

# TBD: Detect text in the image and save the response data into an object called response

Add below it:

response = vision_client.document_text_detection(image=image_object)

This sends the image to the Vision API's document_text_detection method, which is optimized for dense text like signs. Note that the client variable is called vision_client (as defined earlier in the script), and the image parameter uses the image_object we just created.

Test It

Run the script after completing TBDs #1 and #2 to verify text extraction works before moving on:

python3 analyze-images-v2.py $PROJECT_ID $PROJECT_ID

You should see extracted text appearing in the console output.

Task 4: Modify the Script — Translation API

What to Modify

TBD #3 — Translate non-Japanese text to Japanese:

Find the comment:

# TBD: According to the target language pass the description data to the translation API

Add below it:

translation = translate_client.translate(desc, target_language='ja')

Key details:

We use desc (not a generic variable like text) because that's the variable name the script assigns to the extracted description earlier: desc = response.text_annotations[0].description
The target language is 'ja' (Japanese) as specified in the lab instructions
The result is stored in translation, and the script already accesses translation['translatedText'] on the next line

Enable the BigQuery Upload

At the very end of the script, find the commented-out line:

# errors = bq_client.insert_rows(table, rows_for_bq)

Remove the # to enable it:

errors = bq_client.insert_rows(table, rows_for_bq)

The line immediately after (assert errors == []) will verify the upload succeeded.

Complete Modified Script Reference

Here's a summary of all four changes in the script:

Location in Script	What to Add / Change
After `# TBD: Create a Vision API image object`	`image_object = vision.Image(content=file_content)`
After `# TBD: Detect text in the image`	`response = vision_client.document_text_detection(image=image_object)`
After `# TBD: According to the target language`	`translation = translate_client.translate(desc, target_language='ja')`
Last commented line	Remove `#` from `errors = bq_client.insert_rows(table, rows_for_bq)`

Run the Complete Script

python3 analyze-images-v2.py $PROJECT_ID $PROJECT_ID

Watch the output — you should see text being extracted from each image, locale detection, and Japanese translations for non-Japanese text, followed by "Writing Vision API image data to BigQuery..."

Understanding the Python Script (`analyze-images-v2.py`)

Before modifying the script, it's important to understand what it does. Here's a general overview followed by a line-by-line breakdown.

General Overview

The script is an automated image-processing pipeline. It connects to four Google Cloud services simultaneously: Cloud Storage (to read images and write text files), Vision API (to extract text from images via OCR), Translation API (to translate non-Japanese text into Japanese), and BigQuery (to store the final results in a queryable table).

The workflow for each image is: download the image bytes from the bucket → send them to the Vision API → save the detected text back to Cloud Storage as a .txt file → check the language locale → if not Japanese, translate to Japanese → collect all results → batch-upload everything to BigQuery at the end.

Line-by-Line Breakdown

# Dataset: image_classification_dataset
# Table name: image_text_detail
import os
import sys

Lines 1-4: Comments documenting the target BigQuery dataset/table. Imports os (to read environment variables) and sys (to read command-line arguments).

from google.cloud import storage, bigquery, language, vision, translate_v2

Line 7: Imports the five Google Cloud client libraries. storage for Cloud Storage, bigquery for BigQuery, language for Natural Language API (not used in this script but imported from the original template), vision for Vision API, and translate_v2 for the Translation API.

if ('GOOGLE_APPLICATION_CREDENTIALS' in os.environ):
    if (not os.path.exists(os.environ['GOOGLE_APPLICATION_CREDENTIALS'])):
        print ("The GOOGLE_APPLICATION_CREDENTIALS file does not exist.\n")
        exit()
else:
    print ("The GOOGLE_APPLICATION_CREDENTIALS environment variable is not defined.\n")
    exit()

Lines 9-15: Credentials check. Verifies two things: (1) the GOOGLE_APPLICATION_CREDENTIALS environment variable is set, and (2) the file it points to actually exists on disk. If either check fails, the script exits immediately with an error message. This is a safety gate — without valid credentials, no API call will work.

if len(sys.argv)<3:
    print('You must provide parameters for the Google Cloud project ID and Storage bucket')
    print ('python3 '+sys.argv[0]+ '[PROJECT_NAME] [BUCKET_NAME]')
    exit()

project_name = sys.argv[1]
bucket_name = sys.argv[2]

Lines 17-23: Argument parsing. The script requires two command-line arguments: the GCP project ID and the Cloud Storage bucket name. In this lab, both are the same value (your project ID). If you forget to pass them, the script prints usage instructions and exits.

storage_client = storage.Client()
bq_client = bigquery.Client(project=project_name)
nl_client = language.LanguageServiceClient()

Lines 26-28: Client initialization (part 1). Creates client objects for Cloud Storage, BigQuery (bound to your project), and the Natural Language API. The nl_client is inherited from the original template but not used in this challenge.

vision_client = vision.ImageAnnotatorClient()
translate_client = translate_v2.Client()

Lines 31-32: Client initialization (part 2). Creates the Vision API client (for text detection) and the Translation API client (for translating text). These are the two ML API clients you'll use in the TBD sections.

dataset_ref = bq_client.dataset('image_classification_dataset')
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table('image_text_detail')
table = bq_client.get_table(table_ref)

Lines 35-38: BigQuery table setup. Creates a reference chain: dataset name → dataset object → table name → table object. The get_table() call actually contacts BigQuery to verify the table exists and retrieves its schema. This is where the 403 USER_PROJECT_DENIED error occurs if the service account lacks the serviceUsageConsumer role.

rows_for_bq = []

Line 41: Results buffer. Initializes an empty list that will accumulate tuples of (description, locale, translated_text, filename) for each processed image. These get batch-uploaded to BigQuery at the end.

files = storage_client.bucket(bucket_name).list_blobs()
bucket = storage_client.bucket(bucket_name)

Lines 44-45: Bucket access. list_blobs() returns an iterator over every file (blob) in the bucket. The bucket object is saved separately because we'll need it later to upload text files.

print('Processing image files from GCS. This will take a few minutes..')

Line 47: Status message so you know the script is working.

for file in files:
    if file.name.endswith('jpg') or  file.name.endswith('png'):
        file_content = file.download_as_string()

Lines 50-52: Main loop start. Iterates over every blob in the bucket, filters for image files (.jpg or .png), and downloads the image as raw bytes into file_content.

        # TBD: Create a Vision API image object called image_object
        image_object = vision.Image(content=file_content)    # ← YOU ADD THIS

Line 55 (TBD #1): Wraps the raw image bytes into a vision.Image object. The Vision API cannot accept raw bytes directly — it needs this structured object that can hold either image bytes (content) or a GCS URI (source).

        # TBD: Detect text in the image and save the response data into an object called response
        response = vision_client.document_text_detection(image=image_object)    # ← YOU ADD THIS

Line 59 (TBD #2): Sends the image to the Vision API's document_text_detection method. This performs OCR (Optical Character Recognition) optimized for dense text. The response contains a list of text_annotations — the first element holds the full concatenated text and the detected language.

        text_data = response.text_annotations[0].description

Line 62: Extracts the full detected text from the first annotation. The text_annotations array always puts the complete text in index [0], with individual word-level detections in subsequent indices.

        file_name = file.name.split('.')[0] + '.txt'
        blob = bucket.blob(file_name)
        blob.upload_from_string(text_data, content_type='text/plain')

Lines 65-67: Save text to Cloud Storage. Converts the image filename (e.g., sign1.jpg) to a text filename (sign1.txt), creates a blob reference, and uploads the extracted text. This creates a text file in the same bucket for each processed image.

        desc = response.text_annotations[0].description
        locale = response.text_annotations[0].locale

Lines 72-73: Extracts the description (full text) and locale (language code like 'en', 'ja', 'fr') from the response. Note that desc is the same value as text_data — the script extracts it again for clarity of variable naming.

        if locale == '':
            translated_text = desc
        else:
            # TBD: According to the target language pass the description data to the translation API
            translation = translate_client.translate(desc, target_language='ja')    # ← YOU ADD THIS

            translated_text = translation['translatedText']

Lines 77-83 (TBD #3): Translation logic. If the locale is empty (no language detected), the original text is used as-is. Otherwise, the text is sent to the Translation API with target_language='ja' (Japanese). The API returns a dictionary; the translated text is in the 'translatedText' key.

        print(translated_text)

Line 84: Prints the translated (or original) text to the console so you can monitor progress.

        if len(response.text_annotations) > 0:
            rows_for_bq.append((desc, locale, translated_text, file.name))

Lines 88-89: Collect results. If the Vision API found any text (safety check), appends a tuple with the original text, locale, translated text, and filename to the results buffer. This tuple matches the BigQuery table schema.

print('Writing Vision API image data to BigQuery...')
errors = bq_client.insert_rows(table, rows_for_bq)    # ← YOU UNCOMMENT THIS
assert errors == []

Lines 91-93: BigQuery upload. After all images are processed, uses insert_rows() to perform a streaming insert of all collected rows into the BigQuery table. The assert verifies that no errors occurred — if any row failed to insert, the script crashes with an AssertionError.

Task 5: Validate with BigQuery

Run the Verification Query

Go to BigQuery in the Console or use the CLI:

bq query --use_legacy_sql=false \
  'SELECT locale, COUNT(locale) as lcount FROM image_classification_dataset.image_text_detail GROUP BY locale ORDER BY lcount DESC'

You should see a breakdown of language codes (e.g., ja, en, fr, de) with their counts. This confirms the full pipeline worked end-to-end.

Quick Reference — All Commands in Order

# ============================================
# TASK 1: Create service account + bind roles
# ============================================
export PROJECT_ID=$(gcloud config get-value project)

gcloud iam service-accounts create my-ml-sa \
  --display-name="ML API Service Account"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/serviceusage.serviceUsageConsumer"

# ============================================
# TASK 2: Create credentials + set env var
# ============================================
gcloud iam service-accounts keys create ml-sa-key.json \
  --iam-account=my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com

export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/ml-sa-key.json

# ============================================
# TASK 3 & 4: Copy and modify the script
# ============================================
gsutil cp gs://$PROJECT_ID/analyze-images-v2.py .
nano analyze-images-v2.py

# --- Inside nano, make these 4 edits: ---
# 1. After "TBD: Create a Vision API image object":
#        image_object = vision.Image(content=file_content)
#
# 2. After "TBD: Detect text in the image":
#        response = vision_client.document_text_detection(image=image_object)
#
# 3. After "TBD: According to the target language":
#        translation = translate_client.translate(desc, target_language='ja')
#
# 4. Uncomment the last line:
#        errors = bq_client.insert_rows(table, rows_for_bq)
# --- Save with Ctrl+O, Enter, Ctrl+X ---

# ============================================
# TASK 5: Run script and validate
# ============================================
python3 analyze-images-v2.py $PROJECT_ID $PROJECT_ID

bq query --use_legacy_sql=false \
  'SELECT locale, COUNT(locale) as lcount FROM image_classification_dataset.image_text_detail GROUP BY locale ORDER BY lcount DESC'

Troubleshooting

Problem	Solution
`403 USER_PROJECT_DENIED` on BigQuery or API calls	Add the missing role: `gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:my-ml-sa@${PROJECT_ID}.iam.gserviceaccount.com" --role="roles/serviceusage.serviceUsageConsumer"` — wait 1-2 min for propagation
`403 ACCESS_DENIED` on Cloud Storage	You may have used `roles/storage.admin` instead of `roles/storage.objectAdmin`. Fix: bind the correct role
`PERMISSION_DENIED` on Vision/Translate API calls	Enable the APIs: `gcloud services enable vision.googleapis.com translate.googleapis.com`
`PERMISSION_DENIED` on BigQuery	Verify the `dataEditor` role was bound correctly; wait 1-2 minutes for IAM propagation
`ModuleNotFoundError`	Install packages: `pip3 install google-cloud-vision google-cloud-translate google-cloud-bigquery google-cloud-storage google-cloud-language`
Credentials file error	Verify: `echo $GOOGLE_APPLICATION_CREDENTIALS` and `ls -la ml-sa-key.json`
`NameError: name 'image_object' is not defined`	TBD #1 is missing — add `image_object = vision.Image(content=file_content)`
`NameError: name 'response' is not defined`	TBD #2 is missing — add the `vision_client.document_text_detection()` call
`NameError: name 'translation' is not defined`	TBD #3 is missing — add the `translate_client.translate()` call
Empty BigQuery table	Confirm you uncommented `errors = bq_client.insert_rows(table, rows_for_bq)`
`AssertionError` on `assert errors == []`	Check that the BigQuery table `image_text_detail` exists in dataset `image_classification_dataset`
Script argument error	Ensure you pass both arguments: `python3 analyze-images-v2.py $PROJECT_ID $PROJECT_ID`

Key Learnings

Service accounts are the standard way to provide application-level credentials in GCP. Each service account can have granular IAM roles scoped to specific services.
GOOGLE_APPLICATION_CREDENTIALS is the universal environment variable that all Google Cloud client libraries check for authentication.
The Vision API requires an Image object created from raw bytes — you can't pass the bytes directly to the detection method.
The Vision API's document_text_detection returns a structured response where the first element in text_annotations contains the full detected text and its locale.
The Translation API's translate() method returns a dictionary with translatedText, detectedSourceLanguage, and input keys.
BigQuery's insert_rows() performs streaming inserts and returns an empty list on success.
Always read the existing code before modifying — variable names like vision_client, desc, and image_object are defined by the script and must be used exactly as expected.
Use roles/storage.objectAdmin instead of roles/storage.admin — it grants object-level read/write/delete without unnecessary bucket-level management permissions.

Best Practices

Principle of least privilege: Only grant the roles your service account actually needs (dataEditor for BigQuery writes, storage.objectAdmin for GCS object access, serviceUsageConsumer for API consumption).
Test incrementally: Run the script after each modification to catch errors early rather than debugging everything at once.
Environment variables for credentials: Never hard-code paths to credential files in your scripts.
Read the existing code carefully: Variable names matter — using vision_client vs client or desc vs text can cause NameError exceptions.
Use document_text_detection over text_detection when dealing with dense text in images — it uses a more advanced OCR model.

Conclusion

This challenge lab walks you through a realistic ML pipeline pattern: ingest raw data (images), enrich it using ML APIs (Vision + Translation), and store structured results for analysis (BigQuery). These same building blocks — Cloud Storage for data lake, ML APIs for enrichment, BigQuery for analytics — appear in production architectures across industries. Mastering this flow gives you a solid foundation for building more complex ML data pipelines on Google Cloud.

DEV Community: William Schnaider Torres Bermon

Spec Kit vs BMAD vs OpenSpec: Choosing an SDD Framework in 2026

TL;DR

Disclosure

What SDD actually is (and isn't)

The economic shift

The frameworks

Spec Kit (GitHub)

BMAD-METHOD

OpenSpec (Fission-AI)

A different category: SDD as a product, not a framework

Roll your own

The spec drift problem (and what to do about it)

When the constitution fails: a real example

Mistakes that will cost you a sprint

Closing

Solving the Gemini API Challenge Lab on Vertex AI: Text, Function Calling & Video Understanding

The challenge in one paragraph

Task 1: Text generation via curl from Cloud Shell

Setting up the environment

The curl call (with the gotcha)

Task 2: Open the notebook in Vertex AI Workbench

Task 3: Function calling with Gemini

3.1 — Load the model

3.2 — Declare the function

3.3 — Wrap it in a Tool

3.4 — Invoke the model

Decoding the response (the interesting part)

The full function-calling loop (which the lab doesn't make you complete)

Task 4: Describing video contents

4.1 — Load the model

4.2 — Generate the description

Key learnings

Best practices

Wrapping up

Solving "Analyze and Reason on Multimodal Data with Gemini: Challenge Lab" — A Complete Guide

The Scenario

Environment Setup (Task 1)

Task 2: Analyzing Customer Reviews (Text)

Initial Analysis

Deep Dive with Thinking

Task 3: Analyzing Images (Visual Content)

Initial Analysis

Reasoning on Image Trends

Task 4: Analyzing Audio (Podcast)

Initial Analysis

Reasoning on Audio Insights

Task 5: Synthesizing Multimodal Insights

Key Learnings

Best Practices

Conclusion

Solving "Use Machine Learning APIs on Google Cloud: Challenge Lab" — A Complete Guide

Introduction

The Architecture

Task 1: Configure a Service Account

Why a Service Account?

Commands

Verification

Task 2: Create and Download Credentials

Why Download a Key?

Commands

Task 3: Modify the Script — Vision API Text Detection

Get the Script

What to Modify

Test It

Task 4: Modify the Script — Translation API

What to Modify

Enable the BigQuery Upload

Complete Modified Script Reference

Run the Complete Script

Understanding the Python Script (analyze-images-v2.py)

General Overview

Line-by-Line Breakdown

Task 5: Validate with BigQuery

Run the Verification Query

Quick Reference — All Commands in Order

Troubleshooting

Key Learnings

Best Practices

Conclusion

Understanding the Python Script (`analyze-images-v2.py`)