William Schnaider Torres Bermon

Posted on Apr 23

Solving the Gemini API Challenge Lab on Vertex AI: Text, Function Calling & Video Understanding

#ai #googlecloud #googleaichallenge #vertexai

The "Explore Generative AI with the Gemini API in Vertex AI: Challenge Lab" on Google Cloud Skills Boost throws three Gemini capabilities at you in one sitting: a raw REST call from Cloud Shell, function calling from a Jupyter notebook, and multimodal video analysis. None of it is hard once you know what the verifier is actually checking — but a couple of things are easy to get wrong on the first attempt and the lab gives you almost no feedback when you do.

This walkthrough is the version of the solution I wish I had read before starting. I'll show you the working code for every task, but more importantly, I'll explain why each piece works the way it does — including a deep dive into the function-call response object, which is genuinely interesting once you understand it.

The challenge in one paragraph

You're playing the role of a developer at a video-analysis startup. Your job is to prove you can wire up three Gemini features end-to-end: generating text via a direct REST call, declaring a tool that Gemini can decide to invoke, and feeding a video from Cloud Storage into the model so it can describe what it sees. The lab provides a half-finished Jupyter notebook with INSERT placeholders, and your job is to fill in the blanks.

The model used throughout is gemini-2.5-flash, and the notebook uses the new google-genai SDK (not the legacy vertexai one — this matters because the class names and import paths are different).

Task 1: Text generation via curl from Cloud Shell

The first task is the simplest in concept and the most annoying in practice. You open Cloud Shell, you curl the Vertex AI endpoint, you ask Gemini why the sky is blue, you get an answer back. Done.

Except the verifier won't accept your call unless you hit a very specific endpoint. More on that in a moment.

Setting up the environment

The lab pre-fills these variables for you:

PROJECT_ID=qwiklabs-gcp-00-207c94de3534   # yours will differ
LOCATION=us-east1
API_ENDPOINT=${LOCATION}-aiplatform.googleapis.com
MODEL_ID="gemini-2.5-flash"

Then you need to make sure the Vertex AI API is enabled. The lab tells you to do this in the Console, but the CLI is faster:

gcloud services enable aiplatform.googleapis.com --project=${PROJECT_ID}

The curl call (with the gotcha)

Here's the part where the lab can quietly waste 20 minutes of your time. The Vertex AI generative endpoints expose two methods: generateContent (returns one big response) and streamGenerateContent (returns a stream of chunks). Both work. Both return valid Gemini answers. Only one of them satisfies the lab verifier.

The verifier checks for streamGenerateContent. Use this:

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:streamGenerateContent" -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          { "text": "Why is the sky blue?" }
        ]
      }
    ]
  }'

If you get a JSON array back where each element contains a candidates[].content.parts[].text field with text about Rayleigh scattering, you're good. Hit "Check my progress" and Task 1 turns green.

If you get 403 PERMISSION_DENIED, the API hadn't fully propagated yet — wait 30 seconds after enabling and try again. If you get 404, you've got a typo in the region or model name.

Why this matters: the difference between generateContent and streamGenerateContent is operational, not semantic. Streaming is what you'd actually want in production for any user-facing chatbot, because it lets the UI display tokens as they arrive instead of making the user stare at a spinner. The lab is implicitly nudging you toward that pattern.

Task 2: Open the notebook in Vertex AI Workbench

This task has no scoring — it's purely navigational. From the Console: Navigation menu → Vertex AI → Workbench. Find the generative-ai-jupyterlab instance (it should already be running), click Open JupyterLab, and once the new tab loads, double-click gemini-explorer-challenge.ipynb. When the kernel selector pops up, pick Python 3.

That's it. Now the real work begins.

Task 3: Function calling with Gemini

Function calling is the feature that turns Gemini from a chatbot into something that can actually do things in the world. The idea: you describe a function to the model — its name, what it does, what arguments it takes — and the model decides on its own whether and when to invoke it based on what the user is asking.

The notebook has four cells to fill in. Let's do them.

3.1 — Load the model

# Task 3.1
model_id = "gemini-2.5-flash"

Just the model identifier as a string. The new SDK doesn't make you instantiate a model object the way the legacy vertexai library did — you pass the model name straight into client.models.generate_content().

3.2 — Declare the function

# Task 3.2
get_current_weather_func = FunctionDeclaration(
    name="get_current_weather",
    description="Get the current weather in a given location",
    parameters={
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "Location"
            }
        }
    },
)

FunctionDeclaration (already imported at the top of the notebook from google.genai.types) is how you describe a function to Gemini. Notice that you're not giving it any actual code — you're giving it a schema. The description field is critical: this is what Gemini reads to decide whether your function is relevant to the user's prompt. A vague description means the model might not call your function when it should, or might call it when it shouldn't.

The parameters block is JSON Schema. If your real function took more arguments — say, unit for Celsius vs Fahrenheit — you'd add them here.

3.3 — Wrap it in a Tool

# Task 3.3
weather_tool = Tool(
    function_declarations=[get_current_weather_func],
)

A Tool is a container for one or more related function declarations. You could bundle get_current_weather and get_forecast and get_historical_weather into a single tool, and Gemini would pick whichever one fits the user's question.

3.4 — Invoke the model

# Task 3.4
prompt = "What is the weather like in Boston?"
response = client.models.generate_content(
    model=model_id,
    contents=prompt,
    config=GenerateContentConfig(
        tools=[weather_tool],
        temperature=0,
    ),
)
response

temperature=0 is important here: when you're asking the model to make a structured decision (call this function with these args), you want it to be deterministic, not creative.

Decoding the response (the interesting part)

Run the cell and you'll see something that looks alarming the first time:

GenerateContentResponse(
  candidates=[
    Candidate(
      avg_logprobs=-0.5011326244899205,
      content=Content(
        parts=[
          Part(
            function_call=FunctionCall(
              args=<... Max depth ...>,
              name=<... Max depth ...>
            ),
            thought_signature=b'\n\xcb\x01\x01\x8f=k_u\x91\xe5\x14...'
          ),
        ],
        role='model'
      ),
      finish_reason=<FinishReason.STOP: 'STOP'>
    ),
  ],
  ...
  usage_metadata=GenerateContentResponseUsageMetadata(
    candidates_token_count=7,
    prompt_token_count=25,
    thoughts_token_count=39,
    total_token_count=71,
    ...
  )
)

There is no text anywhere in the response. That's not a bug — that's the entire point. Let me unpack what's happening.

Part with function_call instead of text. Normally a Part carries a text field with whatever the model wrote. This one carries a function_call instead. What Gemini is telling you is: "I cannot answer 'what's the weather in Boston' from my training data, but the user gave me a tool called get_current_weather that can. I'm not going to make up an answer — I'm going to ask the caller to invoke that tool with location='Boston' and pass me back the result."

The <... Max depth ...> you see is just Python's repr truncating the output for display. The data is there. If you actually want to read it, do:

fc = response.candidates[0].content.parts[0].function_call
print(fc.name)   # "get_current_weather"
print(fc.args)   # {"location": "Boston"}

thought_signature (those scary-looking bytes). Gemini 2.5 is a thinking model — it does internal chain-of-thought reasoning before producing output. The thought_signature is an opaque, signed blob of that reasoning. You don't read it. Its only purpose is to be passed back to Gemini in a follow-up call (the second turn of the function-calling loop, see below) so the model can resume its reasoning without having to re-derive everything from scratch. It's a cache key for the model's internal state.

finish_reason=STOP. The model finished cleanly. Not truncated by token limit, not blocked by a safety filter.

The token counts. This is where Gemini 2.5 gets fun:

prompt_token_count=25: your prompt plus the function declaration consumed 25 input tokens.
candidates_token_count=7: the function call output was 7 tokens.
thoughts_token_count=39: the model spent 39 tokens thinking internally before deciding to call the function. This is the cost of the chain-of-thought. You're billed for it, and it's only present on the 2.5 family.
total_token_count=71: the sum, which is what hits your bill.

The full function-calling loop (which the lab doesn't make you complete)

What you just saw is step 2 of a 4-step dance. In a real application:

You send a prompt plus tool definitions to Gemini.
Gemini returns a function_call saying which function to invoke and with what args. ← the lab stops here
You actually execute the function — call a real weather API, hit a database, whatever — and send the result back to Gemini as a function_response.
Gemini uses that result to compose a natural-language answer like "It's currently 18°C and partly cloudy in Boston."

The lab only grades you up to step 2 because what's being demonstrated is that the model understands the tool and knows when to use it. The actual execution lives in your application code, not in Gemini's responsibilities. Once you grasp this separation of concerns, function calling stops feeling magical and starts feeling like a very natural API contract.

Task 4: Describing video contents

Same model, same client, but now you're going to feed it a video file from Cloud Storage and ask it to describe what's in it.

4.1 — Load the model

# Task 4.1
multimodal_model = "gemini-2.5-flash"

Same model as before. gemini-2.5-flash is natively multimodal — it doesn't need a separate "vision" or "video" variant. You hand it text, images, audio, or video, and it figures it out.

4.2 — Generate the description

The notebook has two INSERT placeholders here, plus you have to recognize that it's expecting a streaming call (the for response in responses: loop at the bottom is the giveaway).

# Task 4.2 Generate a video description
prompt = """
What is shown in this video?
Where should I go to see it?
What are the top 5 places in the world that look like this?
"""
video = Part.from_uri(
    file_uri="gs://github-repo/img/gemini/multimodality_usecases_overview/mediterraneansea.mp4",
    mime_type="video/mp4",
)
contents = [prompt, video]

responses = client.models.generate_content_stream(
    model=multimodal_model,
    contents=contents
)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
    print(response.text, end="")

Three things to notice.

Part.from_uri is how you reference Cloud Storage assets. You don't download the video to the notebook and base64-encode it — Gemini reads it directly from gs://. Faster, cheaper, and works for files much larger than what you could comfortably embed inline. The mime_type is required so the model knows how to decode the bytes.

contents is a list mixing text and media. You pass [prompt, video] and the SDK figures out what each element is. You could pass [image, prompt, video, image, prompt] if you wanted — the model treats it as a sequential multimodal message.

generate_content_stream, not generate_content. This is the second INSERT and it's the one most people miss. The for response in responses: loop at the bottom of the cell only makes sense if responses is iterable — which it is for the streaming version. If you used the non-streaming generate_content, you'd get back a single response object and the for loop would iterate over its attributes and break in confusing ways. The lab's hint is in the comment links: one of them points to the "stream response" docs.

When you run it, you'll see the video embedded in the notebook and then a streaming description fill in chunk by chunk — turquoise water, rocky cliffs, the Mediterranean — followed by a top-5 list with places like Amalfi, Santorini, the Côte d'Azur, Mallorca, and Croatia's Dalmatian coast.

Hit "Check my progress" and Task 4 goes green.

Key learnings

A few things worth taking away from this lab beyond just passing it.

The google-genai SDK is not the old vertexai SDK. If you've used Vertex AI's generative features before, you're probably used to from vertexai.generative_models import GenerativeModel. That's the legacy path. The new path is from google import genai plus from google.genai.types import .... Class names like FunctionDeclaration, Tool, and Part are similar but live in different modules. Don't mix them — pick one and stick with it.

Function calling is a contract, not an execution. Gemini will never actually call your function. It will tell you that you should call your function, with these args, and then wait for you to pass the result back. The model is the brain; your code is the hands. This separation is what makes function calling safe to deploy in production — you control exactly what the model can and cannot reach.

Thinking tokens are real and they cost money. Gemini 2.5 Flash's thoughts_token_count is a separate billable line item from input and output tokens. For most prompts it's small, but for complex reasoning tasks it can dominate the bill. If you're cost-optimizing, this is worth measuring.

Multimodal inputs come from Cloud Storage, not from your notebook. For anything bigger than a small image, the right pattern is to upload to GCS and reference with Part.from_uri. This avoids round-tripping bytes through your runtime and is dramatically faster for video.

Streaming vs non-streaming is a real choice. generateContent returns a single payload. streamGenerateContent returns chunks as they're produced. Pick streaming for any user-facing experience and non-streaming for server-to-server batch jobs where latency-to-first-token doesn't matter.

Best practices

A few things I'd do differently in real code than what the lab asks for:

Never hard-code the project ID. The notebook has PROJECT_ID = "qwiklabs-gcp-..." because the lab is ephemeral, but in production read it from google.auth.default() or an environment variable.
Write detailed function descriptions. "Get the current weather" is fine for a demo. For real tools, describe what the function returns, what units, what error conditions, and anything else that helps the model decide when to invoke it. The model only sees what you write.
Always set temperature=0 for tool calls. Creative variation in a function-call decision is almost never what you want.
Handle the multi-turn flow. A demo that stops at step 2 of the function-calling loop isn't a real integration. Build out the full round-trip: receive the function call, execute it, send the function_response back, get the natural-language answer.
Validate tool arguments before executing. Gemini is good at structured outputs but not perfect. Your function executor should treat the args as untrusted input and validate them against the schema before doing anything destructive.

Wrapping up

The Gemini API challenge lab is a small surface area but a surprisingly good introduction to three patterns you'll use constantly if you build with Vertex AI: direct REST access for quick experiments, function calling for tool-using agents, and multimodal inputs from Cloud Storage. The three things that tripped me up — the streamGenerateContent requirement in Task 1, the meaning of the function-call response object in Task 3, and the streaming method in Task 4 — are the things worth remembering, because they all reflect how you'd actually use these APIs in production.

Now go build something with it.

DEV Community