Romin Irani for Google AI

Posted on Jun 27 • Originally published at Medium on Jun 26

Mastering Gemini Computer Use : A Comprehensive Hands-on Guide

#gemini #googleai #google #computeruse

Gemini Computer Use lets you build AI agents that control real screens by looking at screenshots and deciding what to click, type, or scroll, just like a human would.

You can read the announcement post below:

Introducing computer use in Gemini 3.5 Flash

This tutorial walks you through five progressive steps (from “Hello Screenshot” to Gemini Enterprise Agent Platform), then shows five real-world use cases. All code is included and runnable.

What is Computer Use?

If you’ve ever written a Selenium test or a Puppeteer script, you know the script and the friction that comes with it. You spend hours crafting CSS selectors, XPath expressions, and fragile wait conditions. Then one day, the website changes a class name, and everything breaks. This is the fragile nature of selector-based automation, where it depends on the implementation details of the UI rather than what’s actually visible on the screen.

And there’s a deeper problem. Not every application exposes an API. Legacy enterprise software, desktop apps, mobile interfaces , there’s an enormous surface area of software that can only be operated by a human looking at a screen and clicking around.

The Solution: AI That Sees

Gemini Computer Use takes a radically different approach. Instead of parsing HTML or querying the DOM, the model looks at a screenshot and decides what to do, exactly the way a human would. It doesn’t need selectors. It doesn’t need APIs. It just needs eyes.

The paradigm is beautifully simple: Observe → Think → Act.

Observe: Capture a screenshot of the current screen state
Think: Send it to Gemini 3.5 Flash, which analyzes the visual content and decides the next action
Act: Execute the action (click a button, type text, scroll down), then capture a new screenshot and loop back

This loop continues until the model decides the task is complete and responds with text instead of a function call.

How Computer Use Works: Brain, Eyes, and Hands

Every Computer Use agent has three parts:

The Brain is always the same — Gemini 3.5 Flash. But the Hands change depending on what you’re controlling:

┌─────────────────────────────────────────────────────────────────┐
│ 🧠 BRAIN (always the same) │
│ Gemini 3.5 Flash │
│ "Look at this screenshot, what should I do?" │
└──────────────────────────┬──────────────────────────────────────┘
                           │
              Model returns: click(x=396, y=185)
                           │
          ┌────────────────┼────────────────┐
          ▼ ▼ ▼
   🖐️ Browser 🖐️ Mobile 🖐️ Enterprise
   Playwright ADB CDP Sandbox
   page.mouse adb shell Remote browser
   .click(506,148) input tap in the cloud
                      506 148

Why this matters: The model doesn’t know or care what “hands” you’re using. It just sees screenshots and returns actions. You can swap Playwright for Selenium, ADB for a physical device, or a local browser for a cloud sandbox — the model’s output is identical. The only thing that changes is the environment parameter you declare (browser, mobile, or desktop), which tells the model what kind of screen it's looking at.

Why Gemini 3.5 Flash is Special

Computer Use isn’t a separate model or a bolted-on capability. In gemini-3.5-flash, it's a native tool, declared alongside other tools like function calling and code execution. This design decision has key implications and its worthwhile to understand them:

Multi-tool composition: The model can interleave browser actions with custom function calls in the same conversation. Click a button, then call save_product() to store data, then scroll down for more.
Three environments: One model supports ENVIRONMENT_BROWSER, ENVIRONMENT_MOBILE, and ENVIRONMENT_DESKTOP. You declare the environment and the model adapts its available actions.
Thinking mode: You can enable ThinkingConfig(include_thoughts=True) to see the model's reasoning before each action, invaluable for debugging.

Key Technical Concepts

Normalized Coordinates (0–999): The model doesn’t know your screen resolution. Instead, it outputs coordinates in a normalized 0–999 range. You convert them to pixels with:

pixel = int(normalized / 1000 * screen_dimension)

For a 1280×800 viewport, coordinate (500, 500) maps to pixel (640, 400) — the centre of the screen.

The intent Field: Each function call includes an intent string describing what the model is trying to achieve , not just the mechanical action. This is crucial if you try to debug the whole flow.

Safety Decisions: When the model encounters potentially sensitive actions (making a purchase, sending a message, deleting data), it includes a safety_decision object requesting explicit user confirmation.

What Changed from Legacy to Current

The previous gemini-2.5-computer-use-preview model was an early experiment with a separate API surface. The current gemini-3.5-flash approach integrates Computer Use as a standard tool. This means the same model, the same SDK, and the same generate_content() call, that you already know.

Two Paths: Gemini API vs Enterprise Platform

There are two ways to use Gemini Computer Use, and which one you choose depends on where you are in the development lifecycle.

Gemini API (ai.google.dev)

The fastest path from zero to working agent. Grab an API key, pip install the SDK, and you’re running. You build your own execution environment (Playwright for browsers, ADB for Android), which gives you maximum flexibility and zero cloud dependencies.

Enterprise Agent Platform (Vertex AI)

When you’re ready for production, Vertex AI adds IAM-based authentication, managed browser sandboxes (no local browser needed), VPC Service Controls, audit logging, and all the enterprise trimmings. The important part? The code is almost identical. You change one line.

The migration between paths is a single-line change in client construction:

Before — Gemini API (prototyping)

# Gemini API (prototyping)
client = genai.Client(api_key="YOUR_KEY")

After — Vertex AI (production)

# That's it — that's the migration
client = genai.Client(
    vertexai=True,
    project="my-project",
    location="us-central1",
)

Prerequisites & Setup

In this section, we will identify the software that we need. This completely depends on what use cases and scenarios that you are interested in, while using Gemini Computer Use.

The table given in the section What You Need identifies the requirement and which step/use case that you need it for. For e.g. if you are not interested in Android app scenarios or running this on Google Cloud, you can simply do with Python 3.10+ and Gemini API Key only. Each of the steps and use cases are mentioned clearly in the article. So as you go along, remember to have these requirements handy.

What You Need

Step-by-Step Installation

The examples are code heavy and hence I have made a code repository available for you. The tutorial too goes into each of the folders i.e. steps or use cases in the repository, and you can follow the steps to run the examples.

GitHub - rominirani/computer-use-tutorial: Mastering Gemini Computer Use: A Comprehensive Hands-On Guide - Build AI agents that control browsers, mobile devices, and desktops with Gemini 3.5 Flash

The Repository structure is shown below:

computer-use-tutorial/
├── README.md ← You are here
├── requirements.txt ← Python dependencies
├── .env.example ← Template for API key
├── 01-hello-screenshot/ ← Step 1: Visual understanding
│ └── hello_screenshot.py
├── 02-single-action/ ← Step 2: One screenshot → one click
│ └── single_action.py
├── 03-browser-agent/ ← Step 3: Full agentic browser loop
│ ├── browser_agent.py
│ └── playwright_env.py
├── 04-mobile-agent/ ← Step 4: Android + Interactions API
│ ├── mobile_agent.py
│ ├── adb_bridge.py
│ └── setup_emulator.sh
├── 05-enterprise-platform/ ← Step 5: Vertex AI + sandboxes
│ └── enterprise_agent.py
└── 06-use-cases/ ← 5 real-world use cases
    ├── usecase1_qa_testing/
    ├── usecase2_price_comparison/
    ├── usecase3_mobile_testing/
    ├── usecase4_web_research/
    └── usecase5_form_filling/

1. Clone the repository

git clone https://github.com/rominirani/computer-use-tutorial
cd computer-use-tutorial

2. Create a Python virtual environment

python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# On Windows: .venv\Scripts\activate

⚠️ Important : Always activate the virtual environment before running any tutorial script. If your shell prompt doesn’t show (.venv), run source .venv/bin/activate again.

3. Install Python dependencies

pip install -r requirements.txt

The requirements.txt is intentionally lean:

google-genai>=2.7.0
playwright==1.55.0
pydantic>=2.0
rich
termcolor
python-dotenv

What is Playwright doing here?

Playwright is the “ hands ” of the agent. Gemini Computer Use gives the model eyes (it can see screenshots) and a brain (it decides what to do), but it has no way to actually interact with a browser. That’s where Playwright comes in.

It is an open-source browser automation library from Microsoft. It lets Python code control a real Chromium browser i.e. open pages, click at pixel coordinates, type text, scroll, take screenshots, intercept popups, etc.

Why we need it?

The Gemini model returns structured instructions like:

click(x=396, y=185)
type(text="hello world")
scroll(direction="down")

But these are just data. The model can’t execute them. Something needs to translate click(x=396, y=185) into an actual mouse click inside a real browser. Playwright does that:

Model says click(x=396, y=185) → Playwright runs page.mouse.click(506, 148)
Model says type("hello") → Playwright runs page.keyboard.type("hello")
Model says take a screenshot → Playwright runs page.screenshot() → PNG bytes

Could you use something else?

Yes you could. Selenium, Puppeteer, or any browser automation tool would work. Playwright is chosen here because it’s fast, has a clean Python API, handles headless mode well, and has built-in screenshot capture. For mobile, we use ADB instead of Playwright (same role, different “hands”).

Playwright doesn’t use the Chrome/Chromium already on your machine. It ships its own browser binaries that are patched for automation, they expose internal APIs (like the Chrome DevTools Protocol) that Playwright uses to control the browser programmatically (click, type, screenshot, navigate, etc.).

playwright install chromium downloads this patched Chromium binary (~150 MB) into a Playwright-managed directory (e.g. ~/Library/Caches/ms-playwright/).

4. Install the Chromium browser binary

playwright install chromium

Verify it works:

python3 -c "from playwright.sync_api import sync_playwright; pw = sync_playwright().start(); b = pw.chromium.launch(); print('✅ Playwright Chromium works'); b.close(); pw.stop()"

5. Configure your Gemini API key

Please get a Gemini API Key from here.

cp .env.example .env
# Edit .env and replace 'your-api-key-here' with your actual key

Your .env file should look like:

GEMINI_API_KEY=AIzaSy...(your actual key)

6. Verify — run Step 01

cd 01-hello-screenshot
python hello_screenshot.py

If you see the model describe the Hacker News page, your setup is complete for Steps 1–3 and browser-based use cases. 🎉

In my case here, this is what I got as output:

✓ Gemini client initialised

============================================================
  Step 1 → Launch browser
  Opening https://news.ycombinator.com
============================================================

  → Page loaded: Hacker News
  → Viewport : 1280×800

============================================================
  Step 2 → Capture screenshot
  Taking a full-viewport PNG snapshot
============================================================

  → Screenshot captured: 206.7 KB
  → Saved to : /Users/romin/computer-use-tutorial/01-hello-screenshot/screenshot.png

============================================================
  Step 3 → Send to Gemini
  Asking the model to describe the screenshot
============================================================

============================================================
  Step 4 → Model response
  Gemini's description of the screenshot
============================================================

Based on the screenshot provided, here is a detailed description of the webpage:

### **Website Name and Identity**
* **Site Name:**  **Hacker News** (a social news website focusing on computer science and entrepreneurship, run by the investment fund Y Combinator).
* **Logo:** A small, bright orange square containing a stylized white letter **"Y"** sits in the top-left corner.

### **Color Scheme**
* **Primary Accent:** Bright safety orange (`#ff6600`), used exclusively for the top navigation bar.
* **Background:** The main content area has a very light, warm gray/off-white background, while the outer margins of the page are plain white.
* **Text:** Dark gray to black for the main headlines, and a lighter, muted gray for metadata (points, time, domains, and comment counts).

### **Layout and Navigation**
* **Structure:** The site features an extremely minimalist, text-heavy, "Web 1.0" style layout. The content is centered in a single vertical column with wide white margins on the left and right.
* **Header Bar:** A solid orange horizontal bar spans the top of the content column. 
    * On the left, next to the "Y" logo, is the bold text **"Hacker News"** followed by a pipe-separated menu of navigation links: `new | past | comments | ask | show | jobs | submit`.
    * On the far right of this bar is a single `login` link.
* **Main List:** Below the header is a sequentially numbered list of stories (numbered 1 through 22 are visible). 

### **Anatomy of a List Item**
Each item in the list follows a strict, consistent format:
1. **Number and Upvote Arrow:** A number (e.g., "1.") followed by a small, light-gray upward-pointing triangle (▲) used for voting.
2. **Headline:** The title of the article, styled as a clickable dark link.
3. **Source Domain:** Directly next to the headline, the hosting website's domain is listed in parentheses in light gray (e.g., `(om.co)` or `(github.com/plbrault)`).
4. **Metadata Line:** Positioned directly below the headline in smaller, lighter text. This includes:
    * The score (e.g., "843 points")
    * The submitting user's handle (e.g., "by minimaxir")
    * How long ago it was posted (e.g., "11 hours ago")
    * A "hide" link
    * The comment count, which links to the discussion page (e.g., "96 comments").

### **Visible Top Headlines**
The top stories visible on this front page cover tech news, programming, hardware, and history. Notable headlines include:
1. *Om Malik has died* (om.co)
2. *Why current LLM costs are not sustainable* (patadia.org)
3. *An entire Herculaneum scroll has been read for the first time* (scrollprize.org)
4. ....
10. ....

────────────────────────────────────────────────────────────
Cleaning up browser resources...
✓ Done — screenshot saved to screenshot.png

(Optional) Android Emulator Setup

Skip this if you only want the browser-based examples (Steps 1–3, Use Cases 1–2, 4–5).

Option A: Use the included setup script

cd 04-mobile-agent
chmod +x setup_emulator.sh
./setup_emulator.sh

This idempotent script installs Java, Android SDK, creates an AVD named ComputerUseTutorial, and prints the export lines for your shell.

If you are familiar with Android SDK, feel free to look at the setup_emulator.sh script and create an AVD named ComputerUseTutorial.

Option B: Existing Android SDK

# Common ANDROID_HOME locations:
export ANDROID_HOME=~/Library/Android/sdk # Android Studio default
export ANDROID_HOME=/opt/homebrew/share/android-commandlinetools # Homebrew (Apple Silicon)
# Add to PATH
export PATH="$ANDROID_HOME/emulator:$ANDROID_HOME/platform-tools:$PATH"
# Add to ~/.zshrc so it persists
echo 'export ANDROID_HOME=~/Library/Android/sdk' >> ~/.zshrc
echo 'export PATH="$ANDROID_HOME/emulator:$ANDROID_HOME/platform-tools:$PATH"' >> ~/.zshrc
source ~/.zshrc

Starting and verifying the emulator

# List and start
emulator -list-avds
emulator -avd <your-avd-name>
# Verify it's ready
adb devices # → emulator-5554 device
adb shell getprop sys.boot_completed # → 1

(Optional) Google Cloud Setup

Skip this if you don’t have a Google Cloud project or only want the Gemini API.

# Authenticate
gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com

Add to your .env:

GCP_PROJECT_ID=your-gcp-project-id
GCP_LOCATION=us-central1

Required IAM role: roles/aiplatform.user

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="user:your-email@domain.com" \
    --role="roles/aiplatform.user"

Understanding the Code: Key Computer Use Patterns

Before diving into the step-by-step code, here are the patterns you’ll see in every script.

Pattern 1: Normalized Coordinates (0–999)

The model always outputs coordinates in a 0–999 grid, regardless of screen size. Your code must denormalize:

# Model returns: click(x=396, y=185) — these are NOT pixels!
pixel_x = int(396 / 1000 * viewport_width) # 396 → 506 on a 1280px screen
pixel_y = int(185 / 1000 * viewport_height) # 185 → 148 on an 800px screen

This is the most common source of bugs, forgetting to denormalize will cause clicks to land in the wrong place.

Pattern 2: The Agentic Loop

1. Take screenshot → send to model with task description
2. Model returns: function_call (e.g., click, type, scroll)
3. Execute the action in the browser/device
4. Take a NEW screenshot → send back as FunctionResponse
5. Goto 2 — until model returns plain text (= task complete)

The key insight: the model signals it’s done by returning text instead of function_calls.

Pattern 3: FunctionResponse Must Include a Screenshot

FunctionResponse(
    name="click",
    response={
        "result": "ok",
        "screenshot": types.Part.from_bytes(
            data=screenshot_png, mime_type="image/png"
        ),
    },
)

Without the screenshot, the model is blind and can’t plan its next move.

Pattern 4: Tool Declaration

tools=[
    types.Tool(
        computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER,
            # or ENVIRONMENT_MOBILE / ENVIRONMENT_DESKTOP
        )
    )
]

This is NOT a regular function declaration — it activates the model’s built-in understanding of screenshots and UI interaction.

Pattern 5: Screenshot Pruning

Screenshots are ~100–300 KB each. In a 20-turn conversation, that’s 2–6 MB of image data in the context:

# Keep only the 3 most recent screenshots
for old_turn in history[:-3]:
    for part in old_turn.parts:
        if hasattr(part, 'inline_data'):
            part.inline_data = None # free memory, keep text

Step-by-Step Tutorial

This tutorial has several steps (numbered 1 to 5). Step 1 to 3 can be done simply via Python and Gemini API Key. Step 4 is about Android , while Step 5 is about Google Cloud.

Step 1 : Hello Screenshot 01-hello-screenshot/

Before we can build an agent that controls a screen, we need to make sure it can see one. In this step, you’ll launch a headless browser, navigate to Hacker News, capture a screenshot, and send it to Gemini 3.5 Flash to get a visual description. There’s no Computer Use here , just the two foundational building blocks (screenshots + the model) that every subsequent step builds on.

Key Concept: This step uses no Computer Use at all. It establishes the two foundational building blocks:

Capturing screenshots
Sending them to Gemini , that every subsequent step depends on.

Code Walkthrough

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text="Describe what you see..."),
                types.Part(
                    inline_data=types.Blob(
                        mime_type="image/png",
                        data=screenshot_bytes,
                    )
                ),
            ],
        )
    ],
)

The SDK handles base64 encoding. You pass raw bytes and the MIME type.

Running It

cd 01-hello-screenshot
python hello_screenshot.py

Expected Output

✓ Gemini client initialised
Step 1 → Launch browser
  Opening https://news.ycombinator.com
  → Page loaded: Hacker News
  → Viewport : 1280×800
  Step 4 → Model response
  The screenshot shows Hacker News (news.ycombinator.com)...

Step 2 : Your First Computer Use Action 02-single-action/

Now we give the model its “ hands.” You’ll show Gemini a screenshot of Wikipedia’s language selection page and ask it to click on the English link. The model will return a function_call with normalized coordinates, and you'll translate those into real pixel positions and execute the click with Playwright. This is a single round-trip — one screenshot in, one action out — but it covers the most critical concept in Computer Use: coordinate normalization.

Key Concept: This is where Computer Use enters the picture. The critical new concept is coordinate normalization.

Code Walkthrough

# Declare the Computer Use tool
computer_use_tool = types.Tool(
    computer_use=types.ComputerUse(
        environment=types.Environment.ENVIRONMENT_BROWSER,
    )
)
# Send request
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=request_contents,
    config=types.GenerateContentConfig(tools=[computer_use_tool]),
)
# Parse and execute
for part in response.candidates[0].content.parts:
    if part.function_call:
        fc = part.function_call
pixel_x = int(fc.args["x"] / 1000 * SCREEN_WIDTH)
pixel_y = int(fc.args["y"] / 1000 * SCREEN_HEIGHT)
page.mouse.click(pixel_x, pixel_y)

Running It

cd 02-single-action
python single_action.py

Expected Output

✓ Gemini client initialised

================================================================
  Step 1 → Launch browser
  Navigating to https://www.wikipedia.org
================================================================

  → Page loaded : Wikipedia
  → Current URL : https://www.wikipedia.org/
  → Viewport : 1280×800

================================================================
  Step 2 → Capture screenshot
  Grabbing the Wikipedia portal page
================================================================

  → Screenshot size : 124.7 KB
  → Saved to : /Users/romin/computer-use-tutorial/02-single-action/01_before_click.png

================================================================
  Step 3 → Ask Gemini to click
  Sending screenshot + instruction with Computer Use tool
================================================================

  → Response received from Gemini

================================================================
  Step 4 → Parse response
  Extracting the function call from the model
================================================================

  Function call received:
    Name : click
    x : 396
    intent: Click on the English Wikipedia link to navigate to the English Wikipedia.
    y : 182

================================================================
  Step 5 → Execute click
  Converting coordinates and clicking
================================================================

  Coordinate conversion:
    Normalised : (396, 182) [0-999 space]
    Pixel : (506, 145) [1280×800 viewport]

  → Click executed at pixel (506, 145)
  → Page navigated to: https://www.wikipedia.org/

================================================================
  Step 6 → Verify result
  Taking a post-click screenshot
================================================================

  → Screenshot size : 315.8 KB
  → Saved to : /Users/romin/computer-use-tutorial/02-single-action/02_after_click.png
  → Current URL : https://en.wikipedia.org/wiki/Main_Page
  → Page title : Wikipedia, the free encyclopedia

  ✅ SUCCESS — Navigated to English Wikipedia!

────────────────────────────────────────────────────────────────
Cleaning up browser resources...
✓ Done — screenshots saved to 01_before_click.png & 02_after_click.png

Step 3 : Building a Full Browser Agent 03-browser-agent/

This is the big leap where we are going from a single action to a fully autonomous agent. You’ll build a complete browser agent that can accept any task (“find the top 3 stories on Hacker News”), take screenshots, send them to the model, execute the returned actions, capture new screenshots, and loop until the task is done. Along the way, you’ll tackle the real-world challenges: managing a growing conversation history, pruning old screenshots to stay within context limits, handling safety decisions, and recovering from transient API errors with exponential backoff.

Key Concept: This is the big step, going from a single action to a fully autonomous agent.

Critical Pattern: Screenshot Pruning

def _prune_old_screenshots(self) -> None:
    screenshots_seen = 0
    for content in reversed(self._history):
        if content.role != "user" or not content.parts:
            continue
        has_screenshot = any(
            p.function_response and p.function_response.parts
            for p in content.parts
        )
        if has_screenshot:
            screenshots_seen += 1
            if screenshots_seen > self.MAX_SCREENSHOTS_IN_HISTORY:
                for part in content.parts:
                    if part.function_response and part.function_response.parts:
                        part.function_response.parts = None

Critical Pattern: FunctionResponse Construction

FunctionResponse(
    name=fc.name,
    response={"url": current_url},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob(
                mime_type="image/png",
                data=post_action_png,
            ),
        ),
    ],
)

Running It

cd 03-browser-agent
python browser_agent.py --task "Go to https://news.ycombinator.com and tell me the top 3 stories"

Expected Output

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Gemini Computer Use — Browser Agent │
│ Model: gemini-3.5-flash | Viewport: 1280×800 | Headless: False │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
────────────────────────────────────────────────────────── Browser Agent Starting ──────────────────────────────────────────────────────────
Task: Go to https://news.ycombinator.com and tell me the top 3 stories

Initial screenshot captured.

─────────────────────────────────────────────────────────────── Iteration 1 ────────────────────────────────────────────────────────────────
╭───────────────────────────────────────────────────────────── Model Thinking ─────────────────────────────────────────────────────────────╮
│ Okay, here's my thought process on this task: │
│ │
│ **Accessing and Prioritizing Hacker News Articles** │
│ │
│ Right, so the goal is clear: identify the top three stories currently dominating Hacker News. My immediate instinct is to open the site. │
│ It's second nature at this point – navigate to `https://news.ycombinator.com`. Once I'm there, the real work begins. I'll need to sift │
│ through the current submissions, using my knowledge of the site's ranking algorithm and usual trends, to identify which articles are │
│ garnering the most attention. I'll be looking at score, comment count, and potentially the recency of the posts to figure out what's │
│ currently hot and relevant. The top three are what I'm looking for, and I can be flexible with the ranking methodology. │
│ │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  ▶ navigate(intent=Navigate to Hacker News, url=https://news.ycombinator.com)
  → Screenshot captured | URL: https://news.ycombinator.com/

─────────────────────────────────────────────────────────────── Iteration 2 ────────────────────────────────────────────────────────────────
╭───────────────────────────────────────────────────────────── Model Thinking ─────────────────────────────────────────────────────────────╮
│ Okay, here's my take on those Hacker News headlines, as I see them: │
│ │
│ **Hacker News Digest - A Quick Glance** │
│ │
│ Alright, let's process these top stories on Hacker News. First up, the Patadia article on LLM costs – always a crucial point for │
│ long-term viability, and something I've been tracking myself. The core of this research is likely to be a deep dive into the │
│ computational inefficiencies, scaling issues, and the sheer energy consumption that underpins current models. I'm especially keen to see │
│ what specific cost-saving strategies the author proposes, and whether they involve architectural improvements, hardware optimization, or │
│ more fundamental algorithmic innovations. That's definitely one to bookmark and dig into. │
│ │
│ Then, a somber note: Om Malik has passed. A prominent voice in the tech sphere is gone. Always a difficult thing to grapple with when a │
│ familiar and respected figure leaves the scene. My thoughts are with his friends and family. A pause for reflection is definitely │
│ warranted. │
│ │
│ Finally, the Herculaneum scroll! This is truly fascinating. Reading an entire scroll from Herculaneum is a significant breakthrough. I'm │
│ guessing the team has leveraged advanced imaging techniques like X-ray tomography and computational tools for deciphering fragmented │
│ texts. What surprises me is this is a first. I'll need to research what the issues have been in deciphering this, if the material has │
│ been too degraded for even cutting-edge processing, or if this new tech is truly groundbreaking. The content itself will undoubtedly │
│ offer invaluable insights into ancient history, philosophy, and everyday life. A promising development. │
│ │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

────────────────────────────────────────────────────────────── Task Complete ───────────────────────────────────────────────────────────────

Agent Summary: The top 3 stories on Hacker News are:

1. **Why current LLM costs are not sustainable** (patadia.org)
2. **Om Malik has died** (om.co)
3. **An entire Herculaneum scroll has been read for the first time** (scrollprize.org)

─────────────────────────────────────────────────────────────── Final Result ───────────────────────────────────────────────────────────────

The top 3 stories on Hacker News are:

1. **Why current LLM costs are not sustainable** (patadia.org)
2. **Om Malik has died** (om.co)
3. **An entire Herculaneum scroll has been read for the first time** (scrollprize.org)

Architecture: browser_agent.py (AI loop) + playwright_env.py (browser abstraction).

Step 4 : Android Mobile Agent 04-mobile-agent/

We swap the browser for a phone. In this step, you’ll connect to an Android emulator via ADB, capture screenshots of the device screen, and let Gemini drive it with touch gestures : taps, swipes, and long-presses. This step also introduces the Interactions API, a stateful alternative to generateContent where the server manages conversation history for you. Instead of sending the full history every turn, you just chain calls with previous_interaction_id , thereby making the code simpler.

Interactions API vs generateContent

ADB Bridge Pattern

def click(self, y: int, x: int, **_) -> None:
    px, py = self._denormalize(x, y)
    self._execute(["shell", "input", "tap", str(px), str(py)])

Running It

# Start emulator first (see Android Setup above)
cd 04-mobile-agent
python mobile_agent.py "Open Settings and check the Android version"

Expected Output

╔══════════════════════════════════════════════════════════╗
║ Gemini Computer Use — Android Mobile Agent ║
║ Model: gemini-3.5-flash | API: Interactions ║
╚══════════════════════════════════════════════════════════╝
Turn 1/50
  🔧 Action: open_app({'package_name': 'com.android.settings'})
  ✓ Result: {'status': 'ok'}
  ... (4-6 turns: scrolls, clicks About Phone, reads version) ...
✅ Agent completed the task!
   Model response: The Android version is 16.

Step 5 : Enterprise Platform 05-enterprise-platform/

Everything you’ve built so far works with a free Gemini API key. But when you’re ready for production i.e. IAM authentication, VPC isolation, audit logging, and managed browser sandboxes, you can move to Vertex AI. The best part? The agent code is virtually identical. You change one line to switch from API-key auth to IAM, and optionally connect to a cloud-hosted sandbox browser instead of launching a local one. This step walks through both approaches.

The One-Line Migration

Your agent code — the agentic loop, the action dispatch, the screenshot pruning — is 100% identical between Gemini API and Vertex AI. The only change:

CDP Sandbox Connection

# Local browser (Steps 1-3):
browser = pw.chromium.launch(headless=True)
# Managed sandbox (Step 5):
browser = pw.chromium.connect_over_cdp(cdp_endpoint)

Running It

cd 05-enterprise-platform
# Approach 1: Self-managed browser (your local Playwright)
python enterprise_agent.py --approach self-managed --project YOUR_PROJECT_ID
# Approach 2: Managed sandbox (cloud-hosted browser)
python enterprise_agent.py --approach managed-sandbox --project YOUR_PROJECT_ID

💡 Tip : Replace YOUR_PROJECT_ID with your actual Google Cloud project ID. You can also set it in .env as GCP_PROJECT_ID=my-company-prod-123.

Part 5: Real-World Use Cases

Each use case in 06-use-cases/ is a complete, self-contained script demonstrating a practical application.

Use Case 1: QA Testing (TodoMVC)

Scenario : Run end-to-end QA tests on a React web app without writing fragile selectors. The agent sees the app and tests it the way a human QA tester would.

Architecture: generateContent API with a full agentic loop + custom report_qa_result function (multi-tool composition).

The Prompt

Navigate to https://todomvc.com/examples/react/dist/ and perform QA testing:
1. Add three todos: 'Buy groceries', 'Read a book', 'Write code'
2. Mark 'Read a book' as complete
3. Verify items are displayed correctly
4. Click the 'Completed' filter
5. Report what you see

Running It

cd 06-use-cases/usecase1_qa_testing
python qa_agent.py

Step-by-Step Walkthrough

Agent navigates to TodoMVC React app
Clicks “What needs to be done?” input and types “Buy groceries” + Enter
Repeats for “Read a book” and “Write code”
Clicks the circle next to “Read a book” to mark it complete
Verifies all three items are visible with correct state
Clicks the “Completed” filter
Confirms only “Read a book” appears with strikethrough
Returns a structured test report

Expected Results

╔══════════════════════════════════════════════════════════════╗
║ Use Case 1: Automated QA Testing Agent — TodoMVC ║
╚══════════════════════════════════════════════════════════════╝

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Step 1 → Validate environment
  Checking GEMINI_API_KEY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✓ API key found

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Step 2 → Launch browser
  Opening https://todomvc.com/examples/react/dist/
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✓ Browser launched — viewport (1440, 900)
  ✓ Page loaded: https://todomvc.com/examples/react/dist/
  ✓ Initial screenshot: 87 KB

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Step 3 → Initialise QA agent
  Setting up Gemini Computer Use loop
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✓ Agent ready — initial screenshot attached to prompt

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Step 4 → Run QA test plan
  Agent is now autonomously executing tests…
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 1
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: **QA Test Plan: TodoMVC React Application - Add Todo Items**

Okay, let's get down to brass tacks. I need to run a QA test plan on this TodoMVC React application. The first test is pretty straightforw…
  ▶ Action: click(x=603, y=180, intent=Click the todo input field to focus it.)
     📸 Screenshot captured (87 KB) — URL: https://todomvc.com/examples/react/dist/

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 2
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Here's my attempt at summarizing the thought process, tailored for an expert audience:

**Initiating Task Execution**

Okay, so the prompt is straightforward. It's an instruction set. It's not asking …
  ▶ Action: type(intent=Type 'Buy groceries' and press Enter., press_enter=True, text=Buy groceries)
     📸 Screenshot captured (94 KB) — URL: https://todomvc.com/examples/react/dist/

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 3
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Here's my attempt at summarizing the thought process, tailored for an expert audience:

**Initiating a Simple Instruction**

Okay, so the prompt is incredibly straightforward, almost rudimentary. It's…
  ▶ Action: type(text=Read a book, intent=Type 'Read a book' and press Enter., press_enter=True)
     📸 Screenshot captured (98 KB) — URL: https://todomvc.com/examples/react/dist/

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 4
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Here's my attempt at summarizing the thought process, tailored for an expert audience:

**Initiating the Inquiry into Gemini's Capabilities**

Okay, so the prompt directs me to explore the "Learn Gemi…
  ▶ Action: type(intent=Type 'Learn Gemini Computer Use' and press Enter., text=Learn Gemini Computer Use, press_enter=True)
     📸 Screenshot captured (103 KB) — URL: https://todomvc.com/examples/react/dist/

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 5
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Okay, here's my understanding, formatted as requested:

**Assessing Task Completion**

Right, let's just make absolutely sure we've got a handle on this. The task asks me to verify the presence of thr…
  ▶ Action: report_qa_result(test_name=Add Todo Items, details=Successfully added 'Buy groceries', 'Read a book', and 'Learn Gemini Computer Use' to the todo list., passed=True)

  📋 QA Result: Add Todo Items → [PASS ✅]
     Details : Successfully added 'Buy groceries', 'Read a book', and 'Learn Gemini Computer Use' to the todo list.

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 6
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: **Marking an Item as Complete**

Okay, so the next test is to mark an item as complete. Specifically, I need to find the "Read a book" task and visually confirm that it's marked as done after I intera…
  ▶ Action: click(y=316, intent=Click the toggle checkbox next to 'Read a book'., x=429)
     📸 Screenshot captured (105 KB) — URL: https://todomvc.com/examples/react/dist/

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 7
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Okay, here's my interpretation of that thought process, assuming I'm an expert in this particular task:

**Test Result Reporting - Iteration Complete**

Alright, that's done. I see "Read a book" is cr…
  ▶ Action: report_qa_result(test_name=Mark Complete, passed=True, details=Successfully marked 'Read a book' as completed, showing a strikethrough style and checked state.)

  📋 QA Result: Mark Complete → [PASS ✅]
     Details : Successfully marked 'Read a book' as completed, showing a strikethrough style and checked state.

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 8
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: **Executing the "Filter Completed" Test**

Okay, test three is up. We're verifying the "Completed" filter function. Let's start by identifying and clicking the "Completed" filter link at the bottom of…
  ▶ Action: click(intent=Click the 'Completed' filter link., y=438, x=637)
     📸 Screenshot captured (96 KB) — URL: https://todomvc.com/examples/react/dist/#/completed

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 9
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.
  💭 Reasoning: Okay, here's my take on those thoughts, framed as my own reasoning:

**Test Result Verification and Reporting**

Alright, let's just make absolutely sure we're on the right track here. First, I need t…
  ▶ Action: report_qa_result(details=Successfully filtered by 'Completed' and verified that only 'Read a book' is visible in the list., test_name=Filter Completed, passed=True)

  📋 QA Result: Filter Completed → [PASS ✅]
     Details : Successfully filtered by 'Completed' and verified that only 'Read a book' is visible in the list.

────────────────────────────────────────────────────────────────
  🤖 Agent Turn 10
────────────────────────────────────────────────────────────────
Tools at indices [1] are not compatible with automatic function calling (AFC). AFC is disabled. If AFC is intended, please include python callables in the tool list, and do not include function declaration and MCP server in the tool list.

  ✅ Agent finished.
  Final reasoning: Here's your summary:

**QA Session Complete**

Okay, that's it. All three tests have been successfully run and are now complete. Time to formally declare the QA session as finished. I've confirmed all the data, reviewed the results, and I'm satisfied that we can move forward. No outstanding issues or anomalies were discovered. Let's get the final report generated and distributed.

 The QA session is complete. All three tests passed successfully:
1. **Add Todo Items** : Successfully added "Buy gr…

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Step 5 → QA Report
  Printing final test results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

================================================
           QA TEST REPORT
================================================
  Test: Add Todo Items [PASS] ✅
  Test: Mark Complete [PASS] ✅
  Test: Filter Completed [PASS] ✅
================================================
  Overall: 3/3 PASSED
================================================

✓ Browser closed. QA session complete.

Key Takeaway: Computer Use can replace hundreds of lines of Selenium/Cypress test code with a single natural-language task description.

Use Case 2: Price Comparison

Scenario: Compare product prices across the web. The agent searches Google Shopping, extracts prices, and saves them through a custom function, demonstrating multi-tool composition.

Architecture: Computer Use (for browsing) + custom save_product() function in the same conversation.

The Prompt

You are a price comparison agent. Your job is to search for products and
extract pricing information.

## Instructions

1. You are on Amazon.com. In the search box, type "wireless noise
   cancelling headphones" and press Enter to search.

2. Wait for results to load. You should see product listings with names
   and prices.

3. Look at the search results. For each of the top 3-5 products you can
   see, call the `save_product` function with:
   - name: the product name (e.g. "Sony WH-1000XM5")
   - price: the displayed price (e.g. "$278.00")
   - source: "Amazon"

4. If you need to scroll down to see more products, do so.

5. Try to find at least 3 products with visible prices. Save each one
   using `save_product`.

6. After saving 3-5 products, state that the comparison is complete.

IMPORTANT: Call `save_product` once for each product you identify.
Do NOT try to save all products in one call.

Running It

cd 06-use-cases/usecase2_price_comparison
python price_agent.py

Expected Results


.... Agent Turn(s) ...
    🔍 PRICE COMPARISON RESULTS                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Product ┃ Price ┃ Source ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ Soundcore by Anker Q20i Hybrid Active Noise │ INR 3,774.86 │ Amazon │
│ Cancelling Headphones, Wireless Over-Ear │ │ │
│ Bluetooth, 40H Long ANC Playtime, Hi-Res Audio, │ │ │
│ Big Bass, Customize via an App, Transparency Mode │ │ │
│ (Black) │ │ │
├────────────────────────────────────────────────────┼───────────────┼────────────────────┤
│ Sony WH-CH720N Noise Canceling Wireless Headphones │ INR 7,084.34 │ Amazon │
│ Bluetooth Over The Ear Headset with Microphone and │ │ │
│ Alexa Built-in, Black New │ │ │
├────────────────────────────────────────────────────┼───────────────┼────────────────────┤
│ Bose QuietComfort Headphones - Wireless Bluetooth │ INR 16,896.70 │ Amazon │
│ Headphones, Active Over Ear Noise Cancelling and │ │ │
│ Mic, USB-C Charging, Deep Bass, Up to 24 Hours of │ │ │
│ Playtime, Twilight Blue - Limited Edition... │ │ │
├────────────────────────────────────────────────────┼───────────────┼────────────────────┤
│ Bose QuietComfort Headphones - Wireless Bluetooth │ INR 16,896.70 │ Amazon │
│ Headphones, Active Over Ear Noise Cancelling and │ │ │
│ Mic, USB-C Charging, Deep Bass, Up to 24 Hours of │ │ │
│ Playtime, Black │ │ │
└────────────────────────────────────────────────────┴───────────────┴────────────────────┘

  📊 Total products found: 4

Key Takeaway: Multi-tool composition is the solid feature. The model seamlessly switches between “driving the browser” and “calling your business logic functions.”

Use Case 3: Mobile App Testing

Scenario : Verify that an Android app’s Settings screen works — toggle dark mode, check About Phone, verify system info.

Architecture : Uses the Interactions API with ENVIRONMENT_MOBILE.

Prompt

You are an automated QA agent testing an Android device.

Your current test plan:
1. Open the Android Settings app (package: com.android.settings).
2. Navigate to "Display" settings.
3. Check whether Dark theme / Dark mode is currently ON or OFF.
4. Toggle Dark mode (turn it ON if it is OFF, or OFF if it is ON).
5. Press Back to return to the main Settings screen.
6. Navigate to "About phone" (scroll down if needed).
7. Read the Android version string shown on that page.
8. Report a test summary with: dark mode before, dark mode after, Android version.

Rules:
* Use the tools provided. Scroll down before assuming an item is missing.
* When the task is complete, output a **Test Summary** with all findings.
* Do NOT use the on-screen keyboard — use the `type` tool instead.

Running It

cd 06-use-cases/usecase3_mobile_testing
python app_test_agent.py

⚠️ Prerequisite : You need a running Android emulator. See the Android Setup section above.

Key Takeaway: The Interactions API handles state management server-side, making mobile agents significantly simpler to build.

Use Case 4: Web Research & Report

Scenario: Research a topic across multiple websites and produce a structured Markdown report using dual custom functions (save_finding + generate_report).

Prompt

You are a web research assistant. Your job:

1. The browser is open to DuckDuckGo. Search for: "{search_query}"
2. Browse the first 2-3 search result pages.
3. On each page, extract **key findings** and call the ``save_finding`` tool
   for each distinct insight (with a descriptive title, the page URL, the
   key point, and a category like "hardware", "algorithm", "partnership",
   "investment", or "breakthrough").
4. After collecting findings from 2-3 pages, call ``generate_report`` to
   signal you are done.
5. Finally, output a brief natural-language summary.

Rules:
* You MUST call ``save_finding`` at least once per page visited.
* Call ``generate_report`` exactly once when finished.
* Navigate with clicks and scrolls — read the page content on screen.
* If a page fails to load, use the browser back button and try the next result.

Running It

cd 06-use-cases/usecase4_web_research
python research_agent.py
# Or with a custom topic:
python research_agent.py --search "AI safety regulations 2026"

Expected Results

A research_report.md file saved to disk with structured findings, source URLs, and a summary.

✓ Report saved to: /Users/romin/computer-use-tutorial/06-use-cases/usecase4_web_research/report_AI_safety_regulations_2026_2026-06-26_152011.md

                                                               Collected Findings (7)                                                               
╭─────┬──────────────────────────────────────────────────────────────┬─────────────┬───────────────────────────────────────────────────────────────╮
│ # │ Title │ Category │ Source │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 1 │ EU AI Act High-Risk AI System Compliance and Penalties │ algorithm │ https://is4.ai/blog/our-blog-1/ai-safety-legislation-2026-gl… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 2 │ California SB 1047 Large-Scale AI Model Safety Protocols │ algorithm │ https://is4.ai/blog/our-blog-1/ai-safety-legislation-2026-gl… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 3 │ China Generative AI Content and Alignment Regulations │ algorithm │ https://is4.ai/blog/our-blog-1/ai-safety-legislation-2026-gl… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 4 │ NIST Launches Autonomous AI Agent Standards Initiative │ algorithm │ https://www.hungyichen.com/en/insights/ai-governance-regulat… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 5 │ Singapore IMDA Model AI Governance Framework for Agentic AI │ algorithm │ https://www.hungyichen.com/en/insights/ai-governance-regulat… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 6 │ US Federal AI Litigation Task Force Challenges State AI Laws │ algorithm │ https://www.gunder.com/en/news-insights/insights/2026-ai-law… │
├─────┼──────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────────────────────────────────────────┤
│ 7 │ US Restricts BEAD Funding for States with Onerous AI Laws │ partnership │ https://www.gunder.com/en/news-insights/insights/2026-ai-law… │
╰─────┴──────────────────────────────────────────────────────────────┴─────────────┴───────────────────────────────────────────────────────────────╯

✓ Done — report at: /Users/romin/computer-use-tutorial/06-use-cases/usecase4_web_research/report_AI_safety_regulations_2026_2026-06-26_152011.md

Key Takeaway: Computer Use + custom functions turns Gemini into a research agent that can navigate the open web and produce structured, citable output.

Use Case 5: Form Filling

Scenario : Fill out a complex HTML form with text inputs, radio buttons, checkboxes, dropdowns, and text areas. A classic RPA task.

The Prompt

Fill out the local practice registration form with:
- Name: Jane Smith
- Email: jane.smith@example.com
- Gender: Female
- Mobile: 555-0123
- Subject: Computer Science
- Hobby: Reading
- Address: 123 AI Street, Tech City, CA 90210
- State: California, City: San Francisco

Then submit the form and verify the confirmation.

Running It

cd 06-use-cases/usecase5_form_filling
python form_agent.py

The form getting filled in real-time:

Expected Results

╭───────────────────────────────────────────────────────────────────────────────────── Agent Report ──────────────────────────────────────────────────────────────────────────────────────╮
│ Okay, here's my take on those thoughts, tailored for an expert's perspective: │
│ │
│ **Registration Confirmation and Report Preparation** │
│ │
│ Alright, the registration's gone through. Got the confirmation message – "Registration Submitted Successfully!" – so that's a good starting point. Now, let's nail down what needs to │
│ be in this final report. I need to clearly articulate what specific fields were populated correctly, essentially a field-by-field breakdown to demonstrate data integrity. Of course, │
│ the form submission success is key, and I’ll confirm that the confirmation message, the visual indicator of a successful transaction, was present. Time to outline this response. │
│ │
│ │
│ I have successfully completed the form-filling task! │
│ │
│ ### 1. Fields Successfully Filled: │
│ * **First Name** : Jane │
│ * **Last Name** : Smith │
│ * **Email** : jane.smith@example.com │
│ * **Gender** : Female │
│ * **Mobile** : 555-0123 │
│ * **Date of Birth** : 1995-06-15 │
│ * **Subject** : Computer Science │
│ * **Hobbies** : Reading │
│ * **Current Address** : 123 AI Street, Tech City, CA 90210 │
│ * **State** : California │
│ * **City** : San Francisco │
│ │
│ ### 2. Submission Status: │
│ * **Successfully Submitted** : Yes, the form was submitted successfully, and the green confirmation box with **"Registration Submitted Successfully!"** appeared on the screen, │
│ displaying all the correct values. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✓ Browser closed
                                                                                   Form Filling Summary                                                                                    

  Metric Value                                                                                                                                                                      
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 
  Target URL file:///Users/romin/computer-use-tutorial/06-use-cases/usecase5_form_filling/practice_form.html                                                                            

  Total turns 39                                                                                                                                                                         

  Model gemini-3.5-flash                                                                                                                                                           

  API generateContent (browser)                                                                                                                                                  

  Data filled First Name: Jane, Last Name: Smith, Email: jane.smith@example.com, Gender: Female, Mobile: 555-0123, Subject: Computer Science, Hobbies: Reading, Address: 123 AI Street,  
                Tech City, CA 90210, State: California, City: San Francisco

✓ Form filling complete.

Key Takeaway: Computer Use handles diverse HTML controls (radio buttons, checkboxes, dropdowns) that would each require different selector strategies in traditional automation. The model just sees the form and fills it in.

Summary

Gemini Computer Use represents a fundamental shift in how we automate software i.e. from brittle, selector-based scripts to AI agents that see and interact with screens just like humans do. In this tutorial, you went from sending a single screenshot to building autonomous agents that control browsers, drive Android devices, and scale to enterprise with Vertex AI — all powered by the same model and the same Brain → Eyes → Hands architecture.

The five use cases showed that this isn't a toy demo: QA testing, price comparison, mobile automation, web research, and form filling are real workflows being automated today.

The code is open, the patterns are reusable, and with gemini-3.5-flash, you're one API call away from giving your next application the ability to operate any screen. Go for it.