Hassann

Posted on Jun 3 • Originally published at apidog.com

How to build a computer-use agent with Qwen 3.7 Plus

Qwen 3.7 Plus scores 79.0 on ScreenSpot Pro, a benchmark for reading a screenshot and returning the exact pixel coordinates to click. That capability is the core of a computer-use agent: software that sees a screen, chooses the next action, executes it, and repeats until the task is complete.

Try Apidog today

In this guide, you’ll build a working Python agent loop with Qwen 3.7 Plus and Playwright. You’ll define a strict action schema, send screenshots to the model, execute returned actions in the browser, and add basic guardrails for cost, reliability, and safety. For model background, see the Qwen 3.7 Plus overview. For raw multimodal request structure, see the Qwen 3.7 Plus API guide. You can test the model calls in Apidog before wiring them into the browser loop.

TL;DR

A computer-use agent runs this loop:

Capture a screenshot.
Send the screenshot and goal to Qwen 3.7 Plus.
Receive a structured action such as click, type, scroll, or done.
Execute the action with an automation driver like Playwright.
Repeat until the goal is complete or the step limit is reached.

Qwen 3.7 Plus is a good fit because it can ground GUI elements to coordinates and supports multimodal workflows. The implementation challenges are mostly outside the model: coordinate scaling, step limits, token cost, retry handling, and sandboxing.

What a computer-use agent does

At implementation level, a computer-use agent is a controlled loop:

Perceive: capture the current screen or browser page.
Decide: ask the model for the next action.
Act: execute the action through a driver.
Check: take another screenshot and verify progress.

The model only handles the “decide” step. Your code owns the execution, limits, retries, and safety rules.

            <video src="https://assets.apidog.com/blog-next/2026/06/V1tXD8Bnm5DAtobB.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata"></video>

Why Qwen 3.7 Plus fits this use case

Qwen 3.7 Plus is useful for computer-use agents for three reasons:

GUI grounding: it can map visual UI elements to coordinates.
Hybrid workflows: it can support GUI and CLI-style task flows.
Lower multimodal cost: at $0.40 per million input tokens, it is practical for repeated screenshot calls.

For a comparison with the text-only flagship model, see the Qwen 3.7 Plus vs Max comparison.

Step 1: Define a strict action schema

Do not ask the model for free-form instructions. Free-form prose is hard to execute safely.

Instead, constrain the model to a small set of actions and require JSON output.

import os
import json
import base64
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

SYSTEM = """You are a GUI agent. You see a screenshot and a goal.
Reply with ONE JSON action and nothing else:
{"action": "click", "x": <int>, "y": <int>}
{"action": "type", "text": "<string>"}
{"action": "scroll", "dy": <int>}
{"action": "done", "reason": "<string>"}
Coordinates are pixels in the screenshot you were given."""

The action vocabulary is intentionally small:

click: click a pixel coordinate.
type: type text into the focused field.
scroll: scroll vertically.
done: stop the loop.

Step 2: Send the screenshot to Qwen 3.7 Plus

This helper encodes a PNG screenshot, sends it with the user goal, and parses the model response as JSON.

def next_action(goal, png_bytes):
    b64 = base64.b64encode(png_bytes).decode()

    resp = client.chat.completions.create(
        model="qwen3.7-plus",
        messages=[
            {"role": "system", "content": SYSTEM},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Goal: {goal}"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            },
        ],
    )

    return json.loads(resp.choices[0].message.content)

Before shipping, confirm the current model ID in the Model Studio docs, since identifiers can change.

Step 3: Execute actions with Playwright

Playwright lets the agent interact with a real browser.

The key implementation detail: make the screenshot size match the viewport size. That way, model coordinates map directly to Playwright coordinates without scaling.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)

    page = browser.new_page(
        viewport={"width": 1280, "height": 800}
    )

    page.goto("https://example.com")

    goal = "Open the pricing page and find the cheapest plan"

    for step in range(15):                 # hard cap on steps
        shot = page.screenshot()           # 1280x800 PNG
        action = next_action(goal, shot)

        print(step, action)

        if action["action"] == "done":
            break

        if action["action"] == "click":
            page.mouse.click(action["x"], action["y"])

        elif action["action"] == "type":
            page.keyboard.type(action["text"])

        elif action["action"] == "scroll":
            page.mouse.wheel(0, action["dy"])

        page.wait_for_timeout(800)         # let the UI settle

    browser.close()

That is the complete browser agent loop:

screenshot -> model -> JSON action -> Playwright -> screenshot -> ...

The same pattern works for desktop apps if you replace Playwright with a desktop automation driver and capture screenshots of the relevant OS window.

Cost and reliability controls

Screenshots are the main cost driver. Each image becomes input tokens, and a 1280-wide screenshot can consume a few thousand tokens. A 15-step loop can therefore add up quickly.

Use these controls from the start:

Downscale screenshots when full resolution is unnecessary.
Crop to the active panel if the task only needs part of the screen.
Cap the step count so the agent cannot loop forever.
Verify after every action by taking a new screenshot.
Stop on repeated actions to avoid clicking or scrolling in circles.

For more cost strategies, see the guide on reducing agent token costs. For workflow failure modes, see agentic workflow wiring patterns and pitfalls.

Handle common failure cases

1. The model returns prose instead of JSON

Retry once with a short repair prompt.

Example strategy:

def parse_action(raw):
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        raise ValueError("Model did not return valid JSON")

In production, you can retry with a message like:

Return only one valid JSON action. Do not include explanations.

If the retry fails, stop the loop and surface the screenshot to a human.

2. A click misses the target

Do not blindly repeat the same click. Take a fresh screenshot and ask the model again.

A missed click often means:

the page moved,
the element shifted,
the screenshot and viewport sizes do not match,
the model selected the wrong visual target.

3. The loop spins without progress

Track recent actions.

recent_actions = []

# after each action
recent_actions.append(action)

if len(recent_actions) > 5:
    recent_actions.pop(0)

if len(recent_actions) == 5 and all(a == recent_actions[0] for a in recent_actions):
    raise RuntimeError("Agent appears stuck")

This gives you an escape hatch before the step cap is reached.

Safety checklist

A computer-use agent performs real clicks and keystrokes. Add guardrails before using it on anything important.

Minimum safety rules:

Run in a sandbox or throwaway browser profile.
Never start with a logged-in production session.
Require human confirmation for destructive actions such as delete, send, submit, or pay.
Log every screenshot and action.
Keep a hard step limit.
Restrict allowed domains during browser automation.
Mock API calls during development where possible.

Test the model calls with Apidog

Most failures start in the model response. Before connecting Playwright, test the “decide” step directly.

Use Apidog to:

Send a sample screenshot to Qwen 3.7 Plus.
Inspect the raw response.
Verify that the model returns one valid JSON action.
Tune the system prompt.
Store your Model Studio key per environment.
Mock the endpoint while developing the agent loop.

When the full loop is running, Apidog’s AI agent debugger helps inspect the sequence of calls and identify the step where the agent derailed.

To generate UI code from a design instead of driving an existing UI, see the companion guide on screenshot-to-code with Qwen 3.7 Plus.

Download Apidog to test and debug the model calls behind your agent.

FAQ

What is a computer-use agent?

A computer-use agent is software that reads a screen through screenshots, decides the next action with a model, and executes that action through an automation driver until a goal is complete.

Can Qwen 3.7 Plus control my desktop?

Not directly. The model returns actions such as click, type, or scroll. Your automation driver executes those actions. Use Playwright for browsers or a desktop automation library for native apps.

How much does each step cost?

Mostly the screenshot cost. A single screen image can use a few thousand input tokens. At $0.40 per million input tokens, the main cost controls are downscaling, cropping, and capping the number of steps.

Is it reliable enough for production?

It can work for bounded, well-defined tasks with verification after each step. For critical systems or open-ended control, keep a human in the loop and run the agent in a sandbox.

Do I need to scale coordinates?

Not if your screenshot resolution matches your viewport. If they differ, scale coordinates by the ratio between the screenshot dimensions and the execution surface dimensions.

Example:

scaled_x = action["x"] * viewport_width / screenshot_width
scaled_y = action["y"] * viewport_height / screenshot_height

The bottom line

A computer-use agent is a loop around a capable multimodal model. Qwen 3.7 Plus provides the GUI grounding needed for coordinate-based actions, while your code handles execution, limits, retries, and safety.

Build the loop, cap it, sandbox it, verify every action, and test the model calls in Apidog before letting the agent click through real workflows.

DEV Community