Thanawat Wongchai

Posted on Jun 3 • Originally published at apidog.com

วิธีสร้างเอเจนต์ AI ควบคุมคอมพิวเตอร์ด้วย Qwen 3.7 Plus

Qwen 3.7 Plus ได้คะแนน 79.0 บน ScreenSpot Pro ซึ่งเป็น benchmark สำหรับการอ่านภาพหน้าจอและคืนพิกัดพิกเซลเพื่อคลิก จุดแข็งนี้ทำให้โมเดลแชทกลายเป็น computer-use agent ได้: ระบบที่มองเห็นหน้าจอ เลือก action ถัดไป และสั่งให้เครื่องมืออัตโนมัติลงมือทำ บทความนี้พาคุณสร้าง agent loop ด้วย Python แบบใช้งานได้จริง

ลองใช้ Apidog วันนี้

เราจะโฟกัสที่สิ่งที่ต้อง implement: โครงสร้าง agent loop, prompt ที่บังคับให้โมเดลตอบเป็น action, ตัวอย่าง Playwright ที่รันได้, การคุม token cost และ guardrail ก่อนนำไปใช้จริง หากต้องการพื้นฐานของโมเดลก่อน อ่าน ภาพรวม Qwen 3.7 Plus และถ้าต้องการดู payload แบบ raw สำหรับ multimodal request อ่าน คู่มือ API ของ Qwen 3.7 Plus คุณสามารถใช้ Apidog เพื่อทดสอบ request/response ของ agent ระหว่างพัฒนาได้

สรุปสั้นๆ

computer-use agent ทำงานเป็น loop:

ถ่ายภาพหน้าจอ
ส่งภาพและ goal ไปให้ Qwen 3.7 Plus
รับ action แบบ structured เช่น click(x, y), type(text), scroll(dy)
ใช้ Playwright หรือ driver อื่นสั่งงานจริง
ถ่ายภาพใหม่และวนซ้ำจนกว่าจะ done

ส่วนที่ต้องระวังคือการจำกัดจำนวนรอบ, การจับคู่พิกัดกับ viewport, การคุม token cost จากภาพหน้าจอ และการ sandbox action เพื่อไม่ให้คลิกผิดแล้วสร้างผลกระทบจริง

agent loop ทำงานอย่างไร

โครงสร้างพื้นฐานมี 4 ขั้นตอน:

Perceive: capture หน้าจอหรือ browser viewport
Decide: ให้โมเดลดูภาพพร้อมเป้าหมาย แล้วคืน action ถัดไป
Act: execute action เช่น click/type/scroll ผ่าน automation driver
Check: capture ภาพใหม่เพื่อดูว่างานเสร็จหรือยัง

โมเดลทำหน้าที่เฉพาะขั้นตอน Decide ส่วนที่เหลือคือระบบควบคุมที่คุณต้องเขียนให้ deterministic และปลอดภัย

<video src="https://assets.apidog.com/blog-next/2026/06/V1tXD8Bnm5DAtobB.mp4" poster="https://img.spacergif.org/v1/1920x1080/0a/spacer.png" width="1920" height="1080" loop="" autoplay="" muted="" playsinline="" preload="metadata"></video>

ทำไม Qwen 3.7 Plus เหมาะกับ computer-use agent

Qwen 3.7 Plus เหมาะกับ use case นี้ด้วยเหตุผลหลัก:

เข้าใจ GUI ได้ดีพอที่จะคืนพิกัดที่นำไปใช้คลิกได้
รองรับ workflow ที่ผสม GUI และ CLI ได้ จึงใช้กับ agent ที่ทั้งคลิกและรันคำสั่งได้
ราคา input multimodal อยู่ที่ $0.40 ต่อล้าน token ทำให้การเรียก vision หลายรอบใน agent loop ยังควบคุมต้นทุนได้

ถ้าต้องการเปรียบเทียบกับโมเดล text-only ระดับสูง อ่าน Qwen 3.7 Plus vs Max

Step 1: บังคับให้โมเดลตอบเป็น JSON action

อย่าให้โมเดลตอบเป็นคำอธิบายยาวๆ เพราะนำไป execute ยาก ให้จำกัด action space ให้เล็กและบังคับ output เป็น JSON เท่านั้น

ตัวอย่าง schema ที่ใช้งานง่าย:

{"action": "click", "x": 120, "y": 300}
{"action": "type", "text": "hello"}
{"action": "scroll", "dy": 500}
{"action": "done", "reason": "goal completed"}

ตัวอย่าง Python สำหรับเรียก Qwen 3.7 Plus ผ่าน OpenAI-compatible API:

import os
import json
import base64
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

SYSTEM = """You are a GUI agent. You see a screenshot and a goal.
Reply with ONE JSON action and nothing else:
{"action": "click", "x": <int>, "y": <int>}
{"action": "type", "text": "<string>"}
{"action": "scroll", "dy": <int>}
{"action": "done", "reason": "<string>"}
Coordinates are pixels in the screenshot you were given."""

ฟังก์ชัน next_action() รับ goal และ PNG bytes แล้วคืน action:

def next_action(goal, png_bytes):
    b64 = base64.b64encode(png_bytes).decode()

    resp = client.chat.completions.create(
        model="qwen3.7-plus",
        messages=[
            {"role": "system", "content": SYSTEM},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Goal: {goal}"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            },
        ],
    )

    return json.loads(resp.choices[0].message.content)

ก่อนใช้งานจริง ให้ตรวจสอบ model ID ใน เอกสาร Model Studio เพราะตัวระบุโมเดลอาจเปลี่ยนได้

Step 2: สร้าง browser agent ด้วย Playwright

Playwright ใช้ควบคุม browser จริง ทำให้ agent ลงมือกับเว็บไซต์ได้โดยตรง

จุดสำคัญ: ตั้ง viewport ให้ตรงกับขนาด screenshot เพื่อให้พิกัดจากโมเดล map แบบ 1:1 ไม่ต้อง scale เพิ่ม

from playwright.sync_api import sync_playwright

MAX_STEPS = 15
VIEWPORT = {"width": 1280, "height": 800}

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)

    page = browser.new_page(viewport=VIEWPORT)
    page.goto("https://example.com")

    goal = "Open the pricing page and find the cheapest plan"

    for step in range(MAX_STEPS):
        # screenshot มีขนาดตรงกับ viewport: 1280x800
        shot = page.screenshot()

        action = next_action(goal, shot)
        print(step, action)

        if action["action"] == "done":
            print("DONE:", action.get("reason"))
            break

        if action["action"] == "click":
            page.mouse.click(action["x"], action["y"])

        elif action["action"] == "type":
            page.keyboard.type(action["text"])

        elif action["action"] == "scroll":
            page.mouse.wheel(0, action["dy"])

        else:
            raise ValueError(f"Unknown action: {action}")

        # ให้ UI โหลดหรือ animate ให้เสร็จก่อน screenshot รอบถัดไป
        page.wait_for_timeout(800)

    browser.close()

นี่คือ agent loop ขั้นต่ำที่ทำงานได้จริง:

screenshot
ask model
execute action
repeat

ถ้าต้องการใช้กับ desktop app ให้เปลี่ยน Playwright เป็น desktop automation driver และส่ง screenshot ของ window/desktop แทน

Step 3: เพิ่ม validation ให้ action ก่อน execute

อย่า execute JSON จากโมเดลทันทีโดยไม่ตรวจสอบ อย่างน้อยควร validate action type และขอบเขตพิกัดก่อน

def validate_action(action, width=1280, height=800):
    allowed = {"click", "type", "scroll", "done"}

    if action.get("action") not in allowed:
        raise ValueError(f"Invalid action: {action}")

    if action["action"] == "click":
        x = action.get("x")
        y = action.get("y")

        if not isinstance(x, int) or not isinstance(y, int):
            raise ValueError("Click coordinates must be integers")

        if not (0 <= x < width and 0 <= y < height):
            raise ValueError(f"Click out of bounds: {x}, {y}")

    if action["action"] == "type":
        if not isinstance(action.get("text"), str):
            raise ValueError("Type action requires text")

    if action["action"] == "scroll":
        if not isinstance(action.get("dy"), int):
            raise ValueError("Scroll action requires integer dy")

    return action

นำไปใช้ใน loop:

action = validate_action(next_action(goal, shot))

Step 4: คุมค่าใช้จ่ายและความน่าเชื่อถือ

ภาพหน้าจอคือส่วนที่ใช้ token มากที่สุด ภาพ 1280px อาจใช้ input token หลายพัน token และ loop 15 รอบจึงมีต้นทุนจริงผ่าน API

แนวทางที่ควรใช้ตั้งแต่แรก:

ลดขนาดภาพ: ส่งภาพเล็กที่สุดที่โมเดลยังอ่าน UI ได้
crop เฉพาะพื้นที่ที่เกี่ยวข้อง: ถ้ารู้ว่า action อยู่ใน panel หรือ modal ให้ส่งเฉพาะส่วนนั้น
จำกัดจำนวนรอบ: ใช้ MAX_STEPS เสมอ
ตรวจสอบหลัง action: อย่าถือว่าคลิกแล้วสำเร็จ ให้ screenshot รอบถัดไปยืนยัน
เก็บ log: log action, screenshot, response raw เพื่อ debug ย้อนหลัง

อ่านต่อได้ที่ คู่มือการลดค่าใช้จ่าย token ของ agent และ รูปแบบการเชื่อมต่อ workflow ของ agent และข้อผิดพลาดที่พบบ่อย

เมื่อ agent ติดขัด ให้แก้แบบนี้

ปัญหาที่พบบ่อยมี 3 แบบ:

1. โมเดลตอบเป็นข้อความแทน JSON

ให้ retry หนึ่งครั้งด้วย prompt สั้นๆ:

FIX_JSON_PROMPT = "Reply with valid JSON only. No markdown. No explanation."

ถ้ายัง parse ไม่ได้ ให้หยุด loop และส่งต่อให้มนุษย์ตรวจสอบ

2. คลิกไม่โดนเป้าหมาย

อย่าคลิกพิกัดเดิมซ้ำ ให้ screenshot ใหม่และให้โมเดลตัดสินใจใหม่ เพราะ UI อาจเปลี่ยนหรือพิกัดอาจคลาดเคลื่อน

3. loop ซ้ำโดยไม่มีความคืบหน้า

เก็บ action ล่าสุดไว้ ถ้าซ้ำหลายครั้งให้หยุด:

recent_actions = []

recent_actions.append(action)
recent_actions = recent_actions[-3:]

if len(recent_actions) == 3 and recent_actions[0] == recent_actions[1] == recent_actions[2]:
    raise RuntimeError("Agent is stuck in a repeated action loop")

Safety checklist ก่อนใช้งานจริง

computer-use agent คลิกของจริงได้ จึงต้องมี guardrail:

ใช้ sandbox หรือ browser profile แยก ไม่ใช้ session production
ห้ามให้ agent เข้าถึง account สำคัญโดยไม่มีข้อจำกัด
action ที่มีผลกระทบสูง เช่น delete, submit, payment ต้องให้มนุษย์ confirm
log ทุก action พร้อม screenshot
ตั้ง allowlist domain ถ้า agent ทำงานบนเว็บ
ตั้ง timeout และ MAX_STEPS เสมอ

ทดสอบ agent request ด้วย Apidog

ก่อนเชื่อมกับ Playwright ให้ตอบคำถามนี้ให้ได้ก่อน: โมเดลคืน action ที่ถูกต้องหรือไม่?

ใช้ Apidog เพื่อ:

ส่ง screenshot ตัวอย่างไปยัง Qwen 3.7 Plus
ตรวจ raw JSON response
ปรับ system prompt จนโมเดลตอบตาม schema อย่างสม่ำเสมอ
เก็บ Model Studio key ตาม environment
mock endpoint เพื่อทดสอบ loop โดยไม่ต้องเสีย token ทุกครั้ง

เมื่อเชื่อม loop เต็มรูปแบบแล้ว AI agent debugger ของ Apidog ช่วยดู sequence ของแต่ละ step เพื่อหา action ที่ทำให้ workflow พังได้ง่ายขึ้น

หากต้องการสร้าง UI code จากภาพแทนการควบคุม UI อ่านคู่มือ แปลงภาพหน้าจอเป็นโค้ดด้วย Qwen 3.7 Plus

ดาวน์โหลด Apidog เพื่อทดสอบและ debug การเรียกโมเดลที่อยู่เบื้องหลัง agent ของคุณ

FAQ

computer-use agent คืออะไร?

ซอฟต์แวร์ที่รับรู้หน้าจอผ่าน screenshot ให้โมเดลตัดสินใจ action แล้ว execute ผ่าน automation driver วนซ้ำจนบรรลุ goal

Qwen 3.7 Plus ควบคุม desktop ได้เองหรือไม่?

ไม่ได้โดยตรง โมเดลคืน action เท่านั้น คุณต้องใช้ driver เช่น Playwright สำหรับ browser หรือ desktop automation library สำหรับแอป native

แต่ละ step มีค่าใช้จ่ายเท่าไร?

ต้นทุนหลักมาจาก screenshot เพราะภาพหนึ่งภาพอาจใช้ input token หลายพัน token ที่ $0.40 ต่อล้าน token ดังนั้นการลดขนาดภาพและจำกัดจำนวนรอบเป็นวิธีคุมค่าใช้จ่ายหลัก

เชื่อถือได้พอสำหรับ production หรือไม่?

ใช้ได้กับงานที่ scope ชัดเจน มี validation และมีการตรวจสอบทุก step แต่สำหรับระบบสำคัญหรือ action ที่มีผลกระทบสูง ควรมี human approval และ sandbox เสมอ

ต้อง scale พิกัดไหม?

ไม่ต้อง ถ้าขนาด screenshot ตรงกับ browser viewport หากไม่ตรง ต้อง scale x และ y ตามอัตราส่วนระหว่าง screenshot กับ viewport

สรุป

computer-use agent คือ loop สั้นๆ รอบโมเดล vision ที่คืน action ได้ Qwen 3.7 Plus ให้ความสามารถด้าน GUI และต้นทุนที่เหมาะกับ loop แบบนี้ สิ่งที่คุณต้อง implement ให้ดีคือ action schema, validation, step limit, sandbox และ logging จากนั้นทดสอบ request ของโมเดลใน Apidog ก่อนปล่อยให้ agent เริ่มคลิกบนระบบจริง

DEV Community