I built a browser agent that plays a game by looking at pixels

#playwright #ai #automation #python

A lot of "AI agents" are just an LLM with a REST API stitched to it. I wanted to build something where the agent can't cheat, no DOM shortcuts, no hidden game state. Just a browser, a screen, and a mouse.

So I built a small agent that plays Block Champ on CrazyGames.

Repo: https://github.com/Hassan-Naeem-code/Browser-Operating-Agent

Here's what was interesting about it, and the techniques all transfer to any canvas-based web app (whiteboards, in-browser IDEs, design tools, games).

The problem with "browser agents."

Most browser automation tutorials assume the app cooperates: meaningful HTML, readable labels, queryable buttons. Games don't. The whole board lives inside a <canvas> element, which from the DOM's point of view is just a black box.

On top of that, this game lives inside nested iframes, a CrazyGames wrapper frame, and inside it, the actual game frame. So the agent needs to:

Land on the page
Find the outer iframe
Switch context into it
Find the inner iframe
Switch context again
Find the canvas
Figure out its absolute position on the screen
Drag the mouse in global coordinates that map back to the canvas

Every step is a place where a naive script breaks.

Step 1 — Traversing nested iframes

Playwright's content_frame() is the key. You query for the iframe element as usual, then ask Playwright to give you the frame's context so you can query inside it:

await page.wait_for_selector('iframe#game-iframe', timeout=15000)
iframe_element = await page.query_selector('iframe#game-iframe')
iframe = await iframe_element.content_frame()

# And again, one level deeper
nested_iframe_element = await iframe.query_selector(
    'iframe[src*="block-champ/2/index.html"]'
)
nested_game_frame = await nested_iframe_element.content_frame()

Small gotcha: you have to wait_for_timeout between switches. IFrames load asynchronously, and if you query too early, you get None back with no obvious error.

Step 2 — Canvas positioning

Once you've got the canvas, bounding_box() gives you its position inside its frame. But the mouse API operates on the page's global coordinates. So you have to add them:

canvas_box = await canvas.bounding_box()
iframe_box = await nested_iframe_element.bounding_box()

global_x = iframe_box['x'] + canvas_box['x'] + canvas_box['width'] / 2
global_y = iframe_box['y'] + canvas_box['y'] + canvas_box['height'] * 0.85

This is the thing most people miss: every iframe has its own coordinate system. Nested iframes stack those offsets. Forget to add them, and your clicks land on nothing.

Step 3 — Humanlike mouse movement

Block Champ only registers drugs that look like real drugs. If you teleport the mouse from A to B, the game ignores it. You need intermediate moves:

steps = 40
for i in range(steps):
    intermediate_x = source_x + (target_x - source_x) * (i / steps)
    intermediate_y = source_y + (target_y - source_y) * (i / steps)
    await self.page.mouse.move(intermediate_x, intermediate_y)
    await asyncio.sleep(0.01)

Forty steps with a 10ms sleep between each gives you a ~400ms drag — smooth enough to register. This same technique works for bot-detection bypass in legitimate testing contexts, drag-and-drop in design tools, slider inputs, anywhere mouse dynamics matter.

Step 4 — Reading a canvas with pixels

Here's the fun part. The canvas is opaque to the DOM, but Playwright can screenshot just the canvas element and hand you raw bytes:

img_bytes = await canvas.screenshot()
img = Image.open(io.BytesIO(img_bytes))

From there, it's just PIL. For Block Champ, I divided the canvas into a 10×10 grid, sampled one pixel at the center of each cell, and classified it as empty or filled based on brightness:

grid_size = 10
cell_w = width // grid_size
cell_h = height // grid_size
threshold = 220  # empty cells are near-white in this game

for gx in range(grid_size):
    for gy in range(grid_size):
        px = gx * cell_w + cell_w // 2
        py = gy * cell_h + cell_h // 2
        r, g, b = pixels[px, py]
        if r > threshold and g > threshold and b > threshold:
            empty_count += 1
        else:
            filled_count += 1

Crude, but it works. And it's the pattern you'd use to let an LLM "see" any canvas-based UI: screenshot, grid, classify, pass the grid as text to the model.

What's still missing

The current version makes random moves. The structure is there for smarter play; choose_best_move(board_state) is a placeholder waiting for real logic.

Next steps I'm exploring:

Send the board state to an LLM as structured text and let the model pick moves
Use a small local vision model to classify cell shapes instead of just "empty/filled."
Run it headless on a schedule and track scores over time