DEV Community

Hassan Naeem
Hassan Naeem

Posted on

I built a browser agent that plays a game by looking at pixels

A lot of "AI agents" are just an LLM with a REST API stitched to it. I wanted to build something where the agent can't cheat, no DOM shortcuts, no hidden game state. Just a browser, a screen, and a mouse.

So I built a small agent that plays Block Champ on CrazyGames.

Repo: https://github.com/Hassan-Naeem-code/Browser-Operating-Agent

Here's what was interesting about it, and the techniques all transfer to any canvas-based web app (whiteboards, in-browser IDEs, design tools, games).

The problem with "browser agents."

Most browser automation tutorials assume the app cooperates: meaningful HTML, readable labels, queryable buttons. Games don't. The whole board lives inside a <canvas> element, which from the DOM's point of view is just a black box.

On top of that, this game lives inside nested iframes, a CrazyGames wrapper frame, and inside it, the actual game frame. So the agent needs to:

  1. Land on the page
  2. Find the outer iframe
  3. Switch context into it
  4. Find the inner iframe
  5. Switch context again
  6. Find the canvas
  7. Figure out its absolute position on the screen
  8. Drag the mouse in global coordinates that map back to the canvas

Every step is a place where a naive script breaks.

Step 1 — Traversing nested iframes

Playwright's content_frame() is the key. You query for the iframe element as usual, then ask Playwright to give you the frame's context so you can query inside it:

await page.wait_for_selector('iframe#game-iframe', timeout=15000)
iframe_element = await page.query_selector('iframe#game-iframe')
iframe = await iframe_element.content_frame()

# And again, one level deeper
nested_iframe_element = await iframe.query_selector(
    'iframe[src*="block-champ/2/index.html"]'
)
nested_game_frame = await nested_iframe_element.content_frame()
Enter fullscreen mode Exit fullscreen mode

Small gotcha: you have to wait_for_timeout between switches. IFrames load asynchronously, and if you query too early, you get None back with no obvious error.

Step 2 — Canvas positioning

Once you've got the canvas, bounding_box() gives you its position inside its frame. But the mouse API operates on the page's global coordinates. So you have to add them:

canvas_box = await canvas.bounding_box()
iframe_box = await nested_iframe_element.bounding_box()

global_x = iframe_box['x'] + canvas_box['x'] + canvas_box['width'] / 2
global_y = iframe_box['y'] + canvas_box['y'] + canvas_box['height'] * 0.85
Enter fullscreen mode Exit fullscreen mode

This is the thing most people miss: every iframe has its own coordinate system. Nested iframes stack those offsets. Forget to add them, and your clicks land on nothing.

Step 3 — Humanlike mouse movement

Block Champ only registers drugs that look like real drugs. If you teleport the mouse from A to B, the game ignores it. You need intermediate moves:

steps = 40
for i in range(steps):
    intermediate_x = source_x + (target_x - source_x) * (i / steps)
    intermediate_y = source_y + (target_y - source_y) * (i / steps)
    await self.page.mouse.move(intermediate_x, intermediate_y)
    await asyncio.sleep(0.01)
Enter fullscreen mode Exit fullscreen mode

Forty steps with a 10ms sleep between each gives you a ~400ms drag — smooth enough to register. This same technique works for bot-detection bypass in legitimate testing contexts, drag-and-drop in design tools, slider inputs, anywhere mouse dynamics matter.

Step 4 — Reading a canvas with pixels

Here's the fun part. The canvas is opaque to the DOM, but Playwright can screenshot just the canvas element and hand you raw bytes:

img_bytes = await canvas.screenshot()
img = Image.open(io.BytesIO(img_bytes))
Enter fullscreen mode Exit fullscreen mode

From there, it's just PIL. For Block Champ, I divided the canvas into a 10×10 grid, sampled one pixel at the center of each cell, and classified it as empty or filled based on brightness:

grid_size = 10
cell_w = width // grid_size
cell_h = height // grid_size
threshold = 220  # empty cells are near-white in this game

for gx in range(grid_size):
    for gy in range(grid_size):
        px = gx * cell_w + cell_w // 2
        py = gy * cell_h + cell_h // 2
        r, g, b = pixels[px, py]
        if r > threshold and g > threshold and b > threshold:
            empty_count += 1
        else:
            filled_count += 1
Enter fullscreen mode Exit fullscreen mode

Crude, but it works. And it's the pattern you'd use to let an LLM "see" any canvas-based UI: screenshot, grid, classify, pass the grid as text to the model.

What's still missing

The current version makes random moves. The structure is there for smarter play; choose_best_move(board_state) is a placeholder waiting for real logic.

Next steps I'm exploring:

  • Send the board state to an LLM as structured text and let the model pick moves
  • Use a small local vision model to classify cell shapes instead of just "empty/filled."
  • Run it headless on a schedule and track scores over time

Why this matters beyond a game

The exact same pattern — traverse → locate → screenshot → classify → act ,is how you'd build agents for:

  • Figma/Miro automation
  • In-browser CAD or music tools
  • Any legacy internal tool that renders to canvas
  • Visual testing of rich web apps

The DOM is no longer the only API for the web. If you can take a screenshot and send a mouse event, you can build an agent.

Repo with full code: https://github.com/Hassan-Naeem-code/Browser-Operating-Agent

If you're building agents or doing browser automation, I'd love to hear what you're working on. Drop a comment or find me on GitHub.

Top comments (0)