A lot of "AI agents" are just an LLM with a REST API stitched to it. I wanted to build something where the agent can't cheat, no DOM shortcuts, no hidden game state. Just a browser, a screen, and a mouse.
So I built a small agent that plays Block Champ on CrazyGames.
Repo: https://github.com/Hassan-Naeem-code/Browser-Operating-Agent
Here's what was interesting about it, and the techniques all transfer to any canvas-based web app (whiteboards, in-browser IDEs, design tools, games).
The problem with "browser agents."
Most browser automation tutorials assume the app cooperates: meaningful HTML, readable labels, queryable buttons. Games don't. The whole board lives inside a <canvas> element, which from the DOM's point of view is just a black box.
On top of that, this game lives inside nested iframes, a CrazyGames wrapper frame, and inside it, the actual game frame. So the agent needs to:
- Land on the page
- Find the outer iframe
- Switch context into it
- Find the inner iframe
- Switch context again
- Find the canvas
- Figure out its absolute position on the screen
- Drag the mouse in global coordinates that map back to the canvas
Every step is a place where a naive script breaks.
Step 1 — Traversing nested iframes
Playwright's content_frame() is the key. You query for the iframe element as usual, then ask Playwright to give you the frame's context so you can query inside it:
await page.wait_for_selector('iframe#game-iframe', timeout=15000)
iframe_element = await page.query_selector('iframe#game-iframe')
iframe = await iframe_element.content_frame()
# And again, one level deeper
nested_iframe_element = await iframe.query_selector(
'iframe[src*="block-champ/2/index.html"]'
)
nested_game_frame = await nested_iframe_element.content_frame()
Small gotcha: you have to wait_for_timeout between switches. IFrames load asynchronously, and if you query too early, you get None back with no obvious error.
Step 2 — Canvas positioning
Once you've got the canvas, bounding_box() gives you its position inside its frame. But the mouse API operates on the page's global coordinates. So you have to add them:
canvas_box = await canvas.bounding_box()
iframe_box = await nested_iframe_element.bounding_box()
global_x = iframe_box['x'] + canvas_box['x'] + canvas_box['width'] / 2
global_y = iframe_box['y'] + canvas_box['y'] + canvas_box['height'] * 0.85
This is the thing most people miss: every iframe has its own coordinate system. Nested iframes stack those offsets. Forget to add them, and your clicks land on nothing.
Step 3 — Humanlike mouse movement
Block Champ only registers drugs that look like real drugs. If you teleport the mouse from A to B, the game ignores it. You need intermediate moves:
steps = 40
for i in range(steps):
intermediate_x = source_x + (target_x - source_x) * (i / steps)
intermediate_y = source_y + (target_y - source_y) * (i / steps)
await self.page.mouse.move(intermediate_x, intermediate_y)
await asyncio.sleep(0.01)
Forty steps with a 10ms sleep between each gives you a ~400ms drag — smooth enough to register. This same technique works for bot-detection bypass in legitimate testing contexts, drag-and-drop in design tools, slider inputs, anywhere mouse dynamics matter.
Step 4 — Reading a canvas with pixels
Here's the fun part. The canvas is opaque to the DOM, but Playwright can screenshot just the canvas element and hand you raw bytes:
img_bytes = await canvas.screenshot()
img = Image.open(io.BytesIO(img_bytes))
From there, it's just PIL. For Block Champ, I divided the canvas into a 10×10 grid, sampled one pixel at the center of each cell, and classified it as empty or filled based on brightness:
grid_size = 10
cell_w = width // grid_size
cell_h = height // grid_size
threshold = 220 # empty cells are near-white in this game
for gx in range(grid_size):
for gy in range(grid_size):
px = gx * cell_w + cell_w // 2
py = gy * cell_h + cell_h // 2
r, g, b = pixels[px, py]
if r > threshold and g > threshold and b > threshold:
empty_count += 1
else:
filled_count += 1
Crude, but it works. And it's the pattern you'd use to let an LLM "see" any canvas-based UI: screenshot, grid, classify, pass the grid as text to the model.
What's still missing
The current version makes random moves. The structure is there for smarter play; choose_best_move(board_state) is a placeholder waiting for real logic.
Next steps I'm exploring:
- Send the board state to an LLM as structured text and let the model pick moves
- Use a small local vision model to classify cell shapes instead of just "empty/filled."
- Run it headless on a schedule and track scores over time
Why this matters beyond a game
The exact same pattern — traverse → locate → screenshot → classify → act ,is how you'd build agents for:
- Figma/Miro automation
- In-browser CAD or music tools
- Any legacy internal tool that renders to canvas
- Visual testing of rich web apps
The DOM is no longer the only API for the web. If you can take a screenshot and send a mouse event, you can build an agent.
Repo with full code: https://github.com/Hassan-Naeem-code/Browser-Operating-Agent
If you're building agents or doing browser automation, I'd love to hear what you're working on. Drop a comment or find me on GitHub.
Top comments (0)