Hector Flores

Posted on Jun 11 • Edited on Jun 16 • Originally published at htek.dev

I Replaced Playwright With Raw CDP

#ai #devex #testing #automation

The Agent Made a Better Call Than I Would Have

I was building a responsive design testing pipeline for a client project. The goal was simple: capture screenshots of every page section at 11 viewport sizes, feed them to an AI vision model, get a structured report of what's broken.

I handed the task to an agent and expected Playwright. It's the obvious choice — well-documented, clean API, every tutorial defaults to it. The agent had a different idea.

It reached for raw Chrome DevTools Protocol over WebSocket. No Playwright, no Puppeteer — just JSON-RPC messages sent directly to Chrome. When I dug into why, the answer was immediate: Playwright was failing to resize the browser window correctly at certain viewport dimensions. Direct Emulation.setDeviceMetricsOverride via CDP handled it cleanly. No abstraction layer fighting against you. Just a direct instruction to the browser.

I kept it.

That wasn't even the interesting part. What the agent built next — the approach it invented for getting AI to analyze multiple screenshots — turned out to be a general pattern I hadn't encountered before. I've started calling it compaction.

The Responsive Testing Problem

Manual responsive testing is one of those things that sounds manageable until you try to do it systematically. Eleven viewport sizes across a multi-section page with a password gate. That's potentially hundreds of screenshots. Reviewing them by hand isn't a workflow; it's a punishment.

You could automate the comparison with perceptual diff tools like Chromatic or Percy, but those require baseline screenshots and tell you that something changed — not whether the layout is actually correct. A broken layout you've never seen before passes as "no regression."

What I wanted was something different: an AI that could look at a layout and say "this section is cropped at 390px, that column collapses wrong at 768px, this text is illegible on ultrawide." Natural language, structural, semantic feedback — not a pixel diff.

The challenge was getting that feedback efficiently.

Why CDP and Not Playwright

The Chrome DevTools Protocol is the actual wire protocol underneath Chrome-based browser automation. Playwright translates high-level method calls into CDP messages for Chromium. So does Puppeteer. Selenium's DevTools integration does the same.

Going raw means connecting directly via WebSocket to a Chrome instance launched with --remote-debugging-port, then firing JSON-RPC commands yourself:

// Connect to Chrome
const client = new CDPClient(target.webSocketDebuggerUrl);
await client.connect();

// Set viewport — direct, no Playwright wrapper
await client.send('Emulation.setDeviceMetricsOverride', {
  width: 390,
  height: 844,
  deviceScaleFactor: 3,
  mobile: true,
  screenOrientation: { angle: 0, type: 'portraitPrimary' },
});

// Capture screenshot
const shot = await client.send('Page.captureScreenshot', {
  format: 'png',
  fromSurface: true,
  captureBeyondViewport: false,
});

No dependencies beyond Node.js 22+ (which has a stable built-in WebSocket global). The tool has one npm dependency: sharp for image compositing. Everything else is Node built-ins.

There's something clarifying about working at this level. You stop debugging "why is Playwright doing X" and start reasoning directly about what Chrome is doing. When viewport resizing wasn't behaving, there was no abstraction to blame and nowhere to hide — which made the fix obvious.

Raw CDP architecture: Node.js speaks JSON-RPC directly to Chrome — no Playwright in between, and only two npm dependencies: sharp and node:fs

The Compaction Insight

Here's where it gets interesting.

The naive approach to AI visual validation is: one screenshot per viewport, one vision API call per screenshot, aggregate results. For 11 viewports across 8 sections, that's 88 API calls. That's slow, expensive, and you lose something important: the ability to compare layouts side by side.

The agent built something smarter. For each page section, it composites all 11 viewport screenshots into a single labeled grid image using sharp:

┌─────────────────────────────────────────────────────────────┐
│  section-02 (hero) · Homepage Hero                          │
├──────────────────┬──────────────────┬──────────────────┬───┤
│ iphone14-portrait│ android-360x800  │ ipad-portrait    │...│
│  390×844         │ 360×800          │ 768×1024         │   │
│ [screenshot]     │ [screenshot]     │ [screenshot]     │   │
└──────────────────┴──────────────────┴──────────────────┴───┘

Each cell has a header strip showing the viewport slug and exact dimensions. The top banner shows the section ID and label. Everything the AI needs to orient itself is embedded in the image.

One image. One vision call. 11 viewports analyzed together.

That's the compaction. Instead of making the AI precise about pixel coordinates across dozens of separate images, you compact everything into a single reference frame where the labels are the coordinates.

AI can interpret an image through natural language, but it's hard to be precise about positioning. Compacting all the different views with text labels into one image solves that. The AI sees all the layouts simultaneously and can pull out a natural language analysis.

The math works out too: one call per section instead of one per (section × viewport). An 11× reduction in API calls, with better analysis quality because the model is comparing layouts in context rather than evaluating each in isolation.

The mosaic compaction pattern — 11 viewports → 1 labeled grid → 1 AI vision call. The cell headers are the coordinates. 11× reduction in API calls.

The Label→Mapping Loop

The output structure is what makes this a pattern rather than a one-off hack.

The vision prompt asks for strict JSON keyed by viewport slug:

{
  "section_id": "section-02",
  "viewport_results": {
    "iphone14-portrait": {
      "status": "ok",
      "issues": []
    },
    "ultrawide-3440x1440": {
      "status": "fail",
      "issues": [{
        "type": "empty_space",
        "severity": "high",
        "description": "Content occupies ~30% of horizontal space at 3440px — missing max-width constraint.",
        "suggested_css": "@media (min-width: 2400px) { .hero { max-width: 1800px; margin: 0 auto; } }"
      }]
    }
  }
}

The labels in the mosaic header become the keys in the output JSON. No post-processing, no coordinate math, no trying to figure out what the AI "meant" — the structure maps directly to the input labels.

That's the loop: you label your inputs, the AI returns findings indexed by those labels. Structured output from unstructured visual analysis.

The label→mapping loop: viewport slugs in the mosaic become keys in the JSON — no post-processing, no coordinate math, direct structured output from visual analysis

What It Actually Caught

I ran this pipeline on the SurgiQuip proposal page — a password-gated, multi-section client site I'd been building.

The result: it caught everything. Every single thing.

Every layout break I'd missed during development, every section that needed max-width handling at wide viewports, every place where the responsive grid didn't collapse cleanly. After re-running the resizes based on the AI's CSS suggestions, every aspect ratio worked.

The AI suggestions aren't a push-button fix — they're a starting point that still needs a human review before applying. "Looks right in the mosaic" isn't the same as "verified in a real browser." But as a first-pass audit that catches structural problems before a client sees them, it's genuinely remarkable.

This is exactly the kind of AI-augmented QA pattern that doesn't replace human judgment — it surfaces what human eyes would miss.

Where Else This Pattern Applies

When I stepped back after the build, the obvious question was: is this just for responsive testing? Honestly, no.

"You can do all kinds of stuff with this pattern. I found that fascinating." That's what I keep coming back to.

The compaction pattern solves a general problem: how do you get structured AI feedback across multiple visual states without making N separate API calls?

A few directions this applies:

Multi-state UI comparison. Composite "empty", "loading", "populated", "error" states of the same component side by side. Ask AI: "Which states have accessibility issues?" One call, structured answer.

Before/after design diffs. Instead of perceptual diffs, composite old vs. new side by side and ask AI: "What changed? Is any change unintentional?" Semantic diff instead of pixel diff.

Cross-browser visual regression. Same page, Chrome vs. Firefox vs. Safari, composited. AI spots rendering inconsistencies that diffs would catch, but also tells you what kind of inconsistency it is.

The key in all cases: labels in the mosaic become keys in the output JSON. You control the structure by controlling the labels.

The Honest Limits

This pipeline requires Chrome running locally with --remote-debugging-port. It doesn't run in a standard CI environment out of the box — you'd need headless Chrome configured to accept CDP connections, which is possible but not the default GitHub Actions setup.

Label quality directly affects analysis precision. Vague labels like section-01 give vague feedback. Section IDs and heading text embedded in the mosaic header give the AI something to reason about specifically.

And the CSS suggestions need human review. The AI is pattern-matching against known layout problems — it will catch max-width issues reliably, but complex responsive grid fixes should be read carefully before applying. This is an augmentation tool, not an autopilot.

The Tool Is in the Repo

The full pipeline lives in tools/responsive-design-testing/ — six scripts that chain together: capture.mjs (raw CDP), composite.mjs (sharp grid), analyze.mjs (vision queue builder), report.mjs, fix.mjs, and run.mjs as the single-command orchestrator.

Single-command usage:

node tools/responsive-design-testing/run.mjs `
  --url https://yoursite.com `
  --password optional-gate-pw

If you're using AI in your workflow and need visual validation of any kind — not just responsive testing — the compaction pattern is worth adding to your toolkit. The insight isn't the CDP part. It's the label→mapping loop. Once you see it, you'll find uses for it everywhere.

The Bottom Line

The agent chose a better tool than I would have, and in doing so, invented an approach I hadn't considered. Fewer abstraction layers meant more direct control over viewport behavior. One labeled composite per section meant 11× fewer API calls with better cross-viewport analysis.

That's two good ideas from one build — neither of which was in my original plan.

The pattern generalizes. Any time you need structured AI feedback across multiple visual states — responsive breakpoints, component states, browser diffs, before/after comparisons — compaction is the pattern. Label your inputs, get output mapped to those labels, skip the coordinate math entirely.

What would you use it for?