As an SDET or Automation Engineer, failing tests are part of the daily grind. With the rise of Agentic AI, fixing scripts is easier than ever—but there’s a catch that tutorials rarely mention: Scale.
In a real-world enterprise suite, you aren’t dealing with 10 tests; you’re dealing with 500. When 200 of them fail right before a major release—often due to a single upstream change by another team—feeding all that data into an AI Agent is a "token bonfire." It’s expensive, slow, and often unnecessary.
So, what’s the move? Do you use a Playwright MCP, a standalone LLM, or the Playwright CLI?
The answer is: It depends on the scope of the fix.
In this post, we’ll explore how to leverage the Playwright CLI to analyze and resolve mass failures efficiently, keeping your token costs low and your velocity high.
What exactly is Playwright and where do you use it
Before we talk about tokens, let's clear up the confusion that tripped me up for weeks.
There are three different Playwright tools that all sound similar:
npx playwright test → the test runner you already know
runs your spec files, reports pass/fail
@playwright/cli → a NEW, separate package
(playwright-cli) built specifically for coding agents
controls a live browser one command at a time
Playwright MCP Server → for persistent agent browser sessions
keeps browser alive across a long investigation
Most QA engineers know the first one. Almost nobody has heard of the second. And the confusion is understandable — they're both "Playwright", they both involve browsers. But they're built for completely different consumers.
npx playwright test is built for humans and CI pipelines. It runs your spec files, prints results to a terminal, exits with a code. A human reads the output. A pipeline checks the exit code. Done.
playwright-cli is built for coding agents. It was created specifically because tools like Claude Code and GitHub Copilot need to interact with a live browser — not run pre-written scripts, but actually observe a page, click things, see what changed, and loop. Every command returns a compact, structured snapshot of the current page state that a machine can read and reason about efficiently.
Installing It
# It's a separate package from @playwright/test
npm install -g @playwright/cli@latest
# This is important — installs structured command docs locally
# so Claude Code knows all available commands without running --help
playwright-cli install --skills
That --skills flag is a small thing that saves tokens on every single agent session. Instead of the agent running playwright-cli --help to discover commands (costing ~300 tokens), it reads the locally installed docs once at session start.
What It Looks Like In Use
# Open the live app in a browser
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
# Get the current page state — returns all interactive elements
playwright-cli snapshot
# Interact using element refs from the snapshot
playwright-cli click e45
playwright-cli fill e12 "Morning Run"
playwright-cli press Enter
# Capture state visually
playwright-cli screenshot
# Check what network calls are happening
playwright-cli network
The snapshot command is the key one. Here's what it actually returns:
Page: Habit Tracker
URL: https://habit-tracker-one-dun-53.vercel.app/
Interactive elements:
e12 input[placeholder="Enter habit name..."] textbox
e13 input[type="date"] date picker
e14 select.status-selector combobox
e45 button.submit-habit-form button "Add Habit"
e67 a[href="/trends"] link "Trends"
e68 a[href="/about"] link "About"
That's it. ~150 tokens. Every interactive element on the page, with a ref ID the agent can use to interact. No noise, no full HTML, no CSS, no scripts — just what an agent needs to act.
This is the foundation of why playwright-cli is so much more token-efficient than alternatives. But let's make that concrete.
The App I'm Testing
Throughout this post I'll use a real example: a React Habit Tracker app I built and deployed on Vercel.
Live app: https://habit-tracker-one-dun-53.vercel.app/
One spec file, two test categories:
-
@smoke— happy path: add habit, mark complete, delete -
@regression— edge cases, validation, error states
Nothing exotic. Simple enough to follow along, realistic enough to make the problem interesting.
The Token Problem With npx playwright test Alone
Here's what happens when a naive agent uses only npx playwright test to investigate a failure:
You: "Why is the add-habit smoke test failing?"
Agent (without playwright-cli):
Step 1: Read entire spec file to understand the test ~3,200 tokens
Step 2: Read playwright.config.js for context ~180 tokens
Step 3: Run npx playwright test --reporter=list ~800 tokens
Step 4: Parse terminal output, identify failure ~600 tokens
Step 5: Read source components to find selector ~2,800 tokens
Step 6: Guess what the DOM looks like now ~1,400 tokens
Step 7: Propose a fix (might be wrong) ~700 tokens
──────────────────────────────────────────────────────
Total: ~9,680 tokens
Certainty of diagnosis: ~60%
The agent is reading source files and guessing what the live DOM looks like. It never actually looks at the app. Its fix might be wrong and cost you another loop.
Here's the same task with playwright-cli:
You: "Why is the add-habit smoke test failing?"
Agent (with playwright-cli):
Step 1: npx playwright test --grep "@smoke" --reporter=json ~420 tokens
Step 2: playwright-cli open + snapshot on failure ~190 tokens
Step 3: Read ONLY the failing test block ~280 tokens
Step 4: Compare snapshot to test, diagnose ~480 tokens
Step 5: Propose exact fix from what it actually saw ~150 tokens
──────────────────────────────────────────────────────
Total: ~1,520 tokens
Certainty of diagnosis: ~95%
84% fewer tokens. Higher confidence. No guessing.
The agent saw the actual live DOM. It didn't read five files hoping to reconstruct what the UI looks like — it just looked.
Why playwright-cli Is So Token-Efficient
1. Snapshot output is compact by design
The playwright-cli snapshot command returns a structured, minimal representation of the current page — only interactive elements, only what an agent needs to act:
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
playwright-cli snapshot
Output:
Page: Habit Tracker
URL: https://habit-tracker-one-dun-53.vercel.app/
Interactive elements:
e12 input[placeholder="Enter habit name..."] textbox
e13 input[type="date"] date picker
e14 select.status-selector combobox
e45 button.submit-habit-form button "Add Habit"
e67 a[href="/trends"] link "Trends"
e68 a[href="/about"] link "About"
That's ~150 tokens. It tells the agent everything it needs: what elements exist, what their refs are, what their current classes and labels are.
Compare that to fetching the full page HTML — which would be 2,000–8,000 tokens of noise the agent has to parse to extract the same 6 elements.
2. Ref-based interaction means no selector archaeology
Once the agent has a snapshot, it uses refs — not selectors it has to guess:
playwright-cli click e45 # click "Add Habit" button
playwright-cli fill e12 "Morning Run"
playwright-cli press Enter
playwright-cli snapshot # see what changed
No reading source code to figure out what class name the button has. No risk of using a stale selector from a file that hasn't been updated since last deploy. The snapshot tells you the current truth.
This is the core insight: playwright-cli trades file reading for direct observation. File reading costs hundreds to thousands of tokens per file. Observation costs ~150 tokens for the entire page state.
3. The --skills flag prevents repeated --help calls
playwright-cli install --skills
This installs structured command documentation locally. Claude Code reads it once at the start of a session — it never needs to run playwright-cli --help to discover available commands. That's another ~300 tokens saved per session, and it means the agent's responses about playwright-cli are accurate and consistent.
4. Scoped execution with --grep on the test runner side
The other half of token efficiency is on the npx playwright test side. Never let an agent run your full suite when it only needs a subset:
# Full suite — 200 tests, large JSON output, expensive to parse
npx playwright test --reporter=json
# Scoped to smoke — 6 tests, tiny JSON output, cheap to parse
npx playwright test --grep "@smoke" --reporter=json
# Single failing test — 1 result, minimal output
npx playwright test --grep "should add a new habit" --reporter=json
The JSON output for 1 test is ~80 tokens. For 200 tests it's ~3,000 tokens. The agent gets the same actionable information either way. Always scope.
The Complete Token Breakdown Across a Real QA Cycle
Here's a full pre-merge QA run on my habit tracker, with actual token estimates at each step:
Task: "Is the habit tracker ready to merge?"
─── Phase 1: Discovery ──────────────────────────────────────────
npx playwright test --list --grep "@smoke"
→ Returns 6 test names
→ Cost: ~80 tokens
─── Phase 2: Smoke Run ──────────────────────────────────────────
npx playwright test --grep "@smoke" --reporter=json
→ 5 passed, 1 failed
→ JSON output parsed by agent
→ Cost: ~420 tokens
─── Phase 3: Triage (playwright-cli) ───────────────────────────
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
→ Cost: ~30 tokens
playwright-cli snapshot
→ Returns current DOM state
→ Cost: ~150 tokens
→ Agent sees: button class is "submit-habit-form" not "habit-submit-btn"
→ Diagnosis: selector-broken. Immediate. No guessing.
playwright-cli screenshot
→ Visual confirmation
→ Cost: ~40 tokens
─── Phase 4: Read only the failing test block ───────────────────
→ ~280 tokens (one test block, not the whole file)
─── Phase 5: Fix proposal ───────────────────────────────────────
→ Agent writes the corrected selector
→ Cost: ~150 tokens
─── Phase 6: Verify fix ────────────────────────────────────────
npx playwright test --grep "should add a new habit" --reporter=json
→ 1 passed ✅
→ Cost: ~80 tokens
─── Phase 7: Regression (smoke now clean) ──────────────────────
npx playwright test --grep "@regression" --reporter=json
→ 14 passed ✅
→ Cost: ~900 tokens
─────────────────────────────────────────────────────────────────
TOTAL: ~2,130 tokens
Full investigation. Real diagnosis. Verified fix. Regression confirmed.
Naive approach (same outcome, no playwright-cli): ~18,000–22,000 tokens
The Debugging Difference
Most Playwright engineers know --debug mode. Here's why it doesn't work for agents and why playwright-cli exists:
npx playwright test --debug
→ Interactive REPL
→ Pauses at each step, waits for human input
→ Requires a human sitting at the keyboard
→ Useless in CI, useless for autonomous agents
playwright-cli
→ Atomic commands, each returns structured output
→ No human needed between steps
→ Agent calls command → reads output → decides next command → loops
→ Works headlessly in CI, works autonomously in Claude Code
The mental model shift:
Human debugging:
open browser → look → click → look → understand
Agent with playwright-cli:
snapshot → reason → click → snapshot → reason → fix
Agent without playwright-cli:
read file → read file → read file → guess → run test → read output → guess again
The agent without playwright-cli is flying blind. It's doing archaeology — piecing together what the live UI looks like from source files that may be outdated. playwright-cli gives the agent real-time ground truth.
What playwright-cli Commands Actually Look Like In Practice
Here's a real triage session my agent ran on a broken selector:
# Test failure: "locator('.habit-submit-btn') timed out"
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
# Agent: app loaded, ready to investigate
playwright-cli snapshot
# Agent reads:
# e45 button.submit-habit-form (button, "Add Habit")
# Agent: class is "submit-habit-form" not "habit-submit-btn"
# Diagnosis: selector-broken, CSS class renamed in a recent deploy
playwright-cli fill e12 "Morning Run"
playwright-cli snapshot
# Agent: input filled correctly, no issues with the input selector
playwright-cli click e45
playwright-cli snapshot
# Agent: habit appeared in the list, the button works
# Confirmation: it's purely a selector name issue, app logic is fine
playwright-cli screenshot
# Agent: captured visual proof of the working flow with correct selector
Total playwright-cli token cost for this investigation: ~420 tokens.
Time: under 10 seconds.
Outcome: exact diagnosis, zero guessing, confirmed fix path.
The Agent Files That Use playwright-cli
I run all of this through Claude Code sub-agents defined in .claude/agents/. Here are the two most important ones for the playwright-cli story.
.claude/agents/qa-smoke-runner.md
---
name: qa-smoke-runner
description: Runs @smoke Playwright tests against the Habit Tracker and
triages failures using playwright-cli. Use proactively when code changes
are made or when asked to verify the app works.
model: claude-sonnet-4-5
tools: Bash, Read, Glob
---
You are a QA smoke testing agent. Follow these steps exactly.
## Step 1 — Run smoke tests (scoped, JSON reporter only)
npx playwright test --grep "@smoke" --reporter=json 2>/dev/null
Parse the JSON. Extract: passed/failed counts, error messages per failure.
## Step 2 — If all pass
Output summary and stop. Never run regression from this agent.
## Step 3 — For each failure, triage with playwright-cli
DO NOT read source files first. Look at the live app first.
1. playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
2. playwright-cli snapshot
→ Read the element refs. Compare against the error message.
→ Can you already see the problem? (renamed class, missing element?)
3. Reproduce the failing steps:
playwright-cli fill <ref> "test value"
playwright-cli click <ref>
playwright-cli snapshot ← after every interaction
playwright-cli screenshot ← at the failure point
4. Only THEN read the failing test block if needed for assertion details.
Read ONLY that test block. Not the whole file.
5. Categorise:
- selector-broken → element exists, selector is wrong
- timing → element not ready when test interacts
- real-bug → app behaviour is actually wrong
- env-issue → Vercel/network problem
- test-data → hardcoded data assumption wrong
## Token rules
- --reporter=json always. Never --reporter=html.
- playwright-cli snapshot before reading any source file.
- --grep "@smoke" always. Never run full suite.
- Read only failing test block, not full spec.
## Output
Result: X passed / Y failed
[Per failure: category, root cause, playwright-cli commands used, fix]
.claude/agents/qa-failure-triager.md
---
name: qa-failure-triager
description: Deep-dives into a specific Playwright test failure using
playwright-cli to reproduce and diagnose the exact root cause. Use
when you need more than a summary — you need the exact fix.
model: claude-sonnet-4-5
tools: Bash, Read, Glob
---
You are a QA triage specialist. Your primary tool is playwright-cli.
Read source files only as a last resort.
## Investigation order (follow exactly)
### 1. Look at the live app first
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
playwright-cli snapshot
Before reading any code, answer from the snapshot:
- Does the element the test expects actually exist?
- Is its selector/class what the test expects?
- Is the page in the right state?
### 2. Reproduce the failure step by step
For each action in the failing test, execute it with playwright-cli:
playwright-cli snapshot # document initial state
playwright-cli fill <ref> "test value" # replicate test actions
playwright-cli snapshot # observe state change
playwright-cli click <ref>
playwright-cli snapshot # observe result
playwright-cli screenshot # capture at failure point
### 3. Check deeper only if needed
playwright-cli network # if failure could be API/data related
playwright-cli console # if failure could be a JS runtime error
### 4. Read failing test code last
Only after playwright-cli investigation. Read only the failing test
block — find it by line number from the JSON reporter output.
### 5. Write the exact fix
// Before (broken)
await page.locator('.old-class').click()
// After (fixed — prefer role-based selectors)
await page.getByRole('button', { name: 'Add Habit' }).click()
### 6. Verify
npx playwright test --grep "[exact test name]" --reporter=json
## Output
Category: [category]
Divergence point: [step where expectation broke]
playwright-cli commands used: [list]
Root cause: [2-3 sentences]
Fix: [exact before/after code]
Verified: ✅ / ❌
The --reporter=json Detail Nobody Talks About
One more token efficiency lever that pairs with playwright-cli: always use --reporter=json when running tests for an agent to consume.
Here's the difference in what an agent has to process:
List reporter (default):
Running 6 tests using 1 worker
✓ 1 [chromium] › add-habit-form.spec.js:12:5 › @smoke should add a new habit (1.2s)
✓ 2 [chromium] › add-habit-form.spec.js:28:5 › @smoke should mark habit complete (0.8s)
✗ 3 [chromium] › add-habit-form.spec.js:45:5 › @smoke should delete a habit (5.0s)
Error: locator('.delete-habit-btn') - waiting for locator
at add-habit-form.spec.js:52
Looks clean to a human. For an agent, it's unstructured text it has to parse with pattern matching — brittle and expensive.
JSON reporter:
{
"stats": { "expected": 5, "unexpected": 1, "duration": 7243 },
"suites": [{
"title": "add-habit-form.spec.js",
"specs": [{
"title": "@smoke should delete a habit",
"ok": false,
"tests": [{
"status": "failed",
"results": [{
"error": {
"message": "locator('.delete-habit-btn') exceeded timeout",
"location": { "file": "add-habit-form.spec.js", "line": 52 }
}
}]
}]
}]
}]
}
Structured. Parseable. The agent extracts exactly what it needs in one pass: which test, what error, which file, which line. No pattern matching, no ambiguity, fewer tokens wasted on parsing overhead.
Combined with --grep to scope the run, you get the smallest possible output surface for an agent to work with.
When playwright-cli Isn't Enough: The MCP Server
playwright-cli is stateless — each command opens the browser, acts, closes. For most triage tasks that's fine. But some investigations need state across multiple steps:
- Verify a habit persists after page refresh
- Check that streak counts update correctly after adding habits on consecutive days
- Explore an entire multi-page flow without re-navigating from scratch
For those cases, the Playwright MCP server keeps a browser alive across the entire agent session. Think of it as a playwright-cli session that doesn't close between commands.
For my habit tracker I use it in the qa-test-generator agent — the one that explores the live app and generates new test coverage. It needs to navigate multiple routes, build up app state, and compare across page transitions. playwright-cli per-command statelessness would make that awkward. The MCP server handles it naturally.
But — and this is important — don't reach for the MCP server by default. playwright-cli covers 90% of QA agent use cases with lower overhead. Use the MCP server specifically when your investigation needs persistent browser state.
The Mental Model, Simplified
npx playwright test = your test runner
answers: WHAT broke?
cost: low (with --grep + --reporter=json)
playwright-cli = your agent's eyes on the live app
answers: WHY did it break?
cost: very low (snapshot is ~150 tokens)
playwright MCP server = your agent's stateful browser session
answers: WHAT ELSE might be broken?
cost: low, but only use when you need state
The shift is this: you stop reading files to reconstruct what the UI looks like. You just look at it. That's what playwright-cli makes possible for agents — and it's why the token difference is so dramatic.
Quick Reference: Token Cost By Command
npx playwright test --list ~80 tokens
npx playwright test --grep "@smoke" --reporter=json ~400 tokens
npx playwright test --reporter=json (full suite) ~3,000 tokens
playwright-cli open <url> ~30 tokens
playwright-cli snapshot ~150 tokens
playwright-cli click <ref> ~20 tokens
playwright-cli screenshot ~40 tokens
playwright-cli network ~200 tokens
Reading entire spec file ~3,200 tokens
Reading one test block ~280 tokens
Reading playwright.config.js ~180 tokens
The pattern is clear. Every playwright-cli command costs less than reading a single source file. And it gives you ground truth about the live app rather than a static snapshot of code that may not reflect the current deployed state.
What I'd Do Differently
Use --grep from day one. My agents initially ran the full suite every time. Scoping to @smoke for quick checks cut token usage by 6x immediately.
playwright-cli snapshot before any file read. The answer is usually in the snapshot. Only go to source files when you need to understand assertion logic, not when you need to understand the DOM.
Role-based selectors over CSS classes. CSS class names change with deploys — getByRole('button', { name: 'Add Habit' }) stays stable. It also makes playwright-cli triage easier because role and name are visible in the snapshot output.
The GitHub Repo
All the agent files, CLAUDE.md, and the full test suite are here:
👉 *https://github.com/anirseven/habit-tracker/tree/main/.claude/agents
If you're building something similar or have questions, drop a comment. Happy to go deeper on any part of this.
Top comments (0)