Anirban Majumdar

Posted on May 5

`@playwright/cli` Is Not `npx playwright test` — And That Difference Is Costing Your AI Agent Thousands of Tokens

#playwright #testing #claudecode #playwrightcli

As an SDET or Automation Engineer, failing tests are part of the daily grind. With the rise of Agentic AI, fixing scripts is easier than ever—but there’s a catch that tutorials rarely mention: Scale.

In a real-world enterprise suite, you aren’t dealing with 10 tests; you’re dealing with 500. When 200 of them fail right before a major release—often due to a single upstream change by another team—feeding all that data into an AI Agent is a "token bonfire." It’s expensive, slow, and often unnecessary.

So, what’s the move? Do you use a Playwright MCP, a standalone LLM, or the Playwright CLI?

The answer is: It depends on the scope of the fix.

In this post, we’ll explore how to leverage the Playwright CLI to analyze and resolve mass failures efficiently, keeping your token costs low and your velocity high.

What exactly is Playwright and where do you use it

Before we talk about tokens, let's clear up the confusion that tripped me up for weeks.

There are three different Playwright tools that all sound similar:

npx playwright test        →  the test runner you already know
                              runs your spec files, reports pass/fail

@playwright/cli            →  a NEW, separate package
(playwright-cli)              built specifically for coding agents
                              controls a live browser one command at a time

Playwright MCP Server      →  for persistent agent browser sessions
                              keeps browser alive across a long investigation

Most QA engineers know the first one. Almost nobody has heard of the second. And the confusion is understandable — they're both "Playwright", they both involve browsers. But they're built for completely different consumers.

npx playwright test is built for humans and CI pipelines. It runs your spec files, prints results to a terminal, exits with a code. A human reads the output. A pipeline checks the exit code. Done.

playwright-cli is built for coding agents. It was created specifically because tools like Claude Code and GitHub Copilot need to interact with a live browser — not run pre-written scripts, but actually observe a page, click things, see what changed, and loop. Every command returns a compact, structured snapshot of the current page state that a machine can read and reason about efficiently.

Installing It

# It's a separate package from @playwright/test
npm install -g @playwright/cli@latest

# This is important — installs structured command docs locally
# so Claude Code knows all available commands without running --help
playwright-cli install --skills

That --skills flag is a small thing that saves tokens on every single agent session. Instead of the agent running playwright-cli --help to discover commands (costing ~300 tokens), it reads the locally installed docs once at session start.

What It Looks Like In Use

# Open the live app in a browser
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/

# Get the current page state — returns all interactive elements
playwright-cli snapshot

# Interact using element refs from the snapshot
playwright-cli click e45
playwright-cli fill e12 "Morning Run"
playwright-cli press Enter

# Capture state visually
playwright-cli screenshot

# Check what network calls are happening
playwright-cli network

The snapshot command is the key one. Here's what it actually returns:

Page: Habit Tracker
URL: https://habit-tracker-one-dun-53.vercel.app/

Interactive elements:
  e12  input[placeholder="Enter habit name..."]   textbox
  e13  input[type="date"]                         date picker
  e14  select.status-selector                     combobox
  e45  button.submit-habit-form                   button "Add Habit"
  e67  a[href="/trends"]                          link "Trends"
  e68  a[href="/about"]                           link "About"

That's it. ~150 tokens. Every interactive element on the page, with a ref ID the agent can use to interact. No noise, no full HTML, no CSS, no scripts — just what an agent needs to act.

This is the foundation of why playwright-cli is so much more token-efficient than alternatives. But let's make that concrete.

The App I'm Testing

Throughout this post I'll use a real example: a React Habit Tracker app I built and deployed on Vercel.

Live app: https://habit-tracker-one-dun-53.vercel.app/

One spec file, two test categories:

@smoke — happy path: add habit, mark complete, delete
@regression — edge cases, validation, error states

Nothing exotic. Simple enough to follow along, realistic enough to make the problem interesting.

The Token Problem With `npx playwright test` Alone

Here's what happens when a naive agent uses only npx playwright test to investigate a failure:

You: "Why is the add-habit smoke test failing?"

Agent (without playwright-cli):
  Step 1: Read entire spec file to understand the test    ~3,200 tokens
  Step 2: Read playwright.config.js for context             ~180 tokens
  Step 3: Run npx playwright test --reporter=list           ~800 tokens
  Step 4: Parse terminal output, identify failure           ~600 tokens
  Step 5: Read source components to find selector          ~2,800 tokens
  Step 6: Guess what the DOM looks like now                ~1,400 tokens
  Step 7: Propose a fix (might be wrong)                   ~700 tokens
  ──────────────────────────────────────────────────────
  Total:                                                  ~9,680 tokens
  Certainty of diagnosis:                                       ~60%

The agent is reading source files and guessing what the live DOM looks like. It never actually looks at the app. Its fix might be wrong and cost you another loop.

Here's the same task with playwright-cli:

You: "Why is the add-habit smoke test failing?"

Agent (with playwright-cli):
  Step 1: npx playwright test --grep "@smoke" --reporter=json  ~420 tokens
  Step 2: playwright-cli open + snapshot on failure            ~190 tokens
  Step 3: Read ONLY the failing test block                     ~280 tokens
  Step 4: Compare snapshot to test, diagnose                   ~480 tokens
  Step 5: Propose exact fix from what it actually saw          ~150 tokens
  ──────────────────────────────────────────────────────
  Total:                                                     ~1,520 tokens
  Certainty of diagnosis:                                        ~95%

84% fewer tokens. Higher confidence. No guessing.

The agent saw the actual live DOM. It didn't read five files hoping to reconstruct what the UI looks like — it just looked.

Why playwright-cli Is So Token-Efficient

1. Snapshot output is compact by design

The playwright-cli snapshot command returns a structured, minimal representation of the current page — only interactive elements, only what an agent needs to act:

playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
playwright-cli snapshot

Output:

Page: Habit Tracker
URL: https://habit-tracker-one-dun-53.vercel.app/

Interactive elements:
  e12  input[placeholder="Enter habit name..."]   textbox
  e13  input[type="date"]                         date picker
  e14  select.status-selector                     combobox
  e45  button.submit-habit-form                   button "Add Habit"
  e67  a[href="/trends"]                          link "Trends"
  e68  a[href="/about"]                           link "About"

That's ~150 tokens. It tells the agent everything it needs: what elements exist, what their refs are, what their current classes and labels are.

Compare that to fetching the full page HTML — which would be 2,000–8,000 tokens of noise the agent has to parse to extract the same 6 elements.

2. Ref-based interaction means no selector archaeology

Once the agent has a snapshot, it uses refs — not selectors it has to guess:

playwright-cli click e45          # click "Add Habit" button
playwright-cli fill e12 "Morning Run"
playwright-cli press Enter
playwright-cli snapshot           # see what changed

No reading source code to figure out what class name the button has. No risk of using a stale selector from a file that hasn't been updated since last deploy. The snapshot tells you the current truth.

This is the core insight: playwright-cli trades file reading for direct observation. File reading costs hundreds to thousands of tokens per file. Observation costs ~150 tokens for the entire page state.

3. The `--skills` flag prevents repeated `--help` calls

playwright-cli install --skills

This installs structured command documentation locally. Claude Code reads it once at the start of a session — it never needs to run playwright-cli --help to discover available commands. That's another ~300 tokens saved per session, and it means the agent's responses about playwright-cli are accurate and consistent.

4. Scoped execution with `--grep` on the test runner side

The other half of token efficiency is on the npx playwright test side. Never let an agent run your full suite when it only needs a subset:

# Full suite — 200 tests, large JSON output, expensive to parse
npx playwright test --reporter=json

# Scoped to smoke — 6 tests, tiny JSON output, cheap to parse
npx playwright test --grep "@smoke" --reporter=json

# Single failing test — 1 result, minimal output
npx playwright test --grep "should add a new habit" --reporter=json

The JSON output for 1 test is ~80 tokens. For 200 tests it's ~3,000 tokens. The agent gets the same actionable information either way. Always scope.

The Complete Token Breakdown Across a Real QA Cycle

Here's a full pre-merge QA run on my habit tracker, with actual token estimates at each step:

Task: "Is the habit tracker ready to merge?"

─── Phase 1: Discovery ──────────────────────────────────────────
npx playwright test --list --grep "@smoke"
  → Returns 6 test names
  → Cost: ~80 tokens

─── Phase 2: Smoke Run ──────────────────────────────────────────
npx playwright test --grep "@smoke" --reporter=json
  → 5 passed, 1 failed
  → JSON output parsed by agent
  → Cost: ~420 tokens

─── Phase 3: Triage (playwright-cli) ───────────────────────────
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
  → Cost: ~30 tokens

playwright-cli snapshot
  → Returns current DOM state
  → Cost: ~150 tokens
  → Agent sees: button class is "submit-habit-form" not "habit-submit-btn"
  → Diagnosis: selector-broken. Immediate. No guessing.

playwright-cli screenshot
  → Visual confirmation
  → Cost: ~40 tokens

─── Phase 4: Read only the failing test block ───────────────────
  → ~280 tokens (one test block, not the whole file)

─── Phase 5: Fix proposal ───────────────────────────────────────
  → Agent writes the corrected selector
  → Cost: ~150 tokens

─── Phase 6: Verify fix ────────────────────────────────────────
npx playwright test --grep "should add a new habit" --reporter=json
  → 1 passed ✅
  → Cost: ~80 tokens

─── Phase 7: Regression (smoke now clean) ──────────────────────
npx playwright test --grep "@regression" --reporter=json
  → 14 passed ✅
  → Cost: ~900 tokens

─────────────────────────────────────────────────────────────────
TOTAL: ~2,130 tokens
Full investigation. Real diagnosis. Verified fix. Regression confirmed.

Naive approach (same outcome, no playwright-cli):  ~18,000–22,000 tokens

The Debugging Difference

Most Playwright engineers know --debug mode. Here's why it doesn't work for agents and why playwright-cli exists:

npx playwright test --debug
  → Interactive REPL
  → Pauses at each step, waits for human input
  → Requires a human sitting at the keyboard
  → Useless in CI, useless for autonomous agents

playwright-cli
  → Atomic commands, each returns structured output
  → No human needed between steps
  → Agent calls command → reads output → decides next command → loops
  → Works headlessly in CI, works autonomously in Claude Code

The mental model shift:

Human debugging:
  open browser → look → click → look → understand

Agent with playwright-cli:
  snapshot → reason → click → snapshot → reason → fix

Agent without playwright-cli:
  read file → read file → read file → guess → run test → read output → guess again

The agent without playwright-cli is flying blind. It's doing archaeology — piecing together what the live UI looks like from source files that may be outdated. playwright-cli gives the agent real-time ground truth.

What playwright-cli Commands Actually Look Like In Practice

Here's a real triage session my agent ran on a broken selector:

# Test failure: "locator('.habit-submit-btn') timed out"

playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
# Agent: app loaded, ready to investigate

playwright-cli snapshot
# Agent reads:
#   e45  button.submit-habit-form  (button, "Add Habit")
# Agent: class is "submit-habit-form" not "habit-submit-btn"
# Diagnosis: selector-broken, CSS class renamed in a recent deploy

playwright-cli fill e12 "Morning Run"
playwright-cli snapshot
# Agent: input filled correctly, no issues with the input selector

playwright-cli click e45
playwright-cli snapshot
# Agent: habit appeared in the list, the button works
# Confirmation: it's purely a selector name issue, app logic is fine

playwright-cli screenshot
# Agent: captured visual proof of the working flow with correct selector

Total playwright-cli token cost for this investigation: ~420 tokens.
Time: under 10 seconds.
Outcome: exact diagnosis, zero guessing, confirmed fix path.

The Agent Files That Use playwright-cli

I run all of this through Claude Code sub-agents defined in .claude/agents/. Here are the two most important ones for the playwright-cli story.

`.claude/agents/qa-smoke-runner.md`

---
name: qa-smoke-runner
description: Runs @smoke Playwright tests against the Habit Tracker and
  triages failures using playwright-cli. Use proactively when code changes
  are made or when asked to verify the app works.
model: claude-sonnet-4-5
tools: Bash, Read, Glob
---

You are a QA smoke testing agent. Follow these steps exactly.

## Step 1 — Run smoke tests (scoped, JSON reporter only)
npx playwright test --grep "@smoke" --reporter=json 2>/dev/null

Parse the JSON. Extract: passed/failed counts, error messages per failure.

## Step 2 — If all pass
Output summary and stop. Never run regression from this agent.

## Step 3 — For each failure, triage with playwright-cli

DO NOT read source files first. Look at the live app first.

1. playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
2. playwright-cli snapshot
   → Read the element refs. Compare against the error message.
   → Can you already see the problem? (renamed class, missing element?)

3. Reproduce the failing steps:
   playwright-cli fill <ref> "test value"
   playwright-cli click <ref>
   playwright-cli snapshot    ← after every interaction
   playwright-cli screenshot  ← at the failure point

4. Only THEN read the failing test block if needed for assertion details.
   Read ONLY that test block. Not the whole file.

5. Categorise:
   - selector-broken  → element exists, selector is wrong
   - timing           → element not ready when test interacts
   - real-bug         → app behaviour is actually wrong
   - env-issue        → Vercel/network problem
   - test-data        → hardcoded data assumption wrong

## Token rules
- --reporter=json always. Never --reporter=html.
- playwright-cli snapshot before reading any source file.
- --grep "@smoke" always. Never run full suite.
- Read only failing test block, not full spec.

## Output
Result: X passed / Y failed
[Per failure: category, root cause, playwright-cli commands used, fix]

`.claude/agents/qa-failure-triager.md`

---
name: qa-failure-triager
description: Deep-dives into a specific Playwright test failure using
  playwright-cli to reproduce and diagnose the exact root cause. Use
  when you need more than a summary — you need the exact fix.
model: claude-sonnet-4-5
tools: Bash, Read, Glob
---

You are a QA triage specialist. Your primary tool is playwright-cli.
Read source files only as a last resort.

## Investigation order (follow exactly)

### 1. Look at the live app first
playwright-cli open https://habit-tracker-one-dun-53.vercel.app/
playwright-cli snapshot

Before reading any code, answer from the snapshot:
- Does the element the test expects actually exist?
- Is its selector/class what the test expects?
- Is the page in the right state?

### 2. Reproduce the failure step by step
For each action in the failing test, execute it with playwright-cli:

playwright-cli snapshot                    # document initial state
playwright-cli fill <ref> "test value"     # replicate test actions
playwright-cli snapshot                    # observe state change
playwright-cli click <ref>
playwright-cli snapshot                    # observe result
playwright-cli screenshot                  # capture at failure point

### 3. Check deeper only if needed
playwright-cli network   # if failure could be API/data related
playwright-cli console   # if failure could be a JS runtime error

### 4. Read failing test code last
Only after playwright-cli investigation. Read only the failing test
block — find it by line number from the JSON reporter output.

### 5. Write the exact fix
// Before (broken)
await page.locator('.old-class').click()

// After (fixed — prefer role-based selectors)
await page.getByRole('button', { name: 'Add Habit' }).click()

### 6. Verify
npx playwright test --grep "[exact test name]" --reporter=json

## Output
Category: [category]
Divergence point: [step where expectation broke]
playwright-cli commands used: [list]
Root cause: [2-3 sentences]
Fix: [exact before/after code]
Verified: ✅ / ❌

The `--reporter=json` Detail Nobody Talks About

One more token efficiency lever that pairs with playwright-cli: always use --reporter=json when running tests for an agent to consume.

Here's the difference in what an agent has to process:

List reporter (default):

Running 6 tests using 1 worker

  ✓  1 [chromium] › add-habit-form.spec.js:12:5 › @smoke should add a new habit (1.2s)
  ✓  2 [chromium] › add-habit-form.spec.js:28:5 › @smoke should mark habit complete (0.8s)
  ✗  3 [chromium] › add-habit-form.spec.js:45:5 › @smoke should delete a habit (5.0s)

  Error: locator('.delete-habit-btn') - waiting for locator
    at add-habit-form.spec.js:52

Looks clean to a human. For an agent, it's unstructured text it has to parse with pattern matching — brittle and expensive.

JSON reporter:

{
  "stats": { "expected": 5, "unexpected": 1, "duration": 7243 },
  "suites": [{
    "title": "add-habit-form.spec.js",
    "specs": [{
      "title": "@smoke should delete a habit",
      "ok": false,
      "tests": [{
        "status": "failed",
        "results": [{
          "error": {
            "message": "locator('.delete-habit-btn') exceeded timeout",
            "location": { "file": "add-habit-form.spec.js", "line": 52 }
          }
        }]
      }]
    }]
  }]
}

Structured. Parseable. The agent extracts exactly what it needs in one pass: which test, what error, which file, which line. No pattern matching, no ambiguity, fewer tokens wasted on parsing overhead.

Combined with --grep to scope the run, you get the smallest possible output surface for an agent to work with.

When playwright-cli Isn't Enough: The MCP Server

playwright-cli is stateless — each command opens the browser, acts, closes. For most triage tasks that's fine. But some investigations need state across multiple steps:

Verify a habit persists after page refresh
Check that streak counts update correctly after adding habits on consecutive days
Explore an entire multi-page flow without re-navigating from scratch

For those cases, the Playwright MCP server keeps a browser alive across the entire agent session. Think of it as a playwright-cli session that doesn't close between commands.

For my habit tracker I use it in the qa-test-generator agent — the one that explores the live app and generates new test coverage. It needs to navigate multiple routes, build up app state, and compare across page transitions. playwright-cli per-command statelessness would make that awkward. The MCP server handles it naturally.

But — and this is important — don't reach for the MCP server by default. playwright-cli covers 90% of QA agent use cases with lower overhead. Use the MCP server specifically when your investigation needs persistent browser state.

The Mental Model, Simplified

npx playwright test    =  your test runner
                          answers: WHAT broke?
                          cost: low (with --grep + --reporter=json)

playwright-cli         =  your agent's eyes on the live app  
                          answers: WHY did it break?
                          cost: very low (snapshot is ~150 tokens)

playwright MCP server  =  your agent's stateful browser session
                          answers: WHAT ELSE might be broken?
                          cost: low, but only use when you need state

The shift is this: you stop reading files to reconstruct what the UI looks like. You just look at it. That's what playwright-cli makes possible for agents — and it's why the token difference is so dramatic.

Quick Reference: Token Cost By Command

npx playwright test --list                          ~80 tokens
npx playwright test --grep "@smoke" --reporter=json ~400 tokens
npx playwright test --reporter=json (full suite)    ~3,000 tokens

playwright-cli open <url>                           ~30 tokens
playwright-cli snapshot                             ~150 tokens
playwright-cli click <ref>                          ~20 tokens
playwright-cli screenshot                           ~40 tokens
playwright-cli network                              ~200 tokens

Reading entire spec file                            ~3,200 tokens
Reading one test block                              ~280 tokens
Reading playwright.config.js                        ~180 tokens

The pattern is clear. Every playwright-cli command costs less than reading a single source file. And it gives you ground truth about the live app rather than a static snapshot of code that may not reflect the current deployed state.

What I'd Do Differently

Use --grep from day one. My agents initially ran the full suite every time. Scoping to @smoke for quick checks cut token usage by 6x immediately.

playwright-cli snapshot before any file read. The answer is usually in the snapshot. Only go to source files when you need to understand assertion logic, not when you need to understand the DOM.

Role-based selectors over CSS classes. CSS class names change with deploys — getByRole('button', { name: 'Add Habit' }) stays stable. It also makes playwright-cli triage easier because role and name are visible in the snapshot output.

The GitHub Repo

All the agent files, CLAUDE.md, and the full test suite are here:

👉 *https://github.com/anirseven/habit-tracker/tree/main/.claude/agents

If you're building something similar or have questions, drop a comment. Happy to go deeper on any part of this.

DEV Community

`@playwright/cli` Is Not `npx playwright test` — And That Difference Is Costing Your AI Agent Thousands of Tokens

What exactly is Playwright and where do you use it

Installing It

What It Looks Like In Use

The App I'm Testing

The Token Problem With `npx playwright test` Alone

Why playwright-cli Is So Token-Efficient

1. Snapshot output is compact by design

2. Ref-based interaction means no selector archaeology

3. The `--skills` flag prevents repeated `--help` calls

4. Scoped execution with `--grep` on the test runner side

The Complete Token Breakdown Across a Real QA Cycle

The Debugging Difference

What playwright-cli Commands Actually Look Like In Practice

The Agent Files That Use playwright-cli

`.claude/agents/qa-smoke-runner.md`

`.claude/agents/qa-failure-triager.md`

The `--reporter=json` Detail Nobody Talks About

When playwright-cli Isn't Enough: The MCP Server

The Mental Model, Simplified

Quick Reference: Token Cost By Command

What I'd Do Differently

The GitHub Repo

Top comments (0)

What exactly is Playwright and where do you use it

Installing It

What It Looks Like In Use

The App I'm Testing

The Token Problem With npx playwright test Alone

Why playwright-cli Is So Token-Efficient

1. Snapshot output is compact by design

2. Ref-based interaction means no selector archaeology

3. The --skills flag prevents repeated --help calls

4. Scoped execution with --grep on the test runner side

The Complete Token Breakdown Across a Real QA Cycle

The Debugging Difference

What playwright-cli Commands Actually Look Like In Practice

The Agent Files That Use playwright-cli

.claude/agents/qa-smoke-runner.md

.claude/agents/qa-failure-triager.md

The --reporter=json Detail Nobody Talks About

When playwright-cli Isn't Enough: The MCP Server

The Mental Model, Simplified

Quick Reference: Token Cost By Command

What I'd Do Differently

The GitHub Repo

The Token Problem With `npx playwright test` Alone

3. The `--skills` flag prevents repeated `--help` calls

4. Scoped execution with `--grep` on the test runner side

`.claude/agents/qa-smoke-runner.md`

`.claude/agents/qa-failure-triager.md`

The `--reporter=json` Detail Nobody Talks About