DEV Community: Ricardo Costa

Something new to read about Playwright CLI

Ricardo Costa — Wed, 17 Jun 2026 21:09:24 +0000

Ricardo Costa

Jun 11

Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding

#cicd #aiops #playwright #cli

7 min read

Getting deeper into AI-powered QA workflows lately, loving it so far.

Ricardo Costa — Thu, 11 Jun 2026 11:02:56 +0000

Ricardo Costa

Jun 11

Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding

#cicd #aiops #playwright #cli

7 min read

Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding

Ricardo Costa — Thu, 11 Jun 2026 09:50:58 +0000

Playwright has excellent tooling around browser automation, but most of the ecosystem still treats it as a test framework. For teams running AI coding agents and automated browser workflows, there is a different set of requirements:

browser automation
    ↓
session persistence across runs
    ↓
debuggable traces when things go wrong
    ↓
parallel execution across CI shards

The Playwright CLI directly addresses these gaps. It ships as a standalone npm package and exposes every browser operation as a CLI command; open, click, type, snapshot - without requiring a Node.js script or test runner.

npm package: @playwright/cli
GitHub: https://github.com/microsoft/playwright-cli

The current implementation focuses on:

session persistence with named instances and portable state
video and trace recording built into every session
CI sharding for parallel execution at scale

session persistence

The default behaviour keeps browser state in memory. Cookies and localStorage are preserved between CLI calls within the session, but cleared when the browser closes. For repeatable workflows, that breaks down fast — logging into an application before every run wastes time and introduces flakiness.

Named sessions let you run multiple browser instances simultaneously and address them by name:

playwright-cli -s=admin open https://app.example.com/admin
playwright-cli -s=checkout open https://app.example.com/checkout

Each session is an isolated browser instance. An agent can orchestrate workflows across multiple authenticated contexts without state leaking between them. The goal is straightforward:

the same CLI binary should be able to maintain independent browser contexts for parallel workflows without requiring environment-specific configuration.

The critical piece for CI and agent reuse is state persistence:
log in once

playwright-cli -s=admin open https://app.example.com/login
playwright-cli -s=admin fill "#username" "admin"
playwright-cli -s=admin fill "#password" "$ADMIN_PASS"
playwright-cli -s=admin click "button[type=submit]"

save cookies + localStorage to portable JSON

playwright-cli -s=admin state-save admin-auth.json

restore in any future session - no re-authentication

playwright-cli -s=admin state-load admin-auth.json
playwright-cli -s=admin open https://app.example.com/dashboard

The state-save and state-load commands persist cookies, localStorage, and sessionStorage to a portable JSON file. Log in once, restore auth in every future session, no re-authenticating.

For long-lived workflows that need full browser profile persistence across restarts:

playwright-cli open https://app.example.com --persistent

The --persistent flag saves the complete browser profile to disk.
Cookies, extensions, service workers, and IndexedDB survive browser restarts.
This is effectively a reusable browser identity.

Session management at scale:

playwright-cli list                    # list all active sessions
playwright-cli -s=stale close          # stop a named browser
playwright-cli -s=stale delete-data    # clean up user data
playwright-cli close-all               # close all browsers
playwright-cli kill-all                # forcefully kill all browser processes

Agents can also pick up the session name from the environment:

PLAYWRIGHT_CLI_SESSION=todo-app claude .

why not just screenshot failures?

Many teams rely on failure screenshots as their primary debugging signal. That approach tends to be fragile because:

a screenshot captures one moment, not the sequence that led to it
timing issues are invisible in a static image
network requests and console errors are absent
the agent performing the actions may interact with the page in unexpected ways

Instead, the Playwright CLI provides two built-in recording mechanisms that capture the full execution context.

video recording

The CLI can record .webm video of an entire session:

playwright-cli video-start session-debug.webm
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli video-stop

During recording, you can annotate actions with callouts and chapter markers:

playwright-cli video-show-actions      # annotate each action with a callout
playwright-cli video-chapter "Login flow"
playwright-cli video-chapter "Checkout flow"
playwright-cli video-hide-actions      # stop annotating

This produces a timestamped, annotated video of exactly what happened. When an AI agent clicks the wrong element or navigates unexpectedly, the video shows the sequence as it occurred, not a single post-mortem screenshot.

Video can also be enabled declaratively in the config file:

json
{
  "saveVideo": {
    "width": 1280,
    "height": 720
  }
}

trace recording

Traces go deeper than video.
A Playwright Trace file contains:

full DOM snapshots at each action
network requests and responses
console logs
execution timeline

Recording a trace:

playwright-cli tracing-start
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli tracing-stop

The output is a trace file inspectable in the Playwright Trace Viewer:

npx playwright show-trace trace.zip

Or open https://trace.playwright.dev in a browser and drop the file in.

Inside the viewer you can step through every action, inspect the DOM at each point, examine network requests, view console output, and see exactly what the browser rendered at each step. This is critical when an agent does something unexpected; you are not guessing from a screenshot, you are replaying the entire session.

Traces can also be enabled via environment variable:

PLAYWRIGHT_MCP_SAVE_TRACE=1

the visual dashboard

For real-time observation of running agent sessions:

playwright-cli show

This opens a window with a session grid showing all active sessions grouped by workspace, each with a live screencast preview, current URL, and page title. Click any session to zoom in and take control, click into the viewport to drive the browser manually, press Escape to release. From the grid you can also close sessions or delete data for inactive ones.

For design review and UI feedback, the dashboard supports annotations:

playwright-cli show --annotate

CI sharding

Sharding is where the CLI integrates with Playwright Test's parallel execution model. The core idea: split your test suite into N shards, run each shard on a separate CI job, and merge the results.

Playwright Test natively supports sharding:

npx playwright test --shard=1/4
npx playwright test --shard=2/4
npx playwright test --shard=3/4
npx playwright test --shard=4/4

Each shard runs an approximately equal portion of the test files. When fullyParallel: true is enabled in the config, sharding balances at the individual test level rather than the file level, producing more even distribution.

The CLI session model composes naturally with sharded CI jobs. Each shard gets its own named session:

CI job for shard 1

export PLAYWRIGHT_CLI_SESSION="shard-1"
npx playwright test --shard=1/4

CI job for shard 2

export PLAYWRIGHT_CLI_SESSION="shard-2"
npx playwright test --shard=2/4

This keeps browser state isolated between shards. Sessions run headlessly by default on CI; pass --headed to open only when you need to observe a specific session.

merging reports across shards

Each shard produces its own report.
To produce a unified view, use the blob reporter:

typescript
// playwright.config.ts
export default defineConfig({
  reporter: process.env.CI ? 'blob' : 'html',
});

Blob reports contain all test results plus attachments; traces, screenshots, video.
After all shards complete, merge:

npx playwright merge-reports --reporter html ./all-blob-reports

This produces a single HTML report in playwright-report/ with the combined results from every shard, including all traces and videos from every session.

GitHub Actions example

yaml
name: Playwright Tests
on:
  push:
    branches: [main]
jobs:
  test:
    strategy:
      matrix:
        shardIndex: [1, 2, 3, 4]
        shardTotal: [4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
        env:
          PLAYWRIGHT_CLI_SESSION: "shard-${{ matrix.shardIndex }}"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: blob-report-${{ matrix.shardIndex }}
          path: blob-report/
  merge-reports:
    if: always()
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - uses: actions/download-artifact@v4
        with:
          pattern: blob-report-*
          path: all-blob-reports
      - run: npx playwright merge-reports --reporter html ./all-blob-reports
      - uses: actions/upload-artifact@v4
        with:
          name: html-report
          path: playwright-report/


Each shard runs independently, uploads its blob report, and a final job merges everything into a single HTML report with all traces and videos attached.

the agent debugging loop

The part I find most interesting is how these features compose into a debugging workflow for AI-driven browser automation.

When an agent runs a workflow and something fails, the typical debugging approach involves:

- thousands of log lines
- screenshots
- console output
- stack traces

This works, but scales poorly.
As more workflows run through agents, the volume of debugging context grows quickly.

With the CLI tooling, the debugging signal is structured from the start:

1. Open the merged HTML report → see which shard and which test failed
2. Open the trace for that session → step through every action frame by frame
3. Watch the session video → see the visual result of each action
4. Inspect network requests in the trace → confirm API calls returned expected data
5. Check console output in the trace → catch JavaScript errors on the page

The goal is not just better debugging. The goal is to reduce the time between "something went wrong" and "I can see exactly what happened."

open questions

Some areas I am currently exploring:

should session state be committed to the repository for deterministic replay, or kept ephemeral?
at what point does per-shard video recording become too expensive in storage?
can trace diffs between passing and failing runs be automated to highlight the exact divergence point?
should agents receive the full trace or a machine-readable summary first?
how much of the debugging loop can be automated before human review becomes necessary?
what is the smallest useful artifact set for an agent to diagnose a failure: one trace, one video, one screenshot, or all three?

next steps

Current roadmap items include:

deeper integration between CLI session state and Playwright Test fixtures
automated trace diffing between baseline and failing runs
agent-friendly failure summaries as structured output
shard-aware video and trace artifact routing
persistent session profiles as reusable CI artifacts
locator stability analysis from trace data

Curious how other teams running Playwright CLI in production agent-driven workflows are approaching these problems.

Building a CI helper for Playwright Java

Ricardo Costa — Tue, 02 Jun 2026 18:29:35 +0000

Playwright has excellent tooling around browser automation, but most of the ecosystem still feels heavily Node.js-centric.

For Java teams, there's a surprising amount of infrastructure work that sits between:

git push
   ↓
ci execution
   ↓
useful failure diagnostics

To explore that gap, I built a small Java CLI:

GitHub repo:
https://github.com/ricardo-costa0405/playwright-java-ci-helper

The current implementation focuses on:

build system detection
test execution
artifact collection
machine-readable failure summaries

build system detection

The first requirement was zero project configuration.

The helper attempts to detect:

./mvnw
pom.xml
./gradlew
build.gradle
build.gradle.kts

and automatically generates the appropriate execution strategy.

The goal is straightforward:

the same binary should be able to run inside arbitrary playwright java repositories without requiring repository-specific configuration.

This allows the tool to work consistently across Maven and Gradle projects while keeping onboarding friction close to zero.

test execution

The helper can execute either an automatically detected build command or a user-supplied command.

Examples:

java -jar playwright-java-ci-helper.jar \
  --project-dir my-project

java -jar playwright-java-ci-helper.jar \
  --test-command "mvn test -Dtest=LoginTest"

An optional setup phase can also be be executed before running tests.

This allows repositories to perform environment preparation, Playwright installation, or custom bootstrap steps before execution begins.

why not parse console logs?

Many CI systems still derive test status from stdout.

That approach tends to be fragile because:

log formats change
plugins inject additional output
parallel execution interleaves messages
different frameworks produce different structures

Instead, the helper parses JUnit XML directly and extracts:

tests
failures
errors
skipped

from the actual source of truth.

This produces deterministic results regardless of how verbose or customized the console output becomes.

artifact collection

The less obvious challenge is artifact discovery.

A failing Playwright run can generate output across multiple locations:

target/surefire-reports
target/failsafe-reports
build/test-results
build/reports/tests
playwright-report
test-results
screenshots
videos
traces

depending on:

build tool
project structure
reporting configuration
team conventions

The helper currently collects only artifacts generated during the active execution window.

This avoids a common CI problem where stale artifacts from previous executions are accidentally included in failure analysis.

CI sharding

One area I wanted to support from the beginning was CI parallelization.
The helper exports:

PW_JAVA_CI_SHARD_INDEX
PW_JAVA_CI_SHARD_TOTAL
PW_JAVA_CI_WORKERS

and automatically injects equivalent parameters into Maven and Gradle executions.

Example:

java -jar playwright-java-ci-helper.jar \
  --shard-index 2 \
  --shard-total 4 \
  --workers 3

The idea is to keep orchestration concerns outside the test implementation itself.

machine-readable failure context

The part I find most interesting isn't the reporting itself.
It's creating a deterministic interface between CI systems and automated tooling.
Today, many teams experimenting with agents and AI-assisted debugging are still passing large amounts of raw information:

thousands of log lines
screenshots
reports
traces
console output

The approach works, but it scales poorly.
As more platforms move toward API-based billing models, context size starts becoming an engineering concern rather than just an implementation detail.

Instead of sending:

4000+ lines of CI logs

a tool can provide:

{
  "tests": 182,
  "failures": 3,
  "screenshots": 3,
  "traces": 3,
  "failedTests": [...]
}

The goal isn't only to improve signal quality.
The goal is to reduce the amount of context required for an agent to reason about a failure.

This becomes increasingly important when traces, screenshots, reports, and execution logs start accumulating across hundreds or thousands of CI runs.

I suspect we'll see more tooling move in this direction as agents become part of the standard engineering workflow.

generating playwright java skeletons

I've also been experimenting with generating Playwright Java test skeletons from browser interaction flows and agent command scripts.

For example:

playwright-cli open https://demo.playwright.dev/todomvc
playwright-cli type "Buy groceries"
playwright-cli press Enter
playwright-cli screenshot

can be transformed into a Java test template.
One interesting limitation is locator generation.

Agent references such as:

e21
e37
e42

Cannot safely be translated into stable Playwright locators.
The generated code compiles, but locator selection remains a human responsibility.

At least for now, a human-in-the-loop approach feels significantly more realistic than fully autonomous test generation.

open questions

Some areas I'm currently exploring:

should junit parsing remain framework-agnostic?
or should framework-specific adapters be introduced for richer diagnostics (e.g. TestNG retries, groups and dependencies)?
is artifact collection better handled through plugins than filesystem discovery?
what is the smallest useful schema for agent-driven failure analysis?
can locator repair be performed safely without introducing additional flakiness?
how much CI context should be exposed to agents before signal becomes noise?

next steps

Current roadmap items include:

testng support
richer failure diagnostics
ai-friendly summaries
sarif output
environment validation ("doctor" command)
locator repair suggestions
deeper agent integrations

The project is still in its early stages, but the objective is simple:
Build better tooling around the gap between test execution and actionable failure diagnostics for Playwright Java teams.

I'm curious how other teams running Playwright Java at scale are approaching these problems