Ricardo Costa

Posted on Jun 11 • Edited on Jun 16

Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding

#cicd #aiops #playwright #cli

Playwright has excellent tooling around browser automation, but most of the ecosystem still treats it as a test framework. For teams running AI coding agents and automated browser workflows, there is a different set of requirements:

browser automation
    ↓
session persistence across runs
    ↓
debuggable traces when things go wrong
    ↓
parallel execution across CI shards

The Playwright CLI directly addresses these gaps. It ships as a standalone npm package and exposes every browser operation as a CLI command; open, click, type, snapshot - without requiring a Node.js script or test runner.

npm package: @playwright/cli
GitHub: https://github.com/microsoft/playwright-cli

The current implementation focuses on:

session persistence with named instances and portable state
video and trace recording built into every session
CI sharding for parallel execution at scale

session persistence

The default behaviour keeps browser state in memory. Cookies and localStorage are preserved between CLI calls within the session, but cleared when the browser closes. For repeatable workflows, that breaks down fast — logging into an application before every run wastes time and introduces flakiness.

Named sessions let you run multiple browser instances simultaneously and address them by name:

playwright-cli -s=admin open https://app.example.com/admin
playwright-cli -s=checkout open https://app.example.com/checkout

Each session is an isolated browser instance. An agent can orchestrate workflows across multiple authenticated contexts without state leaking between them. The goal is straightforward:

the same CLI binary should be able to maintain independent browser contexts for parallel workflows without requiring environment-specific configuration.

The critical piece for CI and agent reuse is state persistence:
log in once

playwright-cli -s=admin open https://app.example.com/login
playwright-cli -s=admin fill "#username" "admin"
playwright-cli -s=admin fill "#password" "$ADMIN_PASS"
playwright-cli -s=admin click "button[type=submit]"

save cookies + localStorage to portable JSON

playwright-cli -s=admin state-save admin-auth.json

restore in any future session - no re-authentication

playwright-cli -s=admin state-load admin-auth.json
playwright-cli -s=admin open https://app.example.com/dashboard

The state-save and state-load commands persist cookies, localStorage, and sessionStorage to a portable JSON file. Log in once, restore auth in every future session, no re-authenticating.

For long-lived workflows that need full browser profile persistence across restarts:

playwright-cli open https://app.example.com --persistent

The --persistent flag saves the complete browser profile to disk.
Cookies, extensions, service workers, and IndexedDB survive browser restarts.
This is effectively a reusable browser identity.

Session management at scale:

playwright-cli list                    # list all active sessions
playwright-cli -s=stale close          # stop a named browser
playwright-cli -s=stale delete-data    # clean up user data
playwright-cli close-all               # close all browsers
playwright-cli kill-all                # forcefully kill all browser processes

Agents can also pick up the session name from the environment:

PLAYWRIGHT_CLI_SESSION=todo-app claude .

why not just screenshot failures?

Many teams rely on failure screenshots as their primary debugging signal. That approach tends to be fragile because:

a screenshot captures one moment, not the sequence that led to it
timing issues are invisible in a static image
network requests and console errors are absent
the agent performing the actions may interact with the page in unexpected ways

Instead, the Playwright CLI provides two built-in recording mechanisms that capture the full execution context.

video recording

The CLI can record .webm video of an entire session:

playwright-cli video-start session-debug.webm
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli video-stop

During recording, you can annotate actions with callouts and chapter markers:

playwright-cli video-show-actions      # annotate each action with a callout
playwright-cli video-chapter "Login flow"
playwright-cli video-chapter "Checkout flow"
playwright-cli video-hide-actions      # stop annotating

This produces a timestamped, annotated video of exactly what happened. When an AI agent clicks the wrong element or navigates unexpectedly, the video shows the sequence as it occurred, not a single post-mortem screenshot.

Video can also be enabled declaratively in the config file:

json
{
  "saveVideo": {
    "width": 1280,
    "height": 720
  }
}

trace recording

Traces go deeper than video.
A Playwright Trace file contains:

full DOM snapshots at each action
network requests and responses
console logs
execution timeline

Recording a trace:

playwright-cli tracing-start
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli tracing-stop

The output is a trace file inspectable in the Playwright Trace Viewer:

npx playwright show-trace trace.zip

Or open https://trace.playwright.dev in a browser and drop the file in.

Inside the viewer you can step through every action, inspect the DOM at each point, examine network requests, view console output, and see exactly what the browser rendered at each step. This is critical when an agent does something unexpected; you are not guessing from a screenshot, you are replaying the entire session.

Traces can also be enabled via environment variable:

PLAYWRIGHT_MCP_SAVE_TRACE=1

the visual dashboard

For real-time observation of running agent sessions:

playwright-cli show

This opens a window with a session grid showing all active sessions grouped by workspace, each with a live screencast preview, current URL, and page title. Click any session to zoom in and take control, click into the viewport to drive the browser manually, press Escape to release. From the grid you can also close sessions or delete data for inactive ones.

For design review and UI feedback, the dashboard supports annotations:

playwright-cli show --annotate

CI sharding

Sharding is where the CLI integrates with Playwright Test's parallel execution model. The core idea: split your test suite into N shards, run each shard on a separate CI job, and merge the results.

Playwright Test natively supports sharding:

npx playwright test --shard=1/4
npx playwright test --shard=2/4
npx playwright test --shard=3/4
npx playwright test --shard=4/4

Each shard runs an approximately equal portion of the test files. When fullyParallel: true is enabled in the config, sharding balances at the individual test level rather than the file level, producing more even distribution.

The CLI session model composes naturally with sharded CI jobs. Each shard gets its own named session:

CI job for shard 1

export PLAYWRIGHT_CLI_SESSION="shard-1"
npx playwright test --shard=1/4

CI job for shard 2

export PLAYWRIGHT_CLI_SESSION="shard-2"
npx playwright test --shard=2/4

This keeps browser state isolated between shards. Sessions run headlessly by default on CI; pass --headed to open only when you need to observe a specific session.

merging reports across shards

Each shard produces its own report.
To produce a unified view, use the blob reporter:

typescript
// playwright.config.ts
export default defineConfig({
  reporter: process.env.CI ? 'blob' : 'html',
});

Blob reports contain all test results plus attachments; traces, screenshots, video.
After all shards complete, merge:

npx playwright merge-reports --reporter html ./all-blob-reports

This produces a single HTML report in playwright-report/ with the combined results from every shard, including all traces and videos from every session.

GitHub Actions example

yaml
name: Playwright Tests
on:
  push:
    branches: [main]
jobs:
  test:
    strategy:
      matrix:
        shardIndex: [1, 2, 3, 4]
        shardTotal: [4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
        env:
          PLAYWRIGHT_CLI_SESSION: "shard-${{ matrix.shardIndex }}"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: blob-report-${{ matrix.shardIndex }}
          path: blob-report/
  merge-reports:
    if: always()
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - uses: actions/download-artifact@v4
        with:
          pattern: blob-report-*
          path: all-blob-reports
      - run: npx playwright merge-reports --reporter html ./all-blob-reports
      - uses: actions/upload-artifact@v4
        with:
          name: html-report
          path: playwright-report/


Each shard runs independently, uploads its blob report, and a final job merges everything into a single HTML report with all traces and videos attached.

the agent debugging loop

The part I find most interesting is how these features compose into a debugging workflow for AI-driven browser automation.

When an agent runs a workflow and something fails, the typical debugging approach involves:

- thousands of log lines
- screenshots
- console output
- stack traces

This works, but scales poorly.
As more workflows run through agents, the volume of debugging context grows quickly.

With the CLI tooling, the debugging signal is structured from the start:

1. Open the merged HTML report → see which shard and which test failed
2. Open the trace for that session → step through every action frame by frame
3. Watch the session video → see the visual result of each action
4. Inspect network requests in the trace → confirm API calls returned expected data
5. Check console output in the trace → catch JavaScript errors on the page

The goal is not just better debugging. The goal is to reduce the time between "something went wrong" and "I can see exactly what happened."

open questions

Some areas I am currently exploring:

should session state be committed to the repository for deterministic replay, or kept ephemeral?
at what point does per-shard video recording become too expensive in storage?
can trace diffs between passing and failing runs be automated to highlight the exact divergence point?
should agents receive the full trace or a machine-readable summary first?
how much of the debugging loop can be automated before human review becomes necessary?
what is the smallest useful artifact set for an agent to diagnose a failure: one trace, one video, one screenshot, or all three?

next steps

Current roadmap items include:

deeper integration between CLI session state and Playwright Test fixtures
automated trace diffing between baseline and failing runs
agent-friendly failure summaries as structured output
shard-aware video and trace artifact routing
persistent session profiles as reusable CI artifacts
locator stability analysis from trace data

Curious how other teams running Playwright CLI in production agent-driven workflows are approaching these problems.

Top comments (3)

Alex Shev • Jun 11

Playwright is a good fit for agent-driven workflows because it gives the agent a real interface contract instead of a vague browser instruction. Sessions, traces, screenshots, and sharding all create evidence the agent can use or hand back to a human.

The part I like most is that browser automation becomes inspectable. If the agent says the flow works, you can ask for the trace or screenshot instead of trusting a text summary.

Ricardo Costa • Jun 11 • Edited

I've been using on my framework and it helps a lot for the follow up.

await test.step('Open Portfolio', async () => { ...});

and

npx playwright test --trace on

which provides screenshots, DOM snapshots, network activity, and trace playback.

Double CHEN • Jun 30

Great breakdown of the CLI session model for agents. I hit a wall with Playwright on sites that fingerprint — LinkedIn, Reddit, anything behind Cloudflare. Switched to browser-act CLI which adds stealth Chromium on top of the same session concept. The session management API is similar, but the anti-detection layer means you don't solve auth for every protected site. Different tradeoff: Playwright has richer traces, browser-act handles anti-bot automatically. npx skills add browser-act/skills --skill browser-act if you want to compare approaches.