DEV Community

Cover image for Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding
Ricardo Costa
Ricardo Costa

Posted on

Playwright CLI for agent-driven workflows: sessions, debugging, and CI Sharding

Playwright has excellent tooling around browser automation, but most of the ecosystem still treats it as a test framework. For teams running AI coding agents and automated browser workflows, there is a different set of requirements:

browser automation
    ↓
session persistence across runs
    ↓
debuggable traces when things go wrong
    ↓
parallel execution across CI shards
Enter fullscreen mode Exit fullscreen mode

The Playwright CLI directly addresses these gaps. It ships as a standalone npm package and exposes every browser operation as a CLI command; open, click, type, snapshot - without requiring a Node.js script or test runner.

npm package: @playwright/cli
GitHub: https://github.com/microsoft/playwright-cli

The current implementation focuses on:

  • session persistence with named instances and portable state
  • video and trace recording built into every session
  • CI sharding for parallel execution at scale

session persistence

The default behaviour keeps browser state in memory. Cookies and localStorage are preserved between CLI calls within the session, but cleared when the browser closes. For repeatable workflows, that breaks down fast — logging into an application before every run wastes time and introduces flakiness.

Named sessions let you run multiple browser instances simultaneously and address them by name:

playwright-cli -s=admin open https://app.example.com/admin
playwright-cli -s=checkout open https://app.example.com/checkout
Enter fullscreen mode Exit fullscreen mode

Each session is an isolated browser instance. An agent can orchestrate workflows across multiple authenticated contexts without state leaking between them. The goal is straightforward:

  • the same CLI binary should be able to maintain independent browser contexts for parallel workflows without requiring environment-specific configuration.

The critical piece for CI and agent reuse is state persistence:
log in once

playwright-cli -s=admin open https://app.example.com/login
playwright-cli -s=admin fill "#username" "admin"
playwright-cli -s=admin fill "#password" "$ADMIN_PASS"
playwright-cli -s=admin click "button[type=submit]"
Enter fullscreen mode Exit fullscreen mode

save cookies + localStorage to portable JSON

playwright-cli -s=admin state-save admin-auth.json
Enter fullscreen mode Exit fullscreen mode

restore in any future session - no re-authentication

playwright-cli -s=admin state-load admin-auth.json
playwright-cli -s=admin open https://app.example.com/dashboard
Enter fullscreen mode Exit fullscreen mode

The state-save and state-load commands persist cookies, localStorage, and sessionStorage to a portable JSON file. Log in once, restore auth in every future session, no re-authenticating.

For long-lived workflows that need full browser profile persistence across restarts:

playwright-cli open https://app.example.com --persistent
Enter fullscreen mode Exit fullscreen mode

The --persistent flag saves the complete browser profile to disk.
Cookies, extensions, service workers, and IndexedDB survive browser restarts.
This is effectively a reusable browser identity.

Session management at scale:

playwright-cli list                    # list all active sessions
playwright-cli -s=stale close          # stop a named browser
playwright-cli -s=stale delete-data    # clean up user data
playwright-cli close-all               # close all browsers
playwright-cli kill-all                # forcefully kill all browser processes
Enter fullscreen mode Exit fullscreen mode

Agents can also pick up the session name from the environment:

PLAYWRIGHT_CLI_SESSION=todo-app claude .
Enter fullscreen mode Exit fullscreen mode

why not just screenshot failures?

Many teams rely on failure screenshots as their primary debugging signal. That approach tends to be fragile because:

  • a screenshot captures one moment, not the sequence that led to it
  • timing issues are invisible in a static image
  • network requests and console errors are absent
  • the agent performing the actions may interact with the page in unexpected ways

Instead, the Playwright CLI provides two built-in recording mechanisms that capture the full execution context.


video recording

The CLI can record .webm video of an entire session:

playwright-cli video-start session-debug.webm
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli video-stop
Enter fullscreen mode Exit fullscreen mode

During recording, you can annotate actions with callouts and chapter markers:

playwright-cli video-show-actions      # annotate each action with a callout
playwright-cli video-chapter "Login flow"
playwright-cli video-chapter "Checkout flow"
playwright-cli video-hide-actions      # stop annotating
Enter fullscreen mode Exit fullscreen mode

This produces a timestamped, annotated video of exactly what happened. When an AI agent clicks the wrong element or navigates unexpectedly, the video shows the sequence as it occurred, not a single post-mortem screenshot.

Video can also be enabled declaratively in the config file:

json
{
  "saveVideo": {
    "width": 1280,
    "height": 720
  }
}
Enter fullscreen mode Exit fullscreen mode

trace recording

Traces go deeper than video.
A Playwright Trace file contains:

  • full DOM snapshots at each action
  • network requests and responses
  • console logs
  • execution timeline

Recording a trace:

playwright-cli tracing-start
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli tracing-stop
Enter fullscreen mode Exit fullscreen mode

The output is a trace file inspectable in the Playwright Trace Viewer:

npx playwright show-trace trace.zip
Enter fullscreen mode Exit fullscreen mode

Or open https://trace.playwright.dev in a browser and drop the file in.

Inside the viewer you can step through every action, inspect the DOM at each point, examine network requests, view console output, and see exactly what the browser rendered at each step. This is critical when an agent does something unexpected; you are not guessing from a screenshot, you are replaying the entire session.

Traces can also be enabled via environment variable:

PLAYWRIGHT_MCP_SAVE_TRACE=1
Enter fullscreen mode Exit fullscreen mode

the visual dashboard

For real-time observation of running agent sessions:

playwright-cli show
Enter fullscreen mode Exit fullscreen mode

This opens a window with a session grid showing all active sessions grouped by workspace, each with a live screencast preview, current URL, and page title. Click any session to zoom in and take control, click into the viewport to drive the browser manually, press Escape to release. From the grid you can also close sessions or delete data for inactive ones.

For design review and UI feedback, the dashboard supports annotations:

playwright-cli show --annotate
Enter fullscreen mode Exit fullscreen mode

CI sharding

Sharding is where the CLI integrates with Playwright Test's parallel execution model. The core idea: split your test suite into N shards, run each shard on a separate CI job, and merge the results.

Playwright Test natively supports sharding:

npx playwright test --shard=1/4
npx playwright test --shard=2/4
npx playwright test --shard=3/4
npx playwright test --shard=4/4
Enter fullscreen mode Exit fullscreen mode

Each shard runs an approximately equal portion of the test files. When fullyParallel: true is enabled in the config, sharding balances at the individual test level rather than the file level, producing more even distribution.

The CLI session model composes naturally with sharded CI jobs. Each shard gets its own named session:

CI job for shard 1

export PLAYWRIGHT_CLI_SESSION="shard-1"
npx playwright test --shard=1/4
Enter fullscreen mode Exit fullscreen mode

CI job for shard 2

export PLAYWRIGHT_CLI_SESSION="shard-2"
npx playwright test --shard=2/4
Enter fullscreen mode Exit fullscreen mode

This keeps browser state isolated between shards. Sessions run headlessly by default on CI; pass --headed to open only when you need to observe a specific session.


merging reports across shards

Each shard produces its own report.
To produce a unified view, use the blob reporter:

typescript
// playwright.config.ts
export default defineConfig({
  reporter: process.env.CI ? 'blob' : 'html',
});
Enter fullscreen mode Exit fullscreen mode

Blob reports contain all test results plus attachments; traces, screenshots, video.
After all shards complete, merge:

npx playwright merge-reports --reporter html ./all-blob-reports
Enter fullscreen mode Exit fullscreen mode

This produces a single HTML report in playwright-report/ with the combined results from every shard, including all traces and videos from every session.


GitHub Actions example

yaml
name: Playwright Tests
on:
  push:
    branches: [main]
jobs:
  test:
    strategy:
      matrix:
        shardIndex: [1, 2, 3, 4]
        shardTotal: [4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
        env:
          PLAYWRIGHT_CLI_SESSION: "shard-${{ matrix.shardIndex }}"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: blob-report-${{ matrix.shardIndex }}
          path: blob-report/
  merge-reports:
    if: always()
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - uses: actions/download-artifact@v4
        with:
          pattern: blob-report-*
          path: all-blob-reports
      - run: npx playwright merge-reports --reporter html ./all-blob-reports
      - uses: actions/upload-artifact@v4
        with:
          name: html-report
          path: playwright-report/


Each shard runs independently, uploads its blob report, and a final job merges everything into a single HTML report with all traces and videos attached.
Enter fullscreen mode Exit fullscreen mode

the agent debugging loop

The part I find most interesting is how these features compose into a debugging workflow for AI-driven browser automation.

When an agent runs a workflow and something fails, the typical debugging approach involves:

- thousands of log lines
- screenshots
- console output
- stack traces
Enter fullscreen mode Exit fullscreen mode

This works, but scales poorly.
As more workflows run through agents, the volume of debugging context grows quickly.

With the CLI tooling, the debugging signal is structured from the start:

1. Open the merged HTML report → see which shard and which test failed
2. Open the trace for that session → step through every action frame by frame
3. Watch the session video → see the visual result of each action
4. Inspect network requests in the trace → confirm API calls returned expected data
5. Check console output in the trace → catch JavaScript errors on the page
Enter fullscreen mode Exit fullscreen mode

The goal is not just better debugging. The goal is to reduce the time between "something went wrong" and "I can see exactly what happened."


open questions

Some areas I am currently exploring:

  • should session state be committed to the repository for deterministic replay, or kept ephemeral?
  • at what point does per-shard video recording become too expensive in storage?
  • can trace diffs between passing and failing runs be automated to highlight the exact divergence point?
  • should agents receive the full trace or a machine-readable summary first?
  • how much of the debugging loop can be automated before human review becomes necessary?
  • what is the smallest useful artifact set for an agent to diagnose a failure: one trace, one video, one screenshot, or all three?

next steps

Current roadmap items include:

  • deeper integration between CLI session state and Playwright Test fixtures
  • automated trace diffing between baseline and failing runs
  • agent-friendly failure summaries as structured output
  • shard-aware video and trace artifact routing
  • persistent session profiles as reusable CI artifacts
  • locator stability analysis from trace data

Curious how other teams running Playwright CLI in production agent-driven workflows are approaching these problems.

Top comments (2)

Collapse
 
alexshev profile image
Alex Shev

Playwright is a good fit for agent-driven workflows because it gives the agent a real interface contract instead of a vague browser instruction. Sessions, traces, screenshots, and sharding all create evidence the agent can use or hand back to a human.

The part I like most is that browser automation becomes inspectable. If the agent says the flow works, you can ask for the trace or screenshot instead of trusting a text summary.

Collapse
 
ricardocosta0405 profile image
Ricardo Costa • Edited

I've been using on my framework and it helps a lot for the follow up.

await test.step('Open Portfolio', async () => { ...});

and

npx playwright test --trace on

which provides screenshots, DOM snapshots, network activity, and trace playback.