Playwright has excellent tooling around browser automation, but most of the ecosystem still treats it as a test framework. For teams running AI coding agents and automated browser workflows, there is a different set of requirements:
browser automation
↓
session persistence across runs
↓
debuggable traces when things go wrong
↓
parallel execution across CI shards
The Playwright CLI directly addresses these gaps. It ships as a standalone npm package and exposes every browser operation as a CLI command; open, click, type, snapshot - without requiring a Node.js script or test runner.
npm package: @playwright/cli
GitHub: https://github.com/microsoft/playwright-cli
The current implementation focuses on:
- session persistence with named instances and portable state
- video and trace recording built into every session
- CI sharding for parallel execution at scale
session persistence
The default behaviour keeps browser state in memory. Cookies and localStorage are preserved between CLI calls within the session, but cleared when the browser closes. For repeatable workflows, that breaks down fast — logging into an application before every run wastes time and introduces flakiness.
Named sessions let you run multiple browser instances simultaneously and address them by name:
playwright-cli -s=admin open https://app.example.com/admin
playwright-cli -s=checkout open https://app.example.com/checkout
Each session is an isolated browser instance. An agent can orchestrate workflows across multiple authenticated contexts without state leaking between them. The goal is straightforward:
- the same CLI binary should be able to maintain independent browser contexts for parallel workflows without requiring environment-specific configuration.
The critical piece for CI and agent reuse is state persistence:
log in once
playwright-cli -s=admin open https://app.example.com/login
playwright-cli -s=admin fill "#username" "admin"
playwright-cli -s=admin fill "#password" "$ADMIN_PASS"
playwright-cli -s=admin click "button[type=submit]"
save cookies + localStorage to portable JSON
playwright-cli -s=admin state-save admin-auth.json
restore in any future session - no re-authentication
playwright-cli -s=admin state-load admin-auth.json
playwright-cli -s=admin open https://app.example.com/dashboard
The state-save and state-load commands persist cookies, localStorage, and sessionStorage to a portable JSON file. Log in once, restore auth in every future session, no re-authenticating.
For long-lived workflows that need full browser profile persistence across restarts:
playwright-cli open https://app.example.com --persistent
The --persistent flag saves the complete browser profile to disk.
Cookies, extensions, service workers, and IndexedDB survive browser restarts.
This is effectively a reusable browser identity.
Session management at scale:
playwright-cli list # list all active sessions
playwright-cli -s=stale close # stop a named browser
playwright-cli -s=stale delete-data # clean up user data
playwright-cli close-all # close all browsers
playwright-cli kill-all # forcefully kill all browser processes
Agents can also pick up the session name from the environment:
PLAYWRIGHT_CLI_SESSION=todo-app claude .
why not just screenshot failures?
Many teams rely on failure screenshots as their primary debugging signal. That approach tends to be fragile because:
- a screenshot captures one moment, not the sequence that led to it
- timing issues are invisible in a static image
- network requests and console errors are absent
- the agent performing the actions may interact with the page in unexpected ways
Instead, the Playwright CLI provides two built-in recording mechanisms that capture the full execution context.
video recording
The CLI can record .webm video of an entire session:
playwright-cli video-start session-debug.webm
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli video-stop
During recording, you can annotate actions with callouts and chapter markers:
playwright-cli video-show-actions # annotate each action with a callout
playwright-cli video-chapter "Login flow"
playwright-cli video-chapter "Checkout flow"
playwright-cli video-hide-actions # stop annotating
This produces a timestamped, annotated video of exactly what happened. When an AI agent clicks the wrong element or navigates unexpectedly, the video shows the sequence as it occurred, not a single post-mortem screenshot.
Video can also be enabled declaratively in the config file:
json
{
"saveVideo": {
"width": 1280,
"height": 720
}
}
trace recording
Traces go deeper than video.
A Playwright Trace file contains:
- full DOM snapshots at each action
- network requests and responses
- console logs
- execution timeline
Recording a trace:
playwright-cli tracing-start
playwright-cli -s=checkout open https://app.example.com/checkout
playwright-cli -s=checkout click "#add-to-cart"
playwright-cli -s=checkout click "#checkout"
playwright-cli tracing-stop
The output is a trace file inspectable in the Playwright Trace Viewer:
npx playwright show-trace trace.zip
Or open https://trace.playwright.dev in a browser and drop the file in.
Inside the viewer you can step through every action, inspect the DOM at each point, examine network requests, view console output, and see exactly what the browser rendered at each step. This is critical when an agent does something unexpected; you are not guessing from a screenshot, you are replaying the entire session.
Traces can also be enabled via environment variable:
PLAYWRIGHT_MCP_SAVE_TRACE=1
the visual dashboard
For real-time observation of running agent sessions:
playwright-cli show
This opens a window with a session grid showing all active sessions grouped by workspace, each with a live screencast preview, current URL, and page title. Click any session to zoom in and take control, click into the viewport to drive the browser manually, press Escape to release. From the grid you can also close sessions or delete data for inactive ones.
For design review and UI feedback, the dashboard supports annotations:
playwright-cli show --annotate
CI sharding
Sharding is where the CLI integrates with Playwright Test's parallel execution model. The core idea: split your test suite into N shards, run each shard on a separate CI job, and merge the results.
Playwright Test natively supports sharding:
npx playwright test --shard=1/4
npx playwright test --shard=2/4
npx playwright test --shard=3/4
npx playwright test --shard=4/4
Each shard runs an approximately equal portion of the test files. When fullyParallel: true is enabled in the config, sharding balances at the individual test level rather than the file level, producing more even distribution.
The CLI session model composes naturally with sharded CI jobs. Each shard gets its own named session:
CI job for shard 1
export PLAYWRIGHT_CLI_SESSION="shard-1"
npx playwright test --shard=1/4
CI job for shard 2
export PLAYWRIGHT_CLI_SESSION="shard-2"
npx playwright test --shard=2/4
This keeps browser state isolated between shards. Sessions run headlessly by default on CI; pass --headed to open only when you need to observe a specific session.
merging reports across shards
Each shard produces its own report.
To produce a unified view, use the blob reporter:
typescript
// playwright.config.ts
export default defineConfig({
reporter: process.env.CI ? 'blob' : 'html',
});
Blob reports contain all test results plus attachments; traces, screenshots, video.
After all shards complete, merge:
npx playwright merge-reports --reporter html ./all-blob-reports
This produces a single HTML report in playwright-report/ with the combined results from every shard, including all traces and videos from every session.
GitHub Actions example
yaml
name: Playwright Tests
on:
push:
branches: [main]
jobs:
test:
strategy:
matrix:
shardIndex: [1, 2, 3, 4]
shardTotal: [4]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npx playwright install --with-deps
- run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
env:
PLAYWRIGHT_CLI_SESSION: "shard-${{ matrix.shardIndex }}"
- uses: actions/upload-artifact@v4
if: always()
with:
name: blob-report-${{ matrix.shardIndex }}
path: blob-report/
merge-reports:
if: always()
needs: [test]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- uses: actions/download-artifact@v4
with:
pattern: blob-report-*
path: all-blob-reports
- run: npx playwright merge-reports --reporter html ./all-blob-reports
- uses: actions/upload-artifact@v4
with:
name: html-report
path: playwright-report/
Each shard runs independently, uploads its blob report, and a final job merges everything into a single HTML report with all traces and videos attached.
the agent debugging loop
The part I find most interesting is how these features compose into a debugging workflow for AI-driven browser automation.
When an agent runs a workflow and something fails, the typical debugging approach involves:
- thousands of log lines
- screenshots
- console output
- stack traces
This works, but scales poorly.
As more workflows run through agents, the volume of debugging context grows quickly.
With the CLI tooling, the debugging signal is structured from the start:
1. Open the merged HTML report → see which shard and which test failed
2. Open the trace for that session → step through every action frame by frame
3. Watch the session video → see the visual result of each action
4. Inspect network requests in the trace → confirm API calls returned expected data
5. Check console output in the trace → catch JavaScript errors on the page
The goal is not just better debugging. The goal is to reduce the time between "something went wrong" and "I can see exactly what happened."
open questions
Some areas I am currently exploring:
- should session state be committed to the repository for deterministic replay, or kept ephemeral?
- at what point does per-shard video recording become too expensive in storage?
- can trace diffs between passing and failing runs be automated to highlight the exact divergence point?
- should agents receive the full trace or a machine-readable summary first?
- how much of the debugging loop can be automated before human review becomes necessary?
- what is the smallest useful artifact set for an agent to diagnose a failure: one trace, one video, one screenshot, or all three?
next steps
Current roadmap items include:
- deeper integration between CLI session state and Playwright Test fixtures
- automated trace diffing between baseline and failing runs
- agent-friendly failure summaries as structured output
- shard-aware video and trace artifact routing
- persistent session profiles as reusable CI artifacts
- locator stability analysis from trace data
Curious how other teams running Playwright CLI in production agent-driven workflows are approaching these problems.
Top comments (2)
Playwright is a good fit for agent-driven workflows because it gives the agent a real interface contract instead of a vague browser instruction. Sessions, traces, screenshots, and sharding all create evidence the agent can use or hand back to a human.
The part I like most is that browser automation becomes inspectable. If the agent says the flow works, you can ask for the trace or screenshot instead of trusting a text summary.
I've been using on my framework and it helps a lot for the follow up.
and
which provides screenshots, DOM snapshots, network activity, and trace playback.