DEV Community

Cover image for Building a CI helper for Playwright Java
Ricardo Costa
Ricardo Costa

Posted on

Building a CI helper for Playwright Java

Playwright has excellent tooling around browser automation, but most of the ecosystem still feels heavily Node.js-centric.

For Java teams, there's a surprising amount of infrastructure work that sits between:

git push
   ↓
ci execution
   ↓
useful failure diagnostics
Enter fullscreen mode Exit fullscreen mode

To explore that gap, I built a small Java CLI:

GitHub repo:
https://github.com/ricardo-costa0405/playwright-java-ci-helper

The current implementation focuses on:

  • build system detection
  • test execution
  • artifact collection
  • machine-readable failure summaries

build system detection

The first requirement was zero project configuration.

The helper attempts to detect:

./mvnw
pom.xml
./gradlew
build.gradle
build.gradle.kts
Enter fullscreen mode Exit fullscreen mode

and automatically generates the appropriate execution strategy.

The goal is straightforward:

the same binary should be able to run inside arbitrary playwright java repositories without requiring repository-specific configuration.

This allows the tool to work consistently across Maven and Gradle projects while keeping onboarding friction close to zero.


test execution

The helper can execute either an automatically detected build command or a user-supplied command.

Examples:

java -jar playwright-java-ci-helper.jar \
  --project-dir my-project
Enter fullscreen mode Exit fullscreen mode

or

java -jar playwright-java-ci-helper.jar \
  --test-command "mvn test -Dtest=LoginTest"
Enter fullscreen mode Exit fullscreen mode

An optional setup phase can also be be executed before running tests.

This allows repositories to perform environment preparation, Playwright installation, or custom bootstrap steps before execution begins.


why not parse console logs?

Many CI systems still derive test status from stdout.

That approach tends to be fragile because:

  • log formats change
  • plugins inject additional output
  • parallel execution interleaves messages
  • different frameworks produce different structures

Instead, the helper parses JUnit XML directly and extracts:

tests
failures
errors
skipped
Enter fullscreen mode Exit fullscreen mode

from the actual source of truth.

This produces deterministic results regardless of how verbose or customized the console output becomes.


artifact collection

The less obvious challenge is artifact discovery.

A failing Playwright run can generate output across multiple locations:

target/surefire-reports
target/failsafe-reports
build/test-results
build/reports/tests
playwright-report
test-results
screenshots
videos
traces
Enter fullscreen mode Exit fullscreen mode

depending on:

  • build tool
  • project structure
  • reporting configuration
  • team conventions

The helper currently collects only artifacts generated during the active execution window.

This avoids a common CI problem where stale artifacts from previous executions are accidentally included in failure analysis.


CI sharding

One area I wanted to support from the beginning was CI parallelization.
The helper exports:

PW_JAVA_CI_SHARD_INDEX
PW_JAVA_CI_SHARD_TOTAL
PW_JAVA_CI_WORKERS
Enter fullscreen mode Exit fullscreen mode

and automatically injects equivalent parameters into Maven and Gradle executions.

Example:

java -jar playwright-java-ci-helper.jar \
  --shard-index 2 \
  --shard-total 4 \
  --workers 3
Enter fullscreen mode Exit fullscreen mode

The idea is to keep orchestration concerns outside the test implementation itself.


machine-readable failure context

The part I find most interesting isn't the reporting itself.
It's creating a deterministic interface between CI systems and automated tooling.
Today, many teams experimenting with agents and AI-assisted debugging are still passing large amounts of raw information:

thousands of log lines
screenshots
reports
traces
console output
Enter fullscreen mode Exit fullscreen mode

The approach works, but it scales poorly.
As more platforms move toward API-based billing models, context size starts becoming an engineering concern rather than just an implementation detail.

Instead of sending:

4000+ lines of CI logs
Enter fullscreen mode Exit fullscreen mode

a tool can provide:

{
  "tests": 182,
  "failures": 3,
  "screenshots": 3,
  "traces": 3,
  "failedTests": [...]
}
Enter fullscreen mode Exit fullscreen mode

The goal isn't only to improve signal quality.
The goal is to reduce the amount of context required for an agent to reason about a failure.

This becomes increasingly important when traces, screenshots, reports, and execution logs start accumulating across hundreds or thousands of CI runs.

I suspect we'll see more tooling move in this direction as agents become part of the standard engineering workflow.


generating playwright java skeletons

I've also been experimenting with generating Playwright Java test skeletons from browser interaction flows and agent command scripts.

For example:

playwright-cli open https://demo.playwright.dev/todomvc
playwright-cli type "Buy groceries"
playwright-cli press Enter
playwright-cli screenshot
Enter fullscreen mode Exit fullscreen mode

can be transformed into a Java test template.
One interesting limitation is locator generation.

Agent references such as:

e21
e37
e42
Enter fullscreen mode Exit fullscreen mode

Cannot safely be translated into stable Playwright locators.
The generated code compiles, but locator selection remains a human responsibility.

At least for now, a human-in-the-loop approach feels significantly more realistic than fully autonomous test generation.


open questions

Some areas I'm currently exploring:

  • should junit parsing remain framework-agnostic?
  • or should framework-specific adapters be introduced for richer diagnostics (e.g. TestNG retries, groups and dependencies)?
  • is artifact collection better handled through plugins than filesystem discovery?
  • what is the smallest useful schema for agent-driven failure analysis?
  • can locator repair be performed safely without introducing additional flakiness?
  • how much CI context should be exposed to agents before signal becomes noise?

next steps

Current roadmap items include:

  • testng support
  • richer failure diagnostics
  • ai-friendly summaries
  • sarif output
  • environment validation ("doctor" command)
  • locator repair suggestions
  • deeper agent integrations

The project is still in its early stages, but the objective is simple:
Build better tooling around the gap between test execution and actionable failure diagnostics for Playwright Java teams.

I'm curious how other teams running Playwright Java at scale are approaching these problems

Top comments (0)