DEV Community: Yash Pandey

Why Playwright + Vitest is the Future of Web Testing

Yash Pandey — Sun, 26 Apr 2026 00:33:48 +0000

Why Playwright + Vitest is the Future of Web Testing

Cypress had its moment. Selenium still works. But if you're starting a new project today and choosing a web testing stack, the combination of Playwright + Vitest deserves serious attention.

Here's why this pairing is becoming the go-to for modern frontend teams.

What Playwright Gets Right (That Cypress Doesn't)

1. True Multi-Browser Support

Playwright runs against Chromium, Firefox, and WebKit (Safari engine) out of the box. Cypress's Firefox support is limited and WebKit support arrived only recently — and only on paid plans.

# Run against all browsers in one command
npx playwright test --project=chromium --project=firefox --project=webkit

2. No iframe or Multi-Tab Restrictions

Cypress notoriously struggles with iframes and multiple tabs. Playwright handles both natively:

// Switch to a new tab
const [newPage] = await Promise.all([
  context.waitForEvent('page'),
  page.click('a[target="_blank"]')
]);
await newPage.waitForLoadState();
await expect(newPage).toHaveURL(/dashboard/);

3. Auto-Waiting That Actually Works

Playwright waits for elements to be visible, stable, and actionable before interacting — no waitFor, sleep, or arbitrary delays needed in 99% of cases.

// This just works — no manual wait needed
await page.click('button[data-testid="submit"]');
await expect(page.locator('.success-toast')).toBeVisible();

Where Vitest Comes In

Vitest is a unit/integration test runner built on Vite. It's fast (native ESM, no transpilation overhead), Jest-compatible, and pairs naturally with modern React/Vue/Svelte setups.

The key insight: you don't need two separate test tools anymore.

Vitest     → Unit tests, component tests, integration tests
Playwright → E2E tests, visual regression, multi-browser

Both use a similar API surface. Your team learns one pattern for assertions, mocking, and setup — and applies it across both layers.

// Vitest unit test
import { describe, it, expect } from 'vitest'
import { formatDate } from './utils'

describe('formatDate', () => {
  it('formats ISO string to readable date', () => {
    expect(formatDate('2024-01-15')).toBe('Jan 15, 2024')
  })
})

// Playwright E2E test — same `expect` API
import { test, expect } from '@playwright/test'

test('shows formatted date on dashboard', async ({ page }) => {
  await page.goto('/dashboard')
  await expect(page.locator('.date-display')).toContainText('Jan 15, 2024')
})

Component Testing: Where They Overlap Nicely

Playwright now has an experimental Component Testing mode for React, Vue, and Svelte. It lets you test components in a real browser — not jsdom — which catches rendering bugs that unit tests miss.

// playwright/ct — component test
import { test, expect } from '@playwright/experimental-ct-react'
import { Button } from './Button'

test('renders disabled state correctly', async ({ mount }) => {
  const component = await mount(<Button disabled label="Submit" />)
  await expect(component).toBeDisabled()
  await expect(component).toHaveCSS('opacity', '0.5')
})

This is the gap Vitest's jsdom environment can't fully cover — and Playwright fills it.

The Setup (Minimal)

npm create vite@latest my-app -- --template react-ts
cd my-app
npm install -D vitest @testing-library/react
npm install -D @playwright/test
npx playwright install

vitest.config.ts

import { defineConfig } from 'vitest/config'

export default defineConfig({
  test: {
    environment: 'jsdom',
    globals: true,
    setupFiles: './src/test/setup.ts',
  },
})

playwright.config.ts

import { defineConfig, devices } from '@playwright/test'

export default defineConfig({
  testDir: './e2e',
  use: { baseURL: 'http://localhost:5173' },
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
    { name: 'firefox', use: { ...devices['Desktop Firefox'] } },
  ],
  webServer: {
    command: 'npm run dev',
    url: 'http://localhost:5173',
    reuseExistingServer: !process.env.CI,
  },
})

Is This Stack Perfect?

No. Fair criticisms:

Playwright's learning curve is steeper than Cypress for beginners
Component testing is still experimental and occasionally rough
Vitest's ecosystem is younger than Jest's — some plugins and matchers lag behind
Debugging Playwright failures locally is less visual than Cypress's time-travel debugger (though Trace Viewer closes this gap significantly) Use this stack when: you're on a modern Vite-based project, your team is comfortable with TypeScript, and you need real multi-browser coverage.

Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.

How to Test LLM-Powered Applications Effectively

Yash Pandey — Sun, 26 Apr 2026 00:31:14 +0000

How to Test LLM-Powered Applications Effectively

Testing a CRUD app is deterministic. You input X, you expect Y, you assert equality. Testing an LLM-powered application is different in a way that breaks most of your existing instincts.

The model's output is probabilistic. The same prompt can return different phrasing across runs. "Correct" is often subjective. Traditional assertEqual doesn't work here.

Here's how to think about testing LLM apps properly.

The Three Layers of an LLM App

Before writing a single test, map out what you're actually testing:

[ User Input ]
     ↓
[ Prompt Construction ]   ← Layer 1: Deterministic. Testable normally.
     ↓
[ LLM API Call ]          ← Layer 2: Non-deterministic. Mock in unit tests.
     ↓
[ Output Parsing ]        ← Layer 3: Deterministic. Testable normally.
     ↓
[ App Response ]

Most bugs aren't in the LLM — they're in layers 1 and 3. Start there.

Layer 1: Test Prompt Construction

Your prompt builder is plain code. Test it like code.

def build_prompt(user_query: str, context: str) -> str:
    return f"""You are a helpful assistant.
Context: {context}
User: {user_query}
Answer concisely."""

def test_prompt_includes_context():
    prompt = build_prompt("What is the policy?", "Refund window is 30 days.")
    assert "Refund window is 30 days." in prompt

def test_prompt_has_system_instruction():
    prompt = build_prompt("Hi", "")
    assert "You are a helpful assistant" in prompt

These are fast, free, and catch the majority of regressions.

Layer 2: Test Output Parsing

If your app parses structured data from LLM output, test the parser independently with canned responses:

def parse_llm_json_response(raw: str) -> dict:
    import json, re
    match = re.search(r'\{.*\}', raw, re.DOTALL)
    if not match:
        raise ValueError("No JSON found in response")
    return json.loads(match.group())

def test_parser_extracts_json():
    raw = "Here is the result: {\"score\": 8, \"reason\": \"Clear\"}"
    result = parse_llm_json_response(raw)
    assert result["score"] == 8

def test_parser_raises_on_no_json():
    with pytest.raises(ValueError):
        parse_llm_json_response("Sorry, I cannot help with that.")

Layer 3: Evaluating LLM Output Quality

For actual model output, shift from assertion-based testing to evaluation-based testing. Three practical approaches:

1. Rubric Scoring (LLM-as-Judge)

def evaluate_response(question, answer, criteria):
    eval_prompt = f"""
Rate the following answer on a scale of 1-5 for each criterion.
Question: {question}
Answer: {answer}
Criteria: {criteria}
Return JSON: {{"score": int, "reason": str}}
"""
    # Call your LLM here and parse response
    ...

2. Semantic Similarity (for factual tasks)

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def is_semantically_similar(expected, actual, threshold=0.85):
    emb1 = model.encode(expected, convert_to_tensor=True)
    emb2 = model.encode(actual, convert_to_tensor=True)
    score = util.cos_sim(emb1, emb2).item()
    return score >= threshold

3. Behavioral Testing (what should never happen)

FORBIDDEN_PHRASES = ["I cannot", "As an AI", "I don't have access"]

def test_no_refusals_on_valid_queries(llm_client):
    response = llm_client.ask("What is the return policy?")
    for phrase in FORBIDDEN_PHRASES:
        assert phrase not in response, f"Got refusal: {phrase}"

Testing for Regressions: Golden Datasets

Build a golden dataset — a curated set of input/expected-output pairs — and run evaluations on every model or prompt change:

Input	Min Score	Pass?
"Summarize this in 3 points"	4/5	✅
"Translate to French"	4/5	✅
"What's 2+2?" (sanity check)	5/5	✅

This won't catch everything, but it will catch regressions — which is the main goal.

Tools Worth Knowing

Promptfoo — open-source LLM eval framework, define test cases in YAML
LangSmith — tracing + eval if you're on LangChain

- DeepEval — pytest-style assertions for LLM metrics

Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.

Building Self-Healing Selenium Frameworks with AI

Yash Pandey — Sun, 26 Apr 2026 00:23:41 +0000

Building Self-Healing Selenium Frameworks with AI

One of the biggest pain points in UI test automation is flaky locators. A developer renames a class, restructures a component, and suddenly 40 tests are failing — not because the feature broke, but because the test couldn't find the element.

Self-healing frameworks solve this. With a bit of AI in the mix, your tests can recover from locator failures at runtime instead of crashing.

What "Self-Healing" Actually Means

A self-healing test framework doesn't just retry. It:

Detects a broken locator at runtime
Uses fallback strategies (or AI ranking) to find the correct element
Optionally updates the locator in source so the same error doesn't repeat

This is different from flaky test retries — you're fixing the root cause, not suppressing the symptom.

The Core Pattern: Locator Fallback Chain

from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

LOCATOR_STRATEGIES = [
    (By.ID, "submit-btn"),
    (By.CSS_SELECTOR, "button[data-testid='submit']"),
    (By.XPATH, "//button[contains(text(),'Submit')]"),
    (By.CSS_SELECTOR, "form button[type='submit']"),
]

def find_element_with_healing(driver):
    for strategy, locator in LOCATOR_STRATEGIES:
        try:
            el = driver.find_element(strategy, locator)
            print(f"Located using: {strategy} → {locator}")
            return el
        except NoSuchElementException:
            continue
    raise Exception("Element not found via any strategy")

This is the foundation. Every locator has a priority list. If the primary fails, it cascades.

Adding AI: Ranking Candidates with Similarity Scoring

The smarter version doesn't just fall back blindly — it scores candidates on the page and picks the best match. A simple approach uses DOM attribute similarity:

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

def find_best_candidate(driver, target_attributes: dict):
    candidates = driver.find_elements(By.CSS_SELECTOR, "*")
    best_score = 0
    best_element = None

    for el in candidates:
        score = 0
        for attr, value in target_attributes.items():
            el_attr = el.get_attribute(attr) or ""
            score += similarity(el_attr, value)

        if score > best_score:
            best_score = score
            best_element = el

    return best_element if best_score > 0.6 else None

# Usage
element = find_best_candidate(driver, {
    "id": "submit-btn",
    "class": "btn-primary",
    "type": "submit"
})

For production use, replace the similarity scorer with an embedding model (OpenAI, Sentence Transformers) to compare semantic similarity of element context — not just attribute strings.

Closing the Loop: Auto-Updating Locators

The final piece is writing the winning locator back to your test config so future runs use it directly:

import json

def update_locator_store(key, strategy, locator, path="locators.json"):
    with open(path, "r+") as f:
        store = json.load(f)
        store[key] = {"strategy": strategy, "locator": locator}
        f.seek(0)
        json.dump(store, f, indent=2)

Combine this with your CI pipeline to generate a PR or comment when a locator heals — giving your team visibility without manual intervention.

When Not to Use This

Self-healing adds complexity. If your app has a stable design system and disciplined data-testid usage, you probably don't need it. This pattern is most valuable in:

Legacy apps with unstable DOM structures
Teams where devs and QA are siloed
Frequent UI redesigns without test ownership

Tips to Take Further

Healenium — open-source proxy that adds self-healing to existing Selenium setups with minimal code change
Sentence Transformers — cosine similarity gives better semantic matching than SequenceMatcher
Log every healing event to a dashboard — you'll spot design instability patterns quickly

Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.