Ritik Pal

Posted on Mar 19

How I Built a Production-Grade E2E Test Automation Framework for an AI Testing Product

#software #ai #automation #softwareengineering

Meta description: How one engineer built two complete E2E test automation frameworks — 61 UI specs + 192 API endpoints — for a GenAI-native test agent. Real architecture, real code, real problems solved.

Target keywords: end-to-end test automation framework, Playwright TypeScript framework, API test automation, SDET framework architecture India

Cover image suggestion: A dark-themed split-screen showing Playwright test output on the left and a CI/CD pipeline dashboard on the right. Search Unsplash: "software testing code dark theme"

When your product lets users write tests in plain English and an AI agent executes them across browsers, mobile devices, and cloud environments — how do you test that?

This is the exact problem I had to solve.

I'm Ritik Pal, Member of Technical Staff at an AI testing company — and over the past year I built two complete end-to-end test automation frameworks from scratch. One for the frontend (61 test specs, 18 feature areas). One for the backend (192 API endpoints, 32 typed clients).

Together, they form the quality backbone of the AI testing product I work on.

No hand-waving. No theory. In this article I'll walk you through the real architecture decisions, the actual code, and the hard problems that took weeks to solve.

The Problem: Who Tests the Testing Tool?

The AI testing product I work on is a GenAI-native testing agent. Teams write test cases in natural language like:

"Login with valid credentials and verify the dashboard loads correctly"

...and the agent converts that into actual automation that runs across browsers and environments. It's genuinely impressive technology.

But here's the uncomfortable question nobody asks until it's too late: who tests the testing tool?

We needed a quality layer that could:

Validate 18 different UI feature areas — projects, test cases, test runs, reports, AI agent integration, Jira integration, and more
Hit every single API endpoint (192 of them) with proper auth, retry logic, and schema validation
Run across 4 environments in parallel: US Staging, EU Staging, US Production, EU Production
Gate every PR with a smoke suite and run full regression nightly
Scale to 15 parallel workers on a cloud execution grid

Two repos. One mission. Zero existing framework to build on.

The High-Level Architecture

Before getting into the details, here's the full picture:

Frontend: E2E-Frontend-Automation

Tech Stack:  Playwright 1.49 + TypeScript 5.7 + Node 20
Pattern:     Page Object Model with custom Playwright fixtures
Tests:       61 specs across 18 feature areas
Execution:   Local → Remote (CDP Grid) → Cloud (HyperExecute)

Backend: E2E-Backend-Automation

Tech Stack:  Playwright Test + Axios + TypeScript 5.7 + AJV + Faker.js
Pattern:     Typed API clients with interceptor chain architecture
Tests:       192 endpoints, 41 test files, ~180 test cases
Tiers:       Smoke (30s) · Regression (full) · Flows (E2E) · Benchmark (latency)

Both frameworks are built on five shared principles:

Type safety everywhere — TypeScript catches breaking changes at compile time, before a single test runs
Zero boilerplate per test — fixtures handle setup and teardown automatically
Test data isolation — every test creates its own data and cleans up after itself
Multi-environment support — one environment variable switches everything
Smart retry logic — distinguishes transient failures from real bugs

Deep Dive: Frontend Framework

The 6-Layer Architecture

┌─────────────────────────────────────────────┐
│  Test Layer (61 specs, @smoke/@regression)  │
├─────────────────────────────────────────────┤
│  Fixture Layer (20 page objects + API setup)│
├─────────────────────────────────────────────┤
│  Page Object Layer (20 modules, 40 files)   │
├─────────────────────────────────────────────┤
│  API Layer (REST API + Jira clients)        │
├─────────────────────────────────────────────┤
│  Config & Utils (env, waits, retry, random) │
├─────────────────────────────────────────────┤
│  Reporting & CI/CD (4 reporters, 3 pipelines│
└─────────────────────────────────────────────┘

Each layer has one job. The test layer just tests. The fixture layer manages lifecycle. The page objects handle interactions. This separation is what keeps the framework maintainable as it grows.

BasePage: Minimal by Design

Most Page Object frameworks I've seen make the same mistake: they try to abstract away the underlying test library entirely. They wrap every Playwright method in a custom method, add unnecessary complexity, and end up with a framework that fights against the tool it's built on.

My BasePage has exactly 3 core methods:

export class BasePage {
  constructor(readonly page: Page) {}

  /**
   * Auto-detects XPath, testid, or CSS selectors.
   * Removes the "which locator type?" decision from test authors.
   */
  loc(selector: string): Locator {
    if (selector.startsWith('testid:')) {
      return this.page.getByTestId(selector.slice(7));
    }
    return selector.startsWith('/') || selector.startsWith('(')
      ? this.page.locator(`xpath=${selector}`)
      : this.page.locator(selector);
  }

  /**
   * Replace {{placeholders}} in selector strings.
   * Enables dynamic selectors without string concatenation.
   */
  tpl(selector: string, replacements: Record<string, string>): string {
    let result = selector;
    for (const [key, value] of Object.entries(replacements)) {
      result = result.replaceAll(`{{${key}}}`, value);
    }
    return result;
  }

  /**
   * Retry any action with configurable attempts and delay.
   * Used for operations that are legitimately flaky by nature.
   */
  async retry(
    action: () => Promise<void>,
    options?: { retries?: number; delayMs?: number; label?: string },
  ): Promise<void> {
    const { retries = 3, delayMs = 2_000, label = 'action' } = options ?? {};
    for (let attempt = 1; attempt <= retries; attempt++) {
      try {
        await action();
        return;
      } catch (error) {
        if (attempt === retries) throw error;
        await this.page.waitForTimeout(delayMs);
      }
    }
  }
}

The design rationale: Page objects use Playwright's native API directly. The base class adds just enough — auto-detecting locator types, template interpolation, and retry logic. Everything else stays in Playwright's hands.

The tpl method deserves a special mention. Instead of building selectors with string concatenation like '.project-' + name + '-button' (which breaks on special characters), you define a selector template once and interpolate:

// In page object selectors file
const PROJECT_BUTTON = '.project-{{name}}-action';

// In page object method
async clickProject(name: string) {
  await this.loc(this.tpl(PROJECT_BUTTON, { name })).click();
}

Clean, readable, and debuggable.

The Fixture System

This is where the real power lives. One fixture file exports everything a test needs — auth, browser context, page objects, and environment routing:

export const test = base.extend<AppFixtures>({
  page: async ({ page }, use, testInfo) => {

    // LOCAL MODE: just navigate and hand over the page
    if (process.env.TEST_MODE !== 'remote') {
      await page.goto(EnvConfig.baseUrl, { waitUntil: 'domcontentloaded' });
      await use(page);
      return;
    }

    // REMOTE MODE: per-test CDP session on remote browser grid
    const testName = testInfo.titlePath.slice(1).join(' > ');
    const wsEndpoint = getCdpEndpoint(profile, runProfile, testName);

    let browser, context, remotePage;
    for (let attempt = 1; attempt <= 2; attempt++) {
      try {
        browser = await chromium.connect(wsEndpoint);
        context = await browser.newContext({
          storageState: testInfo.project.use.storageState,
        });
        remotePage = await context.newPage();
        await remotePage.goto(EnvConfig.baseUrl);
        break;
      } catch (error) {
        if (attempt === 2) throw error;
        // Grid congestion on attempt 1 — retry once before giving up
      }
    }

    await use(remotePage);

    // Update CI dashboard with pass/fail status
    await remotePage.evaluate(
      () => {},
      `remote_action: ${JSON.stringify({
        action: 'setTestStatus',
        arguments: { status: testInfo.status }
      })}`
    );
  },

  // 20 page objects — each lazily initialized per test
  projectPage:  async ({ page }, use) => { await use(new ProjectPage(page)); },
  testCasePage: async ({ page }, use) => { await use(new TestCasePage(page)); },
  // ... 18 more page objects
});

The clever part: In local mode, tests use the default Playwright page. In remote mode, the same fixture creates a per-test CDP session on the remote browser grid, retries once on connection failure, and reports the final test status back to the CI dashboard. The test code itself never knows which mode it's running in.

Composite Fixtures for Zero-Setup Tests

The most powerful pattern in the entire framework: composite fixtures that create entire entity hierarchies before the test even starts.

// Three fixture levels — each builds on the previous:

// Level 1: Just a project
projectOnly: creates project → opens it → USE → deletes it

// Level 2: Project with a test case
projectWithTestCase: creates project → creates TC → USE → deletes both

// Level 3: Full hierarchy
projectWithTestCaseInFolder: creates project → folder → TC → USE → deletes all

A test that needs a full hierarchy? Just destructure it:

test('verify test case appears in folder', async ({
  projectWithTestCaseInFolder
}) => {
  // Everything is already created and opened.
  // Just test the thing you care about.
  await expect(testCasePage.folderItem).toBeVisible();
  // Cleanup happens automatically after this line.
});

Zero setup lines. Zero teardown lines. Zero cognitive overhead about state management.

This pattern transformed how the team writes tests. New test authors don't need to understand fixture wiring — they just pick the fixture level they need.

Smart Waits — The Hardest Problem

Flaky tests are almost always a wait problem. Here's what I found in practice and how I solved each case:

Problem 1: Search field debounce

Typing into a search field and immediately pressing Enter fires before the debounce completes:

async fillAndSubmit(locator: Locator, value: string): Promise<void> {
  await locator.fill(value);
  await this.page.waitForTimeout(800); // explicit debounce wait
  await this.page.keyboard.press('Enter');
}

Problem 2: Network settle after navigation

networkidle is the ideal wait — but it's strict. On CI runners or apps with persistent WebSocket connections, it times out:

async clickAndWaitForNetwork(locator: Locator): Promise<void> {
  await locator.click();
  try {
    await this.page.waitForLoadState('networkidle');
  } catch {
    // Fallback: DOM is ready even if network isn't fully settled
    await this.page.waitForLoadState('domcontentloaded');
  }
}

The principle: Never use fixed waitForTimeout as a crutch for unknown timing. Use it only when you know exactly what you're waiting for (like a debounce) and pair explicit waits with network/DOM state checks.

Deep Dive: Backend Framework

The Interceptor Chain Architecture

This is the core architectural innovation in the backend framework. Every API client is built on Axios with a 4-layer interceptor chain:

Request  →  [HTTP Logger]  →  [Auth Injector]  →  Server
Response ←  [Error Handler] ←  [Retry Logic]  ←  Server

export const createClient = (options: ClientOptions = {}): AxiosInstance => {
  const client = axios.create({
    baseURL: serviceConfig.baseURL,
    timeout: 30_000,
  });

  // Layer 1: HTTP logger — captures timing from the moment request fires
  attachHttpLogger(client);

  // Layer 2: Auth — inject Bearer token or Basic auth depending on service
  client.interceptors.request.use(async (config) => {
    await applyAuth({ headers: config.headers }, authStrategy);
    return config;
  });

  // Layer 3: Error handler — context-aware error classification
  client.interceptors.response.use(
    (res) => res,
    (err) => {
      // 401, 404, 422 in negative tests → EXPECTED (don't alarm)
      // 500 "record not found" → EXPECTED (deterministic)
      // Everything else → ACTUAL ERROR (needs attention)
      return Promise.reject(err);
    },
  );

  // Layer 4: Retry — exponential backoff for transient infrastructure failures
  attachRetryInterceptor(client);

  return client;
};

Why this specific ordering matters:

Logger runs first — captures timing from request inception, not after auth delay
Auth runs before Error handler — auth failures are classified correctly as errors
Error handler runs before Retry — only actually retryable errors get retried
Retry runs last — replays the complete chain including auth refresh on retry

Get the order wrong and you get incorrect timing data, silent auth failures, or infinite retry loops on permanent errors.

Exponential Backoff Retry

Not all 5xx errors deserve a retry. A "record not found" from the database is permanent — retrying it wastes time. A 502 gateway timeout is transient — retrying it often succeeds:

const RETRYABLE_STATUS_CODES = new Set([500, 502, 503, 504]);
const RETRYABLE_NETWORK_CODES = new Set(['ECONNRESET', 'ETIMEDOUT', 'ECONNABORTED']);

const isRetryable = (error: AxiosError): boolean => {
  if (error.response?.status === 500) {
    const body = JSON.stringify(error.response.data);
    // Deterministic application error — not retryable
    if (body.includes('record not found')) return false;
  }
  return (
    RETRYABLE_STATUS_CODES.has(error.response?.status) ||
    RETRYABLE_NETWORK_CODES.has(error.code)
  );
};

// Exponential backoff: 1s → 2s → 4s (max 3 retries)
const delay = Math.pow(2, retryCount) * 1000;

This nuance — distinguishing application errors from infrastructure errors — is the difference between a retry system that helps and one that masks real bugs.

Worker-Scoped Fixtures with LIFO Cleanup

Running 10 parallel workers means 10 simultaneous test suites. Each worker needs its own project context to avoid data collisions:

export const test = base.extend<{}, WorkerFixtures>({

  // CleanupRegistry tracks everything created — cleans up in reverse order
  cleanup: [async ({}, use) => {
    const registry = new CleanupRegistry();
    await use(registry);
    await registry.run(); // LIFO: last created = first deleted
  }, { scope: 'worker' }],

  // One project per worker — shared across all tests in that worker
  projectId: [async ({ cleanup }, use) => {
    const res = await ProjectsClient.create(fakeProject());
    cleanup.add(async () => {
      await ProjectsClient.delete(res.data.id);
    });
    await use(res.data.id);
  }, { scope: 'worker' }],

});

Why LIFO (Last In, First Out) cleanup is essential:

Create order:   Project → Folder → TestCase → TestStep
Cleanup order:  TestStep → TestCase → Folder → Project  ✓

If you delete the Project first, the Folder, TestCase, and TestStep deletions return 404. With LIFO, every entity is deleted in the correct dependency order, automatically.

4-Tier Test Strategy

One test suite that runs everywhere is an anti-pattern. Different contexts need different test configurations:

Tier	Timeout	Workers	Retries	When
Smoke	30s	10	1	Every PR — must be fast
Regression	60s	5	2	Nightly — full coverage
Flows	120s	1 (serial)	0	On demand — E2E business flows
Benchmark	180s	10	0	Weekly — latency profiling

The Flows tier runs serially by design. These tests validate complete multi-step business processes:

Create Project → Create Folder → Add Test Case → Create Test Run → Verify Results

Running these in parallel would mean multiple tests modifying the same entities simultaneously — a recipe for race conditions and false failures.

The Benchmark tier has zero retries by design. If you retry a latency measurement, the p95 and p99 numbers become meaningless. One attempt. Record the truth.

Test Data Factories

15 parallel workers creating test data simultaneously needs one thing above all else: guaranteed uniqueness.

const uniqueName = (entity: string): string =>
  `AutoTest_${entity}_${faker.string.nanoid(6)}`;

export const fakeProject = (overrides?: Partial<ProjectPayload>) => ({
  name: uniqueName('Proj'),          // → AutoTest_Proj_x9m3p7
  description: faker.lorem.sentence(),
  tags: [faker.lorem.word()],
  ...overrides,
});

export const fakeTestCase = (projectId: string, overrides?) => ({
  project_id: projectId,
  test_cases: [{
    title: uniqueName('TC'),         // → AutoTest_TC_a8k2n1
    priority: 'Medium',
    type: 'Functional',
    ...overrides,
  }],
});

The overrides pattern is the most underrated part of this design. Default data is auto-generated for the 80% of tests that don't care about specific field values. But any test can override exactly what it needs:

// Test doesn't care about priority — use default
const tc = fakeTestCase(projectId);

// Test specifically validates Critical priority behaviour
const tc = fakeTestCase(projectId, { priority: 'Critical' });

// Test validates multiple field combinations
const tc = fakeTestCase(projectId, {
  priority: 'High',
  type: 'Integration',
  status: 'Draft',
});

No factory method explosion. No duplicate data builders. One factory, infinite flexibility.

The Hard Problems (And How I Solved Them)

1. Multi-Environment Region Routing

Four environments. Two regions each. Eight different URL sets. One wrong URL and tests silently hit the wrong region — passing when they should be testing something else entirely.

stage    → stage.app.internal  (US)
eu-stage → eu-stage.app.internal  (EU — different subdomain!)
prod     → app.example.com                (US)
eu-prod  → eu-app.example.com             (EU)

Solution: a single TEST_ENV variable that resolves the complete configuration:

const ENV_MAP = {
  'stage':    { project: 'us-chromium', region: 'us' },
  'eu-stage': { project: 'eu-chromium', region: 'eu' },
  'prod':     { project: 'us-chromium', region: 'us' },
  'eu-prod':  { project: 'eu-chromium', region: 'eu' },
};

One source of truth. No URL mixing across config files. Switching from staging to production is one variable change.

2. Remote Grid Stability Under Parallel Load

Running 15 tests simultaneously on a remote browser grid is inherently unstable. Connections drop. Sessions time out. Grid gets congested during peak hours.

My solution was three-pronged:

Inflated timeouts for remote execution:

Local:  test=120s, expect=15s, action=30s, navigation=60s
Remote: test=300s, expect=25s, action=45s, navigation=90s

Per-test connection retry:

for (let attempt = 1; attempt <= 2; attempt++) {
  try {
    browser = await chromium.connect(wsEndpoint);
    break;
  } catch {
    if (attempt === 2) throw error;
    // Grid congestion — wait briefly and retry
  }
}

Named sessions: Every remote test gets its own CDP session with the test name embedded. The CI dashboard shows individual test results, not a blob of unnamed sessions — which makes debugging failures in CI actually possible.

3. Auth Token Sharing Across Workers

Logging in once per test would trigger rate limiting. Logging in once per worker still means 15 login calls at startup. The solution: global setup with file-based token caching.

Frontend (cookie-based):

// global-setup.ts — runs ONCE before any worker starts
const authFilePath = '.auth/user.json';
const ONE_HOUR = 60 * 60 * 1000;

if (
  fs.existsSync(authFilePath) &&
  Date.now() - fs.statSync(authFilePath).mtimeMs < ONE_HOUR
) {
  return; // Reuse cached cookies — no login needed
}

// Otherwise: API login → save cookies → all workers read the file

Backend (JWT-based):

// Fetch JWT once at startup → save to .auth-token file
// Every worker reads from the same file on each request
const token = await getAuthToken(); // reads from file, not network
config.headers['Authorization'] = `Bearer ${token}`;

15 workers. 1 login. No rate limiting.

4. Flaky Test Detection

A test that fails on attempt 0 but passes on attempt 1 is not "passing" — it's flaky. Retries hide the problem. You need to track it:

class FlakyReporter implements Reporter {
  private flakyTests: FlakyTest[] = [];

  onTestEnd(test: TestCase, result: TestResult) {
    if (test.outcome() === 'flaky') {
      this.flakyTests.push({
        title:          test.title,
        file:           test.location.file,
        failureMessage: result.errors[0]?.message,
        attempts:       result.retry + 1,
      });
    }
  }

  onEnd() {
    if (this.flakyTests.length > 0) {
      // Still exits with code 1 — flaky tests are not OK
      console.error(`⚠️  ${this.flakyTests.length} flaky test(s) detected`);
      this.flakyTests.forEach(t => console.error(`  • ${t.title}`));
      process.exit(1);
    }
  }
}

Flaky tests don't silently pass in this framework. They're visible, logged, and they break the build — because a flaky test is a real problem waiting to become a consistent failure.

5. API Schema Validation at Scale

192 endpoints. Manually writing JSON schemas for each one is weeks of work and immediately goes stale when the backend team changes a response shape.

Solution: auto-discovery and drift detection.

First run:   Real API response → auto-generate JSON schema → save to schemas/discovered/
All future:  Real API response → validate against saved schema → FAIL if shape changed

// On first encounter of an endpoint
const schema = generateSchema(response.data);
fs.writeFileSync(`schemas/discovered/${endpointKey}.json`, JSON.stringify(schema));

// On subsequent runs
const savedSchema = JSON.parse(fs.readFileSync(`schemas/discovered/${endpointKey}.json`));
const valid = ajv.validate(savedSchema, response.data);
if (!valid) {
  throw new Error(`Schema drift detected on ${endpointKey}: ${ajv.errorsText()}`);
}

100+ schemas, auto-generated. When the backend team silently changes a response shape (it happens more than you'd think), the CI pipeline catches it before it reaches production.

CI/CD Pipelines

Frontend Pipeline

# ci.yml — Triggered on every PR
steps:
  - TypeScript compile check (tsc --noEmit)
  # Catches type errors before running a single browser

# test.yml — Manual trigger with environment selection
steps:
  - Select: environment (stage | eu-stage | prod | eu-prod)
  - Select: tag (@smoke | @regression)
  - Run: 2 workers, 2 retries
  - Upload: playwright-report, allure-results, test-results artifacts

# hyperexecute.yml — Cloud-scale execution
steps:
  - 15 workers with autosplit (distributes test files automatically)
  - Stable build ID (all workers grouped in CI dashboard)
  - 90-minute global timeout
  - HyperExecute YAML with concurrency: 15

Backend Pipeline

# api-tests.yml — Manual trigger with suite selection
steps:
  - Step 1: TypeScript compile check (tsc --noEmit)
  - Step 2: Run by suite (smoke | regression | flows | all)
  - Upload: HTML report, JSON results, API timing summary

# The compile check in Step 1 is not optional.
# It catches more API contract bugs than the tests themselves.

The Numbers

Metric	Frontend	Backend
Test specs	61	41
API endpoints covered	—	192 (100% of v1)
Feature areas	18	32 API clients
Custom reporters	4	2
Environments supported	4 × 2 regions	6 configs
Max parallel workers	15	10
Smoke run time	~2 minutes	~30 seconds
Full regression	~15 minutes	~8 minutes
Framework code (lines)	~4,000	~6,000
Schema drift detectors	—	100+ auto-generated

Key Takeaways

These aren't principles I copied from a blog post. They're conclusions from building this specific system and watching what broke and what held:

1. Your base class should be thin. My BasePage has 3 methods. Everything else uses Playwright directly. Every additional abstraction layer is a maintenance burden you're betting will pay off. Most don't.

2. Fixtures beat beforeEach/afterEach every time. Playwright fixtures handle dependency injection, scoping, and automatic cleanup. They compose. They're reusable across files. beforeEach is a local solution to a global problem.

3. Separate test data from test logic. Faker factories with override patterns give you unique data with zero maintenance. Never hardcode test data. Never share test data between tests.

4. Design for parallel execution from day one. Worker-scoped fixtures, unique test data, LIFO cleanup — these aren't optimizations you add later. They're requirements. Retrofitting parallelism into a serial framework is painful.

5. Retry logic needs nuance. Not all errors are retryable. Not all retries should be silent. Build flaky test detection, not just retry counts.

6. One environment variable should switch everything. If changing environments requires modifying more than one config value, your setup is too fragile for real CI/CD usage.

7. TypeScript catches bugs before tests run.tsc --noEmit in CI caught more API contract issues than the actual tests. Type safety is not a nice-to-have in test automation — it's a first-class quality gate.

What's Next

These two frameworks now run hundreds of tests every day across 4 environments, catching regressions before they reach our users.

But a framework is never finished. The next priorities are:

Visual regression testing for the product's AI-generated UI components
Contract testing between frontend and backend using Pact
Performance benchmarking with automated regression alerts at p99 latency thresholds

If you're building test automation infrastructure for a complex product and want to talk through architecture decisions — I'm always open to connecting.

Found This Useful?

If this breakdown helped you think differently about test automation architecture, share it with your team or QA community. Every framework starts with someone asking "how should I structure this?" — maybe this article gives you a starting point.

Follow me on LinkedIn for weekly posts on software engineering, open source, and developer tools: 👉 linkedin.com/in/ritikpal

Check out webguardx — my open-source web audit tool built with Playwright: 👉 npmjs.com/package/webguardx

Subscribe to this blog to get notified when I publish the next post: "How I set up web accessibility testing in your CI/CD pipeline"

Have questions about any of the patterns in this article? Drop them in the comments — I read every one.

TestAutomation #Playwright #TypeScript #SDET #QAEngineering #OpenSource #SoftwareEngineering

DEV Community