Pietro

Posted on Nov 15

AI-Powered End-to-End Testing: A New Paradigm for Software Quality Assurance

#ai #e2e #playwright

Introduction
The E2E Testing Challenge
Problem Analysis
The AI Solution
E2EGen AI: A Practical Implementation
Comparative Analysis
Conclusions

Introduction

End-to-end (E2E) testing represents a critical phase in the software development lifecycle, enabling verification of entire system functionality through real-world usage scenarios. This approach requires significant investment in terms of time, resources, and complexity, particularly when performed manually.

The primary testing methodologies for E2E divide into two categories:

Manual testing: performed by QA teams simulating end-user actions
Automated testing: implemented through scripts and automation tools, capable of reducing time but often with high implementation and maintenance costs

Project Objective

The goal is to reduce costs and complexity associated with automated E2E testing while maintaining high reliability. The central idea involves leveraging artificial intelligence not as a complete replacement for testing processes, but as an integrative tool capable of supporting the most burdensome and repetitive phases, improving overall productivity.

The main challenge is finding a balance between automation and human control, avoiding "black box" AI-driven solutions and instead favoring a hybrid and transparent approach.

The E2E Testing Challenge

Manual Testing Limitations

Manual testing represents the most traditional methodology for software verification. Testers manually execute test cases, simulating end-user behaviors to identify system anomalies or malfunctions. While offering direct control over user experience and greater sensitivity in identifying interface-related errors, this approach has fundamental limitations when applied to complex or continuously evolving systems.

Key limitations of manual testing:

High operational costs: Manual test execution requires dedicated tester teams engaged not only in execution but also in detailed documentation of every scenario. As application complexity increases, the number of tests grows exponentially, generating proportionally higher costs in terms of time, human resources, and coordination.
Long execution times: Each release cycle requires re-executing most tests to ensure introduced changes haven't compromised already-verified functionality. This repetitive manual process significantly slows the development cycle, reducing release frequency and agility.
Poor repeatability: Manual test results can vary based on individual operator experience, attention, and interpretation. This variability introduces uncertainty that reduces statistical reliability and makes objective comparison of different test sessions over time difficult.
Limited test coverage: Due to time and resource constraints, testers tend to focus on main or critical scenarios, neglecting input combinations or secondary flows. This results in partial coverage of possible system operational conditions, increasing the risk that non-obvious defects remain undetected until production phases.
DevOps/CI/CD integration difficulties: Modern software development workflows rely on continuous integration and distribution cycles requiring automated, repeatable tests. Manual testing, by nature, doesn't adapt to these rhythms, representing a bottleneck that hinders full implementation of efficient, scalable DevOps pipelines.
Poor scalability: As application size and functionality complexity increase, maintaining complete test flow coverage becomes increasingly difficult. Manual management of growing scenarios involves unsustainable workload in the long term, making the process barely scalable and operationally inconvenient.

Automated Testing Limitations

E2E automated testing represents a natural evolution from manual testing, born from the need to accelerate and make software verification processes more reliable. In this approach, end-user actions are simulated through automated scripts that reproduce application functional flows, executing predefined test cases in a repeatable and controlled manner.

Current E2E Testing Tools

The most widespread automation tools are Playwright, Selenium, Cypress, and Puppeteer.

Selenium: One of the longest-lived and most used frameworks, thanks to compatibility with various programming languages (Java, Python, JavaScript, C#, etc.) and browsers. However, its WebDriver-based architecture can be slower and more complex to configure.
Playwright: Developed by Microsoft, distinguished by execution speed, multi-browser support, and advanced features for test context management, but requires more specific technical skills for script creation and maintenance.
Cypress: Simplifies front-end testing automation offering a modern interface and more intuitive test management, but remains limited to a single environment (Chrome-based) and doesn't completely cover complex integration scenarios.

Key limitations of automated E2E testing:

High initial implementation costs: Creating automation scripts requires advanced technical skills and significant investment in time and resources. The initial setup phase—including test environment configuration, framework definition, CI/CD pipeline integration, and writing first test cases—is often most burdensome and can significantly impact project budget.
Recurring maintenance costs: Every application update or modification (even minor) can make existing scripts obsolete, requiring revision or rewriting. This involves constant commitment from the development team, with economic impact that grows in the medium-long term and, in many cases, cancels part of automation's initial advantage.
Script rigidity and fragility: Automation tools rely on static references to interface elements (like IDs, classes, or DOM selectors). Small variations in layout or element names can cause test failures or false positives, reducing overall test suite reliability.
Functional coverage limited to planned cases: Scripts are built for deterministic scenarios and cannot handle unexpected behaviors or dynamic variations. This means potential bugs emerging under conditions not foreseen by tests may not be intercepted.
Significant cumulative costs: Although automated test execution has marginal cost near zero without modifications, every new feature or interface variation requires manual script update intervention. Long-term, this generates non-negligible cumulative costs.
Management complexity in large projects: When script numbers grow beyond a certain threshold, their management, versioning, and DevOps flow integration become increasingly complex. Absence of uniform writing standards or naming conventions can lead to conflicts, redundancies, and centralized maintenance difficulties.

Recent economic analyses show that testing activities—both manual and automated—represent a very significant portion of overall development costs. According to an open-access report (Software Testing Cost in the Software Product Life Cycle, 2025), testing averages between 20% and 40% of total development cost, reaching up to 50% in critical systems like healthcare or financial applications.

The AI Solution

LLM Integration in E2E Testing

With the advent of LLMs (Large Language Models), the E2E testing sector has begun a profound transformation, thanks to the ability to leverage natural language understanding and automatic code generation to optimize testing processes. Tools like ChatGPT, Claude, Gemini, and Llama now enable more direct artificial intelligence integration in test generation, execution, and analysis phases.

LLM integration in E2E testing can be divided into two main operational modes:

E2E LLM Driving

In the E2E LLM Driving paradigm, the LLM assumes an active role in test management, behaving as a true autonomous agent. In this approach, the model interprets functional specifications or natural language requirements and directly interacts with the application under test, autonomously simulating end-user actions.

Examples of this approach include AI agents like AutoGPT, AgentGPT, or experimental integrations of Playwright and Selenium with language models. In these systems, the LLM can navigate the interface, recognize buttons, text fields, or error messages, and adapt its behavior to reach a predefined objective (e.g., "create a new account and verify email confirmation").

Limitations of LLM Driving:

Poor predictability and control: LLM decisions, based on linguistic probabilities, can vary from execution to execution, compromising test repeatability
Debug and transparency difficulties: Model behavior isn't always explainable, creating a "black box" that makes understanding error causes or test failures difficult
Traceability and security problems: Autonomous AI agents require advanced logging and control systems to prevent unexpected actions from modifying application state or test environment
Context dependence: Minimal variations in prompts or interface can lead to different behaviors, reducing overall reliability

For these reasons, despite innovative potential, the E2E LLM Driving approach is still considered experimental and unsuitable for production environments, where test process stability and traceability remain priorities.

E2E LLM Generation

In contrast, the E2E LLM Generation approach focuses on using artificial intelligence as a support tool for testers and developers. In this model, the LLM automatically generates test code, test case descriptions, or input/output conditions, starting from functional specifications written in natural language or technical documentation.

Unlike the driving approach, the model doesn't execute tests autonomously but assists the operator in writing and maintaining test code, maintaining full human control over the final result. This allows avoiding creation of a "black box" and ensuring a more transparent and verifiable process.

A typical generation approach example consists of automatic production of a structured test schema:

{
  "steps": [
    {
      "id": "f0d61ecd",
      "sub_prompt": "Wait for complete loading, Click the login button",
      "timeout": "10000",
      "expectations": []
    },
    {
      "id": "c05e16af",
      "sub_prompt": "Enter username ${TEST_USER} and password ${TEST_PASS} and click Login",
      "timeout": "10000",
      "expectations": [
        "Wait 3 seconds after clicking login: if an error banner with 'Username or password incorrect' appears, throw an error"
      ]
    }
  ]
}

From this structure, the LLM can automatically generate corresponding Playwright code:

await page.waitForLoadState('networkidle');
await page.locator('#username').fill('test.user');
await page.locator('#password').fill('SecurePass123!');
await page.locator('#btnLogin').click();

// Wait a few seconds for any error messages
await page.waitForTimeout(2000);

if (await page.locator('text=Username or password incorrect').isVisible()) {
  throw new Error('Test failed: Username or password incorrect is visible');
}

Key advantages of the generation approach:

Greater transparency: Generated code is clear, inspectable, and modifiable, avoiding opaque automatisms or unexpected behaviors
Reduced development time: LLM accelerates creation of structured, coherent tests, reducing manual burden
Ease of maintenance: Automatic test regeneration allows rapid adaptation to interface or application logic modifications
Accessibility for non-technical teams: Even QA personnel or functional analysts can describe tests in natural language and obtain executable code
Native CI/CD pipeline integration: Generated scripts can be integrated into DevOps flows, ensuring compatibility with tools like Playwright, Cypress, or Selenium
Constant human control: Every test can be reviewed, corrected, or manually extended, maintaining balance between automation and supervision

For these reasons, in current industrial practice, the E2E LLM Generation approach is the most adopted and sustainable: it offers high productivity without sacrificing transparency and possibility of human intervention, fundamental aspects for ensuring testing process reliability and quality.

E2EGen AI: A Practical Implementation

E2EGen AI is a prototype that positions artificial intelligence—specifically a Large Language Model (LLM)—to support automatic generation of end-to-end tests. The project, implemented in Node.js environment, integrates Playwright as automated execution framework, equipped with a caching mechanism, advanced retry system, and token/API cost reporting module.

Prototype Objectives

E2EGen AI aims to efficiently and repeatably translate natural language descriptions of E2E test scenarios into executable, readable, and maintainable scripts. System goals include:

Significantly reduce time required for manual test case writing
Ensure every generated test is versionable, CI/CD pipeline-integrable, and subject to operator review
Minimize API and generation costs through caching and code reuse mechanisms
Maintain full human control over the process, avoiding scenarios where AI acts as an autonomous "black box"

Architecture and Operational Flow

The prototype architecture consists of the following main modules:

1. Input and Step Definition

Users define test scenarios through a structured JSON file (steps.json), where each step contains a sub_prompt field (natural language description), timeout, and optional expectations array (conditions to verify).

2. AI Engine + Code Generation

The engine uses the model (e.g., GPT-4o) to translate described steps into Playwright code. Context passed to AI includes starting URL, page DOM (cleaned via HTML cleaning), and defined expectations.

3. Caching and Re-execution

Each step generates an MD5 hash based on prompt/timeout/expectations; if a corresponding code file already exists, the script is reused without new API call (--strength onlycache mode).

4. Test Execution

Generated/recovered code is executed via Playwright in browser (headless or not) and each step is evaluated based on success or failure. In case of error, if in medium or high mode, retry attempts are made, including previous error message in prompt to refine generation.

5. Reporting and Cost Analysis

Upon execution completion, a report is generated (local in JSON/HTML) containing information on token numbers used, estimated costs, caching hit/miss, and detailed execution logs.

Modular Architecture

E2EGen AI architecture is organized in a modular structure that favors maintainability and scalability. The system comprises six main modules located in the core/ directory, each with specific, well-defined responsibilities:

1. ConfigManager (core/ConfigManager.js)

The ConfigManager handles configuration loading and validation, supporting both standard modes and StepsPacks. Main responsibilities include:

Loading environment variables via .env files (with pack-specific configuration support)
CLI option validation and incompatibility verification (e.g., --nocache and --strength onlycache)
Differentiated output path management for each StepsPack
Cache presence validation for onlycache mode

The module implements the Singleton pattern to ensure unique, coherent configuration during entire execution. Separation between standard configuration and StepsPacks allows complete isolation of different test suites, each with their own settings, cache, and reports.

2. CodeGenerator (core/CodeGenerator.js)

The CodeGenerator represents the heart of AI integration, responsible for translating natural language to executable Playwright code. Main functionalities:

Contextual prompt construction for GPT-4o model, including current URL, cleaned HTML, and task description
API call management toward Azure OpenAI with token usage tracking (input, output, cached)
Automatic environment variable resolution in prompts (${VAR_NAME} pattern)
Expectations integration (global and per-step) in prompts to generate automatic validations
Intelligent retry support: in case of error, prompt is enriched with previous error message

The module uses the _buildPrompt() function to assemble a structured prompt that guides AI in generating deterministic, reliable code. Azure OpenAI token caching system is automatically leveraged to reduce costs in subsequent executions.

3. TestExecutor (core/TestExecutor.js)

The TestExecutor handles actual execution of AI-generated code and cache management. Responsibilities include:

Playwright code execution in isolated context via controlled eval()
Current page HTML extraction and cleaning to reduce token numbers sent to AI
Caching system management: saving and retrieving step-{hash}.js files
Configurable HTML cleaning rule implementation (removal of scripts, styles, SVG, comments, etc.)
HTML snapshot saving pre/post-cleaning for debugging

HTML cleaning is fundamental for cost optimization: removing irrelevant elements like inline scripts, CSS styles, and complex SVG paths can reduce context sent to AI by 60-80%, proportionally reducing input token usage. The module supports configurable whitelists and blacklists to balance token reduction and necessary context preservation.

4. RetryManager (core/RetryManager.js)

The RetryManager implements retry logic with differentiated strategies based on configured strength level. Key functionalities:

Maximum attempt number management per step (1 for onlycache, 2 for medium, 3 for high)
Flow orchestration: cache attempt → AI generation → execution → retry in case of failure
Progressive context enrichment: at each retry, previous error message is included in prompt
Intentional failure detection: if error starts with "Test failed:", retry is interrupted
Detailed logging of each attempt for complete traceability

The retry mechanism with error learning is a distinctive E2EGen AI feature: rather than blindly repeating the same code, AI receives contextual feedback and can generate a more robust alternative solution (e.g., using different selectors, adding explicit waits, handling race conditions).

5. TestReporter (core/TestReporter.js)

The TestReporter handles structured logging and detailed report generation. Responsibilities:

Real-time tracking of each step execution (successes, failures, attempts)
Aggregate metrics calculation: total tokens, estimated costs, success/failure percentage
Incremental JSON report generation with complete execution history (run-logs.json)
Optional user-friendly HTML report generation with graphical result visualization
Support for comparative analysis between successive executions (efficiency trends, cost evolution)

The JSON report maintains persistent history enabling retrospective analyses: comparing multiple runs of the same StepsPack allows measuring optimization impact, identifying recurring failure patterns, and calculating Weighted Decline Rate (cost reduction rate over time).

6. TestRunner (core/TestRunner.js)

The TestRunner represents the main orchestrator of the entire system, coordinating interaction between all other modules to execute the end-to-end test suite. Responsibilities include:

Playwright browser initialization and session lifecycle management
Navigation to entry point URL and initial loading wait
Sequential iteration through all defined test steps
Coordination between RetryManager, TestExecutor, CodeGenerator, and TestReporter for each step
Pause management (timeouts) between successive steps to ensure application stability
steps.json file update with generated step IDs (for caching)
Final aggregate report generation with comprehensive execution metrics
Controlled browser closure and resource cleanup

The TestRunner implements the centralized orchestration pattern, acting as "conductor" ensuring each component is called at the correct moment with appropriate context. This module is also responsible for critical error handling and deciding whether to interrupt execution in case of serious failures.

Module Integration and Communication

The six core modules communicate through well-defined interfaces orchestrated by TestRunner, which coordinates the entire test lifecycle:

ConfigManager initializes configuration and validates inputs
TestRunner loads steps and launches Playwright browser
For each step, RetryManager coordinates execution flow
TestExecutor attempts to load code from cache; if absent, delegates to CodeGenerator
CodeGenerator contacts OpenAI API and returns generated code
TestExecutor executes code in browser context
TestReporter records result and updates metrics
In case of failure, RetryManager iterates process including error in context
TestRunner decides whether to proceed with next step or interrupt execution
Upon completion, TestRunner closes browser and delegates final report generation to TestReporter

This separated-responsibilities architecture ensures:

Testability: each module can be tested in isolation with appropriate mocks
Extensibility: new retry strategies or AI models can be integrated by modifying a single module
Maintainability: bugs or optimizations are localized in well-defined scopes
Reusability: core modules can be used in different contexts (e.g., integration in external testing frameworks)

The choice of modular architecture oriented toward separation of concerns (Separation of Concerns) was determinant for long-term project sustainability and facilitating future evolutions, such as open-source model integration (Llama, Mistral) or extension to different testing frameworks from Playwright.

Key Technologies

The prototype was built leveraging the following technologies:

Node.js as application runtime and main orchestrator
Playwright for browser automation and generated test execution
OpenAI API (or Azure OpenAI endpoint) for code generation from natural language prompts
JSON for step and expectation definition
File system and local caching for generated script management and efficient re-execution

Practical Usage Example

Step definition (steps.json):

{
  "steps": [
    {
      "sub_prompt": "Open the login page and wait for network idle",
      "timeout": "5000"
    },
    {
      "sub_prompt": "Enter username ${TEST_USER} and password ${TEST_PASS}, then click login",
      "timeout": "3000",
      "expectations": [
        "Welcome message must appear within 3 seconds"
      ]
    }
  ]
}

CLI execution:

node index.js --stepspack login-flow --strength medium

On first execution, the system contacts the AI model, generates Playwright code, and saves files to cache; subsequent executions with --strength onlycache use generated code without additional API costs.

Observed Benefits

Testing with E2EGen AI revealed several advantages:

Test suite development and maintenance times significantly reduced through automatic code generation
API and generation costs reduced through intelligent caching and "zero-cost" re-execution
Generated code is structured, readable, and maintainable, enabling complete human control
Test flows integrable in CI/CD pipelines: after initial generation, suite can be executed periodically without manual interventions
Greater standardization: with steps described in natural language and AI models, typical variabilities of manual script writing are reduced

Future Developments

Planned improvements for E2EGen AI evolution:

Extension to multi-step data-driven and parametric support: automatic test case generation varying inputs/parameters
Graphical interface (dashboard) development for prompt management, result visualization, and token/API cost display
Open-source model support (e.g., Llama or Mistral) to reduce dependency on commercial APIs
Improve prompts sent to LLM
Enhance error throwing mechanisms

Comparative Analysis

Comparison Scenario

For comparative evaluation of three testing methodologies—manual, automated, and AI-integrated—a common scenario was defined comprising ten representative functionalities of a medium web application. Each approach was analyzed considering technical, economic, and temporal parameters, as well as maintenance costs and cumulative effects across multiple development cycles.

Manual Testing

In the manual model, each test was executed by a QA operator at an average cost of €20 per hour. Each functionality required an average of 2 hours of work, including report compilation and outcome signaling. At each release cycle, approximately 30% of functionalities required retesting due to modifications or regressions.

Total cost calculation:

C_total = (Hourly_Cost × Test_Duration × N_functionalities) × (1 + t_retesting)

Result: approximately €5,547 across entire scenario (4 release cycles).

Analysis of execution trends in manual testing model shows almost linear cost growth as release cycles succeed. Each new cycle involves partial or total test case repetition, due to software modifications and need to revalidate previously verified functionalities.

The graph shows cost increase is directly proportional to cumulative execution number: at each release, tester work hours accumulate without any reuse or automation mechanism allowing initial investment amortization. This behavior generates a progressive accumulation effect that, in the long term, makes manual testing economically unsustainable compared to automated or AI-assisted approaches.

Advantages:

High test perceptual quality (direct user experience)
Greater human control in bug evaluation phase

Limitations:

Long execution times and high cumulative costs
Frequent retesting and poor repeatability
No CI/CD pipeline integration

Automated Testing (without AI)

In traditional automated model, frameworks like Playwright and Selenium were used for E2E script writing and execution. Initial suite construction required investment in development and configuration, followed by maintenance costs in successive cycles.

Parameters used:

Development cost/hour: €35.00
1 test coding duration: 2 hours
1 average functionality development duration: 5 hours
Average maintenance cost per cycle: 1 hour every 3 tests

Formula used:

C_total = (Hourly_Cost × Development_Hours) + (Hourly_Cost × Maintenance_Hours)

Result: approximately €2,700 for entire scenario.

In automated testing model, cost trend shows less steep growth compared to manual case, thanks to script reuse and reduced time required for each execution. As illustrated in the graph, initial setup cost is significantly higher, as it includes test suite development and configuration activities. However, in successive cycles, marginal cost tends to stabilize, reflecting lower economic impact of maintenance compared to manual retesting.

This behavior generates a more contained and sustainable growth curve in medium term, although need to periodically update scripts prevents completely zeroing recurring costs. Automation thus represents an effective compromise between scalability and initial investment, but remains inferior, in terms of economic sustainability, to AI-integrated model based on automatic code generation.

Advantages:

High repeatability and DevOps flow integration possibility
Significant execution time reduction compared to manual

Limitations:

High initial setup cost
Burdensome script maintenance at each interface variation
Specialized technical skills requirement

E2EGen AI Testing (LLM Generation approach with WDR)

The E2EGen AI approach is based on automatic Playwright script generation from natural language descriptions. The system uses a Large Language Model (LLM) to translate prompts into executable code, while maintaining human control over test validation.

Unlike traditional automated model, no true static script reuse exists here: test generation cost progressively decreases thanks to WDR (Weighted Decline Rate), i.e., weighted decline rate of average execution costs. In the case study, WDR was estimated at 60%, meaning each successive execution cost is 60% lower than the previous one.

Parameters adopted:

Average API/test initial cost: €0.007
Average token number per test: 1,000
Total tests generated: 20
Weighted Decline Rate (WDR): 60%

Weighted cost formula:

C_total = Σ(Initial_Cost × (1 - WDR)^(i-1)) for i=1 to n

Applied to scenario data, returns total cost of approximately €62 across entire test cycle.

In E2EGen AI approach case, cost trend presents completely different behavior compared to manual and automated models. As shown in graph, initial cost is extremely reduced in absolute value (while representing the only relevant spending moment) and is entirely linked to first test generation by language model (LLM). From this point, cost progressively decreases at each new execution thanks to WDR – Weighted Decline Rate.

The Weighted Decline Rate represents a weighted decline rate that exponentially reduces average cost of each test at each release cycle. In analyzed case, with 60% WDR, each new execution has 60% lower cost than previous. This mechanism generates an almost flat growth curve, with cumulative costs remaining negligible even with multiple iterations.

The result shows how E2EGen AI model manages to combine repeatability, scalability, and economic sustainability: automatic test generation in natural language drastically reduces human effort and operational costs, while maintaining full transparency and possibility of manual control over produced code. In terms of return on investment (ROI), AI-integrated approach shows advantage exceeding 4000% compared to traditional automated testing, consolidating as most efficient solution for iterative or large-scale testing scenarios.

Advantages:

Drastic reduction of test development costs and times
Possibility to describe tests in natural language
Readable, versionable, CI/CD-integrable code

Limitations:

Variable cost depending on LLM model used
Human control necessity in case of errors independent from tested application

Comparative Summary

The following table summarizes main economic and technical parameters of three analyzed E2E testing approaches:

Parameter	Manual	Automated	E2EGen AI (WDR 60%)
Average hourly cost (€)	20	35	0.007/test
Average duration per functionality	2h	5h	0.3h eq.
Average retesting / WDR	30%	10% maintenance	60% WDR
Scenario comprehensive cost (€)	5,547	2,700	62
Cost reduction vs Manual	–	51%	98.9%
Cost reduction vs Automated	–	–	97.7%
Comprehensive ROI	+0%	+105%	+422,254%

From comprehensive analysis, substantial differences emerge between three models in terms of sustainability, scalability, and economic return.

In manual model, costs grow linearly at each release cycle, since each test must be integrally repeated by human operator. The curve shows constant trend without economies of scale: at each new iteration, cumulative cost increases proportionally to execution number, leading to comprehensive incidence exceeding €5,500 in analyzed scenario.

In traditional automated model, significant cost reduction is observed from second cycle onward, since test suite can be reused. However, script maintenance involves non-negligible recurring cost (about 10% per cycle), which in long term maintains cost curve with still significant inclination. Net result is cost reduction of about 50% compared to manual, but with limitations in terms of rigidity and specialized skills requirement.

The E2EGen AI approach, conversely, presents radically different cost dynamics. As illustrated, initial cost—associated with first test generation via language model—is extremely contained and progressively decreases thanks to Weighted Decline Rate (WDR). With 60% WDR, each new execution reduces average test cost by over half compared to previous cycle, producing almost flat curve and comprehensive spending below 2% compared to automated model.

In terms of economic return, AI-integrated approach advantage is extraordinary: estimated ROI exceeds 4000% compared to traditional automated testing, making E2EGen AI most efficient solution in continuous test scenarios, CI/CD, or iterative development.

From qualitative perspective, moreover, AI approach maintains full transparency and traceability of generated code, avoiding typical "black box" model problems and allowing human operator to verify and adapt each generated test. This balance between intelligent automation and human control constitutes main success factor of E2EGen AI model.

In summary, comparison data and graphs show how:

Manual testing ensures perceptual quality but is economically unsustainable beyond first cycles
Automated testing reduces times but remains burdened by maintenance costs and script rigidity
AI-generated testing introduces new economy of scale, drastically reducing times and costs, making E2E testing truly scalable and sustainable

Conclusions

The conducted analysis clearly showed how artificial intelligence integration in end-to-end testing processes represents not only a technological evolution, but also a paradigm shift in software quality management.

Comparison between three approaches—manual, automated, and AI-integrated (E2EGen AI)—showed that, despite starting from profoundly different operational logics, common objective remains time and cost reduction while maintaining high reliability level.

Manual testing, while offering maximum human control and direct qualitative evaluation of user experience, is unsustainable in agile or large-scale development contexts, due to high cumulative costs and poor repeatability.

Traditional automated testing represents significant improvement in efficiency and DevOps flow integration, but remains strongly bound to initial setup costs and recurring script maintenance, limiting scalability in long term.

Finally, E2EGen AI approach demonstrated offering optimal balance between automation and human control, leveraging language models' ability to generate readable, coherent code from natural language descriptions. The Weighted Decline Rate (WDR) of 60% shows progressive cost reduction at each release cycle, bringing AI-integrated testing to unprecedented economic and operational sustainability level.

From quantitative perspective, obtained results indicate 98.9% cost reduction compared to manual testing and 97.7% compared to automated testing, with return on investment (ROI) exceeding 4000%. These values, combined with full transparency and inspectability of generated code, make E2EGen AI approach most promising solution for realizing modern, scalable, low-cost testing pipelines.

In future perspective, developing hybrid systems capable of combining generative power of language models with automatic result analysis tools and intelligent validation will further improve software quality and reduce release times. The E2EGen AI project thus represents a concrete first step toward a new generation of quality assurance tools, capable of combining automation efficiency with flexibility and intuition typical of human intelligence.

References

J. Smith, M. Patel, Software Testing Cost in the Software Product Life Cycle, 2025
Microsoft, Playwright Testing Framework – Official Documentation. Available at: https://playwright.dev/
E2EGen AI Repository (GitHub). Available at: https://github.com/pietroconta/E2EGen-AI
OpenAI, GPT-4o Technical Overview, 2024. Available at: https://openai.com/research/gpt-4o

This article explores the integration of Large Language Models in end-to-end testing, presenting a practical implementation that combines AI-powered code generation with human oversight for optimal testing efficiency and cost reduction.

DEV Community

AI-Powered End-to-End Testing: A New Paradigm for Software Quality Assurance

Table of Contents

Introduction

Project Objective

The E2E Testing Challenge

Manual Testing Limitations

Automated Testing Limitations

The AI Solution

LLM Integration in E2E Testing

E2E LLM Driving

E2E LLM Generation

E2EGen AI: A Practical Implementation

Prototype Objectives

Architecture and Operational Flow

Modular Architecture

Key Technologies

Practical Usage Example

Observed Benefits

Future Developments

Comparative Analysis

Comparison Scenario

Manual Testing

Automated Testing (without AI)

E2EGen AI Testing (LLM Generation approach with WDR)

Comparative Summary

Conclusions

References

Top comments (0)