Let's Automate 🛡️

Posted on Jan 23 • Originally published at Medium on Dec 31, 2025

I Built an AI-Powered Test Data Generator That Analyzes Any URL and Creates Test Data JSON

#ai #openai #llm #langgraph

I got tired of manually inspecting HTML to find selectors. So I taught my framework to do it instead.

Architecture flow

Here’s a question that kept me up at night:

Why am I spending more time finding selectors than writing actual tests?

I watched myself burn 30 minutes on a simple login test — not writing the test itself, but hunting through DevTools for the right selectors, creating fixture files, and crafting test data that would actually work.

What if the framework could just… look at the page and figure it out?

The Problem Nobody Talks About

Here’s the dirty secret of test automation: writing the actual test is the easy part.

The hard part? Finding #username vs input[name="user"] vs .login-field. Creating realistic test data. Building fixture files that match the actual form structure.

Every new page means:

Open DevTools

Inspect elements

Copy selectors

Hope they’re stable

Create JSON fixtures

Hope nothing changes tomorrow

Most “AI-powered” testing tools focus on running tests or analyzing failures. But what about the beginning — the tedious setup that drains your time before you write a single assertion?

The Experiment: Teaching AI to See

The idea was simple but audacious: give the AI a URL and let it figure out everything else.

Not mock data. Not hardcoded selectors. Real selectors from real HTML.

Here’s what I wanted:

python qa_automation.py "Test login" --url https://the-internet.herokuapp.com/login

And the framework should:

Fetch the actual page

Analyze the HTML structure

Extract real, working selectors

Generate meaningful test cases

Save everything as a Cypress fixture

Then generate tests that use that data

Sounds impossible? I thought so too.

How It Actually Works

The magic happens in about 50 lines of Python:

def generate_test_data_from_url(url: str, requirements: list) -> tuple:
    # Step 1: Fetch the real page
    resp = requests.get(url, timeout=10, headers={'User-Agent': 'Mozilla/5.0'})
    html = resp.text[:5000] # First 5KB is usually enough

    # Step 2: Ask AI to analyze it
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = f"""Analyze this HTML and generate test data.

    URL: {url}
    HTML: {html}

    Return JSON with:
    - Real selectors from the HTML
    - Valid test case with working data
    - Invalid test case for error handling
    """

    # Step 3: Parse and save as fixture
    test_data = json.loads(llm.invoke(prompt).content)

    with open("cypress/fixtures/url_test_data.json", 'w') as f:
        json.dump(test_data, f, indent=2)

    return test_data

The AI doesn’t guess. It reads the actual HTML and extracts what’s really there.

Complete Workflow

What The AI Sees vs What It Returns

When I point it at a login page, here’s the actual flow:

Input: Just a URL

--url https://the-internet.herokuapp.com/login

What the AI analyzes:

<input type="text" id="username" name="username">
<input type="password" id="password" name="password">
<button type="submit" class="radius">Login</button>

What it generates:

{
  "url": "https://the-internet.herokuapp.com/login",
  "selectors": {
    "username": "#username",
    "password": "#password",
    "submit": "button[type='submit']"
  },
  "test_cases": [
    {
      "name": "valid_test",
      "username": "tomsmith",
      "password": "SuperSecretPassword!",
      "expected": "success"
    },
    {
      "name": "invalid_test", 
      "username": "wronguser",
      "password": "badpassword",
      "expected": "error"
    }
  ]
}

Real selectors. Actual test data. Zero manual inspection.

The Generated Test Uses It All

The framework then generates a Cypress test that consumes this fixture:

describe('Login Tests', function () {
    beforeEach(function () {
        cy.fixture('url_test_data').then((data) => {
            this.testData = data;
        });
    });

it('should login with valid credentials', function () {
        cy.visit(this.testData.url);
        const valid = this.testData.test_cases.find(tc => tc.name === 'valid_test');

        cy.get(this.testData.selectors.username).type(valid.username);
        cy.get(this.testData.selectors.password).type(valid.password);
        cy.get(this.testData.selectors.submit).click();

        cy.url().should('include', '/secure');
    });
    it('should show error with invalid credentials', function () {
        cy.visit(this.testData.url);
        const invalid = this.testData.test_cases.find(tc => tc.name === 'invalid_test');

        cy.get(this.testData.selectors.username).type(invalid.username);
        cy.get(this.testData.selectors.password).type(invalid.password);
        cy.get(this.testData.selectors.submit).click();

        cy.get('#flash').should('contain', 'invalid');
    });
});

Notice something? The selectors come from the fixture, not hardcoded in the test.

If the page changes, update the fixture. Tests stay clean.

Two Ways to Feed Data

Sometimes you already have test data. Maybe from a previous run. Maybe from your team’s shared fixtures.

So I added a second option:

# Option 1: AI analyzes live URL
python qa_automation.py "Test login" --url https://example.com/login

# Option 2: Use existing JSON file
python qa_automation.py "Test login" --data cypress/fixtures/my_data.json

Same test generation. Different data sources. Your choice.

The Part That Surprised Me

I expected the AI to find basic selectors. What I didn’t expect was how well it understood context.

When analyzing a registration form, it didn’t just find #email — it generated test data like:

Valid: testuser@example.com

Invalid: not-an-email

For password fields:

Valid: SecurePass123!

Invalid: 123 (too short)

The AI understood what kind of data each field expected. Not because I told it — because it read the HTML attributes, labels, and validation patterns.

The Gotcha: Fixtures Need function() Syntax

One thing tripped me up for hours. Cypress fixtures with this.testData require a specific pattern:

// WRONG - arrow functions don't have 'this'
describe('Test', () => {
    beforeEach(() => {
        cy.fixture('data').then((d) => { this.testData = d; }); // undefined!
    });
});

// RIGHT - function() preserves 'this'
describe('Test', function () {
    beforeEach(function () {
        cy.fixture('data').then((data) => { this.testData = data; });
    });

    it('works', function () {
        console.log(this.testData); // actual data!
    });
});

The framework now enforces this pattern in generated tests. Lesson learned the hard way.

What This Means For Your Workflow

Before:

Open page in browser

Inspect elements manually

Copy selectors to notepad

Create fixture JSON by hand

Write test using those selectors

Fix typos in selectors

Run test

Debug why selectors don’t work

After:

Run one command with URL

Framework handles the rest

That’s not an exaggeration. The 30-minute login test? Under 2 minutes now.

Try It Yourself

The framework is open source:

git clone https://github.com/user/cypress-natural-language-tests
cd cypress-natural-language-tests
pip install -r requirements.txt

Set your API key:

export OPENAI_API_KEY=your_key_here
export OPENROUTER_API_KEY=your_openrouter_api_key_here

Generate tests from any URL:

python qa_automation.py "Test the login form" --url https://the-internet.herokuapp.com/login

Check what it created:

cat cypress/fixtures/url_test_data.json
cat cypress/e2e/generated/*.cy.js

The Bigger Picture

We’re at an interesting moment in test automation. The tooling is getting smarter, but the real breakthrough isn’t replacing testers — it’s eliminating the tedious parts.

Finding selectors is tedious. Creating fixture files is tedious. Debugging why #submit-btn worked yesterday but not today is tedious.

Let AI handle tedious. Let humans handle important.

That’s the framework I’m building.

Follow for more AI + QA experiments:

GitHub: https://github.com/aiqualitylab/cypress-natural-language-tests.git