DEV Community: Yaniv

Why I Built an AI-Powered Test Data Generator (and When You Shouldn't Use AI for Fixtures)

Yaniv — Fri, 17 Apr 2026 19:11:03 +0000

Every test suite has the same dirty secret: name="Test User", email="test@test.com", bio="Lorem ipsum". Copy-pasted across 50 tests, never catching real edge cases, never feeling like production data.

I built FixtureForge to fix this — but along the way, I learned that AI is the wrong tool for most of the problem. Here's what I mean.

The Problem With "Just Use Faker"

Faker is great for structured fields — names, emails, phone numbers, addresses. But it can't generate a realistic user bio, a convincing product review, or an angry customer complaint that actually tests your edge cases.

# This is what most test data looks like:
user = User(name="Test User", email="test@test.com", bio="Lorem ipsum...")

# It doesn't catch real-world edge cases.
# It doesn't feel like production data.
# Writing 500 of them by hand? Not happening.

The obvious answer in 2026 is "use AI." But sending every field to an LLM is expensive, slow, and unnecessary. An email address doesn't need AI. An auto-incrementing ID definitely doesn't need AI.

The Insight: Only Semantic Fields Need AI

FixtureForge splits every model field into four tiers:

Tier	Examples	Generator	API Cost
Structural	`id`, `user_id`, `created_at`	Internal counters / FK registry	Free
Standard	`name`, `email`, `phone`	Faker	Free
Computed	`@computed_field` properties	Pydantic	Free
Semantic	`bio`, `description`, `review`	LLM (batched)	API tokens

The key: 100 users with 2 semantic fields = 2 API calls, not 200. FixtureForge batches all semantic values into a single prompt and asks the LLM to return a JSON array.

from fixtureforge import Forge
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str
    email: str
    bio: str

forge = Forge()
users = forge.create_batch(User, count=50, context="SaaS platform users")

FixtureForge routes id to a counter, name and email to Faker, and only bio hits the AI — once, for all 50 records.

CI Mode: No AI, No Network, No Flakiness

This is the part that matters most. In CI, you don't want non-deterministic AI calls making your pipeline flaky. FixtureForge has a deterministic mode:

forge = Forge(use_ai=False, seed=42)
users = forge.create_batch(User, count=100)
# Identical output every run — no network calls

seed=42 guarantees byte-identical output across every run, every machine. Faker handles the standard fields deterministically, and semantic fields fall back to template-based generation. No API key required.

The Context Parameter Is Where It Gets Interesting

The real power isn't generating random data — it's generating data that tests specific scenarios:

angry_users = forge.create_batch(
    Review,
    count=20,
    context="1-star reviews from angry holiday shoppers"
)

Each bio or review field comes back with realistic frustration, specific complaints, edge-case formatting (ALL CAPS, emoji, long rants). This is the kind of data that catches bugs in text processing, truncation, rendering, and content moderation — bugs that "Lorem ipsum" never finds.

pytest Integration

In conftest.py:

from fixtureforge import forge_fixture
from myapp.models import User, Order

forge_fixture(User, count=50)
forge_fixture(Order, count=200)

In your tests:

def test_users_have_emails(users):
    assert all(u.email for u in users)

def test_order_count(orders):
    assert len(orders) == 200

The forge fixture is auto-available. No factory classes to maintain, no fixture files to update.

When You Should NOT Use This

I want to be honest about the limitations:

Don't use FixtureForge if:

Your tests only need IDs and emails — Faker alone is sufficient and simpler
You're in a strict air-gapped environment with no API access — CI mode works, but you lose the AI-generated quality
Your test data needs to match a specific production database schema exactly — use database dumps or migrations instead

Do use FixtureForge if:

You need realistic text content (bios, reviews, descriptions) at scale
You want to test edge cases in text processing without writing them by hand
You need deterministic CI with realistic dev-time data from one tool
You're tired of maintaining factory_boy factory classes for every model change

How It Compares

	FixtureForge	factory_boy	faker	hypothesis
AI-generated content	Yes	No	No	No
Deterministic seed	Yes	Yes	Yes	Yes
FK relationships	Auto	Manual	No	No
pytest plugin	Yes	Via pytest-factoryboy	No	Yes
Large datasets (100k+)	Yes	Manual loops	Manual loops	No
Zero config	Yes	Factory classes needed	Provider setup	Strategy setup

FixtureForge isn't a replacement for Faker — it uses Faker internally. It's the layer between "I need data" and "I need it to feel real."

Try It

pip install fixtureforge

GitHub: Yaniv2809/fixtureforge
Docs: yaniv2809.github.io/fixtureforge

If you've built something similar or have opinions on AI-generated test data vs traditional fixtures, I'd like to hear about it.

Yaniv Metuku (yaniv2809) — QA Automation Engineer. Also building Financial-Integrity-Ecosystem and Failscope.

Stop writing fake test data by hand — I built a library that generates it for you

Yaniv — Thu, 16 Apr 2026 19:03:16 +0000

Every Python project I've worked on has the same problem in the test suite:

user = User(
    name="Test User",
    email="test@test.com",
    age=25,
    bio="Lorem ipsum dolor sit amet",
)

It's not realistic. It doesn't catch edge cases. And when you need 200 of them,
nobody writes them — you just copy-paste the same record and pretend it's a dataset.

I got tired of this and built FixtureForge.

The idea

Define a Pydantic model. Get realistic data.

from fixtureforge import Forge
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str
    email: str
    bio: str

forge = Forge()
users = forge.create_batch(User, count=50, context="SaaS platform users")

FixtureForge routes each field to the right generator:

Field	Generator	Cost
`id`	Sequential counter	Free
`name`, `email`	Faker	Free
`bio`	LLM (batched)	1 API call for all 50

Only semantic fields — descriptions, bios, reviews, messages — hit the AI.
Everything else is free.

CI mode: zero AI, fully deterministic

forge = Forge(use_ai=False, seed=42)
users = forge.create_batch(User, count=100)
# Same output on every machine, every run, forever

The seed= parameter controls both Faker and random generation at the instance level —
two Forge(seed=42) instances produce identical data without interfering with each other.

pytest plugin

# conftest.py
from fixtureforge import forge_fixture
from myapp.models import User, Order

forge_fixture(User, count=50)
forge_fixture(Order, count=200)

# test_users.py
def test_all_users_have_emails(users):
    assert all(u.email for u in users)

def test_order_count(orders):
    assert len(orders) == 200

No boilerplate. Fixtures are named automatically from the model
(User → users, OrderItem → order_items).

Verbose mode

Not sure where a value came from? Turn on verbose:

forge = Forge(use_ai=False, seed=42, verbose=True)
user = forge.create(User)

# [structural] id    = 1
# [faker]      name  = 'Allison Hill'
# [faker]      email = 'donaldgarcia@example.net'
# [ai]         bio   = 'Passionate developer with 8 years of experience...'

Foreign keys

customers = forge.create_batch(Customer, count=10)
orders = forge.create_batch(Order, count=100)
# order.customer_id always points to a real customer.id — automatically

Provider-agnostic

export GROQ_API_KEY=gsk_...      # Groq (free tier — 14,400 req/day)
export ANTHROPIC_API_KEY=sk-...  # Claude
export OPENAI_API_KEY=sk-...     # GPT
# No key? Falls back to Faker-only mode. CI never breaks.

What it's not

This isn't a replacement for faker — it uses faker internally.
It's not a replacement for hypothesis — different problem.

It's the layer between "I need realistic data" and
"I need it to feel like production."

How to get it

pip install fixtureforge
pip install "fixtureforge[groq]"  # + AI support via Groq free tier

Docs: yaniv2809.github.io/fixtureforge
GitHub: github.com/Yaniv2809/fixtureforge

I'd genuinely like to hear: what's your current approach to test data?
factory_boy? raw Faker? just hardcoded dicts?
And is there a use case this doesn't cover that you'd want it to?

CSV vs JSON for Test Data: What I Learned Using Both in the Same Framework

Yaniv — Mon, 13 Apr 2026 11:35:22 +0000

Most test automation tutorials pick one data format and stick with it. I used both — CSV for Web tests, JSON for API and Mobile — in the same framework. Here's why, and what I'd change.

The Setup

My framework tests a financial expense tracker across three platforms: Web (Playwright), API (Flask/requests), and Mobile (Appium). Each platform has its own test suite, and each suite needs external test data.

The core requirement: test data should live outside the test code, so adding a new test case means adding a row to a file — not modifying Python.

Why Two Formats?

CSV for Web Tests

Web test data is tabular and flat. Every expense has the same fields: name, amount, date, category. CSV maps perfectly:

test_id,expense_name,amount,category,date,status
test01,Business Lunch Web,150,Food,2025-05-20,success
test02,Taxi to office,50,Transportation,2025-05-21,success
test02,Client Dinner,320,Food,2025-05-22,success
test02,New Monitor,850,Accommodation,2025-05-23,success

The test_id column is the key design decision. Multiple rows can share the same test_id — this is how one test function runs multiple data sets. test02 has three rows, so test02 runs three times.

JSON for API and Mobile Tests

API and Mobile test data often needs nested structures or type-specific values that CSV can't express cleanly:

[
  {"test_id": "test03", "expense_name": "DDT_Coffee1", "amount": "15", "date": "2026-03-01", "category": "Food"},
  {"test_id": "test03", "expense_name": "DDT_Flight Ticket2", "amount": "1500", "date": "2026-03-02", "category": "Transportation"},
  {"test_id": "test03", "expense_name": "DDT_Hotel_Berlin3", "amount": "850.50", "date": "2026-03-03", "category": "Transportation"}
]

JSON preserves field names with every record (self-documenting), handles special characters better, and works natively with Python's json.load().

The Filtering Pattern

The real power is in the test_id filtering. Instead of loading all data for every test, each test requests only its own subset:

def read_data_from_csv_by_test(file_path, test_id):
    all_data = read_data_from_csv(file_path)
    return [row for row in all_data if row.get("test_id") == test_id]

def read_json_data_by_test(file_path, test_id):
    with open(file_path, 'r', encoding='utf-8') as f:
        all_data = json.load(f)
    return [row for row in all_data if row.get("test_id") == test_id]

This means one master data file per platform — not one file per test. When I need to add a new expense variant for test03, I add a row to the JSON file. The test automatically picks it up via @pytest.mark.parametrize.

Wiring It Into pytest

Here's how the data flows into the test:

Web (CSV):

@pytest.mark.parametrize("expense_data", read_data_from_csv_by_test(MASTER_CSV, "test02"))
def test02_create_multiple_expenses_ddt(self, expense_data):
    WebWorkflows.create_expense(
        page=self.page,
        expense_name=expense_data["expense_name"],
        amount=expense_data["amount"],
        date=expense_data["date"],
        category=expense_data["category"]
    )

API (JSON):

@pytest.mark.parametrize("expense_data", read_json_data_by_test(MASTER_API_DATA, "test03"))
def test03_create_multiple_expenses_api(self, expense_data):
    response = APIWorkflows.create_expense(
        session=self.session,
        expense_name=expense_data["expense_name"],
        amount=expense_data["amount"],
        date=expense_data["date"],
        category=expense_data["category"]
    )
    APIVerifications.verify_status_code(response, 201)

Same pattern, different data source. pytest's parametrize handles the multiplication — 3 rows for test02 means 3 test executions in the report, each with its own pass/fail status.

What This Looks Like in Practice

The Allure report shows each parameterized run as a separate test case. If the "DDT_Hotel_Berlin3" dataset fails but the other two pass, you see exactly which data combination broke — not just "test03 failed."

Adding test coverage for a new edge case (say, an expense with a zero amount) means adding one line to the JSON file:

{"test_id": "test03", "expense_name": "Zero Amount", "amount": "0", "category": "Food"}

No code change. No new test function. The next pytest run picks it up automatically.

What I'd Do Differently

1. I'd standardize on JSON for everything.

CSV seemed simpler at first, but maintaining two reader functions and two data formats adds friction. JSON handles everything CSV does, plus nested data, plus better type preservation. The CSV advantage (editable in Excel) never mattered in practice.

2. I'd add schema validation to the data files.

Right now, if someone adds a row with a typo in a field name (expnse_name instead of expense_name), the test fails with a KeyError — which looks like a code bug, not a data bug. A quick JSON Schema or Pydantic validation on load would catch this immediately.

3. I'd separate test data from test metadata.

The test_id field mixes data with routing information. A cleaner approach would be a separate mapping file that says "test02 uses rows 2-4 from the data file" — but honestly, for a framework this size, the inline test_id approach is simple and works.

The Tradeoff Table

Factor	CSV	JSON
Readability	Easy for flat tables	Better for nested data
Tooling	Excel, Google Sheets	Any text editor, APIs
Type safety	Everything is a string	Supports numbers, booleans, null
Self-documenting	Headers only at top	Field names in every record
Python parsing	`csv.DictReader`	`json.load()`
Edge cases	Commas in values break it	Handles special characters natively
Version control diffs	Clean line-by-line diffs	Noisier diffs (brackets, commas)

For test automation specifically, JSON wins in most categories. CSV's only real advantage is the Excel factor — and if your QA team doesn't use Excel for test data (most don't), it's not a factor.

Full Source

The complete implementation — both data formats, both reader functions, and all parameterized tests:

GitHub: Financial-Integrity-Ecosystem

Data files: data/web/expense_data.csv, data/ddt/expenses_json_data.json, data/ddt/mobile_expense_data.json
Reader functions: utils/common_ops.py

This is part 3 of a series on building a multi-layer test automation framework. Part 1 covered Set Theory for data integrity validation. Part 2 covered the CI/CD pipeline.

Yaniv Metuku — QA Automation Engineer

How I Built a 12-Step CI/CD Pipeline That Spins Up MySQL, Flask, and Playwright From Scratch

Yaniv — Sat, 11 Apr 2026 14:44:47 +0000

Setting up CI for a web app is straightforward. Setting up CI for a test automation framework that needs a real database, two backend servers, and a headless browser — all starting from zero on every run — is a different problem.

This is how I built a GitHub Actions pipeline that provisions the entire infrastructure, runs 37 tests across 3 layers, and deploys a historical test report — every time I push to main.

The Problem: "Works On My Machine" Doesn't Scale

Locally, my test framework needs:

MySQL 8.0 with a specific schema loaded
A JSON Server running on port 3000
A Flask API server running on port 5000
Playwright with Chromium installed
Environment variables for DB credentials and API keys

Running pytest locally assumes all of this is already set up. In CI, nothing exists. Every run starts from an empty Ubuntu container.

The challenge isn't running the tests — it's building the world they need to run in.

The 12 Steps

Here's the full pipeline, and why each step exists:

Steps 1-3: Foundation

- name: 1. Checkout Code
  uses: actions/checkout@v4

- name: 2. Set up Python 3.13
  uses: actions/setup-python@v5
  with:
    python-version: '3.13'

- name: 3. Set up Node.js 18
  uses: actions/setup-node@v4
  with:
    node-version: '18'

Python for the test framework. Node.js for JSON Server. Nothing surprising here, but note the explicit version pinning — 3.13, not 3.x. CI flakiness often starts with "we let the runner pick the version."

Step 4-5: Dependencies

- name: 4. Install Python Dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

- name: 5. Install Playwright Browsers
  run: playwright install chromium --with-deps

--with-deps is critical. Without it, Playwright installs the browser binary but not the OS-level libraries it needs (libgbm, libasound, etc.). The test run will fail with a cryptic error about missing shared objects.

Step 6: Database Schema

services:
  mysql:
    image: mysql:8.0
    env:
      MYSQL_ROOT_PASSWORD: root_password
      MYSQL_DATABASE: expense_test_db
      MYSQL_USER: test_user
      MYSQL_PASSWORD: test_password
    ports:
      - 3306:3306
    options: >-
      --health-cmd="mysqladmin ping"
      --health-interval=10s
      --health-timeout=5s
      --health-retries=5

# In steps:
- name: 6. Initialize MySQL Schema
  run: |
    sudo apt-get install -y mysql-client
    mysql -h 127.0.0.1 -u test_user -ptest_password expense_test_db < data/init_mysql.sql

The MySQL service container starts alongside the job. The health-cmd ensures the container is actually ready before we try to load the schema. Without health checks, you'll hit "Connection refused" errors roughly 30% of the time.

The schema itself is minimal — one table with a CHECK constraint:

CREATE TABLE IF NOT EXISTS expenses (
    id INT PRIMARY KEY AUTO_INCREMENT,
    expense_name VARCHAR(255),
    amount DOUBLE CHECK (amount >= 0),
    date VARCHAR(50),
    category VARCHAR(100)
);

That CHECK constraint is actually a test target — one of my E2E tests validates that negative amounts get rejected at the DB level even though the UI accepts them.

Steps 7-9: Server Orchestration

- name: 7. Install & Start JSON Server
  run: |
    npm install -g json-server
    json-server --watch json-server/db.json --port 3000 &

- name: 8. Start Flask Server
  env:
    DB_TYPE: mysql
    MYSQL_HOST: 127.0.0.1
  run: |
    python server/app.py &

- name: 9. Wait for Servers to be Ready
  run: |
    curl --retry 10 --retry-delay 2 --retry-connrefused http://localhost:3000/expenses
    curl --retry 10 --retry-delay 2 --retry-connrefused http://localhost:5000/expenses
    echo "All servers are up and running!"

The & at the end of each server command runs it in the background. Step 9 is the safety net — it polls both servers with retry logic until they respond, or fails after 20 seconds.

This is a pattern I see skipped in a lot of CI setups. People start a server and immediately run tests, then wonder why they get intermittent connection errors. Always add a readiness check.

Step 10: The Actual Tests

- name: 10. Run Tests (exclude Mobile)
  env:
    GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
  run: |
    pytest -m "not mobile" --alluredir=allure-results --ai-analysis

-m "not mobile" excludes tests that require a physical Android device. The remaining 37 tests cover Web (Playwright), API, Database, and cross-layer E2E.

--ai-analysis triggers an optional Groq LLM call on test failures to classify the root cause. The API key is stored as a GitHub Secret.

Steps 11-12: Reporting

- name: 11. Generate Allure Report
  uses: simple-elf/allure-report-action@master
  if: always()
  with:
    allure_results: allure-results
    allure_history: allure-history
    keep_reports: 20

- name: 12. Deploy Allure Report to GitHub Pages
  uses: peaceiris/actions-gh-pages@v3
  if: always()
  with:
    github_token: ${{ secrets.GITHUB_TOKEN }}
    publish_dir: allure-history

if: always() is key — the report is generated even when tests fail. Without this, failures produce no report, which is exactly when you need one most.

keep_reports: 20 maintains the last 20 runs, so you get historical trend analysis in Allure — pass rates over time, flakiness detection, duration trends.

What I Learned Building This

1. Health checks prevent 80% of CI flakiness. The MySQL health-cmd and the curl retry loops in step 9 eliminated almost all intermittent failures. Before adding them, roughly 1 in 4 runs failed due to timing issues.

2. if: always() on reporting steps is non-negotiable. The whole point of CI reports is to understand failures. If the report step only runs on success, it's useless.

3. Service containers beat docker-compose in GitHub Actions. I originally tried running docker-compose up inside the workflow. It works, but it's slower and harder to debug. Native service containers integrate better with the runner's networking.

4. Pin your versions. Python 3.13, Node 18, MySQL 8.0 — not latest, not 3.x. A version bump in a dependency should be a deliberate commit, not a surprise in CI.

The Docker Alternative

For local development, the same test suite runs via Docker Compose with a single command:

docker-compose up --build

This uses a custom entrypoint script that replicates the CI steps — waits for MySQL, starts both servers, runs pytest:

[1/5] Waiting for MySQL...        ✓
[2/5] Starting JSON Server...     ✓
[3/5] Starting Flask Server...    ✓
[4/5] Waiting for servers...      ✓
[5/5] Running tests...
========================= 34 passed, 3 xfailed =========================

Same tests, same infrastructure, same results — whether it runs in GitHub Actions or on a developer's laptop.

Full Source

The complete workflow file and Docker setup are in the repo:

GitHub: Financial-Integrity-Ecosystem

The CI config is at .github/workflows/ci.yml. The Docker setup is in Dockerfile, docker-compose.yml, and docker-entrypoint.sh.

This is part 2 of a series on building a multi-layer test automation framework. Part 1 covered using Set Theory for cross-layer data integrity validation.

Yaniv Metuku — QA Automation Engineer

How I Used Set Theory to Catch Bugs That Unit Tests Miss

Yaniv — Fri, 10 Apr 2026 16:40:02 +0000

Most test automation tutorials teach you to test layers in isolation: UI tests check buttons, API tests check status codes, DB tests check records. But the bugs that actually cost money in production? They live between the layers.

I learned this the hard way while building a test automation framework for a financial expense tracker. This post is about one specific technique — Set Theory validation — that catches data integrity bugs that no single-layer test will ever find.

The Problem: Everything Passes, But Data Is Wrong

Imagine this scenario:

A user creates an expense for $100 through the Web UI
The UI shows a success message ✅
The API returns 201 Created ✅
The database has... $0. Or two records. Or nothing.

Every individual layer test passes. The UI test confirms the success message appeared. The API test confirms the status code. But nobody verified that the actual data made it through the entire pipeline correctly.

In financial applications, this is not a cosmetic bug — it's a silent data inconsistency that can go unnoticed until an audit.

The Approach: Database State as a Mathematical Set

Instead of checking "does a record exist?", I treat the entire database table as a mathematical set, and use set difference to prove exactly what changed.

Here's the concept:

# Step 1: Capture DB state BEFORE the action
old_set = {(id, name, amount) for each row in expenses}
old_sum = SUM(amount) from expenses

# Step 2: Perform the action (create expense via UI or API)

# Step 3: Capture DB state AFTER the action
new_set = {(id, name, amount) for each row in expenses}
new_sum = SUM(amount) from expenses

# Step 4: Validate using set difference
isolated_record = new_set - old_set

assert len(isolated_record) == 1          # Exactly ONE new record
assert new_sum - old_sum == expected_amount  # Amount is correct

This is powerful because:

It's operation-independent. Whether the expense was created via UI, API, or direct SQL — the validation is the same.
It catches duplicates. If a bug causes two records to be inserted, len(isolated_record) will be 2.
It catches phantom data. If some other process modified the table during the test, the set difference will include unexpected records.
It catches amount drift. If $100 was entered but $99.99 was stored (floating point issues, rounding bugs), the sum check catches it.

Real Implementation

In my framework, this looks like this across the stack:

DB helper that captures state:

@staticmethod
@allure.step("DB: Get all expenses as set")
def get_all_expenses_as_set(cursor):
    cursor.execute("SELECT id, expense_name, amount FROM expenses")
    return {(row[0], row[1], row[2]) for row in cursor.fetchall()}

@staticmethod
@allure.step("DB: Get sum of amounts")
def get_sum_of_amounts(cursor):
    cursor.execute("SELECT COALESCE(SUM(amount), 0) FROM expenses")
    return cursor.fetchone()[0]

Cross-layer E2E test that uses it:

def test_api_create_reflects_in_db(self):
    # Capture pre-state
    old_set = DBActions.get_all_expenses_as_set(cursor)
    old_sum = DBActions.get_sum_of_amounts(cursor)

    # Act: Create expense via API
    response = APIActions.post(session, url, payload)
    APIVerification.verify_status_code(response, 201)

    # Capture post-state
    new_set = DBActions.get_all_expenses_as_set(cursor)
    new_sum = DBActions.get_sum_of_amounts(cursor)

    # Validate integrity
    diff = new_set - old_set
    assert len(diff) == 1, f"Expected 1 new record, got {len(diff)}"
    assert new_sum - old_sum == expected_amount

The same pattern applies to update (set difference shows one record with changed values) and delete (old_set - new_set shows the removed record).

Where This Caught Real Bugs

While building this, the Set Theory approach caught two issues that single-layer tests completely missed:

1. MySQL CHECK constraint silently rejecting negative amounts.
The UI happily accepted -50 as an expense amount. The API returned 201. But MySQL's CHECK (amount >= 0) constraint blocked the INSERT — so the record never existed in the DB. The set difference was empty when it should have contained one record. Without the cross-layer test, this would have looked like a perfectly passing test suite.

2. VARCHAR(255) overflow truncation.
A 300-character expense name was entered through the UI. The API accepted it. MySQL truncated it to 255 characters silently. The set difference caught the mismatch because the stored record didn't match the expected data.

The Full Picture

This technique is one piece of a larger framework I built with 53 tests across 4 layers (Web/Playwright, API/Flask, Mobile/Appium, Database/MySQL). The cross-layer E2E tests that use Set Theory are a small percentage of the total test count, but they catch the highest-risk bugs.

The architecture enforces strict separation:

Tests → Workflows → Actions/Verifications → Page Objects + Data

Every layer has one job. Tests never call raw UI or API actions directly — they go through workflows that compose actions into business flows. This keeps the set theory validation reusable across different test scenarios.

When You Should (and Shouldn't) Use This

Use it when:

Your application handles financial data, inventory, or any domain where data accuracy matters more than UI polish
Data flows through multiple systems (frontend → backend → database → reporting)
You've had production bugs where "the UI said X but the DB had Y"

Don't use it when:

You're testing a static website or content-only app
The DB is behind a well-tested ORM with strong constraints and you trust the abstraction
Test execution time is critical and you can't afford the extra DB queries

Try It Yourself

The full framework is open source:

GitHub: Financial-Integrity-Ecosystem

You can run the entire suite with a single command:

git clone https://github.com/Yaniv2809/Financial-Integrity-Ecosystem.git
cd Financial-Integrity-Ecosystem
docker-compose up --build

This spins up MySQL, Flask, JSON Server, and Playwright — runs all 37 non-mobile tests automatically.

The cross-layer E2E tests are in tests/api/test_e2e_api_db_expense.py and tests/test_e2e_web_api_db.py if you want to see the set theory pattern in action.

I recently shared this project on r/QualityAssurance and got valuable feedback that led to several improvements, including adding a testing philosophy section that explains the business risk behind each layer. If you have feedback or have used similar patterns in production, I'd genuinely like to hear about it.

Yaniv Metuku — QA Automation Engineer