Yaniv

Posted on Apr 13

CSV vs JSON for Test Data: What I Learned Using Both in the Same Framework

#testing #python #automation #beginners

Most test automation tutorials pick one data format and stick with it. I used both — CSV for Web tests, JSON for API and Mobile — in the same framework. Here's why, and what I'd change.

The Setup

My framework tests a financial expense tracker across three platforms: Web (Playwright), API (Flask/requests), and Mobile (Appium). Each platform has its own test suite, and each suite needs external test data.

The core requirement: test data should live outside the test code, so adding a new test case means adding a row to a file — not modifying Python.

Why Two Formats?

CSV for Web Tests

Web test data is tabular and flat. Every expense has the same fields: name, amount, date, category. CSV maps perfectly:

test_id,expense_name,amount,category,date,status
test01,Business Lunch Web,150,Food,2025-05-20,success
test02,Taxi to office,50,Transportation,2025-05-21,success
test02,Client Dinner,320,Food,2025-05-22,success
test02,New Monitor,850,Accommodation,2025-05-23,success

The test_id column is the key design decision. Multiple rows can share the same test_id — this is how one test function runs multiple data sets. test02 has three rows, so test02 runs three times.

JSON for API and Mobile Tests

API and Mobile test data often needs nested structures or type-specific values that CSV can't express cleanly:

[
  {"test_id": "test03", "expense_name": "DDT_Coffee1", "amount": "15", "date": "2026-03-01", "category": "Food"},
  {"test_id": "test03", "expense_name": "DDT_Flight Ticket2", "amount": "1500", "date": "2026-03-02", "category": "Transportation"},
  {"test_id": "test03", "expense_name": "DDT_Hotel_Berlin3", "amount": "850.50", "date": "2026-03-03", "category": "Transportation"}
]

JSON preserves field names with every record (self-documenting), handles special characters better, and works natively with Python's json.load().

The Filtering Pattern

The real power is in the test_id filtering. Instead of loading all data for every test, each test requests only its own subset:

def read_data_from_csv_by_test(file_path, test_id):
    all_data = read_data_from_csv(file_path)
    return [row for row in all_data if row.get("test_id") == test_id]

def read_json_data_by_test(file_path, test_id):
    with open(file_path, 'r', encoding='utf-8') as f:
        all_data = json.load(f)
    return [row for row in all_data if row.get("test_id") == test_id]

This means one master data file per platform — not one file per test. When I need to add a new expense variant for test03, I add a row to the JSON file. The test automatically picks it up via @pytest.mark.parametrize.

Wiring It Into pytest

Here's how the data flows into the test:

Web (CSV):

@pytest.mark.parametrize("expense_data", read_data_from_csv_by_test(MASTER_CSV, "test02"))
def test02_create_multiple_expenses_ddt(self, expense_data):
    WebWorkflows.create_expense(
        page=self.page,
        expense_name=expense_data["expense_name"],
        amount=expense_data["amount"],
        date=expense_data["date"],
        category=expense_data["category"]
    )

API (JSON):

@pytest.mark.parametrize("expense_data", read_json_data_by_test(MASTER_API_DATA, "test03"))
def test03_create_multiple_expenses_api(self, expense_data):
    response = APIWorkflows.create_expense(
        session=self.session,
        expense_name=expense_data["expense_name"],
        amount=expense_data["amount"],
        date=expense_data["date"],
        category=expense_data["category"]
    )
    APIVerifications.verify_status_code(response, 201)

Same pattern, different data source. pytest's parametrize handles the multiplication — 3 rows for test02 means 3 test executions in the report, each with its own pass/fail status.

What This Looks Like in Practice

The Allure report shows each parameterized run as a separate test case. If the "DDT_Hotel_Berlin3" dataset fails but the other two pass, you see exactly which data combination broke — not just "test03 failed."

Adding test coverage for a new edge case (say, an expense with a zero amount) means adding one line to the JSON file:

{"test_id": "test03", "expense_name": "Zero Amount", "amount": "0", "category": "Food"}

No code change. No new test function. The next pytest run picks it up automatically.

What I'd Do Differently

1. I'd standardize on JSON for everything.

CSV seemed simpler at first, but maintaining two reader functions and two data formats adds friction. JSON handles everything CSV does, plus nested data, plus better type preservation. The CSV advantage (editable in Excel) never mattered in practice.

2. I'd add schema validation to the data files.

Right now, if someone adds a row with a typo in a field name (expnse_name instead of expense_name), the test fails with a KeyError — which looks like a code bug, not a data bug. A quick JSON Schema or Pydantic validation on load would catch this immediately.

3. I'd separate test data from test metadata.

The test_id field mixes data with routing information. A cleaner approach would be a separate mapping file that says "test02 uses rows 2-4 from the data file" — but honestly, for a framework this size, the inline test_id approach is simple and works.

The Tradeoff Table

Factor	CSV	JSON
Readability	Easy for flat tables	Better for nested data
Tooling	Excel, Google Sheets	Any text editor, APIs
Type safety	Everything is a string	Supports numbers, booleans, null
Self-documenting	Headers only at top	Field names in every record
Python parsing	`csv.DictReader`	`json.load()`
Edge cases	Commas in values break it	Handles special characters natively
Version control diffs	Clean line-by-line diffs	Noisier diffs (brackets, commas)

For test automation specifically, JSON wins in most categories. CSV's only real advantage is the Excel factor — and if your QA team doesn't use Excel for test data (most don't), it's not a factor.

Full Source

The complete implementation — both data formats, both reader functions, and all parameterized tests:

GitHub: Financial-Integrity-Ecosystem

Data files: data/web/expense_data.csv, data/ddt/expenses_json_data.json, data/ddt/mobile_expense_data.json
Reader functions: utils/common_ops.py

This is part 3 of a series on building a multi-layer test automation framework. Part 1 covered Set Theory for data integrity validation. Part 2 covered the CI/CD pipeline.

Yaniv Metuku — QA Automation Engineer

DEV Community