Data Pipeline Testing Kit
Comprehensive testing framework for PySpark data pipelines — from unit tests to integration validation.
By Datanest Digital | Version 1.0.0 | $39
What You Get
A complete testing toolkit for data pipelines running on Databricks and PySpark, including:
- Test Framework — base classes and runners for PySpark unit/integration tests
- Data Generators — realistic synthetic data factories for customers, orders, events
- Custom Assertions — DataFrame-level assertions for schema, row count, nulls, uniqueness
-
Mock Utilities — helpers for mocking
spark,dbutils, Delta tables, and external APIs - Snapshot Testing — golden-file comparison for pipeline output validation
- Sample Fixtures — ready-to-use JSON test data (customers, orders, expected outputs)
- Pipeline Tests — complete examples testing bronze, silver, and gold layers
File Tree
data-pipeline-testing/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│ ├── test_framework.py # Base test classes and PySpark test runner
│ ├── data_generators.py # Synthetic data factories
│ ├── assertions.py # DataFrame assertion library
│ ├── mock_utils.py # Spark/dbutils/Delta mocking helpers
│ └── snapshot_testing.py # Golden-file snapshot comparison
├── fixtures/
│ ├── sample_customers.json # 50 customer records
│ ├── sample_orders.json # 100 order records
│ └── expected_outputs/
│ └── customer_summary.json
├── tests/
│ ├── conftest.py # Shared pytest fixtures with SparkSession
│ ├── test_bronze_pipeline.py # Bronze layer ingestion tests
│ ├── test_silver_pipeline.py # Silver layer transformation tests
│ └── test_gold_pipeline.py # Gold layer aggregation tests
├── configs/
│ └── test_config.yaml # Test environment configuration
└── guides/
└── testing-data-pipelines.md
Getting Started
1. Install Dependencies
pip install pyspark delta-spark pytest pyyaml
2. Use the Test Framework
from test_framework import SparkTestCase
class TestMyPipeline(SparkTestCase):
def test_ingestion(self):
# Create test data
df = self.create_dataframe(
[("Alice", 100), ("Bob", 200)],
schema=["name", "amount"]
)
# Run your pipeline logic
result = my_transform(df)
# Assert results
self.assert_row_count(result, 2)
self.assert_no_nulls(result, ["name", "amount"])
3. Generate Test Data
from data_generators import CustomerGenerator, OrderGenerator
customers = CustomerGenerator(seed=42).generate(count=1000)
orders = OrderGenerator(seed=42).generate(
count=5000,
customer_ids=customers.select("customer_id")
)
4. Use Custom Assertions
from assertions import DataFrameAssertions
assertions = DataFrameAssertions(spark)
assertions.assert_schema_matches(result_df, expected_schema)
assertions.assert_column_values_in(result_df, "status", ["active", "inactive", "churned"])
assertions.assert_unique(result_df, ["customer_id"])
5. Snapshot Testing
from snapshot_testing import SnapshotTester
tester = SnapshotTester(snapshot_dir="fixtures/expected_outputs")
tester.assert_matches(result_df, "customer_summary") # Compares to golden file
Architecture
┌──────────────────────────────────────────────────────────┐
│ Test Runner (pytest) │
├──────────────┬──────────────┬──────────────┬─────────────┤
│ test_bronze │ test_silver │ test_gold │ your tests │
├──────────────┴──────────────┴──────────────┴─────────────┤
│ Test Framework Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐│
│ │ Assertions │ │ Generators │ │ Snapshot Testing ││
│ └────────────┘ └────────────┘ └────────────────────────┘│
├──────────────────────────────────────────────────────────┤
│ Mock / Fixture Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐│
│ │ Mock Utils │ │ conftest │ │ JSON Fixtures ││
│ └────────────┘ └────────────┘ └────────────────────────┘│
├──────────────────────────────────────────────────────────┤
│ PySpark (local) / Delta Spark │
└──────────────────────────────────────────────────────────┘
Requirements
- Python 3.10+
- PySpark 3.5+
- delta-spark 3.1+
- pytest 7+
- PyYAML 6+
- Java 11+ (for local Spark)
Related Products
- Data Quality Framework — Data quality checks and validation rules
- Airflow DAG Templates — Production Apache Airflow DAG patterns
- Spark ETL Framework — Production Spark ETL pipeline patterns
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Pipeline Testing Kit] with all files, templates, and documentation for $39.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)