Data Quality Framework
Trust your data. A pluggable quality engine with built-in checks for completeness,
uniqueness, validity, freshness, and consistency — plus automated reporting to Slack,
HTML, and Delta Lake.
By Datanest Digital | Version 1.0.0 | $49
What You Get
- Quality Engine — Rule-based engine that loads checks from YAML, executes them against any Spark DataFrame, aggregates results, and produces structured reports
- 6 Check Types — Completeness (null/empty), uniqueness (duplicates), validity (regex, range, enum), freshness (staleness), consistency (cross-table), and custom (arbitrary SQL expressions)
- 3 Reporters — Slack webhook notifications, standalone HTML reports, and Delta Lake audit table writer for historical trending
- YAML Configuration — Define rules and thresholds in human-readable YAML; no code changes needed to add new checks
- Databricks Notebook — Ready-to-run notebook for executing quality checks as a scheduled job
- Strategy Guide — Best practices for implementing data quality at scale
File Tree
data-quality-framework/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│ ├── quality_engine.py # Core engine: load, execute, report
│ ├── checks/
│ │ ├── completeness.py # Null/empty field checks
│ │ ├── uniqueness.py # Duplicate detection
│ │ ├── validity.py # Regex, range, enum validation
│ │ ├── freshness.py # Data staleness checks
│ │ ├── consistency.py # Cross-table consistency
│ │ └── custom.py # Arbitrary SQL expression checks
│ └── reporters/
│ ├── slack_reporter.py # Slack webhook notifications
│ ├── html_reporter.py # Standalone HTML report
│ └── delta_reporter.py # Delta Lake audit table writer
├── configs/
│ ├── quality_rules.yaml # Rule definitions
│ └── thresholds.yaml # Pass/warn/fail thresholds
├── notebooks/
│ └── run_quality_checks.py # Databricks notebook
├── tests/
│ ├── conftest.py # Shared fixtures
│ └── test_quality_engine.py # Unit tests
└── guides/
└── data-quality-strategy.md # Best practices guide
Getting Started
1. Define your quality rules
Edit configs/quality_rules.yaml to specify which checks to run:
rules:
- name: "customer_email_not_null"
table: "analytics.silver.customers"
check_type: "completeness"
columns: ["email"]
threshold: 0.99 # 99% must be non-null
- name: "order_id_unique"
table: "analytics.silver.orders"
check_type: "uniqueness"
columns: ["order_id"]
threshold: 1.0 # 100% unique
2. Run quality checks
from src.quality_engine import QualityEngine
engine = QualityEngine.from_config(
rules_path="configs/quality_rules.yaml",
thresholds_path="configs/thresholds.yaml",
)
# Execute all rules and get a report
report = engine.run_all()
print(report.summary())
# Check if all rules passed
if not report.passed:
print(f"FAILED: {report.failed_count} of {report.total_count} checks failed")
3. Send notifications
from src.reporters.slack_reporter import SlackReporter
from src.reporters.delta_reporter import DeltaReporter
# Send Slack alert for failures
slack = SlackReporter(webhook_url="https://hooks.slack.com/services/T.../B.../xxx")
slack.send(report)
# Persist results to Delta Lake for trending
delta_reporter = DeltaReporter(audit_table="analytics.ops.quality_audit")
delta_reporter.write(report)
Requirements
- Databricks Runtime 13.3 LTS or later
- Apache Spark 3.4+
- Delta Lake 2.4+
- Python 3.10+
- requests (for Slack reporter)
Architecture
┌──────────────────┐ ┌────────────────────┐
│ quality_rules │────▶│ Quality Engine │
│ .yaml │ │ │
└──────────────────┘ │ 1. Load rules │
┌──────────────────┐ │ 2. Execute checks │
│ thresholds │────▶│ 3. Aggregate │
│ .yaml │ │ 4. Report │
└──────────────────┘ └─────────┬──────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Slack Reporter │ │ HTML Reporter │ │ Delta Reporter │
│ (webhook) │ │ (standalone) │ │ (audit table) │
└────────────────┘ └────────────────┘ └────────────────┘
Related Products
- Data Pipeline Testing — Unit and integration tests for data pipelines
- Data Observability Setup — Pipeline monitoring and alerting
- Data Catalog Builder — Build searchable data catalogs
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Quality Framework] with all files, templates, and documentation for $49.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)