DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Data Quality Framework

#data #dataengineering #etl #python

Data Quality Framework

Trust your data. A pluggable quality engine with built-in checks for completeness,
uniqueness, validity, freshness, and consistency — plus automated reporting to Slack,
HTML, and Delta Lake.

By Datanest Digital | Version 1.0.0 | $49

What You Get

Quality Engine — Rule-based engine that loads checks from YAML, executes them against any Spark DataFrame, aggregates results, and produces structured reports
6 Check Types — Completeness (null/empty), uniqueness (duplicates), validity (regex, range, enum), freshness (staleness), consistency (cross-table), and custom (arbitrary SQL expressions)
3 Reporters — Slack webhook notifications, standalone HTML reports, and Delta Lake audit table writer for historical trending
YAML Configuration — Define rules and thresholds in human-readable YAML; no code changes needed to add new checks
Databricks Notebook — Ready-to-run notebook for executing quality checks as a scheduled job
Strategy Guide — Best practices for implementing data quality at scale

File Tree

data-quality-framework/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│   ├── quality_engine.py              # Core engine: load, execute, report
│   ├── checks/
│   │   ├── completeness.py            # Null/empty field checks
│   │   ├── uniqueness.py              # Duplicate detection
│   │   ├── validity.py                # Regex, range, enum validation
│   │   ├── freshness.py               # Data staleness checks
│   │   ├── consistency.py             # Cross-table consistency
│   │   └── custom.py                  # Arbitrary SQL expression checks
│   └── reporters/
│       ├── slack_reporter.py          # Slack webhook notifications
│       ├── html_reporter.py           # Standalone HTML report
│       └── delta_reporter.py          # Delta Lake audit table writer
├── configs/
│   ├── quality_rules.yaml             # Rule definitions
│   └── thresholds.yaml                # Pass/warn/fail thresholds
├── notebooks/
│   └── run_quality_checks.py          # Databricks notebook
├── tests/
│   ├── conftest.py                    # Shared fixtures
│   └── test_quality_engine.py         # Unit tests
└── guides/
    └── data-quality-strategy.md       # Best practices guide

Getting Started

1. Define your quality rules

Edit configs/quality_rules.yaml to specify which checks to run:

rules:
  - name: "customer_email_not_null"
    table: "analytics.silver.customers"
    check_type: "completeness"
    columns: ["email"]
    threshold: 0.99  # 99% must be non-null

  - name: "order_id_unique"
    table: "analytics.silver.orders"
    check_type: "uniqueness"
    columns: ["order_id"]
    threshold: 1.0  # 100% unique

2. Run quality checks

from src.quality_engine import QualityEngine

engine = QualityEngine.from_config(
    rules_path="configs/quality_rules.yaml",
    thresholds_path="configs/thresholds.yaml",
)

# Execute all rules and get a report
report = engine.run_all()
print(report.summary())

# Check if all rules passed
if not report.passed:
    print(f"FAILED: {report.failed_count} of {report.total_count} checks failed")

3. Send notifications

from src.reporters.slack_reporter import SlackReporter
from src.reporters.delta_reporter import DeltaReporter

# Send Slack alert for failures
slack = SlackReporter(webhook_url="https://hooks.slack.com/services/T.../B.../xxx")
slack.send(report)

# Persist results to Delta Lake for trending
delta_reporter = DeltaReporter(audit_table="analytics.ops.quality_audit")
delta_reporter.write(report)

Requirements

Databricks Runtime 13.3 LTS or later
Apache Spark 3.4+
Delta Lake 2.4+
Python 3.10+
requests (for Slack reporter)

Architecture

┌──────────────────┐     ┌────────────────────┐
│  quality_rules   │────▶│   Quality Engine    │
│  .yaml           │     │                     │
└──────────────────┘     │  1. Load rules      │
┌──────────────────┐     │  2. Execute checks  │
│  thresholds      │────▶│  3. Aggregate       │
│  .yaml           │     │  4. Report          │
└──────────────────┘     └─────────┬──────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                     ▼
     ┌────────────────┐  ┌────────────────┐   ┌────────────────┐
     │  Slack Reporter │  │ HTML Reporter  │   │ Delta Reporter │
     │  (webhook)      │  │ (standalone)   │   │ (audit table)  │
     └────────────────┘  └────────────────┘   └────────────────┘

Related Products

Data Pipeline Testing — Unit and integration tests for data pipelines
Data Observability Setup — Pipeline monitoring and alerting
Data Catalog Builder — Build searchable data catalogs

This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Quality Framework] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

Data Quality Framework

Data Quality Framework

What You Get

File Tree

Getting Started

1. Define your quality rules

2. Run quality checks

3. Send notifications

Requirements

Architecture

Related Products

Related Articles

Top comments (0)