DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Quality Framework

Data Quality Framework

Trust your data. A pluggable quality engine with built-in checks for completeness,
uniqueness, validity, freshness, and consistency — plus automated reporting to Slack,
HTML, and Delta Lake.

By Datanest Digital | Version 1.0.0 | $49


What You Get

  • Quality Engine — Rule-based engine that loads checks from YAML, executes them against any Spark DataFrame, aggregates results, and produces structured reports
  • 6 Check Types — Completeness (null/empty), uniqueness (duplicates), validity (regex, range, enum), freshness (staleness), consistency (cross-table), and custom (arbitrary SQL expressions)
  • 3 Reporters — Slack webhook notifications, standalone HTML reports, and Delta Lake audit table writer for historical trending
  • YAML Configuration — Define rules and thresholds in human-readable YAML; no code changes needed to add new checks
  • Databricks Notebook — Ready-to-run notebook for executing quality checks as a scheduled job
  • Strategy Guide — Best practices for implementing data quality at scale

File Tree

data-quality-framework/
├── README.md
├── manifest.json
├── LICENSE
├── src/
│   ├── quality_engine.py              # Core engine: load, execute, report
│   ├── checks/
│   │   ├── completeness.py            # Null/empty field checks
│   │   ├── uniqueness.py              # Duplicate detection
│   │   ├── validity.py                # Regex, range, enum validation
│   │   ├── freshness.py               # Data staleness checks
│   │   ├── consistency.py             # Cross-table consistency
│   │   └── custom.py                  # Arbitrary SQL expression checks
│   └── reporters/
│       ├── slack_reporter.py          # Slack webhook notifications
│       ├── html_reporter.py           # Standalone HTML report
│       └── delta_reporter.py          # Delta Lake audit table writer
├── configs/
│   ├── quality_rules.yaml             # Rule definitions
│   └── thresholds.yaml                # Pass/warn/fail thresholds
├── notebooks/
│   └── run_quality_checks.py          # Databricks notebook
├── tests/
│   ├── conftest.py                    # Shared fixtures
│   └── test_quality_engine.py         # Unit tests
└── guides/
    └── data-quality-strategy.md       # Best practices guide
Enter fullscreen mode Exit fullscreen mode

Getting Started

1. Define your quality rules

Edit configs/quality_rules.yaml to specify which checks to run:

rules:
  - name: "customer_email_not_null"
    table: "analytics.silver.customers"
    check_type: "completeness"
    columns: ["email"]
    threshold: 0.99  # 99% must be non-null

  - name: "order_id_unique"
    table: "analytics.silver.orders"
    check_type: "uniqueness"
    columns: ["order_id"]
    threshold: 1.0  # 100% unique
Enter fullscreen mode Exit fullscreen mode

2. Run quality checks

from src.quality_engine import QualityEngine

engine = QualityEngine.from_config(
    rules_path="configs/quality_rules.yaml",
    thresholds_path="configs/thresholds.yaml",
)

# Execute all rules and get a report
report = engine.run_all()
print(report.summary())

# Check if all rules passed
if not report.passed:
    print(f"FAILED: {report.failed_count} of {report.total_count} checks failed")
Enter fullscreen mode Exit fullscreen mode

3. Send notifications

from src.reporters.slack_reporter import SlackReporter
from src.reporters.delta_reporter import DeltaReporter

# Send Slack alert for failures
slack = SlackReporter(webhook_url="https://hooks.slack.com/services/T.../B.../xxx")
slack.send(report)

# Persist results to Delta Lake for trending
delta_reporter = DeltaReporter(audit_table="analytics.ops.quality_audit")
delta_reporter.write(report)
Enter fullscreen mode Exit fullscreen mode

Requirements

  • Databricks Runtime 13.3 LTS or later
  • Apache Spark 3.4+
  • Delta Lake 2.4+
  • Python 3.10+
  • requests (for Slack reporter)

Architecture

┌──────────────────┐     ┌────────────────────┐
│  quality_rules   │────▶│   Quality Engine    │
│  .yaml           │     │                     │
└──────────────────┘     │  1. Load rules      │
┌──────────────────┐     │  2. Execute checks  │
│  thresholds      │────▶│  3. Aggregate       │
│  .yaml           │     │  4. Report          │
└──────────────────┘     └─────────┬──────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                     ▼
     ┌────────────────┐  ┌────────────────┐   ┌────────────────┐
     │  Slack Reporter │  │ HTML Reporter  │   │ Delta Reporter │
     │  (webhook)      │  │ (standalone)   │   │ (audit table)  │
     └────────────────┘  └────────────────┘   └────────────────┘
Enter fullscreen mode Exit fullscreen mode

Related Products


This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Quality Framework] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)