DEV Community: vahid Saber

Add Data Quality Checks to Your Airflow DAG in 5 Minutes

vahid Saber — Thu, 07 May 2026 20:28:58 +0000

Most Airflow DAGs have zero data quality checks. The pipeline runs, data lands in the warehouse, and you find out something is wrong when a stakeholder asks why the dashboard numbers look off. Three days later.

Adding quality checks feels like a project: pick a tool, configure it, write checks for every table, maintain them as schemas change. So it never happens.

Here's how to add auto-generated data quality checks to any Airflow DAG in under 5 minutes. No configuration, no writing checks by hand.

Option 1: BashOperator (zero install beyond pip)

If you already have DQLens installed in your Airflow environment:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG("my_pipeline", start_date=datetime(2026, 1, 1), schedule="@daily") as dag:

    load_data = BashOperator(
        task_id="load_data",
        bash_command="python load_script.py",
    )

    quality_check = BashOperator(
        task_id="quality_check",
        bash_command=(
            "dqlens init $DATABASE_URL --schema public && "
            "dqlens profile && "
            "dqlens run --ci --focus high"
        ),
        env={"DATABASE_URL": "postgresql://user:pass@host:5432/db"},
    )

    load_data >> quality_check

That's it. After your data loads, DQLens profiles every table, compares against the previous run, and fails the task if it finds HIGH severity problems.

Option 2: DQLensOperator (cleaner, typed)

pip install airflow-provider-dqlens

from airflow import DAG
from dqlens_airflow.operators import DQLensOperator
from datetime import datetime

with DAG("my_pipeline", start_date=datetime(2026, 1, 1), schedule="@daily") as dag:

    load_data = ...

    quality_check = DQLensOperator(
        task_id="quality_check",
        conn_id="my_postgres",
        schema="public",
        focus="high",
    )

    load_data >> quality_check

The operator reads your Airflow connection, profiles the database, and fails if problems are found. Results are pushed to XCom so downstream tasks can access them.

What it catches (without you writing anything)

On the first run, DQLens profiles your tables and stores a baseline. On every subsequent run, it compares and flags:

Null rate spikes: email column went from 0.1% null to 12% null
Row count anomalies: table grew 50% overnight (possible duplicate ingestion)
Schema drift: a column was dropped or changed type
Empty strings: columns that pass not-null checks but carry no information
Freshness: data that hasn't been updated recently

Every finding has a severity level (HIGH / MEDIUM / LOW). The focus="high" parameter means only structural problems (FK violations, schema changes, major null spikes) fail the task. Medium and low findings are logged but don't block the pipeline.

Why not Great Expectations or Soda?

Both require you to write every check by hand. Great Expectations needs Python expectation suites. Soda needs YAML check definitions. For 200 tables, that's days of work and ongoing maintenance as schemas change.

DQLens generates checks automatically from your data. You add one task to your DAG and get coverage you never had to write.

Accessing results downstream

from airflow.operators.python import PythonOperator

def review_results(**context):
    results = context["ti"].xcom_pull(task_ids="quality_check")
    if results:
        print(f"Tables profiled: {results['tables_profiled']}")
        print(f"Findings: {results['findings_count']}")
        print(f"Passed: {results['passed_count']}")

review = PythonOperator(
    task_id="review",
    python_callable=review_results,
)

quality_check >> review

Supported databases

PostgreSQL, DuckDB, SQLite, MySQL. The operator reads your Airflow connection type and builds the right connection URL automatically.

Try it

pip install airflow-provider-dqlens

Add one task to your DAG. Run it. See what it finds.

GitHub: github.com/vahid110/airflow-provider-dqlens
Core engine: github.com/vahid110/dqlens

If your DAG loads data but doesn't check it, you're flying blind. One task fixes that.

What dbt Tests Miss (and How to Catch It Automatically)

vahid Saber — Tue, 05 May 2026 12:42:22 +0000

If you use dbt, you probably have some tests. A few not_null checks, maybe unique on your primary keys, possibly some accepted_values on status columns.

But be honest: how many of your columns actually have tests? 10%? 20%?

The rest are untested. Not because you don't care, but because writing test YAML for 200 columns across 40 models is tedious work that never makes it to the top of the sprint.

The gap in dbt testing

dbt tests are rule-based. You write a rule, it checks that rule. If you didn't write a rule, nothing gets checked. This creates three blind spots:

1. Drift goes undetected.
Your email column had 0.1% nulls last month. Today it's 12%. No dbt test catches this because you never wrote one that says "null rate should stay below X%." You find out when a PM asks why the marketing numbers look off.

2. Structural changes slip through.
A column gets dropped upstream. A type changes from integer to text. dbt won't tell you unless you wrote a test for that specific column. By the time your Spark job fails, the damage is downstream.

3. Nobody tests what they don't know about.
Orphaned foreign keys, outlier values 10x beyond normal range, columns that are technically "not null" but 40% empty strings. These are real problems in real databases that nobody writes tests for because they don't know they exist until something breaks.

What if dbt tests wrote themselves?

That's what I built. dbt-dqlens profiles your models and generates the test YAML for you.

After your normal dbt run:

pip install dbt-dqlens
dqlens-dbt profile        # profiles all models using your dbt connection
dqlens-dbt generate-tests # outputs _dqlens_tests.yml

It reads your profiles.yml, connects to the same warehouse dbt uses, profiles every column (nulls, uniqueness, distributions, patterns, foreign keys, percentiles), and generates native dbt tests based on what it finds.

The output is a standard schema.yml file you commit to your repo:

models:
  - name: orders
    tags: [dqlens]
    columns:
      - name: id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - dqlens_no_null_drift:
              baseline_pct: 0.1
              threshold_multiplier: 3
      - name: amount
        tests:
          - dqlens_no_outliers:
              lower_bound: -110.0
              upper_bound: 210.0
      - name: customer_id
        tests:
          - dqlens_no_orphans:
              target_model: ref('customers')
              target_column: id

Then dbt test --select tag:dqlens runs them as native dbt tests. They show up in dbt docs, dbt Cloud, your CI pipeline. Nothing changes about your workflow except now you have tests you didn't write.

What it catches that dbt tests don't

Problem	Standard dbt test	dbt-dqlens
Null rate increased 10x from last week	No (unless you wrote a threshold)	Yes (baseline comparison)
Column dropped upstream	No	Yes (schema drift detection)
FK references non-existent rows	Only with relationships test (manual)	Yes (auto-detected from schema)
40% empty strings masquerading as "not null"	No	Yes (empty string rate check)
Values 10x beyond normal range	No	Yes (IQR-based outlier detection)
Column type changed	No	Yes (type change detection)

It's behavior-based, not rule-based

The key difference: dbt tests check static rules you defined. dbt-dqlens checks behavior. It learns what your data looks like (the baseline) and flags when something changes.

You don't define thresholds. It computes them from your data. If your email column is normally 0.1% null and jumps to 5%, that's a finding. If your orders table normally grows 2-5% daily and suddenly jumps 50%, that's a finding.

This is the kind of check nobody writes by hand because you'd need to know the baseline first. The tool knows it because it profiled your data.

Try it

pip install dbt-dqlens
dqlens-dbt run  # profiles + generates tests in one step
dbt test --select tag:dqlens

It reads your existing profiles.yml. No new connections to configure. Works with PostgreSQL today, more databases coming.

GitHub: github.com/vahid110/dbt-dqlens

The core engine (DQLens) also works standalone if you don't use dbt. Same profiling, same detection, just a CLI instead of dbt integration.

If you've been meaning to add data quality tests but never found the time, this is the shortcut. Three commands, zero YAML writing, and you get coverage you never had.

Nice read

vahid Saber — Mon, 27 Apr 2026 08:43:53 +0000

abdu masah

Apr 21

4x Faster Redshift Reads With One Line of Python

#python #aws #dataengineering #opensource

Comments 1

3 min read

Introducing sqlxport: Export SQL Query Results to Parquet or CSV and Upload to S3 or MinIO

vahid Saber — Wed, 04 Jun 2025 19:19:25 +0000

In today’s data pipelines, exporting data from SQL databases into flexible and efficient formats like Parquet or CSV is a frequent need — especially when integrating with tools like AWS Athena, Pandas, Spark, or Delta Lake.

That’s where sqlxport comes in.

🚀 What is sqlxport?
sqlxport is a simple, powerful CLI tool that lets you:

Run a SQL query against PostgreSQL or Redshift
Export the results as Parquet or CSV
Optionally upload the result to S3 or MinIO
It’s open source, Python-based, and available on PyPI.

🛠️ Use Cases
Export Redshift query results to S3 in a single command
Prepare Parquet files for data science in DuckDB or Pandas
Integrate your SQL results into Spark Delta Lake pipelines
Automate backups or snapshots from your production databases
✨ Key Features
✅ PostgreSQL and Redshift support
✅ Parquet and CSV output
✅ Supports partitioning
✅ MinIO and AWS S3 support
✅ CLI-friendly and scriptable
✅ MIT licensed
📦 Quickstart
pip install sqlxport
sqlxport run \
--db-url postgresql://user:pass@host:5432/dbname \
--query "SELECT * FROM sales" \
--format parquet \
--output-file sales.parquet
Want to upload it to MinIO or S3?

sqlxport run \
... \
--upload-s3 \
--s3-bucket my-bucket \
--s3-key sales.parquet \
--aws-access-key-id XXX \
--aws-secret-access-key YYY
🧪 Live Demo
We provide a full end-to-end demo using:

PostgreSQL
MinIO (S3-compatible)
Apache Spark with Delta Lake
DuckDB for preview
👉 See it on GitHub

🌐 Where to Find It
📦 PyPI: sqlxport
💻 GitHub: sqlxport
🐦 Follow updates on Twitter/X
🙌 Contributions Welcome
We’re just getting started. Feel free to open issues, submit PRs, or suggest ideas for future features and integrations.