Why every data quality tool tells you what broke — but leaves you alone to figure out why

#dataquality #agents #opensource #dataengineering

Most data quality tools describe what the error is. None of them describes why. Last year, I found that out the hard way — I opened a notebook, ran some queries, dug through pipeline logs, and eventually traced it back to a test account that had been deleted without cleaning up its associated orders. The fix took ten minutes. Finding the cause took three hours.

What bothered me wasn't that the tool missed it — it caught it. What bothered me was that the tool handed me a one-line error and expected me to do all the detective work myself.

Every DQ tool has the same blind spot

I've used Great Expectations, Soda Core, and dbt tests across different teams. They're all good tools. But they all answer the same question: did this check pass or fail?

That's genuinely useful. You know something is wrong. But knowing something is wrong is only the first step, and in my experience it's the easy step.

The hard part is what comes after the alert fires. You open a notebook and query the failing table. You look at the bad rows and figure out what they have in common. You check pipeline logs to find when the bad data arrived. You trace it upstream to find which source or transform wrote it. Then you decide what to do — fix the data, alert a source team, add a guard.

None of the tooling helps with any of that. You get FAILED and then you're on your own.

For the engineer who built the pipeline, this is annoying but manageable. For whoever is on call at 2am rebuilding context from scratch, it's genuinely painful.

What good diagnosis actually looks like

Here's the same foreign key failure, described the way a senior engineer would explain it to you:

Rule:    orders_customer_fk  (critical)
Table:   orders
Failed:  3 rows

Explanation:
  3 orders reference customer_id=99 which does not exist in the
  customers table. Downstream revenue attribution for these orders
  will be silently dropped in any JOIN-based report.

Likely cause:
  customer_id=99 appears to be a test account that was deleted from
  the customers table without cleaning up associated orders.

Recommended action:
  1. SELECT * FROM orders WHERE customer_id = 99
  2. Check customers table: was id=99 recently deleted?
  3. If test data: DELETE FROM orders WHERE customer_id = 99
  4. Add a cleanup job or FK constraint to prevent recurrence

Proposed SQL:
  DELETE FROM orders WHERE customer_id NOT IN (SELECT id FROM customers);

Same failure. Completely different experience. The first output stops you. The second gives you a path forward.

The approach: rule validation + LLM diagnosis

The idea is straightforward — keep the deterministic rule engine (fast, free, reliable) and add an LLM layer that only runs when rules fail.

The LLM doesn't replace the validator. It interprets the failure. It receives the rule definition, the failing rows, the table schema, and any common causes you've documented — and produces the explanation you'd normally piece together manually.

The pipeline runs like this:

validate → classify → diagnose → root cause → SQL fix → report

Tables run in parallel. For each failure, a classifier decides whether it's worth an LLM call based on severity and failure rate. Only failures that cross the threshold get diagnosed. If everything passes, nothing is charged.

Every LLM call is logged to a local SQLite audit trail — the exact prompt, response, cost, and latency. Nothing is a black box.

I built this as Aegis DQ, an open-source Python framework. Here's what a full run looks like against a demo database with intentional dirty data:

╭──────────────── Validation Summary ─────────────────╮
│  Rules checked  │  12                               │
│  Passed         │  1   │  Failed  │  11             │
│  Pass rate      │  8%  │  Cost    │  $0.005576      │
╰─────────────────────────────────────────────────────╯

LLM Diagnoses
  orders_customer_fk  →  Order placed with customer_id=99 that does not exist.
                         Likely cause: customer deleted or test record not cleaned up.
                         Action: Verify customer_id=99; check recent deletions.

  products_sku_unique →  Duplicate SKU-001 — two products share the same identifier.
                         Likely cause: duplicate import from supplier feed.

Remediation SQL
  orders_status_valid     UPDATE orders SET status = 'SHIPPED' WHERE status = 'DISPATCHED';
  products_price_positive UPDATE products SET price = ABS(price) WHERE price < 0;

11 failures, full diagnosis, proposed SQL for each fixable failure — $0.005 with Claude Haiku.

Cost is not the problem you think it is

The first question most people ask is whether running an LLM on every failure is expensive.

With --no-llm it's free — pure rule validation, no API needed. With Claude Haiku, the demo above (12 rules, 4 tables, 11 failures) costs half a cent per run. With Ollama it's $0 at any scale because everything runs locally on your machine.

The key design decision: only failed rules trigger LLM calls. In a healthy pipeline, most runs cost nothing. The LLM cost scales with your actual failure rate, not your rule count.

Generating rules from your schema

Writing rules by hand for every new table always felt like unnecessary work. The schema already tells you most of what you need — which columns can't be null, what the value ranges look like, which columns should be unique.

aegis generate introspects your table and writes the first draft:

aegis generate orders --db warehouse.duckdb --output rules.yaml

Pass a plain-text policy document with your business logic and it generates business validation rules alongside the structural ones:

aegis generate orders --db warehouse.duckdb \
  --kb docs/orders_policy.md \
  --output rules.yaml

A policy file like this:

- status must be one of: placed, confirmed, shipped, delivered, cancelled
- revenue must be greater than 0
- discount_pct must be between 0 and 0.5
- email must be a valid email format

Produces rules like this:

- logic:
    type: accepted_values
    values: [placed, confirmed, shipped, delivered, cancelled]

- logic:
    type: sql_expression
    expression: "revenue > 0"

- logic:
    type: between
    min_value: 0
    max_value: 0.5

- logic:
    type: regex_match
    pattern: "^[^@]+@[^@]+\\.[^@]+$"

Generated rules come out stamped status: draft — you review them, promote the ones you want to enforce to active, and commit to version control. The LLM handles the boilerplate. You handle the judgement.

Where this doesn't replace what you have

If you have an existing Great Expectations suite or dbt tests, there's no reason to replace them. Aegis has a dbt manifest parser that converts your existing tests to Aegis rules if you want to layer diagnosis on top — but that's additive, not a migration.

If you need a business-user UI or an enterprise support contract, Aegis isn't the right fit today. It's a CLI and Python framework, not a SaaS platform.

The gap between "what failed" and "why it failed" is where a lot of data engineering time quietly disappears. Tooling has gotten very good at detection. Diagnosis is still mostly manual. That feels like the wrong place to spend engineering hours.

GitHub: https://github.com/aegis-dq/aegis-dq
Install: pip install aegis-dq
Docs: https://aegis-dq.dev

Happy to answer questions in the comments — especially curious what people are using for DQ today and what pain points I might have missed.