DEV Community

Cover image for I built a CLI data quality tool that goes beyond schema checks - here's what I learned
Abhishek Kaushik
Abhishek Kaushik

Posted on

I built a CLI data quality tool that goes beyond schema checks - here's what I learned

What SageScan does differently

SageScan is a CLI tool that runs statistical validation using a YAML config.

Instead of checking rules you define manually, it checks:
whether your data behaves like it used to.

1. Distribution drift (KS test)

Compares current vs baseline distribution.

Catches:

  • ETL bugs
  • upstream schema changes
  • silent corruption

2. Outlier detection (Z-score + IQR)

Flags statistically abnormal rows.

Not:

"outside a fixed range"

But:

"outside what the data itself considers normal"


3. Population Stability Index (PSI)

Used in ML pipelines for drift detection.

Quantifies:
how much a column’s distribution has shifted


4. Categorical drift (Chi-square test)

Detects changes in category distribution.

Example:

  • Credit card usage drops from 80% -> 45%

That's not invalid data.
That's a signal.


Architecture (the controversial part)

This is where I'd love feedback.

SageScan is:

  • Go CLI
  • Python engine

They communicate via JSON over stdin/stdout.

Why?

  • Go → fast, portable CLI (great for CI)
  • Python → pandas, scipy, rich data ecosystem

Instead of choosing one:
I used both.

Flow:

  1. Go binary parses config
  2. Sends JSON to Python
  3. Python runs checks
  4. Returns results
  5. CLI exits with CI-friendly status

Is this the “right” approach?

Honestly, I don’t know.

But:

  • It shipped
  • It works
  • It was faster than rewriting everything in one stack

Curious how others would approach this.


The AI layer (kept intentionally minimal)

There's an optional AI feature.

When a check fails:

  • Structured context is sent to an LLM
  • It returns possible root causes

Example:

"Negative fare amounts typically indicate chargebacks or voided transactions…"

Important:

  • ❌ AI does NOT replace checks
  • ✅ It only explains failures
  • ✅ It's optional

What I’d do differently

If I started again:

1. Add Polars earlier
Pandas struggles with larger datasets.

2. Improve packaging
Go + Python split adds friction.

3. Build connectors sooner
Everyone asked for:

  • Postgres
  • Snowflake

CSV-first was good for shipping, but not enough.


Try it

pip install sagescan-data
sagescan validate rules.yaml
Enter fullscreen mode Exit fullscreen mode

Looking for feedback

Would love thoughts on:

  • Go + Python architecture — good tradeoff or bad idea?
  • Are these statistical checks enough / overkill?
  • What would you add for real-world pipelines?

Links


Top comments (0)