Abhishek Kaushik

Posted on Apr 3

I built a CLI data quality tool that goes beyond schema checks - here's what I learned

#dataengineering #opensource #cli #machinelearning

What SageScan does differently

SageScan is a CLI tool that runs statistical validation using a YAML config.

Instead of checking rules you define manually, it checks:
whether your data behaves like it used to.

1. Distribution drift (KS test)

Compares current vs baseline distribution.

Catches:

ETL bugs
upstream schema changes
silent corruption

2. Outlier detection (Z-score + IQR)

Flags statistically abnormal rows.

Not:

"outside a fixed range"

But:

"outside what the data itself considers normal"

3. Population Stability Index (PSI)

Used in ML pipelines for drift detection.

Quantifies:
how much a column’s distribution has shifted

4. Categorical drift (Chi-square test)

Detects changes in category distribution.

Example:

Credit card usage drops from 80% -> 45%

That's not invalid data.
That's a signal.

Architecture (the controversial part)

This is where I'd love feedback.

SageScan is:

Go CLI
Python engine

They communicate via JSON over stdin/stdout.

Why?

Go → fast, portable CLI (great for CI)
Python → pandas, scipy, rich data ecosystem

Instead of choosing one:
I used both.

Flow:

Go binary parses config
Sends JSON to Python
Python runs checks
Returns results
CLI exits with CI-friendly status

Is this the “right” approach?

Honestly, I don’t know.

But:

It shipped
It works
It was faster than rewriting everything in one stack

Curious how others would approach this.

The AI layer (kept intentionally minimal)

There's an optional AI feature.

When a check fails:

Structured context is sent to an LLM
It returns possible root causes

Example:

"Negative fare amounts typically indicate chargebacks or voided transactions…"

Important:

❌ AI does NOT replace checks
✅ It only explains failures
✅ It's optional

What I’d do differently

If I started again:

1. Add Polars earlier
Pandas struggles with larger datasets.

2. Improve packaging
Go + Python split adds friction.

3. Build connectors sooner
Everyone asked for:

Postgres
Snowflake

CSV-first was good for shipping, but not enough.

Try it

pip install sagescan-data
sagescan validate rules.yaml

Looking for feedback

Would love thoughts on:

Go + Python architecture — good tradeoff or bad idea?
Are these statistical checks enough / overkill?
What would you add for real-world pipelines?

DEV Community