What SageScan does differently
SageScan is a CLI tool that runs statistical validation using a YAML config.
Instead of checking rules you define manually, it checks:
whether your data behaves like it used to.
1. Distribution drift (KS test)
Compares current vs baseline distribution.
Catches:
- ETL bugs
- upstream schema changes
- silent corruption
2. Outlier detection (Z-score + IQR)
Flags statistically abnormal rows.
Not:
"outside a fixed range"
But:
"outside what the data itself considers normal"
3. Population Stability Index (PSI)
Used in ML pipelines for drift detection.
Quantifies:
how much a column’s distribution has shifted
4. Categorical drift (Chi-square test)
Detects changes in category distribution.
Example:
- Credit card usage drops from 80% -> 45%
That's not invalid data.
That's a signal.
Architecture (the controversial part)
This is where I'd love feedback.
SageScan is:
- Go CLI
- Python engine
They communicate via JSON over stdin/stdout.
Why?
- Go → fast, portable CLI (great for CI)
- Python → pandas, scipy, rich data ecosystem
Instead of choosing one:
I used both.
Flow:
- Go binary parses config
- Sends JSON to Python
- Python runs checks
- Returns results
- CLI exits with CI-friendly status
Is this the “right” approach?
Honestly, I don’t know.
But:
- It shipped
- It works
- It was faster than rewriting everything in one stack
Curious how others would approach this.
The AI layer (kept intentionally minimal)
There's an optional AI feature.
When a check fails:
- Structured context is sent to an LLM
- It returns possible root causes
Example:
"Negative fare amounts typically indicate chargebacks or voided transactions…"
Important:
- ❌ AI does NOT replace checks
- ✅ It only explains failures
- ✅ It's optional
What I’d do differently
If I started again:
1. Add Polars earlier
Pandas struggles with larger datasets.
2. Improve packaging
Go + Python split adds friction.
3. Build connectors sooner
Everyone asked for:
- Postgres
- Snowflake
CSV-first was good for shipping, but not enough.
Try it
pip install sagescan-data
sagescan validate rules.yaml
Looking for feedback
Would love thoughts on:
- Go + Python architecture — good tradeoff or bad idea?
- Are these statistical checks enough / overkill?
- What would you add for real-world pipelines?
Top comments (0)