Vignesh

Posted on Mar 30 • Edited on Apr 7

How I built a data quality API that runs at the edge in milliseconds

#dataengineering #python #api #developers

Bad data is quiet. That's the problem.

Your pipeline doesn't crash. Your tests pass. Three weeks later someone notices the revenue dashboard is wrong, you trace it back, and find that one column started arriving as strings six weeks ago. The ETL swallowed it. The warehouse stored it. Everything looked fine.
I've debugged enough of these to know the pattern. The fix is always obvious in retrospect — validate the data before it enters the system. The hard part is actually doing it without adding another tool, another config file, another YAML-driven framework to maintain.
So I built DataScreenIQ — a data quality API. You POST rows, you get a verdict back. No setup. No infrastructure. One call.

curl -X POST https://api.datascreeniq.com/v1/screen \
  -H "X-API-Key: dsiq_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "source": "orders",
    "rows": [
      {"order_id": "ORD-001", "amount": 99.50, "email": "alice@corp.com"},
      {"order_id": "ORD-002", "amount": "broken", "email": null},
      {"order_id": "ORD-003", "amount": 75.00,   "email": null}
    ]
  }'

Response

{
  "status": "BLOCK",
  "health_score": 0.34,
  "issues": {
    "type_mismatches": ["amount"],
    "null_rates": {"email": 0.67}
  },
  "drift": [
    {
      "kind": "type_changed",
      "field": "amount",
      "detail": "Field changed type from number to mixed"
    }
  ],
  "latency_ms": 38
}

Three verdicts: PASS, WARN, or BLOCK. Your pipeline decides what to do with it.

The 18 checks it runs

Every payload gets:

Schema fingerprinting (SHA-256 hash of field structure)
Null rate per column
Type stability (what % of values match the expected type)
Empty string rate
Duplicate detection
IQR outlier detection on numeric columns
HyperLogLog approximate distinct counts
Enum cardinality tracking (new values appearing)
Row count anomaly detection
Schema drift against the established baseline

The drift detection is the interesting bit. On first run with a new source, it builds a baseline — field types, null rates, schema fingerprint. On every subsequent run it compares the incoming data against that baseline. If your amount field was numeric for six weeks and suddenly 40% of values are strings, that fires a BLOCK.

The Python SDK

pip install datascreeniq

import datascreeniq as dsiq
from datascreeniq.exceptions import DataQualityError

client = dsiq.Client("dsiq_live_...")

# Screen a list of dicts
report = client.screen(rows, source="orders")
print(report.status)       # PASS / WARN / BLOCK
print(report.health_pct)   # 34.0%

# Raise on block — useful as a pipeline gate
try:
    client.screen(rows, source="orders").raise_on_block()
    load_to_warehouse(rows)
except DataQualityError as e:
    print(f"Blocked: {e.report.issues}")
    send_to_dead_letter_queue(rows)

Airflow Integration

from airflow.decorators import task
import datascreeniq as dsiq

@task
def quality_gate(rows: list, source: str) -> dict:
    client = dsiq.Client()  # reads DATASCREENIQ_API_KEY from env
    report = client.screen(rows, source=source)
    if report.is_blocked:
        raise ValueError(f"Data blocked: {report.summary()}")
    return report.to_dict()

@dag
def my_pipeline():
    raw = extract()
    quality_gate(raw, source="orders")
    load(raw)

Screening CSV files directly
The API accepts raw CSV — no conversion needed:

curl -X POST https://api.datascreeniq.com/v1/screen \
  -H "Content-Type: text/csv" \
  -H "X-API-Key: dsiq_live_..." \
  -H "X-Source: orders" \
  --data-binary @orders.csv

The SDK handles CSV, Excel, JSON and XML files too:

report = client.screen_file("orders.csv", source="orders")
report = client.screen_file("orders.xlsx", source="orders")

# pandas DataFrame
import pandas as pd
df = pd.read_csv("orders.csv")
report = client.screen_dataframe(df, source="orders")

Large files get chunked automatically — 10K rows per request, merged into one report.

Why the edge

The API runs on Cloudflare Workers — JavaScript V8 isolates deployed globally. A few things fall out of this naturally:
Raw payload data never touches persistent storage. The Worker reads the rows, computes the statistics, returns the response, and the runtime ends. No database write of your actual data, just the aggregated metrics (null rates, type distributions, schema hashes).
Latency is 30-50ms end to end. The compute itself is under 10ms — the rest is network round trip.
It scales to zero and to high throughput without any configuration on your end.

Resetting baselines

If you've fixed your pipeline and want to start fresh:

client.reset_baseline("orders")

Or via curl:

curl -X DELETE https://api.datascreeniq.com/v1/schema/orders \
  -H "X-API-Key: dsiq_live_..."

Next screen call builds a new baseline from scratch.

Slack alerts

You can configure a Slack incoming webhook in the dashboard — any BLOCK or WARN verdict fires an alert to your channel automatically. Useful if you're running scheduled pipelines and want to know when something breaks without polling the API.

What it doesn't replace
Great Expectations if you need declarative test suites with dozens of custom expectations. Monte Carlo if you need table-level monitoring across your entire warehouse with lineage tracking. dbt tests if your checks live in the transformation layer.
It's a different thing — a lightweight gate you drop in front of a data ingestion point. One call, synchronous, returns immediately. Your pipeline blocks or passes based on the verdict.

Try it
Free tier: 500K rows/month, no credit card.

# Get an API key
# datascreeniq.com

pip install datascreeniq

python3 -c "
import datascreeniq as dsiq
client = dsiq.Client('your_key')
print(client.health())
"

GitHub:

DEV Community

How I built a data quality API that runs at the edge in milliseconds

Top comments (0)