DEV Community

Cover image for How I built a data quality API that runs at the edge in milliseconds
Vignesh
Vignesh

Posted on

How I built a data quality API that runs at the edge in milliseconds

Bad data is quiet. That's the problem.

Your pipeline doesn't crash. Your tests pass. Three weeks later someone notices the revenue dashboard is wrong, you trace it back, and find that one column started arriving as strings six weeks ago. The ETL swallowed it. The warehouse stored it. Everything looked fine.
I've debugged enough of these to know the pattern. The fix is always obvious in retrospect — validate the data before it enters the system. The hard part is actually doing it without adding another tool, another config file, another YAML-driven framework to maintain.
So I built DataScreenIQ — a data quality API. You POST rows, you get a verdict back. No setup. No infrastructure. One call.

curl -X POST https://api.datascreeniq.com/v1/screen \
  -H "X-API-Key: dsiq_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "source": "orders",
    "rows": [
      {"order_id": "ORD-001", "amount": 99.50, "email": "alice@corp.com"},
      {"order_id": "ORD-002", "amount": "broken", "email": null},
      {"order_id": "ORD-003", "amount": 75.00,   "email": null}
    ]
  }'

Enter fullscreen mode Exit fullscreen mode

Response

{
  "status": "BLOCK",
  "health_score": 0.34,
  "issues": {
    "type_mismatches": ["amount"],
    "null_rates": {"email": 0.67}
  },
  "drift": [
    {
      "kind": "type_changed",
      "field": "amount",
      "detail": "Field changed type from number to mixed"
    }
  ],
  "latency_ms": 38
}
Enter fullscreen mode Exit fullscreen mode

Three verdicts: PASS, WARN, or BLOCK. Your pipeline decides what to do with it.

The 18 checks it runs

Every payload gets:

  1. Schema fingerprinting (SHA-256 hash of field structure)
  2. Null rate per column
  3. Type stability (what % of values match the expected type)
  4. Empty string rate
  5. Duplicate detection
  6. IQR outlier detection on numeric columns
  7. HyperLogLog approximate distinct counts
  8. Enum cardinality tracking (new values appearing)
  9. Row count anomaly detection
  10. Schema drift against the established baseline

The drift detection is the interesting bit. On first run with a new source, it builds a baseline — field types, null rates, schema fingerprint. On every subsequent run it compares the incoming data against that baseline. If your amount field was numeric for six weeks and suddenly 40% of values are strings, that fires a BLOCK.

The Python SDK

pip install datascreeniq

Enter fullscreen mode Exit fullscreen mode
import datascreeniq as dsiq
from datascreeniq.exceptions import DataQualityError

client = dsiq.Client("dsiq_live_...")

# Screen a list of dicts
report = client.screen(rows, source="orders")
print(report.status)       # PASS / WARN / BLOCK
print(report.health_pct)   # 34.0%

# Raise on block — useful as a pipeline gate
try:
    client.screen(rows, source="orders").raise_on_block()
    load_to_warehouse(rows)
except DataQualityError as e:
    print(f"Blocked: {e.report.issues}")
    send_to_dead_letter_queue(rows)
Enter fullscreen mode Exit fullscreen mode

Airflow Integration

from airflow.decorators import task
import datascreeniq as dsiq

@task
def quality_gate(rows: list, source: str) -> dict:
    client = dsiq.Client()  # reads DATASCREENIQ_API_KEY from env
    report = client.screen(rows, source=source)
    if report.is_blocked:
        raise ValueError(f"Data blocked: {report.summary()}")
    return report.to_dict()

@dag
def my_pipeline():
    raw = extract()
    quality_gate(raw, source="orders")
    load(raw)
Enter fullscreen mode Exit fullscreen mode

Screening CSV files directly
The API accepts raw CSV — no conversion needed:

curl -X POST https://api.datascreeniq.com/v1/screen \
  -H "Content-Type: text/csv" \
  -H "X-API-Key: dsiq_live_..." \
  -H "X-Source: orders" \
  --data-binary @orders.csv
Enter fullscreen mode Exit fullscreen mode

The SDK handles CSV, Excel, JSON and XML files too:

report = client.screen_file("orders.csv", source="orders")
report = client.screen_file("orders.xlsx", source="orders")

# pandas DataFrame
import pandas as pd
df = pd.read_csv("orders.csv")
report = client.screen_dataframe(df, source="orders")

Enter fullscreen mode Exit fullscreen mode

Large files get chunked automatically — 10K rows per request, merged into one report.

Why the edge

The API runs on Cloudflare Workers — JavaScript V8 isolates deployed globally. A few things fall out of this naturally:
Raw payload data never touches persistent storage. The Worker reads the rows, computes the statistics, returns the response, and the runtime ends. No database write of your actual data, just the aggregated metrics (null rates, type distributions, schema hashes).
Latency is 30-50ms end to end. The compute itself is under 10ms — the rest is network round trip.
It scales to zero and to high throughput without any configuration on your end.

Resetting baselines

If you've fixed your pipeline and want to start fresh:

client.reset_baseline("orders")
Enter fullscreen mode Exit fullscreen mode

Or via curl:

curl -X DELETE https://api.datascreeniq.com/v1/schema/orders \
  -H "X-API-Key: dsiq_live_..."
Enter fullscreen mode Exit fullscreen mode

Next screen call builds a new baseline from scratch.

Slack alerts

You can configure a Slack incoming webhook in the dashboard — any BLOCK or WARN verdict fires an alert to your channel automatically. Useful if you're running scheduled pipelines and want to know when something breaks without polling the API.

What it doesn't replace
Great Expectations if you need declarative test suites with dozens of custom expectations. Monte Carlo if you need table-level monitoring across your entire warehouse with lineage tracking. dbt tests if your checks live in the transformation layer.
It's a different thing — a lightweight gate you drop in front of a data ingestion point. One call, synchronous, returns immediately. Your pipeline blocks or passes based on the verdict.

Try it
Free tier: 500K rows/month, no credit card.

# Get an API key
# datascreeniq.com

pip install datascreeniq

python3 -c "
import datascreeniq as dsiq
client = dsiq.Client('your_key')
print(client.health())
"
Enter fullscreen mode Exit fullscreen mode

GitHub:

Top comments (0)