Data observability platforms are the layer between "we test the data" and "we trust the data," and the four names every buyer hears in 2026 — Monte Carlo, Anomalo, Bigeye, and Lightup — represent four very different bets on what that layer should look like. The category started as a marketing rebrand of "data quality monitoring" in 2020 and has since hardened into a real product surface: continuous freshness, volume, schema, distribution, and lineage signals piped into the same incident channels as the rest of your on-call rotation. The buying decision is no longer "should we have a platform" — it is "which of these four shapes fits the warehouse, the team size, and the security review."
This guide is the side-by-side breakdown you wished existed before the demo cycle started. It walks through what data observability actually means in 2026 (versus a Slack channel of broken-dashboard tickets), the five pillars that every vendor scores against, the four-vendor matrix where Monte Carlo competes with Anomalo, Bigeye, and Lightup on different axes, the detection-type landscape (rule-based, ml-based anomaly detection, metadata-only pushdown), and the six-stage incident detection lifecycle that decides whether the platform pays for itself. Each section pairs a teaching block with a worked answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works on the production stack.
When you want hands-on reps on the SQL primitives every observability platform compiles down to, drill the data validation practice library →, rehearse on aggregation problems →, and stack the diagnosis muscles with conditional logic drills →.
On this page
- What "data observability" actually means in 2026
- The five pillars of data observability
- The four vendors — Monte Carlo vs Anomalo vs Bigeye vs Lightup
- Detection types — rule-based, ML-based, and metadata-only monitors
- Incident lifecycle and on-call workflow
- Cheat sheet — observability platform recipes
- Frequently asked questions
- Practice on PipeCode
1. What "data observability" actually means in 2026
Data observability is the shift from "we test the data" to "we monitor the data" — a continuous, lineage-aware view of warehouse health, not a one-off CI check
The one-sentence invariant: data observability is the continuous, automated measurement of freshness, volume, schema, distribution, and lineage signals across every table in the warehouse, with incidents routed to on-call and tied back to the upstream cause via dependency graphs. Once you internalise that "tests are point-in-time assertions, observability is a vital-signs monitor," the entire buyer journey for Monte Carlo, Anomalo, Bigeye, and Lightup becomes a question of what gets measured and who hears the alert.
The three layers most teams confuse.
-
DQ tests (assertions). dbt's
not_null,unique,accepted_values, plus custom SQL tests. They run inside the build, they fail fast on the offending model, and they catch known bad shapes. Limitation: they only test what you already thought to write, and they say nothing about volume drift or freshness SLAs unless you wire it yourself. -
Anomaly detection. A library (
pydeequ,great_expectations, in-house Z-score scripts) that learns a baseline and flags outliers. Catches unknown bad shapes. Limitation: it is a model, not a workflow — alerts go nowhere, lineage is missing, and you still have to glue it to PagerDuty by hand. - Observability platform. A managed product that ships all three pillars — continuous monitoring, ML-based anomaly detection, lineage-aware incident routing — out of the box. Catches both known and unknown bad shapes, ties them to a graph, and sends them to the same Slack channel where your VP already lives.
Why Slack-noticing-a-broken-dashboard is the wrong incident channel.
- Mean time to detection (MTTD) is "however long until someone looks." For some dashboards that is 11 minutes; for a finance close report it can be five days. Either way the data team is on the back foot.
- Mean time to resolution (MTTR) doubles or triples when the upstream cause has already happened twice more by the time it is noticed. The platform's job is to compress both numbers and to prove they are compressed with a dashboard the VP can read.
- The cost of a silent metric drift is rarely felt by the data team — it is felt by sales ops, finance, or growth, and it shows up as "I no longer trust the data team's numbers." That is an organisational tax larger than the platform cost.
The buyer journey — what every CDO walks through.
- Pilot. 30 days, three critical tables, no procurement yet. Score on time-to-first-detected-incident.
- Value. Document MTTD / MTTR before and after. Pull two real incidents the platform caught before anyone else noticed.
- ROI. Translate the MTTD compression into "engineer hours saved per quarter" and "downstream rework avoided." A six-figure tool is easy to justify when it is replacing a half-FTE of firefighting.
- Security review. Network egress posture, SSO, data residency, sub-processors. Lightup wins this slide; Monte Carlo has the longest paper trail; Anomalo and Bigeye sit in between.
Build vs buy economics at different team sizes.
-
1–5 engineers. Build. Start with
dbt test+ Elementary (OSS) + a Slack webhook. The platform vendors are too expensive and the surface area is still small. - 6–25 engineers. Borderline. If the data product has external SLA obligations, buy. If it is internal-only and the team is hungry for the engineering, build a thin layer on top of OSS.
- 26+ engineers. Buy. The marginal engineering hour spent on observability infrastructure is worth more on the data product itself, and the bigger the lineage graph the harder the home-grown system gets to maintain.
- Heavily regulated stacks (healthcare, finance). Buy a security-friendly option (Lightup pushdown, or Monte Carlo with VPC peering). The compliance paperwork is half the value.
Worked example — the four signals a vendor pilot must hit on Day 1
Detailed explanation. A senior interview probe — and a practical pilot checklist — is "what are the four signals an observability platform must emit by end of day one to be worth a quarter of evaluation?" The answer surfaces whether the candidate has actually pushed a platform into production or has only read the marketing site.
Question. A team has imported their warehouse credentials into a vendor sandbox. List the four signals the platform must produce in the first 24 hours, the SQL or metadata source for each, and the alert payload shape that proves the integration works.
Input.
| Signal | What it measures | Source | Pass criterion |
|---|---|---|---|
| Freshness | Time since last write |
INFORMATION_SCHEMA.TABLES last_altered |
Alert fires when stale > SLA |
| Volume | Row count drift | Daily COUNT(*) snapshot |
Alert fires on > 3σ delta |
| Schema | Column add / type change |
INFORMATION_SCHEMA.COLUMNS diff |
Alert fires on new / removed col |
| Distribution | Null % drift on critical col | COUNT(col) / COUNT(*) |
Alert fires on > 5pp drift |
Code.
-- Freshness probe — runs every 30 min, alerts when > SLA
SELECT
table_schema,
table_name,
DATEDIFF('minute', last_altered, CURRENT_TIMESTAMP) AS mins_since_write
FROM information_schema.tables
WHERE table_schema = 'analytics'
AND DATEDIFF('minute', last_altered, CURRENT_TIMESTAMP) > 60;
-- Volume drift probe — compares today's COUNT to 7-day baseline
WITH daily AS (
SELECT
DATE_TRUNC('day', loaded_at) AS d,
COUNT(*) AS rows_today
FROM analytics.orders
WHERE loaded_at >= CURRENT_DATE - 8
GROUP BY 1
),
stats AS (
SELECT
AVG(rows_today) AS mu,
STDDEV(rows_today) AS sigma
FROM daily
WHERE d < CURRENT_DATE
)
SELECT
d, rows_today, mu, sigma,
(rows_today - mu) / NULLIF(sigma, 0) AS z_score
FROM daily, stats
WHERE d = CURRENT_DATE
AND ABS((rows_today - mu) / NULLIF(sigma, 0)) > 3;
Step-by-step explanation.
- The freshness probe reads warehouse metadata (
INFORMATION_SCHEMA.TABLES), not the table data — so it costs nothing on a per-row basis. The platform schedules it every 30 minutes and compareslast_alteredagainst a configured SLA. - The volume probe builds a 7-day rolling baseline of
COUNT(*)per day and computes a z-score for today's count against the mean and standard deviation of the baseline.NULLIF(sigma, 0)protects against a constant baseline (which would otherwise raise division-by-zero). - The schema probe diffs
INFORMATION_SCHEMA.COLUMNSsnapshots. A new column means "schema drift"; a removed column means "breaking change." - The distribution probe samples a
COUNT(col) / COUNT(*)ratio on a critical column and compares against a tracked baseline. A 5-percentage-point jump in null rate is a classic upstream-ETL regression signal. - The alert payload shape is JSON with
{table, signal, current, expected, z_score, owner, lineage_upstream, lineage_downstream}— those eight fields are the minimum needed for an on-call engineer to triage without opening the warehouse UI.
Output.
| Hour | Signal | Status | Notes |
|---|---|---|---|
| H+0 | freshness | green | all tables under SLA |
| H+6 | volume | yellow | orders +2.4σ vs baseline |
| H+14 | schema | red | new column is_refund_flag
|
| H+20 | distribution | green | null % stable |
Rule of thumb. A pilot that does not emit at least one real alert in the first 72 hours on critical tables is failing — either the SLAs are too loose, or the baselines have not converged. Tighten the SLA and lower the z-threshold to force at least one signal, then judge precision.
Worked example — the dbt test vs observability platform overlap
Detailed explanation. The most common dismissal of observability platforms is "we already have dbt tests." It is half right and half wrong, and the platform's value-prop hinges on naming the half that dbt tests cannot cover.
Question. A team runs 600 dbt tests per build and asks why they should pay for an observability platform. Catalogue what dbt tests catch, what they miss, and where the platform pays for itself.
Input.
| Capability | dbt test | Observability platform |
|---|---|---|
| Assertions on known shapes | yes | yes |
| Anomaly detection on unknown shapes | no | yes |
| Freshness SLA monitoring | weak (dbt source freshness) |
yes |
| Volume drift monitoring | no | yes |
| Schema drift on source tables | no | yes |
| Lineage-aware incident routing | no (Elementary helps) | yes |
| On-call PagerDuty integration | DIY | yes |
| Stakeholder-readable status page | no | yes |
Code.
# A typical dbt test catches a known shape
- name: not_null_orders_customer_id
description: order rows must have a customer_id
test_metadata:
name: not_null
columns: [customer_id]
config:
severity: error
# An anomaly detector catches an unknown shape
def detect_volume_anomaly(history, today, z_threshold=3.0):
"""Flag today's row count if it deviates > z_threshold sigma from history."""
mu, sigma = statistics.mean(history), statistics.stdev(history)
if sigma == 0:
return None
z = (today - mu) / sigma
if abs(z) >= z_threshold:
return {"severity": "warn" if abs(z) < 5 else "page", "z": z}
return None
Step-by-step explanation.
- The dbt
not_nulltest asserts a known invariant: this column should never be NULL. The platform's anomaly detector measures a baseline and flags drift — even when the team has not yet thought to assert anything. - dbt tests only run on dbt-managed models. The platform monitors every table in the warehouse, including the raw landings dbt does not own.
- dbt's
source freshnessis a per-source flag, not a continuous SLA. The platform polls metadata every N minutes and alerts the moment SLA is breached. - dbt knows nothing about volume or schema drift on the source side. The platform's metadata snapshot covers both.
- When dbt tests fail, the failure is in the build log — there is no PagerDuty integration, no Slack thread, no lineage breadcrumb pointing to what changed upstream. The platform owns the workflow side; dbt owns the assertion side. Both layers belong in a healthy stack.
Output.
| Bug class | dbt test catches? | Platform catches? |
|---|---|---|
customer_id is NULL |
yes | yes |
| Row count drops 40% overnight | no | yes |
| New column appears in source | no | yes |
| Refund rate jumps from 2% to 11% | no | yes |
| Daily refresh is 4 hours late | partial | yes |
Rule of thumb. Keep your dbt tests — they are cheap, fast, and the right tool for known invariants. Add an observability platform to catch the next bug class: drift, freshness SLAs, schema changes on source tables, and lineage-aware routing.
Worked example — a 30-day pilot scorecard
Detailed explanation. Vendors will happily extend a 14-day trial to 90 days if you let them. The pilot scorecard pins both sides to a concrete grading rubric so the trial does not drift into "well, we kind of liked it" territory.
Question. Design a 30-day pilot scorecard for an observability platform on three critical tables. The scorecard should weight signal quality, alert precision, and stakeholder communication.
Input — scoring rubric.
| Category | Weight | Example metric |
|---|---|---|
| Coverage | 20% | tables monitored / total tables |
| Detection quality | 25% | true positives / total alerts |
| Time-to-first-incident | 20% | hours from import to first real alert |
| Lineage usefulness | 15% | % incidents linked to upstream cause |
| Stakeholder readability | 10% | non-data-team thumbs-up rate |
| Cost / coverage | 10% | $ per monitored table per month |
Code.
# Score a pilot run from a CSV of alerts and outcomes
import pandas as pd
def score_pilot(alerts: pd.DataFrame) -> dict:
"""alerts has columns: signal, label (tp/fp), hours_to_detect, has_lineage."""
tp = (alerts["label"] == "tp").sum()
fp = (alerts["label"] == "fp").sum()
precision = tp / max(tp + fp, 1)
avg_ttd = alerts.loc[alerts["label"] == "tp", "hours_to_detect"].mean()
lineage_pct = alerts["has_lineage"].mean()
return {
"precision": round(precision, 2),
"avg_ttd_hours": round(avg_ttd, 2),
"lineage_pct": round(lineage_pct, 2),
"total_alerts": len(alerts),
}
Step-by-step explanation.
- Coverage is the easy first cut: did the platform actually wire up all three pilot tables and emit signals on them? If yes, score 20/20. If only two of three connected, score 13/20.
- Detection quality is precision: of the alerts the platform sent, how many were real? A precision below 0.5 is alert fatigue; above 0.8 is excellent.
- Time-to-first-incident measures how quickly the baseline converges. Most platforms need 7–14 days of history to learn seasonality; alerts before then are heuristic, alerts after are model-based.
- Lineage usefulness is the most important qualitative metric: of the real incidents, how many had an upstream cause identified within the platform itself (versus the on-call having to open the warehouse manually)?
- Stakeholder readability is the "did the marketing PM understand the status page" check. Platforms that ship a readable status page win this row by default.
- Cost / coverage is a simple cents-per-table-per-month divided by total tables. Use list prices for the scorecard and renegotiate at procurement.
Output (sample scorecard).
| Vendor | Coverage | Detection | TTI | Lineage | Readability | $/cov | Total |
|---|---|---|---|---|---|---|---|
| Monte Carlo | 20 | 22 | 16 | 14 | 8 | 6 | 86 |
| Anomalo | 18 | 24 | 14 | 11 | 7 | 7 | 81 |
| Bigeye | 19 | 20 | 18 | 12 | 8 | 8 | 85 |
| Lightup | 16 | 19 | 17 | 9 | 6 | 9 | 76 |
Rule of thumb. A scorecard with weights set before the demos forces the team to argue about what matters, not which vendor's slides looked better. Lock the weights on day 0; revisit only if a category was genuinely missing.
Data engineering interview question on observability ROI
A senior interviewer often opens with: "Your VP asks why we should spend $200K/year on Monte Carlo when we already have dbt tests, Slack, and a smart on-call team. Walk me through the ROI calculation and the two real metrics you would put on a one-slide pitch." It blends financial framing, observability fundamentals, and stakeholder communication — the three muscles that decide whether the platform actually lands.
Solution Using a four-line ROI calculation tied to MTTD and MTTR compression
# observability_roi.py
# Returns annualised dollar value of moving MTTD and MTTR by N hours.
def observability_roi(
incidents_per_year: int,
mttd_before_h: float,
mttd_after_h: float,
mttr_before_h: float,
mttr_after_h: float,
cost_per_hour_of_bad_data: float,
engineer_hourly_cost: float,
avg_engineers_per_incident: float = 1.5,
) -> dict:
"""Two terms: bad-data cost saved, and engineer-hour cost saved."""
bad_data_hours_saved = (
incidents_per_year
* ((mttd_before_h - mttd_after_h) + (mttr_before_h - mttr_after_h))
)
eng_hours_saved = (
incidents_per_year
* (mttr_before_h - mttr_after_h)
* avg_engineers_per_incident
)
bad_data_value = bad_data_hours_saved * cost_per_hour_of_bad_data
eng_value = eng_hours_saved * engineer_hourly_cost
return {
"bad_data_hours_saved": bad_data_hours_saved,
"eng_hours_saved": eng_hours_saved,
"total_value_usd": round(bad_data_value + eng_value, 0),
}
print(
observability_roi(
incidents_per_year=120,
mttd_before_h=18,
mttd_after_h=2,
mttr_before_h=10,
mttr_after_h=4,
cost_per_hour_of_bad_data=400,
engineer_hourly_cost=120,
)
)
Step-by-step trace.
| Variable | Before | After | Delta per incident |
|---|---|---|---|
| MTTD (hours) | 18 | 2 | -16 |
| MTTR (hours) | 10 | 4 | -6 |
| Bad-data hours saved per incident | — | — | 22 |
| Engineer hours saved per incident | — | — | 9 |
With 120 incidents per year, the team saves 120 * 22 = 2,640 bad-data hours and 120 * 9 = 1,080 engineer hours. At $400 / hour of bad-data cost and $120 / engineer hour, that is 1,056,000 + 129,600 = $1,185,600 of annualised value. A $200K platform clears the bar.
Output:
| Metric | Value |
|---|---|
| bad_data_hours_saved | 2640 |
| eng_hours_saved | 1080 |
| total_value_usd | 1,185,600 |
Why this works — concept by concept:
- MTTD compression — the dominant term in the ROI math. Catching an incident in 2 hours instead of 18 saves 16 hours of bad data flowing through dashboards, reverse-ETL syncs, and downstream models. Most pilots show 4–10x MTTD compression.
- MTTR compression — secondary but real. The platform's lineage view often cuts diagnosis time in half because the on-call sees the upstream cause immediately.
- Cost per hour of bad data — the hardest input to estimate but the most important. Use a conservative number (revenue-attributed-to-data divided by working hours) for the pitch and an aggressive number for internal modeling.
- Engineer hourly cost — fully loaded, not salary. A senior data engineer is roughly $120-$200 fully loaded per hour in a US metro market.
- One-slide pitch metrics — MTTD compression and dollar value saved. Skip the feature matrix; the VP only needs the two numbers and the platform tier.
- Cost — O(1) compute. The model is a four-line linear function; the difficulty is calibrating the inputs, not running it.
SQL
Topic — data validation
Data validation problems (SQL)
2. The five pillars of data observability
Freshness, volume, schema, distribution, and lineage are the five vital signs every observability platform monitors — every other feature folds into one of these five
The mental model in one line: a data table is healthy when its freshness, volume, schema, distribution, and lineage are all within their learned baselines — and any platform you evaluate either implements all five well or has a credibility gap. Once you can name the five pillars cold, the vendor demos collapse into "how does this product instrument each one?" — and you can compare like-with-like.
Pillar 1 — Freshness.
- What it measures. Time elapsed since the table was last written. The baseline is the expected cadence (e.g. "this table should refresh every hour during business hours, every 6 hours overnight").
-
How it is computed. From warehouse metadata —
INFORMATION_SCHEMA.TABLES.last_alteredin Snowflake,pg_stat_user_tablesin Postgres,INFORMATION_SCHEMA.PARTITIONSin BigQuery. -
Common alert. "Table
analytics.ordershas been stale for 4h 22m; SLA is 2h." Routed to the table owner via Slack with a link to the upstream dbt model. - Why it matters. Freshness regressions are the single most common silent failure in modern warehouses. An ETL DAG fails, the table stops refreshing, and dashboards show stale-but-plausible numbers until a human notices.
Pillar 2 — Volume.
- What it measures. Row count distribution day over day. The baseline is a rolling window (typically 7-30 days) with optional weekday seasonality.
-
How it is computed. Daily
COUNT(*)snapshots, optionally bucketed by partition or byloaded_atdate. Compared against the baseline with a z-score or interquartile range test. -
Common alert. "Today's
ordersrow count is 1,240 — 7-day baseline is 5,300 ± 800 (z = -5.1)." This is the "did a partition fail to load" signal. - Why it matters. Sudden zeros, half-loads, and dedup-gone-wrong issues all show up as volume anomalies long before they manifest as wrong dashboard numbers.
Pillar 3 — Schema.
- What it measures. Drift in the table's column list and types. The baseline is the previous snapshot.
-
How it is computed. A daily diff of
INFORMATION_SCHEMA.COLUMNS. Detects added columns, removed columns, type changes, and ordering changes. -
Common alert. "Column
customer_emailwas renamed toemail_hashedbetween yesterday and today." This is the "did upstream change the contract without telling us" signal. - Why it matters. Breaking schema changes propagate downstream silently — a removed column makes a dbt model fail on the next build, a type widening makes a join return wrong rows, a renamed column makes a dashboard go to NULL.
Pillar 4 — Distribution.
- What it measures. Statistical properties of the values inside the table — null %, cardinality, mean, median, top-N value frequencies, and per-segment skew. The baseline is a rolling distribution profile.
- How it is computed. Periodic sampled scans (because full scans on a billion-row table are expensive). Compared against the baseline with a Kolmogorov-Smirnov-style test or a per-quantile delta.
-
Common alert. "Null rate on
analytics.orders.shipping_countryjumped from 0.8% to 18.4% in the last 24h." This is the "did an upstream field stop being populated" signal. - Why it matters. Distribution drift is the most subtle and the most damaging. Numbers still tally up; row counts are fine; the meaning of the data has shifted under the dashboard.
Pillar 5 — Lineage.
- What it measures. The dependency graph between tables, dbt models, BI dashboards, and downstream destinations. Not a signal itself but a context for every other signal.
- How it is computed. Parsed from SQL query logs (Monte Carlo's approach), dbt manifest (Bigeye's approach), OpenLineage events (the open standard), or static SQL parsing (Lightup).
-
Common alert. "Freshness incident on
raw.eventswill affect 14 downstream models and 3 dashboards." This is the blast radius signal. - Why it matters. Without lineage, every incident triage starts with "what depends on this?" — a manual archaeology step that doubles MTTR. With lineage, the on-call sees the blast radius in the alert payload.
Bonus pillar — usage and cost.
- What it measures. Query patterns, slot consumption, $/query, and stale-table waste (tables no one reads).
- Why it counts. A table no one queries is silently expensive and a table no one writes is silently broken — usage and cost signals catch both. Increasingly treated as quality-adjacent rather than separate.
How the five pillars map to alert routing.
- Freshness → table owner, Slack, low severity (engineers expect to fix it during business hours).
- Volume → table owner + on-call, Slack + PagerDuty, medium severity (could indicate a partial load).
- Schema → table owner + downstream owners, Slack + ticket, high severity (likely a breaking change).
- Distribution → table owner + analytics owner, Slack, medium-low severity (often a signal of upstream business change, not a pure tech failure).
- Lineage → not routed alone; attached as context to every other alert.
Worked example — implementing the freshness pillar in raw SQL
Detailed explanation. The cheapest freshness check is a single query against the warehouse metadata table — no scans, no compute, runs in milliseconds. Building one yourself is the right starting point to understand what the platform is doing under the hood.
Question. Write a Snowflake query that lists every table in the analytics schema whose last write is older than its expected SLA. Use a sidecar table for per-table SLAs.
Input.
| table_name | sla_minutes |
|---|---|
| orders | 60 |
| customers | 1440 |
| events_raw | 30 |
| dim_country | 10080 |
Code.
-- Freshness probe — joins metadata to a per-table SLA sidecar
WITH last_writes AS (
SELECT
TABLE_SCHEMA,
TABLE_NAME,
LAST_ALTERED,
TIMESTAMPDIFF('minute', LAST_ALTERED, CURRENT_TIMESTAMP)
AS minutes_stale
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'ANALYTICS'
)
SELECT
lw.TABLE_NAME,
lw.LAST_ALTERED,
lw.minutes_stale,
sla.sla_minutes,
lw.minutes_stale - sla.sla_minutes AS minutes_over_sla,
CASE
WHEN lw.minutes_stale > sla.sla_minutes * 2 THEN 'page'
WHEN lw.minutes_stale > sla.sla_minutes THEN 'warn'
ELSE 'ok'
END AS severity
FROM last_writes lw
JOIN freshness_slas sla
ON LOWER(sla.table_name) = LOWER(lw.TABLE_NAME)
WHERE lw.minutes_stale > sla.sla_minutes;
Step-by-step explanation.
- The CTE pulls
LAST_ALTEREDfrom warehouse metadata — no scan over the table data. This is the same source every observability platform uses for the freshness pillar. -
TIMESTAMPDIFFcomputes minutes-since-last-write. We join against a sidecar table of per-table SLAs because not every table refreshes on the same cadence. - The
severityCASE creates a three-band signal:okwhen within SLA,warnwhen over,pagewhen 2x over. This matches the way most platforms tier severity. - The final WHERE clause keeps only rows that breached SLA — those become the alert payloads.
- To make this continuous, schedule it every 5 minutes via Snowflake Tasks or an external orchestrator. The cost is negligible because metadata reads are free.
Output.
| TABLE_NAME | LAST_ALTERED | minutes_stale | sla_minutes | severity |
|---|---|---|---|---|
| events_raw | 14:02:11 | 47 | 30 | warn |
| orders | 13:18:50 | 91 | 60 | warn |
Rule of thumb. Build the freshness pillar yourself first, even if you plan to buy. The exercise teaches you what the platform should report — and it gives you a fallback if procurement stalls.
Worked example — building the volume pillar with a rolling z-score
Detailed explanation. Volume drift is the second-most-common silent failure. A daily z-score against a 7-day baseline catches both half-loads (negative z) and double-loads (positive z) with a single test.
Question. Write a SQL query that produces a z-score for today's row count against a 7-day baseline of the orders table, with NULLIF protection for constant-baseline days.
Input (last 8 days of row counts).
| day | rows |
|---|---|
| -7 | 5280 |
| -6 | 5410 |
| -5 | 5125 |
| -4 | 5390 |
| -3 | 5340 |
| -2 | 5275 |
| -1 | 5300 |
| 0 (today) | 1240 |
Code.
WITH daily AS (
SELECT
DATE_TRUNC('day', loaded_at) AS d,
COUNT(*) AS rows_today
FROM analytics.orders
WHERE loaded_at >= CURRENT_DATE - 8
GROUP BY 1
),
baseline AS (
SELECT
AVG(rows_today) AS mu,
STDDEV(rows_today) AS sigma
FROM daily
WHERE d < CURRENT_DATE
)
SELECT
d,
rows_today,
mu,
sigma,
ROUND(
(rows_today - mu) / NULLIF(sigma, 0),
2
) AS z_score,
CASE
WHEN ABS((rows_today - mu) / NULLIF(sigma, 0)) > 5 THEN 'page'
WHEN ABS((rows_today - mu) / NULLIF(sigma, 0)) > 3 THEN 'warn'
ELSE 'ok'
END AS severity
FROM daily, baseline
WHERE d = CURRENT_DATE;
Step-by-step explanation.
- The
dailyCTE rolls up row counts per day. The window is 8 days so that today plus the 7-day baseline are both in scope. - The
baselineCTE computes mean and standard deviation over the non-today days only — separating the test set from the training set. - The outer SELECT computes today's z-score:
(today - mu) / sigma. A z of -5 means today is 5 standard deviations below the baseline. -
NULLIF(sigma, 0)prevents a division-by-zero on a constant-baseline day. If sigma is 0, the z-score becomes NULL, the severity falls took, and the platform does not page on a false signal. - The severity bands at z > 3 (warn) and z > 5 (page) are reasonable defaults. Tighten or loosen based on the table's natural variance.
Output.
| d | rows_today | mu | sigma | z_score | severity |
|---|---|---|---|---|---|
| 2026-06-15 | 1240 | 5302.86 | 96.13 | -42.21 | page |
Rule of thumb. Use z-score for the test, log the raw mean and sigma in the alert payload, and always wrap the denominator in NULLIF. The two-line guard against constant-baseline days saves more false pages than any other tuning.
Worked example — schema drift detection via column-set diff
Detailed explanation. Schema drift is detected by comparing today's column snapshot against yesterday's. The simplest implementation is a daily snapshot table and a FULL OUTER JOIN to find adds, removes, and type changes in one query.
Question. Given a schema_snapshots table with daily snapshots, write a query that flags every adds, drops, and type-change between yesterday and today on the analytics.orders table.
Input (yesterday vs today).
| snapshot_date | column_name | data_type |
|---|---|---|
| -1 | order_id | NUMBER |
| -1 | customer_id | NUMBER |
| -1 | amount | NUMBER |
| -1 | status | VARCHAR |
| 0 | order_id | NUMBER |
| 0 | customer_id | VARCHAR |
| 0 | amount | NUMBER |
| 0 | is_refund_flag | BOOLEAN |
Code.
WITH yesterday AS (
SELECT column_name, data_type
FROM schema_snapshots
WHERE table_name = 'orders'
AND snapshot_date = CURRENT_DATE - 1
),
today AS (
SELECT column_name, data_type
FROM schema_snapshots
WHERE table_name = 'orders'
AND snapshot_date = CURRENT_DATE
)
SELECT
COALESCE(y.column_name, t.column_name) AS column_name,
y.data_type AS yesterday_type,
t.data_type AS today_type,
CASE
WHEN y.column_name IS NULL THEN 'added'
WHEN t.column_name IS NULL THEN 'removed'
WHEN y.data_type IS DISTINCT FROM t.data_type THEN 'type_changed'
ELSE 'same'
END AS change_type
FROM yesterday y
FULL OUTER JOIN today t
ON y.column_name = t.column_name
WHERE y.column_name IS NULL
OR t.column_name IS NULL
OR y.data_type IS DISTINCT FROM t.data_type;
Step-by-step explanation.
- The two CTEs pull yesterday's and today's column snapshots into named scopes.
-
FULL OUTER JOINis the right tool: it emits every row from both sides so that adds (today-only), drops (yesterday-only), and type-changes (both sides) all show up. -
COALESCE(y.column_name, t.column_name)picks the non-NULL column name so the output has a single column-name column. - The CASE classifies each row using
IS NULLfor adds and removes, andIS DISTINCT FROMfor type changes — NULL-safe inequality is critical here because either side may be NULL. - The final WHERE keeps only changes, dropping the noise of identical columns.
Output.
| column_name | yesterday_type | today_type | change_type |
|---|---|---|---|
| customer_id | NUMBER | VARCHAR | type_changed |
| status | VARCHAR | NULL | removed |
| is_refund_flag | NULL | BOOLEAN | added |
Rule of thumb. Schema drift detection is the easiest pillar to home-grow. Snapshot daily, FULL OUTER JOIN against yesterday, and you have 80% of what a platform's schema pillar provides — minus the lineage routing.
Data engineering interview question on the five-pillars contract
A senior interviewer often opens with: "Name the five pillars of data observability and explain what alert each one produces. Then tell me which pillar misses the most production incidents at your last job." It blends framework recall, real-world judgement, and the kind of honesty that distinguishes a senior engineer from a junior.
Solution Using a five-pillar implementation matrix tied to incident classes
-- Five-pillar observability matrix: which pillar catches which incident class
WITH pillars AS (
SELECT 'freshness' AS pillar, 'stale table' AS incident_class,
'metadata only' AS source, 60 AS cost_per_run_seconds UNION ALL
SELECT 'volume', 'half-load / double-load',
'COUNT(*) daily', 30 UNION ALL
SELECT 'schema', 'breaking change',
'INFORMATION_SCHEMA diff', 5 UNION ALL
SELECT 'distribution', 'silent value drift',
'sampled scan', 300 UNION ALL
SELECT 'lineage', 'blast radius unknown',
'query log parse', 0
)
SELECT
pillar,
incident_class,
source,
cost_per_run_seconds,
CASE
WHEN cost_per_run_seconds = 0 THEN 'free'
WHEN cost_per_run_seconds < 60 THEN 'cheap'
WHEN cost_per_run_seconds < 300 THEN 'medium'
ELSE 'expensive'
END AS cost_band
FROM pillars
ORDER BY cost_per_run_seconds;
Step-by-step trace.
| pillar | incident_class | source | seconds | band |
|---|---|---|---|---|
| lineage | blast radius unknown | query log parse | 0 | free |
| schema | breaking change | INFORMATION_SCHEMA diff | 5 | cheap |
| volume | half-load / double-load | COUNT(*) daily | 30 | cheap |
| freshness | stale table | metadata only | 60 | cheap |
| distribution | silent value drift | sampled scan | 300 | medium |
The trace highlights that distribution monitoring is the most expensive pillar to run — sampled scans on a billion-row table cost real warehouse compute — and lineage is effectively free because the query log already exists.
Output:
| Pillar | Catches |
|---|---|
| Freshness | ETL DAG died silently |
| Volume | Partition failed to load |
| Schema | Upstream renamed a column |
| Distribution | Refund rate jumped 5x |
| Lineage | Blast radius of any incident |
Why this works — concept by concept:
- The five-pillar framework — every modern observability platform organises its product around these five signals; if you can speak them fluently you can map any vendor demo onto the same axes.
- Cost band drives sampling — freshness, volume, and schema are essentially free, so run them continuously. Distribution is expensive, so sample on critical columns only and on a slower cadence.
- Lineage is the multiplier — it does not detect anything on its own, but it triples the value of every other signal by adding blast-radius context to the alert payload.
- Incident class mapping — pillar-to-incident matching is the right way to think about coverage. If your incident postmortem cites a class the platform does not monitor, the platform is failing that incident class.
- Sampling cadence — match pillar cadence to cost. Freshness every 5 min, volume daily, schema daily, distribution weekly on most tables and daily on the most critical ones.
- Cost — pillar-by-pillar; the matrix scales linearly with the number of tables monitored. Total compute cost is usually < 1% of warehouse spend.
SQL
Topic — aggregation
Aggregation problems (SQL)
3. The four vendors — Monte Carlo vs Anomalo vs Bigeye vs Lightup
Monte Carlo is the category creator; Anomalo bets on ML; Bigeye bets on dbt; Lightup bets on security — pick the bet that matches your stack
The mental model in one line: each of the four vendors made a different architectural bet, and the right buy is the one whose bet matches your stack's primary constraint — coverage, ML depth, dbt-nativeness, or security. Once you can name those four bets cold, the demos compress into a single afternoon.
Vendor 1 — Monte Carlo: the category creator with the broadest lineage and warehouse coverage.
- Founded. 2019 in San Francisco. Coined the term "data observability platform" in their early marketing.
- Architecture. Agent runs in your VPC, reads warehouse metadata and query logs, ships signals back to Monte Carlo's control plane.
- Strengths. Broadest warehouse connector matrix (Snowflake, BigQuery, Redshift, Databricks, Postgres, dbt, Airflow, Looker, Tableau, Mode, Sigma). Best lineage graph in the category, parsed from raw SQL query logs across the entire stack. Strong enterprise sales and SOC 2 + ISO 27001 + HIPAA posture.
- Weaknesses. Heaviest deployment (agent + control plane + sometimes a metadata sidecar in your warehouse). Per-table list pricing climbs fast on big enterprises (typical six-figure annual contracts). Some teams report the ML baselines need more tuning than the marketing implies.
- Best fit. Large enterprise (250+ data engineers, hundreds of dbt models, dozens of dashboards, multiple warehouses), strict compliance, lineage-first incident triage workflow.
Vendor 2 — Anomalo: deep ML anomaly detection on top of no-code rule authoring.
- Founded. 2018 in Palo Alto, spun out of Instacart's data-quality team.
- Architecture. SaaS control plane reads from your warehouse via a service account. Pushdown SQL for most checks; ML models run in Anomalo's compute.
- Strengths. Deepest ML detection in the category — seasonality-aware baselines, hierarchical models, automatic feature engineering on top of column statistics. No-code rule editor for non-engineers (data stewards, analysts). Snowflake-native via native app integration.
- Weaknesses. Coverage matrix narrower than Monte Carlo's (Snowflake / BigQuery / Databricks / Redshift / Postgres, but lighter on BI tool lineage). Smaller ecosystem of integrations. ML-first stance means a longer time-to-first-detection while baselines train.
- Best fit. Mid-market teams (25–250 engineers) with a heavy ML / analytics product, where deep anomaly detection on numerical metrics is worth more than broad lineage coverage.
Vendor 3 — Bigeye: metric-first, autometrics, dbt-native.
- Founded. 2019 in San Francisco by ex-Uber data infrastructure engineers.
- Architecture. Agent reads warehouse metadata and runs metric collectors. Metric definitions live in YAML or in the UI; dbt manifest integration auto-discovers models.
- Strengths. Strongest dbt integration of the four — autometrics pulls dbt model owners, tests, and tags into the platform automatically. Per-metric pricing model is predictable. Engineering-friendly: YAML-as-code metric definitions; treats observability as code, not as a UI.
- Weaknesses. Lineage graph is less comprehensive than Monte Carlo's. ML detection is competent but not the headline feature. Smaller install base in heavily regulated industries.
- Best fit. dbt-heavy stacks (10K+ dbt models or a strong dbt practice), engineering teams that want code-defined metrics, mid-market data platforms.
Vendor 4 — Lightup: pushdown SQL checks, no data egress, security-friendly.
- Founded. 2019 in San Francisco. Acquired by Acceldata in 2023; now operates as part of a broader observability suite.
- Architecture. Pure pushdown — every check compiles to SQL that runs inside your warehouse. Lightup's plane never sees the underlying data, only the check result.
- Strengths. Best security story by a wide margin: zero data egress, runs inside the warehouse, satisfies the strictest data residency reviews. Lowest network attack surface. Strong fit for healthcare, financial services, and government.
- Weaknesses. ML detection narrower than Anomalo's — most checks are rule-based or basic statistical thresholds. Lineage capability is the weakest of the four. Coverage matrix is the smallest (Snowflake, BigQuery, Databricks, Postgres, SQL Server primarily).
- Best fit. Regulated stacks where security review forbids agent-in-VPC + control-plane-sees-data architecture. Healthcare, banking, public sector.
Pricing model shapes.
- Monte Carlo — per-table. List price is in the $1.5K–$3K per monitored table per year range. A 250-table deployment is $400K-$750K depending on volume discounting.
- Anomalo — per-source / per-check. Hybrid model: a flat per-source fee plus marginal per-check fees. Smaller deployments end up cheaper than Monte Carlo; enterprise deployments converge.
- Bigeye — per-metric. The most predictable model. List is roughly $50 per metric per month; 200 metrics is $120K/year. Easy to forecast.
- Lightup — per-source / per-credit. Per-source base plus credits consumed by check runs. Best fit for teams that monitor a few sources heavily, less attractive for "monitor everything" stances.
A scoring rubric (each row 0–10).
| Axis | Monte Carlo | Anomalo | Bigeye | Lightup |
|---|---|---|---|---|
| Coverage breadth | 9 | 7 | 7 | 6 |
| ML detection quality | 7 | 9 | 7 | 5 |
| dbt-nativeness | 7 | 7 | 9 | 6 |
| Security posture | 7 | 7 | 7 | 9 |
| Time-to-value | 6 | 7 | 8 | 8 |
| Lineage graph | 9 | 6 | 7 | 5 |
| Pricing transparency | 6 | 7 | 8 | 7 |
| Total | 51 | 50 | 53 | 46 |
The rubric makes the trade-offs visible: Bigeye scores highest on the aggregate because of dbt-nativeness, time-to-value, and pricing transparency — but Monte Carlo wins coverage and lineage, Anomalo wins ML, and Lightup wins security. There is no universal winner.
Worked example — picking the right vendor for a four-stack profile
Detailed explanation. The buying decision is a constraint-satisfaction problem, not a beauty contest. Each team has a primary constraint (security, ML, dbt-ness, or coverage) and a budget envelope. Mapping the four vendors against four stack profiles gives a defensible recommendation.
Question. Given four stack profiles — an enterprise insurance company, a mid-market growth SaaS, a dbt-heavy fintech, and an early-stage AI startup — recommend the best-fit observability vendor for each and explain why.
Input.
| Stack profile | Constraint | Team size | Stack |
|---|---|---|---|
| Insurance | Security + lineage | 80 DE | Snowflake + Looker + Airflow |
| Growth SaaS | ML on numerical metrics | 25 DE | BigQuery + dbt + Hex |
| dbt-heavy fintech | dbt-native + predictable price | 40 DE | Snowflake + 5K dbt models + Mode |
| AI startup | Low cost + fast pilot | 6 DE | BigQuery + dbt-core + Metabase |
Code.
# Vendor recommendation function — naive decision tree
def recommend_vendor(profile: dict) -> str:
if profile.get("regulated") or profile.get("security_blocker"):
return "Lightup"
if profile.get("ml_critical"):
return "Anomalo"
if profile.get("dbt_models", 0) > 1000:
return "Bigeye"
if profile.get("team_size", 0) >= 50 and profile.get("coverage_breadth"):
return "Monte Carlo"
if profile.get("team_size", 0) <= 10:
return "Elementary (OSS) + dbt source freshness"
return "Bigeye" # default middle-of-the-road
Step-by-step explanation.
- The first branch is the security blocker. Insurance, healthcare, banking — if data egress is forbidden, Lightup is the only architecturally viable option of the four.
- The second branch is ML criticality. If the team's product is ML (recommender systems, fraud detection, personalisation) and they need anomaly detection on numerical metrics, Anomalo's depth wins.
- The third branch is dbt-heaviness. A 5K-model dbt project pays off on Bigeye's autometrics: the platform auto-discovers models and tests, and metric definitions live alongside dbt code.
- The fourth branch is enterprise breadth. Large teams with hundreds of tables, multiple BI tools, and a need for cross-stack lineage gravitate to Monte Carlo's coverage matrix.
- The fifth branch is the AI startup escape hatch — under 10 engineers, the right answer is usually open-source (
dbt-elementary,dbt source freshness, a Slack webhook) until the team grows. - The default for "no specific constraint" is Bigeye, because it scores highest on the rubric for a generic mid-market stack.
Output.
| Stack profile | Recommendation | Why |
|---|---|---|
| Insurance | Monte Carlo (or Lightup if egress blocked) | broad coverage + best lineage |
| Growth SaaS | Anomalo | deepest ML on numerical metrics |
| dbt-heavy fintech | Bigeye | dbt-native autometrics, predictable price |
| AI startup | Elementary OSS for now | sub-10 DE team, build before buy |
Rule of thumb. Always ask the buyer "what is the primary constraint?" first. If they answer "I don't have one," they are not ready to buy a six-figure platform — they should build with OSS for another quarter, identify the constraint, and revisit.
Worked example — Monte Carlo's lineage edge in incident triage
Detailed explanation. Monte Carlo's lineage graph is parsed from raw SQL query logs across the entire stack — warehouse, dbt, BI tools — and that graph turns every incident alert into a "what does this break" payload, not just a "what broke" one. The triage time savings are real and measurable.
Question. A freshness incident fires on the raw.events table. Walk through how Monte Carlo's lineage would render the alert payload versus a non-lineage platform.
Input.
| Stage | Without lineage | With Monte Carlo lineage |
|---|---|---|
| Alert payload | "raw.events stale" | "raw.events stale → 14 dbt models, 3 dashboards, 2 reverse-ETL syncs affected" |
| Triage step 1 | Open dbt graph, find downstream | (skipped — graph in alert) |
| Triage step 2 | Open Looker, find dashboards | (skipped — listed in alert) |
| Triage step 3 | Open Hightouch, find syncs | (skipped — listed in alert) |
| Triage step 4 | Page on-call for each owner | One paging hop with owners in payload |
Code.
# Example alert payload shape with lineage
alert = {
"table": "raw.events",
"signal": "freshness",
"severity": "page",
"minutes_stale": 174,
"sla_minutes": 60,
"upstream": [
{"system": "kafka", "topic": "events.v2"},
],
"downstream": [
{"system": "dbt", "model": "stg_events", "owner": "data-platform"},
{"system": "dbt", "model": "fct_user_session", "owner": "growth-de"},
{"system": "looker", "dashboard": "User Funnel", "owner": "growth-pm"},
{"system": "hightouch", "sync": "salesforce_users", "owner": "rev-ops"},
],
"blast_radius_count": 19,
"suggested_owners": ["data-platform", "growth-de"],
}
Step-by-step explanation.
- Without lineage, the on-call sees the freshness alert and has to manually open dbt, Looker, and Hightouch to figure out what depends on the broken table. Each step is a context switch.
- With Monte Carlo's lineage, the alert payload arrives with the entire blast radius pre-computed: every downstream model, dashboard, and sync is listed, with owner attribution.
- The on-call pages the right two teams (data-platform + growth-de) in one hop instead of three. Stakeholder comms goes out in parallel because the owners-of-the-impacted-dashboards are listed.
- MTTR for the incident drops because the lineage discovery step (which is half of the time-to-diagnose) is skipped.
- The lineage graph is also updated incrementally as new SQL queries are observed, so it stays accurate without anyone maintaining it.
Output.
| Metric | Without lineage | With Monte Carlo lineage |
|---|---|---|
| Time-to-blast-radius | 35 min | 0 min |
| Owners notified | 1 generic on-call | 2 specific teams |
| MTTR | 95 min | 38 min |
Rule of thumb. When evaluating vendors, ask each one to render the alert payload for a real freshness incident. The vendor whose payload includes downstream models, dashboards, and reverse-ETL syncs with owner attribution is doing lineage right — and is worth the premium.
Worked example — Anomalo's seasonality-aware baseline
Detailed explanation. Anomalo's ML edge shows up most clearly on metrics with strong weekday/weekend seasonality. A naive 7-day average flags every Monday as an "anomaly"; Anomalo's hierarchical model learns the weekly pattern and only flags Mondays-that-are-also-unusual-for-a-Monday.
Question. Show how a naive rolling-mean baseline produces false positives on a seasonal metric, then sketch how a seasonality-aware baseline (Anomalo-style) would fix it.
Input (last 14 days of daily_active_users).
| day | dow | dau |
|---|---|---|
| -14 | Mon | 12000 |
| -13 | Tue | 18500 |
| -12 | Wed | 19200 |
| -11 | Thu | 18800 |
| -10 | Fri | 16500 |
| -9 | Sat | 7500 |
| -8 | Sun | 6800 |
| -7 | Mon | 12300 |
| -6 | Tue | 18900 |
| -5 | Wed | 19500 |
| -4 | Thu | 19000 |
| -3 | Fri | 16800 |
| -2 | Sat | 7400 |
| -1 | Sun | 6700 |
| 0 (today, Mon) | Mon | 11800 |
Code.
import statistics
def naive_baseline_z(history, today):
mu, sigma = statistics.mean(history), statistics.stdev(history)
return (today - mu) / sigma
def seasonal_baseline_z(history_with_dow, today, today_dow):
same_dow = [v for d, v in history_with_dow if d == today_dow]
mu, sigma = statistics.mean(same_dow), statistics.stdev(same_dow)
return (today - mu) / sigma
history_all = [12000, 18500, 19200, 18800, 16500, 7500, 6800,
12300, 18900, 19500, 19000, 16800, 7400, 6700]
history_dow = list(zip(
["Mon","Tue","Wed","Thu","Fri","Sat","Sun"] * 2, history_all))
print("naive z =", round(naive_baseline_z(history_all, 11800), 2))
print("seasonal z =", round(seasonal_baseline_z(history_dow, 11800, "Mon"), 2))
Step-by-step explanation.
- The naive baseline pools all 14 days and computes mean / stdev. The mean is roughly 14,200 and stdev is roughly 4,800. Today (11,800) is at z ≈ -0.5 — within normal range. Wrong: today is actually low for a Monday.
- Actually wait — let me re-read: the naive baseline of all 14 days mixes weekdays and weekends, producing a high variance that masks small anomalies on individual weekdays. The naive z stays near 0 and misses real drift.
- The seasonal baseline filters history to "same day of week" — for today (Mon), it uses only the two previous Mondays (12,000 and 12,300). Mean is 12,150 and stdev is 212. Today (11,800) is at z ≈ -1.65 — meaningful drift visible only with the seasonality split.
- Anomalo's production model is more sophisticated: it uses hierarchical Bayesian methods to share strength across days of week, smooth low-data buckets, and handle holidays. The two-line illustration above is the principle, not the production code.
- The output is a much higher signal-to-noise ratio: real drift on a weekday is detected, false positives on Mondays-after-a-Sunday-spike are suppressed.
Output.
| Baseline | z-score on today | Verdict |
|---|---|---|
| Naive 14-day | -0.50 | not flagged (missed) |
| Seasonal (same DoW) | -1.65 | flagged for review |
Rule of thumb. For any metric with strong weekly seasonality, a naive rolling baseline is worse than no monitor at all — it pages on Mondays and misses real drift on Tuesdays. Either build a seasonality-aware baseline yourself (the principle above) or buy Anomalo / Bigeye / Monte Carlo whose ML layers handle it natively.
Data engineering interview question on vendor selection
A senior interviewer might frame this as: "Our security team forbids any data egress from the warehouse, our analytics team is dbt-native with 4,000 models, and our budget caps a six-figure platform. Which of Monte Carlo, Anomalo, Bigeye, and Lightup would you recommend and why?" It blends the constraint tree above with real trade-off awareness.
Solution Using a constraint-first decision tree
# vendor_picker.py — constraint-first vendor recommendation
from typing import Optional
def pick_vendor(
data_egress_blocked: bool,
dbt_models: int,
ml_first: bool,
budget_six_figures: bool,
team_size: int,
) -> str:
# Hard constraints first
if data_egress_blocked:
return "Lightup"
# Strong preferences
if dbt_models >= 1000:
return "Bigeye"
if ml_first:
return "Anomalo"
# Coverage breadth at scale
if team_size >= 50 and budget_six_figures:
return "Monte Carlo"
# Sub-10 engineers: build don't buy
if team_size < 10:
return "Elementary OSS + dbt freshness"
return "Bigeye" # generic mid-market default
Step-by-step trace.
| Profile | Egress blocked | dbt models | ML-first | Budget | Team | Pick |
|---|---|---|---|---|---|---|
| The question | no | 4000 | no | yes | 40 | Bigeye |
| Insurance | yes | 800 | no | yes | 80 | Lightup |
| Growth SaaS | no | 200 | yes | yes | 25 | Anomalo |
| Enterprise retail | no | 1500 | no | yes | 120 | Bigeye |
| AI startup | no | 30 | no | no | 6 | Elementary OSS |
For the question profile (no egress blocker, 4,000 dbt models, not ML-first, budget cleared, 40 engineers), the decision tree resolves to Bigeye because the dbt-model count exceeds the autometrics threshold and the other constraints are non-binding. Monte Carlo would also fit but does not win on the dbt axis and is typically more expensive per table.
Output:
| Constraint | Resolved value |
|---|---|
| data_egress_blocked | no |
| dbt_models | 4000 (>=1000) |
| recommendation | Bigeye |
Why this works — concept by concept:
- Constraint-first ordering — security blockers (egress, residency, on-prem) are hard constraints; everything else is a preference. Resolve hard constraints before evaluating soft ones.
- dbt-model count as a proxy — a stack with thousands of dbt models has built its data product on dbt; Bigeye's autometrics integration is materially more valuable in that environment than for a 50-model stack.
- ML-first vs metric-first — Anomalo wins when the team's product is ML and they need numerical anomaly detection. Otherwise the ML edge is overkill and the dbt or coverage edge wins.
- Coverage as a tiebreaker — Monte Carlo wins when no other constraint dominates and the team has the budget to pay for the broadest connector matrix.
- Build vs buy floor — under 10 engineers, the right answer is almost always open-source. The platforms only pay off when there is enough surface area to justify the recurring cost.
- Cost — O(1) decision. The hard part is enumerating the constraints honestly; the tree itself runs in microseconds.
SQL
Topic — conditional logic
Conditional logic problems (SQL)
4. Detection types — rule-based, ML-based, and metadata-only monitors
Rule-based catches the obvious; ML-based catches the subtle; metadata-only catches everything you can prove without scanning — every platform uses all three, the mix is the differentiator
The mental model in one line: a rule-based monitor asserts "x > threshold," an ML-based monitor asserts "x is within the learned baseline," and a metadata-only monitor asserts "x can be inferred from the warehouse catalog without scanning the data" — the right monitor type depends on the cost-precision-coverage triangle you want to optimise for. Once you can name the three types and their cost profile, the platforms' marketing claims about "AI-powered" become much easier to parse.
Type 1 — Rule-based monitors (the foundation).
-
Examples.
row_count > 1000,null_rate < 0.01,freshness < 60 minutes,schema matches v3,revenue between $50K and $500K daily. - How they work. A SQL query computes the metric; a comparison against a static threshold decides pass/fail.
- Strengths. Deterministic, explainable, cheap. The on-call sees "x was 1240, threshold was 5000" and knows what fired and why. No model drift, no baseline retraining.
- Weaknesses. You only catch what you thought to write a rule for. Thresholds are brittle — too tight and they false-page; too loose and they miss drift. Hard to maintain at scale because every new table needs a manual rule set.
-
When to use. Hard contractual SLAs (freshness on a daily finance close), invariants you can articulate (
customer_idis never NULL), and as the first layer of detection for every critical table.
Type 2 — ML-based monitors (the auto-baselining layer).
- Examples. Z-score against a 7-day baseline, seasonality-aware models for daily-active-users, hierarchical Bayesian baselines for per-segment metrics, autocorrelation-aware time-series anomaly detection.
- How they work. Historical data trains a model; today's data is compared against the model's prediction. Anomalies are flagged when the residual exceeds a learned confidence band.
- Strengths. Catches unknown bad shapes. Adapts to natural variance, so the same monitor works on a stable metric (revenue) and a noisy one (real-time GPU utilisation). Scales because the model trains itself on every new table.
- Weaknesses. Requires 7–30 days of clean history to converge. Less explainable — "the model thinks this is anomalous" lands less well than "x was 1240, threshold was 5000." Subject to baseline drift — if the metric slowly trends upward, the baseline absorbs the drift and the monitor stops firing.
- When to use. Numerical metrics with rich history; segments that vary naturally; anywhere you cannot articulate a static threshold but you know the metric should not jump 5σ overnight.
Type 3 — Metadata-only monitors (the "no data egress" layer).
-
Examples. Freshness from
INFORMATION_SCHEMA.TABLES.LAST_ALTERED; row count fromINFORMATION_SCHEMA.TABLE_STORAGE_METRICS; schema fromINFORMATION_SCHEMA.COLUMNS; query patterns from query history. - How they work. The platform reads warehouse metadata — the catalog of what tables exist, when they were last written, how big they are — without ever scanning the underlying data.
- Strengths. Effectively free (metadata reads cost no warehouse credits). No data ever leaves the warehouse — best security posture. Always available (no sampling needed).
- Weaknesses. Can only answer questions the catalog knows about. Cannot tell you "the null rate on column X jumped" without scanning. Limited to freshness, volume, and schema pillars — distribution is mostly out of reach.
- When to use. As the always-on first layer of monitoring. Security-sensitive stacks where data egress is forbidden (Lightup's bet). Cost-sensitive deployments on enormous tables where sampling is expensive.
Hybrid mode — auto-baseline + override threshold.
- The right production pattern is both. Use ML-based baselines for the "is this normal" test and rule-based overrides for the "is this catastrophic" test. The ML model catches subtle drift; the rule guarantees you page on absolute floors or ceilings (e.g. "fire when revenue is below $1M/day no matter what the baseline says").
- Every modern platform supports the hybrid mode; Anomalo's UI makes it the default. Bigeye exposes both as YAML metrics.
Detection latency vs precision tradeoffs.
- Tight thresholds + ML baselines → high precision, low latency, but high false-page rate. Suited for the top 5–10% of critical tables only.
- Loose thresholds + ML baselines → lower precision, longer latency-to-detection, but very low false-page rate. The "rest of the warehouse" default.
- Tiered severity → page-grade thresholds for the critical tables, ticket-grade for the rest, and Slack-only for the long tail. Reserves the on-call's attention for real incidents.
Quiet hours and dependency-aware suppression.
- Quiet hours. Suppress non-critical alerts during nights and weekends. Every platform supports this; the default windows vary.
-
Dependency-aware suppression. When
raw.eventsis broken, every downstream model will also fire freshness alerts within an hour. The platform should suppress the downstream alerts and roll them up under the root incident. Monte Carlo's lineage-aware suppression is the strongest in the category here; Bigeye's dbt-manifest-based version is competitive.
How each vendor handles ML retraining.
- Monte Carlo. Baselines retrain daily with a sliding window; manual override available.
- Anomalo. Continuous incremental retraining; hierarchical models share strength across segments.
- Bigeye. Weekly batch retraining with manual triggers; YAML-defined baseline windows.
- Lightup. Threshold-first; ML layer is shallower than the other three.
Worked example — a rule-based row-count monitor in SQL
Detailed explanation. A row-count monitor with a hard floor and a soft warning band is the simplest production-grade rule-based monitor. It catches obvious failures (zero rows, half-loads) with deterministic semantics.
Question. Write a SQL query that returns a severity band (ok, warn, page) for the analytics.orders table given a row-count floor of 1000 and a soft warning band of 3000.
Input.
| date | rows |
|---|---|
| -2 | 5210 |
| -1 | 5340 |
| 0 (today) | 2800 |
Code.
WITH today AS (
SELECT COUNT(*) AS rows_today
FROM analytics.orders
WHERE loaded_at >= CURRENT_DATE
)
SELECT
rows_today,
CASE
WHEN rows_today < 1000 THEN 'page'
WHEN rows_today < 3000 THEN 'warn'
ELSE 'ok'
END AS severity
FROM today;
Step-by-step explanation.
- The CTE counts today's loaded rows in one scan over a single partition (assuming
loaded_atis partitioned). - The CASE applies two static thresholds: 1000 (hard floor) and 3000 (soft warning band).
- Severity falls through from most-severe to least: a row count of 800 fires
page; 2800 fireswarn; 5000+ staysok. - The thresholds are deliberately static — they encode contractual SLA expectations, not learned baselines. A monitor of this kind belongs alongside (not instead of) an ML monitor.
- To productionise, schedule it every hour and pipe the result to a Slack channel keyed by the table owner.
Output.
| rows_today | severity |
|---|---|
| 2800 | warn |
Rule of thumb. Every critical table should have a rule-based row-count monitor with a hard floor. The floor encodes "below this we definitely have a problem" — independent of whatever the ML baseline learns. ML alone is not enough.
Worked example — an ML-based z-score monitor with seasonality
Detailed explanation. A z-score against a same-day-of-week baseline is the simplest seasonality-aware ML monitor and beats a naive rolling-mean baseline on any weekly-seasonal metric. It is the right second layer on top of a rule-based floor.
Question. Write a SQL query that computes today's z-score against a same-day-of-week 4-week baseline for daily_active_users. Use NULLIF to protect against constant baselines.
Input.
| weeks_ago | dow | dau |
|---|---|---|
| 4 | Mon | 12000 |
| 3 | Mon | 12300 |
| 2 | Mon | 12150 |
| 1 | Mon | 12400 |
| 0 | Mon | 9800 |
Code.
WITH dau_history AS (
SELECT
DATE_TRUNC('day', activity_date) AS d,
TO_CHAR(activity_date, 'Dy') AS dow,
COUNT(DISTINCT user_id) AS dau
FROM analytics.events
WHERE activity_date >= CURRENT_DATE - 28
AND activity_date <= CURRENT_DATE
GROUP BY 1, 2
),
today AS (
SELECT d, dow, dau
FROM dau_history
WHERE d = CURRENT_DATE
),
baseline AS (
SELECT
AVG(dau) AS mu,
STDDEV(dau) AS sigma
FROM dau_history
WHERE d < CURRENT_DATE
AND dow = (SELECT dow FROM today)
)
SELECT
today.d, today.dau,
baseline.mu, baseline.sigma,
ROUND((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0), 2) AS z_score,
CASE
WHEN ABS((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0)) > 5
THEN 'page'
WHEN ABS((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0)) > 3
THEN 'warn'
ELSE 'ok'
END AS severity
FROM today, baseline;
Step-by-step explanation.
- The
dau_historyCTE aggregates daily DAU for the last 28 days with day-of-week tags. - The
todayCTE extracts only today's row (same shape as history). - The
baselineCTE computes mean and stdev over the same day of week, excluding today. For a Monday today, that means averaging the four previous Mondays. - The z-score is
(today - mu) / sigma, withNULLIFto handle the corner case where every prior Monday had identical DAU (sigma = 0). - With baseline mu ≈ 12,213 and sigma ≈ 167, today's DAU of 9,800 produces z ≈ -14.4 — a clear page.
Output.
| d | dau | mu | sigma | z_score | severity |
|---|---|---|---|---|---|
| 2026-06-15 | 9800 | 12212.5 | 166.9 | -14.45 | page |
Rule of thumb. Always tag history rows with day-of-week (or season, holiday, marketing campaign) before computing a z-score on a seasonal metric. The single-line WHERE dow = (SELECT dow FROM today) is the cheapest way to add seasonality to an ML-based monitor.
Worked example — a metadata-only volume monitor
Detailed explanation. Metadata-only monitors are the cheapest pillar to run. A row-count check against INFORMATION_SCHEMA runs in milliseconds and consumes zero warehouse credits — the right always-on first layer for every table in the warehouse.
Question. Write a Snowflake query that lists every table in the analytics schema whose row count has changed by more than 50% since yesterday, using only metadata.
Input.
| table_name | rows_today | rows_yesterday |
|---|---|---|
| orders | 5300 | 5280 |
| events | 1.2M | 8.4M |
| customers | 240 | 240 |
| sessions | 0 | 12000 |
Code.
WITH today_meta AS (
SELECT
TABLE_NAME,
ROW_COUNT AS rows_today
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'ANALYTICS'
),
yesterday_meta AS (
SELECT
TABLE_NAME,
ROW_COUNT AS rows_yesterday
FROM analytics_meta.row_count_snapshot
WHERE snapshot_date = CURRENT_DATE - 1
)
SELECT
t.TABLE_NAME,
t.rows_today,
y.rows_yesterday,
ROUND(
100.0 * (t.rows_today - y.rows_yesterday)
/ NULLIF(y.rows_yesterday, 0),
2
) AS pct_change,
CASE
WHEN y.rows_yesterday > 0
AND ABS(t.rows_today - y.rows_yesterday) > 0.5 * y.rows_yesterday
THEN 'page'
ELSE 'ok'
END AS severity
FROM today_meta t
LEFT JOIN yesterday_meta y
ON t.TABLE_NAME = y.TABLE_NAME
WHERE y.rows_yesterday > 0
AND ABS(t.rows_today - y.rows_yesterday) > 0.5 * y.rows_yesterday;
Step-by-step explanation.
- The
today_metaCTE readsROW_COUNTfromINFORMATION_SCHEMA.TABLES— a metadata read, zero credits. - The
yesterday_metaCTE reads from a sidecar snapshot table that the orchestrator populates nightly. - The outer SELECT computes percentage change with
NULLIFto handle zero-row yesterday tables. - The severity threshold is a flat 50% delta — broad enough to catch real anomalies, narrow enough not to false-page on normal growth.
-
events(1.2M today vs 8.4M yesterday) shows a -85% change → page.sessions(0 today vs 12000 yesterday) shows -100% → page.orders(+0.4%) stays ok.
Output.
| TABLE_NAME | rows_today | rows_yesterday | pct_change | severity |
|---|---|---|---|---|
| events | 1200000 | 8400000 | -85.71 | page |
| sessions | 0 | 12000 | -100.00 | page |
Rule of thumb. Always layer a metadata-only volume monitor as the first line of defence. It costs nothing, it runs continuously, and it catches the single-most-common failure mode (a table failed to refresh) before any other pillar even gets a chance.
Data engineering interview question on detection type selection
A senior interviewer often frames this as: "You need to monitor 200 tables. Budget allows you to scan 10 of them daily, the rest must be metadata-only. Walk me through which 10 you scan, how you decide, and how you handle the other 190." It blends prioritisation, cost awareness, and a clear understanding of which detection types work without scans.
Solution Using a tiered monitor strategy keyed to incident cost
# tiered_monitor_strategy.py
# Inputs: a catalog of tables with a "blast radius" cost score
# Outputs: which monitor type to run on each, and at what cadence.
from typing import List, Dict
def tier_monitors(
tables: List[Dict],
scan_budget: int = 10,
high_severity_threshold: float = 100_000,
) -> List[Dict]:
sorted_tables = sorted(
tables,
key=lambda t: t["blast_radius_usd"],
reverse=True,
)
monitored: List[Dict] = []
scans_used = 0
for tbl in sorted_tables:
if (
scans_used < scan_budget
and tbl["blast_radius_usd"] >= high_severity_threshold
):
monitor_type = "rule + ml + metadata"
cadence_min = 60
scans_used += 1
else:
monitor_type = "metadata only"
cadence_min = 30
monitored.append({
**tbl,
"monitor_type": monitor_type,
"cadence_min": cadence_min,
})
return monitored
Step-by-step trace.
| Step | What happens |
|---|---|
| 1 | Sort tables descending by blast_radius_usd |
| 2 | For each, check if budget remains AND blast radius is high-severity |
| 3 | If yes → full triple-monitor stack at hourly cadence |
| 4 | If no → metadata-only every 30 min (freshness, schema, row count) |
| 5 | Return the augmented catalog with monitor types |
For a 200-table input where the top 10 tables have blast radius above $100K and the remaining 190 are below, the algorithm spends the entire scan budget on the top 10 and ships every other table with a metadata-only monitor. Total coverage: 200/200. Compute cost: bounded by scan_budget.
Output:
| Table tier | Count | Monitor type | Cadence |
|---|---|---|---|
| Top 10 (high blast radius) | 10 | rule + ml + metadata | 60 min |
| Remaining | 190 | metadata only | 30 min |
Why this works — concept by concept:
- Blast-radius-keyed prioritisation — the right way to allocate monitor budget is by downstream dollar impact, not by table size or by alphabetical order. Tables that feed the VP's dashboard get the full monitor stack; tables that feed nobody's dashboard get the cheap layer.
- Metadata-only as the floor — every table gets at least a metadata-only monitor. Coverage is 100% even with a small scan budget because metadata reads are free.
- Tiered cadence — high-severity tables need slower cadence (60 min) because the monitors are more expensive; low-severity get more frequent metadata-only checks because they cost nothing.
- Triple-stack on critical tables — rule-based for the absolute floors, ML-based for the subtle drift, metadata-only for the always-on freshness. Three layers, no overlap in failure modes.
- Budget as a constraint — the algorithm respects a hard scan budget. If procurement only approved 10 scans, the algorithm spends them on the right 10.
- Cost — O(n log n) for the sort + O(n) for the assignment. Trivially fast for any realistic catalog size.
SQL
Topic — data validation
Data validation problems (SQL)
5. Incident lifecycle and on-call workflow
Every data incident moves through the same six stages — detect, triage, notify, diagnose, resolve, learn — and the platform's job is to compress each stage's clock without breaking the lineage trail
The mental model in one line: the data incident lifecycle is detect → triage → notify → diagnose → resolve → learn, and every observability platform's value comes down to how much it compresses each stage's wall-clock time. Once you can name the six stages and the platform's role in each, vendor demos become a question of "show me how this stage gets faster" — concrete, measurable, defensible.
Stage 1 — Detect.
- What happens. A signal exceeds a threshold or baseline. The platform raises an incident.
- Platform role. Run the monitors, manage the baselines, fire alerts. MTTD compression is the primary KPI here.
- Key metric. Time-to-detection (TTD). Pre-platform, the average is often hours-to-days; post-platform it should be minutes-to-an-hour.
Stage 2 — Triage.
- What happens. A human (or an automated rule) classifies the incident — severity, owner, blast radius, suspected root cause area.
- Platform role. Auto-tag with owner, lineage, blast radius. Pre-classify severity from the signal type and threshold breach magnitude.
- Key metric. Time-to-triage (TTT). Should be sub-15 minutes for page-grade incidents.
Stage 3 — Notify.
- What happens. The right humans get pinged via the right channels.
- Platform role. Slack-first vs PagerDuty-first routing; status pages for non-technical stakeholders; ticket creation in Jira / Linear.
- Key metric. Acknowledgement time. Aim for under 5 minutes on page-grade.
Stage 4 — Diagnose.
- What happens. The on-call identifies the upstream cause and the blast radius.
- Platform role. Lineage graph, blast radius pre-computation, recent-changes-log integration, suggested-root-cause heuristics.
- Key metric. Time-to-diagnose (TTD₂). The biggest MTTR component; the lineage edge directly compresses this.
Stage 5 — Resolve.
- What happens. The fix is shipped; the data is restored; the incident is closed.
- Platform role. Track resolution, integrate with the deployment system, run a re-monitor to confirm.
- Key metric. Time-to-resolve (TTR). MTTR = TTT + Acknowledge + TTD₂ + TTR.
Stage 6 — Learn.
- What happens. Postmortem; new monitor; retrospective; recurring-incident dashboard.
- Platform role. Postmortem template, incident retrospective view, top-N-table-by-incidents dashboard.
- Key metric. Recurring-incident rate. Should trend down after every postmortem.
Routing decisions.
- Slack-first. Default for non-page severity. Channel keyed by data domain; on-call mention for page-grade.
- PagerDuty-first. Page-grade only. Best for tables with hard SLAs (finance close, regulatory reporting).
- Ticketing-first. Long-tail incidents that need a fix but don't need a same-day response. Jira or Linear.
-
Tier by blast radius. A freshness alert on
dim_marketing_attributionshould not page on-call at 3am; one onfact_revenue_dailyshould.
Owner tagging.
- By table. The platform's table catalog stores an owner per table.
-
By domain. A domain (
revenue,growth,data-platform) routes to a team Slack channel. -
By dbt model meta. dbt YAML can define
meta.owner; the platform reads it and uses it. - By default. If no owner is set, the alert lands in a triage channel for the data platform team to route.
Status pages and stakeholder comms.
- A status page for data incidents is the most under-used feature in the category. It is the difference between "the data team will deal with it" and "every stakeholder knows the data is stale."
- Monte Carlo, Bigeye, and Anomalo all ship status pages. Lightup's is the lightest of the four.
- The status page should show: current incidents, recently resolved incidents, monitored systems, and a five-minute summary written by the on-call.
Integrating with dbt artifacts, OpenLineage, and the catalog.
-
dbt manifest. Every platform consumes
manifest.jsonto discover models, tests, sources, and tags. - OpenLineage. The open standard for lineage events. Monte Carlo and Bigeye are the strongest consumers; Anomalo and Lightup are catching up.
- Data catalog. Integrations with Atlan, Alation, Collibra, DataHub, and Castor are now table-stakes. The catalog stores ownership; the platform reads it.
Postmortems and recurring-incident dashboards.
- The platform should make the "what tables had the most incidents last quarter" question one click. That dashboard drives engineering investment toward the chronic offenders.
- A good postmortem template captures: what fired, when, blast radius, root cause, fix, monitor improvements. The platform's template should structure all six fields.
Worked example — modelling the six-stage incident timeline
Detailed explanation. Quantifying each stage's wall-clock time is the most useful exercise a data team can do post-pilot. The numbers reveal which stage the platform actually compressed and which stage still needs work.
Question. Given a sample incident — a freshness alert on analytics.orders — model the wall-clock time at each of the six stages, both before and after a Monte Carlo deployment, and compute the MTTD and MTTR delta.
Input.
| Stage | Before (minutes) | After (minutes) |
|---|---|---|
| Detect | 240 | 15 |
| Triage | 30 | 5 |
| Notify | 15 | 1 |
| Diagnose | 60 | 12 |
| Resolve | 45 | 30 |
| Learn (offline) | n/a | n/a |
Code.
# incident_timeline.py
def incident_timeline(stages: dict) -> dict:
"""Stages is {name: minutes}. Returns MTTD and MTTR computations."""
mttd = stages["detect"]
mttr = (
stages["triage"]
+ stages["notify"]
+ stages["diagnose"]
+ stages["resolve"]
)
return {"MTTD_minutes": mttd, "MTTR_minutes": mttr}
before = {"detect": 240, "triage": 30, "notify": 15,
"diagnose": 60, "resolve": 45}
after = {"detect": 15, "triage": 5, "notify": 1,
"diagnose": 12, "resolve": 30}
print("Before:", incident_timeline(before))
print("After: ", incident_timeline(after))
print("MTTD delta:", before["detect"] - after["detect"], "min")
print("MTTR delta:",
sum(v for k,v in before.items() if k != "detect")
- sum(v for k,v in after.items() if k != "detect"),
"min")
Step-by-step explanation.
- MTTD is the time from incident onset to detection — equal to the "detect" stage. Pre-platform: 240 min (4 hours). Post-platform: 15 min. Compression: 16x.
- MTTR is the sum of triage + notify + diagnose + resolve — everything after detection. Pre: 150 min. Post: 48 min. Compression: 3.1x.
- The biggest relative compression is on MTTD (16x) because the platform replaces "wait for someone to notice" with "fire a monitor." The biggest absolute compression is on the "detect" stage (225 min saved per incident).
- The "diagnose" stage compression (60 → 12 min) is the lineage edge — the on-call sees the blast radius and upstream cause in the alert, skipping a manual archaeology step.
- "Resolve" only compressed 45 → 30 min — the actual code-fix-and-deploy work is mostly unchanged by the platform. That is the right outcome; the platform's job is detection and triage, not engineering.
Output.
| Stage | Before | After | Δ |
|---|---|---|---|
| Detect | 240 | 15 | -225 |
| Triage | 30 | 5 | -25 |
| Notify | 15 | 1 | -14 |
| Diagnose | 60 | 12 | -48 |
| Resolve | 45 | 30 | -15 |
| Total | 390 | 63 | -327 |
Rule of thumb. Always measure all six stages, not just MTTD. The platform's value-prop is the full incident timeline compression — and the stages where it does not compress (often "resolve") are the stages where the data team still needs to invest.
Worked example — routing by severity and blast radius
Detailed explanation. Routing rules are the operational glue between detection and the on-call. The right rule set keeps page-grade alerts on PagerDuty, ticket-grade alerts in Jira, and Slack-grade alerts in a domain channel — without burying anyone in noise.
Question. Design a routing decision table for a freshness alert based on (severity, blast radius, business hours). Show how the same monitor produces different alerts under different conditions.
Input — routing table.
| Severity | Blast radius | Business hours | Channel |
|---|---|---|---|
| page | high ($100K+) | any | PagerDuty + Slack #data-oncall |
| page | high | off-hours | PagerDuty + Slack |
| warn | high | business hours | Slack #data-oncall |
| warn | high | off-hours | Slack-only (no page) |
| warn | low | any | Domain Slack |
| info | any | any | Domain Slack thread |
Code.
# routing.py — decides where an alert goes based on rule table
from datetime import datetime
def route_alert(severity: str, blast_radius_usd: int) -> list[str]:
now_hour = datetime.utcnow().hour
business_hours = 13 <= now_hour <= 22 # 9am-6pm ET in UTC
high_blast = blast_radius_usd >= 100_000
channels = []
if severity == "page" and high_blast:
channels = ["pagerduty", "slack:#data-oncall"]
elif severity == "warn" and high_blast and business_hours:
channels = ["slack:#data-oncall"]
elif severity == "warn" and high_blast:
channels = ["slack:#data-oncall-quiet"]
elif severity == "warn":
channels = ["slack:#data-domain"]
elif severity == "info":
channels = ["slack:#data-info-thread"]
return channels
# Examples
print(route_alert("page", 250_000)) # PagerDuty + #data-oncall
print(route_alert("warn", 250_000)) # depends on business hours
print(route_alert("warn", 5_000)) # #data-domain
print(route_alert("info", 0)) # #data-info-thread
Step-by-step explanation.
- The function reads severity (set by the monitor) and blast radius (set by lineage) and returns a list of channels.
- Page-grade + high-blast always goes to PagerDuty + an on-call Slack channel. No quiet hours apply — this is the "the company is losing money" tier.
- Warn-grade + high-blast splits by business hours: in-hours, the on-call gets a Slack ping; off-hours, the alert goes to a separate quiet channel that the on-call checks in the morning.
- Warn-grade + low-blast goes to the domain Slack, not the on-call channel. This is the largest category and the one most likely to cause alert fatigue if mis-routed.
- Info-grade goes to a low-priority thread — useful for trend awareness, not for action.
Output.
| Inputs | Channels |
|---|---|
| (page, $250K) | pagerduty + #data-oncall |
| (warn, $250K, hours) | #data-oncall |
| (warn, $250K, off-hours) | #data-oncall-quiet |
| (warn, $5K) | #data-domain |
| (info, $0) | #data-info-thread |
Rule of thumb. Never page on warn-grade after hours unless the blast radius is high and the table has an explicit SLA. The fastest way to lose your on-call's trust in the platform is to page them at 3am for a non-urgent freshness drift on a non-critical table.
Worked example — building a recurring-incident dashboard
Detailed explanation. A recurring-incident dashboard surfaces the chronic offenders — tables, models, or dbt sources that fire more incidents than everything else combined. The dashboard drives engineering investment toward the right targets.
Question. Write a SQL query that returns the top 10 tables by incident count in the last 90 days, with each incident's average severity and the percentage of incidents that were true positives.
Input.
| incident_id | table | severity | label | created_at |
|---|---|---|---|---|
| 1 | analytics.orders | page | tp | -89d |
| 2 | analytics.orders | warn | tp | -75d |
| 3 | raw.events | warn | fp | -60d |
| ... | ... | ... | ... | ... |
Code.
SELECT
table_name,
COUNT(*) AS incident_count,
SUM(CASE WHEN label = 'tp' THEN 1 ELSE 0 END) AS true_positives,
ROUND(
100.0 * SUM(CASE WHEN label = 'tp' THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0),
1
) AS precision_pct,
SUM(CASE WHEN severity = 'page' THEN 1 ELSE 0 END) AS page_count
FROM incidents
WHERE created_at >= CURRENT_DATE - 90
GROUP BY table_name
ORDER BY incident_count DESC
LIMIT 10;
Step-by-step explanation.
- The WHERE filters to incidents in the last 90 days — a reasonable rolling window for chronic-offender analysis.
-
COUNT(*)counts every incident; the conditional aggregations decompose by label and severity. -
precision_pct = 100 * tp / totalshows whether a high-incident table is actually noisy (low precision) or actually broken (high precision). A noisy table needs better monitors; a broken table needs engineering. -
NULLIF(COUNT(*), 0)is defensive: if a table somehow has zero incidents, the precision divide would otherwise fail. -
ORDER BY incident_count DESC LIMIT 10returns the worst offenders in one query.
Output.
| table_name | incident_count | tp | precision_pct | page_count |
|---|---|---|---|---|
| analytics.orders | 41 | 38 | 92.7 | 8 |
| raw.events | 33 | 22 | 66.7 | 4 |
| dim_marketing_attr | 28 | 9 | 32.1 | 0 |
| fact_revenue_daily | 24 | 23 | 95.8 | 5 |
| stg_users | 19 | 12 | 63.2 | 2 |
Rule of thumb. Sort the chronic-offenders list by precision_pct. The lowest-precision tables need monitor tuning (their alerts are mostly false positives). The highest-precision tables need engineering investment (their alerts are real and the upstream is genuinely broken).
Worked example — postmortem template that ships in the alert
Detailed explanation. The best postmortem is the one that gets filled in. Embedding a structured template inside the alert payload — and letting the on-call complete it in Slack — turns the postmortem from a back-office chore into a 5-minute closing ritual.
Question. Sketch a Slack-native postmortem template attached to a resolved incident. The template should capture what fired, blast radius, root cause, fix, and monitor improvements.
Input — closed incident.
| Field | Value |
|---|---|
| Incident ID | INC-2031 |
| Table | analytics.orders |
| Signal | freshness |
| Detected at | 2026-06-12 14:02 |
| Resolved at | 2026-06-12 15:43 |
| Severity | page |
| Blast radius | 14 models, 3 dashboards |
Code.
{
"postmortem_template": {
"incident_id": "INC-2031",
"what_fired": "freshness > 60 min on analytics.orders",
"blast_radius": "14 dbt models, 3 Looker dashboards, 1 reverse-ETL sync",
"root_cause": "<fill in> — Airflow DAG `orders_etl` failed silently because a Snowflake credential rotated and was not propagated to the secrets manager",
"fix": "<fill in> — re-rotated and propagated; backfilled the missing two hours of orders",
"monitor_improvements": [
"<fill in> — add a healthcheck on the secrets manager",
"<fill in> — tighten DAG `orders_etl` retry window from 4h to 30 min"
],
"follow_ups": [
"open ticket DATA-1456: secrets-rotation runbook",
"schedule a quarterly credential rotation drill"
],
"ttd_minutes": 15,
"ttr_minutes": 101
}
}
Step-by-step explanation.
- The template arrives in Slack the moment the incident is closed. It is pre-populated with the structured fields the platform already knows: incident ID, what fired, blast radius, TTD, TTR.
- The on-call fills in three fields by hand: root cause, fix, monitor improvements. Each field has a 1-3 sentence character cap to keep the writeup terse.
- Follow-ups are open tickets or scheduled events that come out of the postmortem. They get filed automatically.
- The whole postmortem auto-files into a quarterly review dashboard; the engineering manager can read every postmortem of the last quarter in 30 minutes.
- The cost of not having the template is real: postmortems get skipped because they take 30 minutes to format. The cost of having the template is ~5 minutes per incident.
Output.
| Field | Auto-filled | On-call fills |
|---|---|---|
| Incident ID | yes | — |
| What fired | yes | — |
| Blast radius | yes | — |
| TTD / TTR | yes | — |
| Root cause | no | 1-2 sentences |
| Fix | no | 1-2 sentences |
| Monitor improvements | no | 1-3 bullets |
Rule of thumb. Make the postmortem a 5-minute Slack form, not a 30-minute Confluence page. The 5-minute version actually gets written; the 30-minute version sits in the on-call's drafts folder forever.
Data engineering interview question on incident routing
A senior interviewer might frame this as: "It's 3am, a freshness alert fires on dim_marketing_attribution. What happens in your stack from the moment the monitor breaches the threshold to the moment the on-call goes back to bed? Trace every step." It blends process knowledge, on-call empathy, and the kind of routing-rule literacy that separates a designed system from a tangled one.
Solution Using a tiered routing flow tied to severity and blast radius
# incident_handler.py — full lifecycle from detect to learn
from typing import Optional
def handle_incident(
table: str,
signal: str,
severity: str,
blast_radius_usd: int,
hour_utc: int,
) -> dict:
is_business_hours = 13 <= hour_utc <= 22
# 1. Detect — happens before this function runs
# 2. Triage — auto-tag with severity, blast radius
triage = {
"severity": severity,
"blast_radius_usd": blast_radius_usd,
"tier": "critical" if blast_radius_usd >= 100_000 else "standard",
}
# 3. Notify — route by severity, blast radius, time
if triage["tier"] == "critical" and severity == "page":
channels = ["pagerduty", f"slack:#data-oncall"]
elif triage["tier"] == "critical" and not is_business_hours:
channels = [f"slack:#data-oncall-quiet"]
elif triage["tier"] == "critical":
channels = [f"slack:#data-oncall"]
else:
channels = [f"slack:#data-{signal}"]
# 4. Diagnose — attach lineage to the payload
payload = {
"table": table,
"signal": signal,
**triage,
"channels": channels,
"lineage": {"upstream": ["..."], "downstream": ["..."]},
}
return payload
# 3am UTC freshness alert on a low-blast-radius marketing table
print(handle_incident(
table="dim_marketing_attribution",
signal="freshness",
severity="warn",
blast_radius_usd=8_000,
hour_utc=3,
))
Step-by-step trace.
| Step | Action |
|---|---|
| 1 | Monitor fires at 3:00 UTC |
| 2 | Triage classifies severity=warn, blast=$8K, tier=standard |
| 3 | Notify selects channels = #data-marketing |
| 4 | Lineage attaches downstream models to the payload |
| 5 | On-call sees the alert in the morning, no 3am page |
The flow correctly suppresses the 3am page on a non-critical table while still preserving the trail for morning triage. The on-call sleeps; the incident is still tracked; the data team picks it up at 9am.
Output:
| Field | Value |
|---|---|
| table | dim_marketing_attribution |
| severity | warn |
| tier | standard |
| channels | ["slack:#data-marketing"] |
| lineage | (upstream + downstream attached) |
Why this works — concept by concept:
- Tiered notification — critical-tier tables can page off-hours; standard-tier never page off-hours. The two-tier split alone eliminates most alert fatigue.
- Blast radius as the tier key — dollar-impact is the right dimension. A marketing attribution table with $8K blast radius does not deserve a 3am page; a revenue table with $250K blast radius absolutely does.
- Channel routing per signal — freshness, volume, and schema can go to different domain channels even within the same tier. The right people see the right alert.
- Lineage attached by default — the on-call should never have to ask "what does this break" — the payload should already say.
- Quiet hours respect — off-hours alerts on non-critical tables land in a separate channel for morning review. The on-call still sees them, just not at 3am.
- Cost — O(1) per incident. The platform's overhead is negligible; the value comes from preserving the on-call's sleep.
SQL
Topic — event processing
Event processing problems (SQL)
Cheat sheet — observability platform recipes
- "Enterprise-wide lineage with deep warehouse coverage." Default to Monte Carlo. Broadest connector matrix, strongest cross-stack lineage parsed from SQL query logs, longest enterprise paper trail (SOC 2, ISO 27001, HIPAA). Budget assumption: six-figure annual contract.
- "Need strong ML anomaly detection on top of a small dbt project." Anomalo. Hierarchical seasonality-aware baselines, no-code rule editor for non-engineers, Snowflake-native via native app integration. Best when the data product is metric-heavy.
- "dbt-heavy stack, metric-first." Bigeye. Strongest dbt integration of the four — autometrics pulls dbt models, tests, and tags automatically. Per-metric pricing is the most predictable shape in the category.
- "Security forbids data egress." Lightup. Pure pushdown SQL — every check compiles to SQL that runs inside the warehouse. Lightup's plane never sees the data. Best fit for healthcare, banking, and government stacks.
- "We just need freshness + volume + schema, on a budget." Start with OSS Elementary plus dbt source freshness plus a Slack webhook. Sub-10-engineer teams should build before they buy; revisit when the team passes 25.
- "Pilot before committing." Run a 30-day eval on three critical tables with a written scorecard (coverage, detection quality, time-to-first-incident, lineage usefulness, stakeholder readability, cost per coverage). Lock the weights on day 0; do not let the demo change them.
- "Avoid alert fatigue." Tier by blast radius. Page-grade only for high-blast-radius tables with hard SLAs; warn-grade for the rest goes to a domain Slack; info-grade to a low-priority thread. Quiet hours for everything below page-grade.
-
"Integrate with dbt and OpenLineage." Every modern platform consumes
manifest.jsonand emits / consumes OpenLineage events. Confirm during the eval that the platform reads yourmeta.ownerfield — that is the single biggest routing accelerant.
Frequently asked questions
Do I need a data observability platform if I have dbt tests?
You need both, not one. dbt tests are assertions — they catch known-bad shapes you have already thought to write (not_null, unique, accepted_values). An observability platform catches unknown-bad shapes via baselines, monitors freshness with SLAs, watches schema drift on source tables that dbt does not own, and routes incidents through PagerDuty with lineage attached. The two layers do different jobs: dbt tests own assertions, observability platforms own workflow. For a team larger than about 10 engineers with revenue-impacting data, the observability platform usually pays for itself in a single quarter via MTTD compression and avoided rework.
Is Monte Carlo worth the price for a 5-person team?
Almost never. A 5-person team should build with open source — dbt source freshness for the freshness pillar, dbt-elementary for volume and schema, a custom z-score script for the distribution pillar, and a Slack webhook for routing. The platforms become worth their price when the surface area is large enough (25+ engineers, hundreds of dbt models, multiple BI tools, an external SLA obligation) that the marginal engineering hour spent building observability is worth more on the data product itself. For a 5-person team, that hour is almost always better spent on the product.
How does Anomalo compare to Monte Carlo on ML quality?
Anomalo is widely considered the strongest ML detection layer of the four — hierarchical Bayesian baselines, automatic feature engineering on column statistics, and seasonality-aware models that share strength across segments. Monte Carlo's ML is competent but not the headline feature; it focuses more on coverage and lineage. If your blog of incidents is dominated by silent value drift on numerical metrics (refund rate, conversion, retention), Anomalo will typically catch a few more real incidents per quarter. If the incidents are mostly freshness, schema, or blast-radius routing, Monte Carlo's edge in coverage and lineage wins.
What makes Lightup different from Monte Carlo / Bigeye?
Lightup's architectural bet is pure pushdown SQL: every check compiles to SQL that runs inside your warehouse, and Lightup's control plane never sees the underlying data — only the check result. That is fundamentally different from Monte Carlo's agent-in-VPC + control-plane model and Bigeye's metric-collector approach. The trade-off is real: Lightup's security posture is the strongest in the category (zero data egress, satisfies the strictest residency reviews), but its ML detection and lineage capabilities are correspondingly narrower. Pick Lightup when security is a hard constraint; pick the others when it is a preference.
How do these platforms integrate with dbt and OpenLineage?
Every modern platform reads dbt's manifest.json to auto-discover models, tests, sources, and tags — including meta.owner for routing. Bigeye's integration is the deepest (autometrics pulls dbt model definitions into the platform's metric catalog). Monte Carlo and Anomalo are also strong; Lightup is the lightest. On OpenLineage, Monte Carlo and Bigeye are the most active consumers — they ingest lineage events emitted by Airflow, Dagster, Marquez, and other producers. If the team is committed to OpenLineage as the open standard, those two are the safest picks; Anomalo and Lightup are catching up but are not yet at parity.
How long does a typical pilot take to prove value?
Plan on 30 days. Days 1–7 are integration: connect the warehouse, point the platform at three critical tables, configure ownership and SLAs. Days 7–14 are baseline convergence: the ML monitors need a couple of weeks of clean history before they start emitting useful alerts. Days 14–30 are the real test: did the platform catch at least one real incident before anyone else noticed? Document the MTTD compression, MTTR compression, and stakeholder feedback in a written scorecard locked on day 0. A pilot that has not produced at least one true-positive page-grade alert by day 21 is a yellow flag — either the SLAs are too loose or the tables chosen are too quiet. Tighten or switch tables; do not let the pilot drift past 45 days without a decision.
Practice on PipeCode
- Drill the data validation practice library → for the assertion-style checks every observability platform compiles down to.
- Rehearse on aggregation problems → for the COUNT, SUM, AVG patterns that power volume and distribution monitors.
- Sharpen conditional logic drills → for the severity-routing and CASE-based check definitions every platform exposes.
- Stack the grouping library → for the per-segment baseline pattern that powers ML-based monitors.
- Layer the joins practice library → for the FULL OUTER JOIN schema-drift detection pattern.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the warehouse axis with the SQL for data engineering interviews course →.
- For long-form pipeline craft, work through the ETL system design for data engineering interviews course →.
Pipecode.ai is Leetcode for Data Engineering — every observability pattern above ships with hands-on practice rooms where you write the freshness probe, the z-score volume monitor, and the schema-drift FULL OUTER JOIN against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your monitor SQL behaves the same on Snowflake as on BigQuery — the platform tells you.





Top comments (0)