DEV Community

Cover image for Data Observability Platforms Compared: Monte Carlo, Anomalo, Bigeye & Lightup
Gowtham Potureddi
Gowtham Potureddi

Posted on

Data Observability Platforms Compared: Monte Carlo, Anomalo, Bigeye & Lightup

Data observability platforms are the layer between "we test the data" and "we trust the data," and the four names every buyer hears in 2026 — Monte Carlo, Anomalo, Bigeye, and Lightup — represent four very different bets on what that layer should look like. The category started as a marketing rebrand of "data quality monitoring" in 2020 and has since hardened into a real product surface: continuous freshness, volume, schema, distribution, and lineage signals piped into the same incident channels as the rest of your on-call rotation. The buying decision is no longer "should we have a platform" — it is "which of these four shapes fits the warehouse, the team size, and the security review."

This guide is the side-by-side breakdown you wished existed before the demo cycle started. It walks through what data observability actually means in 2026 (versus a Slack channel of broken-dashboard tickets), the five pillars that every vendor scores against, the four-vendor matrix where Monte Carlo competes with Anomalo, Bigeye, and Lightup on different axes, the detection-type landscape (rule-based, ml-based anomaly detection, metadata-only pushdown), and the six-stage incident detection lifecycle that decides whether the platform pays for itself. Each section pairs a teaching block with a worked answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works on the production stack.

PipeCode blog header for a data observability platforms comparison — bold white headline 'Data Observability Platforms' with subtitle 'monte carlo · anomalo · bigeye · lightup' and a stylised radar-style sweep diagram with four vendor orbs on a dark gradient with a small pipecode.ai attribution.

When you want hands-on reps on the SQL primitives every observability platform compiles down to, drill the data validation practice library →, rehearse on aggregation problems →, and stack the diagnosis muscles with conditional logic drills →.


On this page


1. What "data observability" actually means in 2026

Data observability is the shift from "we test the data" to "we monitor the data" — a continuous, lineage-aware view of warehouse health, not a one-off CI check

The one-sentence invariant: data observability is the continuous, automated measurement of freshness, volume, schema, distribution, and lineage signals across every table in the warehouse, with incidents routed to on-call and tied back to the upstream cause via dependency graphs. Once you internalise that "tests are point-in-time assertions, observability is a vital-signs monitor," the entire buyer journey for Monte Carlo, Anomalo, Bigeye, and Lightup becomes a question of what gets measured and who hears the alert.

The three layers most teams confuse.

  • DQ tests (assertions). dbt's not_null, unique, accepted_values, plus custom SQL tests. They run inside the build, they fail fast on the offending model, and they catch known bad shapes. Limitation: they only test what you already thought to write, and they say nothing about volume drift or freshness SLAs unless you wire it yourself.
  • Anomaly detection. A library (pydeequ, great_expectations, in-house Z-score scripts) that learns a baseline and flags outliers. Catches unknown bad shapes. Limitation: it is a model, not a workflow — alerts go nowhere, lineage is missing, and you still have to glue it to PagerDuty by hand.
  • Observability platform. A managed product that ships all three pillars — continuous monitoring, ML-based anomaly detection, lineage-aware incident routing — out of the box. Catches both known and unknown bad shapes, ties them to a graph, and sends them to the same Slack channel where your VP already lives.

Why Slack-noticing-a-broken-dashboard is the wrong incident channel.

  • Mean time to detection (MTTD) is "however long until someone looks." For some dashboards that is 11 minutes; for a finance close report it can be five days. Either way the data team is on the back foot.
  • Mean time to resolution (MTTR) doubles or triples when the upstream cause has already happened twice more by the time it is noticed. The platform's job is to compress both numbers and to prove they are compressed with a dashboard the VP can read.
  • The cost of a silent metric drift is rarely felt by the data team — it is felt by sales ops, finance, or growth, and it shows up as "I no longer trust the data team's numbers." That is an organisational tax larger than the platform cost.

The buyer journey — what every CDO walks through.

  1. Pilot. 30 days, three critical tables, no procurement yet. Score on time-to-first-detected-incident.
  2. Value. Document MTTD / MTTR before and after. Pull two real incidents the platform caught before anyone else noticed.
  3. ROI. Translate the MTTD compression into "engineer hours saved per quarter" and "downstream rework avoided." A six-figure tool is easy to justify when it is replacing a half-FTE of firefighting.
  4. Security review. Network egress posture, SSO, data residency, sub-processors. Lightup wins this slide; Monte Carlo has the longest paper trail; Anomalo and Bigeye sit in between.

Build vs buy economics at different team sizes.

  • 1–5 engineers. Build. Start with dbt test + Elementary (OSS) + a Slack webhook. The platform vendors are too expensive and the surface area is still small.
  • 6–25 engineers. Borderline. If the data product has external SLA obligations, buy. If it is internal-only and the team is hungry for the engineering, build a thin layer on top of OSS.
  • 26+ engineers. Buy. The marginal engineering hour spent on observability infrastructure is worth more on the data product itself, and the bigger the lineage graph the harder the home-grown system gets to maintain.
  • Heavily regulated stacks (healthcare, finance). Buy a security-friendly option (Lightup pushdown, or Monte Carlo with VPC peering). The compliance paperwork is half the value.

Worked example — the four signals a vendor pilot must hit on Day 1

Detailed explanation. A senior interview probe — and a practical pilot checklist — is "what are the four signals an observability platform must emit by end of day one to be worth a quarter of evaluation?" The answer surfaces whether the candidate has actually pushed a platform into production or has only read the marketing site.

Question. A team has imported their warehouse credentials into a vendor sandbox. List the four signals the platform must produce in the first 24 hours, the SQL or metadata source for each, and the alert payload shape that proves the integration works.

Input.

Signal What it measures Source Pass criterion
Freshness Time since last write INFORMATION_SCHEMA.TABLES last_altered Alert fires when stale > SLA
Volume Row count drift Daily COUNT(*) snapshot Alert fires on > 3σ delta
Schema Column add / type change INFORMATION_SCHEMA.COLUMNS diff Alert fires on new / removed col
Distribution Null % drift on critical col COUNT(col) / COUNT(*) Alert fires on > 5pp drift

Code.

-- Freshness probe — runs every 30 min, alerts when > SLA
SELECT
    table_schema,
    table_name,
    DATEDIFF('minute', last_altered, CURRENT_TIMESTAMP) AS mins_since_write
FROM information_schema.tables
WHERE table_schema = 'analytics'
  AND DATEDIFF('minute', last_altered, CURRENT_TIMESTAMP) > 60;

-- Volume drift probe — compares today's COUNT to 7-day baseline
WITH daily AS (
    SELECT
        DATE_TRUNC('day', loaded_at) AS d,
        COUNT(*)                     AS rows_today
    FROM analytics.orders
    WHERE loaded_at >= CURRENT_DATE - 8
    GROUP BY 1
),
stats AS (
    SELECT
        AVG(rows_today)       AS mu,
        STDDEV(rows_today)    AS sigma
    FROM daily
    WHERE d < CURRENT_DATE
)
SELECT
    d, rows_today, mu, sigma,
    (rows_today - mu) / NULLIF(sigma, 0) AS z_score
FROM daily, stats
WHERE d = CURRENT_DATE
  AND ABS((rows_today - mu) / NULLIF(sigma, 0)) > 3;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The freshness probe reads warehouse metadata (INFORMATION_SCHEMA.TABLES), not the table data — so it costs nothing on a per-row basis. The platform schedules it every 30 minutes and compares last_altered against a configured SLA.
  2. The volume probe builds a 7-day rolling baseline of COUNT(*) per day and computes a z-score for today's count against the mean and standard deviation of the baseline. NULLIF(sigma, 0) protects against a constant baseline (which would otherwise raise division-by-zero).
  3. The schema probe diffs INFORMATION_SCHEMA.COLUMNS snapshots. A new column means "schema drift"; a removed column means "breaking change."
  4. The distribution probe samples a COUNT(col) / COUNT(*) ratio on a critical column and compares against a tracked baseline. A 5-percentage-point jump in null rate is a classic upstream-ETL regression signal.
  5. The alert payload shape is JSON with {table, signal, current, expected, z_score, owner, lineage_upstream, lineage_downstream} — those eight fields are the minimum needed for an on-call engineer to triage without opening the warehouse UI.

Output.

Hour Signal Status Notes
H+0 freshness green all tables under SLA
H+6 volume yellow orders +2.4σ vs baseline
H+14 schema red new column is_refund_flag
H+20 distribution green null % stable

Rule of thumb. A pilot that does not emit at least one real alert in the first 72 hours on critical tables is failing — either the SLAs are too loose, or the baselines have not converged. Tighten the SLA and lower the z-threshold to force at least one signal, then judge precision.

Worked example — the dbt test vs observability platform overlap

Detailed explanation. The most common dismissal of observability platforms is "we already have dbt tests." It is half right and half wrong, and the platform's value-prop hinges on naming the half that dbt tests cannot cover.

Question. A team runs 600 dbt tests per build and asks why they should pay for an observability platform. Catalogue what dbt tests catch, what they miss, and where the platform pays for itself.

Input.

Capability dbt test Observability platform
Assertions on known shapes yes yes
Anomaly detection on unknown shapes no yes
Freshness SLA monitoring weak (dbt source freshness) yes
Volume drift monitoring no yes
Schema drift on source tables no yes
Lineage-aware incident routing no (Elementary helps) yes
On-call PagerDuty integration DIY yes
Stakeholder-readable status page no yes

Code.

# A typical dbt test catches a known shape
- name: not_null_orders_customer_id
  description: order rows must have a customer_id
  test_metadata:
    name: not_null
    columns: [customer_id]
  config:
    severity: error
Enter fullscreen mode Exit fullscreen mode
# An anomaly detector catches an unknown shape
def detect_volume_anomaly(history, today, z_threshold=3.0):
    """Flag today's row count if it deviates > z_threshold sigma from history."""
    mu, sigma = statistics.mean(history), statistics.stdev(history)
    if sigma == 0:
        return None
    z = (today - mu) / sigma
    if abs(z) >= z_threshold:
        return {"severity": "warn" if abs(z) < 5 else "page", "z": z}
    return None
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dbt not_null test asserts a known invariant: this column should never be NULL. The platform's anomaly detector measures a baseline and flags drift — even when the team has not yet thought to assert anything.
  2. dbt tests only run on dbt-managed models. The platform monitors every table in the warehouse, including the raw landings dbt does not own.
  3. dbt's source freshness is a per-source flag, not a continuous SLA. The platform polls metadata every N minutes and alerts the moment SLA is breached.
  4. dbt knows nothing about volume or schema drift on the source side. The platform's metadata snapshot covers both.
  5. When dbt tests fail, the failure is in the build log — there is no PagerDuty integration, no Slack thread, no lineage breadcrumb pointing to what changed upstream. The platform owns the workflow side; dbt owns the assertion side. Both layers belong in a healthy stack.

Output.

Bug class dbt test catches? Platform catches?
customer_id is NULL yes yes
Row count drops 40% overnight no yes
New column appears in source no yes
Refund rate jumps from 2% to 11% no yes
Daily refresh is 4 hours late partial yes

Rule of thumb. Keep your dbt tests — they are cheap, fast, and the right tool for known invariants. Add an observability platform to catch the next bug class: drift, freshness SLAs, schema changes on source tables, and lineage-aware routing.

Worked example — a 30-day pilot scorecard

Detailed explanation. Vendors will happily extend a 14-day trial to 90 days if you let them. The pilot scorecard pins both sides to a concrete grading rubric so the trial does not drift into "well, we kind of liked it" territory.

Question. Design a 30-day pilot scorecard for an observability platform on three critical tables. The scorecard should weight signal quality, alert precision, and stakeholder communication.

Input — scoring rubric.

Category Weight Example metric
Coverage 20% tables monitored / total tables
Detection quality 25% true positives / total alerts
Time-to-first-incident 20% hours from import to first real alert
Lineage usefulness 15% % incidents linked to upstream cause
Stakeholder readability 10% non-data-team thumbs-up rate
Cost / coverage 10% $ per monitored table per month

Code.

# Score a pilot run from a CSV of alerts and outcomes
import pandas as pd

def score_pilot(alerts: pd.DataFrame) -> dict:
    """alerts has columns: signal, label (tp/fp), hours_to_detect, has_lineage."""
    tp = (alerts["label"] == "tp").sum()
    fp = (alerts["label"] == "fp").sum()
    precision = tp / max(tp + fp, 1)
    avg_ttd = alerts.loc[alerts["label"] == "tp", "hours_to_detect"].mean()
    lineage_pct = alerts["has_lineage"].mean()
    return {
        "precision": round(precision, 2),
        "avg_ttd_hours": round(avg_ttd, 2),
        "lineage_pct": round(lineage_pct, 2),
        "total_alerts": len(alerts),
    }
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Coverage is the easy first cut: did the platform actually wire up all three pilot tables and emit signals on them? If yes, score 20/20. If only two of three connected, score 13/20.
  2. Detection quality is precision: of the alerts the platform sent, how many were real? A precision below 0.5 is alert fatigue; above 0.8 is excellent.
  3. Time-to-first-incident measures how quickly the baseline converges. Most platforms need 7–14 days of history to learn seasonality; alerts before then are heuristic, alerts after are model-based.
  4. Lineage usefulness is the most important qualitative metric: of the real incidents, how many had an upstream cause identified within the platform itself (versus the on-call having to open the warehouse manually)?
  5. Stakeholder readability is the "did the marketing PM understand the status page" check. Platforms that ship a readable status page win this row by default.
  6. Cost / coverage is a simple cents-per-table-per-month divided by total tables. Use list prices for the scorecard and renegotiate at procurement.

Output (sample scorecard).

Vendor Coverage Detection TTI Lineage Readability $/cov Total
Monte Carlo 20 22 16 14 8 6 86
Anomalo 18 24 14 11 7 7 81
Bigeye 19 20 18 12 8 8 85
Lightup 16 19 17 9 6 9 76

Rule of thumb. A scorecard with weights set before the demos forces the team to argue about what matters, not which vendor's slides looked better. Lock the weights on day 0; revisit only if a category was genuinely missing.

Data engineering interview question on observability ROI

A senior interviewer often opens with: "Your VP asks why we should spend $200K/year on Monte Carlo when we already have dbt tests, Slack, and a smart on-call team. Walk me through the ROI calculation and the two real metrics you would put on a one-slide pitch." It blends financial framing, observability fundamentals, and stakeholder communication — the three muscles that decide whether the platform actually lands.

Solution Using a four-line ROI calculation tied to MTTD and MTTR compression

# observability_roi.py
# Returns annualised dollar value of moving MTTD and MTTR by N hours.

def observability_roi(
    incidents_per_year: int,
    mttd_before_h: float,
    mttd_after_h: float,
    mttr_before_h: float,
    mttr_after_h: float,
    cost_per_hour_of_bad_data: float,
    engineer_hourly_cost: float,
    avg_engineers_per_incident: float = 1.5,
) -> dict:
    """Two terms: bad-data cost saved, and engineer-hour cost saved."""
    bad_data_hours_saved = (
        incidents_per_year
        * ((mttd_before_h - mttd_after_h) + (mttr_before_h - mttr_after_h))
    )
    eng_hours_saved = (
        incidents_per_year
        * (mttr_before_h - mttr_after_h)
        * avg_engineers_per_incident
    )

    bad_data_value = bad_data_hours_saved * cost_per_hour_of_bad_data
    eng_value = eng_hours_saved * engineer_hourly_cost

    return {
        "bad_data_hours_saved": bad_data_hours_saved,
        "eng_hours_saved": eng_hours_saved,
        "total_value_usd": round(bad_data_value + eng_value, 0),
    }


print(
    observability_roi(
        incidents_per_year=120,
        mttd_before_h=18,
        mttd_after_h=2,
        mttr_before_h=10,
        mttr_after_h=4,
        cost_per_hour_of_bad_data=400,
        engineer_hourly_cost=120,
    )
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Variable Before After Delta per incident
MTTD (hours) 18 2 -16
MTTR (hours) 10 4 -6
Bad-data hours saved per incident 22
Engineer hours saved per incident 9

With 120 incidents per year, the team saves 120 * 22 = 2,640 bad-data hours and 120 * 9 = 1,080 engineer hours. At $400 / hour of bad-data cost and $120 / engineer hour, that is 1,056,000 + 129,600 = $1,185,600 of annualised value. A $200K platform clears the bar.

Output:

Metric Value
bad_data_hours_saved 2640
eng_hours_saved 1080
total_value_usd 1,185,600

Why this works — concept by concept:

  • MTTD compression — the dominant term in the ROI math. Catching an incident in 2 hours instead of 18 saves 16 hours of bad data flowing through dashboards, reverse-ETL syncs, and downstream models. Most pilots show 4–10x MTTD compression.
  • MTTR compression — secondary but real. The platform's lineage view often cuts diagnosis time in half because the on-call sees the upstream cause immediately.
  • Cost per hour of bad data — the hardest input to estimate but the most important. Use a conservative number (revenue-attributed-to-data divided by working hours) for the pitch and an aggressive number for internal modeling.
  • Engineer hourly cost — fully loaded, not salary. A senior data engineer is roughly $120-$200 fully loaded per hour in a US metro market.
  • One-slide pitch metrics — MTTD compression and dollar value saved. Skip the feature matrix; the VP only needs the two numbers and the platform tier.
  • Cost — O(1) compute. The model is a four-line linear function; the difficulty is calibrating the inputs, not running it.

SQL
Topic — data validation
Data validation problems (SQL)

Practice →


2. The five pillars of data observability

Freshness, volume, schema, distribution, and lineage are the five vital signs every observability platform monitors — every other feature folds into one of these five

The mental model in one line: a data table is healthy when its freshness, volume, schema, distribution, and lineage are all within their learned baselines — and any platform you evaluate either implements all five well or has a credibility gap. Once you can name the five pillars cold, the vendor demos collapse into "how does this product instrument each one?" — and you can compare like-with-like.

Five vertical pillar columns labelled Freshness, Volume, Schema, Distribution, and Lineage, each with a tiny icon on top and three example-signal pills inside, sitting under a thin horizontal 'observability' header band, on a light PipeCode card.

Pillar 1 — Freshness.

  • What it measures. Time elapsed since the table was last written. The baseline is the expected cadence (e.g. "this table should refresh every hour during business hours, every 6 hours overnight").
  • How it is computed. From warehouse metadata — INFORMATION_SCHEMA.TABLES.last_altered in Snowflake, pg_stat_user_tables in Postgres, INFORMATION_SCHEMA.PARTITIONS in BigQuery.
  • Common alert. "Table analytics.orders has been stale for 4h 22m; SLA is 2h." Routed to the table owner via Slack with a link to the upstream dbt model.
  • Why it matters. Freshness regressions are the single most common silent failure in modern warehouses. An ETL DAG fails, the table stops refreshing, and dashboards show stale-but-plausible numbers until a human notices.

Pillar 2 — Volume.

  • What it measures. Row count distribution day over day. The baseline is a rolling window (typically 7-30 days) with optional weekday seasonality.
  • How it is computed. Daily COUNT(*) snapshots, optionally bucketed by partition or by loaded_at date. Compared against the baseline with a z-score or interquartile range test.
  • Common alert. "Today's orders row count is 1,240 — 7-day baseline is 5,300 ± 800 (z = -5.1)." This is the "did a partition fail to load" signal.
  • Why it matters. Sudden zeros, half-loads, and dedup-gone-wrong issues all show up as volume anomalies long before they manifest as wrong dashboard numbers.

Pillar 3 — Schema.

  • What it measures. Drift in the table's column list and types. The baseline is the previous snapshot.
  • How it is computed. A daily diff of INFORMATION_SCHEMA.COLUMNS. Detects added columns, removed columns, type changes, and ordering changes.
  • Common alert. "Column customer_email was renamed to email_hashed between yesterday and today." This is the "did upstream change the contract without telling us" signal.
  • Why it matters. Breaking schema changes propagate downstream silently — a removed column makes a dbt model fail on the next build, a type widening makes a join return wrong rows, a renamed column makes a dashboard go to NULL.

Pillar 4 — Distribution.

  • What it measures. Statistical properties of the values inside the table — null %, cardinality, mean, median, top-N value frequencies, and per-segment skew. The baseline is a rolling distribution profile.
  • How it is computed. Periodic sampled scans (because full scans on a billion-row table are expensive). Compared against the baseline with a Kolmogorov-Smirnov-style test or a per-quantile delta.
  • Common alert. "Null rate on analytics.orders.shipping_country jumped from 0.8% to 18.4% in the last 24h." This is the "did an upstream field stop being populated" signal.
  • Why it matters. Distribution drift is the most subtle and the most damaging. Numbers still tally up; row counts are fine; the meaning of the data has shifted under the dashboard.

Pillar 5 — Lineage.

  • What it measures. The dependency graph between tables, dbt models, BI dashboards, and downstream destinations. Not a signal itself but a context for every other signal.
  • How it is computed. Parsed from SQL query logs (Monte Carlo's approach), dbt manifest (Bigeye's approach), OpenLineage events (the open standard), or static SQL parsing (Lightup).
  • Common alert. "Freshness incident on raw.events will affect 14 downstream models and 3 dashboards." This is the blast radius signal.
  • Why it matters. Without lineage, every incident triage starts with "what depends on this?" — a manual archaeology step that doubles MTTR. With lineage, the on-call sees the blast radius in the alert payload.

Bonus pillar — usage and cost.

  • What it measures. Query patterns, slot consumption, $/query, and stale-table waste (tables no one reads).
  • Why it counts. A table no one queries is silently expensive and a table no one writes is silently broken — usage and cost signals catch both. Increasingly treated as quality-adjacent rather than separate.

How the five pillars map to alert routing.

  • Freshness → table owner, Slack, low severity (engineers expect to fix it during business hours).
  • Volume → table owner + on-call, Slack + PagerDuty, medium severity (could indicate a partial load).
  • Schema → table owner + downstream owners, Slack + ticket, high severity (likely a breaking change).
  • Distribution → table owner + analytics owner, Slack, medium-low severity (often a signal of upstream business change, not a pure tech failure).
  • Lineage → not routed alone; attached as context to every other alert.

Worked example — implementing the freshness pillar in raw SQL

Detailed explanation. The cheapest freshness check is a single query against the warehouse metadata table — no scans, no compute, runs in milliseconds. Building one yourself is the right starting point to understand what the platform is doing under the hood.

Question. Write a Snowflake query that lists every table in the analytics schema whose last write is older than its expected SLA. Use a sidecar table for per-table SLAs.

Input.

table_name sla_minutes
orders 60
customers 1440
events_raw 30
dim_country 10080

Code.

-- Freshness probe — joins metadata to a per-table SLA sidecar
WITH last_writes AS (
    SELECT
        TABLE_SCHEMA,
        TABLE_NAME,
        LAST_ALTERED,
        TIMESTAMPDIFF('minute', LAST_ALTERED, CURRENT_TIMESTAMP)
            AS minutes_stale
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = 'ANALYTICS'
)
SELECT
    lw.TABLE_NAME,
    lw.LAST_ALTERED,
    lw.minutes_stale,
    sla.sla_minutes,
    lw.minutes_stale - sla.sla_minutes AS minutes_over_sla,
    CASE
        WHEN lw.minutes_stale > sla.sla_minutes * 2 THEN 'page'
        WHEN lw.minutes_stale > sla.sla_minutes     THEN 'warn'
        ELSE                                             'ok'
    END AS severity
FROM last_writes lw
JOIN freshness_slas sla
  ON LOWER(sla.table_name) = LOWER(lw.TABLE_NAME)
WHERE lw.minutes_stale > sla.sla_minutes;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The CTE pulls LAST_ALTERED from warehouse metadata — no scan over the table data. This is the same source every observability platform uses for the freshness pillar.
  2. TIMESTAMPDIFF computes minutes-since-last-write. We join against a sidecar table of per-table SLAs because not every table refreshes on the same cadence.
  3. The severity CASE creates a three-band signal: ok when within SLA, warn when over, page when 2x over. This matches the way most platforms tier severity.
  4. The final WHERE clause keeps only rows that breached SLA — those become the alert payloads.
  5. To make this continuous, schedule it every 5 minutes via Snowflake Tasks or an external orchestrator. The cost is negligible because metadata reads are free.

Output.

TABLE_NAME LAST_ALTERED minutes_stale sla_minutes severity
events_raw 14:02:11 47 30 warn
orders 13:18:50 91 60 warn

Rule of thumb. Build the freshness pillar yourself first, even if you plan to buy. The exercise teaches you what the platform should report — and it gives you a fallback if procurement stalls.

Worked example — building the volume pillar with a rolling z-score

Detailed explanation. Volume drift is the second-most-common silent failure. A daily z-score against a 7-day baseline catches both half-loads (negative z) and double-loads (positive z) with a single test.

Question. Write a SQL query that produces a z-score for today's row count against a 7-day baseline of the orders table, with NULLIF protection for constant-baseline days.

Input (last 8 days of row counts).

day rows
-7 5280
-6 5410
-5 5125
-4 5390
-3 5340
-2 5275
-1 5300
0 (today) 1240

Code.

WITH daily AS (
    SELECT
        DATE_TRUNC('day', loaded_at) AS d,
        COUNT(*)                     AS rows_today
    FROM analytics.orders
    WHERE loaded_at >= CURRENT_DATE - 8
    GROUP BY 1
),
baseline AS (
    SELECT
        AVG(rows_today)    AS mu,
        STDDEV(rows_today) AS sigma
    FROM daily
    WHERE d < CURRENT_DATE
)
SELECT
    d,
    rows_today,
    mu,
    sigma,
    ROUND(
        (rows_today - mu) / NULLIF(sigma, 0),
        2
    ) AS z_score,
    CASE
        WHEN ABS((rows_today - mu) / NULLIF(sigma, 0)) > 5 THEN 'page'
        WHEN ABS((rows_today - mu) / NULLIF(sigma, 0)) > 3 THEN 'warn'
        ELSE                                                   'ok'
    END AS severity
FROM daily, baseline
WHERE d = CURRENT_DATE;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The daily CTE rolls up row counts per day. The window is 8 days so that today plus the 7-day baseline are both in scope.
  2. The baseline CTE computes mean and standard deviation over the non-today days only — separating the test set from the training set.
  3. The outer SELECT computes today's z-score: (today - mu) / sigma. A z of -5 means today is 5 standard deviations below the baseline.
  4. NULLIF(sigma, 0) prevents a division-by-zero on a constant-baseline day. If sigma is 0, the z-score becomes NULL, the severity falls to ok, and the platform does not page on a false signal.
  5. The severity bands at z > 3 (warn) and z > 5 (page) are reasonable defaults. Tighten or loosen based on the table's natural variance.

Output.

d rows_today mu sigma z_score severity
2026-06-15 1240 5302.86 96.13 -42.21 page

Rule of thumb. Use z-score for the test, log the raw mean and sigma in the alert payload, and always wrap the denominator in NULLIF. The two-line guard against constant-baseline days saves more false pages than any other tuning.

Worked example — schema drift detection via column-set diff

Detailed explanation. Schema drift is detected by comparing today's column snapshot against yesterday's. The simplest implementation is a daily snapshot table and a FULL OUTER JOIN to find adds, removes, and type changes in one query.

Question. Given a schema_snapshots table with daily snapshots, write a query that flags every adds, drops, and type-change between yesterday and today on the analytics.orders table.

Input (yesterday vs today).

snapshot_date column_name data_type
-1 order_id NUMBER
-1 customer_id NUMBER
-1 amount NUMBER
-1 status VARCHAR
0 order_id NUMBER
0 customer_id VARCHAR
0 amount NUMBER
0 is_refund_flag BOOLEAN

Code.

WITH yesterday AS (
    SELECT column_name, data_type
    FROM schema_snapshots
    WHERE table_name = 'orders'
      AND snapshot_date = CURRENT_DATE - 1
),
today AS (
    SELECT column_name, data_type
    FROM schema_snapshots
    WHERE table_name = 'orders'
      AND snapshot_date = CURRENT_DATE
)
SELECT
    COALESCE(y.column_name, t.column_name) AS column_name,
    y.data_type AS yesterday_type,
    t.data_type AS today_type,
    CASE
        WHEN y.column_name IS NULL                          THEN 'added'
        WHEN t.column_name IS NULL                          THEN 'removed'
        WHEN y.data_type IS DISTINCT FROM t.data_type       THEN 'type_changed'
        ELSE                                                     'same'
    END AS change_type
FROM yesterday y
FULL OUTER JOIN today t
              ON y.column_name = t.column_name
WHERE y.column_name IS NULL
   OR t.column_name IS NULL
   OR y.data_type IS DISTINCT FROM t.data_type;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The two CTEs pull yesterday's and today's column snapshots into named scopes.
  2. FULL OUTER JOIN is the right tool: it emits every row from both sides so that adds (today-only), drops (yesterday-only), and type-changes (both sides) all show up.
  3. COALESCE(y.column_name, t.column_name) picks the non-NULL column name so the output has a single column-name column.
  4. The CASE classifies each row using IS NULL for adds and removes, and IS DISTINCT FROM for type changes — NULL-safe inequality is critical here because either side may be NULL.
  5. The final WHERE keeps only changes, dropping the noise of identical columns.

Output.

column_name yesterday_type today_type change_type
customer_id NUMBER VARCHAR type_changed
status VARCHAR NULL removed
is_refund_flag NULL BOOLEAN added

Rule of thumb. Schema drift detection is the easiest pillar to home-grow. Snapshot daily, FULL OUTER JOIN against yesterday, and you have 80% of what a platform's schema pillar provides — minus the lineage routing.

Data engineering interview question on the five-pillars contract

A senior interviewer often opens with: "Name the five pillars of data observability and explain what alert each one produces. Then tell me which pillar misses the most production incidents at your last job." It blends framework recall, real-world judgement, and the kind of honesty that distinguishes a senior engineer from a junior.

Solution Using a five-pillar implementation matrix tied to incident classes

-- Five-pillar observability matrix: which pillar catches which incident class
WITH pillars AS (
    SELECT 'freshness'    AS pillar, 'stale table'           AS incident_class,
           'metadata only' AS source, 60                     AS cost_per_run_seconds UNION ALL
    SELECT 'volume',        'half-load / double-load',
           'COUNT(*) daily', 30                                                   UNION ALL
    SELECT 'schema',        'breaking change',
           'INFORMATION_SCHEMA diff', 5                                            UNION ALL
    SELECT 'distribution',  'silent value drift',
           'sampled scan',  300                                                   UNION ALL
    SELECT 'lineage',       'blast radius unknown',
           'query log parse', 0
)
SELECT
    pillar,
    incident_class,
    source,
    cost_per_run_seconds,
    CASE
        WHEN cost_per_run_seconds = 0   THEN 'free'
        WHEN cost_per_run_seconds < 60  THEN 'cheap'
        WHEN cost_per_run_seconds < 300 THEN 'medium'
        ELSE                                 'expensive'
    END AS cost_band
FROM pillars
ORDER BY cost_per_run_seconds;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

pillar incident_class source seconds band
lineage blast radius unknown query log parse 0 free
schema breaking change INFORMATION_SCHEMA diff 5 cheap
volume half-load / double-load COUNT(*) daily 30 cheap
freshness stale table metadata only 60 cheap
distribution silent value drift sampled scan 300 medium

The trace highlights that distribution monitoring is the most expensive pillar to run — sampled scans on a billion-row table cost real warehouse compute — and lineage is effectively free because the query log already exists.

Output:

Pillar Catches
Freshness ETL DAG died silently
Volume Partition failed to load
Schema Upstream renamed a column
Distribution Refund rate jumped 5x
Lineage Blast radius of any incident

Why this works — concept by concept:

  • The five-pillar framework — every modern observability platform organises its product around these five signals; if you can speak them fluently you can map any vendor demo onto the same axes.
  • Cost band drives sampling — freshness, volume, and schema are essentially free, so run them continuously. Distribution is expensive, so sample on critical columns only and on a slower cadence.
  • Lineage is the multiplier — it does not detect anything on its own, but it triples the value of every other signal by adding blast-radius context to the alert payload.
  • Incident class mapping — pillar-to-incident matching is the right way to think about coverage. If your incident postmortem cites a class the platform does not monitor, the platform is failing that incident class.
  • Sampling cadence — match pillar cadence to cost. Freshness every 5 min, volume daily, schema daily, distribution weekly on most tables and daily on the most critical ones.
  • Cost — pillar-by-pillar; the matrix scales linearly with the number of tables monitored. Total compute cost is usually < 1% of warehouse spend.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →


3. The four vendors — Monte Carlo vs Anomalo vs Bigeye vs Lightup

Monte Carlo is the category creator; Anomalo bets on ML; Bigeye bets on dbt; Lightup bets on security — pick the bet that matches your stack

The mental model in one line: each of the four vendors made a different architectural bet, and the right buy is the one whose bet matches your stack's primary constraint — coverage, ML depth, dbt-nativeness, or security. Once you can name those four bets cold, the demos compress into a single afternoon.

Two-by-two grid quadrant chart with axes 'Coverage breadth' and 'ML detection depth' and four vendor cards placed in each quadrant — Monte Carlo top-right, Anomalo top-left, Bigeye bottom-right, Lightup bottom-left — each card with a tiny strength badge, on a light PipeCode card.

Vendor 1 — Monte Carlo: the category creator with the broadest lineage and warehouse coverage.

  • Founded. 2019 in San Francisco. Coined the term "data observability platform" in their early marketing.
  • Architecture. Agent runs in your VPC, reads warehouse metadata and query logs, ships signals back to Monte Carlo's control plane.
  • Strengths. Broadest warehouse connector matrix (Snowflake, BigQuery, Redshift, Databricks, Postgres, dbt, Airflow, Looker, Tableau, Mode, Sigma). Best lineage graph in the category, parsed from raw SQL query logs across the entire stack. Strong enterprise sales and SOC 2 + ISO 27001 + HIPAA posture.
  • Weaknesses. Heaviest deployment (agent + control plane + sometimes a metadata sidecar in your warehouse). Per-table list pricing climbs fast on big enterprises (typical six-figure annual contracts). Some teams report the ML baselines need more tuning than the marketing implies.
  • Best fit. Large enterprise (250+ data engineers, hundreds of dbt models, dozens of dashboards, multiple warehouses), strict compliance, lineage-first incident triage workflow.

Vendor 2 — Anomalo: deep ML anomaly detection on top of no-code rule authoring.

  • Founded. 2018 in Palo Alto, spun out of Instacart's data-quality team.
  • Architecture. SaaS control plane reads from your warehouse via a service account. Pushdown SQL for most checks; ML models run in Anomalo's compute.
  • Strengths. Deepest ML detection in the category — seasonality-aware baselines, hierarchical models, automatic feature engineering on top of column statistics. No-code rule editor for non-engineers (data stewards, analysts). Snowflake-native via native app integration.
  • Weaknesses. Coverage matrix narrower than Monte Carlo's (Snowflake / BigQuery / Databricks / Redshift / Postgres, but lighter on BI tool lineage). Smaller ecosystem of integrations. ML-first stance means a longer time-to-first-detection while baselines train.
  • Best fit. Mid-market teams (25–250 engineers) with a heavy ML / analytics product, where deep anomaly detection on numerical metrics is worth more than broad lineage coverage.

Vendor 3 — Bigeye: metric-first, autometrics, dbt-native.

  • Founded. 2019 in San Francisco by ex-Uber data infrastructure engineers.
  • Architecture. Agent reads warehouse metadata and runs metric collectors. Metric definitions live in YAML or in the UI; dbt manifest integration auto-discovers models.
  • Strengths. Strongest dbt integration of the four — autometrics pulls dbt model owners, tests, and tags into the platform automatically. Per-metric pricing model is predictable. Engineering-friendly: YAML-as-code metric definitions; treats observability as code, not as a UI.
  • Weaknesses. Lineage graph is less comprehensive than Monte Carlo's. ML detection is competent but not the headline feature. Smaller install base in heavily regulated industries.
  • Best fit. dbt-heavy stacks (10K+ dbt models or a strong dbt practice), engineering teams that want code-defined metrics, mid-market data platforms.

Vendor 4 — Lightup: pushdown SQL checks, no data egress, security-friendly.

  • Founded. 2019 in San Francisco. Acquired by Acceldata in 2023; now operates as part of a broader observability suite.
  • Architecture. Pure pushdown — every check compiles to SQL that runs inside your warehouse. Lightup's plane never sees the underlying data, only the check result.
  • Strengths. Best security story by a wide margin: zero data egress, runs inside the warehouse, satisfies the strictest data residency reviews. Lowest network attack surface. Strong fit for healthcare, financial services, and government.
  • Weaknesses. ML detection narrower than Anomalo's — most checks are rule-based or basic statistical thresholds. Lineage capability is the weakest of the four. Coverage matrix is the smallest (Snowflake, BigQuery, Databricks, Postgres, SQL Server primarily).
  • Best fit. Regulated stacks where security review forbids agent-in-VPC + control-plane-sees-data architecture. Healthcare, banking, public sector.

Pricing model shapes.

  • Monte Carlo — per-table. List price is in the $1.5K–$3K per monitored table per year range. A 250-table deployment is $400K-$750K depending on volume discounting.
  • Anomalo — per-source / per-check. Hybrid model: a flat per-source fee plus marginal per-check fees. Smaller deployments end up cheaper than Monte Carlo; enterprise deployments converge.
  • Bigeye — per-metric. The most predictable model. List is roughly $50 per metric per month; 200 metrics is $120K/year. Easy to forecast.
  • Lightup — per-source / per-credit. Per-source base plus credits consumed by check runs. Best fit for teams that monitor a few sources heavily, less attractive for "monitor everything" stances.

A scoring rubric (each row 0–10).

Axis Monte Carlo Anomalo Bigeye Lightup
Coverage breadth 9 7 7 6
ML detection quality 7 9 7 5
dbt-nativeness 7 7 9 6
Security posture 7 7 7 9
Time-to-value 6 7 8 8
Lineage graph 9 6 7 5
Pricing transparency 6 7 8 7
Total 51 50 53 46

The rubric makes the trade-offs visible: Bigeye scores highest on the aggregate because of dbt-nativeness, time-to-value, and pricing transparency — but Monte Carlo wins coverage and lineage, Anomalo wins ML, and Lightup wins security. There is no universal winner.

Worked example — picking the right vendor for a four-stack profile

Detailed explanation. The buying decision is a constraint-satisfaction problem, not a beauty contest. Each team has a primary constraint (security, ML, dbt-ness, or coverage) and a budget envelope. Mapping the four vendors against four stack profiles gives a defensible recommendation.

Question. Given four stack profiles — an enterprise insurance company, a mid-market growth SaaS, a dbt-heavy fintech, and an early-stage AI startup — recommend the best-fit observability vendor for each and explain why.

Input.

Stack profile Constraint Team size Stack
Insurance Security + lineage 80 DE Snowflake + Looker + Airflow
Growth SaaS ML on numerical metrics 25 DE BigQuery + dbt + Hex
dbt-heavy fintech dbt-native + predictable price 40 DE Snowflake + 5K dbt models + Mode
AI startup Low cost + fast pilot 6 DE BigQuery + dbt-core + Metabase

Code.

# Vendor recommendation function — naive decision tree
def recommend_vendor(profile: dict) -> str:
    if profile.get("regulated") or profile.get("security_blocker"):
        return "Lightup"
    if profile.get("ml_critical"):
        return "Anomalo"
    if profile.get("dbt_models", 0) > 1000:
        return "Bigeye"
    if profile.get("team_size", 0) >= 50 and profile.get("coverage_breadth"):
        return "Monte Carlo"
    if profile.get("team_size", 0) <= 10:
        return "Elementary (OSS) + dbt source freshness"
    return "Bigeye"  # default middle-of-the-road
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The first branch is the security blocker. Insurance, healthcare, banking — if data egress is forbidden, Lightup is the only architecturally viable option of the four.
  2. The second branch is ML criticality. If the team's product is ML (recommender systems, fraud detection, personalisation) and they need anomaly detection on numerical metrics, Anomalo's depth wins.
  3. The third branch is dbt-heaviness. A 5K-model dbt project pays off on Bigeye's autometrics: the platform auto-discovers models and tests, and metric definitions live alongside dbt code.
  4. The fourth branch is enterprise breadth. Large teams with hundreds of tables, multiple BI tools, and a need for cross-stack lineage gravitate to Monte Carlo's coverage matrix.
  5. The fifth branch is the AI startup escape hatch — under 10 engineers, the right answer is usually open-source (dbt-elementary, dbt source freshness, a Slack webhook) until the team grows.
  6. The default for "no specific constraint" is Bigeye, because it scores highest on the rubric for a generic mid-market stack.

Output.

Stack profile Recommendation Why
Insurance Monte Carlo (or Lightup if egress blocked) broad coverage + best lineage
Growth SaaS Anomalo deepest ML on numerical metrics
dbt-heavy fintech Bigeye dbt-native autometrics, predictable price
AI startup Elementary OSS for now sub-10 DE team, build before buy

Rule of thumb. Always ask the buyer "what is the primary constraint?" first. If they answer "I don't have one," they are not ready to buy a six-figure platform — they should build with OSS for another quarter, identify the constraint, and revisit.

Worked example — Monte Carlo's lineage edge in incident triage

Detailed explanation. Monte Carlo's lineage graph is parsed from raw SQL query logs across the entire stack — warehouse, dbt, BI tools — and that graph turns every incident alert into a "what does this break" payload, not just a "what broke" one. The triage time savings are real and measurable.

Question. A freshness incident fires on the raw.events table. Walk through how Monte Carlo's lineage would render the alert payload versus a non-lineage platform.

Input.

Stage Without lineage With Monte Carlo lineage
Alert payload "raw.events stale" "raw.events stale → 14 dbt models, 3 dashboards, 2 reverse-ETL syncs affected"
Triage step 1 Open dbt graph, find downstream (skipped — graph in alert)
Triage step 2 Open Looker, find dashboards (skipped — listed in alert)
Triage step 3 Open Hightouch, find syncs (skipped — listed in alert)
Triage step 4 Page on-call for each owner One paging hop with owners in payload

Code.

# Example alert payload shape with lineage
alert = {
    "table": "raw.events",
    "signal": "freshness",
    "severity": "page",
    "minutes_stale": 174,
    "sla_minutes": 60,
    "upstream": [
        {"system": "kafka", "topic": "events.v2"},
    ],
    "downstream": [
        {"system": "dbt", "model": "stg_events", "owner": "data-platform"},
        {"system": "dbt", "model": "fct_user_session", "owner": "growth-de"},
        {"system": "looker", "dashboard": "User Funnel", "owner": "growth-pm"},
        {"system": "hightouch", "sync": "salesforce_users", "owner": "rev-ops"},
    ],
    "blast_radius_count": 19,
    "suggested_owners": ["data-platform", "growth-de"],
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Without lineage, the on-call sees the freshness alert and has to manually open dbt, Looker, and Hightouch to figure out what depends on the broken table. Each step is a context switch.
  2. With Monte Carlo's lineage, the alert payload arrives with the entire blast radius pre-computed: every downstream model, dashboard, and sync is listed, with owner attribution.
  3. The on-call pages the right two teams (data-platform + growth-de) in one hop instead of three. Stakeholder comms goes out in parallel because the owners-of-the-impacted-dashboards are listed.
  4. MTTR for the incident drops because the lineage discovery step (which is half of the time-to-diagnose) is skipped.
  5. The lineage graph is also updated incrementally as new SQL queries are observed, so it stays accurate without anyone maintaining it.

Output.

Metric Without lineage With Monte Carlo lineage
Time-to-blast-radius 35 min 0 min
Owners notified 1 generic on-call 2 specific teams
MTTR 95 min 38 min

Rule of thumb. When evaluating vendors, ask each one to render the alert payload for a real freshness incident. The vendor whose payload includes downstream models, dashboards, and reverse-ETL syncs with owner attribution is doing lineage right — and is worth the premium.

Worked example — Anomalo's seasonality-aware baseline

Detailed explanation. Anomalo's ML edge shows up most clearly on metrics with strong weekday/weekend seasonality. A naive 7-day average flags every Monday as an "anomaly"; Anomalo's hierarchical model learns the weekly pattern and only flags Mondays-that-are-also-unusual-for-a-Monday.

Question. Show how a naive rolling-mean baseline produces false positives on a seasonal metric, then sketch how a seasonality-aware baseline (Anomalo-style) would fix it.

Input (last 14 days of daily_active_users).

day dow dau
-14 Mon 12000
-13 Tue 18500
-12 Wed 19200
-11 Thu 18800
-10 Fri 16500
-9 Sat 7500
-8 Sun 6800
-7 Mon 12300
-6 Tue 18900
-5 Wed 19500
-4 Thu 19000
-3 Fri 16800
-2 Sat 7400
-1 Sun 6700
0 (today, Mon) Mon 11800

Code.

import statistics

def naive_baseline_z(history, today):
    mu, sigma = statistics.mean(history), statistics.stdev(history)
    return (today - mu) / sigma

def seasonal_baseline_z(history_with_dow, today, today_dow):
    same_dow = [v for d, v in history_with_dow if d == today_dow]
    mu, sigma = statistics.mean(same_dow), statistics.stdev(same_dow)
    return (today - mu) / sigma

history_all = [12000, 18500, 19200, 18800, 16500, 7500, 6800,
               12300, 18900, 19500, 19000, 16800, 7400, 6700]
history_dow = list(zip(
    ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"] * 2, history_all))

print("naive z =",    round(naive_baseline_z(history_all, 11800), 2))
print("seasonal z =", round(seasonal_baseline_z(history_dow, 11800, "Mon"), 2))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The naive baseline pools all 14 days and computes mean / stdev. The mean is roughly 14,200 and stdev is roughly 4,800. Today (11,800) is at z ≈ -0.5 — within normal range. Wrong: today is actually low for a Monday.
  2. Actually wait — let me re-read: the naive baseline of all 14 days mixes weekdays and weekends, producing a high variance that masks small anomalies on individual weekdays. The naive z stays near 0 and misses real drift.
  3. The seasonal baseline filters history to "same day of week" — for today (Mon), it uses only the two previous Mondays (12,000 and 12,300). Mean is 12,150 and stdev is 212. Today (11,800) is at z ≈ -1.65 — meaningful drift visible only with the seasonality split.
  4. Anomalo's production model is more sophisticated: it uses hierarchical Bayesian methods to share strength across days of week, smooth low-data buckets, and handle holidays. The two-line illustration above is the principle, not the production code.
  5. The output is a much higher signal-to-noise ratio: real drift on a weekday is detected, false positives on Mondays-after-a-Sunday-spike are suppressed.

Output.

Baseline z-score on today Verdict
Naive 14-day -0.50 not flagged (missed)
Seasonal (same DoW) -1.65 flagged for review

Rule of thumb. For any metric with strong weekly seasonality, a naive rolling baseline is worse than no monitor at all — it pages on Mondays and misses real drift on Tuesdays. Either build a seasonality-aware baseline yourself (the principle above) or buy Anomalo / Bigeye / Monte Carlo whose ML layers handle it natively.

Data engineering interview question on vendor selection

A senior interviewer might frame this as: "Our security team forbids any data egress from the warehouse, our analytics team is dbt-native with 4,000 models, and our budget caps a six-figure platform. Which of Monte Carlo, Anomalo, Bigeye, and Lightup would you recommend and why?" It blends the constraint tree above with real trade-off awareness.

Solution Using a constraint-first decision tree

# vendor_picker.py — constraint-first vendor recommendation
from typing import Optional

def pick_vendor(
    data_egress_blocked: bool,
    dbt_models: int,
    ml_first: bool,
    budget_six_figures: bool,
    team_size: int,
) -> str:
    # Hard constraints first
    if data_egress_blocked:
        return "Lightup"
    # Strong preferences
    if dbt_models >= 1000:
        return "Bigeye"
    if ml_first:
        return "Anomalo"
    # Coverage breadth at scale
    if team_size >= 50 and budget_six_figures:
        return "Monte Carlo"
    # Sub-10 engineers: build don't buy
    if team_size < 10:
        return "Elementary OSS + dbt freshness"
    return "Bigeye"  # generic mid-market default
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Profile Egress blocked dbt models ML-first Budget Team Pick
The question no 4000 no yes 40 Bigeye
Insurance yes 800 no yes 80 Lightup
Growth SaaS no 200 yes yes 25 Anomalo
Enterprise retail no 1500 no yes 120 Bigeye
AI startup no 30 no no 6 Elementary OSS

For the question profile (no egress blocker, 4,000 dbt models, not ML-first, budget cleared, 40 engineers), the decision tree resolves to Bigeye because the dbt-model count exceeds the autometrics threshold and the other constraints are non-binding. Monte Carlo would also fit but does not win on the dbt axis and is typically more expensive per table.

Output:

Constraint Resolved value
data_egress_blocked no
dbt_models 4000 (>=1000)
recommendation Bigeye

Why this works — concept by concept:

  • Constraint-first ordering — security blockers (egress, residency, on-prem) are hard constraints; everything else is a preference. Resolve hard constraints before evaluating soft ones.
  • dbt-model count as a proxy — a stack with thousands of dbt models has built its data product on dbt; Bigeye's autometrics integration is materially more valuable in that environment than for a 50-model stack.
  • ML-first vs metric-first — Anomalo wins when the team's product is ML and they need numerical anomaly detection. Otherwise the ML edge is overkill and the dbt or coverage edge wins.
  • Coverage as a tiebreaker — Monte Carlo wins when no other constraint dominates and the team has the budget to pay for the broadest connector matrix.
  • Build vs buy floor — under 10 engineers, the right answer is almost always open-source. The platforms only pay off when there is enough surface area to justify the recurring cost.
  • Cost — O(1) decision. The hard part is enumerating the constraints honestly; the tree itself runs in microseconds.

SQL
Topic — conditional logic
Conditional logic problems (SQL)

Practice →


4. Detection types — rule-based, ML-based, and metadata-only monitors

Rule-based catches the obvious; ML-based catches the subtle; metadata-only catches everything you can prove without scanning — every platform uses all three, the mix is the differentiator

The mental model in one line: a rule-based monitor asserts "x > threshold," an ML-based monitor asserts "x is within the learned baseline," and a metadata-only monitor asserts "x can be inferred from the warehouse catalog without scanning the data" — the right monitor type depends on the cost-precision-coverage triangle you want to optimise for. Once you can name the three types and their cost profile, the platforms' marketing claims about "AI-powered" become much easier to parse.

Three-zone diagram showing detection styles — left zone labelled Rule-based with a threshold ruler icon, middle zone labelled ML-based with a wave + baseline curve and an anomaly spike marker, right zone labelled Metadata-only with a query-log document and a tiny database icon, connected by a thin glowing baseline running through all three, on a light PipeCode card.

Type 1 — Rule-based monitors (the foundation).

  • Examples. row_count > 1000, null_rate < 0.01, freshness < 60 minutes, schema matches v3, revenue between $50K and $500K daily.
  • How they work. A SQL query computes the metric; a comparison against a static threshold decides pass/fail.
  • Strengths. Deterministic, explainable, cheap. The on-call sees "x was 1240, threshold was 5000" and knows what fired and why. No model drift, no baseline retraining.
  • Weaknesses. You only catch what you thought to write a rule for. Thresholds are brittle — too tight and they false-page; too loose and they miss drift. Hard to maintain at scale because every new table needs a manual rule set.
  • When to use. Hard contractual SLAs (freshness on a daily finance close), invariants you can articulate (customer_id is never NULL), and as the first layer of detection for every critical table.

Type 2 — ML-based monitors (the auto-baselining layer).

  • Examples. Z-score against a 7-day baseline, seasonality-aware models for daily-active-users, hierarchical Bayesian baselines for per-segment metrics, autocorrelation-aware time-series anomaly detection.
  • How they work. Historical data trains a model; today's data is compared against the model's prediction. Anomalies are flagged when the residual exceeds a learned confidence band.
  • Strengths. Catches unknown bad shapes. Adapts to natural variance, so the same monitor works on a stable metric (revenue) and a noisy one (real-time GPU utilisation). Scales because the model trains itself on every new table.
  • Weaknesses. Requires 7–30 days of clean history to converge. Less explainable — "the model thinks this is anomalous" lands less well than "x was 1240, threshold was 5000." Subject to baseline drift — if the metric slowly trends upward, the baseline absorbs the drift and the monitor stops firing.
  • When to use. Numerical metrics with rich history; segments that vary naturally; anywhere you cannot articulate a static threshold but you know the metric should not jump 5σ overnight.

Type 3 — Metadata-only monitors (the "no data egress" layer).

  • Examples. Freshness from INFORMATION_SCHEMA.TABLES.LAST_ALTERED; row count from INFORMATION_SCHEMA.TABLE_STORAGE_METRICS; schema from INFORMATION_SCHEMA.COLUMNS; query patterns from query history.
  • How they work. The platform reads warehouse metadata — the catalog of what tables exist, when they were last written, how big they are — without ever scanning the underlying data.
  • Strengths. Effectively free (metadata reads cost no warehouse credits). No data ever leaves the warehouse — best security posture. Always available (no sampling needed).
  • Weaknesses. Can only answer questions the catalog knows about. Cannot tell you "the null rate on column X jumped" without scanning. Limited to freshness, volume, and schema pillars — distribution is mostly out of reach.
  • When to use. As the always-on first layer of monitoring. Security-sensitive stacks where data egress is forbidden (Lightup's bet). Cost-sensitive deployments on enormous tables where sampling is expensive.

Hybrid mode — auto-baseline + override threshold.

  • The right production pattern is both. Use ML-based baselines for the "is this normal" test and rule-based overrides for the "is this catastrophic" test. The ML model catches subtle drift; the rule guarantees you page on absolute floors or ceilings (e.g. "fire when revenue is below $1M/day no matter what the baseline says").
  • Every modern platform supports the hybrid mode; Anomalo's UI makes it the default. Bigeye exposes both as YAML metrics.

Detection latency vs precision tradeoffs.

  • Tight thresholds + ML baselines → high precision, low latency, but high false-page rate. Suited for the top 5–10% of critical tables only.
  • Loose thresholds + ML baselines → lower precision, longer latency-to-detection, but very low false-page rate. The "rest of the warehouse" default.
  • Tiered severity → page-grade thresholds for the critical tables, ticket-grade for the rest, and Slack-only for the long tail. Reserves the on-call's attention for real incidents.

Quiet hours and dependency-aware suppression.

  • Quiet hours. Suppress non-critical alerts during nights and weekends. Every platform supports this; the default windows vary.
  • Dependency-aware suppression. When raw.events is broken, every downstream model will also fire freshness alerts within an hour. The platform should suppress the downstream alerts and roll them up under the root incident. Monte Carlo's lineage-aware suppression is the strongest in the category here; Bigeye's dbt-manifest-based version is competitive.

How each vendor handles ML retraining.

  • Monte Carlo. Baselines retrain daily with a sliding window; manual override available.
  • Anomalo. Continuous incremental retraining; hierarchical models share strength across segments.
  • Bigeye. Weekly batch retraining with manual triggers; YAML-defined baseline windows.
  • Lightup. Threshold-first; ML layer is shallower than the other three.

Worked example — a rule-based row-count monitor in SQL

Detailed explanation. A row-count monitor with a hard floor and a soft warning band is the simplest production-grade rule-based monitor. It catches obvious failures (zero rows, half-loads) with deterministic semantics.

Question. Write a SQL query that returns a severity band (ok, warn, page) for the analytics.orders table given a row-count floor of 1000 and a soft warning band of 3000.

Input.

date rows
-2 5210
-1 5340
0 (today) 2800

Code.

WITH today AS (
    SELECT COUNT(*) AS rows_today
    FROM analytics.orders
    WHERE loaded_at >= CURRENT_DATE
)
SELECT
    rows_today,
    CASE
        WHEN rows_today <  1000 THEN 'page'
        WHEN rows_today <  3000 THEN 'warn'
        ELSE                         'ok'
    END AS severity
FROM today;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The CTE counts today's loaded rows in one scan over a single partition (assuming loaded_at is partitioned).
  2. The CASE applies two static thresholds: 1000 (hard floor) and 3000 (soft warning band).
  3. Severity falls through from most-severe to least: a row count of 800 fires page; 2800 fires warn; 5000+ stays ok.
  4. The thresholds are deliberately static — they encode contractual SLA expectations, not learned baselines. A monitor of this kind belongs alongside (not instead of) an ML monitor.
  5. To productionise, schedule it every hour and pipe the result to a Slack channel keyed by the table owner.

Output.

rows_today severity
2800 warn

Rule of thumb. Every critical table should have a rule-based row-count monitor with a hard floor. The floor encodes "below this we definitely have a problem" — independent of whatever the ML baseline learns. ML alone is not enough.

Worked example — an ML-based z-score monitor with seasonality

Detailed explanation. A z-score against a same-day-of-week baseline is the simplest seasonality-aware ML monitor and beats a naive rolling-mean baseline on any weekly-seasonal metric. It is the right second layer on top of a rule-based floor.

Question. Write a SQL query that computes today's z-score against a same-day-of-week 4-week baseline for daily_active_users. Use NULLIF to protect against constant baselines.

Input.

weeks_ago dow dau
4 Mon 12000
3 Mon 12300
2 Mon 12150
1 Mon 12400
0 Mon 9800

Code.

WITH dau_history AS (
    SELECT
        DATE_TRUNC('day', activity_date) AS d,
        TO_CHAR(activity_date, 'Dy')     AS dow,
        COUNT(DISTINCT user_id)          AS dau
    FROM analytics.events
    WHERE activity_date >= CURRENT_DATE - 28
      AND activity_date <= CURRENT_DATE
    GROUP BY 1, 2
),
today AS (
    SELECT d, dow, dau
    FROM dau_history
    WHERE d = CURRENT_DATE
),
baseline AS (
    SELECT
        AVG(dau)    AS mu,
        STDDEV(dau) AS sigma
    FROM dau_history
    WHERE d < CURRENT_DATE
      AND dow = (SELECT dow FROM today)
)
SELECT
    today.d, today.dau,
    baseline.mu, baseline.sigma,
    ROUND((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0), 2) AS z_score,
    CASE
        WHEN ABS((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0)) > 5
            THEN 'page'
        WHEN ABS((today.dau - baseline.mu) / NULLIF(baseline.sigma, 0)) > 3
            THEN 'warn'
        ELSE 'ok'
    END AS severity
FROM today, baseline;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dau_history CTE aggregates daily DAU for the last 28 days with day-of-week tags.
  2. The today CTE extracts only today's row (same shape as history).
  3. The baseline CTE computes mean and stdev over the same day of week, excluding today. For a Monday today, that means averaging the four previous Mondays.
  4. The z-score is (today - mu) / sigma, with NULLIF to handle the corner case where every prior Monday had identical DAU (sigma = 0).
  5. With baseline mu ≈ 12,213 and sigma ≈ 167, today's DAU of 9,800 produces z ≈ -14.4 — a clear page.

Output.

d dau mu sigma z_score severity
2026-06-15 9800 12212.5 166.9 -14.45 page

Rule of thumb. Always tag history rows with day-of-week (or season, holiday, marketing campaign) before computing a z-score on a seasonal metric. The single-line WHERE dow = (SELECT dow FROM today) is the cheapest way to add seasonality to an ML-based monitor.

Worked example — a metadata-only volume monitor

Detailed explanation. Metadata-only monitors are the cheapest pillar to run. A row-count check against INFORMATION_SCHEMA runs in milliseconds and consumes zero warehouse credits — the right always-on first layer for every table in the warehouse.

Question. Write a Snowflake query that lists every table in the analytics schema whose row count has changed by more than 50% since yesterday, using only metadata.

Input.

table_name rows_today rows_yesterday
orders 5300 5280
events 1.2M 8.4M
customers 240 240
sessions 0 12000

Code.

WITH today_meta AS (
    SELECT
        TABLE_NAME,
        ROW_COUNT AS rows_today
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = 'ANALYTICS'
),
yesterday_meta AS (
    SELECT
        TABLE_NAME,
        ROW_COUNT AS rows_yesterday
    FROM analytics_meta.row_count_snapshot
    WHERE snapshot_date = CURRENT_DATE - 1
)
SELECT
    t.TABLE_NAME,
    t.rows_today,
    y.rows_yesterday,
    ROUND(
        100.0 * (t.rows_today - y.rows_yesterday)
            / NULLIF(y.rows_yesterday, 0),
        2
    ) AS pct_change,
    CASE
        WHEN y.rows_yesterday > 0
         AND ABS(t.rows_today - y.rows_yesterday) > 0.5 * y.rows_yesterday
            THEN 'page'
        ELSE 'ok'
    END AS severity
FROM today_meta t
LEFT JOIN yesterday_meta y
       ON t.TABLE_NAME = y.TABLE_NAME
WHERE y.rows_yesterday > 0
  AND ABS(t.rows_today - y.rows_yesterday) > 0.5 * y.rows_yesterday;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The today_meta CTE reads ROW_COUNT from INFORMATION_SCHEMA.TABLES — a metadata read, zero credits.
  2. The yesterday_meta CTE reads from a sidecar snapshot table that the orchestrator populates nightly.
  3. The outer SELECT computes percentage change with NULLIF to handle zero-row yesterday tables.
  4. The severity threshold is a flat 50% delta — broad enough to catch real anomalies, narrow enough not to false-page on normal growth.
  5. events (1.2M today vs 8.4M yesterday) shows a -85% change → page. sessions (0 today vs 12000 yesterday) shows -100% → page. orders (+0.4%) stays ok.

Output.

TABLE_NAME rows_today rows_yesterday pct_change severity
events 1200000 8400000 -85.71 page
sessions 0 12000 -100.00 page

Rule of thumb. Always layer a metadata-only volume monitor as the first line of defence. It costs nothing, it runs continuously, and it catches the single-most-common failure mode (a table failed to refresh) before any other pillar even gets a chance.

Data engineering interview question on detection type selection

A senior interviewer often frames this as: "You need to monitor 200 tables. Budget allows you to scan 10 of them daily, the rest must be metadata-only. Walk me through which 10 you scan, how you decide, and how you handle the other 190." It blends prioritisation, cost awareness, and a clear understanding of which detection types work without scans.

Solution Using a tiered monitor strategy keyed to incident cost

# tiered_monitor_strategy.py
# Inputs: a catalog of tables with a "blast radius" cost score
# Outputs: which monitor type to run on each, and at what cadence.

from typing import List, Dict

def tier_monitors(
    tables: List[Dict],
    scan_budget: int = 10,
    high_severity_threshold: float = 100_000,
) -> List[Dict]:
    sorted_tables = sorted(
        tables,
        key=lambda t: t["blast_radius_usd"],
        reverse=True,
    )
    monitored: List[Dict] = []
    scans_used = 0
    for tbl in sorted_tables:
        if (
            scans_used < scan_budget
            and tbl["blast_radius_usd"] >= high_severity_threshold
        ):
            monitor_type = "rule + ml + metadata"
            cadence_min = 60
            scans_used += 1
        else:
            monitor_type = "metadata only"
            cadence_min = 30
        monitored.append({
            **tbl,
            "monitor_type": monitor_type,
            "cadence_min": cadence_min,
        })
    return monitored
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step What happens
1 Sort tables descending by blast_radius_usd
2 For each, check if budget remains AND blast radius is high-severity
3 If yes → full triple-monitor stack at hourly cadence
4 If no → metadata-only every 30 min (freshness, schema, row count)
5 Return the augmented catalog with monitor types

For a 200-table input where the top 10 tables have blast radius above $100K and the remaining 190 are below, the algorithm spends the entire scan budget on the top 10 and ships every other table with a metadata-only monitor. Total coverage: 200/200. Compute cost: bounded by scan_budget.

Output:

Table tier Count Monitor type Cadence
Top 10 (high blast radius) 10 rule + ml + metadata 60 min
Remaining 190 metadata only 30 min

Why this works — concept by concept:

  • Blast-radius-keyed prioritisation — the right way to allocate monitor budget is by downstream dollar impact, not by table size or by alphabetical order. Tables that feed the VP's dashboard get the full monitor stack; tables that feed nobody's dashboard get the cheap layer.
  • Metadata-only as the floor — every table gets at least a metadata-only monitor. Coverage is 100% even with a small scan budget because metadata reads are free.
  • Tiered cadence — high-severity tables need slower cadence (60 min) because the monitors are more expensive; low-severity get more frequent metadata-only checks because they cost nothing.
  • Triple-stack on critical tables — rule-based for the absolute floors, ML-based for the subtle drift, metadata-only for the always-on freshness. Three layers, no overlap in failure modes.
  • Budget as a constraint — the algorithm respects a hard scan budget. If procurement only approved 10 scans, the algorithm spends them on the right 10.
  • Cost — O(n log n) for the sort + O(n) for the assignment. Trivially fast for any realistic catalog size.

SQL
Topic — data validation
Data validation problems (SQL)

Practice →


5. Incident lifecycle and on-call workflow

Every data incident moves through the same six stages — detect, triage, notify, diagnose, resolve, learn — and the platform's job is to compress each stage's clock without breaking the lineage trail

The mental model in one line: the data incident lifecycle is detect → triage → notify → diagnose → resolve → learn, and every observability platform's value comes down to how much it compresses each stage's wall-clock time. Once you can name the six stages and the platform's role in each, vendor demos become a question of "show me how this stage gets faster" — concrete, measurable, defensible.

Six-stage horizontal lifecycle ribbon labelled Detect, Triage, Notify, Diagnose, Resolve, Learn, with a tiny icon above each stage and a thin lineage tree branching down from the Diagnose stage to upstream/downstream nodes, on a light PipeCode card.

Stage 1 — Detect.

  • What happens. A signal exceeds a threshold or baseline. The platform raises an incident.
  • Platform role. Run the monitors, manage the baselines, fire alerts. MTTD compression is the primary KPI here.
  • Key metric. Time-to-detection (TTD). Pre-platform, the average is often hours-to-days; post-platform it should be minutes-to-an-hour.

Stage 2 — Triage.

  • What happens. A human (or an automated rule) classifies the incident — severity, owner, blast radius, suspected root cause area.
  • Platform role. Auto-tag with owner, lineage, blast radius. Pre-classify severity from the signal type and threshold breach magnitude.
  • Key metric. Time-to-triage (TTT). Should be sub-15 minutes for page-grade incidents.

Stage 3 — Notify.

  • What happens. The right humans get pinged via the right channels.
  • Platform role. Slack-first vs PagerDuty-first routing; status pages for non-technical stakeholders; ticket creation in Jira / Linear.
  • Key metric. Acknowledgement time. Aim for under 5 minutes on page-grade.

Stage 4 — Diagnose.

  • What happens. The on-call identifies the upstream cause and the blast radius.
  • Platform role. Lineage graph, blast radius pre-computation, recent-changes-log integration, suggested-root-cause heuristics.
  • Key metric. Time-to-diagnose (TTD₂). The biggest MTTR component; the lineage edge directly compresses this.

Stage 5 — Resolve.

  • What happens. The fix is shipped; the data is restored; the incident is closed.
  • Platform role. Track resolution, integrate with the deployment system, run a re-monitor to confirm.
  • Key metric. Time-to-resolve (TTR). MTTR = TTT + Acknowledge + TTD₂ + TTR.

Stage 6 — Learn.

  • What happens. Postmortem; new monitor; retrospective; recurring-incident dashboard.
  • Platform role. Postmortem template, incident retrospective view, top-N-table-by-incidents dashboard.
  • Key metric. Recurring-incident rate. Should trend down after every postmortem.

Routing decisions.

  • Slack-first. Default for non-page severity. Channel keyed by data domain; on-call mention for page-grade.
  • PagerDuty-first. Page-grade only. Best for tables with hard SLAs (finance close, regulatory reporting).
  • Ticketing-first. Long-tail incidents that need a fix but don't need a same-day response. Jira or Linear.
  • Tier by blast radius. A freshness alert on dim_marketing_attribution should not page on-call at 3am; one on fact_revenue_daily should.

Owner tagging.

  • By table. The platform's table catalog stores an owner per table.
  • By domain. A domain (revenue, growth, data-platform) routes to a team Slack channel.
  • By dbt model meta. dbt YAML can define meta.owner; the platform reads it and uses it.
  • By default. If no owner is set, the alert lands in a triage channel for the data platform team to route.

Status pages and stakeholder comms.

  • A status page for data incidents is the most under-used feature in the category. It is the difference between "the data team will deal with it" and "every stakeholder knows the data is stale."
  • Monte Carlo, Bigeye, and Anomalo all ship status pages. Lightup's is the lightest of the four.
  • The status page should show: current incidents, recently resolved incidents, monitored systems, and a five-minute summary written by the on-call.

Integrating with dbt artifacts, OpenLineage, and the catalog.

  • dbt manifest. Every platform consumes manifest.json to discover models, tests, sources, and tags.
  • OpenLineage. The open standard for lineage events. Monte Carlo and Bigeye are the strongest consumers; Anomalo and Lightup are catching up.
  • Data catalog. Integrations with Atlan, Alation, Collibra, DataHub, and Castor are now table-stakes. The catalog stores ownership; the platform reads it.

Postmortems and recurring-incident dashboards.

  • The platform should make the "what tables had the most incidents last quarter" question one click. That dashboard drives engineering investment toward the chronic offenders.
  • A good postmortem template captures: what fired, when, blast radius, root cause, fix, monitor improvements. The platform's template should structure all six fields.

Worked example — modelling the six-stage incident timeline

Detailed explanation. Quantifying each stage's wall-clock time is the most useful exercise a data team can do post-pilot. The numbers reveal which stage the platform actually compressed and which stage still needs work.

Question. Given a sample incident — a freshness alert on analytics.orders — model the wall-clock time at each of the six stages, both before and after a Monte Carlo deployment, and compute the MTTD and MTTR delta.

Input.

Stage Before (minutes) After (minutes)
Detect 240 15
Triage 30 5
Notify 15 1
Diagnose 60 12
Resolve 45 30
Learn (offline) n/a n/a

Code.

# incident_timeline.py
def incident_timeline(stages: dict) -> dict:
    """Stages is {name: minutes}. Returns MTTD and MTTR computations."""
    mttd = stages["detect"]
    mttr = (
        stages["triage"]
        + stages["notify"]
        + stages["diagnose"]
        + stages["resolve"]
    )
    return {"MTTD_minutes": mttd, "MTTR_minutes": mttr}

before = {"detect": 240, "triage": 30, "notify": 15,
          "diagnose": 60,  "resolve": 45}
after  = {"detect": 15,  "triage": 5,  "notify": 1,
          "diagnose": 12,  "resolve": 30}

print("Before:", incident_timeline(before))
print("After: ", incident_timeline(after))
print("MTTD delta:", before["detect"] - after["detect"], "min")
print("MTTR delta:",
      sum(v for k,v in before.items() if k != "detect")
      - sum(v for k,v in after.items()  if k != "detect"),
      "min")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. MTTD is the time from incident onset to detection — equal to the "detect" stage. Pre-platform: 240 min (4 hours). Post-platform: 15 min. Compression: 16x.
  2. MTTR is the sum of triage + notify + diagnose + resolve — everything after detection. Pre: 150 min. Post: 48 min. Compression: 3.1x.
  3. The biggest relative compression is on MTTD (16x) because the platform replaces "wait for someone to notice" with "fire a monitor." The biggest absolute compression is on the "detect" stage (225 min saved per incident).
  4. The "diagnose" stage compression (60 → 12 min) is the lineage edge — the on-call sees the blast radius and upstream cause in the alert, skipping a manual archaeology step.
  5. "Resolve" only compressed 45 → 30 min — the actual code-fix-and-deploy work is mostly unchanged by the platform. That is the right outcome; the platform's job is detection and triage, not engineering.

Output.

Stage Before After Δ
Detect 240 15 -225
Triage 30 5 -25
Notify 15 1 -14
Diagnose 60 12 -48
Resolve 45 30 -15
Total 390 63 -327

Rule of thumb. Always measure all six stages, not just MTTD. The platform's value-prop is the full incident timeline compression — and the stages where it does not compress (often "resolve") are the stages where the data team still needs to invest.

Worked example — routing by severity and blast radius

Detailed explanation. Routing rules are the operational glue between detection and the on-call. The right rule set keeps page-grade alerts on PagerDuty, ticket-grade alerts in Jira, and Slack-grade alerts in a domain channel — without burying anyone in noise.

Question. Design a routing decision table for a freshness alert based on (severity, blast radius, business hours). Show how the same monitor produces different alerts under different conditions.

Input — routing table.

Severity Blast radius Business hours Channel
page high ($100K+) any PagerDuty + Slack #data-oncall
page high off-hours PagerDuty + Slack
warn high business hours Slack #data-oncall
warn high off-hours Slack-only (no page)
warn low any Domain Slack
info any any Domain Slack thread

Code.

# routing.py — decides where an alert goes based on rule table
from datetime import datetime

def route_alert(severity: str, blast_radius_usd: int) -> list[str]:
    now_hour = datetime.utcnow().hour
    business_hours = 13 <= now_hour <= 22  # 9am-6pm ET in UTC
    high_blast = blast_radius_usd >= 100_000

    channels = []
    if severity == "page" and high_blast:
        channels = ["pagerduty", "slack:#data-oncall"]
    elif severity == "warn" and high_blast and business_hours:
        channels = ["slack:#data-oncall"]
    elif severity == "warn" and high_blast:
        channels = ["slack:#data-oncall-quiet"]
    elif severity == "warn":
        channels = ["slack:#data-domain"]
    elif severity == "info":
        channels = ["slack:#data-info-thread"]
    return channels

# Examples
print(route_alert("page", 250_000))  # PagerDuty + #data-oncall
print(route_alert("warn", 250_000))  # depends on business hours
print(route_alert("warn",   5_000))  # #data-domain
print(route_alert("info",      0))   # #data-info-thread
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The function reads severity (set by the monitor) and blast radius (set by lineage) and returns a list of channels.
  2. Page-grade + high-blast always goes to PagerDuty + an on-call Slack channel. No quiet hours apply — this is the "the company is losing money" tier.
  3. Warn-grade + high-blast splits by business hours: in-hours, the on-call gets a Slack ping; off-hours, the alert goes to a separate quiet channel that the on-call checks in the morning.
  4. Warn-grade + low-blast goes to the domain Slack, not the on-call channel. This is the largest category and the one most likely to cause alert fatigue if mis-routed.
  5. Info-grade goes to a low-priority thread — useful for trend awareness, not for action.

Output.

Inputs Channels
(page, $250K) pagerduty + #data-oncall
(warn, $250K, hours) #data-oncall
(warn, $250K, off-hours) #data-oncall-quiet
(warn, $5K) #data-domain
(info, $0) #data-info-thread

Rule of thumb. Never page on warn-grade after hours unless the blast radius is high and the table has an explicit SLA. The fastest way to lose your on-call's trust in the platform is to page them at 3am for a non-urgent freshness drift on a non-critical table.

Worked example — building a recurring-incident dashboard

Detailed explanation. A recurring-incident dashboard surfaces the chronic offenders — tables, models, or dbt sources that fire more incidents than everything else combined. The dashboard drives engineering investment toward the right targets.

Question. Write a SQL query that returns the top 10 tables by incident count in the last 90 days, with each incident's average severity and the percentage of incidents that were true positives.

Input.

incident_id table severity label created_at
1 analytics.orders page tp -89d
2 analytics.orders warn tp -75d
3 raw.events warn fp -60d
... ... ... ... ...

Code.

SELECT
    table_name,
    COUNT(*) AS incident_count,
    SUM(CASE WHEN label = 'tp' THEN 1 ELSE 0 END) AS true_positives,
    ROUND(
        100.0 * SUM(CASE WHEN label = 'tp' THEN 1 ELSE 0 END)
            / NULLIF(COUNT(*), 0),
        1
    ) AS precision_pct,
    SUM(CASE WHEN severity = 'page' THEN 1 ELSE 0 END) AS page_count
FROM incidents
WHERE created_at >= CURRENT_DATE - 90
GROUP BY table_name
ORDER BY incident_count DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The WHERE filters to incidents in the last 90 days — a reasonable rolling window for chronic-offender analysis.
  2. COUNT(*) counts every incident; the conditional aggregations decompose by label and severity.
  3. precision_pct = 100 * tp / total shows whether a high-incident table is actually noisy (low precision) or actually broken (high precision). A noisy table needs better monitors; a broken table needs engineering.
  4. NULLIF(COUNT(*), 0) is defensive: if a table somehow has zero incidents, the precision divide would otherwise fail.
  5. ORDER BY incident_count DESC LIMIT 10 returns the worst offenders in one query.

Output.

table_name incident_count tp precision_pct page_count
analytics.orders 41 38 92.7 8
raw.events 33 22 66.7 4
dim_marketing_attr 28 9 32.1 0
fact_revenue_daily 24 23 95.8 5
stg_users 19 12 63.2 2

Rule of thumb. Sort the chronic-offenders list by precision_pct. The lowest-precision tables need monitor tuning (their alerts are mostly false positives). The highest-precision tables need engineering investment (their alerts are real and the upstream is genuinely broken).

Worked example — postmortem template that ships in the alert

Detailed explanation. The best postmortem is the one that gets filled in. Embedding a structured template inside the alert payload — and letting the on-call complete it in Slack — turns the postmortem from a back-office chore into a 5-minute closing ritual.

Question. Sketch a Slack-native postmortem template attached to a resolved incident. The template should capture what fired, blast radius, root cause, fix, and monitor improvements.

Input — closed incident.

Field Value
Incident ID INC-2031
Table analytics.orders
Signal freshness
Detected at 2026-06-12 14:02
Resolved at 2026-06-12 15:43
Severity page
Blast radius 14 models, 3 dashboards

Code.

{
  "postmortem_template": {
    "incident_id": "INC-2031",
    "what_fired": "freshness > 60 min on analytics.orders",
    "blast_radius": "14 dbt models, 3 Looker dashboards, 1 reverse-ETL sync",
    "root_cause": "<fill in> — Airflow DAG `orders_etl` failed silently because a Snowflake credential rotated and was not propagated to the secrets manager",
    "fix": "<fill in> — re-rotated and propagated; backfilled the missing two hours of orders",
    "monitor_improvements": [
      "<fill in> — add a healthcheck on the secrets manager",
      "<fill in> — tighten DAG `orders_etl` retry window from 4h to 30 min"
    ],
    "follow_ups": [
      "open ticket DATA-1456: secrets-rotation runbook",
      "schedule a quarterly credential rotation drill"
    ],
    "ttd_minutes": 15,
    "ttr_minutes": 101
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The template arrives in Slack the moment the incident is closed. It is pre-populated with the structured fields the platform already knows: incident ID, what fired, blast radius, TTD, TTR.
  2. The on-call fills in three fields by hand: root cause, fix, monitor improvements. Each field has a 1-3 sentence character cap to keep the writeup terse.
  3. Follow-ups are open tickets or scheduled events that come out of the postmortem. They get filed automatically.
  4. The whole postmortem auto-files into a quarterly review dashboard; the engineering manager can read every postmortem of the last quarter in 30 minutes.
  5. The cost of not having the template is real: postmortems get skipped because they take 30 minutes to format. The cost of having the template is ~5 minutes per incident.

Output.

Field Auto-filled On-call fills
Incident ID yes
What fired yes
Blast radius yes
TTD / TTR yes
Root cause no 1-2 sentences
Fix no 1-2 sentences
Monitor improvements no 1-3 bullets

Rule of thumb. Make the postmortem a 5-minute Slack form, not a 30-minute Confluence page. The 5-minute version actually gets written; the 30-minute version sits in the on-call's drafts folder forever.

Data engineering interview question on incident routing

A senior interviewer might frame this as: "It's 3am, a freshness alert fires on dim_marketing_attribution. What happens in your stack from the moment the monitor breaches the threshold to the moment the on-call goes back to bed? Trace every step." It blends process knowledge, on-call empathy, and the kind of routing-rule literacy that separates a designed system from a tangled one.

Solution Using a tiered routing flow tied to severity and blast radius

# incident_handler.py — full lifecycle from detect to learn

from typing import Optional

def handle_incident(
    table: str,
    signal: str,
    severity: str,
    blast_radius_usd: int,
    hour_utc: int,
) -> dict:
    is_business_hours = 13 <= hour_utc <= 22

    # 1. Detect — happens before this function runs
    # 2. Triage — auto-tag with severity, blast radius
    triage = {
        "severity": severity,
        "blast_radius_usd": blast_radius_usd,
        "tier": "critical" if blast_radius_usd >= 100_000 else "standard",
    }

    # 3. Notify — route by severity, blast radius, time
    if triage["tier"] == "critical" and severity == "page":
        channels = ["pagerduty", f"slack:#data-oncall"]
    elif triage["tier"] == "critical" and not is_business_hours:
        channels = [f"slack:#data-oncall-quiet"]
    elif triage["tier"] == "critical":
        channels = [f"slack:#data-oncall"]
    else:
        channels = [f"slack:#data-{signal}"]

    # 4. Diagnose — attach lineage to the payload
    payload = {
        "table": table,
        "signal": signal,
        **triage,
        "channels": channels,
        "lineage": {"upstream": ["..."], "downstream": ["..."]},
    }

    return payload

# 3am UTC freshness alert on a low-blast-radius marketing table
print(handle_incident(
    table="dim_marketing_attribution",
    signal="freshness",
    severity="warn",
    blast_radius_usd=8_000,
    hour_utc=3,
))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action
1 Monitor fires at 3:00 UTC
2 Triage classifies severity=warn, blast=$8K, tier=standard
3 Notify selects channels = #data-marketing
4 Lineage attaches downstream models to the payload
5 On-call sees the alert in the morning, no 3am page

The flow correctly suppresses the 3am page on a non-critical table while still preserving the trail for morning triage. The on-call sleeps; the incident is still tracked; the data team picks it up at 9am.

Output:

Field Value
table dim_marketing_attribution
severity warn
tier standard
channels ["slack:#data-marketing"]
lineage (upstream + downstream attached)

Why this works — concept by concept:

  • Tiered notification — critical-tier tables can page off-hours; standard-tier never page off-hours. The two-tier split alone eliminates most alert fatigue.
  • Blast radius as the tier key — dollar-impact is the right dimension. A marketing attribution table with $8K blast radius does not deserve a 3am page; a revenue table with $250K blast radius absolutely does.
  • Channel routing per signal — freshness, volume, and schema can go to different domain channels even within the same tier. The right people see the right alert.
  • Lineage attached by default — the on-call should never have to ask "what does this break" — the payload should already say.
  • Quiet hours respect — off-hours alerts on non-critical tables land in a separate channel for morning review. The on-call still sees them, just not at 3am.
  • Cost — O(1) per incident. The platform's overhead is negligible; the value comes from preserving the on-call's sleep.

SQL
Topic — event processing
Event processing problems (SQL)

Practice →


Cheat sheet — observability platform recipes

  • "Enterprise-wide lineage with deep warehouse coverage." Default to Monte Carlo. Broadest connector matrix, strongest cross-stack lineage parsed from SQL query logs, longest enterprise paper trail (SOC 2, ISO 27001, HIPAA). Budget assumption: six-figure annual contract.
  • "Need strong ML anomaly detection on top of a small dbt project." Anomalo. Hierarchical seasonality-aware baselines, no-code rule editor for non-engineers, Snowflake-native via native app integration. Best when the data product is metric-heavy.
  • "dbt-heavy stack, metric-first." Bigeye. Strongest dbt integration of the four — autometrics pulls dbt models, tests, and tags automatically. Per-metric pricing is the most predictable shape in the category.
  • "Security forbids data egress." Lightup. Pure pushdown SQL — every check compiles to SQL that runs inside the warehouse. Lightup's plane never sees the data. Best fit for healthcare, banking, and government stacks.
  • "We just need freshness + volume + schema, on a budget." Start with OSS Elementary plus dbt source freshness plus a Slack webhook. Sub-10-engineer teams should build before they buy; revisit when the team passes 25.
  • "Pilot before committing." Run a 30-day eval on three critical tables with a written scorecard (coverage, detection quality, time-to-first-incident, lineage usefulness, stakeholder readability, cost per coverage). Lock the weights on day 0; do not let the demo change them.
  • "Avoid alert fatigue." Tier by blast radius. Page-grade only for high-blast-radius tables with hard SLAs; warn-grade for the rest goes to a domain Slack; info-grade to a low-priority thread. Quiet hours for everything below page-grade.
  • "Integrate with dbt and OpenLineage." Every modern platform consumes manifest.json and emits / consumes OpenLineage events. Confirm during the eval that the platform reads your meta.owner field — that is the single biggest routing accelerant.

Frequently asked questions

Do I need a data observability platform if I have dbt tests?

You need both, not one. dbt tests are assertions — they catch known-bad shapes you have already thought to write (not_null, unique, accepted_values). An observability platform catches unknown-bad shapes via baselines, monitors freshness with SLAs, watches schema drift on source tables that dbt does not own, and routes incidents through PagerDuty with lineage attached. The two layers do different jobs: dbt tests own assertions, observability platforms own workflow. For a team larger than about 10 engineers with revenue-impacting data, the observability platform usually pays for itself in a single quarter via MTTD compression and avoided rework.

Is Monte Carlo worth the price for a 5-person team?

Almost never. A 5-person team should build with open source — dbt source freshness for the freshness pillar, dbt-elementary for volume and schema, a custom z-score script for the distribution pillar, and a Slack webhook for routing. The platforms become worth their price when the surface area is large enough (25+ engineers, hundreds of dbt models, multiple BI tools, an external SLA obligation) that the marginal engineering hour spent building observability is worth more on the data product itself. For a 5-person team, that hour is almost always better spent on the product.

How does Anomalo compare to Monte Carlo on ML quality?

Anomalo is widely considered the strongest ML detection layer of the four — hierarchical Bayesian baselines, automatic feature engineering on column statistics, and seasonality-aware models that share strength across segments. Monte Carlo's ML is competent but not the headline feature; it focuses more on coverage and lineage. If your blog of incidents is dominated by silent value drift on numerical metrics (refund rate, conversion, retention), Anomalo will typically catch a few more real incidents per quarter. If the incidents are mostly freshness, schema, or blast-radius routing, Monte Carlo's edge in coverage and lineage wins.

What makes Lightup different from Monte Carlo / Bigeye?

Lightup's architectural bet is pure pushdown SQL: every check compiles to SQL that runs inside your warehouse, and Lightup's control plane never sees the underlying data — only the check result. That is fundamentally different from Monte Carlo's agent-in-VPC + control-plane model and Bigeye's metric-collector approach. The trade-off is real: Lightup's security posture is the strongest in the category (zero data egress, satisfies the strictest residency reviews), but its ML detection and lineage capabilities are correspondingly narrower. Pick Lightup when security is a hard constraint; pick the others when it is a preference.

How do these platforms integrate with dbt and OpenLineage?

Every modern platform reads dbt's manifest.json to auto-discover models, tests, sources, and tags — including meta.owner for routing. Bigeye's integration is the deepest (autometrics pulls dbt model definitions into the platform's metric catalog). Monte Carlo and Anomalo are also strong; Lightup is the lightest. On OpenLineage, Monte Carlo and Bigeye are the most active consumers — they ingest lineage events emitted by Airflow, Dagster, Marquez, and other producers. If the team is committed to OpenLineage as the open standard, those two are the safest picks; Anomalo and Lightup are catching up but are not yet at parity.

How long does a typical pilot take to prove value?

Plan on 30 days. Days 1–7 are integration: connect the warehouse, point the platform at three critical tables, configure ownership and SLAs. Days 7–14 are baseline convergence: the ML monitors need a couple of weeks of clean history before they start emitting useful alerts. Days 14–30 are the real test: did the platform catch at least one real incident before anyone else noticed? Document the MTTD compression, MTTR compression, and stakeholder feedback in a written scorecard locked on day 0. A pilot that has not produced at least one true-positive page-grade alert by day 21 is a yellow flag — either the SLAs are too loose or the tables chosen are too quiet. Tighten or switch tables; do not let the pilot drift past 45 days without a decision.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every observability pattern above ships with hands-on practice rooms where you write the freshness probe, the z-score volume monitor, and the schema-drift FULL OUTER JOIN against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your monitor SQL behaves the same on Snowflake as on BigQuery — the platform tells you.

Practice data validation now →
Aggregation drills →

Top comments (0)