DEV Community

Blaine Elliott
Blaine Elliott

Posted on • Originally published at blog.anomalyarmor.ai

Data Anomaly Detection: The Complete Guide for Data Engineers

Data anomaly detection is the process of identifying data points, patterns, or values that deviate from expected behavior. It catches schema changes, stale tables, row count spikes, and statistical outliers before they break dashboards or corrupt downstream analytics. Modern data anomaly detection combines statistical methods like z-scores and Welford's algorithm with machine learning models that learn seasonal patterns from historical data.

This guide explains the four types of data anomalies, the algorithms used to detect each one, and how to implement detection in Snowflake, Databricks, and PostgreSQL.

What is data anomaly detection?

Data anomaly detection is the automated identification of unexpected values, patterns, or changes in a dataset. In data engineering, it monitors production tables for problems like:

  • A column gets renamed, dropped, or changes type (schema drift)
  • A daily-updated table hasn't received new rows in 36 hours (freshness failure)
  • Row counts drop by 80% overnight (volume anomaly)
  • Null rate in a critical column spikes from 2% to 40% (quality anomaly)
  • A customer ID in a fact table references a non-existent record (referential anomaly)

The goal is to catch these problems before they reach dashboards, ML models, or customer-facing applications.

The four types of data anomalies

1. Schema anomalies

Schema anomalies occur when the structure of a table changes unexpectedly. Common examples:

  • Column added: A new column appears upstream, which can break SELECT * queries
  • Column dropped: A column disappears, breaking any query that references it
  • Column renamed: The column exists under a different name, causing silent NULL returns
  • Type changed: A VARCHAR becomes an INTEGER, causing cast failures

Schema anomalies are the most common cause of silent data failures because queries often continue to run without error, returning wrong results.

2. Freshness anomalies

Freshness anomalies happen when a table stops updating on its expected schedule. A table that normally updates every hour but hasn't received new rows in 6 hours has a freshness anomaly. These are caused by:

  • Upstream pipeline failures
  • Source system outages
  • Broken scheduled jobs
  • Permission changes

Freshness is typically measured as "time since last insert" or "max(timestamp_column)".

3. Volume anomalies

Volume anomalies are unexpected changes in row counts. A daily sales table that normally receives 10,000-12,000 rows suddenly receiving 500 rows (or 100,000) is a volume anomaly. Causes include:

  • Upstream filter changes
  • Duplicate data ingestion
  • Failed partial loads
  • Fraud or bot activity

4. Value anomalies

Value anomalies are statistical outliers in column values. Examples:

  • A revenue column where 5% of rows are negative when they should always be positive
  • A foreign key column where null rates spike from 2% to 40%
  • A timestamp column with future dates

Value anomalies are detected using statistical methods applied to specific columns.

How data anomaly detection works

Anomaly detection uses three main approaches: static thresholds, statistical methods, and machine learning.

Static thresholds

The simplest approach. You define the expected range manually:

SELECT 'anomaly' AS status
FROM orders
WHERE COUNT(*) < 1000 OR COUNT(*) > 50000;
Enter fullscreen mode Exit fullscreen mode

Static thresholds work for stable metrics but fail for anything with seasonality (weekend traffic drops, end-of-month spikes).

Statistical methods

Statistical anomaly detection uses historical data to compute expected ranges automatically. The most common approach is the z-score:

z = (current_value - historical_mean) / historical_stddev
Enter fullscreen mode Exit fullscreen mode

If the absolute z-score exceeds a threshold (typically 2 or 3), the value is flagged as anomalous. A z-score of 2 catches values more than 2 standard deviations from the mean, which is roughly the top or bottom 2.5% of a normal distribution.

Welford's algorithm is the most efficient way to compute running mean and standard deviation for anomaly detection. It maintains three numbers (count, mean, and sum of squared deviations) and updates them incrementally with each new data point, requiring constant memory:

def update_stats(count, mean, m2, value):
    count += 1
    delta = value - mean
    mean += delta / count
    delta2 = value - mean
    m2 += delta * delta2
    return count, mean, m2

def get_variance(count, m2):
    return m2 / (count - 1) if count > 1 else 0
Enter fullscreen mode Exit fullscreen mode

This is the foundation of most production anomaly detection systems because it scales to high-volume event streams without storing historical data.

Machine learning methods

For data with complex seasonality (weekly patterns, business hours, holiday effects), machine learning models outperform simple statistics. The most common approach is Prophet (Facebook's time-series forecasting library), which decomposes a series into trend, weekly seasonality, and yearly seasonality, then flags values outside the prediction interval.

Prophet requires at least 14 data points to detect weekly patterns and 365 points to detect yearly patterns. For tables with less history, fall back to z-scores.

How to detect data anomalies in Snowflake

Snowflake provides metadata views that make anomaly detection straightforward.

Schema anomalies: Track column changes via INFORMATION_SCHEMA.COLUMNS:

SELECT table_name, column_name, data_type, last_altered
FROM information_schema.columns
WHERE table_schema = 'PRODUCTION'
  AND last_altered > DATEADD(hour, -24, CURRENT_TIMESTAMP());
Enter fullscreen mode Exit fullscreen mode

Freshness anomalies: Check ACCOUNT_USAGE.TABLES for last DML operation:

SELECT table_name,
       last_altered,
       DATEDIFF(hour, last_altered, CURRENT_TIMESTAMP()) AS hours_stale
FROM snowflake.account_usage.tables
WHERE table_schema = 'PRODUCTION'
  AND DATEDIFF(hour, last_altered, CURRENT_TIMESTAMP()) > 24;
Enter fullscreen mode Exit fullscreen mode

Volume anomalies: Compare today's row count against a rolling 30-day average:

WITH daily_counts AS (
  SELECT DATE(created_at) AS day, COUNT(*) AS row_count
  FROM orders
  WHERE created_at >= DATEADD(day, -30, CURRENT_DATE())
  GROUP BY DATE(created_at)
),
stats AS (
  SELECT AVG(row_count) AS mean, STDDEV(row_count) AS stddev
  FROM daily_counts
  WHERE day < CURRENT_DATE()
)
SELECT row_count,
       (row_count - mean) / stddev AS z_score
FROM daily_counts, stats
WHERE day = CURRENT_DATE()
  AND ABS((row_count - mean) / stddev) > 2;
Enter fullscreen mode Exit fullscreen mode

How to detect data anomalies in Databricks

Databricks offers Delta Live Tables expectations for inline anomaly detection:

import dlt
from pyspark.sql.functions import col

@dlt.table
@dlt.expect_or_drop("valid_order_total", "order_total > 0")
@dlt.expect_or_fail("recent_data", "created_at > current_date() - interval 2 days")
def clean_orders():
    return spark.read.table("raw_orders")
Enter fullscreen mode Exit fullscreen mode

For volume and statistical anomalies, use Unity Catalog's lineage tracking combined with scheduled queries:

SELECT table_name,
       COUNT(*) AS row_count,
       MAX(ingestion_time) AS last_update
FROM production.orders
GROUP BY table_name;
Enter fullscreen mode Exit fullscreen mode

How to detect data anomalies in PostgreSQL

PostgreSQL doesn't have built-in anomaly detection, but you can implement it with pg_stat_user_tables and custom queries:

SELECT relname AS table_name,
       n_live_tup AS row_count,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
  AND last_autoanalyze < NOW() - INTERVAL '24 hours';
Enter fullscreen mode Exit fullscreen mode

For value anomalies, use window functions to compute rolling statistics:

WITH rolling_stats AS (
  SELECT order_id,
         amount,
         AVG(amount) OVER (ORDER BY created_at ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS rolling_mean,
         STDDEV(amount) OVER (ORDER BY created_at ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS rolling_stddev
  FROM orders
)
SELECT order_id, amount, rolling_mean, rolling_stddev,
       (amount - rolling_mean) / NULLIF(rolling_stddev, 0) AS z_score
FROM rolling_stats
WHERE ABS((amount - rolling_mean) / NULLIF(rolling_stddev, 0)) > 3;
Enter fullscreen mode Exit fullscreen mode

Build vs buy: data anomaly detection tools

Building anomaly detection in-house gives you control but requires engineering time to maintain. Most data teams outgrow custom solutions because:

  • Alert fatigue: Static thresholds fire too often and get ignored
  • Seasonality blindness: Simple statistics miss weekly and yearly patterns
  • Cross-platform monitoring: Different code for Snowflake, Databricks, and Postgres
  • Incident triage: No unified view of which alerts matter most

AnomalyArmor is a data observability platform that uses AI to configure anomaly detection automatically. You connect your data warehouse, describe what you want to monitor in plain English, and the AI agent sets up schema drift alerts, freshness schedules, and statistical anomaly detection across all your tables. It works on Snowflake, Databricks, PostgreSQL, and BigQuery.

Data anomaly detection FAQ

What is the difference between anomaly detection and data validation?

Data validation checks if data matches explicit rules (e.g., "order_id is not null"). Anomaly detection uses statistical methods to identify values that deviate from historical patterns. Validation catches known problems. Anomaly detection catches unknown ones.

What is the best algorithm for data anomaly detection?

For most production use cases, z-scores computed with Welford's algorithm work well. For data with strong weekly or yearly seasonality, Prophet or similar time-series models are better. For high-dimensional data, isolation forests outperform statistical methods.

How do I detect schema drift automatically?

Query your database's INFORMATION_SCHEMA or metadata views on a schedule, store the previous state, and diff the current state against the stored version. When columns change, type definitions change, or tables are added or removed, fire an alert. AnomalyArmor does this automatically for Snowflake, Databricks, and PostgreSQL.

What is a z-score and how is it used in anomaly detection?

A z-score measures how many standard deviations a value is from the historical mean. A z-score of 2 means the value is 2 standard deviations above the mean, which occurs in roughly 2.5% of a normal distribution. Most anomaly detection systems use z-scores between 2 and 3 as thresholds.

How much historical data do I need for anomaly detection?

Statistical methods like z-scores need at least 7-10 data points to produce meaningful baselines. Machine learning methods like Prophet need at least 14 points for weekly seasonality and 365 points for yearly seasonality. During the learning phase, most systems don't fire alerts.

What is the difference between data observability and anomaly detection?

Anomaly detection is one component of data observability. Data observability also includes lineage tracking, impact analysis, schema change detection, and root cause analysis. Anomaly detection tells you something is wrong. Observability tells you what, where, and why.

Can AI improve data anomaly detection?

Yes. AI improves anomaly detection in three ways. First, AI agents can configure monitoring rules from natural language instead of YAML or GUI forms. Second, LLMs can analyze alert patterns to reduce false positives. Third, AI can correlate anomalies across tables to identify root causes faster than manual investigation.

How do I avoid alert fatigue in anomaly detection?

Use adaptive thresholds that learn from historical patterns instead of static rules. Set sensitivity per table based on how critical it is. Group related alerts so a single upstream failure generates one notification instead of ten. Suppress alerts during known maintenance windows.

What data platforms support anomaly detection natively?

Snowflake has data metric functions and ACCOUNT_USAGE views. Databricks has Delta Live Tables expectations and Unity Catalog lineage. BigQuery has table metadata and scheduled queries. PostgreSQL has pg_stat_user_tables. None of these are full anomaly detection systems, but they provide the raw metrics needed to build one.

How real-time should anomaly detection be?

It depends on the use case. Schema drift and freshness checks should run every 5-15 minutes. Row count and statistical anomalies should run hourly for most tables and daily for slower-changing ones. Real-time streaming anomaly detection (sub-second) is rarely needed for data warehouses but is critical for fraud detection and security monitoring.

Summary

Data anomaly detection catches schema changes, freshness failures, volume spikes, and statistical outliers before they break downstream analytics. The four main types of anomalies require different detection approaches: schema changes need metadata diffs, freshness needs time-since-update checks, volume needs historical baselines, and value anomalies need statistical methods like z-scores or machine learning models like Prophet.

Modern data observability platforms combine all four detection methods with AI-powered configuration to make anomaly detection practical at scale. Whether you build in-house or buy a tool, the fundamental algorithms are the same: maintain historical baselines, compute expected ranges, and flag deviations beyond your sensitivity threshold.


Want to see data anomaly detection in action? Watch a 30-second demo of AI configuring schema drift monitoring in real time.

Top comments (0)