<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Blaine Elliott</title>
    <description>The latest articles on DEV Community by Blaine Elliott (@iblaine).</description>
    <link>https://dev.to/iblaine</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872144%2F91b5234f-bf95-4c8a-8909-c40be588d7bb.png</url>
      <title>DEV Community: Blaine Elliott</title>
      <link>https://dev.to/iblaine</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iblaine"/>
    <language>en</language>
    <item>
      <title>Data Anomaly Detection: The Complete Guide for Data Engineers</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:48:30 +0000</pubDate>
      <link>https://dev.to/iblaine/data-anomaly-detection-the-complete-guide-for-data-engineers-3ifk</link>
      <guid>https://dev.to/iblaine/data-anomaly-detection-the-complete-guide-for-data-engineers-3ifk</guid>
      <description>&lt;p&gt;Data anomaly detection is the process of identifying data points, patterns, or values that deviate from expected behavior. It catches schema changes, stale tables, row count spikes, and statistical outliers before they break dashboards or corrupt downstream analytics. Modern data anomaly detection combines statistical methods like z-scores and Welford's algorithm with machine learning models that learn seasonal patterns from historical data.&lt;/p&gt;

&lt;p&gt;This guide explains the four types of data anomalies, the algorithms used to detect each one, and how to implement detection in Snowflake, Databricks, and PostgreSQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is data anomaly detection?
&lt;/h2&gt;

&lt;p&gt;Data anomaly detection is the automated identification of unexpected values, patterns, or changes in a dataset. In data engineering, it monitors production tables for problems like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A column gets renamed, dropped, or changes type (schema drift)&lt;/li&gt;
&lt;li&gt;A daily-updated table hasn't received new rows in 36 hours (freshness failure)&lt;/li&gt;
&lt;li&gt;Row counts drop by 80% overnight (volume anomaly)&lt;/li&gt;
&lt;li&gt;Null rate in a critical column spikes from 2% to 40% (quality anomaly)&lt;/li&gt;
&lt;li&gt;A customer ID in a fact table references a non-existent record (referential anomaly)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to catch these problems before they reach dashboards, ML models, or customer-facing applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four types of data anomalies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Schema anomalies
&lt;/h3&gt;

&lt;p&gt;Schema anomalies occur when the structure of a table changes unexpectedly. Common examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Column added&lt;/strong&gt;: A new column appears upstream, which can break &lt;code&gt;SELECT *&lt;/code&gt; queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column dropped&lt;/strong&gt;: A column disappears, breaking any query that references it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column renamed&lt;/strong&gt;: The column exists under a different name, causing silent NULL returns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type changed&lt;/strong&gt;: A VARCHAR becomes an INTEGER, causing cast failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schema anomalies are the most common cause of silent data failures because queries often continue to run without error, returning wrong results.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Freshness anomalies
&lt;/h3&gt;

&lt;p&gt;Freshness anomalies happen when a table stops updating on its expected schedule. A table that normally updates every hour but hasn't received new rows in 6 hours has a freshness anomaly. These are caused by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upstream pipeline failures&lt;/li&gt;
&lt;li&gt;Source system outages&lt;/li&gt;
&lt;li&gt;Broken scheduled jobs&lt;/li&gt;
&lt;li&gt;Permission changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Freshness is typically measured as "time since last insert" or "max(timestamp_column)".&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Volume anomalies
&lt;/h3&gt;

&lt;p&gt;Volume anomalies are unexpected changes in row counts. A daily sales table that normally receives 10,000-12,000 rows suddenly receiving 500 rows (or 100,000) is a volume anomaly. Causes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upstream filter changes&lt;/li&gt;
&lt;li&gt;Duplicate data ingestion&lt;/li&gt;
&lt;li&gt;Failed partial loads&lt;/li&gt;
&lt;li&gt;Fraud or bot activity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Value anomalies
&lt;/h3&gt;

&lt;p&gt;Value anomalies are statistical outliers in column values. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A revenue column where 5% of rows are negative when they should always be positive&lt;/li&gt;
&lt;li&gt;A foreign key column where null rates spike from 2% to 40%&lt;/li&gt;
&lt;li&gt;A timestamp column with future dates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Value anomalies are detected using statistical methods applied to specific columns.&lt;/p&gt;

&lt;h2&gt;
  
  
  How data anomaly detection works
&lt;/h2&gt;

&lt;p&gt;Anomaly detection uses three main approaches: static thresholds, statistical methods, and machine learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Static thresholds
&lt;/h3&gt;

&lt;p&gt;The simplest approach. You define the expected range manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'anomaly'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Static thresholds work for stable metrics but fail for anything with seasonality (weekend traffic drops, end-of-month spikes).&lt;/p&gt;

&lt;h3&gt;
  
  
  Statistical methods
&lt;/h3&gt;

&lt;p&gt;Statistical anomaly detection uses historical data to compute expected ranges automatically. The most common approach is the z-score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_value&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;historical_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;historical_stddev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the absolute z-score exceeds a threshold (typically 2 or 3), the value is flagged as anomalous. A z-score of 2 catches values more than 2 standard deviations from the mean, which is roughly the top or bottom 2.5% of a normal distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Welford's algorithm&lt;/strong&gt; is the most efficient way to compute running mean and standard deviation for anomaly detection. It maintains three numbers (count, mean, and sum of squared deviations) and updates them incrementally with each new data point, requiring constant memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;
    &lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
    &lt;span class="n"&gt;delta2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;
    &lt;span class="n"&gt;m2&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;delta2&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_variance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the foundation of most production anomaly detection systems because it scales to high-volume event streams without storing historical data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Machine learning methods
&lt;/h3&gt;

&lt;p&gt;For data with complex seasonality (weekly patterns, business hours, holiday effects), machine learning models outperform simple statistics. The most common approach is &lt;strong&gt;Prophet&lt;/strong&gt; (Facebook's time-series forecasting library), which decomposes a series into trend, weekly seasonality, and yearly seasonality, then flags values outside the prediction interval.&lt;/p&gt;

&lt;p&gt;Prophet requires at least 14 data points to detect weekly patterns and 365 points to detect yearly patterns. For tables with less history, fall back to z-scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to detect data anomalies in Snowflake
&lt;/h2&gt;

&lt;p&gt;Snowflake provides metadata views that make anomaly detection straightforward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema anomalies&lt;/strong&gt;: Track column changes via &lt;code&gt;INFORMATION_SCHEMA.COLUMNS&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_altered&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PRODUCTION'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;last_altered&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Freshness anomalies&lt;/strong&gt;: Check &lt;code&gt;ACCOUNT_USAGE.TABLES&lt;/code&gt; for last DML operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;last_altered&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_altered&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hours_stale&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PRODUCTION'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;DATEDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_altered&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Volume anomalies&lt;/strong&gt;: Compare today's row count against a rolling 30-day average:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;daily_counts&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;STDDEV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;daily_counts&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;row_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;z_score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;daily_counts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;row_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to detect data anomalies in Databricks
&lt;/h2&gt;

&lt;p&gt;Databricks offers Delta Live Tables expectations for inline anomaly detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;

&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_order_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_total &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recent_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at &amp;gt; current_date() - interval 2 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_orders&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For volume and statistical anomalies, use Unity Catalog's lineage tracking combined with scheduled queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ingestion_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;last_update&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to detect data anomalies in PostgreSQL
&lt;/h2&gt;

&lt;p&gt;PostgreSQL doesn't have built-in anomaly detection, but you can implement it with &lt;code&gt;pg_stat_user_tables&lt;/code&gt; and custom queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;last_autoanalyze&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;last_autoanalyze&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24 hours'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For value anomalies, use window functions to compute rolling statistics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;rolling_stats&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rolling_mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;STDDEV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rolling_stddev&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rolling_mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rolling_stddev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;rolling_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rolling_stddev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;z_score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rolling_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;ABS&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;rolling_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rolling_stddev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Build vs buy: data anomaly detection tools
&lt;/h2&gt;

&lt;p&gt;Building anomaly detection in-house gives you control but requires engineering time to maintain. Most data teams outgrow custom solutions because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alert fatigue&lt;/strong&gt;: Static thresholds fire too often and get ignored&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seasonality blindness&lt;/strong&gt;: Simple statistics miss weekly and yearly patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform monitoring&lt;/strong&gt;: Different code for Snowflake, Databricks, and Postgres&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident triage&lt;/strong&gt;: No unified view of which alerts matter most&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://anomalyarmor.ai" rel="noopener noreferrer"&gt;AnomalyArmor&lt;/a&gt; is a data observability platform that uses AI to configure anomaly detection automatically. You connect your data warehouse, describe what you want to monitor in plain English, and the AI agent sets up schema drift alerts, freshness schedules, and statistical anomaly detection across all your tables. It works on Snowflake, Databricks, PostgreSQL, and BigQuery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data anomaly detection FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between anomaly detection and data validation?
&lt;/h3&gt;

&lt;p&gt;Data validation checks if data matches explicit rules (e.g., "order_id is not null"). Anomaly detection uses statistical methods to identify values that deviate from historical patterns. Validation catches known problems. Anomaly detection catches unknown ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the best algorithm for data anomaly detection?
&lt;/h3&gt;

&lt;p&gt;For most production use cases, z-scores computed with Welford's algorithm work well. For data with strong weekly or yearly seasonality, Prophet or similar time-series models are better. For high-dimensional data, isolation forests outperform statistical methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I detect schema drift automatically?
&lt;/h3&gt;

&lt;p&gt;Query your database's &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; or metadata views on a schedule, store the previous state, and diff the current state against the stored version. When columns change, type definitions change, or tables are added or removed, fire an alert. AnomalyArmor does this automatically for Snowflake, Databricks, and PostgreSQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a z-score and how is it used in anomaly detection?
&lt;/h3&gt;

&lt;p&gt;A z-score measures how many standard deviations a value is from the historical mean. A z-score of 2 means the value is 2 standard deviations above the mean, which occurs in roughly 2.5% of a normal distribution. Most anomaly detection systems use z-scores between 2 and 3 as thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much historical data do I need for anomaly detection?
&lt;/h3&gt;

&lt;p&gt;Statistical methods like z-scores need at least 7-10 data points to produce meaningful baselines. Machine learning methods like Prophet need at least 14 points for weekly seasonality and 365 points for yearly seasonality. During the learning phase, most systems don't fire alerts.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between data observability and anomaly detection?
&lt;/h3&gt;

&lt;p&gt;Anomaly detection is one component of data observability. Data observability also includes lineage tracking, impact analysis, schema change detection, and root cause analysis. Anomaly detection tells you something is wrong. Observability tells you what, where, and why.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can AI improve data anomaly detection?
&lt;/h3&gt;

&lt;p&gt;Yes. AI improves anomaly detection in three ways. First, AI agents can configure monitoring rules from natural language instead of YAML or GUI forms. Second, LLMs can analyze alert patterns to reduce false positives. Third, AI can correlate anomalies across tables to identify root causes faster than manual investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I avoid alert fatigue in anomaly detection?
&lt;/h3&gt;

&lt;p&gt;Use adaptive thresholds that learn from historical patterns instead of static rules. Set sensitivity per table based on how critical it is. Group related alerts so a single upstream failure generates one notification instead of ten. Suppress alerts during known maintenance windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  What data platforms support anomaly detection natively?
&lt;/h3&gt;

&lt;p&gt;Snowflake has data metric functions and &lt;code&gt;ACCOUNT_USAGE&lt;/code&gt; views. Databricks has Delta Live Tables expectations and Unity Catalog lineage. BigQuery has table metadata and scheduled queries. PostgreSQL has &lt;code&gt;pg_stat_user_tables&lt;/code&gt;. None of these are full anomaly detection systems, but they provide the raw metrics needed to build one.&lt;/p&gt;

&lt;h3&gt;
  
  
  How real-time should anomaly detection be?
&lt;/h3&gt;

&lt;p&gt;It depends on the use case. Schema drift and freshness checks should run every 5-15 minutes. Row count and statistical anomalies should run hourly for most tables and daily for slower-changing ones. Real-time streaming anomaly detection (sub-second) is rarely needed for data warehouses but is critical for fraud detection and security monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Data anomaly detection catches schema changes, freshness failures, volume spikes, and statistical outliers before they break downstream analytics. The four main types of anomalies require different detection approaches: schema changes need metadata diffs, freshness needs time-since-update checks, volume needs historical baselines, and value anomalies need statistical methods like z-scores or machine learning models like Prophet.&lt;/p&gt;

&lt;p&gt;Modern data observability platforms combine all four detection methods with AI-powered configuration to make anomaly detection practical at scale. Whether you build in-house or buy a tool, the fundamental algorithms are the same: maintain historical baselines, compute expected ranges, and flag deviations beyond your sensitivity threshold.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to see data anomaly detection in action? &lt;a href="https://blog.anomalyarmor.ai/using-ai-to-set-up-schema-drift-detection/" rel="noopener noreferrer"&gt;Watch a 30-second demo of AI configuring schema drift monitoring in real time.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>You Don't Need to Write Data Tests</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:47:37 +0000</pubDate>
      <link>https://dev.to/iblaine/you-dont-need-to-write-data-tests-4llg</link>
      <guid>https://dev.to/iblaine/you-dont-need-to-write-data-tests-4llg</guid>
      <description>&lt;p&gt;Spend five minutes in any data engineering forum and you'll find the same confession repeated in different words: "We just eyeball row counts and pray." It shows up on Reddit, Hacker News, the dbt Community Forum, Stack Overflow. The phrasing changes but the story doesn't.&lt;/p&gt;

&lt;p&gt;Data engineers know they should be testing. They're not skipping tests because they're lazy or because they don't understand the value. They're skipping tests because everything else in their environment conspires against it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why data engineers don't test
&lt;/h2&gt;

&lt;p&gt;If you talk to enough practitioners (or read enough forum threads), the same reasons surface over and over:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nobody gives them time.&lt;/strong&gt; Organizations reward fast delivery, not reliable delivery. If decision makers don't prioritize testing, it never becomes a standard. The incentive structure actively punishes thoroughness. You get more credit for shipping a pipeline in two days than for spending a week making it bulletproof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data changes faster than tests can keep up.&lt;/strong&gt; This is what separates data testing from software testing. Your code doesn't change overnight. Your data does. A source team renames a column. A third-party API changes its response format. A bulk operation shifts row counts by 40%. Tests written last month don't account for changes that happened last night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data quality is invisible until it breaks.&lt;/strong&gt; The fundamental problem in data engineering is that a bad query still returns results. Results, but not necessarily correct ones. If nobody can see when things are broken, nobody builds the political will to prevent breakage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data is inherently hard to test.&lt;/strong&gt; You can test code. Data is another story. Unit tests verify that your transformation logic works. They don't verify that the data you received is what you expected. These are fundamentally different problems, and the second one causes far more real-world failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code testing vs data testing
&lt;/h2&gt;

&lt;p&gt;This is the distinction the industry has been dancing around for years. Unit tests and data quality checks are different things, and conflating them is why most testing advice falls flat for data teams.&lt;/p&gt;

&lt;p&gt;Unit tests verify your code does what you intended. They answer: "Does my transformation produce the right output given known input?"&lt;/p&gt;

&lt;p&gt;Data quality checks verify the data you received is what you expected. They answer: "Did 50,000 rows actually arrive? Is the schema the same as yesterday? Are null rates within normal bounds? Did the distribution shift?"&lt;/p&gt;

&lt;p&gt;In data engineering, the second category catches far more production failures than the first. Your dbt model can be perfectly correct and still produce garbage if the source data changed underneath it.&lt;/p&gt;

&lt;p&gt;Most testing advice aimed at data engineers focuses on the first category. Write unit tests for your transformations. Test your SQL with fixtures. Use dbt tests. This is useful, but it misses the failures that actually page people at 3am.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Make testing easier" is the wrong frame
&lt;/h2&gt;

&lt;p&gt;The conventional wisdom is: testing is too hard, so let's make it easier. Better frameworks. Better test runners. Better dbt test macros. AI-assisted test generation.&lt;/p&gt;

&lt;p&gt;That's genuinely helpful for teams that have the bandwidth to maintain a test suite. But it doesn't address the actual constraint. The problem isn't that testing is too hard. The problem is that testing is another thing to maintain in an environment where there's already not enough time.&lt;/p&gt;

&lt;p&gt;Making tests 50% easier to write doesn't help when nobody has time to write them at all. And even if you find time to write them, data changes faster than tests can keep up.&lt;/p&gt;

&lt;p&gt;The better frame: don't make testing easier. Make it unnecessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automated data testing: tests you never write
&lt;/h2&gt;

&lt;p&gt;Automated data testing flips the model. Instead of engineers defining what "correct" looks like for every table, the system learns what normal looks like and alerts when something deviates.&lt;/p&gt;

&lt;p&gt;This covers the checks that catch the majority of real incidents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema change detection.&lt;/strong&gt; A column gets renamed, removed, or changes type. This breaks downstream models, joins, and dashboards. You don't need a handwritten test for this. You need a system that tracks schema state and alerts on any change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Freshness monitoring.&lt;/strong&gt; A table that updates every hour hasn't been touched in six hours. The pipeline didn't error. It just silently stopped. A system that learns update patterns and flags deviations catches this without any configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume anomalies.&lt;/strong&gt; A table that normally loads 100,000 rows per day suddenly loads 1,000. Or zero. Or 500,000. Anomaly detection against historical baselines catches this without anyone defining thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distribution shifts.&lt;/strong&gt; A column's null rate jumps from 2% to 35%. A numeric field's average drops by half. These are the subtle failures that pass a "did it run?" check but corrupt downstream analytics.&lt;/p&gt;

&lt;p&gt;None of these require writing tests. They require connecting to your data warehouse and letting the system build baselines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;You connect your Snowflake, Databricks, BigQuery, PostgreSQL, or Redshift warehouse. The system runs discovery: what tables exist, what schemas they have, when they typically update, what their normal row counts and distributions look like.&lt;/p&gt;

&lt;p&gt;From that point, monitoring is automatic. Schema changes trigger alerts. Stale tables trigger alerts. Volume and distribution anomalies trigger alerts. All of this happens without writing a single line of test code.&lt;/p&gt;

&lt;p&gt;When something fires, you get context: which table, what changed, when it changed, and which downstream assets are affected. The alert isn't "test failed." The alert is "the &lt;code&gt;orders_fact&lt;/code&gt; table hasn't updated in 4 hours, and 12 downstream models depend on it."&lt;/p&gt;

&lt;p&gt;This is what &lt;a href="https://www.anomalyarmor.ai" rel="noopener noreferrer"&gt;AnomalyArmor&lt;/a&gt; does. Five-minute setup, no test authoring, no test maintenance. It watches your warehouse and tells you when something looks wrong. The coverage scales with your warehouse, not with your team's bandwidth to write tests. See the &lt;a href="https://docs.anomalyarmor.ai/quickstart/overview" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; to connect your first data source.&lt;/p&gt;

&lt;h2&gt;
  
  
  This doesn't replace all testing
&lt;/h2&gt;

&lt;p&gt;To be clear: automated data testing doesn't eliminate the need for all handwritten tests. If you have specific business rules (revenue must be positive, email must contain @, every order must have a customer), those still need explicit validation.&lt;/p&gt;

&lt;p&gt;But most data teams don't have any testing at all. They're eyeballing row counts and praying. For those teams, automated data testing provides 80% of the coverage with 0% of the authoring effort.&lt;/p&gt;

&lt;p&gt;Start with automated monitoring. Add handwritten tests for your most critical business rules. That's the order that matches reality for time-constrained data teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real question
&lt;/h2&gt;

&lt;p&gt;The real question isn't whether every possible scenario has been tested. It's how much uncertainty your organization is willing to tolerate before it starts verifying the numbers it depends on.&lt;/p&gt;

&lt;p&gt;For most data teams, the answer has been: a lot of uncertainty. Because the alternative was writing and maintaining tests they didn't have time for.&lt;/p&gt;

&lt;p&gt;Automated data testing changes that tradeoff. The cost of coverage drops to near zero. The question stops being "can we afford to test?" and becomes "why aren't we?"&lt;/p&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Joe Reis, &lt;a href="https://joereis.substack.com/p/the-2026-state-of-data-engineering" rel="noopener noreferrer"&gt;2026 State of Data Engineering Survey&lt;/a&gt; (2026). 1,101 respondents. Found data teams spend 34% of time on data quality, 26% on firefighting.&lt;/li&gt;
&lt;li&gt;AnomalyArmor, &lt;a href="https://docs.anomalyarmor.ai/quickstart/overview" rel="noopener noreferrer"&gt;Quickstart Guide&lt;/a&gt;. Connect your first data source and set up automated monitoring.&lt;/li&gt;
&lt;li&gt;AnomalyArmor, &lt;a href="https://docs.anomalyarmor.ai/schema-monitoring/overview" rel="noopener noreferrer"&gt;Schema Monitoring Docs&lt;/a&gt;. How automated schema change detection works.&lt;/li&gt;
&lt;li&gt;AnomalyArmor, &lt;a href="https://docs.anomalyarmor.ai/data-quality/overview" rel="noopener noreferrer"&gt;Data Quality Monitoring Docs&lt;/a&gt;. Volume, distribution, and anomaly monitoring reference.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Automated Data Testing FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is automated data testing?
&lt;/h3&gt;

&lt;p&gt;Automated data testing is software that continuously validates data without requiring engineers to write explicit test cases. It learns patterns from historical data (volume, schema, distributions, freshness) and alerts when new data deviates from those patterns. It's the opposite of manual test writing like dbt tests or custom SQL assertions.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is automated data testing different from dbt tests?
&lt;/h3&gt;

&lt;p&gt;dbt tests are deterministic rules you write manually: "this column is unique", "this foreign key exists". Automated data testing learns baselines from historical data and flags statistical deviations. dbt tests catch known problems. Automated testing catches unknown problems. Most production teams use both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I still need to write data tests if I use automated testing?
&lt;/h3&gt;

&lt;p&gt;Yes, for business-critical invariants. Some rules must be enforced explicitly: "revenue must never be negative", "user_id in orders must exist in users". Write these as dbt tests or validation rules. Use automated testing for everything else (statistical anomalies, freshness, schema changes, volume drops).&lt;/p&gt;

&lt;h3&gt;
  
  
  What can automated data testing detect that manual tests can't?
&lt;/h3&gt;

&lt;p&gt;Automated testing catches things you didn't know to look for: a column's null rate drifting from 2% to 15% over two weeks, row count dropping by 30% on Tuesdays only, a new category appearing in an enum column, a schema change that silently returns NULL for one in a million rows. These are invisible to explicit rules unless you already anticipated them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why don't data engineers write more tests?
&lt;/h3&gt;

&lt;p&gt;Three reasons. First, writing tests requires knowing what to test, and data changes faster than test coverage. Second, test maintenance scales linearly with the number of tables, so a team with 500 tables drowns in test code. Third, the ROI of manual tests is unclear until something breaks, so writing them feels like prevention against unknown risks.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do automated data tests learn what's normal?
&lt;/h3&gt;

&lt;p&gt;They compute baselines from historical data using statistical methods: running mean and standard deviation (often via Welford's algorithm), distribution fingerprints, seasonality models like Prophet, and moving averages. The baselines update incrementally as new data arrives. Most systems require 7-14 days of history before alerts start firing.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the false positive rate of automated data testing?
&lt;/h3&gt;

&lt;p&gt;Well-tuned systems run at 5-15% false positive rates using z-scores with sensitivity thresholds of 2-3 standard deviations. Poorly tuned systems can exceed 50%. The key factors are: enough historical data to establish stable baselines, seasonality-aware models for data with weekly or daily patterns, and sensitivity tuning per table based on business criticality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can AI replace data engineers writing tests?
&lt;/h3&gt;

&lt;p&gt;AI can configure and maintain monitoring based on patterns it learns from your data. It can't replace business logic validation. A data engineer still needs to specify what matters to the business. But AI removes the grunt work of writing 500 tests for 500 tables, which is where most test-writing effort is wasted.&lt;/p&gt;

&lt;h3&gt;
  
  
  What tools provide automated data testing?
&lt;/h3&gt;

&lt;p&gt;Leaders in this space include AnomalyArmor, Monte Carlo, Metaplane, Bigeye, and Datafold. Each uses statistical methods to learn baselines and detect anomalies. Open-source options include re_data and Elementary. Traditional tools like Great Expectations require manual test writing but can be combined with profiling to semi-automate.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much historical data do I need before automated testing works?
&lt;/h3&gt;

&lt;p&gt;Minimum 7 days for basic z-score detection on daily data, 14 days for weekly seasonality detection, and 365 days for yearly seasonality. During the initial learning period, alerts should be suppressed or warnings only. Most tools have a "learning phase" flag that prevents false alerts until the baseline is stable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stop writing and maintaining data tests. &lt;a href="https://blog.anomalyarmor.ai/using-ai-to-set-up-schema-drift-detection/" rel="noopener noreferrer"&gt;See how AnomalyArmor's AI agent configures monitoring from a single sentence.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>Data Pipeline Monitoring: How to Stop Silent Failures Before They Hit Production</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:32:03 +0000</pubDate>
      <link>https://dev.to/iblaine/data-pipeline-monitoring-how-to-stop-silent-failures-before-they-hit-production-4i7l</link>
      <guid>https://dev.to/iblaine/data-pipeline-monitoring-how-to-stop-silent-failures-before-they-hit-production-4i7l</guid>
      <description>&lt;p&gt;Your Airflow DAG shows all green. Every task completed. No errors in the logs.&lt;/p&gt;

&lt;p&gt;But the revenue dashboard is showing yesterday's numbers. A downstream ML model is training on stale features. The finance team is about to close the quarter using incomplete data.&lt;/p&gt;

&lt;p&gt;This is the most dangerous type of pipeline failure: the one that doesn't look like a failure at all. And it's far more common than the kind that throws an error.&lt;/p&gt;

&lt;p&gt;Data pipeline monitoring exists to catch exactly this. Not job-level "did it run?" checks. Outcome-level "did the data actually arrive, and does it look right?" checks. The difference between those two questions is where most data incidents live.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is data pipeline monitoring?
&lt;/h2&gt;

&lt;p&gt;Data pipeline monitoring is continuous validation that data is flowing correctly through every stage of your pipeline, from ingestion to transformation to the tables your stakeholders query.&lt;/p&gt;

&lt;p&gt;It covers five dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freshness&lt;/strong&gt;: Is data arriving on schedule?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: Are the expected number of rows landing?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema&lt;/strong&gt;: Have columns been added, removed, or changed type?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distribution&lt;/strong&gt;: Do the values look normal, or has something shifted?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lineage&lt;/strong&gt;: When something breaks, which downstream tables and dashboards are affected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams start with the first two and add the rest as they scale. But even basic freshness and volume checks catch the majority of incidents that slip past orchestration tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 types of pipeline failures (and which ones your tools miss)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The successful failure
&lt;/h3&gt;

&lt;p&gt;A DAG runs to completion. Zero errors. But the source API returned an empty response, so the pipeline wrote zero rows. The orchestrator sees a successful run. The table is now empty or stale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches it&lt;/strong&gt;: Volume monitoring. If a table that normally receives 50,000 rows per load suddenly gets zero, that's an alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The schema surprise
&lt;/h3&gt;

&lt;p&gt;Someone on the source team renames a column from &lt;code&gt;user_id&lt;/code&gt; to &lt;code&gt;userId&lt;/code&gt;. Your pipeline doesn't error, it just silently drops the column or fills it with nulls. Downstream joins break. Metrics go wrong. Nobody notices for three days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches it&lt;/strong&gt;: Schema change detection. Any added, removed, or type-changed column triggers an alert before downstream transformations run.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The slow drift
&lt;/h3&gt;

&lt;p&gt;Data volumes gradually decrease by 5% per week. No single day looks alarming. But after a month, you're missing 20% of your records. A filter change upstream, a timezone bug, a partition misconfiguration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches it&lt;/strong&gt;: Distribution and volume trend monitoring. Anomaly detection that compares today's load against historical patterns, not just a static threshold.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The partial load
&lt;/h3&gt;

&lt;p&gt;The pipeline runs, but only processes data from 3 of 5 source partitions. Row counts look lower than normal, but not dramatically. The missing data is from one region, so the aggregate metrics look "close enough" to pass a quick glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches it&lt;/strong&gt;: Volume monitoring with granular baselines, comparing expected vs actual row counts at the partition or segment level.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The delayed cascade
&lt;/h3&gt;

&lt;p&gt;A source table updates 4 hours late. Downstream transformations ran on schedule and processed stale input. The numbers are technically "fresh" (the downstream table updated on time) but wrong (it used yesterday's source data).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches it&lt;/strong&gt;: Freshness monitoring on source tables, combined with lineage awareness that understands the dependency chain. The downstream table looks fresh, but tracing upstream reveals the root cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why orchestration alerts aren't enough
&lt;/h2&gt;

&lt;p&gt;Airflow, Dagster, Prefect, and similar tools monitor the process: did the job start, run, and finish? They answer "did my code execute?" not "did my data arrive correctly?"&lt;/p&gt;

&lt;p&gt;Three specific gaps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Successful jobs that produce wrong output.&lt;/strong&gt; A job can complete with exit code 0 and write garbage. The orchestrator has no opinion about data content. It ran your code. That's its job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No cross-system visibility.&lt;/strong&gt; Your pipeline pulls from a Postgres source, transforms in dbt, and lands in Snowflake. The orchestrator sees the dbt run. It doesn't know the Postgres source stopped updating two hours before the dbt run kicked off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No historical baselines.&lt;/strong&gt; Orchestration tools tell you about this run. They don't tell you whether this run's output looks normal compared to the last 30 runs. A table loading 1,000 rows isn't alarming, unless it normally loads 100,000.&lt;/p&gt;

&lt;p&gt;Data pipeline monitoring sits on top of orchestration. It checks what the orchestrator can't: the actual data that landed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good data pipeline monitoring looks like
&lt;/h2&gt;

&lt;p&gt;Effective monitoring has four properties:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. It monitors outcomes, not processes
&lt;/h3&gt;

&lt;p&gt;Check the table, not the job. Did rows arrive? Are the columns intact? Do the values fall within expected ranges? This is the fundamental shift from orchestration monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. It adapts to patterns
&lt;/h3&gt;

&lt;p&gt;A static threshold of "alert if fewer than 10,000 rows" breaks when your table legitimately receives 2,000 rows on weekends. Good monitoring learns the pattern and alerts on deviations from it, not from a fixed number.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. It maps dependencies
&lt;/h3&gt;

&lt;p&gt;When a source table is late, you need to know which downstream tables, dashboards, and reports are affected. Without lineage, you're manually tracing dependencies across systems during an incident, which is the worst time to be doing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. It routes alerts to the right people
&lt;/h3&gt;

&lt;p&gt;A freshness alert on the marketing analytics table should go to the data engineering team that owns that pipeline, not to a shared #data-alerts channel that everyone has muted. Alert routing by ownership turns monitoring from noise into action.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to set up data pipeline monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Identify your critical tables
&lt;/h3&gt;

&lt;p&gt;You don't need to monitor everything on day one. Start with the 10-20 tables that power:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executive dashboards&lt;/li&gt;
&lt;li&gt;Customer-facing data products&lt;/li&gt;
&lt;li&gt;Financial reporting&lt;/li&gt;
&lt;li&gt;ML model features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the tables where a silent failure causes the most damage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Set freshness and volume baselines
&lt;/h3&gt;

&lt;p&gt;For each critical table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freshness&lt;/strong&gt;: How often should this table update? Set the SLA slightly longer than the expected interval. A table that updates hourly gets a 2-hour SLA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: How many rows does a typical load produce? Set a range based on the last 30 days, accounting for weekday/weekend variation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Enable schema change detection
&lt;/h3&gt;

&lt;p&gt;Schema changes are the most common cause of silent pipeline failures. Any column added, removed, renamed, or type-changed should generate an alert. This catches problems at the source before they propagate downstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Connect your alert channels
&lt;/h3&gt;

&lt;p&gt;Route alerts to Slack, PagerDuty, or email based on table ownership. The person who gets the alert should be the person who can fix it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Expand gradually
&lt;/h3&gt;

&lt;p&gt;Once your critical tables are monitored, expand to the next tier. Most teams reach full coverage within a few weeks, not months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The build vs buy decision
&lt;/h2&gt;

&lt;p&gt;You can build basic monitoring with SQL queries and a scheduler. Check &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; for freshness, run &lt;code&gt;COUNT(*)&lt;/code&gt; for volume, compare schemas against a stored baseline.&lt;/p&gt;

&lt;p&gt;This works for 5-10 tables. At 50+ tables across multiple databases, you're maintaining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A custom scheduler running checks every 15-60 minutes&lt;/li&gt;
&lt;li&gt;Per-table configurations for thresholds and SLAs&lt;/li&gt;
&lt;li&gt;Historical storage for baselines and trend comparison&lt;/li&gt;
&lt;li&gt;Alert routing logic by table ownership&lt;/li&gt;
&lt;li&gt;A UI for your team to see monitoring status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the monitoring system is its own engineering project. The question is whether your team's time is better spent maintaining monitoring infrastructure or building data products.&lt;/p&gt;

&lt;p&gt;Purpose-built tools like &lt;a href="https://www.anomalyarmor.ai" rel="noopener noreferrer"&gt;AnomalyArmor&lt;/a&gt; handle this out of the box. Connect your warehouse, and freshness, volume, and schema monitoring start automatically. AI-powered analysis explains what changed and why, so you spend less time investigating and more time fixing. Setup takes minutes, not weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common mistakes to avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setting thresholds too tight.&lt;/strong&gt; A freshness SLA of 61 minutes on a table that updates hourly will fire every time there's a minor delay. Start generous and tighten over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring everything equally.&lt;/strong&gt; Not every table is critical. A staging table that only you use doesn't need PagerDuty integration. Prioritize by downstream impact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring weekends and holidays.&lt;/strong&gt; Many pipelines have legitimately different patterns on weekends. Your monitoring needs to account for this or you'll get false alerts every Saturday.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert channel sprawl.&lt;/strong&gt; Sending every alert to a shared Slack channel guarantees they'll be ignored. Route alerts to the specific team that owns the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating monitoring as a one-time setup.&lt;/strong&gt; Your pipelines change. New tables get added, old ones get deprecated, schedules shift. Monitoring configuration needs to evolve with your data stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's the difference between data pipeline monitoring and data observability?
&lt;/h3&gt;

&lt;p&gt;Data pipeline monitoring focuses on whether data is flowing correctly through your pipelines: freshness, volume, schema. Data observability is the broader discipline that includes monitoring plus lineage, root cause analysis, and historical context. Monitoring is the foundation. Observability is the full picture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need monitoring if I already use dbt tests?
&lt;/h3&gt;

&lt;p&gt;Yes. dbt tests validate data at transformation time. They check "is this data correct right now?" Monitoring checks "is this data arriving on schedule, in the expected volume, with the expected schema?" They answer different questions. dbt tests catch logic bugs. Monitoring catches infrastructure and upstream failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  How many tables should I monitor?
&lt;/h3&gt;

&lt;p&gt;Start with your 10-20 most critical tables. Expand from there. Most teams reach full coverage (all production tables) within a few weeks. The goal is 100% coverage of anything that powers a decision, dashboard, or downstream system.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the right alert threshold for freshness?
&lt;/h3&gt;

&lt;p&gt;Set it at 1.5-2x your expected update interval. A table that updates every hour should alert at 2 hours. A daily table should alert at 25-26 hours. This avoids false alarms from minor delays while catching real failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I build my own pipeline monitoring?
&lt;/h3&gt;

&lt;p&gt;You can, and many teams start there. SQL queries checking freshness and row counts are straightforward for a handful of tables. The maintenance burden grows quickly at scale. Most teams that start DIY either invest significant engineering time maintaining it or switch to a purpose-built tool within 6-12 months.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>ai</category>
    </item>
    <item>
      <title>Data Observability vs Data Quality: What's the Difference and Do You Need Both?</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:31:15 +0000</pubDate>
      <link>https://dev.to/iblaine/data-observability-vs-data-quality-whats-the-difference-and-do-you-need-both-1mno</link>
      <guid>https://dev.to/iblaine/data-observability-vs-data-quality-whats-the-difference-and-do-you-need-both-1mno</guid>
      <description>&lt;p&gt;Data observability and data quality get used interchangeably, but they solve different problems. Confusing them leads to buying the wrong tool, building the wrong monitors, and missing the issues that actually break things.&lt;/p&gt;

&lt;p&gt;Here's the short version: data observability tells you whether your pipelines are working. Data quality tells you whether the data itself is correct. One watches the plumbing. The other checks the water.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data observability: watching the pipes
&lt;/h2&gt;

&lt;p&gt;Data observability monitors the infrastructure that moves data. It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did this table update on schedule? (freshness)&lt;/li&gt;
&lt;li&gt;Did the number of rows change unexpectedly? (volume)&lt;/li&gt;
&lt;li&gt;Did someone add, remove, or rename columns? (schema changes)&lt;/li&gt;
&lt;li&gt;Where did this data come from, and what depends on it? (lineage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all things you can measure without knowing anything about what the data means. You don't need business logic. You don't need to know that &lt;code&gt;revenue&lt;/code&gt; should always be positive or that &lt;code&gt;email&lt;/code&gt; should contain an &lt;code&gt;@&lt;/code&gt; sign. You're just watching patterns and alerting when they break.&lt;/p&gt;

&lt;p&gt;Data observability catches problems like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Airflow DAG failed silently at 3am and your morning dashboards show stale data&lt;/li&gt;
&lt;li&gt;A backend engineer renamed &lt;code&gt;user_id&lt;/code&gt; to &lt;code&gt;account_id&lt;/code&gt; and broke 12 downstream models&lt;/li&gt;
&lt;li&gt;A bulk delete wiped 40% of your rows and nobody noticed for two days&lt;/li&gt;
&lt;li&gt;A table that normally updates every hour hasn't been touched in six hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are infrastructure failures. The data pipeline broke, and observability tells you where and when.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data quality: checking the water
&lt;/h2&gt;

&lt;p&gt;Data quality validates the actual content of your data. It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is &lt;code&gt;email&lt;/code&gt; always a valid email address? (validity)&lt;/li&gt;
&lt;li&gt;Are there duplicate rows in the orders table? (uniqueness)&lt;/li&gt;
&lt;li&gt;Does every &lt;code&gt;order_id&lt;/code&gt; in the line items table exist in the orders table? (referential integrity)&lt;/li&gt;
&lt;li&gt;Is &lt;code&gt;price&lt;/code&gt; always positive? (range/business rules)&lt;/li&gt;
&lt;li&gt;Are null rates for critical columns within expected bounds? (completeness)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These checks require domain knowledge. Someone has to decide that &lt;code&gt;price&lt;/code&gt; should be positive, that &lt;code&gt;email&lt;/code&gt; should match a pattern, that &lt;code&gt;country_code&lt;/code&gt; should be in a known list. The tool can automate the checking, but a human has to define what "correct" means.&lt;/p&gt;

&lt;p&gt;Data quality catches problems like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A third-party API started sending prices in cents instead of dollars&lt;/li&gt;
&lt;li&gt;A form change allowed empty email addresses into the database&lt;/li&gt;
&lt;li&gt;Duplicate records from a retry bug inflated conversion metrics by 15%&lt;/li&gt;
&lt;li&gt;A timezone bug shifted all timestamps by 5 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are data content failures. The pipeline worked fine. The data arrived on time, with the right schema, in the right volume. It was just wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where they overlap
&lt;/h2&gt;

&lt;p&gt;The line between observability and quality isn't always clean. Some examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume anomalies&lt;/strong&gt; sit in both camps. A sudden drop in row count could be a pipeline failure (observability) or a business change (quality). The monitoring is the same. The response is different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Null rate spikes&lt;/strong&gt; are technically a quality metric, but a sudden increase in nulls for a column that's always been 100% populated usually means something broke upstream. That's an observability signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema changes&lt;/strong&gt; are pure observability, but they can cause data quality problems downstream. A column type change from &lt;code&gt;int&lt;/code&gt; to &lt;code&gt;varchar&lt;/code&gt; might not break the pipeline, but it could produce garbage in your aggregations.&lt;/p&gt;

&lt;p&gt;Most modern tools handle both to some degree. The question is emphasis.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use which
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with observability if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't have any monitoring today and want coverage fast&lt;/li&gt;
&lt;li&gt;Your biggest pain point is stale dashboards and broken pipelines&lt;/li&gt;
&lt;li&gt;You want automated detection without writing rules for every table&lt;/li&gt;
&lt;li&gt;You have hundreds of tables and can't manually define quality checks for all of them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability tools can start monitoring the day you connect. They learn what "normal" looks like and alert on deviations. No configuration needed for the basics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add quality checks when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have specific business rules that must always hold (prices &amp;gt; 0, no duplicate orders)&lt;/li&gt;
&lt;li&gt;You're dealing with data from external sources you don't control&lt;/li&gt;
&lt;li&gt;Regulatory compliance requires you to prove data accuracy&lt;/li&gt;
&lt;li&gt;Your data powers ML models where subtle incorrectness compounds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quality checks are more effort to set up but catch problems that observability misses. A table can be perfectly fresh, with the right schema and normal volume, and still be full of wrong data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practical answer:&lt;/strong&gt; Start with observability for broad coverage, then layer quality checks on your most critical tables. You get 80% of the value from observability with 20% of the setup effort. Quality checks fill the gap for the tables where correctness actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the tools stack up
&lt;/h2&gt;

&lt;p&gt;Most tools in this space started on one side and expanded toward the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability-first tools&lt;/strong&gt; (AnomalyArmor, Bigeye, Metaplane) give you automated schema, freshness, and volume monitoring out of the box. You connect a database, and within minutes you have baseline coverage across every table. Quality features were added later: custom metrics, validity rules, referential checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance-first tools&lt;/strong&gt; (Monte Carlo) started with enterprise data governance, cataloging, and compliance, then expanded into observability and monitoring. They're comprehensive but come with enterprise pricing and longer setup times. If your primary need is pipeline monitoring, you're paying for a lot of surface area you don't use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality-first tools&lt;/strong&gt; (Great Expectations, Soda, dbt tests) start with explicit validation rules that you write. You define expectations ("this column should never be null," "row count should be between 1000 and 5000") and the tool checks them on a schedule. Observability features like freshness monitoring and lineage are bolted on or require additional setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trend is convergence.&lt;/strong&gt; Every observability tool now adds quality metrics. Every quality tool now has some form of freshness monitoring. Governance tools are expanding down-market. The difference is which side is mature and which side feels like an afterthought.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to look for in practice
&lt;/h2&gt;

&lt;p&gt;Skip the category debate and focus on what actually matters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to first alert.&lt;/strong&gt; How fast can you go from zero monitoring to getting notified when something breaks? If the answer is weeks of configuration, that's a quality-first tool pretending to do observability. If the answer is hours, that's observability-first, which is what you want for starting out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False positive rate.&lt;/strong&gt; A tool that alerts on everything is worse than no tool. AI-powered anomaly detection that learns your data's patterns produces fewer false alarms than static thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom rule support.&lt;/strong&gt; At some point you'll need business-specific checks. Can you define custom SQL metrics? Can you set validity rules? Can you do referential integrity checks across tables?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lineage.&lt;/strong&gt; When something breaks, can you see what's affected downstream? Lineage turns a "this table looks weird" alert into "this table looks weird and it feeds your executive dashboard, the churn model, and the finance report."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integration with your stack.&lt;/strong&gt; Alerts should go where your team works (Slack, PagerDuty). The tool should connect to what you already run (dbt, Airflow, Snowflake, Databricks, PostgreSQL). Bonus points for AI agent integration via MCP so your coding assistant can check data health.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Data observability and data quality are complementary, not competing. Observability gives you broad, automated coverage across your entire data estate. Quality gives you precise, rule-based validation on critical data.&lt;/p&gt;

&lt;p&gt;If you're starting from zero, start with observability. Connect your databases, get baseline monitoring, and stop finding out about broken pipelines from angry stakeholders. Then add quality checks where they matter most.&lt;/p&gt;

&lt;p&gt;If you already have dbt tests or Great Expectations running, you have quality covered. Add observability to catch the problems that explicit tests can't: the pipeline that failed silently, the schema that changed without notice, the table that stopped updating on a holiday.&lt;/p&gt;

&lt;p&gt;Either way, the goal is the same: find out about data problems before your stakeholders do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Observability vs Data Quality FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is data observability?
&lt;/h3&gt;

&lt;p&gt;Data observability is the practice of monitoring data systems end-to-end to understand the health, reliability, and performance of data pipelines. It tracks freshness, volume, schema changes, lineage, and incidents across your data stack. The term is borrowed from software observability but applied to data infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is data quality?
&lt;/h3&gt;

&lt;p&gt;Data quality is the measure of how well data meets the needs of its users. It covers dimensions like accuracy, completeness, consistency, timeliness, uniqueness, and validity. Data quality focuses on the data itself, while observability focuses on the systems producing and moving the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need both data observability and data quality?
&lt;/h3&gt;

&lt;p&gt;Most production data teams need both. Observability catches pipeline failures, stale tables, and schema drift. Quality catches bad values, missing records, and business rule violations. They overlap in some areas (freshness, volume anomalies) but diverge in others (lineage vs validation rules). The cleanest approach is to use observability for infrastructure monitoring and quality rules for content validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between data observability and data monitoring?
&lt;/h3&gt;

&lt;p&gt;Data monitoring is a subset of data observability. Monitoring tracks specific metrics and fires alerts. Observability adds context: lineage showing which pipeline caused a problem, incident history, cross-system correlation, and root cause analysis. Observability is what you do with monitoring data to understand the why, not just the what.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the five pillars of data observability?
&lt;/h3&gt;

&lt;p&gt;The five commonly cited pillars are: &lt;strong&gt;Freshness&lt;/strong&gt; (is the data up to date?), &lt;strong&gt;Volume&lt;/strong&gt; (is the expected amount of data arriving?), &lt;strong&gt;Schema&lt;/strong&gt; (has the structure changed?), &lt;strong&gt;Lineage&lt;/strong&gt; (what depends on what?), and &lt;strong&gt;Distribution&lt;/strong&gt; (are the values within expected ranges?). Some vendors add Quality as a sixth pillar.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does data observability differ from application observability?
&lt;/h3&gt;

&lt;p&gt;Application observability tracks request latency, error rates, and resource usage in services. Data observability tracks data characteristics: freshness, volume, schema, and statistical properties. The underlying principle is the same (instrument everything so you can diagnose problems), but the metrics and tools are different.&lt;/p&gt;

&lt;h3&gt;
  
  
  What tools provide data observability?
&lt;/h3&gt;

&lt;p&gt;Popular data observability platforms include AnomalyArmor, Monte Carlo, Metaplane, Bigeye, Datafold, Soda, and Databand. Open-source options include Great Expectations, Elementary, and re_data. Each has different strengths in terms of platform support, setup complexity, and price.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is dbt enough for data quality?
&lt;/h3&gt;

&lt;p&gt;dbt provides tests (schema tests, custom SQL tests) that work for deterministic validation inside your transformation layer. dbt is not enough for production data quality because it doesn't monitor raw source tables, doesn't track freshness across jobs, doesn't provide cross-pipeline lineage, and doesn't detect statistical anomalies. Most teams pair dbt tests with a data observability tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does data observability cost?
&lt;/h3&gt;

&lt;p&gt;Pricing varies widely. Enterprise tools like Monte Carlo start at $15-25k/year for small deployments. Mid-market tools like Metaplane and AnomalyArmor price per monitored table, typically $5-10/table/month. Open-source tools have no license cost but require engineering time to maintain. Budget based on your number of tables and the criticality of your data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I build data observability in-house?
&lt;/h3&gt;

&lt;p&gt;Yes, but most teams outgrow custom solutions within 6-12 months. In-house data observability typically covers 2-3 pillars well (usually freshness and volume) but falls short on lineage, incident management, and statistical anomaly detection. If you have &amp;lt;20 critical tables and a strong data engineering team, in-house can work. Past that, buying a tool is cheaper than building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AnomalyArmor combines data observability and quality monitoring in one platform. &lt;a href="https://blog.anomalyarmor.ai/using-ai-to-set-up-schema-drift-detection/" rel="noopener noreferrer"&gt;Try the schema drift demo&lt;/a&gt; to see how the AI agent handles both.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataobservability</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>Data Quality Monitoring for Snowflake and Databricks: A Practical Guide</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:04:59 +0000</pubDate>
      <link>https://dev.to/iblaine/data-quality-monitoring-for-snowflake-and-databricks-a-practical-guide-512e</link>
      <guid>https://dev.to/iblaine/data-quality-monitoring-for-snowflake-and-databricks-a-practical-guide-512e</guid>
      <description>&lt;p&gt;Last Tuesday a source team renamed &lt;code&gt;order_status&lt;/code&gt; to &lt;code&gt;status&lt;/code&gt; in Snowflake. No announcement. Your dbt models kept running, but every query referencing &lt;code&gt;order_status&lt;/code&gt; silently returned NULLs. The revenue dashboard showed a 40% drop. It took four hours to trace it back to a one-word column rename.&lt;/p&gt;

&lt;p&gt;Schema changes like this are the #1 cause of silent data pipeline failures, and neither Snowflake nor Databricks will warn you when they happen. Data quality monitoring catches these problems automatically. Here's how to set it up for the two most common cloud data platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four types of data quality monitoring
&lt;/h2&gt;

&lt;p&gt;Data quality monitoring breaks down into four categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema change detection.&lt;/strong&gt; A column gets added, removed, renamed, or changes type. Your query doesn't error. It just returns NULLs or wrong values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/data-freshness-monitoring"&gt;Freshness monitoring&lt;/a&gt;.&lt;/strong&gt; A daily table hasn't updated in 36 hours. Something upstream is broken and nobody knows yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anomaly detection.&lt;/strong&gt; Row count drops 80% overnight. Null rate in a critical column spikes from 2% to 40%. Statistical anomalies surface data quality issues before dashboards break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correctness checks.&lt;/strong&gt; Is &lt;code&gt;order_total&lt;/code&gt; ever negative? Does every &lt;code&gt;customer_id&lt;/code&gt; in the fact table exist in the dimension table? Business-logic validations specific to your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data quality monitoring in Snowflake
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Schema tracking
&lt;/h3&gt;

&lt;p&gt;Snowflake's &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; tracks table and column metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ordinal_position&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PUBLIC'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ordinal_position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The limitation: this shows current state, not change history. To detect changes, you need to snapshot metadata on a schedule and diff it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Freshness
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;INFORMATION_SCHEMA.TABLES.LAST_ALTERED&lt;/code&gt; gives you the last DDL or DML timestamp. For row-level freshness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;last_data_timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where Snowflake's built-in tools fall short
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time Travel&lt;/strong&gt; provides 90 days of history but querying it at scale for monitoring is expensive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streams and Tasks&lt;/strong&gt; detect changes but require setup per table with no unified view.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Tables&lt;/strong&gt; handle some freshness concerns but don't cover schema changes or anomalies.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data quality monitoring in Databricks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Schema tracking
&lt;/h3&gt;

&lt;p&gt;Delta Lake enforces schemas by default. &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; shows schema evolution over time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is more useful than Snowflake's current-state-only metadata because you can see what changed and when.&lt;/p&gt;

&lt;h3&gt;
  
  
  Freshness
&lt;/h3&gt;

&lt;p&gt;Delta table history includes commit timestamps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where Databricks' built-in tools fall short
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog&lt;/strong&gt; focuses on access control, not data quality monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lakehouse Monitoring&lt;/strong&gt; (preview) provides profiling but is limited to Databricks-native tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Live Tables expectations&lt;/strong&gt; handle correctness within DLT pipelines but don't cover freshness or schema drift across your broader data estate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What built-in tools miss
&lt;/h2&gt;

&lt;p&gt;Both platforms provide the primitives: metadata queries, change history, schema enforcement. Turning those primitives into monitoring requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling&lt;/strong&gt; checks across hundreds of tables every 15-60 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diffing&lt;/strong&gt; current state against historical snapshots to detect changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-table thresholds&lt;/strong&gt; because a real-time events table and a weekly rollup need different SLAs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert routing&lt;/strong&gt; to the team that owns each pipeline, not a shared channel everyone ignores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform visibility&lt;/strong&gt; because most teams run Snowflake and Databricks alongside BigQuery, Redshift, or PostgreSQL. Monitoring each platform separately means blind spots at the boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building this yourself is the common starting point. A scheduled SQL job checks freshness, a Python script diffs schemas, a cron sends Slack messages. It works at 20 tables. At 200 tables across three databases, it becomes its own engineering project. (The &lt;a href="https://dev.to/state-of-data-engineering-2026-firefighting"&gt;2026 State of Data Engineering Survey&lt;/a&gt; found that this kind of reactive maintenance consumes 60% of data engineering time.)&lt;/p&gt;

&lt;p&gt;This is where dedicated monitoring tools earn their keep. Instead of maintaining a patchwork of scripts, you connect your warehouses once and get schema change detection, freshness monitoring, and anomaly alerts across everything. &lt;a href="https://www.anomalyarmor.ai" rel="noopener noreferrer"&gt;AnomalyArmor&lt;/a&gt; does exactly this for Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. Email &lt;a href="mailto:support@anomalyarmor.ai"&gt;support@anomalyarmor.ai&lt;/a&gt; for a trial code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;If you're evaluating data quality monitoring for Snowflake or Databricks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inventory your critical tables.&lt;/strong&gt; Which 10-20 tables would cause the most pain if they broke silently?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up schema change detection first.&lt;/strong&gt; Highest value, lowest effort. Schema changes cause silent failures that are expensive to debug after the fact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add freshness monitoring for daily-batch tables.&lt;/strong&gt; The most common source of "the dashboard is showing yesterday's data" incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer in anomaly detection once you have baseline data.&lt;/strong&gt; Anomaly detection needs history to establish normal patterns. Start collecting now.&lt;/li&gt;
&lt;/ol&gt;







&lt;h2&gt;
  
  
  Snowflake and Databricks Data Quality FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How is data quality monitoring different in Snowflake vs Databricks?
&lt;/h3&gt;

&lt;p&gt;Snowflake exposes metadata through &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; and &lt;code&gt;ACCOUNT_USAGE&lt;/code&gt;, which makes schema tracking and freshness checks straightforward via SQL. Databricks uses Unity Catalog plus Delta Live Tables expectations, which puts data quality checks inside the pipeline itself. Both platforms support custom SQL monitors but have different strengths: Snowflake is better for cross-database monitoring, Databricks is better for streaming and pipeline-level quality gates.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is schema drift in Snowflake?
&lt;/h3&gt;

&lt;p&gt;Schema drift in Snowflake refers to unexpected changes in table structure: columns added, dropped, renamed, or having their types changed. Snowflake's &lt;code&gt;INFORMATION_SCHEMA.COLUMNS&lt;/code&gt; view tracks all column changes via the &lt;code&gt;LAST_ALTERED&lt;/code&gt; timestamp, which makes schema drift detection possible with a scheduled query that compares current state against a snapshot.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do Delta Live Tables expectations work?
&lt;/h3&gt;

&lt;p&gt;Delta Live Tables (DLT) expectations are SQL predicates attached to tables in a pipeline. You write a check like &lt;code&gt;@dlt.expect("valid_id", "id IS NOT NULL")&lt;/code&gt; and DLT enforces it on every row. Rows that fail can be dropped, logged, or cause the pipeline to fail. It's data quality enforcement built into the pipeline, rather than monitoring after the fact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use the same data quality tool for Snowflake and Databricks?
&lt;/h3&gt;

&lt;p&gt;Yes. Tools like AnomalyArmor, Monte Carlo, Metaplane, and Soda work across Snowflake, Databricks, BigQuery, and PostgreSQL. They abstract the warehouse-specific metadata queries behind a unified interface. This is important if you use both platforms, because maintaining separate monitoring systems for each creates operational overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are Snowflake Data Metric Functions?
&lt;/h3&gt;

&lt;p&gt;Snowflake Data Metric Functions (DMFs) are built-in functions that measure specific data quality attributes: null count, distinct count, duplicate count, freshness, and custom metrics you define. They run as scheduled jobs on a table and store results in a metadata table. DMFs are native to Snowflake but limited to the platform and lack alert routing or cross-table analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I monitor data quality across multiple warehouses?
&lt;/h3&gt;

&lt;p&gt;Use a tool that connects to all your warehouses and provides a unified view. Running separate monitoring per warehouse creates blind spots at the boundaries: when data moves from Snowflake to Databricks to a downstream warehouse, you need end-to-end visibility. Cross-warehouse tools handle schema drift tracking, freshness monitoring, and anomaly detection in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the cost of data quality monitoring in Snowflake?
&lt;/h3&gt;

&lt;p&gt;Monitoring queries consume Snowflake credits. Schema metadata queries are cheap (milliseconds). Statistical queries on large tables can be expensive if run frequently. Most monitoring tools optimize by sampling, caching, and running queries on off-peak warehouses. Budget 1-5% of your Snowflake spend for monitoring queries if you monitor 50+ critical tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use dbt tests or a data observability tool?
&lt;/h3&gt;

&lt;p&gt;dbt tests are good for deterministic validation: "this column should never be null", "these two tables should have matching row counts". Data observability tools are good for statistical anomaly detection, schema drift tracking, freshness monitoring, and cross-table analysis. Most teams use both: dbt tests inside the transformation layer, and observability tools for production monitoring across all data sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I detect when a Databricks table stops updating?
&lt;/h3&gt;

&lt;p&gt;Query &lt;code&gt;information_schema.tables&lt;/code&gt; for the &lt;code&gt;last_altered&lt;/code&gt; timestamp, or check the Delta transaction log for the most recent commit. Compare against a freshness SLA. You can also watch for a gap between the expected job schedule and the actual last update time. Dedicated monitoring tools automate this by tracking per-table freshness patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the minimum setup time for data quality monitoring?
&lt;/h3&gt;

&lt;p&gt;For a single table with basic checks (row count, null rate, schema), you can set up monitoring in under 10 minutes with most tools. For comprehensive coverage across 50+ tables with alert routing, SLAs, and seasonality detection, budget 1-2 days of configuration. Tools with AI-powered setup (like AnomalyArmor's agent) can configure monitoring from natural language in minutes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AnomalyArmor monitors Snowflake, Databricks, PostgreSQL, BigQuery, and more from a single interface. &lt;a href="https://blog.anomalyarmor.ai/using-ai-to-set-up-schema-drift-detection/" rel="noopener noreferrer"&gt;See how the AI agent configures monitoring in seconds.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>databricks</category>
      <category>monitoring</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>Data Freshness Monitoring: How to Detect Stale Data Before It Breaks Dashboards</title>
      <dc:creator>Blaine Elliott</dc:creator>
      <pubDate>Sat, 11 Apr 2026 22:04:45 +0000</pubDate>
      <link>https://dev.to/iblaine/data-freshness-monitoring-how-to-detect-stale-data-before-it-breaks-dashboards-5hkm</link>
      <guid>https://dev.to/iblaine/data-freshness-monitoring-how-to-detect-stale-data-before-it-breaks-dashboards-5hkm</guid>
      <description>&lt;p&gt;A dashboard shows yesterday's revenue as $0. The CEO pings the data team. Someone checks the pipeline, finds the source table hasn't updated in 18 hours, and kicks off a manual backfill. The dashboard was wrong for half a day before anyone noticed.&lt;/p&gt;

&lt;p&gt;This is the most common data quality incident in production, and the easiest to prevent. Data freshness monitoring checks whether your tables are updating on schedule and alerts you when they stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is data freshness in data engineering?
&lt;/h2&gt;

&lt;p&gt;Data freshness measures how recently a table or dataset was updated. A table with a freshness SLA of 1 hour should have new data no older than 60 minutes. If it falls behind, something is broken upstream: a failed job, a delayed extract, a stuck queue.&lt;/p&gt;

&lt;p&gt;Freshness is different from correctness. A table can be perfectly fresh and full of wrong data. But staleness is the most visible failure mode because dashboards go blank, reports show zeros, and stakeholders notice immediately.&lt;/p&gt;

&lt;p&gt;Most data teams discover freshness issues reactively. Someone complains, an engineer investigates, and the root cause turns out to be a pipeline that failed silently hours ago. Industry surveys consistently find that data teams spend over half their time on this kind of reactive work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why stale data is expensive
&lt;/h2&gt;

&lt;p&gt;Stale data costs more than the incident itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Investigation time.&lt;/strong&gt; The engineer who gets paged spends 30-60 minutes tracing the staleness back through the pipeline. Which upstream table stopped updating? Which job failed? When?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust erosion.&lt;/strong&gt; Every time a dashboard shows stale numbers, stakeholders trust the data less. Once trust is gone, they build their own spreadsheets and stop using the centralized data platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascading failures.&lt;/strong&gt; One stale source table can affect dozens of downstream tables, dashboards, and reports. By the time someone notices, the blast radius has grown.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How data freshness monitoring works
&lt;/h2&gt;

&lt;p&gt;A freshness monitor does three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tracks update timestamps.&lt;/strong&gt; For each monitored table, it records when new data last arrived. This can use metadata queries (&lt;code&gt;information_schema&lt;/code&gt;), row timestamps, or partition metadata depending on your warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compares against a threshold.&lt;/strong&gt; You define how stale is too stale. A real-time events table might need a 15-minute threshold. A daily aggregate might tolerate 25 hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Alerts when the threshold is breached.&lt;/strong&gt; The monitor sends an alert to Slack, email, or PagerDuty. The alert should include which table, how late it is, and ideally what upstream dependency is the likely cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to set data freshness SLAs
&lt;/h2&gt;

&lt;p&gt;The most common mistake is setting the same threshold for every table. A 1-hour SLA on a table that updates weekly creates noise. A 24-hour SLA on a real-time table misses every incident.&lt;/p&gt;

&lt;p&gt;Start with how the table is consumed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time / streaming&lt;/strong&gt; (event logs, clickstream): 15-30 minute SLA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hourly batch&lt;/strong&gt; (hourly aggregates): 2 hour SLA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily batch&lt;/strong&gt; (daily snapshots, dim tables): 25-26 hour SLA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly&lt;/strong&gt; (weekly rollups): 8 day SLA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Set the SLA slightly longer than the expected update interval. A table that updates every hour should have a 2-hour SLA, not a 61-minute one. This avoids false alerts from minor delays while still catching real failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pipeline orchestration alerts aren't enough
&lt;/h2&gt;

&lt;p&gt;Airflow, Dagster, and Prefect already alert on job failures. Why add freshness monitoring?&lt;/p&gt;

&lt;p&gt;Because jobs can succeed without producing data. A DAG completes with a green checkmark and writes zero rows because the source API returned empty. The orchestrator sees success. The table is stale. Nobody knows until a dashboard breaks.&lt;/p&gt;

&lt;p&gt;This is the blind spot that catches most teams. Orchestration monitors the process: did the job run? Freshness monitoring checks the outcome: did data actually arrive? Those are different questions with different answers, and the gap between them is where stale data incidents live.&lt;/p&gt;

&lt;p&gt;If you've ever had a "successful" pipeline run that produced an empty table, you've hit this blind spot. Freshness monitoring is the only way to catch it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DIY freshness monitoring breaks down
&lt;/h2&gt;

&lt;p&gt;You can query &lt;code&gt;INFORMATION_SCHEMA.TABLES.LAST_ALTERED&lt;/code&gt; in Snowflake, run &lt;code&gt;DESCRIBE HISTORY&lt;/code&gt; in Databricks, or check &lt;code&gt;last_modified_time&lt;/code&gt; in BigQuery. One table is easy.&lt;/p&gt;

&lt;p&gt;The problem is scale. At 50 tables across two databases, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A scheduler running checks every 15-60 minutes&lt;/li&gt;
&lt;li&gt;Per-table thresholds (not every table has the same SLA)&lt;/li&gt;
&lt;li&gt;Historical tracking to distinguish "late today" from "always late on Sundays"&lt;/li&gt;
&lt;li&gt;Alert routing to the team that owns each pipeline, not a shared channel everyone ignores&lt;/li&gt;
&lt;li&gt;Cross-database visibility so you can trace a stale table to its upstream root cause in a different warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A cron job with a SQL query gets you started. Maintaining that cron job across hundreds of tables, multiple warehouses, and changing schedules becomes its own engineering project. Most teams that start with DIY eventually spend more time maintaining the monitoring than they saved by catching stale data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started with data freshness monitoring
&lt;/h2&gt;

&lt;p&gt;If you're setting up freshness monitoring for the first time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with 5-10 critical tables.&lt;/strong&gt; The ones that power executive dashboards and key reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set generous SLAs initially.&lt;/strong&gt; Tighten them after you understand normal patterns. Starting too tight creates alert fatigue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route alerts to owners.&lt;/strong&gt; A freshness alert should go to the team that owns the pipeline, not a shared channel where it gets ignored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track resolution time.&lt;/strong&gt; How long between "table went stale" and "table is fresh again"? This tells you if monitoring is actually reducing incident duration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How AnomalyArmor handles freshness monitoring
&lt;/h2&gt;

&lt;p&gt;AnomalyArmor is built to handle the problems described above so you don't have to build and maintain the monitoring infrastructure yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic pattern detection.&lt;/strong&gt; When you connect a database, AnomalyArmor runs schema discovery and analyzes historical update patterns for each table. Instead of manually figuring out that &lt;code&gt;orders&lt;/code&gt; updates hourly and &lt;code&gt;daily_revenue&lt;/code&gt; updates at 6am, the system detects those patterns and suggests appropriate thresholds. You review and adjust, but you're not starting from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-table thresholds with SLA tracking.&lt;/strong&gt; Every table gets its own freshness configuration. Set different thresholds for your real-time event tables (15 minutes) and your weekly rollups (8 days). The freshness chart shows historical update patterns alongside SLA lines so you can see at a glance whether a table is trending toward a violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operating schedules and blackout windows.&lt;/strong&gt; Not every alert matters at 3am on a Sunday. AnomalyArmor lets you define operating schedules (only alert during business hours) and blackout windows (suppress alerts during planned maintenance). This is the difference between a monitoring tool your team trusts and one they mute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-warehouse visibility.&lt;/strong&gt; Monitor Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL from a single dashboard. When a table goes stale, trace the issue upstream across databases using dbt lineage integration to find the root cause, not just the symptom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert routing to the right people.&lt;/strong&gt; Freshness alerts go to Slack, email, or webhooks. Route different tables to different channels or teams. An alert about a finance table goes to the data platform team. An alert about marketing attribution goes to the analytics team. No shared channel where everything gets ignored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-powered investigation.&lt;/strong&gt; When a freshness violation fires, AnomalyArmor's AI investigation correlates it with other recent incidents, identifies the likely root cause, and suggests resolution steps. Instead of spending 30 minutes tracing the problem, you get a summary of what broke and why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.anomalyarmor.ai/sign-up" rel="noopener noreferrer"&gt;Sign up&lt;/a&gt; and the demo database is already loaded with freshness violations you can explore.&lt;/p&gt;







&lt;h2&gt;
  
  
  Data Freshness Monitoring FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is data freshness monitoring?
&lt;/h3&gt;

&lt;p&gt;Data freshness monitoring is the process of automatically checking whether tables, datasets, or data streams are updating on their expected schedule. It alerts you when data stops arriving so you can fix the problem before downstream dashboards, reports, or ML models are affected.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is data freshness different from data quality?
&lt;/h3&gt;

&lt;p&gt;Data freshness measures how recently data was updated. Data quality measures whether the data itself is correct, complete, and consistent. A table can be perfectly fresh but contain wrong values. Freshness is one dimension of data quality, but most teams monitor them separately because the failure modes and fixes are different.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a data freshness SLA?
&lt;/h3&gt;

&lt;p&gt;A data freshness SLA is a service level agreement that specifies how often a table should update. For example, "the orders table must have new rows every 1 hour during business hours, or alert." SLAs turn vague expectations into measurable thresholds that monitoring tools can check.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I detect stale data in Snowflake?
&lt;/h3&gt;

&lt;p&gt;In Snowflake, query &lt;code&gt;ACCOUNT_USAGE.TABLES&lt;/code&gt; for the &lt;code&gt;LAST_ALTERED&lt;/code&gt; timestamp on each table, or check the max value of a timestamp column in the data itself. Compare against an expected freshness SLA. If the data is older than the SLA, fire an alert. Most production systems automate this with a scheduled query or a dedicated monitoring tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I detect stale data in Databricks?
&lt;/h3&gt;

&lt;p&gt;Databricks exposes table modification times through Unity Catalog and Delta Lake transaction logs. Query &lt;code&gt;information_schema.tables&lt;/code&gt; or the Delta history to find the last write. Compare against a freshness SLA. You can also use Delta Live Tables expectations to fail pipelines when source data is too old.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why aren't Airflow (or other orchestration) success alerts enough for freshness monitoring?
&lt;/h3&gt;

&lt;p&gt;Orchestration alerts tell you whether jobs ran successfully. They don't tell you whether the data that arrived is fresh. A job can succeed while producing stale data if the upstream source is delayed, if the extract reads from a cached snapshot, or if the job runs but processes zero new rows. Freshness monitoring checks the data itself, not the job status.&lt;/p&gt;

&lt;h3&gt;
  
  
  What causes data freshness failures?
&lt;/h3&gt;

&lt;p&gt;The most common causes are: upstream source system outages, failed or delayed scheduled jobs, permission changes that block reads, rate-limited APIs, exhausted compute resources, schema changes that break extract logic, and timezone bugs that misalign expected run times with actual run times.&lt;/p&gt;

&lt;h3&gt;
  
  
  How often should I check data freshness?
&lt;/h3&gt;

&lt;p&gt;The check frequency should match the SLA. If a table updates hourly, check every 5-15 minutes. If it updates daily, check hourly. Checking too frequently wastes compute. Checking too infrequently means you find failures hours after they happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I monitor data freshness without a dedicated tool?
&lt;/h3&gt;

&lt;p&gt;Yes, for a small number of tables. Write a scheduled query in Airflow or dbt that checks the max timestamp column and raises on failure. This breaks down at 50+ tables because you need alert routing, incident history, SLA tracking, pattern detection (is this table always late on Mondays?), and snoozing. Dedicated tools handle all of that.&lt;/p&gt;

&lt;h3&gt;
  
  
  What tools can monitor data freshness?
&lt;/h3&gt;

&lt;p&gt;Popular options include AnomalyArmor, Monte Carlo, Metaplane, Soda, Datafold, and dbt tests (with custom freshness macros). Open-source options include Great Expectations and re_data. Custom scripts work for small teams but don't scale past a handful of tables.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to set up freshness monitoring in minutes? &lt;a href="https://app.anomalyarmor.ai/signup" rel="noopener noreferrer"&gt;Try AnomalyArmor&lt;/a&gt; or &lt;a href="https://blog.anomalyarmor.ai/using-ai-to-set-up-schema-drift-detection/" rel="noopener noreferrer"&gt;watch the schema drift demo&lt;/a&gt; for a taste of how our AI agent configures monitoring.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datafreshness</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
