Blaine Elliott

Posted on Apr 12 • Originally published at blog.anomalyarmor.ai

The 6 Dimensions of Data Quality: Definitions, Examples, and How to Monitor Each

#dataquality

The six dimensions of data quality are accuracy, completeness, consistency, timeliness, validity, and uniqueness. Each dimension measures a different aspect of whether data is fit for its intended use. Together they define whether a dataset can be trusted for analytics, machine learning, or customer-facing applications.

This guide defines each dimension with practical examples, SQL detection patterns, and monitoring strategies for production data pipelines.

What are the dimensions of data quality?

Data quality dimensions are measurable attributes that describe different ways data can be wrong. The widely accepted framework includes six core dimensions:

#	Dimension	Question it answers
1	Accuracy	Does the data reflect real-world truth?
2	Completeness	Is any expected data missing?
3	Consistency	Does the same fact match across systems?
4	Timeliness	Is the data current enough to be useful?
5	Validity	Does the data conform to expected formats and rules?
6	Uniqueness	Are there duplicate records where there shouldn't be?

These six dimensions come from the DAMA International Data Management Body of Knowledge (DMBOK) and are used by organizations including the UK Government Data Quality Hub, Monte Carlo, Collibra, and Informatica. Different sources sometimes add dimensions like integrity or conformity, but the core six cover the vast majority of data quality failures.

Why do data quality dimensions matter?

Without a framework, data teams describe quality problems anecdotally: "the data looks off," "something's wrong with customer IDs," "the numbers don't match the dashboard." These complaints are hard to prioritize and harder to fix systematically.

The six dimensions convert vague complaints into measurable categories. A data team that says "we have a completeness problem on 3% of rows and a timeliness problem on 2 tables" can write monitoring rules, assign owners, and track improvement over time. A team that just says "data quality is bad" cannot.

1. Accuracy

Definition: Accuracy measures how closely data reflects the real-world entity or event it describes.

A customer's street address stored as "123 Mai Street" when it should be "123 Main Street" is inaccurate. A transaction recorded as $100 when the actual amount was $1000 is inaccurate. A birth date of 1900-01-01 for a 30-year-old customer is inaccurate.

Accuracy is the hardest dimension to verify automatically because it requires comparing data to an authoritative external truth. Most teams verify accuracy through:

Cross-reference with source systems: Compare warehouse data against the upstream OLTP database
Sampling and manual review: Audit a random subset against original documents
Reference data checks: Compare against a trusted master data source (e.g., a zip code database)
Statistical sanity checks: Flag values that are impossibly high or low

-- Detect impossibly old ages (accuracy check)
SELECT customer_id, birth_date, DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) AS age
FROM customers
WHERE DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) > 120
   OR DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) < 0;

2. Completeness

Definition: Completeness measures whether all expected data is present. It covers both row-level completeness (no missing rows) and column-level completeness (no missing values in required fields).

A daily sales table that should contain one row per store per day but is missing rows for three stores has a row-level completeness problem. A customers table with email IS NULL for 15% of records has a column-level completeness problem.

Completeness checks are straightforward to automate:

-- Column-level completeness: null rate for required fields
SELECT
  COUNT(*) AS total_rows,
  COUNT(email) AS rows_with_email,
  COUNT(*) - COUNT(email) AS null_emails,
  ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS null_rate_pct
FROM customers;

-- Row-level completeness: missing expected records
SELECT store_id, sale_date
FROM expected_stores_and_dates
LEFT JOIN daily_sales USING (store_id, sale_date)
WHERE daily_sales.store_id IS NULL;

The hard part isn't writing the query. It's deciding what "expected" means. You need a ground truth for what should exist, which usually comes from a reference table, a calendar, or a contract with the upstream source.

3. Consistency

Definition: Consistency measures whether the same fact matches across different systems, tables, or timestamps.

If the customer table shows 10,000 active users and the billing table shows 9,850 active users, there's a consistency problem. If a transaction amount appears as $100 in one system and $100.00 in another, that's usually formatting, not a consistency failure. But if the same transaction appears as $100 in one system and $1000 in another, that's a critical consistency failure.

Consistency checks compare aggregate or row-level values across data sources:

-- Cross-system consistency: customer count reconciliation
WITH crm_count AS (
  SELECT COUNT(*) AS n FROM crm_customers WHERE status = 'active'
),
warehouse_count AS (
  SELECT COUNT(*) AS n FROM dim_customers WHERE is_active = TRUE
)
SELECT
  crm_count.n AS crm_active_customers,
  warehouse_count.n AS warehouse_active_customers,
  ABS(crm_count.n - warehouse_count.n) AS delta
FROM crm_count, warehouse_count;

Consistency problems often stem from timing: one system was updated, the other hasn't synced yet. The monitoring question is whether the gap is within an acceptable SLA or has exceeded it.

4. Timeliness

Definition: Timeliness measures whether data is fresh enough to be useful. A timely dataset is updated on its expected schedule and is current relative to the real-world events it describes.

A dashboard showing "sales last hour" that's actually showing data from 6 hours ago has a timeliness problem. A machine learning model trained on data that's 3 months stale may produce incorrect predictions. A fraud detection system running on yesterday's transactions is useless.

Timeliness is measured in two ways:

Freshness lag: How long since the last update? (CURRENT_TIMESTAMP - MAX(inserted_at))
Schedule adherence: Did the expected update happen on time?

-- Freshness: hours since last row was added
SELECT
  TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(inserted_at), HOUR) AS hours_since_last_insert,
  MAX(inserted_at) AS most_recent_row
FROM orders
HAVING hours_since_last_insert > 2;  -- alert if stale beyond SLA

Timeliness is the easiest dimension to monitor at scale because it only requires a single max-timestamp query per table. This is why freshness monitoring is typically the first data quality check teams implement.

5. Validity

Definition: Validity measures whether data conforms to defined formats, types, ranges, and business rules.

An email field containing "not-an-email" is invalid. A phone number field with "call my cell" is invalid. A country field with "Martian Empire" is invalid. A percentage field with 150 is invalid. A timestamp in the year 9999 is invalid.

Validity is the most rule-heavy dimension. It requires explicit definitions of what "valid" means for each field:

-- Validity: email format check
SELECT customer_id, email
FROM customers
WHERE email IS NOT NULL
  AND NOT REGEXP_CONTAINS(email, r'^[^@\s]+@[^@\s]+\.[^@\s]+$');

-- Validity: range check
SELECT order_id, discount_pct
FROM orders
WHERE discount_pct < 0 OR discount_pct > 100;

-- Validity: enum check
SELECT order_id, status
FROM orders
WHERE status NOT IN ('pending', 'paid', 'shipped', 'delivered', 'refunded');

Modern data quality tools automate validity checks by profiling historical data to learn expected formats, then flagging new records that deviate.

6. Uniqueness

Definition: Uniqueness measures whether records that should be unique are unique. It covers both primary key uniqueness and business-level deduplication.

A customers table should have exactly one row per customer. A transactions table should have exactly one row per transaction. When the same customer appears twice with slightly different spellings, or the same transaction appears twice because of a retry bug, you have a uniqueness failure.

Uniqueness checks are simple to write:

-- Primary key uniqueness
SELECT customer_id, COUNT(*) AS occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;

-- Business-level uniqueness (same email, different IDs = probable duplicate)
SELECT LOWER(TRIM(email)) AS normalized_email, COUNT(*) AS dup_count,
       ARRAY_AGG(customer_id) AS customer_ids
FROM customers
WHERE email IS NOT NULL
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1;

The hard part is defining the business rule for uniqueness. Primary keys are enforced by the database. Business-level deduplication (same person, different spellings) requires fuzzy matching, normalization, or entity resolution algorithms.

How do these dimensions relate to each other?

The six dimensions overlap and interact. A single data quality failure often affects multiple dimensions at once:

Duplicate records violate uniqueness, but also affect accuracy (counts are wrong) and sometimes completeness (aggregates miss data)
Schema drift violates validity (new values don't match expected format), often triggers completeness failures (previously required columns become null), and degrades accuracy (wrong values flow through)
Pipeline delays violate timeliness, but also create consistency problems between source and destination systems

Good monitoring tracks all six dimensions because a problem in one often predicts problems in others. A sudden spike in uniqueness failures for customer IDs is often an upstream completeness problem (nulls being converted to a default value).

How do you measure data quality across all six dimensions?

The standard approach is to calculate a quality score per table per dimension, then aggregate:

Per-dimension score: For each table and each dimension, compute pass/fail against defined rules
Rollup to table score: Average the six dimension scores (or weight by business importance)
Rollup to dataset score: Average across all tables in a dataset
Track over time: Plot the score daily to catch degradation trends

For production data pipelines, modern data observability tools automate this by:

Profiling historical data to learn baselines (typical null rates, value distributions, update frequencies)
Detecting anomalies in new data against those baselines
Tagging each anomaly by the dimension it violates
Rolling up to dashboards that show quality over time per table and per dimension

The key insight is that you cannot manually write rules for every edge case across 500 tables. You need statistical baselines that learn from the data itself, with explicit rules for the invariants that matter most to the business.

Data Quality Dimensions FAQ

What are the 6 dimensions of data quality?

The six dimensions of data quality are accuracy, completeness, consistency, timeliness, validity, and uniqueness. Accuracy measures truth against reality, completeness measures missing data, consistency measures cross-system agreement, timeliness measures freshness, validity measures conformance to rules, and uniqueness measures duplicate records.

Are there more than 6 dimensions of data quality?

Yes. Some frameworks add dimensions like integrity (referential relationships), conformity (adherence to standards), reasonableness (within expected bounds), or auditability (traceable to source). The DAMA DMBOK lists six core dimensions that cover the most common failure modes, which is why the "six dimensions" framework is the most widely cited.

Which data quality dimension is most important?

It depends on the use case. For financial reporting, accuracy and consistency matter most. For real-time dashboards, timeliness is critical. For machine learning features, completeness and validity drive model performance. Most production data teams treat timeliness and completeness as the top two because their failures are easiest to detect and most visible to downstream users.

How do you measure data quality dimensions?

Each dimension is measured by running rule-based or statistical checks and counting pass/fail rates. Accuracy is typically measured by sampling and cross-reference. Completeness is measured as null rate or row-count against expectation. Consistency is measured by reconciling aggregates across systems. Timeliness is measured as lag from expected update. Validity is measured by format and range checks. Uniqueness is measured by primary key and business-level dedup queries.

What is the difference between data quality and data integrity?

Data quality is the broader concept covering accuracy, completeness, consistency, timeliness, validity, and uniqueness. Data integrity is a narrower concept focused on referential relationships and constraint enforcement (foreign keys resolve, required fields aren't null, allowed values are enforced). Integrity is sometimes listed as a seventh dimension of quality, but most frameworks treat it as a subset of validity and completeness.

Can you have high data quality in one dimension and low in another?

Yes, and this is common. A table can have perfect uniqueness (no duplicates) but terrible timeliness (updated weekly when it should be hourly). A dataset can be perfectly complete (no missing rows) but inaccurate (values are wrong). Monitoring each dimension separately reveals these patterns. A single "data quality score" that averages all six hides the specific failure modes you need to fix.

How is data quality different from data observability?

Data quality is the outcome: whether data is fit for use. Data observability is the practice: continuously monitoring data pipelines to detect quality issues in production. You can have high data quality without observability (if nothing ever breaks), but in practice you need observability to maintain quality over time as systems evolve and upstream sources change.

What tools automate data quality dimension monitoring?

Modern data observability platforms including AnomalyArmor, Monte Carlo, Metaplane, Bigeye, and Datafold automate monitoring across all six dimensions by profiling historical baselines and flagging anomalies. Open-source tools like Great Expectations, Soda Core, and dbt tests cover rule-based validity and completeness checks but require manual rule writing. Most production teams combine both: automated baseline monitoring for the long tail plus explicit rules for business-critical invariants.

How much historical data do you need to monitor data quality dimensions?

Statistical baselines typically require 7-14 days of historical data for basic anomaly detection. Weekly seasonality needs at least 4 weeks. Yearly seasonality requires 12-18 months. For rule-based checks (validity, uniqueness, primary key enforcement), no history is needed, you can run them on any new data as it arrives.

Can you fix low data quality after the fact?

Sometimes yes, often no. Validity and uniqueness problems can often be fixed retroactively by cleaning and deduplication. Completeness problems can sometimes be fixed by re-running upstream loads. Accuracy problems usually can't be fixed without access to the original source, which may have been lost. Timeliness problems can't be fixed at all: once data is late, it's late. Prevention through monitoring is always cheaper than retroactive cleanup.

Data quality dimensions are only useful if you can measure them in production. See how AnomalyArmor automatically monitors accuracy, completeness, consistency, timeliness, validity, and uniqueness across your data pipelines.

DEV Community