DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Data Observability Setup: Data Observability Guide

#data #dataengineering #etl #python

Data Observability Guide

A practical guide to implementing data observability for Databricks-based data platforms.

1. The Five Pillars of Data Observability

Data observability borrows from software observability but applies it specifically to data quality and reliability. The five pillars are:

1.1 Freshness

Question: Is the data up to date?

Track the last modification timestamp of every table
Define SLAs per table/domain (e.g., "orders must update within 2 hours")
Measure "time since last update" continuously
Alert when SLAs are breached

1.2 Volume

Question: Is the expected amount of data arriving?

Record row counts per pipeline run
Track bytes written to each table
Compare against historical baselines (moving average, percentiles)
Flag zero-row writes, dramatic drops, or unexplained spikes

1.3 Schema

Question: Has the structure of the data changed?

Monitor for added/dropped/renamed columns
Detect type changes (string → int, nullable → required)
Version schemas and track drift over time
Use schema evolution policies in Delta Lake

1.4 Distribution

Question: Are the values within expected ranges?

Profile key columns: null rates, distinct counts, min/max values
Detect distribution shifts (mean, variance, percentile changes)
Monitor categorical distributions for unexpected new values
Use statistical tests (KS test, chi-squared) for formal detection

1.5 Lineage

Question: Where did this data come from, and what depends on it?

Track source → transformation → target relationships
Enable impact analysis ("what breaks if this source goes down?")
Support root cause analysis ("why is this dashboard wrong?")
Satisfy regulatory requirements (GDPR, SOX, HIPAA traceability)

2. SLIs and SLOs for Data

2.1 Service Level Indicators (SLIs)

SLIs are the metrics you measure:

SLI	Description	Example
Freshness	Hours since last update	1.5 hours
Completeness	% of expected rows received	99.2%
Validity	% of rows passing quality checks	98.7%
Timeliness	Pipeline duration	12 minutes
Accuracy	% of records matching source of truth	99.9%

2.2 Service Level Objectives (SLOs)

SLOs are the targets you commit to:

Table	Freshness SLO	Completeness SLO	Validity SLO
orders (Tier 1)	< 2 hours	> 99%	> 99%
customers (Tier 1)	< 6 hours	> 99%	> 98%
product_catalog (Tier 2)	< 24 hours	> 95%	> 95%
web_analytics (Tier 3)	< 48 hours	> 90%	> 90%

2.3 Error Budgets

If your SLO is 99% completeness, you have a 1% error budget per month
Track error budget consumption over rolling windows
When the budget is exhausted, freeze deployments and fix reliability

3. Incident Response for Data Issues

3.1 Severity Classification

Severity	Criteria	Response Time	Example
SEV-1	Revenue-impacting, customer-facing data wrong	15 min	Billing data missing
SEV-2	Tier 1 SLA breach, internal dashboards wrong	1 hour	Orders table stale
SEV-3	Tier 2 SLA breach, degraded analytics	4 hours	Product catalog delayed
SEV-4	Tier 3 anomaly, no immediate business impact	Next business day	Web analytics spike

3.2 Incident Workflow

Detect: Automated monitoring flags the issue
Alert: Route to the correct on-call team based on severity
Triage: Identify scope (which tables/pipelines are affected)
Investigate: Use lineage to trace root cause
Mitigate: Apply a fix or workaround
Communicate: Update stakeholders via status page
Resolve: Confirm data is correct and SLAs are restored
Post-mortem: Document root cause, timeline, and prevention steps

3.3 Runbook Template

Title: [Pipeline Name] — [Issue Type]
Triggered by: [Alert rule name]
Severity: SEV-[1-4]

Symptoms:
  - [What the alert says]

Investigation Steps:
  1. Check pipeline logs: [link]
  2. Query lineage for upstream status
  3. Verify source system availability
  4. Check for schema changes

Mitigation:
  - Option A: Re-run pipeline
  - Option B: Revert to last known good state
  - Option C: Disable downstream consumers

Escalation:
  - SEV-1/2: Page on-call data engineer
  - SEV-3/4: Slack #data-incidents channel

4. Tool Selection

4.1 Build vs. Buy

Approach	Pros	Cons
Build (this toolkit)	Full control, no vendor lock-in, cost-effective	Maintenance burden, limited UI
Monte Carlo / Bigeye	Polished UI, ML-powered anomaly detection	Expensive, vendor lock-in
Great Expectations	Open source, extensive checks	No built-in alerting, batch only
dbt tests	Integrated with dbt workflows	Limited to SQL-based checks

4.2 When This Toolkit is Right

You are on Databricks and want native Delta integration
You need lineage tracking alongside metric collection
You want full control over detection algorithms
Your budget favours a one-time purchase over SaaS subscriptions

5. Implementation Checklist

[ ] Run notebooks/setup_observability.py to create tables
[ ] Configure SLA definitions in configs/observability_config.yaml
[ ] Add lineage tracking calls to your existing pipelines
[ ] Add metric collection (row counts, durations) to pipeline code
[ ] Configure alert channels (Slack, PagerDuty, email)
[ ] Define alert rules in configs/alert_rules.yaml
[ ] Schedule anomaly detection to run after each pipeline
[ ] Set up the observability dashboard notebook
[ ] Write runbooks for your top 5 most critical pipelines
[ ] Conduct a tabletop exercise with your team using a simulated incident

By Datanest Digital | Data Observability Setup v1.0.0

This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Observability Setup] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community