DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Observability Setup: Data Observability Guide

Data Observability Guide

A practical guide to implementing data observability for Databricks-based data platforms.

By Datanest Digital


1. The Five Pillars of Data Observability

Data observability borrows from software observability but applies it specifically to data quality and reliability. The five pillars are:

1.1 Freshness

Question: Is the data up to date?

  • Track the last modification timestamp of every table
  • Define SLAs per table/domain (e.g., "orders must update within 2 hours")
  • Measure "time since last update" continuously
  • Alert when SLAs are breached

1.2 Volume

Question: Is the expected amount of data arriving?

  • Record row counts per pipeline run
  • Track bytes written to each table
  • Compare against historical baselines (moving average, percentiles)
  • Flag zero-row writes, dramatic drops, or unexplained spikes

1.3 Schema

Question: Has the structure of the data changed?

  • Monitor for added/dropped/renamed columns
  • Detect type changes (string → int, nullable → required)
  • Version schemas and track drift over time
  • Use schema evolution policies in Delta Lake

1.4 Distribution

Question: Are the values within expected ranges?

  • Profile key columns: null rates, distinct counts, min/max values
  • Detect distribution shifts (mean, variance, percentile changes)
  • Monitor categorical distributions for unexpected new values
  • Use statistical tests (KS test, chi-squared) for formal detection

1.5 Lineage

Question: Where did this data come from, and what depends on it?

  • Track source → transformation → target relationships
  • Enable impact analysis ("what breaks if this source goes down?")
  • Support root cause analysis ("why is this dashboard wrong?")
  • Satisfy regulatory requirements (GDPR, SOX, HIPAA traceability)

2. SLIs and SLOs for Data

2.1 Service Level Indicators (SLIs)

SLIs are the metrics you measure:

SLI Description Example
Freshness Hours since last update 1.5 hours
Completeness % of expected rows received 99.2%
Validity % of rows passing quality checks 98.7%
Timeliness Pipeline duration 12 minutes
Accuracy % of records matching source of truth 99.9%

2.2 Service Level Objectives (SLOs)

SLOs are the targets you commit to:

Table Freshness SLO Completeness SLO Validity SLO
orders (Tier 1) < 2 hours > 99% > 99%
customers (Tier 1) < 6 hours > 99% > 98%
product_catalog (Tier 2) < 24 hours > 95% > 95%
web_analytics (Tier 3) < 48 hours > 90% > 90%

2.3 Error Budgets

  • If your SLO is 99% completeness, you have a 1% error budget per month
  • Track error budget consumption over rolling windows
  • When the budget is exhausted, freeze deployments and fix reliability

3. Incident Response for Data Issues

3.1 Severity Classification

Severity Criteria Response Time Example
SEV-1 Revenue-impacting, customer-facing data wrong 15 min Billing data missing
SEV-2 Tier 1 SLA breach, internal dashboards wrong 1 hour Orders table stale
SEV-3 Tier 2 SLA breach, degraded analytics 4 hours Product catalog delayed
SEV-4 Tier 3 anomaly, no immediate business impact Next business day Web analytics spike

3.2 Incident Workflow

  1. Detect: Automated monitoring flags the issue
  2. Alert: Route to the correct on-call team based on severity
  3. Triage: Identify scope (which tables/pipelines are affected)
  4. Investigate: Use lineage to trace root cause
  5. Mitigate: Apply a fix or workaround
  6. Communicate: Update stakeholders via status page
  7. Resolve: Confirm data is correct and SLAs are restored
  8. Post-mortem: Document root cause, timeline, and prevention steps

3.3 Runbook Template

Title: [Pipeline Name] — [Issue Type]
Triggered by: [Alert rule name]
Severity: SEV-[1-4]

Symptoms:
  - [What the alert says]

Investigation Steps:
  1. Check pipeline logs: [link]
  2. Query lineage for upstream status
  3. Verify source system availability
  4. Check for schema changes

Mitigation:
  - Option A: Re-run pipeline
  - Option B: Revert to last known good state
  - Option C: Disable downstream consumers

Escalation:
  - SEV-1/2: Page on-call data engineer
  - SEV-3/4: Slack #data-incidents channel
Enter fullscreen mode Exit fullscreen mode

4. Tool Selection

4.1 Build vs. Buy

Approach Pros Cons
Build (this toolkit) Full control, no vendor lock-in, cost-effective Maintenance burden, limited UI
Monte Carlo / Bigeye Polished UI, ML-powered anomaly detection Expensive, vendor lock-in
Great Expectations Open source, extensive checks No built-in alerting, batch only
dbt tests Integrated with dbt workflows Limited to SQL-based checks

4.2 When This Toolkit is Right

  • You are on Databricks and want native Delta integration
  • You need lineage tracking alongside metric collection
  • You want full control over detection algorithms
  • Your budget favours a one-time purchase over SaaS subscriptions

5. Implementation Checklist

  • [ ] Run notebooks/setup_observability.py to create tables
  • [ ] Configure SLA definitions in configs/observability_config.yaml
  • [ ] Add lineage tracking calls to your existing pipelines
  • [ ] Add metric collection (row counts, durations) to pipeline code
  • [ ] Configure alert channels (Slack, PagerDuty, email)
  • [ ] Define alert rules in configs/alert_rules.yaml
  • [ ] Schedule anomaly detection to run after each pipeline
  • [ ] Set up the observability dashboard notebook
  • [ ] Write runbooks for your top 5 most critical pipelines
  • [ ] Conduct a tabletop exercise with your team using a simulated incident

By Datanest Digital | Data Observability Setup v1.0.0


This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Observability Setup] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)