Data Observability Guide
A practical guide to implementing data observability for Databricks-based data platforms.
1. The Five Pillars of Data Observability
Data observability borrows from software observability but applies it specifically to data quality and reliability. The five pillars are:
1.1 Freshness
Question: Is the data up to date?
- Track the last modification timestamp of every table
- Define SLAs per table/domain (e.g., "orders must update within 2 hours")
- Measure "time since last update" continuously
- Alert when SLAs are breached
1.2 Volume
Question: Is the expected amount of data arriving?
- Record row counts per pipeline run
- Track bytes written to each table
- Compare against historical baselines (moving average, percentiles)
- Flag zero-row writes, dramatic drops, or unexplained spikes
1.3 Schema
Question: Has the structure of the data changed?
- Monitor for added/dropped/renamed columns
- Detect type changes (string → int, nullable → required)
- Version schemas and track drift over time
- Use schema evolution policies in Delta Lake
1.4 Distribution
Question: Are the values within expected ranges?
- Profile key columns: null rates, distinct counts, min/max values
- Detect distribution shifts (mean, variance, percentile changes)
- Monitor categorical distributions for unexpected new values
- Use statistical tests (KS test, chi-squared) for formal detection
1.5 Lineage
Question: Where did this data come from, and what depends on it?
- Track source → transformation → target relationships
- Enable impact analysis ("what breaks if this source goes down?")
- Support root cause analysis ("why is this dashboard wrong?")
- Satisfy regulatory requirements (GDPR, SOX, HIPAA traceability)
2. SLIs and SLOs for Data
2.1 Service Level Indicators (SLIs)
SLIs are the metrics you measure:
| SLI | Description | Example |
|---|---|---|
| Freshness | Hours since last update | 1.5 hours |
| Completeness | % of expected rows received | 99.2% |
| Validity | % of rows passing quality checks | 98.7% |
| Timeliness | Pipeline duration | 12 minutes |
| Accuracy | % of records matching source of truth | 99.9% |
2.2 Service Level Objectives (SLOs)
SLOs are the targets you commit to:
| Table | Freshness SLO | Completeness SLO | Validity SLO |
|---|---|---|---|
| orders (Tier 1) | < 2 hours | > 99% | > 99% |
| customers (Tier 1) | < 6 hours | > 99% | > 98% |
| product_catalog (Tier 2) | < 24 hours | > 95% | > 95% |
| web_analytics (Tier 3) | < 48 hours | > 90% | > 90% |
2.3 Error Budgets
- If your SLO is 99% completeness, you have a 1% error budget per month
- Track error budget consumption over rolling windows
- When the budget is exhausted, freeze deployments and fix reliability
3. Incident Response for Data Issues
3.1 Severity Classification
| Severity | Criteria | Response Time | Example |
|---|---|---|---|
| SEV-1 | Revenue-impacting, customer-facing data wrong | 15 min | Billing data missing |
| SEV-2 | Tier 1 SLA breach, internal dashboards wrong | 1 hour | Orders table stale |
| SEV-3 | Tier 2 SLA breach, degraded analytics | 4 hours | Product catalog delayed |
| SEV-4 | Tier 3 anomaly, no immediate business impact | Next business day | Web analytics spike |
3.2 Incident Workflow
- Detect: Automated monitoring flags the issue
- Alert: Route to the correct on-call team based on severity
- Triage: Identify scope (which tables/pipelines are affected)
- Investigate: Use lineage to trace root cause
- Mitigate: Apply a fix or workaround
- Communicate: Update stakeholders via status page
- Resolve: Confirm data is correct and SLAs are restored
- Post-mortem: Document root cause, timeline, and prevention steps
3.3 Runbook Template
Title: [Pipeline Name] — [Issue Type]
Triggered by: [Alert rule name]
Severity: SEV-[1-4]
Symptoms:
- [What the alert says]
Investigation Steps:
1. Check pipeline logs: [link]
2. Query lineage for upstream status
3. Verify source system availability
4. Check for schema changes
Mitigation:
- Option A: Re-run pipeline
- Option B: Revert to last known good state
- Option C: Disable downstream consumers
Escalation:
- SEV-1/2: Page on-call data engineer
- SEV-3/4: Slack #data-incidents channel
4. Tool Selection
4.1 Build vs. Buy
| Approach | Pros | Cons |
|---|---|---|
| Build (this toolkit) | Full control, no vendor lock-in, cost-effective | Maintenance burden, limited UI |
| Monte Carlo / Bigeye | Polished UI, ML-powered anomaly detection | Expensive, vendor lock-in |
| Great Expectations | Open source, extensive checks | No built-in alerting, batch only |
| dbt tests | Integrated with dbt workflows | Limited to SQL-based checks |
4.2 When This Toolkit is Right
- You are on Databricks and want native Delta integration
- You need lineage tracking alongside metric collection
- You want full control over detection algorithms
- Your budget favours a one-time purchase over SaaS subscriptions
5. Implementation Checklist
- [ ] Run
notebooks/setup_observability.pyto create tables - [ ] Configure SLA definitions in
configs/observability_config.yaml - [ ] Add lineage tracking calls to your existing pipelines
- [ ] Add metric collection (row counts, durations) to pipeline code
- [ ] Configure alert channels (Slack, PagerDuty, email)
- [ ] Define alert rules in
configs/alert_rules.yaml - [ ] Schedule anomaly detection to run after each pipeline
- [ ] Set up the observability dashboard notebook
- [ ] Write runbooks for your top 5 most critical pipelines
- [ ] Conduct a tabletop exercise with your team using a simulated incident
By Datanest Digital | Data Observability Setup v1.0.0
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Data Observability Setup] with all files, templates, and documentation for $49.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)