Jose Torreblanca

Posted on Aug 21

When Simple Beats Complex: Solving Stale Data with dbt Views

#dbt #dataengineering #sql #analytics

A story about discovering the right materialization for the right use case

The Problem: "Will My Data Be Stale in 90 Days?"

Picture this: You're building a daily monitoring dashboard for ridepooling stop usage. Your stakeholder asks a simple question: "What happens if I query this data in 90 days? Will I get stale data?"

That question led me down a rabbit hole that taught me more about dbt materializations than any tutorial ever could.

The Discovery: Everything Goes Stale

My initial stop_network_usage model was materialized as a table with a hard-coded 30-day lookback window:

{{ config(materialized="table") }}

select
    stops.stop_id,
    date_trunc('day', trips.occurred_at) as usage_date,
    count(trips.trip_id) as daily_usage
from stops
inner join trips on [...]
where trips.occurred_at >= current_date - interval '30' day

The harsh reality hit: without regular dbt runs, every table in our data warehouse goes stale.

Fresh events flow into source systems continuously
dbt tables only update when manually executed
Queries read stale data until the next dbt run

If I didn't run dbt for 90 days, my dashboard would show data from 120 days ago (90 days since last run + 30-day lookback window). Ouch.

The "Complex" Solution Path

My first instinct was to over-engineer:

Convert to incremental materialization ✓
Add scheduling infrastructure (cron jobs, Airflow, AWS EventBridge)
Implement monitoring and alerting
Add sliding window logic for data retention
Set up error handling and recovery procedures

I started down this path, converting my model to incremental:

{{ config(materialized="incremental", unique_key=["stop_id", "usage_date"]) }}

-- Complex incremental logic with is_incremental() conditions
{% if is_incremental() %}
  and date_trunc('day', trips.occurred_at) > (select max(usage_date) from {{ this }})
{% else %}
  and trips.occurred_at >= current_date - interval '90' day
{% endif %}

While learning about incremental models was valuable (they're genuinely powerful for large datasets), I realized I was building a complex solution for a simple problem.

The Simple Solution: Views to the Rescue

Then it hit me. For my use case, I had a much simpler option: views.

{{ config(materialized="view") }}

select
    stops.stop_id,
    stops.service_area_uuid,
    stops.name_en as stop_name_en,
    date(trip_usage.occurred_at) as trip_date,
    count(distinct trip_usage.trip_id) as daily_trip_count
from stops
left join (
    select link__stop_id__pickup as stop_id, trip_id, occurred_at
    from trips where link__stop_id__pickup is not null
    union
    select link__stop_id__delivery as stop_id, trip_id, occurred_at  
    from trips where link__stop_id__delivery is not null
) trip_usage on stops.stop_id = trip_usage.stop_id
group by [...]

Problem solved. Views query the source data directly every time, so they're always fresh—no automation required.

Views vs. Tables vs. Incremental: Choose Your Fighter

This experience taught me that dbt materializations aren't just technical choices—they're architectural decisions with real business implications:

📊 Views: Always Fresh, Simple

Best for: Small-medium datasets, infrequent queries, freshness > performance
Pros: Always fresh data, no scheduling complexity
Cons: Slower queries, higher compute costs
My use case: ✅ Perfect for daily dashboard monitoring

🗄️ Tables: Fast Performance, Can Go Stale

Best for: Large datasets, frequent queries, acceptable staleness
Pros: Pre-computed results, fast queries
Cons: Requires regular refresh, can go stale
My use case: ❌ Would need complex automation

⚡ Incremental: Best of Both Worlds (When You Need It)

Best for: Very large datasets, regular updates, historical accumulation
Pros: Efficient processing, fast subsequent runs, accumulates history
Cons: Complex logic, still needs scheduling, overkill for simple use cases
My use case: ❌ Over-engineered for my requirements

The Key Insight: Match the Tool to the Problem

The temptation in data engineering is to always reach for the most sophisticated solution. Incremental models are powerful, automated pipelines are impressive, but sometimes the simple answer is the right answer.

My requirements:

Daily UI monitoring ✓
Always fresh data ✓
Reasonable data volumes ✓
Simple aggregations ✓

Views delivered exactly what I needed without any of the operational overhead.

Lessons Learned

Question your assumptions: Not every data model needs to be a table or incremental
Understand the trade-offs: Performance vs. freshness vs. complexity
Size your solution to your problem: Views are underrated for the right use cases
Think about operations: Who will maintain your complex automation?

The Bottom Line

Sometimes the best engineering decision is choosing the simplest solution that meets your requirements. Views might not be glamorous, but they solved my stale data problem elegantly.

The next time someone asks "Will my data be stale in 90 days?", I'll confidently answer: "Not if you use views."

What materialization strategies have you found work best for your use cases? I'd love to hear about times when simple solutions beat complex ones in your data projects.

DEV Community