DEV Community

Cover image for Streaming Tables vs. Materialized Views: Stop Guessing Your Databricks Refresh Strategy
Aniket Abhishek Soni
Aniket Abhishek Soni

Posted on

Streaming Tables vs. Materialized Views: Stop Guessing Your Databricks Refresh Strategy

Why I chose this topic: In the last year, I’ve seen three separate teams in healthcare and fintech burn through six-figure cloud budgets because they defaulted to "Materialized Views" for everything. They treated Databricks like a traditional RDBMS, ignoring the fundamental architecture of Delta Live Tables (DLT). This post is the guide I wish I could have handed them six months ago.

I still remember the 3:00 AM pager alert that haunted my last project. We were running a standard batch job to aggregate patient billing records—a simple INSERT OVERWRITE pattern that had worked fine for two years. But as the volume hit the multi-terabyte scale, the orchestration overhead and the "stop-the-world" latency of full refreshes became a liability. We weren't just missing SLAs; we were blowing through our reserved instances because the cluster had to re-process 90% of the data just to capture the 10% that changed.

Most engineers reach for Materialized Views (MVs) because they look like familiar SQL. You write a CREATE MATERIALIZED VIEW, you set a schedule, and you walk away. But if you’re building a pipeline that consumes high-velocity event data, using an MV is like trying to hydrate a marathon runner with a fire hose once an hour. You’re either flooding the system or leaving it parched.

The industry is obsessed with "simplicity," but in data engineering, simplicity that hides cost is just technical debt with a fancy UI. It’s time to stop treating Databricks like a legacy data warehouse and start understanding why your choice between Streaming Tables and Materialized Views is the difference between a performant pipeline and a PagerDuty incident.

The real problem: The Refresh Fallacy

The core confusion stems from the Databricks UI making both options look like "magic." An MV is declarative—you define the state you want, and Databricks figures out the delta. But "figures out" involves overhead. If your source data is growing linearly, your MV refresh time will eventually hit a wall where the compute cost of the full re-computation exceeds your budget.

Streaming Tables (STs) are fundamentally different. They are designed for "incremental-only" processing. They don't re-read the source; they track the offset and process only what’s new since the last run. If you are doing aggregations on time-series data or CDC (Change Data Capture) feeds, MVs are a performance trap. You’re paying to re-read data you’ve already processed.

Photo by Deng Xiang on Unsplash
Photo by Deng Xiang on Unsplash

Step 1: Analyze your source data topology

Before you write a single line of SQL, look at your source. If your source is a Delta table with a _change_data_feed enabled or a Kafka topic, you should be using a Streaming Table.

If your source is a static lookup table that changes once a week, use an MV. If your source is a massive transaction log, stop using REFRESH MATERIALIZED VIEW on a schedule. You’re just burning money.

-- The wrong way for high-velocity data (MV)
CREATE MATERIALIZED VIEW raw_events_mv AS
SELECT * FROM source_events;

-- The right way for high-velocity data (ST)
CREATE STREAMING TABLE raw_events_st AS
SELECT * FROM STREAM(source_events);
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure for incremental state management

The magic of Streaming Tables is STREAM(). When you use this, you are telling the Databricks engine to maintain a "checkpoint." This checkpoint is the secret sauce. It stores the file offsets so that every subsequent pipeline execution knows exactly where it left off.

If you don't use STREAM(), DLT treats the source as a static snapshot. You’ll see the pipeline trigger, and it will look like it’s working, but it’s re-processing the entire source table every time. I’ve seen this mistake cost a company $15k in a single weekend. Always verify your plan in the DLT UI to ensure you see "Incremental" in the execution graph.

-- Ensure your sink is also optimized for incremental append
CREATE OR REFRESH STREAMING TABLE aggregated_metrics
AS
SELECT 
  window(event_time, '1 hour') as hour,
  count(*) as total_events
FROM STREAM(raw_events_st)
GROUP BY 1;
Enter fullscreen mode Exit fullscreen mode

Step 3: Implement Watermarking

One of the biggest issues I see in production healthcare data is late-arriving data. If you use a Materialized View, late data often results in silent failures or inaccurate reporting because the "refresh" doesn't account for events arriving after the window closed.

Streaming Tables allow for withWatermark. This is non-negotiable if you’re doing time-windowed aggregations. It tells the engine, "I’m willing to wait X minutes for late data, but after that, drop it." This keeps your state store from ballooning into a memory-leaking monster.

CREATE OR REFRESH STREAMING TABLE hourly_patient_stats
AS
SELECT * FROM STREAM(patient_activity)
  .withWatermark('event_timestamp', '10 minutes')
  .groupBy(window('event_timestamp', '1 hour'))
  .count();
Enter fullscreen mode Exit fullscreen mode

Photo by Chris Ried on Unsplash
Photo by Chris Ried on Unsplash

Lessons learned from production

  1. The "Full Refresh" Trap: If you change your logic in an ST, you often need to perform a full refresh. Use REFRESH TABLE table_name FULL sparingly. It deletes the checkpoint. If your source data is massive, this can take hours. Plan your schema evolution before you commit to the Streaming model.
  2. Cluster Selection: MVs can run on "Serverless" SQL warehouses, which are convenient but pricey. Streaming Tables usually run on DLT pipelines. Don't mix these up. I’ve seen teams try to run streaming logic on SQL Warehouses; the latency will kill you because the warehouse isn't optimized for continuous streaming offsets.
  3. Monitoring is non-negotiable: If your Streaming Table falls behind, the DLT UI will show a "lag" metric. If you ignore this, you’re flying blind. Set up alerts on the pipeline_lag metric in your Databricks SQL Alerts.

Production considerations

  • Cost: If your data is small (under 100GB) and doesn't change often, use MVs. They are easier to manage and require zero checkpoint maintenance.
  • Complexity: Streaming Tables require you to think about "state." If you have a complex join between two massive streaming tables, you will need to increase your spark.sql.streaming.stateStore.providerClass to use RocksDB, otherwise, your executor will OOM (Out of Memory) as soon as the state exceeds the executor heap.
  • Governance: Remember that GRANT statements on DLT tables are managed via Unity Catalog. If you aren't using UC, you’re just making your security team miserable.

Conclusion

The choice between Streaming Tables and Materialized Views isn't about SQL syntax—it’s about data lifecycle management. If your data is flowing, use Streaming Tables and respect the checkpoint. If your data is static, use Materialized Views and keep your pipeline simple.

Don't let the "easy" button of a schedule refresh turn into a budget-breaking nightmare. Take the time to understand your data velocity, and your infrastructure will thank you.

Try it: Go to your current DLT pipeline configuration. Identify one "Materialized View" that is being refreshed hourly. Check the source data volume. If it’s over 50GB, rewrite it as a Streaming Table, implement a watermark, and watch your cluster uptime percentage climb.


SEO keywords: databricks streaming tables, materialized views vs streaming tables, delta live tables best practices, spark streaming performance

Tags: #databricks #dataengineering #spark #delta

Cover photo by Taylor Vick on Unsplash.

Top comments (0)