DEV Community

Barbara
Barbara

Posted on

Delta Live Tables

TL;DR - Delta Live Tables aka DLT

DLT is a framework on top of a Delta Lake, and does magic simsalabim out of the box, so you can process big amounts of data without having any knowledge about the mechanics used. But you also have the possibility to configure it very fine grained via a json file, when creating.

Key features of Delta Live Tables

  • Different data sets
Dataset type How is the data processed
Streaming table Each record is processed exactly once. This assumes an append-only source.
Materialized views Records are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, and for change data capture processing (CDC).
Views Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.
  • You can write DTL in Python or SQL.
  • You can use different editions "core", "pro" and "advanced".
  • You can use it to orchestrate tasks and build pipelines in a very fast way and with a lot less code.
  • it takes care of the cluster management by itself, but you can also configure it yourself with a .json if needed.
  • You have inbuilt monitoring. Within the delta live tables user interface, you can see Pipeline status, latency, throughput, error rates and the data quality as defined by you.
  • you can add data quality benchmarking in a very simple way. But it is only enabled in the "advanced" edition.
@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
Enter fullscreen mode Exit fullscreen mode

Sample: Medallion Architecture done with Delta Live Tables

import dlt 
# bumms - magic imported
# expected to run as a part of a delta table pipeline
from pyspark.sql.functions import *

# if you want to ingest json data
json_path = "your_path"

# STEP 1 - Bronze Layer
# alias for creating table function
@dlt.table(
    comment="ingests raw data from wherever you want"
    # you could assign a different table name in here, if you don't want the table to be the function name, like
    # name= "my_bronze_layer" 
)

# function name is the name of the DLT
# this function always needs to follow after the table creation with @dlt.table
def bronze_layer():
    """
    This function ingest raw data from a given source and stores it to a table called "bronze_layer"
    """
    # df = spark. read...whatever you want, like filter data as long as you return a DataFrame
    return (spark.read.format("json").load(json_path)) # a dataframe


# STEP 2 - Silver Layer
@dlt.table(
  comment="Create a silver layer with selected, quality-checked data"
)

@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")

# new table creation
def silver_layer():
  return (
    # live table depending on the table built in STEP 1
    dlt.read("bronze_layer") # after this you can go ahead with spark as usual
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("user_name", "user")
      .withColumnRenamed("prev_title", "previous_page_title")
      .select("user", "click_count", "previous_page_title")
  )

# STEP 3 - Gold Layer
@dlt.table(
  comment="A table containing the top pages linking to the checkout page."
)
def gold_layer():
  return (
    dlt.read("silver_layer")
      .filter(expr("current_page_title == 'Checkout'"))
      .withColumnRenamed("previous_page_title", "referrer")
      .sort(desc("click_count"))
      .select("referrer", "click_count")
      .limit(10)
  )
Enter fullscreen mode Exit fullscreen mode

Further reading:
Delta Live Tables
Delta Lake
Databricks

Top comments (0)