Delta Lake and data lakes — getting started

#datascience #bigdata #deltalake #datalake

Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!

What is data lake?

A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
*Amazon Web Services*

Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:

    +-----------+----------------------+-------------------+
    |           | Data Warehouse       | Data Lake         |
    +-----------+----------------------+-------------------+
    | Data      | Structured           | Unstructured data |
    | Schema    | Schema on write      | Schema on read    |
    | Storage   | High-cost storage    | Low-cost storage  |
    | Users     | Business analysts    | Data scientists   |
    | Analytics | BI and visualization | Data Science      |
    +-----------+----------------------+-------------------+

Using Delta Lake OSS create a data lake

Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:

    libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"

Saving data to Delta

Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.

    val data = spark.range(0, 50)
    data.write.format("delta").save("/data/delta-table")

Reading data from Delta

Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.

    val df = spark.read.format("delta").load("/data/delta-table")
    df.show()

Updating the data in Delta

The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:

    val data = spark.range(0, 100)
    data.write.format("delta").mode("overwrite").save("/data/delta-table")
    df.show()

Summary

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)

Bartosz Gajda

@bartoszgajda55

A gentle introduction to #DataLake and the most prominent solution used - #DeltaLake. Theory and practice on cutting edge #BigData technology.

bartoszgajda.com/2020/01/12/del…

#datalake #deltalake #bigdata #programming #development #databases #databricks #cloud

13:26 PM - 30 Mar 2020

1 2