DEV Community

Rashmiranjan Sahoo
Rashmiranjan Sahoo

Posted on

Creating Azure data lake gen2- learning day 3

Disclaimer: I am writing this for revisiting all the concepts in azure data engineering and clearing my fundamentals. If you find this article helpful to you, I would love it.

what is azure data lake gen2?

It is a data lake solution provided by azure.
It is a combination of blob storage and data lake gen1 to provide high scalable and secure data lake, for big data storage and analytics.

How to create a data lake?

  • To create a data lake in azure

  • click on create a storage account

  • fill all details

  • click on next

  • In advanced there is a section

    • Data lake storage gen2
    • Enable hierarchical namespace ☑
    • click on this checkbox
  • By clicking this check box our storage account converts into azure data lake gen2 with hierarchical name space or directory which is not available in simple blob storage.

  • In the background , there is a data lake configured for system and the disk space and cluster enable for us.

when we enter into our storage account

  • click on container to create container
    • give name
    • create
  • when we click on the container that we created
  • we can see different options available on the top bar that is add directory a disk like structure

What is hierarchical name space?

Hierarchical namespace is based on Linux file storage and Hadoop file system i.e., HDFS file system.

It organizes files into a hierarchy of directories for efficient data access.

We can store everything on data lake gen2 as we know it can store 3 types of data classification data i.e., structured, semi-structured, Unstructured.
Data can be web-server log data, relational data, streamed data, etc.

We can process this data by synapse analytics, data bricks, stream analytics.

Access Tiers

An access tier refers to a data storage option in cloud-based storage services that helps us to manage the cost and performance of our data based on how frequently we access it.

Types of Access tiers:

  • Hot
    It is optimized for storing data that is frequently accessed.
    It offers low latency and suitable where data frequently read and written.

  • Cool
    Optimized for storing data that is infrequently accessed and stored for at least 30 days.
    There is an early deletion fee is charged for deletion.

What is early deletion fee?
If we feel that we don't need this data for 30 days on 15th day we want to delete.

There we will pay some fee.

This fee is called early deletion fee.

  • Archive Access Tier

Optimized for storing data that is rarely accessed and stored for at least 180 days.
It offers the lowest storage costs but can have the highest data retrieval times.

This tier is suitable for long-term storage, such as compliance data, backup archives, and historical records.

what is Rehydration?

To read data from archive storage we must first change the tier to hot or cool.

This process is called rehydration and can take hours to complete.

  • Standard priority:
    The rehydration request will be processed in the order it was received and may take up to 15 hours.

  • High priority:
    The rehydration request will be prioritized over standard requests and may finish under 1 hour for files under 10gb in size.

next article will be about all features in azure data lake gen2 like life cycle policy, security, etc.

Top comments (0)