DEV Community

AWS Lake Formation Summarization

what is a data lake?

AWS Lake Formation helps you create data lakes but what is a data lake?
Well, a data lake is a central place to have all of your data in one place so that you can do analytics on top of it.

what is a Lake Formation?

So, Lake Formation is a fully managed service that makes it super easy to set up a data lake, and usually that can take months but thanks to Lake Formation it takes just a few days to get started with a data lake.

So Lake Formation will help you discover, cleanse, transform, and ingest data into your data lake, and it automates many complex manual steps such as the collecting, cleansing, moving, cataloging data, and also any kind of de-duplication, for this it uses machine learning transforms.

So in this data lake, you can combine structured and unstructured data sources, and it has blueprints.
So, blueprints come out of the box and they help you migrate data from one place to this central data lake.
So it has blueprints for Amazon S3, Amazon RDS, or your relational databases that you run on-premises or your NoSQL databases and so on.

why do you set up Lake Formation?

Well, you have everything in one place, but on top of it you can have Fine-grained access controls for your applications at the row and the column-level.
That means that any application that is connecting to the AWS Lake Formation will have Fine-grain access control, and this is a huge plus.

how does Lake Formation work?

Well, it's actually a layer on top of AWS Glue but you don't actually interact with Glue directly.
As I said, Lake Formation allows you to create a data lake that is stored in Amazon S3.

And the data sources are, could be Amazon S3, RDS, Aurora, your on-premises database, such as SQL or NoSQL, anyways, and thanks to the blueprints available on Lake Formation you will ingest the data.

So Lake Formation comes with Source Crawlers, it comes with ETL and data preparation tools and data cataloging tool, and all of this comes from the underlying Glue service.

Then we have security settings and access controls to make sure that your data is protected on your data lake.

There are services that can leverage Lake Formation can be Athena, Redshift, EMR, or other analytics tools, such as for example, the Apache Spark framework, and so you as users, you are connecting to these services which are in turn connecting to Lake Formation and your data lake.

why do we wanna use Lake Formation?

Well, one central, key aspect that is the centralized permissions.
So say for example, that your company is using Athena and Quicksights to analyze data and your users must only view the data they need and they should have permissions to see, and your data sources include Amazon S3, RDS, Aurora, and so on.
So you could try to set up security in Athena or you could try to set up security in QuickSight or at the user level, or you can set up security with S3 bucket policies or with users and so on in RDS, or in Aurora.
And so you have multiple places where you can manage security and it becomes a mess.

So Lake Formation solves this problem because you have access control and you get column and role level security.

GitHub
LinkedIn
Facebook
Medium

Top comments (0)