DEV Community

Elizabeth Adeotun Adegbaju for AWS Community Builders

Posted on • Originally published at awstip.com on

Building Data Lakes on AWS

This is the first in a series of articles I plan to write on AWS solutions/services. I am an AWS Certified Solutions Architect at the Associate level and this article is written on my journey to the Professional level. This is not a training material, but a more-than-basic explanation of AWS Data Lakes to provide you with enough information in deciding if it is the right solution for you, or if you would like to learn more about it.

What is a Data Lake?

A Data lake is a central repository that stores data in its original form (virtually any format), and is prepared for analysis. Data here refers to all kinds of data — structured, unstructured, and even semi structured (partly). In a data lake, you can deal with big data as it generally ignores size limits and is faster in analytics. The first form of data I worked on in a data lake was a CSV file and I was able to process it and even query the data and get expected results. The benefit of a data lake according to AWS is that it enables you to “analyze more data from more sources in less time”.

Parts of a data lake in order of processing and their corresponding AWS services


AWS Data Lake breakdown

Studying these individual AWS services will enable you to know which one in particular works for the solution you are trying to design. Data ingestion, for instance, will use a different service for different types of data sources.

Security in a Data Lake

  • Secure all your data sources within and outside AWS — This is probably the most important because at this stage your data is in its most complete form and comptonization means everything is exposed even if you try to keep it secure within AWS.
  • Keep private data private — masking data to render them useless to people who aren’t meant to see them can ensure that important information has been hidden and there is no risk of being compromised. Amazon Macie can be utilized to detect sensitive data.
  • Control access to staged data using roles — the principle of least privilege has been saving lives. Your data has probably been transformed through different stages and not everyone needs access to data in its raw form or at a particular stage of processing/transformation, make sure to take note of that and give access to only what is needed.
  • Monitor your data lake with Amazon cloud watch and AWS cloud trail

Optimizing your Data Lake

A major way to optimize your data lake is to make use of lifecycle management in Amazon S3 and to transform your data into columnar, compressed file formats. This not only saves you a lot of money in the long run but improves your processing times when you separate data that you do not need to process currently or process your data in a format that is known to have faster processing times.

AWS Lake Formation

We cannot conclude a discussion on Data Lakes in AWS without mentioning AWS Lake Formation. Building a Data Lake can be complicated when you look at the many types of data you can have and the different parts of a data lake. AWS Lake Formation helps you handle some challenging steps in building a data lake. The aspect of Data Ingestion, Data Stores, and Catalog & Processing can be completely managed by AWS Lake Formation, all you need to provide is your data from your data sources and you can perform searches, queries, and eventually visualize your data. This sounds too good to be true right? All that unstructured data from multiple sources? Yes, it works!


Three (3) Stages of Data Lake Setup

Blueprints and Workflows in AWS Lake Formation

An AWS Lake Formation Blueprint is a template that can be used to create workflows for common data sources. An AWS Lake Formation Workflow orchestrates Extract, Transform, and Load activities in the data lake. This is basically what the AWS Glue Service does (ETL operations) and you probably guessed already that workflows are built on Glue jobs and crawlers. You can create workflows manually in AWS Glue but if you are considering automating the process then Blueprints are the way forward.

At this point of considering blueprints, you already understand how data is added to your data source(s) and the best approach to discovering and ingesting data into your data lake. Using blueprints can help eliminate your involvement in that step and the workflow can run on the data based on triggers that have been set. There are three (3) types of blueprint you can use in AWS Lake Formation:

  1. Database snapshot  — This is advised when existing data could have been changed and you can use patterns to exclude some data you don’t want to process.
  2. Incremental database  — This is advised when existing data wasn’t modified and do not need to be processed again. An example of this type of data is time-stamped data that you are sure only new rows are added to.
  3. Log file  — This is the option to use when you have log file sources.

Visualizing Data with Amazon QuickSight

Earlier on I mentioned that third-party data visualization tools can be used to visualize data from your data lake but if you want an AWS service, you can use Amazon QuickSight to achieve this.

Amazon Quicksight is a Business Intelligence service provided by AWS. You can generate insights dynamically and embed your visualizations in applications you have built even outside AWS. This sounds like a nice end to all that data lake processing when you can “export” your visualizations to anywhere you need them and not be restricted to just viewing your results on AWS.

Imagine having interactive dashboards in your application without having to write the logic in your application code. All the work has been set up in your data lake and is being served to you by Amazon QuickSight.


Top comments (0)