<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joël FARVAULT</title>
    <description>The latest articles on DEV Community by Joël FARVAULT (@jol_farvault_72301b8e349).</description>
    <link>https://dev.to/jol_farvault_72301b8e349</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F634328%2F713a6414-1f97-4cac-94af-661f7c3168c8.jpeg</url>
      <title>DEV Community: Joël FARVAULT</title>
      <link>https://dev.to/jol_farvault_72301b8e349</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jol_farvault_72301b8e349"/>
    <language>en</language>
    <item>
      <title>Architecture options for building a basic Data Lake on AWS - Part 1</title>
      <dc:creator>Joël FARVAULT</dc:creator>
      <pubDate>Wed, 02 Jun 2021 11:21:53 +0000</pubDate>
      <link>https://dev.to/aws-builders/architecture-options-for-building-a-basic-data-lake-on-aws-part-1-18hc</link>
      <guid>https://dev.to/aws-builders/architecture-options-for-building-a-basic-data-lake-on-aws-part-1-18hc</guid>
      <description>&lt;p&gt;This article is a result of a chat discussion with &lt;a class="mentioned-user" href="https://dev.to/aditmodi"&gt;@aditmodi&lt;/a&gt;, Willian ‘Bill’ Rocha, Kevin Peng, Rich Dudley, Patrick Orwat and Welly Tambunan. Any other contributors are welcome 😊&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;data lake&lt;/strong&gt; is the key foundation of data analytics. The data lake is the central repository that can store both structured data (such as tabular or relational data), semi-structured (key/value or document) and unstructured data (such as pictures or audio). &lt;br&gt;
The data lake is scalable and provides the following functionalities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;all data stored in a single place with a low-cost model&lt;/li&gt;
&lt;li&gt;variety of data formats can be stored&lt;/li&gt;
&lt;li&gt;a fast data ingestion and consumption&lt;/li&gt;
&lt;li&gt;schema on read versus the traditional schema on write&lt;/li&gt;
&lt;li&gt;decoupling between storage and compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building a data lake implies to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;setup the right storage (and the corresponding lifecycle management)&lt;/li&gt;
&lt;li&gt;define the solution for data movement&lt;/li&gt;
&lt;li&gt;clean, catalog and prepare the data&lt;/li&gt;
&lt;li&gt;configure policies and a governance &lt;/li&gt;
&lt;li&gt;make data available for visualization &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data lake is very powerful solution for big data but there are some limitations: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data management is challenging with data lakes because they store data as a “bunch of files” of different format&lt;/li&gt;
&lt;li&gt;managing ACID transactions or rollback requires to write a specific ETL/ELT logic&lt;/li&gt;
&lt;li&gt;query performance implies to select wisely the data format (such as Parquet or ORC)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A first alternative is the &lt;strong&gt;lakehouse&lt;/strong&gt; architecture which bring the best of 2 worlds: data lake (low cost object store) and datawarehouse (transactions and data management). The lakehouse provides a metadata layer on top of the data lake (or object) storage that defines which objects are part of a table version. The lakehouse allows to manage ACID transactions through the metadata layer while keeping the data stored in the low cost data lake storage.&lt;/p&gt;

&lt;p&gt;Another alternative that fits well in a complex and distributed environments is the &lt;strong&gt;data mesh&lt;/strong&gt; architecture. Unlike a “monolithic” central data lake which handle in one place consumption, storage and transformation, a data mesh architecture supports distributed, domain-specific data consumers and views “data-as-a-product,” with each domain handling their own data pipelines. &lt;br&gt;
The domains are all connected through an interoperability layer that applies the same syntax and data standards. The data mesh architecture is based on few principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Domain-oriented data owners and pipelines&lt;/li&gt;
&lt;li&gt;Self-serve functionality&lt;/li&gt;
&lt;li&gt;Interoperability and standardization of communications
This decentralized architecture brings more autonomy and flexibility. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each data architecture model has benefits and shortcomings, there is no good and bad approach, it depends on the context and the use cases. Let’s evaluate how these data architecture patterns can be applied using a concrete example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements for a data platform
&lt;/h2&gt;

&lt;p&gt;Your company needs a data repository with the following expectations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingest data manually or from data pipelines&lt;/li&gt;
&lt;li&gt;produce ML models, QuickSight dashboard or external API with the output data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What is the best AWS architecture for these requirements? &lt;/p&gt;

&lt;h2&gt;
  
  
  Data Lake architecture
&lt;/h2&gt;

&lt;p&gt;The diagram below presents a first solution for the requirements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrb03c9vokczn5mok98m.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrb03c9vokczn5mok98m.PNG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AWS data lake architecture is based on several components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The data ingestion / collection&lt;/strong&gt; enables to connect different data sources through batch or real time modes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;th&gt;Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/dms/" rel="noopener noreferrer"&gt;AWS Data Migration Services&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;one-time migration of a database (cloud or on-premise) and replicate on-going changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/aws-transfer-family/" rel="noopener noreferrer"&gt;AWS Transfer Family&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;transfer of data over SFTP ➡️ &lt;a href="https://dev.to/aws-builders/aws-transfer-family-ftp-for-efs-and-s3-32e5"&gt;article dev.to&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/kinesis/data-firehose/" rel="noopener noreferrer"&gt;Kinesis Firehose&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;fully managed delivering of real-time streaming data to Amazon S3 ➡️ &lt;a href="https://dev.to/aws-builders/event-streaming-and-aws-kinesis-4877"&gt;article dev.to&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/lambda/" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;serverless and event based data integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2. The data storage&lt;/strong&gt; enables to hold massive amount of data in raw format. That’s the core of the data lake. The best storage service for this purpose is &lt;a href="https://aws.amazon.com/s3/" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;. &lt;br&gt;
As highlighted in the diagram the best approach is to create specific buckets for landing zone, raw data and cleaned data. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The data cleaning and preparation&lt;/strong&gt; implies to partition, index, catalog and transform data (especially into a columnar format for performance optimization). The data catalog will be automatically created or updated through a crawler that extracts the schemas and adds metadata tags in the catalog. The AWS options are AWS Glue or Glue DataBrew. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;th&gt;Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ETL/ELT that can extract both data and metadata to build catalogs and perform transformations. You can either use the UI or author your scripts in Python, Spark and Scala&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/glue/features/databrew/" rel="noopener noreferrer"&gt;AWS Glue DataBrew&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;is a visual data prep tool that enables to profile, transform, clean and normalize data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can also use other ETL tools such as &lt;a href="https://www.amundsen.io/" rel="noopener noreferrer"&gt;Amundsen&lt;/a&gt; for an intuitive UI and business data cataloguing. The solution &lt;a href="https://www.talend.com/" rel="noopener noreferrer"&gt;Talend&lt;/a&gt; can be also an option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The data processing and analytics&lt;/strong&gt; is the process that creates insights from data (following the rule « garbage in garbage out » it is recommended to use cleaned data to get the best insights). The data will be structured and analyzed to identify information or to support decision-making. &lt;br&gt;
The data processing can be done through &lt;strong&gt;batch mode&lt;/strong&gt;, &lt;strong&gt;interactive analysis&lt;/strong&gt;, &lt;strong&gt;streaming&lt;/strong&gt; or &lt;strong&gt;machine learning&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch mode&lt;/th&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Querying or processing large amount of data on a regular frequency (daily, weekly or monthly)&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://aws.amazon.com/glue/" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; - &lt;a href="https://aws.amazon.com/emr/" rel="noopener noreferrer"&gt;EMR&lt;/a&gt; - &lt;a href="https://aws.amazon.com/redshift/" rel="noopener noreferrer"&gt;Redshift&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Interactive analysis&lt;/th&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ad hoc querying for data exploration and analysis&lt;/td&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/athena/" rel="noopener noreferrer"&gt;Athena&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Streaming&lt;/th&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingest and analyze a sequence of data continuously generated in high volume and/or high velocity&lt;/td&gt;
&lt;td&gt;&lt;a href="https://aws.amazon.com/kinesis/data-analytics/" rel="noopener noreferrer"&gt;Kinesis Data Analytics&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine Learning&lt;/th&gt;
&lt;th&gt;Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Perform on-demand data computation using inference endpoints&lt;/td&gt;
&lt;td&gt;Sagemaker Inferences ➡️ &lt;a href="https://dev.to/aws-builders/automate-sagemaker-machine-learning-inference-pipeline-in-a-serverless-way-bpk"&gt;article dev.to&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Based on the data lake requirements, the first approach for the data analytics &amp;amp; processing leverages the following AWS services:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnbp7ns2s6s2vsu2tp5i.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnbp7ns2s6s2vsu2tp5i.PNG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Athena&lt;/strong&gt; for interactive analysis. Athena is a serverless querying service (based on Apache Presto) that enables to build SQL queries against relational or non-relational data stored in Amazon S3 (even outside S3 with Athena Federated Query). Athena is adapted for use cases such as: ad hoc queries, data exploration or integration with BI tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sagemaker Inferences&lt;/strong&gt; for machine learning. That service allows to execute ML models deployed through Sagemaker Endpoints.&lt;/p&gt;

&lt;p&gt;This architecture provides also the possibility to expose some data through &lt;strong&gt;API Gateway&lt;/strong&gt; using &lt;strong&gt;Lambda&lt;/strong&gt; function and Athena.  &lt;/p&gt;

&lt;p&gt;Another option for the data lake architecture is to use &lt;strong&gt;Redshift&lt;/strong&gt; as a curated datawarehouse. This approach will be described in the &lt;em&gt;part 2 “the lakehouse architecture”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. data visualization&lt;/strong&gt; is the tool that provides a view of the data for your data consumers. The purpose of &lt;a href="https://aws.amazon.com/quicksight/" rel="noopener noreferrer"&gt;Amazon QuickSight&lt;/a&gt; is to empower your data consumers (especially the business users) for building their visualization. QuickSight is powered by SPICE (Super-Fast, Parallel, In-Memory Calculation Engine) and provides the possibility to build easily dashboards and share it with other users. &lt;br&gt;
As highlighted in the proposed architecture, QuickSight is integrated with Athena but it can be integrated with other data sources.&lt;/p&gt;




&lt;p&gt;This article provides only an high level view of the topic, for more information I recommend these very useful dev.to posts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/aws-data-lake-with-terraform-part-1-of-6-4bf1"&gt;Building a data lake with Terraform&lt;/a&gt; by &lt;a class="mentioned-user" href="https://dev.to/valaug"&gt;@valaug&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/awscommunity-asean/introduction-to-the-aws-big-data-portfolio-2539"&gt;Presentation of the Big Data Portfolio&lt;/a&gt; by &lt;a class="mentioned-user" href="https://dev.to/klescosia"&gt;@klescosia&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>analytics</category>
      <category>database</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
