This blog post shares my personal experience and work on a project that closely aligns with my current employer or government project. Feeling unfamiliar with the technical jargon and concepts being discussed, I decided to take a proactive approach by engaging in hands-on learning. I logged into my AWS (Amazon Web Services) account, provisioned the necessary resources, developed the required code, and embarked on a quest to gain a deeper understanding of the project. This article chronicles my journey, highlighting the challenges faced, the lessons learned, and the ultimate satisfaction of acquiring practical knowledge and expertise in the field.
By sharing my story, I hope to inspire others to adopt a similar approach and embrace the power of experiential learning in their professional pursuits.
Building a Simple Data Lake on AWS: Harnessing the Power of Glue, Athena, RDS, and S3"
I highlight the process of constructing a basic data lake on Amazon Web Services (AWS) by leveraging a combination of powerful services. Below is the list of services:
AWS SERVICES
• Amazon Simple Storage Service (Amazon S3)
• AWS Glue Studio
• AWS Glue Data Catalog
• AWS Glue Connections
• AWS Glue Crawlers
• AWS Glue Jobs
• Amazon Athena
• AWS CLI
To better understand the scope of this project, its best that I explain what is out of scope.
Architecture (out of scope)
• Change Data Capture (CDC): Handling changes to systems of record
• Transactional Data Lake: Table formats like Apache Hudi, Apache Iceberg, Delta Table
• Fine-grained Authorization: database-, table-, column-, and row-level permissions
• Data Lineage: Tracking data as it flows from data sources to consumption
• Data Governance: Managing the availability, usability, integrity and security of the data
• Streaming Data: Data that is generated continuously
• Data Inspection: Scanning data for sensitive or unexpected content (PI)
• DataOps: Automating testing, deployment, execution of data pipelines
• Infrastructure as Code (laC): Infrastructure provisioning automation
• Data Lake Tiered Storage
• Backup, HA, and DR
Understanding Data Lakes: Empowering Data-Driven Insights"
A data lake is a centralized storage repository that allows organizations to store and manage vast amounts of raw and unstructured data.
Unlike traditional data storage systems, data lakes accommodate data in its original format, without the need for upfront structuring or transformation.
Databricks offers a framework to follow. There are three layers, Bronze, Silver, and Gold, for managing and processing data efficiently.
The Bronze layer serves as the foundation, storing raw data directly from various sources. It provides a low-cost and reliable storage solution.
The Silver layer focuses on data transformation and data quality checks. Here, data is cleaned, organized, and prepared for analysis.
The Gold layer represents the final stage, where data is enriched, aggregated, and made available for business intelligence and advanced analytics. It provides a curated and optimized dataset for decision-making and extracting valuable insights.
These layers work together to streamline the data pipeline and enable effective data analysis and decision-making processes.
Data Lake Naming Conventions
+-------------+---------------------------------------------------------------------+
| Prefix | Description |
+-------------+---------------------------------------------------------------------+
| source_ | Data source metadata (Amazon RDS) |
| bronze_ | Bronze/Raw data from data sources |
| silver_ | Silver/Augmented data - raw data with initial ELT/cleansing applied |
| gold_ | Gold/Curated data - aggregated/joined refined data |
+-------------+---------------------------------------------------------------------+
Most of the sample data that is stored in the databases are from AWS's website. It is called TICKIT. This small database consists of seven tables: two fact tables and five dimensions. You can load the TICKIT dataset via a csv file into your RDS instances to start extract, transforming, and loading the data into your S3 buckets. You can find the link to the dataset below:
https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
A while back, I had little knowledge about data models until I stumbled upon a fascinating illustration. It was during my exploration of the physical model design that I started to grasp the concept better. This stage proved crucial in understanding the intricate relationships, variable formatting, and schema that greatly influenced the work I was involved in.
Here is a conceptual model of the database tables that AWS uses.
The table below provides an illustration of how the RDS databases, including PostgreSQL, MySQL, and MS SQL, can represent systems commonly used by businesses. Specifically, we highlight an event management system, an e-commerce platform, and a customer relationship management (CRM) platform. This analysis helps us gain insights into the data tracking requirements of a company and the reasons behind capturing specific information about its business operations.
Data Preparation & Instructions
Step 1:
We're going to use three AWS glue crawlers and the AWS glue connections to talk to our three data sources the Postgres SQL MySQL and SQL Server databases. We're going to catalogs the seven tables in the three Amazon RDS databases into our AWS glue data catalog
Step 2:
We're going to copy the data from our three data sources, our three databases, our seven tables into the bronze or raw area of our data Lake using a series of AWS glue jobs based on Apache spark and written in Python
Step 3:
We will cleanse augment and prepare the data for data analytics. We are using a series of AWS glue jobs. The data will be written into the silver area of our data Lake also in an Apache Parquet format. Once again the refined or silver data will be cataloged in our AWS glue data catalog.
Step 4:
Lately, we will will use Amazon Athena to produce curated data sets by joining several tables in the silver area of our data Lake. We will produce multiple views of the data and partition the data based on the most common query and filtering patterns of our end users. These curated data sets will be written back to the gold or curated section of our data Lake as partitioned Apache parquet format files.
FINAL SUMMARY
In this demonstration we built a Data Lake on AWS:
1. Using AWS glue and Amazon Athena we extracted data from our databases from Amazon RDS databases which represented our Enterprise systems (MySQL, Postgress
We combined all those into our Data Lake into a bronze bucket of our Data Lake (raw data)
We then refined, augmented, and cleansed that data and wrote that into the silver bucket of our Data Lake (structured data)
We then use that silver data to create curated data sets doing complex joins and aggregations and SQL functions on that data and wrote that data into the augmented or gold area of our data Lake (aggregated data)
Then to finish up our demonstration we looked at some ways in which we can rate more efficient queries against our data Lake using Amazon Athena
Lessons Learned
Embarking on this project was not without its fair share of obstacles and hurdles. This section aims to shed light on the various issues and struggles encountered throughout the journey towards project completion. To be quite frank, this is where I list out all the complaints frustrations I had along the way. Enjoy hahaha:
- Difficulty in understanding and configuring AWS Glue's various components and services.
- Issues in correctly configuring AWS Glue crawlers to establish connections with data sources.
- Challenges in setting up the necessary IAM roles and permissions for AWS Glue jobs.
- Lack of clear documentation or examples for specific use cases, leading to trial and error.
- Compatibility issues between different versions of AWS Glue and related dependencies.
- Troublesome configuration of database connection parameters such as credentials and endpoint URLs.
- Incompatibility between the IDE and database drivers, resulting in connection failures.
- Insufficient knowledge of database-specific connection options and configurations.
- Difficulty in troubleshooting connection issues due to limited error messages or log details.
- Delays caused by the need to navigate complex networking setups involving security groups, subnets, and internet gateways.
- Confusion in properly configuring security group rules to allow database connections from specific IP addresses or ranges.
- Lack of familiarity with network access control lists (ACLs) and their impact on connectivity.
- Misconfigurations of route tables and subnets, leading to failed network communications.
- Troublesome configuration of NAT gateways or instances for outbound internet access from private subnets.
- Firewall restrictions preventing successful connection establishment.
- Network latency issues affecting the responsiveness and performance of the database connection.
- Difficulties in maintaining consistent and reliable connectivity across different availability zones.
Despite the numerous challenges faced throughout this project, the journey has been incredibly rewarding. By diving headfirst into hands-on learning and persevering through the struggles, I have gained valuable insights, expertise, and a sense of accomplishment. This experience serves as a testament to the power of determination, adaptability, and the willingness to learn by doing.
Top comments (0)