September 15, 2020 - Forrest Brazeal announced this new challenge. The Goal of this challenge is to Automate an ETL processing pipeline for COVID-19 data using Python and cloud services. Just like his previous challenge The Cloud Resume Challenge, this challenges will help folks like me who doesn't have a current cloud experience nor a cloud profession to have a better comprehension on cloud technologies and use this on your interviews or portfolio to help you land a cloud job!
To be honest, even though I'm busy doing something at that time and I have no concrete knowledge on something like this, it still excites me and felt that I need to work on this challenge.
First, I immediately open the AWS console then do the scheduled processing job using cloudwatch rules. However, I saw that I need to build my infrastructure as code (IaC) on the later steps so I immediately setup my environment. Although, a lot things didn't go as expected. Well, I guess it applies on all the things that you want to do, the first step is the hardest but we need to keep going!
EXTRACTION and TRANSFORMATION
Initially, I attempted to use and apply the
csv library that python3 already has but after searching further I saw that the
pandas library is much better on handling this kinds of datasets. Here's a blog that will help you get started.
I made package and multiple module to abstract my code on the previous steps from my lambda handler. The challenge also gave us a hint by linking this.
I prefer to use DynamoDB as I want to learn more about it and explore it further. I converted my transformed and clean data frame to json so I can easily load them to DynamoDB.
for entry in json:
I did a further searching on this one before I tackled it. At first, I use the DynamoDB Streams with the
StreamViewType: NEW_IMAGE then trigger a lambda function to send notification using SNS although it's working but it flood my email even though I set the
BatchSize: 1000 (maximum for DynamoDB Streams) for my lambda function, I noticed that every 10 new items on my DynamoDB Table it will send an email notification. I ended up setting a counter on my
putItem loop and sending the notification after it succeeds.
I surrounded every transformation steps that I made with a
try/catch/finally clause to handle exceptions and notify the interested users. After that, I made two functions, the first one is for the initial load of data to DynamoDB and the second one is the daily updates to DynamoDB. I performed a scan operation to check if my DynamoDB has items and it depends on the scan results whether I initiate the initial load of data function or the daily update function. Finally, I drop all the rows on my dataframe that has
NaN or no value on the
Date column which is my
I search and read a lot on python3 imports, but still got an issue on importing my test module so I seek help on the python discord and someone helped me figuring it out but after some time it still didn't work so I just proceed on putting my
test_***.py on the same directory where my lambda function live. On this step I implemented the Unit Testing framework, here's a video to get you started.
IaC and Source Control
Already implemented this technologies because I believe they are the prerequisites when you start to develop an application specially in the cloud.
It surprised me that AWS Quicksight doesn't support DynamoDB! But I'm got too far just to change my data source and logic.
Although there's a workaround but they are costly. This took me a while to decide on whether if I will stick on the native AWS Technologies then start to build a data pipeline where I will use
DynamoDB Streams -> Lambda -> S3 -> Athena -> Quicksight and honestly it's more complicated than I thought so I searched further and I saw Tableau, Qlik, and many more apps until I found Redash, for me these 3 apps will fulfill this step but I choose the free one. I choose
Redash cause it will give you a 30-day trial without registering your credit card, while I integrate my data source (DynamoDB) on Redash I also learned DynamoDB Query Language, as the name says its a simple, SQL-ish language for DynamoDB although I have this tool, for me it's still complex to query data in DynamoDB.
So here's my simple Redash Dashboard
At the end, I know that there's a lot of steps where I lack and there's a lot of things I need to improve. Nevertheless, I completed the challenge using the steps that I understand. The knowledge that I got throughout this challenge is GOLD so I'm thankful that @forrestbrazeal kept on initiating this kind of challenges.
- I will integrate CI/CD for this application.
- Before my 30-day trial expires in Redash, I will continue the create a new data pipeline and stick with the native AWS technologies cause I think the complexity on building it is worth it.
- Implement Lambda Layers
- Introduce more data visualization
Whoever manage to saw this posts and reach the end, Thank you! I hope you understand it and also please tell me the vital aspects that I need to improve, I'm open to all of your suggestions!