DEV Community: William Raffaelle

Event-Driven Python on AWS

William Raffaelle — Tue, 02 Aug 2022 02:58:00 +0000

Tonight I finished my first data engineering project on AWS. This project was based off of a challenge on the A Cloud Guru website. The challenge involves automating an ETL processing pipeline for COVID-19 data using Python and cloud services. This was my introduction to data engineering and ETL processing on AWS. I got to use my Python skills and learn about different cloud services in completing this project.

ETL

To begin, I created a Python compute job. The job runs once a day thanks to a CloudWatch rule. This means that the function will process COVID-19 data that is updated daily.

DATA

Two datasets were used in this project:

New York Times repository updated daily
Johns Hopkins dataset

The Python function downloads each dataset using the Pandas library. Additionally, a merge is done on both datasets to ensure that days that do not exist in both datasets are removed.

url_data = (r'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv') data_csv = pd.read_csv(url_data) url_data2 = (r'https://raw.githubusercontent.com/datasets/covid-19/master/data/time-series-19-covid-combined.csv') data_csv2 = pd.read_csv(url_data2) data_csv = data_csv.merge(data_csv2, how='inner', on='date')

DATA CLEANING

Next, the data was transformed. To do so, any non-US data was removed from the dataset.

LOAD

The data was then loaded into a DynamoDB table using the Boto3 put_item method.

NOTIFY

An SNS topic was created to notify individuals that the database had been loaded. This message includes the number of rows updated in the database.

ERROR HANDLING

Some error handling was implemented to speed up data processing and ensure that the compute job responds to malformed data properly.
To speed up ETL processing, the function determines whether it is to perform an initial load or an update. To do so, the function checks whether the row is already in the database or not. If the row is already in the database, the processing stops. This way only the current day's numbers are updated and not the entire database.

r = table.get_item(Key={'date' : row.date}) if r.get('Item') == None: batch.put_item(json.loads(row.to_json(), parse_float=Decimal)) new += 1

Second, the function checks for malformed data. If the date format is incorrect, or the case or death numbers are not inputted as numbers, the processing stops and a notification is sent stating that the data is malformed.

Using Systems Manager Parameter Store the function is able to detect if processing was cancelled last time it ran. If this is the case and the data is still malformed, the function will skip the malformed rows and continue with processing.

INFRASTRUCTURE

A CloudFormation template was then written to define the project's resources. This way the project can be deployed in any AWS environment. The resources include the following: Lambda function, CloudWatch rule, SNS trigger, DynamoDB table.

DASHBOARD

To build a dashboard AWS Quicksight was used. The dashboard includes the following: sum of cases by date, sum of deaths by date, and sum of recovered by date.

ADDITIONAL

A second Lambda function was written to download the updated DynamoDB table to a .csv file that is uploaded to S3. This way, QuickSight can use the latest data to build a dashboard. This function is triggered by a CloudWatch rule. It runs daily immediately after the first function runs and after the database has been updated.

The code for this project can be found on [GitHub].(https://github.com/wraffaelle98/Event-Driven-Python-on-AWS)

Cloud Resume Challenge

William Raffaelle — Mon, 25 Jul 2022 02:36:00 +0000

I recently completed the Cloud Resume Challenge by Forrest Brazeal. I thought the challenge was well-thought-out and had the right amount of difficulty. Although I already had some cloud experience prior to this project, the project put my skills to the test and I learned some stuff along the way. This was my first time deploying a full-stack web application using an infrastructure as code tool (CloudFormation). I was also exposed to GitHub Actions which I used to create a deployment pipeline. All in all this challenge was rewarding and I am happy with the end result.

Phase 1. Front End

The resume website was written in HTML. I used the following example to help me: A simple HTML Resume. I also used Adobe Dreamweaver to fix some issues I had with spacing. Next, I styled the website with CSS. The code was then uploaded to an S3 bucket which had static web hosting enabled. I experienced a strange error with Route 53, so I decided to use Cloudflare as a DNS provider.

Phase 2. API

To start this phase, I first created a DynamoDB table to store the visitor counter. I then created a REST API using AWS API Gateway. The API had one simple GET method. I then integrated the API method with a Lambda function. I wrote the Lambda function using Python and Boto3. The function updates the database's primary key by incrementing the visitor counter by 1 and then retrieves the visitor count and returns the number. The back end is represented by the following diagram:

Phase 3. Integration

It was finally time to integrate my front end to my back end. To make this happen, I wrote some JavaScript. The script I wrote uses the fetch() method to make an API call; in this case to update and return the visitor counter.

fetch('') .then(response => response.json()) .then((response) => { console.log(response) document.getElementById('replaceme').innerText = response })

I used Open Up The Cloud's video series Cloud Resume Challenge to help me with this part.

Phase 4. Automation

This was by far the most challenging part of the Cloud Resume Challenge. I now needed to represent my cloud resources as configuration. To complete the Infrastructure as Code step, I used AWS CloudFormation. The template defines all the back end resources: DynamoDB table, API Gateway (method, stage, deployment), and the Lambda function. This step took a lot of troubleshooting. I used different examples I could find on the internet to help me construct the template. Here are some things I noted:

When defining a method, even if the HttpMethod is GET, the IntegrationHttpMethod is POST
A GatewayResponse can be defined to create ResponseParameters to avoid CORS issues. I defined the following ResponseParameters: gatewayresponse.header.Access-Control-Allow-Headers: "'*'" gatewayresponse.header.Access-Control-Allow-Methods: "'*'" gatewayresponse.header.Access-Control-Allow-Origin: "'*'"

I then used GitHub Actions to deploy my application. If I were to complete this challenge again, I would set this up earlier since it saves a lot of time. I set up an action to deploy the CloudFormation template if there is a change to the template. I set up a different workflow to push any code change to the Lambda function. Lastly, I set up a workflow to push any HTML/CSS/Javascript changes to the S3 bucket where the website code is stored. I can now make many changes to either my front end or back end code and the changes will be pushed to the cloud. CI/CD is awesome.

Phase 5. Blog

As soon as I hit the Publish button, I will have completed the Cloud Resume Challenge. I enjoyed doing this challenge because it helped me to understand how full-stack web applications are deployed to AWS. Although I had written Lambda functions and CloudFormation templates prior to partaking this challenge, I had never used API Gateway and CloudFormation together. The challenge helped me to realize just how powerful and important infrastructure as code is. Additionally, prior to this challenge, I had no experience with GitHub Actions. I now see how useful it is and how it can save you a lot of time when you are working on a full-stack web application in the cloud. Overall, I thought this challenge was great and I recommend it to anyone who is new or has some cloud experience.

Link to resume website: https://raffaelle-resume.com/