Jay Viloria

Posted on Oct 1, 2020

Event-Driven Python ETL: ACloudGuru September 2020 Challenge

#serverless #aws #python

I just completed the ACloudGuru September 2020 challenge but in reality I feel like I have only reached one ending of a Choose-Your-Own-Adventure game.

Basically, the challenge is to create an automated ETL process (ran once daily) that takes two COVID-19 data sources, merge and clean them, apply some transformations and save the result to a database of our choosing and send notifications about the results of the process. A dashboard was then required that used the post-ETL data as a source. Here is the official challenge announcement.

Although there were many individual steps listed in the instructions, my approach can be broken into the following parts which I'll expand on in later sections:

Analysis: Time spent exploring data sources, considering different solution routes
CI/CD
ETL Construction: Developing the Extract/Transform/Load logic
DB Choice: Choosing the right DB and how I went with a csv file and S3 as my "database"
Notifications: SNS Notification set-up...and then tear-down...and then re-set-up...
ETL/Unit/Smoke Testing: Manually testing process, writing unit tests and trying to do a proper smoke test
Data Consumption/Dashboard

Analysis

I spent a fair bit of time before doing any "work" just thinking about how I wanted to approach the challenge. I read the instructions a few times and took a look at the data. I'll list some of my initial thoughts/observations and later relate them to future choices,

After looking over the QuickSight and Tableau Public sites, I made the decision to do the frontend/dashboard myself. I wasn't sure at that time if I was going to use a framework or simply vanilla JS. I was leaning towards React though.
As far as CI/CD of resources, I have been doing a lot with GitHub Actions so I decided to continue using that, instead of the Code* line of products that AWS offers.
After glancing at the initial data, I knew I wanted to add fields, specifically, the incremental data. The existing fields were all cumulative cases/deaths/recoveries. I also noticed that at times, older data had their records updated (adjustments to cases/deaths/recoveries)
My initial impressions were that I wanted a relational DB, although cost/complexity was a big concern. This was one of the considerations that made the analysis step take longer than expected.
Forrest listed some extra challenging steps, one of them being a smoke test that ran the process with dummy data and verified the notification message. I knew I wanted to include this in my solution.

CI/CD

After deciding on using GitHub Actions for my workflow, I split my project into two repositories: back-end/ETL/Data API and front-end/dashboard.

ETL/API

I used SAM to handle the back-end. Because I decided to do the front-end myself, I didn't have native integrations into a dashboarding software/service so I knew I needed an API for the data to be consumed. I considered splitting the API into a separate SAM project but decided against it. My rationale being that for a hypothetical ETL change, for example, adding a data source, the API would probably be changed as well to make the new data available to the user.

Dashboard

By this time, I decided to go with React for the dashboard so I chose CDK to structure my project/provision my resources. This was a choice of convenience because I had just completed a template for a React project that is deployed using CDK. You can find the template repo here.

ETL Construction

This was a fairly straight-forward portion of the project. I split the Extract/Transform/Load steps into separate .py files and had one lambda handler control everything. I used one lambda for the core ETL process, but I considered an alternative approach using multiple de-coupled parts for the Extract/Transform/Load steps. In the end, I stuck with the one lambda approach mostly out of simplicity but I'll briefly describe an alternative.

Alternative Approach

I considered having an Extract lambda save the downloaded data to a "raw data" S3 Bucket, that triggered the Transform/Load lambdas. The Transform lambda would have needed logic to ensure that all data sources were extracted before doing its thing. I considered this approach when I thought about scalability of one ETL Lambda. I thought about the case where the data sets were larger or data sources more numerous. In that case, the Extract step might be much longer and the penalty for a failure in the Transform/Load steps much more severe (restarting downloads).

By that same logic, I considered a separate Transform lambda/process. This is because as the data gets bigger, the Transform process could be beyond the Lambda capabilities (15 minute run time and memory constraints) and would possibly need a Fargate/EC2 process to complete.

AWS Glue

I didn't have experience with AWS Glue but it seemed like a viable option, especially with my "database" choice described later. Part of me wants to "choose another path" and re-do an alternate version of this project using Glue instead.

DB Choice

I spent a lot of time going over this choice. My knee-jerk reaction when first looking at the data was a relational DB. This feeling was strengthened by my desire to add fields as mentioned in the Analysis section.

Unfortunately, I looked into RDS and it was expensive as well as being more complex to integrate with AWS Lambda. The main blocker was the cost though, to incur the costs associated with standing up a RDS instance for 200-300 records seemed inappropriate. I briefly looked into Aurora Serverless but I wasn't sure how scaling down to/scaling up from 0 ACU worked. This was a concern because my data was to be exposed as an API Gateway/Lambda for consumption so I wasn't sure how it would affect latency.

My other choices were DynamoDB and the alternative that I ended up choosing, a csv file stored in an S3 Bucket. I'll describe what went into my choice below,

Daily Data Diff

I noticed during my initial analysis that when the data was updated, at times, older records were updated/adjusted. The DataFrame object has nice methods tools to find the global differences between two objects. pandas allows for a csv file to be converted to a DataFrame as one operation. With DynamoDB, the comparisons have to be done at the record/item level or I suppose the result of a scan could be converted to a DataFrame. Either way, it seemed as if more operations were necessary.

Data Consumption Flexibility

My initial leaning towards a relational DB was because of the ability to query the data flexibly. In terms of scalability, I considered the case where this project was extended to include data at the state or county level or many more fields were added. In that case, when consuming the data, I believe it should be filtered on the server(less) side first. Also flexible queries in DynamoDB require setting up Secondary Indexes so I didn't like it as a choice.

I felt that a csv file stored in S3, combined with a Lambda using the S3 Select functionality gave me the best options:

In the case of new fields, there is a clear way to add new query flexibility in terms of query string parameters in the API endpoint.
If the amount of data scaled up allowing for different segmentations (country, state/province, county) and new consumers of this data, the data consumed by existing users would be relatively predictable because with S3 Select, you pay for the data after filtering takes place.
If the amount of data scaled up enough to facilitate a migration to an RDS instance, the S3 Select logic should be mostly re-usable.

Notifications

The DB choice was the most difficult choice I made in this project, but the choice on how to implement ETL notifications was by far the most frustrating choice.

At first it seemed so simple, Lambda has a Destinations feature that allows automatic integration with a few different AWS services, one of them being SNS, for asynchronous Lambda invocations. I thought it was perfect, I'll just hook that up and I get automatic success/failure publishing to my subscribers (my email address). Then I saw the notification...

I looked into the Destinations documentation and the above was what I was stuck with (Note, some content was blocked out from the screenshot above). I understand that Destinations are probably meant more for machine triggers of services so sending over a JSON-structured object makes more sense. For a human-readable notification though, it was...less than ideal. For a while, I thought, well technically, it meets the requirements of the challenge, but it looks terrible and it makes the smoke test much more difficult... But it seemed so easy to just leave that Destinations configuration in my SAM template, oh well, time to move on.

I broke up the ETL Lambda-SNS connection and instead configured the Destination of my ETL Process to be another Lambda function responsible for publishing to SNS. This accomplished two main things,

Better Structured/Formatted Notifications

Filtered Notifications

I was able to use the SNS message attributes to filter messages to subscribers. I used this to create a subscriber that only received messages during my smoke test run using an environment parameter to denote testing vs production runs.

I used an SQS Queue as the target of my testing notifications because it allowed for programmatic verification of received messages and was a direct source of the message. What I mean by that is I considered using another lambda for the test messages and writing the message content to S3/DynamoDB to verify the SNS notifications were functioning. In that sense though, I am also testing the Lambda -> {S3, DynamoDB} connection which I felt muddied the waters of the test, so to speak. This is one area where I would like to learn what is standard practice because my solution felt very hacky.

ETL/Unit/Smoke Testing

By this time I had three Lambda functions:

Core ETL Function
SNS Publishing Function
API Gateway Proxy Function

Most of my unit tests were focused on the ETL function, I tested the Extract/Transform/Load modules separately and tried to cover different data problems that might come up.

One unfortunate consequence of using S3/S3 Select as my database/data consumption mechanism is that moto, the standard mocking library for AWS services, does not yet have support for it so I couldn't include a proper unit test for my API Gateway Proxy function.

Smoke Test

Based on all the machinery I set up around the smoke test, e.g. SNS Message Attributes, SQS Queue, SNS Lambda Publisher. The actual smoke test script was very straight forward. It invokes the ETL Lambda with a testing environment value. It then polls an SQS Queue to determine if the correct SNS message was sent, then cleans up any test ETL files created in the "database". In this case, I used S3 prefixes to separate production vs testing data.

Data Consumption/Dashboard

For this part of the project, I put together a quick React app and used the Recharts library to create a graph out of the data returned by the API call I set up above.

I included a few controls to toggle how the data is aggregated or whether or not the user is viewing Cases/Deaths/Recoveries. Although initially I wanted to add filters for some of the fields I added during the Transform process, I left them as possible future enhancements, translation, I got a little tired and wanted to wrap up this project. A screenshot of the dashboard is provided

Real-Time Dashboard Updates

I didn't implement this in my project but it was listed as an extra challenging step in the directions so I thought I would describe how I might approach that.

Since both data sources were in public repos, I think I would create a function that hits the GitHub API endpoint for both sources with some level of frequency and does a diff comparison of those specific files. If there are changes to those files, trigger the ETL event.

This wouldn't be technically be real-time but with a frequent enough period of the GitHub API endpoint, it could be near it. It would also avoid downloading the file(s) unless necessary. Although I'm not sure if there are quota restrictions on the GitHub API.

Conclusion

I very much enjoyed this challenge. I still feel like there were a bunch of different ways of accomplishing certain steps and I am very tempted to do multiple iterations of everything.

Feel free to check out my dashboard or the two repos that make up the project: front-end and back-end

The overall back-end architecture is provided below,