It was a great experience participating in Forrest Brazeal's Cloud Resume Challenge in July and now have a resume website that I am proud to share. I am excited that he introduced another challenge in September. I would recommend anyone, whether for fun or to learn, a fresher or seasoned professional, to participate.
Goal: Automate an ETL processing pipeline for COVID-19 data using Python and cloud services.
I was able to modify my template from the previous challenge to quickly jumpstart this project. I made a few templates to try out different scenarios but decided on using Lambda and DynamoDB for the ETL pipeline. The decision was based on cost-efficiency and reducing the complexity through the number of services used.
Visualizing the application was simple but it was tricky translating that into code. Transforming the dataframe into a DynamoDB compliant-format was by far my biggest pain-point. Fortunately, I learned about Jupyter, which helped me speed up the testing of different functions and packages.
I use SNS to alert me of any run-time failures in both functions and the number of rows added when the ETL job completes. I added CI/CD using GitHub Actions. I prefer this method because it's free, supports the aws-cli, and introduces an extra layer of automation.
Lastly, data isn't meaningful without a way to visualize it. BI tools are expensive and many have no native support for DynamoDB. I extended my template to accommodate a custom real-time dashboard and created an additional template to provision a CloudFront distribution with an S3 origin. API Gateway is used to trigger a second Lambda function that uses Matplot to create graphs; It also calculates the recent totals of cases, recoveries, and deaths. The graphs are stored in an S3 bucket and the API response returns the calculated totals in JSON format.
While I have not built the dashboard yet, the infrastructure is available for when I do. I learned about Redash, a SaaS BI tool, that supports DynamoDB. I decided to use the free-trial and host my dashboard there for the time being.
Each project I do makes the next one easier. I completed this in roughly two weeks, whereas the previous challenge took about 1.5 months.
A few things that I would like to improve upon:
- Restructure the src folder to remove duplicate code
- Error handling/logging
- Make use of Lambda layers
The source code and additional details are on GitHub.
Top comments (0)