<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Noel Martin Llevares</title>
    <description>The latest articles on DEV Community by Noel Martin Llevares (@dashmug).</description>
    <link>https://dev.to/dashmug</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F415855%2F6a156eac-cf5c-4e7e-b36c-18510f67a9c7.jpeg</url>
      <title>DEV Community: Noel Martin Llevares</title>
      <link>https://dev.to/dashmug</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dashmug"/>
    <language>en</language>
    <item>
      <title>Event-Driven Python in AWS - #CloudGuruChallenge</title>
      <dc:creator>Noel Martin Llevares</dc:creator>
      <pubDate>Thu, 15 Oct 2020 00:40:12 +0000</pubDate>
      <link>https://dev.to/dashmug/event-driven-python-in-aws-cloudguruchallenge-20la</link>
      <guid>https://dev.to/dashmug/event-driven-python-in-aws-cloudguruchallenge-20la</guid>
      <description>&lt;h3&gt;
  
  
  TLDR
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl"&gt;Challenge Instructions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://d21xiw2qs8azw2.cloudfront.net/"&gt;My Dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dashmug/us-covid-stats"&gt;Project GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;A few weeks ago, I came across &lt;a href="https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl"&gt;A Cloud Guru's #CloudGuruChallenge&lt;/a&gt;. The challenge was to create a simple event-driven &lt;strong&gt;Extract-Transform-Load (ETL) pipeline&lt;/strong&gt; using some publicly available US COVID-19 statistics. As the title of the challenge suggests, it is meant to exercise and demonstrate knowledge and usage of AWS services.&lt;/p&gt;

&lt;p&gt;I have recently shifted roles from being a &lt;em&gt;software engineer&lt;/em&gt; to being a &lt;em&gt;data engineer&lt;/em&gt; so getting to work on ETL pipelines is a good learning exercise for me. Thus, I decided to take on this challenge and see what I can come up with. As an added bonus, the challenge will allow me to experiment with new ideas that I wanted to try out but can't easily do so in my own projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenges
&lt;/h3&gt;

&lt;p&gt;I'll summarize them as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Schedule a daily ETL job that will extract data from two different sources, clean/filter the data, merge the two sources, and store the merged data. &lt;/li&gt;
&lt;li&gt;Subsequent runs should only add/update rows that were added/updated since the last ETL run.&lt;/li&gt;
&lt;li&gt;At each ETL run, notify interested subscribers via email.&lt;/li&gt;
&lt;li&gt;The ETL process should be resilient enough to handle off-by-one issues (when one dataset has an extra row not present in the other one).&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;Infrastructure-as-Code&lt;/strong&gt; as much as possible.&lt;/li&gt;
&lt;li&gt;Hook up the ETL results into a dashboard or a reporting tool.&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;Continuous Integration&lt;/strong&gt;/&lt;strong&gt;Continuous Deployment&lt;/strong&gt; (&lt;strong&gt;CI/CD&lt;/strong&gt;) pipeline for updating the infrastructure and application when changes are made to the source code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Additional challenges I wanted for myself are the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make a custom frontend to visualize the transformed data.&lt;/li&gt;
&lt;li&gt;Use as very little hard-coded variables in the infrastructure so that multiple developers can work on separate stacks at the same time without sharing/clashing of resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I used &lt;strong&gt;Serverless Framework&lt;/strong&gt; to manage the backend. It provides a higher-level of abstraction compared to &lt;em&gt;SAM&lt;/em&gt; or &lt;em&gt;CloudFormation&lt;/em&gt;. When I need to define custom resources that &lt;code&gt;serverless&lt;/code&gt; does not manage, I can still do so by using CloudFormation in my &lt;code&gt;serverless&lt;/code&gt; configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ReactAppBackendUrlParameter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::SSM::Parameter&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;REACT_APP_BACKEND_URL&lt;/span&gt;
        &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;String&lt;/span&gt;
        &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Join&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="nv"&gt;HttpApi&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.execute-api.'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS::Region'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS::URLSuffix'&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;OnRefreshDataFromSourcesNotification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::SNS::Topic&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;DisplayName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;US COVID Stats&lt;/span&gt;
        &lt;span class="na"&gt;Subscription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.subscription.${self:custom.subscriptionEnabled}}&lt;/span&gt;
    &lt;span class="na"&gt;DataTable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.DataTable}&lt;/span&gt;
    &lt;span class="na"&gt;DataBucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.DataBucket}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendBucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.FrontendBucket}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendBucketPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.FrontendBucketPolicy}&lt;/span&gt;
    &lt;span class="na"&gt;CloudFrontOriginAccessIdentity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.CloudFrontOriginAccessIdentity}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendDistribution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.FrontendDistribution}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendBucketParameter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.FrontendBucketParameter}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendDistributionParameter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfResources.FrontendDistributionParameter}&lt;/span&gt;
  &lt;span class="na"&gt;Outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;DataBucketName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfOutputs.DataBucketName}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendBucketName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfOutputs.FrontBucketName}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendDistributionId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfOutputs.FrontendDistribution}&lt;/span&gt;
    &lt;span class="na"&gt;FrontendUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${self:custom.cfOutputs.FrontendUrl}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most of the custom resources above are defined in a separate &lt;code&gt;cloudformation.yml&lt;/code&gt;. This allows me to use CloudFormation tools like &lt;code&gt;cfn-lint&lt;/code&gt; on that file. The only ones defined in &lt;code&gt;serverless.yml&lt;/code&gt; are the ones that depend on &lt;code&gt;serverless&lt;/code&gt;-managed resources.&lt;/p&gt;

&lt;p&gt;I am also using a &lt;code&gt;serverless&lt;/code&gt; plugin that handles the packaging of python dependencies.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;serverless&lt;/code&gt; service consists of the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;REST API Gateway&lt;/li&gt;
&lt;li&gt;Lambda function that responds to the API Gateway requests.&lt;/li&gt;
&lt;li&gt;Lambda function that is triggered daily to do the ETL process.&lt;/li&gt;
&lt;li&gt;Lambda function that gets triggered &lt;em&gt;after&lt;/em&gt; the ETL process (via &lt;em&gt;Lambda Destinations&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Multiple custom CloudFormation resources such as DynamoDB table, S3 buckets, CloudFront distribution, SNS topic, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Scheduled ETL
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fei468jrrdfsfvexshtxz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fei468jrrdfsfvexshtxz.png" alt="ETL (1)" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ETL requirement for this challenge can actually be achieved by simply using python's &lt;code&gt;csv&lt;/code&gt; module. However, most ETL tasks in the real-world are not as simple as this challenge. Most tasks would probably involve more sophisticated transforms than just simple merging. This is where &lt;code&gt;pandas&lt;/code&gt; (a data analysis library for Python) can come in handy.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;pandas&lt;/code&gt;, data cleanup and transformation were a breeze. Putting a big dependency (in Lambda's environment) for a requirement so small makes the power of &lt;code&gt;pandas&lt;/code&gt; wasted so I made sure to use more of pandas' features in my REST API.&lt;/p&gt;

&lt;p&gt;For example, I calculated the changes in each day and aggregated them by week and by month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0bwth2m3tg60k3fd6q0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0bwth2m3tg60k3fd6q0n.png" alt="Grouped Tables and Charts" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As the data querying requirements are simple, I chose DynamoDB to be the primary datastore. It's fast, easy to use, very cheap, and very scaleable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Incremental Loading
&lt;/h4&gt;

&lt;p&gt;A full load only happens at the first ETL run. Succeeding runs do not need to recreate all rows that did not change. To achieve this, I keep a CSV snapshot of the same transformed data. After the transformation process, I compare the newly-transformed dataset with the previous dataset and get the changed rows.&lt;/p&gt;

&lt;p&gt;Only the changed rows will then be saved in DynamoDB in a batch write operation. This process makes sure that it only consumes enough DynamoDB WCUs as necessary. Also, then the datasets get updated retroactively, it also gets to store the changes in the previous rows instead of only the latest row.&lt;/p&gt;

&lt;h4&gt;
  
  
  Email Notifications
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Foa9d5fi44b1b1vkglrjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Foa9d5fi44b1b1vkglrjk.png" alt="Screen Shot 2020-10-15 at 10.17.05 am" width="634" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Email notifications are implemented via SNS. The ETL Lambda (&lt;code&gt;RefreshDataFromSources&lt;/code&gt;) is configured to use the &lt;code&gt;OnRefreshDataFromSources&lt;/code&gt; as an asynchronous destination for both success and failure states. The &lt;code&gt;OnRefreshDataFromSources&lt;/code&gt; in turn will publish a corresponding message to an SNS topic which then send a message to subscribed emails.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;functions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;RefreshDataFromSources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us_covid_stats/etl/handler.refresh_data_from_sources&lt;/span&gt;
    &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(1 day)&lt;/span&gt;
    &lt;span class="na"&gt;destinations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;onSuccess&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnRefreshDataFromSources&lt;/span&gt;
      &lt;span class="na"&gt;onFailure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnRefreshDataFromSources&lt;/span&gt;
  &lt;span class="na"&gt;OnRefreshDataFromSources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us_covid_stats/etl/handler.on_refresh_data_from_sources&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  ETL Resiliency
&lt;/h4&gt;

&lt;p&gt;Since we are dealing with two different datasources that we need to merge using the date as joining condition, it may happen that one datasource is more updated than the other. Our transformation has to account for that and ignore rows that are not present in both datasets.&lt;/p&gt;

&lt;p&gt;It's literally just a one-liner with &lt;code&gt;pandas.join()&lt;/code&gt;. The rest are just trivial cleanup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge_cases_with_recoveries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recoveries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recoveries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recovered&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recoveries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recoveries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code, there are unit tests to check the functionality against different scenarios.&lt;/p&gt;

&lt;h4&gt;
  
  
  Infrastructure-as-Code
&lt;/h4&gt;

&lt;p&gt;I am a fan of Infrastructure-as-Code. If done well, it saves so much time and spinning up and tearing down environments. &lt;/p&gt;

&lt;p&gt;In this project, I am using Serverless Framework and CloudFormation to manage the infrastructure. Everything that needs to be provisioned is just two deploy commands away.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backend 

&lt;ul&gt;
&lt;li&gt;When the backend gets deployed (through &lt;code&gt;serverless deploy&lt;/code&gt;), it creates all the necessary &lt;em&gt;backend&lt;/em&gt; services it needs plus the infrastructure that the &lt;em&gt;frontend&lt;/em&gt; will eventually need (i.e. S3 bucket, CloudFront distribution). The autogenerated REST API endpoints, the S3 bucket name, and the CloudFront distribution ID gets stored in SSM Parameter store. This makes it easy for the &lt;code&gt;frontend&lt;/code&gt; deployment process to retrieve them.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Frontend

&lt;ul&gt;
&lt;li&gt;During the build process, it uses SSM to retrieve the autogenerated API endpoints of the backend.&lt;/li&gt;
&lt;li&gt;During the deployment process, it uses SSM to get the S3 bucket name to use in storing the built artifacts and to get the CloudFront distribution ID for invalidating the CDN cache.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Dashboard/Report App
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv2n5v0ue8lkflnfsp6ys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv2n5v0ue8lkflnfsp6ys.png" alt="Untitled Diagram (3)" width="526" height="235"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;I have made a simple React frontend to demonstrate the transformed data. It is available in &lt;a href="https://d21xiw2qs8azw2.cloudfront.net/"&gt;https://d21xiw2qs8azw2.cloudfront.net/&lt;/a&gt; (It's hosted on an S3 bucket and a CloudFront distribution).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://d21xiw2qs8azw2.cloudfront.net/"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ftbkgbvhex32p5hf442we.png" alt="Frontend full page screenshot" width="800" height="1306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the REST API, I used &lt;code&gt;pandas&lt;/code&gt; in order to provide daily, weekly, and monthly aggregate counts. I made 4 &lt;code&gt;GET&lt;/code&gt; endpoints -- &lt;code&gt;/data&lt;/code&gt;, &lt;code&gt;/daily&lt;/code&gt;, &lt;code&gt;/weekly&lt;/code&gt;, &lt;code&gt;/monthly&lt;/code&gt;. They are all handled by the same Lambda function for simplicity.&lt;/p&gt;

&lt;h4&gt;
  
  
  Continuous Integration
&lt;/h4&gt;

&lt;p&gt;I used GitHub actions to run multiple checks that run on pull requests and on the main branch. There are checks for frontend, backend, and for &lt;code&gt;cloudformation.yml&lt;/code&gt; files. I've also added GitHub's Code Scanning service and SonarCloud's Code Analysis.&lt;/p&gt;

&lt;p&gt;For the frontend, we simply use &lt;code&gt;create-react-app&lt;/code&gt;'s build process because it does lint, typechecking, and build all-in-one command.&lt;/p&gt;

&lt;p&gt;For the backend, we check for linting errors using &lt;code&gt;flake8&lt;/code&gt;, check for type errors using &lt;code&gt;mypy&lt;/code&gt;, run unit tests using &lt;code&gt;pytest&lt;/code&gt; and then build the deployment package using &lt;code&gt;serverless&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With this workflow, I can write features and push it as a PR. All the actions will trigger and perform checks on my code. If all passes, I can then merge it to the main branch which will then trigger deployment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Continuous Deployment
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fo3vhl7ebqrhy1h981nmv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fo3vhl7ebqrhy1h981nmv.png" alt="Screen Shot 2020-10-15 at 11.57.36 am" width="664" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For deployment, I use CodeBuild. The CodeBuild project configurations for both frontend and backend are defined in a separate &lt;code&gt;cloudformation.yml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With this deployment pipeline in place, I can simply merge PRs to main branch or just commit trivial changes directly to main and they will deploy automatically.&lt;/p&gt;

&lt;p&gt;Sometimes, changes in the ETL logic may require re-triggering the ETL process afterwards so it is useful to make the ETL process listen to the CodeBuild event via CloudWatch events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;functions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;RefreshDataFromSources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us_covid_stats/etl/handler.refresh_data_from_sources&lt;/span&gt;
    &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(1 day)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cloudwatchEvent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;aws.codebuild&lt;/span&gt;
            &lt;span class="na"&gt;detail-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CodeBuild Build State Change&lt;/span&gt;
            &lt;span class="na"&gt;detail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;build-status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Things I wanted to add but didn't have time to.
&lt;/h3&gt;

&lt;p&gt;It's a shame I have run out of time. There are still a couple of things I wanted to try (I may still try them in the future).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Provide a SageMaker notebook instance with some initial code that loads our CSV snapshot.&lt;/li&gt;
&lt;li&gt;Provide a few Athena named queries that uses our CSV snapshot.&lt;/li&gt;
&lt;li&gt;Implement a post-deployment test using CloudWatch Synthetics to test our app if it works as intended. This test can also be triggered after each ETL run.&lt;/li&gt;
&lt;li&gt;Provision all of the above using CloudFormation, of course.&lt;/li&gt;
&lt;li&gt;Switch from CloudFormation to CDK.&lt;/li&gt;
&lt;li&gt;Using realtime updates for data that refreshes only once a day is not so exciting. So, I wanted to implement a 1990s-style realtime views counter. Or, maybe a guestbook. :-)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;I found this challenge to be a fun one. I learned a lot especially with things that I don't use day-to-day. I look forward to the next one.&lt;/p&gt;

&lt;p&gt;Thank you &lt;a href="https://acloudguru.com"&gt;A Cloud Guru&lt;/a&gt; for organizing this.&lt;/p&gt;

</description>
      <category>cloudguruchallenge</category>
      <category>serverless</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
