<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nxtra</title>
    <description>The latest articles on DEV Community by Nxtra (@nxtra).</description>
    <link>https://dev.to/nxtra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F474779%2F67ade894-b924-4149-977c-a085bd315c20.jpg</url>
      <title>DEV Community: Nxtra</title>
      <link>https://dev.to/nxtra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nxtra"/>
    <language>en</language>
    <item>
      <title>Serverless data pipelines: ETL workflow with Step Functions and Athena</title>
      <dc:creator>Nxtra</dc:creator>
      <pubDate>Thu, 04 Feb 2021 07:31:12 +0000</pubDate>
      <link>https://dev.to/aws-builders/serverless-data-pipelines-etl-workflow-with-step-functions-and-athena-4hhf</link>
      <guid>https://dev.to/aws-builders/serverless-data-pipelines-etl-workflow-with-step-functions-and-athena-4hhf</guid>
      <description>&lt;p&gt;This blog is Part 3 of a multi-part series around analysing Flanders’ traffic whilst leveraging the power of cloud components!&lt;br&gt;&lt;br&gt;
For part 1 see: &lt;a href="https://medium.com/cloudway/real-time-data-processing-with-kinesis-data-analytics-ad52ad338c6d" rel="noopener noreferrer"&gt;https://medium.com/cloudway/real-time-data-processing-with-kinesis-data-analytics-ad52ad338c6d&lt;/a&gt;&lt;br&gt;&lt;br&gt;
For part 2 see: &lt;a href="https://medium.com/cloubis/serverless-data-transform-with-kinesis-e468abd33409" rel="noopener noreferrer"&gt;https://medium.com/cloubis/serverless-data-transform-with-kinesis-e468abd33409&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  What is our goal?
&lt;/h1&gt;

&lt;p&gt;This blog aims to explore the use of the AWS Glue service in conjunction with the AWS Athena service to repartition raw streaming data events.&lt;br&gt;&lt;br&gt;
We previously landed these events on an Amazon S3 bucket partitioned according to the processing time on Kinesis.&lt;br&gt;&lt;br&gt;
However, we would like to have these events partitioned according to event timestamps to allow for meaningful batch analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fedke2jtpg43tp6rykekk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fedke2jtpg43tp6rykekk.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  First, a short introduction to AWS Glue
&lt;/h1&gt;

&lt;p&gt;AWS Glue (which was introduced in august 2017) is a serverless Extract, Transform and Load (ETL) cloud-optimized service.&lt;br&gt;
This service can be used to automate ETL processes that organize, locate, move and transform data sets stored within a variety of data sources, allowing users to efficiently prepare these datasets for data analysis. These data sources can, e.g., be data lakes in Amazon Simple Storage Service (S3), data warehouses in Amazon Redshift or other databases that are part of the Amazon Relational Database Service. Other types of databases such as MySQL, Oracle, Microsoft SQL Server and PostgreSQL are also supported in AWS GLue.&lt;/p&gt;

&lt;p&gt;Since AWS Glue is a serverless service, users are not required to provision, configure and spin-up servers, and they do not need to spend time managing servers.&lt;/p&gt;

&lt;p&gt;At the heart of AWS Glue is the Catalogue, a centralized metadata repository for all data assets.&lt;br&gt;
All relevant information about data assets (such as table definitions, data locations, file types, schema information) is stored in this repository.&lt;/p&gt;

&lt;p&gt;In order to get this information into the Catalogue, AWS Glue uses crawlers. These crawlers can scan data stores and automatically infer the schema of any structured and semi-structured data that might be contained within the data stores and also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;discover file types&lt;/li&gt;
&lt;li&gt;extract the schema&lt;/li&gt;
&lt;li&gt;automatically discover datasets&lt;/li&gt;
&lt;li&gt;store all this information in the Data Catalogue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When data has been catalogued, it can then be accessed and ETL jobs can be performed on it.&lt;br&gt;
AWS Glue provides the capability to automatically generate ETL scripts, which can be used as a starting point, meaning users do not have to start from scratch when developing ETL processes.&lt;br&gt;
In this blog however, we will be focussing on the use of an alternative to the AWS Glue ETL jobs.&lt;br&gt;
We will be making use of SQL queries implemented in AWS Athena to perform the ETL process.&lt;/p&gt;

&lt;p&gt;For the implementation and orchestration of more complex ETL processes, AWS Glue provides users with option of using workflows. These can be used to coordinate more complex ETL activities involving multiple crawlers, jobs and triggers. We will however be using an alternative to the these AWS Glue workflows, namely a state machine with step functions to coordinate our ETL process.&lt;/p&gt;

&lt;p&gt;To reiterate, AWS Glue has 3 main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Data Catalogue, a centralized metadata repository, where all metadata information concerning your data is stored. This includes information about tables (which define the metadata representations or schemas of the stored datasets), schemas and partitions. The metadata properties are inferred within data sources by crawlers, which also provide connections with them.&lt;/li&gt;
&lt;li&gt;The Apache Spark ETL engine. Once metadata is available in the data catalogue and source and target data stores can be selected from the catalogue, the Apache Spark ETL engine allows for the creation of ETL jobs that can be used to process the data.&lt;/li&gt;
&lt;li&gt;The Scheduler. Users can set-up a schedule for their AWS ETL jobs. This schedule can be linked to a specific trigger (e.g. the completion of another ETL job), a particular time of day or a job can be set-up to run on-demand.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  State machine
&lt;/h1&gt;

&lt;p&gt;As mentioned above, our goals are the creation of an ELT pipeline which will repartition the data we already landed in an S3 data lake.&lt;br&gt;
This repartitioning will make sure the data is partitioned based on a timestamp within the event.&lt;br&gt;
This fits our analysis purposes as opposed to partitioning based on the timestamp at which the record arrived on kinesis firehose.&lt;/p&gt;

&lt;p&gt;In order to achieve this we build an ETL job to extract the existing data from S3, transform it by creating new columns based on the event timestamp from within the data and land it in new partitions.&lt;/p&gt;

&lt;p&gt;Specifically, the ETL jobs achieved the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, we needed to figure out what the current data looked like.
In other words, we needed to register a schema for our source data (i.e. the data partitioned according to Kinesis Firehose timestamps, which we landed on S3) in the Glue Catalogue.&lt;/li&gt;
&lt;li&gt;To determine this schema, we needed to run a crawler, exploring the existing data and determining this data format.
Running the crawler created a schema for the source data and registered that schema with the Glue catalogue.&lt;/li&gt;
&lt;li&gt;Next, we needed to run an ETL process in order to transform the data into new partitions.
As already mentioned in this blog, we will focus on the use of Athena to achieve the repartitioning.
Read our next blog to see how it is done with AWS Glue.&lt;/li&gt;
&lt;li&gt;After the data was repartitioned, we, of course, wanted to be able to query the data for analysis purposes.
To achieve this, we needed to run a crawler again to establish what the repartitioned data looked like.
The crawler then registered the new schema in the Glue Catalogue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To run this process continuously would not be very efficient.&lt;br&gt;
On the other hand, running this process not frequently enough (only once a week for example) would mean that we would have to wait too long to be able to report on new data.&lt;/p&gt;

&lt;p&gt;We have a process of a few managed steps (running crawler, registering schema, executing ETL job, running crawler) that we need to orchestrate on a regular basis.&lt;br&gt;
Hence, it would be ideal for orchestration using AWS step functions.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In AWS Step Functions, you define your workflows in the Amazon States Language.&lt;br&gt;
The Step Functions console provides a graphical representation of that state machine to help visualize your application logic.&lt;br&gt;
States are elements in your state machine.&lt;br&gt;
A state is referred to by its name, which can be any string, but which must be unique within the scope of the entire state machine&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here is overview of what our state machine looked like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxpuz2k6t2jvafwu22hnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxpuz2k6t2jvafwu22hnl.png" alt="Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, we have a finite number of steps executed one after the other.&lt;/p&gt;
&lt;h3&gt;
  
  
  ASL - Amazon State Language
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ASL&lt;/code&gt; is a JSON based language to define the steps of your state machine.&lt;br&gt;
This is a JSON base language to define the steps of your state machine.&lt;br&gt;
Later we will look deeper into the logic executed in each step.&lt;br&gt;
Let's first look at the &lt;code&gt;ASL&lt;/code&gt; that defines these steps.&lt;/p&gt;

&lt;p&gt;AWS Sam and the Serverless Framework both allow you to specify the ASL as &lt;code&gt;YAML&lt;/code&gt;.&lt;br&gt;
We found that using &lt;code&gt;YAML&lt;/code&gt; improved readability.&lt;br&gt;
As such we defined our &lt;code&gt;ASL&lt;/code&gt; as follows (complete &lt;code&gt;ASL&lt;/code&gt; schema available &lt;a href="https://github.com/becloudway/serverless-data-pipelines-batch-processing" rel="noopener noreferrer"&gt;here&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;BatchProcessingStateMachine&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(1 day)&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BatchProcessingStateMachine&lt;/span&gt;
  &lt;span class="na"&gt;definition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Comment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;State&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;machine&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;batch&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;processing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pipeline"&lt;/span&gt;
    &lt;span class="na"&gt;StartAt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RunDataCrawler&lt;/span&gt;
    &lt;span class="na"&gt;States&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;RunDataCrawler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Task&lt;/span&gt;
        &lt;span class="na"&gt;Resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-RunDataCrawler&lt;/span&gt;
        &lt;span class="na"&gt;Next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WaitCrawler&lt;/span&gt;
      &lt;span class="na"&gt;WaitCrawler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Wait&lt;/span&gt;
        &lt;span class="na"&gt;Seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
        &lt;span class="na"&gt;Next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GetCrawlerState&lt;/span&gt;
      &lt;span class="na"&gt;GetCrawlerState&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Task&lt;/span&gt;
        &lt;span class="na"&gt;Resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-GetCrawlerState&lt;/span&gt;
        &lt;span class="na"&gt;Next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CheckCrawlerState&lt;/span&gt;
      &lt;span class="na"&gt;CheckCrawlerState&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Choice&lt;/span&gt;
        &lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WaitCrawler&lt;/span&gt;
        &lt;span class="na"&gt;Choices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;And&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Variable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$.CrawlerState'&lt;/span&gt;
              &lt;span class="na"&gt;StringEquals&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;READY&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Variable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$.CrawlerStatus'&lt;/span&gt;
              &lt;span class="na"&gt;StringEquals&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;
            &lt;span class="na"&gt;Next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RunETLInsertAthena&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;And&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="s"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;ASL&lt;/code&gt; describes the same workflow as the state image above.&lt;br&gt;
It's only much harder to read for human eyes.&lt;/p&gt;

&lt;p&gt;Note that we indeed have the steps: running crawler, registering schema, executing ETL job and running crawler again.&lt;br&gt;
But we also have "wait" steps where we periodically check if a crawler is ready with his work.&lt;br&gt;
And we have failed states that we use to react to possible failures in our process.&lt;/p&gt;

&lt;p&gt;Since this blog focuses on data and not on how to build state machines we'll put a link here if you want to know more about &lt;code&gt;AWS State Machines&lt;/code&gt; and &lt;code&gt;Step Functions&lt;/code&gt;: &lt;a href="https://aws.amazon.com/getting-started/hands-on/create-a-serverless-workflow-step-functions-lambda/" rel="noopener noreferrer"&gt;click here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the resources, you'll find a link to a great course by &lt;a href="https://theburningmonk.thinkific.com/courses/complete-guide-to-aws-step-functions" rel="noopener noreferrer"&gt;Yan Cui&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Logic of Step functions
&lt;/h3&gt;

&lt;p&gt;Now it is time to look a little deeper into what happens during every step.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tip: Choose descriptive names for your steps so that it is immediately clear what happens in a certain step.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here are a few of our steps (again, check out the repository if you want to see all the logic):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RunDataCrawler&lt;/strong&gt;&lt;br&gt;
This triggers the executing of a Lambda Function which in turn triggers a Glue Crawler&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;glue_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CRAWLER_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRAWLER_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;timezone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Europe/Brussels&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glue_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_crawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CRAWLER_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GetCrawlerState&lt;/strong&gt;&lt;br&gt;
We are periodically checking the state of the running crawler.&lt;br&gt;
Since there is no direct integration for crawler events with step functions (yet?), we have to check this using a lambda function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;glue_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;glue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CRAWLER_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRAWLER_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glue_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_crawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CRAWLER_NAME&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Crawler&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CrawlerState&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;State&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CrawlerStatus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LastCrawl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;})[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns the state of the crawler, thus telling us whether or not the crawler is finished.&lt;br&gt;
As you can see from the diagram and the &lt;code&gt;ASL&lt;/code&gt;, we'll use this status to make a &lt;code&gt;choice&lt;/code&gt; for what is the next step to execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RunETLInsertAthena&lt;/strong&gt;&lt;br&gt;
When the crawler is finished, it is time to run the ETL job.&lt;br&gt;
This is done using AWS &lt;code&gt;Athena&lt;/code&gt;.&lt;br&gt;
Read more about the how and what of &lt;code&gt;Athena&lt;/code&gt; in the next paragraph.&lt;/p&gt;

&lt;p&gt;It is, however, the job of a Lambda function to start the ETL job in &lt;code&gt;Athena&lt;/code&gt; and to check when it is finished.&lt;/p&gt;

&lt;p&gt;The handler of the lambda function that starts the ETL job looks as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;execution_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FAILED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionIds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;execution_ids&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Define the queries, specifying which data range you want to repartition.&lt;/li&gt;
&lt;li&gt;Pass these queries to &lt;code&gt;Athena&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Return the Athena execution ID. An ID that we can use to check on the state of the ETL job with Athena.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next function checks if the ETL job is finished.&lt;br&gt;
It does so by using the execution ID that was returned from the latest step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;athena_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_get_query_execution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;QueryExecutionIds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionIds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;State&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AthenaState&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionIds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;QueryExecutionIds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AthenaState&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;QueryExecutionIds&lt;/code&gt; from the previous step is now used to get the status of a specific query.&lt;/p&gt;

&lt;p&gt;We saw the steps necessary in the workflow to repartition our data.&lt;br&gt;
This repartitioning was achieved with Athena.&lt;br&gt;
Let's dive deeper into that in the next paragraph.&lt;/p&gt;
&lt;h1&gt;
  
  
  Athena Service
&lt;/h1&gt;

&lt;p&gt;As stated above, we used AWS Athena to run the ETL job, instead of a Glue ETL job with an auto-generated script.&lt;/p&gt;

&lt;p&gt;The querying of datasets and data sources registered in the Glue Data Catalogue is supported natively by AWS Athena. This means Athena will use the Glue Data Catalogue as a centralized location where it stores and retrieves table metadata. This metadata instructs the Athena query engine where it should read data, in what manner it should read the data and provides additional information required to process the data.&lt;br&gt;
It is, for example, possible to run an INSERT INTO DML query against a source table registered with the Data Catalogue. This query will insert rows into the destination table based upon a SELECT statement run against the source table.&lt;br&gt;&lt;br&gt;
Directly below we show part of our complete INSERT INTO DML query, which has additional nested subqueries in which data from the source table is transformed step by step so that it can be repartitioned and used for analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="nv"&gt;"traffic"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"sls_data_pipelines_batch_transformed"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;uniqueId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recordTimestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currentSpeed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bezettingsgraad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previousSpeed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avgSpeed3Minutes&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avgSpeed3Minutes&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;41&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt;
    &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;trafficJamIndicator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avgSpeed20Minutes&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avgSpeed20Minutes&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;41&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt;
    &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;trafficJamIndicatorLong&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;speedDiffindicator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avgSpeed3Minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avgSpeed20Minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;uniqueId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recordTimestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currentSpeed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bezettingsgraad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previousSpeed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trafficIntensityClass5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentSpeed&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;previousSpeed&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentSpeed&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;previousSpeed&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt;
  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;speedDiffindicator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentSpeed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;uniqueId&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;originalTimestamp&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;FOLLOWING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avgSpeed3Minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentSpeed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;uniqueId&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;originalTimestamp&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;FOLLOWING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avgSpeed20Minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTimestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTimestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTimestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalTimestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The (part of the) INSERT INTO DML query shown directly above, performed the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Final selection of relevant information for data analysis.
Not all information contained in the raw data was useful for analysis, and some data was possibly invalid (e.g. due to malfunctioning measuring equipment)&lt;/li&gt;
&lt;li&gt;The computation of aggregate values and derived fields to be used for analysis purposes.
For example, calculation of the average speed and the implementation of the logic of what we consider a traffic jam.&lt;/li&gt;
&lt;li&gt;The repartitioning of the data by event time (i.e. the year, month and day values of the originalTimestamp).
Repartitioning is achieved by first defining a target table in the AWS Glue Catalogue in which year, month, day and hour bigint fields were designated as Partition keys.
Subsequently, we extracted the year, month and day values of the originalTimestamp (i.e. the timestamp of the measurement itself, not the timestamp of the processing time on Kinesis) and finally, these values were assigned to the year, month, day and hour bigint fields which we designated as Partition keys in the target table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The additional nested subqueries performed the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The selection and transformation (where necessary) of relevant information from the source data for the computation of aggregate values and derived fields.&lt;/li&gt;
&lt;li&gt;The selection of a subset of locations from the total amount of 4600 measurement locations and the natural regrouping of these locations (e.g. grouping of sets of lanes on the same road).&lt;/li&gt;
&lt;li&gt;The splitting of queries into data ranges of (maximum) 4 days (i.e. a range between a start day and an end day).
Because Amazon imposes a limit of 100 simultaneously written partitions using an INSERT INTO statement, we implemented a Lambda function to execute multiple concurrent queries. The splitting of the queries limits the amount of simultaneously written partitions to 96 hours.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a link to the complete INSERT INTO DML query, please refer to &lt;a href="https://github.com/becloudway/serverless-data-pipelines-batch-processing/blob/master/queries/InsertETL.sql" rel="noopener noreferrer"&gt;https://github.com/becloudway/serverless-data-pipelines-batch-processing/blob/master/queries/InsertETL.sql&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
For a link to the explanation of field definitions, please refer to &lt;a href="https://github.com/becloudway/serverless-data-pipelines-batch-processing" rel="noopener noreferrer"&gt;this link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When using AWS Athena to perform the ETL job, as opposed to using Glue ETL jobs, there is no functionality to automatically start the next process in a workflow.&lt;br&gt;
Therefore we also implemented a polling mechanism in order to periodically check for crawler/ETL query completion.&lt;/p&gt;

&lt;h1&gt;
  
  
  Alternative solution for the ETL workflow
&lt;/h1&gt;

&lt;p&gt;As already mentioned several times, we could also have used Glue ETL jobs for the implementation of the ETL workflow. These ETL jobs handle all processing and repartitioning of the data through python scripts with Spark.&lt;/p&gt;

&lt;p&gt;In our next blog in the series, we will explore the practical implementation of this alternative solution and compare the advantages, and disadvantages of the use of Glue ETL jobs vs. AWS Athena ETL queries for the implementation of ETL workflows.&lt;/p&gt;

&lt;h1&gt;
  
  
  Resources
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Step Function course: &lt;a href="https://theburningmonk.thinkific.com/courses/complete-guide-to-aws-step-functions" rel="noopener noreferrer"&gt;https://theburningmonk.thinkific.com/courses/complete-guide-to-aws-step-functions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Serverless workflow with Step Functions: &lt;a href="https://aws.amazon.com/getting-started/hands-on/create-a-serverless-workflow-step-functions-lambda/" rel="noopener noreferrer"&gt;https://aws.amazon.com/getting-started/hands-on/create-a-serverless-workflow-step-functions-lambda/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Data - field definitions: &lt;a href="https://github.com/becloudway/serverless-data-pipelines-batch-processing" rel="noopener noreferrer"&gt;https://github.com/becloudway/serverless-data-pipelines-batch-processing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cover - image: &lt;a href="https://unsplash.com/@hertwashere" rel="noopener noreferrer"&gt;https://unsplash.com/@hertwashere&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Get in touch with the authors:
&lt;/h3&gt;

&lt;p&gt;Nick van Hoof - AWS Community builder - Trans-IT: &lt;a href="https://twitter.com/TheNickVanHoof" rel="noopener noreferrer"&gt;https://twitter.com/TheNickVanHoof&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;David Smits - &lt;a href="https://cloubis.be/" rel="noopener noreferrer"&gt;Cloubis&lt;/a&gt; &amp;amp; &lt;a href="https://www.cloudway.be/en/home" rel="noopener noreferrer"&gt;Cloudway&lt;/a&gt;: &lt;a href="mailto:david.smits@cloubis.be"&gt;david.smits@cloubis.be&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Reviewed by - &lt;a href="https://www.linkedin.com/in/mitch-mommers-199a0a12b" rel="noopener noreferrer"&gt;Mitch Mommers - Cloud Native Consultant&lt;/a&gt; &lt;/p&gt;

</description>
      <category>aws</category>
      <category>data</category>
      <category>cloud</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Use lambda resolvers in your GraphQL API with AWS Amplify</title>
      <dc:creator>Nxtra</dc:creator>
      <pubDate>Sun, 06 Dec 2020 00:00:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/use-lambda-resolvers-in-your-graphql-api-with-aws-amplify-5e13</link>
      <guid>https://dev.to/aws-builders/use-lambda-resolvers-in-your-graphql-api-with-aws-amplify-5e13</guid>
      <description>&lt;p&gt;&lt;sup&gt;Written by &lt;strong&gt;&lt;a href="https://twitter.com/TheNickVanHoof" rel="noopener noreferrer"&gt;Nick Van Hoof&lt;/a&gt;&lt;/strong&gt;&lt;/sup&gt;. &lt;sup&gt;Big thanks to &lt;strong&gt;&lt;a href="https://twitter.com/@andthensumm" rel="noopener noreferrer"&gt;Matthew Marks&lt;/a&gt;&lt;/strong&gt; for the review.&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;TL;DR - Find out how to update your API resources and corresponding DynamoDB tables with a lambda resolver!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fz9qbghbh7oaqnzt3gibm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fz9qbghbh7oaqnzt3gibm.png" alt="Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The use case
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;You have a posts with comments.&lt;br&gt;
When you remove a post, you want to remove the comments too.&lt;br&gt;&lt;br&gt;
How do you do that with one graphql mutation?&lt;br&gt;
Use a LAMBDA RESOLVER!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;&lt;br&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To put it more difficult 😄 :&lt;br&gt;&lt;br&gt;
I want to show you how you can make an update to multiple of your Amplify API &lt;code&gt;@model&lt;/code&gt;-annotated types (and corresponding DynamoDB tables) in one graphql request.&lt;/p&gt;
&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;Aight, let's see our API definition. (I took this straight from the Amplify docs and made my additions with the custom query)&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;Hooray!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A custom mutation to delete a post and corresponding comments.&lt;/li&gt;
&lt;li&gt;The response of this mutation containing the id of the deleted post.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;@function&lt;/code&gt; directive which create a pipeline resolver to resolve our custom mutation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to follow along, run &lt;code&gt;amplify add api&lt;/code&gt; and use the schema above.&lt;br&gt;&lt;br&gt;
But hold your horses, do not deploy. This won't work yet. We first have to create the function.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Lambda Function
&lt;/h2&gt;

&lt;p&gt;Since we used the &lt;code&gt;@function&lt;/code&gt; directive Amplify will make sure that AppSync resolves the &lt;code&gt;deletePostAndComments&lt;/code&gt; mutation using the Lambda Function.&lt;br&gt;&lt;br&gt;
Let's create that function, shall we?!&lt;/p&gt;

&lt;p&gt;In the function we follow these steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Import the name of our model tables. Via environment variables. We'll see later how to add these.&lt;/li&gt;
&lt;li&gt;Specify a resolver for the mutation type and &lt;code&gt;deletePostAndComments&lt;/code&gt; field.&lt;/li&gt;
&lt;li&gt;In our handler, select the right resolver.&lt;/li&gt;
&lt;li&gt;Execute the resolver logic and return the id of the deleted post&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here it is. More explanation below! 😍&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;A brief word on the resolver logic that is executed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find all comments linked to this post by querying the GSI &lt;code&gt;byPost&lt;/code&gt; that is created in the API specification in &lt;code&gt;schema.graphql&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Execute the requests to remove all corresponding comments and the post. Notice that we execute these request in parallel by resolving the promises together.&lt;/li&gt;
&lt;li&gt;The comments are deleted via the &lt;code&gt;batchWrite&lt;/code&gt; API of DynamoDB. A post might have tens or hundreds of comments. You don't want to execute hundreds of separate requests (because of latency). Kudos to &lt;a href="https://twitter.com/dabit3" rel="noopener noreferrer"&gt;Nader Dabit&lt;/a&gt; for providing the Document Client Cheat sheet (find the link in the resources)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now add the function using the Amplify cli.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;amplify add function
? Select which capability you want to add: Lambda function (server
less function)
? Provide an AWS Lambda function name: editPostAndComments
? Choose the runtime that you want to use: NodeJS
? Choose the function template that you want to use: Hello World
? Do you want to configure advanced settings? No
? Do you want to edit the local lambda function now? Yes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And add the code above in &lt;code&gt;index.js&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Add this point you can run &lt;code&gt;amplify push&lt;/code&gt; for the first time.&lt;br&gt;&lt;br&gt;
However, did you notice that we chose the Hello World template. This won't configure the necessary access rights to the DynamoDB resources. Neither of the templates will, we'll do this ourselves in the section below. Also, we need environment variables for our Lambda function!&lt;/p&gt;

&lt;p&gt;Read on to find out how to setup custom permissions and environment variables.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuring our Lambda Function
&lt;/h2&gt;

&lt;p&gt;Now we dive deep into working with Amplify! 💁&lt;/p&gt;

&lt;p&gt;Adding the function to our project created a bunch of files to configure the function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Famplify-function-resources.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Famplify-function-resources.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nothing to be scared of, yet it can frighten new users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt; : contains the cloudformation that defines all the resources related to this function eg. role, permission, triggers, the function itself..&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parameters.json&lt;/code&gt;: a file that we will create below. Contains the parameters that will be past into this template coming from other cloudformation stacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You guessed it, we have to add our environment variables and permissions to this cloudformation template.&lt;/p&gt;

&lt;p&gt;In the parameters inside &lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt; add:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"apiGraphQLAPIIdOutput": {
  "Type": "String"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will be the parameter that receives the Id of your AppSync GraphQL API.&lt;/p&gt;

&lt;p&gt;Under environment variables inside &lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt; add:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"POSTTABLE": {
  "Fn::ImportValue": {
    "Fn::Sub": "${apiGraphQLAPIIdOutput}:GetAtt:PostTable:Name"
  }
},
"COMMENTTABLE": {
  "Fn::ImportValue": {
    "Fn::Sub": "${apiGraphQLAPIIdOutput}:GetAtt:CommentTable:Name"
  }
},
"AWS_NODEJS_CONNECTION_REUSE_ENABLED": 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Here we import the table names which are exported by our API stacks. Also notice the &lt;code&gt;CONNECTION_REUSE&lt;/code&gt; environment variable; This will make sure connections with DynamoDB are reused in our function instead of opening and closing all the time. Hence, performance gain! 😎&lt;/p&gt;

&lt;p&gt;Now let's add the policies with the permissions for accessing our dynamodb API resources: Under &lt;code&gt;resources&lt;/code&gt; inside &lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt; add:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;This grants access to execute the queries that are specified in the lambda's &lt;code&gt;index.js&lt;/code&gt;. Mind that we need access to the tables, but also to the index &lt;code&gt;byPost&lt;/code&gt; on the comments table.&lt;/p&gt;

&lt;p&gt;People often think that custom configuration in &lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt; will be overwritten with every push.&lt;br&gt;&lt;br&gt;
This is NOT TRUE.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom configuration is not overwritten on running &lt;code&gt;amplify push&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Name your custom resources, here &lt;code&gt;AccessDynamoDBApiResourcesPolicy&lt;/code&gt;, with a name that is not used as a standard by the framework (eg. &lt;code&gt;lambdaexecutionpolicy&lt;/code&gt;; don't use &lt;code&gt;lambdaexecutionpolicy&lt;/code&gt; for your custom policy). Now your policy will survive even when running &lt;code&gt;amplify fuction update&lt;/code&gt; for our function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As long as you follow these guidelines, you are pretty safe.&lt;/p&gt;

&lt;p&gt;Next add a file &lt;code&gt;parameters.json&lt;/code&gt; next to &lt;code&gt;editPostAndComments-cloudformation-template.json&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "apiGraphQLAPIIdOutput": {
    "Fn::GetAtt": ["api&amp;amp;lt;nameOfYourAppSyncAPI&amp;amp;gt;", "Outputs.GraphQLAPIIdOutput"]
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are telling Amplify to get the outputted &lt;code&gt;GraphQLAPIIdOutput&lt;/code&gt; from the &lt;code&gt;api&amp;lt;nameOfYourAppSyncAPI&amp;gt;&lt;/code&gt; stack. We are basically connecting our stacks here by referencing a variable from another stack which will be passed into our template.&lt;/p&gt;

&lt;p&gt;And we have lift off! Run &lt;code&gt;amplify push&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing our setup
&lt;/h2&gt;

&lt;p&gt;Log into the AWS console, go to your lambda function and verify that it has the correct permissions and environment variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Flambda-permissions.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Flambda-permissions.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Go to the AppSync console! Create a post:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mutation CreatePost {
  createPost(input: {blogID: "dfeebd0d-3d18-4e3e-9d10-88fcf0fb8b4d", title: "Hello Cloud"}) {
    id
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specify the returned id as &lt;code&gt;query variable&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "postId": "50571e27-e0f3-48ce-9432-c7eb0307010d"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create some comments for the post:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mutation CreateCommentsForPost($postId: ID!) {
  create1: createComment(input: {postID: $postId, content: "This is a short comment"}) {
    id
  }
  create2: createComment(input: {postID: $postId, content: "This is a second comment"}) {
    id
  }
  create3: createComment(input: {postID: $postId, content: "This is the last comment"}) {
    id
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now for the big moment. Delete the post and its comments at once!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mutation DeletePostAndComments($postId: ID!) {
  deletePostAndComments(postId: $postId) {
    id
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;😱 Nice, that should look like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Fappsync-test-mutation.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Fappsync-test-mutation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you made that work, you are a ⭐!&lt;br&gt;
If you did not make it work, but you made it till here in the blogpost you are also a ⭐.&lt;br&gt;&lt;br&gt;
If you need guidance, reach out to me: &lt;a href="https://twitter.com/TheNickVanHoof" rel="noopener noreferrer"&gt;Nick on Twitter&lt;/a&gt; 🐥&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing performance
&lt;/h2&gt;

&lt;p&gt;Let's say you have a post which has 500 comments! When deleting the post and the comments via a synchronous api call, how much time would that take.&lt;br&gt;&lt;br&gt;
Well, let's test it out!&lt;/p&gt;

&lt;p&gt;I created a post and added a 500 comments. If you look into the logic of the Lambda function you'll find that the comments are deleted using the &lt;code&gt;batchWrite&lt;/code&gt; API of DynamoDB. Even the batchWrite API request can only contain 25 requests at once, meaning we need 20 batchWrite calls to delete 500 comments. &lt;/p&gt;

&lt;p&gt;That is still quite a lot! 😯 Let's see how it performs via AWS Xray:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Fperformance-test-result.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftheclouddeveloper.io%2F%2Fimages%2F2020-11-27-amplify-lambda-resolvers%2Fperformance-test-result.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It took about 600 ms to resolve the &lt;code&gt;deletePostAndComments&lt;/code&gt; mutation for 1 post with 500 comments using a Lambda Function with 512 Mb of memory configured.&lt;/p&gt;

&lt;p&gt;If executing your Graphql mutation always has to update (or delete) this many (or more) items, there is a point at which it becomes a better solution to execute that work asynchronously.&lt;br&gt;&lt;br&gt;
You could delete a post using a normal VTL resolver generated by Amplify. Then you can catch the delete event of the DynamoDB change stream to delete the comments with a Lambda function. Something for the next blog. 😏&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Working example for the demo in this blog: &lt;a href="https://github.com/Nxtra/Amplify-API-Lambda-Resolver" rel="noopener noreferrer"&gt;https://github.com/Nxtra/Amplify-API-Lambda-Resolver&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;Amplify direct lambda resolvers by &lt;a href="https://twitter.com/@andthensumm" rel="noopener noreferrer"&gt;Matt Marks&lt;/a&gt;: &lt;a href="https://dev.to/aws-builders/direct-lambda-resolvers-with-aws-amplify-and-appsync-2k3j"&gt;https://dev.to/aws-builders/direct-lambda-resolvers-with-aws-amplify-and-appsync-2k3j&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Document client cheat sheet by &lt;a href="https://twitter.com/dabit3" rel="noopener noreferrer"&gt;Nader Dabit&lt;/a&gt;: &lt;a href="https://github.com/dabit3/dynamodb-documentclient-cheat-sheet" rel="noopener noreferrer"&gt;https://github.com/dabit3/dynamodb-documentclient-cheat-sheet&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  Credits
&lt;/h5&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A big thanks to &lt;a href="https://twitter.com/@andthensumm" rel="noopener noreferrer"&gt;Matthew Marks&lt;/a&gt; for reviewing this blogpost.&lt;br&gt;
He does great  work for the Amplify community, check it out &lt;a href="https://dev.to/andthensumm"&gt;here&lt;/a&gt;.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Featured image by &lt;a href="https://unsplash.com/@rockthechaos" rel="noopener noreferrer"&gt;Kolleen Gladden&lt;/a&gt; on &lt;a href="https://unsplash.com" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloud</category>
      <category>amplify</category>
      <category>appsync</category>
      <category>lambda</category>
    </item>
  </channel>
</rss>
