<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ani Channarasappa</title>
    <description>The latest articles on DEV Community by Ani Channarasappa (@achannarasappa).</description>
    <link>https://dev.to/achannarasappa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F278388%2F59eccd22-c13a-47e8-8404-a5c8e5c39acc.jpeg</url>
      <title>DEV Community: Ani Channarasappa</title>
      <link>https://dev.to/achannarasappa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/achannarasappa"/>
    <language>en</language>
    <item>
      <title>Serverless apartment web scraper with NodeJS, AWS Lambda, and Locust - Part 2</title>
      <dc:creator>Ani Channarasappa</dc:creator>
      <pubDate>Mon, 27 Jan 2020 01:06:28 +0000</pubDate>
      <link>https://dev.to/achannarasappa/serverless-apartment-web-scraper-with-nodejs-aws-lambda-and-locust-part-2-1joo</link>
      <guid>https://dev.to/achannarasappa/serverless-apartment-web-scraper-with-nodejs-aws-lambda-and-locust-part-2-1joo</guid>
      <description>&lt;p&gt;This is part two of a three part series in which we'll seek to understand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What areas in New York are most popular, have the best public transit connectivity, and offer the best amenities for their asking price?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you haven't already, check out part one &lt;a href="https://dev.to/achannarasappa/serverless-apartment-web-scraper-with-nodejs-aws-lambda-and-locust-ngk"&gt;here&lt;/a&gt; to get caught up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking ahead
&lt;/h2&gt;

&lt;p&gt;In this article we'll cover the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using Terraform to provision the infrastructure for a serverless web crawler&lt;/li&gt;
&lt;li&gt;Setup a recursive serverless function&lt;/li&gt;
&lt;li&gt;Connecting to datastores and external systems&lt;/li&gt;
&lt;li&gt;Schedule a daily run for the crawl job&lt;/li&gt;
&lt;li&gt;Deploying the system to AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Recap
&lt;/h2&gt;

&lt;p&gt;Thus far, we've put together and tested locally a configuration file that defines how the scraper will extract apartment listings from Craigslist. That configuration should look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ./src/job.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;moment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;moment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// non-configuration truncated for brevity&lt;/span&gt;
&lt;span class="c1"&gt;// see here for full file: https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/src/job.js&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;transformListing&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext #titletextonly&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;price&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext .price&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext .housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;location&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext small&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;datetime&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postinginfo time&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;datetime&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;images&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#thumbs .thumb&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;attributes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.mapAndAttrs p.attrgroup:not(:nth-of-type(1)) span&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google_maps_link&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.mapaddress a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;description&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#postingbody&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isListingUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;saveListing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://newyork.craigslist.org/search/apa&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;apartment-listings&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;concurrencyLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;depthLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;links&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;isIndexUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;isListingUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;browserWSEndpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`ws://localhost:3000`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The next steps are to design the system, set up the infrastructure, and deploy the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Design
&lt;/h2&gt;

&lt;p&gt;Let's define some non-functional requirements and considerations to guide the design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No pre-existing infrastructure or systems - a greenfield build&lt;/li&gt;
&lt;li&gt;Listings change frequently so the crawl should be run on a regular interval&lt;/li&gt;
&lt;li&gt;Locust requires a Redis and Chrome instance for it's queue and HTTP requests respectively&lt;/li&gt;
&lt;li&gt;Network access

&lt;ul&gt;
&lt;li&gt;Serverless run context will need network access to the data store for listings&lt;/li&gt;
&lt;li&gt;Serverless run context will need network access to the Redis and Chrome instances for Locust&lt;/li&gt;
&lt;li&gt;Chrome will need access to the internet to execute HTTP requests&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;A database schema will need to be defined for the data store before it is usable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these in mind, the system diagram would look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RFrb1Xp3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/r09c4ezx7prgx2jubqio.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RFrb1Xp3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/r09c4ezx7prgx2jubqio.png" alt="system"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: the database will be in the public subnet to simplify initial setup&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure setup
&lt;/h2&gt;

&lt;p&gt;To setup and manage infrastructure, we'll use &lt;a href="https://www.terraform.io/"&gt;Terraform&lt;/a&gt; to define our infrastructure as configuration. The some of the Terraform resources needed for this setup are low level and not part of the core problem so we'll pull in a few Terraform modules that provide higher order abstractions for these common resource collections. These are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS VPC - &lt;a href="https://github.com/terraform-aws-modules/terraform-aws-vpc"&gt;terraform-aws-modules/vpc/aws&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS RDS - &lt;a href="https://github.com/terraform-aws-modules/terraform-aws-rds"&gt;terraform-aws-modules/rds/aws&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Locust internal resources - &lt;a href="https://github.com/achannarasappa/locust-aws-terraform"&gt;github.com/achannarasappa/locust-aws-terraform&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compute (AWS Lambda)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kQLchaqa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iht1ebhxeq3w47mq9ntf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kQLchaqa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iht1ebhxeq3w47mq9ntf.png" alt="compute"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First we'll start by setting up the Locust job in an AWS Lambda function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ./infra/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"aws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"default"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lambda_function"&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;function_name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"apartment-listings"&lt;/span&gt;
  &lt;span class="nx"&gt;filename&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./src.zip"&lt;/span&gt;
  &lt;span class="nx"&gt;source_code_hash&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;filebase64sha256&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"./src.zip"&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;

  &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"src/handler.start"&lt;/span&gt;
  &lt;span class="nx"&gt;runtime&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"nodejs10.x"&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note here that a handler of &lt;code&gt;src/handler.start&lt;/code&gt; is referenced along with a file bundle &lt;code&gt;./src.zip&lt;/code&gt;. &lt;code&gt;src/handler.start&lt;/code&gt; is the AWS Lambda function handler that is called when the function is triggered. Since with each Locust job run, the next job's data is pulled from Redis queue, no arguments are needed from the handler and the handler ends up being fairly straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ./src/handler.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;execute&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@achannarasappa/locust&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./job.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;job&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Next, the source along with dependencies will need to be &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/package.json#L7"&gt;bundled into &lt;code&gt;./src.zip&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; zip &lt;span class="nt"&gt;-r&lt;/span&gt; ./infra/src.zip ./src package&lt;span class="k"&gt;*&lt;/span&gt;.json node_modules
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Since &lt;code&gt;source_code_hash&lt;/code&gt; has been set to &lt;code&gt;filebase64sha256&lt;/code&gt; of the zip file, a rebundle will result in a diff in Terraform and the new file bundle will be pushed up.&lt;/p&gt;

&lt;p&gt;From this point, the lambda can be provisioned to AWS with &lt;code&gt;terraform apply&lt;/code&gt; but it won't be all that useful since it still lacks connection information and network access to other resources in addition to basic permissions to run. We will come back to this Terraform block later to add those pieces once they've been setup elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  Networking (VPC)
&lt;/h3&gt;

&lt;p&gt;In order to provision many of the resources needed for this system, a VPC is required. The &lt;a href="https://github.com/terraform-aws-modules/terraform-aws-vpc"&gt;terraform-aws-modules/vpc/aws&lt;/a&gt; module can be used to setup a VPC along with some common resources associated with networking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ./infra/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/vpc/aws"&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"apartment-listings"&lt;/span&gt;

  &lt;span class="nx"&gt;cidr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;

  &lt;span class="nx"&gt;azs&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"us-east-1c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1d"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;private_subnets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.0.1.0/24"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"10.0.2.0/24"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;public_subnets&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.0.101.0/24"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"10.0.102.0/24"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# enable public access to database for initial setup&lt;/span&gt;
  &lt;span class="nx"&gt;create_database_subnet_group&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_database_subnet_route_table&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_database_internet_gateway_route&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_hostnames&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_support&lt;/span&gt;                     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With the VPC setup, we can start adding resources to it starting with the database&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage (AWS RDS)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5P6y4Sfm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/lgpmtjfn8id6dvs1v5w1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5P6y4Sfm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/lgpmtjfn8id6dvs1v5w1.png" alt="database"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the database, we'll need to provision a Postgres instance to AWS RDS along with set up the schema. The configuration for a minimal database will be as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ./infra/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"db"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/rds/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 2.0"&lt;/span&gt;

  &lt;span class="nx"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"apartment-listings-postgres"&lt;/span&gt;

  &lt;span class="nx"&gt;engine&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt;
  &lt;span class="nx"&gt;engine_version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.10"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.t3.micro"&lt;/span&gt;
  &lt;span class="nx"&gt;allocated_storage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="nx"&gt;storage_encrypted&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_database&lt;/span&gt;
  &lt;span class="nx"&gt;username&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_user&lt;/span&gt;
  &lt;span class="nx"&gt;password&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_password&lt;/span&gt;
  &lt;span class="nx"&gt;port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_port&lt;/span&gt;

  &lt;span class="nx"&gt;publicly_accessible&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

  &lt;span class="nx"&gt;maintenance_window&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Mon:00:00-Mon:03:00"&lt;/span&gt;
  &lt;span class="nx"&gt;backup_window&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"03:00-06:00"&lt;/span&gt;
  &lt;span class="nx"&gt;backup_retention_period&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;family&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres10"&lt;/span&gt;
  &lt;span class="nx"&gt;major_engine_version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.10"&lt;/span&gt;

  &lt;span class="nx"&gt;enabled_cloudwatch_logs_exports&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"postgresql"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"upgrade"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnets&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_protection&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note here that the RDS instance is marked as publicly accessible and part of a public subnet so that we can perform the one time setup of the database schema. There are also no &lt;code&gt;vpc_security_group_ids&lt;/code&gt; defined yet which will need to be added later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"local-database-access"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${module.vpc.vpc_id}"&lt;/span&gt;

  &lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
    &lt;span class="nx"&gt;self&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tonumber&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_port&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tonumber&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;postgres_port&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"${chomp(data.http.myip.body)}/32"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"http"&lt;/span&gt; &lt;span class="s2"&gt;"myip"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http://ipv4.icanhazip.com"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"null_resource"&lt;/span&gt; &lt;span class="s2"&gt;"db_setup"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provisioner&lt;/span&gt; &lt;span class="s2"&gt;"local-exec"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PGPASSWORD=${var.postgres_password} psql -h ${module.db.this_db_instance_address} -p ${var.postgres_port} -f ../db/schema/setup.sql ${var.postgres_database} ${var.postgres_user}"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;aws_security_group_rule&lt;/code&gt; will add a firewall rule that allows access from the machine being used to provision this system while the &lt;code&gt;null_resource&lt;/code&gt; named &lt;code&gt;db_setup&lt;/code&gt; will execute an ad-hoc sql query using &lt;a href="https://www.postgresql.org/download/"&gt;&lt;code&gt;psql&lt;/code&gt;&lt;/a&gt; that will create the table and schema in the database (this will run locally so psql will need to be installed on the local machine). The &lt;code&gt;db&lt;/code&gt; resource will also need to be updated with the newly created security group for local access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"${aws_security_group.local-database-access}"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With the infra defined for the database, the we'll need &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/db/schema/setup.sql"&gt;sql statements&lt;/a&gt; that sets up the database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bedroom_count&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;size&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_posted&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latitude&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;longitude&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Looking back at the &lt;code&gt;./src/job.js&lt;/code&gt; file, the properties here correspond 1:1 with the output of the &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/src/job.js#L54"&gt;&lt;code&gt;transformListing&lt;/code&gt; function&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now all the pieces are in place to provision the database. Also note that there are several variables defined in the preceding terraform blocks that will need to defined in &lt;code&gt;variables.tf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"postgres_user"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"postgres_password"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"postgres_database"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"postgres_port"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"5432"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Scheduling runs (AWS Cloudwatch)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qzTUV7FW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/kmauinbkfcju9g2mw2jj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qzTUV7FW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/kmauinbkfcju9g2mw2jj.png" alt="cron"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to have the crawl execute on an interval, a cron-like solution will be needed that interfaces well with AWS Lambda. One way to achieve that is through a scheduled CloudWatch event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_rule"&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Crawls apartment listings on a schedule"&lt;/span&gt;

  &lt;span class="nx"&gt;schedule_expression&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"rate(1 day)"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_target"&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${aws_cloudwatch_event_rule.apartment_listings_crawler.name}"&lt;/span&gt;
  &lt;span class="nx"&gt;arn&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${aws_lambda_function.apartment_listings_crawler.arn}"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This will trigger the Lambda once per day which will start a crawler job that will continue until a stop condition is met spawning additional Lambdas bounded by the &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/src/job.js#L104"&gt;parameters in the job definition file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;An additional resource-based permission is needed to allow CloudWatch events to trigger Lambdas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lambda_permission"&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;action&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"lambda:InvokeFunction"&lt;/span&gt;
  &lt;span class="nx"&gt;function_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${aws_lambda_function.apartment_listings_crawler.function_name}"&lt;/span&gt;
  &lt;span class="nx"&gt;principal&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"events.amazonaws.com"&lt;/span&gt;
  &lt;span class="nx"&gt;source_arn&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_cloudwatch_event_rule&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;apartment_listings_crawler&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h3&gt;
  
  
  Locust internal resources
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_Voo8atQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/c7vw0etjv9vtblpmtvma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_Voo8atQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/c7vw0etjv9vtblpmtvma.png" alt="locust"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The last remaining set of resources to add are the chrome instance which Locust will use to execute HTTP requests in a browser context and the Redis instance which will power Locust's job queue. These are all defined within the Terraform module &lt;a href="https://github.com/achannarasappa/locust-aws-terraform"&gt;&lt;code&gt;github.com/achannarasappa/locust-aws-terraform&lt;/code&gt;&lt;/a&gt;. Inputs for this module are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;vpc_id&lt;/em&gt; - VPC id from &lt;code&gt;apartment-listings&lt;/code&gt; VPC defined earlier&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;private_subnet_ids&lt;/em&gt; - list of private subnet ids from &lt;code&gt;apartment-listings&lt;/code&gt; VPC defined earlier&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;public_subnet_ids&lt;/em&gt; - list of public subnet ids from &lt;code&gt;apartment-listings&lt;/code&gt; VPC defined earlier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And outputs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;redis_hostname&lt;/em&gt; - hostname of the Redis instance which will need to be passed to the AWS Lambda running Locust&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;chrome_hostname&lt;/em&gt; - hostname of the Chrome instance which will need to be passed to the AWS Lambda running Locust&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;security_group_id&lt;/em&gt; - AWS security group that the Redis and Chrome instances are a part of&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;iam_role_arn&lt;/em&gt; - AWS IAM role with the proper permissions to access Chrome, Redis, and run Locust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll need to revisit the Lambda configuration to add the hostnames, role ARN, and security group with the outputs from this module in the next section. The security group can also be reused by the &lt;code&gt;db&lt;/code&gt; module to allow access from the Lambda to Postgres:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"db"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="err"&gt;...&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"${module.locust.security_group_id}"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="err"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Tying everything together
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xgN-IM-t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/nawzn5btc7flzom3t2yl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xgN-IM-t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/nawzn5btc7flzom3t2yl.png" alt="tying"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Earlier we set up a placeholder Lambda function that was missing a few key pieces that we now have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM role&lt;/li&gt;
&lt;li&gt;VPC subnets&lt;/li&gt;
&lt;li&gt;Security groups with dependent resources&lt;/li&gt;
&lt;li&gt;Hostnames for Redis and Chrome plus connection information for Postgres&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that other resources have been setup, the &lt;code&gt;aws_lambda_function&lt;/code&gt; can be updated with this information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lambda_function"&lt;/span&gt; &lt;span class="s2"&gt;"apartment_listings_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="err"&gt;...&lt;/span&gt;

  &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${module.locust.iam_role_arn}"&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;concat&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnets&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnets&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"${module.locust.security_group_id}"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;environment&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;variables&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;CHROME_HOST&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${module.locust.chrome_hostname}"&lt;/span&gt;
      &lt;span class="nx"&gt;REDIS_HOST&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${module.locust.redis_hostname}"&lt;/span&gt;
      &lt;span class="nx"&gt;POSTGRES_HOST&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${module.db.this_db_instance_address}"&lt;/span&gt;
      &lt;span class="nx"&gt;POSTGRES_USER&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.postgres_user}"&lt;/span&gt;
      &lt;span class="nx"&gt;POSTGRES_PASSWORD&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.postgres_password}"&lt;/span&gt;
      &lt;span class="nx"&gt;POSTGRES_DATABASE&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.postgres_database}"&lt;/span&gt;
      &lt;span class="nx"&gt;POSTGRES_PORT&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.postgres_port}"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Connection information for dependencies are passed into the Lambda run context to tell Locust &lt;em&gt;where&lt;/em&gt; to connect. The security groups, subnets, and IAM role allow the Lambda to make outbound connections to Postgres, Chrome, and Redis.&lt;/p&gt;

&lt;p&gt;Now that connection information for AWS is being passed into the Locust run context, the various &lt;code&gt;localhost&lt;/code&gt; references in &lt;code&gt;./src/job.js&lt;/code&gt; can be updated to use those environment variables.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In the connection to Postgres (&lt;code&gt;saveListing&lt;/code&gt;s function):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;POSTGRES_HOST&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;POSTGRES_DATABASE&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;POSTGRES_USER&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;POSTGRES_PASSWORD&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;POSTGRES_PORT&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;In the connection object for Redis and Chrome:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;REDIS_HOST&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;browserWSEndpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`ws://&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CHROME_HOST&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:3000`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With all of the connection details setup, the last step is to replace the dummy &lt;a href="https://locust.dev/docs/api#function-start"&gt;&lt;code&gt;start&lt;/code&gt; function&lt;/a&gt; with a function that will trigger a new job run. This will allow Locust to recursively trigger itself until a &lt;a href="https://locust.dev/docs/concepts#stop-condition"&gt;stop condition&lt;/a&gt; is met. In this case, we need to initiate a new Lambda function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lambda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2015-03-31&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;FunctionName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;apartment-listings&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;InvocationType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Event&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nx"&gt;promise&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploying to AWS
&lt;/h2&gt;

&lt;p&gt;The final setup is to provision the infrastructure and push the bundled source for the crawler. With the &lt;code&gt;source_code_hash = filebase64sha256("./src.zip")&lt;/code&gt; in resource block for &lt;code&gt;aws_lambda_function&lt;/code&gt;, the bundle &lt;code&gt;./src.zip&lt;/code&gt; will be pushed along with a &lt;code&gt;terraform apply&lt;/code&gt; so no distinct step is needed for that.&lt;/p&gt;

&lt;p&gt;Bundle the source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; ./infra/src.zip &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; zip &lt;span class="nt"&gt;-r&lt;/span&gt; ./infra/src.zip ./src package&lt;span class="k"&gt;*&lt;/span&gt;.json node_modules
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Double check thar &lt;code&gt;terraform&lt;/code&gt; and &lt;code&gt;psql&lt;/code&gt; are installed locally then apply the changes with terraform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ./infra &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; terraform apply &lt;span class="nt"&gt;-auto-approve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The provisioning will take about 10 minutes then the system should be up and running. The CloudWatch will automatically trigger the job once a day so no additional ad-hoc commands are need to run the crawler.&lt;/p&gt;

&lt;p&gt;If you'd like to trigger the crawler immediately, this command can be used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--invocation-type&lt;/span&gt; Event &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--function-name&lt;/span&gt; apartment_listings_crawler &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1  &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--profile&lt;/span&gt; default &lt;span class="se"&gt;\&lt;/span&gt;
out.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Refer to the Locust operational guide for tips on how to manage Locust and debug issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Thus far in the series, we've learned how to build a serverless crawler with Locust in part 1 including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyzing how web data is related on a particular website and how this can be used by a crawler to discover page on the fly&lt;/li&gt;
&lt;li&gt;Identifying relevant elements of a web page and how to extract them using Web APIs&lt;/li&gt;
&lt;li&gt;Filtering out noise and optimizing crawler efficiency&lt;/li&gt;
&lt;li&gt;Controlling crawler behaviors and setting stop conditions&lt;/li&gt;
&lt;li&gt;Persisting to a datastore&lt;/li&gt;
&lt;li&gt;Cleaning data before persistence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, we've covered how to deploy the crawler to AWS including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using Terraform to provision the infrastructure for a serverless web crawler&lt;/li&gt;
&lt;li&gt;Setup a recursive serverless function&lt;/li&gt;
&lt;li&gt;Connecting to datastores and external systems&lt;/li&gt;
&lt;li&gt;Schedule a daily run for the crawl job&lt;/li&gt;
&lt;li&gt;Deploying the system to AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next article in the series, we'll take a look at the data that's been gathered by the crawler to come to a data driven answer to the original question of where are the best areas to live in New York City.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>aws</category>
      <category>serverless</category>
      <category>scraping</category>
    </item>
    <item>
      <title>Serverless apartment web scraper with NodeJS, AWS Lambda, and Locust</title>
      <dc:creator>Ani Channarasappa</dc:creator>
      <pubDate>Sun, 22 Dec 2019 15:32:22 +0000</pubDate>
      <link>https://dev.to/achannarasappa/serverless-apartment-web-scraper-with-nodejs-aws-lambda-and-locust-ngk</link>
      <guid>https://dev.to/achannarasappa/serverless-apartment-web-scraper-with-nodejs-aws-lambda-and-locust-ngk</guid>
      <description>&lt;p&gt;New York's apartment rental market is competitive with rentals in desireable neighborhoods being rented quickly. Let's build a Craigslist apartment listing web scraper to understand the market better and make a data driven decision on where to move.&lt;/p&gt;

&lt;p&gt;Let's focus on this aspect of the of the apartment rental market:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What areas in New York are most popular, have the best public transit connectivity, and offer the best amenities for their asking price?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This will be the first of a three part series:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gathering rental market data - Building a web scraper &lt;/li&gt;
&lt;li&gt;Gathering rental market data - Deploying and operating the web scraper&lt;/li&gt;
&lt;li&gt;Deriving rental market insights - Analyzing the data&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Solution Space
&lt;/h2&gt;

&lt;p&gt;While there are a number of different tools that can be used for web data extraction, let's impose some criteria for this project to help refine solution selection.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Minimize infrastructure costs (idle + active)&lt;/li&gt;
&lt;li&gt;Horizontally scalability of data extraction&lt;/li&gt;
&lt;li&gt;Maintainability of data extraction logic&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Technologies
&lt;/h3&gt;

&lt;p&gt;The solution space of web data extraction is quite crowded with a number of open source projects and commercial offerings. In this case we will use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS RDS&lt;/strong&gt; (storage)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Lambda&lt;/strong&gt; (compute)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NodeJS&lt;/strong&gt; (runtime)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://locust.dev" rel="noopener noreferrer"&gt;&lt;strong&gt;Locust&lt;/strong&gt;&lt;/a&gt; (scraping framework)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Disclosure: Locust is developed by me&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;First, we'll divide the web scraping problem into a more manageable sub-problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Understand site and page structure

&lt;ul&gt;
&lt;li&gt;How to pages relate to one another?&lt;/li&gt;
&lt;li&gt;Which pages contain relevant information?&lt;/li&gt;
&lt;li&gt;What data attributes are useful for this problem?&lt;/li&gt;
&lt;li&gt;Is any processing needed to clean up or restructure the data?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Configuring the web scraper

&lt;ul&gt;
&lt;li&gt;When should the scraper stop gathering listings?&lt;/li&gt;
&lt;li&gt;How can we gather data quickly while being considerate of site load?&lt;/li&gt;
&lt;li&gt;How should we handle error conditions?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Persisting data

&lt;ul&gt;
&lt;li&gt;How do the entities we store relate to one another?&lt;/li&gt;
&lt;li&gt;How do we structure the data we store?&lt;/li&gt;
&lt;li&gt;Should raw output or cleaned/formatted data be stored?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Deployment and infrastructure on AWS

&lt;ul&gt;
&lt;li&gt;What infrastructure do we need to provision on AWS?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Assumptions
&lt;/h3&gt;

&lt;p&gt;We'll also need to validate some assumptions during initial discovery and as we begin capturing data:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Site and page structure

&lt;ol&gt;
&lt;li&gt;There are only two types of pages - indexes and details&lt;/li&gt;
&lt;li&gt;There is only one page structure for each type of entity with minor variations&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Site and user behaviors

&lt;ol&gt;
&lt;li&gt;When listings are removed or retired, the unit is taken by a new tenant&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  Discovery
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Page categorization
&lt;/h3&gt;

&lt;p&gt;Starting by visiting the &lt;a href="https://newyork.craigslist.org/search/apa?s=120" rel="noopener noreferrer"&gt;CL New York page apartment listing page&lt;/a&gt; and exploring, there's ostensibly only two relevant groupings of pages each with different types of information we need to extract:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Entity index&lt;/strong&gt; - list of multiple entities with some limited detail
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fi3snzq7whmzlgurkngbj.png" alt="entity index"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity detail&lt;/strong&gt; - detailed information on a single entity
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftxyfjfz3ro2kuyj2much.png" alt="entity detail"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Page relationships
&lt;/h3&gt;

&lt;p&gt;Web pages are linked to one another with anchor elements (&lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt; tags). The &lt;code&gt;href&lt;/code&gt; attributes of these elements link to other related pages and  can be used to crawl the entirety of the site. Since we're only interested in the above two type of entities, the only links we are interested in are those to other entities.&lt;/p&gt;

&lt;p&gt;To get an idea of what links are on an entity index and entity detail page, &lt;code&gt;$$('a').map(el =&amp;gt; el.href)&lt;/code&gt; can be run in Chrome Developer Tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxmfk3s3qw5bpdzqsoxwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxmfk3s3qw5bpdzqsoxwj.png" alt="links on page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, there are 350+ links from this page which are mostly not relevant or duplicates. However through examining the results, we find that there are two link patterns that correspond to the two types of entities identified above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Entity index - &lt;code&gt;https://newyork.craigslist.org/search/apa?s=&amp;lt;page offset&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Entity detail - &lt;code&gt;https://newyork.craigslist.org/&amp;lt;region&amp;gt;/apa/d/&amp;lt;listing name&amp;gt;/&amp;lt;listing id&amp;gt;.html&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The scraper will need to bound it's crawl of the site to these two types of pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Entity attributes
&lt;/h3&gt;

&lt;p&gt;In the previous step, we've already identified links as one of the data attributes that need to be extracted to crawl a site. Since the entity information on an entity index page is rather limited, we'll focus on extracting entity attributes from the entity detail page.&lt;/p&gt;

&lt;p&gt;Since it's not yet clear at this stage, what listing elements influence apartment popularity, let's capture as many attributes as possible and cleave away irrelevant attributes at a later time.&lt;/p&gt;

&lt;p&gt;Below are some attributes and their corresponding locations on the page to capture as a first pass:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fv5sk1s5gt807a0f36s31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fv5sk1s5gt807a0f36s31.png" alt="page attributes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;title&lt;/li&gt;
&lt;li&gt;price&lt;/li&gt;
&lt;li&gt;bedroom_count&lt;/li&gt;
&lt;li&gt;size&lt;/li&gt;
&lt;li&gt;attributes&lt;/li&gt;
&lt;li&gt;latitude&lt;/li&gt;
&lt;li&gt;longitude&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each of these, we'll need to find the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors" rel="noopener noreferrer"&gt;CSS selectors&lt;/a&gt;. In some cases, (e.g. &lt;code&gt;bedroom_count&lt;/code&gt;) we'll need to capture the an element that contains the data attributes value and use regular expressions later on to process the data and extract the information needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;At this point, we have enough understanding of the site to start writing code / configuration. Before moving on from discovery, let's summarize what we've learned about the site:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are two types of pages that have data we're interested in:

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Entity index&lt;/strong&gt; - list of multiple entities with some limited detail

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Information to extract&lt;/strong&gt;: links to other entity indexes and entity detail pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transforms&lt;/strong&gt; - filtering out links to extraneous pages that are not entity indexes or entity detail pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outputs&lt;/strong&gt; - list of links to entity index and entity detail pages that should be fed back into the web scraper to scrape next&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Entity detail&lt;/strong&gt; - detailed information on a single entity

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Information to extract&lt;/strong&gt; - attributes of the single entity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transforms&lt;/strong&gt; - formatting, cleaning, or restructuring entity attributes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outputs&lt;/strong&gt; - a single entity to persist to a datastore&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Execution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Setup
&lt;/h3&gt;

&lt;p&gt;Refer to the &lt;a href="https://github.com/achannarasappa/locust-examples/tree/master/apartment-listings#setup" rel="noopener noreferrer"&gt;setup section&lt;/a&gt; in the example repo for instructions on how to setup the required tools and dependencies to run the subsequent steps locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;The high level process flow will look something like this:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzbjhxqxkaya9bgjzddyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzbjhxqxkaya9bgjzddyz.png" alt="process flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Locust will handle the labeled scraping and queueing steps with the right job configuration file. The only logic that needs to be developed is the integration with the persistence layer.&lt;/p&gt;

&lt;p&gt;Steps 3, 4, and 5 will loop until a stop condition (step 6) is met at which point the crawl will end.&lt;/p&gt;
&lt;h3&gt;
  
  
  Defining the job
&lt;/h3&gt;

&lt;p&gt;We'll start by defining some base properties for the job that will govern how it will operate. We'll choose some reasonable starting values for these and work to refine them as we learn more about the site behaviors and limitations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entrypoint - As is standard for web crawlers, an entrypoint url defines the first page that is crawled and where links to subsequent pages is extracted. A good starting url will link to other relevant pages and in this case, that would be the first entity index page &lt;code&gt;https://newyork.craigslist.org/search/apa&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Stop Conditions - When should the job stop? As a starting point, we'll set a depth limit of 2 indicating that the job shouldn't crawl pages that are more than two degrees of separation from the entrypoint page.&lt;/li&gt;
&lt;li&gt;Throttling - How should we limit the web crawler so it does not put too great a load on the site? Many servers will enforce rate limitations and ban clients that exceed those limitations. We need to define some starting limitations for the crawler to obey so as to not come up against these limitations. We can start with two concurrent job at any given time and introduce a delay of 3000ms before each job.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is a &lt;a href="https://locust.dev/docs/api#object-jobdefinition" rel="noopener noreferrer"&gt;Locust job definition&lt;/a&gt; that captures that above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://newyork.craigslist.org/search/apa&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// entrypoint url where the job start&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;apartment-listings&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;concurrencyLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// maximum concurrent number of jobs&lt;/span&gt;
    &lt;span class="na"&gt;depthLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// maximum link distance of a page from the entrypoint url to be scraped&lt;/span&gt;
    &lt;span class="na"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// delay in milliseconds before starting a scrape job&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// locust queue connection details&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// locust chrome connection details&lt;/span&gt;
      &lt;span class="na"&gt;browserWSEndpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ws://localhost:3000&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: Locust's CLI tool can be used to interactively generate this file with &lt;a href="https://locust.dev/docs/develop#create-a-job" rel="noopener noreferrer"&gt;&lt;code&gt;locust generate&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, let's test that this job works with &lt;a href="https://locust.dev/docs/develop#run" rel="noopener noreferrer"&gt;&lt;code&gt;locust run job.js&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;❯ locust run job.js &lt;span class="nt"&gt;-l&lt;/span&gt;
Running &lt;span class="k"&gt;in &lt;/span&gt;single job mode. Queue related hooks and configuration will be ignored. Check docs &lt;span class="k"&gt;for &lt;/span&gt;more information.
response:
  ok:         &lt;span class="nb"&gt;true
  &lt;/span&gt;status:     200
  statusText: OK
  headers:
    last-modified:             Sat, 30 Nov 2019 17:26:56 GMT
    cache-control:             max-age&lt;span class="o"&gt;=&lt;/span&gt;900, public
    &lt;span class="nb"&gt;date&lt;/span&gt;:                      Sat, 30 Nov 2019 17:26:55 GMT
    content-encoding:          &lt;span class="nb"&gt;gzip
    &lt;/span&gt;vary:                      Accept-Encoding
    content-length:            36348
    content-type:              text/html&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;charset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;utf-8
    x-frame-options:           SAMEORIGIN                                                           
    server:                    Apache
    expires:                   Sat, 30 Nov 2019 17:41:56 GMT
    set-cookie:                &lt;span class="nv"&gt;cl_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4|c67de625ad2525f94f6b813ca1498758bbff6f5a|1575135224cQqUI&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="nv"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.craigslist.org&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="nv"&gt;expires&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Fri, 01-Jan-2038 00:00:00 GMT
    strict-transport-security: max-age&lt;span class="o"&gt;=&lt;/span&gt;86400
  url:        https://newyork.craigslist.org/search/apa
links:
  - https://newyork.craigslist.org/
  - https://newyork.craigslist.org/
  - https://post.craigslist.org/c/nyc
  - https://accounts.craigslist.org/login/home
  - https://newyork.craigslist.org/search/apa#
  - https://newyork.craigslist.org/search/apa#
  ... 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here again we see the ~350 links. Next let's strip out links to pages that are not relevant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Filtering links
&lt;/h3&gt;

&lt;p&gt;In order to filter the links down to just entity index and detail pages, we can apply a &lt;a href="https://locust.dev/docs/api#function-filter" rel="noopener noreferrer"&gt;filter function&lt;/a&gt; with a couple regular expressions. Referring back to the two page patterns identified as relevant earlier, these can be converted into regular expressions to bound the pages the job run on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isDetailUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="sr"&gt;/newyork&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="sr"&gt;craigslist&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="sr"&gt;org&lt;/span&gt;&lt;span class="se"&gt;\/(&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;)\/?&lt;/span&gt;&lt;span class="sr"&gt;apa&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;d&lt;/span&gt;&lt;span class="se"&gt;\/(&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;)\.&lt;/span&gt;&lt;span class="sr"&gt;html&lt;/span&gt;&lt;span class="se"&gt;(?&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;!#&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;$/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isIndexUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="sr"&gt;/newyork&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="sr"&gt;craigslist&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="sr"&gt;org&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;search&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;apa&lt;/span&gt;&lt;span class="se"&gt;\?&lt;/span&gt;&lt;span class="sr"&gt;s=&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;$/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;links&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;isIndexUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nf"&gt;isDetailUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;locust run job.js -l&lt;/code&gt; again will yield a much less noisy set of links. We still see duplicates however these will be filtered out internally by Locust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting data
&lt;/h3&gt;

&lt;p&gt;Using upon the page elements identified earlier, we can add an &lt;a href="https://locust.dev/docs/api#function-extract" rel="noopener noreferrer"&gt;extract function&lt;/a&gt; to define entity attributes to extract from the page for our job. We'll also need to handle cases when an element at a selector does not exist since we have two page structures that need to be handled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="na"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext #titletextonly&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;price&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext .price&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext .housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;location&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.postingtitletext small&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the &lt;code&gt;$&lt;/code&gt; convenience function selects the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent" rel="noopener noreferrer"&gt;text content&lt;/a&gt; of the first element the CSS selector matches.&lt;/p&gt;

&lt;p&gt;We also want to extract out the listing attributes which correspond to multiple HTML elements with attributes we're interested in. Locuts' &lt;code&gt;$&lt;/code&gt; is design to only extract a single element from the page so we'll need to use Puppeteer's version of &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll" rel="noopener noreferrer"&gt;Document.querySelectorAll&lt;/a&gt;, &lt;a href="https://pptr.dev/#?product=Puppeteer&amp;amp;version=v1.18.1&amp;amp;show=api-pageevalselector-pagefunction-args" rel="noopener noreferrer"&gt;page.$$eval&lt;/a&gt; to extract multiple attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
  &lt;span class="na"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;images&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="nf"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#thumbs .thumb&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applying the same approach to the other entity attributes identified earlier, we will end up with an extract function that looks something like &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/src/job.js#L72" rel="noopener noreferrer"&gt;this&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;Again running this with Locust CLI returns the unformatted data that we expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;❯ locust run job.js   
Running &lt;span class="k"&gt;in &lt;/span&gt;single job mode. Queue related hooks and configuration will be ignored. Check docs &lt;span class="k"&gt;for &lt;/span&gt;more information.
data: 
  title:            Great Location 1 Bd Kent Ave
  price:            &lt;span class="nv"&gt;$1995&lt;/span&gt;
  housing:          / 1br - 550ft2 - 
  location:          &lt;span class="o"&gt;(&lt;/span&gt;Bed Sty/ Clinton Hill&lt;span class="o"&gt;)&lt;/span&gt;
  datetime:         2019-11-30T09:18:35-0500
  images: 
    - https://images.craigslist.org/00n0n_4f3tg9LaeXL_600x450.jpg
    - https://images.craigslist.org/00202_6CW2GEUYqb5_600x450.jpg
    - https://images.craigslist.org/01313_dP3ybMPhO0j_600x450.jpg
    - https://images.craigslist.org/00909_71bNJzxnYCJ_600x450.jpg
    - https://images.craigslist.org/00606_aJQr6Xo6hFU_600x450.jpg
    - https://images.craigslist.org/00C0C_9dQLT85mc4e_600x450.jpg
    - https://images.craigslist.org/00Y0Y_b1LXFSOQtEH_600x450.jpg
  attributes: 
    - application fee details: &lt;span class="nv"&gt;$20&lt;/span&gt; credit check
    - broker fee details: one month
    - cats are OK - purrr
    - apartment
    - laundry &lt;span class="k"&gt;in &lt;/span&gt;bldg
    - listed by: Lawrence Amrhein/Exit All Seasons
  google_maps_link: https://www.google.com/maps/preview/@40.694989,-73.959472,16z
url:      https://newyork.craigslist.org/brk/apa/d/brooklyn-great-location-1-bd-kent-ave/7029456524.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at a few of the attributes, all the off the data is present but not in a fully usable state (e.g. housing). Next, we'll setup some transformations to clean up the data before we persist it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transforming data
&lt;/h3&gt;

&lt;p&gt;Some of the data that the page exposes can be used as is however there some attributes that we want to clean, transform, or split. Below are the attributes that we'll seek to pull from the raw output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;price - parse into numerical value with two decimal places&lt;/li&gt;
&lt;li&gt;bedroom count - parse number followed by &lt;code&gt;br&lt;/code&gt; from &lt;code&gt;housing&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;size - parse number followed by &lt;code&gt;ft2&lt;/code&gt; from &lt;code&gt;housing&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;latitude - parse string from &lt;code&gt;google_maps_link&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;longitude - parse string from &lt;code&gt;google_maps_link&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;date_posted - parse ISO 8601 datetime from human readable datetime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That transform function would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;moment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;moment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transformListing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(((&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\$([&lt;/span&gt;&lt;span class="sr"&gt;0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[])[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;location&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\((&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;)\)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;bedroom_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;br/&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;housing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;ft2/&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;date_posted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;datetime&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;moment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;YYYY-MM-DD HH:mm:ss&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;images&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google_maps_link&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/@&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;0-9.-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;,/&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google_maps_link&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/,&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;0-9.-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;,/&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;matchObjectPropertyRegexOrNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;property&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;regex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;property&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;property&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;regex&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;property&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;regex&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;transformListing&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Layering the transform function into the job definition file and running with the CLI, the output should include the transformed output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;❯ locust run ./apartment-listings/src/job.js
Running &lt;span class="k"&gt;in &lt;/span&gt;single job mode. Queue related hooks and configuration will be ignored. Check docs &lt;span class="k"&gt;for &lt;/span&gt;more information.
data: 
  title:         Great Location 1 Bd Kent Ave
  price:         1995
  location:      Bed Sty/ Clinton Hill
  bedroom_count: 1
  size:          550
  date_posted:   2019-11-30 09:18:35
  attributes: 
    - application fee details: &lt;span class="nv"&gt;$20&lt;/span&gt; credit check
    - broker fee details: one month
    - cats are OK - purrr
    - apartment
    - laundry &lt;span class="k"&gt;in &lt;/span&gt;bldg
    - listed by: Lawrence Amrhein/Exit All Seasons
  images: 
    - https://images.craigslist.org/00n0n_4f3tg9LaeXL_600x450.jpg
    - https://images.craigslist.org/00202_6CW2GEUYqb5_600x450.jpg
    - https://images.craigslist.org/01313_dP3ybMPhO0j_600x450.jpg
    - https://images.craigslist.org/00909_71bNJzxnYCJ_600x450.jpg
    - https://images.craigslist.org/00606_aJQr6Xo6hFU_600x450.jpg
    - https://images.craigslist.org/00C0C_9dQLT85mc4e_600x450.jpg
    - https://images.craigslist.org/00Y0Y_b1LXFSOQtEH_600x450.jpg
  latitude:      40.694989
  longitude:     &lt;span class="nt"&gt;-73&lt;/span&gt;.959472
url:      https://newyork.craigslist.org/brk/apa/d/brooklyn-great-location-1-bd-kent-ave/7029456524.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the right data attributes, the next step is to start persisting the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Persisting data
&lt;/h3&gt;

&lt;p&gt;Since the attributes and structure of listing data is consistent for the most part, a relational database is a suitable storage solution. &lt;/p&gt;

&lt;h4&gt;
  
  
  Postgres Setup
&lt;/h4&gt;

&lt;p&gt;Let's proceed with starting up a local Postgres server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 5432:5432 &lt;span class="nt"&gt;--name&lt;/span&gt; listings-pg postgres:10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then creating a Postgres Schema and table with schema matching the transformed data structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bedroom_count&lt;/span&gt; &lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;size&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_posted&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;latitude&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;longitude&lt;/span&gt; &lt;span class="nb"&gt;character&lt;/span&gt; &lt;span class="nb"&gt;varying&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the Postgres database setup with the proper schema, the next step is to update the job to insert listings.&lt;/p&gt;

&lt;h4&gt;
  
  
  Updating the job
&lt;/h4&gt;

&lt;p&gt;In order to insert a new listing after each job run, a postgres client will be needed and the popular &lt;a href="https://github.com/brianc/node-postgres" rel="noopener noreferrer"&gt;&lt;code&gt;pg&lt;/code&gt; library&lt;/a&gt; will work.&lt;/p&gt;

&lt;p&gt;In the job file, a connection will also need to be established for each job run since all jobs run in independent AWS Lambda functions along with a call to execute an &lt;code&gt;INSERT&lt;/code&gt; query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;saveListing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;postgres&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;INSERT INTO listing.home&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;(title, price, "location", bedroom_count, "size", date_posted, "attributes", images, description, latitude, longitude)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;VALUES(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$1,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$2,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$3,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$4,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$5,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$6,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$7,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$8,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$9,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$10,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;$11&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;);&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, a Locust &lt;a href="https://locust.dev/docs/api#function-start" rel="noopener noreferrer"&gt;&lt;code&gt;after&lt;/code&gt; hook&lt;/a&gt; will need to be added to the job definition file in which the &lt;code&gt;saveListing&lt;/code&gt; function will be called after scraping the site and transforming the output data.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;saveListing&lt;/code&gt; should also only be called on the entity detail pages and not on the entity index pages so a conditional is in order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// job.js&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// defined earlier for the filter function&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isListingUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveListing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;jobResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the integration of the persistence layer, the job definition is for the most part complete. The next step is to do a test run of the job locally before deploying to AWS.&lt;/p&gt;

&lt;p&gt;The complete job definition file can be found in &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/src/job.js" rel="noopener noreferrer"&gt;the example repo&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it all together
&lt;/h3&gt;

&lt;p&gt;Earlier, &lt;code&gt;locust run&lt;/code&gt; was used to scrape a single page to validate that the &lt;code&gt;extract&lt;/code&gt; function worked as expected with the queue related features of Locust disabled. Before going through the trouble of setting up infrastructure on AWS and pushing the job up, it is best to run the the job locally with &lt;a href="https://locust.dev/docs/develop#run" rel="noopener noreferrer"&gt;&lt;code&gt;locust start&lt;/code&gt;&lt;/a&gt;. This will run the job very similarly to how it will operate on AWS Lambda (or any cloud provider). This will also run a CLI UI that shows active jobs, their status, and queue information which is useful to tracking job progress and uncovering issues with the job.&lt;/p&gt;

&lt;p&gt;First, ensure that dependent systems are up (postgres, redis, chrome) from &lt;a href="https://github.com/achannarasappa/locust-examples/blob/master/apartment-listings/docker-compose.yml" rel="noopener noreferrer"&gt;this docker-compose.yml&lt;/a&gt; file and start them if not with &lt;code&gt;docker-compose up&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Next, run the start command with the job file and monitor it's progress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;locust start ./job.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnroko6ie4gb8kxzomg23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnroko6ie4gb8kxzomg23.png" alt="monitor run"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Connecting to the Postgres database and &lt;code&gt;SELECT&lt;/code&gt;ing contents of the &lt;code&gt;listing.home&lt;/code&gt; table, we can observe new listings being added while the job is running:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fr0kyy9d2srjs4fq2le62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fr0kyy9d2srjs4fq2le62.png" alt="postgres"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a good indication that the job is stable and is suitable to push up to AWS.&lt;/p&gt;

&lt;p&gt;Up until this point, the we've hardcoded configuration for local runs in the job definition file. Before pushing up to AWS, AWS-specific integrations will need to be added including environment variables and a Locust &lt;a href="https://locust.dev/docs/api#function-start" rel="noopener noreferrer"&gt;&lt;code&gt;start&lt;/code&gt; hook&lt;/a&gt; to define for Locust how to invoke a new Lambda instance on AWS.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In part two, we'll deploy the scraper to AWS and begin gathering data.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>aws</category>
      <category>serverless</category>
      <category>scraping</category>
    </item>
  </channel>
</rss>
