<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andreas</title>
    <description>The latest articles on DEV Community by Andreas (@andreasa).</description>
    <link>https://dev.to/andreasa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F261519%2F8dfeabf3-0003-4fd3-97b0-7b084a608cbb.jpg</url>
      <title>DEV Community: Andreas</title>
      <link>https://dev.to/andreasa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andreasa"/>
    <language>en</language>
    <item>
      <title>Why and how I built an Email Permutator</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Tue, 25 Jul 2023 20:42:46 +0000</pubDate>
      <link>https://dev.to/andreasa/why-and-how-i-built-an-email-permutator-2adf</link>
      <guid>https://dev.to/andreasa/why-and-how-i-built-an-email-permutator-2adf</guid>
      <description>&lt;p&gt;Sometimes it takes a while to automate the trivial tasks … 10 years back, I started working as a marketer in the B2B software niche. I was confronted with email marketing, contact &amp;amp; and newsletter messaging from day one. The fact that I have an academic background in mechanical engineering and hence automation and programmatic approaches have always been a strong part of the way I work, make me question even more why it has taken me 10 years to automate something that I have solved manually over and over again: email permutation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is email permutation?
&lt;/h2&gt;

&lt;p&gt;Email permutation is the variation of a person’s first- and lastname, combined with a TLD in order to find their email address. The basic assumption is that business email addresses follow commonly used syntax patterns, like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;johnsmith@example.com
john.smith@example.com
jsmith@example.com
j.smith@example.com
johns@example.com
john.s@example.com
js@example.com
j.s@example.com
smithjohn@example.com
smith.john@example.com
smithj@example.com
smith.j@example.com
sjohn@example.com
s.john@example.com
sj@example.com
s.j@example.com
john-smith@example.com
j-smith@example.com
john-s@example.com
j-s@example.com
smith-john@example.com
smith-j@example.com
s-john@example.com
s-j@example.com
john_smith@example.com
j_smith@example.com
john_s@example.com
j_s@example.com
smith_john@example.com
smith_j@example.com
s_john@example.com
s_j@example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;converting john smith and example.com into variables makes the list look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;${firstname}@${domain}
${firstname}${lastname}@${domain}
${firstname}.${lastname}@${domain}
${firstLetterFirstname}${lastname}@${domain}
${firstLetterFirstname}.${lastname}@${domain}
${firstname}${firstLetterLastname}@${domain}
${firstname}.${firstLetterLastname}@${domain}
${firstLetterFirstname}${firstLetterLastname}@${domain}
${firstLetterFirstname}.${firstLetterLastname}@${domain}
${lastname}${firstname}@${domain}
${lastname}.${firstname}@${domain}
${lastname}${firstLetterFirstname}@${domain}
${lastname}.${firstLetterFirstname}@${domain}
${firstLetterLastname}${firstname}@${domain}
${firstLetterLastname}.${firstname}@${domain}
${firstLetterLastname}${firstLetterFirstname}@${domain}
${firstLetterLastname}.${firstLetterFirstname}@${domain}
${firstname}-${lastname}@${domain}
${firstLetterFirstname}-${lastname}@${domain}
${firstname}-${firstLetterLastname}@${domain}
${firstLetterFirstname}-${firstLetterLastname}@${domain}
${lastname}-${firstname}@${domain}
${lastname}-${firstLetterFirstname}@${domain}
${firstLetterLastname}-${firstname}@${domain}
${firstLetterLastname}-${firstLetterFirstname}@${domain}
${firstname}_${lastname}@${domain}
${firstLetterFirstname}_${lastname}@${domain}
${firstname}_${firstLetterLastname}@${domain}
${firstLetterFirstname}_${firstLetterLastname}@${domain}
${lastname}_${firstname}@${domain}
${lastname}_${firstLetterFirstname}@${domain}
${firstLetterLastname}_${firstname}@${domain}
${firstLetterLastname}_${firstLetterFirstname}@${domain}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since my business also provides an email validation service (&lt;a href="https://emailvalidation.io"&gt;emailvalidation.io&lt;/a&gt;), I have published the &lt;a href="https://emailvalidation.io/free-email-permutator/"&gt;email permutator&lt;/a&gt; on the website of our service.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can be done with the retrieved list
&lt;/h2&gt;

&lt;p&gt;The nature of the list above implies that max one email address of the list is actually valid. Sending emails to all addresses from the list will drastically destroy your reputation and will drive future email campaigns to the spam folder of your recipients. &lt;/p&gt;

&lt;p&gt;Hence, the real magic starts you put the retrieved list of email addresses into an email validation service as a second step. This may provide you with a list of valid email addresses of contacts that is valid and can be targeted with your next email campaign. Of course you should always keep in mind to get people’s consent before adding them to your email campaign. &lt;/p&gt;

</description>
      <category>email</category>
      <category>marketing</category>
    </item>
    <item>
      <title>The most common email validation problems</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Thu, 01 Dec 2022 15:26:57 +0000</pubDate>
      <link>https://dev.to/andreasa/the-most-common-email-validation-problems-4j9o</link>
      <guid>https://dev.to/andreasa/the-most-common-email-validation-problems-4j9o</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Email validation can be a little confusing, but it's an important part of your email marketing strategy. Why? Email validation helps ensure that your emails reach the right person and avoid being marked as spam by their inboxes. It also allows you to track how many people are opening and clicking on your messages so that you can make changes to improve them over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not validating your email.
&lt;/h2&gt;

&lt;p&gt;Email validation is the process of confirming that an email address is valid. Email validation helps you determine if an email address is valid, and whether it will be delivered to the intended recipient.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A common misconception is that a “no-reply” or invalid email address means that no one can reply to your message. However, this isn't always true—it may sometimes mean that mail sent to this address will bounce back with an error message because it's unknown or not working properly (and you might still get some kind of automated response).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you're sending out emails to a list of contacts and want to ensure that each person receives your communication, make sure all addresses are validated before sending out any messages!&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Not tracking and analyzing your results
&lt;/h2&gt;

&lt;p&gt;Tracking and analyzing the results of your emails is critical to growing your business.&lt;br&gt;
This can be done manually or using a tool like Mailchimp. It’s important to keep track of open rates, click rates and unsubscribes so you can identify trends in your audience and adjust your strategy accordingly.&lt;br&gt;
For example, if you notice that people are opening most of their emails in the morning but not clicking on them at all, then maybe it’s time to try sending an email at another time of day. If your email validation has shown that people aren’t clicking on certain subject lines or content types, maybe it's time for some adjustments there as well!&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the wrong email validation tools
&lt;/h2&gt;

&lt;p&gt;Email validation is a complex process. It's not simply a matter of checking whether an email address exists or not—it's also important to check the syntax and make sure that domains are valid, among other things.&lt;br&gt;
You should use the right tool for the job. A general web service like Google Apps will not be able to handle all of these tasks, so don't try it! You need an email validation tool specifically built for this purpose.&lt;br&gt;
If you do decide to use a web service that isn't designed specifically for email validation (such as &lt;a href="https://apps.google.com/business/"&gt;Google Apps&lt;/a&gt; or &lt;a href="https://emailvalidation.io"&gt;emailvalidation.io&lt;/a&gt; then make sure it does at least two out of three: checks if an address exists; checks if domains are valid; checks whether domains are syntactically correct (and only do this if there's no way around it).&lt;/p&gt;

&lt;h2&gt;
  
  
  Not correcting typos in emails
&lt;/h2&gt;

&lt;p&gt;The most common email validation problems are simple typos. Sometimes you hit the wrong button, or maybe you're just so excited about something that your fingers fly over the keyboard without checking to see if everything is spelled correctly.&lt;br&gt;
Don't worry—all hope isn't lost! There are ways to avoid typos in emails and even fix them after they've been sent out.&lt;br&gt;
To prevent these mishaps from happening, try taking a few deep breaths before sending an email out to your coworkers or clients. This will give you time to proofread it and catch any errors before they reach their intended destination. If it's more of an emergency situation (e.g., someone else needs access urgently), try asking someone else for help! A fresh pair of eyes can often spot mistakes that we're too close to see ourselves.&lt;br&gt;
If you do end up sending out an email with spelling or grammar errors, don't panic; there's still hope for redemption! Just send another message apologizing for any mistakes made and stating how much work went into getting the message corrected as soon as possible (eagerly awaiting your response).&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding yourself to spam lists
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;If you use email validation tools that are not compatible with your email provider, it could result in your emails being filtered as spam.&lt;/li&gt;
&lt;li&gt;Adding yourself to spam lists is also a common mistake. Even if you’re sending emails through a reputable service, there’s no guarantee that everyone on your mailing list wants to get them.&lt;/li&gt;
&lt;li&gt;Finally, never send emails to people who have not consented to receive them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Email validation ensures that your emails reach the right person and can help you improve your open rates and click-through rates
&lt;/h2&gt;

&lt;p&gt;Email validation is important for improving your email marketing performance. Because it's so easy to get an incorrect or outdated email address, the benefits of email validation go beyond just ensuring that messages reach the right person. It can also help you improve your open rates and click-through rates, as well as deliverability and bounce rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Don’t let email validation slip away from you. Email validation is an important part of any marketing strategy, and it’s easy to do. You can use APIs or create your own validation rules using regular expressions. If you keep these seven pitfalls in mind when validating emails, you’ll be well on your way to better open and click-through rates!&lt;/p&gt;

</description>
      <category>email</category>
    </item>
    <item>
      <title>Web Scraping with PHP: Building a Competitor Price Monitoring Tool</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Mon, 18 Jan 2021 13:42:32 +0000</pubDate>
      <link>https://dev.to/andreasa/web-scraping-with-php-building-a-competitor-price-monitoring-tool-1le9</link>
      <guid>https://dev.to/andreasa/web-scraping-with-php-building-a-competitor-price-monitoring-tool-1le9</guid>
      <description>&lt;p&gt;Depending on your business's niche and market, adjusting your services and prices has to come along with taking your competitors into account. &lt;br&gt;
In a lot of companies that I have seen, this is a manual task, that is completed once every quarter or at least every year. &lt;br&gt;
In this PHP web scraping tutorial, we are going to build a tiny tool, that automizes this process. Of course, the tool will need further advancements, but is always about understanding the concepts, right? :) &lt;/p&gt;

&lt;p&gt;Let's get started!&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;We will need the following set of tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Web server with PHP&lt;/li&gt;
&lt;li&gt;Composer&lt;/li&gt;
&lt;li&gt;Guzzle - scraping client&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/paquettg/php-html-parser" rel="noopener noreferrer"&gt;PHP HTML Parser&lt;/a&gt; - as HTML Parser&lt;/li&gt;
&lt;li&gt;A currency parser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Download composer &lt;a href="https://getcomposer.org/download/?ref=mohit" rel="noopener noreferrer"&gt;here&lt;/a&gt; to download the composer and follow the install instructions.&lt;/p&gt;

&lt;p&gt;After composer has successfully been installed, install guzzle via composer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;composer require guzzlehttp/guzzle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's install our HTML parser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;composer require paquettg/php-html-parser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we add the currency parser to our project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;composer require mcuadros/currency-detector dev-master
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building the scraper
&lt;/h2&gt;

&lt;p&gt;As we want to build a competitor price monitoring tool, let's say that this product URL is our own:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.allendalewine.com/products/11262719/diplomatico-reserva-exclusiva" rel="noopener noreferrer"&gt;https://www.allendalewine.com/products/11262719/diplomatico-reserva-exclusiva&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a competitor page, we select the following: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.winetoship.com/diplomatico-rum-reserva-exclusiva.html" rel="noopener noreferrer"&gt;https://www.winetoship.com/diplomatico-rum-reserva-exclusiva.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, we have to define the CSS-Selectors that contain the price information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fzenscrape.com%2Fwp-content%2Fuploads%2F2021%2F01%2FScreenshot-2021-01-18-at-13.48.27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fzenscrape.com%2Fwp-content%2Fuploads%2F2021%2F01%2FScreenshot-2021-01-18-at-13.48.27.png" alt="Selecting Price Information"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For our "own" website, the selector is &lt;code&gt;.sale-price.currency&lt;/code&gt;. Going through the same process for the competitor, the selector is &lt;code&gt;.less-price .o_price span&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Putting the pieces together, we end up with the following script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;?php
require 'vendor/autoload.php';

use \GuzzleHttp\Client;
use \PHPHtmlParser\Dom;
use \CurrencyDetector\Detector;


$productPairs = [
    'rum' =&amp;gt; [
        'own' =&amp;gt; [
            'url' =&amp;gt; 'https://www.allendalewine.com/products/11262719/diplomatico-reserva-exclusiva',
            'selectorPath' =&amp;gt; '.sale-price.currency'
        ],
        'competitor1' =&amp;gt; [
            'url' =&amp;gt; 'https://www.winetoship.com/diplomatico-rum-reserva-exclusiva.html',
            'selectorPath' =&amp;gt; '.less-price .o_price span'
        ]
    ]
    # you can add as many product pairs as you wish
];

$detector = new Detector();

$comparison = [];

foreach ($productPairs as $productName =&amp;gt; $pair) {


    foreach($pair as $provider =&amp;gt; $product) {

        $client = new Client();
        $parser = new Dom;

        $request = $client-&amp;gt;request('GET', $product['url']);
        $response = (string) $request-&amp;gt;getBody();
        $parser-&amp;gt;loadStr($response);
        $price = $parser-&amp;gt;find($product['selectorPath'])[0];
        $priceString = $price-&amp;gt;text;

        $fmt = new NumberFormatter( 'en_US', NumberFormatter::CURRENCY );

        $comparison[$productName][$provider] = [
            'currency' =&amp;gt; $detector-&amp;gt;getCurrency($priceString),
            'amount' =&amp;gt; $detector-&amp;gt;getAmount($priceString),
        ];     

    }
}

echo json_encode($comparison);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can add as many product and competitor entities as you like. The scraper then loops through all products and competitors and fetches the HTML-Markup. Our DOM-Parser then extracts the related elements from the HTML. Finally, the currency detector parses the price string into a comparable and normalized format. &lt;/p&gt;

&lt;p&gt;I used the following &lt;a href="https://zenscrape.com/web-scraping-php/" rel="noopener noreferrer"&gt;PHP web scraping tutorial&lt;/a&gt; to create this  scraper.&lt;/p&gt;

</description>
      <category>php</category>
      <category>ecommerce</category>
      <category>scraping</category>
    </item>
    <item>
      <title>My attempt to build a worldwide zip code data set</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Fri, 02 Oct 2020 15:40:35 +0000</pubDate>
      <link>https://dev.to/andreasa/my-attempt-to-build-a-worldwide-zip-code-data-set-c06</link>
      <guid>https://dev.to/andreasa/my-attempt-to-build-a-worldwide-zip-code-data-set-c06</guid>
      <description>&lt;p&gt;I always thought getting worldwide postal codes by myself was an easy task because postal codes seem to be nothing more than a simple shortcode that is publicly available. I quickly realized this was not the case, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no single source of truth&lt;/li&gt;
&lt;li&gt;Most sources were incomplete &lt;/li&gt;
&lt;li&gt;Data was very often presented in a very unstructured way&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After doing some general research, I soon understood, that the reason for the problems above had their origin in the history of postal codes. Each country has a different format, area granularity, and &lt;a href="https://en.wikipedia.org/wiki/Postal_code#Presentation"&gt;way of structuring postal codes as a whole&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I first tried to scrape Wikipedia with the following code. For this post, I will use the example of &lt;a href="https://en.wikipedia.org/wiki/List_of_postal_codes_in_Austria"&gt;Austria&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For this, I a small python script.&lt;br&gt;
Before running it make sure to install all dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pip3 install lxml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pip3 install requests&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pip3 install bs4&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'https://en.wikipedia.org/wiki/List_of_postal_codes_in_Austria'&lt;/span&gt;

&lt;span class="c1"&gt;# fire GET request
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# parse content
&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'lxml'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# get postal codes
&lt;/span&gt;
&lt;span class="n"&gt;postcodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;postcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;postcode&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'li'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;' - '&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;postcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# filter edge cases
&lt;/span&gt;
&lt;span class="n"&gt;postcodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;postcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;postcode&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;postcodes&lt;/span&gt; 
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt;
    &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# write output to file
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'at_postcodes.txt'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;postcode&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;postcodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postcode&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The obtained data set and the related approach might be enough for some use cases, but since I wanted to get global postal code data, I was not satisfied.&lt;/p&gt;

&lt;p&gt;I live in Austria and realized very quickly that the data that I have just scraped is not complete (some postal codes are missing). Considering the time it took my to build the parser and the fact that I would have to adapt it for every single data source (adaptions are even needed across Wikipedia since every article is written differently), I decided to give up.&lt;/p&gt;

&lt;p&gt;This was the moment I gave up and started to look for ready-to-use solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://zipcodebase.com"&gt;Zipcodebase.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://greatdata.com/"&gt;greatdata.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.unitedstateszipcodes.org/zip-code-database/"&gt;unitedstateszipcodes.org&lt;/a&gt; (US Only)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.geopostcodes.com/"&gt;getpostcodes.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope this article will save you some time, in case you are trying to achieve the same. &lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping Soccer Data with Nodejs and Puppeteer</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Fri, 11 Sep 2020 13:27:57 +0000</pubDate>
      <link>https://dev.to/andreasa/scraping-soccer-data-with-nodejs-and-puppeteer-3mh2</link>
      <guid>https://dev.to/andreasa/scraping-soccer-data-with-nodejs-and-puppeteer-3mh2</guid>
      <description>&lt;p&gt;I was recently in a situation of needing sports data - soccer data in particular. I am writing this post, as I had to overcome some initial problems. You should be able to follow my thoughts and my path that lead me to a solution. &lt;/p&gt;

&lt;p&gt;For this tutorial, I came across fleshscore.com, a website that provides plenty of leagues and covers fixtures and live matches. &lt;/p&gt;

&lt;p&gt;I started with the following basic script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;//performing a GET request&lt;/span&gt;
&lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.flashscore.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;//handling the success&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;//handling error&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To investigate what is being returned by the script, I forwarded all returned content into a &lt;code&gt;test.html&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node scraper.js &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; test.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After I had opened the HTML-file inside my browser, I quickly realized that all match information, that was shown on the original website, was missing. This was not a big surprise, as I expected the content to be rendered by javascript. &lt;/p&gt;

&lt;p&gt;As the script above is written in nodejs, I started to play around with puppeteer, which is a node library that provides a high-level API to control headless Chrome or Chromium.&lt;/p&gt;

&lt;p&gt;After some time, I ended up with the following piece of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;puppeteer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;//initiating Puppeteer&lt;/span&gt;
&lt;span class="nx"&gt;puppeteer&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;launch&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

        &lt;span class="c1"&gt;//opening a new page and navigating to Fleshscore&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;newPage&lt;/span&gt; &lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goto&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.flashscore.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;waitForSelector&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;body&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;//manipulating the page's content&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;grabMatches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;evaluate&lt;/span&gt; &lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;allLiveMatches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelectorAll&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.event__match--oneLine&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;//storing the post items in an array then selecting for retrieving content&lt;/span&gt;
        &lt;span class="nx"&gt;scrapeItems&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
        &lt;span class="nx"&gt;allLiveMatches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;forEach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

            &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;postDescription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;homeTeam&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.event__participant--home&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;awayTeam&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.event__participant--away&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;currentHomeScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.event__scores.fontBold span:nth-of-type(1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;currentAwayScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.event__scores.fontBold span:nth-of-type(2)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="nx"&gt;scrapeItems&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="na"&gt;homeTeam&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;homeTeam&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="na"&gt;awayTeam&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;awayTeam&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="na"&gt;currentHomeScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentHomeScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="na"&gt;currentAwayScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentAwayScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;});&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

            &lt;span class="p"&gt;});&lt;/span&gt;
            &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;liveMatches&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;scrapeItems&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;};&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="c1"&gt;//outputting the scraped data&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;grabMatches&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;//closing the browser&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;close&lt;/span&gt; &lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;//handling any errors&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I ran the script again with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node scraper.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FVC8nVxo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/w8e9pwrt0bdvaj419p1p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FVC8nVxo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/w8e9pwrt0bdvaj419p1p.png" alt="Scraping Soccer Data" width="800" height="817"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see I retrieved a beautiful list of JSON data.&lt;br&gt;
Now, of course, there is plenty of work that could be spent to sort the data by the league, country, etc. etc. &lt;/p&gt;

&lt;p&gt;For my use case, this snippet was enough. If you aim for more serious scraping, you may as well pick a general sports- or soccer API (I.e. &lt;a href="https://sportdataapi.com"&gt;sportdataapi.com&lt;/a&gt;, &lt;a href="https://xmlsoccer.com"&gt;xmlsoccer.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy Scraping :-) &lt;/p&gt;

</description>
      <category>scraping</category>
      <category>sports</category>
      <category>node</category>
    </item>
    <item>
      <title>Introduction to Web Scraping With JavaScript and Node.js</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Wed, 09 Sep 2020 13:42:04 +0000</pubDate>
      <link>https://dev.to/andreasa/introduction-to-web-scraping-with-javascript-and-node-js-101b</link>
      <guid>https://dev.to/andreasa/introduction-to-web-scraping-with-javascript-and-node-js-101b</guid>
      <description>&lt;p&gt;With the massive increase in the volume of data on the Internet, this technique is becoming increasingly beneficial in retrieving information from websites and applying them for various use cases. Typically, web data extraction involves making a request to the given web page, accessing its HTML code, and parsing that code to harvest some information. Since JavaScript is excellent at manipulating the DOM (Document Object Model) inside a web browser, creating data extraction scripts in Node.js can be extremely versatile. Hence, this tutorial focuses on javascript web scraping.&lt;/p&gt;

&lt;p&gt;In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js. &lt;/p&gt;

&lt;p&gt;We’ll start by demonstrating how to use the Axios and Cheerio packages to extract data from a simple website.&lt;/p&gt;

&lt;p&gt;Then, we’ll show how to use a headless browser, Puppeteer, to retrieve data from a dynamic website that loads content via JavaScript.&lt;/p&gt;

&lt;h3&gt;
  
  
  What you’ll need
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Web browser&lt;/li&gt;
&lt;li&gt;A web page to extract data from&lt;/li&gt;
&lt;li&gt;Code editor such as Visual Studio Code
&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Axios&lt;/li&gt;
&lt;li&gt;Cheerio&lt;/li&gt;
&lt;li&gt;Puppeteer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready?&lt;/p&gt;

&lt;p&gt;Let’s begin getting our hands dirty…&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Installing Node.js
&lt;/h3&gt;

&lt;p&gt;Node.js is a popular JavaScript runtime environment that comes with lots of features for automating the laborious task of gathering data from websites. &lt;/p&gt;

&lt;p&gt;To install it on your system, follow the download instructions available on its website here. npm (the Node Package Manager) will also be installed automatically alongside Node.js.&lt;/p&gt;

&lt;p&gt;npm is the default package management tool for Node.js. Since we’ll be using packages to simplify web scraping, npm will make the process of consuming them fast and painless.&lt;/p&gt;

&lt;p&gt;After installing Node.js, go to your project’s root directory and run the following command to create a package.json file, which will contain all the details relevant to the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;npm&lt;/span&gt; &lt;span class="nx"&gt;init&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Installing Axios
&lt;/h3&gt;

&lt;p&gt;Axios is a robust promise-based HTTP client that can be deployed both in Node.js and the web browser. With this npm package, you can make HTTP requests from Node.js using promises, and download data from the Internet easily and fast.&lt;/p&gt;

&lt;p&gt;Furthermore, Axios automatically transforms data into JSON format, intercepts requests and responses, and can handle multiple concurrent requests.&lt;/p&gt;

&lt;p&gt;To install it, navigate to your project’s directory folder in the terminal, and run the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;npm&lt;/span&gt; &lt;span class="nx"&gt;install&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;By default, NPM will install Axios in a folder named node_modules, which will be automatically created in your project’s directory.&lt;/p&gt;
&lt;h3&gt;
  
  
  Installing Cheerio
&lt;/h3&gt;

&lt;p&gt;Cheerio is an efficient and lean module that provides jQuery-like syntax for manipulating the content of web pages. It greatly simplifies the process of selecting, editing, and viewing DOM elements on a web page. &lt;/p&gt;

&lt;p&gt;While Cheerio allows you to parse and manipulate the DOM easily, it does not work the same way as a web browser. This implies that it doesn’t take requests, execute JavaScript, load external resources, or apply CSS styling. &lt;/p&gt;

&lt;p&gt;To install it, navigate to your project’s directory folder in the terminal, and run the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;npm&lt;/span&gt; &lt;span class="nx"&gt;install&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;By default, just like Axios, npm will install Cheerio in a folder named node_modules, which will be automatically created in your project’s directory.&lt;/p&gt;
&lt;h3&gt;
  
  
  Installing Puppeteer
&lt;/h3&gt;

&lt;p&gt;Puppeteer is a Node library that allows you to control a headless Chrome browser programmatically and extract data smoothly and fast.&lt;/p&gt;

&lt;p&gt;Since some websites rely on JavaScript to load their content, using an HTTP-based tool like Axios may not yield the intended results. With Puppeteer, you can simulate the browser environment, execute JavaScript just like a browser does, and scrape dynamic content from websites.&lt;/p&gt;

&lt;p&gt;To install it, just like the other packages, navigate to your project’s directory folder in the terminal, and run the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;npm&lt;/span&gt; &lt;span class="nx"&gt;install&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Scraping a simple website
&lt;/h2&gt;

&lt;p&gt;Now let’s see how we can use Axios and Cheerio to extract data from a simple website. &lt;/p&gt;

&lt;p&gt;For this tutorial, our target will be &lt;a href="https://www.forextradingbig.com/instaforex-broker-review/"&gt;this web page&lt;/a&gt;. We’ll be seeking to extract the number of comments listed on the top section of the page. &lt;/p&gt;

&lt;p&gt;To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OKBa_9M1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/u1t5ltsfqgxt28lhrix2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OKBa_9M1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/u1t5ltsfqgxt28lhrix2.png" alt="Alt Text" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see on the image above, the number of comments data is enclosed in an &lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt; tag, which is a child of the &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt; tag with a class of &lt;code&gt;comment-bubble&lt;/code&gt;. We’ll use this information when using Cheerio to select these elements on the page.&lt;/p&gt;

&lt;p&gt;Here are the steps for creating the scraping logic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the web page.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Then, let’s use the &lt;code&gt;require&lt;/code&gt; function, which is built-in within Node.js, to include the modules we’ll use in the project.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let’s use Axios to make a GET HTTP request to the target web page.&lt;/p&gt;

&lt;p&gt;Here is the code:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.forextradingbig.com/instaforex- 
broker-review/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      
   &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Notice that when a request is sent to the web page, it returns a response. This Axios response object is made up of various components, including data that refers to the payload returned from the server. &lt;/p&gt;

&lt;p&gt;So, when a GET request is made, we output the data from the response, which is in HTML format. &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Next, let’s load the response data into a Cheerio instance. This way, we can create a Cheerio object to help us in parsing through the HTML from the target web page and finding the DOM elements for the data we want—just like when using jQuery.&lt;/p&gt;

&lt;p&gt;To uphold the infamous jQuery convention, we’ll name the Cheerio object &lt;code&gt;$&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Here is the code:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let’s use the Cheerio’s selectors syntax to search the elements containing the data we want:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scrapedata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.comment-bubble&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scrapedata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Notice that we also used the &lt;code&gt;text()&lt;/code&gt; method to output the data in a text format. &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, let’s log any errors experienced during the scraping process.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Here is the entire code for the scraping logic:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;If we run the above code with the &lt;code&gt;node index.js&lt;/code&gt; command, it returns the information we wanted to scrape from the target web page.&lt;/p&gt;

&lt;p&gt;Here is a screenshot of the results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---z1zt1uU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/ich4isgvqjr295eg41oz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---z1zt1uU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/ich4isgvqjr295eg41oz.png" alt="Alt Text" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It worked!&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping a dynamic website
&lt;/h2&gt;

&lt;p&gt;Now let’s see how you can use Puppeteer to extract data from a dynamic website. &lt;/p&gt;

&lt;p&gt;For this example, we’ll use the ES2017 asynchronous &lt;code&gt;async/await syntax&lt;/code&gt; to work with promises comfortably. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;async&lt;/code&gt; expression implies that a promise will be returned. And the &lt;code&gt;await&lt;/code&gt; expression makes JavaScript wait until that promise is resolved before executing the rest of the code. This syntax will ensure we extract the webpage’s content after it has been successfully loaded. &lt;/p&gt;

&lt;p&gt;Our target will be &lt;a href="https://www.reddit.com/r/scraping/"&gt;this Reddit page&lt;/a&gt;, which uses JavaScript for rendering content. We’ll be seeking to extract the headlines and descriptions found on the page.&lt;/p&gt;

&lt;p&gt;To find the specific HTML elements that hold the data we are looking for, let’s use the inspector tool on our web browser:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qFm6SZIT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/2l2grs3gzz4hyq1dhaq9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qFm6SZIT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/2l2grs3gzz4hyq1dhaq9.png" alt="Alt Text" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see on the image above, each post is enclosed in a Post class, amongst other stuff. By examining it closely, we find that each post title has an h3 tag, and each description has a p tag. We’ll use this information when selecting these elements on the page.&lt;/p&gt;

&lt;p&gt;Here are the steps for creating the scraping logic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Let’s start by creating a file called index.js that will contain the programming logic for retrieving data from the webpage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Then, let’s use the &lt;code&gt;require&lt;/code&gt; function, which is built-in within Node.js, to import Puppeteer into our project.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;puppeteer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let’s launch Puppeteer. We’re actually launching an instance of the Chrome browser to use for accessing the target webpage.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;puppeteer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let’s create a new page in the headless browser. Since we’ve used the &lt;code&gt;await&lt;/code&gt; expression, we’ll wait for the new page to be opened before saving it to the &lt;code&gt;page&lt;/code&gt; variable. &lt;/p&gt;

&lt;p&gt;After creating the page, we’ll use it for navigating to the Reddit page. Again, since we’ve used &lt;code&gt;await&lt;/code&gt;, our code execution will pause until the page is loaded or an error is thrown. &lt;/p&gt;

&lt;p&gt;We’ll also wait for the page’s body tag to be loaded before proceeding with the rest of the execution. &lt;/p&gt;

&lt;p&gt;Here is the code:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;newPage&lt;/span&gt; &lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goto&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.reddit.com/r/scraping/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;waitForSelector&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;body&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;After pulling up the Reddit page in Puppeteer, we can use its &lt;code&gt;evaluate()&lt;/code&gt; function to interact with the page. &lt;/p&gt;

&lt;p&gt;With the function, we can execute arbitrary JavaScript in Chrome and use its built-in functions, such as &lt;code&gt;querySelector()&lt;/code&gt;, to manipulate the page and retrieve its contents.&lt;/p&gt;

&lt;p&gt;Here is the code:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let’s close the browser.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;close&lt;/span&gt; &lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, let’s log any errors experienced during the scraping process.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Here is the entire code for the scraping logic:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If we run the above code with the &lt;code&gt;node index.js&lt;/code&gt; command, it returns the information we wanted to scrape from the target web page.&lt;/p&gt;

&lt;p&gt;Here is a screenshot of the results (for brevity, the results have been truncated):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vfUAtmYz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/80votf7p7l0w3gjelgme.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vfUAtmYz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/i/80votf7p7l0w3gjelgme.png" alt="Alt Text" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It worked!&lt;/p&gt;

&lt;p&gt;If you intend to use the above in production and make thousands of requests to scrape data, you’ll definitely get banned. In this scenario, rotating your IP addresses after every few requests can help you to stay below their radar and extract content successfully. &lt;/p&gt;

&lt;p&gt;Therefore, connecting to a proxy service can help you to make the most of your scraping efforts. Importantly, with &lt;a href="https://zenscrape.com/residential-proxies/"&gt;residential proxies&lt;/a&gt;, you can get around the scraping bottlenecks and harvest online data easily and fast. &lt;/p&gt;

&lt;p&gt;In Puppeteer, you can easily connect to a proxy by passing one extra line of arguments when launching it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;puppeteer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;--proxy-server=145.0.10.11:7866&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That’s how you can perform web scraping with JavaScript and Node.js. With such skills, you can harvest useful information from web pages and integrate them into your use case. &lt;/p&gt;

&lt;p&gt;Remember that if you want to build something advanced, you can always check Axios, Cheerio, and Puppeteer documentation to assist you in getting your feet off the ground quickly. &lt;/p&gt;

&lt;p&gt;Happy scraping!&lt;/p&gt;

</description>
      <category>scraping</category>
      <category>javascript</category>
      <category>proxies</category>
    </item>
    <item>
      <title>How to Scrape Tripadvisor Reviews with Nodejs and Puppeteer</title>
      <dc:creator>Andreas</dc:creator>
      <pubDate>Mon, 04 Nov 2019 15:22:39 +0000</pubDate>
      <link>https://dev.to/andreasa/how-to-scrape-tripadvisor-reviews-with-nodejs-and-puppeteer-5gn</link>
      <guid>https://dev.to/andreasa/how-to-scrape-tripadvisor-reviews-with-nodejs-and-puppeteer-5gn</guid>
      <description>&lt;p&gt;Tripadvisor contains tons of useful local business reviews. However, the site’s API does not provide an endpoint to access them. Also, Tripadvisor uses a lot of javascript in the frontend, which makes it a little bit harder to scrape than a lot of other websites. &lt;/p&gt;

&lt;p&gt;In this quick tutorial, we are going to walk through all steps to scrape the customer reviews from a tripadvisor local business page. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why Puppeteer?
&lt;/h2&gt;

&lt;p&gt;Let me quickly say a few words about why I decided to use Puppeteer for this project. As I mentioned above, Tripadvisor requires a full browser, because a lot of content requires javascript, in order to be rendered. That basically left me with two options: Selenium and Puppeteer. Throughout the last months, Puppeteer has evolved to be the more prominent solution, as it is noticeably faster. &lt;/p&gt;

&lt;h2&gt;
  
  
  Information we are going to scrape
&lt;/h2&gt;

&lt;p&gt;For this tutorial, I have selected a random pizzeria in New York City. It has the following profile url: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.tripadvisor.com/Restaurant_Review-g60763-d15873406-Reviews-Ortomare_Ristorante_Pizzeria-New_York_City_New_York.html"&gt;https://www.tripadvisor.com/Restaurant_Review-g60763-d15873406-Reviews-Ortomare_Ristorante_Pizzeria-New_York_City_New_York.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see in the screenshot above, we are going to scrape the following pieces of information of each review: &lt;/p&gt;

&lt;p&gt;• The rating&lt;br&gt;
• Date of the review&lt;br&gt;
• Date of the visit&lt;br&gt;
• The review title&lt;br&gt;
• Review text (we will have to expand it)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JDBQgeZ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/information-to-scrape.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JDBQgeZ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/information-to-scrape.png" alt="information to scrape" title="Information to scrape" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Before we start extracting the code, we will have to install puppeteer on our local environment:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm install puppeteer --save&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The full code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
/* Part 1 */

const puppeteer = require('puppeteer');

puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox', '--window-size=1920,1080'] }).then(async browser =&amp;gt; {

    const page = await browser.newPage();
    await page.goto("https://www.tripadvisor.com/Restaurant_Review-g60763-d15873406-Reviews-Ortomare_Ristorante_Pizzeria-New_York_City_New_York.html");
    await page.waitForSelector('body');

/* Part 2 */

    await page.click('.taLnk.ulBlueLinks');
    await page.waitForFunction('document.querySelector("body").innerText.includes("Show less")');

/* Part 3 */

var reviews = await page.evaluate(() =&amp;gt; {
       
        var results = [];

        var items = document.body.querySelectorAll('.review-container');
        items.forEach((item) =&amp;gt; {
            
            /* Get and format Rating */
            let ratingElement = item.querySelector('.ui_bubble_rating').getAttribute('class');
            let integer = ratingElement.replace(/[^0-9]/g,'');
            let parsedRating = parseInt(integer) / 10;

            /* Get and format date of Visit */
            let dateOfVisitElement = item.querySelector('.prw_rup.prw_reviews_stay_date_hsx').innerText;
            let parsedDateOfVisit = dateOfVisitElement.replace('Date of visit:', '').trim();

/* Part 4 */

            results.push({
                    rating: parsedRating,
                    dateOfVisit: parsedDateOfVisit,
                    ratingDate: item.querySelector('.ratingDate').getAttribute('title'),
                    title:  item.querySelector('.noQuotes').innerText,
                    content: item.querySelector('.partial_entry').innerText,
                
            });
            
        });
        return results;
    });
    console.log(reviews);        
    await browser.close();

}).catch(function(error) {
    console.error(error);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Let me walk through the parts of the code:
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Part 1:
&lt;/h4&gt;

&lt;p&gt;With these first lines, we launch puppeteer in headless mode and navigate to the profile page of the pizzeria. All following actions need the document body to be fully loaded. This is ensured by the last line of part 1. &lt;/p&gt;

&lt;h4&gt;
  
  
  Part 2:
&lt;/h4&gt;

&lt;p&gt;As you can see above, not the entire review text is shown by default. Hence, we have to click on “More”, before scraping the actual content. This is executed by line 8. Again, the following code is only to be executed, once the click action has successfully been completed. This is ensured by the last line of part 2.&lt;/p&gt;

&lt;h4&gt;
  
  
  Part 3:
&lt;/h4&gt;

&lt;p&gt;This is where the magic happens. We access the page-DOM and extract all desired information from each review that is visible on the page.&lt;/p&gt;

&lt;h5&gt;
  
  
  Rating:
&lt;/h5&gt;

&lt;p&gt;By taking a closer look at the element, we can see that the ratings are made up from pseudo elements. However, there is a class on the element from which we can conclude the rating:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2Bui1YHd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/css-selector.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2Bui1YHd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/css-selector.png" alt="information to scrape" title="Information to scrape" width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This review shows a 5/5 rating. We can calculate the given rating by extracting the number “50” from the string “bubble_50”, convert it to an integer, and divide it by 10. &lt;/p&gt;

&lt;p&gt;Date of visit: The date of visit can be obtained quite easily. We simply select the element that contains the date and remove  the substring “Date of visit:”.&lt;/p&gt;

&lt;h5&gt;
  
  
  Review title and content:
&lt;/h5&gt;

&lt;p&gt;These can be simply extracted by simply getting the text from the related elements. No manipulation needed. &lt;/p&gt;

&lt;p&gt;We have now successfully gathered all information. &lt;/p&gt;

&lt;h4&gt;
  
  
  Part 4:
&lt;/h4&gt;

&lt;p&gt;We append all gathered information in an array array, which is then returned by the function.&lt;/p&gt;

&lt;p&gt;Running the script in your console should return all ten results from the first page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rTL504WG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/response.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rTL504WG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://zenscrape.com/wp-content/uploads/2019/11/response.png" alt="information to scrape" title="Information to scrape" width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Possible improvements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The script above only returns the reviews that are shown on the first page. In order to obtain all available reviews, you have to paginate through all review pages. Each page contains up to 10 reviews. This can be achieved by clicking on the page links on the bottom of the page, like we did with the “show more” links.&lt;/li&gt;
&lt;li&gt;When scraping the reviews of a bigger list of restaurants, I recommend using a puppeteer cluster. Make sure to limit the concurrency, so you are not spamming too much. &lt;/li&gt;
&lt;li&gt;Also, your scraper might get blocked at some point. This is one of the major issues my current startup, a web scraping API, is trying to solve: &lt;a href="https://zenscrape.com/web-scraping-api/"&gt;zenscrape.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Zenscrape also offers a great tutorial on how to get started with nodejs &lt;a href="https://zenscrape.com/web-scraping-with-javascript-and-node-js-tutorial/"&gt;getting started with nodejs and puppeteer&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>scraping</category>
      <category>node</category>
      <category>headlesschrome</category>
      <category>puppeteer</category>
    </item>
  </channel>
</rss>
