<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ScrapingBee</title>
    <description>The latest articles on DEV Community by ScrapingBee (@scrapingbee).</description>
    <link>https://dev.to/scrapingbee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F1488%2Fa78b15d9-2297-4a6d-9ac9-8053f725d893.png</url>
      <title>DEV Community: ScrapingBee</title>
      <link>https://dev.to/scrapingbee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/scrapingbee"/>
    <language>en</language>
    <item>
      <title>Web Scraping 101 with Javascript and NodeJS</title>
      <dc:creator>Pierre </dc:creator>
      <pubDate>Tue, 16 Jun 2020 13:26:33 +0000</pubDate>
      <link>https://dev.to/scrapingbee/web-scraping-101-with-javascript-and-nodejs-4klg</link>
      <guid>https://dev.to/scrapingbee/web-scraping-101-with-javascript-and-nodejs-4klg</guid>
      <description>&lt;p&gt;Javascript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it's a web or mobile application, Javascript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to scrape the web efficiently to meet most of your requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  TOC
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Prerequisites&lt;/li&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;I. HTTP Clients&lt;/li&gt;
&lt;li&gt;II. Regular Expressions: The hard way&lt;/li&gt;
&lt;li&gt;III. Cheerio: Core JQuery for traversing the DOM&lt;/li&gt;
&lt;li&gt;IV. JSDOM: The DOM for Node&lt;/li&gt;
&lt;li&gt;V. Puppeteer: The headless browser&lt;/li&gt;
&lt;li&gt;VI. Nightmare: An alternative to Puppeteer&lt;/li&gt;
&lt;li&gt;Ressources&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;This post is primarily aimed at developers who have some level of experience with Javascript. If you have a firm understanding of Web Scraping but have no experience with Javascript, this post could still prove useful.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ A background in Javascript&lt;/li&gt;
&lt;li&gt;✅ Experience using the DevTools to extract selectors of elements&lt;/li&gt;
&lt;li&gt;✅ Some experience with ES6 Javascript (Optional)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⭐ Make sure to check the resources topic to learn more!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Outcomes
&lt;/h3&gt;

&lt;p&gt;By reading this post will be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have a functional understanding of NodeJS&lt;/li&gt;
&lt;li&gt;Use multiple HTTP clients to assist the web scraping process&lt;/li&gt;
&lt;li&gt;Utilize multiple modern and battle-tested libraries to scrape the web &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding NodeJS: A brief introduction
&lt;/h2&gt;

&lt;p&gt;Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. When a website is loaded, Javascript is run by the browser's Javascript Engine and converted into a bunch of code that the computer can understand. For Javascript to interact with your browser, the browser provides a Runtime Environment (document, window, etc.).&lt;/p&gt;

&lt;p&gt;This means that Javascript is not the kind of programming language that can interact with or manipulate the computer or it's resources directly. In a web server, for example, the server must be capable of interacting with the file system to maybe read a file or store a record in a database. &lt;/p&gt;

&lt;p&gt;Introducing NodeJS, the crux of the idea was to make Javascript capable of running not only client-side but also server-side. To make this possible, Ryan Dahl a skilled developer literally took Google Chrome's v8 Javascript Engine and embedded it with a C++ program which was named Node. So NodeJS is a runtime environment that allows an application written in Javascript to make it possible to be run at on a server as well.&lt;/p&gt;

&lt;p&gt;As opposed to how most languages like C or C++ deal with concurrency by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a Non-Blocking manner with the help of the &lt;a href="https://nodejs.dev/the-nodejs-event-loop"&gt;Event Loop&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;Putting up a simple web server is fairly simple as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;http&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createServer&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statusCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;setHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text/plain&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello World&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Server running at PORT:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have NodeJS installed and you run the above code by typing(without the &amp;lt; and &amp;gt;) in &lt;code&gt;node &amp;lt;YourFileNameHere&amp;gt;.js&lt;/code&gt; and open up your browser and navigate to &lt;code&gt;localhost:3000&lt;/code&gt;, you will see some text saying "Hello World". NodeJS is highly ideal for applications that are I/O intensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  HTTP clients: querying the web
&lt;/h2&gt;

&lt;p&gt;HTTP clients are tools capable of sending a request to a server and then receive a response from it. Almost every tool that will be discussed uses an HTTP client under the hood, to query the server of the website that you will attempt to scrape.&lt;/p&gt;

&lt;h3&gt;
  
  
  Request
&lt;/h3&gt;

&lt;p&gt;Request is one of the most widely used HTTP clients in the Javascript ecosystem, however, though currently, the author of the Request library has officially declared that it is deprecated. This does not mean it is unusable, quite a lot of libraries still use it, and it is every bit worth using. It is fairly simple to make an HTTP request with Request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;request&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.reddit.com/r/programming.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;body&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;body:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find the Request library at &lt;a href="https://github.com/request/request"&gt;Github&lt;/a&gt;, and installing it is as simple as running &lt;code&gt;npm install request&lt;/code&gt;. You can also find the deprecation notice and what this means &lt;a href="https://github.com/request/request/issues/3142"&gt;here&lt;/a&gt;. If you don't feel safe about the fact that this library is deprecated, there's more down below!&lt;/p&gt;

&lt;h3&gt;
  
  
  Axios
&lt;/h3&gt;

&lt;p&gt;Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use Typescript, then axios has you covered with built-in types. Making an HTTP request with Axios is straight forward, it ships with promise support by default as opposed to utilizing callbacks in Request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;axios&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.reddit.com/r/programming.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you fancy the async/await syntax sugar for the Promises API, then you can do that too but since top level await is still at &lt;a href="https://github.com/tc39/proposal-top-level-await"&gt;stage 3&lt;/a&gt;, we will have to make use of an Async Function instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;getForum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.reddit.com/r/programming.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And all you have to do is call &lt;code&gt;getForum&lt;/code&gt;! You can find Axios library at &lt;a href="https://github.com/axios/axios"&gt;Github&lt;/a&gt; and installing Axios is as simple as &lt;code&gt;npm install axios&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Superagent
&lt;/h3&gt;

&lt;p&gt;Much like Axios, Superagent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but Superagent has more dependencies and is less popular.&lt;/p&gt;

&lt;p&gt;Regardless, making an HTTP request with Superagent using promises, async/await or callbacks looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;superagent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;superagent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;forumURL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://www.reddit.com/r/programming.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;// callbacks&lt;/span&gt;
&lt;span class="nx"&gt;superagent&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;forumURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;end&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// promises&lt;/span&gt;
&lt;span class="nx"&gt;superagent&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;forumURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// promises with async/await&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;getForum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;superagent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;forumURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find the Superagent library at &lt;a href="https://github.com/visionmedia/superagent"&gt;Github&lt;/a&gt; and installing Superagent is as simple as &lt;code&gt;npm install superagent&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For the upcoming few web scraping tools, Axios will be used as the HTTP client.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Regular Expressions: The hard way
&lt;/h2&gt;

&lt;p&gt;The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you receive by querying a webpage using an HTTP client, but there is a big tradeoff. Regular Expressions aren't as flexible and quite a lot of people both professionals and amateurs struggle with writing the correct regular expression.&lt;/p&gt;

&lt;p&gt;For complex web scraping, the regular expression can also get out of hand very quickly. With that said, let's give it a go. Say there's a label with some username in it, and we want the username, this is similar to what you'd have to do if you relied on regular expressions&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;htmlString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;label&amp;gt;Username: John Doe&amp;lt;/label&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;htmlString&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&amp;lt;label&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;.+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;&amp;lt;&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;label&amp;gt;/&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;// Username: John Doe, John Doe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Javascript, &lt;code&gt;match()&lt;/code&gt; usually returns an array with everything that matches the regular expression. The 2nd element(in index 1) you will find the &lt;code&gt;textContent&lt;/code&gt; or the &lt;code&gt;innerHTML&lt;/code&gt; of the &lt;code&gt;&amp;lt;label&amp;gt;&lt;/code&gt;tag which is what we want. But this result contains some unwanted text ( "Username: ") which has to be removed.&lt;/p&gt;

&lt;p&gt;As you can see, for a very simple use case the steps and the work to be done are unnecessarily high. This is why you should rely on something like an HTML parser, which we will talk about next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cheerio: Core JQuery for traversing the DOM
&lt;/h2&gt;

&lt;p&gt;Cheerio is an efficient and light library which allows you to use the rich and powerful API of JQuery on the server-side. If you have used JQuery previously then you will feel right at home with Cheerio, it removes all the DOM inconsistencies and browser-related features and exposes an efficient API to parse and manipulate the DOM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;h2 class="title"&amp;gt;Hello world&amp;lt;/h2&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;h2.title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello there!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;h2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;addClass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;welcome&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;// &amp;lt;h2 class="title welcome"&amp;gt;Hello there!&amp;lt;/h2&amp;gt;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, using Cheerio is very similar to how you'd use JQuery.&lt;/p&gt;

&lt;p&gt;However, though it does not work the same way that a web browser works, which means it does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Render any of the parsed or manipulated DOM elements&lt;/li&gt;
&lt;li&gt;Apply CSS or load any external resource&lt;/li&gt;
&lt;li&gt;Execute javascript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if the website or web application that you are trying to crawl is Javascript heavy (for example a Single Page Application) then Cheerio is not your best bet, you might have to rely on some of the other options that are talked about later on.&lt;/p&gt;

&lt;p&gt;To demonstrate the power of Cheerio, we will attempt to crawl the &lt;a href="https://www.reddit.com/r/programming/"&gt;r/programming&lt;/a&gt; forum in Reddit, we will attempt to get a list of post names. &lt;/p&gt;

&lt;p&gt;First, install Cheerio and axios by running the following command:&lt;br&gt;
&lt;code&gt;npm install cheerio axios&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then create a new file called &lt;code&gt;crawler.js&lt;/code&gt; and copy/paste the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getPostTitles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://old.reddit.com/r/programming/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;postTitles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

        &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;div &amp;gt; p.title &amp;gt; a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;postTitle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nx"&gt;postTitles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;postTitle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;postTitles&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;getPostTitles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;postTitles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;postTitles&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;getPostTitles()&lt;/code&gt; is an asynchronous function that will crawl the old reddit's r/programming forum. First the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library, then the html data is fed into Cheerio using the &lt;code&gt;cheerio.load()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Then with the help of the Dev Tools of the browser, you can obtain the selector that is capable of targetting all the postcards generally. If you've used JQuery, the &lt;code&gt;$('div &amp;gt; p.title &amp;gt; a')&lt;/code&gt; must be very familiar. This will get all the posts, since you only want the title of each post individually, you have to loop through each post which is done with the help of the &lt;code&gt;each()&lt;/code&gt; function. &lt;/p&gt;

&lt;p&gt;To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (&lt;code&gt;el&lt;/code&gt; refers to the current element). Then calling &lt;code&gt;text()&lt;/code&gt; on each element will give you the text.&lt;/p&gt;

&lt;p&gt;Now you can pop open a terminal and run &lt;code&gt;node crawler.js&lt;/code&gt; and then you'll see an array of about 25 or 26 different post titles, it'll be quite long. While this is quite a simple use case, it demonstrates the simple nature of the API provided by Cheerio. &lt;/p&gt;

&lt;p&gt;If your use case requires the execution of Javascript and the loading of external sources, then the following few options will be helpful.&lt;/p&gt;

&lt;h2&gt;
  
  
  JSDOM: The DOM for Node
&lt;/h2&gt;

&lt;p&gt;JSDOM is a pure Javascript implementation of the Document Object Model to be used in NodeJS, as mentioned previously the DOM is not available to Node, so JSDOM is the closest you can get. It more or less emulates the browser. &lt;/p&gt;

&lt;p&gt;Since a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, then using JSDOM will be quite straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;JSDOM&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jsdom&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;JSDOM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;h2 class="title"&amp;gt;Hello world&amp;lt;/h2&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;window&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;heading&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;heading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello there!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="nx"&gt;heading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;classList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;welcome&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;heading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;innerHTML&lt;/span&gt;
&lt;span class="c1"&gt;// &amp;lt;h2 class="title welcome"&amp;gt;Hello there!&amp;lt;/h2&amp;gt;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, JSDOM creates a DOM and then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.&lt;/p&gt;

&lt;p&gt;To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it, then we will verify if the post has been upvoted. &lt;/p&gt;

&lt;p&gt;Start by running the following command to install jsdom and axios:&lt;br&gt;
&lt;code&gt;npm install jsdom axios&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then make a file by the name of &lt;code&gt;crawler.js&lt;/code&gt; and copy/paste the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;JSDOM&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;jsdom&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;upvoteFirstPost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://old.reddit.com/r/programming/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;JSDOM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;runScripts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dangerously&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;usable&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;firstPost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;div &amp;gt; div.midcol &amp;gt; div.arrow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;firstPost&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isUpvoted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;firstPost&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;classList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;upmod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;isUpvoted&lt;/span&gt;
      &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Post has been upvoted successfully!&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The post has not been upvoted!&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;upvoteFirstPost&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;upvoteFirstPost()&lt;/code&gt; is an asynchronous function that will obtain the first post in r/programming and then upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier. The JSDOM constructor accepts the HTML as the first argument and the options as the second, the 2 options that have been added perform the following functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;runScripts&lt;/strong&gt;: When set to "dangerously", it allows the execution of event handlers and any Javascript code. If you do not have a clear idea on the credibility of the scripts that your application will run, then it is best to set runScripts to "outside-only", which attaches all the Javascript specification provided globals to the &lt;code&gt;window&lt;/code&gt; object thus preventing any script being executed on the &lt;em&gt;inside&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;resources&lt;/strong&gt;: When set to "usable", it allows the loading of any external script declared using the &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag (ex: the JQuery library fetched from a CDN)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the DOM has been created, you would use the same DOM methods to get the first post's upvote button and then click on it. To verify if it has indeed been clicked, you could check the &lt;code&gt;classList&lt;/code&gt; for a class called &lt;code&gt;upmod&lt;/code&gt;. If this class exists in &lt;code&gt;classList&lt;/code&gt;, then a message is returned.&lt;/p&gt;

&lt;p&gt;Now you can pop open a terminal and run &lt;code&gt;node crawler.js&lt;/code&gt; and then you'll see a neat string that will tell if the post has been upvoted or not. While this example use case is trivial, you could build on top of this to create something powerful for example, a bot that goes around upvoting a particular user's posts. &lt;/p&gt;

&lt;p&gt;If you dislike the lack of expressiveness in JSDOM, and if your crawling relies heavily on many such manipulations or if there is a need to recreate a lot of different DOMs, then the following options will be a better match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Puppeteer: The headless browser
&lt;/h2&gt;

&lt;p&gt;Puppeteer, as the name implies, allows you to manipulate the browser programmatically just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RisEdzQr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/746130/40333229-5df5480c-5d0c-11e8-83cb-c3e371de7374.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RisEdzQr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://user-images.githubusercontent.com/746130/40333229-5df5480c-5d0c-11e8-83cb-c3e371de7374.png" alt="puppeteer-hierachy" width="880" height="861"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Taken from the Puppeter Docs (&lt;a href="https://github.com/puppeteer/puppeteer/blob/v3.0.2/docs/api.md"&gt;Source&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilites that weren't there before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  You can get screenshots or generate PDFs of pages.&lt;/li&gt;
&lt;li&gt;  You could crawl a Single Page Application and generate pre-rendered content.&lt;/li&gt;
&lt;li&gt;  Automate a lot of different user interactions like keyboard inputs, form submissions, navigation, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It could also play a big role in a lot of other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc. &lt;/p&gt;

&lt;p&gt;It's quite often that you would want to take screenshots of websites, perhaps to get to know about a competitor's product catalog, puppeteer can be used to do this. To start, you must install puppeteer, to do so run the following command:&lt;br&gt;
&lt;code&gt;npm install puppeteer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This will download a bundled version of Chromium which takes up about 180 MB to 300 MB depending on your Operating System. If you wish to disable this and point puppeteer to an already downloaded version of chromium, you must set a few &lt;a href="https://github.com/puppeteer/puppeteer/blob/v3.0.2/docs/api.md#environment-variables"&gt;environment variables&lt;/a&gt;. This, however, is not recommended, if you truly wish to avoid downloading Chromium and puppeteer for this tutorial, you can rely on the &lt;a href="https://try-puppeteer.appspot.com/"&gt;puppeteer playground&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's attempt to get a screenshot and a PDF of the r/programming forum in Reddit, create a new file called &lt;code&gt;crawler.js&lt;/code&gt; and then copy/paste the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;puppeteer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;getVisual&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.reddit.com/r/programming/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;screenshot.png&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;page.pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;getVisual&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;getVisual()&lt;/code&gt; is an asynchronous function that will take a screenshot and a pdf of the value assigned to the &lt;code&gt;URL&lt;/code&gt; variable. To start, an instance of the browser is created by running &lt;code&gt;puppeteer.launch()&lt;/code&gt; then a new page is created. This page can be thought of like a tab in a regular browser. Then by calling &lt;code&gt;page.goto()&lt;/code&gt; with the &lt;code&gt;URL&lt;/code&gt; as the parameter, the page that was created earlier will be directed to the URL specified. Finally, the browser instance is destroyed along with the page.&lt;/p&gt;

&lt;p&gt;Once that is done and the page has finished loading, a screenshot and a pdf will be taken using &lt;code&gt;page.screenshot()&lt;/code&gt; and &lt;code&gt;page.pdf()&lt;/code&gt; respectively. You could listen to the javascript load event and then perform these actions too, which is highly recommended at a production level.&lt;/p&gt;

&lt;p&gt;To run the code type in &lt;code&gt;node crawler.js&lt;/code&gt; to the terminal, and after a few seconds, you will notice that 2 files by the names &lt;code&gt;screenshot.jpg&lt;/code&gt; and &lt;code&gt;page.pdf&lt;/code&gt; have been created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nightmare: An alternative to Puppeteer
&lt;/h2&gt;

&lt;p&gt;Nightmare is also a high-level browser automation library like Puppeteer, that uses Electron but is said to be roughly twice as faster as it's predecessor PhantomJS and more modern. &lt;/p&gt;

&lt;p&gt;If you dislike Puppeteer in some way or feel discouraged by the size of the Chromium bundle then Nightmare is an ideal choice. To start, installghtmare library by running the following command:&lt;br&gt;
&lt;code&gt;npm install nightmare&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then once nightmare has been downloaded, we will use it to find ScrapingBee's website through the Google Search engine. To do so, create a file called &lt;code&gt;crawler.js&lt;/code&gt; and then copy/paste the following code into it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;Nightmare&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;nightmare&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nightmare&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Nightmare&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;nightmare&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.google.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;input[title='Search']&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ScrapingBee&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;input[value='Google Search']&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#rso &amp;gt; div:nth-child(1) &amp;gt; div &amp;gt; div &amp;gt; div.r &amp;gt; a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#rso &amp;gt; div:nth-child(1) &amp;gt; div &amp;gt; div &amp;gt; div.r &amp;gt; a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Scraping Bee Web Link&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Search failed:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Firstly a Nighmare instance is created, then this instance is directed to the Google Search Engine by calling &lt;code&gt;goto()&lt;/code&gt; once it has loaded, the search box is fetched using it's selector and then the value of the search box (an input tag) is changed to "ScrapingBee". Once that is done, the search form is submitted by clicking on the "Google Search" button. Then Nightmare is told to wait till the first link has loaded, and once it has, a DOM method will be used to fetch the value of the &lt;code&gt;href&lt;/code&gt; attribute of the anchor tag that contains the link.&lt;/p&gt;

&lt;p&gt;Finally, once everything is complete, the link is printed to the console. To run the code, type in &lt;code&gt;node crawler.js&lt;/code&gt; to your terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;That was a long read! But now you understand the different ways to use NodeJS and it's rich ecosystem of libraries to crawl the web any way you want. To wrap up, you learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;NodeJS&lt;/strong&gt; is a Javascript &lt;em&gt;runtime&lt;/em&gt; to allow Javascript to be run in the &lt;em&gt;server-side&lt;/em&gt;. It has a &lt;strong&gt;non-blocking&lt;/strong&gt; nature thanks to the Event Loop.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;HTTP Clients&lt;/strong&gt; such as &lt;em&gt;Axios&lt;/em&gt;, &lt;em&gt;Superagent&lt;/em&gt;, and &lt;em&gt;Request&lt;/em&gt; are used to send HTTP requests to a &lt;em&gt;server&lt;/em&gt; and receive a response.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Cheerio&lt;/strong&gt; abstracts the best out of &lt;em&gt;JQuery&lt;/em&gt; for the sole purpose of running it in the &lt;em&gt;server-side&lt;/em&gt; for web crawling but &lt;em&gt;does not execute Javascript&lt;/em&gt; code.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;JSDOM&lt;/strong&gt; creates a DOM per the standard &lt;em&gt;Javascript specification&lt;/em&gt; out of an HTML string and allows you to perform DOM manipulations on it.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Puppeteer&lt;/strong&gt; and &lt;strong&gt;Nightmare&lt;/strong&gt; are &lt;em&gt;high-level browser automation&lt;/em&gt; libraries, that allow you to &lt;em&gt;programmatically manipulate&lt;/em&gt; web applications as if a real person were interacting with it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Feel like reading more? Check these links out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://nodejs.org/en/about/"&gt;NodeJS website&lt;/a&gt; - Contains documentation and a lot of information on how to get started.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developers.google.com/web/tools/puppeteer"&gt;Puppeteer docs&lt;/a&gt; - Contains the API reference and getting started guides.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.scrapingbee.com/blog/"&gt;ScrapingBee's Blog&lt;/a&gt; - Contains a lot of information on Web Scraping goodies on multiple platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;This blog post was originally posted on &lt;a href="https://www.scrapingbee.com/blog/web-scraping-javascript/"&gt;ScrapingBee's blog&lt;/a&gt; by Shenesh Perera&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>node</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Easy Web Scraping With Scrapy</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 18 Dec 2019 16:22:13 +0000</pubDate>
      <link>https://dev.to/scrapingbee/easy-web-scraping-with-scrapy-6im</link>
      <guid>https://dev.to/scrapingbee/easy-web-scraping-with-scrapy-6im</guid>
      <description>&lt;p&gt;In the previous post about &lt;a href="https://www.scrapingbee.com/blog/web-scraping-101-with-python/" rel="noopener noreferrer"&gt;Web Scraping with Python&lt;/a&gt; we talked a bit about Scrapy. In this post we are going to dig a little bit deeper into it. &lt;/p&gt;

&lt;p&gt;Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multithreading&lt;/li&gt;
&lt;li&gt;Crawling (going from link to link)&lt;/li&gt;
&lt;li&gt;Extracting the data&lt;/li&gt;
&lt;li&gt;Validating&lt;/li&gt;
&lt;li&gt;Saving to different format / databases&lt;/li&gt;
&lt;li&gt;Many more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main difference between Scrapy and other commonly used librairies like Requests / BeautifulSoup is that it is opinionated. It allows you to solve the usual web scraping problems in an elegant way. &lt;/p&gt;

&lt;p&gt;The downside of Scrapy is that the learning curve is steep, there is a lot to learn, but that is what we are here for :)&lt;/p&gt;

&lt;p&gt;In this tutorial we will create two different web scrapers, a simple one that will extract data from an E-commerce product page, and a more "complex" one that will scrape an entire E-commerce catalog!&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic overview
&lt;/h2&gt;

&lt;p&gt;You can install Scrapy using &lt;a href="https://pypi.org/project/pip/" rel="noopener noreferrer"&gt;pip&lt;/a&gt;. Be careful though, the Scrapy documentation strongly suggests to install it in a dedicated virtual environnement in order to avoid conflicts with your system packages. &lt;/p&gt;

&lt;p&gt;I'm using Virtualenv and Virtualenvwrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mkvirtualenv scrapy_env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;Scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now create a new Scrapy project with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject product_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create all the necessary boilerplate files for the project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;├── product_scraper
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a brief overview of these files and folders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;items.py&lt;/em&gt;&lt;/strong&gt; is a model for the extracted data. You can define custom model (like a Product) that will inherit the scrapy Item class.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;middlewares.py&lt;/em&gt;&lt;/strong&gt; Middleware used to change the request / response lifecycle. For example you could create a middle ware to rotate user-agents, or to use an API like ScrapingBee instead of doing the requests yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;pipelines.py&lt;/em&gt;&lt;/strong&gt; In Scrapy, pipelines are used to process the extracted data, clean the HTML, validate the data, and export it to a custom format or saving it to a database. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;/spiders&lt;/em&gt;&lt;/strong&gt; is a folder containing Spider classes. With Scrapy, Spiders are classes that define how a website should be scraped, including what link to follow and how to extract the data for those links.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;scrapy.cfg&lt;/em&gt;&lt;/strong&gt; is a configuration file to change some settings&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scraping a single product
&lt;/h2&gt;

&lt;p&gt;In this example we are going to scrape a single product from a dummy E-commerce website. Here is the first the product we are going to scrape: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F753fd5a7477937e0ce9bb8431d28b10fbdabf858%2F0f27d%2Fimages%2Fpost%2Fpost5%2Fproduct_screenshot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F753fd5a7477937e0ce9bb8431d28b10fbdabf858%2F0f27d%2Fimages%2Fpost%2Fpost5%2Fproduct_screenshot.jpg"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/" rel="noopener noreferrer"&gt;https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We are going to extract the product name, picture, price and description.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scrapy Shell
&lt;/h2&gt;

&lt;p&gt;Scrapy comes with a built-in shell that helps you try and debug your scraping code in real time. You can quickly test your XPath expressions / CSS selectors with it. It's a very cool tool to write your web scrapers and I always use it!&lt;/p&gt;

&lt;p&gt;You can configure Scrapy Shell to use another console instead of the default Python console like IPython. You will get autocompletion and other nice perks like colorized output. &lt;/p&gt;

&lt;p&gt;In order to use it in your scrapy Shell, you need to add this line to your scrapy.cfg file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shell &lt;span class="o"&gt;=&lt;/span&gt; ipython
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it's configured, you can start using scrapy shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;scrapy shell &lt;span class="nt"&gt;--nolog&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s] Available Scrapy objects:
&lt;span class="o"&gt;[&lt;/span&gt;s]   scrapy     scrapy module &lt;span class="o"&gt;(&lt;/span&gt;contains scrapy.Request, scrapy.Selector, etc&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   crawler    &amp;lt;scrapy.crawler.Crawler object at 0x108147eb8&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   item       &lt;span class="o"&gt;{}&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   settings   &amp;lt;scrapy.settings.Settings object at 0x108d10978&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;s] Useful shortcuts:
&lt;span class="o"&gt;[&lt;/span&gt;s]   fetch&lt;span class="o"&gt;(&lt;/span&gt;url[, &lt;span class="nv"&gt;redirect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True]&lt;span class="o"&gt;)&lt;/span&gt; Fetch URL and update &lt;span class="nb"&gt;local &lt;/span&gt;objects &lt;span class="o"&gt;(&lt;/span&gt;by default, redirects are followed&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   fetch&lt;span class="o"&gt;(&lt;/span&gt;req&lt;span class="o"&gt;)&lt;/span&gt;                  Fetch a scrapy.Request and update &lt;span class="nb"&gt;local &lt;/span&gt;objects
&lt;span class="o"&gt;[&lt;/span&gt;s]   shelp&lt;span class="o"&gt;()&lt;/span&gt;           Shell &lt;span class="nb"&gt;help&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;print this &lt;span class="nb"&gt;help&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   view&lt;span class="o"&gt;(&lt;/span&gt;response&lt;span class="o"&gt;)&lt;/span&gt;    View response &lt;span class="k"&gt;in &lt;/span&gt;a browser
In &lt;span class="o"&gt;[&lt;/span&gt;1]:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can start fetching a URL by simply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;fetch&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start by fetching the /robot.txt file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;scrapy.core.engine] DEBUG: Crawled &lt;span class="o"&gt;(&lt;/span&gt;404&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;GET https://clever-lichterman-044f16.netlify.com/robots.txt&amp;gt; &lt;span class="o"&gt;(&lt;/span&gt;referer: None&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case there isn't any robot.txt, that's why we can see a 404 HTTP code. If there was a robot.txt, by default Scrapy will follow the rule. &lt;/p&gt;

&lt;p&gt;You can disable this behavior by changing this setting in settings.py:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;ROBOTSTXT_OBEY = True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you should should have a log like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;scrapy.core.engine] DEBUG: Crawled &lt;span class="o"&gt;(&lt;/span&gt;200&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;GET https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&amp;gt; &lt;span class="o"&gt;(&lt;/span&gt;referer: None&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now see your response object, response headers, and try different XPath expression / CSS selectors to extract the data you want. &lt;/p&gt;

&lt;p&gt;You can see the response directly in your browser with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;view&lt;span class="o"&gt;(&lt;/span&gt;response&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the page will render badly inside your browser, for lots of different reasons. This can be CORS issues, Javascript code that didn't execute, or relative URLs for assets that won't work locally. &lt;/p&gt;

&lt;p&gt;The scrapy shell is like a regular Python shell, so don't hesitate to load your favorite scripts/function in it. &lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Data
&lt;/h3&gt;

&lt;p&gt;Scrapy doesn't execute any Javascript by default, so if the website you are trying to scrape is using a frontend framework like Angular / React.js, you could have trouble accessing the data you want. &lt;/p&gt;

&lt;p&gt;Now let's try some XPath expression to extract the product title and price:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F200a8982c21e545c85170599feec2552f6bcde5b%2F7ce12%2Fimages%2Fpost%2Fpost5%2Fproduct_dom_screenshot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F200a8982c21e545c85170599feec2552f6bcde5b%2F7ce12%2Fimages%2Fpost%2Fpost5%2Fproduct_dom_screenshot.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to extract the price, we are going to use an &lt;a href="https://dev.to/blog/practical-xpath-for-web-scraping/"&gt;XPath expression&lt;/a&gt;, we're selecting the first span after the div with the class &lt;code&gt;my-4&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;In &lt;span class="o"&gt;[&lt;/span&gt;16]: response.xpath&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"//div[@class='my-4']/span/text()"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.get&lt;span class="o"&gt;()&lt;/span&gt;
Out[16]: &lt;span class="s1"&gt;'20.00$'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I could also use a CSS selector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;In &lt;span class="o"&gt;[&lt;/span&gt;21]: response.css&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.my-4 span::text'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.get&lt;span class="o"&gt;()&lt;/span&gt;
Out[21]: &lt;span class="s1"&gt;'20.00$'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating a Scrapy Spider
&lt;/h2&gt;

&lt;p&gt;With Scrapy, Spiders are classes where you define your crawling (what links / URLs need to be scraped) and scraping (what to extract) behavior. &lt;/p&gt;

&lt;p&gt;Here are the different steps used by a spider to scrape a website:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It starts by looking at the class attribute

&lt;code&gt;start_urls&lt;/code&gt;

, and call these URLs with the start_requests() method. You could override this method if you need to change the HTTP verb, add some parameters to the request (for example, sending a POST request instead of a GET). 
* It will then generate a Request object for each URL, and send the response to the callback function parse()
* The parse() method will then extract the data (in our case, the product price, image, description, title) and return either a dictionnary,  an Item object, a Request or an iterable. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;You may wonder why the parse method can return so many different objects. It's for flexibility. Let's say you want to scrape an E-commerce website that doesn't have any sitemap. You could start by scraping the product categories, so this would be a first parse method.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This method would then yield a Request object to each product category to a new callback method parse2()&lt;/em&gt;&lt;br&gt;
&lt;em&gt;For each category you would need to handle pagination Then for each product the actual scraping that generate an Item so a third parse function.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With Scrapy you can return the scraped data as a simple Python dictionary, but it is a good idea to use the built-in Scrapy &lt;strong&gt;&lt;em&gt;Item&lt;/em&gt;&lt;/strong&gt; class. &lt;br&gt;
It's a simple container for our scraped data and Scrapy will look at this item's fields for many things like exporting the data to different format (JSON / CSV...), the item pipeline etc. &lt;/p&gt;

&lt;p&gt;So here is a basic Product class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;product_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;img_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can generate a spider, either with the command line helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy genspider myspider mydomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or you can do it manually and put your Spider's code inside the /spiders directory. &lt;/p&gt;

&lt;p&gt;There are different types of Spiders in Scrapy to solve the most common web scraping use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Spider&lt;/code&gt; that we will use. It takes a start_urls list and scrape each one with a &lt;code&gt;parse&lt;/code&gt; method. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CrawlSpider&lt;/code&gt; follows links defined by a set of rules&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SitemapSpider&lt;/code&gt; extract URLs defined in a sitemap&lt;/li&gt;
&lt;li&gt;Many more
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -*- coding: utf-8 -*-
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EcomSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ecom_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clever-lichterman-044f16.netlify.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]/span/text()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//section[1]//h2/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;img_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product-slider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]//img/@src&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this &lt;strong&gt;&lt;em&gt;EcomSpider&lt;/em&gt;&lt;/strong&gt; class, there are two required attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;name&lt;/code&gt; which is our Spider's name (that you can run using &lt;code&gt;scrapy runspider spider_name&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt; &lt;code&gt;start_urls&lt;/code&gt; which is the starting URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;allowed_domains&lt;/code&gt; is optionnal but important when you use a CrawlSpider that could follow links on different domains. &lt;/p&gt;

&lt;p&gt;Then I've just populated the Product fields by using XPath expressions to extract the data I wanted as we saw earlier, and we return the item. &lt;/p&gt;

&lt;p&gt;You can run this code as follow to export the result into JSON (you could also export to CSV)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy runspider ecom_spider.py &lt;span class="nt"&gt;-o&lt;/span&gt; product.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should then get a nice JSON file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"product_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20.00$"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Taba Cream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"img_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://clever-lichterman-044f16.netlify.com/images/products/product-2.png"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Item loaders
&lt;/h4&gt;

&lt;p&gt;There are two common problems that you can face while extracting data from the Web: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the same website, the page layout and underlying HTML can be different. If you scrape an E-commerce website, you will often have a regular price and a discounted price, with different XPath / CSS selectors. &lt;/li&gt;
&lt;li&gt;The data can be dirty and need some kind of post processing, again for an E-commerce website it could be the way the prices are displayed for example ($1.00, $1, $1,00 )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scrapy comes with a built-in solution for this, &lt;a href="https://docs.scrapy.org/en/latest/topics/loaders.html" rel="noopener noreferrer"&gt;ItemLoaders&lt;/a&gt;. &lt;br&gt;
It's an interesting way to populate our Product object. &lt;/p&gt;

&lt;p&gt;You can add several XPath expression to the same Item field, and it will test it sequentially. By default if several XPath are found, it will load all of them into a list. &lt;/p&gt;

&lt;p&gt;You can find many examples of input and output processors in the &lt;a href="https://docs.scrapy.org/en/latest/" rel="noopener noreferrer"&gt;Scrapy documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;It's really useful when you need to transorm/clean the data your extract. &lt;br&gt;
For example, extracting the currency from a price, transorming a unit into another one (centimers in meters, Celcius degres in Fahrenheit) ... &lt;/p&gt;

&lt;p&gt;In our webpage we can find the product title with different XPath expressions: &lt;code&gt;//title&lt;/code&gt; and &lt;code&gt;//section[1]//h2/text()&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Here is how you could use and Itemloader in this case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ItemLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]/span/text()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//section[1]//h2/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generally you only want the first matching XPath, so you will need to add this &lt;code&gt;output_processor=TakeFirst()&lt;/code&gt; to your item's field constructor. &lt;/p&gt;

&lt;p&gt;In our case we only want the first matching XPath for each field, so a better approach would be to create our own ItemLoader and declare a default output_processor to take the first matching XPath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.loader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ItemLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.loader.processors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TakeFirst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MapCompose&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Join&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_dollar_sign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProductLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ItemLoader&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;default_output_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TakeFirst&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;price_in&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MapCompose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remove_dollar_sign&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also added a &lt;code&gt;price_in&lt;/code&gt; which is an input processor to delete the dollar sign from the price. I'm using &lt;code&gt;MapCompose&lt;/code&gt; which is a built-in processor that takes one or several functions to be executed sequentially. You can add as many functions as you like for . The convention is to add &lt;code&gt;_in&lt;/code&gt; or &lt;code&gt;_out&lt;/code&gt; to your Item field's name to add an input or output processor to it. &lt;/p&gt;

&lt;p&gt;There are many more processors, you can learn more about this in the &lt;a href="https://docs.scrapy.org/en/latest/topics/loaders.html#input-and-output-processors" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping multiple pages
&lt;/h2&gt;

&lt;p&gt;Now that we know how to scrape a single page, it's time to learn how to scrape multiple pages, like the entire product catalog. &lt;br&gt;
As we saw earlier there are different kinds of Spiders. &lt;/p&gt;

&lt;p&gt;When you want to scrape an entire product catalog the first thing you should look at is a sitemap. Sitemap are exactly built for this, to show web crawlers how the website is structured. &lt;/p&gt;

&lt;p&gt;Most of the time you can find one at &lt;code&gt;base_url/sitemap.xml&lt;/code&gt;. Parsing a sitemap can be tricky, and again, Scrapy is here to help you with this. &lt;/p&gt;

&lt;p&gt;In our case, you can find the sitemap here: &lt;a href="https://clever-lichterman-044f16.netlify.com/sitemap.xml" rel="noopener noreferrer"&gt;https://clever-lichterman-044f16.netlify.com/sitemap.xml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we look inside the sitemap there are many URLs that we are not interested by, like the home page, blog posts etc:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/blog/post-1/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/products/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, we can filter the URLs to parse only those that matches some pattern, it's really easy, here we only to have URL that &lt;br&gt;
have &lt;code&gt;/products/&lt;/code&gt; in their URLs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SitemapSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SitemapSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sitemap_spider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;sitemap_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/sitemap.xml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;sitemap_rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/products/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_product&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# ... scrape product ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run this spider as follow to scrape all the products and export the result to a CSV file:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;scrapy runspider sitemap_spider.py -o output.csv&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Now what if the website didn't have any sitemap? Once again, Scrapy has a solution for this! &lt;/p&gt;

&lt;p&gt;Let me introduce you to the... &lt;code&gt;CrawlSpider&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The CrawlSpider will crawl the target website by starting with a &lt;code&gt;start_urls&lt;/code&gt; list. Then for each url, it will extract all the links based on a list of &lt;code&gt;Rule&lt;/code&gt;. &lt;br&gt;
In our case it's easy, products has the same URL pattern &lt;code&gt;/products/product_title&lt;/code&gt; so we only need filter these URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.productloader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProductLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;crawl_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clever-lichterman-044f16.netlify.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/products/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;

        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_product&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="c1"&gt;# .. parse product 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, all these built-in Spiders are really easy to use. It would have been much more complex to do it from scratch. &lt;/p&gt;

&lt;p&gt;With Scrapy you don't have to think about the crawling logic, like adding new URLs to a queue, keeping track of already parsed URLs, multi-threading... &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we saw a general overview of how to scrape the web with Scrapy and how it can solve your most common web scraping challenges. Of course we only touched the surface and there are many more interesting things to explore, like middlewares, exporters, extensions, pipelines! &lt;/p&gt;

&lt;p&gt;If you've been doing web scraping more "manually" with tools like BeautifulSoup / Requests, it's easy to understand how Scrapy can help save time and build more maintainable scrapers. &lt;/p&gt;

&lt;p&gt;I hope you liked this Scrapy tutorial and that it will motivate you to experiment with it. &lt;/p&gt;

&lt;p&gt;For further reading don't hesitate to look at the great &lt;a href="https://docs.scrapy.org/en/latest/index.html" rel="noopener noreferrer"&gt;Scrapy documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;You can also check out our &lt;a href="https://www.scrapingbee.com/blog/web-scraping-101-with-python/" rel="noopener noreferrer"&gt;web scraping with Python&lt;/a&gt; tutorial to learn more about web scraping. &lt;/p&gt;

&lt;p&gt;Happy Scraping!&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>Practical XPath for Web Scraping</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Thu, 07 Nov 2019 10:41:17 +0000</pubDate>
      <link>https://dev.to/scrapingbee/practical-xpath-for-web-scraping-3d9e</link>
      <guid>https://dev.to/scrapingbee/practical-xpath-for-web-scraping-3d9e</guid>
      <description>&lt;p&gt;XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.&lt;/p&gt;

&lt;p&gt;It looks like the perfect tool for web scraping right? At &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt; we love XPath!&lt;/p&gt;

&lt;h3&gt;
  
  
  Why learn XPath
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.&lt;/li&gt;
&lt;li&gt;  It's more powerful than CSS selectors&lt;/li&gt;
&lt;li&gt;  It allows you to navigate the DOM in any direction&lt;/li&gt;
&lt;li&gt;  Can match text inside HTML elements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.&lt;/p&gt;

&lt;p&gt;But first, let's talk a little about the DOM&lt;/p&gt;

&lt;h2&gt;
  
  
  Document Object Model
&lt;/h2&gt;

&lt;p&gt;I am going to assume you already know HTML, so this is just a small reminder.&lt;/p&gt;

&lt;p&gt;As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.&lt;/p&gt;

&lt;p&gt;Let's see a basic HTML page, to understand what the Document Object Model is.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ehm3m1Ms--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/rMQHgyRbWDBCzFcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ehm3m1Ms--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/rMQHgyRbWDBCzFcg.png" alt="" width="880" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This HTML code is basically HTML content encapsulated inside other HTML content. The HTML hierarchy can be viewed as a tree. We can already see this hierarchy through the indentation in the HTML code.&lt;/p&gt;

&lt;p&gt;When your web browser parses this code, it will create a tree which is an object representation of the HTML document. It is called the Document Object Model.&lt;/p&gt;

&lt;p&gt;Below is the internal tree structure inside Google Chrome inspector :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--juQiVf6o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/nmSRYUtpVLZDisCK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--juQiVf6o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/nmSRYUtpVLZDisCK.png" alt="" width="880" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left we can see the HTML tree, and on the right we have the Javascript object representing the currently selected element (in this case, the &lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt; tag), with all its attributes.&lt;/p&gt;

&lt;p&gt;The important thing to remember is that &lt;strong&gt;the DOM you see in your browser, when you right click + inspect can be really different from the actual HTML that was sent&lt;/strong&gt;. Maybe some Javascript code was executed and dynamically changed the DOM ! For example, when you scroll on your twitter account, a request is sent by your browser to fetch new tweets, and some Javascript code is dynamically adding those new tweets to the DOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath Syntax
&lt;/h2&gt;

&lt;p&gt;First let’s look at some XPath vocabulary :&lt;/p&gt;

&lt;p&gt;• In Xpath terminology, as with HTML, there are different types of nodes : root nodes, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.&lt;/p&gt;

&lt;p&gt;• Each element node has one parent. in this example, the section element is the parent of p, details and button.&lt;/p&gt;

&lt;p&gt;• Element nodes can have any number of children. In our example, li elements are all children of the ul element.&lt;/p&gt;

&lt;p&gt;• Siblings are nodes that have the same parents. p, details and button are siblings.&lt;/p&gt;

&lt;p&gt;• Ancestors a node’s parent and parent’s parent...&lt;/p&gt;

&lt;p&gt;• Descendants a node’s children and children’s children...&lt;/p&gt;

&lt;p&gt;There are different types of expressions to select a node in an HTML document, here are the most important ones :&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You can also use &lt;strong&gt;predicates&lt;/strong&gt; to find a node that contains a specific value. Predicates are always in square brackets: &lt;code&gt;[predicate]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Here are some examples :&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now we will see some examples of Xpath expressions. We can test XPath expressions inside Chrome Dev tools, so it is time to fire up Chrome.&lt;/p&gt;

&lt;p&gt;To do so, right-click on the web page -&amp;gt; inspect and then &lt;code&gt;cmd + f&lt;/code&gt; on a Mac or &lt;code&gt;ctrl + f&lt;/code&gt; on other systems, then you can enter an Xpath expression, and the match will be highlighted in the Dev tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jr6Uhf9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/HkmFzlIKSQBEBMzI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jr6Uhf9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/HkmFzlIKSQBEBMzI.png" alt="" width="544" height="811"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tip
&lt;/h3&gt;

&lt;p&gt;In the dev tools, you can right-click on any DOM node, and show its full XPath expression, that you can later factorize.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath with Python
&lt;/h2&gt;

&lt;p&gt;There are many Python packages that allow you to use XPath expressions to select HTML elements like lxml, Scrapy or Selenium. In these examples, we are going to use Selenium with Chrome in headless mode. You can look at this article to set up your environment: &lt;a href="https://www.scrapingbee.com/blog/scraping-single-page-applications"&gt;Scraping Single Page Application with Python&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  E-commerce product data extraction
&lt;/h3&gt;

&lt;p&gt;In this example, we are going to see how to extract E-commerce product data from Ebay.com with XPath expressions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L92NhRf6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/mXasAzIbzHhopNjw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L92NhRf6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/mXasAzIbzHhopNjw.jpg" alt="" width="880" height="614"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;On these three XPath expressions, we are using a &lt;code&gt;//&lt;/code&gt; as an axis, meaning we are selecting nodes anywhere in the HTML tree. Then we are using a predicate &lt;code&gt;[predicate]&lt;/code&gt; to match on specific IDs. IDs are supposed to be unique so it's not a problem do to this.&lt;/p&gt;

&lt;p&gt;But when you select an element with its class name, it's better to use a relative path, because the class name can be used anywhere in the DOM, so the more specific you are the better. Not only that, but when the website will change (and it will), your code will be much more resilient to changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automagically authenticate to a website
&lt;/h3&gt;

&lt;p&gt;When you have to perform the same action on a website or extract the same type of information we can be a little smarter with our XPath expression, in order to create generic ones, and not specific XPath for each website.&lt;/p&gt;

&lt;p&gt;In order to explain this, we're going to make a "generic" authentication function that will take a Login URL, a username and password, and try to authenticate on the target website.&lt;/p&gt;

&lt;p&gt;To auto-magically log into a website with your scrapers, the idea is :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GET /loginPage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first  tag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first  before it that is not hidden&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set the value attribute for both inputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the enclosing form and click on the submit button.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most login forms will have an &lt;code&gt;&amp;lt;input type="password"&amp;gt;&lt;/code&gt; tag. So we can select this password input with a simple: &lt;code&gt;//input[@type='password']&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Once we have this password input, we can use a &lt;strong&gt;relative path&lt;/strong&gt; to select the username/email input. It will generally be the first preceding input &lt;strong&gt;that isn't hidden:&lt;/strong&gt; &lt;code&gt;.//preceding::input[not(@type='hidden')]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It's really important to exclude hidden inputs, because most of the time you will have at least one &lt;a href="https://en.wikipedia.org/wiki/Cross-site_request_forgery"&gt;CSRF token&lt;/a&gt; hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.&lt;/p&gt;

&lt;p&gt;Now we need to select the enclosing form from one of the input:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.//ancestor::form&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;And with the form, we can select the submit input/button:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.//*[@type='submit']&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of such a function:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Of course it is far from perfect, it won't work everywhere but you get the idea.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.&lt;/p&gt;

&lt;p&gt;One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.&lt;/p&gt;

&lt;p&gt;At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our &lt;a href="https://www.scrapingbee.com/api-store"&gt;ready-made APIs&lt;/a&gt;. We will discuss the differences between the two in another blog post!&lt;/p&gt;

&lt;p&gt;I hope you enjoyed this article, next time we will talk about ... CSS selectors :)&lt;/p&gt;

&lt;p&gt;Happy Scraping!&lt;/p&gt;

&lt;p&gt;Discuss on HN: &lt;a href="https://news.ycombinator.com/item?id=21452310"&gt;https://news.ycombinator.com/item?id=21452310&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>scraping</category>
      <category>webscraping</category>
      <category>xpath</category>
    </item>
    <item>
      <title>Serverless Web Scraping With Aws Lambda and Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 04 Sep 2019 09:36:20 +0000</pubDate>
      <link>https://dev.to/scrapingbee/serverless-web-scraping-with-aws-lambda-and-java-48lc</link>
      <guid>https://dev.to/scrapingbee/serverless-web-scraping-with-aws-lambda-and-java-48lc</guid>
      <description>&lt;p&gt;Serverless is a term referring to the execution of code inside ephemeral containers (Function As A Service, or FaaS). It is a hot topic in 2019, after the “micro-service” hype, here come the “nano-services”!&lt;/p&gt;

&lt;p&gt;Cloud functions can be triggered by different things such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An HTTP call to a REST API&lt;/li&gt;
&lt;li&gt;A job in a message queue&lt;/li&gt;
&lt;li&gt;A log&lt;/li&gt;
&lt;li&gt;IOT event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud functions are a really good fit with web scraping tasks for many reasons. Web Scraping is I/O bound, most of the time is spent waiting for HTTP responses, so we don’t need high-end CPU servers. Cloud functions are cheap (first 1M request is free, then $0.20 per million requests) and easy to set up. Cloud functions are a good fit for parallel scraping, we can create hundreds or thousands of function at the same time for large-scale scraping.&lt;/p&gt;

&lt;p&gt;In this introduction, we are going to see how to deploy a slightly modified version of the Craigslist scraper we made on a previous &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;blogpost&lt;/a&gt; on AWS Lambda using the serverless framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;We are going to use the &lt;a href="https://serverless.com/"&gt;Serverless&lt;/a&gt; framework to build and deploy our project to AWS lambda. Serverless CLI is able to generate lots of boilerplate code in different languages and deploy the code to different cloud providers, like AWS, Google Cloud or Azure. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AWS account&lt;/li&gt;
&lt;li&gt;Node and npm&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://serverless.com/framework/docs/providers/aws/guide/quick-start/"&gt;Serverless CLI&lt;/a&gt; and Setup your &lt;a href="https://serverless.com/framework/docs/providers/aws/guide/credentials/"&gt;AWS credentials&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Java 8 &lt;/li&gt;
&lt;li&gt;Maven&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;We will build an API using API Gateway with a single endpoint &lt;code&gt;/items/{query}&lt;/code&gt; binded on a lambda function that will respond to us with a JSON array with all items (on the first result page) for this query.&lt;/p&gt;

&lt;p&gt;Here is a simple diagram for this architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7GKGCJBN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-lambda/cloudcraft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7GKGCJBN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-lambda/cloudcraft.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Create the Maven project
&lt;/h2&gt;

&lt;p&gt;Serverless is able to generate projects in lots of different languages: Java, Python, NodeJS, Scala... &lt;br&gt;
We are going to use one of these templates to generate a maven project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;serverless create &lt;span class="nt"&gt;--template&lt;/span&gt; aws-java-maven &lt;span class="nt"&gt;--name&lt;/span&gt; items-api &lt;span class="nt"&gt;-p&lt;/span&gt; aws-java-scraper
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can now open this Maven project in your favorite IDE. &lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;The first thing to do is to change the &lt;strong&gt;&lt;em&gt;serverless.yml&lt;/em&gt;&lt;/strong&gt; config to implement an API gateway route and bind it to the &lt;strong&gt;handleRequest&lt;/strong&gt; method in the &lt;strong&gt;&lt;em&gt;Handler.java&lt;/em&gt;&lt;/strong&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;craigslist-scraper-api&lt;/span&gt; 
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;java8&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;package&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;artifact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target/hello-dev.jar&lt;/span&gt;

&lt;span class="na"&gt;functions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;getCraigsListItems&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;com.serverless.Handler&lt;/span&gt;
    &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/items/{searchQuery}&lt;/span&gt;
        &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I also added a timeout to 30 seconds. The default timeout with the serverless framework is 6 seconds. Since we're running Java code the &lt;a href="https://serverless.com/blog/keep-your-lambdas-warm/"&gt;Lambda cold start&lt;/a&gt; can take several seconds. And then we will make an HTTP request to Craigslist website, so 30 seconds seems good. &lt;/p&gt;

&lt;h2&gt;
  
  
  Function code
&lt;/h2&gt;

&lt;p&gt;Now we can modify the &lt;strong&gt;&lt;em&gt;Handler.class&lt;/em&gt;&lt;/strong&gt;. The function logic is easy. First, we retrieve the path parameter called "searchQuery". Then we create a CraigsListScraper and call the &lt;strong&gt;&lt;em&gt;scrape()&lt;/em&gt;&lt;/strong&gt; method with this searchQuery. It will return a &lt;code&gt;List&amp;lt;Item&amp;gt;&lt;/code&gt; representing all the items on the first Craigslist's result page. &lt;/p&gt;

&lt;p&gt;We then use the &lt;code&gt;ApiGatewayResponse&lt;/code&gt; class that was generated by the Serverless framework to return a JSON array containing every item. &lt;/p&gt;

&lt;p&gt;You can find the rest of the code in &lt;a href="https://github.com/ksahin/serverless-scraping"&gt;this repository&lt;/a&gt;, with the &lt;code&gt;CraigsListScraper&lt;/code&gt; and &lt;code&gt;Item&lt;/code&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt; &lt;span class="nf"&gt;handleRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="no"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"received: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pathParameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"pathParameters"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pathParameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"searchQuery"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;CraigsListScraper&lt;/span&gt; &lt;span class="n"&gt;scraper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CraigsListScraper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scraper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scrape&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatusCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setObjectBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setHeaders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Powered-By"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"AWS Lambda &amp;amp; serverless"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
        &lt;span class="no"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error : "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;Response&lt;/span&gt; &lt;span class="n"&gt;responseBody&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error while processing URL: "&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatusCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setObjectBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responseBody&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setHeaders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Powered-By"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"AWS Lambda &amp;amp; Serverless"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can now build the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mvn clean install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And deploy it to AWS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;serverless deploy
Serverless: Packaging service...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (13.35 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.................................
Serverless: Stack update finished...
Service Information
service: items-api
stage: dev
region: us-east-1
stack: items-api-dev
api keys:
  None
endpoints:
  GET - https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/{searchQuery}
functions:
  getCraigsListItems: items-api-dev-getCraigsListItems
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can then test your function using curl or your web browser with the URL given in the deployment logs (&lt;br&gt;
&lt;br&gt;
&lt;code&gt;serverless info&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 will also show this information.)&lt;/p&gt;

&lt;p&gt;Here is a query to look for "macBook pro" :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/macBook%20pro | json_reformat                                                            1 ↵
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19834  100 19834    0     0   7623      0  0:00:02  0:00:02 &lt;span class="nt"&gt;--&lt;/span&gt;:--:--  7622
&lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"2010 15&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; Macbook pro 3.06ghz 8gb 320gb osx maverick"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 325,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro-306ghz-8gb-320gb/6680853189.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Apple MacBook Pro A1502 13.3&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; Late 2013 2.6GHz i5 8 GB 500GB + Extras"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 875,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-alateghz-i5/6688755497.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Apple MacBook Pro Charger USB-C (Latest Model) w/ Box - Like New!"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 50,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-charger-usb/6686902986.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"MacBook Pro 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; C2D 4GB memory 500GB HDD"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 250,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro-13-c2d-4gb-memory/6688682499.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Macbook Pro 2011 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 475,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro/6675556875.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Trackpad Touchpad Mouse with Cable and Screws for Apple MacBook Pro"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 39,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/trackpad-touchpad-mouse-with/6682812027.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Macbook Pro 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; i5 very clean, excellent shape! 4GB RAM, 500GB HDD"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 359,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/sfc/sys/d/macbook-pro-13-i5-very-clean/6686879047.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note that the first invocation will be slow, it took 7 seconds for me. The next invocations will be much quicker. &lt;/p&gt;

&lt;h2&gt;
  
  
  Go further
&lt;/h2&gt;

&lt;p&gt;This was just a little example, here are some ideas to improve this :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better error handling&lt;/li&gt;
&lt;li&gt;Protect the API with an API Key (really easy to implement with API Gateway)&lt;/li&gt;
&lt;li&gt;Save the items to a DynamoDB database&lt;/li&gt;
&lt;li&gt;Send the search query to an SQS queue, and trigger the lambda execution with the queue instead of an HTTP request&lt;/li&gt;
&lt;li&gt;Send a notification with SNS if an Item is less than a certain price point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

&lt;p&gt;This is the end of this tutorial. I hope you enjoyed the post. Don't hesitate to experiment with Lambda and other cloud providers, it's really fun, easy, and can drastically reduce your infrastructure costs, especially for web-scraping or asynchronous related tasks.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Web Scraping 101 in Python</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 21 Aug 2019 10:24:15 +0000</pubDate>
      <link>https://dev.to/scrapingbee/web-scraping-101-in-python-5aoj</link>
      <guid>https://dev.to/scrapingbee/web-scraping-101-in-python-5aoj</guid>
      <description>&lt;p&gt;In this post, which can be read as a follow up to our &lt;a href="https://www.daolf.com/posts/avoiding-being-blocked-while-scraping-ultimate-guide/"&gt;ultimate web scraping guide&lt;/a&gt;, we will cover almost all the tools Python offers you to web scrape. We will go from the more basic to the most advanced one and will cover the pros and cons of each. Of course, we won't be able to cover all aspect of every tool we discuss, but this post should be enough to have a good idea of which tools does what, and when to use which.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: when I talk about Python in this blog post you should assume that I talk about Python3.&lt;/em&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Content:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;0) Web Fundamentals
&lt;/li&gt;
&lt;li&gt;1) Manually opening a socket and sending the HTTP request
&lt;/li&gt;
&lt;li&gt;2) urllib3 &amp;amp; LXML
&lt;/li&gt;
&lt;li&gt;3) requests &amp;amp; BeautifulSoup
&lt;/li&gt;
&lt;li&gt;4) Scrapy
&lt;/li&gt;
&lt;li&gt;5) Selenium &amp;amp; Chrome —headless
&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  &lt;span id="web-fondamentals"&gt; 0) Web Fundamentals &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;The internet is &lt;strong&gt;really complex&lt;/strong&gt;: there are many underlying technologies and concepts involved to view a simple web page in your browser. I don’t have the pretension to explain everything, but I will show you the most important things you have to understand in order to extract data from the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  HyperText Transfer Protocol
&lt;/h2&gt;

&lt;p&gt;HTTP uses a &lt;strong&gt;client/server&lt;/strong&gt; model, where an HTTP client (A browser, your Python program, curl, Requests...) opens a connection and sends a message (“I want to see that page: /product”)to an HTTP server (Nginx, Apache...). &lt;/p&gt;

&lt;p&gt;Then the server answers with a response (The HTML code for example) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independent. FTP for example, is stateful.&lt;/p&gt;

&lt;p&gt;Basically, when you type a website address in your browser, the HTTP request looks like this:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In the first line of this request, you can see multiples things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the GET verb or method being used, meaning we request data from the specific path: &lt;code&gt;/product/&lt;/code&gt;.There are other HTTP verbs, you can see the full list &lt;a href="https://www.w3schools.com/tags/ref_httpmethods.asp"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The version of the HTTP protocol, in this tutorial we will focus on HTTP 1.&lt;/li&gt;
&lt;li&gt;Multiple headers fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are the most important header fields :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Host:&lt;/strong&gt; The domain name of the server, if no port number is given, is assumed to be 80*&lt;em&gt;.&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-Agent:&lt;/strong&gt; Contains information about the client originating the request, including the OS information. In this case, it is my web-browser (Chrome), on OSX. This header is important because it is either used for statistics (How many users visit my website on Mobile vs Desktop) or to prevent any violations by bots. Because these headers are sent by the clients, it can be modified (it is called “Header Spoofing”), and that is exactly what we will do with our scrapers, to make our scrapers look like a normal web browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept:&lt;/strong&gt; The content types that are acceptable as a response. There are lots of different content types and sub-types: &lt;strong&gt;text/plain, text/html, image/jpeg, application/json&lt;/strong&gt; ...&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cookie&lt;/strong&gt; : name1=value1;name2=value2... This header field contains a list of name-value pairs. It is called session cookies, these are used to store data. Cookies are what websites use to authenticate users, and/or store data in your browser. For example, when you fill a login form, the server will check if the credentials you entered are correct, if so, it will redirect you and inject a session cookie in your browser. Your browser will then send this cookie with every subsequent request to that server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Referrer&lt;/strong&gt;: The Referrer header contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user came from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the list goes on...you can find the full header list &lt;a href="https://en.wikipedia.org/wiki/List_of_HTTP_header_fields"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A server will respond with something like this: &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;On the first line, we have a new piece of information, the HTTP code &lt;code&gt;200 OK&lt;/code&gt;. It means the request has succeeded. As for the request headers, there are lots of HTTP codes, split into four common classes, 2XX for successful requests, 3XX for redirects, 4XX for bad requests (the most famous being 404 Not found), and 5XX for server errors.&lt;/p&gt;

&lt;p&gt;Then, in case you are sending this HTTP request with your web browser, the browser will parse the HTML code, fetch all the eventual assets (Javascript files, CSS files, images...) and it will render the result into the main window.&lt;/p&gt;

&lt;p&gt;In the next parts we will see the different ways to perform HTTP requests with Python and extract the data we want from the responses. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="socket"&gt; 1) Manually opening a socket and sending the HTTP request &lt;/span&gt;
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Socket
&lt;/h2&gt;

&lt;p&gt;The most basic way to perform an HTTP request in Python is to open a &lt;a href="https://docs.python.org/3/howto/sockets.html"&gt;socket&lt;/a&gt; and manually send the HTTP request.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions. &lt;/p&gt;

&lt;h2&gt;
  
  
  Regular Expressions
&lt;/h2&gt;

&lt;p&gt;A regular expression (RE, or Regex) is a search pattern for strings. With regex, you can search for a particular character/word inside a bigger body of text.&lt;/p&gt;

&lt;p&gt;For example, you could identify all phone numbers inside a web page. You can also replace items, for example, you could replace all uppercase tag in a poorly formatted HTML by lowercase ones. You can also validate some inputs ...&lt;/p&gt;

&lt;p&gt;The pattern used by the regex is applied from left to right. Each source character is only used once. You may be wondering why it is important to know about regular expressions when doing web scraping?&lt;/p&gt;

&lt;p&gt;After all, there is all kind of different Python module to parse HTML, with XPath, CSS selectors. &lt;/p&gt;

&lt;p&gt;In an ideal &lt;a href="https://en.wikipedia.org/wiki/Semantic_Web"&gt;semantic world,&lt;/a&gt; data is easily machine-readable, the information is embedded inside relevant HTML element, with meaningful attributes.&lt;/p&gt;

&lt;p&gt;But the real world is messy, you will often find huge amounts of text inside a p element. When you want to extract a specific data inside this huge text, for example, a price, a date, a name... you will have to use regular expressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Here is a great website to test your regex: &lt;a href="https://regex101.com/"&gt;https://regex101.com/&lt;/a&gt; and &lt;a href="https://www.rexegg.com/"&gt;one awesome blog&lt;/a&gt; to learn more about them, this post will only cover a small fraction of what you can do with regexp.&lt;/p&gt;

&lt;p&gt;Regular expressions can be useful when you have this kind of data:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;p&amp;gt;Price : 19.99$&amp;lt;/p&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;We could select this text node with an Xpath expression, and then use this kind of regex to extract the price :&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;^Price\s:\s(\d+\.\d{2})\$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;To extract the text inside an HTML tag, it is annoying to use a regex, but doable:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;As you can see, manually sending the HTTP request with a socket, and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="lxml"&gt; 2) urllib3 &amp;amp; LXML &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: It is easy to get lost in the urllib universe in Python. You have urllib and urllib2 that are parts of the standard lib. You can also find urllib3. urllib2 was split in multiple modules in Python 3, and urllib3 should not be a part of the standard lib anytime soon. This whole confusing thing will be the subject of a blog post by itself. In this part, I've made the choice to only talk about urllib3 as it is used widely in the Python world, by Pip and requests to name only them. &lt;/p&gt;

&lt;p&gt;Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. It allows doing what we did above with socket with way fewer lines of code.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Much more concise than the socket version. Not only that, but the API is straightforward and you can do many things easily, like adding HTTP headers, using a proxy, POSTing forms ... &lt;/p&gt;

&lt;p&gt;For example, had we decide to set some headers and to use a proxy, we would only have to do this.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;See? Exactly the same number of line, however, there are some things that urllib3 does not handle very easily, for example, if we want to add a cookie, we have to manually create the corresponding headers and add it to the request. &lt;/p&gt;

&lt;p&gt;There are also things that urllib3 can do that requsts can't, creation and management of pool and proxy pool, control of retry strategy for example.&lt;/p&gt;

&lt;p&gt;To put in simply, urllib3 is between requests and socket in terms of abstraction, although way closer to requests than socket.&lt;/p&gt;

&lt;p&gt;This time, to parse the response, we are going to use the lxml package and XPath expressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath
&lt;/h2&gt;

&lt;p&gt;Xpath is a technology that uses path expressions to select nodes or node- sets in an XML document (or HTML document). As with the Document Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a programming language in itself, it allows you to write expression that can access directly to a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).&lt;/p&gt;

&lt;p&gt;Think of XPath as regexp, but specifically for XML/HMTL.&lt;/p&gt;

&lt;p&gt;To extract data from an HTML document with XPath we need 3 things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an HTML document&lt;/li&gt;
&lt;li&gt;some XPath expressions&lt;/li&gt;
&lt;li&gt;an XPath engine that will run those expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To begin we will use the HTML that we got thanks to urllib3, we just want to extract all the links from the Google homepage so we will use one simple XPath expression: &lt;code&gt;//a&lt;/code&gt; and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPATH. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Installation&lt;/em&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install lxml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Below is the code that comes just after the previous snippet:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;And the output should look like this:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You have to keep in mind that this example is really really simple and doesn't really show you how powerful XPath can be (note: this XPath expression should have been changed to &lt;code&gt;//a/@href&lt;/code&gt; to avoid having to iterate on &lt;code&gt;links&lt;/code&gt; to get their &lt;code&gt;href&lt;/code&gt; ).&lt;/p&gt;

&lt;p&gt;If you want to learn more about XPath you can read &lt;a href="https://librarycarpentry.org/lc-webscraping/02-xpath/index.html"&gt;this good introduction&lt;/a&gt;. The LXML documentation is also &lt;a href="https://lxml.de/tutorial.html"&gt;well written and is a good starting point&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;XPath expresions, like regexp, are really powerful and one of the fastest way to extract information from HTML, and like regexp, XPath can quickly become messy, hard to read and hard to maintain.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="requests"&gt; 3) requests &amp;amp; BeautifulSoup &lt;span&gt;
&lt;/span&gt;&lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HrgsYR9Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/requests/requests/master/docs/_static/requests-logo-small.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HrgsYR9Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/requests/requests/master/docs/_static/requests-logo-small.png" alt="" width="357" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/psf/requests"&gt;Requests&lt;/a&gt; is the king of python packages, with more than 11 000 000 downloads, it is the most widly used package for Python. &lt;/p&gt;

&lt;p&gt;Installation: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Making a request with Requests (no comment) is really easy: &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;With Requests it is easy to perform POST requests, handle cookies, query parameters... &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication to Hacker News&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's say we want to create a tool to automatically submit our blog post to Hacker news or any other forums, like Buffer. We would need to authenticate to those websites before posting our link. That's what we are going to do with Requests and BeautifulSoup!&lt;/p&gt;

&lt;p&gt;Here is the Hacker News login form and the associated DOM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr2y7j7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ksah.in/content/images/2016/02/screenshot_hn_login_form.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr2y7j7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ksah.in/content/images/2016/02/screenshot_hn_login_form.png" alt="" width="880" height="717"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are three &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt; **tags on this form, the first one has a type hidden with a name "goto" and the two others are the username and password. &lt;/p&gt;

&lt;p&gt;If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated. &lt;/p&gt;

&lt;p&gt;Doing this with Requests is easy, it will handle redirects automatically for us, and handling cookies can be done with the &lt;em&gt;Session&lt;/em&gt; object. &lt;/p&gt;

&lt;p&gt;The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.&lt;/p&gt;

&lt;p&gt;Installation: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install beautifulsoup4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;So all we have to do is to POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;In order to learn more about BeautifulSoup we could try to extract every links on the homepage. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;By the way, Hacker News offers a &lt;a href="https://github.com/HackerNews/API"&gt;powerful API&lt;/a&gt;, so we're doing this as an example, but you should use the API instead of scraping it!&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;The first thing we need to do is to inspect the Hacker News's home page to understand the structure and the different CSS classes that we will have to select:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eGZHahfg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hacker_news_screenshot-475f78bf-c737-4a60-8c24-d0cc220d7219.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eGZHahfg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hacker_news_screenshot-475f78bf-c737-4a60-8c24-d0cc220d7219.jpg" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that all posts are inside a &lt;code&gt;&amp;lt;tr class="athing"&amp;gt;&lt;/code&gt; **so the first thing we will need to do is to select all these tags. This can be easily done with: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;links = soup.findAll('tr', class_='athing')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then for each link, we will extract its id, title, url and rank:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;As you saw, Requests and BeautifulSoup are great libraries to extract data and automate different things by posting forms. If you want to do large-scale web scraping projects, you could still use Requests, but you would need to handle lots of things yourself. &lt;/p&gt;

&lt;p&gt;When you need to scrape a lots of webpages, there are many things you have to take care of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;finding a way of parallelizing your code to make it faster&lt;/li&gt;
&lt;li&gt;handling error&lt;/li&gt;
&lt;li&gt;storing result&lt;/li&gt;
&lt;li&gt;filtering result&lt;/li&gt;
&lt;li&gt;throttling your request so you don't over load the server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortunately for us, tools exist that can handle those things for us.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="scrapy"&gt; 4) Scrapy &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VIvNnTuY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://secure.meetupstatic.com/photos/event/1/b/6/6/600_468367014.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VIvNnTuY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://secure.meetupstatic.com/photos/event/1/b/6/6/600_468367014.jpeg" alt="" width="600" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scrapy is a powerful Python web scraping framework. It provides many features to download web pages asynchronously, process and save it. It handles multithreading, crawling (the process of going from links to links to find every URLs in a website), sitemap crawling and many more. &lt;/p&gt;

&lt;p&gt;Scrapy has also an interactive mode called the Scrapy Shell. With Scrapy Shell you can test your scraping code really quickly, like XPath expression or CSS selectors. &lt;/p&gt;

&lt;p&gt;The downside of Scrapy is that the learning curve is steep, there is a lot to learn. &lt;/p&gt;

&lt;p&gt;To follow up on our example about Hacker news, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file. &lt;/p&gt;

&lt;p&gt;You can easily install Scrapy with pip: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install Scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then you can use the scrapy cli to generate the boilerplate code for our project: &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject hacker_news_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Inside &lt;code&gt;hacker_news_scraper/spider&lt;/code&gt; we will create a new python file with our Spider's code:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;There is a lot of convention in Scrapy, here we define an Array of starting urls. The attribute name will be used to call our Spider with the Scrapy command line. &lt;/p&gt;

&lt;p&gt;The parse method will be called on each URL in the &lt;code&gt;start_urls&lt;/code&gt; array&lt;/p&gt;

&lt;p&gt;We then need to tune Scrapy a little bit in order for our Spider to behave nicely against the target website. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;You should always turn this on, it will make sure the target website is not slow down by your spiders by analyzing the response time and adapting the numbers of concurrent threads. &lt;/p&gt;

&lt;p&gt;You can run this code with the Scrapy CLI and with different output format (CSV, JSON, XML...):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl hacker-news -o links.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;And that's it! You will now have all your links in a nicely formatted JSON file. &lt;/p&gt;
&lt;h1&gt;
  
  
  &lt;span id="selenium"&gt; 5) Selenium &amp;amp; Chrome —headless &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;Scrapy is really nice for large-scale web scraping tasks, but it is not enough if you need to scrape Single Page Application written with Javascript frameworks because It won't be able to render the Javascript code. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8rnOiMh7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/SinglePageDiagram-9ae99e86-e997-4e18-9da9-d7abba599b9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8rnOiMh7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/SinglePageDiagram-9ae99e86-e997-4e18-9da9-d7abba599b9b.png" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It can be challenging to scrape these SPAs because there are often lots of AJAX calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing the interesting data.&lt;/p&gt;

&lt;p&gt;In some cases, there are just too many asynchronous HTTP calls involved to get the data you want and it can be easier to just render the page in a headless browser. &lt;/p&gt;

&lt;p&gt;Another great use case would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (again !)&lt;/p&gt;

&lt;p&gt;You can install the selenium package with pip: &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install selenium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;You will also need &lt;a href="http://chromedriver.chromium.org/"&gt;Chromedriver&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew install chromedriver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then we just have to import the Webdriver from selenium package, configure Chrome with headless=True and set a window size (otherwise it is really small):&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;You should get a nice screenshot of the homepage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ML_oEg3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hn_homepage-bd6bd60d-8778-404b-a82c-39ba76728e14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ML_oEg3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hn_homepage-bd6bd60d-8778-404b-a82c-39ba76728e14.png" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can do many more with the Selenium API and Chrome, like :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executing Javascript&lt;/li&gt;
&lt;li&gt;Filling forms&lt;/li&gt;
&lt;li&gt;Clicking on Elements&lt;/li&gt;
&lt;li&gt;Extracting elements with CSS selectors / XPath expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Selenium and Chrome in headless mode is really the ultimate combination to scrape anything you want. You can automate anything that you could do with your regular Chrome browser. &lt;/p&gt;

&lt;p&gt;The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance. &lt;/p&gt;

&lt;p&gt;If you want to run several Chrome instances concurrently, you will need powerful servers (the cost goes up quickly) and constant monitoring of resources. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="conclusion"&gt; Conclusion: &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;Here is a quick recap table of every technology we discuss about in this about. Do not hesitate to tell us in the comment if you know some ressources that you feel have their places here.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;I hope that this overview will help you best choose your Python scraping tools and that you learned things reading this post.&lt;/p&gt;

&lt;p&gt;Every tools I talked about in this post will be the subject of a specific blog post in the future where I'll go deep into the details.&lt;/p&gt;

&lt;p&gt;Everything I talked about in this post is things I used to build &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt;, the simplest web scraping API around there. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us 😊.&lt;/p&gt;

&lt;p&gt;Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.&lt;/p&gt;

&lt;p&gt;Happy Scraping &lt;/p&gt;

</description>
      <category>python</category>
      <category>scraping</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>A guide to Web scraping without getting blocked</title>
      <dc:creator>Pierre </dc:creator>
      <pubDate>Wed, 31 Jul 2019 16:30:25 +0000</pubDate>
      <link>https://dev.to/scrapingbee/a-guide-to-web-scraping-without-getting-blocked-5e7e</link>
      <guid>https://dev.to/scrapingbee/a-guide-to-web-scraping-without-getting-blocked-5e7e</guid>
      <description>&lt;p&gt;Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.&lt;/p&gt;

&lt;p&gt;But you should use an API for this!&lt;/p&gt;

&lt;p&gt;Not every website offers an API, and APIs don't always expose every piece of information you need. So it's often the only solution to extract website data.&lt;/p&gt;

&lt;p&gt;There are many use cases for web scraping:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;E-commerce price monitoring&lt;/li&gt;
&lt;li&gt;News aggregation&lt;/li&gt;
&lt;li&gt;Lead generation&lt;/li&gt;
&lt;li&gt;SEO (Search engine result page monitoring)&lt;/li&gt;
&lt;li&gt;Bank account aggregation (Mint in the US, Bankin' in Europe)&lt;/li&gt;
&lt;li&gt;But also lots of individual and researchers who need to build a dataset otherwise not available.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, what is the problem?&lt;/p&gt;

&lt;p&gt;The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browser (except Google, they all want to be scraped by Google).&lt;/p&gt;

&lt;p&gt;So, when you scrape, you have to be careful not being recognized as a robot by basically doing two things: using human tools &amp;amp; having a human behavior. This post will guide you through all the things you can use to cover yourself and through all the tools websites use to block you.&lt;/p&gt;

&lt;h1&gt;
  
  
  Emulate human tool i.e: Headless Chrome
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why using headless browsing?
&lt;/h2&gt;

&lt;p&gt;When you open your browser and go to a webpage it almost always means that you are you asking an HTTP server for some content. And one of the easiest ways pull content from an HTTP server is to use a classic command-line tool such as &lt;a href="https://en.wikipedia.org/wiki/CURL" rel="noopener noreferrer"&gt;cURL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thing is if you just do a: curl &lt;a href="http://www.google.com" rel="noopener noreferrer"&gt;www.google.com&lt;/a&gt;, Google has many ways to know that you are not a human, just by looking at the headers for examples. Headers are small pieces of information that goes with every HTTP request that hit the servers, and one of those pieces of information precisely describe the client making the request, I am talking about the "User-Agent" header. And just by looking at the "User-Agent" header, Google now knows that you are using cURL. If you want to learn more about headers, the &lt;a href="https://en.wikipedia.org/wiki/List_of_HTTP_header_fields" rel="noopener noreferrer"&gt;Wikipedia page&lt;/a&gt; is great, and to make some experiment, just &lt;a href="http://www.httpbin.org/headers?json" rel="noopener noreferrer"&gt;go over here&lt;/a&gt;, it's a webpage that simply displays the headers information of your request.&lt;/p&gt;

&lt;p&gt;Headers are really easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you'd need to set more than just one header but more generally it is not very difficult to artificially craft an HTTP request with cURL or any library that will make this request looks exactly like a request made with a browser. Everybody knows that, and so, to know if you are using a real browser website will check one thing that cURL and library can not do: JS execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you speak JS?
&lt;/h2&gt;

&lt;p&gt;The concept is very simple, the website embeds a little snippet of JS in its webpage that, once executed, will "unlock" the webpage. If you are using a real browser, then, you won't notice the difference, but if you're not, all you'll receive is an HTML page with some obscure JS in it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9jxuds0b3pa770dej6t8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9jxuds0b3pa770dej6t8.png"&gt;&lt;/a&gt;&lt;/p&gt;
 (an actual example of such a snippet) 



&lt;p&gt;But once again, this solution is not completely bulletproof, mainly because since nodeJS it is now very easy to execute JS outside of a browser. But once again, the web evolved and there are other tricks to determine if you are using a real browser or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Headless Browsing
&lt;/h2&gt;

&lt;p&gt;Trying to execute snippet JS on the side with the node is really difficult and not robust at all. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with node become useless. So the best way to look like a real browser is to actually use one.&lt;/p&gt;

&lt;p&gt;Headless Browsers will behave "exactly" like a real browser except that you will easily be able to programmatically use them. The most used is Chrome Headless, a Chrome option that has the behavior of Chrome without all the UI wrapping it.&lt;/p&gt;

&lt;p&gt;The easiest way to use Headless Chrome is by calling driver that wraps all its functionality into an easy API, &lt;a href="https://github.com/SeleniumHQ/selenium" rel="noopener noreferrer"&gt;Selenium&lt;/a&gt; and &lt;a href="https://github.com/GoogleChrome/puppeteer" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt; are the two most famous solutions.&lt;/p&gt;

&lt;p&gt;However, it will not be enough as websites have now tools that allow them to detect a headless browser. This arms race that's been going on for a long time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fingerprinting
&lt;/h2&gt;

&lt;p&gt;Everyone, and mostly front dev, knows how every browser behaves differently. Sometimes it can be about rendering CSS, sometimes JS, sometimes just internal properties. Most of those differences are well known and it is now possible to detect if a browser is actually who it pretends to be. Meaning the website is asking itself "are all the browser properties and behaviors matched what I know about the User-Agent sent by this browser?".&lt;/p&gt;

&lt;p&gt;This is why there is an everlasting arms race between scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.&lt;/p&gt;

&lt;p&gt;However, in this arms race, web scrapers tend to have a big advantage and here is why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6augn0y6fqtlqd08ratm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6augn0y6fqtlqd08ratm.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most of the time, when a Javascript code tries to detect whether it's being run in headless mode is when it is a malware that is trying to evade behavioral fingerprinting. Meaning that the JS will behave nicely inside a scanning environment and badly inside real browsers. And this is why the &lt;a href="https://transparencyreport.google.com/safe-browsing/overview" rel="noopener noreferrer"&gt;team behind the Chrome headless mode&lt;/a&gt; are trying to make it indistinguishable from a real user's web browser in order to stop malware from doing that. And this is why web scrapers, in this arms race can profit from this effort.&lt;/p&gt;

&lt;p&gt;One another thing to know is that whereas running 20 cURL in parallel is trivial, Chrome Headless while relatively easy to use for small use cases, can be tricky to put at scale. Mainly because it uses lots of RAM so managing more than 20 instances of it is a challenge.&lt;/p&gt;

&lt;p&gt;If you want to learn more about browser fingerprinting I suggest you take a look at &lt;a href="https://antoinevastel.com/" rel="noopener noreferrer"&gt;Antoine Vastel blog&lt;/a&gt;, a blog entirely dedicated to this subject.&lt;/p&gt;

&lt;p&gt;That's about all you need to know to understand how to pretend like you are using a real browser. Let's now take a look at how do you behave like a real human.&lt;/p&gt;

&lt;h1&gt;
  
  
  Emulate human behaviour i.e: Proxy, Captchas solving and Request pattern
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Proxy yourself
&lt;/h2&gt;

&lt;p&gt;A human using a real browser will rarely request 20 pages per second from the same website, so if you want to request a lot of page from the same website you have to trick this website into thinking that all those requests come from a different place in the world i.e: different I.P addresses. In other words, you need to use proxies.&lt;/p&gt;

&lt;p&gt;Proxies are now not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxies IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.&lt;/p&gt;

&lt;p&gt;There are several proxy solutions in the market, here are the most used: &lt;a href="https://luminati.io/" rel="noopener noreferrer"&gt;Luminati Network&lt;/a&gt;, &lt;a href="https://blazingseollc.com/" rel="noopener noreferrer"&gt;Blazing SEO&lt;/a&gt; and &lt;a href="https://smartproxy.com/" rel="noopener noreferrer"&gt;SmartProxy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is also a lot of free proxy list and I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these&lt;/p&gt;

&lt;p&gt;proxies are located. Those free proxy lists are most of the time public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is very important, anti crawling services are known to maintain an internal list of proxy IP so every traffic coming from those IPs will also be blocked. Be careful to choose a good reputation Proxy. This is why I recommend using a paid proxy network or build your own&lt;/p&gt;

&lt;p&gt;To build your on you could take a look at &lt;a href="https://scrapoxy.io/" rel="noopener noreferrer"&gt;scrapoxy&lt;/a&gt;, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist ...) but can be a little tedious to put in place.&lt;/p&gt;

&lt;p&gt;You could also use the TOR network, aka, &lt;a href="https://www.torproject.org/" rel="noopener noreferrer"&gt;The Onion Router&lt;/a&gt;. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in the dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many&lt;/p&gt;

&lt;p&gt;cases, TOR won’t help you, compared to classic proxies. It's worth noting that traffic through TOR is also inherently much slower because of the multiple routing thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Captchas
&lt;/h2&gt;

&lt;p&gt;But sometimes proxies will not be enough, some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you'll need to use CAPTCHAs solving service (&lt;a href="https://2captcha.com/" rel="noopener noreferrer"&gt;2Captchas&lt;/a&gt; and &lt;a href="https://www.deathbycaptcha.com/user/login" rel="noopener noreferrer"&gt;DeathByCaptchas&lt;/a&gt; come to mind).&lt;/p&gt;

&lt;p&gt;You have to know that while some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffq70vlkx0ud047kn9us7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffq70vlkx0ud047kn9us7.png"&gt;&lt;/a&gt;&lt;/p&gt;
 Old captcha, breakable programatically 



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F68acxo4r89rla6u385qm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F68acxo4r89rla6u385qm.png"&gt;&lt;/a&gt;&lt;/p&gt;
 Google ReCaptcha V2 



&lt;p&gt;What it means is that if you use those aforementioned services, on the other side of the API call you'll have hundreds of people resolving CAPTCHAs for as low as &lt;a href="https://2captcha.com/make-money-online" rel="noopener noreferrer"&gt;20ct an hour&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your little scraping job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Request Pattern
&lt;/h2&gt;

&lt;p&gt;A last advanced tool used by the website to detect scraping is pattern recognition. So if you plan to scrap every ids from 1 to 10 000 for the URL &lt;a href="http://www.example.com/product/" rel="noopener noreferrer"&gt;www.example.com/product/&lt;/a&gt;, try not to do it sequentially and with a constant rate of request. You could, for example, maintain a set of integer going from 1 to 10 000 and randomly choose one integer inside this set and then scraping your product.&lt;/p&gt;

&lt;p&gt;This one simple example, some websites also do statistic on browser fingerprint per endpoint. Which means that if you don't change some parameters in your headless browser and target a single endpoint, they might block you anyway.&lt;/p&gt;

&lt;p&gt;Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try not doing it with proxies in Vietnam for example.&lt;/p&gt;

&lt;p&gt;But from experience, what I can tell, is that rate is the most important factor in "Request Pattern Recognition", sot the slower you scrape, the less chance you have to be discovered.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;I hope that this overview will help you understand better web-scraping and that you learned things reading this post.&lt;/p&gt;

&lt;p&gt;Everything I talked about in this post is things I used to build &lt;a href="https://www.scrapingbee.com" rel="noopener noreferrer"&gt;ScrapingBee&lt;/a&gt;, the simplest web scraping API around there. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us :).&lt;/p&gt;

&lt;p&gt;Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.&lt;br&gt;
Happy scraping 😎&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>webdev</category>
      <category>beginners</category>
      <category>scraping</category>
    </item>
    <item>
      <title>Scraping single page applications with ease.</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Sun, 26 May 2019 17:22:14 +0000</pubDate>
      <link>https://dev.to/scrapingbee/scraping-single-page-applications-with-ease-d8o</link>
      <guid>https://dev.to/scrapingbee/scraping-single-page-applications-with-ease-d8o</guid>
      <description>&lt;p&gt;Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend.&lt;/p&gt;

&lt;p&gt;These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.&lt;/p&gt;

&lt;p&gt;So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.&lt;/p&gt;

&lt;p&gt;There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.&lt;/p&gt;

&lt;p&gt;It can be challenging to scrape these SPAs because there are often lots of Ajax calls and Websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.&lt;/p&gt;

&lt;p&gt;So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code in order to render the page, that is what the next part is about.&lt;/p&gt;

&lt;h1&gt;
  
  
  Headless Chrome with Python
&lt;/h1&gt;

&lt;p&gt;PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about the release of the headless mode with Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS [...]” It looks like Chrome in headless mode is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.&lt;/p&gt;

&lt;h1&gt;
  
  
  Prerequisites
&lt;/h1&gt;

&lt;p&gt;You will need to install the selenium package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And of course, you need a Chrome browser, and Chromedriver installed on your system.&lt;/p&gt;

&lt;p&gt;On macOS, you can simply use brew:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;brew&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;chromedriver&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Taking a screenshot
&lt;/h1&gt;

&lt;p&gt;We are going to use Chrome to take a screenshot of the Nintendo's home page which uses lots of Javascript.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;selenium.webdriver.chrome.options&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Options&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;executable_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s"&gt;'/usr/local/bin/chromedriver'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://www.nintendo.com/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'screenshot.png'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is really straightforward, I just added a parameter --window-size because the default size was too small.&lt;/p&gt;

&lt;p&gt;You should now have a nice screenshot of the Nintendo's home page:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bf00LEFQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/rpui3447ha74z56t6vkb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bf00LEFQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/rpui3447ha74z56t6vkb.png" alt="Nintendo Homepage Screenshot" width="880" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Waiting for the page load
&lt;/h1&gt;

&lt;p&gt;Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.&lt;/p&gt;

&lt;p&gt;A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.&lt;/p&gt;

&lt;p&gt;The other solution is to use the WebDriverWait object from the Selenium API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WebDriverWait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EC&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;presence_of_element_located&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'chart'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

 &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Page is ready!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;TimeoutException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Timeout"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.&lt;/p&gt;

&lt;p&gt;Meaning there will be CPU spikes, memory spikes just like a regular Chrome browser. After all, your Chrome instance will execute un-trusted and un-predictable third-party Javascript code! Then there is also the zombie-processes problem&lt;/p&gt;

&lt;p&gt;This is one of the reason I started &lt;a href="https://https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt;, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!&lt;/p&gt;

&lt;p&gt;This was my first post on about scraping, I hope you enjoyed it!&lt;/p&gt;

&lt;p&gt;If you did please let me know, I'll write more 😊&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to know more about ScrapingBee, you can 👉 &lt;a href="https://dev.to/daolf/new-season-new-project-i-need-you-197l"&gt;here&lt;/a&gt;&lt;/em&gt; &lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>scraping</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Introduction to Web Scraping With Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 13 Mar 2019 16:46:23 +0000</pubDate>
      <link>https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8</link>
      <guid>https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8</guid>
      <description>&lt;p&gt;Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.&lt;/p&gt;

&lt;p&gt;Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.&lt;br&gt;
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…&lt;/p&gt;

&lt;p&gt;Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.&lt;/p&gt;

&lt;p&gt;In this post, we are going to see basic techniques in order to fetch and parse data in Java. &lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic Java understanding&lt;/li&gt;
&lt;li&gt;Basic XPath&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;p&gt;You will need Java 8 with &lt;a href="http://htmlunit.sourceforge.net" rel="noopener noreferrer"&gt;HtmlUnit&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;net.sourceforge.htmlunit&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;htmlunit&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.19&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fdetail_pane.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fdetail_pane.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's scrape CraigList
&lt;/h3&gt;

&lt;p&gt;For our first example, we are going to fetch items from Craigslist since they don't seem to offer an API, to collect names, prices, and images, and export it to JSON. &lt;/p&gt;

&lt;p&gt;First, let's take a look at what happens when you search an item on Craigslist. Open Chrome Dev tools and click on the Network tab :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist_request_search.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist_request_search.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The search URL is :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://newyork.craigslist.org/search/moa?is_paid=all&amp;amp;search_distance_type=mi&amp;amp;query=iphone+6s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://newyork.craigslist.org/search/sss?sort=rel&amp;amp;query=iphone+6s  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled ...)&lt;/p&gt;

&lt;p&gt;We are going to disable Javascript since it's not required for our example, and disabling Javascript makes the page load faster :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;searchQuery&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Iphone 6s"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;searchUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://newyork.craigslist.org/search/sss?sort=rel&amp;amp;query="&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nc"&gt;URLEncoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;encode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchQuery&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HtmlPage object will contain the HTML code, you can access it with &lt;code&gt;asXml()&lt;/code&gt; method. &lt;/p&gt;

&lt;p&gt;Now we are going to fetch titles, images, and prices. We need to inspect the DOM structure for an item :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist-dom-new-compressor.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist-dom-new-compressor.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With HtmlUnit you have several options to select an html tag :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;getHtmlElementById(String id)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getFirstByXPath(String Xpath)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getByXPath(String XPath)&lt;/code&gt; which returns a List&lt;/li&gt;
&lt;li&gt;many others, rtfm !&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since there isn't any ID we could use, we have to make an &lt;a href="http://www.w3schools.com/xsl/xpath_syntax.asp" rel="noopener noreferrer"&gt;Xpath&lt;/a&gt; expression to select the tags we want. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XPath&lt;/strong&gt; is a query language to select XML nodes( HTML in our case).&lt;/p&gt;

&lt;p&gt;First, we are going to select all the &lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt; tags that have a class  &lt;code&gt;result-info&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//li[@class='result-row']"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;
  &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No items found !"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
  &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//p[@class='result-info']/a"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

  &lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//a/span[@class='result-price']"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

  &lt;span class="c1"&gt;// It is possible that an item doesn't have any price&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"0.0"&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

  &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Name : %s Url : %s Price : %s"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then instead of just printing the results, we are going to put it in JSON, using &lt;a href="https://github.com/FasterXML/jackson" rel="noopener noreferrer"&gt;Jackson&lt;/a&gt; library, to map items in JSON format. &lt;/p&gt;

&lt;p&gt;We need a POJO (plain old java object) to represent Items&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Item.java&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Item&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;//getters and setters&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then add this to your pom.xml :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.fasterxml.jackson.core&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;jackson-databind&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.7.0&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file ...), and adapt the previous code a little bit :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
   &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//p[@class='result-info']/a"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

   &lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; 
   &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//a/span[@class='result-price']"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="c1"&gt;// It is possible that an item doesn't have any &lt;/span&gt;
   &lt;span class="c1"&gt;//price, we set the price to 0.0 in this case&lt;/span&gt;
   &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"0.0"&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; 
   &lt;span class="n"&gt;spanPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="nc"&gt;Item&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTitle&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setUrl&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
   &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPrice&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; 
   &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;itemPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"$"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;)));&lt;/span&gt;

   &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
   &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
   &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Go further
&lt;/h3&gt;

&lt;p&gt;This example is not perfect, there are many things that can be improved :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-city search&lt;/li&gt;
&lt;li&gt;Handling pagination&lt;/li&gt;
&lt;li&gt;Multi-criteria search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find the code in this &lt;a href="https://github.com/ksahin/introWebScraping" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was my first blog post I hope you enjoyed it, feel free to give me any feedback in the comments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further reading
&lt;/h3&gt;

&lt;p&gt;I recently wrote a blog post about a &lt;a href="https://dev.to/scrapingbee/a-guide-to-web-scraping-without-getting-blocked-5e7e"&gt;Web Scraping without getting blocked&lt;/a&gt; to explain the different techniques in order how to hide your scrapers, check it out!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping E-Commerce Product Data</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Sun, 17 Feb 2019 09:24:37 +0000</pubDate>
      <link>https://dev.to/scrapingbee/scraping-e-commerce-product-data-2aif</link>
      <guid>https://dev.to/scrapingbee/scraping-e-commerce-product-data-2aif</guid>
      <description>&lt;p&gt;In this tutorial, we are going to see how to extract product data from any E-commerce websites with Java. There are lots of different use cases for product data extraction, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;E-commerce price monitoring&lt;/li&gt;
&lt;li&gt;Price comparator&lt;/li&gt;
&lt;li&gt;Availability monitoring&lt;/li&gt;
&lt;li&gt;Extracting reviews&lt;/li&gt;
&lt;li&gt;Market research&lt;/li&gt;
&lt;li&gt;MAP violation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are going to extract these different fields: Price, Product Name, Image URL, SKU, and currency from this product page: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008"&gt;https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iB7kU6mc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-product/creenshot-2019-04-03-15.56.02.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iB7kU6mc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-product/creenshot-2019-04-03-15.56.02.jpg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What you will need
&lt;/h2&gt;

&lt;p&gt;We will use HtmlUnit to perform the HTTP request and parse the DOM, add this dependency to your pom.xml.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;net.sourceforge.htmlunit&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;htmlunit&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.19&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We will also use the Jackson library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.fasterxml.jackson.core&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;jackson-databind&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.9.8&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Schema.org
&lt;/h2&gt;

&lt;p&gt;In order to extract the fields we're interested in, we are going to parse &lt;a href="https://schema.org"&gt;https://schema.org&lt;/a&gt; metadata from the Html markup. &lt;/p&gt;

&lt;p&gt;Schema is a &lt;strong&gt;&lt;em&gt;semantic&lt;/em&gt;&lt;/strong&gt; vocabulary that can be added to any webpage. There are many benefits of implementing Schema. Most search engines use it to understand what a page is about (A Product, an Article, a Review, and &lt;a href="https://schema.org/docs/schemas.html"&gt;many more&lt;/a&gt; )&lt;/p&gt;

&lt;p&gt;According to schema.org, about 10 million websites use it worldwide. That's huge! &lt;br&gt;
There are different types of Schema, and today we're going to look at the &lt;a href="https://schema.org/Product"&gt;Product type&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's really convenient because once you wrote a scraper that extracts specific schema data, it will work on any other website using the same schema. No more specific XPath / CSS selectors to write!&lt;/p&gt;

&lt;p&gt;In my experience at PricingBot (my previous company), about 40% of E-commerce websites use schema.org metadata in their DOM. &lt;/p&gt;

&lt;p&gt;There are three main schema markups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;JSON-LD&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;&amp;lt;script&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;type=&lt;/span&gt;&lt;span class="s2"&gt;"application/ld+json"&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://schema.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ItemList"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"numberOfItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"315"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"itemListElement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://multivarki.ru/brand_502/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Brand 502"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"offers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Offer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4399 p."&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;RDF-A&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;vocab=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"ItemList"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"numberOfItems"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;315&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"itemListElement"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"image"&lt;/span&gt; &lt;span class="na"&gt;alt=&lt;/span&gt;&lt;span class="s"&gt;"Photo of product"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;&amp;lt;a&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://multivarki.ru/brand_502/"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;BRAND 502&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"offers"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/Offer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:priceCurrency"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"RUB"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;руб
            &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:price"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"4399.00"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;4 399,00
            &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:itemCondition"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/NewCondition"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;...
        &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"itemListElement"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
          ...
        &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the one used in our example, &lt;strong&gt;&lt;em&gt;Microdata&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"schema-org"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;


&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"image"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://images.asos-media.com/products/the-north-face-vault-backpack-28-litres-in-black/10253008-1-black"&lt;/span&gt; &lt;span class="na"&gt;alt=&lt;/span&gt;&lt;span class="s"&gt;"Image 1 of The North Face Vault Backpack 28 Litres in Black"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"itemCondition"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/NewCondition"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"productID"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;10253008&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"sku"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;10253008&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"brand"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Brand"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;The North Face&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;The North Face Vault Backpack 28 Litres in Black&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"description"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Shop The North Face Vault Backpack 28 Litres in Black at ASOS. Discover fashion online.&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"offers"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Offer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"availability"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/InStock"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"priceCurrency"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"GBP"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;60&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"eligibleRegion"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;GB&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"seller"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Organization"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;ASOS&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;  
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note that you can have multiple offers in a single page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting the data
&lt;/h2&gt;

&lt;p&gt;The first thing is to create a basic POJO of a Product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="no"&gt;URL&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="c1"&gt;// ...getters &amp;amp; setters&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then we need to go to the target URL and create a basic microdata parser to extract the fields we are interested in. I'm using HtmlUnit for this, which is a pure Java headless browser. I could have used lots of different libraries like Jsoup or Selenium + Headless Chrome. &lt;/p&gt;

&lt;p&gt;But in most cases, HtmlUnit is a good solution because it's lighter than Selenium + Headless Chrome, but offer more features than a raw HTTP client + JSoup (which only handles Html parsing). &lt;/p&gt;

&lt;p&gt;For "Javascript-heavy" websites, relying on frontend frameworks like React / Vue.js, Headless Chrome is the way to go!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;
&lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;productUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//*[@itemtype='https://schema.org/Product']"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="no"&gt;URL&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="no"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;((((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./img"&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='offers']"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

&lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='price']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='name']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./*[@itemprop='priceCurrency']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;getAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productSKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='sku']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;On the first lines, I created the HtmlUnit HTTP client and disabled Javascript because we don't need it to get the Schema markup. &lt;/p&gt;

&lt;p&gt;Then it's just basic XPath expressions to select the interesting DOM nodes we want. &lt;/p&gt;

&lt;p&gt;This parser is far from perfect, it doesn't extract everything and it doesn't handle multiple offers. However, this will give you an idea about how to extract Schema data. &lt;/p&gt;

&lt;p&gt;We can then create the Product object, and print it as a JSON string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Product&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productSKU&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Avoid getting blocked
&lt;/h2&gt;

&lt;p&gt;Now that we are able to extract the product data we want, we have to be careful not to get blocked. &lt;/p&gt;

&lt;p&gt;For various reasons, there are sometimes anti-bot mechanisms implemented on websites. The most obvious reason to protect sites from bots is to prevent heavy automated traffic to impact a website’s performance (and you must be careful with concurrent requests, by adding delays...). Another reason is to stop bad behavior from bots like spam.&lt;/p&gt;

&lt;p&gt;There are various protection mechanisms. Sometime your bot will be blocked if it does too many requests per second/hour/ day. Sometimes there is a rate limit on how many requests per IP address. The most difficult protection is when there is a user behavior analysis. For example, the website could analyze the time between requests, if the same IP is making requests concurrently.&lt;/p&gt;

&lt;p&gt;The easiest solution to hide our scrapers is to use proxies. In combination with random user-agent, using a proxy is a powerful method to hide our scrapers, and scrape rate-limited web pages. Of course, it’s better not be blocked in the first place, but sometimes websites allow only a certain amount of request per day/hour.&lt;/p&gt;

&lt;p&gt;In these cases, you should use a proxy. There are lots of free proxy list, I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these proxies are located. Sometimes the public proxy list is operated by a legit company, offering premium proxies, and sometimes not... &lt;/p&gt;

&lt;p&gt;What I recommend is using a paid proxy service, or you could build your own.&lt;/p&gt;

&lt;p&gt;Setting a proxy to HtmlUnit is easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ProxyConfig&lt;/span&gt; &lt;span class="n"&gt;proxyConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProxyConfig&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"host"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;myPort&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setProxyConfig&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxyConfig&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Go further
&lt;/h2&gt;

&lt;p&gt;As you can see, thanks to Schema.org data, extracting product data is much easier now than it was ten years ago. &lt;/p&gt;

&lt;p&gt;But there are still challenges such as handling websites that haven't implemented Schema, handling IP blocking and rate limits, rendering Javascript... &lt;/p&gt;

&lt;p&gt;That is exactly why we've been working with my partner Pierre on a &lt;a href="https://www.scrapingbee.com"&gt;Web Scraping API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ScrapingBee is an API to extract any HTML from any website without having to deal with proxies, CAPTCHAs and headless browsers. A single API call, with only the product URL you to want to extract data from. &lt;/p&gt;

&lt;p&gt;I hope you enjoyed this post, as always you can find the full code in this Github repository: &lt;a href="https://github.com/ksahin/introWebScraping"&gt;https://github.com/ksahin/introWebScraping&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Introduction to Chrome Headless</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Fri, 18 Jan 2019 09:45:11 +0000</pubDate>
      <link>https://dev.to/scrapingbee/introduction-to-chrome-headless-469b</link>
      <guid>https://dev.to/scrapingbee/introduction-to-chrome-headless-469b</guid>
      <description>&lt;p&gt;In the previous articles, I introduce you to two different tools to perform web scraping with Java. &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;HtmlUnit&lt;/a&gt; in the first article, and &lt;a href="https://dev.to/scrapingbee/web-scraping-handling-ajax-website-1ip8"&gt;PhantomJS&lt;/a&gt; in the article about handling Javascript heavy website. &lt;/p&gt;

&lt;p&gt;This time we are going to introduce a new feature from Chrome, the &lt;strong&gt;&lt;em&gt;headless&lt;/em&gt;&lt;/strong&gt; mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don't know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago. &lt;/p&gt;

&lt;p&gt;PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer because of I quote &lt;em&gt;"Google Chrome is faster and more stable than PhantomJS [...]"&lt;/em&gt;&lt;br&gt;
It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites. &lt;/p&gt;

&lt;p&gt;HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome. &lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Google Chrome &amp;gt; 59&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.google.com/a/chromium.org/chromedriver/downloads"&gt;Chromedriver&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Selenium &lt;/li&gt;
&lt;li&gt;In your &lt;strong&gt;&lt;em&gt;pom.xml&lt;/em&gt;&lt;/strong&gt; add a recent version of Selenium :
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.seleniumhq.selenium&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;selenium-java&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;3.8.1&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;If you don't have Google Chrome installed, you can download it &lt;a href="https://www.google.com/chrome/browser/desktop/index.html"&gt;here&lt;/a&gt;&lt;br&gt;
To install Chromedriver you can use brew on MacOS :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew install chromedriver
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or download it using the link below. &lt;br&gt;
There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.&lt;/p&gt;
&lt;h3&gt;
  
  
  Let's log into Hacker News
&lt;/h3&gt;

&lt;p&gt;In this part, we are going to log into Hacker News, and take a screenshot once logged in. We don't need Chrome headless for this task, but the goal of this article is only to show you how to run headless Chrome with Selenium.&lt;/p&gt;

&lt;p&gt;The first thing we have to do is to create a WebDriver object, and set the chromedriver path and some arguments :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Init chromedriver&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/Path/To/Chromedriver"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setProperty&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"webdriver.chrome.driver"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addArguments&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--headless"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--disable-gpu"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"--ignore-certificate-errors"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeDriver&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 option is needed on Windows systems, according to the [documentation](https://developers.google.com/web/updates/2017/04/headless-chrome)
Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :



```java
options.setBinary("/Path/to/specific/version/of/Google Chrome");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you want to learn more about the different options, here is the &lt;a href="https://sites.google.com/a/chromium.org/chromedriver/capabilities"&gt;Chromedriver documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next step is to perform a GET request to the Hacker News login form, select the username and password field, fill it with our credentials and click on the login button. Then we have to check for a credential error, and if we are logged in, we can take a screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3ax81Lgo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-headless/hn_screenshot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3ax81Lgo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-headless/hn_screenshot.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have done this in a previous article, here is the full code :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChromeHeadlessTest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;IOException&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
       &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/your/chromedriver/path"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
       &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setProperty&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"webdriver.chrome.driver"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
       &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
       &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addArguments&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--headless"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--disable-gpu"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"--ignore-certificate-errors"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--silent"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
       &lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeDriver&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Get the login page&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://news.ycombinator.com/login?goto=news"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Search for username / password input and fill the inputs&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@name='acct']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;sendKeys&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@type='password']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;sendKeys&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Locate the login button and click on it&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@value='login']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;click&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCurrentUrl&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://news.ycombinator.com/login"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Incorrect credentials"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quit&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Successfuly logged in"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Take a screenshot of the current page&lt;/span&gt;
        &lt;span class="nc"&gt;File&lt;/span&gt; &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;TakesScreenshot&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getScreenshotAs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OutputType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FILE&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;FileUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;copyFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"screenshot.png"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

        &lt;span class="c1"&gt;// Logout&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"logout"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;click&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quit&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
   &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You should now have a nice screenshot of the Hacker News homepage while being authenticated. As you can see Chrome headless is really easy to use, it is not that different from PhantomJS since we are using Selenium to run it. &lt;/p&gt;

&lt;p&gt;If you enjoyed this do not hesitate to subscribe to our newsletter!&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

&lt;p&gt;As usual, the code is available in this &lt;a href="https://github.com/ksahin/introWebScraping"&gt;Github repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to Log in to Almost Any Websites</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 02 Jan 2019 09:52:27 +0000</pubDate>
      <link>https://dev.to/scrapingbee/how-to-log-in-to-almost-any-websites-7dn</link>
      <guid>https://dev.to/scrapingbee/how-to-log-in-to-almost-any-websites-7dn</guid>
      <description>&lt;p&gt;In the first &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;article about java web scraping&lt;/a&gt; I showed how to extract data from CraigList website. &lt;br&gt;
But what about the data you want or if the action you want to carry out on a website requires authentication ?&lt;/p&gt;

&lt;p&gt;In this short tutorial I will show you how to make a generic method that can handle most authentication forms. &lt;/p&gt;
&lt;h3&gt;
  
  
  Authentication mechanism
&lt;/h3&gt;

&lt;p&gt;There are many different authentication mechanisms, the most frequent being a login form , sometimes with a &lt;a href="http://https://en.wikipedia.org/wiki/Cross-site_request_forgery#Forging_login_requests"&gt;CSRF token&lt;/a&gt; as a hidden input. &lt;/p&gt;

&lt;p&gt;To auto-magically log into a website with your scrapers, the idea is : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GET /loginPage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first &lt;code&gt;&amp;lt;input type="password"&amp;gt;&lt;/code&gt; tag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt; before it that is not hidden&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set the value attribute for both inputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the enclosing form, and submit it. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Hacker News Authentication
&lt;/h3&gt;

&lt;p&gt;Let's say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :&lt;/p&gt;

&lt;p&gt;Here is the login form and the associated DOM : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AYcjSLil--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-login/screenshot_hn_login_form.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AYcjSLil--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-login/screenshot_hn_login_form.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we can implement the login algorithm&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="nf"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;FailingHttpStatusCodeException&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;MalformedURLException&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IOException&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;HtmlInput&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@type='password']"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;//The first preceding input that is not hidden&lt;/span&gt;
        &lt;span class="nc"&gt;HtmlInput&lt;/span&gt; &lt;span class="n"&gt;inputLogin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//preceding::input[not(@type='hidden')]"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;inputLogin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValueAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValueAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;//get the enclosing form&lt;/span&gt;
        &lt;span class="nc"&gt;HtmlForm&lt;/span&gt; &lt;span class="n"&gt;loginForm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEnclosingForm&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;//submit the form&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginForm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

        &lt;span class="c1"&gt;//returns the cookie filled client :)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then the main method, which :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;calls &lt;code&gt;autoLogin&lt;/code&gt; with the right parameters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;code&gt;https://news.ycombinator.com&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check the logout link presence to verify we're logged&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prints the cookie to the console&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://news.ycombinator.com"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"/login?goto=news"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"login"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"password"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Starting autoLogin on "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

            &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;logoutLink&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//a[@href='user?id=%s']"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logoutLink&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;){&lt;/span&gt;
                &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Successfuly logged in !"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
                &lt;span class="c1"&gt;// printing the cookies&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Cookie&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCookieManager&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getCookies&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;
                    &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Wrong credentials"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can find the code in this &lt;a href="https://github.com/ksahin/introWebScraping"&gt;Github repo&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Go further
&lt;/h3&gt;

&lt;p&gt;There are many cases where this method will not work : Amazon, DropBox... and all other two-steps/captcha protected login forms. &lt;/p&gt;

&lt;p&gt;Things that can be improved with this code : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Handle the check for the logout link inside &lt;code&gt;autoLogin&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check for &lt;code&gt;null&lt;/code&gt; inputs/form and throw an appropriate exception&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a next post I will show you how to deal with captchas or virtual numeric keyboards with OCR and captchas breaking APIs !&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>An Automatic Bill Downloader in Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 12 Dec 2018 10:03:07 +0000</pubDate>
      <link>https://dev.to/scrapingbee/an-automatic-bill-downloader-in-java-4277</link>
      <guid>https://dev.to/scrapingbee/an-automatic-bill-downloader-in-java-4277</guid>
      <description>&lt;p&gt;In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.&lt;/p&gt;

&lt;p&gt;I suggest you to read these articles first : &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;Introduction to web scraping with Java&lt;/a&gt; and &lt;a href="https://www.scrapingbee.com/blog/how-to-log-in-to-almost-any-websites/" rel="noopener noreferrer"&gt;Autologin&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since I am hosting this blog on &lt;a href="https://m.do.co/c/0e940b26444e" rel="noopener noreferrer"&gt;Digital Ocean&lt;/a&gt; (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have. &lt;/p&gt;

&lt;h3&gt;
  
  
  Login
&lt;/h3&gt;

&lt;p&gt;To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article. &lt;/p&gt;

&lt;p&gt;Then we have to go to the bill page : &lt;code&gt;https://cloud.digitalocean.com/settings/billing&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://cloud.digitalocean.com"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"email"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"password"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Authenticator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"/login"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://cloud.digitalocean.com/settings/billing"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"You need to sign in for access to this page"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Exception&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error during login on %s , check your credentials"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fetching the bills
&lt;/h3&gt;

&lt;p&gt;Let's create a new Class called Bill or Invoice to represent a bill : &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Bill.java&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Bill&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;//... getters &amp;amp; setters&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fjava-bill%2Fbills_dom.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fjava-bill%2Fbills_dom.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are lucky here, it's a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;HtmlTable&lt;/code&gt; to store the table and iterate on each rows&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;getCell&lt;/code&gt; to select the cells&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, using the Jackson library we will export the Bill objects to JSON and print it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;HtmlTable&lt;/span&gt; &lt;span class="n"&gt;billsTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlTable&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//table[@class='listing Billing--history']"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlTableRow&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;billsTable&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBodies&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getRows&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;

    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="c1"&gt;// We only want the invoice row, not the payment one&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Invoice"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Date&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SimpleDateFormat&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MMMM d, yyyy"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Locale&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ENGLISH&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"$"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getFirstChild&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="nc"&gt;Bill&lt;/span&gt; &lt;span class="n"&gt;bill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bill&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;bills&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bill&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bill&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the &lt;code&gt;Page&lt;/code&gt;object to store the pdf, and call a &lt;code&gt;getContentAsStream&lt;/code&gt;on it. It's better to check if the file has the right content type when doing this (&lt;code&gt;application/pdf&lt;/code&gt; in our case)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Page&lt;/span&gt; &lt;span class="n"&gt;invoicePdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoicePdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebResponse&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getContentType&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"application/pdf"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
    &lt;span class="nc"&gt;IOUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoicePdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebResponse&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getContentAsStream&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DigitalOcean"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".pdf"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it, here is the ouput :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for December 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1451602800000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for November 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;6.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1448924400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for October 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1446332400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for April 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1430431200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for March 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1427839200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for February 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1425164400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for January 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1422745200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for October 2014"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1414796400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As usual you can find the full code on this &lt;a href="https://github.com/ksahin/introWebScraping" rel="noopener noreferrer"&gt;Github Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com" rel="noopener noreferrer"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
  </channel>
</rss>
