<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kevin Sahin</title>
    <description>The latest articles on DEV Community by Kevin Sahin (@kevinsahin).</description>
    <link>https://dev.to/kevinsahin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F90430%2F6df895e9-2f67-4707-bbb4-f5f0eea7e321.png</url>
      <title>DEV Community: Kevin Sahin</title>
      <link>https://dev.to/kevinsahin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kevinsahin"/>
    <language>en</language>
    <item>
      <title>Easy Web Scraping With Scrapy</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 18 Dec 2019 16:22:13 +0000</pubDate>
      <link>https://dev.to/scrapingbee/easy-web-scraping-with-scrapy-6im</link>
      <guid>https://dev.to/scrapingbee/easy-web-scraping-with-scrapy-6im</guid>
      <description>&lt;p&gt;In the previous post about &lt;a href="https://www.scrapingbee.com/blog/web-scraping-101-with-python/" rel="noopener noreferrer"&gt;Web Scraping with Python&lt;/a&gt; we talked a bit about Scrapy. In this post we are going to dig a little bit deeper into it. &lt;/p&gt;

&lt;p&gt;Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multithreading&lt;/li&gt;
&lt;li&gt;Crawling (going from link to link)&lt;/li&gt;
&lt;li&gt;Extracting the data&lt;/li&gt;
&lt;li&gt;Validating&lt;/li&gt;
&lt;li&gt;Saving to different format / databases&lt;/li&gt;
&lt;li&gt;Many more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main difference between Scrapy and other commonly used librairies like Requests / BeautifulSoup is that it is opinionated. It allows you to solve the usual web scraping problems in an elegant way. &lt;/p&gt;

&lt;p&gt;The downside of Scrapy is that the learning curve is steep, there is a lot to learn, but that is what we are here for :)&lt;/p&gt;

&lt;p&gt;In this tutorial we will create two different web scrapers, a simple one that will extract data from an E-commerce product page, and a more "complex" one that will scrape an entire E-commerce catalog!&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic overview
&lt;/h2&gt;

&lt;p&gt;You can install Scrapy using &lt;a href="https://pypi.org/project/pip/" rel="noopener noreferrer"&gt;pip&lt;/a&gt;. Be careful though, the Scrapy documentation strongly suggests to install it in a dedicated virtual environnement in order to avoid conflicts with your system packages. &lt;/p&gt;

&lt;p&gt;I'm using Virtualenv and Virtualenvwrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mkvirtualenv scrapy_env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;Scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now create a new Scrapy project with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject product_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create all the necessary boilerplate files for the project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;├── product_scraper
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a brief overview of these files and folders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;items.py&lt;/em&gt;&lt;/strong&gt; is a model for the extracted data. You can define custom model (like a Product) that will inherit the scrapy Item class.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;middlewares.py&lt;/em&gt;&lt;/strong&gt; Middleware used to change the request / response lifecycle. For example you could create a middle ware to rotate user-agents, or to use an API like ScrapingBee instead of doing the requests yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;pipelines.py&lt;/em&gt;&lt;/strong&gt; In Scrapy, pipelines are used to process the extracted data, clean the HTML, validate the data, and export it to a custom format or saving it to a database. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;/spiders&lt;/em&gt;&lt;/strong&gt; is a folder containing Spider classes. With Scrapy, Spiders are classes that define how a website should be scraped, including what link to follow and how to extract the data for those links.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;scrapy.cfg&lt;/em&gt;&lt;/strong&gt; is a configuration file to change some settings&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scraping a single product
&lt;/h2&gt;

&lt;p&gt;In this example we are going to scrape a single product from a dummy E-commerce website. Here is the first the product we are going to scrape: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F753fd5a7477937e0ce9bb8431d28b10fbdabf858%2F0f27d%2Fimages%2Fpost%2Fpost5%2Fproduct_screenshot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F753fd5a7477937e0ce9bb8431d28b10fbdabf858%2F0f27d%2Fimages%2Fpost%2Fpost5%2Fproduct_screenshot.jpg"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/" rel="noopener noreferrer"&gt;https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We are going to extract the product name, picture, price and description.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scrapy Shell
&lt;/h2&gt;

&lt;p&gt;Scrapy comes with a built-in shell that helps you try and debug your scraping code in real time. You can quickly test your XPath expressions / CSS selectors with it. It's a very cool tool to write your web scrapers and I always use it!&lt;/p&gt;

&lt;p&gt;You can configure Scrapy Shell to use another console instead of the default Python console like IPython. You will get autocompletion and other nice perks like colorized output. &lt;/p&gt;

&lt;p&gt;In order to use it in your scrapy Shell, you need to add this line to your scrapy.cfg file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shell &lt;span class="o"&gt;=&lt;/span&gt; ipython
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it's configured, you can start using scrapy shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;scrapy shell &lt;span class="nt"&gt;--nolog&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s] Available Scrapy objects:
&lt;span class="o"&gt;[&lt;/span&gt;s]   scrapy     scrapy module &lt;span class="o"&gt;(&lt;/span&gt;contains scrapy.Request, scrapy.Selector, etc&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   crawler    &amp;lt;scrapy.crawler.Crawler object at 0x108147eb8&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   item       &lt;span class="o"&gt;{}&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   settings   &amp;lt;scrapy.settings.Settings object at 0x108d10978&amp;gt;
&lt;span class="o"&gt;[&lt;/span&gt;s] Useful shortcuts:
&lt;span class="o"&gt;[&lt;/span&gt;s]   fetch&lt;span class="o"&gt;(&lt;/span&gt;url[, &lt;span class="nv"&gt;redirect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True]&lt;span class="o"&gt;)&lt;/span&gt; Fetch URL and update &lt;span class="nb"&gt;local &lt;/span&gt;objects &lt;span class="o"&gt;(&lt;/span&gt;by default, redirects are followed&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   fetch&lt;span class="o"&gt;(&lt;/span&gt;req&lt;span class="o"&gt;)&lt;/span&gt;                  Fetch a scrapy.Request and update &lt;span class="nb"&gt;local &lt;/span&gt;objects
&lt;span class="o"&gt;[&lt;/span&gt;s]   shelp&lt;span class="o"&gt;()&lt;/span&gt;           Shell &lt;span class="nb"&gt;help&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;print this &lt;span class="nb"&gt;help&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;s]   view&lt;span class="o"&gt;(&lt;/span&gt;response&lt;span class="o"&gt;)&lt;/span&gt;    View response &lt;span class="k"&gt;in &lt;/span&gt;a browser
In &lt;span class="o"&gt;[&lt;/span&gt;1]:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can start fetching a URL by simply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;fetch&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start by fetching the /robot.txt file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;scrapy.core.engine] DEBUG: Crawled &lt;span class="o"&gt;(&lt;/span&gt;404&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;GET https://clever-lichterman-044f16.netlify.com/robots.txt&amp;gt; &lt;span class="o"&gt;(&lt;/span&gt;referer: None&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case there isn't any robot.txt, that's why we can see a 404 HTTP code. If there was a robot.txt, by default Scrapy will follow the rule. &lt;/p&gt;

&lt;p&gt;You can disable this behavior by changing this setting in settings.py:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;ROBOTSTXT_OBEY = True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you should should have a log like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;scrapy.core.engine] DEBUG: Crawled &lt;span class="o"&gt;(&lt;/span&gt;200&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;GET https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&amp;gt; &lt;span class="o"&gt;(&lt;/span&gt;referer: None&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now see your response object, response headers, and try different XPath expression / CSS selectors to extract the data you want. &lt;/p&gt;

&lt;p&gt;You can see the response directly in your browser with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;view&lt;span class="o"&gt;(&lt;/span&gt;response&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the page will render badly inside your browser, for lots of different reasons. This can be CORS issues, Javascript code that didn't execute, or relative URLs for assets that won't work locally. &lt;/p&gt;

&lt;p&gt;The scrapy shell is like a regular Python shell, so don't hesitate to load your favorite scripts/function in it. &lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Data
&lt;/h3&gt;

&lt;p&gt;Scrapy doesn't execute any Javascript by default, so if the website you are trying to scrape is using a frontend framework like Angular / React.js, you could have trouble accessing the data you want. &lt;/p&gt;

&lt;p&gt;Now let's try some XPath expression to extract the product title and price:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F200a8982c21e545c85170599feec2552f6bcde5b%2F7ce12%2Fimages%2Fpost%2Fpost5%2Fproduct_dom_screenshot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd33wubrfki0l68.cloudfront.net%2F200a8982c21e545c85170599feec2552f6bcde5b%2F7ce12%2Fimages%2Fpost%2Fpost5%2Fproduct_dom_screenshot.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to extract the price, we are going to use an &lt;a href="https://dev.to/blog/practical-xpath-for-web-scraping/"&gt;XPath expression&lt;/a&gt;, we're selecting the first span after the div with the class &lt;code&gt;my-4&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;In &lt;span class="o"&gt;[&lt;/span&gt;16]: response.xpath&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"//div[@class='my-4']/span/text()"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.get&lt;span class="o"&gt;()&lt;/span&gt;
Out[16]: &lt;span class="s1"&gt;'20.00$'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I could also use a CSS selector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;In &lt;span class="o"&gt;[&lt;/span&gt;21]: response.css&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.my-4 span::text'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.get&lt;span class="o"&gt;()&lt;/span&gt;
Out[21]: &lt;span class="s1"&gt;'20.00$'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating a Scrapy Spider
&lt;/h2&gt;

&lt;p&gt;With Scrapy, Spiders are classes where you define your crawling (what links / URLs need to be scraped) and scraping (what to extract) behavior. &lt;/p&gt;

&lt;p&gt;Here are the different steps used by a spider to scrape a website:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It starts by looking at the class attribute

&lt;code&gt;start_urls&lt;/code&gt;

, and call these URLs with the start_requests() method. You could override this method if you need to change the HTTP verb, add some parameters to the request (for example, sending a POST request instead of a GET). 
* It will then generate a Request object for each URL, and send the response to the callback function parse()
* The parse() method will then extract the data (in our case, the product price, image, description, title) and return either a dictionnary,  an Item object, a Request or an iterable. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;You may wonder why the parse method can return so many different objects. It's for flexibility. Let's say you want to scrape an E-commerce website that doesn't have any sitemap. You could start by scraping the product categories, so this would be a first parse method.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This method would then yield a Request object to each product category to a new callback method parse2()&lt;/em&gt;&lt;br&gt;
&lt;em&gt;For each category you would need to handle pagination Then for each product the actual scraping that generate an Item so a third parse function.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With Scrapy you can return the scraped data as a simple Python dictionary, but it is a good idea to use the built-in Scrapy &lt;strong&gt;&lt;em&gt;Item&lt;/em&gt;&lt;/strong&gt; class. &lt;br&gt;
It's a simple container for our scraped data and Scrapy will look at this item's fields for many things like exporting the data to different format (JSON / CSV...), the item pipeline etc. &lt;/p&gt;

&lt;p&gt;So here is a basic Product class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;product_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;img_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can generate a spider, either with the command line helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy genspider myspider mydomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or you can do it manually and put your Spider's code inside the /spiders directory. &lt;/p&gt;

&lt;p&gt;There are different types of Spiders in Scrapy to solve the most common web scraping use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Spider&lt;/code&gt; that we will use. It takes a start_urls list and scrape each one with a &lt;code&gt;parse&lt;/code&gt; method. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CrawlSpider&lt;/code&gt; follows links defined by a set of rules&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SitemapSpider&lt;/code&gt; extract URLs defined in a sitemap&lt;/li&gt;
&lt;li&gt;Many more
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -*- coding: utf-8 -*-
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EcomSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ecom_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clever-lichterman-044f16.netlify.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]/span/text()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//section[1]//h2/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;img_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product-slider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]//img/@src&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this &lt;strong&gt;&lt;em&gt;EcomSpider&lt;/em&gt;&lt;/strong&gt; class, there are two required attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;name&lt;/code&gt; which is our Spider's name (that you can run using &lt;code&gt;scrapy runspider spider_name&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt; &lt;code&gt;start_urls&lt;/code&gt; which is the starting URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;allowed_domains&lt;/code&gt; is optionnal but important when you use a CrawlSpider that could follow links on different domains. &lt;/p&gt;

&lt;p&gt;Then I've just populated the Product fields by using XPath expressions to extract the data I wanted as we saw earlier, and we return the item. &lt;/p&gt;

&lt;p&gt;You can run this code as follow to export the result into JSON (you could also export to CSV)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy runspider ecom_spider.py &lt;span class="nt"&gt;-o&lt;/span&gt; product.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should then get a nice JSON file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"product_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20.00$"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Taba Cream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"img_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://clever-lichterman-044f16.netlify.com/images/products/product-2.png"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Item loaders
&lt;/h4&gt;

&lt;p&gt;There are two common problems that you can face while extracting data from the Web: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the same website, the page layout and underlying HTML can be different. If you scrape an E-commerce website, you will often have a regular price and a discounted price, with different XPath / CSS selectors. &lt;/li&gt;
&lt;li&gt;The data can be dirty and need some kind of post processing, again for an E-commerce website it could be the way the prices are displayed for example ($1.00, $1, $1,00 )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scrapy comes with a built-in solution for this, &lt;a href="https://docs.scrapy.org/en/latest/topics/loaders.html" rel="noopener noreferrer"&gt;ItemLoaders&lt;/a&gt;. &lt;br&gt;
It's an interesting way to populate our Product object. &lt;/p&gt;

&lt;p&gt;You can add several XPath expression to the same Item field, and it will test it sequentially. By default if several XPath are found, it will load all of them into a list. &lt;/p&gt;

&lt;p&gt;You can find many examples of input and output processors in the &lt;a href="https://docs.scrapy.org/en/latest/" rel="noopener noreferrer"&gt;Scrapy documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;It's really useful when you need to transorm/clean the data your extract. &lt;br&gt;
For example, extracting the currency from a price, transorming a unit into another one (centimers in meters, Celcius degres in Fahrenheit) ... &lt;/p&gt;

&lt;p&gt;In our webpage we can find the product title with different XPath expressions: &lt;code&gt;//title&lt;/code&gt; and &lt;code&gt;//section[1]//h2/text()&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Here is how you could use and Itemloader in this case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ItemLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]/span/text()&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//section[1]//h2/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generally you only want the first matching XPath, so you will need to add this &lt;code&gt;output_processor=TakeFirst()&lt;/code&gt; to your item's field constructor. &lt;/p&gt;

&lt;p&gt;In our case we only want the first matching XPath for each field, so a better approach would be to create our own ItemLoader and declare a default output_processor to take the first matching XPath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.loader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ItemLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.loader.processors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TakeFirst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MapCompose&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Join&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_dollar_sign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProductLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ItemLoader&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;default_output_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TakeFirst&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;price_in&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MapCompose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remove_dollar_sign&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also added a &lt;code&gt;price_in&lt;/code&gt; which is an input processor to delete the dollar sign from the price. I'm using &lt;code&gt;MapCompose&lt;/code&gt; which is a built-in processor that takes one or several functions to be executed sequentially. You can add as many functions as you like for . The convention is to add &lt;code&gt;_in&lt;/code&gt; or &lt;code&gt;_out&lt;/code&gt; to your Item field's name to add an input or output processor to it. &lt;/p&gt;

&lt;p&gt;There are many more processors, you can learn more about this in the &lt;a href="https://docs.scrapy.org/en/latest/topics/loaders.html#input-and-output-processors" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping multiple pages
&lt;/h2&gt;

&lt;p&gt;Now that we know how to scrape a single page, it's time to learn how to scrape multiple pages, like the entire product catalog. &lt;br&gt;
As we saw earlier there are different kinds of Spiders. &lt;/p&gt;

&lt;p&gt;When you want to scrape an entire product catalog the first thing you should look at is a sitemap. Sitemap are exactly built for this, to show web crawlers how the website is structured. &lt;/p&gt;

&lt;p&gt;Most of the time you can find one at &lt;code&gt;base_url/sitemap.xml&lt;/code&gt;. Parsing a sitemap can be tricky, and again, Scrapy is here to help you with this. &lt;/p&gt;

&lt;p&gt;In our case, you can find the sitemap here: &lt;a href="https://clever-lichterman-044f16.netlify.com/sitemap.xml" rel="noopener noreferrer"&gt;https://clever-lichterman-044f16.netlify.com/sitemap.xml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we look inside the sitemap there are many URLs that we are not interested by, like the home page, blog posts etc:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/blog/post-1/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/products/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;loc&amp;gt;&lt;/span&gt;
  https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/
  &lt;span class="nt"&gt;&amp;lt;/loc&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;lastmod&amp;gt;&lt;/span&gt;2019-10-17T11:22:16+06:00&lt;span class="nt"&gt;&amp;lt;/lastmod&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/url&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, we can filter the URLs to parse only those that matches some pattern, it's really easy, here we only to have URL that &lt;br&gt;
have &lt;code&gt;/products/&lt;/code&gt; in their URLs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SitemapSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SitemapSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sitemap_spider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;sitemap_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/sitemap.xml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;sitemap_rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/products/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_product&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# ... scrape product ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run this spider as follow to scrape all the products and export the result to a CSV file:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;scrapy runspider sitemap_spider.py -o output.csv&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Now what if the website didn't have any sitemap? Once again, Scrapy has a solution for this! &lt;/p&gt;

&lt;p&gt;Let me introduce you to the... &lt;code&gt;CrawlSpider&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The CrawlSpider will crawl the target website by starting with a &lt;code&gt;start_urls&lt;/code&gt; list. Then for each url, it will extract all the links based on a list of &lt;code&gt;Rule&lt;/code&gt;. &lt;br&gt;
In our case it's easy, products has the same URL pattern &lt;code&gt;/products/product_title&lt;/code&gt; so we only need filter these URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.productloader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProductLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;product_scraper.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;crawl_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clever-lichterman-044f16.netlify.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://clever-lichterman-044f16.netlify.com/products/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;

        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_product&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="c1"&gt;# .. parse product 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, all these built-in Spiders are really easy to use. It would have been much more complex to do it from scratch. &lt;/p&gt;

&lt;p&gt;With Scrapy you don't have to think about the crawling logic, like adding new URLs to a queue, keeping track of already parsed URLs, multi-threading... &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we saw a general overview of how to scrape the web with Scrapy and how it can solve your most common web scraping challenges. Of course we only touched the surface and there are many more interesting things to explore, like middlewares, exporters, extensions, pipelines! &lt;/p&gt;

&lt;p&gt;If you've been doing web scraping more "manually" with tools like BeautifulSoup / Requests, it's easy to understand how Scrapy can help save time and build more maintainable scrapers. &lt;/p&gt;

&lt;p&gt;I hope you liked this Scrapy tutorial and that it will motivate you to experiment with it. &lt;/p&gt;

&lt;p&gt;For further reading don't hesitate to look at the great &lt;a href="https://docs.scrapy.org/en/latest/index.html" rel="noopener noreferrer"&gt;Scrapy documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;You can also check out our &lt;a href="https://www.scrapingbee.com/blog/web-scraping-101-with-python/" rel="noopener noreferrer"&gt;web scraping with Python&lt;/a&gt; tutorial to learn more about web scraping. &lt;/p&gt;

&lt;p&gt;Happy Scraping!&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>Practical XPath for Web Scraping</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Thu, 07 Nov 2019 10:41:17 +0000</pubDate>
      <link>https://dev.to/scrapingbee/practical-xpath-for-web-scraping-3d9e</link>
      <guid>https://dev.to/scrapingbee/practical-xpath-for-web-scraping-3d9e</guid>
      <description>&lt;p&gt;XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.&lt;/p&gt;

&lt;p&gt;It looks like the perfect tool for web scraping right? At &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt; we love XPath!&lt;/p&gt;

&lt;h3&gt;
  
  
  Why learn XPath
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.&lt;/li&gt;
&lt;li&gt;  It's more powerful than CSS selectors&lt;/li&gt;
&lt;li&gt;  It allows you to navigate the DOM in any direction&lt;/li&gt;
&lt;li&gt;  Can match text inside HTML elements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.&lt;/p&gt;

&lt;p&gt;But first, let's talk a little about the DOM&lt;/p&gt;

&lt;h2&gt;
  
  
  Document Object Model
&lt;/h2&gt;

&lt;p&gt;I am going to assume you already know HTML, so this is just a small reminder.&lt;/p&gt;

&lt;p&gt;As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.&lt;/p&gt;

&lt;p&gt;Let's see a basic HTML page, to understand what the Document Object Model is.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ehm3m1Ms--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/rMQHgyRbWDBCzFcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ehm3m1Ms--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/rMQHgyRbWDBCzFcg.png" alt="" width="880" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This HTML code is basically HTML content encapsulated inside other HTML content. The HTML hierarchy can be viewed as a tree. We can already see this hierarchy through the indentation in the HTML code.&lt;/p&gt;

&lt;p&gt;When your web browser parses this code, it will create a tree which is an object representation of the HTML document. It is called the Document Object Model.&lt;/p&gt;

&lt;p&gt;Below is the internal tree structure inside Google Chrome inspector :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--juQiVf6o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/nmSRYUtpVLZDisCK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--juQiVf6o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/nmSRYUtpVLZDisCK.png" alt="" width="880" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left we can see the HTML tree, and on the right we have the Javascript object representing the currently selected element (in this case, the &lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt; tag), with all its attributes.&lt;/p&gt;

&lt;p&gt;The important thing to remember is that &lt;strong&gt;the DOM you see in your browser, when you right click + inspect can be really different from the actual HTML that was sent&lt;/strong&gt;. Maybe some Javascript code was executed and dynamically changed the DOM ! For example, when you scroll on your twitter account, a request is sent by your browser to fetch new tweets, and some Javascript code is dynamically adding those new tweets to the DOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath Syntax
&lt;/h2&gt;

&lt;p&gt;First let’s look at some XPath vocabulary :&lt;/p&gt;

&lt;p&gt;• In Xpath terminology, as with HTML, there are different types of nodes : root nodes, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.&lt;/p&gt;

&lt;p&gt;• Each element node has one parent. in this example, the section element is the parent of p, details and button.&lt;/p&gt;

&lt;p&gt;• Element nodes can have any number of children. In our example, li elements are all children of the ul element.&lt;/p&gt;

&lt;p&gt;• Siblings are nodes that have the same parents. p, details and button are siblings.&lt;/p&gt;

&lt;p&gt;• Ancestors a node’s parent and parent’s parent...&lt;/p&gt;

&lt;p&gt;• Descendants a node’s children and children’s children...&lt;/p&gt;

&lt;p&gt;There are different types of expressions to select a node in an HTML document, here are the most important ones :&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You can also use &lt;strong&gt;predicates&lt;/strong&gt; to find a node that contains a specific value. Predicates are always in square brackets: &lt;code&gt;[predicate]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Here are some examples :&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now we will see some examples of Xpath expressions. We can test XPath expressions inside Chrome Dev tools, so it is time to fire up Chrome.&lt;/p&gt;

&lt;p&gt;To do so, right-click on the web page -&amp;gt; inspect and then &lt;code&gt;cmd + f&lt;/code&gt; on a Mac or &lt;code&gt;ctrl + f&lt;/code&gt; on other systems, then you can enter an Xpath expression, and the match will be highlighted in the Dev tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jr6Uhf9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/HkmFzlIKSQBEBMzI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jr6Uhf9q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/HkmFzlIKSQBEBMzI.png" alt="" width="544" height="811"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tip
&lt;/h3&gt;

&lt;p&gt;In the dev tools, you can right-click on any DOM node, and show its full XPath expression, that you can later factorize.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath with Python
&lt;/h2&gt;

&lt;p&gt;There are many Python packages that allow you to use XPath expressions to select HTML elements like lxml, Scrapy or Selenium. In these examples, we are going to use Selenium with Chrome in headless mode. You can look at this article to set up your environment: &lt;a href="https://www.scrapingbee.com/blog/scraping-single-page-applications"&gt;Scraping Single Page Application with Python&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  E-commerce product data extraction
&lt;/h3&gt;

&lt;p&gt;In this example, we are going to see how to extract E-commerce product data from Ebay.com with XPath expressions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L92NhRf6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/mXasAzIbzHhopNjw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L92NhRf6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://landen.imgix.net/blog_pkzRugQgwaDvAtAE/assets/mXasAzIbzHhopNjw.jpg" alt="" width="880" height="614"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;On these three XPath expressions, we are using a &lt;code&gt;//&lt;/code&gt; as an axis, meaning we are selecting nodes anywhere in the HTML tree. Then we are using a predicate &lt;code&gt;[predicate]&lt;/code&gt; to match on specific IDs. IDs are supposed to be unique so it's not a problem do to this.&lt;/p&gt;

&lt;p&gt;But when you select an element with its class name, it's better to use a relative path, because the class name can be used anywhere in the DOM, so the more specific you are the better. Not only that, but when the website will change (and it will), your code will be much more resilient to changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automagically authenticate to a website
&lt;/h3&gt;

&lt;p&gt;When you have to perform the same action on a website or extract the same type of information we can be a little smarter with our XPath expression, in order to create generic ones, and not specific XPath for each website.&lt;/p&gt;

&lt;p&gt;In order to explain this, we're going to make a "generic" authentication function that will take a Login URL, a username and password, and try to authenticate on the target website.&lt;/p&gt;

&lt;p&gt;To auto-magically log into a website with your scrapers, the idea is :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GET /loginPage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first  tag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first  before it that is not hidden&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set the value attribute for both inputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the enclosing form and click on the submit button.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most login forms will have an &lt;code&gt;&amp;lt;input type="password"&amp;gt;&lt;/code&gt; tag. So we can select this password input with a simple: &lt;code&gt;//input[@type='password']&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Once we have this password input, we can use a &lt;strong&gt;relative path&lt;/strong&gt; to select the username/email input. It will generally be the first preceding input &lt;strong&gt;that isn't hidden:&lt;/strong&gt; &lt;code&gt;.//preceding::input[not(@type='hidden')]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It's really important to exclude hidden inputs, because most of the time you will have at least one &lt;a href="https://en.wikipedia.org/wiki/Cross-site_request_forgery"&gt;CSRF token&lt;/a&gt; hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.&lt;/p&gt;

&lt;p&gt;Now we need to select the enclosing form from one of the input:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.//ancestor::form&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;And with the form, we can select the submit input/button:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.//*[@type='submit']&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Here is an example of such a function:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Of course it is far from perfect, it won't work everywhere but you get the idea.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.&lt;/p&gt;

&lt;p&gt;One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.&lt;/p&gt;

&lt;p&gt;At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our &lt;a href="https://www.scrapingbee.com/api-store"&gt;ready-made APIs&lt;/a&gt;. We will discuss the differences between the two in another blog post!&lt;/p&gt;

&lt;p&gt;I hope you enjoyed this article, next time we will talk about ... CSS selectors :)&lt;/p&gt;

&lt;p&gt;Happy Scraping!&lt;/p&gt;

&lt;p&gt;Discuss on HN: &lt;a href="https://news.ycombinator.com/item?id=21452310"&gt;https://news.ycombinator.com/item?id=21452310&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>scraping</category>
      <category>webscraping</category>
      <category>xpath</category>
    </item>
    <item>
      <title>12 months, 3 products, some MRR, and one (irrigation) pivot</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Mon, 07 Oct 2019 09:35:33 +0000</pubDate>
      <link>https://dev.to/kevinsahin/12-months-3-products-some-mrr-and-one-irrigation-pivot-542f</link>
      <guid>https://dev.to/kevinsahin/12-months-3-products-some-mrr-and-one-irrigation-pivot-542f</guid>
      <description>&lt;p&gt;My partner Pierre and I have been working and talking about different side projects/startups for over 5 years. Two years ago we released our first product to the public but it was one year ago that we decided to go full time on the indie hacker road. In this post, I’m going to explain our journey, our backgrounds and how we did it after many failed attempts.&lt;/p&gt;

&lt;p&gt;This post is not about some magic product we launched in 2 days while getting 10k signups and reaching $20k MRR in one month working 4 hours a week in Hawaï. This post is more about the small win and loses we had during our first year in the Indie Hacker world and the things we wish we knew before starting.&lt;/p&gt;

&lt;p&gt;This post is about three products, one irrigation pivot, one startup pivot, and of course, some MRR.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(disclaimer: ScrapingBee was initially launched as ScrapingNinja, but due to some copyright issues we hade to quicky rebrand it. We'll talk about it in a future blog post.)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Background
&lt;/h3&gt;

&lt;p&gt;It started when we were both employed in different startups as software developers. We had lots of ideas and we loved to build side-projects for fun.&lt;/p&gt;

&lt;p&gt;Pierre and I were doing lots of Web Scraping in our jobs. I worked at a Fintech startup called Fiduceo which was acquired by a big french bank, and we were doing bank account aggregation, like &lt;a href="http://mint.com"&gt;Mint.com&lt;/a&gt; in the US. I was leading a small team handling the web scraping code and infrastructure. &lt;/p&gt;

&lt;p&gt;Pierre worked in the US and then came back to France to work in the biggest French real-estate data provider as a data engineer. Part of his job was to find, gather, extract and load new data set from the web.&lt;/p&gt;

&lt;p&gt;So we both had experience with Web Scraping and data at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Our first project: ShopToList
&lt;/h3&gt;

&lt;p&gt;One of the first “mini-successes” we had was Shoptolist.com, a B2C website/browser extension which is a universal wishlist, that sends you alerts if it sees any price drop. It was really just a fun side project that was never meant to be more.&lt;/p&gt;

&lt;p&gt;It allowed us to try many different things and to discover that acquisition is really, really, really hard.&lt;/p&gt;

&lt;p&gt;We quickly reached 1000 users by just submitting our product on frugal/fashion subreddits. We were very happy about it because it was just an experiment. Every day we had a script that scrapes each product in our database to update its price, and we were sending emails in case of a price drop. &lt;/p&gt;

&lt;p&gt;The links in the email were affiliate links, so we took a small percentage if the user ended up buying the product.&lt;br&gt;
In theory, this model works great, but in practice here is what happened:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Out of 1000 emails sent, about 20–30% were opened&lt;/li&gt;
&lt;li&gt;2% clicked on the product links that were on sale&lt;/li&gt;
&lt;li&gt;From this 2 %, only 5–10% bought the product&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The percentage we earned was very small, depending on the niche it was 0.5–5%, so this business model only works with millions of users.&lt;br&gt;
And this is where we hit a wall, we did not manage to create sustainable growth. We tested many things, content marketing, affiliation, and some paid advertising, we just did not manage to create growth. &lt;/p&gt;

&lt;p&gt;And since it was just a little side project that only took us two weeks to build, we were ok with that.&lt;/p&gt;

&lt;p&gt;For us, this was a valuable learning experience, because this was the first project we shipped to actual users, and we learned a lot.&lt;/p&gt;

&lt;p&gt;By digging into the database we noticed that a few users had thousands of products saved inside ShopToList. It seems strange unless they were crazy impulsive buyers, the majority of users had like 20 products saved on average…&lt;/p&gt;

&lt;p&gt;So after a little “investigation”, we discovered that these users were E-commerce owners that were “spying” on their competitor…&lt;/p&gt;

&lt;h2&gt;
  
  
  First pivot: PricingBot
&lt;/h2&gt;

&lt;p&gt;We assumed that those users were doing this to receive alerts when their competitor’s products were changing the price. &lt;/p&gt;

&lt;p&gt;There were many solutions on the web that allowed to do this, but ShopToList allowed them to monitor thousands of products for free when other solutions were quite expensive.&lt;/p&gt;

&lt;p&gt;We did a small market research and discover that many tools offered to monitor your competitor’s product, however, all those tools seemed either really difficult to use or really expensive.&lt;/p&gt;

&lt;p&gt;Because we felt we could do better, the PricingBot idea was born. Pierre quit his job and we both decided to commit full-time on this. Side project era was over 😎.&lt;/p&gt;

&lt;p&gt;We made a landing page explaining our value proposition, nothing fancy but something clear and nice enough so people could trust us, and got 60 signups from different E-commerce owners in different niches.&lt;/p&gt;

&lt;p&gt;While technically challenging, extracting E-commerce product data was something we knew how to do thanks to Shoptolist, so building the MVP was pretty quick.&lt;/p&gt;

&lt;p&gt;We launched our beta on ProductHunt in November 2018 and it was a big success, followed by a big crash, the classic startup trough of sorrow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--znxdQ_Sq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/3sq87catntzinqpmnca8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--znxdQ_Sq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/3sq87catntzinqpmnca8.jpg" alt="Product Hunt Launch"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You had to upload a CSV file with your product catalog, and for each product match it with a competitor product URL. &lt;/p&gt;

&lt;p&gt;It's ok for several dozens of products, but people had often hundreds or thousands of products in their online store... &lt;/p&gt;

&lt;p&gt;So with this feedback, we created some integration with popular E-commerce platforms like Shopify and Woocommerce to let people import their catalog in one click. &lt;/p&gt;

&lt;p&gt;Our activation &lt;strong&gt;tripled 🎉 ,&lt;/strong&gt; we were very happy about how things were going, however, one thing to note is that until this moment the product was completely free and we did not ask people for money.&lt;/p&gt;

&lt;p&gt;At this point in time here are a few numbers we had that made use happy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We managed to have around 200 signups with $0 spend&lt;/li&gt;
&lt;li&gt;20 users seemed to use the product and had their account fully set up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What could go wrong right?&lt;/p&gt;

&lt;p&gt;We decide to close the beta and start asking our user to pay for our software with a classic SaaS model with three plans, $29/$99/$299 months based on volume. &lt;/p&gt;

&lt;p&gt;The first day was magic because literally several seconds after sending the email announcing the end of the beta we got our first customer for a $29 plan 🚀&lt;/p&gt;

&lt;p&gt;We also managed to signup a user to a $299 plan soon after, but for him, we had to manually set up his account and manually match 1000 products across 10 websites, it was long but we felt it was worth it. &lt;/p&gt;

&lt;p&gt;We were wrong! Just before renewing he churned telling us PricingBot was very good but not useful enough for him. We were sad and angry, mostly at ourselves, but decided to move forward and continue.&lt;/p&gt;

&lt;p&gt;It seemed we were on a good path and that we just needed to go all-in on marketing. And that's what we did. Content marketing, cold outreach, affiliation, SEO you name it! &lt;/p&gt;

&lt;p&gt;But before diving into this, let's talk again about our activation. &lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #1: bad metrics leads to bad conclusion, bad conclusion leads to bad decisions.. (in Yoda's voice)
&lt;/h3&gt;

&lt;p&gt;When we first decided to monitor our activation rate we assumed that one user was activated when he did two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add at least one of his product, (or link his store with our built-in integration)&lt;/li&gt;
&lt;li&gt;Add at least one of his competitor's product&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so, with that definition, we had around 10% of our users that were "activated". Considering that at that time most of our users were coming from ProductHunt and that hunters are known to easily signup to products they don't plan to use and just for the sake of it we were happy with these numbers.&lt;/p&gt;

&lt;p&gt;But we were wrong.&lt;/p&gt;

&lt;p&gt;This definition meant that someone who owns a Shopify store with 4000 products, and who adds only one competitor's products was activated, it was silly. Someone who only adds one competitor's product out of 4000 of is own catalog won't use PricingBot to do price monitoring and surely won't pay for it. We learned this the hard way.&lt;/p&gt;

&lt;p&gt;Because soon after we had this first paying customer, nobody followed, literally nobody. At first, we did not understand, then it was obvious, out of 200 signups, we had 20 active users, out of 20 active users we had 1 paying customers, so the only solutions were to have more signups.&lt;/p&gt;

&lt;p&gt;This was another mistake.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #2: Thinking our only problem was acquisition
&lt;/h3&gt;

&lt;p&gt;We thought we only needed more users and just went full marketing. Because we did not know the e-commerce community very well we had some trouble starting but we eventually managed to write some piece of content that was shared on relevant Facebook/Reddit/Linkedin group that brought in a few leads.&lt;/p&gt;

&lt;p&gt;We also did some paid advertising and cold outreach but it failed miserably.&lt;/p&gt;

&lt;p&gt;One month later, we needed to see the obvious, we were not on the right path.&lt;/p&gt;

&lt;p&gt;Our leads used the product but did not pay, and even if all the leads we brought in paid, it would have not been sustainable.&lt;/p&gt;

&lt;p&gt;At this point in time we finally decide to understand better why users don't use our product more and with feedback request and a lot of analytics insight we discover two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For most of our users, PricingBot is a nice to have, it is not something that worth paying&lt;/li&gt;
&lt;li&gt;Most of our users don't want to do the setup because it is too tedious, but they don't want to pay us to do it for them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next thing we know we revamp our whole onboarding process and try to automate as many things as possible. But it is still not working.&lt;/p&gt;

&lt;p&gt;When you want, as an e-commerce owner, to monitor your competitors you first have to link your products with your competitors and this was the hard part. This part alone meant approximately 1 hour of work per 100 products you want to match. This was way too much time for e-commerce owners working alone with a 10k product catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fear, Uncertainty and Doubt
&lt;/h2&gt;

&lt;p&gt;To help you understand how we felt at that point in time let me just recap the timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;January 2018: 📣  we launch ShopToList&lt;/li&gt;
&lt;li&gt;July 2018: 🚀  Pierre quit his job and we decide to build PricingBot&lt;/li&gt;
&lt;li&gt;October 2018: 🤖  After a busy summer and 1 month of code we launch the MVP in beta&lt;/li&gt;
&lt;li&gt;January 2019: 💵  First paying customer&lt;/li&gt;
&lt;li&gt;February-March 2019: Acquisition, product dev&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Back in May 2019 we kind of hit a wall, nothing we did really worked and it was hard staying motivated. The only silver lining was that we managed to rank well on Google so we had, every day, around 3 new signups without any acquisition. &lt;/p&gt;

&lt;p&gt;But we still did not manage to make them pay. And we still did not manage to make them configure their account.&lt;/p&gt;

&lt;p&gt;This period of time was hard because it was full of negativity, me and my cofounder both knew that we were not moving forward and while this did not degrade our working relation this certainly degrade our working productivity.&lt;/p&gt;

&lt;p&gt;We both felt that no matter what we did, we were not able to move any meaningful needle that could have boost our business.&lt;/p&gt;

&lt;p&gt;We improved the product a lot, managed to gather some signup along the way but it was not enough. Here is a look at our revenue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LBrB-gAY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ur8gi2vcwopujo5rr1zl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LBrB-gAY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ur8gi2vcwopujo5rr1zl.jpg" alt="PricingBot MRR"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  One agricultural pivot to build, one startup pivot to make
&lt;/h2&gt;

&lt;p&gt;Mid-June 2019 things are not looking good, we only have 3 months to launch a successful business. We both agreed in 2018 that we gave ourselves 1 year to launch something that works, 1 year to reach "&lt;a href="http://www.paulgraham.com/ramenprofitable.html"&gt;ramen profitability&lt;/a&gt;" 🍲.  &lt;/p&gt;

&lt;p&gt;We had a long talk beginning of June and we both agreed that we need to step back and that we currently have 3 options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Continue with PricingBot hoping that some magic happens and that we cross $4k MRR in 3 months&lt;/li&gt;
&lt;li&gt;Leaving the company and start going our own way&lt;/li&gt;
&lt;li&gt;Building something else&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Point 1 was hard because we were both fed up with the product, everything we did seemed useless and it was not working. Point 2 needed to be addressed but although it was not a success we felt that working together was working really good (in the human side of things) and that it would be a pity to give up. Option 3 was chosen, we are both very happy with the outcome of that talk and full of energy. We only needed one thing: to choose what we would build.&lt;/p&gt;

&lt;p&gt;We also decide to do something we should have done earlier, we sold ShopToList. The whole deal was done in less than 1 month thanks to &lt;a href="http://1kprojects.com"&gt;1kprojects.com&lt;/a&gt; and it brought some welcomed money in our company bank account.&lt;/p&gt;

&lt;p&gt;At the same time, Pierre's father in law, a farmer in the south of France called him because he needed help assembling an irrigation pivot. The heatwave was supposed to be hard in June (and guess what, it was), and it was an urgent job. We both decided that this was a good opportunity to take a break, to think, each other on our side about the future product, and to come back full of ideas and motivation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E1TEz3-g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/vf7zgxg815v7vbkgujw7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E1TEz3-g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/vf7zgxg815v7vbkgujw7.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was kind of ironic because this pivot, kind of fund our pivot.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: If you ever need to buy an irrigation pivot Pierre strongly suggests that you look into Valley pivot (PS: this post was not sponsored by Valley in any way)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ScrapingBee
&lt;/h2&gt;

&lt;p&gt;Two weeks later, we both find ourselves with a bullet list of product ideas, some good, some bad, some crazy, some boring, some exciting, well, you get the idea, both lists were diverse. However we quickly agree on one idea, because it really stood out of the other, let me explain.&lt;/p&gt;

&lt;p&gt;While working on Shoptolist and Pricingbot, and also in our previous work experience, there are three things that we always needed to do for our web scraping infrastructure: Transforming websites into a structured API, Running headless browsers at scale, and managing a pool of proxies. &lt;/p&gt;

&lt;p&gt;When you extract data from lots of different websites, you always have to deal with Javascript-heavy websites / Single Page application, and you don't really have other choices than running headless browsers to render all this Javascript.&lt;/p&gt;

&lt;p&gt;Running a headless browser like Chrome is really painful because the same thing happening on your desktop (high memory usage, poorly coded Single page application eating 100% of your CPU) will happen on your servers. So it is not only painful but very expensive to do this on your own when you don't know what you are doing. &lt;/p&gt;

&lt;p&gt;When doing web scraping at scale, you often have to use proxies for different reasons. The website you are visiting with your bot may show different information based on your location, for example, a price in Euro in the Euro-zone and a price in dollars in the US. &lt;/p&gt;

&lt;p&gt;Dealing with proxies is painful too. There are lots of shady companies selling bad quality proxies so you either have to run your own proxies or test dozens of proxy companies to make sure your proxy pool is always up. &lt;/p&gt;

&lt;p&gt;We used to solve all those problems using APIs that were either not really efficient or crazy expensive. These are problems that we solve multiple times in our projects so we thought about packaging it into an API and leveraging our experience to make all kinds of web scraping APIs. &lt;/p&gt;

&lt;p&gt;We decided this time, to make things right and to try to avoid doing the mistakes we made with PricingBot while creating &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt; .&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake avoided #1: create a product you won't use:
&lt;/h3&gt;

&lt;p&gt;One of the biggest problems we had with PricingBot was to find where our potential users gathered online. What group did they follow, what blog did they read, what influencers did they listen to. And the reason was simple, having never worked with or in the e-commerce industry except for some freelancing gigs, the whole landscape was unknown to us. &lt;/p&gt;

&lt;p&gt;With ScrapingBee we would be our own users and it changed everything. I know this advice is not new, but often, this advice is meant to build a better product, and sure, being one of your own users allows you to build a better product.&lt;/p&gt;

&lt;p&gt;But for us, the game-changing fact was that being our own users meant that we knew exactly where to find and how to reach potentials leads.&lt;/p&gt;

&lt;p&gt;Pierre and I also have our own blogs running for quite a bit of time and I wrote last year a book dedicated to web scraping in Java. This directly translated into 20k monthly visits that we could leverage to promote ScrapingBee.&lt;/p&gt;

&lt;p&gt;And it worked. In about 2 months, we reached 150 beta signups, 4 times the amount of beta testers we had for PricingBot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fMIZ6uco--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/p9s1otkh79ge4142dkv4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fMIZ6uco--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/p9s1otkh79ge4142dkv4.jpg" alt="Ship By Product Hunt"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SY-A7Eon--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/z9d17x9rf6to4nsw398s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SY-A7Eon--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/z9d17x9rf6to4nsw398s.png" alt="Newsletter subscribers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake avoided #2 Spending too much money
&lt;/h2&gt;

&lt;p&gt;While building PricingBot, we spent a lot of money on useless infrastructure, APIs and software without reaching Product-Market Fit.  &lt;/p&gt;

&lt;p&gt;We basically to get our money back thanks to ShopToList sale and Pierre's agricultural skills before we launched ScrapingBee, and this time, we were way more careful about how we spent it.&lt;/p&gt;

&lt;p&gt;I know spending several thousand dollars to bootstrap a project is not a lot of money but we weren't comfortable with spending more, so we decided to be careful with how we would spend it with ScrapingBee.&lt;/p&gt;

&lt;p&gt;We basically reduced our costs by only finding deals (&amp;lt;3 AWS Credits) like &lt;a href="https://www.joinsecret.com/"&gt;Secret&lt;/a&gt; which basically give you 6 months free for lots of SaaS or huge discount. &lt;/p&gt;

&lt;p&gt;We decided to do more with what we had and so far we don't regret it.&lt;/p&gt;

&lt;p&gt;I'll talk more about the products and tools we used in a future blog post, this one is already long enough. &lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Launch 🚀 and mistake avoided #3 not asking for money from day 1
&lt;/h2&gt;

&lt;p&gt;One thing that did not work well with PricingBot is that for months, we build a product that was free to use. I know this is a classic mistake, but this is not the worst part, the worst part is that we knew it was a mistake. In the last 4 years, we've read tons of books, interviews, blog posts about startup and everyone seems to agree that the sooner you ask for money the better.&lt;/p&gt;

&lt;p&gt;But it was easier said than done and we did not dare ask for money while building PricingBot because we did not think anyone would pay for an unfinished product.&lt;/p&gt;

&lt;p&gt;We did for ScrapingBee. The pricing for ScrapingBee is again a classic three plan SaaS based on API call volume/feature starting at $9 / $29 / $99 per month and an Enterprise plan:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bFt2-Zzw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/d75t54lb3dpcuahxgfg6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bFt2-Zzw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/d75t54lb3dpcuahxgfg6.jpg" alt="Pricing Table"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We "soft-launched" first to our mailing list and got our first few small paying customers. Again, we had the same experience with PricingBot but this time it was different. With PricingBot, every paying customer we had was really hard to get, we had sent them tons and tons of email and they took a long time to finally pay.&lt;/p&gt;

&lt;p&gt;With ScrapingNinja it was different. Our first 2 customers never talked with us before. &lt;/p&gt;

&lt;p&gt;We then started to blog and got tons of leads, a few more paying customer including a big Enterprise plan as you can see in the MRR chart below. &lt;/p&gt;

&lt;p&gt;Then it all went quickly, Pierre and I both having blogged about programming, creating insightful content about Web scraping is not a problem for us, and we knew how and where to promote it.&lt;/p&gt;

&lt;p&gt;One particular piece of content we wrote, a &lt;a href="https://www.scrapingbee.com/blog/web-scraping-without-getting-blocked"&gt;web scraping guide&lt;/a&gt; completely exceeded our expectations. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CVt5YrgR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ec55uvbwtbmviucdwjne.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CVt5YrgR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ec55uvbwtbmviucdwjne.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post alone meant that in two months we had three times the traffic we had in one year of PricingBot. This post not only brought traffic but also customers with real $. It also allowed us to signup the first big enterprise plan that allowed us to reach and cross $1000 MRR.&lt;/p&gt;

&lt;h2&gt;
  
  
  The future
&lt;/h2&gt;

&lt;p&gt;Of course, It's really early to say if &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt; will be a success or not.&lt;/p&gt;

&lt;p&gt;This big enterprise customers we got thanks to the success of our first blog post can only be an outlier phenomenon that won't reproduce in the future. But one thing is certain, things are looking way better with ScrapingNinja.&lt;/p&gt;

&lt;p&gt;We have lots of engagement from our users and leads, the conversion rate from trial to paying customers being close to 5%. &lt;/p&gt;

&lt;p&gt;We also love to talk with our potential customers (❤️  Zoom) and we have the feeling that ScrapingBee is really a must-have for them, instead of a "nice to have". (small tips: we multiplied by five the free plan for users that accept to have a small 15 minutes talk with us, this already allowed us to have 40 real talks with real people about ScrapingBee).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XMXcWcQp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/cc0n6bbi5qvod3gfrvdn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XMXcWcQp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/cc0n6bbi5qvod3gfrvdn.png" alt="CTA phone call"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the months to come a big challenge will be to find profitable and scalable acquisition channels. We hope that content marketing will continue to work and that it will improve our SEO to get organic traffic. Writing a good piece of content may not be enough and we really have to discover other acquisition channels.&lt;/p&gt;

&lt;p&gt;The other big challenge is to prioritize features in the API-store. Meaning figuring out what users &lt;strong&gt;need&lt;/strong&gt; not blindly implementing what they want, and hopefully, manage to get them to pay before the feature is implemented.&lt;/p&gt;

&lt;p&gt;We still don't know what we want to do with PricingBot, we seriously think about selling it but are a bit afraid of all the paperwork involved (it was much easier with ShopToList because ShopToList did not bring any money in, so no bank account, Stripe account etc...)&lt;/p&gt;

&lt;p&gt;We also still have a lot to learn and a lot to prove before being able to say that we build a sustainable and profitable business but we feel that it can be done, time will tell us if we're right.&lt;/p&gt;

</description>
      <category>startup</category>
      <category>maker</category>
    </item>
    <item>
      <title>Serverless Web Scraping With Aws Lambda and Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 04 Sep 2019 09:36:20 +0000</pubDate>
      <link>https://dev.to/scrapingbee/serverless-web-scraping-with-aws-lambda-and-java-48lc</link>
      <guid>https://dev.to/scrapingbee/serverless-web-scraping-with-aws-lambda-and-java-48lc</guid>
      <description>&lt;p&gt;Serverless is a term referring to the execution of code inside ephemeral containers (Function As A Service, or FaaS). It is a hot topic in 2019, after the “micro-service” hype, here come the “nano-services”!&lt;/p&gt;

&lt;p&gt;Cloud functions can be triggered by different things such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An HTTP call to a REST API&lt;/li&gt;
&lt;li&gt;A job in a message queue&lt;/li&gt;
&lt;li&gt;A log&lt;/li&gt;
&lt;li&gt;IOT event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud functions are a really good fit with web scraping tasks for many reasons. Web Scraping is I/O bound, most of the time is spent waiting for HTTP responses, so we don’t need high-end CPU servers. Cloud functions are cheap (first 1M request is free, then $0.20 per million requests) and easy to set up. Cloud functions are a good fit for parallel scraping, we can create hundreds or thousands of function at the same time for large-scale scraping.&lt;/p&gt;

&lt;p&gt;In this introduction, we are going to see how to deploy a slightly modified version of the Craigslist scraper we made on a previous &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;blogpost&lt;/a&gt; on AWS Lambda using the serverless framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;We are going to use the &lt;a href="https://serverless.com/"&gt;Serverless&lt;/a&gt; framework to build and deploy our project to AWS lambda. Serverless CLI is able to generate lots of boilerplate code in different languages and deploy the code to different cloud providers, like AWS, Google Cloud or Azure. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AWS account&lt;/li&gt;
&lt;li&gt;Node and npm&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://serverless.com/framework/docs/providers/aws/guide/quick-start/"&gt;Serverless CLI&lt;/a&gt; and Setup your &lt;a href="https://serverless.com/framework/docs/providers/aws/guide/credentials/"&gt;AWS credentials&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Java 8 &lt;/li&gt;
&lt;li&gt;Maven&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;We will build an API using API Gateway with a single endpoint &lt;code&gt;/items/{query}&lt;/code&gt; binded on a lambda function that will respond to us with a JSON array with all items (on the first result page) for this query.&lt;/p&gt;

&lt;p&gt;Here is a simple diagram for this architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7GKGCJBN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-lambda/cloudcraft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7GKGCJBN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-lambda/cloudcraft.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Create the Maven project
&lt;/h2&gt;

&lt;p&gt;Serverless is able to generate projects in lots of different languages: Java, Python, NodeJS, Scala... &lt;br&gt;
We are going to use one of these templates to generate a maven project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;serverless create &lt;span class="nt"&gt;--template&lt;/span&gt; aws-java-maven &lt;span class="nt"&gt;--name&lt;/span&gt; items-api &lt;span class="nt"&gt;-p&lt;/span&gt; aws-java-scraper
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can now open this Maven project in your favorite IDE. &lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;The first thing to do is to change the &lt;strong&gt;&lt;em&gt;serverless.yml&lt;/em&gt;&lt;/strong&gt; config to implement an API gateway route and bind it to the &lt;strong&gt;handleRequest&lt;/strong&gt; method in the &lt;strong&gt;&lt;em&gt;Handler.java&lt;/em&gt;&lt;/strong&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;craigslist-scraper-api&lt;/span&gt; 
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;java8&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;package&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;artifact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;target/hello-dev.jar&lt;/span&gt;

&lt;span class="na"&gt;functions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;getCraigsListItems&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;com.serverless.Handler&lt;/span&gt;
    &lt;span class="na"&gt;events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/items/{searchQuery}&lt;/span&gt;
        &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I also added a timeout to 30 seconds. The default timeout with the serverless framework is 6 seconds. Since we're running Java code the &lt;a href="https://serverless.com/blog/keep-your-lambdas-warm/"&gt;Lambda cold start&lt;/a&gt; can take several seconds. And then we will make an HTTP request to Craigslist website, so 30 seconds seems good. &lt;/p&gt;

&lt;h2&gt;
  
  
  Function code
&lt;/h2&gt;

&lt;p&gt;Now we can modify the &lt;strong&gt;&lt;em&gt;Handler.class&lt;/em&gt;&lt;/strong&gt;. The function logic is easy. First, we retrieve the path parameter called "searchQuery". Then we create a CraigsListScraper and call the &lt;strong&gt;&lt;em&gt;scrape()&lt;/em&gt;&lt;/strong&gt; method with this searchQuery. It will return a &lt;code&gt;List&amp;lt;Item&amp;gt;&lt;/code&gt; representing all the items on the first Craigslist's result page. &lt;/p&gt;

&lt;p&gt;We then use the &lt;code&gt;ApiGatewayResponse&lt;/code&gt; class that was generated by the Serverless framework to return a JSON array containing every item. &lt;/p&gt;

&lt;p&gt;You can find the rest of the code in &lt;a href="https://github.com/ksahin/serverless-scraping"&gt;this repository&lt;/a&gt;, with the &lt;code&gt;CraigsListScraper&lt;/code&gt; and &lt;code&gt;Item&lt;/code&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Override&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt; &lt;span class="nf"&gt;handleRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="no"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"received: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pathParameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"pathParameters"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pathParameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"searchQuery"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;CraigsListScraper&lt;/span&gt; &lt;span class="n"&gt;scraper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CraigsListScraper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scraper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scrape&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatusCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setObjectBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setHeaders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Powered-By"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"AWS Lambda &amp;amp; serverless"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
        &lt;span class="no"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error : "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;Response&lt;/span&gt; &lt;span class="n"&gt;responseBody&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error while processing URL: "&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ApiGatewayResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatusCode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setObjectBody&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responseBody&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setHeaders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Powered-By"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"AWS Lambda &amp;amp; Serverless"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can now build the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mvn clean install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And deploy it to AWS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;serverless deploy
Serverless: Packaging service...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (13.35 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.................................
Serverless: Stack update finished...
Service Information
service: items-api
stage: dev
region: us-east-1
stack: items-api-dev
api keys:
  None
endpoints:
  GET - https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/{searchQuery}
functions:
  getCraigsListItems: items-api-dev-getCraigsListItems
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can then test your function using curl or your web browser with the URL given in the deployment logs (&lt;br&gt;
&lt;br&gt;
&lt;code&gt;serverless info&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 will also show this information.)&lt;/p&gt;

&lt;p&gt;Here is a query to look for "macBook pro" :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/macBook%20pro | json_reformat                                                            1 ↵
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19834  100 19834    0     0   7623      0  0:00:02  0:00:02 &lt;span class="nt"&gt;--&lt;/span&gt;:--:--  7622
&lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"2010 15&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; Macbook pro 3.06ghz 8gb 320gb osx maverick"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 325,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro-306ghz-8gb-320gb/6680853189.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Apple MacBook Pro A1502 13.3&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; Late 2013 2.6GHz i5 8 GB 500GB + Extras"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 875,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-alateghz-i5/6688755497.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Apple MacBook Pro Charger USB-C (Latest Model) w/ Box - Like New!"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 50,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-charger-usb/6686902986.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"MacBook Pro 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; C2D 4GB memory 500GB HDD"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 250,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro-13-c2d-4gb-memory/6688682499.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Macbook Pro 2011 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 475,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/eby/sys/d/macbook-pro/6675556875.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Trackpad Touchpad Mouse with Cable and Screws for Apple MacBook Pro"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 39,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/pen/sys/d/trackpad-touchpad-mouse-with/6682812027.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Macbook Pro 13&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; i5 very clean, excellent shape! 4GB RAM, 500GB HDD"&lt;/span&gt;,
        &lt;span class="s2"&gt;"price"&lt;/span&gt;: 359,
        &lt;span class="s2"&gt;"url"&lt;/span&gt;: &lt;span class="s2"&gt;"https://sfbay.craigslist.org/sfc/sys/d/macbook-pro-13-i5-very-clean/6686879047.html"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;,
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note that the first invocation will be slow, it took 7 seconds for me. The next invocations will be much quicker. &lt;/p&gt;

&lt;h2&gt;
  
  
  Go further
&lt;/h2&gt;

&lt;p&gt;This was just a little example, here are some ideas to improve this :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better error handling&lt;/li&gt;
&lt;li&gt;Protect the API with an API Key (really easy to implement with API Gateway)&lt;/li&gt;
&lt;li&gt;Save the items to a DynamoDB database&lt;/li&gt;
&lt;li&gt;Send the search query to an SQS queue, and trigger the lambda execution with the queue instead of an HTTP request&lt;/li&gt;
&lt;li&gt;Send a notification with SNS if an Item is less than a certain price point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

&lt;p&gt;This is the end of this tutorial. I hope you enjoyed the post. Don't hesitate to experiment with Lambda and other cloud providers, it's really fun, easy, and can drastically reduce your infrastructure costs, especially for web-scraping or asynchronous related tasks.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Web Scraping 101 in Python</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 21 Aug 2019 10:24:15 +0000</pubDate>
      <link>https://dev.to/scrapingbee/web-scraping-101-in-python-5aoj</link>
      <guid>https://dev.to/scrapingbee/web-scraping-101-in-python-5aoj</guid>
      <description>&lt;p&gt;In this post, which can be read as a follow up to our &lt;a href="https://www.daolf.com/posts/avoiding-being-blocked-while-scraping-ultimate-guide/"&gt;ultimate web scraping guide&lt;/a&gt;, we will cover almost all the tools Python offers you to web scrape. We will go from the more basic to the most advanced one and will cover the pros and cons of each. Of course, we won't be able to cover all aspect of every tool we discuss, but this post should be enough to have a good idea of which tools does what, and when to use which.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: when I talk about Python in this blog post you should assume that I talk about Python3.&lt;/em&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Content:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;0) Web Fundamentals
&lt;/li&gt;
&lt;li&gt;1) Manually opening a socket and sending the HTTP request
&lt;/li&gt;
&lt;li&gt;2) urllib3 &amp;amp; LXML
&lt;/li&gt;
&lt;li&gt;3) requests &amp;amp; BeautifulSoup
&lt;/li&gt;
&lt;li&gt;4) Scrapy
&lt;/li&gt;
&lt;li&gt;5) Selenium &amp;amp; Chrome —headless
&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  &lt;span id="web-fondamentals"&gt; 0) Web Fundamentals &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;The internet is &lt;strong&gt;really complex&lt;/strong&gt;: there are many underlying technologies and concepts involved to view a simple web page in your browser. I don’t have the pretension to explain everything, but I will show you the most important things you have to understand in order to extract data from the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  HyperText Transfer Protocol
&lt;/h2&gt;

&lt;p&gt;HTTP uses a &lt;strong&gt;client/server&lt;/strong&gt; model, where an HTTP client (A browser, your Python program, curl, Requests...) opens a connection and sends a message (“I want to see that page: /product”)to an HTTP server (Nginx, Apache...). &lt;/p&gt;

&lt;p&gt;Then the server answers with a response (The HTML code for example) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independent. FTP for example, is stateful.&lt;/p&gt;

&lt;p&gt;Basically, when you type a website address in your browser, the HTTP request looks like this:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In the first line of this request, you can see multiples things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the GET verb or method being used, meaning we request data from the specific path: &lt;code&gt;/product/&lt;/code&gt;.There are other HTTP verbs, you can see the full list &lt;a href="https://www.w3schools.com/tags/ref_httpmethods.asp"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The version of the HTTP protocol, in this tutorial we will focus on HTTP 1.&lt;/li&gt;
&lt;li&gt;Multiple headers fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are the most important header fields :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Host:&lt;/strong&gt; The domain name of the server, if no port number is given, is assumed to be 80*&lt;em&gt;.&lt;/em&gt;*&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-Agent:&lt;/strong&gt; Contains information about the client originating the request, including the OS information. In this case, it is my web-browser (Chrome), on OSX. This header is important because it is either used for statistics (How many users visit my website on Mobile vs Desktop) or to prevent any violations by bots. Because these headers are sent by the clients, it can be modified (it is called “Header Spoofing”), and that is exactly what we will do with our scrapers, to make our scrapers look like a normal web browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept:&lt;/strong&gt; The content types that are acceptable as a response. There are lots of different content types and sub-types: &lt;strong&gt;text/plain, text/html, image/jpeg, application/json&lt;/strong&gt; ...&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cookie&lt;/strong&gt; : name1=value1;name2=value2... This header field contains a list of name-value pairs. It is called session cookies, these are used to store data. Cookies are what websites use to authenticate users, and/or store data in your browser. For example, when you fill a login form, the server will check if the credentials you entered are correct, if so, it will redirect you and inject a session cookie in your browser. Your browser will then send this cookie with every subsequent request to that server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Referrer&lt;/strong&gt;: The Referrer header contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user came from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the list goes on...you can find the full header list &lt;a href="https://en.wikipedia.org/wiki/List_of_HTTP_header_fields"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A server will respond with something like this: &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;On the first line, we have a new piece of information, the HTTP code &lt;code&gt;200 OK&lt;/code&gt;. It means the request has succeeded. As for the request headers, there are lots of HTTP codes, split into four common classes, 2XX for successful requests, 3XX for redirects, 4XX for bad requests (the most famous being 404 Not found), and 5XX for server errors.&lt;/p&gt;

&lt;p&gt;Then, in case you are sending this HTTP request with your web browser, the browser will parse the HTML code, fetch all the eventual assets (Javascript files, CSS files, images...) and it will render the result into the main window.&lt;/p&gt;

&lt;p&gt;In the next parts we will see the different ways to perform HTTP requests with Python and extract the data we want from the responses. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="socket"&gt; 1) Manually opening a socket and sending the HTTP request &lt;/span&gt;
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Socket
&lt;/h2&gt;

&lt;p&gt;The most basic way to perform an HTTP request in Python is to open a &lt;a href="https://docs.python.org/3/howto/sockets.html"&gt;socket&lt;/a&gt; and manually send the HTTP request.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions. &lt;/p&gt;

&lt;h2&gt;
  
  
  Regular Expressions
&lt;/h2&gt;

&lt;p&gt;A regular expression (RE, or Regex) is a search pattern for strings. With regex, you can search for a particular character/word inside a bigger body of text.&lt;/p&gt;

&lt;p&gt;For example, you could identify all phone numbers inside a web page. You can also replace items, for example, you could replace all uppercase tag in a poorly formatted HTML by lowercase ones. You can also validate some inputs ...&lt;/p&gt;

&lt;p&gt;The pattern used by the regex is applied from left to right. Each source character is only used once. You may be wondering why it is important to know about regular expressions when doing web scraping?&lt;/p&gt;

&lt;p&gt;After all, there is all kind of different Python module to parse HTML, with XPath, CSS selectors. &lt;/p&gt;

&lt;p&gt;In an ideal &lt;a href="https://en.wikipedia.org/wiki/Semantic_Web"&gt;semantic world,&lt;/a&gt; data is easily machine-readable, the information is embedded inside relevant HTML element, with meaningful attributes.&lt;/p&gt;

&lt;p&gt;But the real world is messy, you will often find huge amounts of text inside a p element. When you want to extract a specific data inside this huge text, for example, a price, a date, a name... you will have to use regular expressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Here is a great website to test your regex: &lt;a href="https://regex101.com/"&gt;https://regex101.com/&lt;/a&gt; and &lt;a href="https://www.rexegg.com/"&gt;one awesome blog&lt;/a&gt; to learn more about them, this post will only cover a small fraction of what you can do with regexp.&lt;/p&gt;

&lt;p&gt;Regular expressions can be useful when you have this kind of data:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;p&amp;gt;Price : 19.99$&amp;lt;/p&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;We could select this text node with an Xpath expression, and then use this kind of regex to extract the price :&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;^Price\s:\s(\d+\.\d{2})\$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;To extract the text inside an HTML tag, it is annoying to use a regex, but doable:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;As you can see, manually sending the HTTP request with a socket, and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="lxml"&gt; 2) urllib3 &amp;amp; LXML &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: It is easy to get lost in the urllib universe in Python. You have urllib and urllib2 that are parts of the standard lib. You can also find urllib3. urllib2 was split in multiple modules in Python 3, and urllib3 should not be a part of the standard lib anytime soon. This whole confusing thing will be the subject of a blog post by itself. In this part, I've made the choice to only talk about urllib3 as it is used widely in the Python world, by Pip and requests to name only them. &lt;/p&gt;

&lt;p&gt;Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. It allows doing what we did above with socket with way fewer lines of code.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Much more concise than the socket version. Not only that, but the API is straightforward and you can do many things easily, like adding HTTP headers, using a proxy, POSTing forms ... &lt;/p&gt;

&lt;p&gt;For example, had we decide to set some headers and to use a proxy, we would only have to do this.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;See? Exactly the same number of line, however, there are some things that urllib3 does not handle very easily, for example, if we want to add a cookie, we have to manually create the corresponding headers and add it to the request. &lt;/p&gt;

&lt;p&gt;There are also things that urllib3 can do that requsts can't, creation and management of pool and proxy pool, control of retry strategy for example.&lt;/p&gt;

&lt;p&gt;To put in simply, urllib3 is between requests and socket in terms of abstraction, although way closer to requests than socket.&lt;/p&gt;

&lt;p&gt;This time, to parse the response, we are going to use the lxml package and XPath expressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath
&lt;/h2&gt;

&lt;p&gt;Xpath is a technology that uses path expressions to select nodes or node- sets in an XML document (or HTML document). As with the Document Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a programming language in itself, it allows you to write expression that can access directly to a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).&lt;/p&gt;

&lt;p&gt;Think of XPath as regexp, but specifically for XML/HMTL.&lt;/p&gt;

&lt;p&gt;To extract data from an HTML document with XPath we need 3 things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an HTML document&lt;/li&gt;
&lt;li&gt;some XPath expressions&lt;/li&gt;
&lt;li&gt;an XPath engine that will run those expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To begin we will use the HTML that we got thanks to urllib3, we just want to extract all the links from the Google homepage so we will use one simple XPath expression: &lt;code&gt;//a&lt;/code&gt; and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPATH. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Installation&lt;/em&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install lxml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Below is the code that comes just after the previous snippet:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;And the output should look like this:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You have to keep in mind that this example is really really simple and doesn't really show you how powerful XPath can be (note: this XPath expression should have been changed to &lt;code&gt;//a/@href&lt;/code&gt; to avoid having to iterate on &lt;code&gt;links&lt;/code&gt; to get their &lt;code&gt;href&lt;/code&gt; ).&lt;/p&gt;

&lt;p&gt;If you want to learn more about XPath you can read &lt;a href="https://librarycarpentry.org/lc-webscraping/02-xpath/index.html"&gt;this good introduction&lt;/a&gt;. The LXML documentation is also &lt;a href="https://lxml.de/tutorial.html"&gt;well written and is a good starting point&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;XPath expresions, like regexp, are really powerful and one of the fastest way to extract information from HTML, and like regexp, XPath can quickly become messy, hard to read and hard to maintain.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="requests"&gt; 3) requests &amp;amp; BeautifulSoup &lt;span&gt;
&lt;/span&gt;&lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HrgsYR9Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/requests/requests/master/docs/_static/requests-logo-small.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HrgsYR9Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/requests/requests/master/docs/_static/requests-logo-small.png" alt="" width="357" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/psf/requests"&gt;Requests&lt;/a&gt; is the king of python packages, with more than 11 000 000 downloads, it is the most widly used package for Python. &lt;/p&gt;

&lt;p&gt;Installation: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Making a request with Requests (no comment) is really easy: &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;With Requests it is easy to perform POST requests, handle cookies, query parameters... &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication to Hacker News&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's say we want to create a tool to automatically submit our blog post to Hacker news or any other forums, like Buffer. We would need to authenticate to those websites before posting our link. That's what we are going to do with Requests and BeautifulSoup!&lt;/p&gt;

&lt;p&gt;Here is the Hacker News login form and the associated DOM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr2y7j7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ksah.in/content/images/2016/02/screenshot_hn_login_form.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr2y7j7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ksah.in/content/images/2016/02/screenshot_hn_login_form.png" alt="" width="880" height="717"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are three &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt; **tags on this form, the first one has a type hidden with a name "goto" and the two others are the username and password. &lt;/p&gt;

&lt;p&gt;If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated. &lt;/p&gt;

&lt;p&gt;Doing this with Requests is easy, it will handle redirects automatically for us, and handling cookies can be done with the &lt;em&gt;Session&lt;/em&gt; object. &lt;/p&gt;

&lt;p&gt;The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.&lt;/p&gt;

&lt;p&gt;Installation: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install beautifulsoup4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;So all we have to do is to POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;In order to learn more about BeautifulSoup we could try to extract every links on the homepage. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;By the way, Hacker News offers a &lt;a href="https://github.com/HackerNews/API"&gt;powerful API&lt;/a&gt;, so we're doing this as an example, but you should use the API instead of scraping it!&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;The first thing we need to do is to inspect the Hacker News's home page to understand the structure and the different CSS classes that we will have to select:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eGZHahfg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hacker_news_screenshot-475f78bf-c737-4a60-8c24-d0cc220d7219.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eGZHahfg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hacker_news_screenshot-475f78bf-c737-4a60-8c24-d0cc220d7219.jpg" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that all posts are inside a &lt;code&gt;&amp;lt;tr class="athing"&amp;gt;&lt;/code&gt; **so the first thing we will need to do is to select all these tags. This can be easily done with: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;links = soup.findAll('tr', class_='athing')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then for each link, we will extract its id, title, url and rank:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;As you saw, Requests and BeautifulSoup are great libraries to extract data and automate different things by posting forms. If you want to do large-scale web scraping projects, you could still use Requests, but you would need to handle lots of things yourself. &lt;/p&gt;

&lt;p&gt;When you need to scrape a lots of webpages, there are many things you have to take care of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;finding a way of parallelizing your code to make it faster&lt;/li&gt;
&lt;li&gt;handling error&lt;/li&gt;
&lt;li&gt;storing result&lt;/li&gt;
&lt;li&gt;filtering result&lt;/li&gt;
&lt;li&gt;throttling your request so you don't over load the server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortunately for us, tools exist that can handle those things for us.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="scrapy"&gt; 4) Scrapy &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VIvNnTuY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://secure.meetupstatic.com/photos/event/1/b/6/6/600_468367014.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VIvNnTuY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://secure.meetupstatic.com/photos/event/1/b/6/6/600_468367014.jpeg" alt="" width="600" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scrapy is a powerful Python web scraping framework. It provides many features to download web pages asynchronously, process and save it. It handles multithreading, crawling (the process of going from links to links to find every URLs in a website), sitemap crawling and many more. &lt;/p&gt;

&lt;p&gt;Scrapy has also an interactive mode called the Scrapy Shell. With Scrapy Shell you can test your scraping code really quickly, like XPath expression or CSS selectors. &lt;/p&gt;

&lt;p&gt;The downside of Scrapy is that the learning curve is steep, there is a lot to learn. &lt;/p&gt;

&lt;p&gt;To follow up on our example about Hacker news, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file. &lt;/p&gt;

&lt;p&gt;You can easily install Scrapy with pip: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install Scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then you can use the scrapy cli to generate the boilerplate code for our project: &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject hacker_news_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Inside &lt;code&gt;hacker_news_scraper/spider&lt;/code&gt; we will create a new python file with our Spider's code:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;There is a lot of convention in Scrapy, here we define an Array of starting urls. The attribute name will be used to call our Spider with the Scrapy command line. &lt;/p&gt;

&lt;p&gt;The parse method will be called on each URL in the &lt;code&gt;start_urls&lt;/code&gt; array&lt;/p&gt;

&lt;p&gt;We then need to tune Scrapy a little bit in order for our Spider to behave nicely against the target website. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;You should always turn this on, it will make sure the target website is not slow down by your spiders by analyzing the response time and adapting the numbers of concurrent threads. &lt;/p&gt;

&lt;p&gt;You can run this code with the Scrapy CLI and with different output format (CSV, JSON, XML...):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl hacker-news -o links.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;And that's it! You will now have all your links in a nicely formatted JSON file. &lt;/p&gt;
&lt;h1&gt;
  
  
  &lt;span id="selenium"&gt; 5) Selenium &amp;amp; Chrome —headless &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;Scrapy is really nice for large-scale web scraping tasks, but it is not enough if you need to scrape Single Page Application written with Javascript frameworks because It won't be able to render the Javascript code. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8rnOiMh7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/SinglePageDiagram-9ae99e86-e997-4e18-9da9-d7abba599b9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8rnOiMh7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/SinglePageDiagram-9ae99e86-e997-4e18-9da9-d7abba599b9b.png" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It can be challenging to scrape these SPAs because there are often lots of AJAX calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing the interesting data.&lt;/p&gt;

&lt;p&gt;In some cases, there are just too many asynchronous HTTP calls involved to get the data you want and it can be easier to just render the page in a headless browser. &lt;/p&gt;

&lt;p&gt;Another great use case would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (again !)&lt;/p&gt;

&lt;p&gt;You can install the selenium package with pip: &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install selenium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;You will also need &lt;a href="http://chromedriver.chromium.org/"&gt;Chromedriver&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew install chromedriver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Then we just have to import the Webdriver from selenium package, configure Chrome with headless=True and set a window size (otherwise it is really small):&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;You should get a nice screenshot of the homepage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ML_oEg3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hn_homepage-bd6bd60d-8778-404b-a82c-39ba76728e14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ML_oEg3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/hn_homepage-bd6bd60d-8778-404b-a82c-39ba76728e14.png" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can do many more with the Selenium API and Chrome, like :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executing Javascript&lt;/li&gt;
&lt;li&gt;Filling forms&lt;/li&gt;
&lt;li&gt;Clicking on Elements&lt;/li&gt;
&lt;li&gt;Extracting elements with CSS selectors / XPath expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Selenium and Chrome in headless mode is really the ultimate combination to scrape anything you want. You can automate anything that you could do with your regular Chrome browser. &lt;/p&gt;

&lt;p&gt;The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance. &lt;/p&gt;

&lt;p&gt;If you want to run several Chrome instances concurrently, you will need powerful servers (the cost goes up quickly) and constant monitoring of resources. &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;span id="conclusion"&gt; Conclusion: &lt;/span&gt;
&lt;/h1&gt;

&lt;p&gt;Here is a quick recap table of every technology we discuss about in this about. Do not hesitate to tell us in the comment if you know some ressources that you feel have their places here.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;I hope that this overview will help you best choose your Python scraping tools and that you learned things reading this post.&lt;/p&gt;

&lt;p&gt;Every tools I talked about in this post will be the subject of a specific blog post in the future where I'll go deep into the details.&lt;/p&gt;

&lt;p&gt;Everything I talked about in this post is things I used to build &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt;, the simplest web scraping API around there. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us 😊.&lt;/p&gt;

&lt;p&gt;Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.&lt;/p&gt;

&lt;p&gt;Happy Scraping &lt;/p&gt;

</description>
      <category>python</category>
      <category>scraping</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Scraping single page applications with ease.</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Sun, 26 May 2019 17:22:14 +0000</pubDate>
      <link>https://dev.to/scrapingbee/scraping-single-page-applications-with-ease-d8o</link>
      <guid>https://dev.to/scrapingbee/scraping-single-page-applications-with-ease-d8o</guid>
      <description>&lt;p&gt;Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend.&lt;/p&gt;

&lt;p&gt;These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.&lt;/p&gt;

&lt;p&gt;So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.&lt;/p&gt;

&lt;p&gt;There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.&lt;/p&gt;

&lt;p&gt;It can be challenging to scrape these SPAs because there are often lots of Ajax calls and Websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.&lt;/p&gt;

&lt;p&gt;So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code in order to render the page, that is what the next part is about.&lt;/p&gt;

&lt;h1&gt;
  
  
  Headless Chrome with Python
&lt;/h1&gt;

&lt;p&gt;PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about the release of the headless mode with Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS [...]” It looks like Chrome in headless mode is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.&lt;/p&gt;

&lt;h1&gt;
  
  
  Prerequisites
&lt;/h1&gt;

&lt;p&gt;You will need to install the selenium package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And of course, you need a Chrome browser, and Chromedriver installed on your system.&lt;/p&gt;

&lt;p&gt;On macOS, you can simply use brew:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;brew&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;chromedriver&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Taking a screenshot
&lt;/h1&gt;

&lt;p&gt;We are going to use Chrome to take a screenshot of the Nintendo's home page which uses lots of Javascript.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;selenium.webdriver.chrome.options&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Options&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;executable_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s"&gt;'/usr/local/bin/chromedriver'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://www.nintendo.com/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'screenshot.png'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is really straightforward, I just added a parameter --window-size because the default size was too small.&lt;/p&gt;

&lt;p&gt;You should now have a nice screenshot of the Nintendo's home page:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bf00LEFQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/rpui3447ha74z56t6vkb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bf00LEFQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/rpui3447ha74z56t6vkb.png" alt="Nintendo Homepage Screenshot" width="880" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Waiting for the page load
&lt;/h1&gt;

&lt;p&gt;Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.&lt;/p&gt;

&lt;p&gt;A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.&lt;/p&gt;

&lt;p&gt;The other solution is to use the WebDriverWait object from the Selenium API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WebDriverWait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EC&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;presence_of_element_located&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'chart'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

 &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Page is ready!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;TimeoutException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Timeout"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.&lt;/p&gt;

&lt;p&gt;Meaning there will be CPU spikes, memory spikes just like a regular Chrome browser. After all, your Chrome instance will execute un-trusted and un-predictable third-party Javascript code! Then there is also the zombie-processes problem&lt;/p&gt;

&lt;p&gt;This is one of the reason I started &lt;a href="https://https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt;, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!&lt;/p&gt;

&lt;p&gt;This was my first post on about scraping, I hope you enjoyed it!&lt;/p&gt;

&lt;p&gt;If you did please let me know, I'll write more 😊&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to know more about ScrapingBee, you can 👉 &lt;a href="https://dev.to/daolf/new-season-new-project-i-need-you-197l"&gt;here&lt;/a&gt;&lt;/em&gt; &lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>scraping</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Introduction to Web Scraping With Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 13 Mar 2019 16:46:23 +0000</pubDate>
      <link>https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8</link>
      <guid>https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8</guid>
      <description>&lt;p&gt;Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.&lt;/p&gt;

&lt;p&gt;Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.&lt;br&gt;
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…&lt;/p&gt;

&lt;p&gt;Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.&lt;/p&gt;

&lt;p&gt;In this post, we are going to see basic techniques in order to fetch and parse data in Java. &lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic Java understanding&lt;/li&gt;
&lt;li&gt;Basic XPath&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;p&gt;You will need Java 8 with &lt;a href="http://htmlunit.sourceforge.net" rel="noopener noreferrer"&gt;HtmlUnit&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;net.sourceforge.htmlunit&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;htmlunit&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.19&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fdetail_pane.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fdetail_pane.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's scrape CraigList
&lt;/h3&gt;

&lt;p&gt;For our first example, we are going to fetch items from Craigslist since they don't seem to offer an API, to collect names, prices, and images, and export it to JSON. &lt;/p&gt;

&lt;p&gt;First, let's take a look at what happens when you search an item on Craigslist. Open Chrome Dev tools and click on the Network tab :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist_request_search.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist_request_search.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The search URL is :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://newyork.craigslist.org/search/moa?is_paid=all&amp;amp;search_distance_type=mi&amp;amp;query=iphone+6s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://newyork.craigslist.org/search/sss?sort=rel&amp;amp;query=iphone+6s  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled ...)&lt;/p&gt;

&lt;p&gt;We are going to disable Javascript since it's not required for our example, and disabling Javascript makes the page load faster :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;searchQuery&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Iphone 6s"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;searchUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://newyork.craigslist.org/search/sss?sort=rel&amp;amp;query="&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nc"&gt;URLEncoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;encode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchQuery&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF-8"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searchUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HtmlPage object will contain the HTML code, you can access it with &lt;code&gt;asXml()&lt;/code&gt; method. &lt;/p&gt;

&lt;p&gt;Now we are going to fetch titles, images, and prices. We need to inspect the DOM structure for an item :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist-dom-new-compressor.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fintro-java%2Fcraiglist-dom-new-compressor.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With HtmlUnit you have several options to select an html tag :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;getHtmlElementById(String id)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getFirstByXPath(String Xpath)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getByXPath(String XPath)&lt;/code&gt; which returns a List&lt;/li&gt;
&lt;li&gt;many others, rtfm !&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since there isn't any ID we could use, we have to make an &lt;a href="http://www.w3schools.com/xsl/xpath_syntax.asp" rel="noopener noreferrer"&gt;Xpath&lt;/a&gt; expression to select the tags we want. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XPath&lt;/strong&gt; is a query language to select XML nodes( HTML in our case).&lt;/p&gt;

&lt;p&gt;First, we are going to select all the &lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt; tags that have a class  &lt;code&gt;result-info&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//li[@class='result-row']"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;
  &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No items found !"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
  &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//p[@class='result-info']/a"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

  &lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//a/span[@class='result-price']"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

  &lt;span class="c1"&gt;// It is possible that an item doesn't have any price&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"0.0"&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

  &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Name : %s Url : %s Price : %s"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;itemUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then instead of just printing the results, we are going to put it in JSON, using &lt;a href="https://github.com/FasterXML/jackson" rel="noopener noreferrer"&gt;Jackson&lt;/a&gt; library, to map items in JSON format. &lt;/p&gt;

&lt;p&gt;We need a POJO (plain old java object) to represent Items&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Item.java&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Item&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;//getters and setters&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then add this to your pom.xml :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.fasterxml.jackson.core&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;jackson-databind&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.7.0&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file ...), and adapt the previous code a little bit :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
   &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;itemAnchor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//p[@class='result-info']/a"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

   &lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; 
   &lt;span class="n"&gt;htmlItem&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//a/span[@class='result-price']"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="c1"&gt;// It is possible that an item doesn't have any &lt;/span&gt;
   &lt;span class="c1"&gt;//price, we set the price to 0.0 in this case&lt;/span&gt;
   &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;itemPrice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spanPrice&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"0.0"&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; 
   &lt;span class="n"&gt;spanPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="nc"&gt;Item&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTitle&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setUrl&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
   &lt;span class="n"&gt;itemAnchor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

   &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPrice&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; 
   &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;itemPrice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"$"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;)));&lt;/span&gt;

   &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
   &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
   &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

   &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Go further
&lt;/h3&gt;

&lt;p&gt;This example is not perfect, there are many things that can be improved :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-city search&lt;/li&gt;
&lt;li&gt;Handling pagination&lt;/li&gt;
&lt;li&gt;Multi-criteria search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find the code in this &lt;a href="https://github.com/ksahin/introWebScraping" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was my first blog post I hope you enjoyed it, feel free to give me any feedback in the comments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further reading
&lt;/h3&gt;

&lt;p&gt;I recently wrote a blog post about a &lt;a href="https://dev.to/scrapingbee/a-guide-to-web-scraping-without-getting-blocked-5e7e"&gt;Web Scraping without getting blocked&lt;/a&gt; to explain the different techniques in order how to hide your scrapers, check it out!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping E-Commerce Product Data</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Sun, 17 Feb 2019 09:24:37 +0000</pubDate>
      <link>https://dev.to/scrapingbee/scraping-e-commerce-product-data-2aif</link>
      <guid>https://dev.to/scrapingbee/scraping-e-commerce-product-data-2aif</guid>
      <description>&lt;p&gt;In this tutorial, we are going to see how to extract product data from any E-commerce websites with Java. There are lots of different use cases for product data extraction, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;E-commerce price monitoring&lt;/li&gt;
&lt;li&gt;Price comparator&lt;/li&gt;
&lt;li&gt;Availability monitoring&lt;/li&gt;
&lt;li&gt;Extracting reviews&lt;/li&gt;
&lt;li&gt;Market research&lt;/li&gt;
&lt;li&gt;MAP violation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are going to extract these different fields: Price, Product Name, Image URL, SKU, and currency from this product page: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008"&gt;https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iB7kU6mc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-product/creenshot-2019-04-03-15.56.02.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iB7kU6mc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-product/creenshot-2019-04-03-15.56.02.jpg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What you will need
&lt;/h2&gt;

&lt;p&gt;We will use HtmlUnit to perform the HTTP request and parse the DOM, add this dependency to your pom.xml.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;net.sourceforge.htmlunit&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;htmlunit&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.19&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We will also use the Jackson library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.fasterxml.jackson.core&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;jackson-databind&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.9.8&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Schema.org
&lt;/h2&gt;

&lt;p&gt;In order to extract the fields we're interested in, we are going to parse &lt;a href="https://schema.org"&gt;https://schema.org&lt;/a&gt; metadata from the Html markup. &lt;/p&gt;

&lt;p&gt;Schema is a &lt;strong&gt;&lt;em&gt;semantic&lt;/em&gt;&lt;/strong&gt; vocabulary that can be added to any webpage. There are many benefits of implementing Schema. Most search engines use it to understand what a page is about (A Product, an Article, a Review, and &lt;a href="https://schema.org/docs/schemas.html"&gt;many more&lt;/a&gt; )&lt;/p&gt;

&lt;p&gt;According to schema.org, about 10 million websites use it worldwide. That's huge! &lt;br&gt;
There are different types of Schema, and today we're going to look at the &lt;a href="https://schema.org/Product"&gt;Product type&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's really convenient because once you wrote a scraper that extracts specific schema data, it will work on any other website using the same schema. No more specific XPath / CSS selectors to write!&lt;/p&gt;

&lt;p&gt;In my experience at PricingBot (my previous company), about 40% of E-commerce websites use schema.org metadata in their DOM. &lt;/p&gt;

&lt;p&gt;There are three main schema markups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;JSON-LD&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;&amp;lt;script&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;type=&lt;/span&gt;&lt;span class="s2"&gt;"application/ld+json"&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://schema.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ItemList"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"numberOfItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"315"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"itemListElement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://multivarki.ru/brand_502/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Brand 502"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"offers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Offer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4399 p."&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;RDF-A&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;vocab=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"ItemList"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"numberOfItems"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;315&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"itemListElement"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"image"&lt;/span&gt; &lt;span class="na"&gt;alt=&lt;/span&gt;&lt;span class="s"&gt;"Photo of product"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;&amp;lt;a&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"url"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://multivarki.ru/brand_502/"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;BRAND 502&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"offers"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/Offer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:priceCurrency"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"RUB"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;руб
            &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:price"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"4399.00"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;4 399,00
            &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"schema:itemCondition"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://schema.org/NewCondition"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;...
        &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;property=&lt;/span&gt;&lt;span class="s"&gt;"itemListElement"&lt;/span&gt; &lt;span class="na"&gt;typeof=&lt;/span&gt;&lt;span class="s"&gt;"Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
          ...
        &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the one used in our example, &lt;strong&gt;&lt;em&gt;Microdata&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"schema-org"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;


&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Product"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"image"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://images.asos-media.com/products/the-north-face-vault-backpack-28-litres-in-black/10253008-1-black"&lt;/span&gt; &lt;span class="na"&gt;alt=&lt;/span&gt;&lt;span class="s"&gt;"Image 1 of The North Face Vault Backpack 28 Litres in Black"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"itemCondition"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/NewCondition"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"productID"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;10253008&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"sku"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;10253008&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"brand"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Brand"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;The North Face&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;The North Face Vault Backpack 28 Litres in Black&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"description"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Shop The North Face Vault Backpack 28 Litres in Black at ASOS. Discover fashion online.&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"offers"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Offer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"availability"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/InStock"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"priceCurrency"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"GBP"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;60&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"eligibleRegion"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;GB&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"seller"&lt;/span&gt; &lt;span class="na"&gt;itemscope=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;itemtype=&lt;/span&gt;&lt;span class="s"&gt;"https://schema.org/Organization"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;itemprop=&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;ASOS&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;  
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Note that you can have multiple offers in a single page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting the data
&lt;/h2&gt;

&lt;p&gt;The first thing is to create a basic POJO of a Product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="no"&gt;URL&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="c1"&gt;// ...getters &amp;amp; setters&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then we need to go to the target URL and create a basic microdata parser to extract the fields we are interested in. I'm using HtmlUnit for this, which is a pure Java headless browser. I could have used lots of different libraries like Jsoup or Selenium + Headless Chrome. &lt;/p&gt;

&lt;p&gt;But in most cases, HtmlUnit is a good solution because it's lighter than Selenium + Headless Chrome, but offer more features than a raw HTTP client + JSoup (which only handles Html parsing). &lt;/p&gt;

&lt;p&gt;For "Javascript-heavy" websites, relying on frontend frameworks like React / Vue.js, Headless Chrome is the way to go!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;
&lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;productUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//*[@itemtype='https://schema.org/Product']"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="no"&gt;URL&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="no"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;((((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./img"&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="nc"&gt;HtmlElement&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='offers']"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

&lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='price']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='name']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;offers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./*[@itemprop='priceCurrency']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;getAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;productSKU&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(((&lt;/span&gt;&lt;span class="nc"&gt;HtmlElement&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;productNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./span[@itemprop='sku']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;On the first lines, I created the HtmlUnit HTTP client and disabled Javascript because we don't need it to get the Schema markup. &lt;/p&gt;

&lt;p&gt;Then it's just basic XPath expressions to select the interesting DOM nodes we want. &lt;/p&gt;

&lt;p&gt;This parser is far from perfect, it doesn't extract everything and it doesn't handle multiple offers. However, this will give you an idea about how to extract Schema data. &lt;/p&gt;

&lt;p&gt;We can then create the Product object, and print it as a JSON string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Product&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productSKU&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;imageUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Avoid getting blocked
&lt;/h2&gt;

&lt;p&gt;Now that we are able to extract the product data we want, we have to be careful not to get blocked. &lt;/p&gt;

&lt;p&gt;For various reasons, there are sometimes anti-bot mechanisms implemented on websites. The most obvious reason to protect sites from bots is to prevent heavy automated traffic to impact a website’s performance (and you must be careful with concurrent requests, by adding delays...). Another reason is to stop bad behavior from bots like spam.&lt;/p&gt;

&lt;p&gt;There are various protection mechanisms. Sometime your bot will be blocked if it does too many requests per second/hour/ day. Sometimes there is a rate limit on how many requests per IP address. The most difficult protection is when there is a user behavior analysis. For example, the website could analyze the time between requests, if the same IP is making requests concurrently.&lt;/p&gt;

&lt;p&gt;The easiest solution to hide our scrapers is to use proxies. In combination with random user-agent, using a proxy is a powerful method to hide our scrapers, and scrape rate-limited web pages. Of course, it’s better not be blocked in the first place, but sometimes websites allow only a certain amount of request per day/hour.&lt;/p&gt;

&lt;p&gt;In these cases, you should use a proxy. There are lots of free proxy list, I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these proxies are located. Sometimes the public proxy list is operated by a legit company, offering premium proxies, and sometimes not... &lt;/p&gt;

&lt;p&gt;What I recommend is using a paid proxy service, or you could build your own.&lt;/p&gt;

&lt;p&gt;Setting a proxy to HtmlUnit is easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ProxyConfig&lt;/span&gt; &lt;span class="n"&gt;proxyConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProxyConfig&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"host"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;myPort&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setProxyConfig&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxyConfig&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Go further
&lt;/h2&gt;

&lt;p&gt;As you can see, thanks to Schema.org data, extracting product data is much easier now than it was ten years ago. &lt;/p&gt;

&lt;p&gt;But there are still challenges such as handling websites that haven't implemented Schema, handling IP blocking and rate limits, rendering Javascript... &lt;/p&gt;

&lt;p&gt;That is exactly why we've been working with my partner Pierre on a &lt;a href="https://www.scrapingbee.com"&gt;Web Scraping API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ScrapingBee is an API to extract any HTML from any website without having to deal with proxies, CAPTCHAs and headless browsers. A single API call, with only the product URL you to want to extract data from. &lt;/p&gt;

&lt;p&gt;I hope you enjoyed this post, as always you can find the full code in this Github repository: &lt;a href="https://github.com/ksahin/introWebScraping"&gt;https://github.com/ksahin/introWebScraping&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Introduction to Chrome Headless</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Fri, 18 Jan 2019 09:45:11 +0000</pubDate>
      <link>https://dev.to/scrapingbee/introduction-to-chrome-headless-469b</link>
      <guid>https://dev.to/scrapingbee/introduction-to-chrome-headless-469b</guid>
      <description>&lt;p&gt;In the previous articles, I introduce you to two different tools to perform web scraping with Java. &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;HtmlUnit&lt;/a&gt; in the first article, and &lt;a href="https://dev.to/scrapingbee/web-scraping-handling-ajax-website-1ip8"&gt;PhantomJS&lt;/a&gt; in the article about handling Javascript heavy website. &lt;/p&gt;

&lt;p&gt;This time we are going to introduce a new feature from Chrome, the &lt;strong&gt;&lt;em&gt;headless&lt;/em&gt;&lt;/strong&gt; mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don't know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago. &lt;/p&gt;

&lt;p&gt;PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer because of I quote &lt;em&gt;"Google Chrome is faster and more stable than PhantomJS [...]"&lt;/em&gt;&lt;br&gt;
It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites. &lt;/p&gt;

&lt;p&gt;HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome. &lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Google Chrome &amp;gt; 59&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.google.com/a/chromium.org/chromedriver/downloads"&gt;Chromedriver&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Selenium &lt;/li&gt;
&lt;li&gt;In your &lt;strong&gt;&lt;em&gt;pom.xml&lt;/em&gt;&lt;/strong&gt; add a recent version of Selenium :
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.seleniumhq.selenium&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;selenium-java&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;3.8.1&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;If you don't have Google Chrome installed, you can download it &lt;a href="https://www.google.com/chrome/browser/desktop/index.html"&gt;here&lt;/a&gt;&lt;br&gt;
To install Chromedriver you can use brew on MacOS :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew install chromedriver
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or download it using the link below. &lt;br&gt;
There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.&lt;/p&gt;
&lt;h3&gt;
  
  
  Let's log into Hacker News
&lt;/h3&gt;

&lt;p&gt;In this part, we are going to log into Hacker News, and take a screenshot once logged in. We don't need Chrome headless for this task, but the goal of this article is only to show you how to run headless Chrome with Selenium.&lt;/p&gt;

&lt;p&gt;The first thing we have to do is to create a WebDriver object, and set the chromedriver path and some arguments :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Init chromedriver&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/Path/To/Chromedriver"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setProperty&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"webdriver.chrome.driver"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addArguments&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--headless"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--disable-gpu"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"--ignore-certificate-errors"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeDriver&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 option is needed on Windows systems, according to the [documentation](https://developers.google.com/web/updates/2017/04/headless-chrome)
Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :



```java
options.setBinary("/Path/to/specific/version/of/Google Chrome");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you want to learn more about the different options, here is the &lt;a href="https://sites.google.com/a/chromium.org/chromedriver/capabilities"&gt;Chromedriver documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next step is to perform a GET request to the Hacker News login form, select the username and password field, fill it with our credentials and click on the login button. Then we have to check for a credential error, and if we are logged in, we can take a screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3ax81Lgo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-headless/hn_screenshot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3ax81Lgo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-headless/hn_screenshot.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have done this in a previous article, here is the full code :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChromeHeadlessTest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;IOException&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
       &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/your/chromedriver/path"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
       &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setProperty&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"webdriver.chrome.driver"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chromeDriverPath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
       &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
       &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addArguments&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--headless"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--disable-gpu"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--window-size=1920,1200"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"--ignore-certificate-errors"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--silent"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
       &lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ChromeDriver&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Get the login page&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://news.ycombinator.com/login?goto=news"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Search for username / password input and fill the inputs&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@name='acct']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;sendKeys&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@type='password']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;sendKeys&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Locate the login button and click on it&lt;/span&gt;
      &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@value='login']"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;click&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCurrentUrl&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://news.ycombinator.com/login"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Incorrect credentials"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quit&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Successfuly logged in"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Take a screenshot of the current page&lt;/span&gt;
        &lt;span class="nc"&gt;File&lt;/span&gt; &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;TakesScreenshot&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getScreenshotAs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OutputType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FILE&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;FileUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;copyFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"screenshot.png"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

        &lt;span class="c1"&gt;// Logout&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"logout"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;click&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quit&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
   &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You should now have a nice screenshot of the Hacker News homepage while being authenticated. As you can see Chrome headless is really easy to use, it is not that different from PhantomJS since we are using Selenium to run it. &lt;/p&gt;

&lt;p&gt;If you enjoyed this do not hesitate to subscribe to our newsletter!&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

&lt;p&gt;As usual, the code is available in this &lt;a href="https://github.com/ksahin/introWebScraping"&gt;Github repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to Log in to Almost Any Websites</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 02 Jan 2019 09:52:27 +0000</pubDate>
      <link>https://dev.to/scrapingbee/how-to-log-in-to-almost-any-websites-7dn</link>
      <guid>https://dev.to/scrapingbee/how-to-log-in-to-almost-any-websites-7dn</guid>
      <description>&lt;p&gt;In the first &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;article about java web scraping&lt;/a&gt; I showed how to extract data from CraigList website. &lt;br&gt;
But what about the data you want or if the action you want to carry out on a website requires authentication ?&lt;/p&gt;

&lt;p&gt;In this short tutorial I will show you how to make a generic method that can handle most authentication forms. &lt;/p&gt;
&lt;h3&gt;
  
  
  Authentication mechanism
&lt;/h3&gt;

&lt;p&gt;There are many different authentication mechanisms, the most frequent being a login form , sometimes with a &lt;a href="http://https://en.wikipedia.org/wiki/Cross-site_request_forgery#Forging_login_requests"&gt;CSRF token&lt;/a&gt; as a hidden input. &lt;/p&gt;

&lt;p&gt;To auto-magically log into a website with your scrapers, the idea is : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GET /loginPage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first &lt;code&gt;&amp;lt;input type="password"&amp;gt;&lt;/code&gt; tag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the first &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt; before it that is not hidden&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set the value attribute for both inputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the enclosing form, and submit it. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Hacker News Authentication
&lt;/h3&gt;

&lt;p&gt;Let's say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :&lt;/p&gt;

&lt;p&gt;Here is the login form and the associated DOM : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AYcjSLil--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-login/screenshot_hn_login_form.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AYcjSLil--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-login/screenshot_hn_login_form.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we can implement the login algorithm&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="nf"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;FailingHttpStatusCodeException&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;MalformedURLException&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IOException&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setCssEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOptions&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setJavaScriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;HtmlInput&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//input[@type='password']"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;//The first preceding input that is not hidden&lt;/span&gt;
        &lt;span class="nc"&gt;HtmlInput&lt;/span&gt; &lt;span class="n"&gt;inputLogin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".//preceding::input[not(@type='hidden')]"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;inputLogin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValueAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValueAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;//get the enclosing form&lt;/span&gt;
        &lt;span class="nc"&gt;HtmlForm&lt;/span&gt; &lt;span class="n"&gt;loginForm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputPassword&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEnclosingForm&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;//submit the form&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginForm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

        &lt;span class="c1"&gt;//returns the cookie filled client :)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then the main method, which :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;calls &lt;code&gt;autoLogin&lt;/code&gt; with the right parameters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;code&gt;https://news.ycombinator.com&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check the logout link presence to verify we're logged&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prints the cookie to the console&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://news.ycombinator.com"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"/login?goto=news"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"login"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"password"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Starting autoLogin on "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loginUrl&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

            &lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt; &lt;span class="n"&gt;logoutLink&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//a[@href='user?id=%s']"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logoutLink&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;){&lt;/span&gt;
                &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Successfuly logged in !"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
                &lt;span class="c1"&gt;// printing the cookies&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Cookie&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCookieManager&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getCookies&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;
                    &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Wrong credentials"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You can find the code in this &lt;a href="https://github.com/ksahin/introWebScraping"&gt;Github repo&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Go further
&lt;/h3&gt;

&lt;p&gt;There are many cases where this method will not work : Amazon, DropBox... and all other two-steps/captcha protected login forms. &lt;/p&gt;

&lt;p&gt;Things that can be improved with this code : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Handle the check for the logout link inside &lt;code&gt;autoLogin&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check for &lt;code&gt;null&lt;/code&gt; inputs/form and throw an appropriate exception&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a next post I will show you how to deal with captchas or virtual numeric keyboards with OCR and captchas breaking APIs !&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>An Automatic Bill Downloader in Java</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Wed, 12 Dec 2018 10:03:07 +0000</pubDate>
      <link>https://dev.to/scrapingbee/an-automatic-bill-downloader-in-java-4277</link>
      <guid>https://dev.to/scrapingbee/an-automatic-bill-downloader-in-java-4277</guid>
      <description>&lt;p&gt;In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.&lt;/p&gt;

&lt;p&gt;I suggest you to read these articles first : &lt;a href="https://dev.to/scrapingbee/introduction-to-web-scraping-with-java-5i8"&gt;Introduction to web scraping with Java&lt;/a&gt; and &lt;a href="https://www.scrapingbee.com/blog/how-to-log-in-to-almost-any-websites/" rel="noopener noreferrer"&gt;Autologin&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since I am hosting this blog on &lt;a href="https://m.do.co/c/0e940b26444e" rel="noopener noreferrer"&gt;Digital Ocean&lt;/a&gt; (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have. &lt;/p&gt;

&lt;h3&gt;
  
  
  Login
&lt;/h3&gt;

&lt;p&gt;To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article. &lt;/p&gt;

&lt;p&gt;Then we have to go to the bill page : &lt;code&gt;https://cloud.digitalocean.com/settings/billing&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://cloud.digitalocean.com"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"email"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"password"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;WebClient&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Authenticator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;autoLogin&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"/login"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="nc"&gt;HtmlPage&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://cloud.digitalocean.com/settings/billing"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"You need to sign in for access to this page"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Exception&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error during login on %s , check your credentials"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;printStackTrace&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fetching the bills
&lt;/h3&gt;

&lt;p&gt;Let's create a new Class called Bill or Invoice to represent a bill : &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Bill.java&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Bill&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt; 
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;//... getters &amp;amp; setters&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fjava-bill%2Fbills_dom.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.scrapingbee.com%2Fimages%2Fpost%2Fjava-bill%2Fbills_dom.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are lucky here, it's a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;HtmlTable&lt;/code&gt; to store the table and iterate on each rows&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;getCell&lt;/code&gt; to select the cells&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, using the Jackson library we will export the Bill objects to JSON and print it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;HtmlTable&lt;/span&gt; &lt;span class="n"&gt;billsTable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlTable&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFirstByXPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//table[@class='listing Billing--history']"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HtmlTableRow&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;billsTable&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBodies&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getRows&lt;/span&gt;&lt;span class="o"&gt;()){&lt;/span&gt;

    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="c1"&gt;// We only want the invoice row, not the payment one&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Invoice"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Date&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SimpleDateFormat&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MMMM d, yyyy"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Locale&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ENGLISH&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="nc"&gt;BigDecimal&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BigDecimal&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;asText&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"$"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;HtmlAnchor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCell&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;getFirstChild&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;getHrefAttribute&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="nc"&gt;Bill&lt;/span&gt; &lt;span class="n"&gt;bill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Bill&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;bills&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bill&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValueAsString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bill&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the &lt;code&gt;Page&lt;/code&gt;object to store the pdf, and call a &lt;code&gt;getContentAsStream&lt;/code&gt;on it. It's better to check if the file has the right content type when doing this (&lt;code&gt;application/pdf&lt;/code&gt; in our case)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Page&lt;/span&gt; &lt;span class="n"&gt;invoicePdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoicePdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebResponse&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getContentType&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"application/pdf"&lt;/span&gt;&lt;span class="o"&gt;)){&lt;/span&gt;
    &lt;span class="nc"&gt;IOUtils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;copy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoicePdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getWebResponse&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getContentAsStream&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DigitalOcean"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".pdf"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it, here is the ouput :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for December 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1451602800000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for November 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;6.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1448924400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for October 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1446332400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for April 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1430431200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for March 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1427839200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for February 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1425164400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for January 2015"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1422745200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Invoice for October 2014"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1414796400000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"/billing/XXXXXX.pdf"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As usual you can find the full code on this &lt;a href="https://github.com/ksahin/introWebScraping" rel="noopener noreferrer"&gt;Github Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new &lt;a href="https://www.scrapingbee.com" rel="noopener noreferrer"&gt;web scraping API&lt;/a&gt;, the first 1000 API calls are on us.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Web Scraping Handling Ajax Website</title>
      <dc:creator>Kevin Sahin</dc:creator>
      <pubDate>Sat, 01 Dec 2018 10:14:30 +0000</pubDate>
      <link>https://dev.to/scrapingbee/web-scraping-handling-ajax-website-1ip8</link>
      <guid>https://dev.to/scrapingbee/web-scraping-handling-ajax-website-1ip8</guid>
      <description>&lt;p&gt;Today more and more websites are using Ajax for fancy user experiences, dynamic web pages, and many more good reasons. &lt;br&gt;
Crawling Ajax heavy website can be tricky and painful, we are going to see some tricks to make it easier.&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisite
&lt;/h2&gt;

&lt;p&gt;Before starting, please read the previous articles I wrote to understand how to set up your Java environment, and have a basic understanding of HtmlUnit &lt;a href="https://ksah.in/introduction-to-web-scraping-with-java/"&gt;Introduction to Web Scraping With Java&lt;/a&gt; and &lt;a href="https://ksah.in/how-to-log-in-to-almost-any-websites/"&gt;Handling Authentication&lt;/a&gt;.&lt;br&gt;
After reading this you should be a little bit more familiar with web scraping.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;The first way to scrape Ajax website with Java that we are going to see is by using &lt;a href="http://phantomjs.org/"&gt;PhantomJS&lt;/a&gt; with Selenium and GhostDriver. &lt;/p&gt;

&lt;p&gt;PhantomJS is a headless web browser based on WebKit ( used in Chrome and Safari). It is quite fast and does a great job to render the Dom like a normal web browser.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First you'll need to &lt;a href="http://phantomjs.org/download.html"&gt;download&lt;/a&gt; PhantomJS&lt;/li&gt;
&lt;li&gt;Then add this to your pom.xml :
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.github.detro&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;phantomjsdriver&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.2.0&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;and this :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.seleniumhq.selenium&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;selenium-java&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.53.1&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  PhantomJS and Selenium
&lt;/h2&gt;

&lt;p&gt;Now we're going to use Selenium and GhostDriver to "pilot" PhantomJS. &lt;/p&gt;

&lt;p&gt;The example that we are going to see is a simple "See more" button on a news site, that perform a ajax call to load more news. &lt;br&gt;
So you may think that opening PhantomJS to click on a simple button is a waste of time and overkilled ? Of course it is !&lt;/p&gt;

&lt;p&gt;The news site is : &lt;a href="https://www.inshorts.com/en/read"&gt;Inshort&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--srJLjLtm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/buttonLoadMore.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--srJLjLtm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/buttonLoadMore.jpg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As usual we have to open Chrome Dev tools or your favorite inspector to see how to select the "Load More" button and then click on it. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0suqT7N---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/domLoadMore.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0suqT7N---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/domLoadMore.jpg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's look at some code :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;USER_AGENT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;DesiredCapabilities&lt;/span&gt; &lt;span class="n"&gt;desiredCaps&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;


    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;initPhantomJS&lt;/span&gt;&lt;span class="o"&gt;(){&lt;/span&gt;
        &lt;span class="n"&gt;desiredCaps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DesiredCapabilities&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setJavascriptEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCapability&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"takesScreenshot"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCapability&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PhantomJSDriverService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PHANTOMJS_EXECUTABLE_PATH_PROPERTY&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"/usr/local/bin/phantomjs"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCapability&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PhantomJSDriverService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"User-Agent"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;USER_AGENT&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;cliArgsCap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;cliArgsCap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--web-security=false"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;cliArgsCap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--ssl-protocol=any"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;cliArgsCap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--ignore-ssl-errors=true"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;cliArgsCap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--webdriver-loglevel=ERROR"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCapability&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PhantomJSDriverService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PHANTOMJS_CLI_ARGS&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cliArgsCap&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PhantomJSDriver&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desiredCaps&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;manage&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Dimension&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1920&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1080&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;That's a lot of code to setup phantomJs and Selenium !&lt;br&gt;
I suggest you to read the documentation to see the many arguments you can pass to PhantomJS.&lt;/p&gt;

&lt;p&gt;Note that you will have to replace &lt;code&gt;/usr/local/bin/phantomjs&lt;/code&gt; with your own phantomJs executable path&lt;/p&gt;

&lt;p&gt;Then in a main method :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setProperty&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"phantomjs.page.settings.userAgent"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;USER_AGENT&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://www.inshorts.com/en/read"&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;initPhantomJS&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nbArticlesBefore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElements&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//div[@class='card-stack']/div"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElement&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"load-more-btn"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;click&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// We wait for the ajax call to fire and to load the response into the page&lt;/span&gt;
        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sleep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nbArticlesAfter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findElements&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"//div[@class='card-stack']/div"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Initial articles : %s Articles after clicking : %s"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbArticlesBefore&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbArticlesAfter&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here we call the &lt;code&gt;initPhantomJs()&lt;/code&gt; method to setup everything, then we select the button with its id and click on it. &lt;/p&gt;

&lt;p&gt;The other part of the code count the number of articles we have on the page and print it to show what we have loaded. &lt;/p&gt;

&lt;p&gt;We could have also printed the entire dom with &lt;code&gt;driver.getPageSource()&lt;/code&gt;and open it in a real browser to see the difference before and after the click.&lt;/p&gt;

&lt;p&gt;I suggest you to look at the &lt;a href="https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html"&gt;Selenium Webdriver&lt;/a&gt; documentation, there are lots of cool methods to manipulate the DOM. &lt;/p&gt;

&lt;p&gt;I used a dirty solution with my &lt;code&gt;Thread.sleep(800)&lt;/code&gt; to wait for the Ajax call to complete. &lt;br&gt;
It's dirty because it is an arbitrary number, and the scraper could run faster if we could wait just the time it takes to perform that ajax call.&lt;/p&gt;

&lt;p&gt;There are other ways of solving this problem :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;waitForAjax&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;WebDriverWait&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;until&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ExpectedCondition&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WebDriver&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nc"&gt;JavascriptExecutor&lt;/span&gt; &lt;span class="n"&gt;js&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;JavascriptExecutor&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Boolean&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;js&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;executeScript&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"return jQuery.active == 0"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you look at the function being executed when we click on the button, you'll see it's using jQuery : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Wyl0q749--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/jqueryPng-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wyl0q749--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scrapingbee.com/images/post/java-ajax/jqueryPng-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This code will wait until the variable jQuery.active equals 0 (it seems to be an internal variable of jQuery that counts the number of ongoing ajax calls)&lt;/p&gt;

&lt;p&gt;If we knew what DOM elements the Ajax call is supposed to render we could have used that id/class/xpath in the WebDriverWait condition :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;until&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ExpectedConditions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;elementToBeClickable&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;By&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;xpath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xpathExpression&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So we've seen a little bit about how to use PhantomJS with Java.&lt;/p&gt;

&lt;p&gt;The example I took is really simple, it would have been easy to simulate the request.&lt;/p&gt;

&lt;p&gt;But sometimes when you have tens of Ajax calls, and lots of Javascript being executed to render the page properly, it can be very hard to scrape the data you want, and PhantomJS/Selenium is here to save you :)&lt;/p&gt;

&lt;p&gt;Next time we will see how to do it by analyzing the AJAX calls and make the requests ourselves.&lt;/p&gt;

&lt;p&gt;As usual you can find all the code in my &lt;a href="https://github.com/ksahin/introWebScraping"&gt;Github repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rendering JS at scale can be really difficult and expensive. This is exactly the reason why we build &lt;a href="https://www.scrapingbee.com"&gt;ScrapingBee&lt;/a&gt;, a web scraping API that take care of this for you.&lt;/p&gt;

&lt;p&gt;It will also take car of proxies and CAPTCHAs, don't hesitate to check it out, the first 1000 API calls are on us.&lt;/p&gt;

</description>
      <category>java</category>
      <category>webscraping</category>
    </item>
  </channel>
</rss>
