DEV Community: ScrapingBee

Web Scraping 101 with Javascript and NodeJS

Pierre — Tue, 16 Jun 2020 13:26:33 +0000

Javascript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it's a web or mobile application, Javascript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to scrape the web efficiently to meet most of your requirements.

Understanding NodeJS: A brief introduction

Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. When a website is loaded, Javascript is run by the browser's Javascript Engine and converted into a bunch of code that the computer can understand. For Javascript to interact with your browser, the browser provides a Runtime Environment (document, window, etc.).

This means that Javascript is not the kind of programming language that can interact with or manipulate the computer or it's resources directly. In a web server, for example, the server must be capable of interacting with the file system to maybe read a file or store a record in a database.

Introducing NodeJS, the crux of the idea was to make Javascript capable of running not only client-side but also server-side. To make this possible, Ryan Dahl a skilled developer literally took Google Chrome's v8 Javascript Engine and embedded it with a C++ program which was named Node. So NodeJS is a runtime environment that allows an application written in Javascript to make it possible to be run at on a server as well.

As opposed to how most languages like C or C++ deal with concurrency by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a Non-Blocking manner with the help of the Event Loop.

Putting up a simple web server is fairly simple as shown below:

const http = require('http');
const PORT = 3000;

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end('Hello World');
});

server.listen(port, () => {
  console.log(`Server running at PORT:${port}/`);
});

If you have NodeJS installed and you run the above code by typing(without the < and >) in node <YourFileNameHere>.js and open up your browser and navigate to localhost:3000, you will see some text saying "Hello World". NodeJS is highly ideal for applications that are I/O intensive.

HTTP clients: querying the web

HTTP clients are tools capable of sending a request to a server and then receive a response from it. Almost every tool that will be discussed uses an HTTP client under the hood, to query the server of the website that you will attempt to scrape.

Request

Request is one of the most widely used HTTP clients in the Javascript ecosystem, however, though currently, the author of the Request library has officially declared that it is deprecated. This does not mean it is unusable, quite a lot of libraries still use it, and it is every bit worth using. It is fairly simple to make an HTTP request with Request:

const request = require('request')
request('https://www.reddit.com/r/programming.json', function (
  error,
  response,
  body
) {
  console.error('error:', error)
  console.log('body:', body)
})

You can find the Request library at Github, and installing it is as simple as running npm install request. You can also find the deprecation notice and what this means here. If you don't feel safe about the fact that this library is deprecated, there's more down below!

Axios

Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use Typescript, then axios has you covered with built-in types. Making an HTTP request with Axios is straight forward, it ships with promise support by default as opposed to utilizing callbacks in Request:

const axios = require('axios')

axios
    .get('https://www.reddit.com/r/programming.json')
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    });

If you fancy the async/await syntax sugar for the Promises API, then you can do that too but since top level await is still at stage 3, we will have to make use of an Async Function instead:

async function getForum() {
    try {
        const response = await axios.get(
            'https://www.reddit.com/r/programming.json'
        )
        console.log(response)
    } catch (error) {
        console.error(error)
    }
}

And all you have to do is call getForum! You can find Axios library at Github and installing Axios is as simple as npm install axios.

Superagent

Much like Axios, Superagent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but Superagent has more dependencies and is less popular.

Regardless, making an HTTP request with Superagent using promises, async/await or callbacks looks like this:

const superagent = require("superagent")
const forumURL = "https://www.reddit.com/r/programming.json"

// callbacks
superagent
    .get(forumURL)
    .end((error, response) => {
        console.log(response)
    })

// promises
superagent
    .get(forumURL)
    .then((response) => {
        console.log(response)
    })
    .catch((error) => {
        console.error(error)
    })

// promises with async/await
async function getForum() {
    try {
        const response = await superagent.get(forumURL)
        console.log(response)
    } catch (error) {
        console.error(error)
    }
}

You can find the Superagent library at Github and installing Superagent is as simple as npm install superagent.

For the upcoming few web scraping tools, Axios will be used as the HTTP client.

Regular Expressions: The hard way

The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you receive by querying a webpage using an HTTP client, but there is a big tradeoff. Regular Expressions aren't as flexible and quite a lot of people both professionals and amateurs struggle with writing the correct regular expression.

For complex web scraping, the regular expression can also get out of hand very quickly. With that said, let's give it a go. Say there's a label with some username in it, and we want the username, this is similar to what you'd have to do if you relied on regular expressions

const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>(.+)<\/label>/)

console.log(result[1], result[1].split(": ")[1])
// Username: John Doe, John Doe

In Javascript, match() usually returns an array with everything that matches the regular expression. The 2nd element(in index 1) you will find the textContent or the innerHTML of the <label>tag which is what we want. But this result contains some unwanted text ( "Username: ") which has to be removed.

As you can see, for a very simple use case the steps and the work to be done are unnecessarily high. This is why you should rely on something like an HTML parser, which we will talk about next.

Cheerio: Core JQuery for traversing the DOM

Cheerio is an efficient and light library which allows you to use the rich and powerful API of JQuery on the server-side. If you have used JQuery previously then you will feel right at home with Cheerio, it removes all the DOM inconsistencies and browser-related features and exposes an efficient API to parse and manipulate the DOM.

const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
// <h2 class="title welcome">Hello there!</h2>

As you can see, using Cheerio is very similar to how you'd use JQuery.

However, though it does not work the same way that a web browser works, which means it does not:

Render any of the parsed or manipulated DOM elements
Apply CSS or load any external resource
Execute javascript

So if the website or web application that you are trying to crawl is Javascript heavy (for example a Single Page Application) then Cheerio is not your best bet, you might have to rely on some of the other options that are talked about later on.

To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit, we will attempt to get a list of post names.

First, install Cheerio and axios by running the following command:
npm install cheerio axios.

Then create a new file called crawler.js and copy/paste the following code:

const axios = require('axios');
const cheerio = require('cheerio');

const getPostTitles = async () => {
    try {
        const { data } = await axios.get(
            'https://old.reddit.com/r/programming/'
        );
        const $ = cheerio.load(data);
        const postTitles = [];

        $('div > p.title > a').each((_idx, el) => {
            const postTitle = $(el).text()
            postTitles.push(postTitle)
        });

        return postTitles;
    } catch (error) {
        throw error;
    }
};

getPostTitles()
.then((postTitles) => console.log(postTitles));

getPostTitles() is an asynchronous function that will crawl the old reddit's r/programming forum. First the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library, then the html data is fed into Cheerio using the cheerio.load() function.

Then with the help of the Dev Tools of the browser, you can obtain the selector that is capable of targetting all the postcards generally. If you've used JQuery, the $('div > p.title > a') must be very familiar. This will get all the posts, since you only want the title of each post individually, you have to loop through each post which is done with the help of the each() function.

To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (el refers to the current element). Then calling text() on each element will give you the text.

Now you can pop open a terminal and run node crawler.js and then you'll see an array of about 25 or 26 different post titles, it'll be quite long. While this is quite a simple use case, it demonstrates the simple nature of the API provided by Cheerio.

If your use case requires the execution of Javascript and the loading of external sources, then the following few options will be helpful.

JSDOM: The DOM for Node

JSDOM is a pure Javascript implementation of the Document Object Model to be used in NodeJS, as mentioned previously the DOM is not available to Node, so JSDOM is the closest you can get. It more or less emulates the browser.

Since a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, then using JSDOM will be quite straightforward.

const { JSDOM } = require('jsdom')
const { document } = new JSDOM(
    '<h2 class="title">Hello world</h2>'
).window
const heading = document.querySelector('.title')
heading.textContent = 'Hello there!'
heading.classList.add('welcome')

heading.innerHTML
// <h2 class="title welcome">Hello there!</h2>

As you can see, JSDOM creates a DOM and then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.

To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it, then we will verify if the post has been upvoted.

Start by running the following command to install jsdom and axios:
npm install jsdom axios

Then make a file by the name of crawler.js and copy/paste the following code:

const { JSDOM } = require("jsdom")
const axios = require('axios')

const upvoteFirstPost = async () => {
  try {
    const { data } = await axios.get("https://old.reddit.com/r/programming/");
    const dom = new JSDOM(data, {
      runScripts: "dangerously",
      resources: "usable"
    });
    const { document } = dom.window;
    const firstPost = document.querySelector("div > div.midcol > div.arrow");
    firstPost.click();
    const isUpvoted = firstPost.classList.contains("upmod");
    const msg = isUpvoted
      ? "Post has been upvoted successfully!"
      : "The post has not been upvoted!";

    return msg;
  } catch (error) {
    throw error;
  }
};

upvoteFirstPost().then(msg => console.log(msg));

upvoteFirstPost() is an asynchronous function that will obtain the first post in r/programming and then upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier. The JSDOM constructor accepts the HTML as the first argument and the options as the second, the 2 options that have been added perform the following functions:

runScripts: When set to "dangerously", it allows the execution of event handlers and any Javascript code. If you do not have a clear idea on the credibility of the scripts that your application will run, then it is best to set runScripts to "outside-only", which attaches all the Javascript specification provided globals to the window object thus preventing any script being executed on the inside.
resources: When set to "usable", it allows the loading of any external script declared using the <script> tag (ex: the JQuery library fetched from a CDN)

Once the DOM has been created, you would use the same DOM methods to get the first post's upvote button and then click on it. To verify if it has indeed been clicked, you could check the classList for a class called upmod. If this class exists in classList, then a message is returned.

Now you can pop open a terminal and run node crawler.js and then you'll see a neat string that will tell if the post has been upvoted or not. While this example use case is trivial, you could build on top of this to create something powerful for example, a bot that goes around upvoting a particular user's posts.

If you dislike the lack of expressiveness in JSDOM, and if your crawling relies heavily on many such manipulations or if there is a need to recreate a lot of different DOMs, then the following options will be a better match.

Puppeteer: The headless browser

Puppeteer, as the name implies, allows you to manipulate the browser programmatically just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.

Taken from the Puppeter Docs (Source)

Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilites that weren't there before:

You can get screenshots or generate PDFs of pages.
You could crawl a Single Page Application and generate pre-rendered content.
Automate a lot of different user interactions like keyboard inputs, form submissions, navigation, etc.

It could also play a big role in a lot of other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc.

It's quite often that you would want to take screenshots of websites, perhaps to get to know about a competitor's product catalog, puppeteer can be used to do this. To start, you must install puppeteer, to do so run the following command:
npm install puppeteer

This will download a bundled version of Chromium which takes up about 180 MB to 300 MB depending on your Operating System. If you wish to disable this and point puppeteer to an already downloaded version of chromium, you must set a few environment variables. This, however, is not recommended, if you truly wish to avoid downloading Chromium and puppeteer for this tutorial, you can rely on the puppeteer playground.

Let's attempt to get a screenshot and a PDF of the r/programming forum in Reddit, create a new file called crawler.js and then copy/paste the following code:

const puppeteer = require('puppeteer')

async function getVisual() {
    try {
        const URL = 'https://www.reddit.com/r/programming/'
        const browser = await puppeteer.launch()
        const page = await browser.newPage()

        await page.goto(URL)
        await page.screenshot({ path: 'screenshot.png' })
        await page.pdf({ path: 'page.pdf' })

        await browser.close()
    } catch (error) {
        console.error(error)
    }
}

getVisual()

getVisual() is an asynchronous function that will take a screenshot and a pdf of the value assigned to the URL variable. To start, an instance of the browser is created by running puppeteer.launch() then a new page is created. This page can be thought of like a tab in a regular browser. Then by calling page.goto() with the URL as the parameter, the page that was created earlier will be directed to the URL specified. Finally, the browser instance is destroyed along with the page.

Once that is done and the page has finished loading, a screenshot and a pdf will be taken using page.screenshot() and page.pdf() respectively. You could listen to the javascript load event and then perform these actions too, which is highly recommended at a production level.

To run the code type in node crawler.js to the terminal, and after a few seconds, you will notice that 2 files by the names screenshot.jpg and page.pdf have been created.

Nightmare: An alternative to Puppeteer

Nightmare is also a high-level browser automation library like Puppeteer, that uses Electron but is said to be roughly twice as faster as it's predecessor PhantomJS and more modern.

If you dislike Puppeteer in some way or feel discouraged by the size of the Chromium bundle then Nightmare is an ideal choice. To start, installghtmare library by running the following command:
npm install nightmare

Then once nightmare has been downloaded, we will use it to find ScrapingBee's website through the Google Search engine. To do so, create a file called crawler.js and then copy/paste the following code into it:

const Nightmare = require('nightmare')
const nightmare = Nightmare()

nightmare
    .goto('https://www.google.com/')
    .type("input[title='Search']", 'ScrapingBee')
    .click("input[value='Google Search']")
    .wait('#rso > div:nth-child(1) > div > div > div.r > a')
    .evaluate(
        () =>
            document.querySelector(
                '#rso > div:nth-child(1) > div > div > div.r > a'
            ).href
    )
    .end()
    .then((link) => {
        console.log('Scraping Bee Web Link': link)
    })
    .catch((error) => {
        console.error('Search failed:', error)
    })

Firstly a Nighmare instance is created, then this instance is directed to the Google Search Engine by calling goto() once it has loaded, the search box is fetched using it's selector and then the value of the search box (an input tag) is changed to "ScrapingBee". Once that is done, the search form is submitted by clicking on the "Google Search" button. Then Nightmare is told to wait till the first link has loaded, and once it has, a DOM method will be used to fetch the value of the href attribute of the anchor tag that contains the link.

Finally, once everything is complete, the link is printed to the console. To run the code, type in node crawler.js to your terminal.

Summary

That was a long read! But now you understand the different ways to use NodeJS and it's rich ecosystem of libraries to crawl the web any way you want. To wrap up, you learned:

✅ NodeJS is a Javascript runtime to allow Javascript to be run in the server-side. It has a non-blocking nature thanks to the Event Loop.
✅ HTTP Clients such as Axios, Superagent, and Request are used to send HTTP requests to a server and receive a response.
✅ Cheerio abstracts the best out of JQuery for the sole purpose of running it in the server-side for web crawling but does not execute Javascript code.
✅ JSDOM creates a DOM per the standard Javascript specification out of an HTML string and allows you to perform DOM manipulations on it.
✅ Puppeteer and Nightmare are high-level browser automation libraries, that allow you to programmatically manipulate web applications as if a real person were interacting with it.

Resources

Feel like reading more? Check these links out:

NodeJS website - Contains documentation and a lot of information on how to get started.
Puppeteer docs - Contains the API reference and getting started guides.
ScrapingBee's Blog - Contains a lot of information on Web Scraping goodies on multiple platforms.

This blog post was originally posted on ScrapingBee's blog by Shenesh Perera

Easy Web Scraping With Scrapy

Kevin Sahin — Wed, 18 Dec 2019 16:22:13 +0000

In the previous post about Web Scraping with Python we talked a bit about Scrapy. In this post we are going to dig a little bit deeper into it.

Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale:

Multithreading
Crawling (going from link to link)
Extracting the data
Validating
Saving to different format / databases
Many more

The main difference between Scrapy and other commonly used librairies like Requests / BeautifulSoup is that it is opinionated. It allows you to solve the usual web scraping problems in an elegant way.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn, but that is what we are here for :)

In this tutorial we will create two different web scrapers, a simple one that will extract data from an E-commerce product page, and a more "complex" one that will scrape an entire E-commerce catalog!

Basic overview

You can install Scrapy using pip. Be careful though, the Scrapy documentation strongly suggests to install it in a dedicated virtual environnement in order to avoid conflicts with your system packages.

I'm using Virtualenv and Virtualenvwrapper:

mkvirtualenv scrapy_env

and

pip install Scrapy

You can now create a new Scrapy project with this command:

scrapy startproject product_scraper

This will create all the necessary boilerplate files for the project.

├── product_scraper
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

Here is a brief overview of these files and folders:

items.py is a model for the extracted data. You can define custom model (like a Product) that will inherit the scrapy Item class.
middlewares.py Middleware used to change the request / response lifecycle. For example you could create a middle ware to rotate user-agents, or to use an API like ScrapingBee instead of doing the requests yourself.
pipelines.py In Scrapy, pipelines are used to process the extracted data, clean the HTML, validate the data, and export it to a custom format or saving it to a database.
/spiders is a folder containing Spider classes. With Scrapy, Spiders are classes that define how a website should be scraped, including what link to follow and how to extract the data for those links.
scrapy.cfg is a configuration file to change some settings

Scraping a single product

In this example we are going to scrape a single product from a dummy E-commerce website. Here is the first the product we are going to scrape:

https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/

We are going to extract the product name, picture, price and description.

Scrapy Shell

Scrapy comes with a built-in shell that helps you try and debug your scraping code in real time. You can quickly test your XPath expressions / CSS selectors with it. It's a very cool tool to write your web scrapers and I always use it!

You can configure Scrapy Shell to use another console instead of the default Python console like IPython. You will get autocompletion and other nice perks like colorized output.

In order to use it in your scrapy Shell, you need to add this line to your scrapy.cfg file:

shell = ipython

Once it's configured, you can start using scrapy shell:

$ scrapy shell --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x108147eb8>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108d10978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

We can start fetching a URL by simply:

fetch('https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/')

This will start by fetching the /robot.txt file.

[scrapy.core.engine] DEBUG: Crawled (404) <GET https://clever-lichterman-044f16.netlify.com/robots.txt> (referer: None)

In this case there isn't any robot.txt, that's why we can see a 404 HTTP code. If there was a robot.txt, by default Scrapy will follow the rule.

You can disable this behavior by changing this setting in settings.py:

ROBOTSTXT_OBEY = True

Then you should should have a log like this:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/> (referer: None)

You can now see your response object, response headers, and try different XPath expression / CSS selectors to extract the data you want.

You can see the response directly in your browser with:

view(response)

Note that the page will render badly inside your browser, for lots of different reasons. This can be CORS issues, Javascript code that didn't execute, or relative URLs for assets that won't work locally.

The scrapy shell is like a regular Python shell, so don't hesitate to load your favorite scripts/function in it.

Extracting Data

Scrapy doesn't execute any Javascript by default, so if the website you are trying to scrape is using a frontend framework like Angular / React.js, you could have trouble accessing the data you want.

Now let's try some XPath expression to extract the product title and price:

In order to extract the price, we are going to use an XPath expression, we're selecting the first span after the div with the class my-4

In [16]: response.xpath("//div[@class='my-4']/span/text()").get()
Out[16]: '20.00$'

I could also use a CSS selector:

In [21]: response.css('.my-4 span::text').get()
Out[21]: '20.00$'

Creating a Scrapy Spider

With Scrapy, Spiders are classes where you define your crawling (what links / URLs need to be scraped) and scraping (what to extract) behavior.

Here are the different steps used by a spider to scrape a website:

It starts by looking at the class attribute start_urls , and call these URLs with the start_requests() method. You could override this method if you need to change the HTTP verb, add some parameters to the request (for example, sending a POST request instead of a GET). * It will then generate a Request object for each URL, and send the response to the callback function parse() * The parse() method will then extract the data (in our case, the product price, image, description, title) and return either a dictionnary, an Item object, a Request or an iterable.

You may wonder why the parse method can return so many different objects. It's for flexibility. Let's say you want to scrape an E-commerce website that doesn't have any sitemap. You could start by scraping the product categories, so this would be a first parse method.

This method would then yield a Request object to each product category to a new callback method parse2()
For each category you would need to handle pagination Then for each product the actual scraping that generate an Item so a third parse function.

With Scrapy you can return the scraped data as a simple Python dictionary, but it is a good idea to use the built-in Scrapy Item class.
It's a simple container for our scraped data and Scrapy will look at this item's fields for many things like exporting the data to different format (JSON / CSV...), the item pipeline etc.

So here is a basic Product class:

import scrapy

class Product(scrapy.Item):
    product_url = scrapy.Field()
    price = scrapy.Field()
    title = scrapy.Field()
    img_url = scrapy.Field()

Now we can generate a spider, either with the command line helper:

scrapy genspider myspider mydomain.com

Or you can do it manually and put your Spider's code inside the /spiders directory.

There are different types of Spiders in Scrapy to solve the most common web scraping use cases:

Spider that we will use. It takes a start_urls list and scrape each one with a parse method.
CrawlSpider follows links defined by a set of rules
SitemapSpider extract URLs defined in a sitemap
Many more

# -*- coding: utf-8 -*-
import scrapy

from product_scraper.items import Product

class EcomSpider(scrapy.Spider):
    name = 'ecom_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.com']
    start_urls = ['https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/']

    def parse(self, response):
        item = Product()
        item['product_url'] = response.url
        item['price'] = response.xpath("//div[@class='my-4']/span/text()").get()
        item['title'] = response.xpath('//section[1]//h2/text()').get()
        item['img_url'] = response.xpath("//div[@class='product-slider']//img/@src").get(0)
        return item

In this EcomSpider class, there are two required attributes:

name which is our Spider's name (that you can run using scrapy runspider spider_name)
start_urls which is the starting URL

The allowed_domains is optionnal but important when you use a CrawlSpider that could follow links on different domains.

Then I've just populated the Product fields by using XPath expressions to extract the data I wanted as we saw earlier, and we return the item.

You can run this code as follow to export the result into JSON (you could also export to CSV)

scrapy runspider ecom_spider.py -o product.json

You should then get a nice JSON file:

[
  {
    "product_url": "https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/",
    "price": "20.00$",
    "title": "Taba Cream",
    "img_url": "https://clever-lichterman-044f16.netlify.com/images/products/product-2.png"
  }
]

Item loaders

There are two common problems that you can face while extracting data from the Web:

For the same website, the page layout and underlying HTML can be different. If you scrape an E-commerce website, you will often have a regular price and a discounted price, with different XPath / CSS selectors.
The data can be dirty and need some kind of post processing, again for an E-commerce website it could be the way the prices are displayed for example ($1.00, $1, $1,00 )

Scrapy comes with a built-in solution for this, ItemLoaders.
It's an interesting way to populate our Product object.

You can add several XPath expression to the same Item field, and it will test it sequentially. By default if several XPath are found, it will load all of them into a list.

You can find many examples of input and output processors in the Scrapy documentation.

It's really useful when you need to transorm/clean the data your extract.
For example, extracting the currency from a price, transorming a unit into another one (centimers in meters, Celcius degres in Fahrenheit) ...

In our webpage we can find the product title with different XPath expressions: //title and //section[1]//h2/text()

Here is how you could use and Itemloader in this case:

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('price', "//div[@class='my-4']/span/text()")
    l.add_xpath('title', '//section[1]//h2/text()')
    l.add_xpath('title', '//title')
    l.add_value('product_url', response.url)
    return l.load_item()

Generally you only want the first matching XPath, so you will need to add this output_processor=TakeFirst() to your item's field constructor.

In our case we only want the first matching XPath for each field, so a better approach would be to create our own ItemLoader and declare a default output_processor to take the first matching XPath:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

def remove_dollar_sign(value):
    return value.replace('$', '')

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(remove_dollar_sign)

I also added a price_in which is an input processor to delete the dollar sign from the price. I'm using MapCompose which is a built-in processor that takes one or several functions to be executed sequentially. You can add as many functions as you like for . The convention is to add _in or _out to your Item field's name to add an input or output processor to it.

There are many more processors, you can learn more about this in the documentation

Scraping multiple pages

Now that we know how to scrape a single page, it's time to learn how to scrape multiple pages, like the entire product catalog.
As we saw earlier there are different kinds of Spiders.

When you want to scrape an entire product catalog the first thing you should look at is a sitemap. Sitemap are exactly built for this, to show web crawlers how the website is structured.

Most of the time you can find one at base_url/sitemap.xml. Parsing a sitemap can be tricky, and again, Scrapy is here to help you with this.

In our case, you can find the sitemap here: https://clever-lichterman-044f16.netlify.com/sitemap.xml

If we look inside the sitemap there are many URLs that we are not interested by, like the home page, blog posts etc:

<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/blog/post-1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>

Fortunately, we can filter the URLs to parse only those that matches some pattern, it's really easy, here we only to have URL that
have /products/ in their URLs:

class SitemapSpider(SitemapSpider):
    name = "sitemap_spider"
    sitemap_urls = ['https://clever-lichterman-044f16.netlify.com/sitemap.xml']
    sitemap_rules = [
        ('/products/', 'parse_product')
    ]

    def parse_product(self, response):
        # ... scrape product ...

You can run this spider as follow to scrape all the products and export the result to a CSV file:

scrapy runspider sitemap_spider.py -o output.csv

Now what if the website didn't have any sitemap? Once again, Scrapy has a solution for this!

Let me introduce you to the... CrawlSpider.

The CrawlSpider will crawl the target website by starting with a start_urls list. Then for each url, it will extract all the links based on a list of Rule.
In our case it's easy, products has the same URL pattern /products/product_title so we only need filter these URLs.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from product_scraper.productloader import ProductLoader
from product_scraper.items import Product

class MySpider(CrawlSpider):
    name = 'crawl_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.com']
    start_urls = ['https://clever-lichterman-044f16.netlify.com/products/']

    rules = (

        Rule(LinkExtractor(allow=('products', )), callback='parse_product'),
    )

    def parse_product(self, response):
      # .. parse product

As you can see, all these built-in Spiders are really easy to use. It would have been much more complex to do it from scratch.

With Scrapy you don't have to think about the crawling logic, like adding new URLs to a queue, keeping track of already parsed URLs, multi-threading...

Conclusion

In this post we saw a general overview of how to scrape the web with Scrapy and how it can solve your most common web scraping challenges. Of course we only touched the surface and there are many more interesting things to explore, like middlewares, exporters, extensions, pipelines!

If you've been doing web scraping more "manually" with tools like BeautifulSoup / Requests, it's easy to understand how Scrapy can help save time and build more maintainable scrapers.

I hope you liked this Scrapy tutorial and that it will motivate you to experiment with it.

For further reading don't hesitate to look at the great Scrapy documentation.

You can also check out our web scraping with Python tutorial to learn more about web scraping.

Happy Scraping!

Practical XPath for Web Scraping

Kevin Sahin — Thu, 07 Nov 2019 10:41:17 +0000

XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.

It looks like the perfect tool for web scraping right? At ScrapingBee we love XPath!

Why learn XPath

Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.
It's more powerful than CSS selectors
It allows you to navigate the DOM in any direction
Can match text inside HTML elements

Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.

But first, let's talk a little about the DOM

Document Object Model

I am going to assume you already know HTML, so this is just a small reminder.

As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.

Let's see a basic HTML page, to understand what the Document Object Model is.

This HTML code is basically HTML content encapsulated inside other HTML content. The HTML hierarchy can be viewed as a tree. We can already see this hierarchy through the indentation in the HTML code.

When your web browser parses this code, it will create a tree which is an object representation of the HTML document. It is called the Document Object Model.

Below is the internal tree structure inside Google Chrome inspector :

On the left we can see the HTML tree, and on the right we have the Javascript object representing the currently selected element (in this case, the <p> tag), with all its attributes.

The important thing to remember is that the DOM you see in your browser, when you right click + inspect can be really different from the actual HTML that was sent. Maybe some Javascript code was executed and dynamically changed the DOM ! For example, when you scroll on your twitter account, a request is sent by your browser to fetch new tweets, and some Javascript code is dynamically adding those new tweets to the DOM.

XPath Syntax

First let’s look at some XPath vocabulary :

• In Xpath terminology, as with HTML, there are different types of nodes : root nodes, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.

• Each element node has one parent. in this example, the section element is the parent of p, details and button.

• Element nodes can have any number of children. In our example, li elements are all children of the ul element.

• Siblings are nodes that have the same parents. p, details and button are siblings.

• Ancestors a node’s parent and parent’s parent...

• Descendants a node’s children and children’s children...

There are different types of expressions to select a node in an HTML document, here are the most important ones :

You can also use predicates to find a node that contains a specific value. Predicates are always in square brackets: [predicate]

Here are some examples :

Now we will see some examples of Xpath expressions. We can test XPath expressions inside Chrome Dev tools, so it is time to fire up Chrome.

To do so, right-click on the web page -> inspect and then cmd + f on a Mac or ctrl + f on other systems, then you can enter an Xpath expression, and the match will be highlighted in the Dev tool.

Tip

In the dev tools, you can right-click on any DOM node, and show its full XPath expression, that you can later factorize.

XPath with Python

There are many Python packages that allow you to use XPath expressions to select HTML elements like lxml, Scrapy or Selenium. In these examples, we are going to use Selenium with Chrome in headless mode. You can look at this article to set up your environment: Scraping Single Page Application with Python

E-commerce product data extraction

In this example, we are going to see how to extract E-commerce product data from Ebay.com with XPath expressions.

On these three XPath expressions, we are using a // as an axis, meaning we are selecting nodes anywhere in the HTML tree. Then we are using a predicate [predicate] to match on specific IDs. IDs are supposed to be unique so it's not a problem do to this.

But when you select an element with its class name, it's better to use a relative path, because the class name can be used anywhere in the DOM, so the more specific you are the better. Not only that, but when the website will change (and it will), your code will be much more resilient to changes.

Automagically authenticate to a website

When you have to perform the same action on a website or extract the same type of information we can be a little smarter with our XPath expression, in order to create generic ones, and not specific XPath for each website.

In order to explain this, we're going to make a "generic" authentication function that will take a Login URL, a username and password, and try to authenticate on the target website.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage
Select the first tag
Select the first before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form and click on the submit button.

Most login forms will have an <input type="password"> tag. So we can select this password input with a simple: //input[@type='password']

Once we have this password input, we can use a relative path to select the username/email input. It will generally be the first preceding input that isn't hidden: .//preceding::input[not(@type='hidden')]

It's really important to exclude hidden inputs, because most of the time you will have at least one CSRF token hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.

Now we need to select the enclosing form from one of the input:

.//ancestor::form

And with the form, we can select the submit input/button:

.//*[@type='submit']

Here is an example of such a function:

Of course it is far from perfect, it won't work everywhere but you get the idea.

Conclusion

XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.

One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.

At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our ready-made APIs. We will discuss the differences between the two in another blog post!

I hope you enjoyed this article, next time we will talk about ... CSS selectors :)

Happy Scraping!

Discuss on HN: https://news.ycombinator.com/item?id=21452310

Serverless Web Scraping With Aws Lambda and Java

Kevin Sahin — Wed, 04 Sep 2019 09:36:20 +0000

Serverless is a term referring to the execution of code inside ephemeral containers (Function As A Service, or FaaS). It is a hot topic in 2019, after the “micro-service” hype, here come the “nano-services”!

Cloud functions can be triggered by different things such as:

An HTTP call to a REST API
A job in a message queue
A log
IOT event

Cloud functions are a really good fit with web scraping tasks for many reasons. Web Scraping is I/O bound, most of the time is spent waiting for HTTP responses, so we don’t need high-end CPU servers. Cloud functions are cheap (first 1M request is free, then $0.20 per million requests) and easy to set up. Cloud functions are a good fit for parallel scraping, we can create hundreds or thousands of function at the same time for large-scale scraping.

In this introduction, we are going to see how to deploy a slightly modified version of the Craigslist scraper we made on a previous blogpost on AWS Lambda using the serverless framework.

Prerequisites

We are going to use the Serverless framework to build and deploy our project to AWS lambda. Serverless CLI is able to generate lots of boilerplate code in different languages and deploy the code to different cloud providers, like AWS, Google Cloud or Azure.

An AWS account
Node and npm
Serverless CLI and Setup your AWS credentials
Java 8
Maven

Architecture

We will build an API using API Gateway with a single endpoint /items/{query} binded on a lambda function that will respond to us with a JSON array with all items (on the first result page) for this query.

Here is a simple diagram for this architecture:

Create the Maven project

Serverless is able to generate projects in lots of different languages: Java, Python, NodeJS, Scala...
We are going to use one of these templates to generate a maven project:

serverless create --template aws-java-maven --name items-api -p aws-java-scraper

You can now open this Maven project in your favorite IDE.

Configuration

The first thing to do is to change the serverless.yml config to implement an API gateway route and bind it to the handleRequest method in the Handler.java class.

service: craigslist-scraper-api 
provider:
  name: aws
  runtime: java8
  timeout: 30

package:
  artifact: target/hello-dev.jar

functions:
  getCraigsListItems:
    handler: com.serverless.Handler
    events:
    - http:
        path: /items/{searchQuery}
        method: get

I also added a timeout to 30 seconds. The default timeout with the serverless framework is 6 seconds. Since we're running Java code the Lambda cold start can take several seconds. And then we will make an HTTP request to Craigslist website, so 30 seconds seems good.

Function code

Now we can modify the Handler.class. The function logic is easy. First, we retrieve the path parameter called "searchQuery". Then we create a CraigsListScraper and call the scrape() method with this searchQuery. It will return a List<Item> representing all the items on the first Craigslist's result page.

We then use the ApiGatewayResponse class that was generated by the Serverless framework to return a JSON array containing every item.

You can find the rest of the code in this repository, with the CraigsListScraper and Item class.

@Override
public ApiGatewayResponse handleRequest(Map<String, Object> input, Context context) {
    LOG.info("received: {}", input);
    try{
        Map<String,String> pathParameters = (Map<String,String>)input.get("pathParameters");
        String query = pathParameters.get("searchQuery");

        CraigsListScraper scraper = new CraigsListScraper();
        List<Item> items = scraper.scrape(query);
        return ApiGatewayResponse.builder()
            .setStatusCode(200)
            .setObjectBody(items)
            .setHeaders(Collections.singletonMap("X-Powered-By", "AWS Lambda & serverless"))
            .build();
    }catch(Exception e){
        LOG.error("Error : " + e);
        Response responseBody = new Response("Error while processing URL: ", input);
        return ApiGatewayResponse.builder()
            .setStatusCode(500)
            .setObjectBody(responseBody)
            .setHeaders(Collections.singletonMap("X-Powered-By", "AWS Lambda & Serverless"))
            .build();
    }
}

We can now build the project:

mvn clean install

And deploy it to AWS:

serverless deploy
Serverless: Packaging service...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (13.35 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.................................
Serverless: Stack update finished...
Service Information
service: items-api
stage: dev
region: us-east-1
stack: items-api-dev
api keys:
  None
endpoints:
  GET - https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/{searchQuery}
functions:
  getCraigsListItems: items-api-dev-getCraigsListItems

You can then test your function using curl or your web browser with the URL given in the deployment logs (

serverless info

will also show this information.)

Here is a query to look for "macBook pro" :

curl https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/macBook%20pro | json_reformat                                                            1 ↵
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19834  100 19834    0     0   7623      0  0:00:02  0:00:02 --:--:--  7622
[
    {
        "title": "2010 15\" Macbook pro 3.06ghz 8gb 320gb osx maverick",
        "price": 325,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro-306ghz-8gb-320gb/6680853189.html"
    },
    {
        "title": "Apple MacBook Pro A1502 13.3\" Late 2013 2.6GHz i5 8 GB 500GB + Extras",
        "price": 875,
        "url": "https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-alateghz-i5/6688755497.html"
    },
    {
        "title": "Apple MacBook Pro Charger USB-C (Latest Model) w/ Box - Like New!",
        "price": 50,
        "url": "https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-charger-usb/6686902986.html"
    },
    {
        "title": "MacBook Pro 13\" C2D 4GB memory 500GB HDD",
        "price": 250,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro-13-c2d-4gb-memory/6688682499.html"
    },
    {
        "title": "Macbook Pro 2011 13\"",
        "price": 475,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro/6675556875.html"
    },
    {
        "title": "Trackpad Touchpad Mouse with Cable and Screws for Apple MacBook Pro",
        "price": 39,
        "url": "https://sfbay.craigslist.org/pen/sys/d/trackpad-touchpad-mouse-with/6682812027.html"
    },
    {
        "title": "Macbook Pro 13\" i5 very clean, excellent shape! 4GB RAM, 500GB HDD",
        "price": 359,
        "url": "https://sfbay.craigslist.org/sfc/sys/d/macbook-pro-13-i5-very-clean/6686879047.html"
    },
...

Note that the first invocation will be slow, it took 7 seconds for me. The next invocations will be much quicker.

Go further

This was just a little example, here are some ideas to improve this :

Better error handling
Protect the API with an API Key (really easy to implement with API Gateway)
Save the items to a DynamoDB database
Send the search query to an SQS queue, and trigger the lambda execution with the queue instead of an HTTP request
Send a notification with SNS if an Item is less than a certain price point.

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

This is the end of this tutorial. I hope you enjoyed the post. Don't hesitate to experiment with Lambda and other cloud providers, it's really fun, easy, and can drastically reduce your infrastructure costs, especially for web-scraping or asynchronous related tasks.

Web Scraping 101 in Python

Kevin Sahin — Wed, 21 Aug 2019 10:24:15 +0000

In this post, which can be read as a follow up to our ultimate web scraping guide, we will cover almost all the tools Python offers you to web scrape. We will go from the more basic to the most advanced one and will cover the pros and cons of each. Of course, we won't be able to cover all aspect of every tool we discuss, but this post should be enough to have a good idea of which tools does what, and when to use which.

Note: when I talk about Python in this blog post you should assume that I talk about Python3.

Table of Content:

0) Web Fundamentals
1) Manually opening a socket and sending the HTTP request
2) urllib3 & LXML
3) requests & BeautifulSoup
4) Scrapy
5) Selenium & Chrome —headless
Conclusion

0) Web Fundamentals

The internet is really complex: there are many underlying technologies and concepts involved to view a simple web page in your browser. I don’t have the pretension to explain everything, but I will show you the most important things you have to understand in order to extract data from the web.

HyperText Transfer Protocol

HTTP uses a client/server model, where an HTTP client (A browser, your Python program, curl, Requests...) opens a connection and sends a message (“I want to see that page: /product”)to an HTTP server (Nginx, Apache...).

Then the server answers with a response (The HTML code for example) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independent. FTP for example, is stateful.

Basically, when you type a website address in your browser, the HTTP request looks like this:

In the first line of this request, you can see multiples things:

the GET verb or method being used, meaning we request data from the specific path: /product/.There are other HTTP verbs, you can see the full list here.
The version of the HTTP protocol, in this tutorial we will focus on HTTP 1.
Multiple headers fields

Here are the most important header fields :

Host: The domain name of the server, if no port number is given, is assumed to be 80*.*
User-Agent: Contains information about the client originating the request, including the OS information. In this case, it is my web-browser (Chrome), on OSX. This header is important because it is either used for statistics (How many users visit my website on Mobile vs Desktop) or to prevent any violations by bots. Because these headers are sent by the clients, it can be modified (it is called “Header Spoofing”), and that is exactly what we will do with our scrapers, to make our scrapers look like a normal web browser.
Accept: The content types that are acceptable as a response. There are lots of different content types and sub-types: text/plain, text/html, image/jpeg, application/json ...
Cookie : name1=value1;name2=value2... This header field contains a list of name-value pairs. It is called session cookies, these are used to store data. Cookies are what websites use to authenticate users, and/or store data in your browser. For example, when you fill a login form, the server will check if the credentials you entered are correct, if so, it will redirect you and inject a session cookie in your browser. Your browser will then send this cookie with every subsequent request to that server.
Referrer: The Referrer header contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user came from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.

And the list goes on...you can find the full header list here.

A server will respond with something like this:

On the first line, we have a new piece of information, the HTTP code 200 OK. It means the request has succeeded. As for the request headers, there are lots of HTTP codes, split into four common classes, 2XX for successful requests, 3XX for redirects, 4XX for bad requests (the most famous being 404 Not found), and 5XX for server errors.

Then, in case you are sending this HTTP request with your web browser, the browser will parse the HTML code, fetch all the eventual assets (Javascript files, CSS files, images...) and it will render the result into the main window.

In the next parts we will see the different ways to perform HTTP requests with Python and extract the data we want from the responses.

1) Manually opening a socket and sending the HTTP request

Socket

The most basic way to perform an HTTP request in Python is to open a socket and manually send the HTTP request.

Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.

Regular Expressions

A regular expression (RE, or Regex) is a search pattern for strings. With regex, you can search for a particular character/word inside a bigger body of text.

For example, you could identify all phone numbers inside a web page. You can also replace items, for example, you could replace all uppercase tag in a poorly formatted HTML by lowercase ones. You can also validate some inputs ...

The pattern used by the regex is applied from left to right. Each source character is only used once. You may be wondering why it is important to know about regular expressions when doing web scraping?

After all, there is all kind of different Python module to parse HTML, with XPath, CSS selectors.

In an ideal semantic world, data is easily machine-readable, the information is embedded inside relevant HTML element, with meaningful attributes.

But the real world is messy, you will often find huge amounts of text inside a p element. When you want to extract a specific data inside this huge text, for example, a price, a date, a name... you will have to use regular expressions.

Note: Here is a great website to test your regex: https://regex101.com/ and one awesome blog to learn more about them, this post will only cover a small fraction of what you can do with regexp.

Regular expressions can be useful when you have this kind of data:

<p>Price : 19.99$</p>

We could select this text node with an Xpath expression, and then use this kind of regex to extract the price :

^Price\s:\s(\d+\.\d{2})\$

To extract the text inside an HTML tag, it is annoying to use a regex, but doable:

As you can see, manually sending the HTTP request with a socket, and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier.

2) urllib3 & LXML

Disclaimer: It is easy to get lost in the urllib universe in Python. You have urllib and urllib2 that are parts of the standard lib. You can also find urllib3. urllib2 was split in multiple modules in Python 3, and urllib3 should not be a part of the standard lib anytime soon. This whole confusing thing will be the subject of a blog post by itself. In this part, I've made the choice to only talk about urllib3 as it is used widely in the Python world, by Pip and requests to name only them.

Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. It allows doing what we did above with socket with way fewer lines of code.

Much more concise than the socket version. Not only that, but the API is straightforward and you can do many things easily, like adding HTTP headers, using a proxy, POSTing forms ...

For example, had we decide to set some headers and to use a proxy, we would only have to do this.

See? Exactly the same number of line, however, there are some things that urllib3 does not handle very easily, for example, if we want to add a cookie, we have to manually create the corresponding headers and add it to the request.

There are also things that urllib3 can do that requsts can't, creation and management of pool and proxy pool, control of retry strategy for example.

To put in simply, urllib3 is between requests and socket in terms of abstraction, although way closer to requests than socket.

This time, to parse the response, we are going to use the lxml package and XPath expressions.

XPath

Xpath is a technology that uses path expressions to select nodes or node- sets in an XML document (or HTML document). As with the Document Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a programming language in itself, it allows you to write expression that can access directly to a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).

Think of XPath as regexp, but specifically for XML/HMTL.

To extract data from an HTML document with XPath we need 3 things:

an HTML document
some XPath expressions
an XPath engine that will run those expressions

To begin we will use the HTML that we got thanks to urllib3, we just want to extract all the links from the Google homepage so we will use one simple XPath expression: //a and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPATH.

Installation:

pip install lxml

Below is the code that comes just after the previous snippet:

And the output should look like this:

You have to keep in mind that this example is really really simple and doesn't really show you how powerful XPath can be (note: this XPath expression should have been changed to //a/@href to avoid having to iterate on links to get their href ).

If you want to learn more about XPath you can read this good introduction. The LXML documentation is also well written and is a good starting point.

XPath expresions, like regexp, are really powerful and one of the fastest way to extract information from HTML, and like regexp, XPath can quickly become messy, hard to read and hard to maintain.

3) requests & BeautifulSoup

Requests is the king of python packages, with more than 11 000 000 downloads, it is the most widly used package for Python.

Installation:

pip install requests

Making a request with Requests (no comment) is really easy:

With Requests it is easy to perform POST requests, handle cookies, query parameters...

Authentication to Hacker News

Let's say we want to create a tool to automatically submit our blog post to Hacker news or any other forums, like Buffer. We would need to authenticate to those websites before posting our link. That's what we are going to do with Requests and BeautifulSoup!

Here is the Hacker News login form and the associated DOM:

There are three <input> **tags on this form, the first one has a type hidden with a name "goto" and the two others are the username and password.

If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated.

Doing this with Requests is easy, it will handle redirects automatically for us, and handling cookies can be done with the Session object.

The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.

Installation:

pip install beautifulsoup4

So all we have to do is to POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:

In order to learn more about BeautifulSoup we could try to extract every links on the homepage.

By the way, Hacker News offers a powerful API, so we're doing this as an example, but you should use the API instead of scraping it!

The first thing we need to do is to inspect the Hacker News's home page to understand the structure and the different CSS classes that we will have to select:

We can see that all posts are inside a <tr class="athing"> **so the first thing we will need to do is to select all these tags. This can be easily done with:

links = soup.findAll('tr', class_='athing')

Then for each link, we will extract its id, title, url and rank:

As you saw, Requests and BeautifulSoup are great libraries to extract data and automate different things by posting forms. If you want to do large-scale web scraping projects, you could still use Requests, but you would need to handle lots of things yourself.

When you need to scrape a lots of webpages, there are many things you have to take care of:

finding a way of parallelizing your code to make it faster
handling error
storing result
filtering result
throttling your request so you don't over load the server

Fortunately for us, tools exist that can handle those things for us.

4) Scrapy

Scrapy is a powerful Python web scraping framework. It provides many features to download web pages asynchronously, process and save it. It handles multithreading, crawling (the process of going from links to links to find every URLs in a website), sitemap crawling and many more.

Scrapy has also an interactive mode called the Scrapy Shell. With Scrapy Shell you can test your scraping code really quickly, like XPath expression or CSS selectors.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn.

To follow up on our example about Hacker news, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file.

You can easily install Scrapy with pip:

pip install Scrapy

Then you can use the scrapy cli to generate the boilerplate code for our project:

scrapy startproject hacker_news_scraper

Inside hacker_news_scraper/spider we will create a new python file with our Spider's code:

There is a lot of convention in Scrapy, here we define an Array of starting urls. The attribute name will be used to call our Spider with the Scrapy command line.

The parse method will be called on each URL in the start_urls array

We then need to tune Scrapy a little bit in order for our Spider to behave nicely against the target website.

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5

You should always turn this on, it will make sure the target website is not slow down by your spiders by analyzing the response time and adapting the numbers of concurrent threads.

You can run this code with the Scrapy CLI and with different output format (CSV, JSON, XML...):

scrapy crawl hacker-news -o links.json

And that's it! You will now have all your links in a nicely formatted JSON file.

5) Selenium & Chrome —headless

Scrapy is really nice for large-scale web scraping tasks, but it is not enough if you need to scrape Single Page Application written with Javascript frameworks because It won't be able to render the Javascript code.

It can be challenging to scrape these SPAs because there are often lots of AJAX calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing the interesting data.

In some cases, there are just too many asynchronous HTTP calls involved to get the data you want and it can be easier to just render the page in a headless browser.

Another great use case would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (again !)

You can install the selenium package with pip:

pip install selenium

You will also need Chromedriver:

brew install chromedriver

Then we just have to import the Webdriver from selenium package, configure Chrome with headless=True and set a window size (otherwise it is really small):

You should get a nice screenshot of the homepage:

You can do many more with the Selenium API and Chrome, like :

Executing Javascript
Filling forms
Clicking on Elements
Extracting elements with CSS selectors / XPath expressions

Selenium and Chrome in headless mode is really the ultimate combination to scrape anything you want. You can automate anything that you could do with your regular Chrome browser.

The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance.

If you want to run several Chrome instances concurrently, you will need powerful servers (the cost goes up quickly) and constant monitoring of resources.

Conclusion:

Here is a quick recap table of every technology we discuss about in this about. Do not hesitate to tell us in the comment if you know some ressources that you feel have their places here.

I hope that this overview will help you best choose your Python scraping tools and that you learned things reading this post.

Every tools I talked about in this post will be the subject of a specific blog post in the future where I'll go deep into the details.

Everything I talked about in this post is things I used to build ScrapingBee, the simplest web scraping API around there. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us 😊.

Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.

Happy Scraping

A guide to Web scraping without getting blocked

Pierre — Wed, 31 Jul 2019 16:30:25 +0000

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

But you should use an API for this!

Not every website offers an API, and APIs don't always expose every piece of information you need. So it's often the only solution to extract website data.

There are many use cases for web scraping:

E-commerce price monitoring
News aggregation
Lead generation
SEO (Search engine result page monitoring)
Bank account aggregation (Mint in the US, Bankin' in Europe)
But also lots of individual and researchers who need to build a dataset otherwise not available.

So, what is the problem?

The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browser (except Google, they all want to be scraped by Google).

So, when you scrape, you have to be careful not being recognized as a robot by basically doing two things: using human tools & having a human behavior. This post will guide you through all the things you can use to cover yourself and through all the tools websites use to block you.

Emulate human tool i.e: Headless Chrome

Why using headless browsing?

When you open your browser and go to a webpage it almost always means that you are you asking an HTTP server for some content. And one of the easiest ways pull content from an HTTP server is to use a classic command-line tool such as cURL.

Thing is if you just do a: curl www.google.com, Google has many ways to know that you are not a human, just by looking at the headers for examples. Headers are small pieces of information that goes with every HTTP request that hit the servers, and one of those pieces of information precisely describe the client making the request, I am talking about the "User-Agent" header. And just by looking at the "User-Agent" header, Google now knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great, and to make some experiment, just go over here, it's a webpage that simply displays the headers information of your request.

Headers are really easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you'd need to set more than just one header but more generally it is not very difficult to artificially craft an HTTP request with cURL or any library that will make this request looks exactly like a request made with a browser. Everybody knows that, and so, to know if you are using a real browser website will check one thing that cURL and library can not do: JS execution.

Do you speak JS?

The concept is very simple, the website embeds a little snippet of JS in its webpage that, once executed, will "unlock" the webpage. If you are using a real browser, then, you won't notice the difference, but if you're not, all you'll receive is an HTML page with some obscure JS in it.

(an actual example of such a snippet)

But once again, this solution is not completely bulletproof, mainly because since nodeJS it is now very easy to execute JS outside of a browser. But once again, the web evolved and there are other tricks to determine if you are using a real browser or not.

Headless Browsing

Trying to execute snippet JS on the side with the node is really difficult and not robust at all. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with node become useless. So the best way to look like a real browser is to actually use one.

Headless Browsers will behave "exactly" like a real browser except that you will easily be able to programmatically use them. The most used is Chrome Headless, a Chrome option that has the behavior of Chrome without all the UI wrapping it.

The easiest way to use Headless Chrome is by calling driver that wraps all its functionality into an easy API, Selenium and Puppeteer are the two most famous solutions.

However, it will not be enough as websites have now tools that allow them to detect a headless browser. This arms race that's been going on for a long time.

Fingerprinting

Everyone, and mostly front dev, knows how every browser behaves differently. Sometimes it can be about rendering CSS, sometimes JS, sometimes just internal properties. Most of those differences are well known and it is now possible to detect if a browser is actually who it pretends to be. Meaning the website is asking itself "are all the browser properties and behaviors matched what I know about the User-Agent sent by this browser?".

This is why there is an everlasting arms race between scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.

However, in this arms race, web scrapers tend to have a big advantage and here is why.

Most of the time, when a Javascript code tries to detect whether it's being run in headless mode is when it is a malware that is trying to evade behavioral fingerprinting. Meaning that the JS will behave nicely inside a scanning environment and badly inside real browsers. And this is why the team behind the Chrome headless mode are trying to make it indistinguishable from a real user's web browser in order to stop malware from doing that. And this is why web scrapers, in this arms race can profit from this effort.

One another thing to know is that whereas running 20 cURL in parallel is trivial, Chrome Headless while relatively easy to use for small use cases, can be tricky to put at scale. Mainly because it uses lots of RAM so managing more than 20 instances of it is a challenge.

If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel blog, a blog entirely dedicated to this subject.

That's about all you need to know to understand how to pretend like you are using a real browser. Let's now take a look at how do you behave like a real human.

Emulate human behaviour i.e: Proxy, Captchas solving and Request pattern

Proxy yourself

A human using a real browser will rarely request 20 pages per second from the same website, so if you want to request a lot of page from the same website you have to trick this website into thinking that all those requests come from a different place in the world i.e: different I.P addresses. In other words, you need to use proxies.

Proxies are now not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxies IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.

There are several proxy solutions in the market, here are the most used: Luminati Network, Blazing SEO and SmartProxy.

There is also a lot of free proxy list and I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these

proxies are located. Those free proxy lists are most of the time public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is very important, anti crawling services are known to maintain an internal list of proxy IP so every traffic coming from those IPs will also be blocked. Be careful to choose a good reputation Proxy. This is why I recommend using a paid proxy network or build your own

To build your on you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist ...) but can be a little tedious to put in place.

You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in the dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many

cases, TOR won’t help you, compared to classic proxies. It's worth noting that traffic through TOR is also inherently much slower because of the multiple routing thing.

Captchas

But sometimes proxies will not be enough, some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you'll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).

You have to know that while some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.

Old captcha, breakable programatically

Google ReCaptcha V2

What it means is that if you use those aforementioned services, on the other side of the API call you'll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.

But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your little scraping job.

Request Pattern

A last advanced tool used by the website to detect scraping is pattern recognition. So if you plan to scrap every ids from 1 to 10 000 for the URL www.example.com/product/, try not to do it sequentially and with a constant rate of request. You could, for example, maintain a set of integer going from 1 to 10 000 and randomly choose one integer inside this set and then scraping your product.

This one simple example, some websites also do statistic on browser fingerprint per endpoint. Which means that if you don't change some parameters in your headless browser and target a single endpoint, they might block you anyway.

Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try not doing it with proxies in Vietnam for example.

But from experience, what I can tell, is that rate is the most important factor in "Request Pattern Recognition", sot the slower you scrape, the less chance you have to be discovered.

Conclusion

I hope that this overview will help you understand better web-scraping and that you learned things reading this post.

Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.
Happy scraping 😎

Scraping single page applications with ease.

Kevin Sahin — Sun, 26 May 2019 17:22:14 +0000

Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend.

These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.

So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.

There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.

It can be challenging to scrape these SPAs because there are often lots of Ajax calls and Websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.

So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code in order to render the page, that is what the next part is about.

Headless Chrome with Python

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about the release of the headless mode with Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS [...]” It looks like Chrome in headless mode is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

Prerequisites

You will need to install the selenium package:

pip install selenium

And of course, you need a Chrome browser, and Chromedriver installed on your system.

On macOS, you can simply use brew:

brew install chromedriver

Taking a screenshot

We are going to use Chrome to take a screenshot of the Nintendo's home page which uses lots of Javascript.

> chrome.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver')
driver.get("https://www.nintendo.com/")
driver.save_screenshot('screenshot.png')
driver.quit()

The code is really straightforward, I just added a parameter --window-size because the default size was too small.

You should now have a nice screenshot of the Nintendo's home page:

Waiting for the page load

Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.

A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.

The other solution is to use the WebDriverWait object from the Selenium API:

try:

 elem = WebDriverWait(driver, delay)
     .until(EC.presence_of_element_located((By.NAME, 'chart')))

 print("Page is ready!")

except TimeoutException:

 print("Timeout")

This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.

Conclusion

As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.

Meaning there will be CPU spikes, memory spikes just like a regular Chrome browser. After all, your Chrome instance will execute un-trusted and un-predictable third-party Javascript code! Then there is also the zombie-processes problem

This is one of the reason I started ScrapingBee, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!

This was my first post on about scraping, I hope you enjoyed it!

If you did please let me know, I'll write more 😊

If you want to know more about ScrapingBee, you can 👉 here

Introduction to Web Scraping With Java

Kevin Sahin — Wed, 13 Mar 2019 16:46:23 +0000

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…

Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.

In this post, we are going to see basic techniques in order to fetch and parse data in Java.

Prerequisites

Basic Java understanding
Basic XPath

Tools

You will need Java 8 with HtmlUnit

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.19</version>
</dependency>

If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page.

Let's scrape CraigList

For our first example, we are going to fetch items from Craigslist since they don't seem to offer an API, to collect names, prices, and images, and export it to JSON.

First, let's take a look at what happens when you search an item on Craigslist. Open Chrome Dev tools and click on the Network tab :

The search URL is :

https://newyork.craigslist.org/search/moa?is_paid=all&search_distance_type=mi&query=iphone+6s

You can also use

https://newyork.craigslist.org/search/sss?sort=rel&query=iphone+6s

Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled ...)

We are going to disable Javascript since it's not required for our example, and disabling Javascript makes the page load faster :

String searchQuery = "Iphone 6s" ;

WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
  String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
  HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
  e.printStackTrace();
}
}

The HtmlPage object will contain the HTML code, you can access it with asXml() method.

Now we are going to fetch titles, images, and prices. We need to inspect the DOM structure for an item :

With HtmlUnit you have several options to select an html tag :

getHtmlElementById(String id)
getFirstByXPath(String Xpath)
getByXPath(String XPath) which returns a List
many others, rtfm !

Since there isn't any ID we could use, we have to make an Xpath expression to select the tags we want.

XPath is a query language to select XML nodes( HTML in our case).

First, we are going to select all the <p> tags that have a class result-info

Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.

List<HtmlElement> items = (List<HtmlElement>) page.getByXPath("//li[@class='result-row']") ;
if(items.isEmpty()){
  System.out.println("No items found !");
}else{
for(HtmlElement item : items){
  HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

  HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

  String itemName = itemAnchor.asText()
  String itemUrl =  itemAnchor.getHrefAttribute()

  // It is possible that an item doesn't have any price
  String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText() ;

  System.out.println( String.format("Name : %s Url : %s Price : %s", itemName, itemPrice, itemUrl));
  }
}

Then instead of just printing the results, we are going to put it in JSON, using Jackson library, to map items in JSON format.

We need a POJO (plain old java object) to represent Items

Item.java

public class Item {
    private String title ; 
    private BigDecimal price ;
    private String url ;
//getters and setters
}

Then add this to your pom.xml :

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.7.0</version>
</dependency>

Now, all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file ...), and adapt the previous code a little bit :

for(HtmlElement htmlItem : items){
   HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

   HtmlElement spanPrice = ((HtmlElement) 
   htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

   // It is possible that an item doesn't have any 
   //price, we set the price to 0.0 in this case
   String itemPrice = spanPrice == null ? "0.0" : 
   spanPrice.asText() ;

   Item item = new Item();

   item.setTitle(itemAnchor.asText());
   item.setUrl( baseUrl + 
   itemAnchor.getHrefAttribute());

   item.setPrice(new 
   BigDecimal(itemPrice.replace("$", "")));

   ObjectMapper mapper = new ObjectMapper();
   String jsonString = 
   mapper.writeValueAsString(item) ;

   System.out.println(jsonString);
}

Go further

This example is not perfect, there are many things that can be improved :

Multi-city search
Handling pagination
Multi-criteria search

You can find the code in this Github repo

This was my first blog post I hope you enjoyed it, feel free to give me any feedback in the comments.

Scraping E-Commerce Product Data

Kevin Sahin — Sun, 17 Feb 2019 09:24:37 +0000

In this tutorial, we are going to see how to extract product data from any E-commerce websites with Java. There are lots of different use cases for product data extraction, such as:

E-commerce price monitoring
Price comparator
Availability monitoring
Extracting reviews
Market research
MAP violation

We are going to extract these different fields: Price, Product Name, Image URL, SKU, and currency from this product page:

https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008

What you will need

We will use HtmlUnit to perform the HTTP request and parse the DOM, add this dependency to your pom.xml.

<dependency>
   <groupId>net.sourceforge.htmlunit</groupId>
   <artifactId>htmlunit</artifactId>
   <version>2.19</version>
</dependency>

We will also use the Jackson library:

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.9.8</version>
</dependency>

Schema.org

In order to extract the fields we're interested in, we are going to parse https://schema.org metadata from the Html markup.

Schema is a semantic vocabulary that can be added to any webpage. There are many benefits of implementing Schema. Most search engines use it to understand what a page is about (A Product, an Article, a Review, and many more )

According to schema.org, about 10 million websites use it worldwide. That's huge!
There are different types of Schema, and today we're going to look at the Product type

It's really convenient because once you wrote a scraper that extracts specific schema data, it will work on any other website using the same schema. No more specific XPath / CSS selectors to write!

In my experience at PricingBot (my previous company), about 40% of E-commerce websites use schema.org metadata in their DOM.

There are three main schema markups:

JSON-LD

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "ItemList",
    "url": "http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600",
    "numberOfItems": "315",
    "itemListElement": [
        {
            "@type": "Product",
            "image": "http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg",
            "url": "http://multivarki.ru/brand_502/",
            "name": "Brand 502",
            "offers": {
                "@type": "Offer",
                "price": "4399 p."
            }
        },
        {
            "@type": "Product",
            "name": "..."
        }
    ]
}
</script>

RDF-A

<div vocab="http://schema.org/" typeof="ItemList">
    <link property="url" href="http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"><span property="numberOfItems">315</span>
    <div property="itemListElement" typeof="Product">
        <img property="image" alt="Photo of product" src="http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"> <a property="url" href="http://multivarki.ru/brand_502/"><span property="name">BRAND 502</span></a>
        <div property="offers" typeof="http://schema.org/Offer">
            <meta property="schema:priceCurrency" content="RUB">руб
            <meta property="schema:price" content="4399.00">4 399,00
            <link property="schema:itemCondition" href="http://schema.org/NewCondition">
        </div>...
        <div property="itemListElement" typeof="Product">
          ...
        </div>
    </div>
</div>

And the one used in our example, Microdata:

<div class="schema-org">


<div itemscope="" itemtype="https://schema.org/Product">
    <img itemprop="image" src="https://images.asos-media.com/products/the-north-face-vault-backpack-28-litres-in-black/10253008-1-black" alt="Image 1 of The North Face Vault Backpack 28 Litres in Black">
    <link itemprop="itemCondition" href="https://schema.org/NewCondition">
    <span itemprop="productID">10253008</span>
    <span itemprop="sku">10253008</span>
    <span itemprop="brand" itemscope="" itemtype="https://schema.org/Brand">
        <span itemprop="name">The North Face</span>
    </span>
    <span itemprop="name">The North Face Vault Backpack 28 Litres in Black</span>
    <span itemprop="description">Shop The North Face Vault Backpack 28 Litres in Black at ASOS. Discover fashion online.</span>
    <span itemprop="offers" itemscope="" itemtype="https://schema.org/Offer">
        <link itemprop="availability" href="https://schema.org/InStock">
        <meta itemprop="priceCurrency" content="GBP">
        <span itemprop="price">60</span>
        <span itemprop="eligibleRegion">GB</span>
        <span itemprop="seller" itemscope="" itemtype="https://schema.org/Organization">
            <span itemprop="name">ASOS</span>
        </span>
    </span>  
</div>

  </div>

Note that you can have multiple offers in a single page.

Extracting the data

The first thing is to create a basic POJO of a Product:

public class Product {

    private BigDecimal price;
    private String name;
    private String sku;
    private URL imageUrl;
    private String currency;
        // ...getters & setters

Then we need to go to the target URL and create a basic microdata parser to extract the fields we are interested in. I'm using HtmlUnit for this, which is a pure Java headless browser. I could have used lots of different libraries like Jsoup or Selenium + Headless Chrome.

But in most cases, HtmlUnit is a good solution because it's lighter than Selenium + Headless Chrome, but offer more features than a raw HTTP client + JSoup (which only handles Html parsing).

For "Javascript-heavy" websites, relying on frontend frameworks like React / Vue.js, Headless Chrome is the way to go!


WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
String productUrl = "https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008";

HtmlPage page = client.getPage(productUrl);
HtmlElement productNode = ((HtmlElement) page
                .getFirstByXPath("//*[@itemtype='https://schema.org/Product']"));
URL imageUrl = new URL((((HtmlElement) productNode.getFirstByXPath("./img")))
                .getAttribute("src"));
HtmlElement offers = ((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='offers']"));

BigDecimal price = new BigDecimal(((HtmlElement) offers.getFirstByXPath("./span[@itemprop='price']")).asText());
String productName = (((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='name']")).asText());
String currency = (((HtmlElement) offers.getFirstByXPath("./*[@itemprop='priceCurrency']")).getAttribute("content"));
String productSKU = (((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='sku']")).asText());

On the first lines, I created the HtmlUnit HTTP client and disabled Javascript because we don't need it to get the Schema markup.

Then it's just basic XPath expressions to select the interesting DOM nodes we want.

This parser is far from perfect, it doesn't extract everything and it doesn't handle multiple offers. However, this will give you an idea about how to extract Schema data.

We can then create the Product object, and print it as a JSON string:

Product product = new Product(price, productName, productSKU, imageUrl, currency);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(product) ;
System.out.println(jsonString);

Avoid getting blocked

Now that we are able to extract the product data we want, we have to be careful not to get blocked.

For various reasons, there are sometimes anti-bot mechanisms implemented on websites. The most obvious reason to protect sites from bots is to prevent heavy automated traffic to impact a website’s performance (and you must be careful with concurrent requests, by adding delays...). Another reason is to stop bad behavior from bots like spam.

There are various protection mechanisms. Sometime your bot will be blocked if it does too many requests per second/hour/ day. Sometimes there is a rate limit on how many requests per IP address. The most difficult protection is when there is a user behavior analysis. For example, the website could analyze the time between requests, if the same IP is making requests concurrently.

The easiest solution to hide our scrapers is to use proxies. In combination with random user-agent, using a proxy is a powerful method to hide our scrapers, and scrape rate-limited web pages. Of course, it’s better not be blocked in the first place, but sometimes websites allow only a certain amount of request per day/hour.

In these cases, you should use a proxy. There are lots of free proxy list, I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these proxies are located. Sometimes the public proxy list is operated by a legit company, offering premium proxies, and sometimes not...

What I recommend is using a paid proxy service, or you could build your own.

Setting a proxy to HtmlUnit is easy:

ProxyConfig proxyConfig = new ProxyConfig("host", myPort);
client.getOptions().setProxyConfig(proxyConfig);

Go further

As you can see, thanks to Schema.org data, extracting product data is much easier now than it was ten years ago.

But there are still challenges such as handling websites that haven't implemented Schema, handling IP blocking and rate limits, rendering Javascript...

That is exactly why we've been working with my partner Pierre on a Web Scraping API

ScrapingBee is an API to extract any HTML from any website without having to deal with proxies, CAPTCHAs and headless browsers. A single API call, with only the product URL you to want to extract data from.

I hope you enjoyed this post, as always you can find the full code in this Github repository: https://github.com/ksahin/introWebScraping

Introduction to Chrome Headless

Kevin Sahin — Fri, 18 Jan 2019 09:45:11 +0000

In the previous articles, I introduce you to two different tools to perform web scraping with Java. HtmlUnit in the first article, and PhantomJS in the article about handling Javascript heavy website.

This time we are going to introduce a new feature from Chrome, the headless mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don't know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago.

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer because of I quote "Google Chrome is faster and more stable than PhantomJS [...]"
It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome.

Prerequisites

Google Chrome > 59
Chromedriver
Selenium
In your pom.xml add a recent version of Selenium :

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.8.1</version>
</dependency>

If you don't have Google Chrome installed, you can download it here
To install Chromedriver you can use brew on MacOS :

brew install chromedriver

Or download it using the link below.
There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.

Let's log into Hacker News

In this part, we are going to log into Hacker News, and take a screenshot once logged in. We don't need Chrome headless for this task, but the goal of this article is only to show you how to run headless Chrome with Selenium.

The first thing we have to do is to create a WebDriver object, and set the chromedriver path and some arguments :

// Init chromedriver
String chromeDriverPath = "/Path/To/Chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors");
WebDriver driver = new ChromeDriver(options);

The


 option is needed on Windows systems, according to the [documentation](https://developers.google.com/web/updates/2017/04/headless-chrome)
Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :



```java
options.setBinary("/Path/to/specific/version/of/Google Chrome");

If you want to learn more about the different options, here is the Chromedriver documentation

The next step is to perform a GET request to the Hacker News login form, select the username and password field, fill it with our credentials and click on the login button. Then we have to check for a credential error, and if we are logged in, we can take a screenshot.

We have done this in a previous article, here is the full code :

public class ChromeHeadlessTest {
    private static String userName = "" ;
    private static String password = "" ;

    public static void main(String[] args) throws IOException{
       String chromeDriverPath = "/your/chromedriver/path" ;
       System.setProperty("webdriver.chrome.driver", chromeDriverPath);
       ChromeOptions options = new ChromeOptions();
       options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors", "--silent");
       WebDriver driver = new ChromeDriver(options);

      // Get the login page
      driver.get("https://news.ycombinator.com/login?goto=news");

      // Search for username / password input and fill the inputs
      driver.findElement(By.xpath("//input[@name='acct']")).sendKeys(userName);
      driver.findElement(By.xpath("//input[@type='password']")).sendKeys(password);

      // Locate the login button and click on it
      driver.findElement(By.xpath("//input[@value='login']")).click();

      if(driver.getCurrentUrl().equals("https://news.ycombinator.com/login")){
        System.out.println("Incorrect credentials");
        driver.quit();
        System.exit(1);
      }else{
        System.out.println("Successfuly logged in");
      }

        // Take a screenshot of the current page
        File screenshot = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        FileUtils.copyFile(screenshot, new File("screenshot.png"));

        // Logout
        driver.findElement(By.id("logout")).click();
    driver.quit();
   }
}

You should now have a nice screenshot of the Hacker News homepage while being authenticated. As you can see Chrome headless is really easy to use, it is not that different from PhantomJS since we are using Selenium to run it.

If you enjoyed this do not hesitate to subscribe to our newsletter!

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

As usual, the code is available in this Github repository

How to Log in to Almost Any Websites

Kevin Sahin — Wed, 02 Jan 2019 09:52:27 +0000

In the first article about java web scraping I showed how to extract data from CraigList website.
But what about the data you want or if the action you want to carry out on a website requires authentication ?

In this short tutorial I will show you how to make a generic method that can handle most authentication forms.

Authentication mechanism

There are many different authentication mechanisms, the most frequent being a login form , sometimes with a CSRF token as a hidden input.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage
Select the first <input type="password"> tag
Select the first <input> before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form, and submit it.

Hacker News Authentication

Let's say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :

Here is the login form and the associated DOM :

Now we can implement the login algorithm

    public static WebClient autoLogin(String loginUrl, String login, String password) throws FailingHttpStatusCodeException, MalformedURLException, IOException{
        WebClient client = new WebClient();
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);

        HtmlPage page = client.getPage(loginUrl);

        HtmlInput inputPassword = page.getFirstByXPath("//input[@type='password']");
        //The first preceding input that is not hidden
        HtmlInput inputLogin = inputPassword.getFirstByXPath(".//preceding::input[not(@type='hidden')]");

        inputLogin.setValueAttribute(login);
        inputPassword.setValueAttribute(password);

        //get the enclosing form
        HtmlForm loginForm = inputPassword.getEnclosingForm() ;

        //submit the form
        page = client.getPage(loginForm.getWebRequest(null));

        //returns the cookie filled client :)
        return client;
    }

Then the main method, which :

calls autoLogin with the right parameters
Go to https://news.ycombinator.com
Check the logout link presence to verify we're logged
Prints the cookie to the console

    public static void main(String[] args) {

        String baseUrl = "https://news.ycombinator.com" ;
        String loginUrl = baseUrl + "/login?goto=news" ; 
        String login = "login";
        String password = "password" ;

        try {
            System.out.println("Starting autoLogin on " + loginUrl);
            WebClient client = autoLogin(loginUrl, login, password);
            HtmlPage page = client.getPage(baseUrl) ;

            HtmlAnchor logoutLink = page.getFirstByXPath(String.format("//a[@href='user?id=%s']", login)) ;
            if(logoutLink != null ){
                System.out.println("Successfuly logged in !");
                // printing the cookies
                for(Cookie cookie : client.getCookieManager().getCookies()){
                    System.out.println(cookie.toString());
                }
            }else{
                System.err.println("Wrong credentials");
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

You can find the code in this Github repo

Go further

There are many cases where this method will not work : Amazon, DropBox... and all other two-steps/captcha protected login forms.

Things that can be improved with this code :

Handle the check for the logout link inside autoLogin
Check for null inputs/form and throw an appropriate exception

In a next post I will show you how to deal with captchas or virtual numeric keyboards with OCR and captchas breaking APIs !

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

An Automatic Bill Downloader in Java

Kevin Sahin — Wed, 12 Dec 2018 10:03:07 +0000

In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.

I suggest you to read these articles first : Introduction to web scraping with Java and Autologin

Since I am hosting this blog on Digital Ocean (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have.

Login

To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article.

Then we have to go to the bill page : https://cloud.digitalocean.com/settings/billing

String baseUrl = "https://cloud.digitalocean.com";
String login = "email";
String password = "password" ;

try {
    WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);

    HtmlPage page = client.getPage("https://cloud.digitalocean.com/settings/billing");
    if(page.asText().contains("You need to sign in for access to this page")){
        throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
    }
}catch (Exception e) {
    e.printStackTrace();
}

Fetching the bills

Let's create a new Class called Bill or Invoice to represent a bill :

Bill.java


public class Bill {

    private String label ;
    private BigDecimal amount ; 
    private Date date;
    private String url ;
//... getters & setters
}

Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :

We are lucky here, it's a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these :

HtmlTable to store the table and iterate on each rows
getCell to select the cells

Then, using the Jackson library we will export the Bill objects to JSON and print it.

HtmlTable billsTable = (HtmlTable) page.getFirstByXPath("//table[@class='listing Billing--history']");
for(HtmlTableRow row : billsTable.getBodies().get(0).getRows()){

    String label = row.getCell(1).asText();
    // We only want the invoice row, not the payment one
    if(!label.contains("Invoice")){
        continue ;
    }

    Date date = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH).parse(row.getCell(0).asText());
    BigDecimal amount =new BigDecimal(row.getCell(2).asText().replace("$", ""));
    String url = ((HtmlAnchor) row.getCell(3).getFirstChild()).getHrefAttribute();

    Bill bill = new Bill(label, amount, date, url);
    bills.add(bill);
    ObjectMapper mapper = new ObjectMapper();
    String jsonString = mapper.writeValueAsString(bill) ;

    System.out.println(jsonString);

It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the Pageobject to store the pdf, and call a getContentAsStreamon it. It's better to check if the file has the right content type when doing this (application/pdf in our case)

Page invoicePdf = client.getPage(baseUrl + url);

if(invoicePdf.getWebResponse().getContentType().equals("application/pdf")){
    IOUtils.copy(invoicePdf.getWebResponse().getContentAsStream(), new FileOutputStream("DigitalOcean" + label + ".pdf"));
}

That's it, here is the ouput :

{"label":"Invoice for December 2015","amount":0.35,"date":1451602800000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for November 2015","amount":6.00,"date":1448924400000,"url":"/billing/XXXX.pdf"}
{"label":"Invoice for October 2015","amount":3.05,"date":1446332400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for April 2015","amount":1.87,"date":1430431200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for March 2015","amount":5.00,"date":1427839200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for February 2015","amount":5.00,"date":1425164400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for January 2015","amount":1.30,"date":1422745200000,"url":"/billing/XXXXXX.pdf"}
{"label":"Invoice for October 2014","amount":3.85,"date":1414796400000,"url":"/billing/XXXXXX.pdf"}

As usual you can find the full code on this Github Repo

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

DEV Community: ScrapingBee

Web Scraping 101 with Javascript and NodeJS

TOC

Prerequisites

Outcomes

Understanding NodeJS: A brief introduction

HTTP clients: querying the web

Request

Axios

Superagent

Regular Expressions: The hard way

Cheerio: Core JQuery for traversing the DOM

JSDOM: The DOM for Node

Puppeteer: The headless browser

Nightmare: An alternative to Puppeteer

Summary

Resources

Easy Web Scraping With Scrapy

Basic overview

Scraping a single product

Scrapy Shell

Extracting Data

Creating a Scrapy Spider

Item loaders

Scraping multiple pages

Conclusion

Practical XPath for Web Scraping

Why learn XPath

Document Object Model

XPath Syntax

Tip

XPath with Python

E-commerce product data extraction

Automagically authenticate to a website

Conclusion

Serverless Web Scraping With Aws Lambda and Java

Prerequisites

Architecture

Create the Maven project

Configuration

Function code

Go further

Web Scraping 101 in Python

Table of Content:

0) Web Fundamentals

HyperText Transfer Protocol

1) Manually opening a socket and sending the HTTP request

Socket

Regular Expressions

2) urllib3 & LXML

XPath

3) requests & BeautifulSoup

4) Scrapy

5) Selenium & Chrome —headless

Conclusion:

A guide to Web scraping without getting blocked

Emulate human tool i.e: Headless Chrome

Why using headless browsing?

Do you speak JS?

Headless Browsing

Fingerprinting

Emulate human behaviour i.e: Proxy, Captchas solving and Request pattern

Proxy yourself

Captchas

Request Pattern

Conclusion

Scraping single page applications with ease.

Headless Chrome with Python

Prerequisites

Taking a screenshot

Waiting for the page load

Conclusion

Introduction to Web Scraping With Java

Prerequisites

Tools

Let's scrape CraigList

Go further

Further reading

Scraping E-Commerce Product Data

What you will need