DEV Community: Kevin Sahin

Easy Web Scraping With Scrapy

Kevin Sahin — Wed, 18 Dec 2019 16:22:13 +0000

In the previous post about Web Scraping with Python we talked a bit about Scrapy. In this post we are going to dig a little bit deeper into it.

Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale:

Multithreading
Crawling (going from link to link)
Extracting the data
Validating
Saving to different format / databases
Many more

The main difference between Scrapy and other commonly used librairies like Requests / BeautifulSoup is that it is opinionated. It allows you to solve the usual web scraping problems in an elegant way.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn, but that is what we are here for :)

In this tutorial we will create two different web scrapers, a simple one that will extract data from an E-commerce product page, and a more "complex" one that will scrape an entire E-commerce catalog!

Basic overview

You can install Scrapy using pip. Be careful though, the Scrapy documentation strongly suggests to install it in a dedicated virtual environnement in order to avoid conflicts with your system packages.

I'm using Virtualenv and Virtualenvwrapper:

mkvirtualenv scrapy_env

and

pip install Scrapy

You can now create a new Scrapy project with this command:

scrapy startproject product_scraper

This will create all the necessary boilerplate files for the project.

├── product_scraper
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

Here is a brief overview of these files and folders:

items.py is a model for the extracted data. You can define custom model (like a Product) that will inherit the scrapy Item class.
middlewares.py Middleware used to change the request / response lifecycle. For example you could create a middle ware to rotate user-agents, or to use an API like ScrapingBee instead of doing the requests yourself.
pipelines.py In Scrapy, pipelines are used to process the extracted data, clean the HTML, validate the data, and export it to a custom format or saving it to a database.
/spiders is a folder containing Spider classes. With Scrapy, Spiders are classes that define how a website should be scraped, including what link to follow and how to extract the data for those links.
scrapy.cfg is a configuration file to change some settings

Scraping a single product

In this example we are going to scrape a single product from a dummy E-commerce website. Here is the first the product we are going to scrape:

https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/

We are going to extract the product name, picture, price and description.

Scrapy Shell

Scrapy comes with a built-in shell that helps you try and debug your scraping code in real time. You can quickly test your XPath expressions / CSS selectors with it. It's a very cool tool to write your web scrapers and I always use it!

You can configure Scrapy Shell to use another console instead of the default Python console like IPython. You will get autocompletion and other nice perks like colorized output.

In order to use it in your scrapy Shell, you need to add this line to your scrapy.cfg file:

shell = ipython

Once it's configured, you can start using scrapy shell:

$ scrapy shell --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x108147eb8>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108d10978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

We can start fetching a URL by simply:

fetch('https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/')

This will start by fetching the /robot.txt file.

[scrapy.core.engine] DEBUG: Crawled (404) <GET https://clever-lichterman-044f16.netlify.com/robots.txt> (referer: None)

In this case there isn't any robot.txt, that's why we can see a 404 HTTP code. If there was a robot.txt, by default Scrapy will follow the rule.

You can disable this behavior by changing this setting in settings.py:

ROBOTSTXT_OBEY = True

Then you should should have a log like this:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/> (referer: None)

You can now see your response object, response headers, and try different XPath expression / CSS selectors to extract the data you want.

You can see the response directly in your browser with:

view(response)

Note that the page will render badly inside your browser, for lots of different reasons. This can be CORS issues, Javascript code that didn't execute, or relative URLs for assets that won't work locally.

The scrapy shell is like a regular Python shell, so don't hesitate to load your favorite scripts/function in it.

Extracting Data

Scrapy doesn't execute any Javascript by default, so if the website you are trying to scrape is using a frontend framework like Angular / React.js, you could have trouble accessing the data you want.

Now let's try some XPath expression to extract the product title and price:

In order to extract the price, we are going to use an XPath expression, we're selecting the first span after the div with the class my-4

In [16]: response.xpath("//div[@class='my-4']/span/text()").get()
Out[16]: '20.00$'

I could also use a CSS selector:

In [21]: response.css('.my-4 span::text').get()
Out[21]: '20.00$'

Creating a Scrapy Spider

With Scrapy, Spiders are classes where you define your crawling (what links / URLs need to be scraped) and scraping (what to extract) behavior.

Here are the different steps used by a spider to scrape a website:

It starts by looking at the class attribute start_urls , and call these URLs with the start_requests() method. You could override this method if you need to change the HTTP verb, add some parameters to the request (for example, sending a POST request instead of a GET). * It will then generate a Request object for each URL, and send the response to the callback function parse() * The parse() method will then extract the data (in our case, the product price, image, description, title) and return either a dictionnary, an Item object, a Request or an iterable.

You may wonder why the parse method can return so many different objects. It's for flexibility. Let's say you want to scrape an E-commerce website that doesn't have any sitemap. You could start by scraping the product categories, so this would be a first parse method.

This method would then yield a Request object to each product category to a new callback method parse2()
For each category you would need to handle pagination Then for each product the actual scraping that generate an Item so a third parse function.

With Scrapy you can return the scraped data as a simple Python dictionary, but it is a good idea to use the built-in Scrapy Item class.
It's a simple container for our scraped data and Scrapy will look at this item's fields for many things like exporting the data to different format (JSON / CSV...), the item pipeline etc.

So here is a basic Product class:

import scrapy

class Product(scrapy.Item):
    product_url = scrapy.Field()
    price = scrapy.Field()
    title = scrapy.Field()
    img_url = scrapy.Field()

Now we can generate a spider, either with the command line helper:

scrapy genspider myspider mydomain.com

Or you can do it manually and put your Spider's code inside the /spiders directory.

There are different types of Spiders in Scrapy to solve the most common web scraping use cases:

Spider that we will use. It takes a start_urls list and scrape each one with a parse method.
CrawlSpider follows links defined by a set of rules
SitemapSpider extract URLs defined in a sitemap
Many more

# -*- coding: utf-8 -*-
import scrapy

from product_scraper.items import Product

class EcomSpider(scrapy.Spider):
    name = 'ecom_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.com']
    start_urls = ['https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/']

    def parse(self, response):
        item = Product()
        item['product_url'] = response.url
        item['price'] = response.xpath("//div[@class='my-4']/span/text()").get()
        item['title'] = response.xpath('//section[1]//h2/text()').get()
        item['img_url'] = response.xpath("//div[@class='product-slider']//img/@src").get(0)
        return item

In this EcomSpider class, there are two required attributes:

name which is our Spider's name (that you can run using scrapy runspider spider_name)
start_urls which is the starting URL

The allowed_domains is optionnal but important when you use a CrawlSpider that could follow links on different domains.

Then I've just populated the Product fields by using XPath expressions to extract the data I wanted as we saw earlier, and we return the item.

You can run this code as follow to export the result into JSON (you could also export to CSV)

scrapy runspider ecom_spider.py -o product.json

You should then get a nice JSON file:

[
  {
    "product_url": "https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/",
    "price": "20.00$",
    "title": "Taba Cream",
    "img_url": "https://clever-lichterman-044f16.netlify.com/images/products/product-2.png"
  }
]

Item loaders

There are two common problems that you can face while extracting data from the Web:

For the same website, the page layout and underlying HTML can be different. If you scrape an E-commerce website, you will often have a regular price and a discounted price, with different XPath / CSS selectors.
The data can be dirty and need some kind of post processing, again for an E-commerce website it could be the way the prices are displayed for example ($1.00, $1, $1,00 )

Scrapy comes with a built-in solution for this, ItemLoaders.
It's an interesting way to populate our Product object.

You can add several XPath expression to the same Item field, and it will test it sequentially. By default if several XPath are found, it will load all of them into a list.

You can find many examples of input and output processors in the Scrapy documentation.

It's really useful when you need to transorm/clean the data your extract.
For example, extracting the currency from a price, transorming a unit into another one (centimers in meters, Celcius degres in Fahrenheit) ...

In our webpage we can find the product title with different XPath expressions: //title and //section[1]//h2/text()

Here is how you could use and Itemloader in this case:

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('price', "//div[@class='my-4']/span/text()")
    l.add_xpath('title', '//section[1]//h2/text()')
    l.add_xpath('title', '//title')
    l.add_value('product_url', response.url)
    return l.load_item()

Generally you only want the first matching XPath, so you will need to add this output_processor=TakeFirst() to your item's field constructor.

In our case we only want the first matching XPath for each field, so a better approach would be to create our own ItemLoader and declare a default output_processor to take the first matching XPath:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

def remove_dollar_sign(value):
    return value.replace('$', '')

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(remove_dollar_sign)

I also added a price_in which is an input processor to delete the dollar sign from the price. I'm using MapCompose which is a built-in processor that takes one or several functions to be executed sequentially. You can add as many functions as you like for . The convention is to add _in or _out to your Item field's name to add an input or output processor to it.

There are many more processors, you can learn more about this in the documentation

Scraping multiple pages

Now that we know how to scrape a single page, it's time to learn how to scrape multiple pages, like the entire product catalog.
As we saw earlier there are different kinds of Spiders.

When you want to scrape an entire product catalog the first thing you should look at is a sitemap. Sitemap are exactly built for this, to show web crawlers how the website is structured.

Most of the time you can find one at base_url/sitemap.xml. Parsing a sitemap can be tricky, and again, Scrapy is here to help you with this.

In our case, you can find the sitemap here: https://clever-lichterman-044f16.netlify.com/sitemap.xml

If we look inside the sitemap there are many URLs that we are not interested by, like the home page, blog posts etc:

<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/blog/post-1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>

Fortunately, we can filter the URLs to parse only those that matches some pattern, it's really easy, here we only to have URL that
have /products/ in their URLs:

class SitemapSpider(SitemapSpider):
    name = "sitemap_spider"
    sitemap_urls = ['https://clever-lichterman-044f16.netlify.com/sitemap.xml']
    sitemap_rules = [
        ('/products/', 'parse_product')
    ]

    def parse_product(self, response):
        # ... scrape product ...

You can run this spider as follow to scrape all the products and export the result to a CSV file:

scrapy runspider sitemap_spider.py -o output.csv

Now what if the website didn't have any sitemap? Once again, Scrapy has a solution for this!

Let me introduce you to the... CrawlSpider.

The CrawlSpider will crawl the target website by starting with a start_urls list. Then for each url, it will extract all the links based on a list of Rule.
In our case it's easy, products has the same URL pattern /products/product_title so we only need filter these URLs.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from product_scraper.productloader import ProductLoader
from product_scraper.items import Product

class MySpider(CrawlSpider):
    name = 'crawl_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.com']
    start_urls = ['https://clever-lichterman-044f16.netlify.com/products/']

    rules = (

        Rule(LinkExtractor(allow=('products', )), callback='parse_product'),
    )

    def parse_product(self, response):
      # .. parse product

As you can see, all these built-in Spiders are really easy to use. It would have been much more complex to do it from scratch.

With Scrapy you don't have to think about the crawling logic, like adding new URLs to a queue, keeping track of already parsed URLs, multi-threading...

Conclusion

In this post we saw a general overview of how to scrape the web with Scrapy and how it can solve your most common web scraping challenges. Of course we only touched the surface and there are many more interesting things to explore, like middlewares, exporters, extensions, pipelines!

If you've been doing web scraping more "manually" with tools like BeautifulSoup / Requests, it's easy to understand how Scrapy can help save time and build more maintainable scrapers.

I hope you liked this Scrapy tutorial and that it will motivate you to experiment with it.

For further reading don't hesitate to look at the great Scrapy documentation.

You can also check out our web scraping with Python tutorial to learn more about web scraping.

Happy Scraping!

Practical XPath for Web Scraping

Kevin Sahin — Thu, 07 Nov 2019 10:41:17 +0000

XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.

It looks like the perfect tool for web scraping right? At ScrapingBee we love XPath!

Why learn XPath

Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.
It's more powerful than CSS selectors
It allows you to navigate the DOM in any direction
Can match text inside HTML elements

Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.

But first, let's talk a little about the DOM

Document Object Model

I am going to assume you already know HTML, so this is just a small reminder.

As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.

Let's see a basic HTML page, to understand what the Document Object Model is.

This HTML code is basically HTML content encapsulated inside other HTML content. The HTML hierarchy can be viewed as a tree. We can already see this hierarchy through the indentation in the HTML code.

When your web browser parses this code, it will create a tree which is an object representation of the HTML document. It is called the Document Object Model.

Below is the internal tree structure inside Google Chrome inspector :

On the left we can see the HTML tree, and on the right we have the Javascript object representing the currently selected element (in this case, the <p> tag), with all its attributes.

The important thing to remember is that the DOM you see in your browser, when you right click + inspect can be really different from the actual HTML that was sent. Maybe some Javascript code was executed and dynamically changed the DOM ! For example, when you scroll on your twitter account, a request is sent by your browser to fetch new tweets, and some Javascript code is dynamically adding those new tweets to the DOM.

XPath Syntax

First let’s look at some XPath vocabulary :

• In Xpath terminology, as with HTML, there are different types of nodes : root nodes, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.

• Each element node has one parent. in this example, the section element is the parent of p, details and button.

• Element nodes can have any number of children. In our example, li elements are all children of the ul element.

• Siblings are nodes that have the same parents. p, details and button are siblings.

• Ancestors a node’s parent and parent’s parent...

• Descendants a node’s children and children’s children...

There are different types of expressions to select a node in an HTML document, here are the most important ones :

You can also use predicates to find a node that contains a specific value. Predicates are always in square brackets: [predicate]

Here are some examples :

Now we will see some examples of Xpath expressions. We can test XPath expressions inside Chrome Dev tools, so it is time to fire up Chrome.

To do so, right-click on the web page -> inspect and then cmd + f on a Mac or ctrl + f on other systems, then you can enter an Xpath expression, and the match will be highlighted in the Dev tool.

Tip

In the dev tools, you can right-click on any DOM node, and show its full XPath expression, that you can later factorize.

XPath with Python

There are many Python packages that allow you to use XPath expressions to select HTML elements like lxml, Scrapy or Selenium. In these examples, we are going to use Selenium with Chrome in headless mode. You can look at this article to set up your environment: Scraping Single Page Application with Python

E-commerce product data extraction

In this example, we are going to see how to extract E-commerce product data from Ebay.com with XPath expressions.

On these three XPath expressions, we are using a // as an axis, meaning we are selecting nodes anywhere in the HTML tree. Then we are using a predicate [predicate] to match on specific IDs. IDs are supposed to be unique so it's not a problem do to this.

But when you select an element with its class name, it's better to use a relative path, because the class name can be used anywhere in the DOM, so the more specific you are the better. Not only that, but when the website will change (and it will), your code will be much more resilient to changes.

Automagically authenticate to a website

When you have to perform the same action on a website or extract the same type of information we can be a little smarter with our XPath expression, in order to create generic ones, and not specific XPath for each website.

In order to explain this, we're going to make a "generic" authentication function that will take a Login URL, a username and password, and try to authenticate on the target website.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage
Select the first tag
Select the first before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form and click on the submit button.

Most login forms will have an <input type="password"> tag. So we can select this password input with a simple: //input[@type='password']

Once we have this password input, we can use a relative path to select the username/email input. It will generally be the first preceding input that isn't hidden: .//preceding::input[not(@type='hidden')]

It's really important to exclude hidden inputs, because most of the time you will have at least one CSRF token hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.

Now we need to select the enclosing form from one of the input:

.//ancestor::form

And with the form, we can select the submit input/button:

.//*[@type='submit']

Here is an example of such a function:

Of course it is far from perfect, it won't work everywhere but you get the idea.

Conclusion

XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.

One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.

At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our ready-made APIs. We will discuss the differences between the two in another blog post!

I hope you enjoyed this article, next time we will talk about ... CSS selectors :)

Happy Scraping!

Discuss on HN: https://news.ycombinator.com/item?id=21452310

12 months, 3 products, some MRR, and one (irrigation) pivot

Kevin Sahin — Mon, 07 Oct 2019 09:35:33 +0000

My partner Pierre and I have been working and talking about different side projects/startups for over 5 years. Two years ago we released our first product to the public but it was one year ago that we decided to go full time on the indie hacker road. In this post, I’m going to explain our journey, our backgrounds and how we did it after many failed attempts.

This post is not about some magic product we launched in 2 days while getting 10k signups and reaching $20k MRR in one month working 4 hours a week in Hawaï. This post is more about the small win and loses we had during our first year in the Indie Hacker world and the things we wish we knew before starting.

This post is about three products, one irrigation pivot, one startup pivot, and of course, some MRR.

(disclaimer: ScrapingBee was initially launched as ScrapingNinja, but due to some copyright issues we hade to quicky rebrand it. We'll talk about it in a future blog post.)

Background

It started when we were both employed in different startups as software developers. We had lots of ideas and we loved to build side-projects for fun.

Pierre and I were doing lots of Web Scraping in our jobs. I worked at a Fintech startup called Fiduceo which was acquired by a big french bank, and we were doing bank account aggregation, like Mint.com in the US. I was leading a small team handling the web scraping code and infrastructure.

Pierre worked in the US and then came back to France to work in the biggest French real-estate data provider as a data engineer. Part of his job was to find, gather, extract and load new data set from the web.

So we both had experience with Web Scraping and data at scale.

Our first project: ShopToList

One of the first “mini-successes” we had was Shoptolist.com, a B2C website/browser extension which is a universal wishlist, that sends you alerts if it sees any price drop. It was really just a fun side project that was never meant to be more.

It allowed us to try many different things and to discover that acquisition is really, really, really hard.

We quickly reached 1000 users by just submitting our product on frugal/fashion subreddits. We were very happy about it because it was just an experiment. Every day we had a script that scrapes each product in our database to update its price, and we were sending emails in case of a price drop.

The links in the email were affiliate links, so we took a small percentage if the user ended up buying the product.
In theory, this model works great, but in practice here is what happened:

Out of 1000 emails sent, about 20–30% were opened
2% clicked on the product links that were on sale
From this 2 %, only 5–10% bought the product

The percentage we earned was very small, depending on the niche it was 0.5–5%, so this business model only works with millions of users.
And this is where we hit a wall, we did not manage to create sustainable growth. We tested many things, content marketing, affiliation, and some paid advertising, we just did not manage to create growth.

And since it was just a little side project that only took us two weeks to build, we were ok with that.

For us, this was a valuable learning experience, because this was the first project we shipped to actual users, and we learned a lot.

By digging into the database we noticed that a few users had thousands of products saved inside ShopToList. It seems strange unless they were crazy impulsive buyers, the majority of users had like 20 products saved on average…

So after a little “investigation”, we discovered that these users were E-commerce owners that were “spying” on their competitor…

First pivot: PricingBot

We assumed that those users were doing this to receive alerts when their competitor’s products were changing the price.

There were many solutions on the web that allowed to do this, but ShopToList allowed them to monitor thousands of products for free when other solutions were quite expensive.

We did a small market research and discover that many tools offered to monitor your competitor’s product, however, all those tools seemed either really difficult to use or really expensive.

Because we felt we could do better, the PricingBot idea was born. Pierre quit his job and we both decided to commit full-time on this. Side project era was over 😎.

We made a landing page explaining our value proposition, nothing fancy but something clear and nice enough so people could trust us, and got 60 signups from different E-commerce owners in different niches.

While technically challenging, extracting E-commerce product data was something we knew how to do thanks to Shoptolist, so building the MVP was pretty quick.

We launched our beta on ProductHunt in November 2018 and it was a big success, followed by a big crash, the classic startup trough of sorrow.

You had to upload a CSV file with your product catalog, and for each product match it with a competitor product URL.

It's ok for several dozens of products, but people had often hundreds or thousands of products in their online store...

So with this feedback, we created some integration with popular E-commerce platforms like Shopify and Woocommerce to let people import their catalog in one click.

Our activation tripled 🎉 , we were very happy about how things were going, however, one thing to note is that until this moment the product was completely free and we did not ask people for money.

At this point in time here are a few numbers we had that made use happy:

We managed to have around 200 signups with $0 spend
20 users seemed to use the product and had their account fully set up

What could go wrong right?

We decide to close the beta and start asking our user to pay for our software with a classic SaaS model with three plans, $29/$99/$299 months based on volume.

The first day was magic because literally several seconds after sending the email announcing the end of the beta we got our first customer for a $29 plan 🚀

We also managed to signup a user to a $299 plan soon after, but for him, we had to manually set up his account and manually match 1000 products across 10 websites, it was long but we felt it was worth it.

We were wrong! Just before renewing he churned telling us PricingBot was very good but not useful enough for him. We were sad and angry, mostly at ourselves, but decided to move forward and continue.

It seemed we were on a good path and that we just needed to go all-in on marketing. And that's what we did. Content marketing, cold outreach, affiliation, SEO you name it!

But before diving into this, let's talk again about our activation.

Mistake #1: bad metrics leads to bad conclusion, bad conclusion leads to bad decisions.. (in Yoda's voice)

When we first decided to monitor our activation rate we assumed that one user was activated when he did two things:

Add at least one of his product, (or link his store with our built-in integration)
Add at least one of his competitor's product

And so, with that definition, we had around 10% of our users that were "activated". Considering that at that time most of our users were coming from ProductHunt and that hunters are known to easily signup to products they don't plan to use and just for the sake of it we were happy with these numbers.

But we were wrong.

This definition meant that someone who owns a Shopify store with 4000 products, and who adds only one competitor's products was activated, it was silly. Someone who only adds one competitor's product out of 4000 of is own catalog won't use PricingBot to do price monitoring and surely won't pay for it. We learned this the hard way.

Because soon after we had this first paying customer, nobody followed, literally nobody. At first, we did not understand, then it was obvious, out of 200 signups, we had 20 active users, out of 20 active users we had 1 paying customers, so the only solutions were to have more signups.

This was another mistake.

Mistake #2: Thinking our only problem was acquisition

We thought we only needed more users and just went full marketing. Because we did not know the e-commerce community very well we had some trouble starting but we eventually managed to write some piece of content that was shared on relevant Facebook/Reddit/Linkedin group that brought in a few leads.

We also did some paid advertising and cold outreach but it failed miserably.

One month later, we needed to see the obvious, we were not on the right path.

Our leads used the product but did not pay, and even if all the leads we brought in paid, it would have not been sustainable.

At this point in time we finally decide to understand better why users don't use our product more and with feedback request and a lot of analytics insight we discover two things:

For most of our users, PricingBot is a nice to have, it is not something that worth paying
Most of our users don't want to do the setup because it is too tedious, but they don't want to pay us to do it for them.

Next thing we know we revamp our whole onboarding process and try to automate as many things as possible. But it is still not working.

When you want, as an e-commerce owner, to monitor your competitors you first have to link your products with your competitors and this was the hard part. This part alone meant approximately 1 hour of work per 100 products you want to match. This was way too much time for e-commerce owners working alone with a 10k product catalog.

Fear, Uncertainty and Doubt

To help you understand how we felt at that point in time let me just recap the timeline:

January 2018: 📣 we launch ShopToList
July 2018: 🚀 Pierre quit his job and we decide to build PricingBot
October 2018: 🤖 After a busy summer and 1 month of code we launch the MVP in beta
January 2019: 💵 First paying customer
February-March 2019: Acquisition, product dev

Back in May 2019 we kind of hit a wall, nothing we did really worked and it was hard staying motivated. The only silver lining was that we managed to rank well on Google so we had, every day, around 3 new signups without any acquisition.

But we still did not manage to make them pay. And we still did not manage to make them configure their account.

This period of time was hard because it was full of negativity, me and my cofounder both knew that we were not moving forward and while this did not degrade our working relation this certainly degrade our working productivity.

We both felt that no matter what we did, we were not able to move any meaningful needle that could have boost our business.

We improved the product a lot, managed to gather some signup along the way but it was not enough. Here is a look at our revenue.

One agricultural pivot to build, one startup pivot to make

Mid-June 2019 things are not looking good, we only have 3 months to launch a successful business. We both agreed in 2018 that we gave ourselves 1 year to launch something that works, 1 year to reach "ramen profitability" 🍲.

We had a long talk beginning of June and we both agreed that we need to step back and that we currently have 3 options:

Continue with PricingBot hoping that some magic happens and that we cross $4k MRR in 3 months
Leaving the company and start going our own way
Building something else

Point 1 was hard because we were both fed up with the product, everything we did seemed useless and it was not working. Point 2 needed to be addressed but although it was not a success we felt that working together was working really good (in the human side of things) and that it would be a pity to give up. Option 3 was chosen, we are both very happy with the outcome of that talk and full of energy. We only needed one thing: to choose what we would build.

We also decide to do something we should have done earlier, we sold ShopToList. The whole deal was done in less than 1 month thanks to 1kprojects.com and it brought some welcomed money in our company bank account.

At the same time, Pierre's father in law, a farmer in the south of France called him because he needed help assembling an irrigation pivot. The heatwave was supposed to be hard in June (and guess what, it was), and it was an urgent job. We both decided that this was a good opportunity to take a break, to think, each other on our side about the future product, and to come back full of ideas and motivation.

It was kind of ironic because this pivot, kind of fund our pivot.

Disclaimer: If you ever need to buy an irrigation pivot Pierre strongly suggests that you look into Valley pivot (PS: this post was not sponsored by Valley in any way)

ScrapingBee

Two weeks later, we both find ourselves with a bullet list of product ideas, some good, some bad, some crazy, some boring, some exciting, well, you get the idea, both lists were diverse. However we quickly agree on one idea, because it really stood out of the other, let me explain.

While working on Shoptolist and Pricingbot, and also in our previous work experience, there are three things that we always needed to do for our web scraping infrastructure: Transforming websites into a structured API, Running headless browsers at scale, and managing a pool of proxies.

When you extract data from lots of different websites, you always have to deal with Javascript-heavy websites / Single Page application, and you don't really have other choices than running headless browsers to render all this Javascript.

Running a headless browser like Chrome is really painful because the same thing happening on your desktop (high memory usage, poorly coded Single page application eating 100% of your CPU) will happen on your servers. So it is not only painful but very expensive to do this on your own when you don't know what you are doing.

When doing web scraping at scale, you often have to use proxies for different reasons. The website you are visiting with your bot may show different information based on your location, for example, a price in Euro in the Euro-zone and a price in dollars in the US.

Dealing with proxies is painful too. There are lots of shady companies selling bad quality proxies so you either have to run your own proxies or test dozens of proxy companies to make sure your proxy pool is always up.

We used to solve all those problems using APIs that were either not really efficient or crazy expensive. These are problems that we solve multiple times in our projects so we thought about packaging it into an API and leveraging our experience to make all kinds of web scraping APIs.

We decided this time, to make things right and to try to avoid doing the mistakes we made with PricingBot while creating ScrapingBee .

Mistake avoided #1: create a product you won't use:

One of the biggest problems we had with PricingBot was to find where our potential users gathered online. What group did they follow, what blog did they read, what influencers did they listen to. And the reason was simple, having never worked with or in the e-commerce industry except for some freelancing gigs, the whole landscape was unknown to us.

With ScrapingBee we would be our own users and it changed everything. I know this advice is not new, but often, this advice is meant to build a better product, and sure, being one of your own users allows you to build a better product.

But for us, the game-changing fact was that being our own users meant that we knew exactly where to find and how to reach potentials leads.

Pierre and I also have our own blogs running for quite a bit of time and I wrote last year a book dedicated to web scraping in Java. This directly translated into 20k monthly visits that we could leverage to promote ScrapingBee.

And it worked. In about 2 months, we reached 150 beta signups, 4 times the amount of beta testers we had for PricingBot.

Mistake avoided #2 Spending too much money

While building PricingBot, we spent a lot of money on useless infrastructure, APIs and software without reaching Product-Market Fit.

We basically to get our money back thanks to ShopToList sale and Pierre's agricultural skills before we launched ScrapingBee, and this time, we were way more careful about how we spent it.

I know spending several thousand dollars to bootstrap a project is not a lot of money but we weren't comfortable with spending more, so we decided to be careful with how we would spend it with ScrapingBee.

We basically reduced our costs by only finding deals (<3 AWS Credits) like Secret which basically give you 6 months free for lots of SaaS or huge discount.

We decided to do more with what we had and so far we don't regret it.

I'll talk more about the products and tools we used in a future blog post, this one is already long enough.

🚀 Launch 🚀 and mistake avoided #3 not asking for money from day 1

One thing that did not work well with PricingBot is that for months, we build a product that was free to use. I know this is a classic mistake, but this is not the worst part, the worst part is that we knew it was a mistake. In the last 4 years, we've read tons of books, interviews, blog posts about startup and everyone seems to agree that the sooner you ask for money the better.

But it was easier said than done and we did not dare ask for money while building PricingBot because we did not think anyone would pay for an unfinished product.

We did for ScrapingBee. The pricing for ScrapingBee is again a classic three plan SaaS based on API call volume/feature starting at $9 / $29 / $99 per month and an Enterprise plan:

We "soft-launched" first to our mailing list and got our first few small paying customers. Again, we had the same experience with PricingBot but this time it was different. With PricingBot, every paying customer we had was really hard to get, we had sent them tons and tons of email and they took a long time to finally pay.

With ScrapingNinja it was different. Our first 2 customers never talked with us before.

We then started to blog and got tons of leads, a few more paying customer including a big Enterprise plan as you can see in the MRR chart below.

Then it all went quickly, Pierre and I both having blogged about programming, creating insightful content about Web scraping is not a problem for us, and we knew how and where to promote it.

One particular piece of content we wrote, a web scraping guide completely exceeded our expectations.

This post alone meant that in two months we had three times the traffic we had in one year of PricingBot. This post not only brought traffic but also customers with real $. It also allowed us to signup the first big enterprise plan that allowed us to reach and cross $1000 MRR.

The future

Of course, It's really early to say if ScrapingBee will be a success or not.

This big enterprise customers we got thanks to the success of our first blog post can only be an outlier phenomenon that won't reproduce in the future. But one thing is certain, things are looking way better with ScrapingNinja.

We have lots of engagement from our users and leads, the conversion rate from trial to paying customers being close to 5%.

We also love to talk with our potential customers (❤️ Zoom) and we have the feeling that ScrapingBee is really a must-have for them, instead of a "nice to have". (small tips: we multiplied by five the free plan for users that accept to have a small 15 minutes talk with us, this already allowed us to have 40 real talks with real people about ScrapingBee).

In the months to come a big challenge will be to find profitable and scalable acquisition channels. We hope that content marketing will continue to work and that it will improve our SEO to get organic traffic. Writing a good piece of content may not be enough and we really have to discover other acquisition channels.

The other big challenge is to prioritize features in the API-store. Meaning figuring out what users need not blindly implementing what they want, and hopefully, manage to get them to pay before the feature is implemented.

We still don't know what we want to do with PricingBot, we seriously think about selling it but are a bit afraid of all the paperwork involved (it was much easier with ShopToList because ShopToList did not bring any money in, so no bank account, Stripe account etc...)

We also still have a lot to learn and a lot to prove before being able to say that we build a sustainable and profitable business but we feel that it can be done, time will tell us if we're right.

Serverless Web Scraping With Aws Lambda and Java

Kevin Sahin — Wed, 04 Sep 2019 09:36:20 +0000

Serverless is a term referring to the execution of code inside ephemeral containers (Function As A Service, or FaaS). It is a hot topic in 2019, after the “micro-service” hype, here come the “nano-services”!

Cloud functions can be triggered by different things such as:

An HTTP call to a REST API
A job in a message queue
A log
IOT event

Cloud functions are a really good fit with web scraping tasks for many reasons. Web Scraping is I/O bound, most of the time is spent waiting for HTTP responses, so we don’t need high-end CPU servers. Cloud functions are cheap (first 1M request is free, then $0.20 per million requests) and easy to set up. Cloud functions are a good fit for parallel scraping, we can create hundreds or thousands of function at the same time for large-scale scraping.

In this introduction, we are going to see how to deploy a slightly modified version of the Craigslist scraper we made on a previous blogpost on AWS Lambda using the serverless framework.

Prerequisites

We are going to use the Serverless framework to build and deploy our project to AWS lambda. Serverless CLI is able to generate lots of boilerplate code in different languages and deploy the code to different cloud providers, like AWS, Google Cloud or Azure.

An AWS account
Node and npm
Serverless CLI and Setup your AWS credentials
Java 8
Maven

Architecture

We will build an API using API Gateway with a single endpoint /items/{query} binded on a lambda function that will respond to us with a JSON array with all items (on the first result page) for this query.

Here is a simple diagram for this architecture:

Create the Maven project

Serverless is able to generate projects in lots of different languages: Java, Python, NodeJS, Scala...
We are going to use one of these templates to generate a maven project:

serverless create --template aws-java-maven --name items-api -p aws-java-scraper

You can now open this Maven project in your favorite IDE.

Configuration

The first thing to do is to change the serverless.yml config to implement an API gateway route and bind it to the handleRequest method in the Handler.java class.

service: craigslist-scraper-api 
provider:
  name: aws
  runtime: java8
  timeout: 30

package:
  artifact: target/hello-dev.jar

functions:
  getCraigsListItems:
    handler: com.serverless.Handler
    events:
    - http:
        path: /items/{searchQuery}
        method: get

I also added a timeout to 30 seconds. The default timeout with the serverless framework is 6 seconds. Since we're running Java code the Lambda cold start can take several seconds. And then we will make an HTTP request to Craigslist website, so 30 seconds seems good.

Function code

Now we can modify the Handler.class. The function logic is easy. First, we retrieve the path parameter called "searchQuery". Then we create a CraigsListScraper and call the scrape() method with this searchQuery. It will return a List<Item> representing all the items on the first Craigslist's result page.

We then use the ApiGatewayResponse class that was generated by the Serverless framework to return a JSON array containing every item.

You can find the rest of the code in this repository, with the CraigsListScraper and Item class.

@Override
public ApiGatewayResponse handleRequest(Map<String, Object> input, Context context) {
    LOG.info("received: {}", input);
    try{
        Map<String,String> pathParameters = (Map<String,String>)input.get("pathParameters");
        String query = pathParameters.get("searchQuery");

        CraigsListScraper scraper = new CraigsListScraper();
        List<Item> items = scraper.scrape(query);
        return ApiGatewayResponse.builder()
            .setStatusCode(200)
            .setObjectBody(items)
            .setHeaders(Collections.singletonMap("X-Powered-By", "AWS Lambda & serverless"))
            .build();
    }catch(Exception e){
        LOG.error("Error : " + e);
        Response responseBody = new Response("Error while processing URL: ", input);
        return ApiGatewayResponse.builder()
            .setStatusCode(500)
            .setObjectBody(responseBody)
            .setHeaders(Collections.singletonMap("X-Powered-By", "AWS Lambda & Serverless"))
            .build();
    }
}

We can now build the project:

mvn clean install

And deploy it to AWS:

serverless deploy
Serverless: Packaging service...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service .zip file to S3 (13.35 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.................................
Serverless: Stack update finished...
Service Information
service: items-api
stage: dev
region: us-east-1
stack: items-api-dev
api keys:
  None
endpoints:
  GET - https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/{searchQuery}
functions:
  getCraigsListItems: items-api-dev-getCraigsListItems

You can then test your function using curl or your web browser with the URL given in the deployment logs (

serverless info

will also show this information.)

Here is a query to look for "macBook pro" :

curl https://tmulioizdf.execute-api.us-east-1.amazonaws.com/dev/items/macBook%20pro | json_reformat                                                            1 ↵
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19834  100 19834    0     0   7623      0  0:00:02  0:00:02 --:--:--  7622
[
    {
        "title": "2010 15\" Macbook pro 3.06ghz 8gb 320gb osx maverick",
        "price": 325,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro-306ghz-8gb-320gb/6680853189.html"
    },
    {
        "title": "Apple MacBook Pro A1502 13.3\" Late 2013 2.6GHz i5 8 GB 500GB + Extras",
        "price": 875,
        "url": "https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-alateghz-i5/6688755497.html"
    },
    {
        "title": "Apple MacBook Pro Charger USB-C (Latest Model) w/ Box - Like New!",
        "price": 50,
        "url": "https://sfbay.craigslist.org/pen/sys/d/apple-macbook-pro-charger-usb/6686902986.html"
    },
    {
        "title": "MacBook Pro 13\" C2D 4GB memory 500GB HDD",
        "price": 250,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro-13-c2d-4gb-memory/6688682499.html"
    },
    {
        "title": "Macbook Pro 2011 13\"",
        "price": 475,
        "url": "https://sfbay.craigslist.org/eby/sys/d/macbook-pro/6675556875.html"
    },
    {
        "title": "Trackpad Touchpad Mouse with Cable and Screws for Apple MacBook Pro",
        "price": 39,
        "url": "https://sfbay.craigslist.org/pen/sys/d/trackpad-touchpad-mouse-with/6682812027.html"
    },
    {
        "title": "Macbook Pro 13\" i5 very clean, excellent shape! 4GB RAM, 500GB HDD",
        "price": 359,
        "url": "https://sfbay.craigslist.org/sfc/sys/d/macbook-pro-13-i5-very-clean/6686879047.html"
    },
...

Note that the first invocation will be slow, it took 7 seconds for me. The next invocations will be much quicker.

Go further

This was just a little example, here are some ideas to improve this :

Better error handling
Protect the API with an API Key (really easy to implement with API Gateway)
Save the items to a DynamoDB database
Send the search query to an SQS queue, and trigger the lambda execution with the queue instead of an HTTP request
Send a notification with SNS if an Item is less than a certain price point.

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

This is the end of this tutorial. I hope you enjoyed the post. Don't hesitate to experiment with Lambda and other cloud providers, it's really fun, easy, and can drastically reduce your infrastructure costs, especially for web-scraping or asynchronous related tasks.

Web Scraping 101 in Python

Kevin Sahin — Wed, 21 Aug 2019 10:24:15 +0000

In this post, which can be read as a follow up to our ultimate web scraping guide, we will cover almost all the tools Python offers you to web scrape. We will go from the more basic to the most advanced one and will cover the pros and cons of each. Of course, we won't be able to cover all aspect of every tool we discuss, but this post should be enough to have a good idea of which tools does what, and when to use which.

Note: when I talk about Python in this blog post you should assume that I talk about Python3.

Table of Content:

0) Web Fundamentals
1) Manually opening a socket and sending the HTTP request
2) urllib3 & LXML
3) requests & BeautifulSoup
4) Scrapy
5) Selenium & Chrome —headless
Conclusion

0) Web Fundamentals

The internet is really complex: there are many underlying technologies and concepts involved to view a simple web page in your browser. I don’t have the pretension to explain everything, but I will show you the most important things you have to understand in order to extract data from the web.

HyperText Transfer Protocol

HTTP uses a client/server model, where an HTTP client (A browser, your Python program, curl, Requests...) opens a connection and sends a message (“I want to see that page: /product”)to an HTTP server (Nginx, Apache...).

Then the server answers with a response (The HTML code for example) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independent. FTP for example, is stateful.

Basically, when you type a website address in your browser, the HTTP request looks like this:

In the first line of this request, you can see multiples things:

the GET verb or method being used, meaning we request data from the specific path: /product/.There are other HTTP verbs, you can see the full list here.
The version of the HTTP protocol, in this tutorial we will focus on HTTP 1.
Multiple headers fields

Here are the most important header fields :

Host: The domain name of the server, if no port number is given, is assumed to be 80*.*
User-Agent: Contains information about the client originating the request, including the OS information. In this case, it is my web-browser (Chrome), on OSX. This header is important because it is either used for statistics (How many users visit my website on Mobile vs Desktop) or to prevent any violations by bots. Because these headers are sent by the clients, it can be modified (it is called “Header Spoofing”), and that is exactly what we will do with our scrapers, to make our scrapers look like a normal web browser.
Accept: The content types that are acceptable as a response. There are lots of different content types and sub-types: text/plain, text/html, image/jpeg, application/json ...
Cookie : name1=value1;name2=value2... This header field contains a list of name-value pairs. It is called session cookies, these are used to store data. Cookies are what websites use to authenticate users, and/or store data in your browser. For example, when you fill a login form, the server will check if the credentials you entered are correct, if so, it will redirect you and inject a session cookie in your browser. Your browser will then send this cookie with every subsequent request to that server.
Referrer: The Referrer header contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user came from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.

And the list goes on...you can find the full header list here.

A server will respond with something like this:

On the first line, we have a new piece of information, the HTTP code 200 OK. It means the request has succeeded. As for the request headers, there are lots of HTTP codes, split into four common classes, 2XX for successful requests, 3XX for redirects, 4XX for bad requests (the most famous being 404 Not found), and 5XX for server errors.

Then, in case you are sending this HTTP request with your web browser, the browser will parse the HTML code, fetch all the eventual assets (Javascript files, CSS files, images...) and it will render the result into the main window.

In the next parts we will see the different ways to perform HTTP requests with Python and extract the data we want from the responses.

1) Manually opening a socket and sending the HTTP request

Socket

The most basic way to perform an HTTP request in Python is to open a socket and manually send the HTTP request.

Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.

Regular Expressions

A regular expression (RE, or Regex) is a search pattern for strings. With regex, you can search for a particular character/word inside a bigger body of text.

For example, you could identify all phone numbers inside a web page. You can also replace items, for example, you could replace all uppercase tag in a poorly formatted HTML by lowercase ones. You can also validate some inputs ...

The pattern used by the regex is applied from left to right. Each source character is only used once. You may be wondering why it is important to know about regular expressions when doing web scraping?

After all, there is all kind of different Python module to parse HTML, with XPath, CSS selectors.

In an ideal semantic world, data is easily machine-readable, the information is embedded inside relevant HTML element, with meaningful attributes.

But the real world is messy, you will often find huge amounts of text inside a p element. When you want to extract a specific data inside this huge text, for example, a price, a date, a name... you will have to use regular expressions.

Note: Here is a great website to test your regex: https://regex101.com/ and one awesome blog to learn more about them, this post will only cover a small fraction of what you can do with regexp.

Regular expressions can be useful when you have this kind of data:

<p>Price : 19.99$</p>

We could select this text node with an Xpath expression, and then use this kind of regex to extract the price :

^Price\s:\s(\d+\.\d{2})\$

To extract the text inside an HTML tag, it is annoying to use a regex, but doable:

As you can see, manually sending the HTTP request with a socket, and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier.

2) urllib3 & LXML

Disclaimer: It is easy to get lost in the urllib universe in Python. You have urllib and urllib2 that are parts of the standard lib. You can also find urllib3. urllib2 was split in multiple modules in Python 3, and urllib3 should not be a part of the standard lib anytime soon. This whole confusing thing will be the subject of a blog post by itself. In this part, I've made the choice to only talk about urllib3 as it is used widely in the Python world, by Pip and requests to name only them.

Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. It allows doing what we did above with socket with way fewer lines of code.

Much more concise than the socket version. Not only that, but the API is straightforward and you can do many things easily, like adding HTTP headers, using a proxy, POSTing forms ...

For example, had we decide to set some headers and to use a proxy, we would only have to do this.

See? Exactly the same number of line, however, there are some things that urllib3 does not handle very easily, for example, if we want to add a cookie, we have to manually create the corresponding headers and add it to the request.

There are also things that urllib3 can do that requsts can't, creation and management of pool and proxy pool, control of retry strategy for example.

To put in simply, urllib3 is between requests and socket in terms of abstraction, although way closer to requests than socket.

This time, to parse the response, we are going to use the lxml package and XPath expressions.

XPath

Xpath is a technology that uses path expressions to select nodes or node- sets in an XML document (or HTML document). As with the Document Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a programming language in itself, it allows you to write expression that can access directly to a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).

Think of XPath as regexp, but specifically for XML/HMTL.

To extract data from an HTML document with XPath we need 3 things:

an HTML document
some XPath expressions
an XPath engine that will run those expressions

To begin we will use the HTML that we got thanks to urllib3, we just want to extract all the links from the Google homepage so we will use one simple XPath expression: //a and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPATH.

Installation:

pip install lxml

Below is the code that comes just after the previous snippet:

And the output should look like this:

You have to keep in mind that this example is really really simple and doesn't really show you how powerful XPath can be (note: this XPath expression should have been changed to //a/@href to avoid having to iterate on links to get their href ).

If you want to learn more about XPath you can read this good introduction. The LXML documentation is also well written and is a good starting point.

XPath expresions, like regexp, are really powerful and one of the fastest way to extract information from HTML, and like regexp, XPath can quickly become messy, hard to read and hard to maintain.

3) requests & BeautifulSoup

Requests is the king of python packages, with more than 11 000 000 downloads, it is the most widly used package for Python.

Installation:

pip install requests

Making a request with Requests (no comment) is really easy:

With Requests it is easy to perform POST requests, handle cookies, query parameters...

Authentication to Hacker News

Let's say we want to create a tool to automatically submit our blog post to Hacker news or any other forums, like Buffer. We would need to authenticate to those websites before posting our link. That's what we are going to do with Requests and BeautifulSoup!

Here is the Hacker News login form and the associated DOM:

There are three <input> **tags on this form, the first one has a type hidden with a name "goto" and the two others are the username and password.

If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated.

Doing this with Requests is easy, it will handle redirects automatically for us, and handling cookies can be done with the Session object.

The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.

Installation:

pip install beautifulsoup4

So all we have to do is to POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:

In order to learn more about BeautifulSoup we could try to extract every links on the homepage.

By the way, Hacker News offers a powerful API, so we're doing this as an example, but you should use the API instead of scraping it!

The first thing we need to do is to inspect the Hacker News's home page to understand the structure and the different CSS classes that we will have to select:

We can see that all posts are inside a <tr class="athing"> **so the first thing we will need to do is to select all these tags. This can be easily done with:

links = soup.findAll('tr', class_='athing')

Then for each link, we will extract its id, title, url and rank:

As you saw, Requests and BeautifulSoup are great libraries to extract data and automate different things by posting forms. If you want to do large-scale web scraping projects, you could still use Requests, but you would need to handle lots of things yourself.

When you need to scrape a lots of webpages, there are many things you have to take care of:

finding a way of parallelizing your code to make it faster
handling error
storing result
filtering result
throttling your request so you don't over load the server

Fortunately for us, tools exist that can handle those things for us.

4) Scrapy

Scrapy is a powerful Python web scraping framework. It provides many features to download web pages asynchronously, process and save it. It handles multithreading, crawling (the process of going from links to links to find every URLs in a website), sitemap crawling and many more.

Scrapy has also an interactive mode called the Scrapy Shell. With Scrapy Shell you can test your scraping code really quickly, like XPath expression or CSS selectors.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn.

To follow up on our example about Hacker news, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file.

You can easily install Scrapy with pip:

pip install Scrapy

Then you can use the scrapy cli to generate the boilerplate code for our project:

scrapy startproject hacker_news_scraper

Inside hacker_news_scraper/spider we will create a new python file with our Spider's code:

There is a lot of convention in Scrapy, here we define an Array of starting urls. The attribute name will be used to call our Spider with the Scrapy command line.

The parse method will be called on each URL in the start_urls array

We then need to tune Scrapy a little bit in order for our Spider to behave nicely against the target website.

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5

You should always turn this on, it will make sure the target website is not slow down by your spiders by analyzing the response time and adapting the numbers of concurrent threads.

You can run this code with the Scrapy CLI and with different output format (CSV, JSON, XML...):

scrapy crawl hacker-news -o links.json

And that's it! You will now have all your links in a nicely formatted JSON file.

5) Selenium & Chrome —headless

Scrapy is really nice for large-scale web scraping tasks, but it is not enough if you need to scrape Single Page Application written with Javascript frameworks because It won't be able to render the Javascript code.

It can be challenging to scrape these SPAs because there are often lots of AJAX calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing the interesting data.

In some cases, there are just too many asynchronous HTTP calls involved to get the data you want and it can be easier to just render the page in a headless browser.

Another great use case would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (again !)

You can install the selenium package with pip:

pip install selenium

You will also need Chromedriver:

brew install chromedriver

Then we just have to import the Webdriver from selenium package, configure Chrome with headless=True and set a window size (otherwise it is really small):

You should get a nice screenshot of the homepage:

You can do many more with the Selenium API and Chrome, like :

Executing Javascript
Filling forms
Clicking on Elements
Extracting elements with CSS selectors / XPath expressions

Selenium and Chrome in headless mode is really the ultimate combination to scrape anything you want. You can automate anything that you could do with your regular Chrome browser.

The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance.

If you want to run several Chrome instances concurrently, you will need powerful servers (the cost goes up quickly) and constant monitoring of resources.

Conclusion:

Here is a quick recap table of every technology we discuss about in this about. Do not hesitate to tell us in the comment if you know some ressources that you feel have their places here.

I hope that this overview will help you best choose your Python scraping tools and that you learned things reading this post.

Every tools I talked about in this post will be the subject of a specific blog post in the future where I'll go deep into the details.

Everything I talked about in this post is things I used to build ScrapingBee, the simplest web scraping API around there. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us 😊.

Do not hesitate to tell in the comments what you'd like to know about scraping, I'll talk about it in my next post.

Happy Scraping

Scraping single page applications with ease.

Kevin Sahin — Sun, 26 May 2019 17:22:14 +0000

Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend.

These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.

So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.

There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.

It can be challenging to scrape these SPAs because there are often lots of Ajax calls and Websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.

So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code in order to render the page, that is what the next part is about.

Headless Chrome with Python

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about the release of the headless mode with Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS [...]” It looks like Chrome in headless mode is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

Prerequisites

You will need to install the selenium package:

pip install selenium

And of course, you need a Chrome browser, and Chromedriver installed on your system.

On macOS, you can simply use brew:

brew install chromedriver

Taking a screenshot

We are going to use Chrome to take a screenshot of the Nintendo's home page which uses lots of Javascript.

> chrome.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver')
driver.get("https://www.nintendo.com/")
driver.save_screenshot('screenshot.png')
driver.quit()

The code is really straightforward, I just added a parameter --window-size because the default size was too small.

You should now have a nice screenshot of the Nintendo's home page:

Waiting for the page load

Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.

A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.

The other solution is to use the WebDriverWait object from the Selenium API:

try:

 elem = WebDriverWait(driver, delay)
     .until(EC.presence_of_element_located((By.NAME, 'chart')))

 print("Page is ready!")

except TimeoutException:

 print("Timeout")

This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.

Conclusion

As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.

Meaning there will be CPU spikes, memory spikes just like a regular Chrome browser. After all, your Chrome instance will execute un-trusted and un-predictable third-party Javascript code! Then there is also the zombie-processes problem

This is one of the reason I started ScrapingBee, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!

This was my first post on about scraping, I hope you enjoyed it!

If you did please let me know, I'll write more 😊

If you want to know more about ScrapingBee, you can 👉 here

Introduction to Web Scraping With Java

Kevin Sahin — Wed, 13 Mar 2019 16:46:23 +0000

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…

Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.

In this post, we are going to see basic techniques in order to fetch and parse data in Java.

Prerequisites

Basic Java understanding
Basic XPath

Tools

You will need Java 8 with HtmlUnit

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.19</version>
</dependency>

If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page.

Let's scrape CraigList

For our first example, we are going to fetch items from Craigslist since they don't seem to offer an API, to collect names, prices, and images, and export it to JSON.

First, let's take a look at what happens when you search an item on Craigslist. Open Chrome Dev tools and click on the Network tab :

The search URL is :

https://newyork.craigslist.org/search/moa?is_paid=all&search_distance_type=mi&query=iphone+6s

You can also use

https://newyork.craigslist.org/search/sss?sort=rel&query=iphone+6s

Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled ...)

We are going to disable Javascript since it's not required for our example, and disabling Javascript makes the page load faster :

String searchQuery = "Iphone 6s" ;

WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
  String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
  HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
  e.printStackTrace();
}
}

The HtmlPage object will contain the HTML code, you can access it with asXml() method.

Now we are going to fetch titles, images, and prices. We need to inspect the DOM structure for an item :

With HtmlUnit you have several options to select an html tag :

getHtmlElementById(String id)
getFirstByXPath(String Xpath)
getByXPath(String XPath) which returns a List
many others, rtfm !

Since there isn't any ID we could use, we have to make an Xpath expression to select the tags we want.

XPath is a query language to select XML nodes( HTML in our case).

First, we are going to select all the <p> tags that have a class result-info

Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.

List<HtmlElement> items = (List<HtmlElement>) page.getByXPath("//li[@class='result-row']") ;
if(items.isEmpty()){
  System.out.println("No items found !");
}else{
for(HtmlElement item : items){
  HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

  HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

  String itemName = itemAnchor.asText()
  String itemUrl =  itemAnchor.getHrefAttribute()

  // It is possible that an item doesn't have any price
  String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText() ;

  System.out.println( String.format("Name : %s Url : %s Price : %s", itemName, itemPrice, itemUrl));
  }
}

Then instead of just printing the results, we are going to put it in JSON, using Jackson library, to map items in JSON format.

We need a POJO (plain old java object) to represent Items

Item.java

public class Item {
    private String title ; 
    private BigDecimal price ;
    private String url ;
//getters and setters
}

Then add this to your pom.xml :

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.7.0</version>
</dependency>

Now, all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file ...), and adapt the previous code a little bit :

for(HtmlElement htmlItem : items){
   HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

   HtmlElement spanPrice = ((HtmlElement) 
   htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

   // It is possible that an item doesn't have any 
   //price, we set the price to 0.0 in this case
   String itemPrice = spanPrice == null ? "0.0" : 
   spanPrice.asText() ;

   Item item = new Item();

   item.setTitle(itemAnchor.asText());
   item.setUrl( baseUrl + 
   itemAnchor.getHrefAttribute());

   item.setPrice(new 
   BigDecimal(itemPrice.replace("$", "")));

   ObjectMapper mapper = new ObjectMapper();
   String jsonString = 
   mapper.writeValueAsString(item) ;

   System.out.println(jsonString);
}

Go further

This example is not perfect, there are many things that can be improved :

Multi-city search
Handling pagination
Multi-criteria search

You can find the code in this Github repo

This was my first blog post I hope you enjoyed it, feel free to give me any feedback in the comments.

Scraping E-Commerce Product Data

Kevin Sahin — Sun, 17 Feb 2019 09:24:37 +0000

In this tutorial, we are going to see how to extract product data from any E-commerce websites with Java. There are lots of different use cases for product data extraction, such as:

E-commerce price monitoring
Price comparator
Availability monitoring
Extracting reviews
Market research
MAP violation

We are going to extract these different fields: Price, Product Name, Image URL, SKU, and currency from this product page:

https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008

What you will need

We will use HtmlUnit to perform the HTTP request and parse the DOM, add this dependency to your pom.xml.

<dependency>
   <groupId>net.sourceforge.htmlunit</groupId>
   <artifactId>htmlunit</artifactId>
   <version>2.19</version>
</dependency>

We will also use the Jackson library:

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.9.8</version>
</dependency>

Schema.org

In order to extract the fields we're interested in, we are going to parse https://schema.org metadata from the Html markup.

Schema is a semantic vocabulary that can be added to any webpage. There are many benefits of implementing Schema. Most search engines use it to understand what a page is about (A Product, an Article, a Review, and many more )

According to schema.org, about 10 million websites use it worldwide. That's huge!
There are different types of Schema, and today we're going to look at the Product type

It's really convenient because once you wrote a scraper that extracts specific schema data, it will work on any other website using the same schema. No more specific XPath / CSS selectors to write!

In my experience at PricingBot (my previous company), about 40% of E-commerce websites use schema.org metadata in their DOM.

There are three main schema markups:

JSON-LD

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "ItemList",
    "url": "http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600",
    "numberOfItems": "315",
    "itemListElement": [
        {
            "@type": "Product",
            "image": "http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg",
            "url": "http://multivarki.ru/brand_502/",
            "name": "Brand 502",
            "offers": {
                "@type": "Offer",
                "price": "4399 p."
            }
        },
        {
            "@type": "Product",
            "name": "..."
        }
    ]
}
</script>

RDF-A

<div vocab="http://schema.org/" typeof="ItemList">
    <link property="url" href="http://multivarki.ru?filters%5Bprice%5D%5BLTE%5D=39600"><span property="numberOfItems">315</span>
    <div property="itemListElement" typeof="Product">
        <img property="image" alt="Photo of product" src="http://img01.multivarki.ru.ru/c9/f1/a5fe6642-18d0-47ad-b038-6fca20f1c923.jpeg"> <a property="url" href="http://multivarki.ru/brand_502/"><span property="name">BRAND 502</span></a>
        <div property="offers" typeof="http://schema.org/Offer">
            <meta property="schema:priceCurrency" content="RUB">руб
            <meta property="schema:price" content="4399.00">4 399,00
            <link property="schema:itemCondition" href="http://schema.org/NewCondition">
        </div>...
        <div property="itemListElement" typeof="Product">
          ...
        </div>
    </div>
</div>

And the one used in our example, Microdata:

<div class="schema-org">


<div itemscope="" itemtype="https://schema.org/Product">
    <img itemprop="image" src="https://images.asos-media.com/products/the-north-face-vault-backpack-28-litres-in-black/10253008-1-black" alt="Image 1 of The North Face Vault Backpack 28 Litres in Black">
    <link itemprop="itemCondition" href="https://schema.org/NewCondition">
    <span itemprop="productID">10253008</span>
    <span itemprop="sku">10253008</span>
    <span itemprop="brand" itemscope="" itemtype="https://schema.org/Brand">
        <span itemprop="name">The North Face</span>
    </span>
    <span itemprop="name">The North Face Vault Backpack 28 Litres in Black</span>
    <span itemprop="description">Shop The North Face Vault Backpack 28 Litres in Black at ASOS. Discover fashion online.</span>
    <span itemprop="offers" itemscope="" itemtype="https://schema.org/Offer">
        <link itemprop="availability" href="https://schema.org/InStock">
        <meta itemprop="priceCurrency" content="GBP">
        <span itemprop="price">60</span>
        <span itemprop="eligibleRegion">GB</span>
        <span itemprop="seller" itemscope="" itemtype="https://schema.org/Organization">
            <span itemprop="name">ASOS</span>
        </span>
    </span>  
</div>

  </div>

Note that you can have multiple offers in a single page.

Extracting the data

The first thing is to create a basic POJO of a Product:

public class Product {

    private BigDecimal price;
    private String name;
    private String sku;
    private URL imageUrl;
    private String currency;
        // ...getters & setters

Then we need to go to the target URL and create a basic microdata parser to extract the fields we are interested in. I'm using HtmlUnit for this, which is a pure Java headless browser. I could have used lots of different libraries like Jsoup or Selenium + Headless Chrome.

But in most cases, HtmlUnit is a good solution because it's lighter than Selenium + Headless Chrome, but offer more features than a raw HTTP client + JSoup (which only handles Html parsing).

For "Javascript-heavy" websites, relying on frontend frameworks like React / Vue.js, Headless Chrome is the way to go!


WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
String productUrl = "https://www.asos.com/the-north-face/the-north-face-vault-backpack-28-litres-in-black/prd/10253008";

HtmlPage page = client.getPage(productUrl);
HtmlElement productNode = ((HtmlElement) page
                .getFirstByXPath("//*[@itemtype='https://schema.org/Product']"));
URL imageUrl = new URL((((HtmlElement) productNode.getFirstByXPath("./img")))
                .getAttribute("src"));
HtmlElement offers = ((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='offers']"));

BigDecimal price = new BigDecimal(((HtmlElement) offers.getFirstByXPath("./span[@itemprop='price']")).asText());
String productName = (((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='name']")).asText());
String currency = (((HtmlElement) offers.getFirstByXPath("./*[@itemprop='priceCurrency']")).getAttribute("content"));
String productSKU = (((HtmlElement) productNode.getFirstByXPath("./span[@itemprop='sku']")).asText());

On the first lines, I created the HtmlUnit HTTP client and disabled Javascript because we don't need it to get the Schema markup.

Then it's just basic XPath expressions to select the interesting DOM nodes we want.

This parser is far from perfect, it doesn't extract everything and it doesn't handle multiple offers. However, this will give you an idea about how to extract Schema data.

We can then create the Product object, and print it as a JSON string:

Product product = new Product(price, productName, productSKU, imageUrl, currency);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(product) ;
System.out.println(jsonString);

Avoid getting blocked

Now that we are able to extract the product data we want, we have to be careful not to get blocked.

For various reasons, there are sometimes anti-bot mechanisms implemented on websites. The most obvious reason to protect sites from bots is to prevent heavy automated traffic to impact a website’s performance (and you must be careful with concurrent requests, by adding delays...). Another reason is to stop bad behavior from bots like spam.

There are various protection mechanisms. Sometime your bot will be blocked if it does too many requests per second/hour/ day. Sometimes there is a rate limit on how many requests per IP address. The most difficult protection is when there is a user behavior analysis. For example, the website could analyze the time between requests, if the same IP is making requests concurrently.

The easiest solution to hide our scrapers is to use proxies. In combination with random user-agent, using a proxy is a powerful method to hide our scrapers, and scrape rate-limited web pages. Of course, it’s better not be blocked in the first place, but sometimes websites allow only a certain amount of request per day/hour.

In these cases, you should use a proxy. There are lots of free proxy list, I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these proxies are located. Sometimes the public proxy list is operated by a legit company, offering premium proxies, and sometimes not...

What I recommend is using a paid proxy service, or you could build your own.

Setting a proxy to HtmlUnit is easy:

ProxyConfig proxyConfig = new ProxyConfig("host", myPort);
client.getOptions().setProxyConfig(proxyConfig);

Go further

As you can see, thanks to Schema.org data, extracting product data is much easier now than it was ten years ago.

But there are still challenges such as handling websites that haven't implemented Schema, handling IP blocking and rate limits, rendering Javascript...

That is exactly why we've been working with my partner Pierre on a Web Scraping API

ScrapingBee is an API to extract any HTML from any website without having to deal with proxies, CAPTCHAs and headless browsers. A single API call, with only the product URL you to want to extract data from.

I hope you enjoyed this post, as always you can find the full code in this Github repository: https://github.com/ksahin/introWebScraping

Introduction to Chrome Headless

Kevin Sahin — Fri, 18 Jan 2019 09:45:11 +0000

In the previous articles, I introduce you to two different tools to perform web scraping with Java. HtmlUnit in the first article, and PhantomJS in the article about handling Javascript heavy website.

This time we are going to introduce a new feature from Chrome, the headless mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don't know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago.

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer because of I quote "Google Chrome is faster and more stable than PhantomJS [...]"
It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome.

Prerequisites

Google Chrome > 59
Chromedriver
Selenium
In your pom.xml add a recent version of Selenium :

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.8.1</version>
</dependency>

If you don't have Google Chrome installed, you can download it here
To install Chromedriver you can use brew on MacOS :

brew install chromedriver

Or download it using the link below.
There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.

Let's log into Hacker News

In this part, we are going to log into Hacker News, and take a screenshot once logged in. We don't need Chrome headless for this task, but the goal of this article is only to show you how to run headless Chrome with Selenium.

The first thing we have to do is to create a WebDriver object, and set the chromedriver path and some arguments :

// Init chromedriver
String chromeDriverPath = "/Path/To/Chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors");
WebDriver driver = new ChromeDriver(options);

The


 option is needed on Windows systems, according to the [documentation](https://developers.google.com/web/updates/2017/04/headless-chrome)
Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :



```java
options.setBinary("/Path/to/specific/version/of/Google Chrome");

If you want to learn more about the different options, here is the Chromedriver documentation

The next step is to perform a GET request to the Hacker News login form, select the username and password field, fill it with our credentials and click on the login button. Then we have to check for a credential error, and if we are logged in, we can take a screenshot.

We have done this in a previous article, here is the full code :

public class ChromeHeadlessTest {
    private static String userName = "" ;
    private static String password = "" ;

    public static void main(String[] args) throws IOException{
       String chromeDriverPath = "/your/chromedriver/path" ;
       System.setProperty("webdriver.chrome.driver", chromeDriverPath);
       ChromeOptions options = new ChromeOptions();
       options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors", "--silent");
       WebDriver driver = new ChromeDriver(options);

      // Get the login page
      driver.get("https://news.ycombinator.com/login?goto=news");

      // Search for username / password input and fill the inputs
      driver.findElement(By.xpath("//input[@name='acct']")).sendKeys(userName);
      driver.findElement(By.xpath("//input[@type='password']")).sendKeys(password);

      // Locate the login button and click on it
      driver.findElement(By.xpath("//input[@value='login']")).click();

      if(driver.getCurrentUrl().equals("https://news.ycombinator.com/login")){
        System.out.println("Incorrect credentials");
        driver.quit();
        System.exit(1);
      }else{
        System.out.println("Successfuly logged in");
      }

        // Take a screenshot of the current page
        File screenshot = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        FileUtils.copyFile(screenshot, new File("screenshot.png"));

        // Logout
        driver.findElement(By.id("logout")).click();
    driver.quit();
   }
}

You should now have a nice screenshot of the Hacker News homepage while being authenticated. As you can see Chrome headless is really easy to use, it is not that different from PhantomJS since we are using Selenium to run it.

If you enjoyed this do not hesitate to subscribe to our newsletter!

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

As usual, the code is available in this Github repository

How to Log in to Almost Any Websites

Kevin Sahin — Wed, 02 Jan 2019 09:52:27 +0000

In the first article about java web scraping I showed how to extract data from CraigList website.
But what about the data you want or if the action you want to carry out on a website requires authentication ?

In this short tutorial I will show you how to make a generic method that can handle most authentication forms.

Authentication mechanism

There are many different authentication mechanisms, the most frequent being a login form , sometimes with a CSRF token as a hidden input.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage
Select the first <input type="password"> tag
Select the first <input> before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form, and submit it.

Hacker News Authentication

Let's say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :

Here is the login form and the associated DOM :

Now we can implement the login algorithm

    public static WebClient autoLogin(String loginUrl, String login, String password) throws FailingHttpStatusCodeException, MalformedURLException, IOException{
        WebClient client = new WebClient();
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);

        HtmlPage page = client.getPage(loginUrl);

        HtmlInput inputPassword = page.getFirstByXPath("//input[@type='password']");
        //The first preceding input that is not hidden
        HtmlInput inputLogin = inputPassword.getFirstByXPath(".//preceding::input[not(@type='hidden')]");

        inputLogin.setValueAttribute(login);
        inputPassword.setValueAttribute(password);

        //get the enclosing form
        HtmlForm loginForm = inputPassword.getEnclosingForm() ;

        //submit the form
        page = client.getPage(loginForm.getWebRequest(null));

        //returns the cookie filled client :)
        return client;
    }

Then the main method, which :

calls autoLogin with the right parameters
Go to https://news.ycombinator.com
Check the logout link presence to verify we're logged
Prints the cookie to the console

    public static void main(String[] args) {

        String baseUrl = "https://news.ycombinator.com" ;
        String loginUrl = baseUrl + "/login?goto=news" ; 
        String login = "login";
        String password = "password" ;

        try {
            System.out.println("Starting autoLogin on " + loginUrl);
            WebClient client = autoLogin(loginUrl, login, password);
            HtmlPage page = client.getPage(baseUrl) ;

            HtmlAnchor logoutLink = page.getFirstByXPath(String.format("//a[@href='user?id=%s']", login)) ;
            if(logoutLink != null ){
                System.out.println("Successfuly logged in !");
                // printing the cookies
                for(Cookie cookie : client.getCookieManager().getCookies()){
                    System.out.println(cookie.toString());
                }
            }else{
                System.err.println("Wrong credentials");
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

You can find the code in this Github repo

Go further

There are many cases where this method will not work : Amazon, DropBox... and all other two-steps/captcha protected login forms.

Things that can be improved with this code :

Handle the check for the logout link inside autoLogin
Check for null inputs/form and throw an appropriate exception

In a next post I will show you how to deal with captchas or virtual numeric keyboards with OCR and captchas breaking APIs !

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

An Automatic Bill Downloader in Java

Kevin Sahin — Wed, 12 Dec 2018 10:03:07 +0000

In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.

I suggest you to read these articles first : Introduction to web scraping with Java and Autologin

Since I am hosting this blog on Digital Ocean (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have.

Login

To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article.

Then we have to go to the bill page : https://cloud.digitalocean.com/settings/billing

String baseUrl = "https://cloud.digitalocean.com";
String login = "email";
String password = "password" ;

try {
    WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);

    HtmlPage page = client.getPage("https://cloud.digitalocean.com/settings/billing");
    if(page.asText().contains("You need to sign in for access to this page")){
        throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
    }
}catch (Exception e) {
    e.printStackTrace();
}

Fetching the bills

Let's create a new Class called Bill or Invoice to represent a bill :

Bill.java


public class Bill {

    private String label ;
    private BigDecimal amount ; 
    private Date date;
    private String url ;
//... getters & setters
}

Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :

We are lucky here, it's a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these :

HtmlTable to store the table and iterate on each rows
getCell to select the cells

Then, using the Jackson library we will export the Bill objects to JSON and print it.

HtmlTable billsTable = (HtmlTable) page.getFirstByXPath("//table[@class='listing Billing--history']");
for(HtmlTableRow row : billsTable.getBodies().get(0).getRows()){

    String label = row.getCell(1).asText();
    // We only want the invoice row, not the payment one
    if(!label.contains("Invoice")){
        continue ;
    }

    Date date = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH).parse(row.getCell(0).asText());
    BigDecimal amount =new BigDecimal(row.getCell(2).asText().replace("$", ""));
    String url = ((HtmlAnchor) row.getCell(3).getFirstChild()).getHrefAttribute();

    Bill bill = new Bill(label, amount, date, url);
    bills.add(bill);
    ObjectMapper mapper = new ObjectMapper();
    String jsonString = mapper.writeValueAsString(bill) ;

    System.out.println(jsonString);

It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the Pageobject to store the pdf, and call a getContentAsStreamon it. It's better to check if the file has the right content type when doing this (application/pdf in our case)

Page invoicePdf = client.getPage(baseUrl + url);

if(invoicePdf.getWebResponse().getContentType().equals("application/pdf")){
    IOUtils.copy(invoicePdf.getWebResponse().getContentAsStream(), new FileOutputStream("DigitalOcean" + label + ".pdf"));
}

That's it, here is the ouput :

{"label":"Invoice for December 2015","amount":0.35,"date":1451602800000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for November 2015","amount":6.00,"date":1448924400000,"url":"/billing/XXXX.pdf"}
{"label":"Invoice for October 2015","amount":3.05,"date":1446332400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for April 2015","amount":1.87,"date":1430431200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for March 2015","amount":5.00,"date":1427839200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for February 2015","amount":5.00,"date":1425164400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for January 2015","amount":1.30,"date":1422745200000,"url":"/billing/XXXXXX.pdf"}
{"label":"Invoice for October 2014","amount":3.85,"date":1414796400000,"url":"/billing/XXXXXX.pdf"}

As usual you can find the full code on this Github Repo

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

Web Scraping Handling Ajax Website

Kevin Sahin — Sat, 01 Dec 2018 10:14:30 +0000

Today more and more websites are using Ajax for fancy user experiences, dynamic web pages, and many more good reasons.
Crawling Ajax heavy website can be tricky and painful, we are going to see some tricks to make it easier.

Prerequisite

Before starting, please read the previous articles I wrote to understand how to set up your Java environment, and have a basic understanding of HtmlUnit Introduction to Web Scraping With Java and Handling Authentication.
After reading this you should be a little bit more familiar with web scraping.

Setup

The first way to scrape Ajax website with Java that we are going to see is by using PhantomJS with Selenium and GhostDriver.

PhantomJS is a headless web browser based on WebKit ( used in Chrome and Safari). It is quite fast and does a great job to render the Dom like a normal web browser.

First you'll need to download PhantomJS
Then add this to your pom.xml :

<dependency>
    <groupId>com.github.detro</groupId>
    <artifactId>phantomjsdriver</artifactId>
    <version>1.2.0</version>
</dependency>

and this :

<dependency>
   <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>2.53.1</version>
</dependency>

PhantomJS and Selenium

Now we're going to use Selenium and GhostDriver to "pilot" PhantomJS.

The example that we are going to see is a simple "See more" button on a news site, that perform a ajax call to load more news.
So you may think that opening PhantomJS to click on a simple button is a waste of time and overkilled ? Of course it is !

The news site is : Inshort

As usual we have to open Chrome Dev tools or your favorite inspector to see how to select the "Load More" button and then click on it.

Now let's look at some code :

private static String USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36";
    private static DesiredCapabilities desiredCaps ;
    private static WebDriver driver ;


    public static void initPhantomJS(){
        desiredCaps = new DesiredCapabilities();
        desiredCaps.setJavascriptEnabled(true);
        desiredCaps.setCapability("takesScreenshot", false);
        desiredCaps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "/usr/local/bin/phantomjs");
        desiredCaps.setCapability(PhantomJSDriverService.PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX + "User-Agent", USER_AGENT);

        ArrayList<String> cliArgsCap = new ArrayList();
        cliArgsCap.add("--web-security=false");
        cliArgsCap.add("--ssl-protocol=any");
        cliArgsCap.add("--ignore-ssl-errors=true");
        cliArgsCap.add("--webdriver-loglevel=ERROR");

        desiredCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap);
        driver = new PhantomJSDriver(desiredCaps);
        driver.manage().window().setSize(new Dimension(1920, 1080));
    }

That's a lot of code to setup phantomJs and Selenium !
I suggest you to read the documentation to see the many arguments you can pass to PhantomJS.

Note that you will have to replace /usr/local/bin/phantomjs with your own phantomJs executable path

Then in a main method :

        System.setProperty("phantomjs.page.settings.userAgent", USER_AGENT);
        String baseUrl = "https://www.inshorts.com/en/read" ;
        initPhantomJS();
        driver.get(baseUrl) ;
        int nbArticlesBefore = driver.findElements(By.xpath("//div[@class='card-stack']/div")).size();
        driver.findElement(By.id("load-more-btn")).click();

        // We wait for the ajax call to fire and to load the response into the page
        Thread.sleep(800);
        int nbArticlesAfter = driver.findElements(By.xpath("//div[@class='card-stack']/div")).size();
        System.out.println(String.format("Initial articles : %s Articles after clicking : %s", nbArticlesBefore, nbArticlesAfter));

Here we call the initPhantomJs() method to setup everything, then we select the button with its id and click on it.

The other part of the code count the number of articles we have on the page and print it to show what we have loaded.

We could have also printed the entire dom with driver.getPageSource()and open it in a real browser to see the difference before and after the click.

I suggest you to look at the Selenium Webdriver documentation, there are lots of cool methods to manipulate the DOM.

I used a dirty solution with my Thread.sleep(800) to wait for the Ajax call to complete.
It's dirty because it is an arbitrary number, and the scraper could run faster if we could wait just the time it takes to perform that ajax call.

There are other ways of solving this problem :

public static void waitForAjax(WebDriver driver) {
    new WebDriverWait(driver, 180).until(new ExpectedCondition<Boolean>() {
        public Boolean apply(WebDriver driver) {
            JavascriptExecutor js = (JavascriptExecutor) driver;
            return (Boolean) js.executeScript("return jQuery.active == 0");
        }
    });
}

If you look at the function being executed when we click on the button, you'll see it's using jQuery :

This code will wait until the variable jQuery.active equals 0 (it seems to be an internal variable of jQuery that counts the number of ongoing ajax calls)

If we knew what DOM elements the Ajax call is supposed to render we could have used that id/class/xpath in the WebDriverWait condition :

wait.until(ExpectedConditions.elementToBeClickable(By.xpath(xpathExpression)))

Conclusion

So we've seen a little bit about how to use PhantomJS with Java.

The example I took is really simple, it would have been easy to simulate the request.

But sometimes when you have tens of Ajax calls, and lots of Javascript being executed to render the page properly, it can be very hard to scrape the data you want, and PhantomJS/Selenium is here to save you :)

Next time we will see how to do it by analyzing the AJAX calls and make the requests ourselves.

As usual you can find all the code in my Github repo

Rendering JS at scale can be really difficult and expensive. This is exactly the reason why we build ScrapingBee, a web scraping API that take care of this for you.

It will also take car of proxies and CAPTCHAs, don't hesitate to check it out, the first 1000 API calls are on us.

DEV Community: Kevin Sahin

Easy Web Scraping With Scrapy

Basic overview

Scraping a single product

Scrapy Shell

Extracting Data

Creating a Scrapy Spider

Item loaders

Scraping multiple pages

Conclusion

Practical XPath for Web Scraping

Why learn XPath

Document Object Model

XPath Syntax

Tip

XPath with Python

E-commerce product data extraction

Automagically authenticate to a website

Conclusion

12 months, 3 products, some MRR, and one (irrigation) pivot

Background

Our first project: ShopToList

First pivot: PricingBot

Mistake #1: bad metrics leads to bad conclusion, bad conclusion leads to bad decisions.. (in Yoda's voice)

Mistake #2: Thinking our only problem was acquisition

Fear, Uncertainty and Doubt

One agricultural pivot to build, one startup pivot to make

ScrapingBee

Mistake avoided #1: create a product you won't use:

Mistake avoided #2 Spending too much money

🚀 Launch 🚀 and mistake avoided #3 not asking for money from day 1

The future

Serverless Web Scraping With Aws Lambda and Java

Prerequisites

Architecture

Create the Maven project

Configuration

Function code

Go further

Web Scraping 101 in Python

Table of Content:

0) Web Fundamentals

HyperText Transfer Protocol

1) Manually opening a socket and sending the HTTP request

Socket

Regular Expressions

2) urllib3 & LXML

XPath

3) requests & BeautifulSoup

4) Scrapy

5) Selenium & Chrome —headless

Conclusion:

Scraping single page applications with ease.

Headless Chrome with Python

Prerequisites

Taking a screenshot

Waiting for the page load

Conclusion

Introduction to Web Scraping With Java

Prerequisites

Tools

Let's scrape CraigList

Go further

Further reading

Scraping E-Commerce Product Data

What you will need

Schema.org

Extracting the data

Avoid getting blocked

Go further

Introduction to Chrome Headless

Prerequisites

Let's log into Hacker News

How to Log in to Almost Any Websites

Authentication mechanism

Hacker News Authentication

Go further

An Automatic Bill Downloader in Java

Login

Fetching the bills