Crawl the web with ferrum and nokogiri

#ruby #webcrawler #rails #nokogiri

Wikipedia definition:

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing

I usually need to write a web crawler when there is some data available on the web in a format that makes it hard to process it.
A typical example is when a company has access to the B2B catalog of its supplier, but the process for gaining that data is slow and complicated.

Imagine having to open the browser, log in to the site, fill in the search form, scroll the results and find the relevant ones, then repeat the process for other two suppliers that sell the same item, but using a different search method (breadcrumb navigation instead of a search form), and display the price in a different way (excluding VAT).

Imagine now having in front of you your web app, fill in a form with the search terms and let the web crawler do all the work for you in background, or even remotely. Once the crawler finishes his job you will only see the relevant results in a well-organized fashion on your page.

Why not just using curl or wget to get the page source and parse it with grep and awk?

Well, it's not that easy. Most of the websites I have to interact with, are not meant to be consumed this way. Some of them are single-page apps, some other require javascript interaction of some sort. When you have to grab data from a website that was created to be interacted with using a browser, the best way to do that is by using a browser.
Rails developers are familiar with selenium-webdirver, a tool that can launch a browser session and drive capybara to run your system tests.
Capybara is great for testing your own application when you know what to expect to find on a page. When you have to explore a foreign website, there are better tools for the job. In the past years, I have been using watir, a very solid (been developed for over 15 years) gem that uses capybara to drive a browser but has a very rubyish syntax and makes writing and reading code a pleasure.

 # Sample watir code
 # Initalize the Browser
 browser = Watir::Browser.new

 # Navigate to Page
 browser.goto 'a.testaddressbook.com'
 # Authenticate and Navigate to the Form
 browser.link(id: 'sign-in').click
 browser.text_field(data_test: 'email').set 'watir_example@example.com'
 browser.text_field(data_test: 'password').set 'password'
 browser.button(name: 'commit').click

Goodbye watir, welcome ferrum

The thing is that you don't need selenium anymore, there's a new gem (or at least new to me) called ferrum, that aims to be a clean lightweight ruby API to browser.
Ferrum is pure ruby and connects to the dom via DevTools Protocol.
I've tried it on a couple of projects and looks very easy and fast. The other great thing is that it feels like writing natural ruby

browser = Ferrum::Browser.new(headless: false)
browser.goto 'https://www.exampleservice.com/login'
browser.at_css('#login_username').focus.type('exampleusername')
browser.at_css('#login_password_field').focus.type('examplepassword')
browser.at_css('button[type="submit"]').click

It still lacks some of the nice things that come with watir, for example, the ability to enumerate browser elements like this:

# my old workflow suing watir
browser.table(id: "Products").tbody.trs(class: "Available").each do |row|
  parse_row(row)
end

This is a very neat feature of watir but for some reason extremely slow.

Nokogiri to the rescue

The rule of thumb for my projects is to use watir or ferrum to get to the page that I want, then grab the DOM and parse it using nokogiri.
Nokogiri is solid, fast, actively maintained and can do almost everything you need when parsing HTML.

@browser.goto("http://www.example.com")
page = Nokogiri.parse(@browser.body)
page.css('.panel-body .row').each do |row|
  parse_row(row)
end

Enumerating elements for nokogiri is a very fast operation, that runs in milliseconds.

Here is an example of parsing some data:

def parse_row(row)
  brand = row.elements[0].at_css('img')['src'].split('/').last.split('.').first
  photo = row.elements[0].at_css('a')['href']
  name = row.elements[1].at_css('h4').children[0].text
  color = row.elements[1].at_css('h4').children[1].text.sub(' / ', '')
  desc = row.elements[1].at_css('p > text()').text
  width = desc.split[1].sub(/[A-Z]/, '')
  rim = desc.split[0].to_i
  size = "#{width}x#{rim}"
  cost = row.elements[2].at_css('b').text
  puts "#{brand} #{name} #{color} #{size}, #{cost}"
end