There will be a number of times when you will want to get data from the internet for a program you are writing. APIs are a great resource, but only if one exists with the subject matter you want and if the data you are looking for is static and unchanging. If you want information about the upcoming events for your favorite celebrity or the approaching games for the Seahawks or Mariners, APIs don’t cut it. Scraping is a great way to get data from a webpage.
Scraping is the process of extracting data from a webpage. It can refer to manually copying the data from the page, but it usually refers programmatically parsing the data you need from the HTML of the webpage. The best way to scrape with Ruby is to use the Nokogiri gem and Open-URI. Open-URI is just used to get the HTML from the link you pass it then Nokogiri does all of the interesting stuff.
The name Nokogiri comes from Japanese meaning “a fine-toothed saw” that was used for the rough ripping of logs into boards for carpenters and cabinetmakers. And thats how the Nokogiri gem is used. It precisely scrapes the only the necessary data that we are looking for from a webpage.
To use Nokogiri you first have to install the gem then require it and Open-URI at the top of your file:
require 'nokogiri' require 'open-uri'
Then to start you need to get the HTML from whatever website you want data from using the url:
html = open('http://www.google.com')
and once you have this Nokogiri will take the string of HTML returned and translate it into a NodeSet that we can mess with (it is conventional to call this ‘doc’):
doc = Nokogiri::HTML(html)
Now doc has all the HTML converted into nested nodes, which is similar to the structure of nested hashes, so we can iterate over the Nokogiri object to find the information we want.
Now that we have the fundamentals done can get to the interesting stuff: getting the data we’re looking for from ‘doc’. Nokogiri uses css selectors to return data from a specific part of the HTML document. Css selectors look like this:
<div id="example-div"> <p class="example-paragraph">Example paragraph</p> </div>
In this example you would refer to the div element with #example-div (# for ids) and the paragraph with .example-paragraph (. for class). In HTML an id has to be unique and only refers to one element, while a class can be applied to a group of similar elements.
Using the Nokogiri in the doc variable to look for the element you want is possible, but not very practical seeing as it is a very long list of information. The best way to find what you want is to find it on the original webpage, right click it, and choose inspect. This will show you the HTML of the whole webpage, but it will highlight the item you clicked on.
Back to Ruby, Nokogiri has a .css method that is called on the doc variable. It takes an argument of the css selector you want to find using the approach from the previous paragraph, with classes and ids. So getting the paragraph from the example above would look like:
But this actually returns something similar to an array with one Nokogiri object with a bunch of information about the whole paragraph element. If you want to get only the text part of the paragraph you would call .text on the end:
doc.css(".example-paragraph").text => "Example paragraph"
But it isn’t very interesting to grab information from only one part of a webpage, and this is where iteration comes in. If an animal shelter website had a list of div elements for every adoptable pet, you could find the class name for all of those divs and call pets = doc.css(“.class-name”) to get a collection of elements, then iterate over them using an enumerable to get the names of all of the pets, for example:
pets.map do |pet| pet.css(".pet-name").text end => [array of pet names]
It’s clear to see that scraping is a powerful tool to get data from the internet, but there are a number of drawbacks. One is that it might take time to find what you’re looking for. Another is that you might have to write vastly different code to scrape from one website compared to another. But the biggest problem is that if a website changes the scraper you wrote probably won’t work anymore. But even with these stumbling blocks scraping is still much faster and easier than having to manually enter data.