Welcome to day 35 of the 49 Days of Ruby! 🎉
Now that we know a bit about HTTP and making HTTP requests in Ruby, today we'll discuss how to use that knowledge to scrape the web!
Web scraping is where you write some code that fetches a resource off the web and gives you some content from that website. It is an alternative to using APIs (more about that tomorrow) and is often used when there is no API available.
tl;dr Today's resources come from this excellent blog post by Sylwia, a DEV community member, and friend:
If you recall from yesterday, we made HTTP requests using the
net/http library. Today, we will use
open-uri, which is also part of the standard Ruby core utilities:
require "open-uri" html = open("https://en.wikipedia.org/wiki/Douglas_Adams")
The above example looks a lot like our fetching of the blog post yesterday, except even more condensed. The variable
html now holds the HTML content of the Wikipedia page for Douglas Adams.
Our next step is to parse that HTML.
A popular gem to use to help us in parsing HTML is Nokogiri. The gem is very powerful, and because of that, its complexity can grow by multitudes as you build out more intricate applications.
In our case, we will try to pare down our usage of it:
response = Nokogiri::HTML(html)
response variable now contains an object of
Nokogiri::HTML::Document. This is the HTML that is structured like a hash with lots of nested resources.
We now have our HTML in a structure that we can scrape some data from.
For our example, we'll get just the main body text for Douglas Adams.
We do that by finding some kind of identifier on the Wikipedia page that we can utilize. HTML is the language that one creates websites in. Another language, which we are not discussing but need to mention, is CSS. CSS is the language that one styles websites in. Each part of the page has some kind of CSS tags that we can use to identify the part we want to scrape.
In the case of the Wikipedia page, the text is inside a
p tag. We can use the Nokogiri
#css method providing the
p tag as an argument to get just the text:
text = html.css("p").text
Now, if you inspect
text you will see it contains the entire description for Douglas Adams from Wikipedia. You've successfully scraped a site!
If you want to read more about this, I highly recommend Sylwia's post. She goes into a lot more detail than our format provides. Continue to share your learnings with the community using the hashtag #49daysofruby!
Come back tomorrow for the next installment of 49 Days of Ruby! You can join the conversation on Twitter with the hashtag #49daysofruby.