Felice Forby

Creating My First CLI Gem Part 3: Using Nokogiri to Scrape Data for My Objects

#ruby #gems #beginners #flatiron

In Part 2, I set up the basic CLI interface using dummy data so I could just concentrate on the interface without worrying about objects just yet. Next, I created my Recipe class and Scraper class so I could work get live data and use that data to create real objects.

The Scraper

For each recipe object, I wanted to get the title, description, ingredients and steps, which would be set as instance variables in the Recipe class: attr_accessor :name, :description, :ingredients, :steps. I chose to use Cookpad’s website to get the data, as they have a trending recipes page and of course, search pages.

To actually scrape the data, I used the Nokogiri gem which was added by inserting the line spec.add_dependency 'nokogiri', "~> 1.8", ">= 1.8.2" to my gemspec and require 'nokogiri' into my find_recipe.rb file which acts as kind of an environment file.

The tricky part about getting the recipe data was that I first needed to get the list of recipe titles from a trending or search page and then get the rest of the data from each individual page. It couldn’t be done all at once. Also, the trending page and the search pages are quite different, so I decided to create different scraper methods for each type.

Because I needed to access each individual recipe page, I added another instance variable to the Recipe class, :url.

The main Scraper methods I built are as follows. You can see the entire class on github, too (note, the scraper class does not use instance methods or instance variables, as it does not need to store data itself. It simply gets passed to the Recipe class).

#scrape_trending_recipes_page

def self.scrape_trending_recipes_page
  recipe_page = self.get_trending_recipes_page

  recipe_cards = recipe_page.css( ".card.feed__card" )
  recipe_cards.collect do |card|
    recipe = {
      name: card.css( "h2.recipe-title" ).text.strip,
      url: ROOT_URL + card.css( "a.link-unstyled" ).attribute( "href" ).value
    }
  end.uniq
end

The uniq at the end there accounts for the duplicates on the trending recipes page.

#scrape_search_page( keyword )

This method uses the user-input keyword to find the appropriate search page.

def self.scrape_search_page( keyword )
  recipe_page = self.get_recipe_page_from_keyword( keyword )

  recipe_cards = recipe_page.css( "li.wide-card" )
  recipe_cards.collect do |card|
    recipe = {
      name: card.css( "h2.recipe-title span" ).text.strip,
      url: ROOT_URL + card.css( "a.media" ).attribute( "href" ).value
    }
  end
end

#scrape_individual_recipe_data( recipe_url )

After getting the initial recipe data (title and url) from the listing pages, this method would use the Recipe instance’s url attribute to fetch the rest of the data.

def self.scrape_individual_recipe_data( recipe_url )
  recipe_data = {}
  recipe_page = self.get_individual_recipe_page( recipe_url )
  recipe_data[:description] = recipe_page.at( "meta[itemprop='description']" )['content']

  # Gets ingredients as an array of each ingredient
  recipe_data[:ingredients] = recipe_page.css( ".ingredient__details" ).collect{ |ingredient| ingredient.text.strip }

  # Gets steps as an array of each step
  recipe_data[:steps] = recipe_page.css( ".step .prose" ).collect{ |step| step.text.strip }
  recipe_data
end

The Recipe class

The data returned from the scraper as hashes or an array of hashes would then be processed by the Recipe class. First, the #create_recipes class method creates recipes based on the initial data passed in from the Scraper class, the title and url. Another method, #add_additional_recipe_data, would then add the rest of the data to each recipe instance by using the data from its respective page. As it operates on recipe instances, this method is an instance method. (You can see the entire class on github).

#create_recipes( keyword = nil )

First, this method uses the appropriate data scraper, depending on whether a search keyword is used or not. It then creates several recipes, storing them into an array. It then adds the additional data to each recipe instance with the #add_additional_recipe_data method.

# Create recipes based off an array of recipe data hashes output by the scraper
def self.create_recipes( keyword = nil )
  if keyword
    recipe_array = FindRecipe::Scraper.scrape_search_page( keyword )
  else
    recipe_array = FindRecipe::Scraper.scrape_trending_recipes_page
  end

  recipes = recipe_array.collect do |recipe|
    self.new( recipe )
  end

  # Add remaining attributes from each individual recipe page
  recipes.each do |recipe|
    additional_recipe_data = FindRecipe::Scraper.scrape_individual_recipe_data( recipe.url )
    recipe.add_additional_recipe_data( additional_recipe_data )
  end
end

#add_additional_recipe_data( data_hash )

This uses the #send method to flexibly add each attribute and its value to the recipe, in case some attributes are not available, thus avoiding an error, or if extra attributes were to be added to the Recipe class at a later time.

def add_additional_recipe_data( data_hash )
  data_hash.each do |attribute, value|
    self.send( ("#{attribute}="), value )
  end
  self
end

The initialize method similarly uses the #send method to flexibly handle the data in the data hashes passed to it upon creating Recipe instances:

def initialize( recipe_hash )
  recipe_hash.each do |attribute, value|
    self.send( ("#{attribute}=" ), value)
  end
end

Finishing up the CLI class

Now that real Recipe objects could be created from real, live data scraped from Cookpad, it was time to integrate this functionality into the CLI class.

Recipes are created when the user chooses a search option. The FindRecipe::Recipe.create_recipes knows whether or not to get trending recipes or recipes from a search by looking whether a keyword has been passed as an argument or not.

The created recipes are then stored into an instance variable as an array so they can be accessed in other areas in the CLI class. Another reason I did this was because the scraping takes several seconds, so instead of re-scraping the data every time the user goes back to the recipe list, the program can simply use the data already stored in the instance variable, thus avoiding long loading times.

def search_options
  puts ""
  puts "Welcome!".red
  puts "How do you want to get started?"
  puts ""
  puts "1.".blue + " See trending recipes"
  puts "2.".blue + " Search for a recipe"
  puts ""
  puts "Enter 1 or 2, or exit"

  input = gets.strip.downcase

  if input == "1"

    # Scrapes recipes only once to save loading time
    if !@trending_recipes
      puts "Please wait a moment for the recipes to be loaded..."
      @trending_recipes = FindRecipe::Recipe.create_recipes
    end
    trending_recipes
  elsif input == "2"

    # Scrape recipes only if a search hasn't been done yet or if user restarts
    if !@searched_recipes || @searched_recipes.length == 0
       puts "What is the dish or ingredient you want to search for?"
       input = gets.strip.downcase
       puts "Please wait a moment for the recipes to be loaded..."

       # If search keyword has spaces, it's necessary to replace them with %20 so the URL works
       @searched_recipes = FindRecipe::Recipe.create_recipes( input.gsub( " ", "%20" ) )
    end
    choose_searched_recipes
  elsif input == "exit"
    exit_program
  else
    puts "Not sure what you mean..."
    search_options
  end
end

The CLI class’s #trending_recipes method and #choose_searched_recipes method then take care of actually choosing recipes from a list and viewing them. (See the entire CLI class on github.

If you’d like to check out the entire project, check out the Github repository. I also published the gem the [Rubygems.org[(https://rubygems.org/gems/find_recipe) (exciting!).

DEV Community

Creating My First CLI Gem Part 3: Using Nokogiri to Scrape Data for My Objects

The Scraper

#scrape_trending_recipes_page

#scrape_search_page( keyword )

#scrape_individual_recipe_data( recipe_url )

The Recipe class

#create_recipes( keyword = nil )

#add_additional_recipe_data( data_hash )

Finishing up the CLI class

Top comments (0)

Read next

How to Persist User Data with LocalStorage in Vue

Information Disclosure

: vs :: in CSS3

How to Add Google Maps to Your Web App Using React