DEV Community

Cover image for It's Scraping Seasons
JaredHarbison
JaredHarbison

Posted on • Edited on

4

It's Scraping Seasons

Rails + React + Redux - Pt 3


This post is going to focus on some of the more challenging tasks I encountered while scraping data from Fandom and connecting the data appropriately into the schema established in the last post. This post will focus on defining the get_seasons method and the get_queens method will follow. The gists are heavily commented!


Let's get started!


1. Def get_seasons in season.rb to scrape the list of season names from Fandom, concatenate each season name into an array of URLS for each Season's Wikipedia page, then iterate through the array to .create!() an instance of each Season.

class Season < ApplicationRecord
has_many :episodes
accepts_nested_attributes_for :episodes
def get_seasons
I18n.enforce_available_locales = false
#### define the location of the season index and open it with Nokogiri
seasons_index_url = "https://rupaulsdragrace.fandom.com/wiki/Category:Seasons"
seasons_index_doc = Nokogiri::HTML(open(seasons_index_url))
#### define an array of season names and instantiate an object for each name
seasons_list = seasons_index_doc.xpath('//td[1]/a[1]').map {|season| season.text}
seasons_list.each {|season| Season.create!(season_name: season)}
#### define an array of urls for each season, distinguishing between the original series and All Stars
seasons_urls = seasons_list.map do |season|
if season.starts_with?("All Stars")
#### remove the numbers from the end of the season name to concatenate them into the url
all_stars_season = season[-1, 1]
season_url = "https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_All_Stars_(season_#{all_stars_season})"
else
if season.length > 8
rpdr_season = season[-2, 2]
else
rpdr_season = season[-1, 1]
end
season_url = "https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_#{rpdr_season})"
end
end
#### iterate through the season urls to open each one with Nokogiri
seasons_urls.map.with_index do |season, index|
season_doc = Nokogiri::HTML(open(season))
season_id = index + 1
#### scrape the seasons's episode header row and reject any empty cells
season_episodes = season_doc.xpath('//*[@id="mw-content-text"]/div/table[3]/tbody/tr[1]/th/text() | //*[@id="mw-content-text"]/div/table[3]/tbody/tr[1]/th/b/text()').map {|episode| episode.text}
season_episodes = season_episodes.reject {|episode| episode.length > 3 || episode.blank?}
#### create a unique episode identiefier (i.e. S4E10) to avoid future collisions
season_episodes_codes = season_episodes.map {|episode| "S" + season_id.to_s + "E" + episode.to_s}
#### scrape the contestants list to store in the episode object for easier appearance creation
season_contestants = season_doc.xpath('//*[@id="mw-content-text"]/div/table[3]/tbody/tr/td[1]/b').map {|contestant| contestant.text.downcase}
#### iterate through the episodes array to create each Episode
season_episodes.map.with_index do |episode, index|
Episode.create(
season_id: season_id,
episode_name: episode,
episode_code: season_episodes_codes[index]
)
end
end
end
end
view raw season.rb hosted with ❤ by GitHub

2. Def get_queens in queen.rb to scrape the list of queens' names from Fandom, concatenate each queen's name into an array of URLs for each Queen's Fandom page, then iterate through the array to .create!() an instance of each Queen and her attributes (including associations for Quotes and Trivia.

__see the following post for the gist__

3. With Seasons and Queens instantiated, iterate through the Seasons and .create!() an appearance for each episode per Queen and her appropriate appearance attributes.

__see following posts for the gist__

That's all folks!

Top comments (2)

Collapse
 
bengreenberg profile image
Ben Greenberg

Great final project! Love seeing fellow Flatiron grads projects, well done!

Collapse
 
jaredharbison profile image
JaredHarbison

Thanks Ben!