Rails + React + Redux - Pt 6
This post is going to revisit the get_seasons and get_appearances method with the most complex aspect of the scrape so far, the condensed and transformed ranks of each queen in each episode.
Let's get started!
1. Revisit get_seasons in season.rb to scrape, clean, and transform the ranks of each episode per queen and store them in the episode for use in get_appearances.
class Season < ApplicationRecord | |
has_many :episodes | |
accepts_nested_attributes_for :episodes | |
def get_seasons | |
I18n.enforce_available_locales = false | |
#### define the location of the season index and open it with Nokogiri | |
seasons_index_url = "https://rupaulsdragrace.fandom.com/wiki/Category:Seasons" | |
seasons_index_doc = Nokogiri::HTML(open(seasons_index_url)) | |
#### define an array of season names | |
seasons_list = seasons_index_doc.xpath('//td[1]/a[1]').map {|season| season.text} | |
#### define an array of urls for each season, distinguishing between the two series... | |
seasons_urls = seasons_list.map do |season| | |
if season.starts_with?("All Stars") | |
#### ... by removing the numbers from the end of the season name to concatenate them into the url... | |
all_stars_season = season[-1, 1] | |
season_url = "https://rupaulsdragrace.fandom.com/wiki/RuPaul%27s_Drag_Race_All_Stars_(Season_#{all_stars_season})" | |
else | |
#### ... and accomodating for double digit season numbers in the original series | |
if season.length > 8 | |
rpdr_season = season[-2, 2] | |
else | |
rpdr_season = season[-1, 1] | |
end | |
season_url = "https://rupaulsdragrace.fandom.com/wiki/RuPaul%27s_Drag_Race_(Season_#{rpdr_season})" | |
end | |
end | |
#### Now create a Season object for each of the listed seasons, with the season name and fandom url | |
seasons_list.each.with_index do |season, index| | |
Season.create!( | |
season_name: season, | |
fandom_season_url: seasons_urls[index] | |
) | |
end | |
#### iterate through the season urls to open each one with Nokogiri and predict the season ids | |
seasons_urls.map.with_index do |season, index| | |
season_doc = Nokogiri::HTML(open(season)) | |
season_id = index + 1 | |
#### scrape simple details for each season_doc | |
season_premiere = season_doc.xpath('//*[@data-source="premiere"]/div[@class="pi-data-value pi-font"]/text()').text | |
season_finale = season_doc.xpath('//*[@data-source="finale"]/div[@class="pi-data-value pi-font"]/text()').text | |
season_judges = season_doc.xpath('//*[@data-source="judges"]/div[@class="pi-data-value pi-font"]/a/text()').map {|judge| judge.text} | |
#### scrape the seasons's episode header row from the progress table and reject any empty cells | |
table_headers = season_doc.xpath('//*[@class="wikitable"]//following-sibling::th').map {|episode| episode.text} | |
#### find the contestants column, identify the index integer, then add 1 and turn it into a string to prepare for xPath | |
find_contestants_column = table_headers.map.with_index do |header, index| | |
header.starts_with?(" Contest" || "Contest") | |
end | |
contestants_column_index = find_contestants_column.find_index(true) | |
contestants_column_number = contestants_column_index + 1 | |
contestants_column_number_string = contestants_column_number.to_s | |
#### concatenate the column index into the xPath string and pull the contestants from the column | |
contestants_column_xpath = '//*[@id="mw-content-text"]//td[' + contestants_column_number_string + ']/b/a' | |
season_contestants = season_doc.xpath(contestants_column_xpath).map {|header| header.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '').strip} | |
#### find the episode title column, identify the index integer, then add 1 and turn it into a string to prepare for xPath | |
episodes_table_headers = season_doc.xpath('//center/table[@class="wikitable"]//th').map {|episode| episode.text} | |
find_episode_title_column = episodes_table_headers.map.with_index do |header, index| | |
header.starts_with?("Title", " Episode Title")# || header.starts_with?("Episode Title") | |
end | |
title_column_index = find_episode_title_column.find_index(true) | |
title_column_number = title_column_index.to_i + 1 | |
title_column_number_string = title_column_number.to_s | |
title_column_xpath = '//center/table[@class="wikitable"]//td[' + title_column_number_string + ']' | |
#### pull the episode titles row into an array and reject the blanks | |
episode_titles_row = season_doc.xpath(title_column_xpath).map {|header| header.text} | |
episode_titles = episode_titles_row.reject {|episode| episode.blank?} | |
#### iterate through the headers to find the episode columns, pull the episode number, then reject any blanks | |
episode_number_headers = table_headers.map do |episode| | |
if episode.starts_with?("Ep.", "Ep. ", " Ep.", " Ep.") | |
episode.gsub(/[^0-9]/, '') | |
else | |
episode = "" | |
end | |
end | |
episode_indices = episode_number_headers.map.with_index {|header, index| header.to_i >= 1 ? index : ""} | |
episode_indices.reject! {|x| x.blank?} | |
#### define a range of cells indices containing episode numbers, turn them to row numbers, then scrape the rows | |
episode_indices_range = episode_indices.first..episode_indices.last #will need .to_s | |
row_numbers = season_contestants.map.with_index {|cont, index| index + 2} | |
row_lookups = row_numbers.map do |row| | |
'//*[@class="wikitable"]//following-sibling::tr['+row.to_s+']/td' | |
end | |
#### clean up the cell contents then use the strings to apply numerical ranks | |
draft_lookups = row_lookups.map do |lookup| | |
season_doc.xpath(lookup).map {|el| el.text.squish} | |
end | |
draft_final = draft_lookups.map do |lookup| | |
lookup[episode_indices_range] | |
end | |
final_ranks = draft_final.map do |ranks| | |
ranks.map do |rank| | |
if rank.include?("WINNER") | |
final_ranks = 10 | |
elsif rank.include?("RUNUP" || "LOST") | |
final_ranks = 9 | |
elsif rank.include?("MISSCON") | |
final_ranks = 8 | |
elsif rank.include?("WIN") | |
final_ranks = 7 | |
elsif rank.include?("HIGH") | |
final_ranks = 6 | |
elsif rank.include?("TOP") | |
final_ranks = 5 | |
elsif rank.include?("SAFE"|| "RUNNING") | |
final_ranks = 4 | |
elsif rank.include?("LOW") | |
final_ranks = 3 | |
elsif rank.include?("BTM") | |
final_ranks = 2 | |
elsif rank.include?("ELIM" || "OUT" || "JUROR" || "RTRN") | |
final_ranks = 1 | |
else | |
final_ranks = 0 | |
end | |
end | |
end | |
#### combine contestants names and ranks due to sqlite database limitations | |
season_contestants = season_contestants.zip(final_ranks) | |
#### create unique episode keys of the same length (i.e. S04E10) | |
episode_numbers = episode_number_headers.reject {|episode| episode.blank?} #.gsub(/[^0-9]/, '') | |
season_episodes_codes = episode_numbers.map do |episode| | |
if season_id.to_s.length < 2 && episode.to_s.length < 2 | |
"S0" + season_id.to_s + "E0" + episode.to_s | |
elsif season_id.to_s.length < 2 && episode.to_s.length == 2 | |
"S0" + season_id.to_s + "E" + episode.to_s | |
elsif season_id.to_s.length == 2 && episode.to_s.length < 2 | |
"S" + season_id.to_s + "E0" + episode.to_s.to_s | |
else | |
"S" + season_id.to_s + "E" + episode.to_s | |
end | |
end | |
#### iterate through the episodes array to create each Episode | |
episode_numbers.map.with_index do |episode, index| | |
Episode.create( | |
season_id: season_id, | |
episode_name: episode_titles[index], | |
episode_code: season_episodes_codes[index], | |
contestants: season_contestants | |
) | |
end | |
end | |
end | |
end |
2. Revisit get_appearances in appearance.rb to instantiate each and every appearance with the queens numerical rank for that episode.
class Appearance < ApplicationRecord | |
belongs_to :queen | |
belongs_to :episode | |
def get_appearances | |
#### iterate through each season, pulling episode ids and setting the index of each iterative episodes' ranks | |
Season.all.each do |season| | |
season.episodes.each.with_index do |episode, index| | |
episode_id = episode.id | |
rank_index = index + 2 | |
#### clean the contestants array for Appearance creation | |
contestants = episode.contestants.split(", ").map do |contestant| | |
contestant.gsub(/[^0-9a-z%&!\n\/(). ]/i, '').strip | |
end | |
contestants = contestants.in_groups_of(season.episodes.length + 1) | |
#### iterate through the array of contestants to create an Appearance for each Episode | |
contestants.map do |contestant| | |
Appearance.create( | |
episode_id: episode_id, | |
#### use the contestants list as it appeared on the season's Fandom page to find the corresponding Queens | |
queen_id: Queen.find_or_create_by(drag_name: contestant[0]).id, | |
#### set the queen's numerical rank for her appearance in each episode | |
rank: contestant[rank_index] | |
) | |
end | |
end | |
end | |
end | |
end |
I'd love to refactor the get_XXXX methods from the previous posts as soon as possible, but I may move on to some UI in my post next week!
Top comments (0)