Rails + React + Redux - Pt 4
This post is going to focus on some of the more challenging tasks I encountered while scraping data from Fandom then connecting the data appropriately into the schema established in the last post. In retrospect I would scrape seasons before queens, so I've accounted for the order here. This post will focus on defining the get_queens method and the get_seasons will follow. The gist is heavily commented!
Let's get started!
1. Def get_seasons in season.rb to scrape the list of season names from Fandom, concatenate each season name into an array of URLS for each Season's Wikipedia page, then iterate through the array to .create!() an instance of each Season.
2. Def get_queens in queen.rb to scrape the list of queens' names from Fandom, concatenate each queen's name into an array of URLs for each Queen's Fandom page, then iterate through the array to .create!() an instance of each Queen and her attributes (including associations for Quotes and Trivia.
class Queen < ApplicationRecord | |
has_many :trivia, dependent: :destroy | |
has_many :quotes, dependent: :destroy | |
has_many :appearances | |
has_many :seasons, through: :appearances | |
accepts_nested_attributes_for :trivia, :quotes, :seasons, :appearances | |
def get_queens | |
#### define the url containing the index of queens then open it with Nokogiri | |
queen_index_url = "https://rupaulsdragrace.fandom.com/wiki/Category:Queens" | |
queen_index_doc = Nokogiri::HTML(open(queen_index_url)) | |
#### scrape an array of queens names from the Nokogiri doc then remove the duplicates | |
queen_list = queen_index_doc.css('.tabber').last.css('.thumbimage') | |
cleaned_queen_list = queen_list.select.with_index {|_, i| i.even?} | |
#### iterate through the array through US queens then concatenate a url for each queen | |
cleaned_queen_list[1..185].each do |queen| | |
queen_url = "https://rupaulsdragrace.fandom.com/wiki/#{queen.attr("title")}" | |
#### transliterate each url, pass it through a regex expression, then open it with Nokogiri | |
I18n.enforce_available_locales = false | |
cleaned_queen_url = I18n.transliterate(queen_url).split(' ').join('_').gsub(/\(.+/, '') | |
queen_doc = Nokogiri::HTML(open(cleaned_queen_url)) | |
#### scrape each queens seasons into an array for association to seasons through appearances | |
queen_seasons = queen_doc.xpath( | |
'//*[@data-source="Season"]/div[@class="pi-data-value pi-font"]/a') | |
.map { |e| e.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
#### scrape each queens biographical information | |
queen_drag_name = queen_doc.css( | |
"#mw-content-text > aside > section:nth-child(3) > div:nth-child(2) > div") | |
.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') | |
queen_real_name = queen_doc.css( | |
"#mw-content-text > aside > section:nth-child(3) > div:nth-child(3) > div") | |
.text.split(' ').join(' ') | |
queen_date_of_birth = queen_doc.xpath( | |
'//*[@data-source="birth year"]/div[@class="pi-data-value pi-font"]/text()') | |
.text[0..-3] | |
queen_ethnicity = queen_doc.xpath( | |
'//*[@data-source="Ethnicity"]/div[@class="pi-data-value pi-font"]/text()') | |
.map { |e| e.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
queen_current_city = queen_doc.xpath( | |
'//*[@data-source="Current City"]/div[@class="pi-data-value pi-font"]/text()').text | |
#### Fandom labels "Birthplace" or "Hometown", scrape for both and define them as one | |
variable_queen_birthplace = queen_doc.xpath( | |
'//*[@data-source="Birthplace"]/div[@class="pi-data-value pi-font"]/text()').text | |
variable_queen_hometown = queen_doc.xpath( | |
'//*[@data-source="Hometown"]/div[@class="pi-data-value pi-font"]/text()').text | |
queen_hometown = variable_queen_birthplace + variable_queen_hometown | |
#### scrape each queens social/web links, define "Official Website" and "Website" as one | |
queen_instagram = queen_doc.xpath('//a[text()="Instagram"]').attribute('href') | |
queen_twitter = queen_doc.xpath('//a[text()="Twitter"]').attribute('href') | |
queen_facebook = queen_doc.xpath('//a[text()="Facebook"]').attribute('href') | |
queen_youtube = queen_doc.xpath('//a[text()="YouTube"]').attribute('href') | |
queen_imdb = queen_doc.xpath('//a[text()="IMDB Page"]').attribute('href') | |
queen_wikipedia = queen_doc.xpath('//a[text()="Wikipedia"]').attribute('href') | |
variable_queen_site = queen_doc.xpath('//a[text()="Site"]').attribute('href') | |
variable_queen_official_site = queen_doc.xpath('//a[text()="Official Website"]').attribute('href') | |
queen_website = variable_queen_site ? variable_queen_site : variable_queen_official_site | |
#### scrape each queen_doc after any element that has the Trivia id until the end of the element | |
#### run the trivia through a regex expression | |
queen_trivia = queen_doc.xpath( | |
'//*[preceding::*[@id="Trivia"] and following::*[@class="mw-headline"]]//following-sibling::li') | |
.map { |e| e.text.split(' ').join(' ').gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
#### scrape each queens quotes in the various ways Fandom labels them then define them as one | |
#### scrape each queen_doc after any element that has the #{variable} id until the end of the element | |
#### run the trivia through a regex expression | |
variable_fandom_quotes = queen_doc.xpath( | |
'//*[preceding::*[@id="Quotes"] and following::*[@class="mw-headline"]]//following-sibling::li') | |
.map { |e| e.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
variable_memorable_quotes = queen_doc.xpath( | |
'//*[preceding::*[@id="Memorable_Quotes"] and following::*[@class="mw-headline"]]//following-sibling::li') | |
.map { |e| e.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
variable_memorable_quotes_C2A0 = queen_doc.xpath( | |
'//*[preceding::*[@id="Memorable_Quotes.C2.A0"] and following::*[@class="mw-headline"]]//following-sibling::li') | |
.map { |e| e.text.gsub(/[^0-9a-z%&!\n\/(). ]/i, '') } | |
queen_quotes = variable_fandom_quotes + variable_memorable_quotes + variable_memorable_quotes_C2A0 | |
#### create an instance of a queen for each url, assigning all the appropriate attributes | |
Queen.create!(drag_name: queen_drag_name, | |
real_name: queen_real_name, | |
primary_image: queen_primary_image, | |
date_of_birth: queen_date_of_birth, | |
hometown: queen_hometown, | |
current_city: queen_current_city, | |
ethnicity: queen_ethnicity, | |
instagram: queen_instagram, | |
twitter: queen_twitter, | |
facebook: queen_facebook, | |
youtube: queen_youtube, | |
website: queen_website, | |
imdb: queen_imdb, | |
wikipedia: queen_wikipedia, | |
#### for each instance of a queen, create instances of their trivia and quotes | |
trivia_attributes: queen_trivia.map {|trivium| {content: trivium}}, | |
quotes_attributes: queen_quotes.map {|quote| {content: quote}}, | |
#### associations to Seasons through Appearances can be defined here using the scraped array of seasons | |
#### but I'll define those associations from within the get_seasons method using the array of contestants | |
) | |
end | |
end | |
end | |
3. With Seasons and Queens instantiated, iterate through the Seasons and .create!() an appearance for each episode per Queen and her appropriate appearance attributes.
Top comments (0)