DEV Community

denvermullets
denvermullets

Posted on

Some notes on web scraping with Ruby featuring Kimurai and Selenium

I had previously written about scraping some stats from basketball reference and when I went back to do it again my links didn't work, meaning I would have to redo some of the work. I didn't particularly love my last exploration with Nokogiri and scraping so I took this as an opportunity to learn and expand my knowledge.

I'm not going to write this as a walkthru since this blog will really walk you thru it and really just posting this to show different ways you can get the information you need.

I feel like finding elements using CSS leads to a more stable approach vs hardcoding the XPath. As long as the class name or id is the same, it won't matter if the site ends up nesting the element and breaking your code.

One thing that I spent a lot of time trying to troubleshoot was my variables were coming up empty despite me doing it all seemingly correct. In the end I realized that if the site hadn't finished loading. I'm assuming there's a safer way to do this vs just using sleep 2 or something.

My Code

require 'kimurai'
require 'json'

class TequilaScraper < Kimurai::Base
  @name = 'tqdb_scrap'
  @start_urls = ['https://tequilamatchmaker.com/tequilas/2325-fortaleza-blanco']
  @engine = :selenium_chrome

  @@tequilas = []

  def scrape_page
    sleep 2
    doc = browser.current_response
    tequila = doc.css('div.product-actions')

    teq_name = tequila.css('h1[itemprop="name"]').text.gsub(/\n/, "")
    teq_type = tequila.css('div.product-type a').text.gsub(/\n/, "")
    teq_rating_p = tequila.css('ul.product-list__item__ratings li')[0].text.gsub(/\D/, '').gsub(/\n/, "")
    teq_rating_c = tequila.css('ul.product-list__item__ratings li')[1].text.gsub(/\D/, '').gsub(/\n/, "")
    teq_price_check = tequila.css('div.commerce-price-container div span')[1]

    if teq_price_check
      teq_price = tequila.css('div.commerce-price-container div span')[1].text.gsub(/\n/, "")
    else
      teq_price = 'n/a'
    end

    doc_mid = doc.css('div.container')
    teq_image = doc_mid.css('img.product-image').attr('src')

    teq_nom = doc_mid.css('div.production-details_product table tbody tr')[0].css('td a').text.gsub(/\n/, "")
    doc_mid.search('span.sr-only').each do |spans|
      # remove search result spans since it's just a comma
      spans.remove
    end 
    teq_agave = doc_mid.css('div.production-details_product table tbody tr')[1].css('td').text.gsub(/\n/, "")
    teq_agave_region = doc_mid.css('div.production-details_product table tbody tr')[2].css('td').text.gsub(/\n/, "")
    teq_region = doc_mid.css('div.production-details_product table tbody tr')[3].css('td').text.gsub(/\n/, "")
    teq_cooking = doc_mid.css('div.production-details_product table tbody tr')[4].css('td').text.gsub(/\n/, "")
    teq_extraction = doc_mid.css('div.production-details_product table tbody tr')[5].css('td').text.gsub(/\n/, "")
    teq_water = doc_mid.css('div.production-details_product table tbody tr')[6].css('td').text.gsub(/\n/, "")
    teq_fermentation = doc_mid.css('div.production-details_product table tbody tr')[7].css('td').text.gsub(/\n/, "")
    teq_distillation = doc_mid.css('div.production-details_product table tbody tr')[8].css('td').text.gsub(/\n/, "")
    teq_still = doc_mid.css('div.production-details_product table tbody tr')[9].css('td').text.gsub(/\n/, "")
    teq_aging = doc_mid.css('div.production-details_product table tbody tr')[10].css('td').text.gsub(/\n/, "")
    teq_abv = doc_mid.css('div.production-details_product table tbody tr')[11].css('td').text.gsub(/\n/, "")
    teq_other = doc_mid.css('div.production-details_product table tbody tr')[12].css('td').text.gsub(/\n/, "")

    tequila = {name: teq_name, type: teq_type, rating_p: teq_rating_p, rating_c: teq_rating_c,
      price: teq_price, image_url: teq_image, nom: teq_nom, agave: teq_agave, agave_region: teq_agave_region,
      region: teq_region, cooking: teq_cooking, extraction: teq_extraction, water: teq_water, fermentation: teq_fermentation,
      distillation: teq_distillation, still: teq_still, aging: teq_aging, abv: teq_abv, other: teq_other}

    @@tequilas << tequila #if !@@tequilas.include?(tequila)
  end

  def parse(response, url:, data: {})

    scrape_page

    File.open("tequila.json","w") do |f|
      f.write(JSON.pretty_generate(@@tequilas))
    end

    @@tequilas
  end
end

TequilaScraper.crawl!
puts 'done scraping'
Enter fullscreen mode Exit fullscreen mode

Initially I was going to get mock data from a React site but opted for this tequila database just to test easily. Using Selenium is a bit overkill for this, but all in all it worked really well. I think next time I'm going to have to try to dig into some Python or JS scraping.

Discussion (0)