DEV Community

Cover image for Adventures in Ruby
David J. Davis
David J. Davis

Posted on

Adventures in Ruby

TLDR;

If you have used include? on an array while evaluating comparisons then you may want to look at sets instead!

INTERLUDE

Before we set off on our quest for knowledge we have to know why this quest is so important. We are not going to be saving a princess from a burning building, but instead 17 million princesses from a select 500 thousand burning buildings. (Truth be told there is no fire in this, but I wanted to add a bit of excitement to the story)

I was recently tasked with parsing a CSV file that had over 17 million records in the file and was roughly 8GB. When given the params I initially said "no problem, that would be a welcomed 2-hour break from my other project". I would soon regret those words when I downloaded the files to see that I needed to combine fields from file A which contained roughly 17.9 million records and file B which contained the IDs for 500 thousand records along with some other data that had already been manually by someone.

THE BATTLES

My first attempt, was not pretty because of my prior experience with modifying CSVs. I had not had to modify anything larger thank 50k items at most, which is 1/10 of the smallest dataset in this task.

require "csv"
require 'active_support/core_ext/hash'
# -----------------------------------------------------------------------------------------------
# Grab the file and open it up
# Get the hash table from CSV and use it for selects / detects
etas_title_file = './title_file.csv'
the_csv_file = CSV.parse(File.read(etas_title_file), {:headers => true, :header_converters => :symbol})
# OUTPUTS A HARD CRASH OR A SLUGGISH EVERYTHING IN ABOUT 4 HOURS
view raw attempt_1.rb hosted with ❤ by GitHub

My second attempt was feeling around how to parse the data properly and then trying to parse the biggest files. Again it proved to be a failure, but lead me to the correct way of parsing files, just not the correct way of dealing with that much data.

require "csv"
require 'active_support/core_ext/hash'
# -----------------------------------------------------------------------------------------------
# GET SMARTER GRAB THE FILE LINE BY LINE AND DO STUFF WITH IT, BUT I'M GOING TO NEED BOTH FILES
# SAVE THE FIRST ONE TO AN ARRAY OF HASHES BECAUSE THAT WILL BE EASY TO SEARCH USING SELECTS
etas_title_file = './title_file.csv'
the_csv_file = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
the_csv_file << row.to_h
end
# THIS WORKED LETS MOVE ON TO THE BIGGER FILE
# THIS ONE HAD LOTS OF PROBLEMS SO WE FIGURED OUT HOW TO PARSE IT
other_file = './other_file.txt'
other_csv_file
CSV.foreach(other_file ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
other_csv_file << row.to_h
end
# NOW THAT WE HAVE OUR FILES, LETS SEARCH OUR DATA
the_csv_file.each do |record|
big_file_data = other_csv_file.select{ |r| r[:oclc] == record[:oclc_num] }
combine_files(big_file_data, record) # this was a special funciton I had for extracting the info I wanted
end
# AFTER A DAY OF LETTING THIS RUN, I STOPPED EXECUTION (I had only parsed about 1o thousand of 17 million)
view raw attempt_2.rb hosted with ❤ by GitHub

My third attempt was a step in a positive direction, giving me the appearance that I was doing things in a better way, however, looking back at it I added un-needed complexities that hurt the performance rather than boosted it. The concept was simple, create a list of item ID's then be sure there are no duplicates, and finally iterate through the largest file only performing modifications on data that was needed in the final iteration.

require "csv"
require 'active_support/core_ext/hash'
# -----------------------------------------------------------------------------------------------
# TAKING TOO LONG BECAUSE EXTRA RECORDS LETS DO SOME CHECKS FIRST
etas_title_file = './title_file.csv'
the_csv_file = []
oclc_id_array = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
the_csv_file << row.to_h
oclc_id_array << row[:oclc]
end
oclc_id_array.uniq!
# GETTING SMARTER WITH THE BIGGER FILE, MAKE IT DO LESS
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
new_hathi_csv = []
CSV.foreach('./hathi_full.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
next unless oclc_id_array.include? row[:oclc_num]
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
new_hathi_csv << row.to_h
end
# AFTER A DAY OF LETTING THIS RUN, I STOPPED EXECUTION
# I DON'T KNOW HOW MANY PARSED BUT I WAS SURE THAT IT WOULDN'T WORK
# SO I DID A TIME AND BENCH MARK FOR 1 RECORD THEN MULTIPLIED THAT TIME FOR ALL 17 Million
# THE RESULT WAS 156 DAYS TO COMPLETE ... YIKES!
view raw attempt_3.rb hosted with ❤ by GitHub

My fourth attempt was another positive step because it de-dupped the number of records in the largest file, only processing the ones that need to be processed, but does that inherently add performance.

require "csv"
require 'active_support/core_ext/hash'
# ---------------------------------------------------------
# I FEEL SOLID ABOUT THIS PART SO I HAVEN'T CHANGED THIS
# ----------------------------------------------------------
etas_title_file = './title_file.csv'
the_csv_file = []
oclc_id_array = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
the_csv_file << row.to_h
oclc_id_array << row[:oclc]
end
oclc_id_array.uniq!
# ---------------------------------------------------------
# GOING TO GET MORE COMPLICATED WITH THINGS TO TRY TO SPEED THINGS UP
# THIS IS PROBABLY A POOR DESCISION
# ----------------------------------------------------------
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
remove_keys = %i[access rights description source source_bib_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
new_hathi_csv = []
processed_keys = []
CSV.foreach('./hathi_full.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
next unless oclc_id_array.include? row[:oclc_num]
next if processed_keys.include? row[:oclc_num]
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
new_hathi_csv << row.to_h.except(*remove_keys)
processed_keys << row[:oclc_num]
end
# THE GOAL WAS TO USE THE CSV FOREACH TO CREATE A NEW CSV TO DO ADDITONAL WORK FROM
# AFTER A COUPLE OF HOURS I PASSED THIS OFF TO A FRIEND TO LOOK AT WHO SAID IT SHOULD BE FINE AND I RAN IT OVER THE WEEKEND.
# I CRASHED IT WHEN I GOT BACK BECAUSE IT NEVER LEFT THE LOOP....
view raw attempt_4.rb hosted with ❤ by GitHub

Looking at both the third and fourth attempts I did fall into a Ruby trap here, instead of thinking in terms of data I was thinking in terms of ruby weaponry. Knowing what I could accomplish easily with Arrays and not looking at that as a potential downfall.

My fifth attempt, well there were many more than 5 attempts, but let us assume for this article that I was smart enough to only do it in 5 attempts with the help of StackOverflow. Basically I got smarter in handling my hashes, writing CSV, and switching to sets for comparisons as recommended from StackOverflow.

require "csv"
require 'active_support/core_ext/hash'
# -------------------------------------------------------------------------------------
# STACK OVERFLOW WAS MY ONLY HOPE
# A USER ON STACKOVERFLOW was ABLE TO POINT ME TOWARDS SETS AND USING MY HASHES IN A SMARTER WAY
# -------------------------------------------------------------------------------------
etas_title_file = './titles_file.csv'
oclc_ids = Set.new
csv_hash_file = {}
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
oclc_ids.add row[:oclc]
csv_hash_file[row[:oclc]] = row.to_h
end
# -------------------------------------------------------------------------------------
# SCRIPT WORK
# -------------------------------------------------------------------------------------
# SETS HEADERS FOR CSV
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
# SETS NEW HEADERS FOR APPENDED CSV
new_headers = %i[htid,ht_bib_key,oclc,local_id,item_type,access,rights]
# creates a set of numbers to avoid duplicates
processed_keys = Set.new
# variables for feedback
filename = './hathitext.txt'
line_count = %x{wc -l < "#{filename}"}.to_i
counter = 1
# Opens a file to write to during each line so that we aren't holding that in memory.
CSV.open("./converted_#{Time.now.to_i}.csv", 'a') do |csv|
csv << new_headers
CSV.foreach(filename ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
counter += 1
oclc_id = row[:oclc_num]
# proof the script is working
puts "--- #{(counter.to_f / line_count.to_f * 100.0).round(2)}% --------------------------"
# don't waste time
next unless oclc_ids.include? oclc_id
next if processed_keys.include? oclc_id
# combine
combine_csv = csv_hash_file[oclc_id]
csv << [row[:htid], row[:ht_bib_key], oclc_id, combine_csv[:local_id], combine_csv[:item_type], combine_csv[:access], combine_csv[:rights]]
# add processed and showcase time
processed_keys.add oclc_id
end
finish = Time.now - start
puts " ====================================================================================== "
puts "Finished Processing in - #{finish}, the number of records processed was #{line_count}"
puts " ====================================================================================== "
end
view raw attempt_5.rb hosted with ❤ by GitHub

This attempt is the attempt that saved all the princesses and put out the fires without even blinking. Before attempt 5 the maximum amount of time I let the script run was about 5 days because of a long weekend. That is an enormous amount of time, but after attempt 5 reprocessing took 7-10 minutes. It made me feel ill, but also very happy.

WEAPON ANALYSIS

My first mistake was unfamiliarity with large datasets leading me down the road to being decapitated by the big O notation dragon who guarded the princesses. Moreover, it reveals that when programming I need to be more cognizant of more performant processes and not just default to things I've done in the past.

Ruby is like having a locker of ready to go weapons that can accomplish tasks quickly, but instead of saving the princesses with my awesome sword of justice, I cut my arm off. It wasn't me who realized this though, it was a user(Casper) on StackOverflow who pointed out my mistake.

My approach was to use an array of items to check for inclusion which in my given subset of objects was going to be a minimum of 500,000 each time I went over a single record in the 17 million records, but in fact I was doing this twice with processed items, so that was growing the number of checks which at its smallest amount was 500,000 and largest was 1 million. This means that 17 million records made a million checks on objects before it ever did any actual processing. Benchmark testing on these items showed me the error of my ways and showed me that the script as I had it in iteration 4 would have finished in 156 days.

I can't blame Ruby, it was doing exactly what I told it to do. Instead I need to blame myself for not thinking of performance in such a simple little script. I can blame in-experience, but I think the biggest thing to look at is the ruts of everyday programming. As a junior developer, you get things working and just try to get by figuring things out, as a senior level developer you need to look more at performance, maintainability, and reliability. In that same retrospect you need to get out of the ruts you may have developed through your working life.

Testing Numbers Arrays VS Sets

I wanted to see what this comparison of sets vs arrays in comparing simple numbers such as an ID. I wrote a simple benchmark test and ran it similarly to running the other script just to see the output or performance of set vs array. In comparisons sets are 14 times faster than arrays and I find that quite amazing!

tmp_array = []
tmp_set = Set.new
500000.times do |i|
tmp_array << i
tmp_set.add i
end
start = Time.now
10000.times do |i|
num = rand(i..500000)
if tmp_array.include? num
# do nothing
end
end
finish = Time.now
complete = finish - start
puts "Array Completed: #{complete}"
start = Time.now
10000.times do |i|
num = rand(i..500000)
tmp_set.include? num
end
finish = Time.now
complete = finish - start
puts "Set Completed: #{complete}"

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay