TLDR;
If you have used include? on an array while evaluating comparisons then you may want to look at sets instead!
INTERLUDE
Before we set off on our quest for knowledge we have to know why this quest is so important. We are not going to be saving a princess from a burning building, but instead 17 million princesses from a select 500 thousand burning buildings. (Truth be told there is no fire in this, but I wanted to add a bit of excitement to the story)
I was recently tasked with parsing a CSV file that had over 17 million records in the file and was roughly 8GB. When given the params I initially said "no problem, that would be a welcomed 2-hour break from my other project". I would soon regret those words when I downloaded the files to see that I needed to combine fields from file A which contained roughly 17.9 million records and file B which contained the IDs for 500 thousand records along with some other data that had already been manually by someone.
THE BATTLES
My first attempt, was not pretty because of my prior experience with modifying CSVs. I had not had to modify anything larger thank 50k items at most, which is 1/10 of the smallest dataset in this task.
require "csv" | |
require 'active_support/core_ext/hash' | |
# ----------------------------------------------------------------------------------------------- | |
# Grab the file and open it up | |
# Get the hash table from CSV and use it for selects / detects | |
etas_title_file = './title_file.csv' | |
the_csv_file = CSV.parse(File.read(etas_title_file), {:headers => true, :header_converters => :symbol}) | |
# OUTPUTS A HARD CRASH OR A SLUGGISH EVERYTHING IN ABOUT 4 HOURS |
My second attempt was feeling around how to parse the data properly and then trying to parse the biggest files. Again it proved to be a failure, but lead me to the correct way of parsing files, just not the correct way of dealing with that much data.
require "csv" | |
require 'active_support/core_ext/hash' | |
# ----------------------------------------------------------------------------------------------- | |
# GET SMARTER GRAB THE FILE LINE BY LINE AND DO STUFF WITH IT, BUT I'M GOING TO NEED BOTH FILES | |
# SAVE THE FIRST ONE TO AN ARRAY OF HASHES BECAUSE THAT WILL BE EASY TO SEARCH USING SELECTS | |
etas_title_file = './title_file.csv' | |
the_csv_file = [] | |
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| | |
the_csv_file << row.to_h | |
end | |
# THIS WORKED LETS MOVE ON TO THE BIGGER FILE | |
# THIS ONE HAD LOTS OF PROBLEMS SO WE FIGURED OUT HOW TO PARSE IT | |
other_file = './other_file.txt' | |
other_csv_file | |
CSV.foreach(other_file ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| | |
other_csv_file << row.to_h | |
end | |
# NOW THAT WE HAVE OUR FILES, LETS SEARCH OUR DATA | |
the_csv_file.each do |record| | |
big_file_data = other_csv_file.select{ |r| r[:oclc] == record[:oclc_num] } | |
combine_files(big_file_data, record) # this was a special funciton I had for extracting the info I wanted | |
end | |
# AFTER A DAY OF LETTING THIS RUN, I STOPPED EXECUTION (I had only parsed about 1o thousand of 17 million) |
My third attempt was a step in a positive direction, giving me the appearance that I was doing things in a better way, however, looking back at it I added un-needed complexities that hurt the performance rather than boosted it. The concept was simple, create a list of item ID's then be sure there are no duplicates, and finally iterate through the largest file only performing modifications on data that was needed in the final iteration.
require "csv" | |
require 'active_support/core_ext/hash' | |
# ----------------------------------------------------------------------------------------------- | |
# TAKING TOO LONG BECAUSE EXTRA RECORDS LETS DO SOME CHECKS FIRST | |
etas_title_file = './title_file.csv' | |
the_csv_file = [] | |
oclc_id_array = [] | |
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| | |
the_csv_file << row.to_h | |
oclc_id_array << row[:oclc] | |
end | |
oclc_id_array.uniq! | |
# GETTING SMARTER WITH THE BIGGER FILE, MAKE IT DO LESS | |
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author] | |
new_hathi_csv = [] | |
CSV.foreach('./hathi_full.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| | |
next unless oclc_id_array.include? row[:oclc_num] | |
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}" | |
new_hathi_csv << row.to_h | |
end | |
# AFTER A DAY OF LETTING THIS RUN, I STOPPED EXECUTION | |
# I DON'T KNOW HOW MANY PARSED BUT I WAS SURE THAT IT WOULDN'T WORK | |
# SO I DID A TIME AND BENCH MARK FOR 1 RECORD THEN MULTIPLIED THAT TIME FOR ALL 17 Million | |
# THE RESULT WAS 156 DAYS TO COMPLETE ... YIKES! |
My fourth attempt was another positive step because it de-dupped the number of records in the largest file, only processing the ones that need to be processed, but does that inherently add performance.
require "csv" | |
require 'active_support/core_ext/hash' | |
# --------------------------------------------------------- | |
# I FEEL SOLID ABOUT THIS PART SO I HAVEN'T CHANGED THIS | |
# ---------------------------------------------------------- | |
etas_title_file = './title_file.csv' | |
the_csv_file = [] | |
oclc_id_array = [] | |
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| | |
the_csv_file << row.to_h | |
oclc_id_array << row[:oclc] | |
end | |
oclc_id_array.uniq! | |
# --------------------------------------------------------- | |
# GOING TO GET MORE COMPLICATED WITH THINGS TO TRY TO SPEED THINGS UP | |
# THIS IS PROBABLY A POOR DESCISION | |
# ---------------------------------------------------------- | |
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author] | |
remove_keys = %i[access rights description source source_bib_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author] | |
new_hathi_csv = [] | |
processed_keys = [] | |
CSV.foreach('./hathi_full.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| | |
next unless oclc_id_array.include? row[:oclc_num] | |
next if processed_keys.include? row[:oclc_num] | |
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}" | |
new_hathi_csv << row.to_h.except(*remove_keys) | |
processed_keys << row[:oclc_num] | |
end | |
# THE GOAL WAS TO USE THE CSV FOREACH TO CREATE A NEW CSV TO DO ADDITONAL WORK FROM | |
# AFTER A COUPLE OF HOURS I PASSED THIS OFF TO A FRIEND TO LOOK AT WHO SAID IT SHOULD BE FINE AND I RAN IT OVER THE WEEKEND. | |
# I CRASHED IT WHEN I GOT BACK BECAUSE IT NEVER LEFT THE LOOP.... |
Looking at both the third and fourth attempts I did fall into a Ruby trap here, instead of thinking in terms of data I was thinking in terms of ruby weaponry. Knowing what I could accomplish easily with Arrays and not looking at that as a potential downfall.
My fifth attempt, well there were many more than 5 attempts, but let us assume for this article that I was smart enough to only do it in 5 attempts with the help of StackOverflow. Basically I got smarter in handling my hashes, writing CSV, and switching to sets for comparisons as recommended from StackOverflow.
require "csv" | |
require 'active_support/core_ext/hash' | |
# ------------------------------------------------------------------------------------- | |
# STACK OVERFLOW WAS MY ONLY HOPE | |
# A USER ON STACKOVERFLOW was ABLE TO POINT ME TOWARDS SETS AND USING MY HASHES IN A SMARTER WAY | |
# ------------------------------------------------------------------------------------- | |
etas_title_file = './titles_file.csv' | |
oclc_ids = Set.new | |
csv_hash_file = {} | |
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row| | |
oclc_ids.add row[:oclc] | |
csv_hash_file[row[:oclc]] = row.to_h | |
end | |
# ------------------------------------------------------------------------------------- | |
# SCRIPT WORK | |
# ------------------------------------------------------------------------------------- | |
# SETS HEADERS FOR CSV | |
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author] | |
# SETS NEW HEADERS FOR APPENDED CSV | |
new_headers = %i[htid,ht_bib_key,oclc,local_id,item_type,access,rights] | |
# creates a set of numbers to avoid duplicates | |
processed_keys = Set.new | |
# variables for feedback | |
filename = './hathitext.txt' | |
line_count = %x{wc -l < "#{filename}"}.to_i | |
counter = 1 | |
# Opens a file to write to during each line so that we aren't holding that in memory. | |
CSV.open("./converted_#{Time.now.to_i}.csv", 'a') do |csv| | |
csv << new_headers | |
CSV.foreach(filename ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row| | |
counter += 1 | |
oclc_id = row[:oclc_num] | |
# proof the script is working | |
puts "--- #{(counter.to_f / line_count.to_f * 100.0).round(2)}% --------------------------" | |
# don't waste time | |
next unless oclc_ids.include? oclc_id | |
next if processed_keys.include? oclc_id | |
# combine | |
combine_csv = csv_hash_file[oclc_id] | |
csv << [row[:htid], row[:ht_bib_key], oclc_id, combine_csv[:local_id], combine_csv[:item_type], combine_csv[:access], combine_csv[:rights]] | |
# add processed and showcase time | |
processed_keys.add oclc_id | |
end | |
finish = Time.now - start | |
puts " ====================================================================================== " | |
puts "Finished Processing in - #{finish}, the number of records processed was #{line_count}" | |
puts " ====================================================================================== " | |
end |
This attempt is the attempt that saved all the princesses and put out the fires without even blinking. Before attempt 5 the maximum amount of time I let the script run was about 5 days because of a long weekend. That is an enormous amount of time, but after attempt 5 reprocessing took 7-10 minutes. It made me feel ill, but also very happy.
WEAPON ANALYSIS
My first mistake was unfamiliarity with large datasets leading me down the road to being decapitated by the big O notation dragon who guarded the princesses. Moreover, it reveals that when programming I need to be more cognizant of more performant processes and not just default to things I've done in the past.
Ruby is like having a locker of ready to go weapons that can accomplish tasks quickly, but instead of saving the princesses with my awesome sword of justice, I cut my arm off. It wasn't me who realized this though, it was a user(Casper) on StackOverflow who pointed out my mistake.
My approach was to use an array of items to check for inclusion which in my given subset of objects was going to be a minimum of 500,000 each time I went over a single record in the 17 million records, but in fact I was doing this twice with processed items, so that was growing the number of checks which at its smallest amount was 500,000 and largest was 1 million. This means that 17 million records made a million checks on objects before it ever did any actual processing. Benchmark testing on these items showed me the error of my ways and showed me that the script as I had it in iteration 4 would have finished in 156 days.
I can't blame Ruby, it was doing exactly what I told it to do. Instead I need to blame myself for not thinking of performance in such a simple little script. I can blame in-experience, but I think the biggest thing to look at is the ruts of everyday programming. As a junior developer, you get things working and just try to get by figuring things out, as a senior level developer you need to look more at performance, maintainability, and reliability. In that same retrospect you need to get out of the ruts you may have developed through your working life.
Testing Numbers Arrays VS Sets
I wanted to see what this comparison of sets vs arrays in comparing simple numbers such as an ID. I wrote a simple benchmark test and ran it similarly to running the other script just to see the output or performance of set vs array. In comparisons sets are 14 times faster than arrays and I find that quite amazing!
tmp_array = [] | |
tmp_set = Set.new | |
500000.times do |i| | |
tmp_array << i | |
tmp_set.add i | |
end | |
start = Time.now | |
10000.times do |i| | |
num = rand(i..500000) | |
if tmp_array.include? num | |
# do nothing | |
end | |
end | |
finish = Time.now | |
complete = finish - start | |
puts "Array Completed: #{complete}" | |
start = Time.now | |
10000.times do |i| | |
num = rand(i..500000) | |
tmp_set.include? num | |
end | |
finish = Time.now | |
complete = finish - start | |
puts "Set Completed: #{complete}" |
Top comments (0)