loading...

A Quick Intro to Ruby's Set, Part 2: Object Lookups and Efficiency

isalevine profile image Isa Levine Updated on ・9 min read

Continuing from the previous article, we will now explore some of Set's methods, and compare their benchmarks with equivalent operations on arrays.

To review, some key details about sets:

  • All objects in a Set are guaranteed unique (arrays can have duplicates)
  • Objects in a Set are not ordered (arrays are ordered by index)
  • Sets are built on top of Hashes for super-fast object lookup (arrays are just dynamic arrays under-the-hood)

Or, summarized by RubyGuides' article on the Set class:

A set is a class that stores items like an array…

But with some special attributes that make it 10x faster in specific situations!

On top of that:

All the items in a set are guaranteed to be unique.

As we will see, my particular benchmarks don't quite reach 10x faster efficiency--but we'll explore situations where sets are certainly more efficient than an array!

Overview

In this second article, we'll compare the following operations between Ruby sets and arrays:

Operations where sets are faster than arrays:

  • set.include?() vs. array.include?()
  • set.add() vs. array.push()
  • set.delete() vs. array.delete()
  • set.subset?() vs. (array1 & array2) == array2

Operations where sets are slower than arrays:

  • set.replace() vs. array.replace()
  • set1 == set2 vs. array1 == array2

Operations where sets are faster than arrays

Let's start with methods where sets are more efficient than arrays.

Note: If you're coding along at home, remember to include require 'benchmark' to avoid an error with the Benchmark class!

set.include?() vs. array.include?()

Since sets are built on top of Ruby's hash, the .include() method is where they really shine. It simply checks if an element is present, returning a boolean.

The RubyGuides article above reports these benchmarks:

set   include: 8381985.2 i/s
array include: 703305.5  i/s - 11.92x  slower

And goes on to explain:

The reason for this difference is that an array has to check every single element...

...if you have a 1 million element array it's going to be checking 1 million elements every time you call include?.

A set doesn't have to do that.

Here's how we'll be testing it, using the same sets and arrays from our previous article. Each operation will be performed within Benchmark.bm 5 million times, and will be reported with .report. (The large spacing is for pretty-printing in the console!)

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]

Benchmark.bm(34) do |x|
    x.report("Set   .include? (beginning)       :")     { n.times do; filter_set.include?("a")   ; end}
    x.report("Array .include? (beginning)       :")     { n.times do; filter_array.include?("a")   ; end}
    x.report("Set   .include? (middle)          :")     { n.times do; filter_set.include?("they"); end}
    x.report("Array .include? (middle)          :")     { n.times do; filter_array.include?("they"); end}
    x.report("Set   .include? (end)             :")     { n.times do; filter_set.include?("at")  ; end}
    x.report("Array .include? (end)             :")     { n.times do; filter_array.include?("at")  ; end}
end

Above, we have the array checking for an element at the beginning, middle, and end of the array. For kicks, we have the set looking up elements that are hard-coded in the same order--but, because sets are built on top of hashes, there is no object order in a set!

Here's our output:

                                         user     system      total        real
Set   .include? (beginning)       :  0.796039   0.001551   0.797590 (  0.799956)
Array .include? (beginning)       :  0.632907   0.000869   0.633776 (  0.637059)
Set   .include? (middle)          :  0.786981   0.000610   0.787591 (  0.788172)
Array .include? (middle)          :  1.294061   0.001416   1.295477 (  1.298378)
Set   .include? (end)             :  0.792309   0.000866   0.793175 (  0.795283)
Array .include? (end)             :  2.476954   0.002786   2.479740 (  2.484839)

As expected, the array takes longer to find the closer the element is to the end. Meanwhile, the set's lookup time is essentially constant.

In the worst-case scenario, set.include?() will perform over 3x faster than array.include?().

set.add() vs. array.push()

Next, let's compare the set.add() method introduced in the previous article to array.push().

The array push method will add an element to the end of an array--this is the most efficient way to add elements (assuming you don't have to increase the array's capacity under-the-hood).

set.add(), on the other hand, simply adds a key-value pair to its under-the-hood hash, with the key being the added element, and its value being a boolean true.

Note: In this and several following tests, you'll notice that these operations include a first step of duplicating the original set/array. This is to ensure that the same operation is being performed--as opposed to the set adding an element only once and then rejecting it the other 4999999 times!

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]

Benchmark.bm(34) do |x|
    x.report("Set   .add  (successful)          :") { 
        n.times do
            set = filter_set
            set.add("big")
        end
    }
    x.report("Array .push (successful)          :") { 
        n.times do
            array = filter_array
            array.push("big")                                     
        end
    }
    x.report("Set   .add  (failed)              :") { 
        n.times do
            set = filter_set
            set.add("they")
        end
    }
    x.report("Array .push (if not present)      :") { 
        n.times do
            array = filter_array
            if !array.include?("they")
                array.push("they")
            end   
        end
    }
end

Here's our output:

                                         user     system      total        real
Set   .add  (successful)          :  0.929103   0.000749   0.929852 (  0.931718)
Array .push (successful)          :  0.909708   0.094546   1.004254 (  1.005897)
Set   .add  (failed)              :  0.979433   0.029193   1.008626 (  1.011531)
Array .push (if not present)      :  1.327389   0.001286   1.328675 (  1.330673)

At best, our set adds elements approximately as fast as an array's push. In situations where we need to check an array for an element before adding it, we see sets outperforming arrays by about 36%.

set.delete() vs. array.delete()

Both set.delete() and array.delete() require a lookup for an element before removing it, so this is a great comparison of their equivalent abilities!

Here's our code:

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]

Benchmark.bm(34) do |x|
    x.report("Set   .delete (beginning)         :") { 
        n.times do 
            set = filter_set
            set.delete("a")           
        end
    }
    x.report("Array .delete (beginning)         :") { 
        n.times do
            array = filter_array
            array.delete("a")
        end
    }
    x.report("Set   .delete (middle)            :") { 
        n.times do 
            set = filter_set
            set.delete("they")        
        end
    }
    x.report("Array .delete (middle)            :") { 
        n.times do
            array = filter_array
            array.delete("they")  
        end
    }
    x.report("Set   .delete (end)               :") { 
        n.times do 
            set = filter_set
            set.delete("at")          
        end
    }
    x.report("Array .delete (end)               :") { 
        n.times do
            array = filter_array
            array.delete("at")    
        end
    }
end

Once again, we're measuring an array's ability to delete elements at their beginning, middle, and end (and illustrating that sets have no such order).

Here's our output:

                                         user     system      total        real
Set   .delete (beginning)         :  0.812703   0.000889   0.813592 (  0.816582)
Array .delete (beginning)         :  1.932468   0.001862   1.934330 (  1.938841)
Set   .delete (middle)            :  0.811337   0.000912   0.812249 (  0.814055)
Array .delete (middle)            :  2.053270   0.001698   2.054968 (  2.059913)
Set   .delete (end)               :  0.813986   0.000986   0.814972 (  0.816330)
Array .delete (end)               :  2.203437   0.002076   2.205513 (  2.210077)

As expected, sets can delete elements in constant time. Interestingly, arrays are able to delete elements at the end almost as fast as ones at their beginning!

set.subset?() vs. (array1 & array2) == array2

One of sets' most useful features is the subset? method. A subset is a set where all of its elements are also elements of its superset.

While arrays don't have an exactly equivalent method, this excellent StackOverflow answer illustrates how the array & method can be used for the same effect.

Here's our code:

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
sub_set = Set["a", "they", "at"]
sub_array = ["a", "they", "at"]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]

Benchmark.bm(34) do |x|
    x.report("Set   .subset?          (true)    :")     { n.times do; sub_set.subset?(filter_set)                       ; end}
    x.report("Array (a1 & a2) == a2   (true)    :")     { n.times do; (filter_array & sub_array) == sub_array           ; end}
    x.report("Set   .subset?          (false)   :")     { n.times do; replace_set.subset?(filter_set)                   ; end}
    x.report("Array (a1 & a2) == a2   (false)   :")     { n.times do; (filter_array & replace_array) == replace_array   ; end}    
end

Our code compares sets and arrays in situations where subset? and (a1 & a2) == a2 return both true and false.

Here's our output:

                                         user     system      total        real
Set   .subset?          (true)    :  2.076477   0.001450   2.077927 (  2.080519)
Array (a1 & a2) == a2   (true)    :  7.961141   0.180642   8.141783 (  8.158982)
Set   .subset?          (false)   :  1.554540   0.001360   1.555900 (  1.558715)
Array (a1 & a2) == a2   (false)   :  5.280627   0.006986   5.287613 (  5.297066)

Wow! In both situations, we see sets performing subset? over 3x as fast as the array's equivalent!

Operations where sets are slower than arrays

Now, let's turn our attention to some situations where sets do not perform as well as arrays.

Unfortunately, I cannot offer under-the-hood explanations for these results at this time. Please feel free to comment below and share your wisdom of why arrays outperform sets in the following operations! :)

set.replace() vs. array.replace()

In both structures, the replace method completely replaces the contents of the given structure with the elements from a supplied enumerable. In our code, we will be replacing a set with a set, and an array with an array.

Here's our code:

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]

Benchmark.bm(34) do |x|
    x.report("Set   .replace                    :")     { n.times do; filter_set.replace(replace_set)       ; end}
    x.report("Array .replace                    :")     { n.times do; filter_array.replace(replace_array)   ; end}
end

Here's our output:

                                         user     system      total        real
Set   .replace                    :  2.318287   0.001841   2.320128 (  2.327267)
Array .replace                    :  0.441079   0.000528   0.441607 (  0.442994)

Ouch! Arrays implement replace 5x faster than sets! As stated above, I'd love to have someone fill me in on why!

set1 == set2 vs. array1 == array2

Finally, let's look at ==, where each pair of structures is checked for having the same elements--and for arrays, the same order. Once again, we will benchmark both true and false situations for each.

Here's our code:

n = 5000000

filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
match_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
match_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their", 
    "them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]

Benchmark.bm(34) do |x|
    x.report("Set   ==  (true)                  :")     { n.times do; filter_set == match_set       ; end}
    x.report("Array ==  (true)                  :")     { n.times do; filter_array == match_array   ; end}
    x.report("Set   ==  (false)                 :")     { n.times do; filter_set == replace_set     ; end}
    x.report("Array ==  (false)                 :")     { n.times do; filter_array == replace_array ; end}
end

Here's our output:

                                         user     system      total        real
Set   ==  (true)                  :  7.219614   0.004376   7.223990 (  7.234709)
Array ==  (true)                  :  4.119128   0.003112   4.122240 (  4.128884)
Set   ==  (false)                 :  1.097661   0.000751   1.098412 (  1.099235)
Array ==  (false)                 :  0.391299   0.000482   0.391781 (  0.393029)

In the best case, array's perform better than sets by about 75% when checking for true cases--and at worst, for false situations, arrays check about 275% faster than sets! Equality is definitely a situation to prefer arrays over sets. (Anyone care to explain why? <3)

Conclusion

Now you're more familiar with Set's methods and cases where they may be advantageous to arrays! The two key situations to look for are:

  • Needing to check via include? in a collection of unique elements
  • Needing to check for subsets of elements

Thanks for reading!

Links and Sources

Excellent introduction to Ruby Sets by Al Scott

Ruby docs - Sets

RubyGuides intro to Sets

DotNetPerls - Ruby Set examples

Got any tips, tricks, or instances where you like to use sets? Please feel free to share in the comments below!

Discussion

pic
Editor guide