Continuing from the previous article, we will now explore some of Set's methods, and compare their benchmarks with equivalent operations on arrays.
To review, some key details about sets:
- All objects in a Set are guaranteed unique (arrays can have duplicates)
- Objects in a Set are not ordered (arrays are ordered by index)
- Sets are built on top of Hashes for super-fast object lookup (arrays are just dynamic arrays under-the-hood)
Or, summarized by RubyGuides' article on the Set class:
A set is a class that stores items like an array…
But with some special attributes that make it 10x faster in specific situations!
On top of that:
All the items in a set are guaranteed to be unique.
As we will see, my particular benchmarks don't quite reach 10x faster efficiency--but we'll explore situations where sets are certainly more efficient than an array!
Overview
In this second article, we'll compare the following operations between Ruby sets and arrays:
Operations where sets are faster than arrays:
-
set.include?()
vs.array.include?()
-
set.add()
vs.array.push()
-
set.delete()
vs.array.delete()
-
set.subset?()
vs.(array1 & array2) == array2
Operations where sets are slower than arrays:
-
set.replace()
vs.array.replace()
-
set1 == set2
vs.array1 == array2
Operations where sets are faster than arrays
Let's start with methods where sets are more efficient than arrays.
Note: If you're coding along at home, remember to include require 'benchmark'
to avoid an error with the Benchmark
class!
set.include?()
vs. array.include?()
Since sets are built on top of Ruby's hash, the .include()
method is where they really shine. It simply checks if an element is present, returning a boolean.
The RubyGuides article above reports these benchmarks:
set include: 8381985.2 i/s
array include: 703305.5 i/s - 11.92x slower
And goes on to explain:
The reason for this difference is that an array has to check every single element...
...if you have a 1 million element array it's going to be checking 1 million elements every time you call
include?
.A set doesn't have to do that.
Here's how we'll be testing it, using the same sets and arrays from our previous article. Each operation will be performed within Benchmark.bm
5 million times, and will be reported with .report
. (The large spacing is for pretty-printing in the console!)
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
Benchmark.bm(34) do |x|
x.report("Set .include? (beginning) :") { n.times do; filter_set.include?("a") ; end}
x.report("Array .include? (beginning) :") { n.times do; filter_array.include?("a") ; end}
x.report("Set .include? (middle) :") { n.times do; filter_set.include?("they"); end}
x.report("Array .include? (middle) :") { n.times do; filter_array.include?("they"); end}
x.report("Set .include? (end) :") { n.times do; filter_set.include?("at") ; end}
x.report("Array .include? (end) :") { n.times do; filter_array.include?("at") ; end}
end
Above, we have the array checking for an element at the beginning, middle, and end of the array. For kicks, we have the set looking up elements that are hard-coded in the same order--but, because sets are built on top of hashes, there is no object order in a set!
Here's our output:
user system total real
Set .include? (beginning) : 0.796039 0.001551 0.797590 ( 0.799956)
Array .include? (beginning) : 0.632907 0.000869 0.633776 ( 0.637059)
Set .include? (middle) : 0.786981 0.000610 0.787591 ( 0.788172)
Array .include? (middle) : 1.294061 0.001416 1.295477 ( 1.298378)
Set .include? (end) : 0.792309 0.000866 0.793175 ( 0.795283)
Array .include? (end) : 2.476954 0.002786 2.479740 ( 2.484839)
As expected, the array takes longer to find the closer the element is to the end. Meanwhile, the set's lookup time is essentially constant.
In the worst-case scenario, set.include?()
will perform over 3x faster than array.include?()
.
set.add()
vs. array.push()
Next, let's compare the set.add()
method introduced in the previous article to array.push()
.
The array push
method will add an element to the end of an array--this is the most efficient way to add elements (assuming you don't have to increase the array's capacity under-the-hood).
set.add()
, on the other hand, simply adds a key-value pair to its under-the-hood hash, with the key being the added element, and its value being a boolean true
.
Note: In this and several following tests, you'll notice that these operations include a first step of duplicating the original set/array. This is to ensure that the same operation is being performed--as opposed to the set adding an element only once and then rejecting it the other 4999999 times!
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
Benchmark.bm(34) do |x|
x.report("Set .add (successful) :") {
n.times do
set = filter_set
set.add("big")
end
}
x.report("Array .push (successful) :") {
n.times do
array = filter_array
array.push("big")
end
}
x.report("Set .add (failed) :") {
n.times do
set = filter_set
set.add("they")
end
}
x.report("Array .push (if not present) :") {
n.times do
array = filter_array
if !array.include?("they")
array.push("they")
end
end
}
end
Here's our output:
user system total real
Set .add (successful) : 0.929103 0.000749 0.929852 ( 0.931718)
Array .push (successful) : 0.909708 0.094546 1.004254 ( 1.005897)
Set .add (failed) : 0.979433 0.029193 1.008626 ( 1.011531)
Array .push (if not present) : 1.327389 0.001286 1.328675 ( 1.330673)
At best, our set adds elements approximately as fast as an array's push
. In situations where we need to check an array for an element before adding it, we see sets outperforming arrays by about 36%.
set.delete()
vs. array.delete()
Both set.delete()
and array.delete()
require a lookup for an element before removing it, so this is a great comparison of their equivalent abilities!
Here's our code:
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
Benchmark.bm(34) do |x|
x.report("Set .delete (beginning) :") {
n.times do
set = filter_set
set.delete("a")
end
}
x.report("Array .delete (beginning) :") {
n.times do
array = filter_array
array.delete("a")
end
}
x.report("Set .delete (middle) :") {
n.times do
set = filter_set
set.delete("they")
end
}
x.report("Array .delete (middle) :") {
n.times do
array = filter_array
array.delete("they")
end
}
x.report("Set .delete (end) :") {
n.times do
set = filter_set
set.delete("at")
end
}
x.report("Array .delete (end) :") {
n.times do
array = filter_array
array.delete("at")
end
}
end
Once again, we're measuring an array's ability to delete elements at their beginning, middle, and end (and illustrating that sets have no such order).
Here's our output:
user system total real
Set .delete (beginning) : 0.812703 0.000889 0.813592 ( 0.816582)
Array .delete (beginning) : 1.932468 0.001862 1.934330 ( 1.938841)
Set .delete (middle) : 0.811337 0.000912 0.812249 ( 0.814055)
Array .delete (middle) : 2.053270 0.001698 2.054968 ( 2.059913)
Set .delete (end) : 0.813986 0.000986 0.814972 ( 0.816330)
Array .delete (end) : 2.203437 0.002076 2.205513 ( 2.210077)
As expected, sets can delete elements in constant time. Interestingly, arrays are able to delete elements at the end almost as fast as ones at their beginning!
set.subset?()
vs. (array1 & array2) == array2
One of sets' most useful features is the subset?
method. A subset is a set where all of its elements are also elements of its superset.
While arrays don't have an exactly equivalent method, this excellent StackOverflow answer illustrates how the array &
method can be used for the same effect.
Here's our code:
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
sub_set = Set["a", "they", "at"]
sub_array = ["a", "they", "at"]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]
Benchmark.bm(34) do |x|
x.report("Set .subset? (true) :") { n.times do; sub_set.subset?(filter_set) ; end}
x.report("Array (a1 & a2) == a2 (true) :") { n.times do; (filter_array & sub_array) == sub_array ; end}
x.report("Set .subset? (false) :") { n.times do; replace_set.subset?(filter_set) ; end}
x.report("Array (a1 & a2) == a2 (false) :") { n.times do; (filter_array & replace_array) == replace_array ; end}
end
Our code compares sets and arrays in situations where subset?
and (a1 & a2) == a2
return both true and false.
Here's our output:
user system total real
Set .subset? (true) : 2.076477 0.001450 2.077927 ( 2.080519)
Array (a1 & a2) == a2 (true) : 7.961141 0.180642 8.141783 ( 8.158982)
Set .subset? (false) : 1.554540 0.001360 1.555900 ( 1.558715)
Array (a1 & a2) == a2 (false) : 5.280627 0.006986 5.287613 ( 5.297066)
Wow! In both situations, we see sets performing subset?
over 3x as fast as the array's equivalent!
Operations where sets are slower than arrays
Now, let's turn our attention to some situations where sets do not perform as well as arrays.
Unfortunately, I cannot offer under-the-hood explanations for these results at this time. Please feel free to comment below and share your wisdom of why arrays outperform sets in the following operations! :)
set.replace()
vs. array.replace()
In both structures, the replace
method completely replaces the contents of the given structure with the elements from a supplied enumerable. In our code, we will be replacing a set with a set, and an array with an array.
Here's our code:
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]
Benchmark.bm(34) do |x|
x.report("Set .replace :") { n.times do; filter_set.replace(replace_set) ; end}
x.report("Array .replace :") { n.times do; filter_array.replace(replace_array) ; end}
end
Here's our output:
user system total real
Set .replace : 2.318287 0.001841 2.320128 ( 2.327267)
Array .replace : 0.441079 0.000528 0.441607 ( 0.442994)
Ouch! Arrays implement replace
5x faster than sets! As stated above, I'd love to have someone fill me in on why!
set1 == set2
vs. array1 == array2
Finally, let's look at ==
, where each pair of structures is checked for having the same elements--and for arrays, the same order. Once again, we will benchmark both true and false situations for each.
Here's our code:
n = 5000000
filter_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
filter_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
match_set = Set["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
match_array = ["a", "an", "the", "and", "is", "of", "to", "be", "in", "they", "their",
"them", "or", "if", "this", "like", "had", "but", "what", "with", "at",
]
replace_set = Set["big", "sword", "knight"]
replace_array = ["big", "sword", "knight"]
Benchmark.bm(34) do |x|
x.report("Set == (true) :") { n.times do; filter_set == match_set ; end}
x.report("Array == (true) :") { n.times do; filter_array == match_array ; end}
x.report("Set == (false) :") { n.times do; filter_set == replace_set ; end}
x.report("Array == (false) :") { n.times do; filter_array == replace_array ; end}
end
Here's our output:
user system total real
Set == (true) : 7.219614 0.004376 7.223990 ( 7.234709)
Array == (true) : 4.119128 0.003112 4.122240 ( 4.128884)
Set == (false) : 1.097661 0.000751 1.098412 ( 1.099235)
Array == (false) : 0.391299 0.000482 0.391781 ( 0.393029)
In the best case, array's perform better than sets by about 75% when checking for true cases--and at worst, for false situations, arrays check about 275% faster than sets! Equality is definitely a situation to prefer arrays over sets. (Anyone care to explain why? <3)
Conclusion
Now you're more familiar with Set's methods and cases where they may be advantageous to arrays! The two key situations to look for are:
- Needing to check via
include?
in a collection of unique elements - Needing to check for subsets of elements
Thanks for reading!
Links and Sources
Excellent introduction to Ruby Sets by Al Scott
DotNetPerls - Ruby Set examples
Top comments (1)
This is a really great series that dives into the nuances of ruby sets for an introduction, thanks for writing this!