Benchmarking Scala with ScalaMeter, Pt. 1 (Scala DCP #004)
Andrew (he/him) ・10 min read
Daily Coding Problems in Scala (5 Part Series)
Daily Coding Problem is a website which will send a programming challenge to your inbox every day. I want to show beginners how to solve some of these problems using Scala, so this will be an ongoing series of my solutions. Feel free to pick them apart in the comments!
Problem
Given an array of integers, find the first missing positive integer in linear time and constant space. In other words, find the lowest positive integer that does not exist in the array. The array can contain duplicates and negative numbers as well.
For example, the input
[3, 4, 1, 1]
should give2
. The input[1, 2, 0]
should give3
.You can modify the input array inplace.
Strategy
See my discussion of this problem (and my Java implementation) here!
Spoilers! Don't look below unless you want to see my solution!
The Java solution I wrote for this problem is very imperative. It uses array indices and mutable variables and more and feels very unScalalike. But the "constant space" restriction on the problem seems to mandate that we use a mutable array.
Since this mutability sort of flies in the face of Scala's functional style, and since I've already solved the problem with the constant space restriction, I'd like to take a different approach to solving this problem using Scala. I'd like to solve this as many ways as I can and then benchmark the resulting algorithms. Since the arrays can contain duplicates and negative numbers, we have several dimensions over which we might want to search for the optimal algorithm:
 % of duplicate elements (
pctDup
)  % of negative elements (
pctNeg
)  length of array (
n
)  maximum integer contained (
maxVal
)  minimum integer contained (
minVal
)
These five restrictions essentially define a probability density function for an arbitrary random integer generator. Generating integers within a range is easy enough. If you have Scala 2.13 or later, you can use the Random.between()
method:
scala> import scala.util.Random
import scala.util.Random
scala> val rand = new Random()
rand: scala.util.Random = scala.util.Random@5496c165
scala> (1 to 10).map(e => rand.between(5, 5))
res1: IndexedSeq[Int] = Vector(3, 3, 3, 2, 2, 1, 4, 3, 3, 3)
scala> // or...
scala> List.fill(10)(rand.between(5,5))
res5: List[Int] = List(1, 0, 4, 2, 1, 5, 4, 1, 3, 4)
If you have an earlier version of Scala, you can implement the between()
method yourself by just copyingandpasting the source code from Scala's GitHub repo. Here is the implementation as it exists in the source (comments removed):
def between(minInclusive: Int, maxExclusive: Int): Int = {
require(minInclusive < maxExclusive, "Invalid bounds")
val difference = maxExclusive  minInclusive
if (difference >= 0) {
nextInt(difference) + minInclusive
} else {
@tailrec
def loop(): Int = {
val n = nextInt()
if (n >= minInclusive && n < maxExclusive) n
else loop()
}
loop()
}
}
Once we have a ranged random integer generator, we can simply generate n
values to get our sample array. Since we ignore nonpositive integers, we probably don't care what the smallest integer is, we only really care about the percentage of nonpositive integers. That means the dimensions over which we want to benchmark our solutions might actually be:
 % of duplicate elements (
pctDup
)  % of nonpositive elements (
pctBad
)  length of array (
n
)  maximum integer contained (
maxVal
)
...to get a particular % of nonpositive elements, we can simply roll a die when the next integer in the sequence is about to be generated, and make that integer a 0
if some predicate passes. The same thing can be done for duplicate elements.
Finally, we likely want maxVal
to somehow be dependent on the length of the array n
, and not arbitrary. So I propose that we instead use a maximum value which is relative to the length of the array:
maximum integer contained, relative to length of array (
maxPct
)
We might naively chose values for these parameters similar to the following
pctDup = 0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 75, 90, 99, 99.9
pctBad = 0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 75, 90, 99, 99.9
n = 0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000
maxPct = 25, 50, 75, 100, 125, 150, 175, 200
Looking at the above, we can see that some of these values may be contradictory. For example, we can't have an array of length n = 5
with a maxPct = 25
(meaning the maximum allowable value in the array is 0.25 * 5 = 1.25
or 1
when rounded to an integer, and require that that array have a pctDup
any less than 100%, which is not one of our allowable values. Also, it might be difficult to implement a "duplicate value" generator, because that requires us to look over the alreadygenerated values and pick one at random before duplicating that one in particular.
Instead, we can inspect pctDup
as an emergent property of the generated arrays. If maxPct
is small, then pctDup
will naturally become large, since we're restricting the range of possible values relative to the length of the array. This simplifies our search space further. Let's drop pctDup
entirely and instead focus on only three tunable parameters:
 the length of the array (
n
)  the maximum allowable value in the array, as a percentage of the length of the array (
maxPct
)  the percentage of nonpositive ("bad") elements (
pctBad
)
Readjusting our test values slightly:
pctBad = 0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 75, 90, 99, 99.9
n = 0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000
maxPct = 10, 25, 50, 75, 100, 125, 150, 175, 200, 500, 1000
This gives us 17 * 11 * 11 = 2057
possible combinations of parameters to iterate over, multiplied by the number of algorithms. Hopefully most of these should be very fast, because in the majority of cases, we're generating arrays with less than 100 elements. If these tests don't take too long, we can add additional tests with greater values of n
.
We need to be careful to only benchmark the algorithm defined in the prompt and not to benchmark the arraygenerating code itself. Since we're not benchmarking that code, we don't need to pick apart its efficiency (though, ideally, it will be as efficient as possible so that the tests run quickly).
A quick and dirty implementation of this random array generator could look something like:
scala> import scala.util.Random
import scala.util.Random
scala> val rand = new Random()
rand: scala.util.Random = scala.util.Random@5496c165
scala> val n = 10
n: Int = 10
scala> val pctBad = 20
pctBad: Int = 20
scala> val maxPct = 1000
maxPct: Int = 1000
scala> (1 to n).map(i => if (rand.nextDouble() < pctBad/100) 0 else rand.between(1, n*maxPct/100 + 1))
res0: IndexedSeq[Int] = Vector(52, 0, 77, 84, 0, 0, 16, 87, 8, 30)
The above values of n
, pctBad
, and maxPct
generate an array of 10
elements, where approximately 20
percent of them are "bad" (nonpositive), and the values in the array range from 1
to 1000
percent of the length of the array (i.e. 10 times the length of the array or 100
), inclusive on both ends.
Note that, because we're using percentages and not fractions, we need to divide both percentages by 100
. Instead of doing that, let's just move to fractional values instead:
pctBad = 0, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, .01, .02, .05, 0.1, 0.2, 0.5, 0.75, 0.9, 0.99, 0.999
n = 0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000
maxPct = 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 5, 10
And take another shot at our random array generator...
scala> def randArray (n: Int, pctBad: Double, maxPct: Double): Array[Int] =
 (1 to n).map(i => if (rand.nextDouble() < pctBad) 0
 else rand.between(1, (n*maxPct).toInt + 1)).toArray
randArray: (n: Int, pctBad: Double, maxPct: Double)Array[Int]
scala> randArray(10, 0.5, 1.0)
res9: Array[Int] = Array(6, 0, 4, 3, 0, 0, 0, 9, 7, 0)
scala> randArray(15, 0.9, 10.0)
res12: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 0, 148, 0, 0)
scala> randArray(5, 0, 100.0)
res13: Array[Int] = Array(194, 45, 113, 33, 460)
This seems okay until we realise that it fails for certain combinations of parameters. For instance, if n
is 1
(allowed) and maxPct
is 0.1
(allowed), we will try to call rand.between(1, 1)
, which will throw an IllegalArgumentException
(requirement failed: Invalid bounds
). So we can require
maxPct
to always be positive, and use
Math.ceil(n*maxPct).toInt
instead of just
ceil(n*maxPct).toInt
This ensures that the range sent to rand.between()
is always valid. One more small fix might be to reduce our number of random values generated. Since between()
and nextDouble()
both return uniformly random values over a range, we could simplify the algorithm by simply allowing the range to be negative for a percentage equivalent to pctBad
.
For instance, if n
is 20
, pctBad
is 0.0909
(9.09%), and maxPct
is 1.5
(150%), it means we want to generate an array of 20
integer values on the range [1, 30]
, where 1.8
of those values are negative, on average (0.0909 * 20
).
Instead of throwing two random numbers  one corresponding to pctBad
, and one corresponding to generating the positive value if pctBad
is greater than 0.0909
, we could instead just extend our range into the negative integers, such that 9.09% of the total range is nonpositive. In other words, we change our range from [1, 30]
to [2, 30]
and generate integers randomly uniformly over that range. In this case, our range includes 33
integers, 3
of which (2
, 1
, and 0
) are "bad".
This ensures that, on average, 9.09% of the values (3
out of 33
possible integer values) will be nonpositive ("bad"), and that the maximum allowable generated value is 150% of the length of the array. And we only have to generate one random value!
A modified randArray()
might then look like:
// random array generator
def randArray (n: Int, fracBad: Double, maxScaling: Double): Array[Int] = {
require(fracBad >= 0 && fracBad <= 1, "fraction of \"bad\" elements must be between 0 and 1")
require(maxScaling >= 0, "maxScaling must be >= 0")
require(n >= 0, "array length must be >= 0")
if (n == 0) Array()
else if (fracBad == 1.0) Array.fill(n)(0)
else {
val maxVal = n * maxScaling
val minVal = 0.5  fracBad / (1  fracBad) * maxVal
val range = f"range [$minVal%.3f, $maxVal%.3f]"
require(n == 0  minVal <= maxVal, s"invalid $range")
if (minVal == maxVal) Array.fill(n)(minVal.toInt)
else (1 to n).map(_ =>
rand.between(minVal, maxVal).round.toInt).toArray
}
}
...you could do away with the require()
s if you're sure that you're only passing valid parameters. With this implementation, we'll have some "wiggle" on the percentage of nonpositive values in an array, as well as the range of values contained in the array. We can look at all of these things statistically. In summary, we'll use the above randArray()
implementation and the problem solutions below, benchmarking them as a function of

n
, the number of elements in the array  the fraction of nonpositive elements in the array
 the maximum value present in the array, relative to the length of the array
 the average duplication of given elements in the array...?
 ...?
Let's go!
Code
There are three methods I'd like to try and benchmark. The first is my Javalike solution (rewritten in (ugly) Scala, of course). I've debugged this a bit and I think the performance is more or less optimal at this point. It's constant space because it only uses the given mutable array, and I think it performs the fewest number of value swaps and array traversals to solve the problem:
scala> :paste
// Entering paste mode (ctrlD to finish)
def findMissingMutable (array: Array[Int]): Int = {
if (array.length < 1) 1
else {
val length = array.length
var smallestMissing = 1
for (index < 1 to length) {
var value = array(index  1)
// if this value is valid, and not at the appropriate index...
if ((value != index) && (value > 0) && (value <= length)) {
breakable {
// ...we will try to swap it
while (value != index) {
// get the value at this value's index
val swapIndex = value
val swapValue = array(swapIndex  1)
// if value of target index is equal to this value, skip
if (value == swapValue) break
// swap the values at the indices
array(swapIndex  1) = value
array(index  1) = swapValue
// reapply conditions in outer if loop
if (swapValue < 1  swapValue > length) break
if (swapValue == swapIndex) break
// get newlyswapped value
value = array(index  1)
}
}
}
}
for (ii < 0 until length) {
if (array(ii) >= 1) {
if (array(ii) != smallestMissing) return smallestMissing
else smallestMissing += 1
}
}
smallestMissing
}
}
// Exiting paste mode, now interpreting.
findMissingMutable: (array: Array[Int])Int
scala> findMissingMutable(Array(3, 4, 1, 1))
res102: Int = 2
scala> findMissingMutable(Array(1, 2, 0))
res103: Int = 3
The second uses Scala's Array.sortInPlace()
method (available from 2.13 onward), which should allow us to still satisfy the prompt's constant space complexity requirement:
scala> def findMissingInPlace (arr: Array[Int]): Int = {
 val goodVals = arr.filter(_ > 0).distinct.sortInPlace
 val missing = goodVals.zipWithIndex.dropWhile({ case (x,i) => x == i+1 }).headOption
 missing match {
 case Some((_, i)) => i + 1
 case None => goodVals.length + 1
 }
 }
findMissingInPlace: (arr: Array[Int])Int
scala> findMissingInPlace(Array(3, 4, 1, 1))
res46: Int = 2
scala> findMissingInPlace(Array(1, 2, 0))
res47: Int = 3
The third method is linear in space, but might be faster because I'll be converting the Array
to a Set
, then performing only Set
operations, which are, in general, faster than operations on Scala's sequences:
scala> def findMissingSet (arr: Array[Int]): Int = {
 val goodVals = arr.filter(_ > 0).toSet
 val nVals = goodVals.size
 val missing = (1 to nVals).dropWhile(goodVals.contains).headOption
 missing match {
 case Some(x) => x
 case None => nVals + 1
 }
 }
findMissingSet: (arr: Array[Int])Int
scala> findMissingSet(Array(3, 4, 1, 1))
res71: Int = 2
scala> findMissingSet(Array(1, 2, 0))
res72: Int = 3
After testing these solutions a few hundred times and ensuring that they all arrive at the same results (and squashing some bugs), I'm happy enough to benchmark them. Note that, because findMissingMutable()
and findMissingInPlace()
both mutate the array passed to them, we must first generate an array, then copy it three times to pass to each of the three implementations above.
So, which one will be fastest? Under which conditions? Place your bets now because I'm about to reveal the results!
Results coming soon!
All the code for my Daily Coding Problems solutions is available at github.com/awwsmm/daily.
Suggestions? Let me know in the comments.
If you enjoyed this post, please consider supporting my work by buying me a coffee!
Daily Coding Problems in Scala (5 Part Series)