Sometimes, Clojure seems to miss operations that feel like they should be in clojure.core
. I'm clearly not the only one who has felt this way, as evidenced by James Reeves's medley library. These missing operations are typically simple to implement, but after rewriting them for the nth time I feel like putting them in my own library. Or maybe contributing to James's.
Today's question was how to identify duplicates in a seq. For instance, say we have a sequence of numbers with a duplicate:
(def a-seq [0 1 2 3 4 2 6])
Step 1
My first thought was that every element in the sequence needs to be compared to every other item. Indeed, if the first element is not equal to any of the sequence that follows it, then that element need not be considered again. How might that look?
I'll start with a different seq where the first element does appear again:
(def seq0 [0 1 2 3 4 2 0])
We want to avoid indexing and loops whenever possible in Clojure, as these are conceptually low-level, and can be a source of errors. Instead, we want to look in clojure.core
for applicable functions whenever possible.
As with any Lisp, the 2 main functions to consider for processing seqs are map
and reduce
. The map
operation will do something for every element in a seq, returning a new seq, while reduce
will derive a single value from processing each element in a seq. In this case, we're looking for a repeated value:
(reduce (fn [result element] (or result (= (first a-seq) element))) nil (rest a-seq))
true
OK, this found the value, but which value was it?
(reduce (fn [result element] (or result (and (= (first a-seq) element) element))) (rest a-seq))
0
This could then be applied to each subsequence, of the seq, but... wow, is it clunky. The reducing function is doing a lot here, comparing if the result is true yet, and if not, then comparing the provided first element to the element being presented by reduce
, returning that element if needed.
We could clean it up a little by putting reduced
over the return value so that it breaks as soon as the first match is found, and using a singleton set for the equality test:
(reduce (fn [result element] (or result (and (#{(first seq0)} element) (reduced element)))) (rest seq0))
But clojure.core
already has a function that does all of this, called some
. The documentation even tells us that we can use a set as an "equality" test when searching a seq:
(some #{(first a-seq)} (rest a-seq))
Step 2
But now we want to do this for every sub-seq, dropping off the head each time. One way might be to use drop
to remove the start of the seq, counting up:
(map #(drop % a-seq) (range (count a-seq)))
((0 1 2 3 4 2 6) (1 2 3 4 2 6) (2 3 4 2 6) (3 4 2 6) (4 2 6) (2 6) (6))
But counting the sequence, or even using map-indexed
(which does the counting for you) is still getting very involved in the mechanics of processing a sequence. Is there a higher-level construct?
One mechanism is using the rest
operation over and over. This might be done using reduce
in some way, but fortunately the iterate
function does exactly this. Except it returns an infinite series, so we just need to take the correct amount. Let's go past the end by 5 to see why we need to keep it constrained:
(take (+ 5 (count a-seq)) (iterate rest a-seq))
([0 1 2 3 4 2 6] (1 2 3 4 2 6) (2 3 4 2 6) (3 4 2 6) (4 2 6) (2 6) (6) () () () () ())
This looks good, but it's still counting the length of the seq, so we want to keep looking for a better approach.
The map
function is more flexible than simply applying an operation to transform each element of a seq. It can also be used to apply an operation to a group of elements across multiple seqs. The best part about this is that it will terminate as soon as the first of the seqs finishes. So let's try using map between the original seq and the iteration. We can just return the 2 values as pairs to see what we are working with:
(map (fn [a b] [a b]) a-seq (iterate rest a-seq))
This is almost exactly what we want as the arguments for the some
operation above. Each operation of the map
could execute rest
on its second argument, but if we start the iteration one step along, then can get the same effect without needing the extra rest
every time:
(map (fn [a b] [a b]) a-seq (rest (iterate rest a-seq)))
([0 (1 2 3 4 2 6)] [1 (2 3 4 2 6)] [2 (3 4 2 6)] [3 (4 2 6)] [4 (2 6)] [2 (6)] [6 ()])
These pairs are exactly what we needed for the some
operation, so let's look at that:
(map (fn [a b] (some #{a} b)) a-seq (rest (iterate rest a-seq)))
(nil nil 2 nil nil nil nil)
Step 3
This is almost done. We need to skip to the first non-nil
value, which we already know we can do using some
. However, some
needs a function to apply whereas we don't need to do anything to the value, just use it as-is. So we can use identity
here:
(some identity (map (fn [a b] (some #{a} b)) a-seq (iterate rest (rest a-seq))))
2
To look more Clojure-ey we would usually run steps like this through a threading macro, and the anonymous function can be abbreviated as well. Of course, we want this to be a function, so we can replace the use of the a-seq
value with a function argument, which I'll just call s
for "seq":
(defn any=
[s]
(->> s
rest
(iterate rest)
(map #(some #{%1} %2) s)
(some identity)))
Step 4
After all of the above, I was thinking I had something relatively nice, and maybe I should write about the process of getting to it. Hence this post. I actually included extra steps along the way, just to spell things out, though I didn't actually go through most of those steps.
While thinking of extra steps and alternatives that I could write about, I thought about comparing to Clojure's distinct
function which can remove duplicates:
(distinct a-seq)
This is similar to what I needed, but instead of telling me when it found a duplicate, it skips the duplicates instead. A similarly written function would also work, but this function is a transducer, and these are probably more work than should be necessary for something relatively simple.
I'm not saying that a transducer would be difficult. They follow a straightforward template:
(defn duplicates
([]
(fn [rf]
(let [seen (volatile! #{})]
(fn
([] (rf))
([result] (rf result))
([result input]
(if (contains? @seen input)
(rf result input)
(do (vswap! seen conj input)
result)))))))
([coll] (sequence (duplicates) coll)))
This is actually a little more flexible since it can find all of the duplicates, rather than just the first one:
(duplicates [1 2 3 4 2 4 5 1])
(2 4 1)
The transducer approach also has the benefit of only processing the seq s single time, and using memory to recall what it has seen. So instead of O(n^2) complexity, it becomes O(n.log(n)) complexity (since the hashmap insertions and lookups are logarithmic - though in Clojure they are effectively O(1)).
However, all I needed was a function that could report the first duplicate in a seq. Maybe there was some way to use distinct
for that.
Let's try using map
again, this time against the original seq and it's distinct
version. We'll emit pairs to see what is happening:
(map (fn [a b] [a b]) a-seq (distinct a-seq))
([1 1] [2 2] [3 3] [4 4] [2 5])
It terminated at the end of the distinct version, but we can see that just at the end the two sequences differed. This will happen if there is a duplicate in the first sequence that didn't appear in the second. So we can output that one when we get to it:
(map (fn [a b] (when (not= a b) a)) a-seq (distinct a-seq))
(nil nil nil nil nil 2)
This will keep emitting values after the first duplicate is found, since the two sequences are now misaligned. Consequently we only want to return the first non-nil value from the seq:
(some identity (map (fn [a b] (when (not= a b) a)) a-seq (distinct a-seq)))
2
Giving a new version for finding the first duplicate:
(defn any=
[s]
(->> s
distinct
(map #(when (not= %1 %2) %1) s)
(some identity)))
Wrap Up
This is a little shorter, has better complexity, and makes better use of clojure.core
. I'd probably use this one instead of my first attempt.
This post also reminded me to check my assumptions. This is not the first time I've tried to explain why one approach was not as effective as another, only to show that it was better. I need to remember that.
Top comments (1)
Thanks! I enjoyed your article and it got me thinking about how I might implement this. Recursion can be a very useful tool and it's the one I'd reach for here:
(defn any= [[head & tail]] (when tail (or (some #{head} tail) (recur tail))))