Paula Gearon

Posted on Mar 12

Repetition

#clojure

Sometimes, Clojure seems to miss operations that feel like they should be in clojure.core. I'm clearly not the only one who has felt this way, as evidenced by James Reeves's medley library. These missing operations are typically simple to implement, but after rewriting them for the nth time I feel like putting them in my own library. Or maybe contributing to James's.

Today's question was how to identify duplicates in a seq. For instance, say we have a sequence of numbers with a duplicate:

(def a-seq [0 1 2 3 4 2 6])

Step 1

My first thought was that every element in the sequence needs to be compared to every other item. Indeed, if the first element is not equal to any of the sequence that follows it, then that element need not be considered again. How might that look?

I'll start with a different seq where the first element does appear again:

(def seq0 [0 1 2 3 4 2 0])

We want to avoid indexing and loops whenever possible in Clojure, as these are conceptually low-level, and can be a source of errors. Instead, we want to look in clojure.core for applicable functions whenever possible.

As with any Lisp, the 2 main functions to consider for processing seqs are map and reduce. The map operation will do something for every element in a seq, returning a new seq, while reduce will derive a single value from processing each element in a seq. In this case, we're looking for a repeated value:

(reduce (fn [result element] (or result (= (first a-seq) element))) nil (rest a-seq))
true

OK, this found the value, but which value was it?

(reduce (fn [result element] (or result (and (= (first a-seq) element) element))) (rest a-seq))
0

This could then be applied to each subsequence, of the seq, but... wow, is it clunky. The reducing function is doing a lot here, comparing if the result is true yet, and if not, then comparing the provided first element to the element being presented by reduce, returning that element if needed.

We could clean it up a little by putting reduced over the return value so that it breaks as soon as the first match is found, and using a singleton set for the equality test:

(reduce (fn [result element] (or result (and (#{(first seq0)} element) (reduced element)))) (rest seq0))

But clojure.core already has a function that does all of this, called some. The documentation even tells us that we can use a set as an "equality" test when searching a seq:

(some #{(first a-seq)} (rest a-seq))

Step 2

But now we want to do this for every sub-seq, dropping off the head each time. One way might be to use drop to remove the start of the seq, counting up:

(map #(drop % a-seq) (range (count a-seq)))
((0 1 2 3 4 2 6) (1 2 3 4 2 6) (2 3 4 2 6) (3 4 2 6) (4 2 6) (2 6) (6))

But counting the sequence, or even using map-indexed (which does the counting for you) is still getting very involved in the mechanics of processing a sequence. Is there a higher-level construct?

One mechanism is using the rest operation over and over. This might be done using reduce in some way, but fortunately the iterate function does exactly this. Except it returns an infinite series, so we just need to take the correct amount. Let's go past the end by 5 to see why we need to keep it constrained:

(take (+ 5 (count a-seq)) (iterate rest a-seq))
([0 1 2 3 4 2 6] (1 2 3 4 2 6) (2 3 4 2 6) (3 4 2 6) (4 2 6) (2 6) (6) () () () () ())

This looks good, but it's still counting the length of the seq, so we want to keep looking for a better approach.

The map function is more flexible than simply applying an operation to transform each element of a seq. It can also be used to apply an operation to a group of elements across multiple seqs. The best part about this is that it will terminate as soon as the first of the seqs finishes. So let's try using map between the original seq and the iteration. We can just return the 2 values as pairs to see what we are working with:

(map (fn [a b] [a b]) a-seq (iterate rest a-seq))

This is almost exactly what we want as the arguments for the some operation above. Each operation of the map could execute rest on its second argument, but if we start the iteration one step along, then can get the same effect without needing the extra rest every time:

(map (fn [a b] [a b]) a-seq (rest (iterate rest a-seq)))
([0 (1 2 3 4 2 6)] [1 (2 3 4 2 6)] [2 (3 4 2 6)] [3 (4 2 6)] [4 (2 6)] [2 (6)] [6 ()])

These pairs are exactly what we needed for the some operation, so let's look at that:

(map (fn [a b] (some #{a} b)) a-seq (rest (iterate rest a-seq)))
(nil nil 2 nil nil nil nil)

Step 3

This is almost done. We need to skip to the first non-nil value, which we already know we can do using some. However, some needs a function to apply whereas we don't need to do anything to the value, just use it as-is. So we can use identity here:

(some identity (map (fn [a b] (some #{a} b)) a-seq (iterate rest (rest a-seq))))
2

To look more Clojure-ey we would usually run steps like this through a threading macro, and the anonymous function can be abbreviated as well. Of course, we want this to be a function, so we can replace the use of the a-seq value with a function argument, which I'll just call s for "seq":

(defn any=
 [s]
 (->> s
      rest
      (iterate rest)
      (map #(some #{%1} %2) s)
      (some identity)))

Step 4

After all of the above, I was thinking I had something relatively nice, and maybe I should write about the process of getting to it. Hence this post. I actually included extra steps along the way, just to spell things out, though I didn't actually go through most of those steps.

While thinking of extra steps and alternatives that I could write about, I thought about comparing to Clojure's distinct function which can remove duplicates:

(distinct a-seq)

This is similar to what I needed, but instead of telling me when it found a duplicate, it skips the duplicates instead. A similarly written function would also work, but this function is a transducer, and these are probably more work than should be necessary for something relatively simple.

I'm not saying that a transducer would be difficult. They follow a straightforward template:

(defn duplicates
 ([]
  (fn [rf]
    (let [seen (volatile! #{})]
      (fn
        ([] (rf))
        ([result] (rf result))
        ([result input]
         (if (contains? @seen input)
           (rf result input)
           (do (vswap! seen conj input)
               result)))))))
 ([coll] (sequence (duplicates) coll)))

This is actually a little more flexible since it can find all of the duplicates, rather than just the first one:

(duplicates [1 2 3 4 2 4 5 1])
(2 4 1)

The transducer approach also has the benefit of only processing the seq s single time, and using memory to recall what it has seen. So instead of O(n^2) complexity, it becomes O(n.log(n)) complexity (since the hashmap insertions and lookups are logarithmic - though in Clojure they are effectively O(1)).

However, all I needed was a function that could report the first duplicate in a seq. Maybe there was some way to use distinct for that.

Let's try using map again, this time against the original seq and it's distinct version. We'll emit pairs to see what is happening:

(map (fn [a b] [a b]) a-seq (distinct a-seq))
([1 1] [2 2] [3 3] [4 4] [2 5])

It terminated at the end of the distinct version, but we can see that just at the end the two sequences differed. This will happen if there is a duplicate in the first sequence that didn't appear in the second. So we can output that one when we get to it:

(map (fn [a b] (when (not= a b) a)) a-seq (distinct a-seq))
(nil nil nil nil nil 2)

This will keep emitting values after the first duplicate is found, since the two sequences are now misaligned. Consequently we only want to return the first non-nil value from the seq:

(some identity (map (fn [a b] (when (not= a b) a)) a-seq (distinct a-seq)))
2

Giving a new version for finding the first duplicate:

(defn any=
 [s]
 (->> s
      distinct
      (map #(when (not= %1 %2) %1) s)
      (some identity)))

Wrap Up

This is a little shorter, has better complexity, and makes better use of clojure.core. I'd probably use this one instead of my first attempt.

This post also reminded me to check my assumptions. This is not the first time I've tried to explain why one approach was not as effective as another, only to show that it was better. I need to remember that.

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (1)

Ray Miller • Mar 12

Thanks! I enjoyed your article and it got me thinking about how I might implement this. Recursion can be a very useful tool and it's the one I'd reach for here:

(defn any= [[head & tail]] (when tail (or (some #{head} tail) (recur tail))))

DEV Community

Repetition

Step 1

Step 2

Step 3

Step 4

Wrap Up

Deploy with ease. Manage efficiently. Scale faster.

Top comments (1)

A Workflow Copilot. Tailored to You.

Introducing RTABench

Benchmarking Databases for Real-Time Analytics Applications

James Blackwood-Sewell for Timescale ・ Mar 26