DEV Community

Cover image for Find Duplicates in an Array

Find Duplicates in an Array

Hector Minaya on September 11, 2020

The Problem 🤔? Write a function that will take in an array of integers and will return all duplicate elements. Sample Dataset ...
Collapse
 
scott_yeatts profile image
Scott Yeatts

While I know it's not as performant as the Map solution, would love to have seen a solution using the built-in array iterators:

const sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

const findDuplicatesUsingIterators = (sampleArr) => {
    return sampleArr.reduce((acc, curr) => {
        const duplicatesCheck = sampleArr.filter(
            (value) => curr === value;
        );
        return duplicatesCheck.length > 1
            ? [acc, ...curr]
            : acc
    }, []);
};
console.log(findDuplicatesUsingIterators(sampleData));

Please excuse any typos/readability/bad code here... not a fully fleshed-out solution and I haven't run it, just wrote it real-time thinking about how to avoid explicit for loops (but there are 2 implicit loops), and I feel like it will probably be closest to, if not longer than brute force. It's a 'code golf'-ish solution that might look pretty, but actually be very inefficient.

These are the kinds of things I see (and write) a lot for readability on a known-length array, but once we get into potentially large (IE: tens of thousands+) arrays, the impact on the performance isn't well known (or at least analyzed) by the developer in a lot of cases.

Around 50k elements you will see reduce and filter (which are sometimes MORE performant than a hand-written for loop in smaller arrays) start to take longer once the function overhead takes its toll. github.com/dg92/Performance-Analys... is over two years old, and I haven't tested it myself in quite some time, but it matches my experience with it.

I LOVE these types of functions and use them all the time, but it's important to use the right tool from the toolbox.

Thanks for the article. I enjoy a good big-O analysis!

Collapse
 
hminaya profile image
Hector Minaya

Yeap, for this first set of approaches I purposely stayed away as much as possible from the javascript built in iterators to try and get a baseline raw approach.

I was planning on a second set of tests using more JS native functions. Thanks for the code!

Collapse
 
ustice profile image
Jason Kleinberg

Here's another version of findMapForReduce as a one-liner that's still very readable:

const findMapForReduce = (data) => data.reduce(
  ([ seen, repeated ], element) => seen.has(element)
      ? [ seen, repeated.set(element, true) ]
      : [ seen.set(element, true), repeated ], 
  [ new Map(), new Map() ]
)[1].keys()

Collapse
 
jonrandy profile image
Jon Randy 🎖️ • Edited

Shorter version of 2, using reduce:

const findDupes = a=>a.reduce(([u,d],x)=>u.includes(x)?[u,[...d,x]]:[[...u,x],d],[[],[]])[1]

Or a vastly faster Set based version:

const findDupes = a=>[...a.reduce(([u,d],x)=>u.has(x)?[u,d.add(x)]:[u.add(x),d],[new Set(),new Set()])[1]]
Collapse
 
xazzzi profile image
Vitaliy

You don't need Map, there's Set.

Collapse
 
xowap profile image
Rémy 🤖

Sets are one of my favorite things. So fantastic to use in Python and so crippled in JS :'(

Collapse
 
hminaya profile image
Hector Minaya

I'm doing a follow up with sets, custom hashes, array reduction...

Collapse
 
kybernetikos profile image
kybernetikos • Edited

Qudos for posting your code so that others can see what you did. And critique it :-). I would have liked a copy of the workbench.js used in the original post.

Anyway, you should really output the results to an array and check they all give the same (correct) answer.

For example, your findObjectFor findObjectForCached and findObjectReduce have a simple bug in them that means they give the wrong answer and artificially makes them appear faster than they really are (the seen[item] test never passes).

e.g.

> findObjectForCached([1, 2, 1, 2])
[ 1, 2, 1, 2 ]

You'll need to be a little careful fixing these, as some ways of fixing it would not work correctly if the sample data includes the number 0.

Most of the functions don't behave as I would expect if something is repeated more than once, although the code in the original article has the same problem e.g.

> findArrayReduce([1, 2, 1, 2, 3, 1])
[ 1, 2, 1 ]

In a few places, there would be equivalent versions where the Maps could be replaced with Sets. The performance should be more or less the same, but I think the code would reflect intent better.

In terms of approaches rather than micro-optimisations, I have some approaches that haven't been mentioned - one that sorts and mutates the passed array and so creates very little garbage

function mutatingLowGarbage(sampleData) {
  sampleData.sort()
  let ptr = 0
  for (let i = 0, len = sampleData.length; i < len; ++i) {
    if (sampleData[i] === sampleData[i - 1] && sampleData[i] !== sampleData[i + 1]) {
      sampleData[ptr++] = sampleData[i]
    }
  }
  sampleData.length = ptr
  return sampleData
}

and here's one that uses sparse arrays and creates a count.

function findDupByCount(sampleData) {
  const counts = [], seen = []
  for (let i = 0, len = sampleData.length, value; i < len; value = sampleData[i++]) {
      if (counts[value] === undefined) {
        seen.push(value)
        counts[value] = 0
      }
      counts[value]++
  }
  // could use Object.keys(counts) here instead of keeping track of seen,
  // but that makes everything into strings!
  return seen.filter(x => counts[x] > 1)
}

Neither approach has particularly amazing performance for the scenario posted, but it's interesting to consider the different ways you can solve the problem.

Collapse
 
xfbs profile image
Patrick Elsen

This is not needed:

uniqueElements.set(sampleData[i], sampleData[i]);

You can just do

uniqueElements.set(sampleData[i], true);

(since you're using the hash map as a set)

Collapse
 
hminaya profile image
Hector Minaya

correct.

Collapse
 
euryvallejo profile image
Eury Vallejo

Great Article!! Good Explained

Collapse
 
hminaya profile image
Hector Minaya

Awesome, I liked how you implemented the Cached version!.