loading...
Cover image for Find Duplicates in an Array

Find Duplicates in an Array

hminaya profile image Hector Minaya ・4 min read

The Problem 🤔?

Write a function that will take in an array of integers and will return all duplicate elements.

Sample Dataset

let sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

Expected return

[ 32, 3, 11 ]

Approach #1 - Brute force

Let's create an Array to hold repeated elements.

    let repeatedElements = [];

Next we will loop over the array.

    // This is also known as O(n) in Big-O notation since
    // we have to iterate over all of the items in the array
    for(let i = 0; i < sampleData.length; i++) {

    }

Inside of the loop we will need to loop again and compare each integer to every other integer in the array to determine if they are duplicates.

    for(let i = 0; i < sampleData.length; i++) {
        // Added for clarity, not needed since we can access
        // sampleData[i] directly in our next loop.
        let item = sampleData[i];

        // Now we need to loop over the array again to see if
        // we find the same item again.
        // 
        // Unfortunately this adds O(n^2) complexity 😢
        for (ii = 0; ii < sampleData.length; ii++) {

            // Is it the same integer in a different position?
            if ( (item === sampleData[ii]) && (i !== ii) ) {

                // Add to our array so we can return.
                repeatedElements.push(item)
            }
        }
    }

Here is the full code 👇

let sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

function FindDuplicatesUsingBruteForce(sampleData) {

    let repeatedElements = [];

    for(let i = 0; i < sampleData.length; i++) {

        let item = sampleData[i];    
        for (ii = 0; ii < sampleData.length; ii++) {
            if ( (item === sampleData[ii]) && (i !== ii) ) {
                repeatedElements.push(item)
            }
        }
    }

    return repeatedElements;
}

console.log(FindDuplicatesUsingBruteForce(sampleData));

// returns: [ 32, 11, 32, 3, 3, 11 ]
// It actually returns items more than once, but
// I'll ignore this for now.

Be honest, at some point we've all written similar code 🤷‍♂️. This will get you the result we are looking for, but it's the slowest path that will take up the most resources 🤦‍♂️.

This is mostly due to the inner loop, it turns the algorithm into O(n^2).

If your dataset is small you won't notice the difference, but it will quickly slow down and 💣.

Don't use this approach 🛑.

Approach #2 - Using Arrays

Now let's try a slightly different approach, we will avoid the inner loop by using an extra array, which may or may not make this more efficient.

This extra array will keep track of the items that we have already seen.

    let uniqueElements = [];
    let repeatedElements = [];

Next up is the same loop as our first approach, which we will use for all other approaches.

    for(let i = 0; i < sampleData.length; i++) {

    }

Inside our loop we need to keep track of items we have already seen 👀.

    for(let i = 0; i < sampleData.length; i++) {

        // This is where it starts to get interesting. If
        // we have already seen this number we will add it
        // to our array of duplicated elements.
        //
        // What is the Big-O complexity of Array.includes?
        // I'll come back to this.
        if (uniqueElements.includes(sampleData[i])) {
            repeatedElements.push(sampleData[i]);
        }

    }

Plus new items 🔍.

    for(let i = 0; i < sampleData.length; i++) {

        if (uniqueElements.includes(sampleData[i])) {
            repeatedElements.push(sampleData[i]);
        } else {
            // Add to our unique elements to track items we have 
            // already seen
            uniqueElements.push(sampleData[i]);
        }

    }

Here is the full code 👇

let sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

function FindDuplicatesUsingArrays(sampleData) {

    let uniqueElements = [];
    let repeatedElements = [];

    for(let i = 0; i < sampleData.length; i++) {

        if (uniqueElements.includes(sampleData[i])) {
            repeatedElements.push(sampleData[i]);
        } else {
            uniqueElements.push(sampleData[i]);
        }

    }

    return repeatedElements;
}

console.log(FindDuplicatesUsingArrays(sampleData));

// returns: [ 32, 3, 11 ]

This seems more efficient than our previous approach and it may be, but it all depends on uniqueElements.includes 🤔.

Why? We are relying on the javascript implementation of includes which is a linear search of items in an Array.

If we go back to how data structures work we will remember that an array is very efficient O(1) if we look up an item by it's key/position, but terribly inefficient O(n) if we look up an item by it's value since we'll have to traverse the array until we find it's value 🤦‍♂️.

Is it more efficient than our first approach? Yes, but there are better ways to do this.

Bonus: An Array in javascript is not an Array 🙃.

Approach #3 - Using a Map()

What else can we try? What data structure has an O(1) lookup? A HashTable 😎.

    // As with a lot of things in JavaScript a Map isn't exactly a 
    // HashTable, but it's close enough for this problem.
    let uniqueElements = new Map();
    let repeatedElements = [];

Instead of uniqueElements.includes we will use the uniqueElements.has method of our Map.

    for(let i = 0; i < sampleData.length; i++) {

        // Since a HashTable lookup is O(1) we have greatly improved
        // our performance by just using a different data structure!!!
        if (uniqueElements.has(sampleData[i])) {
            repeatedElements.push(sampleData[i]);
        } else {
            uniqueElements.set(sampleData[i], sampleData[i]);
        }

    }

Here is the full code 👇

let sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

function FindDuplicatesUsingMap(sampleData) {

    let uniqueElements = new Map();
    let repeatedElements = [];

    for(let i = 0; i < sampleData.length; i++) {

        if (uniqueElements.has(sampleData[i])) {
            repeatedElements.push(sampleData[i]);
        } else {
            uniqueElements.set(sampleData[i], sampleData[i]);
        }

    }

    return repeatedElements;
}

console.log(FindDuplicatesUsingMap(sampleData));

// returns: [ 32, 3, 11 ]

So, how fast is this approach? Let's try and compare 👇

let sampleData = [];

// 50k array of random numbers
for (let i = 0; i < 50000; i++) {
    sampleData[i] = Math.floor((Math.random() * 50000) + 1);
}

/*
 Add here the 3 functions we just looked at
 */

// Let's run them all on the same array and time it.

console.time("FindDuplicatesUsingBruteForce");
FindDuplicatesUsingBruteForce(sampleData);
console.timeEnd("FindDuplicatesUsingBruteForce");

console.time("FindDuplicatesUsingArrays");
FindDuplicatesUsingArrays(sampleData);
console.timeEnd("FindDuplicatesUsingArrays");

console.time("FindDuplicatesUsingMap");
FindDuplicatesUsingMap(sampleData);
console.timeEnd("FindDuplicatesUsingMap");

Results 👇

Alt Text

Edit: There are dozens of different solutions to this problem, some more efficient in terms of space or time than the ones outlined here. If you'd like to share one please go ahead in the comments 👇

Discussion

pic
Editor guide
Collapse
scott_yeatts profile image
Scott Yeatts

While I know it's not as performant as the Map solution, would love to have seen a solution using the built-in array iterators:

const sampleData = [54,32,5,11,35,32,17,3,3,22,4,1,6,11];

const findDuplicatesUsingIterators = (sampleArr) => {
    return sampleArr.reduce((acc, curr) => {
        const duplicatesCheck = sampleArr.filter(
            (value) => curr === value;
        );
        return duplicatesCheck.length > 1
            ? [acc, ...curr]
            : acc
    }, []);
};
console.log(findDuplicatesUsingIterators(sampleData));

Please excuse any typos/readability/bad code here... not a fully fleshed-out solution and I haven't run it, just wrote it real-time thinking about how to avoid explicit for loops (but there are 2 implicit loops), and I feel like it will probably be closest to, if not longer than brute force. It's a 'code golf'-ish solution that might look pretty, but actually be very inefficient.

These are the kinds of things I see (and write) a lot for readability on a known-length array, but once we get into potentially large (IE: tens of thousands+) arrays, the impact on the performance isn't well known (or at least analyzed) by the developer in a lot of cases.

Around 50k elements you will see reduce and filter (which are sometimes MORE performant than a hand-written for loop in smaller arrays) start to take longer once the function overhead takes its toll. github.com/dg92/Performance-Analys... is over two years old, and I haven't tested it myself in quite some time, but it matches my experience with it.

I LOVE these types of functions and use them all the time, but it's important to use the right tool from the toolbox.

Thanks for the article. I enjoy a good big-O analysis!

Collapse
hminaya profile image
Hector Minaya Author

Yeap, for this first set of approaches I purposely stayed away as much as possible from the javascript built in iterators to try and get a baseline raw approach.

I was planning on a second set of tests using more JS native functions. Thanks for the code!

Collapse
eruizdechavez profile image
Erick Ruiz de Chavez

This was a very interesting reading indeed!

I wanted to do my own flavor of your experiment by throwing a couple of extra tests and then ran it 1000 times to get some data.

The extra tests I added were using reduce instead of a for loop, using objects instead of Maps, and finally caching the for loop values (current index, array length).

In the end, the best option was to use a for loop with cached values, and an object instead of a Map.

Edit:

Comment updated and code removed because it had some basic errors. See the comments below for the updated code and results.

Collapse
ustice profile image
Jason Kleinberg

Here's another version of findMapForReduce as a one-liner that's still very readable:

const findMapForReduce = (data) => data.reduce(
  ([ seen, repeated ], element) => seen.has(element)
      ? [ seen, repeated.set(element, true) ]
      : [ seen.set(element, true), repeated ], 
  [ new Map(), new Map() ]
)[1].keys()

Collapse
kybernetikos profile image
kybernetikos

Qudos for posting your code so that others can see what you did. And critique it :-). I would have liked a copy of the workbench.js used in the original post.

Anyway, you should really output the results to an array and check they all give the same (correct) answer.

For example, your findObjectFor findObjectForCached and findObjectReduce have a simple bug in them that means they give the wrong answer and artificially makes them appear faster than they really are (the seen[item] test never passes).

e.g.

> findObjectForCached([1, 2, 1, 2])
[ 1, 2, 1, 2 ]

You'll need to be a little careful fixing these, as some ways of fixing it would not work correctly if the sample data includes the number 0.

Most of the functions don't behave as I would expect if something is repeated more than once, although the code in the original article has the same problem e.g.

> findArrayReduce([1, 2, 1, 2, 3, 1])
[ 1, 2, 1 ]

In a few places, there would be equivalent versions where the Maps could be replaced with Sets. The performance should be more or less the same, but I think the code would reflect intent better.

In terms of approaches rather than micro-optimisations, I have some approaches that haven't been mentioned - one that sorts and mutates the passed array and so creates very little garbage

function mutatingLowGarbage(sampleData) {
  sampleData.sort()
  let ptr = 0
  for (let i = 0, len = sampleData.length; i < len; ++i) {
    if (sampleData[i] === sampleData[i - 1] && sampleData[i] !== sampleData[i + 1]) {
      sampleData[ptr++] = sampleData[i]
    }
  }
  sampleData.length = ptr
  return sampleData
}

and here's one that uses sparse arrays and creates a count.

function findDupByCount(sampleData) {
  const counts = [], seen = []
  for (let i = 0, len = sampleData.length, value; i < len; value = sampleData[i++]) {
      if (counts[value] === undefined) {
        seen.push(value)
        counts[value] = 0
      }
      counts[value]++
  }
  // could use Object.keys(counts) here instead of keeping track of seen,
  // but that makes everything into strings!
  return seen.filter(x => counts[x] > 1)
}

Neither approach has particularly amazing performance for the scenario posted, but it's interesting to consider the different ways you can solve the problem.

Collapse
eruizdechavez profile image
Erick Ruiz de Chavez

Hey, thanks for taking the time to review and comment the code!

I found and fixed my silly errors (I def. did not have enough coffee that morning, they were so obvious) and updated the stats in my original comment.

The slowest now is findObjectReduce, and the fastest one is findMapForCached.

Here is the new chart: live.amcharts.com/2MjY2/

And here is the updated code:

function findObjectFor(sampleData) {
  const seen = {};
  const dupes = [];

  for (let i = 0; i < sampleData.length; i++) {
    const item = sampleData[i];

    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findObjectForCached(sampleData) {
  const seen = {};
  const dupes = [];
  const length = sampleData.length;
  let i = 0;

  for (; i < length; i++) {
    const item = sampleData[i];
    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findObjectReduce(sampleData) {
  const seen = {};

  return sampleData.reduce((dupes, item) => {
    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }

    return dupes;
  }, []);
}

function findMapFor(sampleData) {
  const seen = new Map();
  const dupes = [];

  for (let i = 0; i < sampleData.length; i++) {
    const item = sampleData[i];
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findMapForCached(sampleData) {
  const seen = new Map();
  const dupes = [];
  const length = sampleData.length;

  let i = 0;

  for (; i < length; i++) {
    const item = sampleData[i];
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findMapForReduce(sampleData) {
  const seen = new Map();

  return sampleData.reduce((dupes, item) => {
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }

    return dupes;
  }, []);
}

for (let r = 0; r < 1000; r++) {
  console.log(`run: ${r}`);

  let sampleData = [];
  // 50k array of random numbers
  for (let i = 0; i < 50000; i++) {
    sampleData[i] = Math.floor(Math.random() * 50000 + 1);
  }

  console.time("findObjectFor");
  findObjectFor(sampleData);
  console.timeEnd("findObjectFor");

  console.time("findObjectForCached");
  findObjectForCached(sampleData);
  console.timeEnd("findObjectForCached");

  console.time("findObjectReduce");
  findObjectReduce(sampleData);
  console.timeEnd("findObjectReduce");

  console.time("findMapFor");
  findMapFor(sampleData);
  console.timeEnd("findMapFor");

  console.time("findMapForCached");
  findMapForCached(sampleData);
  console.timeEnd("findMapForCached");

  console.time("findMapForReduce");
  findMapForReduce(sampleData);
  console.timeEnd("findMapForReduce");

  console.log();
}
Collapse
hminaya profile image
Hector Minaya Author

Awesome, I liked how you implemented the Cached version!.

Collapse
eruizdechavez profile image
Erick Ruiz de Chavez

Thanks!

Also, forgot to mention, I removed the first results (findArrayReduce) because those were consistently above 600ms and it was basically flattening all other results. I know I could use a different axis (like a logarithmic one) but haven't had enough coffee.

Collapse
jonrandy profile image
Jon Randy

Shorter version of 2, using reduce:

const findDupes = a=>a.reduce(([u,d],x)=>u.includes(x)?[u,[...d,x]]:[[...u,x],d],[[],[]])[1]

Or a vastly faster Set based version:

const findDupes = a=>[...a.reduce(([u,d],x)=>u.has(x)?[u,d.add(x)]:[u.add(x),d],[new Set(),new Set()])[1]]
Collapse
t0nyba11 profile image
Tony B
const dupes = (a, h = []) => a.filter(x => (h[x] = (h[x] || 0) + 1) === 2);
Collapse
xazzzi profile image
Vitaliy

You don't need Map, there's Set.

Collapse
xowap profile image
Rémy 🤖

Sets are one of my favorite things. So fantastic to use in Python and so crippled in JS :'(

Collapse
hminaya profile image
Hector Minaya Author

I'm doing a follow up with sets, custom hashes, array reduction...

Collapse
euryvallejo profile image
Eury

Great Article!! Good Explained

Collapse
xfbs profile image
Patrick Elsen

This is not needed:

uniqueElements.set(sampleData[i], sampleData[i]);

You can just do

uniqueElements.set(sampleData[i], true);

(since you're using the hash map as a set)

Collapse
hminaya profile image