loading...

re: Find Duplicates in an Array VIEW POST

FULL DISCUSSION
 

This was a very interesting reading indeed!

I wanted to do my own flavor of your experiment by throwing a couple of extra tests and then ran it 1000 times to get some data.

The extra tests I added were using reduce instead of a for loop, using objects instead of Maps, and finally caching the for loop values (current index, array length).

In the end, the best option was to use a for loop with cached values, and an object instead of a Map.

Edit:

Comment updated and code removed because it had some basic errors. See the comments below for the updated code and results.

 

Here's another version of findMapForReduce as a one-liner that's still very readable:

const findMapForReduce = (data) => data.reduce(
  ([ seen, repeated ], element) => seen.has(element)
      ? [ seen, repeated.set(element, true) ]
      : [ seen.set(element, true), repeated ], 
  [ new Map(), new Map() ]
)[1].keys()

 

Qudos for posting your code so that others can see what you did. And critique it :-). I would have liked a copy of the workbench.js used in the original post.

Anyway, you should really output the results to an array and check they all give the same (correct) answer.

For example, your findObjectFor findObjectForCached and findObjectReduce have a simple bug in them that means they give the wrong answer and artificially makes them appear faster than they really are (the seen[item] test never passes).

e.g.

> findObjectForCached([1, 2, 1, 2])
[ 1, 2, 1, 2 ]

You'll need to be a little careful fixing these, as some ways of fixing it would not work correctly if the sample data includes the number 0.

Most of the functions don't behave as I would expect if something is repeated more than once, although the code in the original article has the same problem e.g.

> findArrayReduce([1, 2, 1, 2, 3, 1])
[ 1, 2, 1 ]

In a few places, there would be equivalent versions where the Maps could be replaced with Sets. The performance should be more or less the same, but I think the code would reflect intent better.

In terms of approaches rather than micro-optimisations, I have some approaches that haven't been mentioned - one that sorts and mutates the passed array and so creates very little garbage

function mutatingLowGarbage(sampleData) {
  sampleData.sort()
  let ptr = 0
  for (let i = 0, len = sampleData.length; i < len; ++i) {
    if (sampleData[i] === sampleData[i - 1] && sampleData[i] !== sampleData[i + 1]) {
      sampleData[ptr++] = sampleData[i]
    }
  }
  sampleData.length = ptr
  return sampleData
}

and here's one that uses sparse arrays and creates a count.

function findDupByCount(sampleData) {
  const counts = [], seen = []
  for (let i = 0, len = sampleData.length, value; i < len; value = sampleData[i++]) {
      if (counts[value] === undefined) {
        seen.push(value)
        counts[value] = 0
      }
      counts[value]++
  }
  // could use Object.keys(counts) here instead of keeping track of seen,
  // but that makes everything into strings!
  return seen.filter(x => counts[x] > 1)
}

Neither approach has particularly amazing performance for the scenario posted, but it's interesting to consider the different ways you can solve the problem.

 

Hey, thanks for taking the time to review and comment the code!

I found and fixed my silly errors (I def. did not have enough coffee that morning, they were so obvious) and updated the stats in my original comment.

The slowest now is findObjectReduce, and the fastest one is findMapForCached.

Here is the new chart: live.amcharts.com/2MjY2/

And here is the updated code:

function findObjectFor(sampleData) {
  const seen = {};
  const dupes = [];

  for (let i = 0; i < sampleData.length; i++) {
    const item = sampleData[i];

    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findObjectForCached(sampleData) {
  const seen = {};
  const dupes = [];
  const length = sampleData.length;
  let i = 0;

  for (; i < length; i++) {
    const item = sampleData[i];
    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findObjectReduce(sampleData) {
  const seen = {};

  return sampleData.reduce((dupes, item) => {
    if (!seen.hasOwnProperty(item)) {
      seen[item] = item;
    } else {
      dupes.push(item);
    }

    return dupes;
  }, []);
}

function findMapFor(sampleData) {
  const seen = new Map();
  const dupes = [];

  for (let i = 0; i < sampleData.length; i++) {
    const item = sampleData[i];
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findMapForCached(sampleData) {
  const seen = new Map();
  const dupes = [];
  const length = sampleData.length;

  let i = 0;

  for (; i < length; i++) {
    const item = sampleData[i];
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }
  }

  return dupes;
}

function findMapForReduce(sampleData) {
  const seen = new Map();

  return sampleData.reduce((dupes, item) => {
    if (!seen.has(item)) {
      seen.set(item, item);
    } else {
      dupes.push(item);
    }

    return dupes;
  }, []);
}

for (let r = 0; r < 1000; r++) {
  console.log(`run: ${r}`);

  let sampleData = [];
  // 50k array of random numbers
  for (let i = 0; i < 50000; i++) {
    sampleData[i] = Math.floor(Math.random() * 50000 + 1);
  }

  console.time("findObjectFor");
  findObjectFor(sampleData);
  console.timeEnd("findObjectFor");

  console.time("findObjectForCached");
  findObjectForCached(sampleData);
  console.timeEnd("findObjectForCached");

  console.time("findObjectReduce");
  findObjectReduce(sampleData);
  console.timeEnd("findObjectReduce");

  console.time("findMapFor");
  findMapFor(sampleData);
  console.timeEnd("findMapFor");

  console.time("findMapForCached");
  findMapForCached(sampleData);
  console.timeEnd("findMapForCached");

  console.time("findMapForReduce");
  findMapForReduce(sampleData);
  console.timeEnd("findMapForReduce");

  console.log();
}
 

Awesome, I liked how you implemented the Cached version!.

 

Thanks!

Also, forgot to mention, I removed the first results (findArrayReduce) because those were consistently above 600ms and it was basically flattening all other results. I know I could use a different axis (like a logarithmic one) but haven't had enough coffee.

Code of Conduct Report abuse