Forem

Jameer Khan
Jameer Khan

Posted on • Originally published at stackblogger.com

Finding Duplicate Documents in MongoDB Collection

The original article is published here.

Looking out to find duplicate documents in a MongoDB collection? This article will help you find duplicate records in MongoDB collection by a specific field.

MongoDB is a JSON based document-oriented database that can handle big size of data without compromising the performance.

Find Duplicate Documents In MongoDB Collection
Sometimes you want to remove all the duplicate documents from a collection but you don’t know how to find them. Here we will see how to find duplicate records in MongoDB. Once you find the duplicates, you can easily remove those from Mongo collection.

For this example, I am using Github Public Issues API. I have fetched the issues from Github API and inserted them into the database by making same requests multiple times to make duplicates.

I will use aggregate method along with $group and $match pipeline operators to find duplicates. Let’s step by step implement code to get duplicates.

MongoDB Group Records by Field

First step towards implementing duplicate search is Grouping Records.

db.issues.aggregate([  
    {$group: {
        _id: {IssueNumber: "$issue_number"}
        }
    }
]);
Enter fullscreen mode Exit fullscreen mode

Above code fetches all the documents present in issues collection grouped by issue_number field.

/* 1 */
{
    "_id" : {
        "IssueNumber" : 2778
    }
}

/* 2 */
{
    "_id" : {
        "IssueNumber" : 80
    }
}

/* 3 */
{
    "_id" : {
        "IssueNumber" : 2592
    }
}

/* 4 */
{
    "_id" : {
        "IssueNumber" : 2596
    }
}

/* 5 */
{
    "_id" : {
        "IssueNumber" : 2599
    }
}

/* 6 */
{
    "_id" : {
        "IssueNumber" : 98
    }
}

/* 7 */
{
    "_id" : {
        "IssueNumber" : 2602
    }
}

/* 8 */
{
    "_id" : {
        "IssueNumber" : 99
    }
}

/* 9 */
{
    "_id" : {
        "IssueNumber" : 2607
    }
}

/* 10 */
{
    "_id" : {
        "IssueNumber" : 100
    }
}
Enter fullscreen mode Exit fullscreen mode

MongoDB Get Duplicate Documents

It’s time to implement the complete query and get all the rows that have more than 1 count in MongoDB collection.

db.issues.aggregate([  
    {$group: {
        _id: {IssueNumber: "$issue_number"},
        uniqueIds: {$addToSet: "$_id"},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {"$gt": 1}
        }
    }
]);
Enter fullscreen mode Exit fullscreen mode

$addToSet operator adds a value to an array if the value is not already present. If the value is already present, it does nothing.

$sum calculates and returns the sum of numeric values.

$match applies condition on the result set. $gt will check if the count is greater than provided number (ie 1).

The above query produces following output.

/* 1 */
{
    "_id" : {
        "IssueNumber" : 34
    },
    "uniqueIds" : [ 
        ObjectId("61b993c2f8fba0f17accd7fc"), 
        ObjectId("61b9937b12b680e75f5a7fac"), 
        ObjectId("61b88043dcb3fc4c9c03ca8e")
    ],
    "count" : 3.0
}

/* 2 */
{
    "_id" : {
        "IssueNumber" : 37
    },
    "uniqueIds" : [ 
        ObjectId("61b993c1f8fba0f17accd7f6"), 
        ObjectId("61b9937b12b680e75f5a7f97"), 
        ObjectId("61b88042dcb3fc4c9c03ca82")
    ],
    "count" : 3.0
}

/* 3 */
{
    "_id" : {
        "IssueNumber" : 677
    },
    "uniqueIds" : [ 
        ObjectId("61b993baf8fba0f17accd6bd"), 
        ObjectId("61b88042dcb3fc4c9c03ca76")
    ],
    "count" : 2.0
}

/* 4 */
{
    "_id" : {
        "IssueNumber" : 695
    },
    "uniqueIds" : [ 
        ObjectId("61b993b9f8fba0f17accd667"), 
        ObjectId("61b88040dcb3fc4c9c03ca4a")
    ],
    "count" : 2.0
}

/* 5 */
{
    "_id" : {
        "IssueNumber" : 700
    },
    "uniqueIds" : [ 
        ObjectId("61b993b8f8fba0f17accd653"), 
        ObjectId("61b88040dcb3fc4c9c03ca42")
    ],
    "count" : 2.0
}

/* 6 */
{
    "_id" : {
        "IssueNumber" : 1871
    },
    "uniqueIds" : [ 
        ObjectId("61b993b8f8fba0f17accd60c"), 
        ObjectId("61b8803fdcb3fc4c9c03ca25")
    ],
    "count" : 2.0
}
Enter fullscreen mode Exit fullscreen mode

Now you have ObjectId of all the duplicate records present in MongoDB collection. You can run a separate query to remove them.

Looking to insert and update records in MongoDB together? This will help you upsert documents in a single query.

Repair MongoDB instance after crash on Windows.

Conclusion

This article elaborates on how to find duplicate documents in MongoDB collection by a specific field.

TLDR;

  • Group records by the field on which you want to find duplicates
  • Add all the Object Ids to an array
  • Count the documents
  • Apply condition to fetch only documents that are available more than once in collection

Hope you find some value out of this article. If you liked the article, make sure to spread it across other developers. See you in next article 🙂

Top comments (0)