How to remove duplicates from a C++ container

#cpp

We discuss how to remove duplicate elements from C++ Standard Template Library (STL) containers such as std::vector. Albeit an everyday task, this functionality is not part of the STL. Every C++ programmer sooner or later needs to write this functionality on their own. Writing this up has actually shown me a few surprises.

Introduction

Let us start with the canonical C++ solution to this problem. We use a combination of std::sort() and std::unique(), followed by a resizing of the vector. We can write this up with a template for the std::vector container.

#include <algorithm>
#include <vector>

template<typename T>
void remove_duplicates( std::vector<T>& vec ) {

    std::sort( vec.begin(), vec.end() ); // sort the vector 

    auto last = std::unique( vec.begin(), vec.end() ); // remove consecutive duplicates 

    vec.erase( last, vec.end() ); // erase the remaining duplicates, which are now at the end of the vector
}

#include <iostream> // for the output

int main() {

    std::vector<int> data = { 4, 5, 2, 6, 3, 1, 5, 4, 3, 6, 1, 4, 2, 3 };

    remove_duplicates( data );

    for( int num : data ) std::cout << num << " "; // output the modified vector 

    std::cout << std::endl;

    return 0;
}

Let us briefly review the steps in remove_duplicates:

We first sort the elements of the container. Here we see that this already requires a comparison function (either operator< or std::less, depending on your C++ version.
The function std::unique moves consecutive duplicates to the end of the container. The previous sorting step ensures that all duplicate elements in the container build a successive sequence. std::unique returns an iterator to the first element of those duplicates.
Lastly, since all the superfluous duplicates are located at the end of the container, we can simply erase them.

The function std::unique was introduced precisely for this purpose. The biggest restriction of this general approach is that we need to sort the elements of the container, which requires a comparison. But that restriction has a good reason!

Namely, the asymptotic time and space complexity of this algorithm is dominated by the complexity of the sorting function, depending on the specific sorting algorithm. In most implementations, such as when using heap sort, the algorithm works in-place and its runtime typically grows by N log(N) in the number N of elements. The function unique can be implemented in linear time and in-place, so it really the sorting algorithm that matters here.

Hence, we can expect that removing duplicates can be implemented as an in-place algorithm with with N log(N) time complexity. However, the specific behavior will depend on the algorithm used inside std::sort. The C++ standard indeed mandates an asymptotic runtime of N log N but the required memory is left unspecified.

Let us complement this with a minor variation of remove_duplicates: instead of calling erase, we just resize the vector. That has the same outcome for all practical purposes:

#include <algorithm>
#include <vector>

template<typename T>
void remove_duplicates( std::vector<T>& vec ) {

    std::sort( vec.begin(), vec.end() ); // sort the vector 

    auto last = std::unique( vec.begin(), vec.end() ); // remove consecutive duplicates 

    vec.resize( std::distance( vec.begin(), last ) ); // erase the remaining duplicates, which are now at the end of the vector
}

Using the iterator interface for random access containers

The preceding version of remove_duplicates takes the container as the argument. This is very much the type of function interface that you encounter in most programming languages - with the exception of the C++ STL, of course! The STL algorithms use iterators all over the place.

So let us rewrite the function using iterators. We want to provide random access iterators as arguments: one for the beginning of the sequence, and one for the end.

However, there is a severe constraint: without access to the full container, we cannot remove the remaining duplicates in the manner that we did above. As a remedy, we return the iterator to the logical end of the sequence, similar as std::duplicate. Rather an anticlimax!

#include <algorithm>
#include <iterator>

template<typename RandomAccessIterator>
RandomAccessIterator remove_duplicates( RandomAccessIterator first, RandomAccessIterator last ) {

    std::sort( first, last ); // sort all elements in the range

    return std::unique( first, last ); // remove consecutive duplicates
}

#include <iostream>
#include <vector>

int main() {

    // example using std::vector 
    std::vector<int> vec = { 4, 5, 2, 6, 3, 1, 5, 4, 3, 6, 1, 4, 2, 3 };
    auto new_end = remove_duplicates( vec.begin(), vec.end() );

    vec.erase( new_end, vec.end() );

    for( auto elem : vec) std::cout << elem << ' ';
    std::cout << std::endl;

    // example using array
    int arr[] = { 4, 5, 2, 6, 3, 1, 5, 4, 3, 6, 1, 4, 2, 3 };
    auto new_end_arr = remove_duplicates( std::begin(arr), std::end(arr) );

    // Output the result; since arrays do not support erase, just use new_end_arr
    for( auto it = std::begin(arr); it != new_end_arr; ++it ) std::cout << *it << ' ';
    std::cout << std::endl;

    return 0;
}

But we are still restricted to containers that allow random access. What about, say, the std::list, which is implemented as a doubly-linked list and allows merely biderectional iterators but is not a random access container?

Removing duplicates in a bidirectional container

The above version of remove_duplicates utilizes the algorithms std::sort and std::unique. These are only available for containers that support random access iterators, such as std::vector or classical arrays. However, a different approach is needed for other containers, such as std::list, which support only bidirectional iterators.

The background is that the std::sort algorithms only perform with random access iterators: internally it uses an algorithm such as quick sort or merge sort, or some derivation of these, which only works well with random access. However, a bidirectional container still enables a different implementation of sorting algorithms.

This is how we implement this functionality for linked lists. The complete code looks as follows:

#include <algorithm>
#include <iostream>
#include <list>

template<typename T>
void remove_duplicates( std::list<T>& lst ) {
    lst.sort();
    lst.unique();
}

int main() {

    // Example using a list
    std::list<int> my_list = { 4, 5, 2, 6, 3, 1, 5, 4, 3, 6, 1, 4, 2, 3 };

    remove_duplicates( my_list );

    for( auto elem : my_list ) std::cout << elem << ' ';
    std::cout << std::endl;

    return 0;
}

Notably, this approach relies on two member functions of std::list, namely sort and unique. While sorting works completely analogously to the case of std::vector and other random access containers, the class method std::unique has strikingly different behavior: this method now really erases all duplicate elements, instead of just moving them to the end of the container.

DEV Community

How to remove duplicates from a C++ container

Introduction

Using the iterator interface for random access containers

Removing duplicates in a bidirectional container

See also

Top comments (0)