Removing duplicates from a C++ container, using forward iterators

#algorithms #computerscience #cpp #tutorial

We further discuss how to remove duplicates from a C++ container. A previous article discusses the removal of duplicates from random access containers. But not every container supports random access; more precisely, not every container supports random access iterators.

Let us discuss removing duplicates when the container interface is the most basic one: only supporting forward iterators. Moreover, we will not require that the elements support comparisons.

A very basic and naive algorithm works in-place and runs in quadratic time. That is strictly worse than the in-place algorithms that run in log-linear time, assuming the container is random access and the elements can be sorted.

However, this surprisingly is not the end of the whole story: if we are willing to tolerate additional memory consumption, then we achieve much better time complexity for these algorithms.

A classical approach

Our first approach uses the remove_if algorithm in the Standard Template Library.

#include <algorithm>
template <typename ForwardIterator>
ForwardIterator remove_duplicates(ForwardIterator begin, ForwardIterator end) {

    for(; begin != end; ++begin )
    {
        auto runner = std::next( begin );
        end = std::remove_if( runner, end, [&]( const typename ForwardIterator::value_type& v ) -> bool { return v == *begin; } );
    }
    return begin;
}

Some explanations:

The function assumes as a precondition that begin will equals end after a finite number of increments (possibly none). This precondition remains true throughout each iteration of the loop body. At the end, we return begin, which will equal end.
At each loop iteration, we let runner be the successor of begin. By the loop invariant, runner will equal end after a finite number of iterator increments, possibly none.
The call to remove_if will target all the elements between runner and up to (excluding) end which equal v. Internally, remove_if uses move assignments to achieve the following outcome: if there are, say, k elements within the range that equal v, then the last k elements before end are in a moved-from state, and none of the preceding members equals v.
The function remove_if returns an iterator to the new logical end of the range, which is between runner and (excluding) end. As we assign end, notice that the loop invariant is maintained.

If we save the return value in a variable last, then the range [begin, last) will contain the original elements of the original but with all duplicates removed. All members within the range [last,end) will be in a moved-from state.

Let us combine the above code with some example main function:

#include <iostream>
#include <vector>

int main() {

    std::vector<int> data = { 4, 4, 3, 6, 1, 4, 5, 2, 6, 3, 1, 5, 2, 3 };

    const int N = data.size();

    auto last = remove_duplicates( data.begin(), data.end() );

    for( auto it = data.begin(); it != last; it++ ) std::cout << *it << " "; // output the remaining items

    std::cout << std::endl;

    for( auto e : data ) std::cout << e << " "; // the removed items still exist but in unspecified state 

    std::cout << std::endl;

    return 0;
}

Now, there is a caveat to this implementation of remove_duplicates, in comparison to the variant that uses random access iterators: the runtime generally grows quadratically in the number of elements. This is realized, for example, when the original range contains N different elements with no duplicates at all: then the run time will be quadratic in N.

Generally speaking, the run time of this algorithm will be shorter the more duplicates there are. While the quadratic worst case is realized if there are no duplicates at all, the algorithm will only take a linear number of steps if the range contains only the same type of element.

Use a Hash table

This was pretty elementary. But is there a way to improve the run time? The underlying idea is that we run over the entire range once and remove elements that have been encountered at previous positions. What causes the run time to grow beyond linear is that each step, checking whether the element has already been encountered requires more than constant time.

In our attempt to ameliorate this situation, let us have a look at the following implementation.

#include <algorithm>
#include <set>

template <typename ForwardIterator>
ForwardIterator remove_duplicates(ForwardIterator begin, ForwardIterator end) {
    std::set<typename ForwardIterator::value_type> visited;
    return std::remove_if(begin, end, [&](const typename ForwardIterator::value_type& v) {
        if (visited.contains(v)) { // `contains` needs C++20
            return true;
        } else {
            visited.insert(v);
            return false;
        }
    });
}

The C++ standard mandates that the number of comparisons invoked by remove_if is exactly the distance end - begin. That means that no more than N comparisons are made within a range of N elements, and so the comparison is called on any element exactly once. This excludes the possibility of double removals, though this would not be any serious concern in the most straight-forward implementation of remove_if.

Notice that the insert and contains methods have run time logarithmic in the number of elements in the set. On a side note, contains is only available from C++20 onward.

How does compare to implementation using sort and unique, as discussed in a previous gist? Both approaches lead to time complexity N log N. But using sorting can be implemented as an in-place algorithm.

Of course, std::set requires that the type of the elements supports comparisons. But we would like to even ditch that requirement. Alternatively, we can use an unordered set. The container std::unordered_set has been available since the advent of C++11.

#include <algorithm>
#include <unordered_set>

template <typename ForwardIterator>
ForwardIterator remove_duplicates(ForwardIterator begin, ForwardIterator end) {
    std::unordered_set<typename ForwardIterator::value_type> visited;
    return std::remove_if(begin, end, [&](const typename ForwardIterator::value_type& v) {
        if (visited.find(v) != visited.end()) {
            return true;
        } else {
            visited.insert(v);
            return false;
        }
    });
}

The unordered set data structure is a bit of mixed bag in practice. The C++ standard requires that the operations contains and insert run in asymptotic constant time. This is implemented using a hash table. In real life, outside of the platonic realm of the C++ standard, it has been reported that some implementations do not satisfy the official run time requirements due to sub-optimal hash functions, so the real life performance may be considerably worse.

Hence, if we allow for additional memory on a linear order in the worst case, then we can hope to achieve average linear time complexity. In fact, some might consider this to be even better than the much more idiomatic combination of sort-unique.

DEV Community

Removing duplicates from a C++ container, using forward iterators

A classical approach

Use a Hash table

See also

Top comments (0)