Vitaly Shchurov

Posted on Dec 25, 2020 • Edited on Dec 26, 2020

Python: using heapq module to find n largest items

#python #python3 #heap #heapq

Today, I'm going to tell about using the heapq module. As you probably know, the easiest way to find the largest element in a collection in Python is by using the max() method. Or min() to find the smallest one. But what if you need to find n largest or smallest items? The solution depends on how large this n is comparing to the overall size of a collection.

Let's start with simpler cases. Later on, I'm going to give you more complicated examples, including one that requires some out-of-the-box thinking.

RELATIVELY SMALL N

If n is not about the same size as the passed-in collection, you might want to consider using the heapq library. It's equipped with nlargest(num, iterable) and nsmallest(num, iterable) functions, which are very easy to use. Let's see how they work on a list of numbers:



import heapq

nums = [-325, 252, 100, 0, 1, 128, 1000, -500, 879, 12, -57]
print('Largest:', heapq.nlargest(2, nums))
print('Smallest:', heapq.nsmallest(2, nums))

This will result in:



Largest:  [1000, 879]
Smallest:  [-500, -325]

By the way, these functions don't look for unique values. For example, if we had 1000 twice in our list, the largest result would be [1000, 1000].

RELATIVELY LARGE N

If n is almost the same size as the iterable, it's usually faster to just sort that collection and slice it:



import random

# generate a list with 110 arbitrary numbers within the given range:
nums = [random.randint(-100, 100) for _ in range(110)]

# sort, and then slice:
print('Largest 70 numbers:', sorted(nums)[-90:])
print('Smallest 70 numbers:', sorted(nums)[:90])

`heapq`: MORE ADVANCED EXAMPLES

LARGEST CITIES

The heapq example above was rather basic, but nlargest() and nsmallest() actually allow more complicated processing. The thing is that they can also accept a third optional key argument, which is a common parameter for a number of Python functions. key expects a function to be passed in (not a function call!), and this function serves as a comparison criteria that allows you to perform the operation on your own terms.

Let's try it on a list of dictionaries. We need two extract data on two world's largest cities, so we're interested in comparing their population only. To do that, we'll use lambda to specify that dictionaries with the biggest population number should be treated as the largest:



from pprint import pprint
import heapq

cities = [
    {'city': 'Cairo', 'country': 'Egypt', 'population': 20076000},
    {'city': 'São Paulo', 'country': 'Brazil', 'population': 21650000},
    {'city': 'Mumbai', 'country': 'India', 'population': 19980000},
    {'city': 'Mexico City', 'country': 'Mexico', 'population': 21581000},
    {'city': 'Tokyo', 'country': 'Japan', 'population': 37400068},
    {'city': 'Delhi', 'country': 'India', 'population': 28514000},
    {'city': 'Beijing', 'country': 'China', 'population': 19618000},
    {'city': 'Shanghai', 'country': 'China', 'population': 25582000},
]

largest = heapq.nlargest(2, cities, 
                         key=lambda x: x['population'])  # 1
print('Data on two largest cities in the world:')
pprint(largest)

# 1: here, the nlargest() will iterate through the dictionaries contained within the cities list. For each dictionary, we'll use the population key's value to compare them. Otherwise, Python will throw an error, because it can't just compare two dictionaries.

The result will be:



Data on two largest cities in the world:
[{'city': 'Tokyo', 'country': 'Japan', 'population': 37400068},
 {'city': 'Delhi', 'country': 'India', 'population': 28514000}]

HIGHEST-PAID EMPLOYEES

Now, having the employees dictionary, let's find three members of stuff with the highest salary. The problem is that if you pass in a dictionary, you won't get the result you might expect:



import heapq

employees = {
    'Susan Smith': 56000,
    'Matthew Jones': 60000,
    'Jim Taylor': 83000,
    'Emma Wilson': 150000,
    'Mary Thompson': 200000,
    'Chris Evans': 200000,
    'Andrew Evans': 200000,
    'John Evans': 150000,
}

top_paid = heapq.nlargest(3, employees)
print(top_paid)

The output will be:



['Susan Smith', 'Matthew Jones', 'Mary Thompson']

Wrong! None of these employees is among the three highest-paid ones. This happens because nlargest() iterated through the keys of the employees dictionary, which is normal for Python (for example, for loops iterate through the keys as well, by default). So, we iterated through the names only.

In case you're wondering why we didn't get Andrew Evans, it happened because in Python 'b' is bigger than 'a'. So, we got the names that appear later in the alphabet. To get it all right, we need to specify a key function:



top_paid = heapq.nlargest(3, employees, key=employees.get)

This time, we'll get the correct result:



['Mary Thompson', 'Chris Evans', 'Andrew Evans']

If you'd like to print salaries as well, you can pass dictionaries' items and use lambda like this:



top_paid = heapq.nlargest(3, employees.items(), key=lambda x: x[1])

The result will be:



[('Mary Thompson', 200000), ('Chris Evans', 200000), ('Andrew Evans', 200000)]

CHALLENGE

Now, let's have some fun! In the example above we retrieved a few highest-paid employees. The thing is, we didn't exactly get the names in alphabetical order, which is not very good for representation purposes. It's neater if Andrew Evans comes before, say Chris Evans in the output.

So, we still have to search for the highest salaries first, but if there's a tie, we need to compare the keys to put them in alphabetical order. For example, if you've got something like {'bread': 10, 'apple': 10}, you'll need to put apple first. As was mentioned earlier, the catch is that in Python:



>>> 'b' > 'a'
True

So, if you simply use nlargest(), it'll fail to sort your keys alphabetically when there's a tie. You can use nsmallest() instead, but you're still going to need the highest salaries... What do we do about it?

This is where the key parameter provides us with an excellent solution. Using it, we can write our own sorting function that can practically redefine nlargest() or nsmallest() a bit to get us what we want.



import heapq

employees = {
    'Susan Smith': 56000,
    'Matthew Jones': 60000,
    'Jim Taylor': 83000,
    'Emma Wilson': 150000,
    'Mary Thompson': 200000,
    'Chris Evans': 200000,
    'Andrew Evans': 200000,
    'John Evans': 150000,
}

def sort_key(x: tuple) -> tuple:   # 2
    return -x[1], x[0]

top_paid = heapq.nsmallest(4, employees.items(), key=sort_key)  # 1
print(top_paid)

# 1: We use nsmallest() here, so we can overcome the 'b' > 'a' constraint. So, in our final list we'll have, say, Adam before Bill. But we still have to pick the largest salaries...

# 2: And that's why we define our own sorting function that accepts a tuple as an argument, and returns it in reverse order (salary first, name second), and change the salary number to the opposite. For example, if we pass ('Susan Smith', 56000), we'll get (-56000, 'Susan Smith'). Why? So that nsmallest() can still find the biggest salaries instead of the smallest.

Remember that key only serves as a comparison criteria, it doesn't change the output. So, eventually, we'll get:



[('Andrew Evans', 200000), ('Chris Evans', 200000), ('Mary Thompson', 200000), ('Emma Wilson', 150000)]

As you can see, we got the top four salaries first, but the names are placed in alphabetical order!

One more thing, the possible downside to this solution is that using heapq.nsmallest() is less intuitive if we're talking about searching for the largest salaries, and, hence, might be less readable. So, alternatively, you can opt for heap.nlargest(), but you'll need to update your sort_key() function:



def sort_key(x: tuple) -> tuple:
    return x[1], [-ord(i) for i in x[0]]

Here, instead of changing a salary to the opposite, you change each character's ordinal number. And ordinal numbers is how Python really compares letters:



>>> ord('b') > ord('a')
True
>>> -ord('b') < -ord('a')
True

`heapq`: UNDERLYING MECHANISM

Now, let's have a real quick look under the hood of the heapq module. Underneath, the functions we've just learned convert passed-in data into a list where items are ordered as a heap (a special data structure). Let's see how it works:



>>> import heapq
>>> nums = [-15, 123, 133, 0, 100, 222, 144, 56]
>>> heapq.heapify(nums)
>>> nums
[-15, 0, 133, 56, 100, 222, 144, 123]

heapq.heapify orders our nums list as a heap here, but it doesn't exactly sort it. The important feature of a heap as a data structure is that its first element (heap[0]) is always the smallest item. You can use this to easily access the next smallest item by popping it and heapifying the list again:



>>> heapq.heappop(nums)
-15
>>> heapq.heappop(nums)
0
>>> nums
[56, 100, 133, 123, 144, 222]
>>> heapq.heappop(nums)
56

56 wasn't the third in the queue to leave our wonderful collection, but it got to the beginning as soon as it became the smallest number. Needless to say, it wouldn't have worked with the list method .pop(), because the point is that the list is being heapified after popping an item. It requires O(log n) operations where n is the size of the heap.

Now, you know how to use the heapq library. To sum it all up, I prepared a little cheat sheet for you.

Hope you enjoyed my post! If you did, don't forget to like it. Thank you :)