## DEV Community

Vitaly Shchurov

Posted on • Updated on

# Python: using heapq module to find n largest items

Today, I'm going to tell about using the `heapq` module. As you probably know, the easiest way to find the largest element in a collection in Python is by using the `max()` method. Or `min()` to find the smallest one. But what if you need to find n largest or smallest items? The solution depends on how large this n is comparing to the overall size of a collection.

Let's start with simpler cases. Later on, I'm going to give you more complicated examples, including one that requires some out-of-the-box thinking.

#### RELATIVELY SMALL N

If n is not about the same size as the passed-in collection, you might want to consider using the `heapq` library. It's equipped with `nlargest(num, iterable)` and `nsmallest(num, iterable)` functions, which are very easy to use. Let's see how they work on a list of numbers:

``````import heapq

nums = [-325, 252, 100, 0, 1, 128, 1000, -500, 879, 12, -57]
print('Largest:', heapq.nlargest(2, nums))
print('Smallest:', heapq.nsmallest(2, nums))
``````

This will result in:

``````Largest:  [1000, 879]
Smallest:  [-500, -325]
``````

By the way, these functions don't look for unique values. For example, if we had `1000` twice in our list, the largest result would be `[1000, 1000]`.

#### RELATIVELY LARGE N

If n is almost the same size as the iterable, it's usually faster to just sort that collection and slice it:

``````import random

# generate a list with 110 arbitrary numbers within the given range:
nums = [random.randint(-100, 100) for _ in range(110)]

# sort, and then slice:
print('Largest 70 numbers:', sorted(nums)[-90:])
print('Smallest 70 numbers:', sorted(nums)[:90])
``````

#### `heapq`: MORE ADVANCED EXAMPLES

##### LARGEST CITIES

The `heapq` example above was rather basic, but `nlargest()` and `nsmallest()` actually allow more complicated processing. The thing is that they can also accept a third optional `key` argument, which is a common parameter for a number of Python functions. `key` expects a function to be passed in (not a function call!), and this function serves as a comparison criteria that allows you to perform the operation on your own terms.

Let's try it on a list of dictionaries. We need two extract data on two world's largest cities, so we're interested in comparing their population only. To do that, we'll use `lambda` to specify that dictionaries with the biggest `population` number should be treated as the largest:

``````from pprint import pprint
import heapq

cities = [
{'city': 'Cairo', 'country': 'Egypt', 'population': 20076000},
{'city': 'São Paulo', 'country': 'Brazil', 'population': 21650000},
{'city': 'Mumbai', 'country': 'India', 'population': 19980000},
{'city': 'Mexico City', 'country': 'Mexico', 'population': 21581000},
{'city': 'Tokyo', 'country': 'Japan', 'population': 37400068},
{'city': 'Delhi', 'country': 'India', 'population': 28514000},
{'city': 'Beijing', 'country': 'China', 'population': 19618000},
{'city': 'Shanghai', 'country': 'China', 'population': 25582000},
]

largest = heapq.nlargest(2, cities,
key=lambda x: x['population'])  # 1
print('Data on two largest cities in the world:')
pprint(largest)
``````

`# 1`: here, the `nlargest()` will iterate through the dictionaries contained within the `cities` list. For each dictionary, we'll use the `population` key's value to compare them. Otherwise, Python will throw an error, because it can't just compare two dictionaries.

The result will be:

``````Data on two largest cities in the world:
[{'city': 'Tokyo', 'country': 'Japan', 'population': 37400068},
{'city': 'Delhi', 'country': 'India', 'population': 28514000}]
``````
##### HIGHEST-PAID EMPLOYEES

Now, having the `employees` dictionary, let's find three members of stuff with the highest salary. The problem is that if you pass in a dictionary, you won't get the result you might expect:

``````import heapq

employees = {
'Susan Smith': 56000,
'Matthew Jones': 60000,
'Jim Taylor': 83000,
'Emma Wilson': 150000,
'Mary Thompson': 200000,
'Chris Evans': 200000,
'Andrew Evans': 200000,
'John Evans': 150000,
}

top_paid = heapq.nlargest(3, employees)
print(top_paid)
``````

The output will be:

``````['Susan Smith', 'Matthew Jones', 'Mary Thompson']
``````

Wrong! None of these employees is among the three highest-paid ones. This happens because `nlargest()` iterated through the keys of the `employees` dictionary, which is normal for Python (for example, `for` loops iterate through the keys as well, by default). So, we iterated through the names only.

In case you're wondering why we didn't get `Andrew Evans`, it happened because in Python `'b'` is bigger than `'a'`. So, we got the names that appear later in the alphabet. To get it all right, we need to specify a `key` function:

``````top_paid = heapq.nlargest(3, employees, key=employees.get)
``````

This time, we'll get the correct result:

``````['Mary Thompson', 'Chris Evans', 'Andrew Evans']
``````

If you'd like to print salaries as well, you can pass dictionaries' items and use `lambda` like this:

``````top_paid = heapq.nlargest(3, employees.items(), key=lambda x: x[1])
``````

The result will be:

``````[('Mary Thompson', 200000), ('Chris Evans', 200000), ('Andrew Evans', 200000)]
``````

#### CHALLENGE

Now, let's have some fun! In the example above we retrieved a few highest-paid employees. The thing is, we didn't exactly get the names in alphabetical order, which is not very good for representation purposes. It's neater if `Andrew Evans` comes before, say `Chris Evans` in the output.

So, we still have to search for the highest salaries first, but if there's a tie, we need to compare the keys to put them in alphabetical order. For example, if you've got something like `{'bread': 10, 'apple': 10}`, you'll need to put `apple` first. As was mentioned earlier, the catch is that in Python:

``````>>> 'b' > 'a'
True
``````

So, if you simply use `nlargest()`, it'll fail to sort your keys alphabetically when there's a tie. You can use `nsmallest()` instead, but you're still going to need the highest salaries... What do we do about it?

This is where the `key` parameter provides us with an excellent solution. Using it, we can write our own sorting function that can practically redefine `nlargest()` or `nsmallest()` a bit to get us what we want.

``````import heapq

employees = {
'Susan Smith': 56000,
'Matthew Jones': 60000,
'Jim Taylor': 83000,
'Emma Wilson': 150000,
'Mary Thompson': 200000,
'Chris Evans': 200000,
'Andrew Evans': 200000,
'John Evans': 150000,
}

def sort_key(x: tuple) -> tuple:   # 2
return -x[1], x[0]

top_paid = heapq.nsmallest(4, employees.items(), key=sort_key)  # 1
print(top_paid)
``````

`# 1`: We use `nsmallest()` here, so we can overcome the `'b' > 'a'` constraint. So, in our final list we'll have, say, `Adam` before `Bill`. But we still have to pick the largest salaries...

`# 2`: And that's why we define our own sorting function that accepts a tuple as an argument, and returns it in reverse order (salary first, name second), and change the salary number to the opposite. For example, if we pass `('Susan Smith', 56000)`, we'll get `(-56000, 'Susan Smith')`. Why? So that `nsmallest()` can still find the biggest salaries instead of the smallest.

Remember that `key` only serves as a comparison criteria, it doesn't change the output. So, eventually, we'll get:

``````[('Andrew Evans', 200000), ('Chris Evans', 200000), ('Mary Thompson', 200000), ('Emma Wilson', 150000)]
``````

As you can see, we got the top four salaries first, but the names are placed in alphabetical order!

One more thing, the possible downside to this solution is that using `heapq.nsmallest()` is less intuitive if we're talking about searching for the largest salaries, and, hence, might be less readable. So, alternatively, you can opt for `heap.nlargest()`, but you'll need to update your `sort_key()` function:

``````def sort_key(x: tuple) -> tuple:
return x[1], [-ord(i) for i in x[0]]
``````

Here, instead of changing a salary to the opposite, you change each character's ordinal number. And ordinal numbers is how Python really compares letters:

``````>>> ord('b') > ord('a')
True
>>> -ord('b') < -ord('a')
True
``````

#### `heapq`: UNDERLYING MECHANISM

Now, let's have a real quick look under the hood of the `heapq` module. Underneath, the functions we've just learned convert passed-in data into a list where items are ordered as a heap (a special data structure). Let's see how it works:

``````>>> import heapq
>>> nums = [-15, 123, 133, 0, 100, 222, 144, 56]
>>> heapq.heapify(nums)
>>> nums
[-15, 0, 133, 56, 100, 222, 144, 123]
``````

`heapq.heapify` orders our `nums` list as a heap here, but it doesn't exactly sort it. The important feature of a heap as a data structure is that its first element (`heap[0]`) is always the smallest item. You can use this to easily access the next smallest item by popping it and heapifying the list again:

``````>>> heapq.heappop(nums)
-15
>>> heapq.heappop(nums)
0
>>> nums
[56, 100, 133, 123, 144, 222]
>>> heapq.heappop(nums)
56
``````

`56` wasn't the third in the queue to leave our wonderful collection, but it got to the beginning as soon as it became the smallest number. Needless to say, it wouldn't have worked with the list method `.pop()`, because the point is that the list is being heapified after popping an item. It requires O(log n) operations where n is the size of the heap.

Now, you know how to use the `heapq` library. To sum it all up, I prepared a little cheat sheet for you.

Hope you enjoyed my post! If you did, don't forget to like it. Thank you :)