DEV Community

Cover image for Dictionaries and Sets in Python: A Comprehensive Guide to the Two Key-Value Data Structures
Albert
Albert

Posted on

Dictionaries and Sets in Python: A Comprehensive Guide to the Two Key-Value Data Structures

Greetings everyone, I'm a developer planning to publish a series of articles about computer science and programming. The content will cover everything from beginner to advanced.
I love sharing programming knowledge and hope to impart my experience to more people. If you're also interested, feel free to follow my dev.to page. I look forward to learning and growing with you!

Dictionary and Set Basics

A dictionary is a collection of elements consisting of key-value pairs. In Python 3.7+, dictionaries are considered ordered (note: in Python 3.6, dictionary ordering is an implementation detail and only officially became a language feature in Python 3.7), while in versions prior to 3.6, they were unordered. Dictionaries can vary in size, and elements can be freely added, modified, or deleted.

Compared to lists and tuples, dictionaries have better performance, especially for lookup, addition, and deletion operations, which can be done in constant time complexity.
Sets are similar to dictionaries, but they don't have key-value pairs. Instead, they are a collection of unordered and unique elements. Let's first look at how dictionaries and sets are created using the following methods:

d1 = {'name': 'jason', 'age': 20, 'gender': 'male'}
d2 = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
d3 = dict([('name', 'jason'), ('age', 20), ('gender', 'male')])
d4 = dict(name='jason', age=20, gender='male')
d1 == d2 == d3 == d4
True

s1 = {1, 2, 3}
s2 = set([1, 2, 3])
s1 == s2
True
Enter fullscreen mode Exit fullscreen mode

Note that in Python, both keys and values in dictionaries and sets can be of mixed types. For example, in the following example, I created a set with elements 1, 'hello', and 5.0:

s = {1, 'hello', 5.0}
Enter fullscreen mode Exit fullscreen mode

Now let's discuss element access. In dictionaries, you can directly index a key to access its value. If the key does not exist, an exception will be raised:

d = {'name': 'jason', 'age': 20}
d['name']
'jason'
d['location']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'location'
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can use the get(key, default) function to perform indexing. If the key does not exist, calling get() will return a default value. In the following example, it returns 'null':

d = {'name': 'jason', 'age': 20}
d.get('name')
'jason'
d.get('location', 'null')
'null'
Enter fullscreen mode Exit fullscreen mode

After discussing dictionary access, let's move on to sets.

First, I want to emphasize that sets do not support indexing operations because sets are essentially hash tables and are different from lists. Therefore, the following operation is incorrect, and Python will raise an exception:

s = {1, 2, 3}
s[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing
Enter fullscreen mode Exit fullscreen mode

To check if an element exists in a dictionary or set, you can use value in dict/set:

s = {1, 2, 3}
1 in s
True
10 in s
False

d = {'name': 'jason', 'age': 20}
'name' in d
True
'location' in d
False
Enter fullscreen mode Exit fullscreen mode

Apart from creation and access, dictionaries and sets also support operations such as addition, deletion, and update.

d = {'name': 'jason', 'age': 20}
d['gender'] = 'male'  # Add the key-value pair 'gender': 'male'
d['dob'] = '1999-02-01'  # Add the key-value pair 'dob': '1999-02-01'
d
{'name': 'jason', 'age': 20, 'gender': 'male', 'dob': '1999-02-01'}
d['dob'] = '1998-01-01'  # Update the value of the key 'dob'
d.pop('dob')  # Remove the key-value pair with the key 'dob'
'1998-01-01'
d
{'name': 'jason', 'age': 20, 'gender': 'male'}

s = {1, 2, 3}
s.add(4)  # Add element 4 to the set
s
{1, 2, 3, 4}
s.remove(4)  # Remove element 4 from the set
s
{1, 2, 3}
Enter fullscreen mode Exit fullscreen mode

However, keep in mind that the pop() operation on a set removes the last element, but since sets are unordered, you cannot predict which element will be removed. Therefore, use this operation with caution.
In practical applications, we often need to sort dictionaries or sets, such as retrieving the top 50 pairs with the highest values.
For dictionaries, we usually sort them in ascending or descending order based on keys or values:

d = {'b': 1, 'a': 2, 'c': 10}
d_sorted_by_key = sorted(d.items(), key=lambda x: x[0])  # Sort by keys in ascending order
d_sorted_by_value = sorted(d.items(), key=lambda x: x[1])  # Sort by values in ascending order
d_sorted_by_key
[('a', 2), ('b', 1), ('c', 10)]
d_sorted_by_value
[('b', 1), ('a', 2), ('c', 10)]
Enter fullscreen mode Exit fullscreen mode

Here, a list is returned. Each element in the list is a tuple consisting of the original dictionary's key and value.

As for sets, sorting works similarly to lists and tuples mentioned earlier. You can simply call sorted(set), and it will return a sorted list.

s = {3, 4, 2, 1}
sorted(s) 
[1, 2, 3, 4]
Enter fullscreen mode Exit fullscreen mode

Dictionary and Set Performance

As mentioned earlier, dictionaries and sets are highly optimized data structures, especially for lookup, addition, and deletion operations. Now let's take a look at their performance in specific scenarios compared to other data structures like lists.

For example, consider an e-commerce backend that stores the ID, name, and price of each product. The requirement is to find the price of a given product ID. If we store this data in a list and perform a lookup, the corresponding code would be as follows:

def find_product_price(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150)
]

print('The price of product 432314553 is {}'.format(find_product_price(products, 432314553)))

# Output
The price of product 432314553 is 30
Enter fullscreen mode Exit fullscreen mode

Assuming the list has n elements, the time complexity for the lookup process is O(n). Even if we sort the list and use binary search, it would still require O(logn) time complexity. Moreover, the sorting of the list itself would take O(nlogn) time.
However, if we use a dictionary to store this data, the lookup becomes very efficient, requiring only O(1) time complexity. The reason is simple: dictionaries are internally implemented as hash tables, allowing direct access to the corresponding value using the hash value of the key.

products = {
    143121312: 100,
    432314553: 30,
    32421912367: 150
}
print('The price of product 432314553 is {}'.format(products[432314553]))

# Output
The price of product 432314553 is 30
Enter fullscreen mode Exit fullscreen mode

Similarly, let's consider another requirement: finding the number of different prices among these products. We'll compare the approaches using the same methods.
If we continue to use a list, the corresponding code would be as follows, where A and B represent two nested loops. Assuming the original list has n elements, in the worst case scenario, it would require O(n^2) time complexity.

# list version
def find_unique_price_using_list(products):
    unique_price_list = []
    for _, price in products:  # A
        if price not in unique_price_list:  # B
            unique_price_list.append(price)
    return len(unique_price_list)

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('Number of unique prices is: {}'.format(find_unique_price_using_list(products)))

# Output
Number of unique prices is: 3
Enter fullscreen mode Exit fullscreen mode

However, if we choose to use the set data structure, since sets are highly optimized hash tables that do not allow duplicate elements, and their addition and lookup operations have a complexity of O(1), the overall time complexity becomes O(n).

# set version
def find_unique_price_using_set(products):
    unique_price_set = set()
    for _, price in products:
        unique_price_set.add(price)
    return len(unique_price_set)

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('Number of unique prices is: {}'.format(find_unique_price_using_set(products)))

# Output
Number of unique prices is: 3
Enter fullscreen mode Exit fullscreen mode

You may not have an intuitive understanding of these time complexities, so let me provide an example from an actual work scenario to help you get a feel for it.The code provided initializes a product with 100,000 elements and measures the runtime of counting product prices using both a list and a set.

import time
id = [x for x in range(0, 100000)]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))

# Calculate time using list
start_using_list = time.perf_counter()
find_unique_price_using_list(products)
end_using_list = time.perf_counter()
print("time elapse using list: {}".format(end_using_list - start_using_list))
## Output
time elapse using list: 41.61519479751587

# Calculate time using set
start_using_set = time.perf_counter()
find_unique_price_using_set(products)
end_using_set = time.perf_counter()
print("time elapse using set: {}".format(end_using_set - start_using_set))
# Output
time elapse using set: 0.008238077163696289
Enter fullscreen mode Exit fullscreen mode

As you can see, even with just 100,000 data points, there is a significant difference in speed between the two approaches. In reality, large enterprise backends often deal with data in the range of billions, and using inappropriate data structures can lead to server crashes, affecting user experience and causing substantial financial losses for the company.
How Dictionaries and Sets Work
By providing examples and comparing them with lists, we can observe the efficiency of dictionary and set operations. However, why are dictionaries and sets so efficient, especially for lookup, insertion, and deletion operations?
This efficiency is closely related to the internal data structure of dictionaries and sets, which is a hash table. Here is a simplified representation of the hash table structure in older versions of Python:

--+-------------------------------+
| Hash    Key     Value
--+-------------------------------+
0 | hash0   key0    value0
--+-------------------------------+
1 | hash1   key1    value1
--+-------------------------------+
2 | hash2   key2    value2
--+-------------------------------+
. | ...
__+_______________________________+
Enter fullscreen mode Exit fullscreen mode

It's easy to imagine that as the hash table expands, it becomes increasingly sparse. For example, if we have the following dictionary:

{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}
Enter fullscreen mode Exit fullscreen mode

It would be stored in a form similar to this:

entries = [
    ['--', '--', '--'],
    [-230273521, 'dob', '1999-01-01'],
    ['--', '--', '--'],
    ['--', '--', '--'],
    [1231236123, 'name', 'mike'],
    ['--', '--', '--'],
    [9371539127, 'gender', 'male']
]
Enter fullscreen mode Exit fullscreen mode

This design structure clearly wastes storage space. To improve storage efficiency, modern hash tables separate the indices, hash values, keys, and values from the dictionary's own structure. The new structure looks like this:

Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------

Entries
--------------------
hash0   key0  value0
---------------------
hash1   key1  value1
---------------------
hash2   key2  value2
---------------------
...
---------------------
Enter fullscreen mode Exit fullscreen mode

Thus, the previous example, using the new hash table structure, would be stored as follows:

indices = [None, 1, None, None, 0, None, 2]
entries = [
    [1231236123, 'name', 'mike'],
    [-230273521, 'dob', '1999-01-01'],
    [9371539127, 'gender', 'male']
]
Enter fullscreen mode Exit fullscreen mode

We can clearly see that storage efficiency is significantly improved.
Now that we understand the specific design structure, let's look at the workings of these operations.

Insertion Operation

When inserting an element into a dictionary or set in Python, the first step is to calculate the hash value of the key (hash(key)). Then, it is bitwise ANDed with the mask = PyDicMinSize - 1 to determine the position where the element should be inserted into the hash table: index = hash(key) & mask. If the position in the hash table is empty, the element is inserted there.
If the position is already occupied, Python compares the hash(key) and keys of the two elements.

  • If they are both equal, it means that the element already exists. In this case, if the values are different, the value is updated.
  • If one of them is not equal, it is called a hash collision, indicating that the keys of the two elements are different but the hash(key) are equal. In this case, Python continues to search for an available position in the table until it finds one. It is worth mentioning that in most cases, the simplest approach to handle hash collisions is linear probing, which means starting from the current position and sequentially searching for an empty slot. However, Python has optimized this process internally (you don't need to delve into the details, but you can refer to the source code if you're interested), making this step more efficient.

Lookup Operation

Similar to the insertion operation, Python finds the position where an element should be located based on its hash value. Then, it compares the hash(key) and key of the element in that position with the element being searched. If they are equal, the element is returned directly. If they are not equal, the search continues until an empty slot is found or an exception is raised.

Deletion Operation

For the deletion operation, Python temporarily assigns a special value to the element at that position. It will be removed when the hash table is resized.
It is not difficult to understand that hash collisions often reduce the speed of dictionary and set operations. To ensure efficiency, the hash tables in dictionaries and sets typically reserve at least 1/3 of the remaining space. As elements are continuously inserted, when the remaining space becomes less than 1/3, Python obtains a larger memory space and expands the hash table. In this case, all the elements in the table will be rearranged.
Although hash collisions and resizing of the hash table can slow down operations, these situations occur very rarely. Therefore, on average, it still guarantees constant time complexity (O(1)) for insertion, lookup, and deletion operations.

Summary

In Python 3.7+, dictionaries are ordered data structures, while sets are unordered. The internal hash table storage structure ensures the efficiency of lookup, insertion, and deletion operations. Therefore, dictionaries and sets are commonly used in scenarios that require efficient element lookup and deduplication.

Top comments (0)