In this post, we'll dissect how a dictionary is represented in Python, and the hash-scheme it uses to achieve constant-time lookup.
{
"name": "Rahul Jha",
"alias": "RJ722",
"music": "country",
"cheat_diet": "pasta",
}
When we first create a dictionary, it internally maintains a list. This list is used to store "data" about the items in the dictionary, and comprises tuples in the following format:
(
hash(key),
key,
value,
)
Each tuple referred to as a "row", represents a single key-value pair in the dictionary. When a new pair is added to the dictionary, it is appended to this list. This forms the "database" of the dictionary.
For our example above, the database will look like this:
[(7409878836864405608, 'name', 'Rahul Jha'),
(2620441572017060194, 'alias', 'RJ722'),
(205288631654089410, 'music', 'country'),
(405166953458871902, 'cheat_diet', 'pasta')]
Along with creating the database, when a new dictionary is created, a chunk of memory is "blocked". This is used for addressing the keys, to say, which row in the database should be returned for a corresponding key. More on it later.
The size of this chunk determines the number of open addresses available. If all the open addresses are used up, a new memory chunk (usually double in size) is allocated and any values stored in these addresses are copied over. Generally, this is done when addresses
are two-thirds (or 66%) full.
For the sake of this example, let's represent the "database" by a list items
and the blocked memory chunk by a list addresses
:
items = []
n = 8 # number of open addresses available
addresses = [None] * n
Addressing
To lookup a key in a dictionary, one need not search through the entire database. That way it won't be possible to maintain a constant lookup time.
The pairs are addressed instead. The address/position of each value in the database is computed based on the [hash] of the key.
To perform a lookup, one can simply hash the key, use it to compute the address, and read the value at the given address.
If the hashes for two or more keys collide, a technique called Open Address Multihashing helps resolve the address. Before understanding multihashing, let's first look at a simpler technique:
Linear Probing
If there's a collision for address i
, the value, check i+1
address. If another value is stored at i+2
check for i+3
, and so on until you find an empty slot:
def insert(key, value):
h = hash(key)
items.append((h, key, value))
position = len(items) - 1 # the index of this row
address = h % n
while addresses[address] is not None:
address = (address + 1) % n
# Store the position at which the tuple can be found
addresses[address] = position
Let's try it out:
insert("name", "Rahul Jha")
insert("alias", "RJ722")
insert("music", "country")
insert("cheat_diet", "pasta")
pprint(addresses)
# [(7409878836864405608, 'name', 'Rahul Jha'),
# (2620441572017060194, 'alias', 'RJ722'),
# (205288631654089410, 'music', 'country'),
# (405166953458871902, 'cheat_diet', 'pasta')]
pprint(items)
# [0, None, 1, 2, None, None, 3, None]
Similarly, when looking up, one needs to probe adjacent addresses until the matching key is found:
def lookup(key):
h = hash(key)
address = addresses[h % n]
while items[address][1] != key:
address = (address + 1) % n
return items[address][2]
print(lookup("name") ) # "Rahul Jha"
However, if a large number of addresses collide, linear probing can lead to a catastrophic linear pile-up, causing long chains of search and slowing down the lookup.
Note that the implementation above is incomplete: Removing keys can create "holes" which will break the chain, and probing would not be able to find keys that had a collision. The solution is to use mark the slot as DUMMY
to serve as a placeholder. This technique is known as "Knuth โ Algorithm D".
Open Address Multihashing
Rather than probing adjacent addresses, a new address is generated by re-hashing, based on both the prior hash value and the number of probes so far.
We add a pseudo-random number to the address probe โ generated by a "simple linear-congruential random number generator": i = 5 * i + 1
.
This, however, again scans addresses
in a fixed order. To avoid this, the original hash is also added to the random number. This hash is then shifted 5 bits on each recurrence. This way the sequence depends (eventually) on every bit in the hash code and the pseudo-scrambling more aggressive:
def insert(key, value):
h = hash(key)
items.append((h, key, value))
position = len(items) - 1 # the index of this row
perturb = h
address = h % n
while addresses[address] is not None:
address = (5 * address + 1 + perturb) % n
perturb >>= 5
# Store the position at which the tuple can be found
addresses[address] = position
Carrying forward the same example as above, we can generate the addresses:
insert("name", "Rahul Jha")
insert("alias", "RJ722")
insert("music", "country")
insert("cheat_diet", "pasta")
pprint(items)
# [(-6473745080427136004, 'name', 'Rahul Jha'),
# (-1887375983119989115, 'alias', 'RJ722'),
# (-4671665682772832904, 'music', 'country'),
# (3463812311448012107, 'cheat_diet', 'pasta')]
pprint(addresses)
# [2, None, None, 3, 0, 1, None, None]
During lookup, we resolve collisions using the same recurrence pattern:
def lookup(key):
perturb = h = hash(key)
address = addresses[h % n]
while items[address][1] != key:
address = (5 * address + 1 + perturb) % n
perturb >>= 5
return items[address][2]
print(lookup("name")) # "Rahul Jha"
Conclusion
Python uses two tables: one to store hash/key/value densely, and a separate, sparse table of indices that vectors the dense table.
It wasn't always this way. As Raymond Hettinger explains brilliantly in his talk "Modern Dictionaries" (which also provided the fodder for this post), dictionaries in Python 2.7 didn't have two tables. There was no table for addressing, only data -- A big sparse table. This lead to a lot of space being wasted in empty holes, and consequently, dictionaries being giant until Python3.6 (wherein this vector scheme was added by Hettinger).
Fun Question: Why do you think key hashes are also stored along with data?
Top comments (0)