Hash Values
Python's built-in hash() function returns an object's hash value. Hash tables rely on hash values to index elements:
object ──▶ hash(object) ──▶ hash value ──▶ slot index in hash table
For a hash table to work correctly, key objects must satisfy two conditions:
- The hash value must not change during the object's lifetime.
- Objects that compare equal must have the same hash value.
Objects satisfying these conditions are called hashable. Only hashable objects can be used as keys in a hash table. This is why dict, set, and other hash-table-backed containers require hashable keys.
Among Python's built-in types, all immutable objects are hashable; mutable objects like list and dict are not:
>>> hash([])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Unhashable objects cannot be used as dict keys:
>>> {[]: 'list is not hashable'}
TypeError: unhashable type: 'list'
>>> {{}: 'dict is not hashable either'}
TypeError: unhashable type: 'dict'
User-defined objects are hashable by default — their hash value is derived from their memory address, and any two distinct objects are considered unequal:
>>> class A:
... pass
...
>>> a = A()
>>> b = A()
>>> hash(a), hash(b)
(-9223372036573452351, -9223372036573452365)
>>> a == b
False
How Is the Hash Value Computed?
The answer is a hash function. As we know from the object model, object behavior is determined by the type object. Hash value computation is one such behavior — it's stored in the tp_hash function pointer of the type object. The built-in hash() function delegates to tp_hash to compute and return the value.
For str, the hash function is unicode_hash in Objects/unicodeobject.c:
PyTypeObject PyUnicode_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str", /* tp_name */
sizeof(PyUnicodeObject), /* tp_size */
// ...
(hashfunc) unicode_hash, /* tp_hash */
// ...
};
For user-defined objects, you can override the default hash behavior by implementing __hash__ and __eq__. For example, if a Tag instance is uniquely identified by its value field:
class Tag:
def __init__(self, value, title):
self.value = value
self.title = title
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
return self.value == other.value
Since hash values are used frequently and never change during an object's lifetime, they are cached inside the object to avoid recomputation. For str, the internal hash field stores the cached value.
A good hash function must distribute hash values uniformly across the hash space. The more similar two values are, the more different their hash values should be.
Hash Collisions
Two different objects can have the same hash value. Also, the number of slots in a hash table is far smaller than the full hash value space. This means multiple keys can map to the same slot — a hash collision.
hash(key1) % table_size = 3 ┐
├─▶ both map to slot 3 → collision!
hash(key3) % table_size = 3 ┘
There are two common strategies to resolve hash collisions:
- Separate chaining
- Open addressing
Separate Chaining
Each slot maintains a linked list. All keys that hash to the same slot are stored in the corresponding list:
Hash Index Linked Lists
┌───┐
│ 0 │──▶ NULL
├───┤
│ 1 │──▶ [key2] ──▶ NULL
├───┤
│ 2 │──▶ NULL
├───┤
│ 3 │──▶ [key1] ──▶ [key3] ──▶ NULL ← both hash to slot 3
├───┤
│ 4 │──▶ NULL
└───┘
Open Addressing (Python's Approach)
Python uses open addressing — data is stored directly in the hash slots. If a slot is occupied, try another one.
Insert key3 (hashes to slot 3, already occupied by key1):
┌───┬───────┐
│ 0 │ │
├───┼───────┤
│ 1 │ │
├───┼───────┤
│ 2 │ │
├───┼───────┤
│ 3 │ key1 │ ← slot 3 occupied
├───┼───────┤
│ 4 │ │
├───┼───────┤
│ 5 │ key3 │ ← probe finds slot 5, insert here ✅
└───┴───────┘
The i-th probe tries slot (base + dᵢ) % size. The probing strategy depends on the function dᵢ.
Linear probing — dᵢ = 2 * i:
Probe sequence: slot 3 → 5 → 7 → 9 → ...
Quadratic probing — dᵢ = i²:
Probe sequence: slot 3 → 4 → 7 → 12 → ...
Both methods have a weakness: a fixed probe sequence increases collision probability when multiple keys share the same starting slot.
key, key2, key3 all hash to slot 1:
Fixed probe sequence: 1 → 3 → 7 → 13 → ...
All three follow the SAME path → high collision rate ❌
Python's Perturbation-Based Probing
Python improves on this by incorporating the object's hash value into the probe sequence, making each key's probe path unique:
key (hash=0b...0101): probe path → 1 → 6 → 3 → ...
key2 (hash=0b...1010): probe path → 1 → 4 → 9 → ...
key3 (hash=0b...1111): probe path → 1 → 8 → 2 → ...
The implementation is in lookdict inside Objects/dictobject.c:
static Py_ssize_t _Py_HOT_FUNCTION
lookdict(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject **value_addr)
{
size_t i, mask, perturb;
PyDictKeysObject *dk;
PyDictKeyEntry *ep0;
top:
dk = mp->ma_keys;
ep0 = DK_ENTRIES(dk);
mask = DK_MASK(dk);
perturb = hash;
i = (size_t)hash & mask;
for (;;) {
Py_ssize_t ix = dk_get_index(dk, i);
// key comparison omitted
// compute next slot
// perturb incorporates the hash value → unique probe sequence per key
perturb >>= PERTURB_SHIFT;
i = (i*5 + perturb + 1) & mask;
}
Py_UNREACHABLE();
}
Hash Attacks
Before Python 3.3, hash values were computed solely from the object itself — no randomness involved. This meant the same object always produced the same hash value across any two interpreter instances:
# Python 2, process 1 (pid=2878)
>>> hash('fasion')
3629822619130952182
# Python 2, process 2 (pid=2915)
>>> hash('fasion')
3629822619130952182 # identical ← vulnerability
An attacker could craft a large number of keys that all hash to the same value, then POST them as JSON to a Python 2 web server. The server would build a dict from this data — and every single key would collide. Hash table performance degrades from O(1) to O(N), effectively taking the server down. This is a hash flooding attack.
The Fix: Hash Salting (Python 3.3+)
The solution is simple — add a salt:
- When the interpreter starts, generate a random number as the salt.
- The hash function incorporates both the object itself and the random salt.
Since the attacker cannot know the interpreter's internal random salt, they cannot pre-compute colliding keys.
# Python 3.7, process 1
>>> hash('fasion')
7411353060704220518
# Python 3.7, process 2
>>> hash('fasion')
1784735826115825426 # different every time ✅
Python's hash algorithm is implemented in Python/pyhash.c.
The Delete Operation
Now let's look at how dict handles deletion. Consider this scenario:
State after inserting key1, key2, key3:
Hash Index Entry Storage
┌───┬──────┐ ┌───┬──────┐
│ 0 │ │ │ 0 │ key1 │
├───┼──────┤ ├───┼──────┤
│ 1 │ 1 │──▶ 1 │ key2 │
├───┼──────┤ ├───┼──────┤
│ 2 │ │ │ 2 │ │
├───┼──────┤ ├───┼──────┤
│ 3 │ │ │ 3 │ │
├───┼──────┤ └───┴──────┘
│ 4 │ │
├───┼──────┤
│ 5 │ 0 │──▶ key1 (slot 5, entry 0)
├───┼──────┤
│ 6 │ 2 │──▶ key3 (slot 6, entry 2) ← key3 collided at slot 1, moved to slot 6
└───┴──────┘
If we delete key2 by simply marking slot 1 as EMPTY:
┌───┬──────┐
│ 1 │EMPTY │ ← slot 1 cleared
├───┼──────┤
│ 6 │ 2 │ key3 hashes to slot 1, probes... finds EMPTY → thinks key3 doesn't exist ❌
└───┴──────┘
This breaks the probe chain — key3 becomes unreachable!
The Fix: DUMMY Tombstone
Instead of marking the slot as EMPTY, Python marks it as DUMMY — a tombstone that signals "keep probing, don't stop here":
┌───┬──────┐
│ 1 │DUMMY │ ← tombstone, probe chain intact
├───┼──────┤
│ 6 │ 2 │ key3 hashes to slot 1, sees DUMMY → continues probing → finds slot 6 ✅
└───┴──────┘
Slot state constants are defined in Objects/dict-common.h:
#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2) /* Used internally */
#define DKIX_ERROR (-3)
Reusing Deleted Entry Storage
Python doesn't bother reusing deleted entry storage slots. When a new key is inserted, it simply uses the next available entry:
Insert key4 after deleting key2:
Entry storage:
┌───┬──────┐
│ 0 │ key1 │
├───┼──────┤
│ 1 │ key2 │ ← deleted (wasted, ignored)
├───┼──────┤
│ 2 │ key3 │
├───┼──────┤
│ 3 │ key4 │ ← new entry goes here ✅
└───┴──────┘
Yes, some storage slots may be wasted — but that's acceptable.
Resize and Rebuild
When entry storage is exhausted, Python performs a resize: allocates a brand-new hash table and migrates all live entries into it.
Before resize: After resize:
┌───┬──────┐ ┌───┬──────┐
│ 0 │ key1 │ │ 0 │ key1 │
├───┼──────┤ ──rebuild──▶ ├───┼──────┤
│ 1 │ key2 │(deleted) │ 1 │ key3 │ ← only live entries copied
├───┼──────┤ ├───┼──────┤
│ 2 │ key3 │ │ 2 │ key4 │
├───┼──────┤ ├───┼──────┤
│ 3 │ key4 │ │ │ │ ← fresh empty slots
└───┴──────┘ └───┴──────┘
The new table size is determined by the current number of live elements, so a resize can result in expansion, shrinkage, or no change in capacity. Either way, after the rebuild, fresh entry slots are available again.
Top comments (0)