DEV Community

James Lee
James Lee

Posted on

Python dict Internals: Hash Tables, Collision Resolution, and Hash Attacks

Hash Values

Python's built-in hash() function returns an object's hash value. Hash tables rely on hash values to index elements:

object ──▶ hash(object) ──▶ hash value ──▶ slot index in hash table
Enter fullscreen mode Exit fullscreen mode

For a hash table to work correctly, key objects must satisfy two conditions:

  1. The hash value must not change during the object's lifetime.
  2. Objects that compare equal must have the same hash value.

Objects satisfying these conditions are called hashable. Only hashable objects can be used as keys in a hash table. This is why dict, set, and other hash-table-backed containers require hashable keys.

Among Python's built-in types, all immutable objects are hashable; mutable objects like list and dict are not:

>>> hash([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Enter fullscreen mode Exit fullscreen mode

Unhashable objects cannot be used as dict keys:

>>> {[]: 'list is not hashable'}
TypeError: unhashable type: 'list'

>>> {{}: 'dict is not hashable either'}
TypeError: unhashable type: 'dict'
Enter fullscreen mode Exit fullscreen mode

User-defined objects are hashable by default — their hash value is derived from their memory address, and any two distinct objects are considered unequal:

>>> class A:
...     pass
...
>>> a = A()
>>> b = A()
>>> hash(a), hash(b)
(-9223372036573452351, -9223372036573452365)
>>> a == b
False
Enter fullscreen mode Exit fullscreen mode

How Is the Hash Value Computed?

The answer is a hash function. As we know from the object model, object behavior is determined by the type object. Hash value computation is one such behavior — it's stored in the tp_hash function pointer of the type object. The built-in hash() function delegates to tp_hash to compute and return the value.

For str, the hash function is unicode_hash in Objects/unicodeobject.c:

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",                        /* tp_name */
    sizeof(PyUnicodeObject),      /* tp_size */
    // ...
    (hashfunc) unicode_hash,      /* tp_hash */
    // ...
};
Enter fullscreen mode Exit fullscreen mode

For user-defined objects, you can override the default hash behavior by implementing __hash__ and __eq__. For example, if a Tag instance is uniquely identified by its value field:

class Tag:

    def __init__(self, value, title):
        self.value = value
        self.title = title

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return self.value == other.value
Enter fullscreen mode Exit fullscreen mode

Since hash values are used frequently and never change during an object's lifetime, they are cached inside the object to avoid recomputation. For str, the internal hash field stores the cached value.

A good hash function must distribute hash values uniformly across the hash space. The more similar two values are, the more different their hash values should be.


Hash Collisions

Two different objects can have the same hash value. Also, the number of slots in a hash table is far smaller than the full hash value space. This means multiple keys can map to the same slot — a hash collision.

hash(key1) % table_size = 3  ┐
                              ├─▶ both map to slot 3 → collision!
hash(key3) % table_size = 3  ┘
Enter fullscreen mode Exit fullscreen mode

There are two common strategies to resolve hash collisions:

  • Separate chaining
  • Open addressing

Separate Chaining

Each slot maintains a linked list. All keys that hash to the same slot are stored in the corresponding list:

Hash Index        Linked Lists
┌───┐
│ 0 │──▶ NULL
├───┤
│ 1 │──▶ [key2] ──▶ NULL
├───┤
│ 2 │──▶ NULL
├───┤
│ 3 │──▶ [key1] ──▶ [key3] ──▶ NULL   ← both hash to slot 3
├───┤
│ 4 │──▶ NULL
└───┘
Enter fullscreen mode Exit fullscreen mode

Open Addressing (Python's Approach)

Python uses open addressing — data is stored directly in the hash slots. If a slot is occupied, try another one.

Insert key3 (hashes to slot 3, already occupied by key1):

┌───┬───────┐
│ 0 │       │
├───┼───────┤
│ 1 │       │
├───┼───────┤
│ 2 │       │
├───┼───────┤
│ 3 │ key1  │ ← slot 3 occupied
├───┼───────┤
│ 4 │       │
├───┼───────┤
│ 5 │ key3  │ ← probe finds slot 5, insert here ✅
└───┴───────┘
Enter fullscreen mode Exit fullscreen mode

The i-th probe tries slot (base + dᵢ) % size. The probing strategy depends on the function dᵢ.

Linear probingdᵢ = 2 * i:

Probe sequence: slot 3 → 5 → 7 → 9 → ...
Enter fullscreen mode Exit fullscreen mode

Quadratic probingdᵢ = i²:

Probe sequence: slot 3 → 4 → 7 → 12 → ...
Enter fullscreen mode Exit fullscreen mode

Both methods have a weakness: a fixed probe sequence increases collision probability when multiple keys share the same starting slot.

key, key2, key3 all hash to slot 1:
Fixed probe sequence: 1 → 3 → 7 → 13 → ...
All three follow the SAME path → high collision rate ❌
Enter fullscreen mode Exit fullscreen mode

Python's Perturbation-Based Probing

Python improves on this by incorporating the object's hash value into the probe sequence, making each key's probe path unique:

key  (hash=0b...0101): probe path → 1 → 6 → 3 → ...
key2 (hash=0b...1010): probe path → 1 → 4 → 9 → ...
key3 (hash=0b...1111): probe path → 1 → 8 → 2 → ...
Enter fullscreen mode Exit fullscreen mode

The implementation is in lookdict inside Objects/dictobject.c:

static Py_ssize_t _Py_HOT_FUNCTION
lookdict(PyDictObject *mp, PyObject *key,
         Py_hash_t hash, PyObject **value_addr)
{
    size_t i, mask, perturb;
    PyDictKeysObject *dk;
    PyDictKeyEntry *ep0;

top:
    dk = mp->ma_keys;
    ep0 = DK_ENTRIES(dk);
    mask = DK_MASK(dk);
    perturb = hash;
    i = (size_t)hash & mask;

    for (;;) {
        Py_ssize_t ix = dk_get_index(dk, i);
        // key comparison omitted

        // compute next slot
        // perturb incorporates the hash value → unique probe sequence per key
        perturb >>= PERTURB_SHIFT;
        i = (i*5 + perturb + 1) & mask;
    }
    Py_UNREACHABLE();
}
Enter fullscreen mode Exit fullscreen mode

Hash Attacks

Before Python 3.3, hash values were computed solely from the object itself — no randomness involved. This meant the same object always produced the same hash value across any two interpreter instances:

# Python 2, process 1 (pid=2878)
>>> hash('fasion')
3629822619130952182

# Python 2, process 2 (pid=2915)
>>> hash('fasion')
3629822619130952182   # identical ← vulnerability
Enter fullscreen mode Exit fullscreen mode

An attacker could craft a large number of keys that all hash to the same value, then POST them as JSON to a Python 2 web server. The server would build a dict from this data — and every single key would collide. Hash table performance degrades from O(1) to O(N), effectively taking the server down. This is a hash flooding attack.

The Fix: Hash Salting (Python 3.3+)

The solution is simple — add a salt:

  1. When the interpreter starts, generate a random number as the salt.
  2. The hash function incorporates both the object itself and the random salt.

Since the attacker cannot know the interpreter's internal random salt, they cannot pre-compute colliding keys.

# Python 3.7, process 1
>>> hash('fasion')
7411353060704220518

# Python 3.7, process 2
>>> hash('fasion')
1784735826115825426   # different every time ✅
Enter fullscreen mode Exit fullscreen mode

Python's hash algorithm is implemented in Python/pyhash.c.


The Delete Operation

Now let's look at how dict handles deletion. Consider this scenario:

State after inserting key1, key2, key3:

Hash Index    Entry Storage
┌───┬──────┐  ┌───┬──────┐
│ 0 │      │  │ 0 │ key1 │
├───┼──────┤  ├───┼──────┤
│ 1 │  1   │──▶ 1 │ key2 │
├───┼──────┤  ├───┼──────┤
│ 2 │      │  │ 2 │      │
├───┼──────┤  ├───┼──────┤
│ 3 │      │  │ 3 │      │
├───┼──────┤  └───┴──────┘
│ 4 │      │
├───┼──────┤
│ 5 │  0   │──▶ key1 (slot 5, entry 0)
├───┼──────┤
│ 6 │  2   │──▶ key3 (slot 6, entry 2) ← key3 collided at slot 1, moved to slot 6
└───┴──────┘
Enter fullscreen mode Exit fullscreen mode

If we delete key2 by simply marking slot 1 as EMPTY:

┌───┬──────┐
│ 1 │EMPTY │  ← slot 1 cleared
├───┼──────┤
│ 6 │  2   │  key3 hashes to slot 1, probes... finds EMPTY → thinks key3 doesn't exist ❌
└───┴──────┘
Enter fullscreen mode Exit fullscreen mode

This breaks the probe chain — key3 becomes unreachable!

The Fix: DUMMY Tombstone

Instead of marking the slot as EMPTY, Python marks it as DUMMY — a tombstone that signals "keep probing, don't stop here":

┌───┬──────┐
│ 1 │DUMMY │  ← tombstone, probe chain intact
├───┼──────┤
│ 6 │  2   │  key3 hashes to slot 1, sees DUMMY → continues probing → finds slot 6 ✅
└───┴──────┘
Enter fullscreen mode Exit fullscreen mode

Slot state constants are defined in Objects/dict-common.h:

#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2)  /* Used internally */
#define DKIX_ERROR (-3)
Enter fullscreen mode Exit fullscreen mode

Reusing Deleted Entry Storage

Python doesn't bother reusing deleted entry storage slots. When a new key is inserted, it simply uses the next available entry:

Insert key4 after deleting key2:

Entry storage:
┌───┬──────┐
│ 0 │ key1 │
├───┼──────┤
│ 1 │ key2 │ ← deleted (wasted, ignored)
├───┼──────┤
│ 2 │ key3 │
├───┼──────┤
│ 3 │ key4 │ ← new entry goes here ✅
└───┴──────┘
Enter fullscreen mode Exit fullscreen mode

Yes, some storage slots may be wasted — but that's acceptable.

Resize and Rebuild

When entry storage is exhausted, Python performs a resize: allocates a brand-new hash table and migrates all live entries into it.

Before resize:                After resize:
┌───┬──────┐                 ┌───┬──────┐
│ 0 │ key1 │                 │ 0 │ key1 │
├───┼──────┤   ──rebuild──▶  ├───┼──────┤
│ 1 │ key2 │(deleted)        │ 1 │ key3 │  ← only live entries copied
├───┼──────┤                 ├───┼──────┤
│ 2 │ key3 │                 │ 2 │ key4 │
├───┼──────┤                 ├───┼──────┤
│ 3 │ key4 │                 │   │      │  ← fresh empty slots
└───┴──────┘                 └───┴──────┘
Enter fullscreen mode Exit fullscreen mode

The new table size is determined by the current number of live elements, so a resize can result in expansion, shrinkage, or no change in capacity. Either way, after the rebuild, fresh entry slots are available again.

Top comments (0)