Bhavesh Kukreja

Posted on Oct 3

A Deep Dive Into Python Dictionaries

#python #c #tutorial #learning

my_dict = {'name': 'Guido', 'year': 1991}
print(my_dict['name'])

What you're seeing is my personal favorite and one of the most fascinating data structure in Python. It’s fast, flexible, and probably the most important data structure in Python. Dictionaries are everywhere. We use them for everything, from simple data storage to making your leetcode submission faster.

But ever wondered how they do it? How can my_dict['name'] be almost instantaneous, even if the dictionary has millions of keys? How does it remember the order you added items in? What happens when two different keys try to claim the same spot? That's what we're going to explore today!

Just like with lists, the Python we run is an executable program called CPython, written in C. Every time you create a dictionary or access a key, you're telling CPython to run specific C functions. To find the ground truth, we'll be looking at the dictobject.c file from CPython 3.11.

Note: A lot of the code has several layers to it, I will simplify and abstract away the functions to explain them, without spending too much time on the details of actual implementation to avoid information overload.
Feel free to check out the source code through the hyperlinks!

Hashing and Objects

Before we dive into the dictionary struct itself, we need to understand its principle: the hash table.

A hash table is like a filing cabinet. Instead of searching through every file one by one, you use a special function, a hash function that takes your key (like the word 'name') and instantly tells you which drawer it's in. This is why lookups are so fast. The main challenge for any hash table is what to do when the hash function tells two different keys to go into the same drawer. This is called a collision, and how a dictionary handles collisions is essential to its performance.

Of course, a dictionary is still a Python object. Just like lists, every object in Python starts with the same C struct, the PyObject.

typedef struct _object {
    Py_ssize_t ob_refcnt; // The reference counter for memory management
    PyTypeObject *ob_type; // The pointer that says "I am a dictionary"
} PyObject;

The Data Structure: PyDictObject

Now, what's fascinating about Python dictionaries (since Python 3.6) is that they separate the hash table from the data storage. This separation is what allows them to be fast, ordered, and memory-efficient all at once.

The structure is split into two main parts: the PyDictObject and the PyDictKeysObject.

// The dictionary object
typedef struct {
    PyObject_HEAD
    Py_ssize_t ma_used;        // Number of items in dictionary
    PyDictKeysObject *ma_keys; // Pointer to the main thing
    PyObject **ma_values;      // Usually NULL for a special case explained below
} PyDictObject;

The PyDictObject itself is just a small header. The magic is in the PyDictKeysObject it points to. But before we get to that, let's talk about ma_values. Why is it usually NULL? This points to an optimization for class instances, which the source code calls a "split table".

/*
The DictObject can be in one of two forms.

Either:
    A combined table:
        ma_values == NULL, dk_refcnt == 1.
        Values are stored in the me_value field of the PyDictKeysObject.
Or:
    A split table:
        ma_values != NULL, dk_refcnt >= 1
        Values are stored in the ma_values array.
        Only string (unicode) keys are allowed.
*/

This means most dictionaries use a "combined table," where values are stored right alongside their keys. For memory efficiency, however, many objects of the same class can use a "split table". Think that you have a student class with name and grade attributes, and you have 1000 students. Python knows the keys are gonna be the same, so why bother giving extra memory for it?

They all share one common structure for their keys (like .name and .grade), while each instance stores its unique values in a separate, tiny array. This approach saves a huge amount of memory by not duplicating the key names for every single object.

Now, let's look at the code that drives both types of dictionaries.

// The struct that holds all the important stuff
typedef struct {
    // ... metadata like refcount and size ...
    int8_t *dk_indices;  // The actual hash table!
    PyDictKeyEntry *dk_entries; // The array where data is stored
} PyDictKeysObject;

Attribute Breakdown

dk_indices

What it is: This is the real hash table. But it doesn't store your keys or values. It stores integers.

What it does: Each integer is an index that tells you where to find the actual data in the other array, dk_entries. Think of it as a card catalog in a library, it doesn't hold the book, it just tells you which shelf the book is on.

Why it matters: Keeping the hash table compact with just integers (which can be small, like 1 or 2 bytes each) saves a ton of memory and makes lookups very fast.
dk_entries

What it is: This is a simple array that stores the actual key, value, and the key's hash value together in a struct.

What it does: When you insert a new item into the dictionary, it's simply appended to the end of this array.

Why it matters: This is how dictionaries preserve insertion order. By always adding new items to the end of this list, the dictionary naturally remembers the order they were added in. Iterating over the dictionary is as simple as iterating through this dk_entries array from start to finish.

It can be understood that dk_indices is optimized for fast, unordered lookups, while dk_entries is optimized for ordered storage.

The Dictionary in Action

Now that we understand the two-part data structure, let's connect it to the Python dictionary methods we use every day. We’re going to see how CPython manipulates the dk_indices and dk_entries arrays to do its work.

Creation of a Dictionary - PyDict_New

This is the code you write to create a dictionary, very simple right? It's underlying code is also like it.

d = {}

Above code triggers a call to PyDict_New

// Public function to create a new dictionary.
PyObject *
PyDict_New(void)
{
    // Grab a reference to the global, shared, empty keys object.
    dictkeys_incref(Py_EMPTY_KEYS);
    // Create the dict object, pointing it to the empty keys.
    return new_dict(Py_EMPTY_KEYS, NULL, 0, 0);
}

Snippet Breakdown

Py_EMPTY_KEYS: Instead of allocating a new PyDictKeysObject (with its indices and entries arrays) every time you create an empty dict, CPython maintains a single, global, immutable empty keys object.
new_dict(...): This function just allocates the shell PyDictObject and sets its ma_keys pointer to that shared Py_EMPTY_KEYS object. ma_used is set to 0.

The Takeaway

The creation of an empty dictionary is an incredibly cheap O(1) operation. Python bets that you'll create many empty dictionaries that might not even be used, so it just avoids allocating any real memory for the hash table or entries until you insert the very first item.

Accessing an Item - _Py_dict_lookup

To get a value from a dictionary, you use its key inside square brackets.

value = my_dict['name']

To access an item, _Py_dict_lookup is called.

Py_ssize_t
_Py_dict_lookup(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject **value_addr)
{
    PyDictKeysObject *dk = mp->ma_keys;
    size_t mask = DK_MASK(dk);
    size_t i = (size_t)hash & mask; // The initial index calculation.
    size_t perturb = hash;

    // The main probing loop.
    for (;;) {
        Py_ssize_t ix = dictkeys_get_index(dk, i); // Read from dk_indices.

        // If the slot is empty, the key is not here.
        if (ix == DKIX_EMPTY) {
            *value_addr = NULL;
            return DKIX_EMPTY;
        }
        // If the slot is active (not empty or dummy)...
        if (ix >= 0) {
            PyDictKeyEntry *ep = &DK_ENTRIES(dk)[ix]; // Go to the entry.
            // Check if the keys are actually the same.
            if (ep->me_key == key || PyObject_RichCompareBool(ep->me_key, key, Py_EQ)) {
                *value_addr = ep->me_value; // Success! Return the value.
                return ix;
            }
        }
        // Collision! Calculate the next index to probe. 
        // The logic is explained in following paragraph. 
        perturb >>= PERTURB_SHIFT;
        i = (i * 5 + perturb + 1) & mask;
    }
}

The most interesting part of this function is the last two lines inside the loop. This is the collision resolution algorithm. A lot of times, hash calculation for your key points to a slot that's already occupied by a different key. We need to solve this collision problem!

The CPython developers left a fantastic, detailed comment explaining its design.

Source Code comments:

/* Major subtleties ahead:  Most hash schemes depend on having a "good" hash
   function, in the sense of simulating randomness.  Python doesn't:  its most
   important hash functions (for ints) are very regular...
   ...
   The other half of the strategy is to get the other bits of the hash code
   into play.  This is done by initializing a (unsigned) vrbl "perturb" to the
   full hash code, and changing the recurrence to:
    perturb >>= PERTURB_SHIFT;
   j = (5*j) + 1 + perturb;
   use j % 2**i as the next table index;
*/

What this means is that CPython is designed to be robust even with "bad" hash functions. The 5*j + 1 part ensures the probing jumps around the table in a scrambled, pseudo-random way that will eventually visit every slot. The perturb variable is the secret sauce, it's initialized with the full hash value, and with each collision, its higher-order bits are mixed into the calculation. This ensures that every bit of the hash value contributes to finding the next slot, effectively breaking up clusters and making the probe sequence unique for each hash value.

The Takeaway

Item access is O(1) on average because, most of the time, the first or second probe lands on the correct key. It's a direct calculation, not a search. The probing loop is a safety net that handles the inevitable cases where multiple keys hash to the same initial spot.

Inserting an Item - insertdict

Adding a new key-value pair or updating an existing one is a fundamental operation

my_dict['year'] = 1991

Now the insertdict function takes over. It's essentially a lookup followed by a placement.

static int
insertdict(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject *value)
{
    PyObject *old_value;
    // First, do a lookup to see if the key already exists.
    Py_ssize_t ix = _Py_dict_lookup(mp, key, hash, &old_value);

    // If the key was not found...
    if (ix == DKIX_EMPTY) {
        // ...check if we need to make the table bigger.
        if (mp->ma_keys->dk_usable <= 0) {
            // This is the expensive O(n) call to resize the whole dict.
            if (insertion_resize(mp, 1) < 0)
                goto Fail;
        }

        // Find an empty or dummy slot in the hash table.
        Py_ssize_t hashpos = find_empty_slot(mp->ma_keys, hash);
        Py_ssize_t n = mp->ma_keys->dk_nentries;

        // Append the new data to the end of the entries array.
        PyDictKeyEntry *ep = &DK_ENTRIES(mp->ma_keys)[n];
        ep->me_key = key;
        ep->me_hash = hash;
        ep->me_value = value;

        // Store the new entry's index (n) in the hash table slot.
        dictkeys_set_index(mp->ma_keys, hashpos, n);

        // Update all the counters.
        mp->ma_used++;
        mp->ma_keys->dk_usable--;
        mp->ma_keys->dk_nentries++;
        return 0;
    }

    // If the key already existed, just update the value.
    if (old_value != value) {
        DK_ENTRIES(mp->ma_keys)[ix].me_value = value;
        Py_XDECREF(old_value);
    }
    return 0;
}

The Takeaway

Like lists, dictionaries have a fast path and a slow path for adding items. Most insertions are O(1) because there's already space. Occasionally, an expensive O(n) dictresize occurs where the entire table is rebuilt. The growth strategy ensures this happens infrequently, so the average or "amortized" cost of insertion remains O(1).

Deleting an Item - delitem_common

To remove an item from a dictionary you use del keyword.

del my_dict['name']     # You can also use my_dict.pop('name')
                        # pop will also return what was deleted

Deleting an item can't simply erase the entry, as that could break the collision search chain. The delitem_common function reveals the solution.

static int
delitem_common(PyDictObject *mp, Py_hash_t hash, Py_ssize_t ix, PyObject *old_value)
{
    PyObject *old_key;
    // Find the item's position in the hash table (dk_indices).
    Py_ssize_t hashpos = lookdict_index(mp->ma_keys, hash, ix);

    // This is the importnat, mark the hash table slot as a DUMMY.
    dictkeys_set_index(mp->ma_keys, hashpos, DKIX_DUMMY);

    // Get the entry from the data array (dk_entries).
    PyDictKeyEntry *ep = &DK_ENTRIES(mp->ma_keys)[ix];
    old_key = ep->me_key;

    // Null out the key and value in the data array.
    ep->me_key = NULL;
    ep->me_value = NULL;

    // Free the old key and value and update the used count.
    Py_DECREF(old_key);
    Py_DECREF(old_value);
    mp->ma_used--;
    return 0;
}

The Takeaway

Deletion is an O(1) operation on average. It's just a lookup and a couple of memory writes. It doesn't require shifting any other elements. The cost is the creation of DUMMY slots, which add a tiny bit of overhead to future lookups until a resize cleans them up.

Bonus Fact: Why d['missing'] Crashes but d.get('missing') Doesn't

You've probably noticed that accessing a missing key with square brackets raises a KeyError, while using the .get() method simply returns None.

This difference in behavior is a cool historical artifact. The .get() method is based on an older, internal C function called PyDict_GetItem, which was built for speed and simplicity in an era when exceptions were not expected for this operation. It was designed to just return NULL (which becomes None) on failure without making a fuss.

Source

/* Note that, for historical reasons, PyDict_GetItem() suppresses all errors
 * that may occur (originally dicts supported only string keys, and exceptions
 * weren't possible).  So, while the original intent was that a NULL return
 * meant the key wasn't present, in reality it can mean that, or that an error
 * (suppressed) occurred while computing the key's hash, or that some error
 * (suppressed) occurred when comparing keys in the dict's internal probe
 * sequence.  A nasty example of the latter is when a Python-coded comparison
 * function hits a stack-depth error, which can cause this to return NULL
 * even if the key is present.
 */

The square bracket access (d['key']), however, is handled by a more modern, user-facing C function. It uses the same fast lookup logic but adds an extra step: it explicitly checks for that NULL result and takes on the responsibility of raising the KeyError we're all familiar with.

The Final Takeaway: What a Dictionary Really Is

The Python dictionary is not just one simple hash table. It's a hybrid data structure. It combines an integer-only hash table for fast lookups (dk_indices) with a compact array for ordered data storage (dk_entries).

This design is its power. It gives us the O(1) average-case speed we expect from a hash table while also providing the guaranteed insertion order we've come to rely on since Python 3.7. It's always so fascinating to learn how things work under the hood!

Thank you for reading!

DEV Community

A Deep Dive Into Python Dictionaries

Hashing and Objects

The Data Structure: PyDictObject

The Dictionary in Action

Creation of a Dictionary - PyDict_New

Snippet Breakdown

The Takeaway

Accessing an Item - _Py_dict_lookup

The Takeaway

Inserting an Item - insertdict

The Takeaway

Deleting an Item - delitem_common

The Takeaway

Bonus Fact: Why d['missing'] Crashes but d.get('missing') Doesn't

The Final Takeaway: What a Dictionary Really Is

Top comments (0)