Aaron Rose

Posted on Sep 28

The Alternative Filing System: Open Addressing Explained

#python #coding #programming #software

Timothy's discovery of the Instant Retrieval Cabinet had revolutionized his library work, but one morning he encountered a problem that would teach him the most important lesson about hash tables: what happens when the system breaks down.

The Dreaded Duplicate Drawer

Timothy was cataloging his poetry collection when disaster struck. He fed "Ode to Joy" into the brass calculating machine and received drawer number 247. But drawer 247 already contained "Paradise Lost"! Two completely different poems, same drawer assignment.

Timothy's mastery of the chaining cabinet had served him well, but his curiosity about alternative systems led him to discover a completely different approach hidden in the library's eastern wing. Here, the archivists had developed a filing system that handled collisions not by sharing drawers, but by finding new ones entirely.

The Discovery in the Eastern Wing

While exploring the library's older sections, Timothy stumbled upon a cabinet that looked similar to his Instant Retrieval system but operated by completely different rules. When he fed "Romeo and Juliet" into this cabinet's brass calculator, it produced drawer number 342. But drawer 342 already contained "Hamlet"!

Instead of adding "Romeo and Juliet" to drawer 342's chain, this system did something unexpected: it automatically tried drawer 343. Finding that one occupied by "Macbeth," it moved to drawer 344, which was empty. "Romeo and Juliet" went into drawer 344.

# The eastern wing's collision resolution
original_target = 342  # "Romeo and Juliet" hashes here
if drawer_342_occupied:
    try_drawer = 343  # Check next drawer
    if drawer_343_occupied:
        try_drawer = 344  # Check next drawer
        # Found empty drawer 344 - store here!

Timothy had discovered "open addressing"—a collision resolution strategy that found alternative locations within the same cabinet rather than storing multiple items in one drawer.

The Linear Probe Method

The eastern wing's chief archivist, Margaret, explained their "linear probe" system to Timothy. When a collision occurred, they simply checked the next drawer in sequence until finding an empty one. If they reached the end of the cabinet (drawer 999), they wrapped around to drawer 0 and continued searching.

# Margaret's linear probe explanation
def find_empty_drawer(target_drawer, cabinet_size=1000):
    current_drawer = target_drawer
    while cabinet[current_drawer] is not None:
        current_drawer = (current_drawer + 1) % cabinet_size
        if current_drawer == target_drawer:  # Full circle
            return "Cabinet full!"
    return current_drawer

"The beauty," Margaret explained, "is that every book gets exactly one drawer. No sharing, no chains, just systematic searching until we find the right spot."

The Retrieval Challenge

Timothy soon discovered the complexity of finding books in this system. To locate "Romeo and Juliet," he couldn't simply go to its hash drawer (342) and find it there. Instead, he had to follow the same probing sequence used during storage:

Check drawer 342 (find "Hamlet" - not what we want)
Check drawer 343 (find "Macbeth" - not what we want)
Check drawer 344 (find "Romeo and Juliet" - success!)

# Timothy's retrieval process for open addressing
def find_book(title, cabinet):
    target_drawer = hash(title) % 1000
    current_drawer = target_drawer

    while cabinet[current_drawer] is not None:
        if cabinet[current_drawer][0] == title:
            return cabinet[current_drawer][1]  # Found it!
        current_drawer = (current_drawer + 1) % 1000
        if current_drawer == target_drawer:  # Searched full circle
            return "Book not found"

    return "Book not found"  # Hit empty drawer - proves book was never stored

The search continued until Timothy either found the book or encountered an empty drawer (proving the book wasn't stored).

The Clustering Problem

Margaret showed Timothy a concerning pattern in their system. As the cabinet filled up, books began clustering together in certain sections. If drawers 342-350 all contained items, any new book hashing to that range would have to probe through the entire cluster to find storage, creating "traffic jams."

# The clustering effect Timothy observed
busy_section = {
    342: "Hamlet",
    343: "Macbeth", 
    344: "Romeo and Juliet",
    345: "Othello",
    346: "King Lear",
    347: "As You Like It"
}
# Any new book hashing to 342-347 must probe through all six!

These clusters made both storage and retrieval slower, as the system had to check multiple occupied drawers before finding the target or an empty space.

The Deletion Dilemma

Timothy discovered the eastern wing's most serious problem when Margaret tried to remove "Macbeth" from drawer 343. Simply removing the book and leaving the drawer completely empty would break the system! If someone later searched for "Romeo and Juliet," they would:

Check drawer 342 (find "Hamlet")
Check drawer 343 (find completely empty drawer - incorrectly conclude "Romeo and Juliet" isn't stored)

The search would stop at the empty drawer, never reaching "Romeo and Juliet" in drawer 344.

# The deletion problem
# Before deletion: 342→Hamlet, 343→Macbeth, 344→Romeo
# After naive deletion: 342→Hamlet, 343→EMPTY, 344→Romeo
# Search for Romeo hits empty 343 and stops - fails to find Romeo!

Margaret's solution was to mark deleted slots as "tombstones"—special placeholders that said "keep searching, but this spot is available for new storage."

The Quadratic Probe Alternative

In the library's newest wing, Timothy found an improved open addressing system. Instead of checking consecutive drawers (1, 2, 3, 4...), they used quadratic probing—checking drawers at increasing squared intervals (1, 4, 9, 16...).

# Quadratic probing reduces clustering
def quadratic_probe(target_drawer, attempt, cabinet_size=1000):
    return (target_drawer + attempt * attempt) % cabinet_size

# For collision at drawer 342:
# Attempt 1: check drawer (342 + 1²) % 1000 = 343
# Attempt 2: check drawer (342 + 2²) % 1000 = 346  
# Attempt 3: check drawer (342 + 3²) % 1000 = 351

This approach spread out the collision handling, reducing the clustering problems that plagued linear probing.

The Trade-off Analysis

Timothy compared the two collision resolution approaches:

Chaining (His Original Cabinet):

Memory per drawer varies (chains can grow)
Consistent performance regardless of load
Easy deletion—just remove from chain
Handles high load factors gracefully

Open Addressing (Eastern Wing):

Fixed memory per drawer (exactly one item)
Performance degrades as cabinet fills
Complex deletion requiring tombstones
More cache-friendly for the underlying machinery

# Timothy's performance comparison
chaining_search = "Always check one drawer + short chain"
open_addressing_search = "May check many consecutive drawers"

chaining_memory = "Base drawers + variable chain storage"
open_addressing_memory = "Fixed drawer array only"

Why Python Chose Chaining

Timothy's research revealed why Python's dictionary designers selected chaining over open addressing. Python dictionaries needed to handle:

Frequent insertions and deletions
Variable load factors
Consistent performance guarantees
Complex key types (strings, tuples, custom objects)

Chaining provided more predictable behavior and easier implementation for Python's diverse use cases.

Timothy's Open Addressing Insights

After studying both systems, Timothy documented the key principles:

Open addressing trades simplicity for complexity: While the storage model is simpler (one item per drawer), the collision resolution becomes more complex.

Load factor is critical: Open addressing performance degrades rapidly as the cabinet fills, requiring careful load management.

Deletion is challenging: Unlike chaining, removing items requires special handling to maintain search integrity.

Cache locality matters: For the underlying computer hardware, open addressing can be faster due to better memory access patterns.

Different problems need different solutions: Neither approach is universally superior—the choice depends on specific requirements.

Timothy's exploration of the eastern wing had revealed that collision resolution wasn't a single problem with one solution, but a design space with trade-offs. His original chaining cabinet remained his preferred system, but understanding open addressing gave him deeper appreciation for the engineering decisions behind Python's dictionary implementation.

While Python chose chaining for its dictionaries, Timothy learned that other programming languages made different choices. Java and C++ often use open addressing in their hash table implementations, proving that both approaches have merit in real-world systems. The choice depends on the specific requirements and constraints of each language's design goals.

The secret life of Python dictionaries wasn't just about chaining—it was about the careful consideration of multiple approaches and the wisdom to choose the right tool for the language's specific needs.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

DEV Community