How Hash Tables Achieve O(1) Lookups

#webdev #tutorial

A hash table looks like magic the first time you reach for one: you store a million entries, then pull any single value back out in roughly the same time it takes to read one with ten entries. No scanning, no comparisons against every key. The trick is turning the key itself into the address where its value lives.

From key to index

The core idea is a function that converts any key into a number, then folds that number into the bounds of a backing array. Storing a value at key k means computing index = hash(k) % capacity and writing to that slot. Reading it back computes the same index and reads the same slot. Array indexing is a single memory operation — it does not get slower as the array fills — so the lookup cost is dominated by computing the hash, not by the number of stored entries.

Picture a table with 8 slots holding strings keyed by name:

hash("alice") % 8 -> 3   slots[3] = "alice's data"
hash("bob")   % 8 -> 6   slots[6] = "bob's data"
hash("carol") % 8 -> 1   slots[1] = "carol's data"

To look up "bob", you hash it again, get 6, and read slots[6] directly. You never touch slots 3 or 1. That direct jump is the whole reason lookups are constant time on average: the work does not depend on n, the number of entries.

For this to hold, the hash function must spread keys evenly across the slots. A good hash makes outputs look random, so keys scatter uniformly. A bad hash that clumps keys into a few slots destroys the property — and so does an attacker who deliberately picks keys that collide.

Collisions and how they are resolved

Because you are squeezing an enormous space of possible keys into a small array, two different keys will eventually compute the same index. This is a collision, and every hash table needs a strategy for it.

Separate chaining stores a small secondary structure — usually a linked list — at each slot. Colliding keys are appended to the list at that slot. A lookup hashes to the slot, then walks the (normally short) list comparing keys. If "dave" also hashes to 6, slot 6 holds a list of both "bob" and "dave", and a lookup checks each.

Open addressing keeps everything in the array itself. On a collision, it probes for the next free slot by a fixed rule — linear probing checks the next slot, the next, and so on. Lookups follow the same probe sequence until they find the key or hit an empty slot. This is more cache-friendly but degrades faster as the table fills.

Either way, the cost of a lookup is proportional to how many keys share its slot (or probe sequence). Keep that number small and lookups stay fast.

Load factor, rehashing, and the worst case

The knob that keeps collisions rare is the load factor: entries divided by capacity. As it climbs toward 1, slots fill up, chains lengthen, and probe sequences grow. So hash tables resize before that happens — typically when the load factor crosses a threshold like 0.75. Resizing allocates a larger array and rehashes every existing entry into it, because % capacity now yields different indices.

Rehashing is an O(n) operation, but it happens rarely and each resize roughly doubles capacity, so the cost spreads thinly across all the cheap inserts in between. Averaged over a sequence of operations, insertion is amortized O(1) — occasional expensive resizes, vastly more often free.

A hash table does not promise constant time on every single call. Any one insert can trigger a full O(n) rehash, and lookups degrade to O(n) when many keys land in the same slot. In the pathological case — a weak hash function, or an attacker who crafts keys that all collide — every entry chains into one slot and the table behaves like a linked list. This is a real denial-of-service vector (algorithmic complexity attacks), which is why many languages randomize their hash seed per process.

The worst case is therefore O(n): all n keys collide into a single slot, and a lookup must scan all of them. You get O(1) in practice because a decent hash function plus a controlled load factor make that case astronomically unlikely for ordinary data. The guarantee is statistical, not absolute — which is exactly why the choice of hash function matters so much.