Open addressing

Relja Jovicevic — Wed, 20 Oct 2021 23:38:11 +0000

If you ever wondered how collisions are handled in hash tables, chances are you've heard about open addressing.

The goal of a hash table is to construct a mapping from a set of keys to a set of values.
Hash table uses a hash function to compute an index (hash) that we store in an array.

When two keys have the same hash value hashf(k1) = hashf(k2)
collision happens. Open addressing is one of ways to avoid it.
As opposed to separate chaining where we use some sort of a list for entries with the same index, in open addressing we keep all the key-value pairs in the array itself.
That means that we have to pay attention on the size of the array in order to have enough space for all entries.

Personally, I would keep the load factor ( α ) less than 0.5 and once we reach the threshold, simply double the array size.

Now let's see how open addressing works. Firstly, when a new entry has to be inserted, we check if the corresponding slot is empty, if it is we insert the element. If not, we proceed with a probing sequence.

x = 1
keyHash = H(k)
index = keyHash

while table[index] != null:
    index = (keyHash + P(k, x)) mod N
    x = x + 1

N -> table size
H -> hash function
P -> Probing function

Be wary when choosing a probing sequence since some of them may produce cycle shorter than N and as a result you'll get stuck in an infinite loop.

Open Addressing techniques

Linear Probing

When collision occurs, we linearly probe for the next bucket.
We keep probing until an empty bucket is found.
Linear probing function could look like this

P(x) = a * x

Now if we had a table of size N = 9 and probing function P(x) = a * x, where a = 6 we wouldn't be able to reach all the buckets in the array and as a consequence, would be stuck in an infinite loop.
In order to avoid this, N and a should be relatively prime (Their global common denominator should be equal to 1).

Quadratic Probing

Similar as linear probing but instead of a linear function we use a quadratic one.

P(x) = a * x² + b * x

There are a few ways to avoid infinite loop

P(x) = x² => keep the table size a prime number and > 3 and keep α <= 1/2
P(x) = (x² + x) / 2 => keep the table size a power of two.
P(x) = (-1^x) * x² => keep the table size a prime N where N ≡ 3 mod 4

Double Hashing

Double hashing uses the idea of applying a second hash function to key when a collision occurs.
Probing method in double hashing could look like this

P(k, x) = x * secondHashFunc(k)

To avoid infinite loop when using double hashes make sure that the table size N is a prime number and calculate the delta
δ = secondHashFunc(k) mod N.
If δ happens to be 0, set it to 1.

There are many variations of open addressing such as Coalesced Hashing, Cuckoo Hashing, Robin Hood Hashing... but they are beyond the scope of this post. I recommend that you explore them on your own.

Key points to take from this post:

Make sure the load factor (α) is below a certain threshold
Avoid infinite loops
Keep learning

reach me at relja.jovicevic@gmail.com

SQL vs NoSQL

Relja Jovicevic — Fri, 15 Oct 2021 22:25:27 +0000

If you are a beginner wondering what are use cases for SQL or an experienced developer trying to make a choice between SQL or NoSQL for your next project, this article might help you steer your decision the right way.

Pros vs cons

Everything comes at a cost.

While SQL is ACID ( Atomicity, Consistency, Isolation, Durability) compliant which ensures that performed transactions are always consistent, it might be an overkill for your use case.

Atomicity - ensures that each database transaction is a single unit that adopts an "all or nothing" approach to execution. If any statement in the transaction fails, the entire transaction is rolled back
Consistency – A processed transaction will never endanger the structural integrity of the database.
Isolation - Transactions cannot compromise the integrity of other transactions by interacting with them while they are still in progress
Durability - The data related to the completed transaction will persist even in the cases of network or power outages.

Instead you might find BASE ( Basically available, Soft state, Eventual Consistency ) NoSQL databases to be more of a fit.

Basically available - The NoSQL database approach focuses on the availability of data even in the presence of multiple failures. It achieves this by using a highly distributed approach to database management. Instead of maintaining a single large data store and focusing on the fault tolerance of that store, NoSQL databases spread data across many storage systems with a high degree of replication. In the unlikely event that a failure disrupts access to a segment of data, this does not necessarily result in a complete database outage.
Soft state - Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with the concept of a database which enforces its own consistency, delegating that responsibility to developers.
Eventual Consistency - The only requirement that NoSQL databases have regarding consistency is to require that at some point in the future, data will converge to a consistent state. No guarantees are made, however, about when this will occur.

For example, if you're dealing with transaction processing where consistency and durability are must-have features then you'll be better off with an ACID compliant DB.
On the other hand, social network feeds do not require such a strict compliance so it's acceptable to trade-off ACID in favour of more flexible, easier to scale, approach.

This article does not contain the answer which is better, SQL or NoSQL because as always, it depends on your use case.

Having said that, I highly advise against following the latest trends blindly and flavour of the month technologies. Instead try to define your use case and make a decision based on your needs.

reach out > relja.jovicevic@gmail.com