Omar

Posted on

# #009 DS&A - Hashing

## Hashing

### Direct address Table

we have many data structures but they are all slow at searching.

type time complexity
unsorted array O(n)
sorted array O(log n)
linked list O(n)
Binary Tree O(n)
Binary Search Tree O(n)
Balanced Binary Search Tree O(log n)
Priority (min and max) O(n)

to minimize the searching time we use hashing which is `O(1)` , also some people use `Direct address Table` is just same as an array.

`{1,2,3,4,5,...,100}` if we need to put insert an element we put it with a key and store it let's say we need to store the number of occurrence for character `C` the number of ASCII let's assume it's 128

`arr[99]` I will store in this array the number of occurrence for `c` where `99` is the key of `c`

, and now if I need to access the number of occurrence letter `c` I can access it directly using `arr[99]` which is `O(1)`

### Introduction to hashing

the problem with this direct access is that array should have big as key sizes , to solve this problem we use hashing.

let's say we have `m` keys and `n` elements. in order to map a `key` to `element` we need an `hash function` which will take time of `n%m` which is constant `O(1)` . When we need to hash a new element to element and it have same location of an previous element it's called collision example`141%10=1` and `21%10=1` so this is collision because they have same key . The challenge with hash map is the collision.

``````  i) better hash functions
ii) chaining (using linked list)
iii) open addressing , implemented using verious ways
a) linear probing
b) quadratic probing
c) Double hashing
``````

### chaining

after computing the hash if there is a collision , the previous value will link to the new value.

``````worst case search - O(n)
worst case deletion - O(n)

average search O(1+ alpha)
average deletion O(1)
``````

where `n` is the number of elements represented in the linked list , and `m` is the number of cells in linked list.

`alpha = n/m` where alpha is the load factor.

An advantage of chained hash table over open addressing scheme is deletion is easier.

### Open addressing

open addressing AKA closed hashing

let's say we have array that can hold `m` elements it mean all the elements I need should be inside array not outside like chaining and `U` keys , where `U` is very large and `m` is limited so definitely their is a collision.

we have the load factor `alpha` which is always will be between `0` and `1` .

We resolve the collision by applying the hash function with a small modification , this is called probing second time. Probing is nothing but finding location in hash table where we need place an element.

we apply the hash function if we find that the element is already picked we make a combination between `u` and finite set `[0,m-1]`

the worst time here to search an element is `O(m)` because we need to visit all the table to find that element.

assume the `X` is mean that slot is not empty and `N` is NULL (we can put element here).

``````h(k)        = 0
``````

`1` is NULL so I can put `k` in it.

``````h(k1)       = 1
h(k1 , 1)   = 5
h(k1 , 2)   = 6
``````

`h(k1)` is `1` but `1` is not free ( their is collision ) so we run hash function again we get `5` which also is not free , so we run it again with `2` we get `6` which is NULL.

same for searching assume we have the element we need `k1` is at slot `0`

``````h(k,0) = 2
h(k,1) = 6
h(k,2) = 3
...
h(k,7) = 0
``````

so I take `O(m)` to find it in worst case

assume the `k1` is at slot `5` instead of `7` so if I reach `h(k,5)` and find it `NULL` I stop because I know that this element is not presented , because if it's not `NULL` the hash function will occupy it for `k1`.

This is the case if no elements being deleting . If it's deleted I will continue searching (probing). So deletion will cause a problem so I will assign an element that mean oh there is an element let's say `D` here but it is deleted , and in insertion if I find `D` I will treat it as `NULL` so I place replace it with my element.

In worst case if All elements are deleted and I need an element after them , so chaining is the best in this case.

### Linear probing

``````h : U --> {0 ... m-1}
``````

if I run `h(k) = a` so index is `a` , if `a` is not empty will I will take another hash function h prime`h'(k,i)=(h(k)+i) mod m` which is `h'(k,1) = ( h(k)+1) mod n = (a+1) mod m = a+1` and `h'(k,2) = ( h(k)+2) mod n = (a+2) mod m = a+2`

it's called linear probing because if the element at `a` is occupied it try `a+1` if it also occupied it will try`a+2` if it reach `m` it will start again from `0` to `a`

so if it start at `1` it will start from `1` and end at `0` `{1,2,3 ... m-1 , 0}` if it start at `k` it will end at `k-1`

the number of prob sequences is `m` . if 2 elements have the initial prob is the same it's called secondary clustering. Primary clustering is if a continuous block of memory is sequentially occupied it mean from `a` to `a+k` is filled , so the probability of inserting an element will be `k+1/m`

the worst search is `O(m)` but the average is constant `O(2.5)` their is a lot of proofs but they are hard and contain a lot of advanced probability theorems.

### Quadratic probing

`h'(k,i) = ( h(k) + c1*i + c2 i*i ) mod m`

the problem with quadratic probing is that we need to choose `c1 , c2 and m` in such a way that `h'(k,i)` will examine entire table.

### Double hashing

`h'(k) = ( h1(k) + i * h2(k) ) mod m`

if 2 elements have the same `h1` , `h2` probably will not be the same

selection of `h2(k)`

The secondary hash function `h2(k)` it should never yield an index of zero

• it should cycle through the whole table
• it should be very fast to compute
• it should be pair-wise independent of `h1(k)`
• The distribution characteristics of `h2(k)`are irrelevant. It is analogous to a random-number generator - it is only necessary that `h2(k)` be ’’relatively prime’’ to `m`.

In practice, if division hashing is used for both functions, the divisors are chosen as primes.