johnfound

Posted on

# Interesting hash tree implementation

The hash tables are great when we need to search a string in a big set of strings.

They provide O(1) complexity this way making our algorithms very fast.

Unfortunately, the hash tables have one big disadvantage, especially when we are talking about huge sets of strings.

Because it is important to keep the hash collisions as low as possible, the size of the hash table must be at least twice the count of the strings we need to put inside.

This way, on a very big set of strings, the memory waste can be really huge. And if we fill the table more, the speed of the search degrades very fast down to simple linear search with O(n).

There are different ways to counteract these problems. For one of them I am writing here.

The author of the algorithm is Tomasz Grysztar, the author of FlatAssembler.

FlatAssembler uses this algorithm in order to handle huge number of label names during compilation.

The algorithm is slower than the classic hash table, but still will keep the complexity to O(log k) where k is the size of the hash, not the size of the set.

This way, the search time will remain constant regardless of the strings count in the set and can be assumed to be O(1).

Considering the much lower number of collisions, on large sets it can be faster than a classic hash table with smaller hash size.

## Adding strings to the tree.

At first we need to compute the hash of the string. It is good idea to use relatively big hash, for example 32bit, in order to minimize the collisions.

I am usually using the FNV-1b hash, but the hash algorithm is not very important for the understanding of the algorithm.

Then we need a linear memory area, where elements of the following type will be allocated consecutively:

``````struct THashTreeNode
.bit0 dd ?
.bit1 dd ?
ends
``````

Then starting from the first array element and shifting the hash one bit right at a time, we put in the `.bit0` or `.bit1` fields (depending on the value of the bit shifted) the offset to the next element of the array where to continue the scan.

The next element is allocated at the end of the array, just after the last used element. (like in a queue or linear list).

When we reach the 32nd bit of the hash, the last index will point to the element with the payload (or the leaf of the tree if you prefer) - it can occupy random space, but it is more convenient if the size of the leaf structure is multiple of the size of the node structure.

In my implementation it occupies two consecutive elements of of the array:

``````struct THashTreeLeaf
.hString dd ?         ; a handle of the string stored.
.next    dd ?         ; the next element in the cases of collisions.
.lparam  dd ?         ; user defined parameter.
.wparam  dd ?         ; user defined parameter.
ends
``````

The field `.hString` contains a pointer or a handle to the string we put in the tree. (in the library I use, the strings are dynamic and are represented by a handles, not pointers. But the difference is not important)

The field `.next` contains an offset to the next `THashTreeLeaf` in the rare cases if we have two different strings with the same hash (a collision).

## Searching the string in the set.

The searching algorithm is almost the same as when we are adding strings to the set.

Starting from the beginning of the array, we are shifting the searched string hash one bit at a time and jumping to the offsets from the `.bit0` and `.bit1` fields. If on some step the offset is 0 this means the string is not in the tree.

If after 32 steps, we find a leaf element, a direct string comparison with the stored string is needed in order to resolve the possible collision.

Then if the strings does not match, we need to follow the offsets from the `THashTreeLeaf.next` fields until 0 (the string is not in the tree) or string match - the string is in the tree.

You can see that in the later case, the search deteriorates to a simple and slow linked list search. Fortunately, when the hash is 32 bit such collisions are really rare.

The algorithm is memory friendly.

The whole tree is in one contiguous memory area, which is good for the CPU caches and additionally improve the performance.

The algorithm is relatively simple, at least for assembly language implementation.

I am not sure how it will be implemented in higher level languages, but at least for C/C++/Pascal there should be no problems, except for some possible code readability issues...

Notice, that the same algorithm can be implemented as an usual tree with separately allocated nodes and leafs. But such implementation will be slower, because of the heap allocations of multiply small chunks of memory and because of the cache issues.

## The quiz

In order to make the things even more clear, here is the full code of the procedure that can add strings to the tree and also to search the tree for a string without adding it.

Because the searching and adding algorithms are very similar, it is possible to provide both functions in a single procedure.

As a little quiz to the readers I will not put detailed line-by-line explanations of the code. It is published in the same form it exists in the production library.

I will give only a little reference to the external procedures used at the end of the article.

So, that is all. Here is the code. Try to analyze and understand it! :)

I will explain it in details, but let give some fun time for the puzzle solvers.

``````; Searches the hash tree for the given string.
; if [.fAdd] is TRUE and if the string is not in the tree it will be added.
;
; Arguments:
;   .pHashTree - pointer to TArray with elements size sizeof.THashTreeNode (8 bytes)
;   .hString - handle/pointer to the string to be searched/added. Notice, that exactly this
;              string will be added to the THashTreeLeaf element. The caller is responsible
;              to not free this string until the hash tree is needed.
;   .fAdd - flag indicating whether to add the string to the tree if is not included.
;
; Returns:
;   edx - pointer to the array. Can be reallocated!
;   eax - index in the TArray of the THashTreeLeaf of the string.
;   CF = 1: the string is already in the tree.
;   CF = 0: the string is not in the tree.
;     [.fAdd] <> 0: eax contains a pointer to the new added leaf.
;                   eax == 0 means there is an out of memory error on
;                   the attempt to add the string.
;     [.fAdd] = 0: eax == 0

.hash dd ?
begin

mov     edx, [.pHashTree]

stdcall StrHash, [.hString]
mov     [.hash], eax

mov     ecx, 32
mov     ebx, TArray.array

cmp     [edx+TArray.count], 0
jne     .treeseek

and     dword [eax+THashTreeNode.bit0], 0
and     dword [eax+THashTreeNode.bit1], 0

.treeseek:
xor     edi, edi
ror     [.hash], 1
lea     edi, [ebx+4*edi]

cmp     dword [edx + edi], 0 ; edi is offset from the beginning
; of the array.
je      .notfound

mov     ebx, [edx + edi]
loop    .treeseek

; The hash exists in the hash tree. Process the possible collisions.
; here ebx is the offset to the THashTreeLeaf !

.cmp_loop:
stdcall StrCompCase, [.hString], [edx+ebx+THashTreeLeaf.hString]
jc      .finish      ; the string is already in the hash tree!!!

lea     edi, [ebx + THashTreeLeaf.next]

cmp     dword [edx + edi], 0

mov     ebx, [edx + edi]
jmp     .cmp_loop

je      .finish_zero

jc      .finish_zero

sub     eax, edx

.notfound:
je      .finish_zero

; edx - pointer to the tree array,  ebx - offset of the last
;       found element.
; eax - 0 or 1 depending of the last bit checked.
; ecx - remaining bits count.
; edi - the offset to the [.bitX] field of the last node.

lea     eax, [ecx+1]
jc      .finish_zero

sub     eax, edx

dec     ecx

mov     [edx + edi], eax
and     dword [edx + eax + THashTreeNode.bit0], 0
and     dword [edx + eax + THashTreeNode.bit1], 0

xor     edi, edi
ror     [.hash], 1

lea     edi, [eax + 4*edi]

mov     [edx + edi], eax
lea     ebx, [edx+eax]

mov     ecx, [.hString]
xor     eax, eax

mov     [ebx+THashTreeLeaf.hString], ecx
mov     [ebx+THashTreeLeaf.next], eax
mov     [ebx+THashTreeLeaf.lparam], eax
mov     [ebx+THashTreeLeaf.wparam], eax

clc

.finish:
mov     [esp+4*regEAX], ebx
mov     [esp+4*regEDX], edx
return

.finish_zero:
xor     ebx, ebx
clc
jmp     .finish
endp
``````

Very small cheat sheet:

`TArray` is a structure of a dynamic array that can be resized when needed. (the source)

The procedure `AddArrayElements` adds new elements at the end of TArray and returns in `EAX` a pointer to these new elements and in `EDX` the pointer to the array that can be reallocated because of the resizing.

The procedure `StrHash` computes 32bit hash of a string and returns it in EAX

The procedure `StrCompCase` compares two strings, case sensitive and returns Boolean result in CF flag. 1 if the string match and 0 if not.

Quick assembly instruction reference for the absolute beginners: here

NB: The above implementation is far from optimal. It is written in order to be readable and correct. The optimizations will be made later. The bug fixes as well.