The hash tables are great when we need to search a string in a big set of strings.
They provide O(1) complexity this way making our algorithms very fast.
Unfortunately, the hash tables have one big disadvantage, especially when we are talking about huge sets of strings.
Because it is important to keep the hash collisions as low as possible, the size of the hash table must be at least twice the count of the strings we need to put inside.
This way, on a very big set of strings, the memory waste can be really huge. And if we fill the table more, the speed of the search degrades very fast down to simple linear search with O(n).
There are different ways to counteract these problems. For one of them I am writing here.
FlatAssembler uses this algorithm in order to handle huge number of label names during compilation.
The algorithm is slower than the classic hash table, but still will keep the complexity to O(log k) where k is the size of the hash, not the size of the set.
This way, the search time will remain constant regardless of the strings count in the set and can be assumed to be O(1).
Considering the much lower number of collisions, on large sets it can be faster than a classic hash table with smaller hash size.
At first we need to compute the hash of the string. It is good idea to use relatively big hash, for example 32bit, in order to minimize the collisions.
I am usually using the FNV-1b hash, but the hash algorithm is not very important for the understanding of the algorithm.
Then we need a linear memory area, where elements of the following type will be allocated consecutively:
struct THashTreeNode .bit0 dd ? .bit1 dd ? ends
Then starting from the first array element and shifting the hash one bit right at a time, we put in the
.bit1 fields (depending on the value of the bit shifted) the offset to the next element of the array where to continue the scan.
The next element is allocated at the end of the array, just after the last used element. (like in a queue or linear list).
When we reach the 32nd bit of the hash, the last index will point to the element with the payload (or the leaf of the tree if you prefer) - it can occupy random space, but it is more convenient if the size of the leaf structure is multiple of the size of the node structure.
In my implementation it occupies two consecutive elements of of the array:
struct THashTreeLeaf .hString dd ? ; a handle of the string stored. .next dd ? ; the next element in the cases of collisions. .lparam dd ? ; user defined parameter. .wparam dd ? ; user defined parameter. ends
.hString contains a pointer or a handle to the string we put in the tree. (in the library I use, the strings are dynamic and are represented by a handles, not pointers. But the difference is not important)
.next contains an offset to the next
THashTreeLeaf in the rare cases if we have two different strings with the same hash (a collision).
The searching algorithm is almost the same as when we are adding strings to the set.
Starting from the beginning of the array, we are shifting the searched string hash one bit at a time and jumping to the offsets from the
.bit1 fields. If on some step the offset is 0 this means the string is not in the tree.
If after 32 steps, we find a leaf element, a direct string comparison with the stored string is needed in order to resolve the possible collision.
Then if the strings does not match, we need to follow the offsets from the
THashTreeLeaf.next fields until 0 (the string is not in the tree) or string match - the string is in the tree.
You can see that in the later case, the search deteriorates to a simple and slow linked list search. Fortunately, when the hash is 32 bit such collisions are really rare.
The algorithm is memory friendly.
The whole tree is in one contiguous memory area, which is good for the CPU caches and additionally improve the performance.
The algorithm is relatively simple, at least for assembly language implementation.
I am not sure how it will be implemented in higher level languages, but at least for C/C++/Pascal there should be no problems, except for some possible code readability issues...
Notice, that the same algorithm can be implemented as an usual tree with separately allocated nodes and leafs. But such implementation will be slower, because of the heap allocations of multiply small chunks of memory and because of the cache issues.
In order to make the things even more clear, here is the full code of the procedure that can add strings to the tree and also to search the tree for a string without adding it.
Because the searching and adding algorithms are very similar, it is possible to provide both functions in a single procedure.
As a little quiz to the readers I will not put detailed line-by-line explanations of the code. It is published in the same form it exists in the production library.
I will give only a little reference to the external procedures used at the end of the article.
So, that is all. Here is the code. Try to analyze and understand it! :)
If something is not clear ask in the comments.
I will explain it in details, but let give some fun time for the puzzle solvers.
; Searches the hash tree for the given string. ; if [.fAdd] is TRUE and if the string is not in the tree it will be added. ; ; Arguments: ; .pHashTree - pointer to TArray with elements size sizeof.THashTreeNode (8 bytes) ; .hString - handle/pointer to the string to be searched/added. Notice, that exactly this ; string will be added to the THashTreeLeaf element. The caller is responsible ; to not free this string until the hash tree is needed. ; .fAdd - flag indicating whether to add the string to the tree if is not included. ; ; Returns: ; edx - pointer to the array. Can be reallocated! ; eax - index in the TArray of the THashTreeLeaf of the string. ; CF = 1: the string is already in the tree. ; CF = 0: the string is not in the tree. ; [.fAdd] <> 0: eax contains a pointer to the new added leaf. ; eax == 0 means there is an out of memory error on ; the attempt to add the string. ; [.fAdd] = 0: eax == 0 proc SearchHashTree, .pHashTree, .hString, .fAdd .hash dd ? begin pushad mov edx, [.pHashTree] stdcall StrHash, [.hString] mov [.hash], eax mov ecx, 32 mov ebx, TArray.array cmp [edx+TArray.count], 0 jne .treeseek stdcall AddArrayItems, edx, 1 and dword [eax+THashTreeNode.bit0], 0 and dword [eax+THashTreeNode.bit1], 0 .treeseek: xor edi, edi ror [.hash], 1 adc edi, edi lea edi, [ebx+4*edi] cmp dword [edx + edi], 0 ; edi is offset from the beginning ; of the array. je .notfound mov ebx, [edx + edi] loop .treeseek ; The hash exists in the hash tree. Process the possible collisions. ; here ebx is the offset to the THashTreeLeaf ! .cmp_loop: stdcall StrCompCase, [.hString], [edx+ebx+THashTreeLeaf.hString] jc .finish ; the string is already in the hash tree!!! lea edi, [ebx + THashTreeLeaf.next] cmp dword [edx + edi], 0 je .add_in_list mov ebx, [edx + edi] jmp .cmp_loop .add_in_list: cmp [.fAdd], 0 je .finish_zero stdcall AddArrayItems, edx, 2 jc .finish_zero sub eax, edx jmp .do_add .notfound: cmp [.fAdd], 0 je .finish_zero ; edx - pointer to the tree array, ebx - offset of the last ; found element. ; eax - 0 or 1 depending of the last bit checked. ; ecx - remaining bits count. ; edi - the offset to the [.bitX] field of the last node. .add_remaining: lea eax, [ecx+1] stdcall AddArrayItems, edx, eax ; add all needed (THashTreeNode) + THashTreeLeaf jc .finish_zero sub eax, edx dec ecx jz .do_add .addloop: mov [edx + edi], eax and dword [edx + eax + THashTreeNode.bit0], 0 and dword [edx + eax + THashTreeNode.bit1], 0 xor edi, edi ror [.hash], 1 adc edi, edi lea edi, [eax + 4*edi] add eax, sizeof.THashTreeNode loop .addloop .do_add: mov [edx + edi], eax lea ebx, [edx+eax] mov ecx, [.hString] xor eax, eax mov [ebx+THashTreeLeaf.hString], ecx mov [ebx+THashTreeLeaf.next], eax mov [ebx+THashTreeLeaf.lparam], eax mov [ebx+THashTreeLeaf.wparam], eax clc .finish: mov [esp+4*regEAX], ebx mov [esp+4*regEDX], edx popad return .finish_zero: xor ebx, ebx clc jmp .finish endp
Very small cheat sheet:
TArray is a structure of a dynamic array that can be resized when needed. (the source)
AddArrayElements adds new elements at the end of TArray and returns in
EAX a pointer to these new elements and in
EDX the pointer to the array that can be reallocated because of the resizing.
StrHash computes 32bit hash of a string and returns it in EAX
StrCompCase compares two strings, case sensitive and returns Boolean result in CF flag. 1 if the string match and 0 if not.
Quick assembly instruction reference for the absolute beginners: here
NB: The above implementation is far from optimal. It is written in order to be readable and correct. The optimizations will be made later. The bug fixes as well.