K-d trees

#database #performance #tree

K-d trees

Multidimensional data indexing
implementation

Introduction

Motivation
You have a dateset of hotels each hotel has characteristics like its position{x,y} and its rate. if you want to reach the nearest hotel to you with a suitable rate to you, you may start manually searching for the nearest hotel by calculating the distance. Then you will filter hotels by their rate.
OK, that approach sounds that taking too much time and effort.
In 1975, Jon Louis Bentley introduced an interesting data structure by which much time and effort were reduced.
Description
A k-d tree is a binary search tree with a K-dimensional node. Each node could be represented as a split plan that divides the space into two parts, known as half-spaces. Points to the left of this split plane are represented by the left sub-tree of that node and points to the right of the split plane are represented by the right sub-tree. The direction in which we divide our space is determined by the level of the node in the tree, To choose which branch of the tree will be traversed, we only compare the i-th (modulo k) coordinates of points.

Operations on k-d trees

Construction
The power of k-d tree leverage when it is balanced. So we will construct it balanced.
A balanced tree could be obtained by choosing the node in the middle of space to divide it, space, into 2 equal numbers of nodes in each child.
That approach is obtained by sorting nodes of each space and picking the middle to be the next parent.
Search
It is similar to a binary search tree. Compare the searched point to the node and choose which branch to traverse by using the ‘I’ parameter, I=level%K.

Insert
Although it is not the most preferred operation, it may be necessary to be made. That operation’s downside is that it unbalances the tree.
To insert a node and don’t affect the balance of the tree as could as possible, we use the approach of search until we reach the closest leaf node in the tree to our point, then we extend child from it.

Nearest neighbor
The most interesting operation is provided by k-d trees. By which you could find the hotel in the first example.
k-d tree, much like a sorted array, has structural information about the relative distance and position of its elements, and we can leverage that information to perform our search more efficiently.
We start from the consideration that each tree node covers one of the rectangular regions in which the space was partitioned, as we have shown in figures 9.5 and 9.6. So first we want to find which region contains our target point P. That’s a good starting point for our search because it’s likely that the point stored in the leaf covering that region, G in this example, will be among the closest points to P.
As you can see, we check the distance of every intermediate point along the path because any of them can be closer than the leaf. Even if intermediate points cover larger regions than leaves, inside each region we don’t have any indication of where dateset points might lie. If we refer to figure 9.20, if point A had been at ( 0,0), the tree would have had the same shape, but P would have been closer to A (the root) than G (a leaf).

To make sure that region intersects our current nearest neighbor’s hyper-sphere:
each region stems from the partitioning created by split lines. When traversing a path, we go on one side of the split (the one on the same side as P), but if the distance between a split line and P is less than the distance to our current nearest neighbor, then the hyper-sphere intersects the other partition as well.
To make sure we visit all partitions still intersecting the NN hyper-sphere, we need to backtrack our traversal of the tree. In our example, we go back to node D, then check the distance between P and the vertical split line passing through D (which, in turn, is just the difference of the x coordinates of the two points). Since this is smaller than the distance to D, our current NN, then we need to visit the other branch of D as well. When we say visit, we mean traversing the tree in a path from D to the closest leaf to P. While we do so, we visit node F and discover that it’s closer than D, so we update our current NN (and its distance: you can see that we shrink the radius of our nearest neighbor perimeter, the circle marking the region where possible nearest neighbors can be found).

Region search
When we deal with spherical regions, we are in a similar situation as the nearest neighbor search: we need to include all the points within a certain distance from a given point. It’s like performing an NN-search, where we never update the distance to the nearest neighbor, and instead of keeping track of only one point (the NN), we gather all points closer than that distance.
In figure 9.25 you can see how we are going to prune branches. When we are at a split, we will certainly have to traverse the branch on the same side of the center of the search region P. For the other branch, we check the distance between P and its projection on the split line. If that distance is lower than or equal to the search region’s radius, it means that there is still an area of intersection between that branch and the search region, and so we need to traverse the branch; otherwise, we can prune it.