Why kNN doesn't scale...

#datascience #machinelearning #tutorial #ai

...And how Approximate Nearest Neighbors mitigates that.

k-Nearest Neighbours:

k-Nearest Neighbors (kNN) is a simple and intuitive algorithm used for classification and regression tasks. Given a new data point, the algorithm searches the training dataset for the 'k' training examples that are closest to the point and then returns the output based on these 'k' examples.

kNN is used in some of the modern AI solutions like LLMs. But one of the major issues with kNN is its scalability. There are 3 important points regarding this.

Scalability Issues of kNN:

Brute Force Search: In its basic form, for every new data point, KNN performs a brute force search to calculate distances to all points in the training dataset. This means the time complexity is O(N), where N is the number of data points in the dataset. This is assuming calculation of these distances happens in constant time.
Memory Intensive: Storing the entire dataset is necessary for KNN, which can be memory intensive for large datasets.
High Dimensionality: As the dimensionality of the data increases, the distance between most pairs of points tends to become more similar (known as the curse of dimensionality). This can make the distance metric less meaningful and degrade the performance of KNN.

Approximate Nearest Neighbors (ANN):

Algorithm:

Approximate Nearest Neighbors (ANN) algorithms aim to find the nearest neighbors in sub-linear time by allowing a small probability of error. The key idea is that in many applications, an approximate answer is sufficient. Here I will explain one of the most common ANN algorithm, Locality Sensitive Hashing(LSH).

LSH is a technique which is used to hash similar dataset elements together in the same buckets. For ex. Vector embeddings of the similar words in a vocabulary will lie in the same bucket. But instead of typical buckets, you can think of clustering the points by deciding whether they lie above or below a hyperplane.

These hyperplanes are randomly generated and are used to divide the vector space into small sections. If the vector embeddings of a word are two dimensional then this hyperplane will be a line.

To decide whether a vector embedding lie above or below a line, we take the dot product of the vector embedding with the line(hyperplane in higher dimensions) which is also known as Projecting a Vector onto the Line. The sign of this dot product will help us decide whether embedding lies above or below a line.

Here is a visualisation of a projection in 2D space

Here is how sign of the projection tells us about its position

Now let's say you are trying to optimise below loss function:

|| h - X ||

where h is the test value and X represents value from the dataset

In the normal kNN, you will have to go through each X in the dataset, compare it with h using some metric like Cosine Similarity or Euclidean Distance. Calculating this metric for each X is very cumbersome in case of a large dataset.

So, we do the following steps in the ANN:

Use LSH to find the hash values of all the X's in the dataset (which involves only simple matrix multiplications).
Find the hash value of the test value h.
Compare h with only those X's which got the same hash value using the above metrics.

As you can probably guess, this method doesn't gurantee the answer we might get in case of kNN. So, this is done for multiple iterations by generating random planes to increase the likelihood of getting the right answer.

This is how LSH with Approximate Nearest Neighbors trades accuracy with efficiency.

Apart from LSH, ANN uses some other methods too like:

Random Projection Trees: These are space partitioning data structures that divide the dataset into increasingly smaller subsets using random hyperplanes.
KD-Trees, Ball Trees: These are also space partitioning trees but may not perform as well in very high dimensions.

To conclude, while KNN is straightforward and can be very accurate, it can be computationally expensive and unfeasible for large datasets or high-dimensional data. ANN provides a trade-off between accuracy and speed, allowing for efficient querying on large datasets. Depending on the specific requirements (e.g., acceptable error rate, query speed), one might choose an appropriate ANN method over traditional KNN.