Unsupervised clustering for faster pairwise distance computation

#python #unsupervisedlearning

Recently at work, I was given a list of locations(around 70k+) with their lat and long coordinates. My task was to find the nearest point of all the points in the list.

My predecessor who developed the first version used the simplest and most straightforward way for this computation -- for each point, compute Euclidean distance with all other points. This computation takes about 30 minutes to finish as it involves 70k * 70k times float point computation. The results are stored into a huge nested list, which takes a lot of RAM as well.

However, this computation efficiency is not acceptable in both production and testing environment. The following algorithm was created for enhancing computation efficiency:

Apply K-Means with k=100 cluster centres on this dataset
For each point, instead of computing pairwise distance with all other points, only compute pairwise distance with points in the same cluster and the nearest point in the cluster will be the nearest point in the dataset.

For each cluster, computation is done ~ 700 * 700 times. Therefore in total, 70k * 700 times computation is done. The computation is 100 times faster.

In fact, K-means clustering is very effective and efficient as compare to pairwise distance computation. After this algorithm implemented, it took less than two minutes.

First time using unsupervised learning in non machine learning related domain. Excited and want to record this down. However in production this methodology is not utilised as the technical lead decides to reduce the number of candidate locations and only recompute when there are new location points added into the list.

Either solve the problem, or reduce the problem. Lessons for the day

DEV Community

Unsupervised clustering for faster pairwise distance computation

Oldest comments (0)