Supervised Learning: Instance-based Learning and K-Nearest Neighbors

#machinelearning #supervisedlearning

This is the 5th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.

What is "Instance Based"?

Recall that Supervised Learning approximates a function. Then projections are made by plugging in values to the function, without any reference to the actual data.

An alternative approach just puts all the raw data ("all instances") in a database, and, when queried, looks up the corresponding output. No time is spent doing any learning, so it is fast and simple.

However, of course, we lose out on generalization (for example if we are asked something close to but not exactly what we have data for) and we are prone to overfitting (for example if our data is noisy).

So Instance Learning looks at the nearest neighbors to decide what any queried point should be. Specifically, the k nearest neighbors.

`K` Nearest Neighbors

Given:

Training Data D = {Xi, Yi}
Distance Metric d(q, x) (representing your domain knowledge - a way to quantify the similarity of one thing to another)
Number of neighbors, k (also relying on your domain knowledge as to what matters)
A query point, q

The algorithm is simply - given D, find the k nearest neighbors (K-NN) based on your distance metric d(q, x).

You can do both classification and regression this way:

Classification: based on vote of Yi's - your point is classified by whatever the most neighbors are
Regression: the average of the Yi's.

Instead of a simple vote or simple average that weighs all neighbors the same, you can also weight them by closeness so that closer neighbors count more.

Lazy vs Eager

You can loosely break up these algorithms into learning and querying stages:

The learning stage computational complexity of K-NN is O(1), while Linear Regression is O(N).
The querying stage computational complexity of K-NN is O(log N), while Linear Regression is O(1)

Thus more work is frontloaded by Linear Regression, making it an "eager" learning algorithm, whereas K-NN does more work while querying, making it "lazy.

In Python

For some light programming practice, try solving the problems pictured here:

Here's a sample Repl solution calculating nearest neighbors by Euclidian and Manhattan distance.

Note that the real answer (based on the hidden function) was 18, which K-NN doesn't get close to by any of these methods.

KNN's biases

KNN's preference bias (its beliefs about what makes a good hypotheses) are:

Locality - assumes that Near Points are "similar"
Smoothness - using averaging
All features matter equally <-- this belies the Curse...

Curse of Dimensionality

As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially!

A blended approach

Instead of a Euclidean/Manhattan weighted/unweighted average approach to estimating the final result, we can also combine a KNN with regression to create locally-weighted regression to have the best of both worlds. This isn't just limited to regression - within the defined nearest neighbors, you can use any of the methods we have covered so far, like neural networks or decision trees.

Optimal Data Structure

We can also consider more efficient data structures for kNN than a simple lookup table. Ball Trees are promising in this regard.

Next in our series

Hopefully that was a good introduction to Instance-based Learning and K-Nearest Neighbors. I am planning more primers and would love your feedback and questions on:

Overview
Supervised Learning
Unsupervised Learning
- Randomized Optimization
- Information Theory
- Clustering - week of Feb 25
- Feature Selection - week of Mar 4
- Feature Transformation - week of Mar 11
Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- "True" RL - week of Apr 1
- Game Theory - week of Apr 15

Oldest comments (3)

TheRealZeljko • Jan 27 '19 • Edited

Depending on the number of dimensions you can try an embedding (word2vec, gloVe, fastText, ...) or something like a dimensionality reduction (PCA, ISOMAP, ...). I have made mostly positive experience with PCA + kNN. When working on NLP problems word2vec and kNN did well too.

BTW: Thrilled for you VC dimension post. This topic can be confusing.

swyx • Jan 27 '19 • Edited

thanks - i'm very keen on NLP type problems and unfortunately it doesnt seem to be covered in any detail in this course. so after i end this series i may do a further deep dive on this, or maybe if you wanna write up some things i can explore (a pseudo syllabus?) i will happily do that