# Supervised Learning: Instance-based Learning and K-Nearest Neighbors

###
shawn swyx wang πΈπ¬
Jan 27
*Updated on Feb 24, 2019 *
γ»4 min read

*This is the 5th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.*

## What is "Instance Based"?

Recall that Supervised Learning approximates a function. Then projections are made by plugging in values to the function, without any reference to the actual data.

An alternative approach just puts all the raw data ("all instances") in a database, and, when queried, looks up the corresponding output. No time is spent doing any learning, so it is fast and simple.

However, of course, we lose out on **generalization** (for example if we are asked something close to but not exactly what we have data for) and we are prone to **overfitting** (for example if our data is noisy).

So Instance Learning looks at the nearest neighbors to decide what any queried point should be. Specifically, the `k`

nearest neighbors.

##
`K`

Nearest Neighbors

Given:

- Training Data
`D = {Xi, Yi}`

- Distance Metric
`d(q, x)`

(representing your domain knowledge - a way to quantify the similarity of one thing to another) - Number of neighbors,
`k`

(also relying on your domain knowledge as to what matters) - A query point,
`q`

The algorithm is simply - given `D`

, find the `k`

nearest neighbors (K-NN) based on your distance metric `d(q, x)`

.

You can do both **classification** and **regression** this way:

- Classification: based on vote of
`Yi`

's - your point is classified by whatever the most neighbors are - Regression: the average of the
`Yi`

's.

Instead of a simple vote or simple average that weighs all neighbors the same, you can also weight them by closeness so that closer neighbors count more.

## Lazy vs Eager

You can loosely break up these algorithms into **learning** and **querying** stages:

- The learning stage computational complexity of K-NN is
`O(1)`

, while Linear Regression is`O(N)`

. - The querying stage computational complexity of K-NN is
`O(log N)`

, while Linear Regression is`O(1)`

Thus more work is frontloaded by Linear Regression, making it an "eager" learning algorithm, whereas K-NN does more work while querying, making it "lazy.

## In Python

For some light programming practice, try solving the problems pictured here:

Here's a sample Repl solution calculating nearest neighbors by Euclidian and Manhattan distance.

Note that the real answer (based on the hidden function) was `18`

, which K-NN doesn't get close to by any of these methods.

## KNN's biases

KNN's preference bias (its beliefs about what makes a good hypotheses) are:

- Locality - assumes that Near Points are "similar"
- Smoothness - using averaging
- All features matter equally <-- this belies the Curse...

## Curse of Dimensionality

As the number of

featuresordimensionsgrows, the amount of data we need togeneralize accuratelygrows exponentially!

## A blended approach

Instead of a Euclidean/Manhattan weighted/unweighted average approach to estimating the final result, we can also combine a KNN with regression to create locally-weighted regression to have the best of both worlds. This isn't just limited to regression - within the defined nearest neighbors, you can use any of the methods we have covered so far, like neural networks or decision trees.

## Optimal Data Structure

We can also consider more efficient data structures for kNN than a simple lookup table. Ball Trees are promising in this regard.

## Next in our series

Hopefully that was a good introduction to Instance-based Learning and K-Nearest Neighbors. I am planning more primers and would love your feedback and questions on:

- Overview
- Supervised Learning
- Unsupervised Learning
- Randomized Optimization
- Information Theory
- Clustering - week of Feb 25
- Feature Selection - week of Mar 4
- Feature Transformation - week of Mar 11

- Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- "True" RL - week of Apr 1
- Game Theory - week of Apr 15

Depending on the number of dimensions you can try an embedding (word2vec, gloVe, fastText, ...) or something like a dimensionality reduction (PCA, ISOMAP, ...). I have made mostly positive experience with PCA + kNN. When working on NLP problems word2vec and kNN did well too.

BTW: Thrilled for you VC dimension post. This topic can be confusing.

thanks - i'm very keen on NLP type problems and unfortunately it doesnt seem to be covered in any detail in this course. so after i end this series i may do a further deep dive on this, or maybe if you wanna write up some things i can explore (a pseudo syllabus?) i will happily do that

fyi i wrote up VC Dimensions. it is confusing indeed but i learned a lot from examples. dev.to/swyx/supervised-learning-vc...