Unsupervised Learning: Feature Transformation

#machinelearning #unsupervisedlearning

This is the 16th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.

Feature Transformation

The problem of preprocessing a set of features to create a new (smaller/more compact) feature set, while retaining as much (relevant? useful?) information as possible. For example in linear algebra terms, you can create new features that are some linear combination of existing features.

Why do Feature Transformation as a standalone topic?

Any problem with a large amount of features that have plenty of false positives and negatives will want some form of Feature Transformation.

One example: The information retrieval, or ad hoc retrieval problem, like with Google needing to look up documents based on given search terms. For google, all the words in a document are features (give or take some cleaning). That's a lot of words, bringing us the curse of dimensionality. But most relevantly here, the words also aren't very good indicators (because of polysemy and synonymy) which cause a lot of false positives and negatives. Doing feature selection will not solve this problem.

Principal Components Analysis

It is an example of an Eigenproblem. PCA finds the directions of maximum variance that are mutually orthogonal. You are able to perfectly reconstruct (or describe) your data with PCA, but most commonly
if you take a subset of the principal components you will minimize the L2 error for the dimensions you pick.

Eigenvalues and Dimensionality Reduction

Eigenvalues indicate importance of each PC, and a 0 eigenvalue means that dimension is irrelevant. PCA does dimensionality reduction by sorting eigenvalues and picking the top k values. You can visualize this with a Scree Plot

You use PCA to transform features to a new space where you know how to do the filtering.

Independent Components Analysis

PCA is about finding correlation by maximizing variance.

ICA tries to maximize independence. Transforming a set of features X into new features Y with no mutual information between any Y feature, but maximum mutual info between X and Y (being able to reconstruct the data).

This is applied in the Blind Source Separation problem (aka the Cocktail Party Problem) - where microphones try to isolate sound of individual people despite picking up extra noise. Each microphone receives some linear combination of the multiple sources, and you can model it with ICA and work backwards to create sound streams of individual voices.

In contrast to PCA, ICA is local and unsorted. It just focuses on unique traits, which is a very nice feature to have for sound and natural images - they detect edges in images! When you apply this to our original problem of documents, ICA gives you topics - collections of words that select for specific topics.

Alternatives

RCA: Random Components Analysis instead of finding dimensions with highest variance, it just picks random dimensions. It works remarkably well if the next thing you do is a form of classification because you still maintain information, but have lower dimensions so you still pick up some correlations. It's less efficient but faster than PCA.

LDA: Linear Discriminant Analysis finds a projection that discriminates based on the label (like supervised learning) - separates based on label. After the 90's, Latent Dirichlet Allocation has become the more popular LDA.