Discussion on: Clustering in R

View post

Clustering wines
K-Means
Learn R here: hackr.io/tutorials/learn-r
This first example is to learn to make cluster analysis with R. The library rattle is loaded in order to use the data set wines.

install.packages('rattle')

data(wine, package='rattle')
head(wine)

Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids

1 1 14.23 1.71 2.43 15.6 127 2.80 3.06

2 1 13.20 1.78 2.14 11.2 100 2.65 2.76

3 1 13.16 2.36 2.67 18.6 101 2.80 3.24

4 1 14.37 1.95 2.50 16.8 113 3.85 3.49

5 1 13.24 2.59 2.87 21.0 118 2.80 2.69

6 1 14.20 1.76 2.45 15.2 112 3.27 3.39

Nonflavanoids Proanthocyanins Color Hue Dilution Proline

1 0.28 2.29 5.64 1.04 3.92 1065

2 0.26 1.28 4.38 1.05 3.40 1050

3 0.30 2.81 5.68 1.03 3.17 1185

4 0.24 2.18 7.80 0.86 3.45 1480

5 0.39 1.82 4.32 1.04 2.93 735

6 0.34 1.97 6.75 1.05 2.85 1450

In this data set we observe the composition of different wines. Given a set of observations (x1,x2,.,xn)(x1,x2,.,xn), where each observation is a dd-dimensional real vector, kk-means clustering aims to partition the n observations into (k≤nk≤n) S={S1,S2,.,Sk}S={S1,S2,.,Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find::

argminS∑i=1k∑xj∈Si∥xj−μi∥2
argminS∑i=1k∑xj∈Si‖xj−μi‖2

where μiμi is the mean of points in SiSi. The clustering optimization problem is solved with the function kmeans in R.

wine.stand <- scale(wine[-1]) # To standarize the variables

K-Means

k.means.fit <- kmeans(wine.stand, 3) # k = 3
In k.means.fit are contained all the elements of the cluster output:

attributes(k.means.fit)

$names

[1] "cluster" "centers" "totss" "withinss"

[5] "tot.withinss" "betweenss" "size" "iter"

[9] "ifault"

$class

[1] "kmeans"

Centroids:

k.means.fit$centers

Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids

1 0.1644 0.8691 0.1864 0.5229 -0.07526 -0.97658 -1.21183

2 0.8329 -0.3030 0.3637 -0.6085 0.57596 0.88275 0.97507

3 -0.9235 -0.3929 -0.4931 0.1701 -0.49033 -0.07577 0.02075

Nonflavanoids Proanthocyanins Color Hue Dilution Proline

1 0.72402 -0.7775 0.9389 -1.1615 -1.2888 -0.4059

2 -0.56051 0.5787 0.1706 0.4727 0.7771 1.1220

3 -0.03344 0.0581 -0.8994 0.4605 0.2700 -0.7517

Clusters:

k.means.fit$cluster

[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3

[71] 3 3 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3

[106] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1

[141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[176] 1 1 1

Cluster size:

k.means.fit$size

[1] 51 62 65

A fundamental question is how to determine the value of the parameter kk. If we looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.