Clustering Validation and Selection of K
A value for k
Which set of clusters to use, after 17 randomized restarts
Take each point P
Find the centroid of P’s cluster C
Find the distance D from C to P
Square D to get D’
Usually Euclidean distance
Distance from A to B in two dimensions
\(\sqrt{(A_{x}-B_{x})^{2}-(A_{y}-B_{y})^{2}}\)
\(\sqrt{\sum_{}^{}(A_{i}-B_{i})^{2}}\)
Works for choosing between randomized restarts
Does not work for choosing cluster size
More clusters almost always leads to smaller Distortion
Distance to nearest cluster center should almost always be smaller with more clusters
It only isn’t when you have bad luck in your randomization
A different problem than prediction modeling
You’re not trying to predict specific values
You’re determining whether any center is close to a given point
You can use the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) to do this
Assess how much fit would be spuriously expected from a random N centroids (without allowing the centroids to move)
Assess how much fit you actually had
An increasingly popular method for determining how many clusters to use
Silhouette values scaled from -1 to 1
Close to +1: Data point is far from adjacent clusters
Close to 0: Data point is at boundary between clusters
Close to -1: Data point is closer to other cluster than its own cluster
For each data point i in Cluster C
Find C* = cluster which has the lowest average distance of i from all the data points in cluster c*
A(i) = average distance of i from all other data points in same cluster C
B(i) = average dissimilarity of i from all other data points in cluster C*
\(S(i) = \frac{B(i)-A(i))}{max({A(i),B(i))}}\)
http://scikit-learn.org/ stable/auto_examples/cluster/ plot_kmeans_silhouette_analysis.html
2 and 4 clusters are reasonable choices
3, 5, and 6 clusters are not good choices
If your goal is to just discover qualitatively interesting patterns in the data, you may want to do something simpler than using an information criterion or silhouette analysis