Clustering
A type of Structure Discovery algorithm
This type of method is also referred to as Dimensionality Reduction, based on a common application
You have a large number of data points
You want to find what structure there is among the data points
You don’t know anything a priori about the structure
Clustering tries to find data points that “group together”
Let’s say your data has two variables
Number of video interactions
Unitized Time
Note: clustering works for (and is effective in)
large feature spaces
Just the simplest
We’ll talk about fancier ones soon
First we decided how many clusters we wanted: 5
We picked starting values for the “centroids” of the clusters…
Usually chosen randomly
Sometimes there are good reasons to start with specific initial values…
We classify every point as to which centroid it’s closest to
This defines the clusters
Typically visualized as a voronoi diagram
Repeat the process until the centroids stop moving
“Convergence”
What happens if your starting points are in strange places?
Not trivial to avoid, considering the full span of possible data distributions
Run several times, involving different starting points
cf. Conati & Amershi (2009)
Take the following examples
And execute k-means for them
Do this by hand…
Focus on getting the concept rather than the exact right answer…
(Solutions are by hand rather than actually using code, and are not guaranteed to be perfect)
The three clusters in the same data lump might move around for a little while.
But really, what we have here is one cluster and two outliers…
K should be 3 rather than 5
The bottom-right cluster is actually empty!
There was never a point where that centroid was actually closest to any point
Kind of a weird outcome
By unlucky initial positioning
One data lump at left became three clusters
Two clearly distinct data lumps at right became one cluster
A lot depends on initial positioning
And on the number of clusters
How do you pick which final position and number of clusters to go with?