Structure Discovery

Clustering

A type of Structure Discovery algorithm
This type of method is also referred to as Dimensionality Reduction, based on a common application

Clustering

You have a large number of data points
You want to find what structure there is among the data points
You don’t know anything a priori about the structure
Clustering tries to find data points that “group together”

Trivial Example

Let’s say your data has two variables
- Number of video interactions
- Unitized Time

Note: clustering works for (and is effective in)
large feature spaces

Example

Not the only clustering algorithm

Just the simplest
We’ll talk about fancier ones soon

How did we get these clusters?

First we decided how many clusters we wanted: 5
- How did we do that? More on this in just a few slides.
We picked starting values for the “centroids” of the clusters…
- Usually chosen randomly
- Sometimes there are good reasons to start with specific initial values…

Example

Then…

We classify every point as to which centroid it’s closest to
- This defines the clusters
- Typically visualized as a voronoi diagram

Example

Then…

We re-fit the centroids as the center of the points in each cluster

Result

Then…

Repeat the process until the centroids stop moving
“Convergence”

Result

Then..

Result

Example

Result

What happens?

What happens if your starting points are in strange places?
Not trivial to avoid, considering the full span of possible data distributions

One Solution

Run several times, involving different starting points
cf. Conati & Amershi (2009)

Exercises

Take the following examples
And execute k-means for them
Do this by hand…
Focus on getting the concept rather than the exact right answer…
(Solutions are by hand rather than actually using code, and are not guaranteed to be perfect)

Questions? Comments?

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

No points switched – convergence

Notes

K-Means did pretty reasonable here

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

Notes

The three clusters in the same data lump might move around for a little while.
But really, what we have here is one cluster and two outliers…
K should be 3 rather than 5
- We will review this more in the next module.

Exercise

Solution

Notes

The bottom-right cluster is actually empty!
There was never a point where that centroid was actually closest to any point

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

Solution Step 6

Solution Step 7

Approximate Solution

Notes

Kind of a weird outcome
By unlucky initial positioning
- One data lump at left became three clusters
- Two clearly distinct data lumps at right became one cluster

Questions? Comments?

Exercise

Notes

That actually kind of came out ok…

As you can see

A lot depends on initial positioning
And on the number of clusters
How do you pick which final position and number of clusters to go with?

Next module

Clustering – Validation and Selection of K

Questions? Comments?

Closing thought

How might you want to use clustering?