Structure Discovery

Clustering

Clustering

  • A type of Structure Discovery algorithm

  • This type of method is also referred to as Dimensionality Reduction, based on a common application

Clustering

  • You have a large number of data points

  • You want to find what structure there is among the data points

  • You don’t know anything a priori about the structure 

  • Clustering tries to find data points that “group together”

Trivial Example

  • Let’s say your data has two variables

    • Number of video interactions

    • Unitized Time


Note: clustering works for (and is effective in)
large feature spaces

Example

Example

Not the only clustering algorithm

  • Just the simplest

  • We’ll talk about fancier ones soon

How did we get these clusters?

  • First we decided how many clusters we wanted: 5

    • How did we do that? More on this in just a few slides.
  • We picked starting values for the “centroids” of the clusters…

    • Usually chosen randomly

    • Sometimes there are good reasons to start with specific initial values…

Example

Then…

  • We classify every point as to which centroid it’s closest to

    • This defines the clusters

    • Typically visualized as a voronoi diagram

Example

Then…

  • We re-fit the centroids as the center of the points in each cluster

Result

Then…

  • Repeat the process until the centroids stop moving

  • “Convergence”

Result

Then..

Result

Example

Example

Example

Result

Result

What happens?

  • What happens if your starting points are in strange places?

  • Not trivial to avoid, considering the full span of possible data distributions

One Solution

  • Run several times, involving different starting points

  • cf. Conati & Amershi (2009)

Exercises

  • Take the following examples

  • And execute k-means for them

  • Do this by hand…

  • Focus on getting the concept rather than the exact right answer…

  • (Solutions are by hand rather than actually using code, and are not guaranteed to be perfect)

Questions? Comments?

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

No points switched – convergence

Notes

  • K-Means did pretty reasonable here

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

Notes

  • The three clusters in the same data lump might move around for a little while.

  • But really, what we have here is one cluster and two outliers…

  • K should be 3 rather than 5

    • We will review this more in the next module.

Exercise

Solution

Notes

  • The bottom-right cluster is actually empty!

  • There was never a point where that centroid was actually closest to any point

Exercise

Solution Step 1

Solution Step 2

Solution Step 3

Solution Step 4

Solution Step 5

Solution Step 6

Solution Step 7

Approximate Solution

Notes

  • Kind of a weird outcome

  • By unlucky initial positioning

    • One data lump at left became three clusters

    • Two clearly distinct data lumps at right became one cluster

Questions? Comments?

Exercise

Exercise

Notes

  • That actually kind of came out ok…

As you can see

  • A lot depends on initial positioning

  • And on the number of clusters

  • How do you pick which final position and number of clusters to go with?

Next module

  • Clustering – Validation and Selection of K

Questions? Comments?

Closing thought

  • How might you want to use clustering?