Clustering using k-Means

If you don't remember how the k-Nearest Neighbor classifier works, now would be a good time to revisit our "Classification" lesson in module 1. The algorithms share similar names and both function using k nearest samples for computations, but remember that k-NN is for classification, k-Means is for clustering.

k-Means is almost as simple as k-NN it its core functionality. For 2-dimensional clustering, logic can be written as steps:

  • Step 0: User selects a value for k and launches the k-Means for m samples.
  • Step 1: Randomize starting (x,y)-coordinates for all k cluster centroids.
  • Step 2: Compute the Euclidean distance between each m sample and each k centroid.
  • Step 3: Assign each m sample belonging to the closest k cluster centroid.
  • Step 4: Compute the median (x,y)-coordinate for all m samples in each k clusters.
  • Step 5: Move the cluster centroids to these new coordinates.

Steps 2 to 5 will be repeated until stopping criterion has been met. A typical stopping criterion is that the location of the centroid did not move.

The steps will be covered in detail in the Jupyter Notebook exercise.

Other algorithms

Note that this is just a brief overview into clustering, and only focuses on one simple algorithm.

  • Read the Scikit Learn section about clustering for more information.

Note that only some algorithms take the number of clusters as a parameter. For example, DBSCAN clusters based on their density. The amount of clusters depends on input parameter eps, which is the "maximum distance between two samples for one to be considered as in the neighborhood of the other."

Note also that some algorithms allow label sample as noise. In this case, what class label will be given to them? What kind of performance metrics can be use when no ground-truth is available? Find out by reading the Scikit-Learn documentation about clustering.

Conclusions

This lesson is just a quick overview of clustering. Remember that clustering is a part of unsupervised learning. Thus, there are no labels and the goal is to find structure in the data.

Open up the saved PDF book, A Brief Introduction to Machine Learning for Engineers (v3, 2018), which is available here. Read the beginning of the chapter 6, up to and including the 6.2 K-Means Clustering.