If you don't remember how the k-Nearest Neighbor classifier works, now would be a good time to revisit our "Classification" lesson in module 1. The algorithms share similar names and both function using k nearest samples for computations, but remember that k-NN is for classification, k-Means is for clustering.
k-Means is almost as simple as k-NN it its core functionality. For 2-dimensional clustering, logic can be written as steps:
k
and launches the k-Means for m
samples.(x,y)
-coordinates for all k
cluster centroids.m
sample and each k
centroid. m
sample belonging to the closest k
cluster centroid.(x,y)
-coordinate for all m
samples in each k
clusters.Steps 2 to 5 will be repeated until stopping criterion has been met. A typical stopping criterion is that the location of the centroid did not move.
The steps will be covered in detail in the Jupyter Notebook exercise.
Note that this is just a brief overview into clustering, and only focuses on one simple algorithm.
Note that only some algorithms take the number of clusters as a parameter. For example, DBSCAN clusters based on their density. The amount of clusters depends on input parameter eps
, which is the "maximum distance between two samples for one to be considered as in the neighborhood of the other."
Note also that some algorithms allow label sample as noise. In this case, what class label will be given to them? What kind of performance metrics can be use when no ground-truth is available? Find out by reading the Scikit-Learn documentation about clustering.
This lesson is just a quick overview of clustering. Remember that clustering is a part of unsupervised learning. Thus, there are no labels and the goal is to find structure in the data.
Open up the saved PDF book, A Brief Introduction to Machine Learning for Engineers (v3, 2018), which is available here. Read the beginning of the chapter 6, up to and including the 6.2 K-Means Clustering.