Machine Learning Interview Questions – Q3 – How is KNN different from k-means clustering?

Machine learning interview questions is a series I will periodically post on.  The idea was inspired by the post 41 Essential Machine Learning Interview Questions at Springboard.  I will take each question posted there and provide an answer in my own words.  Whether that expands upon their solution or is simply another perspective on how to phrase the solution, I hope you will come away with a better understanding of the topic at hand.

To see other posts in this series visit the Machine Learning Interview Questions category.

Q3 –  How is KNN different from k-means clustering?

K-Nearest Neighbors (KNN)

K-Nearest Neighbors is a supervised classification algorithm.  It requires labeled data to train.  Given the labeled points, KNN will classify new, unlabeled data by looking at the ‘k’ number of nearest data points.  The variable ‘k’ is a parameter that will be set by the machine learning engineer.

For example, if k=5, then any new unlabeled points would look at the nearest 5 data points to it and be classified based on the class of the majority of those 5 points it is closest to.  KNN classifies unlabeled data by looking at the label of the k-nearest labeled points.

K-Means Clustering

K-means clustering is an unsupervised clustering algorithm.  It requires unlabeled data to train.  Given the unlabeled points and some ‘k’ number of clusters, k-means clustering will gradually learn how to cluster the unlabeled points into groups by computing the mean distance between the points.

The variable ‘k’ represents the number of centroids to use, this is also the number of clusters or different groups you will cluster your data into.  The algorithm works by moving these centroids at every iteration to minimize the error function.  The error function is essentially the distance of the unlabeled data points to each centroid.

Summary

KNN and k-means clustering both are very different algorithms that solve different problems and have their own meanings of what the variable ‘k’ is.  KNN is a supervised classification algorithm that will label new data points based on the ‘k’ number of nearest data points and k-means clustering is an unsupervised clustering algorithm that groups the data into ‘k’ number of clusters.