The k-means Clustering method is an unsupervised machine learning technique that groups unlabelled dataset into different clusters.
The clusters in the example are distinguished by color and current cluster seeds (centroids) are marked by a starburst. In the first round, each point is assigned to its closest seed, and a new seed is chosen for each cluster based on the average of all points in that cluster. As a result, the blue cluster seed moves to the right. In the second round, both cluster seeds drift to their correct locations in a proper division. After two rounds, the clusters have reached a steady-state, and wouldn't change further through an infinite number of iterations.
What similarity measures are applicable for clustering?
Clustering is done on the basis of a similarity measure to group similar data objects together. Similarity is an amount to reflect the strength of relationship between two data items, to represent how similar two data patterns are. A similarity measure is the distance between various data points.
To group objects in clusters, this similarity measure is most commonly based on distance functions such as Euclidean distance, Manhattan distance, Minkowski distance, Cosine similarity, etc. The clusters are formed such that any two data objects in a cluster have a minimum distance value and any two data objects across different clusters have a maximum distance value. Clustering using distance functions, called distance based clustering, is popularly used to cluster objects.
What are the main objectives of the k-means clustering algorithm?
The main objectives of the k-means algorithm are:
- Determine the best value of k center points (centroids) iteratively.
- Assign each data point to its closest k-center. The data points closer to a particular k-center (with minimum Euclidean distance) form a cluster.
- Minimize the sum of distances between the data points and their corresponding cluster centroid.
Could you share some use cases of k-means clustering?
This algorithm is used across a variety of applications such as image segmentation, image compression, cyber-profiling criminals, document classification, market/customer segmentation, clustering of IT alerts, delivery store optimization, insurance fraud detection, etc.
For example, in image compression, k-means clustering groups similar colors together into 'k' clusters (say k=64) of different colors. Hence, each cluster centroid represents the three-dimensional color vector in RGB color space of its respective cluster. Then, these 'k' cluster centroids will replace all the color vectors in their respective clusters.
In the original image, each pixel takes 24 (8x3) bits to store its corresponding color vector. After k-means algorithm is applied, each pixel takes only 6 bits to store as it stores only the label of the cluster to which it belongs. K=64 different color vectors can be represented using 6 bits only. The final image will have 64 different colors in its RGB color space.
Are there any variations of the k-means algorithm?
Yes. The following are some of the variations of the k-means algorithm:
- k-medians clustering - uses the median in each dimension (instead of the mean) for optimal clustering.
- k-medoids (Partitioning Around Medoids) - uses medoid instead of mean, and minimizes the sum of distances for arbitrary distance functions.
- Fuzzy C-Means clustering - A soft version of k-means, where each data point has a fuzzy degree of belonging to each cluster.
- k-means++ - This is the standard k-means algorithm coupled with a smarter initialization of the centroids.
What are the steps of the k-means clustering algorithm?
The first step in k-means clustering algorithm is to randomly choose k centroids (k is the number of clusters to be formed). Each data point is assigned to its nearest centroid. The mean of all points in each cluster is computed and a new centroid is selected. Through iterations, the last 2 steps are repeated until the centroid positions do not change.
What are the major challenges associated with k-means clustering?
- Initialization sensitivity - Since the initialization process of the standard k-means algorithm is random, it leads to the problem of initialization sensitivity. For larger and more complex datasets, the run-time of the algorithm is longer and it may not reach optimal clustering.
- Clustering data of varying sizes and densities - k-means has trouble clustering data where the clusters are of varying sizes and densities. Usually, to cluster such data, you need to generalize k-means. While the k-means algorithm clusters the given data sets uniformly when used for uniform data sets, it fails to form clusters as per requirements when used for data sets of varying densities or distances between data sets.
- Clustering outliers - Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. To overcome this challenge, outliers could be clipped or removed before clustering.
- Scaling with number of dimensions - With increasing number of dimensions (variables), more data-points become nearest neighbours of a selected centroid, until the choice of nearest neighbour is nearly random. The distances between a data point and its nearest and farthest neighbours could become equidistant with increases dimensions.
What are the ways to avoid the problem of initialization sensitivity in k-means algorithm?
There are two ways to avoid the problem of initialization sensitivity:
- Repeat k-means: The algorithm is executed repeatedly. The centroids are initialized and the clusters are formed to result in smallest intra-cluster distance and larger inter-cluster distance.
- k-means++: This is a smart centroid initialization technique. Only one centroid is initialized randomly, and other centroids are chosen such that they are very far from the initial centroid/s. This results in faster convergence and lower possibility of the centroid being poorly initialized.
How to decide the optimal number of k in k-means algorithm?
- Elbow method - The sum of squared distance (SSE) is used to select an ideal k value. This is based on the distance between the data points and respective clusters. A value of k is chosen such that the SSE begins to flatten out and an inflection point is seen. Visually, this graph looks similar to an elbow, hence the name.
- Average Silhouette method - This measures the quality of clustering and determines how well each object lies within its cluster. The similarity/dissimilarity score is measured between the assigned cluster and the next best/nearest cluster for each data point. The average silhouette of observations is calculated for different values of k. The optimal k is one that maximizes the average silhouette over a range of possible values for k.
- Gap statistic method - The total intra-cluster variation is compared for different k values with their expected values under null reference distribution of data (i.e. a distribution with no obvious clustering). The optimal k value is one that maximizes the gap statistic value.
What are the possible stopping criteria in k-means algorithm?
The following are possible stopping conditions in k-means algorithm:
- Stability: The centroids of new clusters do not change.
- Convergence: The algorithm has converged at the minima i.e., the points stay in the same cluster.
- Maximum number of iterations has been reached: While the number of iterations limits the run-time of the algorithm, the quality of the clustering could be poor due to an insufficient number of iterations.
- When Residual Sum of Squares(RSS)/ within-cluster sum of squares falls below a threshold: This ensures the desired quality of clustering after termination. The combination of this with a bound on the number of iterations guarantees convergence.
How to perform k-means on larger datasets to make it faster?
For large datasets, an alternative to k-means is the mini-batch k-means. Unlike k-means, this doesn’t iterate over the entire dataset. It uses small, random, fixed-size batches of data to store in memory. With each iteration, a random sample of data is obtained and used to update the clusters. The means are calculated and the cluster centres are re-calculated using gradient descent. This is repeated until convergence. Mini-batch k-means has been known to provide faster convergence than k-means, with a slightly different cluster output.
The mathematician Hugo Steinhaus publishes an article titled "Sur la Division des Corps Materials en Parties". He presents the problem of partitioning a heterogeneous solid by the adequate selection of partitions. The article's conclusion mentions that diverse questions about types in anthropology or industrial object normalization require a solution based on the determination of n fictitious representatives of a numerous population, chosen so as to minimize as much as possible the deviations between population elements and those from the sample. The deviation is measured between every actual element and the closest fictitious element.
The concept of k-means algorithm was first suggested by Stuart Lloyd as a code for pulse-code modulation. In his article, he mentions that this algorithm may result in local optima since it is critical to initial cluster heads so formed arbitrary. The partitioning method comprises of k partitions of data, where each partition denotes a cluster with respective cluster heads such that k ≤ n(data objects). Categorizing the data into k clusters must fulfil the two requirements: i) Every cluster must comprise of one or more data objects. ii) Every object must belong to exactly one cluster.
In the paper 'On Grouping for Maximum Homogeneity', Walter D Fisher discusses the practical procedure for grouping a given set of arbitrary numbers, so that the variance within groups is minimized. For problems up to size of 200 numbers to be placed in 10 groups, he considers two types of problems. First is the unrestricted problem, where no restrictions or side conditions are imposed on the partitions allowed. Second is the restricted problem, where such conditions are imposed a priori on the basis of previous knowledge, theory, or for convenience.
In Principles of Numerical Taxonomy, Sokal and Sneath use Operational Taxonomic Units (OTUs) to measure and count as many characters as possible of the organisms to be compared. Based on a matrix of variation in all features across all OTUs, the OTUs are then compared by overall similarity, affinity, or phonetic distance. The results are displayed by means of a network or OTUs are linked to each other by a clustering algorithm to produce a phenogram. Following this, many different ways have been proposed to measure pair-wise similarity or dissimilarity, and many different clustering methods have also been devised.
James MacQueen, from the Department of Statistics of the University of California, introduces the term k-means in his article titled "Some Methods for Classification and Analysis of Multivariate Observations". He proposes an algorithm to partition an instance into a set of clusters whose variance is small for each cluster. Prior to the term k-means being coined by him, it was known by names like dynamic clustering method, iterative minimum-distance clustering, nearest centroid sorting, and h-means, among others.
Engelman and Hartigan, in their article 'Percentage Points of a Test for Clusters', state that a set of observations may be partitioned into two clusters to maximize the ratio of between sum of squares to within sum of squares. This maximum ratio is the likelihood ratio test statistic for the observations to come from two normal populations, with different means but the same variance. They consider the null hypothesis, that the observations are a sample from a single normal population.
In the book 'Cluster Analysis for Applications', various methods and applications of cluster analysis has been covered, as well as the category solving problems and the need for cluster analysis algorithms. This includes variables and scales to measures of association among variables and among data units. The conceptual problems in cluster analysis is discussed, along with hierarchical and non-hierarchical clustering methods.
- Agarwal, Vibhor. 2018. "Image Compression using K-means Clustering." Medium.com, May 24. Accessed 2021-12-29.
- Aldahdooh, Raed T, and Wesam Ashour. 2013. "DIMK-means "Distance-based Initialization Method for K-means Clustering Algorithm." I.J. Intelligent Systems and Applications. Accessed 2021-12-15.
- Anderberg, Michael R. 1973. "Cluster Analysis for Applications." Elsevier. Accessed 2021-12-29.
- Arthur, David, and Sergei Vassilvitskii. 2006. "k-means++: The Advantages of Careful Seeding." Stanford University. Accessed 2021-12-19.
- Engelman, L, and J.A. Hartigan. 2012. "Percentage Points of a Test for Clusters." Volume 64,1969-Issue 328, Journal of the American Statistical Association, April 10. Accessed 2021-12-28.
- Fisher, Walter D. 1958. "On Grouping For Maximum Homogeneity." American Statistical Association Journal. Accessed 2021-12-27.
- Garbade, Michael J. 2018. "Understanding K-means Clustering in Machine Learning." towards data science, September 13. Accessed 2021-12-15.
- Garg, Sanjay, and Ramesh Chandra Jain. 2006. "Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets." Science Alert. Accessed 2021-12-29.
- Ghosh, Joydeep, and Alexander Liu. 2009. "K-Means." Chapman and Hall, 1st Edition, The Top Ten Algorithms in Data Mining. Accessed 2021-12-16.
- Github. 2012."K-Means Cluster Analysis." UC Business Analytics R Programming Guide.https://uc-r.github.io/kmeans_clustering
- Hinneburg, Alexander, Charu C Aggarwal, and Daniel A Keim. 2000. "What is the nearest neighbor in high dimensional spaces?" Cairo, Egypt, Proceedings of the 26th VLDB Conference. Accessed 2021-12-16.
- Irani, Jasmine, Nitin Pise, and Madhura Phatak. 2016. "Clustering Techniques and the Similarity Measures used in Clustering: A Survey." Volume 134-No.7, IJCA, January. Accessed 2021-12-29.
- Kaushik, Manju, and Bhawana Mathur. 2014. "Comparative Study of K-Means and Hierarchical Clustering Techniques." Volume 2, Issue 6, IJSHRE, June. Accessed 2021-12-16.
- Kharwal, Aman. 2021. "Mini-batch K-means Clustering in Machine Learning." TheCleverProgrammer, September 10. Accessed 2021-12-15.
- Koddinariya, Trupti M, and Prashant R Makwana. 2018. "Review on determining number of cluster in K-Means Clustering." Volume 1, Issue 6, IJARCSMS, November. Accessed 2021-12-16.
- Lloyd, Stuart P. 1982. "Least Squares Quantization in PCM." IEEE Transactions on Information Theory, March. Accessed 2021-12-28.
- Page, Justin T, Zachary S Liechty, Mark D Huynh, and Joshua A Udall. 2014. "BamBam: genome sequence analysis tools for biologists." BMC Research Notes, November. Accessed 2021-12-29.
- Perez-Ortega, Joaquin, Nelva Nely Almanza Ortega, Andrea Vega Villalobos, Rodolfo Pazos Rangel, Crispin Zavala Diaz, and Alicia Martinez Rebollar. 2019. "The K-Means Algorithm Evolution." InTechOpen Book Chapter. Introduction to Data Science and Machine Learning, April 03. Accessed 2021-12-16.
- Raghupathi, Kaushik. 2018. "10 Interesting Use Cases for the K-Means Algorithm." DZone, March 27. Accessed 2022-01-30.
- Raykov, Yordan P, Alexis Boukouvalas, Fahd Baig, and Max A Little. 2016. "What to do when K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm." PLOS ONE, September 26. Accessed 2021-12-16.
- Sharma,Pulkit. 2019. "The Most Comprehensive Guide to k-Means Clustering you'll ever need." Analytics Vidhya, August 19.
- Singh, Seema. 2018. "K-Means Clustering." Data Driven Investor. Accessed 2021-12-15.
- Wright, Richard Irwin Vane. 2013. "Taxonomy, Methods of." Second Edition, Encyclopedia of Biodiversity. Accessed 2021-12-29.
- Junjie Wu. 2012. "Advances in K-means Clustering." Springer Theses.
- Yadav,Ritu, Sharma,Anuradha.2012. "Advanced Methods to Improve Performance of K-means Algorithm: A Review." Global Journal of Computer Science and Technology, Vol. 12, Issue 9, Version 1.0.
- Piech,Chris. 2013. "K means." Stanford CS221.