kMeans Clustering
 Summary

Discussion
 What similarity measures are applicable for clustering?
 What are the main objectives of the kmeans clustering algorithm?
 Could you share some use cases of kmeans clustering?
 Are there any variations of the kmeans algorithm?
 What are the steps of the kmeans clustering algorithm?
 What are the major challenges associated with kmeans clustering?
 What are the ways to avoid the problem of initialization sensitivity in kmeans algorithm?
 How to decide the optimal number of k in kmeans algorithm?
 What are the possible stopping criteria in kmeans algorithm?
 How to perform kmeans on larger datasets to make it faster?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
The kmeans Clustering method is an unsupervised machine learning technique that groups unlabelled dataset into different clusters.
The algorithm starts with a group of randomly selected 'k' centroids as the beginning points for every cluster. Then, iterative calculations optimize the centroid positions.^{}
The clusters in the example are distinguished by color and current cluster seeds (centroids) are marked by a starburst. In the first round, each point is assigned to its closest seed, and a new seed is chosen for each cluster based on the average of all points in that cluster. As a result, the blue cluster seed moves to the right. In the second round, both cluster seeds drift to their correct locations in a proper division. After two rounds, the clusters have reached a steadystate, and wouldn't change further through an infinite number of iterations.^{}
Discussion
What similarity measures are applicable for clustering? Clustering is done on the basis of a similarity measure to group similar data objects together. Similarity is an amount to reflect the strength of relationship between two data items, to represent how similar two data patterns are. A similarity measure is the distance between various data points.^{}
To group objects in clusters, this similarity measure is most commonly based on distance functions such as Euclidean distance, Manhattan distance, Minkowski distance, Cosine similarity, etc. The clusters are formed such that any two data objects in a cluster have a minimum distance value and any two data objects across different clusters have a maximum distance value. Clustering using distance functions, called distance based clustering, is popularly used to cluster objects.^{}
What are the main objectives of the kmeans clustering algorithm? The main objectives of the kmeans algorithm are:
 Determine the best value of k center points (centroids) iteratively.
 Assign each data point to its closest kcenter. The data points closer to a particular kcenter (with minimum Euclidean distance) form a cluster.
 Minimize the sum of distances between the data points and their corresponding cluster centroid.^{}
Could you share some use cases of kmeans clustering? This algorithm is used across a variety of applications such as image segmentation, image compression, document classification, market/customer segmentation, clustering of IT alerts, public transport data analysis, insurance fraud detection, etc.^{}
For example, in image compression, kmeans clustering groups similar colors together into 'k' clusters (say k=64) of different colors. Hence, each cluster centroid represents the threedimensional color vector in RGB color space of its respective cluster. Then, these 'k' cluster centroids will replace all the color vectors in their respective clusters.^{}
In the original image, each pixel takes 24 (8x3) bits to store its corresponding color vector. After kmeans algorithm is applied, each pixel takes only 6 bits to store as it stores only the label of the cluster to which it belongs. K=64 different color vectors can be represented using 6 bits only. The final image will have 64 different colors in its RGB color space.^{}
Are there any variations of the kmeans algorithm? Yes. The following are some of the variations of the kmeans algorithm:
 kmedians clustering  uses the median in each dimension (instead of the mean) for optimal clustering.
 kmedoids (Partitioning Around Medoids)  uses medoid instead of mean, and minimizes the sum of distances for arbitrary distance functions.
 Fuzzy CMeans clustering  A soft version of kmeans, where each data point has a fuzzy degree of belonging to each cluster.
 kmeans++  This is the standard kmeans algorithm coupled with a smarter initialization of the centroids.^{}
What are the steps of the kmeans clustering algorithm? The first step in kmeans clustering algorithm is to randomly choose k centroids (k is the number of clusters to be formed). Each data point is assigned to its nearest centroid. The mean of all points in each cluster is computed and a new centroid is selected. Through iterations, the last 2 steps are repeated until the centroid positions do not change.^{}
What are the major challenges associated with kmeans clustering?  Initialization sensitivity  Since the initialization process of the standard kmeans algorithm is random, it leads to the problem of initialization sensitivity. For larger and more complex datasets, the runtime of the algorithm is longer and it may not reach optimal clustering.^{}
 Clustering data of varying sizes and densities  kmeans has trouble clustering data where the clusters are of varying sizes and densities. Usually, to cluster such data, you need to generalize kmeans.^{} While the kmeans algorithm clusters the given data sets uniformly when used for uniform data sets, it fails to form clusters as per requirements when used for data sets of varying densities or distances between data sets.
 Clustering outliers  Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. To overcome this challenge, outliers could be clipped or removed before clustering.
 Scaling with number of dimensions  With increasing number of dimensions (variables), more datapoints become nearest neighbours of a selected centroid, until the choice of nearest neighbour is nearly random. The distances between a data point and its nearest and farthest neighbours could become equidistant with increases dimensions.^{}
What are the ways to avoid the problem of initialization sensitivity in kmeans algorithm? There are two ways to avoid the problem of initialization sensitivity:
 Repeat kmeans: The algorithm is executed repeatedly. The centroids are initialized and the clusters are formed to result in smallest intracluster distance and larger intercluster distance.
 kmeans++: This is a smart centroid initialization technique. Only one centroid is initialized randomly, and other centroids are chosen such that they are very far from the initial centroid/s. This results in faster convergence and lower possibility of the centroid being poorly initialized.^{}
Between the two techniques that can be implemented to avoid the problem of initialization sensitivity, K means++ provides comparatively better results.^{}
How to decide the optimal number of k in kmeans algorithm?  Elbow method  The sum of squared distance (SSE) is used to select an ideal k value. This is based on the distance between the data points and respective clusters. A value of k is chosen such that the SSE begins to flatten out and an inflection point is seen. Visually, this graph looks similar to an elbow, hence the name.^{}
 Average Silhouette method  This measures the quality of clustering and determines how well each object lies within its cluster. The similarity/dissimilarity score is measured between the assigned cluster and the next best/nearest cluster for each data point. The average silhouette of observations is calculated for different values of k. The optimal k is one that maximizes the average silhouette over a range of possible values for k.^{}
 Gap statistic method  The total intracluster variation is compared for different k values with their expected values under null reference distribution of data (i.e. a distribution with no obvious clustering). The optimal k value is one that maximizes the gap statistic value.
What are the possible stopping criteria in kmeans algorithm? The following are possible stopping conditions in kmeans algorithm:
 Stability: The centroids of new clusters do not change.
 Convergence: The algorithm has converged at the minima i.e., the points stay in the same cluster.
 Maximum number of iterations has been reached: While the number of iterations limits the runtime of the algorithm, the quality of the clustering could be poor due to an insufficient number of iterations.^{}
 When Residual Sum of Squares(RSS)/ withincluster sum of squares falls below a threshold: This ensures the desired quality of clustering after termination. The combination of this with a bound on the number of iterations guarantees convergence.
How to perform kmeans on larger datasets to make it faster? For large datasets, an alternative to kmeans is the minibatch kmeans. Unlike kmeans, this doesn’t iterate over the entire dataset. It uses small, random, fixedsize batches of data to store in memory. With each iteration, a random sample of data is obtained and used to update the clusters. The means are calculated and the cluster centres are recalculated using gradient descent. This is repeated until convergence. Minibatch kmeans has been known to provide faster convergence than kmeans, with a slightly different cluster output.^{}
Milestones
The mathematician Hugo Steinhaus publishes an article titled "Sur la Division des Corps Materials en Parties". He presents the problem of partitioning a heterogeneous solid by the adequate selection of partitions. The article's conclusion mentions that diverse questions about types in anthropology or industrial object normalization require a solution based on the determination of n fictitious representatives of a numerous population, chosen so as to minimize as much as possible the deviations between population elements and those from the sample. The deviation is measured between every actual element and the closest fictitious element.^{}
The concept of kmeans algorithm was first suggested by Stuart Lloyd as a code for pulsecode modulation. In his article, he mentions that this algorithm may result in local optima since it is critical to initial cluster heads so formed arbitrary. The partitioning method comprises of k partitions of data, where each partition denotes a cluster with respective cluster heads such that k ≤ n(data objects). Categorizing the data into k clusters must fulfil the two requirements: i) Every cluster must comprise of one or more data objects. ii) Every object must belong to exactly one cluster.^{}
In the paper 'On Grouping for Maximum Homogeneity', Walter D Fisher discusses the practical procedure for grouping a given set of arbitrary numbers, so that the variance within groups is minimized. For problems up to size of 200 numbers to be placed in 10 groups, he considers two types of problems. First is the unrestricted problem, where no restrictions or side conditions are imposed on the partitions allowed. Second is the restricted problem, where such conditions are imposed a priori on the basis of previous knowledge, theory, or for convenience.^{}
In Principles of Numerical Taxonomy, Sokal and Sneath use Operational Taxonomic Units (OTUs) to measure and count as many characters as possible of the organisms to be compared. Based on a matrix of variation in all features across all OTUs, the OTUs are then compared by overall similarity, affinity, or phonetic distance. The results are displayed by means of a network or OTUs are linked to each other by a clustering algorithm to produce a phenogram. Following this, many different ways have been proposed to measure pairwise similarity or dissimilarity, and many different clustering methods have also been devised.^{}
James MacQueen, from the Department of Statistics of the University of California, introduces the term kmeans in his article titled "Some Methods for Classification and Analysis of Multivariate Observations". He proposes an algorithm to partition an instance into a set of clusters whose variance is small for each cluster. Prior to the term kmeans being coined by him, it was known by names like dynamic clustering method, iterative minimumdistance clustering, nearest centroid sorting, and hmeans, among others.^{}
Engelman and Hartigan, in their article 'Percentage Points of a Test for Clusters', state that a set of observations may be partitioned into two clusters to maximize the ratio of between sum of squares to within sum of squares. This maximum ratio is the likelihood ratio test statistic for the observations to come from two normal populations, with different means but the same variance. They consider the null hypothesis, that the observations are a sample from a single normal population.^{}
In the book 'Cluster Analysis for Applications', various methods and applications of cluster analysis has been covered, as well as the category solving problems and the need for cluster analysis algorithms. This includes variables and scales to measures of association among variables and among data units. The conceptual problems in cluster analysis is discussed, along with hierarchical and nonhierarchical clustering methods.^{}
References
 Agarwal, Vibhor. 2018. "Image Compression using Kmeans Clustering." Medium.com, May 24. Accessed 20211229.
 Aldahdooh, Raed T, and Wesam Ashour. 2013. "DIMKmeans "Distancebased Initialization Method for Kmeans Clustering Algorithm." I.J. Intelligent Systems and Applications. Accessed 20211215.
 Anderberg, Michael R. 1973. "Cluster Analysis for Applications." Elsevier. Accessed 20211229.
 Arthur, David, and Sergei Vassilvitskii. 2006. "kmeans++: The Advantages of Careful Seeding." Stanford University. Accessed 20211219.
 Engelman, L, and J.A. Hartigan. 2012. "Percentage Points of a Test for Clusters." Volume 64,1969Issue 328, Journal of the American Statistical Association, April 10. Accessed 20211228.
 Fisher, Walter D. 1958. "On Grouping For Maximum Homogeneity." American Statistical Association Journal. Accessed 20211227.
 Garbade, Michael J. 2018. "Understanding Kmeans Clustering in Machine Learning." towards data science, September 13. Accessed 20211215.
 Garg, Sanjay, and Ramesh Chandra Jain. 2006. "Variations of kmean Algorithm: A Study for HighDimensional Large Data Sets." Science Alert. Accessed 20211229.
 Ghosh, Joydeep, and Alexander Liu. 2009. "KMeans." Chapman and Hall, 1st Edition, The Top Ten Algorithms in Data Mining. Accessed 20211216.
 Github. 2012."KMeans Cluster Analysis." UC Business Analytics R Programming Guide.https://ucr.github.io/kmeans_clustering
 Hinneburg, Alexander, Charu C Aggarwal, and Daniel A Keim. 2000. "What is the nearest neighbor in high dimensional spaces?" Cairo, Egypt, Proceedings of the 26th VLDB Conference. Accessed 20211216.
 Irani, Jasmine, Nitin Pise, and Madhura Phatak. 2016. "Clustering Techniques and the Similarity Measures used in Clustering: A Survey." Volume 134No.7, IJCA, January. Accessed 20211229.
 Kaushik, Manju, and Bhawana Mathur. 2014. "Comparative Study of KMeans and Hierarchical Clustering Techniques." Volume 2, Issue 6, IJSHRE, June. Accessed 20211216.
 [Khan, Muhammad Rizwan. 2018. "K Means Clustering Algorithm & Its Application." Data Driven Investor, October 12. Accessed 20220104.](https://medium.datadriveninvestor.com/kmeansclusteringalgorithmitsapplicationff9e97297e6e)https://www.simplilearn.com/tutorials/machinelearningtutorial/kmeansclusteringalgorithm
 Kharwal, Aman. 2021. "Minibatch Kmeans Clustering in Machine Learning." TheCleverProgrammer, September 10. Accessed 20211215.
 Koddinariya, Trupti M, and Prashant R Makwana. 2018. "Review on determining number of cluster in KMeans Clustering." Volume 1, Issue 6, IJARCSMS, November. Accessed 20211216.
 Lloyd, Stuart P. 1982. "Least Squares Quantization in PCM." IEEE Transactions on Information Theory, March. Accessed 20211228.
 Page, Justin T, Zachary S Liechty, Mark D Huynh, and Joshua A Udall. 2014. "BamBam: genome sequence analysis tools for biologists." BMC Research Notes, November. Accessed 20211229.
 PerezOrtega, Joaquin, Nelva Nely Almanza Ortega, Andrea Vega Villalobos, Rodolfo Pazos Rangel, Crispin Zavala Diaz, and Alicia Martinez Rebollar. 2019. "The KMeans Algorithm Evolution." InTechOpen Book Chapter. Introduction to Data Science and Machine Learning, April 03. Accessed 20211216.
 Raykov, Yordan P, Alexis Boukouvalas, Fahd Baig, and Max A Little. 2016. "What to do when KMeans Clustering Fails: A Simple yet Principled Alternative Algorithm." PLOS ONE, September 26. Accessed 20211216.
 Sharma,Pulkit. 2019. "The Most Comprehensive Guide to kMeans Clustering you'll ever need." Analytics Vidhya, August 19.
 Singh, Seema. 2018. "KMeans Clustering." Data Driven Investor. Accessed 20211215.
 Wright, Richard Irwin Vane. 2013. "Taxonomy, Methods of." Second Edition, Encyclopedia of Biodiversity. Accessed 20211229.
Further Reading
 Junjie Wu. 2012. "Advances in Kmeans Clustering." Springer Theses.
 Yadav,Ritu, Sharma,Anuradha.2012. "Advanced Methods to Improve Performance of Kmeans Algorithm: A Review." Global Journal of Computer Science and Technology, Vol. 12, Issue 9, Version 1.0.
 Piech,Chris. 2013. "K means." Stanford CS221.
Article Stats
Cite As
See Also
 Machine Learning Model
 ExpectationMaximization Algorithm
 Text Clustering
 Machine Learning