Text Clustering

Text clustering. Source: Kunwar 2013.
Text clustering. Source: Kunwar 2013.

The amount of text data being generated in the recent years has exploded exponentially. It's essential for organizations to have a structure in place to mine actionable insights from the text being generated. From social media analytics to risk management and cybercrime protection, dealing with textual data has never been more important.

Text clustering is the task of grouping a set of unlabelled texts in such a way that texts in the same cluster are more similar to each other than to those in other clusters. Text clustering algorithms process text and determine if natural clusters (groups) exist in the data.

Discussion

  • What's the principle behind text clustering?
    Semantically similar sentences. Source: Yang and Tar 2018.
    Semantically similar sentences. Source: Yang and Tar 2018.

    The big idea is that documents can be represented numerically as vectors of features. The similarity in text can be compared by measuring the distance between these feature vectors. Objects that are near each other should belong to the same cluster. Objects that are far from each other should belong to different clusters.

    Essentially, text clustering involves three aspects:

    • Selecting a suitable distance measure to identify the proximity of two feature vectors.
    • A criterion function that tells us that we've got the best possible clusters and stop further processing.
    • An algorithm to optimize the criterion function. A greedy algorithm will start with some initial clustering and refine the clusters iteratively.
  • What are the use cases of text clustering?
    Applications of text clustering. Source: Nabi 2018.
    Applications of text clustering. Source: Nabi 2018.

    We note a few use cases:

    • Document Retrieval: To improve recall, start by adding other documents from the same cluster.
    • Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing content.
    • Fake News Identification: Detect if a news is genuine or fake.
    • Language Translation: Translation of a sentence from one language to another.
    • Spam Mail Filtering: Detect unsolicited and unwanted email/messages.
    • Customer Support Issue Analysis: Identify commonly reported support issues.
  • How is text clustering different from text classification?
    Clustering is unsupervised whereas classification is supervised. Source: Valcheva 2018.
    Clustering is unsupervised whereas classification is supervised. Source: Valcheva 2018.

    Classification is a supervised learning approach that maps an input to an output based on example input-output pairs. Clustering is a unsupervised learning approach.

    • Classification: If the prediction value tends to be category like yes/no or positive/negative, then it falls under classification type problem in machine learning. The different classes are known in advance. For example, given a sentence, predict whether it's a negative or positive review.
    • Clustering: Clustering is the task of partitioning the dataset into groups called clusters. The goal is to split up the data in such a way that points within single cluster are very similar and points in different clusters are different. It determines grouping among unlabelled data.
  • What are the types of clustering?
    Hard versus soft clustering. Source: Withanawasam 2015.
    Hard versus soft clustering. Source: Withanawasam 2015.

    Broadly, clustering can be divided into two groups:

    • Hard Clustering: This groups items such that each item is assigned to only one cluster. For example, we want to know if a tweet is expressing a positive or negative sentiment. k-means is a hard clustering algorithm.
    • Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items such that an item can belong to multiple clusters. Fuzzy C Means (FCM) is a soft clustering algorithm.
  • What are the steps involved in text clustering?

    Any text clustering approach involves broadly the following steps:

    • Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and sparse representations. Pre-processing makes the dataset easier to work with.
    • Feature Extraction: One of the commonly used technique to extract the features from textual data is calculating the frequency of words/tokens in the document/corpus.
    • Clustering: We can then cluster different text documents based on the features we have generated.
  • What are the steps involved in text pre-processing?

    Below are the main components involved in pre-processing.

    • Tokenization: Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases.
    • Transformation: It converts the text to lowercase, removes all diacritics/accents in the text, and parses html tags.
    • Normalization: Text normalization is the process of transforming a text into a canonical (root) form. Stemming and lemmatization techniques are used for deriving the root word.
    • Filtering: Stop words are common words used in a language, such as 'the', 'a', 'on', 'is', or 'all'. These words do not carry important meaning for text clustering and are usually removed from texts.
  • What are the levels of text clustering?

    Text clustering can be document level, sentence level or word level.

    • Document level: It serves to regroup documents about the same topic. Document clustering has applications in news articles, emails, search engines, etc.
    • Sentence level: It's used to cluster sentences derived from different documents. Tweet analysis is an example.
    • Word level: Word clusters are groups of words based on a common theme. The easiest way to build a cluster is by collecting synonyms for a particular word. For example, WordNet is a lexical database for the English language that groups English words into sets of synonyms called synsets.
  • How do I define or extract textual features for clustering?
    BOW with word as feature. Source: Hoonlor 2011, fig. 2.1.
    BOW with word as feature. Source: Hoonlor 2011, fig. 2.1.

    In general, words can be used to represent a common class of feature. Word characteristics are also features. For example, capitalization matters: US versus us, White House versus white house. Part of speech and grammatical structure also add to textual features. Semantics can be a textual feature: buy versus purchase.

    The mapping from textual data to real-valued vectors is called feature extraction. One of the simplest techniques to numerically represent text is Bag of Words (BOW). In BOW, we make a list of unique words in the text corpus called vocabulary. Then we can represent each sentence or document as a vector, with each word represented as 1 for presence and 0 for absence.

    Another representation is to count the number of times each word appears in a document. The most popular approach is using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.

    More recently, word embeddings are being used to map words into feature vectors. A popular model for word embeddings is word2vec.

  • How can I measure similarity in text clustering?

    Words can be similar lexically or semantically:

    • Lexical similarity: Words are similar lexically if they have a similar character sequence. Lexical similarity can be measured using string-based algorithms that operate on string sequences and character composition.
    • Semantic similarity: Words are similar semantically if they have the same meaning, are opposite of each other, used in the same way, used in the same context or one is a type of another. Semantic similarity can be measured using corpus-based or knowledge-based algorithms.

    Some of the metrics for computing similarity between two pieces of text are Jaccard coefficient, cosine similarity and Euclidean distance.

  • Which are some common text clustering algorithms?
    Some types of text clustering algorithms. Source: Khosla et al. 2019, fig. 4.
    Some types of text clustering algorithms. Source: Khosla et al. 2019, fig. 4.

    Ignoring neural network models, we can identify different types:

    • Hierarchical: In the divisive approach, we start with one cluster and split that into sub-clusters. Example algorithms include DIANA and MONA. In the agglomerative approach, each document starts as its own cluster and then we merge similar ones into bigger clusters. Examples include BIRCH and CURE.
    • Partitioning: k-means is a popular algorithm but requires the right choice of k. Other examples are ISODATA and PAM.
    • Density: Instead of using a distance measure, we form clusters based on how many data points fall within a given radius. DBSCAN is the most well-known algorithm.
    • Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
    • Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also have probabilities that they belong to a topic. Topic Modelling is a separate NLP task but it's similar to soft clustering. pLSA and LDA are example topic models.
  • How can I evaluate the efficiency of a text clustering algorithm?
    Internal quality measure: more compact clusters on the left. Source: Hassani and Seidl 2016.
    Internal quality measure: more compact clusters on the left. Source: Hassani and Seidl 2016.

    Measuring the quality of a clustering algorithm has shown to be as important as the algorithm itself. We can evaluate it in two ways:

    • External quality measure: External knowledge is required for measuring the external quality. For example, we can conduct surveys of users of the application that includes text clustering.
    • Internal quality measure: The evaluation of the clustering is compared only with the result itself, that is, the structure of found clusters and their relations to one another. Two main concepts are compactness and separation. Compactness measures how closely data points are grouped in a cluster. Separation measures how different the found clusters are from each other. More formally, compactness is intra-cluster variance whereas separation is inter-cluster distance.
  • What are the common challenges involved in text clustering?

    Document clustering is being studied for many decades. It's far from trivial or a solved problem. The challenges include the following:

    • Selecting appropriate features of documents that should be used for clustering.
    • Selecting an appropriate similarity measure between documents.
    • Selecting an appropriate clustering method utilising the above similarity measure.
    • Implementing the clustering algorithm in an efficient way that makes it feasible in terms of memory and CPU resources.
    • Finding ways of assessing the quality of the performed clustering.

Milestones

1971
Vector space model. Source: Perone 2013.
Vector space model. Source: Perone 2013.

Text mining research in general relies on a vector space model. Salton first proposes it to model text documents as vectors. Features are considered to be the words in the document collection and feature values come from different term weighting schemes, the most popular of which is the Term Frequency-Inverse Document Frequency (TF-IDF).

1983

Massart et al. in the book The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis introduces various clustering methods, including hierarchical and non-hierarchical methods. They show how clustering can be used to interpret large quantities of analytical data. They discuss how clustering is related to other pattern recognition techniques.

1992

Cutting et al. adapt partition-based clustering algorithms to cluster documents. Two of the techniques are Buckshot and Fractionation. Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the clusters formed. Fractionation finds k centres by initially breaking N documents into N/m buckets of a fixed size m > k. Each cluster is then treated as if it's an individual document and the whole process is repeated until there are only K clusters.

1997

Huang introduces k-modes, an extension to the well-known k-means algorithm for clustering numerical data. By defining the mode notion for categorical clusters and introducing an incremental update rule for cluster modes, the algorithm preserves the scaling properties of k-means. Naturally, it also inherits its disadvantages, such as dependence on the seed clusters and the inability to automatically detect the number of clusters.

2008

Sun et al. develop a novel hierarchal algorithm for document clustering. They use cluster overlapping phenomenon to design cluster merging criteria. The system computes the overlap rate in order to improve time efficiency.

References

  1. Allahyari, Mehdi, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques." arXiv, v2, July 28. Accessed 2019-12-06.
  2. Cutting, Douglass R., David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. "Scatter/gather: A cluster-based approach to browsing large document collections." Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318-329. Accessed 2019-12-01.
  3. Eduonix. 2019. "Clustering Similar Sentences Together Using Machine Learning." Blog, Eduonix Learning Solutions, April 23. Updated 2019-04-30. Accessed 2019-12-08.
  4. Ganesan, Kavita. 2015. "What is text similarity?" November 08. Updated 2019-11-09. Accessed 2019-12-06.
  5. Hassani, Marwan, and Thomas Seidl. 2016. "Using internal evaluation measures to validate the quality of diverse stream clustering algorithms." Vietnam Journal of Computer Science, vol. 4, no. 3, pp. 171-183. Accessed 2019-12-07.
  6. Hoonlor, Apirak. 2011. "Sequential patterns and temporal patterns for text mining." PhD Thesis, Rensselaer Polytechnic Institute. Accessed 2019-12-06.
  7. Hosseinimotlagh, Seyedmehdi and Evangelos E. Papalexakis. 2018. "Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles." Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with WSDM 2018, Los Angeles, California, USA. Accessed 2019-12-01.
  8. Huang, Z. 1997. "A fast clustering algorithm to cluster very large categorical data sets in data mining." Proceedings of ACM Workshop on Research Issues on Data Mining and Knowledge Discovery. Accessed 2019-12-01.
  9. IBM. 2014. "Clustering Principles." SPSS Statistics 23.0.0, IBM Knowledge Center, October 24. Accessed 2019-12-06.
  10. Jajoo, Pankaj. 2008. "Document Clustering." Thesis, Master of Technology, Indian Institute of Technology Kharagpur. Accessed 2019-12-01.
  11. Khosla, Meenakshi, Keith W Jamison, Gia H. Ngo, Amy Kuceyeski, and Mert R. Sabuncu. 2019. "Machine learning in resting-state fMRI analysis." Magnetic Resonance Imaging, vol. 64, June. Accessed 2019-12-06.
  12. Krishan. 2016. "Topic Modeling and Document Clustering; What’s the Difference?" Blog, Integrated Knowledge Solutions, May 16. Accessed 2019-12-06.
  13. Kunwar, Samir. 2013. "Text Documents Clustering using K-Means Algorithm." CodeProject, January 26. Accessed 2019-12-01.
  14. Malik, Farhad. 2019. "Machine Learning Hard Vs Soft Clustering." FinTechExplained, on Medium, June 07. Accessed 2019-12-06.
  15. Massart, Désiré Luc and Leonard Kaufman. 1983. "The interpretation of analytical chemical data by the use of cluster analysis." Wiley.
  16. Matuszek, Paul. 2012. "Text Mining Applications: Document Clustering." CSC 9010, Villanova University, on SlidePlayer. Accessed 2019-12-08.
  17. Mnasri, Maali. 2016. "Quick review on Text Clustering and Text Similarity Approaches." Blog, LumenAI, June 29. Accessed 2019-12-06.
  18. Nabi, Javaid. 2018. "Machine Learning — Text Processing." Towards Data Science, on Medium, September 13. Accessed 2019-12-06.
  19. Nandi, Manojit. 2015. "Density-Based Clustering." Blog, Domino Data Lab, September 09. Accessed 2019-12-06.
  20. Paul, Christian, Achim Rettinger, Aditya Mogadala, Craig A. Knoblock, and Pedro Szekely. 2016. "Efficient Graph-based Document Similarity." Proceedings of the 13th International Conference on The Semantic Web, Latest Advances and New Domains, vol. 9678, pp. 334-349, May 29 - June 02. Accessed 2019-12-06.
  21. Perone, Christian S. 2013. "Machine Learning :: Cosine Similarity for Vector Space Models (Part III)." Blog, September 12. Accessed 2019-12-01.
  22. Renvoisé, Paul. 2017. "Introduction To Text Clustering: Research Findings." Blog, SAP Conversational AI, January 19. Accessed 2019-12-01.
  23. Salton, Gerard. 1971. "The SMART Retrieval System—Experiments in Automatic Document Processing." Prentice-Hall. Accessed 2019-12-01.
  24. Sasaki, Minoru and H. Shinnou. 2005. "Spam detector using text clustering." International Conference on Cyberworlds. Accessed 2019-12-01.
  25. Sieg, Adrien. 2018. "Text Similarities : Estimate the degree of similarity between two texts." Medium, July 05. Accessed 2019-12-06.
  26. Sun, Haojun, Zhihui Liu, and Lingjun Kong. 2008. "A Document Clustering Method Based on Hierarchical Algorithm with Model Clustering." Proceedings of International Conference on Advanced Information Networking and Applications, March 25-28. Accessed 2019-12-01.
  27. Valcheva, Silvia. 2018. "Supervised vs Unsupervised Learning: Algorithms and Examples." Intellspot, March 11. Accessed 2019-12-01.
  28. Veress, Gabor. 2013. "Clustering." SlideShare, October 16. Accessed 2019-12-08.
  29. Vydiswaran, V. G. Vinod. 2019. "Identifying Features from Text." Applied Text Mining in Python, University of Michigan, on Coursera. Accessed 2019-12-06.
  30. Withanawasam, Jayani. 2015. "Types of clustering." In: Apache Mahout Essentials, Packt, June 18. Accessed 2019-12-08.
  31. Xu, Rui, and Donald Wunsch. 2005. "Survey of clustering algorithms." IEEE Trans. on Neural Networks, vol. 16, no. 3, pp. 645–678, May. Accessed 2019-12-06.
  32. Yang, Yinfei and Chris Tar. 2018. "Advances in Semantic Textual Similarity." Google AI Blog, May 17. Accessed 2019-12-08.
  33. hitbullseye. 2019. "Learn Words through Word Groups/Clusters." hitbullseye. Accessed 2019-12-06.

Further Reading

  1. Renvoisé, Paul. 2017. "Introduction To Text Clustering: Research Findings." Blog, SAP Conversational AI, January 19. Accessed 2019-12-01.
  2. Xu, Rui, and Donald Wunsch. 2005. "Survey of clustering algorithms." IEEE Trans. on Neural Networks, vol. 16, no. 3, pp. 645–678, May. Accessed 2019-12-06.
  3. Cutting, Douglass R., David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. "Scatter/gather: A cluster-based approach to browsing large document collections." Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318-329. Accessed 2019-12-01.
  4. Day, William H. E., and Herbert Edelsbrunner. 1984. "Efficient algorithms for agglomerative hierarchical clustering methods." Journal of Classification, vol. 1, no. 7, pp. 7-24, December. Accessed 2019-12-08.
  5. Sebastiani, Fabrizio. 2002. "Machine learning in automated text categorization." ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, March. Accessed 2019-12-08.
  6. Croft, W. B. 1978. "Organizing and Searching Large Files of Documents." Ph.D. Thesis, University of Cambridge.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
29
0
3696
6
3
1230
1841
Words
26
Likes
46K
Hits

Cite As

Devopedia. 2022. "Text Clustering." Version 35, January 19. Accessed 2023-11-12. https://devopedia.org/text-clustering
Contributed by
2 authors


Last updated on
2022-01-19 16:40:23

Improve this article

Article Warnings

  • Readability score of this article is below 50 (49). Use shorter sentences. Use simpler words.