Text Clustering

Article Info

Contributed by
2 authors

Last updated on
2022-01-19 16:40:23

Improve this article

Article Versions

35 2022-01-19 16:40:23
3153,2016 35,3153

By arvindpdmn

Deleting a misplaced double quote in image caption.
34 2020-04-07 04:46:30
2016,1814 34,2016

By arvindpdmn

No new content. Minor improvements to sentences.
33 2019-12-08 07:21:56
1814,1812 33,1814

By arvindpdmn

Corrected format of references: remove URLs in description, add missing author names, correct year of publication, etc. Added clarifications. Extra image.
32 2019-12-07 04:28:59
1812,1811 32,1812

By Sudha-Nadchal

Added metrics for similarity measures
31 2019-12-07 04:21:44
1811,1810 31,1811

By Sudha-Nadchal

Fixed formatting issues

Chat Room

Submitting ...

You are editing an existing chat message.
2022-01-19 16:41:28
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Redirected URLs:
References: https://www.researchgate.net/publication/333621242 → https://www.researchgate.net/publication/333621242_Machine_learning_in_resting-state_fMRI_analysis
References: https://cai.tools.sap/blog/introduction-text-clustering/ → https://community.sap.com/topics/machine-learning
References: https://dl.acm.org/citation.cfm?id=1102022 → https://dl.acm.org/doi/book/10.5555/1102022
References: http://www.lumenai.fr/blog/quick-review-on-text-clustering-and-text-similarity-approaches → https://www.lumenai.fr/
References: https://www.researchgate.net/publication/268050206 → https://www.researchgate.net/publication/268050206_SEQUENTIAL_PATTERNS_AND_TEMPORAL_PATTERNS_FOR_TEXT_MINING
Further Reading: https://link.springer.com/article/10.1007%2FBF01890115 → https://link.springer.com/article/10.1007/BF01890115
2019-12-05 03:54:04
-

By arvindpdmn

Clustering approaches: https://cai.tools.sap/blog/introduction-text-clustering/
Semantic similarity: https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html
Using word2vec: https://blog.eduonix.com/artificial-intelligence/clustering-similar-sentences-together-using-machine-learning/
2019-12-05 03:36:34
-

By arvindpdmn

1. Use cases: can give one line description of the last two points and reduce description of fake news.
2. Clustering vs. classification: too much explanation on classification/regression. Question is relevant but it can be about when to choose clustering over classification. Is regression relevant in this discussion?
3. Would like to see more info about features. Give specific examples of features for text clustering.
4. Text pre-processing steps: we can move this to main NLP article. Nothing here is specific to text clustering. Leave it in this article for now.
5. Challenges question and answer is very good. In fact, this article should provides some of the answers. Can have a question on similarity measures?
6. In milestones, add one for doc2vec.
7. Another question: about different levels of clustering: word, sentence, paragraph, document: different techniques exist for each.
7. Figure in summary is good. If you can source suitable images for other answers, it will it easier for readers to understand.
2019-12-01 14:45:22
-

By arvindpdmn

1. Just added ref/citation for the image. Use this as an example to insert the other refs.
2. We don't allow nested lists within answers. You may have to restructure answers into a single list.
3. Use of "Sun et al." is okay. Just include the ref/citation.
4. I have not read the actual content. Will do the full review once you are done.

The amount of text data being generated in the recent years has exploded exponentially. It's essential for organizations to have a structure in place to mine actionable insights from the text being generated. From social media analytics to risk management and cybercrime protection, dealing with textual data has never been more important.

Text clustering is the task of grouping a set of unlabelled texts in such a way that texts in the same cluster are more similar to each other than to those in other clusters. Text clustering algorithms process text and determine if natural clusters (groups) exist in the data.

Discussion

What's the principle behind text clustering?
Semantically similar sentences. Source: Yang and Tar 2018.
The big idea is that documents can be represented numerically as vectors of features. The similarity in text can be compared by measuring the distance between these feature vectors. Objects that are near each other should belong to the same cluster. Objects that are far from each other should belong to different clusters.
Essentially, text clustering involves three aspects:
- Selecting a suitable distance measure to identify the proximity of two feature vectors.
- A criterion function that tells us that we've got the best possible clusters and stop further processing.
- An algorithm to optimize the criterion function. A greedy algorithm will start with some initial clustering and refine the clusters iteratively.
What are the use cases of text clustering?
Applications of text clustering. Source: Nabi 2018.
We note a few use cases:
- Document Retrieval: To improve recall, start by adding other documents from the same cluster.
- Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing content.
- Fake News Identification: Detect if a news is genuine or fake.
- Language Translation: Translation of a sentence from one language to another.
- Spam Mail Filtering: Detect unsolicited and unwanted email/messages.
- Customer Support Issue Analysis: Identify commonly reported support issues.
How is text clustering different from text classification?
Clustering is unsupervised whereas classification is supervised. Source: Valcheva 2018.
Classification is a supervised learning approach that maps an input to an output based on example input-output pairs. Clustering is a unsupervised learning approach.
- Classification: If the prediction value tends to be category like yes/no or positive/negative, then it falls under classification type problem in machine learning. The different classes are known in advance. For example, given a sentence, predict whether it's a negative or positive review.
- Clustering: Clustering is the task of partitioning the dataset into groups called clusters. The goal is to split up the data in such a way that points within single cluster are very similar and points in different clusters are different. It determines grouping among unlabelled data.
What are the types of clustering?
Hard versus soft clustering. Source: Withanawasam 2015.
Broadly, clustering can be divided into two groups:
- Hard Clustering: This groups items such that each item is assigned to only one cluster. For example, we want to know if a tweet is expressing a positive or negative sentiment. k-means is a hard clustering algorithm.
- Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items such that an item can belong to multiple clusters. Fuzzy C Means (FCM) is a soft clustering algorithm.
What are the steps involved in text clustering?
Any text clustering approach involves broadly the following steps:
- Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and sparse representations. Pre-processing makes the dataset easier to work with.
- Feature Extraction: One of the commonly used technique to extract the features from textual data is calculating the frequency of words/tokens in the document/corpus.
- Clustering: We can then cluster different text documents based on the features we have generated.
What are the steps involved in text pre-processing?
Below are the main components involved in pre-processing.
- Tokenization: Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases.
- Transformation: It converts the text to lowercase, removes all diacritics/accents in the text, and parses html tags.
- Normalization: Text normalization is the process of transforming a text into a canonical (root) form. Stemming and lemmatization techniques are used for deriving the root word.
- Filtering: Stop words are common words used in a language, such as 'the', 'a', 'on', 'is', or 'all'. These words do not carry important meaning for text clustering and are usually removed from texts.
What are the levels of text clustering?
Text clustering can be document level, sentence level or word level.
- Document level: It serves to regroup documents about the same topic. Document clustering has applications in news articles, emails, search engines, etc.
- Sentence level: It's used to cluster sentences derived from different documents. Tweet analysis is an example.
- Word level: Word clusters are groups of words based on a common theme. The easiest way to build a cluster is by collecting synonyms for a particular word. For example, WordNet is a lexical database for the English language that groups English words into sets of synonyms called synsets.
How do I define or extract textual features for clustering?
BOW with word as feature. Source: Hoonlor 2011, fig. 2.1.
In general, words can be used to represent a common class of feature. Word characteristics are also features. For example, capitalization matters: US versus us, White House versus white house. Part of speech and grammatical structure also add to textual features. Semantics can be a textual feature: buy versus purchase.
The mapping from textual data to real-valued vectors is called feature extraction. One of the simplest techniques to numerically represent text is Bag of Words (BOW). In BOW, we make a list of unique words in the text corpus called vocabulary. Then we can represent each sentence or document as a vector, with each word represented as 1 for presence and 0 for absence.
Another representation is to count the number of times each word appears in a document. The most popular approach is using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.
More recently, word embeddings are being used to map words into feature vectors. A popular model for word embeddings is word2vec.
How can I measure similarity in text clustering?
Words can be similar lexically or semantically:
- Lexical similarity: Words are similar lexically if they have a similar character sequence. Lexical similarity can be measured using string-based algorithms that operate on string sequences and character composition.
- Semantic similarity: Words are similar semantically if they have the same meaning, are opposite of each other, used in the same way, used in the same context or one is a type of another. Semantic similarity can be measured using corpus-based or knowledge-based algorithms.
Some of the metrics for computing similarity between two pieces of text are Jaccard coefficient, cosine similarity and Euclidean distance.
Which are some common text clustering algorithms?
Some types of text clustering algorithms. Source: Khosla et al. 2019, fig. 4.
Ignoring neural network models, we can identify different types:
- Hierarchical: In the divisive approach, we start with one cluster and split that into sub-clusters. Example algorithms include DIANA and MONA. In the agglomerative approach, each document starts as its own cluster and then we merge similar ones into bigger clusters. Examples include BIRCH and CURE.
- Partitioning: k-means is a popular algorithm but requires the right choice of k. Other examples are ISODATA and PAM.
- Density: Instead of using a distance measure, we form clusters based on how many data points fall within a given radius. DBSCAN is the most well-known algorithm.
- Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
- Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also have probabilities that they belong to a topic. Topic Modelling is a separate NLP task but it's similar to soft clustering. pLSA and LDA are example topic models.
How can I evaluate the efficiency of a text clustering algorithm?
Internal quality measure: more compact clusters on the left. Source: Hassani and Seidl 2016.
Measuring the quality of a clustering algorithm has shown to be as important as the algorithm itself. We can evaluate it in two ways:
- External quality measure: External knowledge is required for measuring the external quality. For example, we can conduct surveys of users of the application that includes text clustering.
- Internal quality measure: The evaluation of the clustering is compared only with the result itself, that is, the structure of found clusters and their relations to one another. Two main concepts are compactness and separation. Compactness measures how closely data points are grouped in a cluster. Separation measures how different the found clusters are from each other. More formally, compactness is intra-cluster variance whereas separation is inter-cluster distance.
What are the common challenges involved in text clustering?
Document clustering is being studied for many decades. It's far from trivial or a solved problem. The challenges include the following:
- Selecting appropriate features of documents that should be used for clustering.
- Selecting an appropriate similarity measure between documents.
- Selecting an appropriate clustering method utilising the above similarity measure.
- Implementing the clustering algorithm in an efficient way that makes it feasible in terms of memory and CPU resources.
- Finding ways of assessing the quality of the performed clustering.

Milestones

1971

Vector space model. Source: Perone 2013.

Text mining research in general relies on a vector space model. Salton first proposes it to model text documents as vectors. Features are considered to be the words in the document collection and feature values come from different term weighting schemes, the most popular of which is the Term Frequency-Inverse Document Frequency (TF-IDF).

1983

Massart et al. in the book The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis introduces various clustering methods, including hierarchical and non-hierarchical methods. They show how clustering can be used to interpret large quantities of analytical data. They discuss how clustering is related to other pattern recognition techniques.

1992

Cutting et al. adapt partition-based clustering algorithms to cluster documents. Two of the techniques are Buckshot and Fractionation. Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the clusters formed. Fractionation finds k centres by initially breaking N documents into N/m buckets of a fixed size m > k. Each cluster is then treated as if it's an individual document and the whole process is repeated until there are only K clusters.

1997

Huang introduces k-modes, an extension to the well-known k-means algorithm for clustering numerical data. By defining the mode notion for categorical clusters and introducing an incremental update rule for cluster modes, the algorithm preserves the scaling properties of k-means. Naturally, it also inherits its disadvantages, such as dependence on the seed clusters and the inability to automatically detect the number of clusters.

2008

Sun et al. develop a novel hierarchal algorithm for document clustering. They use cluster overlapping phenomenon to design cluster merging criteria. The system computes the overlap rate in order to improve time efficiency.

References

Article Stats

1841

Words

Authors

Edits

Chats

Likes

46K

Hits

Cite As

Devopedia. 2022. "Text Clustering." Version 35, January 19. Accessed 2023-11-12. https://devopedia.org/text-clustering

Contributed by
2 authors

Last updated on
2022-01-19 16:40:23

Improve this article

algorithms natural language processing text analytics

Text Clustering

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Article Warnings

Login