Text Classification

Article Info

Contributed by
2 authors

Last updated on
2021-04-07 18:13:47

Improve this article

Article Versions

9 2021-04-07 18:13:47
2495,2050 9,2495

By arvindpdmn

Grammar correction to question.
8 2020-05-07 06:11:17
2050,1831 8,2050

By arvindpdmn

MEM abbreviation updated.
7 2019-12-25 13:06:31
1831,1829 7,1831

By arvindpdmn

Minor spelling correction.
6 2019-12-25 06:40:58
1829,1828 6,1829

By arvindpdmn

Publishing with minor edits. Added new question. Added more images.
5 2019-12-19 16:27:24
1828,1822 5,1828

By arvindpdmn

Review in progress. Mainly updated references to correct format: added missing names and dates.

Chat Room

Submitting ...

You are editing an existing chat message.
2021-04-07 18:14:57
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Redirected URLs:
References: https://www.mdpi.com/2078-2489/10/4/150/pdf → https://res.mdpi.com/d_attachment/information/information-10-00150/article_deploy/information-10-00150-v2.pdf
References: https://www.meaningcloud.com/developer/resources/doc/models → https://www.meaningcloud.com/developer/resources/doc/models/2.0
References: https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub → https://www.tensorflow.org/hub/tutorials/tf2_text_classification
2019-12-08 12:43:28
-

By arvindpdmn

It's okay to be brief but include citation to back it. This will enable readers to verify from the source.
2019-12-08 11:36:17
-

By anuradhac

Smaller data-sets, humanly identifiable or limited categories -> Supervised. Impossibly huge, ever growing data sets, too many categories -> Unsupervised, ANN.

I had mentioned this point in the answers, but probably not explicit enough. Will add a separate question to clarify this point with examples.

Hope you can circulate the updated article to the attendees after its updated. Sorry for not completing it entirely before the meetup.
2019-12-08 07:42:19
-

By arvindpdmn

Presented briefly this article at the NLP meetup. Comments from audience:
1. "Supervised or unsupervised techniques": general opinion is that classification is only a supervised task.
2. More images.
3. Add milestones.
2019-12-06 12:48:40
-

By anuradhac

Done summary and important Qs that you would need on top priority. You can review them and make use tomorrow. I will complete the remaining parts in the mean time.

Text classification process. Source: MonkeyLearn 2018.

Abundant textual data accumulates in any eco-system, unstructured and in diverse formats. To extract trends or meaningful insights from it, we need to sort data into different categories. Text classification is a simple, powerful analysis technique to sort the text repository under various tags, each representing specific meaning. Typical classification examples include categorizing customer feedback as positive or negative, or news as sports or politics.

Machine Learning is used to extract keywords from text and classify them into categories. Text classification can be implemented using supervised algorithms, Naïve Bayes, SVM and Deep Learning being common choices.

Text classification finds wide application in NLP for detecting spam, sentiment analysis, subject labelling or analysing intent. Automating mundane tasks makes search, analysis and decision making faster and easier. Text classification is very effective with historical data. It can also be used for real-time textual input analysis.

Discussion

What's the scope of text classification?
Document feature extraction by sentence level analysis. Source: Kowsari et al. 2019, fig. 4.
The scope of text classification is at document level, paragraph level, sentence level, or even sub-sentence level. In some applications we may assign one or more classes to an entire document. In other applications, we may assign a class to each paragraph or sentence in the document.
An algorithm may be designed to accept syntax and semantic information at a sentence level. This is then used to classify the document. While sentence-level analysis is more granular, it's limitation is that often sentence-level context can be determined only from sentences surrounding it.
What are the different methods to perform text classification?
Some methods of text classification include the following:
- Manual classification: We could do the sorting manually like secretaries did in the olden days. Assign categories to every incoming text. Then group documents as per common categories. This method will have the best accuracy, but possible only for small quantities.
- Hand-crafted rules for automated classification: A domain expert defines the rules for categorization. For example, text with words "shares, profits, equity, liabilities" is classified into Business category. Text with words "Messi, kick, Ronaldo, goal" is of Football category.
- Machine Learning Algorithms: Supervised techniques chosen based on dataset size and number of categories.
- Hybrid techniques: Start with manual classification and assign categories to a small portion of the training dataset. Then apply ML algorithms to extend to the rest of the dataset.
How does text classification work?
How text classification works. Source: MeaningCloud 2019.
Let’s work with a supervised classifier example to classify customer feedback into one of these categories: COMPLAINT, SUGGESTION, APPRECIATION. Text may not explicitly include these words, so the categorisation has to happen based on interpretation of words present.
For the classifier to predict the correct label, it needs to ‘understand’ the essence of the text. So we train the classifier with millions of similar customer feedback texts which already have one of these labels assigned. After training, we allow the classifier to test its predictions. Once desired accuracy levels are attained, the classifier can now be used to categorise new customer feedback.
With text classification, the algorithm doesn’t care whether the user wrote standard English, an emoji, or an indirect reference (poetry, sarcasm, movie quotes). It only cares about the frequency of occurrence of particular text resulting in getting assigned to a category. Let’s say the words “Out of this world” occurs very frequently whenever a very appreciative customer feedback is present in the test dataset. The chance of a new prediction getting classified as “APPRECIATION” becomes very high now, if this phrase is encountered.
How do ML algorithms automate text classification?
- Naïve Bayes: Its common application is in spam filtering, to classify incoming emails/SMS as spam or not-spam. We apply conditional probability model of Baye’s theorem to answer “What is the probability the message is spam given occurrence of word X?”. We assume that samples are independent and identically distributed. Though this is often untrue, the prediction accuracy is reasonably good, especially for small sample sizes.
- Support Vector Machines: For the same spam filtering, SVM offers better accuracy in results than Naïve Bayes since it uses an optimization technique. However, it requires more computational resources. SVM builds an optimal separating hyperplane which maximises the margin separating the categories (Spam/Not Spam in our case). Unlike Naive Bayes, SVM is a non-probabilistic algorithm.
- Deep Learning: Works well when datasets are huge in size and continuously growing. For spam filtering on a large scale (huge dataset, large number of categories), RNN (LSTM in particular) would be highly effective as it gives weightage to the sequence of words appearing in the text. Convolutional Neural Networks are a good choice for hierarchical document classification to recognize patterns in the text sequence.
How should I prepare the data for text classification?
Data preparation for ML-based text classification involves the following:
- Tokenization: Identifying words, symbols, emojis, hyperlinks, based on known delimiters and format rules.
- Word normalization: Reduce derived words into their root form (developmental becomes develop, encouragement becomes encourage).
- Text and feature encoding: ML models require numeric features and labels to provide a prediction. So we create a data dictionary to map each word/feature in the document and each category label to a numerical ID. (Example category codes: COMPLAINT -> 0, SUGGESTION -> 1, APPRECIATION -> 2).
- Feature representation: Every feature (category) is represented as a Word Count Vector (giving frequency count of each feature) or a TF-IDF vector (Term Frequency/Inverse Document Frequency) representing relative importance of a term in a document. As an example, Microsoft Azure has Feature Hashing module to convert features to integers for input into model.
- Word/Document embedding: Every row in the dataset is an entire document, represented as a dense vector. The word position within the vector is learned from text and based on the surrounding words. Word embeddings can be trained using the input corpus but pre-trained embeddings (Glove, FastText, and Word2Vec) are available.
How should I select features for text classification?
Feature selection involves picking a small but optimal subset of terms from the training dataset to use as features (categories) during text classification. Since vocabulary size reduces, computational efficiency during training improves that's critical for algorithms other than Naïve-Bayes. Feature selection also helps minimise noise features, which increase classification error when included.
Feature selection is essentially a process of dimensionality reduction. Contrary to belief that more categories means better classification, weaker models (fewer but well-differentiated features) deliver better accuracies when training data is limited. Feature selection methods are classified into 4 types:
- Filter: A pre-processing step suitable for any ML algorithm. Statistical tests are done for features and their correlation with the outcome variable. Features are chosen based on test scores. Feature/Response: continuous: Pearson’s Correlation.
- Wrapper: Based on greedy search algorithms, they choose optimal features for a specific ML algorithm. Example: Step forward/backward feature selection (Add / remove features 1 by 1 in round-robin fashion from feature set and evaluate performance)
- Embedded: Feature selection is part of ML algorithm training phase (LASSO, Ridge Regression to reduce overfitting).
- Hybrid: Mix of filter and wrapper methods.
What are the methods to evaluate the effectiveness of text classification?
Cross-validation and hold-out tests are common evaluation methods to deduce how often a prediction was right (true positives, true negatives) and when it made a mistake (false positives, false negatives).
- N-fold cross-validation: Split dataset into N folds. Runs test N times. At a time, use one fold of data as test set, remaining N - 1 folds of data as training sets. Classification accuracy is average of results in N runs.
- Hold-out test: Divide dataset into training and test subsets. Varied splits will result in varied results and accuracy, especially for small datasets. Paired t-test can be used to measure significance in accuracy differences.
Well-known performance metrics used in assessment include accuracy, precision, recall and F1 score. Classification error (1 - Accuracy) is a sufficient metric if the percentage of documents in the class is high (10-20% or higher). However for small classes, always saying ‘NO’ will achieve high accuracy, but make the classifier irrelevant. So precision, recall and F1 are better measures.
Other evaluation metrics include Matthews Correlation Coefficient (MCC), ROC and AUC.
What are the factors behind the choice of an ML algorithm?
A flowchart for selecting a text classification algorithm. Source: Google Developers 2019, fig. 5.
There are some common pointers that can be followed to determine whether to choose an supervised or unsupervised algorithm. The choice depends on the dataset size, accuracy required, training time or resources available.
For smaller datasets with humanly identifiable or limited categories, choose a Supervised algorithm. For impossibly huge, ever growing datasets, or too many categories, choose Unsupervised algorithm or ANN model. While text classification is usually considered a supervised task, sometimes the term has been used for unsupervised tasks. Text clustering is an example that identifies the categories before assigning them to documents.
Calculate the ratio number of samples/number of words per sample. If this ratio is less than 1500, tokenize the text as n-grams and use a simple multi-layer perceptron (MLP) model to classify them. If the ratio is greater than 1500, tokenize the text as sequences and use a CNN model to classify them.
How to adapt binary text classification techniques to multi-class problems?
More classes means higher model complexity. For a fixed dataset, binary classification gives better accuracy than multi-class because of clear decision boundaries. It’s a common practice to reduce multi-class instances into binary classifiers before running ML algorithms.
Let’s take a training set with 5 classes: "Class-A", "Class-B", "Class-C", "Class-D", "Others". We relabel data under "Class-ABCD" accumulating labels "Class-A", "Class-B", "Class-C" and "Class-D". Now it’s a binary-classification problem with just 2 classes “Class-ABCD”, “Others”.
A popular problem reduction technique called OAA (One-Against-All) is employed for this:
- For each of the 5 classes in our dataset, we create 5 new datasets D1…D5 (one per class)
- For each dataset Di, we mark training data of the corresponding class i as positive and all others as negatives (D2 : Class-B -> Positive, All others -> Negative)
- We train 5 binary classifiers, each on their own dataset Di
- We get predictions from all binary classifiers. Then choose the class with positive classifier output, breaking ties randomly.
An undesirable consequence is class imbalance. As negative examples far outweigh positives, learning skews towards the negative class. This is prevented by introducing relative weightages for each dataset.
What support is available in various programming languages and libraries?
Briefly, we list some available resources:
- Python: Scikit-Learn: sample datasets, feature encoding, Ppediction and evaluation
- TensorFlow: TF-Hub: training, prediction, evaluators
- Java: Stanford CoreNLP, MALLET, WEKA, Lingpipe libraries
- R: OpenNLP, Rweka
RandolphVI maintains a list of useful neural network Python implementations for Multi-Label Text Classification. This includes CNNs, RNNs, and attention networks.

Milestones

1960

Since the 1960s, the concept of classifying documents into categories exists as a part of library sciences. Several automated library management tools include this feature.

1990

In the 1990s, text classification gains importance as an important NLP function. With the availability of larger datasets and more categories, researchers apply ML-based classification. For text classification, Support Vector Machines (SVM) and Maximum Entropy Models (MEMs) are applied in the late 1990s. By the end of the decade, text classification is seen a combination of information retrieval and machine learning. It's also seen as an application of text mining.

2000

In the early part of 2000s, researchers look at multi-label text classification. Unlike topic modelling where topic candidates are ranked, text classification requires a definite set of topics. Approaches include EM algorithm, Parametric Mixture Models (PMM) and multi-labelled MEMs.

2004

Pang and Lee publish a paper on sentiment classification based on Machine Learning techniques. They propose a 'subjectivity detector' that would filter out the sentences labelled 'subjective' in a document and employ text categorization techniques on the resulting data. They implement Naive Bayes and SVM to find minimum cuts in a graph. They claim an accuracy of 86.4% on the NB polarity classifier.

Oct
2014

CNN for text showing two channels. Source: Kim 2014, fig. 1.

CNNs are common for image processing but not in NLP. Yoon Kim presents the idea of using a CNN to classify text in a paper titled Convolutional Neural Networks for Sentence Classification.

References

Article Stats

2039

Words

Authors

Edits

Chats

Likes

7055

Hits

Cite As

Devopedia. 2021. "Text Classification." Version 9, April 7. Accessed 2023-11-12. https://devopedia.org/text-classification

Contributed by
2 authors

Last updated on
2021-04-07 18:13:47

Improve this article

data algorithms machine learning artificial intelligence natural language processing

Text Classification

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Text Classification

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Article Warnings

Login