Text Classification

Text classification process. Source: MonkeyLearn 2018.
Text classification process. Source: MonkeyLearn 2018.

Abundant textual data accumulates in any eco-system, unstructured and in diverse formats. To extract trends or meaningful insights from it, we need to sort data into different categories. Text classification is a simple, powerful analysis technique to sort the text repository under various tags, each representing specific meaning. Typical classification examples include categorizing customer feedback as positive or negative, or news as sports or politics.

Machine Learning is used to extract keywords from text and classify them into categories. Text classification can be implemented using supervised algorithms, Naïve Bayes, SVM and Deep Learning being common choices.

Text classification finds wide application in NLP for detecting spam, sentiment analysis, subject labelling or analysing intent. Automating mundane tasks makes search, analysis and decision making faster and easier. Text classification is very effective with historical data. It can also be used for real-time textual input analysis.

Discussion

  • What's the scope of text classification?
    Document feature extraction by sentence level analysis. Source: Kowsari et al. 2019, fig. 4.
    Document feature extraction by sentence level analysis. Source: Kowsari et al. 2019, fig. 4.

    The scope of text classification is at document level, paragraph level, sentence level, or even sub-sentence level. In some applications we may assign one or more classes to an entire document. In other applications, we may assign a class to each paragraph or sentence in the document.

    An algorithm may be designed to accept syntax and semantic information at a sentence level. This is then used to classify the document. While sentence-level analysis is more granular, it's limitation is that often sentence-level context can be determined only from sentences surrounding it.

  • What are the different methods to perform text classification?

    Some methods of text classification include the following:

    • Manual classification: We could do the sorting manually like secretaries did in the olden days. Assign categories to every incoming text. Then group documents as per common categories. This method will have the best accuracy, but possible only for small quantities.
    • Hand-crafted rules for automated classification: A domain expert defines the rules for categorization. For example, text with words "shares, profits, equity, liabilities" is classified into Business category. Text with words "Messi, kick, Ronaldo, goal" is of Football category.
    • Machine Learning Algorithms: Supervised techniques chosen based on dataset size and number of categories.
    • Hybrid techniques: Start with manual classification and assign categories to a small portion of the training dataset. Then apply ML algorithms to extend to the rest of the dataset.
  • How does text classification work?
    How text classification works. Source: MeaningCloud 2019.
    How text classification works. Source: MeaningCloud 2019.

    Let’s work with a supervised classifier example to classify customer feedback into one of these categories: COMPLAINT, SUGGESTION, APPRECIATION. Text may not explicitly include these words, so the categorisation has to happen based on interpretation of words present.

    For the classifier to predict the correct label, it needs to ‘understand’ the essence of the text. So we train the classifier with millions of similar customer feedback texts which already have one of these labels assigned. After training, we allow the classifier to test its predictions. Once desired accuracy levels are attained, the classifier can now be used to categorise new customer feedback.

    With text classification, the algorithm doesn’t care whether the user wrote standard English, an emoji, or an indirect reference (poetry, sarcasm, movie quotes). It only cares about the frequency of occurrence of particular text resulting in getting assigned to a category. Let’s say the words “Out of this world” occurs very frequently whenever a very appreciative customer feedback is present in the test dataset. The chance of a new prediction getting classified as “APPRECIATION” becomes very high now, if this phrase is encountered.

  • How do ML algorithms automate text classification?
    • Naïve Bayes: Its common application is in spam filtering, to classify incoming emails/SMS as spam or not-spam. We apply conditional probability model of Baye’s theorem to answer “What is the probability the message is spam given occurrence of word X?”. We assume that samples are independent and identically distributed. Though this is often untrue, the prediction accuracy is reasonably good, especially for small sample sizes.
    • Support Vector Machines: For the same spam filtering, SVM offers better accuracy in results than Naïve Bayes since it uses an optimization technique. However, it requires more computational resources. SVM builds an optimal separating hyperplane which maximises the margin separating the categories (Spam/Not Spam in our case). Unlike Naive Bayes, SVM is a non-probabilistic algorithm.
    • Deep Learning: Works well when datasets are huge in size and continuously growing. For spam filtering on a large scale (huge dataset, large number of categories), RNN (LSTM in particular) would be highly effective as it gives weightage to the sequence of words appearing in the text. Convolutional Neural Networks are a good choice for hierarchical document classification to recognize patterns in the text sequence.
  • How should I prepare the data for text classification?

    Data preparation for ML-based text classification involves the following:

    • Tokenization: Identifying words, symbols, emojis, hyperlinks, based on known delimiters and format rules.
    • Word normalization: Reduce derived words into their root form (developmental becomes develop, encouragement becomes encourage).
    • Text and feature encoding: ML models require numeric features and labels to provide a prediction. So we create a data dictionary to map each word/feature in the document and each category label to a numerical ID. (Example category codes: COMPLAINT -> 0, SUGGESTION -> 1, APPRECIATION -> 2).
    • Feature representation: Every feature (category) is represented as a Word Count Vector (giving frequency count of each feature) or a TF-IDF vector (Term Frequency/Inverse Document Frequency) representing relative importance of a term in a document. As an example, Microsoft Azure has Feature Hashing module to convert features to integers for input into model.
    • Word/Document embedding: Every row in the dataset is an entire document, represented as a dense vector. The word position within the vector is learned from text and based on the surrounding words. Word embeddings can be trained using the input corpus but pre-trained embeddings (Glove, FastText, and Word2Vec) are available.
  • How should I select features for text classification?

    Feature selection involves picking a small but optimal subset of terms from the training dataset to use as features (categories) during text classification. Since vocabulary size reduces, computational efficiency during training improves that's critical for algorithms other than Naïve-Bayes. Feature selection also helps minimise noise features, which increase classification error when included.

    Feature selection is essentially a process of dimensionality reduction. Contrary to belief that more categories means better classification, weaker models (fewer but well-differentiated features) deliver better accuracies when training data is limited. Feature selection methods are classified into 4 types:

    • Filter: A pre-processing step suitable for any ML algorithm. Statistical tests are done for features and their correlation with the outcome variable. Features are chosen based on test scores. Feature/Response: continuous: Pearson’s Correlation.
    • Wrapper: Based on greedy search algorithms, they choose optimal features for a specific ML algorithm. Example: Step forward/backward feature selection (Add / remove features 1 by 1 in round-robin fashion from feature set and evaluate performance)
    • Embedded: Feature selection is part of ML algorithm training phase (LASSO, Ridge Regression to reduce overfitting).
    • Hybrid: Mix of filter and wrapper methods.
  • What are the methods to evaluate the effectiveness of text classification?

    Cross-validation and hold-out tests are common evaluation methods to deduce how often a prediction was right (true positives, true negatives) and when it made a mistake (false positives, false negatives).

    • N-fold cross-validation: Split dataset into N folds. Runs test N times. At a time, use one fold of data as test set, remaining N - 1 folds of data as training sets. Classification accuracy is average of results in N runs.
    • Hold-out test: Divide dataset into training and test subsets. Varied splits will result in varied results and accuracy, especially for small datasets. Paired t-test can be used to measure significance in accuracy differences.

    Well-known performance metrics used in assessment include accuracy, precision, recall and F1 score. Classification error (1 - Accuracy) is a sufficient metric if the percentage of documents in the class is high (10-20% or higher). However for small classes, always saying ‘NO’ will achieve high accuracy, but make the classifier irrelevant. So precision, recall and F1 are better measures.

    Other evaluation metrics include Matthews Correlation Coefficient (MCC), ROC and AUC.

  • What are the factors behind the choice of an ML algorithm?
    A flowchart for selecting a text classification algorithm. Source: Google Developers 2019, fig. 5.
    A flowchart for selecting a text classification algorithm. Source: Google Developers 2019, fig. 5.

    There are some common pointers that can be followed to determine whether to choose an supervised or unsupervised algorithm. The choice depends on the dataset size, accuracy required, training time or resources available.

    For smaller datasets with humanly identifiable or limited categories, choose a Supervised algorithm. For impossibly huge, ever growing datasets, or too many categories, choose Unsupervised algorithm or ANN model. While text classification is usually considered a supervised task, sometimes the term has been used for unsupervised tasks. Text clustering is an example that identifies the categories before assigning them to documents.

    Calculate the ratio number of samples/number of words per sample. If this ratio is less than 1500, tokenize the text as n-grams and use a simple multi-layer perceptron (MLP) model to classify them. If the ratio is greater than 1500, tokenize the text as sequences and use a CNN model to classify them.

  • How to adapt binary text classification techniques to multi-class problems?

    More classes means higher model complexity. For a fixed dataset, binary classification gives better accuracy than multi-class because of clear decision boundaries. It’s a common practice to reduce multi-class instances into binary classifiers before running ML algorithms.

    Let’s take a training set with 5 classes: "Class-A", "Class-B", "Class-C", "Class-D", "Others". We relabel data under "Class-ABCD" accumulating labels "Class-A", "Class-B", "Class-C" and "Class-D". Now it’s a binary-classification problem with just 2 classes “Class-ABCD”, “Others”.

    A popular problem reduction technique called OAA (One-Against-All) is employed for this:

    • For each of the 5 classes in our dataset, we create 5 new datasets D1…D5 (one per class)
    • For each dataset Di, we mark training data of the corresponding class i as positive and all others as negatives (D2 : Class-B -> Positive, All others -> Negative)
    • We train 5 binary classifiers, each on their own dataset Di
    • We get predictions from all binary classifiers. Then choose the class with positive classifier output, breaking ties randomly.

    An undesirable consequence is class imbalance. As negative examples far outweigh positives, learning skews towards the negative class. This is prevented by introducing relative weightages for each dataset.

  • What support is available in various programming languages and libraries?

    Briefly, we list some available resources:

    • Python: Scikit-Learn: sample datasets, feature encoding, Ppediction and evaluation
    • TensorFlow: TF-Hub: training, prediction, evaluators
    • Java: Stanford CoreNLP, MALLET, WEKA, Lingpipe libraries
    • R: OpenNLP, Rweka

    RandolphVI maintains a list of useful neural network Python implementations for Multi-Label Text Classification. This includes CNNs, RNNs, and attention networks.

Milestones

1960

Since the 1960s, the concept of classifying documents into categories exists as a part of library sciences. Several automated library management tools include this feature.

1990

In the 1990s, text classification gains importance as an important NLP function. With the availability of larger datasets and more categories, researchers apply ML-based classification. For text classification, Support Vector Machines (SVM) and Maximum Entropy Models (MEMs) are applied in the late 1990s. By the end of the decade, text classification is seen a combination of information retrieval and machine learning. It's also seen as an application of text mining.

2000

In the early part of 2000s, researchers look at multi-label text classification. Unlike topic modelling where topic candidates are ranked, text classification requires a definite set of topics. Approaches include EM algorithm, Parametric Mixture Models (PMM) and multi-labelled MEMs.

2004

Pang and Lee publish a paper on sentiment classification based on Machine Learning techniques. They propose a 'subjectivity detector' that would filter out the sentences labelled 'subjective' in a document and employ text categorization techniques on the resulting data. They implement Naive Bayes and SVM to find minimum cuts in a graph. They claim an accuracy of 86.4% on the NB polarity classifier.

Oct
2014
CNN for text showing two channels. Source: Kim 2014, fig. 1.
CNN for text showing two channels. Source: Kim 2014, fig. 1.

CNNs are common for image processing but not in NLP. Yoon Kim presents the idea of using a CNN to classify text in a paper titled Convolutional Neural Networks for Sentence Classification.

References

  1. Agarwal, Rahul. 2014. "Attention, CNN and what not for Text Classification." Towards Data Science, on Medium, March 09. Accessed 2019-12-04.
  2. AzureML. 2014. "Multiclass Classification: News categorization." Azure AI Gallery, September 02. Accessed 2019-12-25.
  3. Geitgey, Adam. 2018. "Text Classification is Your New Secret Weapon." Medium, August 15. Accessed 2019-12-04.
  4. Google Developers. 2019. "Step 2.5: Choose a Model." Text Classification, Machine Learning. Accessed 2019-12-04.
  5. Irandoust, Kiarash. 2017. "Most used Java libraries, frameworks, and APIs in big data projects — part 2." ITNEXT, on Medium, March 03. Accessed 2019-12-04.
  6. Kaushik, Saurav. 2016. "Introduction to Feature Selection methods with an example (or how to select the right variables?)." Analytics Vidhya, December 01. Accessed 2019-12-04.
  7. Kim, Yoon. 2014. "Convolutional Neural Networks for Sentence Classification." Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), ACL, pp. 1746-1751, October. Accessed 2019-12-25.
  8. Klein, Bernd. 2019. "Text Categorization and Classification." Python Machine Learning Tutorial, Python Course. Accessed 2019-12-04.
  9. Kowsari, Kamran, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. "Text Classification Algorithms: A Survey." Information, vol. 10, no. 4, 150, April 23. Accessed 2019-12-04.
  10. Malik, Usman. 2018. "Applying Wrapper Methods in Python for Feature Selection." Stack Abuse, November 06. Accessed 2019-12-04.
  11. Malik, Usman. 2018b. "Text Classification with Python and Scikit-Learn." Stack Abuse, August 27. Accessed 2019-12-04.
  12. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. "Evaluation of text classification." Introduction to Information Retrieval, Cambridge University Press. Accessed 2019-12-04.
  13. MeaningCloud. 2019. "How does text classification work?" MeaningCloud. Accessed 2019-12-04.
  14. Mestiri, Sara. 2017. "Applied Text-Classification on Email Spam Filtering [Part 1]." Towards Data Science, on Medium, September 01. Accessed 2019-12-04.
  15. Mittal, Swayam. 2019. "Deep Learning Techniques for Text Classification." Data Driven Investor, on Medium, August 17. Accessed 2019-12-04.
  16. MonkeyLearn. 2018. "Text Classification." MonkeyLearn Inc., October 04. Accessed 2019-12-04.
  17. Pagels, Max. 2018. "Machine Learning Reductions & Mother Algorithms, Part II: Multiclass to Binary Classification." Fourkind, on Medium, November 16. Accessed 2019-12-04.
  18. Pang, Bo, and Lillian Lee. 2004. "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts." Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 271-278, July. Accessed 2019-12-04.
  19. RandolphVI. 2019. "RandolphVI/Multi-Label-Text-Classification." GitHub, April 16. Accessed 2019-12-25.
  20. Raschka, Sebastian. 2014. "Naive Bayes and Text Classification Introduction and Theory." October 04. Accessed 2019-12-04.
  21. Rogati, Monica and Yiming Yang. 2002. "High-Performing Feature Selection for Text Classification." Accessed 2019-12-04.
  22. Sasaki, Yutaka. 2009. "Introduction to Text Classification." Toyota Technological Institute, December 15. Accessed 2019-12-22.
  23. Sebastiani, Fabrizio. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, March. Accessed 2019-12-24.
  24. Taspinar, Ahmet. 2015. "Text Classification and Sentiment Analysis." November 16. Accessed 2019-12-04.
  25. TensorFlow. 2019. "How to build a simple text classifier with TF-Hub." Tutorial, TensorFlow, October 31. Accessed 2019-12-04.
  26. Varangaonkar, Amey. 2017. "9 Useful R Packages for NLP & Text Mining." Packt Publishing Ltd, December 18. Accessed 2019-12-04.
  27. Wikipedia. 2019. "Multiclass classification." Wikipedia, December 10. Accessed 2019-12-04.
  28. Wong, James. 2016. "Text classification." SlideShare, LinkedIn Corporation, April 25. Accessed 2019-12-04.
  29. Yu, Bei. 2008. "An Evaluation of Text Classification Methods for Literary Study." Literary and Linguistic Computing, vol. 23, no. 3. Accessed 2019-12-04.
  30. Zafra, Miguel Fernández. 2019. "Text Classification in Python." Towards Data Science, on Medium, June 16. Accessed 2019-12-04.

Further Reading

  1. Kowsari, Kamran, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. "Text Classification Algorithms: A Survey." Information, vol. 10, no. 4, 150, April 23. Accessed 2019-12-04.
  2. Google Developers. 2019. "Introduction." Text Classification, Machine Learning. Accessed 2019-12-04.
  3. Sebastiani, Fabrizio. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, March. Accessed 2019-12-24.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
2
1613
5
3
570
2039
Words
2
Likes
7055
Hits

Cite As

Devopedia. 2021. "Text Classification." Version 9, April 7. Accessed 2023-11-12. https://devopedia.org/text-classification
Contributed by
2 authors


Last updated on
2021-04-07 18:13:47

Improve this article

Article Warnings

  • Readability score of this article is below 50 (44.3). Use shorter sentences. Use simpler words.