# Topic Modelling

## Summary

The growth of the web since the early 1990s has resulted in an explosion of online data. In an effort to organize all this unstructured data, topic models were invented as a text mining tool.^{} Topic modelling uncovers underlying themes or topics in documents.^{}

Consider a document in which the words 'dog' and 'bone' occur often. We can say that this document belongs to the topic of *Dogs*. Another document with words 'cat' and 'meow' occurring frequently is of topic *Cats*.^{} In another example, based on words in the content, emails can be topically labelled as *Personal*, *Project* or *Financial*. An email can also belong to multiple topics.^{} Topic modelling can be seen as a form of tagging.^{}

Topic modelling is an unsupervised task. LDA and its many variants have been popular.^{} LDA is a probabilistic generative model.^{}

## Milestones

1998

1999

2005

2007

2012

2012

2012

2013

2018

## Discussion

What are some typical applications of topic modelling? Early invention and application of topic models was in the field of text mining and information retrieval.

^{}Since then, topic modelling has been used in various applications including classification, categorization, summarization, and segmentation of documents.^{}More unique applications include computer vision, population genetics and social networks.^{}In information retrieval, topic modelling helps in query expansion. It also personalizes search results or makes recommendations by mapping user preferences to topics.

^{}^{}When analyzing scientific literature, it's been noted that topics often correspond with scientific disciplines. We can also track how topics evolve over time. For example, the topic 'string' in Physics (for string theory) would be more common from the 1970s.

^{}In social sciences, topic modelling enables qualitative analysis. Sentiment analysis and social network analysis are two examples.

^{}In software engineering, topic modelling has been used to analyze source code, change logs, bug databases, and execution traces.

^{}In bioinformatics, compared to traditional data reduction techniques, topic modelling is seen to be more promising since it's more easily interpretable.

^{}How is topic modelling different from text classification or clustering? **Text classification**is a supervised task that learns a classifier from training data. Topic modelling is an unsupervised task where topics are not learned in advance. Topics are induced from the actual data.^{}**Text clustering**and topic modelling are similar in the sense that both are unsupervised tasks. Both attempt to organize documents for better information retrieval and browsing. However, there's a difference.^{}Text clustering looks at the similarity among documents and attempts to form similar clusters of these documents. These similarity measures could be based on TF-IDF weighting.

^{}In topic modelling, we don't look at document similarity. Instead, we treat a document as a mixture of topics in which a topic is a probability distribution of words.^{}Soft clustering (where a document can belong to multiple clusters) can be viewed as being similar to topic modelling, though the approaches still differ.^{}Thus, the clusters from text clustering are not quite the same as topics in topic modelling. They're however seen as complementary. Research from mid-2000s explore combining both techniques into a single model.

^{}What's the typical pipeline for topic modelling? Topic models perform a statistical analysis of words present in each document from a collection of documents. The model is expected to output three things: (a) clusters of co-occurring words each of which represents a topic; (b) the distribution of topics for each document; (c) a histogram of words for each topic.

^{}To build a model, we must balance different aspects: fidelity (how well the model reflects the real world), performance, tractability (discrete models are preferred), and interpretability.

^{}In the

**bag-of-words model**the ordering of words in each document is ignored. Such a model is simple but it ignores phrase-level co-occurrences. An alternative is the**unigram model**in which words are randomly drawn from a categorical distribution. A mixture of such unigram models is also possible. For example, each topic has a distribution of words. We randomly draw words conditioned on the topic.^{}Essentially, words are being generated by latent variables of the model. Thus, a topic modelling algorithm such as LDA is a

**generative model**.^{}Could you explain how documents, words and topics are related? The basic approach towards topic modelling is to prepare a document-term matrix. For each document, we count how many times a particular term occurs. In practice, not all terms are equally important. For this reason,

**TF-IDF weighting**is used instead of raw counts. TF-IDF effectively gives more weight to frequent terms in a document that's rarer in the rest of the corpus.^{}The next step is to decompose this matrix into document-topic and term-topic matrices. We don't in fact identify the names of these topics. This is something the analyst can do by looking at the main terms of the topic.

^{}In the figure, we can see that*T1*is probably about sports because of the terms Lebron, Celtics and sprain.Since number of topics is far fewer than the vocabulary, we can view topic modelling as a

*dimensionality reduction technique*.^{}^{}To determine exactly how many topics we should look for, the**Kullback-Leibler Divergence score**is a useful measure.^{}Could you describe the main algorithms for topic modelling? We mention three main algorithms:

**Latent Semantic Analysis (LSA)**: Also called*LSI*, this algorithm constructs a semantic space in which related words and documents are placed near one another. It uses SVD as the technique.^{}**Probabilistic LSA (pLSA)**: Also called*aspect model*,^{}this is a probabilistic generative model. It doesn't use SVD. It looks at the probability of a topic given a document and the probability of a word given a topic. These are multinomial distributions that can be trained with EM algorithm.^{}**Latent Dirichlet Allocation (LDA)**: This is a Bayesian approach. Document is modelled as a finite mixture of topics. Each topic is modelled as an infinite mixture of topic probabilities. Topic probabilities make up a document's representation. Topic mixture is a Dirichlet distribution.^{}

What are some common challenges with topic modelling? Topic modelling doesn't provide a method to select the optimum number of topics. LDA has many free parameters that can cause overfitting. LDA uses Bayesian priors without suitable justification. Statistical properties of the data (such as Zipf's Law) may also differ from the assumptions.

^{}LSA is a linear model. It might not fit non-linearities in the dataset. It assumes Gaussian distribution of terms. SVD is also computationally expensive.

^{}LSA uses less efficient representations. Its results are not easily interpretable.^{}LDA can't directly represent co-occurrence information since words are sampled independently. A model based on Generalized Pólya urns or Bayesian regularization can solve this.

^{}High-frequency non-specific words will result in topics that users may find too general and not useful. For examples, documents on artificial intelligence might have the words 'algorithm', 'model' or 'estimation' occurring frequently. Low-frequency specific words are equally problematic.

^{}We could end up with topics that all look nearly the same. This can be solved by conditioning the prior topic distribution with data.

^{}What techniques have been used to improve the performance of topic modelling? When pre-processing text, it's common to do lemmatization, remove punctuation and stop words. In addition, we can remove low frequency terms since they represent weak features. We could also make use of POS tags and remove terms that are not contextually important. For example, all prepositions could be removed.

^{}Another technique is to do batch-wise LDA. Each batch will provide a different set of topics and an intersection of these might give the best topic terms.

^{}Through user interactions, we could add constraints, or merge or separate topics. For example, for two highly correlated words we can add the constraint that they should appear in the same topic.

^{}Some approaches to improve upon LDA include integrating topics with syntax; looking at correlation between topics such as Pachinko Allocation Model (PAM) or Correlated Topic Model (CTM); using metadata such as authors; accounting for burstiness of words (word once used in a document is more likely to appear again). Non-parametric models such Pitman-Yor or negative binomial processes have tried to address Zipf's Law.

^{}What are some useful resources for research into topic modelling? Developers can use the

*MALLET*Java package for topic modelling. A wrapper for this in R is available via the*mallet*package. Another R package is*topicmodels*.^{}The latter package can do LDA (VEM or Gibbs estimation) or CTM (VEM estimation).^{}Python developers can use

*nltk*for text pre-processing and*gensim*for topic modelling. Package*gensim*has functions to create a bag of words from a document, do TF-IDF weighting and apply LDA.^{}^{}If the intent is to do LSA, then*sklearn*package has functions for TF-IDF and SVD.^{}MALLET package is also available in Python via*gensim*.^{}pyLDAvis is a Python library for topic model visualization. An analyst can use this to look at terms of a topic and decide the topic name.

^{}For automatic topic labelling, Wikipedia can be a useful data source.^{}Those who wish to use a cloud service can look at

**Amazon Comprehend**. It uses LDA on a collection. It returns terms-topics and documents-topics associations. A job should include at least 1000 documents.^{}

## References

- AWS Docs. 2019. "Topic Modeling." Developer Guide, Amazon Comprehend, AWS Docs, February 2. Accessed 2020-01-13.
- Allahyari, Mehdi, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques." arXiv, v2, July 28. Accessed 2020-01-12.
- Arora, Sanjeev, Rong Ge, and Ankur Moitra. 2012. "Learning Topic Models - Going beyond SVD." arXiv, v2, April 10. Accessed 2020-01-10.
- Bansal, Shivam. 2016. "Beginners Guide to Topic Modeling in Python." Analytics Vidhya, August 24. Accessed 2020-01-10.
- Blei, David M., and John D. Lafferty. 2005. "Correlated topic models." NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 147-154, December. Accessed 2020-01-13.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. "Latent Dirichlet Allocation." Dietterich, T. G., S. Becker, and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, pp. 601-608. Accessed 2020-01-13.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022. Accessed 2020-01-13.
- Boyd-Graber, Jordan, Yuening Hu, and David Mimno. 2017. "Applications of Topic Models." Foundations and Trends in Information Retrieval, vol. 11, no. 2-3, pp. 143-296, July. Accessed 2020-01-10.
- Boyd-Graber, Jordan, David Mimno, and David Newman. 2019. "Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements." Chapter 12 in: Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds), Handbook of Mixed Membership Models and Their Applications, CRC Press. Accessed 2020-01-10.
- Chaney, Allison, and David M. Blei. 2012. "Visualizing topic models." Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 419-422, June 4-7. Accessed 2020-01-12.
- Chen, Tse-Hsun, Stephen W. Thomas, and Ahmed E. Hassan. 2015. "A Survey on the Use of Topic Models when Mining Software Repositories." Empirical Software Engineering, vol. 21, pp. 1843–1919, September 10. Issue date October 2016. Accessed 2020-01-10.
- Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. 2012. "Termite: Visualization Techniques forAssessing Textual Topic Models." Advanced Visual Interfaces (AVI), ACM, May 21-25, 2012. Accessed 2020-01-12.
- Contreras-Piña, Constanza, and Sebastián A. Ríos. 2016. "An empirical comparison of latent sematic models for applications in industry." Neurocomputing, Elsevier, vol. 179, pp. 176-185. Accessed 2020-01-13.
- Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, September. Accessed 2020-01-12.
- Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. 2018. "A network approach to topic models." Science Advances, AAAS, vol. 4, no. 7, eaaq1360. Accessed 2020-01-10.
- Griffiths, Thomas L. and Mark Steyvers. 2004. "Finding scientific topics." Proceedings of the National Academy of Sciences, vol. 101, Suppl 1, pp. 5228–5235. Accessed 2020-01-13.
- Grün, Bettina, and Kurt Hornik. 2011. "topicmodels: An R Package for Fitting Topic Models." Journal of Statistical Software, vol. 40, no. 13, May. Accessed 2020-01-10.
- Hofmann, Thomas. 1999. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57, August. https://doi.org/10.1145/312624.312649. Accessed 2020-01-14.
- Ivanov, George-Bogdan. 2018. "Complete Guide to Topic Modeling." NLP For Hackers, January 3. Accessed 2020-01-10.
- Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-10.
- Joshi, Prateek. 2018. "Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)." Analytics Vidhya, October 1. Accessed 2020-01-10.
- Krishan. 2016. "Topic Modeling and Document Clustering; What’s the Difference?" Integrated Knowledge Solutions, May 16. Accessed 2020-01-12.
- Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-10.
- Li, Susan. 2018. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science, on Medium, May 31. Accessed 2020-01-10.
- Liu, Lin, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. 2016. "An overview of topic modeling and its current applications in bioinformatics." SpringerPlus, vol. 5, article no. 1608, September 20. Accessed 2020-01-10.
- McCallum, Andrew, Xuerui Wang, and Andres Corrada-Emmanuel. 2007. "Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email." Journal of Artificial Intelligence Research, AI Access Foundation, vol. 30, pp. 249-272, October. Accessed 2020-01-10.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2020-01-14.
- Moody, Chris. 2016. "Introducing our Hybrid lda2vec Algorithm." MultiThreaded Blog, Stitch Fix, Inc., May 27. Accessed 2020-01-10.
- Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. "Automatic Evaluation of Topic Coherence." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 100–108, June. Accessed 2020-01-14.
- Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 1998. "Latent Semantic Indexing: A Probabilistic Analysis." Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 159–168, May. Accessed 2020-01-14.
- Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. "Latent Semantic Indexing: A Probabilistic Analysis." Journal of Computer and System Sciences, Elsevier, vol. 61, no. 2, pp. 217-235, October. Accessed 2020-01-14.
- Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 2020-01-10.
- Prabhakaran, Selva. 2018. "Topic Modeling with Gensim (Python)." Machine Learning Plus, March 26. Accessed 2020-01-10.
- Raju, L Venkata Rama. 2019. "Topic Modeling – Latent Semantic Analysis (LSA) and Singular Value Decomposition (SVD)." Data Jang, June 18. Accessed 2020-01-10.
- Robinson, David and Julia Silge. 2017. "Topic Modeling." Chapter 6 in: Text Mining with R, O'Reilly Media, Inc. Accessed 2020-01-10.
- Ruozzi, Nicholas. 2019. "Topic Models and LDA." Lecture 18 in: CS 6347, Statistical Methods in AI and ML, UT Dallas. Accessed 2020-01-10.
- Silge, Julia and David Robinson. 2019. "Text Mining with R." November 24. Accessed 2020-01-10.
- Vorontsov, Konstantin, and Anna Potapenko. 2014. "Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization." In: Ignatov D., Khachay M., Panchenko A., Konstantinova N., Yavorsky R. (eds), Analysis of Images, Social Networks and Texts, AIST 2014, Communications in Computer and Information Science, vol. 436, Springer, Cham. Accessed 2020-01-10.
- Wikipedia. 2019. "Topic model." Wikipedia, December 13. Accessed 2020-01-10.
- Wu, Hu, Yongji Wang, and Xiang Cheng. 2008. "Incremental Probabilistic Latent Semantic Analysis for Automatic Question Recommendation." RecSys’08, ACM, pp. 99-106, October 23–25. Accessed 2020-01-10.
- Xie, Pengtao, and Eric P. Xing. 2013. "Integrating Document Clustering and Topic Modeling." arXiv, v1, September 26. Accessed 2020-01-12.
- Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-10.
- Zhao, Alice. 2019. "Natural Language Processing (Part 5): Topic Modeling with Latent Dirichlet Allocation in Python." YouTube, January 5. Accessed 2020-01-10.

## Milestones

1998

1999

2005

2007

2012

2012

2012

2013

2018

## Tags

## See Also

- Singular Value Decomposition
- Latent Dirichlet Allocation
- Correlated Topic Model
- Text Clustering
- Expectation Maximization Algorithm
- Factor Analysis

## Further Reading

- Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 2020-01-10.
- Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-10.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022. Accessed 2020-01-13.
- Fatma, Fatma. 2019. "Industrial applications of topic model." Medium, April 5. Accessed 2020-01-10.
- Agrawal, Amritanshu, Wei Fu, and Tim Menzies. 2018. "What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)." arXiv, v4, February 20. Accessed 2020-01-10.
- Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-10.