Topic Modelling
 Summary

Discussion
 What are some typical applications of topic modelling?
 How is topic modelling different from text classification or clustering?
 What's the typical pipeline for topic modelling?
 Could you explain how documents, words and topics are related?
 Could you describe the main algorithms for topic modelling?
 What are some common challenges with topic modelling?
 What techniques have been used to improve the performance of topic modelling?
 What are some useful resources for research into topic modelling?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
The growth of the web since the early 1990s has resulted in an explosion of online data. In an effort to organize all this unstructured data, topic models were invented as a text mining tool.^{} Topic modelling uncovers underlying themes or topics in documents.^{}
Consider a document in which the words 'dog' and 'bone' occur often. We can say that this document belongs to the topic of Dogs. Another document with words 'cat' and 'meow' occurring frequently is of topic Cats.^{} In another example, based on words in the content, emails can be topically labelled as Personal, Project or Financial. An email can also belong to multiple topics.^{} Topic modelling can be seen as a form of tagging.^{}
Topic modelling is an unsupervised task. LDA and its many variants have been popular.^{} LDA is a probabilistic generative model.^{}
Discussion
What are some typical applications of topic modelling? Early invention and application of topic models was in the field of text mining and information retrieval.^{} Since then, topic modelling has been used in various applications including classification, categorization, summarization, and segmentation of documents.^{} More unique applications include computer vision, population genetics and social networks.^{}
In information retrieval, topic modelling helps in query expansion. It also personalizes search results or makes recommendations by mapping user preferences to topics.^{} ^{}
When analyzing scientific literature, it's been noted that topics often correspond with scientific disciplines. We can also track how topics evolve over time. For example, the topic 'string' in Physics (for string theory) would be more common from the 1970s.^{}
In social sciences, topic modelling enables qualitative analysis. Sentiment analysis and social network analysis are two examples.^{}
In software engineering, topic modelling has been used to analyze source code, change logs, bug databases, and execution traces.^{}
In bioinformatics, compared to traditional data reduction techniques, topic modelling is seen to be more promising since it's more easily interpretable.^{}
How is topic modelling different from text classification or clustering? Text classification is a supervised task that learns a classifier from training data. Topic modelling is an unsupervised task where topics are not learned in advance. Topics are induced from the actual data.^{}
Text clustering and topic modelling are similar in the sense that both are unsupervised tasks. Both attempt to organize documents for better information retrieval and browsing. However, there's a difference.^{}
Text clustering looks at the similarity among documents and attempts to form similar clusters of these documents. These similarity measures could be based on TFIDF weighting.^{} In topic modelling, we don't look at document similarity. Instead, we treat a document as a mixture of topics in which a topic is a probability distribution of words.^{} Soft clustering (where a document can belong to multiple clusters) can be viewed as being similar to topic modelling, though the approaches still differ.^{}
Thus, the clusters from text clustering are not quite the same as topics in topic modelling. They're however seen as complementary. Research from mid2000s explore combining both techniques into a single model.^{}
What's the typical pipeline for topic modelling? Topic models perform a statistical analysis of words present in each document from a collection of documents. The model is expected to output three things: (a) clusters of cooccurring words each of which represents a topic; (b) the distribution of topics for each document; (c) a histogram of words for each topic.^{}
To build a model, we must balance different aspects: fidelity (how well the model reflects the real world), performance, tractability (discrete models are preferred), and interpretability.^{}
In the bagofwords model the ordering of words in each document is ignored. Such a model is simple but it ignores phraselevel cooccurrences. An alternative is the unigram model in which words are randomly drawn from a categorical distribution. A mixture of such unigram models is also possible. For example, each topic has a distribution of words. We randomly draw words conditioned on the topic.^{}
Essentially, words are being generated by latent variables of the model. Thus, a topic modelling algorithm such as LDA is a generative model.^{}
Could you explain how documents, words and topics are related? The basic approach towards topic modelling is to prepare a documentterm matrix. For each document, we count how many times a particular term occurs. In practice, not all terms are equally important. For this reason, TFIDF weighting is used instead of raw counts. TFIDF effectively gives more weight to frequent terms in a document that's rarer in the rest of the corpus.^{}
The next step is to decompose this matrix into documenttopic and termtopic matrices. We don't in fact identify the names of these topics. This is something the analyst can do by looking at the main terms of the topic.^{} In the figure, we can see that T1 is probably about sports because of the terms Lebron, Celtics and sprain.
Since number of topics is far fewer than the vocabulary, we can view topic modelling as a dimensionality reduction technique.^{} ^{} To determine exactly how many topics we should look for, the KullbackLeibler Divergence score is a useful measure.^{}
Could you describe the main algorithms for topic modelling? We mention three main algorithms:
 Latent Semantic Analysis (LSA): Also called LSI, this algorithm constructs a semantic space in which related words and documents are placed near one another. It uses SVD as the technique.^{}
 Probabilistic LSA (pLSA): Also called aspect model,^{} this is a probabilistic generative model. It doesn't use SVD. It looks at the probability of a topic given a document and the probability of a word given a topic. These are multinomial distributions that can be trained with EM algorithm.^{}
 Latent Dirichlet Allocation (LDA): This is a Bayesian approach. Document is modelled as a finite mixture of topics. Each topic is modelled as an infinite mixture of topic probabilities. Topic probabilities make up a document's representation. Topic mixture is a Dirichlet distribution.^{}
What are some common challenges with topic modelling? Topic modelling doesn't provide a method to select the optimum number of topics. LDA has many free parameters that can cause overfitting. LDA uses Bayesian priors without suitable justification. Statistical properties of the data (such as Zipf's Law) may also differ from the assumptions.^{}
LSA is a linear model. It might not fit nonlinearities in the dataset. It assumes Gaussian distribution of terms. SVD is also computationally expensive.^{} LSA uses less efficient representations. Its results are not easily interpretable.^{}
LDA can't directly represent cooccurrence information since words are sampled independently. A model based on Generalized Pólya urns or Bayesian regularization can solve this.^{}
Highfrequency nonspecific words will result in topics that users may find too general and not useful. For examples, documents on artificial intelligence might have the words 'algorithm', 'model' or 'estimation' occurring frequently. Lowfrequency specific words are equally problematic.^{}
We could end up with topics that all look nearly the same. This can be solved by conditioning the prior topic distribution with data.^{}
What techniques have been used to improve the performance of topic modelling? When preprocessing text, it's common to do lemmatization, remove punctuation and stop words. In addition, we can remove low frequency terms since they represent weak features. We could also make use of POS tags and remove terms that are not contextually important. For example, all prepositions could be removed.^{}
Another technique is to do batchwise LDA. Each batch will provide a different set of topics and an intersection of these might give the best topic terms.^{}
Through user interactions, we could add constraints, or merge or separate topics. For example, for two highly correlated words we can add the constraint that they should appear in the same topic.^{}
Some approaches to improve upon LDA include integrating topics with syntax; looking at correlation between topics such as Pachinko Allocation Model (PAM) or Correlated Topic Model (CTM); using metadata such as authors; accounting for burstiness of words (word once used in a document is more likely to appear again). Nonparametric models such PitmanYor or negative binomial processes have tried to address Zipf's Law.^{}
What are some useful resources for research into topic modelling? Developers can use the MALLET Java package for topic modelling. A wrapper for this in R is available via the mallet package. Another R package is topicmodels.^{} The latter package can do LDA (VEM or Gibbs estimation) or CTM (VEM estimation).^{}
Python developers can use nltk for text preprocessing and gensim for topic modelling. Package gensim has functions to create a bag of words from a document, do TFIDF weighting and apply LDA.^{} ^{} If the intent is to do LSA, then sklearn package has functions for TFIDF and SVD.^{} MALLET package is also available in Python via gensim.^{}
pyLDAvis is a Python library for topic model visualization. An analyst can use this to look at terms of a topic and decide the topic name.^{} For automatic topic labelling, Wikipedia can be a useful data source.^{}
Those who wish to use a cloud service can look at Amazon Comprehend. It uses LDA on a collection. It returns termstopics and documentstopics associations. A job should include at least 1000 documents.^{}
Milestones
Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. They note that users want to retrieve documents not by words but rather by concept. By applying SVD on a documentterm matrix they bring together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA). The name LSI reflects its origins in information retrieval.^{}
1998
Papadimitriou et al. present the first mathematical analysis to rigorously explain why LSI works so well for information retrieval. Since LSI uses statistical properties of the corpus, they start with a probabilistic model of the corpus.^{} ^{}
1999
Hofmann presents a statistical analysis of LSA, perhaps independently of the work of Papadimitriou et al. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.^{}
First presented at NIPS 2001 conference,^{} Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this.^{}
2005
Blei and Lafferty present Correlated Topic Model (CTM) to overcome a limitation of LDA. LDA is unable to capture correlations among topics. For example, a document about genetics is more likely to be about disease than xray astronomy. CTM captures these correlations via the logistic normal distribution.^{} One of their results shows that CTM can handle as many as 90 topics whereas LDA peaks at only 30 topics.^{}
2007
Researchers at the University of Massachusetts apply topic modelling to social network analysis. They note that previous analysis looked at only links between network nodes. Their work also looks at topics on those links. They call their model AuthorRecipientTopic (ART). It's based on LDA.^{}
Newman et al. study a number of methods to automatically measure topic coherence. High coherence implies better interpretability. As an upper bound for comparison, they use interannotator agreement (IIA) as the gold standard. They find that Pointwise Mutual information (PMI) performs best and comes close to IIA. PMI is calculated between word pairs in a document on a 10word sliding window. PMI measures statistical independence of two words occurring in close proximity.^{}
2012
Arora et al. note the limitations of SVD: only one topic per document or recover only spans of topic vectors rather than the vectors themselves. Instead, they propose the use of Nonnegative Matrix Factorization (NMF). They assume that every topic has an anchor word that separates it from other topics. This aspect of separability using NMF was studied by the machine learning community at least a decade earlier.^{}
2012
Chuang et al. present Termite, a tool for analyzing the performance of topic modelling. It's a term vs topic visualization on a grid, with size of circles depicting importance. In a technique called seriation, terms are ordered to show how they cluster for a topic.^{} The visualization also helps us see if topics use lots of words or only a handful of them. We can also see what words are shared across topics.^{}
2012
Chaney and Blei present a method to visualize topic models. Given a topic (such as defined by three most prominent words), their system displays associated words, most relevant documents matching this topic and a list of related topics.^{} This is more useful that showing just a word cloud. Word clouds typically show only the topics and they make visual search difficult.^{}
2013
Xie and Xing propose a model that integrates document clustering and text modelling into a single unified framework. Performing these two tasks separately fails to exploit the correlations between them. They note that a flat set of topics is not useful across domains. For example, topics applicable for computer science will be different from topics for economics. Their model also includes global topics that cut across domains. Their model has N documents, J groups, K groupspecific topics per group, and R global topics.^{}
Word2vec came out in 2013.^{} It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.^{}
2018
Gerlach et al. note that topic modelling and community detection have evolved independently. Because they are conceptually similar, they can be combined into a single unified model. Community detection is about identifying groups of nodes with similar connectivity patterns. When applied to topic modelling, we're inferring topics by way of inferring communities of words and documents. Stochastic Block Model (SBM) is popular for community detection. This is adapted as hierarchical SBM (hSBM) for topic modelling.^{}
References
 Allahyari, Mehdi, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques." arXiv, v2, July 28. Accessed 20200112.
 Arora, Sanjeev, Rong Ge, and Ankur Moitra. 2012. "Learning Topic Models  Going beyond SVD." arXiv, v2, April 10. Accessed 20200110.
 AWS Docs. 2019. "Topic Modeling." Developer Guide, Amazon Comprehend, AWS Docs, February 2. Accessed 20200113.
 Bansal, Shivam. 2016. "Beginners Guide to Topic Modeling in Python." Analytics Vidhya, August 24. Accessed 20200110.
 Blei, David M., and John D. Lafferty. 2005. "Correlated topic models." NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 147154, December. Accessed 20200113.
 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. "Latent Dirichlet Allocation." Dietterich, T. G., S. Becker, and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, pp. 601608. Accessed 20200113.
 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 9931022. Accessed 20200113.
 BoydGraber, Jordan, Yuening Hu, and David Mimno. 2017. "Applications of Topic Models." Foundations and Trends in Information Retrieval, vol. 11, no. 23, pp. 143296, July. Accessed 20200110.
 BoydGraber, Jordan, David Mimno, and David Newman. 2019. "Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements." Chapter 12 in: Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds), Handbook of Mixed Membership Models and Their Applications, CRC Press. Accessed 20200110.
 Chaney, Allison, and David M. Blei. 2012. "Visualizing topic models." Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 419422, June 47. Accessed 20200112.
 Chen, TseHsun, Stephen W. Thomas, and Ahmed E. Hassan. 2015. "A Survey on the Use of Topic Models when Mining Software Repositories." Empirical Software Engineering, vol. 21, pp. 1843–1919, September 10. Issue date October 2016. Accessed 20200110.
 Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. 2012. "Termite: Visualization Techniques forAssessing Textual Topic Models." Advanced Visual Interfaces (AVI), ACM, May 2125, 2012. Accessed 20200112.
 ContrerasPiña, Constanza, and Sebastián A. Ríos. 2016. "An empirical comparison of latent sematic models for applications in industry." Neurocomputing, Elsevier, vol. 179, pp. 176185. Accessed 20200113.
 Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391407, September. Accessed 20200112.
 Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. 2018. "A network approach to topic models." Science Advances, AAAS, vol. 4, no. 7, eaaq1360. Accessed 20200110.
 Griffiths, Thomas L. and Mark Steyvers. 2004. "Finding scientific topics." Proceedings of the National Academy of Sciences, vol. 101, Suppl 1, pp. 5228–5235. Accessed 20200113.
 Grün, Bettina, and Kurt Hornik. 2011. "topicmodels: An R Package for Fitting Topic Models." Journal of Statistical Software, vol. 40, no. 13, May. Accessed 20200110.
 Hofmann, Thomas. 1999. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57, August. https://doi.org/10.1145/312624.312649. Accessed 20200114.
 Ivanov, GeorgeBogdan. 2018. "Complete Guide to Topic Modeling." NLP For Hackers, January 3. Accessed 20200110.
 Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 20200110.
 Joshi, Prateek. 2018. "Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)." Analytics Vidhya, October 1. Accessed 20200110.
 Krishan. 2016. "Topic Modeling and Document Clustering; What’s the Difference?" Integrated Knowledge Solutions, May 16. Accessed 20200112.
 Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 20200110.
 Li, Susan. 2018. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science, on Medium, May 31. Accessed 20200110.
 Liu, Lin, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. 2016. "An overview of topic modeling and its current applications in bioinformatics." SpringerPlus, vol. 5, article no. 1608, September 20. Accessed 20200110.
 McCallum, Andrew, Xuerui Wang, and Andres CorradaEmmanuel. 2007. "Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email." Journal of Artificial Intelligence Research, AI Access Foundation, vol. 30, pp. 249272, October. Accessed 20200110.
 Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 20200114.
 Moody, Chris. 2016. "Introducing our Hybrid lda2vec Algorithm." MultiThreaded Blog, Stitch Fix, Inc., May 27. Accessed 20200110.
 Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. "Automatic Evaluation of Topic Coherence." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 100–108, June. Accessed 20200114.
 Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 1998. "Latent Semantic Indexing: A Probabilistic Analysis." Proceedings of the seventeenth ACM SIGACTSIGMODSIGART symposium on Principles of database systems, pp. 159–168, May. Accessed 20200114.
 Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. "Latent Semantic Indexing: A Probabilistic Analysis." Journal of Computer and System Sciences, Elsevier, vol. 61, no. 2, pp. 217235, October. Accessed 20200114.
 Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 20200110.
 Prabhakaran, Selva. 2018. "Topic Modeling with Gensim (Python)." Machine Learning Plus, March 26. Accessed 20200110.
 Raju, L Venkata Rama. 2019. "Topic Modeling – Latent Semantic Analysis (LSA) and Singular Value Decomposition (SVD)." Data Jang, June 18. Accessed 20200110.
 Robinson, David and Julia Silge. 2017. "Topic Modeling." Chapter 6 in: Text Mining with R, O'Reilly Media, Inc. Accessed 20200110.
 Ruozzi, Nicholas. 2019. "Topic Models and LDA." Lecture 18 in: CS 6347, Statistical Methods in AI and ML, UT Dallas. Accessed 20200110.
 Silge, Julia and David Robinson. 2019. "Text Mining with R." November 24. Accessed 20200110.
 Vorontsov, Konstantin, and Anna Potapenko. 2014. "Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization." In: Ignatov D., Khachay M., Panchenko A., Konstantinova N., Yavorsky R. (eds), Analysis of Images, Social Networks and Texts, AIST 2014, Communications in Computer and Information Science, vol. 436, Springer, Cham. Accessed 20200110.
 Wikipedia. 2019. "Topic model." Wikipedia, December 13. Accessed 20200110.
 Wu, Hu, Yongji Wang, and Xiang Cheng. 2008. "Incremental Probabilistic Latent Semantic Analysis for Automatic Question Recommendation." RecSys’08, ACM, pp. 99106, October 23–25. Accessed 20200110.
 Xie, Pengtao, and Eric P. Xing. 2013. "Integrating Document Clustering and Topic Modeling." arXiv, v1, September 26. Accessed 20200112.
 Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 20200110.
 Zhao, Alice. 2019. "Natural Language Processing (Part 5): Topic Modeling with Latent Dirichlet Allocation in Python." YouTube, January 5. Accessed 20200110.
Further Reading
 Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 20200110.
 Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 20200110.
 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 9931022. Accessed 20200113.
 Fatma, Fatma. 2019. "Industrial applications of topic model." Medium, April 5. Accessed 20200110.
 Agrawal, Amritanshu, Wei Fu, and Tim Menzies. 2018. "What is Wrong with Topic Modeling? (and How to Fix it Using Searchbased Software Engineering)." arXiv, v4, February 20. Accessed 20200110.
 Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 20200110.
Article Stats
Cite As
See Also
 Singular Value Decomposition
 Latent Dirichlet Allocation
 Correlated Topic Model
 Text Clustering
 Expectation Maximization Algorithm
 Factor Analysis