Topic Modelling

One or more topics attached to emails. Source: Zhao 2019, 1:00.
One or more topics attached to emails. Source: Zhao 2019, 1:00.

The growth of the web since the early 1990s has resulted in an explosion of online data. In an effort to organize all this unstructured data, topic models were invented as a text mining tool. Topic modelling uncovers underlying themes or topics in documents.

Consider a document in which the words 'dog' and 'bone' occur often. We can say that this document belongs to the topic of Dogs. Another document with words 'cat' and 'meow' occurring frequently is of topic Cats. In another example, based on words in the content, emails can be topically labelled as Personal, Project or Financial. An email can also belong to multiple topics. Topic modelling can be seen as a form of tagging.

Topic modelling is an unsupervised task. LDA and its many variants have been popular. LDA is a probabilistic generative model.


  • What are some typical applications of topic modelling?
    Manually defined scientific fields of study (y-axis) correlated with uncovered topics (x-axis). Source: Griffiths and Steyvers 2004, fig. 4.
    Manually defined scientific fields of study (y-axis) correlated with uncovered topics (x-axis). Source: Griffiths and Steyvers 2004, fig. 4.

    Early invention and application of topic models was in the field of text mining and information retrieval. Since then, topic modelling has been used in various applications including classification, categorization, summarization, and segmentation of documents. More unique applications include computer vision, population genetics and social networks.

    In information retrieval, topic modelling helps in query expansion. It also personalizes search results or makes recommendations by mapping user preferences to topics.

    When analyzing scientific literature, it's been noted that topics often correspond with scientific disciplines. We can also track how topics evolve over time. For example, the topic 'string' in Physics (for string theory) would be more common from the 1970s.

    In social sciences, topic modelling enables qualitative analysis. Sentiment analysis and social network analysis are two examples.

    In software engineering, topic modelling has been used to analyze source code, change logs, bug databases, and execution traces.

    In bioinformatics, compared to traditional data reduction techniques, topic modelling is seen to be more promising since it's more easily interpretable.

  • How is topic modelling different from text classification or clustering?
    Topic modelling vs document clustering. Source: Krishan 2016.
    Topic modelling vs document clustering. Source: Krishan 2016.

    Text classification is a supervised task that learns a classifier from training data. Topic modelling is an unsupervised task where topics are not learned in advance. Topics are induced from the actual data.

    Text clustering and topic modelling are similar in the sense that both are unsupervised tasks. Both attempt to organize documents for better information retrieval and browsing. However, there's a difference.

    Text clustering looks at the similarity among documents and attempts to form similar clusters of these documents. These similarity measures could be based on TF-IDF weighting. In topic modelling, we don't look at document similarity. Instead, we treat a document as a mixture of topics in which a topic is a probability distribution of words. Soft clustering (where a document can belong to multiple clusters) can be viewed as being similar to topic modelling, though the approaches still differ.

    Thus, the clusters from text clustering are not quite the same as topics in topic modelling. They're however seen as complementary. Research from mid-2000s explore combining both techniques into a single model.

  • What's the typical pipeline for topic modelling?
    Topic Modelling identifies topics and their distributions. Source: Joshi 2018.
    Topic Modelling identifies topics and their distributions. Source: Joshi 2018.

    Topic models perform a statistical analysis of words present in each document from a collection of documents. The model is expected to output three things: (a) clusters of co-occurring words each of which represents a topic; (b) the distribution of topics for each document; (c) a histogram of words for each topic.

    To build a model, we must balance different aspects: fidelity (how well the model reflects the real world), performance, tractability (discrete models are preferred), and interpretability.

    In the bag-of-words model the ordering of words in each document is ignored. Such a model is simple but it ignores phrase-level co-occurrences. An alternative is the unigram model in which words are randomly drawn from a categorical distribution. A mixture of such unigram models is also possible. For example, each topic has a distribution of words. We randomly draw words conditioned on the topic.

    Essentially, words are being generated by latent variables of the model. Thus, a topic modelling algorithm such as LDA is a generative model.

  • Could you explain how documents, words and topics are related?
    Document-Term matrix is decomposed into two other matrices. Source: Pascual 2019.
    Document-Term matrix is decomposed into two other matrices. Source: Pascual 2019.

    The basic approach towards topic modelling is to prepare a document-term matrix. For each document, we count how many times a particular term occurs. In practice, not all terms are equally important. For this reason, TF-IDF weighting is used instead of raw counts. TF-IDF effectively gives more weight to frequent terms in a document that's rarer in the rest of the corpus.

    The next step is to decompose this matrix into document-topic and term-topic matrices. We don't in fact identify the names of these topics. This is something the analyst can do by looking at the main terms of the topic. In the figure, we can see that T1 is probably about sports because of the terms Lebron, Celtics and sprain.

    Since number of topics is far fewer than the vocabulary, we can view topic modelling as a dimensionality reduction technique. To determine exactly how many topics we should look for, the Kullback-Leibler Divergence score is a useful measure.

  • Could you describe the main algorithms for topic modelling?
    Comparing common topic modelling algorithms. Source: Lee et al. 2010, table 2.
    Comparing common topic modelling algorithms. Source: Lee et al. 2010, table 2.

    We mention three main algorithms:

    • Latent Semantic Analysis (LSA): Also called LSI, this algorithm constructs a semantic space in which related words and documents are placed near one another. It uses SVD as the technique.
    • Probabilistic LSA (pLSA): Also called aspect model, this is a probabilistic generative model. It doesn't use SVD. It looks at the probability of a topic given a document and the probability of a word given a topic. These are multinomial distributions that can be trained with EM algorithm.
    • Latent Dirichlet Allocation (LDA): This is a Bayesian approach. Document is modelled as a finite mixture of topics. Each topic is modelled as an infinite mixture of topic probabilities. Topic probabilities make up a document's representation. Topic mixture is a Dirichlet distribution.
  • What are some common challenges with topic modelling?

    Topic modelling doesn't provide a method to select the optimum number of topics. LDA has many free parameters that can cause overfitting. LDA uses Bayesian priors without suitable justification. Statistical properties of the data (such as Zipf's Law) may also differ from the assumptions.

    LSA is a linear model. It might not fit non-linearities in the dataset. It assumes Gaussian distribution of terms. SVD is also computationally expensive. LSA uses less efficient representations. Its results are not easily interpretable.

    LDA can't directly represent co-occurrence information since words are sampled independently. A model based on Generalized Pólya urns or Bayesian regularization can solve this.

    High-frequency non-specific words will result in topics that users may find too general and not useful. For examples, documents on artificial intelligence might have the words 'algorithm', 'model' or 'estimation' occurring frequently. Low-frequency specific words are equally problematic.

    We could end up with topics that all look nearly the same. This can be solved by conditioning the prior topic distribution with data.

  • What techniques have been used to improve the performance of topic modelling?
    Topics can be inferred from the keywords. Source: Prabhakaran 2018.
    Topics can be inferred from the keywords. Source: Prabhakaran 2018.

    When pre-processing text, it's common to do lemmatization, remove punctuation and stop words. In addition, we can remove low frequency terms since they represent weak features. We could also make use of POS tags and remove terms that are not contextually important. For example, all prepositions could be removed.

    Another technique is to do batch-wise LDA. Each batch will provide a different set of topics and an intersection of these might give the best topic terms.

    Through user interactions, we could add constraints, or merge or separate topics. For example, for two highly correlated words we can add the constraint that they should appear in the same topic.

    Some approaches to improve upon LDA include integrating topics with syntax; looking at correlation between topics such as Pachinko Allocation Model (PAM) or Correlated Topic Model (CTM); using metadata such as authors; accounting for burstiness of words (word once used in a document is more likely to appear again). Non-parametric models such Pitman-Yor or negative binomial processes have tried to address Zipf's Law.

  • What are some useful resources for research into topic modelling?
    R topicmodels package used within a text analysis workflow. Source: Robinson and Silge 2017, fig. 6-1.
    R topicmodels package used within a text analysis workflow. Source: Robinson and Silge 2017, fig. 6-1.

    Developers can use the MALLET Java package for topic modelling. A wrapper for this in R is available via the mallet package. Another R package is topicmodels. The latter package can do LDA (VEM or Gibbs estimation) or CTM (VEM estimation).

    Python developers can use nltk for text pre-processing and gensim for topic modelling. Package gensim has functions to create a bag of words from a document, do TF-IDF weighting and apply LDA. If the intent is to do LSA, then sklearn package has functions for TF-IDF and SVD. MALLET package is also available in Python via gensim.

    pyLDAvis is a Python library for topic model visualization. An analyst can use this to look at terms of a topic and decide the topic name. For automatic topic labelling, Wikipedia can be a useful data source.

    Those who wish to use a cloud service can look at Amazon Comprehend. It uses LDA on a collection. It returns terms-topics and documents-topics associations. A job should include at least 1000 documents.


SVD is used in LSA for topic modelling. Source: Raju 2019.

Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. They note that users want to retrieve documents not by words but rather by concept. By applying SVD on a document-term matrix they bring together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA). The name LSI reflects its origins in information retrieval.


Papadimitriou et al. present the first mathematical analysis to rigorously explain why LSI works so well for information retrieval. Since LSI uses statistical properties of the corpus, they start with a probabilistic model of the corpus.

Graphical representation of pLSA in two equivalents forms: asymmetric and symmetric parameterization. Source: Hofmann 1999, fig. 1.

Hofmann presents a statistical analysis of LSA, perhaps independently of the work of Papadimitriou et al. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.


First presented at NIPS 2001 conference, Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this.

CTM model (top) and example correlations (bottom) of diagonal covariance, negative correlation and positive correlation. Source: Blei and Lafferty 2005, fig. 1.

Blei and Lafferty present Correlated Topic Model (CTM) to overcome a limitation of LDA. LDA is unable to capture correlations among topics. For example, a document about genetics is more likely to be about disease than x-ray astronomy. CTM captures these correlations via the logistic normal distribution. One of their results shows that CTM can handle as many as 90 topics whereas LDA peaks at only 30 topics.

ART model and three related models for social network analysis. Source: McCallum et al. 2007, fig. 1.

Researchers at the University of Massachusetts apply topic modelling to social network analysis. They note that previous analysis looked at only links between network nodes. Their work also looks at topics on those links. They call their model Author-Recipient-Topic (ART). It's based on LDA.


Newman et al. study a number of methods to automatically measure topic coherence. High coherence implies better interpretability. As an upper bound for comparison, they use inter-annotator agreement (IIA) as the gold standard. They find that Pointwise Mutual information (PMI) performs best and comes close to IIA. PMI is calculated between word pairs in a document on a 10-word sliding window. PMI measures statistical independence of two words occurring in close proximity.


Arora et al. note the limitations of SVD: only one topic per document or recover only spans of topic vectors rather than the vectors themselves. Instead, they propose the use of Non-negative Matrix Factorization (NMF). They assume that every topic has an anchor word that separates it from other topics. This aspect of separability using NMF was studied by the machine learning community at least a decade earlier.

Illustrating selection of Topic 17 in Termite tool. Source: Chuang et al. 2012, fig. 3.

Chuang et al. present Termite, a tool for analyzing the performance of topic modelling. It's a term vs topic visualization on a grid, with size of circles depicting importance. In a technique called seriation, terms are ordered to show how they cluster for a topic. The visualization also helps us see if topics use lots of words or only a handful of them. We can also see what words are shared across topics.

Navigating Wikipedia via topic models. Source: Chaney and Blei 2012, fig. 1.

Chaney and Blei present a method to visualize topic models. Given a topic (such as defined by three most prominent words), their system displays associated words, most relevant documents matching this topic and a list of related topics. This is more useful that showing just a word cloud. Word clouds typically show only the topics and they make visual search difficult.

Multi-Grain Clustering Topic Model (MGCTM). Source: Xie and Xing 2013, fig. 1.

Xie and Xing propose a model that integrates document clustering and text modelling into a single unified framework. Performing these two tasks separately fails to exploit the correlations between them. They note that a flat set of topics is not useful across domains. For example, topics applicable for computer science will be different from topics for economics. Their model also includes global topics that cut across domains. Their model has N documents, J groups, K group-specific topics per group, and R global topics.


Word2vec came out in 2013. It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.

Topic modelling and community detection are mathematically similar. Source: Gerlach et al. 2018, fig. 2.

Gerlach et al. note that topic modelling and community detection have evolved independently. Because they are conceptually similar, they can be combined into a single unified model. Community detection is about identifying groups of nodes with similar connectivity patterns. When applied to topic modelling, we're inferring topics by way of inferring communities of words and documents. Stochastic Block Model (SBM) is popular for community detection. This is adapted as hierarchical SBM (hSBM) for topic modelling.


  1. Allahyari, Mehdi, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques." arXiv, v2, July 28. Accessed 2020-01-12.
  2. Arora, Sanjeev, Rong Ge, and Ankur Moitra. 2012. "Learning Topic Models - Going beyond SVD." arXiv, v2, April 10. Accessed 2020-01-10.
  3. AWS Docs. 2019. "Topic Modeling." Developer Guide, Amazon Comprehend, AWS Docs, February 2. Accessed 2020-01-13.
  4. Bansal, Shivam. 2016. "Beginners Guide to Topic Modeling in Python." Analytics Vidhya, August 24. Accessed 2020-01-10.
  5. Blei, David M., and John D. Lafferty. 2005. "Correlated topic models." NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 147-154, December. Accessed 2020-01-13.
  6. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. "Latent Dirichlet Allocation." Dietterich, T. G., S. Becker, and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, pp. 601-608. Accessed 2020-01-13.
  7. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022. Accessed 2020-01-13.
  8. Boyd-Graber, Jordan, Yuening Hu, and David Mimno. 2017. "Applications of Topic Models." Foundations and Trends in Information Retrieval, vol. 11, no. 2-3, pp. 143-296, July. Accessed 2020-01-10.
  9. Boyd-Graber, Jordan, David Mimno, and David Newman. 2019. "Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements." Chapter 12 in: Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds), Handbook of Mixed Membership Models and Their Applications, CRC Press. Accessed 2020-01-10.
  10. Chaney, Allison, and David M. Blei. 2012. "Visualizing topic models." Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 419-422, June 4-7. Accessed 2020-01-12.
  11. Chen, Tse-Hsun, Stephen W. Thomas, and Ahmed E. Hassan. 2015. "A Survey on the Use of Topic Models when Mining Software Repositories." Empirical Software Engineering, vol. 21, pp. 1843–1919, September 10. Issue date October 2016. Accessed 2020-01-10.
  12. Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. 2012. "Termite: Visualization Techniques forAssessing Textual Topic Models." Advanced Visual Interfaces (AVI), ACM, May 21-25, 2012. Accessed 2020-01-12.
  13. Contreras-Piña, Constanza, and Sebastián A. Ríos. 2016. "An empirical comparison of latent sematic models for applications in industry." Neurocomputing, Elsevier, vol. 179, pp. 176-185. Accessed 2020-01-13.
  14. Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, September. Accessed 2020-01-12.
  15. Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. 2018. "A network approach to topic models." Science Advances, AAAS, vol. 4, no. 7, eaaq1360. Accessed 2020-01-10.
  16. Griffiths, Thomas L. and Mark Steyvers. 2004. "Finding scientific topics." Proceedings of the National Academy of Sciences, vol. 101, Suppl 1, pp. 5228–5235. Accessed 2020-01-13.
  17. Grün, Bettina, and Kurt Hornik. 2011. "topicmodels: An R Package for Fitting Topic Models." Journal of Statistical Software, vol. 40, no. 13, May. Accessed 2020-01-10.
  18. Hofmann, Thomas. 1999. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57, August. Accessed 2020-01-14.
  19. Ivanov, George-Bogdan. 2018. "Complete Guide to Topic Modeling." NLP For Hackers, January 3. Accessed 2020-01-10.
  20. Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-10.
  21. Joshi, Prateek. 2018. "Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)." Analytics Vidhya, October 1. Accessed 2020-01-10.
  22. Krishan. 2016. "Topic Modeling and Document Clustering; What’s the Difference?" Integrated Knowledge Solutions, May 16. Accessed 2020-01-12.
  23. Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-10.
  24. Li, Susan. 2018. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science, on Medium, May 31. Accessed 2020-01-10.
  25. Liu, Lin, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. 2016. "An overview of topic modeling and its current applications in bioinformatics." SpringerPlus, vol. 5, article no. 1608, September 20. Accessed 2020-01-10.
  26. McCallum, Andrew, Xuerui Wang, and Andres Corrada-Emmanuel. 2007. "Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email." Journal of Artificial Intelligence Research, AI Access Foundation, vol. 30, pp. 249-272, October. Accessed 2020-01-10.
  27. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2020-01-14.
  28. Moody, Chris. 2016. "Introducing our Hybrid lda2vec Algorithm." MultiThreaded Blog, Stitch Fix, Inc., May 27. Accessed 2020-01-10.
  29. Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. "Automatic Evaluation of Topic Coherence." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 100–108, June. Accessed 2020-01-14.
  30. Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 1998. "Latent Semantic Indexing: A Probabilistic Analysis." Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 159–168, May. Accessed 2020-01-14.
  31. Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. "Latent Semantic Indexing: A Probabilistic Analysis." Journal of Computer and System Sciences, Elsevier, vol. 61, no. 2, pp. 217-235, October. Accessed 2020-01-14.
  32. Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 2020-01-10.
  33. Prabhakaran, Selva. 2018. "Topic Modeling with Gensim (Python)." Machine Learning Plus, March 26. Accessed 2020-01-10.
  34. Raju, L Venkata Rama. 2019. "Topic Modeling – Latent Semantic Analysis (LSA) and Singular Value Decomposition (SVD)." Data Jang, June 18. Accessed 2020-01-10.
  35. Robinson, David and Julia Silge. 2017. "Topic Modeling." Chapter 6 in: Text Mining with R, O'Reilly Media, Inc. Accessed 2020-01-10.
  36. Ruozzi, Nicholas. 2019. "Topic Models and LDA." Lecture 18 in: CS 6347, Statistical Methods in AI and ML, UT Dallas. Accessed 2020-01-10.
  37. Silge, Julia and David Robinson. 2019. "Text Mining with R." November 24. Accessed 2020-01-10.
  38. Vorontsov, Konstantin, and Anna Potapenko. 2014. "Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization." In: Ignatov D., Khachay M., Panchenko A., Konstantinova N., Yavorsky R. (eds), Analysis of Images, Social Networks and Texts, AIST 2014, Communications in Computer and Information Science, vol. 436, Springer, Cham. Accessed 2020-01-10.
  39. Wikipedia. 2019. "Topic model." Wikipedia, December 13. Accessed 2020-01-10.
  40. Wu, Hu, Yongji Wang, and Xiang Cheng. 2008. "Incremental Probabilistic Latent Semantic Analysis for Automatic Question Recommendation." RecSys’08, ACM, pp. 99-106, October 23–25. Accessed 2020-01-10.
  41. Xie, Pengtao, and Eric P. Xing. 2013. "Integrating Document Clustering and Topic Modeling." arXiv, v1, September 26. Accessed 2020-01-12.
  42. Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-10.
  43. Zhao, Alice. 2019. "Natural Language Processing (Part 5): Topic Modeling with Latent Dirichlet Allocation in Python." YouTube, January 5. Accessed 2020-01-10.

Further Reading

  1. Pascual, Federico. 2019. "Introduction to Topic Modeling." Blog, MonkeyLearn, September 26. Accessed 2020-01-10.
  2. Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-10.
  3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022. Accessed 2020-01-13.
  4. Fatma, Fatma. 2019. "Industrial applications of topic model." Medium, April 5. Accessed 2020-01-10.
  5. Agrawal, Amritanshu, Wei Fu, and Tim Menzies. 2018. "What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)." arXiv, v4, February 20. Accessed 2020-01-10.
  6. Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-10.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2020. "Topic Modelling." Version 7, January 14. Accessed 2020-11-24.
Contributed by
2 authors

Last updated on
2020-01-14 09:07:34