Latent Dirichlet Allocation
- Summary
-
Discussion
- What are the shortcomings of earlier topic models that LDA aims to solve?
- Could you describe some example applications where LDA has been applied?
- Why is LDA called a generative model?
- What's the significance Dirichlet priors in LDA?
- What are the methods of inference and parameter estimation under LDA?
- What's the typical pipeline for doing LDA?
- Could you describe some variants of the basic LDA?
- What are some resources for working with LDA?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
Given a document, topic modelling is a task that aims to uncover the most suitable topics or themes that the document is about. It does this by looking at words that most often occur together. For example, a document with high co-occurrence of words 'cats' and 'dogs' is probably about the topic 'Animals', whereas the words 'horses' and 'equestrian' is partly about 'Animals' but more about 'Sports'. Latent Dirichlet Allocation (LDA) is a popular technique to do topic modelling.
LDA is based on probability distributions. For each document, it considers a distribution of topics. For each topic, it considers a distribution of words. This information helps LDA discover the topics in a document.
LDA and its many variants support diverse applications. LDA is well-supported in a few programming languages and software packages.
Discussion
-
What are the shortcomings of earlier topic models that LDA aims to solve? Before LDA, there were LSA and pLSA models. LSA was simply a dimensionality reduction technique and lacked a strong probabilistic approach. pLSA remedied this by being a probabilistic generative model. It picked a topic with probability P(z). Then it selected the document and the word with probabilities P(d|z) and P(w|z) respectively.
While LSA could model synonymy well, it failed in polysemy. In other words, a word with multiple meanings needed to appear in multiple topics but didn't. pLSA partially handled polysemy. However, pLSA ignored P(d). Each document was a mixture of topics but there was no model to generate this mixture. This made pLSA grow linearly as the corpus size increased, leading to overfitting. Also, pLSA was unable to assign topic probabilities to new documents.
LDA is inspired by pLSA. Like pLSA, it's also a probabilistic generative model. Unlike pLSA, LDA also considers the generation of documents. This is where the Dirichlet distribution becomes useful. It determines the topic distribution for each document.
-
Could you describe some example applications where LDA has been applied? LDA is an algorithm or method for topic modelling, which has been applied in information retrieval, text mining, social media analysis, and more. In general, topic modelling uncovers hidden structures or topics in documents.
LDA has been applied in diverse tasks: automatic essay grading, anti-phishing, automatic labelling, emotion detection, expert identification, role discovery, sentiment summarization, word sense disambiguation, and more.
For analysing political texts, LDA-based model was used to find opinions from different viewpoints. In software engineering, LDA was used to find similar code in software repositories and suggest code refactoring. Another study made use of geographic data and GPS-based documents to discover topics.
LDA has been used on online or social media data. By applying it on public tweets or chat data, we can detect and track how topics change over time. We can identify users who follow similar distribution of topics. On Yelp restaurant reviews, LDA was used to do aspect-based opinion mining. LDA was used on a school blog to uncover main topics and who's taking about them.
-
Why is LDA called a generative model? In a generative model, observations are generated by latent variables. Given the words of a document, LDA figures out the latent topics. But as a generative model, we can think of LDA as generating the topics and then the words for that document.
In the figure, we note that \(\alpha\) fixes a particular distribution of topics \(\theta\). There's one such distribution for each document. For a document, when we pick a topic from this distribution, we're faced with word distribution \(\beta_i\) for that topic. These word distributions are determined by \(\eta\). From our topic's word distribution, we pick a word. We do this as many times as the document's word count. Thus, the model is generative.
-
What's the significance Dirichlet priors in LDA? Topic distribution \(\theta\) and word distribution \(\beta\) are created from \(\alpha\) and \(\eta\) respectively. The latter are called Dirichlet priors. A low \(\alpha\) implies a document might have fewer dominant topics. A large \(\alpha\) implies many dominant topics. Similarly, a low (or high) \(\eta\) means a topic has a few (or many) dominant words.
Given the Dirichlet distribution \(Dir(\alpha)\) we sample a topic distribution for a specific document. Likewise, from \(Dir(\eta)\) we sample a word distribution for a specific topic. In other words, the Dirichlet distribution generates another distribution. For this reason, it's called distribution over distributions.
Suppose we have k topics and a vocabulary V. \(\alpha\) will be a vector of length k. \(\eta\) will be a vector of length V. If all elements of the vector have the same value, we call this symmetric Dirichlet distribution. It simply means we have no prior knowledge and assume all topics or words are equally likely. In this case, the prior may be expressed as a scalar, called concentration parameter.
In an alternative terminology, \(\eta\) symbol is not used. \(Dir(\beta)\) generates \(\phi\).
-
What are the methods of inference and parameter estimation under LDA? In LDA, words are observed, topic and word distributions are hidden, and \(\alpha\) and \(\eta\) are the hyperparameters. Thus, we need to infer the distributions and the hyperparameters. In general, this problem is intractable. The two distributions are coupled in the latent topics.
We note three common techniques for inference and estimation:
- Gibbs Sampling: A method for sampling from a joint distribution when only conditional distributions of topics and words can be efficiently computed.
- Expectation-Maximization (EM): Useful for parameter estimation via maximum likelihood.
- Variational Inference: Coupling between distributions is removed to yield a simplified graphical model with free variational parameters Now we have an optimization problem to find the best variational parameters. Kullback-Leibler (KL) divergence between the variational distribution and the true posterior can be used.
-
What's the typical pipeline for doing LDA? The actual working of LDA is iterative. It starts by randomly assigning a topic to each word in each document. Then the topic and word distributions are calculated. These distributions are used in the next iteration to reassign topics. This is repeated until the algorithm converges. Once the distribution is worked out during training, the dominant topics of a test document can be identified by its location in the topic space.
Suppose a document has only a few topics. Suppose a topic has only a few highly likely words. These two goals are at odds with each other. By trading off these two goals, LDA uncovers tightly co-occurring words.
-
Could you describe some variants of the basic LDA? While LDA looks at co-occurrences of words, some LDA variants include metadata such as research paper authors or citations. Another approach is to look at word sequences with Markov dependencies. For social network analysis, the Author-Recipient-Topic model conditions the distribution of topics on the sender and one recipient.
For applications such as automatic image annotation or text-based image retrieval, Correspondence LDA models the joint distribution of images and text, plus the conditional distribution of the annotation given the image.
Correlated Topic Model captures correlations among topics.
Word co-occurrence patterns are rarely static. For example, "dynamic systems" more recently co-occur with "graphical models" more than "neural networks". Topics over Time models time jointly with word co-occurrences. It uses a continuous distribution over time.
LDA doesn't differentiate between topic words and opinion words. Opinions can also come from different perspectives. Cross-Perspective Topic model extends LDA by separating opinion generation from topic generation. Nouns form topics. Adjectives, verbs and adverbs form opinions.
-
What are some resources for working with LDA? In Python, nltk is useful for general text processing while gensim enables LDA. In R, quanteda is for quantitative text analysis while topicmodels is more specifically for topic modelling. In Java, there's Mallet, TMT and Mr.LDA.
Gensim has a useful feature to automatically calculate the optimal asymmetric prior for \(\alpha\) by accounting for how often words co-occur.
LDA is built into Spark MLlib. This can be used via Scala, Java, Python or R. For example, in Python, LDA is available in module
pyspark.ml.clustering
.There are plenty of datasets for research into topic modelling. Those labelled with categories or topics may be more useful. Some examples are Reuters-21578, Wiki10+, DBPL Dataset, NIPS Conference Papers 1987-2015, and 20Newgroups.
Milestones
Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. SVD brings together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).
1999
Hofmann presents a statistical analysis of LSA. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.
2003
First presented at NIPS 2001 conference, Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this. Their work uses the bag-of-words model but they note that LDA can be applied for larger units such as n-grams or paragraphs.
2003
2009
Typically, symmetric Dirichlet priors are used in LDA. Wallach et al. study the effect of structured priors for topic modelling. They find that asymmetric Dirichlet priors over document-topic distributions is much better than symmetric priors. To use asymmetric priors over topic-word distributions has little benefit. The resulting model is less sensitive to number of topics. With hyperparameter optimization, computation can be made practical. Related research with similar results are reported in 2018 by Syed and Spruit.
Word2vec came out in 2013. It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.
Sample Code
References
- Apache Spark Docs. 2019. "Clustering." Apache Spark 2.4.4, April 13. Accessed 2020-01-18.
- Blei, David M. 2013. "Probabilistic Topic Models: Origins and Challenges." Department of Computer Science, Princeton University, December 9. Accessed 2020-01-14.
- Blei, David M. and Michael I. Jordan. 2003. "Modeling Annotated Data." SIGIR'03, July 28–August 1. Accessed 2020-01-18.
- Blei, David M., and John D. Lafferty. 2005. "Correlated topic models." NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 147-154, December. Accessed 2020-01-18.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. "Latent Dirichlet Allocation." Dietterich, T. G., S. Becker, and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, pp. 601-608. Accessed 2020-01-13.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022, January. Accessed 2020-01-13.
- Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, September. Accessed 2020-01-12.
- Fang, Yi, Luo Si, Naveen Somasundaram, and Zhengtao Yu. 2012. "Mining Contrastive Opinions on Political Texts using Cross-Perspective Topic Model." Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM'12, ACM, February 8–12. Accessed 2020-01-18.
- Ganegedara, Thushan. 2018. "Intuitive Guide to Latent Dirichlet Allocation." Towards Data Science, on Medium, August 23. Accessed 2020-01-18.
- Hofmann, Thomas. 1999. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57, August. https://doi.org/10.1145/312624.312649. Accessed 2020-01-14.
- Hong, Soojung. 2018. "LDA and Topic Modeling." TextMining Wiki, on GitHub, July 4. Accessed 2020-01-14.
- Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-14.
- Khazaei, Tarneh. 2017. "LDA Topic Modeling in Spark MLlib." Blog, Zero Gravity Labs, July 14. Updated 2017-09-06. Accessed 2020-01-14.
- Kuang, Xiaoting. 2017. "Topic Modeling with LDA in NLP: data mining in Pressible." Blog, EdLab, Teachers College Columbia University, April 7. Accessed 2020-01-14.
- Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-18.
- Li, Susan. 2018. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science, on Medium, May 31. Accessed 2020-01-14.
- McCallum, Andrew, Andrés Corrada-Emmanuel, and Xuerui Wang. 2005. "Topic and Role Discovery in Social Networks." International Joint Conferences on Artificial Intelligence, pp. 786-791. Accessed 2020-01-18.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2020-01-14.
- Moody, Chris. 2016. "Introducing our Hybrid lda2vec Algorithm." MultiThreaded Blog, Stitch Fix, Inc., May 27. Accessed 2020-01-14.
- R on Methods Bites. 2019. "Advancing Text Mining with R and quanteda." R-bloggers, October 16. Accessed 2020-01-18.
- Ruozzi, Nicholas. 2019. "Topic Models and LDA." Lecture 18 in: CS 6347, Statistical Methods in AI and ML, UT Dallas. Accessed 2020-01-14.
- Syed, Shaheen and Marco Spruit. 2018. "Selecting Priors for Latent Dirichlet Allocation." IEEE 12th International Conference on Semantic Computing (ICSC), pp. 194-202, January 31 - February 2. Accessed 2020-01-18.
- Tanna, Vineet. 2018. "vineettanna / Aspect-Based-Opinion-Mining-Using-Spark." GitHub, February 8. Accessed 2020-01-18.
- Tim. 2016. "What exactly is the alpha in the Dirichlet distribution?" CrossValidated, StackExchange, November 8. Updated 2017-04-13. Accessed 2020-01-14.
- Wallach, Hanna M., David Mimno, and Andrew McCallum. 2009. "Rethinking LDA: Why Priors Matter." Advances in Neural Information Processing Systems 22, pp. 1973-1981, December. Accessed 2020-01-18.
- Wang, Xuerui, and Andrew McCallum. 2006. "Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends." KDD'06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 424–433, August. Accessed 2020-01-18.
- Wikipedia. 2020. "Dirichlet distribution." Wikipedia, January 10. Accessed 2020-01-18.
- Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-14.
- Řehůřek, Radim. 2013. "Asymmetric LDA Priors, Christmas Edition." Rare Technologies, December 21. Accessed 2020-01-18.
Further Reading
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022, January. Accessed 2020-01-13.
- Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-14.
- Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-14.
- Wallach, Hanna M., David Mimno, and Andrew McCallum. 2009. "Rethinking LDA: Why Priors Matter." Advances in Neural Information Processing Systems 22, pp. 1973-1981, December. Accessed 2020-01-18.
- Liu, Sue. 2019. "Dirichlet distribution." Towards Data Science, on Medium, January 7. Accessed 2020-01-14.
- Boyd-Graber, Jordan. 2018. "Continuous Distributions: Beta and Dirichlet Distributions." YouTube, February 24. Accessed 2020-01-14.
Article Stats
Cite As
See Also
- Topic Modelling
- Structural Topic Model
- Latent Semantic Analysis
- Text Clustering
- Expectation Maximization Algorithm
- Factor Analysis