• Graphical representation of LDA with annotations. Source: Hong 2018.
    Graphical representation of LDA with annotations. Source: Hong 2018.
  • The pLSA model. Source: Blei 2013, slide 10.
    The pLSA model. Source: Blei 2013, slide 10.
  • The LDA model. Source: Ruozzi 2019, slide 12.
    The LDA model. Source: Ruozzi 2019, slide 12.
  • Comparison of three LDA-based models for automated image captioning. Source: Blei and Jordan 2003, fig. 6.
    Comparison of three LDA-based models for automated image captioning. Source: Blei and Jordan 2003, fig. 6.
  • LDA uncovers four main topics connected to user IDs on a school blog. Source: Kuang 2017.
    LDA uncovers four main topics connected to user IDs on a school blog. Source: Kuang 2017.
  • LDA generates the document one word at a time. Source: Ganegedara 2018.
    LDA generates the document one word at a time. Source: Ganegedara 2018.
  • Effect of α on topic distribution of three topics. Source: Ganegedara 2018.
    Effect of α on topic distribution of three topics. Source: Ganegedara 2018.
  • Text preprocessing with NLTK and aspect extraction using LDA via Spark MLlib. Source: Tanna 2018.
    Text preprocessing with NLTK and aspect extraction using LDA via Spark MLlib. Source: Tanna 2018.
  • A summary of some LDA variants for the period 2003-2016. Source: Jelodar et al. 2018, fig. 1.
    A summary of some LDA variants for the period 2003-2016. Source: Jelodar et al. 2018, fig. 1.

Latent Dirichlet Allocation

Avatar of user arvindpdmn
arvindpdmn
1214 DevCoins
1 author has contributed to this article
Last updated by arvindpdmn
on 2020-01-20 09:08:27
Created by arvindpdmn
on 2020-01-14 13:38:46
Improve this article. Show messages

Summary

Graphical representation of LDA with annotations. Source: Hong 2018.
Graphical representation of LDA with annotations. Source: Hong 2018.

Given a document, topic modelling is a task that aims to uncover the most suitable topics or themes that the document is about. It does this by looking at words that most often occur together. For example, a document with high co-occurrence of words 'cats' and 'dogs' is probably about the topic 'Animals', whereas the words 'horses' and 'equestrian' is partly about 'Animals' but more about 'Sports'. Latent Dirichlet Allocation (LDA) is a popular technique to do topic modelling.

LDA is based on probability distributions. For each document, it considers a distribution of topics. For each topic, it considers a distribution of words. This information helps LDA discover the topics in a document.

LDA and its many variants support diverse applications. LDA is well-supported in a few programming languages and software packages.

Milestones

1990

Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. SVD brings together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

Aug
1999
The pLSA model. Source: Blei 2013, slide 10.

Hofmann presents a statistical analysis of LSA. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.

Jan
2003
The LDA model. Source: Ruozzi 2019, slide 12.

First presented at NIPS 2001 conference, Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this. Their work uses the bag-of-words model but they note that LDA can be applied for larger units such as n-grams or paragraphs.

Jul
2003
Comparison of three LDA-based models for automated image captioning. Source: Blei and Jordan 2003, fig. 6.

Blei and Jordan consider the problems of automated image captioning and text-based image retrieval. They study three hierarchical probabilistic mixture models. They arrive at Correspondence LDA (CorrLDA) that gives best performance.

Dec
2009

Typically, symmetric Dirichlet priors are used in LDA. Wallach et al. study the effect of structured priors for topic modelling. They find that asymmetric Dirichlet priors over document-topic distributions is much better than symmetric priors. To use asymmetric priors over topic-word distributions has little benefit. The resulting model is less sensitive to number of topics. With hyperparameter optimization, computation can be made practical. Related research with similar results are reported in 2018 by Syed and Spruit.

2016

Word2vec came out in 2013. It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.

Discussion

  • What are the shortcomings of earlier topic models that LDA aims to solve?

    Before LDA, there were LSA and pLSA models. LSA was simply a dimensionality reduction technique and lacked a strong probabilistic approach. pLSA remedied this by being a probabilistic generative model. It picked a topic with probability P(z). Then it selected the document and the word with probabilities P(d|z) and P(w|z) respectively.

    While LSA could model synonymy well, it failed in polysemy. In other words, a word with multiple meanings needed to appear in multiple topics but didn't. pLSA partially handled polysemy. However, pLSA ignored P(d). Each document was a mixture of topics but there was no model to generate this mixture. This made pLSA grow linearly as the corpus size increased, leading to overfitting. Also, pLSA was unable to assign topic probabilities to new documents.

    LDA is inspired by pLSA. Like pLSA, it's also a probabilistic generative model. Unlike pLSA, LDA also considers the generation of documents. This is where the Dirichlet distribution becomes useful. It determines the topic distribution for each document.

  • Could you describe some example applications where LDA has been applied?
    LDA uncovers four main topics connected to user IDs on a school blog. Source: Kuang 2017.
    LDA uncovers four main topics connected to user IDs on a school blog. Source: Kuang 2017.

    LDA is an algorithm or method for topic modelling, which has been applied in information retrieval, text mining, social media analysis, and more. In general, topic modelling uncovers hidden structures or topics in documents.

    LDA has been applied in diverse tasks: automatic essay grading, anti-phishing, automatic labelling, emotion detection, expert identification, role discovery, sentiment summarization, word sense disambiguation, and more.

    For analysing political texts, LDA-based model was used to find opinions from different viewpoints. In software engineering, LDA was used to find similar code in software repositories and suggest code refactoring. Another study made use of geographic data and GPS-based documents to discover topics.

    LDA has been used on online or social media data. By applying it on public tweets or chat data, we can detect and track how topics change over time. We can identify users who follow similar distribution of topics. On Yelp restaurant reviews, LDA was used to do aspect-based opinion mining. LDA was used on a school blog to uncover main topics and who's taking about them.

  • Why is LDA called a generative model?
    LDA generates the document one word at a time. Source: Ganegedara 2018.
    LDA generates the document one word at a time. Source: Ganegedara 2018.

    In a generative model, observations are generated by latent variables. Given the words of a document, LDA figures out the latent topics. But as a generative model, we can think of LDA as generating the topics and then the words for that document.

    In the figure, we note that \(\alpha\) fixes a particular distribution of topics \(\theta\). There's one such distribution for each document. For a document, when we pick a topic from this distribution, we're faced with word distribution \(\beta_i\) for that topic. These word distributions are determined by \(\eta\). From our topic's word distribution, we pick a word. We do this as many times as the document's word count. Thus, the model is generative.

  • What's the significance Dirichlet priors in LDA?
    Effect of α on topic distribution of three topics. Source: Ganegedara 2018.
    Effect of α on topic distribution of three topics. Source: Ganegedara 2018.

    Topic distribution \(\theta\) and word distribution \(\beta\) are created from \(\alpha\) and \(\eta\) respectively. The latter are called Dirichlet priors. A low \(\alpha\) implies a document might have fewer dominant topics. A large \(\alpha\) implies many dominant topics. Similarly, a low (or high) \(\eta\) means a topic has a few (or many) dominant words.

    Given the Dirichlet distribution \(Dir(\alpha)\) we sample a topic distribution for a specific document. Likewise, from \(Dir(\eta)\) we sample a word distribution for a specific topic. In other words, the Dirichlet distribution generates another distribution. For this reason, it's called distribution over distributions.

    Suppose we have k topics and a vocabulary V. \(\alpha\) will be a vector of length k. \(\eta\) will be a vector of length V. If all elements of the vector have the same value, we call this symmetric Dirichlet distribution. It simply means we have no prior knowledge and assume all topics or words are equally likely. In this case, the prior may be expressed as a scalar, called concentration parameter.

    In an alternative terminology, \(\eta\) symbol is not used. \(Dir(\beta)\) generates \(\phi\).

  • What are the methods of inference and parameter estimation under LDA?

    In LDA, words are observed, topic and word distributions are hidden, and \(\alpha\) and \(\eta\) are the hyperparameters. Thus, we need to infer the distributions and the hyperparameters. In general, this problem is intractable. The two distributions are coupled in the latent topics.

    We note three common techniques for inference and estimation:

    • Gibbs Sampling: A method for sampling from a joint distribution when only conditional distributions of topics and words can be efficiently computed.
    • Expectation-Maximization (EM): Useful for parameter estimation via maximum likelihood.
    • Variational Inference: Coupling between distributions is removed to yield a simplified graphical model with free variational parameters Now we have an optimization problem to find the best variational parameters. Kullback-Leibler (KL) divergence between the variational distribution and the true posterior can be used.
  • What's the typical pipeline for doing LDA?
    Text preprocessing with NLTK and aspect extraction using LDA via Spark MLlib. Source: Tanna 2018.
    Text preprocessing with NLTK and aspect extraction using LDA via Spark MLlib. Source: Tanna 2018.

    The actual working of LDA is iterative. It starts by randomly assigning a topic to each word in each document. Then the topic and word distributions are calculated. These distributions are used in the next iteration to reassign topics. This is repeated until the algorithm converges. Once the distribution is worked out during training, the dominant topics of a test document can be identified by its location in the topic space.

    Suppose a document has only a few topics. Suppose a topic has only a few highly likely words. These two goals are at odds with each other. By trading off these two goals, LDA uncovers tightly co-occurring words.

  • Could you describe some variants of the basic LDA?
    A summary of some LDA variants for the period 2003-2016. Source: Jelodar et al. 2018, fig. 1.
    A summary of some LDA variants for the period 2003-2016. Source: Jelodar et al. 2018, fig. 1.

    While LDA looks at co-occurrences of words, some LDA variants include metadata such as research paper authors or citations. Another approach is to look at word sequences with Markov dependencies. For social network analysis, the Author-Recipient-Topic model conditions the distribution of topics on the sender and one recipient.

    For applications such as automatic image annotation or text-based image retrieval, Correspondence LDA models the joint distribution of images and text, plus the conditional distribution of the annotation given the image.

    Correlated Topic Model captures correlations among topics.

    Word co-occurrence patterns are rarely static. For example, "dynamic systems" more recently co-occur with "graphical models" more than "neural networks". Topics over Time models time jointly with word co-occurrences. It uses a continuous distribution over time.

    LDA doesn't differentiate between topic words and opinion words. Opinions can also come from different perspectives. Cross-Perspective Topic model extends LDA by separating opinion generation from topic generation. Nouns form topics. Adjectives, verbs and adverbs form opinions.

    Jelodar et al (2018) note many more variants.

  • What are some resources for working with LDA?

    In Python, nltk is useful for general text processing while gensim enables LDA. In R, quanteda is for quantitative text analysis while topicmodels is more specifically for topic modelling. In Java, there's Mallet, TMT and Mr.LDA.

    Gensim has a useful feature to automatically calculate the optimal asymmetric prior for \(\alpha\) by accounting for how often words co-occur.

    LDA is built into Spark MLlib. This can be used via Scala, Java, Python or R. For example, in Python, LDA is available in module pyspark.ml.clustering.

    There are plenty of datasets for research into topic modelling. Those labelled with categories or topics may be more useful. Some examples are Reuters-21578, Wiki10+, DBPL Dataset, NIPS Conference Papers 1987-2015, and 20Newgroups.

Sample Code

  • // Source: https://spark.apache.org/docs/latest/ml-clustering.html
    // Accessed 2020-01-19
     
    import org.apache.spark.ml.clustering.LDA
     
    // Loads data.
    val dataset = spark.read.format("libsvm")
      .load("data/mllib/sample_lda_libsvm_data.txt")
     
    // Trains a LDA model.
    val lda = new LDA().setK(10).setMaxIter(10)
    val model = lda.fit(dataset)
     
    val ll = model.logLikelihood(dataset)
    val lp = model.logPerplexity(dataset)
    println(s"The lower bound on the log likelihood of the entire corpus: $ll")
    println(s"The upper bound on perplexity: $lp")
     
    // Describe topics.
    val topics = model.describeTopics(3)
    println("The topics described by their top-weighted terms:")
    topics.show(false)
     
    // Shows the result.
    val transformed = model.transform(dataset)
    transformed.show(false)

References

  1. Apache Spark Docs. 2019. "Clustering." Apache Spark 2.4.4, April 13. Accessed 2020-01-18.
  2. Blei, David M. 2013. "Probabilistic Topic Models: Origins and Challenges." Department of Computer Science, Princeton University, December 9. Accessed 2020-01-14.
  3. Blei, David M. and Michael I. Jordan. 2003. "Modeling Annotated Data." SIGIR'03, July 28–August 1. Accessed 2020-01-18.
  4. Blei, David M., and John D. Lafferty. 2005. "Correlated topic models." NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 147-154, December. Accessed 2020-01-18.
  5. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. "Latent Dirichlet Allocation." Dietterich, T. G., S. Becker, and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, pp. 601-608. Accessed 2020-01-13.
  6. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022, January. Accessed 2020-01-13.
  7. Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, September. Accessed 2020-01-12.
  8. Fang, Yi, Luo Si, Naveen Somasundaram, and Zhengtao Yu. 2012. "Mining Contrastive Opinions on Political Texts using Cross-Perspective Topic Model." Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM'12, ACM, February 8–12. Accessed 2020-01-18.
  9. Ganegedara, Thushan. 2018. "Intuitive Guide to Latent Dirichlet Allocation." Towards Data Science, on Medium, August 23. Accessed 2020-01-18.
  10. Hofmann, Thomas. 1999. "Probabilistic latent semantic indexing." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57, August. https://doi.org/10.1145/312624.312649. Accessed 2020-01-14.
  11. Hong, Soojung. 2018. "LDA and Topic Modeling." TextMining Wiki, on GitHub, July 4. Accessed 2020-01-14.
  12. Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-14.
  13. Khazaei, Tarneh. 2017. "LDA Topic Modeling in Spark MLlib." Blog, Zero Gravity Labs, July 14. Updated 2017-09-06. Accessed 2020-01-14.
  14. Kuang, Xiaoting. 2017. "Topic Modeling with LDA in NLP: data mining in Pressible." Blog, EdLab, Teachers College Columbia University, April 7. Accessed 2020-01-14.
  15. Lee, Sangno, Jaeki Song, and Yongjin Kim. 2010. "An Empirical Comparison of Four Text Mining Methods." Journal of Computer Information Systems, Fall. Accessed 2020-01-18.
  16. Li, Susan. 2018. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science, on Medium, May 31. Accessed 2020-01-14.
  17. McCallum, Andrew, Andrés Corrada-Emmanuel, and Xuerui Wang. 2005. "Topic and Role Discovery in Social Networks." International Joint Conferences on Artificial Intelligence, pp. 786-791. Accessed 2020-01-18.
  18. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2020-01-14.
  19. Moody, Chris. 2016. "Introducing our Hybrid lda2vec Algorithm." MultiThreaded Blog, Stitch Fix, Inc., May 27. Accessed 2020-01-14.
  20. R on Methods Bites. 2019. "Advancing Text Mining with R and quanteda." R-bloggers, October 16. Accessed 2020-01-18.
  21. Ruozzi, Nicholas. 2019. "Topic Models and LDA." Lecture 18 in: CS 6347, Statistical Methods in AI and ML, UT Dallas. Accessed 2020-01-14.
  22. Syed, Shaheen and Marco Spruit. 2018. "Selecting Priors for Latent Dirichlet Allocation." IEEE 12th International Conference on Semantic Computing (ICSC), pp. 194-202, January 31 - February 2. Accessed 2020-01-18.
  23. Tanna, Vineet. 2018. "vineettanna / Aspect-Based-Opinion-Mining-Using-Spark." GitHub, February 8. Accessed 2020-01-18.
  24. Tim. 2016. "What exactly is the alpha in the Dirichlet distribution?" CrossValidated, StackExchange, November 8. Updated 2017-04-13. Accessed 2020-01-14.
  25. Wallach, Hanna M., David Mimno, and Andrew McCallum. 2009. "Rethinking LDA: Why Priors Matter." Advances in Neural Information Processing Systems 22, pp. 1973-1981, December. Accessed 2020-01-18.
  26. Wang, Xuerui, and Andrew McCallum. 2006. "Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends." KDD'06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 424–433, August. Accessed 2020-01-18.
  27. Wikipedia. 2020. "Dirichlet distribution." Wikipedia, January 10. Accessed 2020-01-18.
  28. Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-14.
  29. Řehůřek, Radim. 2013. "Asymmetric LDA Priors, Christmas Edition." Rare Technologies, December 21. Accessed 2020-01-18.

Milestones

1990

Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. SVD brings together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

Aug
1999
The pLSA model. Source: Blei 2013, slide 10.

Hofmann presents a statistical analysis of LSA. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.

Jan
2003
The LDA model. Source: Ruozzi 2019, slide 12.

First presented at NIPS 2001 conference, Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this. Their work uses the bag-of-words model but they note that LDA can be applied for larger units such as n-grams or paragraphs.

Jul
2003
Comparison of three LDA-based models for automated image captioning. Source: Blei and Jordan 2003, fig. 6.

Blei and Jordan consider the problems of automated image captioning and text-based image retrieval. They study three hierarchical probabilistic mixture models. They arrive at Correspondence LDA (CorrLDA) that gives best performance.

Dec
2009

Typically, symmetric Dirichlet priors are used in LDA. Wallach et al. study the effect of structured priors for topic modelling. They find that asymmetric Dirichlet priors over document-topic distributions is much better than symmetric priors. To use asymmetric priors over topic-word distributions has little benefit. The resulting model is less sensitive to number of topics. With hyperparameter optimization, computation can be made practical. Related research with similar results are reported in 2018 by Syed and Spruit.

2016

Word2vec came out in 2013. It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.

Tags

See Also

Further Reading

  1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research, vol. 3, pp. 993-1022, January. Accessed 2020-01-13.
  2. Xu, Joyce. 2018. "Topic Modeling with LSA, PLSA, LDA & lda2Vec." NanoNets, on Medium, May 25. Accessed 2020-01-14.
  3. Jelodar, Hamed, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2018. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv, v2, December 6. Accessed 2020-01-14.
  4. Wallach, Hanna M., David Mimno, and Andrew McCallum. 2009. "Rethinking LDA: Why Priors Matter." Advances in Neural Information Processing Systems 22, pp. 1973-1981, December. Accessed 2020-01-18.
  5. Liu, Sue. 2019. "Dirichlet distribution." Towards Data Science, on Medium, January 7. Accessed 2020-01-14.
  6. Boyd-Graber, Jordan. 2018. "Continuous Distributions: Beta and Dirichlet Distributions." YouTube, February 24. Accessed 2020-01-14.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
2
0
1214
1808
Words
0
Chats
2
Edits
0
Likes
385
Hits

Cite As

Devopedia. 2020. "Latent Dirichlet Allocation." Version 2, January 20. Accessed 2020-03-29. https://devopedia.org/latent-dirichlet-allocation