Latent Dirichlet Allocation

Graphical representation of LDA with annotations. Source: Hong 2018.

Given a document, topic modelling is a task that aims to uncover the most suitable topics or themes that the document is about. It does this by looking at words that most often occur together. For example, a document with high co-occurrence of words 'cats' and 'dogs' is probably about the topic 'Animals', whereas the words 'horses' and 'equestrian' is partly about 'Animals' but more about 'Sports'. Latent Dirichlet Allocation (LDA) is a popular technique to do topic modelling.

LDA is based on probability distributions. For each document, it considers a distribution of topics. For each topic, it considers a distribution of words. This information helps LDA discover the topics in a document.

LDA and its many variants support diverse applications. LDA is well-supported in a few programming languages and software packages.

Discussion

What are the shortcomings of earlier topic models that LDA aims to solve?
Before LDA, there were LSA and pLSA models. LSA was simply a dimensionality reduction technique and lacked a strong probabilistic approach. pLSA remedied this by being a probabilistic generative model. It picked a topic with probability P(z). Then it selected the document and the word with probabilities P(d|z) and P(w|z) respectively.
While LSA could model synonymy well, it failed in polysemy. In other words, a word with multiple meanings needed to appear in multiple topics but didn't. pLSA partially handled polysemy. However, pLSA ignored P(d). Each document was a mixture of topics but there was no model to generate this mixture. This made pLSA grow linearly as the corpus size increased, leading to overfitting. Also, pLSA was unable to assign topic probabilities to new documents.
LDA is inspired by pLSA. Like pLSA, it's also a probabilistic generative model. Unlike pLSA, LDA also considers the generation of documents. This is where the Dirichlet distribution becomes useful. It determines the topic distribution for each document.
Could you describe some example applications where LDA has been applied?
LDA uncovers four main topics connected to user IDs on a school blog. Source: Kuang 2017.
LDA is an algorithm or method for topic modelling, which has been applied in information retrieval, text mining, social media analysis, and more. In general, topic modelling uncovers hidden structures or topics in documents.
LDA has been applied in diverse tasks: automatic essay grading, anti-phishing, automatic labelling, emotion detection, expert identification, role discovery, sentiment summarization, word sense disambiguation, and more.
For analysing political texts, LDA-based model was used to find opinions from different viewpoints. In software engineering, LDA was used to find similar code in software repositories and suggest code refactoring. Another study made use of geographic data and GPS-based documents to discover topics.
LDA has been used on online or social media data. By applying it on public tweets or chat data, we can detect and track how topics change over time. We can identify users who follow similar distribution of topics. On Yelp restaurant reviews, LDA was used to do aspect-based opinion mining. LDA was used on a school blog to uncover main topics and who's taking about them.
Why is LDA called a generative model?
LDA generates the document one word at a time. Source: Ganegedara 2018.
In a generative model, observations are generated by latent variables. Given the words of a document, LDA figures out the latent topics. But as a generative model, we can think of LDA as generating the topics and then the words for that document.
In the figure, we note that \(\alpha\) fixes a particular distribution of topics \(\theta\). There's one such distribution for each document. For a document, when we pick a topic from this distribution, we're faced with word distribution \(\beta_i\) for that topic. These word distributions are determined by \(\eta\). From our topic's word distribution, we pick a word. We do this as many times as the document's word count. Thus, the model is generative.
What's the significance Dirichlet priors in LDA?
Effect of α on topic distribution of three topics. Source: Ganegedara 2018.
Topic distribution \(\theta\) and word distribution \(\beta\) are created from \(\alpha\) and \(\eta\) respectively. The latter are called Dirichlet priors. A low \(\alpha\) implies a document might have fewer dominant topics. A large \(\alpha\) implies many dominant topics. Similarly, a low (or high) \(\eta\) means a topic has a few (or many) dominant words.
Given the Dirichlet distribution \(Dir(\alpha)\) we sample a topic distribution for a specific document. Likewise, from \(Dir(\eta)\) we sample a word distribution for a specific topic. In other words, the Dirichlet distribution generates another distribution. For this reason, it's called distribution over distributions.
Suppose we have k topics and a vocabulary V. \(\alpha\) will be a vector of length k. \(\eta\) will be a vector of length V. If all elements of the vector have the same value, we call this symmetric Dirichlet distribution. It simply means we have no prior knowledge and assume all topics or words are equally likely. In this case, the prior may be expressed as a scalar, called concentration parameter.
In an alternative terminology, \(\eta\) symbol is not used. \(Dir(\beta)\) generates \(\phi\).
What are the methods of inference and parameter estimation under LDA?
In LDA, words are observed, topic and word distributions are hidden, and \(\alpha\) and \(\eta\) are the hyperparameters. Thus, we need to infer the distributions and the hyperparameters. In general, this problem is intractable. The two distributions are coupled in the latent topics.
We note three common techniques for inference and estimation:
- Gibbs Sampling: A method for sampling from a joint distribution when only conditional distributions of topics and words can be efficiently computed.
- Expectation-Maximization (EM): Useful for parameter estimation via maximum likelihood.
- Variational Inference: Coupling between distributions is removed to yield a simplified graphical model with free variational parameters Now we have an optimization problem to find the best variational parameters. Kullback-Leibler (KL) divergence between the variational distribution and the true posterior can be used.
What's the typical pipeline for doing LDA?
Text preprocessing with NLTK and aspect extraction using LDA via Spark MLlib. Source: Tanna 2018.
The actual working of LDA is iterative. It starts by randomly assigning a topic to each word in each document. Then the topic and word distributions are calculated. These distributions are used in the next iteration to reassign topics. This is repeated until the algorithm converges. Once the distribution is worked out during training, the dominant topics of a test document can be identified by its location in the topic space.
Suppose a document has only a few topics. Suppose a topic has only a few highly likely words. These two goals are at odds with each other. By trading off these two goals, LDA uncovers tightly co-occurring words.
Could you describe some variants of the basic LDA?
A summary of some LDA variants for the period 2003-2016. Source: Jelodar et al. 2018, fig. 1.
While LDA looks at co-occurrences of words, some LDA variants include metadata such as research paper authors or citations. Another approach is to look at word sequences with Markov dependencies. For social network analysis, the Author-Recipient-Topic model conditions the distribution of topics on the sender and one recipient.
For applications such as automatic image annotation or text-based image retrieval, Correspondence LDA models the joint distribution of images and text, plus the conditional distribution of the annotation given the image.
Correlated Topic Model captures correlations among topics.
Word co-occurrence patterns are rarely static. For example, "dynamic systems" more recently co-occur with "graphical models" more than "neural networks". Topics over Time models time jointly with word co-occurrences. It uses a continuous distribution over time.
LDA doesn't differentiate between topic words and opinion words. Opinions can also come from different perspectives. Cross-Perspective Topic model extends LDA by separating opinion generation from topic generation. Nouns form topics. Adjectives, verbs and adverbs form opinions.
Jelodar et al (2018) note many more variants.
What are some resources for working with LDA?
In Python, nltk is useful for general text processing while gensim enables LDA. In R, quanteda is for quantitative text analysis while topicmodels is more specifically for topic modelling. In Java, there's Mallet, TMT and Mr.LDA.
Gensim has a useful feature to automatically calculate the optimal asymmetric prior for \(\alpha\) by accounting for how often words co-occur.
LDA is built into Spark MLlib. This can be used via Scala, Java, Python or R. For example, in Python, LDA is available in module pyspark.ml.clustering.
There are plenty of datasets for research into topic modelling. Those labelled with categories or topics may be more useful. Some examples are Reuters-21578, Wiki10+, DBPL Dataset, NIPS Conference Papers 1987-2015, and 20Newgroups.

Milestones

1990

Deerwester et al. apply Singular Value Decomposition (SVD) to the problem of automatic indexing and information retrieval. SVD brings together terms and documents that are closely related in the "semantic space". Their idea of semantics is nothing more than a topic or concept. They call their method Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA).

Aug
1999

The pLSA model. Source: Blei 2013, slide 10.

Hofmann presents a statistical analysis of LSA. He coins the term Probabilistic Latent Semantic Analysis (pLSA). It's based on aspect model, which is a latent variable model. It associates unobserved class variables (topics) with each observation (words). Unlike LSA, this is a proper generative model.

Jan
2003

The LDA model. Source: Ruozzi 2019, slide 12.

First presented at NIPS 2001 conference, Blei et al. describe in detail a probabilistic generative model that they name Latent Dirichlet Allocation (LDA). They note that pLSA lacks a probabilistic model at the document level. LDA overcomes this. Their work uses the bag-of-words model but they note that LDA can be applied for larger units such as n-grams or paragraphs.

Jul
2003

Comparison of three LDA-based models for automated image captioning. Source: Blei and Jordan 2003, fig. 6.

Blei and Jordan consider the problems of automated image captioning and text-based image retrieval. They study three hierarchical probabilistic mixture models. They arrive at Correspondence LDA (CorrLDA) that gives best performance.

Dec
2009

Typically, symmetric Dirichlet priors are used in LDA. Wallach et al. study the effect of structured priors for topic modelling. They find that asymmetric Dirichlet priors over document-topic distributions is much better than symmetric priors. To use asymmetric priors over topic-word distributions has little benefit. The resulting model is less sensitive to number of topics. With hyperparameter optimization, computation can be made practical. Related research with similar results are reported in 2018 by Syed and Spruit.

2016

Word2vec came out in 2013. It's a word embedding that's constructed by predicting neighbouring words given a word. LDA on the other hand looks at words at the document level. Moody proposes lda2vec as an approach to capture both local and global information. This combines the power of word2vec and the interpretability of LDA. Word vectors are dense but document vectors are sparse.

Sample Code

scala

// Source: https://spark.apache.org/docs/latest/ml-clustering.html
// Accessed 2020-01-19
 
import org.apache.spark.ml.clustering.LDA
 
// Loads data.
val dataset = spark.read.format("libsvm")
  .load("data/mllib/sample_lda_libsvm_data.txt")
 
// Trains a LDA model.
val lda = new LDA().setK(10).setMaxIter(10)
val model = lda.fit(dataset)
 
val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)
println(s"The lower bound on the log likelihood of the entire corpus: $ll")
println(s"The upper bound on perplexity: $lp")
 
// Describe topics.
val topics = model.describeTopics(3)
println("The topics described by their top-weighted terms:")
topics.show(false)
 
// Shows the result.
val transformed = model.transform(dataset)
transformed.show(false)

References

Article Stats

1808

Words

Authors

Edits

Chats

Likes

8056

Hits

Cite As

Devopedia. 2020. "Latent Dirichlet Allocation." Version 3, July 24. Accessed 2024-06-25. https://devopedia.org/latent-dirichlet-allocation

Contributed by
1 author

Last updated on
2020-07-24 05:06:53

algorithms natural language processing modelling text analytics

Topic Modelling
Structural Topic Model
Latent Semantic Analysis
Text Clustering
Expectation Maximization Algorithm
Factor Analysis

Latent Dirichlet Allocation

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Cite As

See Also

Latent Dirichlet Allocation

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login