Maximum-Entropy Markov Model

Article Info

Contributed by
3 authors

Last updated on
2022-02-15 11:54:31

Article Versions

7 2022-02-15 11:54:31
3372,2246 7,3372

By devbot5S

Migrating blockquotes to markdown syntax
6 2020-09-21 17:07:35
2246,2088 6,2246

By arvindpdmn

Update to See Also and Further Reading.
5 2020-05-30 08:28:57
2088,2003 5,2088

By arpittrainer

There are many systems where there's a time or state dependency. These systems evolve in time through a sequence of states and current state is influenced by past states. For example, there's a maxi chance of rain today if it had rained yesterday. Other e
4 2020-03-25 11:37:39
2003,1673 4,2003

By arvindpdmn

Remove a misleading acronym.
3 2019-10-09 10:07:54
1673,1671 3,1673

By arvindpdmn

Grammar correction.

Chat Room

Submitting ...

You are editing an existing chat message.
2020-09-21 17:08:09
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Redirected URLs:
References: https://pdfs.semanticscholar.org/8edf/4ad46385ecbed2bd1c8106c89d9fc62a4c90.pdf → https://www.semanticscholar.org/paper/Application-of-Maximum-Entropy-Markov-Models-on-the-Kim/8edf4ad46385ecbed2bd1c8106c89d9fc62a4c90?p2df
References: https://www.statisticshowto.datasciencecentral.com/multinomial-logistic-regression/ → https://www.statisticshowto.com/multinomial-logistic-regression/
Further Reading: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a → https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a?gi=4dac04a2551f
2020-05-30 08:36:09
-

By arvindpdmn

@arpittrainer has removed a valid link in Further Reading and replaced with a generic Machine Learning link. I think the former link to MEMM is more relevant to this article. The change description to the edit is also a copy of the Summary. Ideally, it should describe the nature of the edit.

There are many systems where there's a time or state dependency. These systems evolve in time through a sequence of states and current state is influenced by past states. For example, there's a high chance of rain today if it had rained yesterday. Other examples include stock prices, DNA sequencing, human speech or words in a sentence.

One problem is that we may have observations but not the states. For example, we know the sequence of words but not the corresponding part-of-speech tags. We therefore model the tags as states and use the observed words to predict the most probable sequence of tags. This is exactly what Maximum-Entropy Markov Model (MEMM) can do.

MEMM is a model that makes use of state-time dependencies. It uses predictions of the past and the current observation to make current prediction.

Discussion

Which are the building blocks of Maximum-Entropy Markov Model?
Let's consider the task of email spam detection. Using logistic regression, we can obtain the probability that an email is spam. It's essentially binary classification. We can extend this to a multiclass problem. For example, in image analysis, we're required to classify the object into one of many classes. We estimate the probability for each class. Rather than take a hard decision on one of the outcomes, it's better to output probabilities, which will benefit downstream tasks.
Multinomial logistic regression is also called softmax regression or Maximum Entropy (MaxEnt) classifier. Entropy's related to "disorder". Higher the disorder, less predictable the outcomes, and hence more information. For example, an unbiased coin has more information (and entropy) than a coin that mostly lands up heads. MaxEnt is therefore about picking a probability distribution that maximizes entropy.
Then, there's Markov Chain. It models a system as a set of states with probabilities assigned to state transitions. While MaxEnt computes probabilities for each input independently, Markov chain recognizes that there's dependency from one state to the next.
Thus, MEMM is about maximizing entropy plus using state dependencies (Markov Model).
Could you explain MEMM with an example?
Example of POS tagging with MEMM. Source: Gui et al. 2019, slide 46.
Consider the problem of POS tagging. Given a sequence of words, we wish to find the most probable sequence of tags. MEMM predicts the tag sequence by modelling tags as states of the Markov chain. To predict a tag, MEMM uses the current word and the tag assigned to the previous word.
In reality, MEMM doesn't take a hard decision on what tag to select. Using MaxEnt, it calculates the probability of each tag. Probabilities of past tags are used to take a soft decision so that the word sequence as a whole get the best possible tag sequence. The algorithm that can do this efficiently is the Viterbi algorithm.
The formula in the figure assumes a sequence of L words. Z is a normalizing term so that probabilities are in range [0, 1] and they all add up to 1.
What are some applications where MEMM is used?
Use of MEMM for facial expression recognition. Source: Siddiqi et al. 2016, fig. 1.
MEMM has been used in many NLP tasks including POS tagging and semantic role modelling. It's been used in computational biology for protein secondary structure prediction.
One study used MEMM for recognizing human facial expressions. Human expressions are modelled as states. Observations are from video sensors.
How is MEMM different from Hidden Markov Model (HMM)?
HMM vs. MEMM for POS tagging. Source: Adapted from Jurafsky and Martin 2009, fig. 6.20.
HMM is a generative model because words as modelled as observations generated from hidden states. Even the formulation of probabilities uses likelihood P(W|T) and prior P(T). On the other hand, MEMM is a discriminative model. This is because it directly uses posterior probability P(T|W); that is, probability of a tag sequence given a word sequence. Thus, it discriminates among the possible tag sequences.
It can also be said that HMM uses joint probability for maximizing the probability of the word sequence. MEMM uses conditional probability, conditioned on previous tag and current word.
In HMM, for the tag sequence decoding problem, probabilities are obtained by training on a text corpus. In MEMM, we build a distribution by adding features, which can be hand crafted or picked out by training. The idea is to select the maximum entropy distribution given the constraints specified by the features. MEMM is more flexible because we can add features such as capitalization, hyphens or word endings, which are hard to consider in HMM. MEMM allows for diverse non-independent features.
What are some shortcomings of MEMM?
Illustrating the label bias problem. Source: Lafferty et al. 2001, fig. 1.
MEMM suffers from what's called the label bias problem. Once we're in a state or label, the next observation will select one of many transitions leaving that state. However, the model as a whole would have many more transitions. If a state has only one outgoing transition, the observation has no influence. Simply put, transition scores are normalized on a per-state basis.
Consider a character-level input sequence "rib". We have an MEMM to select between "rib" and "rob". When we receive 'r', both paths 0-1 and 0-4 get equal probability mass. When 'i' arrives, we have nowhere to go from state-4 except to state-5 and pass on the entire probability mass along this path. When 'b' arrives, at state-3, both paths are equally likely.
The label bias problem was first noted by L. Bottou in 1991. Conditional Random Fields (CRFs) is an alternative model that solves this.

Milestones

1996

Berger et al. at the IBM Watson Research Center note that the concept of maximum entropy is not new but were not common in real-world applications due to high computational requirements. This is changing due to more powerful computers. They apply MaxEnt to some NLP tasks such as sentence segmentation and word reordering. They show that MaxEnt and maximum likelihood give the same result,

The model with the greatest entropy consistent with the constraints is the same as the exponential model which best predicts the sample of data.

1996

Adwait Ratnaparkhi at the University of Pennsylvania applies MaxEnt model along with Markov model to the task of part-of-speech tagging. He simply calls it Maximum Entropy Model. The model is able to use rich contextual features. It achieves state-of-the-art accuracy of 96.6%. This work leads to his PhD in 1998. The model considers past two POS tags, and features of past two and next two words. It combines them into a single exponential model.

1998

Saul and Rahim make use of Markov Processes on Curves (MPCs) for automatic speech recognition. Acoustic feature trajectory is a continuous variable that evolves jointly with phonetic transcriptions modelled as discrete states. This work later inspires MEMM by McCallum et al. who apply it to discrete time.

2000

McCallum et al. coin the term Maximum-Entropy Markov Model (MEMM). They apply it to information extraction and segmentation. They note that the three well-known problems solved by HMM can also be solved by MEMM. Exponential models of MEMM can be training using Generalized Iterative Scaling (GIS), which is similar to Expectation-Maximization (EM) algorithm used in HMM.

Jun
2001

Lafferty et al. propose an alternative discriminative model called Conditional Random Fields (CRFs) to overcome the label bias problem of MEMM. MEMM can be biased towards states with fewer successor states. While MEMM uses per-state exponential models for the conditional probabilities, CRF uses a single exponential model for the joint probability of the entire label sequence given the observation sequence. In other words, CRF uses global normalization instead of per-state normalization. However, CRF is slower to train that MEMM.