# Maximum-Entropy Markov Model

There are many systems where there's a time or state dependency. These systems evolve in time through a sequence of states and current state is influenced by past states. For example, there's a high chance of rain today if it had rained yesterday. Other examples include stock prices, DNA sequencing, human speech or words in a sentence.

One problem is that we may have observations but not the states. For example, we know the sequence of words but not the corresponding part-of-speech tags. We therefore model the tags as states and use the observed words to predict the most probable sequence of tags. This is exactly what Maximum-Entropy Markov Model (MEMM) can do.^{}

MEMM is a model that makes use of state-time dependencies. It uses predictions of the past and the current observation to make current prediction.

## Discussion

Which are the building blocks of Maximum-Entropy Markov Model? Let's consider the task of email spam detection. Using

**logistic regression**, we can obtain the probability that an email is spam. It's essentially binary classification. We can extend this to a multiclass problem. For example, in image analysis, we're required to classify the object into one of many classes. We estimate the probability for each class.^{}Rather than take a hard decision on one of the outcomes, it's better to output probabilities, which will benefit downstream tasks.^{}Multinomial logistic regression is also called softmax regression or

**Maximum Entropy (MaxEnt)**classifier.^{}Entropy's related to "disorder". Higher the disorder, less predictable the outcomes, and hence more information. For example, an unbiased coin has more information (and entropy) than a coin that mostly lands up heads. MaxEnt is therefore about picking a probability distribution that maximizes entropy.^{}Then, there's

**Markov Chain**. It models a system as a set of states with probabilities assigned to state transitions.^{}While MaxEnt computes probabilities for each input independently, Markov chain recognizes that there's dependency from one state to the next.Thus, MEMM is about maximizing entropy plus using state dependencies (Markov Model).

^{}Could you explain MEMM with an example? Consider the problem of POS tagging. Given a sequence of words, we wish to find the most probable sequence of tags. MEMM predicts the tag sequence by modelling tags as states of the Markov chain. To predict a tag, MEMM uses the current word and the tag assigned to the previous word.

^{}In reality, MEMM doesn't take a hard decision on what tag to select. Using MaxEnt, it calculates the probability of each tag. Probabilities of past tags are used to take a soft decision so that the word sequence as a whole get the best possible tag sequence. The algorithm that can do this efficiently is the

**Viterbi algorithm**.^{}The formula in the figure assumes a sequence of L words. Z is a normalizing term so that probabilities are in range [0, 1] and they all add up to 1.

^{}What are some applications where MEMM is used? MEMM has been used in many NLP tasks including POS tagging

^{}and semantic role modelling.^{}It's been used in computational biology for protein secondary structure prediction.^{}One study used MEMM for recognizing human facial expressions. Human expressions are modelled as states. Observations are from video sensors.

^{}How is MEMM different from Hidden Markov Model (HMM)? HMM is a

**generative model**because words as modelled as observations generated from hidden states. Even the formulation of probabilities uses likelihood P(W|T) and prior P(T). On the other hand, MEMM is a**discriminative model**. This is because it directly uses posterior probability P(T|W); that is, probability of a tag sequence given a word sequence. Thus, it discriminates among the possible tag sequences.^{}It can also be said that HMM uses

**joint probability**for maximizing the probability of the word sequence. MEMM uses**conditional probability**, conditioned on previous tag and current word.^{}In HMM, for the tag sequence decoding problem, probabilities are obtained by training on a text corpus. In MEMM, we build a distribution by adding features, which can be hand crafted or picked out by training. The idea is to select the maximum entropy distribution given the constraints specified by the features. MEMM is more flexible because we can add features such as capitalization, hyphens or word endings, which are hard to consider in HMM.

^{}MEMM allows for diverse non-independent features.^{}What are some shortcomings of MEMM? MEMM suffers from what's called the

**label bias problem**. Once we're in a state or label, the next observation will select one of many transitions leaving that state. However, the model as a whole would have many more transitions. If a state has only one outgoing transition, the observation has no influence. Simply put, transition scores are normalized on a per-state basis.^{}Consider a character-level input sequence "rib". We have an MEMM to select between "rib" and "rob". When we receive 'r', both paths 0-1 and 0-4 get equal probability mass. When 'i' arrives, we have nowhere to go from state-4 except to state-5 and pass on the entire probability mass along this path. When 'b' arrives, at state-3, both paths are equally likely.

^{}The label bias problem was first noted by L. Bottou in 1991.

**Conditional Random Fields (CRFs)**is an alternative model that solves this.^{}

## Milestones

Berger et al. at the IBM Watson Research Center note that the concept of maximum entropy is not new but were not common in real-world applications due to high computational requirements. This is changing due to more powerful computers. They apply MaxEnt to some NLP tasks such as sentence segmentation and word reordering. They show that MaxEnt and maximum likelihood give the same result,^{}

The model with the greatest entropy consistent with the constraints is the same as the exponential model which best predicts the sample of data.

Adwait Ratnaparkhi at the University of Pennsylvania applies MaxEnt model along with Markov model to the task of part-of-speech tagging. He simply calls it *Maximum Entropy Model*. The model is able to use rich contextual features. It achieves state-of-the-art accuracy of 96.6%.^{} This work leads to his PhD in 1998. The model considers past two POS tags, and features of past two and next two words. It combines them into a single exponential model.^{}

Saul and Rahim make use of **Markov Processes on Curves (MPCs)** for automatic speech recognition. Acoustic feature trajectory is a continuous variable that evolves jointly with phonetic transcriptions modelled as discrete states.^{} This work later inspires MEMM by McCallum et al. who apply it to discrete time.^{}

McCallum et al. coin the term **Maximum-Entropy Markov Model (MEMM)**. They apply it to information extraction and segmentation. They note that the three well-known problems solved by HMM can also be solved by MEMM. Exponential models of MEMM can be training using **Generalized Iterative Scaling (GIS)**, which is similar to Expectation-Maximization (EM) algorithm used in HMM.^{}

2001

Lafferty et al. propose an alternative discriminative model called **Conditional Random Fields (CRFs)** to overcome the label bias problem of MEMM. MEMM can be biased towards states with fewer successor states. While MEMM uses per-state exponential models for the conditional probabilities, CRF uses a single exponential model for the joint probability of the entire label sequence given the observation sequence.^{} In other words, CRF uses global normalization instead of per-state normalization.^{} However, CRF is slower to train that MEMM.^{}

## References

- Alibaba Cloud. 2018. "HMM, MEMM, and CRF: A Comparative Analysis of Statistical Modeling Methods." Medium, May 18. Accessed 2019-10-08.
- Berger, Adam L., Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. "A Maximum Entropy Approach to Natural Language Processing." Computational Linguistics, vol. 22, no. 1. Accessed 2019-10-08.
- Blunsom, Phil. 2004. "Maximum Entropy Markov Models for Semantic Role Labelling." Proceedings of the Australasian Language Technology Workshop, pp. 109-116, December. Accessed 2019-10-08.
- Brilliant. 2019. "Markov Chains." Accessed 2019-10-08.
- Glen, Stephanie. 2017. "Multinomial Logistic Regression: Definition and Examples." Statistics How To, January 26. Accessed 2019-10-08.
- Google Developers. 2019. "Multi-Class Neural Networks: Softmax." Machine Learning Crash Course, Google. Accessed 2019-10-08.
- Gui, Weiwei, Huiji Gao, Jun Shi, and Bo Long. 2019. "Deep Natural Language Processing for Search Systems." Tutorial, 42nd International ACM SIGIR Conference, July 21-25. Accessed 2019-10-08.
- Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-10-08.
- Kim, Yohan. 2001. "Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions." Technical report, University of California. Accessed 2019-10-08.
- Lafferty, John, Andrew McCallum, and Fernando C.N. Pereira. 2001. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." University of Pennsylvania, June. Accessed 2019-10-08.
- Martino, De Andrea and Daniele De Martino. 2018. "An introduction to the maximum entropy approach and its application to inference problems in biology." Heliyon, vol. 4, no. 4, e00596, April. Accessed 2019-10-08.
- McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591-598, June 29 - July 02. Accessed 2019-10-08.
- Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for Part-Of-Speech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133-142. Accessed 2019-10-08.
- Saul, Lawrence K. and Mazin G. Rahim. 1999. "Markov Processes on Curves for Automatic Speech Recognition." Advances in Neural Information Processing Systems 11 (NIPS 1998), M. J. Kearns and S. A. Solla and D. A. Cohn, eds., pp. 751-760, MIT Press. Accessed 2019-10-08.
- Siddiqi, MH, MGR Alam, CS Hong, AM Khan, and H Choo. 2016. "A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition." PLoS ONE, vol. 11, no. 9, e0162702. Accessed 2019-10-08.

## Further Reading

- Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-10-08.
- McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591-598, June 29 - July 02. Accessed 2019-10-08.
- Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for Part-Of-Speech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133-142. Accessed 2019-10-08.
- Rashka, Sebastian. 2019. "What is Softmax regression and how is it related to Logistic regression?" Machine Learning FAQ, September 22. Accessed 2019-10-08.
- Godoy, Daniel. 2018. "Understanding binary cross-entropy / log loss: a visual explanation." Towards Data Science, via Medium, November 21. Accessed 2019-10-08.
- Batista, David S. 2017. "Maximum Entropy Markov Models and Logistic Regression." Blog, November 12. Accessed 2019-10-08.

## Article Stats

## Cite As

### See Also

- Logistic Regression
- Viterbi Algorithm
- Hidden Markov Model
- Conditional Random Field
- Time Series Analysis
- Natural Language Processing