# Maximum-Entropy Markov Model

## Summary

There are many systems where there's a time or state dependency. These systems evolve in time through a sequence of states and current state is influenced by past states. For example, there's a high chance of rain today if it had rained yesterday. Other examples include stock prices, DNA sequencing, human speech or words in a sentence.

One problem is that we may have observations but not the states. For example, we know the sequence of words but not the corresponding part-of-speech tags. We therefore model the tags as states and use the observed words to predict the most probable sequence of tags. This is exactly what Maximum-Entropy Markov Model (MEMM) can do.^{}

MEMM is a model that makes use of state-time dependencies. It uses predictions of the past and the current observation to make current prediction.

## Milestones

2001

## Discussion

Which are the building blocks of Maximum-Entropy Markov Model? Let's consider the task of email spam detection. Using

**logistic regression**, we can obtain the probability that an email is spam. It's essentially binary classification. We can extend this to a multiclass problem. For example, in image analysis, we're required to classify the object into one of many classes. We estimate the probability for each class.^{}Rather than take a hard decision on one of the outcomes, it's better to output probabilities, which will benefit downstream tasks.^{}Multinomial logistic regression is also called softmax regression or

**Maximum Entropy (MaxEnt)**classifier.^{}Entropy's related to "disorder". Higher the disorder, less predictable the outcomes, and hence more information. For example, an unbiased coin has more information (and entropy) than a coin that mostly lands up heads. MaxEnt is therefore about picking a probability distribution that maximizes entropy.^{}Then, there's

**Markov Chain**. It models a system as a set of states with probabilities assigned to state transitions.^{}While MaxEnt computes probabilities for each input independently, Markov chain recognizes that there's dependency from one state to the next.Thus, MEMM is about maximizing entropy plus using state dependencies (Markov Model).

^{}Could you explain MEMM with an example? Consider the problem of POS tagging. Given a sequence of words, we wish to find the most probable sequence of tags. MEMM predicts the tag sequence by modelling tags as states of the Markov chain. To predict a tag, MEMM uses the current word and the tag assigned to the previous word.

^{}In reality, MEMM doesn't take a hard decision on what tag to select. Using MaxEnt, it calculates the probability of each tag. Probabilities of past tags are used to take a soft decision so that the word sequence as a whole get the best possible tag sequence. The algorithm that can do this efficiently is the

**Viterbi algorithm**.^{}The formula in the figure assumes a sequence of L words. Z is a normalizing term so that probabilities are in range [0, 1] and they all add up to 1.

^{}What are some applications where MEMM is used? MEMM has been used in many NLP tasks including POS tagging

^{}and semantic role modelling.^{}It's been used in computational biology for protein secondary structure prediction.^{}One study used MEMM for recognizing human facial expressions. Human expressions are modelled as states. Observations are from video sensors.

^{}How is MEMM different from Hidden Markov Model (HMM)? HMM is a

**generative model**because words as modelled as observations generated from hidden states. Even the formulation of probabilities uses likelihood P(W|T) and prior P(T). On the other hand, MEMM is a**discriminative model**. This is because it directly uses posterior probability P(T|W); that is, probability of a tag sequence given a word sequence. Thus, it discriminates among the possible tag sequences.^{}It can also be said that HMM uses

**joint probability**for maximizing the probability of the word sequence. MEMM uses**conditional probability**, conditioned on previous tag and current word.^{}In HMM, for the tag sequence decoding problem, probabilities are obtained by training on a text corpus. In MEMM, we build a distribution by adding features, which can be hand crafted or picked out by training. The idea is to select the maximum entropy distribution given the constraints specified by the features. MEMM is more flexible because we can add features such as capitalization, hyphens or word endings, which are hard to consider in HMM.

^{}MEMM allows for diverse non-independent features.^{}What are some shortcomings of MEMM? MEMM suffers from what's called the

**label bias problem**. Once we're in a state or label, the next observation will select one of many transitions leaving that state. However, the model as a whole would have many more transitions. If a state has only one outgoing transition, the observation has no influence. Simply put, transition scores are normalized on a per-state basis.^{}Consider a character-level input sequence "rib". We have an MEMM to select between "rib" and "rob". When we receive 'r', both paths 0-1 and 0-4 get equal probability mass. When 'i' arrives, we have nowhere to go from state-4 except to state-5 and pass on the entire probability mass along this path. When 'b' arrives, at state-3, both paths are equally likely.

^{}The label bias problem was first noted by L. Bottou in 1991.

**Conditional Random Fields (CRFs)**is an alternative model that solves this.^{}

## References

- Alibaba Cloud. 2018. "HMM, MEMM, and CRF: A Comparative Analysis of Statistical Modeling Methods." Medium, May 18. Accessed 2019-10-08.
- Berger, Adam L., Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. "A Maximum Entropy Approach to Natural Language Processing." Computational Linguistics, vol. 22, no. 1. Accessed 2019-10-08.
- Blunsom, Phil. 2004. "Maximum Entropy Markov Models for Semantic Role Labelling." Proceedings of the Australasian Language Technology Workshop, pp. 109-116, December. Accessed 2019-10-08.
- Brilliant. 2019. "Markov Chains." Accessed 2019-10-08.
- Glen, Stephanie. 2017. "Multinomial Logistic Regression: Definition and Examples." Statistics How To, January 26. Accessed 2019-10-08.
- Google Developers. 2019. "Multi-Class Neural Networks: Softmax." Machine Learning Crash Course, Google. Accessed 2019-10-08.
- Gui, Weiwei, Huiji Gao, Jun Shi, and Bo Long. 2019. "Deep Natural Language Processing for Search Systems." Tutorial, 42nd International ACM SIGIR Conference, July 21-25. Accessed 2019-10-08.
- Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-10-08.
- Kim, Yohan. 2001. "Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions." Technical report, University of California. Accessed 2019-10-08.
- Lafferty, John, Andrew McCallum, and Fernando C.N. Pereira. 2001. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." University of Pennsylvania, June. Accessed 2019-10-08.
- Martino, De Andrea and Daniele De Martino. 2018. "An introduction to the maximum entropy approach and its application to inference problems in biology." Heliyon, vol. 4, no. 4, e00596, April. Accessed 2019-10-08.
- McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591-598, June 29 - July 02. Accessed 2019-10-08.
- Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for Part-Of-Speech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133-142. Accessed 2019-10-08.
- Saul, Lawrence K. and Mazin G. Rahim. 1999. "Markov Processes on Curves for Automatic Speech Recognition." Advances in Neural Information Processing Systems 11 (NIPS 1998), M. J. Kearns and S. A. Solla and D. A. Cohn, eds., pp. 751-760, MIT Press. Accessed 2019-10-08.
- Siddiqi, MH, MGR Alam, CS Hong, AM Khan, and H Choo. 2016. "A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition." PLoS ONE, vol. 11, no. 9, e0162702. Accessed 2019-10-08.

## Milestones

2001

## Tags

## See Also

- Logistic Regression
- Viterbi Algorithm
- Hidden Markov Model
- Conditional Random Field
- Time Series Analysis
- Natural Language Processing

## Further Reading

- Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-10-08.
- McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591-598, June 29 - July 02. Accessed 2019-10-08.
- Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for Part-Of-Speech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133-142. Accessed 2019-10-08.
- Rashka, Sebastian. 2019. "What is Softmax regression and how is it related to Logistic regression?" Machine Learning FAQ, September 22. Accessed 2019-10-08.
- Godoy, Daniel. 2018. "Understanding binary cross-entropy / log loss: a visual explanation." Towards Data Science, via Medium, November 21. Accessed 2019-10-08.
- Batista, David S. 2017. "Maximum Entropy Markov Models and Logistic Regression." Blog, November 12. Accessed 2019-10-08.