MaximumEntropy Markov Model
There are many systems where there's a time or state dependency. These systems evolve in time through a sequence of states and current state is influenced by past states. For example, there's a high chance of rain today if it had rained yesterday. Other examples include stock prices, DNA sequencing, human speech or words in a sentence.
One problem is that we may have observations but not the states. For example, we know the sequence of words but not the corresponding partofspeech tags. We therefore model the tags as states and use the observed words to predict the most probable sequence of tags. This is exactly what MaximumEntropy Markov Model (MEMM) can do.^{}
MEMM is a model that makes use of statetime dependencies. It uses predictions of the past and the current observation to make current prediction.
Discussion

Which are the building blocks of MaximumEntropy Markov Model? Let's consider the task of email spam detection. Using logistic regression, we can obtain the probability that an email is spam. It's essentially binary classification. We can extend this to a multiclass problem. For example, in image analysis, we're required to classify the object into one of many classes. We estimate the probability for each class.^{} Rather than take a hard decision on one of the outcomes, it's better to output probabilities, which will benefit downstream tasks.^{}
Multinomial logistic regression is also called softmax regression or Maximum Entropy (MaxEnt) classifier.^{} Entropy's related to "disorder". Higher the disorder, less predictable the outcomes, and hence more information. For example, an unbiased coin has more information (and entropy) than a coin that mostly lands up heads. MaxEnt is therefore about picking a probability distribution that maximizes entropy.^{}
Then, there's Markov Chain. It models a system as a set of states with probabilities assigned to state transitions.^{} While MaxEnt computes probabilities for each input independently, Markov chain recognizes that there's dependency from one state to the next.
Thus, MEMM is about maximizing entropy plus using state dependencies (Markov Model).^{}

Could you explain MEMM with an example? Consider the problem of POS tagging. Given a sequence of words, we wish to find the most probable sequence of tags. MEMM predicts the tag sequence by modelling tags as states of the Markov chain. To predict a tag, MEMM uses the current word and the tag assigned to the previous word.^{}
In reality, MEMM doesn't take a hard decision on what tag to select. Using MaxEnt, it calculates the probability of each tag. Probabilities of past tags are used to take a soft decision so that the word sequence as a whole get the best possible tag sequence. The algorithm that can do this efficiently is the Viterbi algorithm.^{}
The formula in the figure assumes a sequence of L words. Z is a normalizing term so that probabilities are in range [0, 1] and they all add up to 1.^{}

What are some applications where MEMM is used? MEMM has been used in many NLP tasks including POS tagging^{} and semantic role modelling.^{} It's been used in computational biology for protein secondary structure prediction.^{}
One study used MEMM for recognizing human facial expressions. Human expressions are modelled as states. Observations are from video sensors.^{}

How is MEMM different from Hidden Markov Model (HMM)? HMM is a generative model because words as modelled as observations generated from hidden states. Even the formulation of probabilities uses likelihood P(WT) and prior P(T). On the other hand, MEMM is a discriminative model. This is because it directly uses posterior probability P(TW); that is, probability of a tag sequence given a word sequence. Thus, it discriminates among the possible tag sequences.^{}
It can also be said that HMM uses joint probability for maximizing the probability of the word sequence. MEMM uses conditional probability, conditioned on previous tag and current word.^{}
In HMM, for the tag sequence decoding problem, probabilities are obtained by training on a text corpus. In MEMM, we build a distribution by adding features, which can be hand crafted or picked out by training. The idea is to select the maximum entropy distribution given the constraints specified by the features. MEMM is more flexible because we can add features such as capitalization, hyphens or word endings, which are hard to consider in HMM.^{} MEMM allows for diverse nonindependent features.^{}

What are some shortcomings of MEMM? MEMM suffers from what's called the label bias problem. Once we're in a state or label, the next observation will select one of many transitions leaving that state. However, the model as a whole would have many more transitions. If a state has only one outgoing transition, the observation has no influence. Simply put, transition scores are normalized on a perstate basis.^{}
Consider a characterlevel input sequence "rib". We have an MEMM to select between "rib" and "rob". When we receive 'r', both paths 01 and 04 get equal probability mass. When 'i' arrives, we have nowhere to go from state4 except to state5 and pass on the entire probability mass along this path. When 'b' arrives, at state3, both paths are equally likely.^{}
The label bias problem was first noted by L. Bottou in 1991. Conditional Random Fields (CRFs) is an alternative model that solves this.^{}
Milestones
Berger et al. at the IBM Watson Research Center note that the concept of maximum entropy is not new but were not common in realworld applications due to high computational requirements. This is changing due to more powerful computers. They apply MaxEnt to some NLP tasks such as sentence segmentation and word reordering. They show that MaxEnt and maximum likelihood give the same result,^{}
The model with the greatest entropy consistent with the constraints is the same as the exponential model which best predicts the sample of data.
Adwait Ratnaparkhi at the University of Pennsylvania applies MaxEnt model along with Markov model to the task of partofspeech tagging. He simply calls it Maximum Entropy Model. The model is able to use rich contextual features. It achieves stateoftheart accuracy of 96.6%.^{} This work leads to his PhD in 1998. The model considers past two POS tags, and features of past two and next two words. It combines them into a single exponential model.^{}
Saul and Rahim make use of Markov Processes on Curves (MPCs) for automatic speech recognition. Acoustic feature trajectory is a continuous variable that evolves jointly with phonetic transcriptions modelled as discrete states.^{} This work later inspires MEMM by McCallum et al. who apply it to discrete time.^{}
McCallum et al. coin the term MaximumEntropy Markov Model (MEMM). They apply it to information extraction and segmentation. They note that the three wellknown problems solved by HMM can also be solved by MEMM. Exponential models of MEMM can be training using Generalized Iterative Scaling (GIS), which is similar to ExpectationMaximization (EM) algorithm used in HMM.^{}
2001
Lafferty et al. propose an alternative discriminative model called Conditional Random Fields (CRFs) to overcome the label bias problem of MEMM. MEMM can be biased towards states with fewer successor states. While MEMM uses perstate exponential models for the conditional probabilities, CRF uses a single exponential model for the joint probability of the entire label sequence given the observation sequence.^{} In other words, CRF uses global normalization instead of perstate normalization.^{} However, CRF is slower to train that MEMM.^{}
References
 Alibaba Cloud. 2018. "HMM, MEMM, and CRF: A Comparative Analysis of Statistical Modeling Methods." Medium, May 18. Accessed 20191008.
 Berger, Adam L., Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. "A Maximum Entropy Approach to Natural Language Processing." Computational Linguistics, vol. 22, no. 1. Accessed 20191008.
 Blunsom, Phil. 2004. "Maximum Entropy Markov Models for Semantic Role Labelling." Proceedings of the Australasian Language Technology Workshop, pp. 109116, December. Accessed 20191008.
 Brilliant. 2019. "Markov Chains." Accessed 20191008.
 Glen, Stephanie. 2017. "Multinomial Logistic Regression: Definition and Examples." Statistics How To, January 26. Accessed 20191008.
 Google Developers. 2019. "MultiClass Neural Networks: Softmax." Machine Learning Crash Course, Google. Accessed 20191008.
 Gui, Weiwei, Huiji Gao, Jun Shi, and Bo Long. 2019. "Deep Natural Language Processing for Search Systems." Tutorial, 42nd International ACM SIGIR Conference, July 2125. Accessed 20191008.
 Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, PrenticeHall, Inc. Accessed 20191008.
 Kim, Yohan. 2001. "Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions." Technical report, University of California. Accessed 20191008.
 Lafferty, John, Andrew McCallum, and Fernando C.N. Pereira. 2001. "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." University of Pennsylvania, June. Accessed 20191008.
 Martino, De Andrea and Daniele De Martino. 2018. "An introduction to the maximum entropy approach and its application to inference problems in biology." Heliyon, vol. 4, no. 4, e00596, April. Accessed 20191008.
 McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591598, June 29  July 02. Accessed 20191008.
 Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for PartOfSpeech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133142. Accessed 20191008.
 Saul, Lawrence K. and Mazin G. Rahim. 1999. "Markov Processes on Curves for Automatic Speech Recognition." Advances in Neural Information Processing Systems 11 (NIPS 1998), M. J. Kearns and S. A. Solla and D. A. Cohn, eds., pp. 751760, MIT Press. Accessed 20191008.
 Siddiqi, MH, MGR Alam, CS Hong, AM Khan, and H Choo. 2016. "A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition." PLoS ONE, vol. 11, no. 9, e0162702. Accessed 20191008.
Further Reading
 Jurafsky, Daniel and James H. Martin. 2009. "Hidden Markov and Maximum Entropy Models." Chapter 6 in Speech and Language Processing, Second Edition, PrenticeHall, Inc. Accessed 20191008.
 McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 591598, June 29  July 02. Accessed 20191008.
 Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for PartOfSpeech Tagging." Conference on Empirical Methods in Natural Language Processing, pp. 133142. Accessed 20191008.
 Rashka, Sebastian. 2019. "What is Softmax regression and how is it related to Logistic regression?" Machine Learning FAQ, September 22. Accessed 20191008.
 Godoy, Daniel. 2018. "Understanding binary crossentropy / log loss: a visual explanation." Towards Data Science, via Medium, November 21. Accessed 20191008.
 Batista, David S. 2017. "Maximum Entropy Markov Models and Logistic Regression." Blog, November 12. Accessed 20191008.
Article Stats
Cite As
See Also
 Logistic Regression
 Viterbi Algorithm
 Hidden Markov Model
 Conditional Random Field
 Time Series Analysis
 Natural Language Processing