Hidden Markov Model
 Summary

Discussion
 Could you explain HMM with an example?
 What types of problems can be solved by HMM?
 What are some applications where HMM is useful?
 What are the different types of Hidden Markov Models?
 Could you explain forward algorithm and backward algorithm?
 What's the algorithm for solving HMM's decoding problem?
 How can we solve the learning problem of HMM?
 Could you describe some tools for doing HMM?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
Consider weather, stock prices, DNA sequence, human speech or words in a sentence. In all these cases, current state is influenced by one or more previous states. Moreover, often we can observe the effect but not the underlying cause that remains hidden from the observer. Hidden Markov Model (HMM) helps us figure out the most probable hidden state given an observation.^{}
In practice, we use a sequence of observations to estimate the sequence of hidden states. In HMM, the next state depends only on the current state.^{} As such, it's good for modelling time series data.^{}
We can classify HMM as a generative probabilistic model since a sequence of observed variables is generated by a sequence of hidden states.^{} HMM is also seen as a specific kind of Bayesian network.^{}
Discussion

Could you explain HMM with an example? Suppose Bob tells his friend Alice what he did earlier today. Based on this information Alice guesses today's weather at Bob's location. In HMM, we model weather as states and Bob's activity as observations.
To solve this problem, Alice needs to know three things:^{}
 Transition Probabilities: Probability of moving from one state to another. For example, "If today was sunny, what's the probability that it will rain tomorrow?" If there are N states, this is an NxN matrix.
 Emission Probabilities: Probability of a particular output given a particular state. For example, "What's the chance that Bob is walking if it's raining?" Given a choice of M possible observation symbols, this is an NxM matrix. This is also called output or observation probabilities.
 Initial Probabilities: Probability of being in a state at the start, say, yesterday or ten days ago.
Unlike a typical Markov chain, we can't see the states in HMM. However, we can observe the output and then predict the state. Thus, the states are hidden, giving rise to the term "hidden" in the name HMM.^{}

What types of problems can be solved by HMM? Let A, B and π denote the transition matrix, observation matrix and initial state distribution respectively. HMM can be represented as λ = (A, B, π). Let observation sequence be O and state sequence be Q.^{}
HMM can be used to solve three types of problems:^{} ^{}
 Likelihood Problem: Given O and λ, find the likelihood P(Oλ). How likely is a particular sequence of observations? Forward algorithm solves this problem.
 Decoding Problem: Given O and λ, find the best possible Q that explains O. Given the observation sequence, what's the best possible state sequence? Viterbi algorithm solves this problem.
 Learning Problem: Given O and Q, learn λ, perhaps by maximizing P(Oλ). What model best maps states to observations? BaumWelch algorithm, also called forwardbackward algorithm,^{} solves this problem. In the language of machine learning, we can say that O is training data and the number of states N is the model's hyperparameter.^{}

What are some applications where HMM is useful? HMM has been applied in many areas including automatic speech recognition, handwriting recognition, gesture recognition, partofspeech tagging, musical score following, partial discharges and bioinformatics.^{}
In speech recognition, a spectral analysis of speech gives us suitable observations for HMM. States are modelled after phonemes or syllables, or after the average number of observations in a spoken word. Each word gets its own model.^{} ^{}
To tag words with their parts of speech, the tags are modelled as hidden states and the words are the observations.^{}
In computer networking, HMMs are used in intrusion detection systems. This has two flavours: anomaly detection in which normal behaviour is modelled; or misuse detection in which a predefined set of attacks is modelled.^{}
In computer vision, HMM has been used to label human activities from skeleton output. Each activity is modelled with a HMM. By linking multiple HMMs on common states, a compound HMM is formed. The purpose is to allow robots to be aware of human activity.^{}

What are the different types of Hidden Markov Models? In the typical model, called the ergodic HMM, the states of the HMM are fully connected so that we can transition to a state from any other state. Leftright HMM is a more constrained model in which state transitions are allowed only from lower indexed states to higher indexed ones. Variations and combinations of these two types are possible, such as having two parallel lefttoright state paths.^{}
HMM started with observations of discrete symbols governed by discrete probabilities. If observations are continuous signals, then we would use continuous observation density.^{}
There are also domainspecific variations of HMM. For example, in biological sequence analysis, there are at least three types including profileHMMs, pairHMMs, and contextsensitive HMMs.^{}

Could you explain forward algorithm and backward algorithm? Every state sequence has a probability that it will lead to a given sequence of observations. Given T observations and N states, there are \(N^T\) possible state sequences. Thus, the complexity of calculating the probability of a given sequence of observations is \(O(N^{T}T)\). Both forward and backward algorithms bring down the complexity to \(O(N^{2}T)\) through dynamic programming.^{}
In the forward algorithm, we consider the probability of being in a state at the current time step. Then we consider the transition probabilities to calculate the state probabilities for the next step. Thus, at each time step we have considered all state sequences preceding it.^{} The algorithm is more efficient since it reuses calculations from earlier steps. Instead of keeping all path sequences, paths are folded into a forward trellis.^{}
Backward algorithm is similar except that we start from the last time step and calculate in reverse. We're finding the probability that from a given state, the model will generate the output sequence that follows.^{}
A combination of both algorithms, called forwardbackward algorithm, is used to solve the learning problem.^{}

What's the algorithm for solving HMM's decoding problem? Viterbi algorithm solves HMM's decoding problem. It's similar to the forward algorithm except that instead of summing the probabilities of all paths leading to a state, we retain only one path that gives maximum probability.^{} Thus, at every time step or iteration, given that we have N states, we retain only N paths, the most likely path for each state. For the next iteration, we use the most likely paths of current iteration and repeat the process.^{}
When we reach the end of the sequence, we'll have N most likely paths, each ending in a unique state. We then select the most likely end state. Once this selection is made, we backtrack to read the state sequence, that is, how we got to the end state. This state sequence is now the most likely sequence given our sequence of observations.^{}

How can we solve the learning problem of HMM? In HMM's learning problem, we are required to learn the transition (A) and observation (B) probabilities when given a sequence of observations and the vocabulary of hidden states. The forwardbackward algorithm solves this problem. It's an iterative algorithm. It starts with an initial estimate of the probabilities and improves these estimates with each iteration.^{}
The algorithm consists of two steps:^{}
 Expectation or Estep: We compute the expected state occupancy count and the expected state transition count based on current probabilities A and B.
 Maximization or Mstep: We use the expected counts from the Estep to recompute A and B.
While this algorithm is unsupervised, in practice, initial conditions are very important. For this reason, often extra information is given to the algorithm. For example, in speech recognition, the HMM structure is set manually and the model is trained to set the initial probabilities.^{}

Could you describe some tools for doing HMM? In Python, hmmlearn package implements HMM. Three models are available:
hmm.GaussianHMM
,hmm.GMMHMM
andhmm.MultinomialHMM
.^{} This package is also part of Scikitlearn but will be removed in v0.17.^{} Stephen Marsland has shared Python code in NumPy and Pandas that implements many essential algorithms for HMM.In R, HMM package implements HMM. It has functions for forward, backward, Viterbi and BaumWelch algorithms.^{} Another package depmixS4 implements dependent mixture models that can be used to fit HMM to observed data.^{} Rbloggers has an example use of depmixS4.
Milestones
Russian mathematician A. A. Markov recognizes that in a sequence of random variables, one variable may not be independent of the previous variable. For example, two successive coin tosses are independent but today's weather might depend on yesterday's weather. He models this as a chain of linked events with probability assigned to each link. This technique later is named Markov Chain.^{}
Baum and Petrie at the Institute of Defense Analyses, Princeton, introduce the Hidden Markov Model (HMM), though this name is not used. They state the problem of estimating transition and emission probabilities from observations. They use maximum likelihood estimate.^{}
Andrew Viterbi publishes an algorithm to decode information at the receiver in a communication system.^{} Later named Viterbi algorithm, it's directly applicable to the decoding problem in HMM. Vintsyuk first applies this algorithm to speech and language processing in 1968.^{}
The BaumWelch algorithm is proposed to solve the learning problem in HMM. This algorithm is a special case of the ExpectationMaximization (EM) algorithm.^{} However, the name HMM is not used in the paper and mathematicians refer to HMM as "probabilistic functions of Markov chains".^{}
James Baker at CMU applies HMM to speech recognition in the DRAGON speech understanding system.^{} This is one of the earliest engineering applications of HMM.^{} HMM is further applied to speech recognition through the 1970s and 1980s by Jelinek, Bahl and Mercer at IBM.^{}
Lawrence Rabiner publishes a tutorial on HMM covering theory, practice and applications. He notes that HMM originated in mathematics and was not widely read by engineers. Even when it was applied to speech processing in the 1970s, there were no tutorials to help translate theory into practice.^{}
HMM is typically used when the number of states is small but one research team applies it to large scale web traffic analysis. This involves hundreds of states and tens of millions of observations.^{}
Sample Code
References
 Baker, J. 1975. "The DRAGON systemAn overview." IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 2429, February. Accessed 20190904.
 Baum, Leonard E. and Ted Petrie. 1966. "Statistical Inference for Probabilistic Functions of Finite State Markov Chains." Ann. Math. Statist., vol. 37, no. 6, pp. 15541563. Accessed 20190904.
 Baum, Leonard E., Ted Petrie, George Soules and Norman Weiss. 1970. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains." The Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164171, February. Accessed 20190904.
 Bioinformatics Algorithms. 2015. "The Viterbi Algorithm." Bioinformatics Algorithms: An Active Learning Approach, via YouTube, July 26. Accessed 20190906.
 Chugg, Keith. 2017. "Viteri Algorithm." University of South Carolina, via YouTube, March. Accessed 20190906.
 Felzenszwalb, Pedro F., Daniel P. Huttenlocher, and Jon M. Kleinberg. 2003. "Fast Algorithms for LargeStateSpace HMMs with Applications to Web Usage Analysis." Advances in Neural Information Processing Systems (NIPS) 16. Accessed 20190904.
 FigueroaAngulo, Jose Israel, Jesus Savage, Ernesto Bribiesca, Boris Escalante, and Luis Enrique Sucar. 2015. "Compound Hidden Markov Model for Activity Labelling." Int. J. of Intelligence Science, Scientific Research, vol. 5, no. 5, pp. 177195, October. Accessed 20190904.
 Gales, Mark and Steve Young. 2007. "The Application of Hidden Markov Models in Speech Recognition." Foundations and Trends in Signal Processing, Now, vol. 1, no. 3, pp. 195–304. Accessed 20190904.
 Ghahramani, Z. 2001. "An Introduction to Hidden Markov Modes and Bayesian Networks." Int. J. of Pattern Recognition and Artificial Intelligence, vol. 15, no. 1, pp. 942. Accessed 20190904.
 Hayes, Brian. 2017. "First Links in the Markov Chain." American Scientist, February 06. Updated 20180906. Accessed 20190904.
 Himmelmann, Lin. 2015. "HMM  Hidden Markov Models." CRAN, February 19. Accessed 20190904.
 Jana, Abhisek. 2019a. "Introduction to Hidden Markov Model." A Developer Diary, February 13. Accessed 20190904.
 Jana, Abhisek. 2019b. "Forward and Backward Algorithm in Hidden Markov Model." A Developer Diary, February 17. Accessed 20190904.
 Jurafsky, Daniel and James H. Martin. 2018. "Speech and Language Processing." Third Edition draft, September 23. Accessed 20190904.
 Kang, Eugine. 2017. "Hidden Markov Model." Medium, September 01. Accessed 20190904.
 Katahira, Kentaro, Kenta Suzuki, Kazuo Okanoya, and Masato Okada. 2011. "Complex Sequencing Rules of Birdsong Can be Explained by Simple Hidden Markov Processes." PLoS ONE, vol. 6, no. 9, e24516, September 07. Accessed 20190904.
 Khreich, Wael, Eric Granger, Ali Miri, and Robert Sabourin. 2012. "A survey of techniques for incremental learning of HMM parameters." Information Sciences, Elsevier, vol. 197, pp. 105–130. Accessed 20190904.
 Kouemou, Guy Leonard. 2011. "History and Theoretical Basics of Hidden Markov Models." IntechOpen. Accessed 20190904.
 Rabiner, L.R. 1989. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE, vol. 77, no. 2, pp. 257286, February. Accessed 20190904.
 Scikitlearn. 2015. "Hidden Markov Models." v0.16.1, April 14. Accessed 20190904.
 Smith, Adam. 2017. "The Viterbi Algorithm at 50." Viterbi School of Engineering, USC, March 16. Accessed 20190904.
 Stamp, Mark. 2018. "A Revealing Introduction to Hidden Markov Models." San Jose State University, October 17. Accessed 20190904.
 Visser, Ingmar. 2019. "depmixS4: Dependent Mixture Models  Hidden Markov Models of GLMs and Other Distributions in S4." Package depmixS4, v1.40, CRAN, July 10. Accessed 20190904.
 Wikipedia. 2019. "Hidden Markov model." Wikipedia, August 27. Accessed 20190904.
 Yoon, ByungJun. 2009. "Hidden Markov Models and their Applications in Biological Sequence Analysis." Current Genomics, vol. 10, no. 6, pp. 40215, September. Accessed 20190905.
 hmmlearn. 2019. "Tutorial." hmmlearn, June 24. Accessed 20190904.
Further Reading
 Batista, David S. 2017. "Hidden Markov Model and Naive Bayes relationship." Blog, November 11. Accessed 20190904.
 Kang, Eugine. 2017. "Hidden Markov Model." Medium, September 01. Accessed 20190904.
 Rabiner, L.R. 1989. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE, vol. 77, no. 2, pp. 257286, February. Accessed 20190904.
 Bilmes, Jeff A. 1998. "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models." International Computer Science Institute, TR97021, April. Accessed 20190904.
Article Stats
Cite As
See Also
 Viterbi Algorithm
 ExpectationMaximization Algorithm
 Speech Recognition
 PartofSpeech Tagging
 Bayesian Network
 Markov Chain