• HMM states (X), observations (O) and proabilities (A, B). Source: Stamp 2018, fig. 1.
• An example of Hidden Markov Model. Source: Wikipedia 2019.
• Typical notation used in HMM. Source: Kang 2017.
• Complex birdsong analyzed using HMM. Source: Adapted from Katahira et al. 2011, fig. 4.
• Some types of HMMs. Source: Rabiner 1989, fig. 7.
• Trellis diagrams showing forward and backward algorithms. Source: Adapted from Jana 2019b.

# Hidden Markov Model

arvindpdmn
1565 DevCoins
Last updated by arvindpdmn
on 2019-09-06 05:47:21
Created by arvindpdmn
on 2019-09-05 16:45:46

## Summary

Consider weather, stock prices, DNA sequence, human speech or words in a sentence. In all these cases, current state is influenced by one or more previous states. Moreover, often we can observe the effect but not the underlying cause that remains hidden from the observer. Hidden Markov Model (HMM) helps us figure out the most probable hidden state given an observation.

In practice, we use a sequence of observations to estimate the sequence of hidden states. In HMM, the next state depends only on the current state. As such, it's good for modelling time series data.

We can classify HMM as a generative probabilistic model since a sequence of observed variables is generated by a sequence of hidden states. HMM is also seen as a specific kind of Bayesian network.

## Milestones

1913

Russian mathematician A. A. Markov recognizes that in a sequence of random variables, one variable may not be independent of the previous variable. For example, two successive coin tosses are independent but today's weather might depend on yesterday's weather. He models this as a chain of linked events with probability assigned to each link. This technique later is named Markov Chain.

1966

Baum and Petrie at the Institute of Defense Analyses, Princeton, introduce the Hidden Markov Model (HMM), though this name is not used. They state the problem of estimating transition and emission probabilities from observations. They use maximum likelihood estimate.

1967

Andrew Viterbi publishes an algorithm to decode information at the receiver in a communication system. Later named Viterbi algorithm, it's directly applicable to the decoding problem in HMM. Vintsyuk first applies this algorithm to speech and language processing in 1968.

1970

The Baum-Welch algorithm is proposed to solve the learning problem in HMM. This algorithm is a special case of the Expectation-Maximization (EM) algorithm. However, the name HMM is not used in the paper and mathematicians refer to HMM as "probabilistic functions of Markov chains".

1975

James Baker at CMU applies HMM to speech recognition in the DRAGON speech understanding system. This is one of the earliest engineering applications of HMM. HMM is further applied to speech recognition through the 1970s and 1980s by Jelinek, Bahl and Mercer at IBM.

1989

Lawrence Rabiner publishes a tutorial on HMM covering theory, practice and applications. He notes that HMM originated in mathematics and was not widely read by engineers. Even when it was applied to speech processing in the 1970s, there were no tutorials to help translate theory into practice.

2003

HMM is typically used when the number of states is small but one research team applies it to large scale web traffic analysis. This involves hundreds of states and tens of millions of observations.

## Discussion

• Could you explain HMM with an example?

Suppose Bob tells his friend Alice what he did earlier today. Based on this information Alice guesses today's weather at Bob's location. In HMM, we model weather as states and Bob's activity as observations.

To solve this problem, Alice needs to know three things:

• Transition Probabilities: Probability of moving from one state to another. For example, "If today was sunny, what's the probability that it will rain tomorrow?" If there are N states, this is an NxN matrix.
• Emission Probabilities: Probability of a particular output given a particular state. For example, "What's the chance that Bob is walking if it's raining?" Given a choice of M possible observation symbols, this is an NxM matrix. This is also called output or observation probabilities.
• Initial Probabilities: Probability of being in a state at the start, say, yesterday or ten days ago.

Unlike a typical Markov chain, we can't see the states in HMM. However, we can observe the output and then predict the state. Thus, the states are hidden, giving rise to the term "hidden" in the name HMM.

• What types of problems can be solved by HMM?

Let A, B and π denote the transition matrix, observation matrix and initial state distribution respectively. HMM can be represented as λ = (A, B, π). Let observation sequence be O and state sequence be Q.

HMM can be used to solve three types of problems:

• Likelihood Problem: Given O and λ, find the likelihood P(O|λ). How likely is a particular sequence of observations? Forward algorithm solves this problem.
• Decoding Problem: Given O and λ, find the best possible Q that explains O. Given the observation sequence, what's the best possible state sequence? Viterbi algorithm solves this problem.
• Learning Problem: Given O and Q, learn λ, perhaps by maximizing P(O|λ). What model best maps states to observations? Baum-Welch algorithm, also called forward-backward algorithm, solves this problem. In the language of machine learning, we can say that O is training data and the number of states N is the model's hyperparameter.
• What are some applications where HMM is useful?

HMM has been applied in many areas including automatic speech recognition, handwriting recognition, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

In speech recognition, a spectral analysis of speech gives us suitable observations for HMM. States are modelled after phonemes or syllables, or after the average number of observations in a spoken word. Each word gets its own model.

To tag words with their parts of speech, the tags are modelled as hidden states and the words are the observations.

In computer networking, HMMs are used in intrusion detection systems. This has two flavours: anomaly detection in which normal behaviour is modelled; or misuse detection in which a predefined set of attacks is modelled.

In computer vision, HMM has been used to label human activities from skeleton output. Each activity is modelled with a HMM. By linking multiple HMMs on common states, a compound HMM is formed. The purpose is to allow robots to be aware of human activity.

• What are the different types of Hidden Markov Models?

In the typical model, called the ergodic HMM, the states of the HMM are fully connected so that we can transition to a state from any other state. Left-right HMM is a more constrained model in which state transitions are allowed only from lower indexed states to higher indexed ones. Variations and combinations of these two types are possible, such as having two parallel left-to-right state paths.

HMM started with observations of discrete symbols governed by discrete probabilities. If observations are continuous signals, then we would use continuous observation density.

There are also domain-specific variations of HMM. For example, in biological sequence analysis, there are at least three types including profile-HMMs, pair-HMMs, and context-sensitive HMMs.

• Could you explain forward algorithm and backward algorithm?

Every state sequence has a probability that it will lead to a given sequence of observations. Given T observations and N states, there are $$N^T$$ possible state sequences. Thus, the complexity of calculating the probability of a given sequence of observations is $$O(N^{T}T)$$. Both forward and backward algorithms bring down the complexity to $$O(N^{2}T)$$ through dynamic programming.

In the forward algorithm, we consider the probability of being in a state at the current time step. Then we consider the transition probabilities to calculate the state probabilities for the next step. Thus, at each time step we have considered all state sequences preceding it. The algorithm is more efficient since it reuses calculations from earlier steps. Instead of keeping all path sequences, paths are folded into a forward trellis.

Backward algorithm is similar except that we start from the last time step and calculate in reverse. We're finding the probability that from a given state, the model will generate the output sequence that follows.

A combination of both algorithms, called forward-backward algorithm, is used to solve the learning problem.

• What's the algorithm for solving HMM's decoding problem?

Viterbi algorithm solves HMM's decoding problem. It's similar to the forward algorithm except that instead of summing the probabilities of all paths leading to a state, we retain only one path that gives maximum probability. Thus, at every time step or iteration, given that we have N states, we retain only N paths, the most likely path for each state. For the next iteration, we use the most likely paths of current iteration and repeat the process.

When we reach the end of the sequence, we'll have N most likely paths, each ending in a unique state. We then select the most likely end state. Once this selection is made, we backtrack to read the state sequence, that is, how we got to the end state. This state sequence is now the most likely sequence given our sequence of observations.

• How can we solve the learning problem of HMM?

In HMM's learning problem, we are required to learn the transition (A) and observation (B) probabilities when given a sequence of observations and the vocabulary of hidden states. The forward-backward algorithm solves this problem. It's an iterative algorithm. It starts with an initial estimate of the probabilities and improves these estimates with each iteration.

The algorithm consists of two steps:

• Expectation or E-step: We compute the expected state occupancy count and the expected state transition count based on current probabilities A and B.
• Maximization or M-step: We use the expected counts from the E-step to recompute A and B.

While this algorithm is unsupervised, in practice, initial conditions are very important. For this reason, often extra information is given to the algorithm. For example, in speech recognition, the HMM structure is set manually and the model is trained to set the initial probabilities.

• Could you describe some tools for doing HMM?

In Python, hmmlearn package implements HMM. Three models are available: hmm.GaussianHMM, hmm.GMMHMM and hmm.MultinomialHMM. This package is also part of Scikit-learn but will be removed in v0.17. Stephen Marsland has shared Python code in NumPy and Pandas that implements many essential algorithms for HMM.

In R, HMM package implements HMM. It has functions for forward, backward, Viterbi and Baum-Welch algorithms. Another package depmixS4 implements dependent mixture models that can be used to fit HMM to observed data. R-bloggers has an example use of depmixS4.

## Sample Code

• # Source: https://stats.stackexchange.com/questions/31746/what-is-the-difference-between-the-forward-backward-and-viterbi-algorithms
# Accessed: 2019-09-05

library(HMM)
# in education setting,
# hidden state: Rainy and Sunny
# observation: Walk, Shop, Clean

# state transition
P <- as.matrix(rbind(c(0.7,0.3),
c(0.4,0.6)))

# emission prob
R <- as.matrix(rbind(c(0.1, 0.4, 0.5),
c(0.6,0.3, 0.1)))

hmm = initHMM(States=c("Rainy","Sunny"),
Symbols=c("Walk","Shop", "Clean"),
startProbs=c(0.6,0.4),
transProbs=P,
emissionProbs=R)
hmm

obs=c("Walk","Shop","Walk", "Clean")
print(posterior(hmm, obs)) # gives the state probabilities for each observation
print(viterbi(hmm, obs)) # gives the sequence of states

## Milestones

1913

Russian mathematician A. A. Markov recognizes that in a sequence of random variables, one variable may not be independent of the previous variable. For example, two successive coin tosses are independent but today's weather might depend on yesterday's weather. He models this as a chain of linked events with probability assigned to each link. This technique later is named Markov Chain.

1966

Baum and Petrie at the Institute of Defense Analyses, Princeton, introduce the Hidden Markov Model (HMM), though this name is not used. They state the problem of estimating transition and emission probabilities from observations. They use maximum likelihood estimate.

1967

Andrew Viterbi publishes an algorithm to decode information at the receiver in a communication system. Later named Viterbi algorithm, it's directly applicable to the decoding problem in HMM. Vintsyuk first applies this algorithm to speech and language processing in 1968.

1970

The Baum-Welch algorithm is proposed to solve the learning problem in HMM. This algorithm is a special case of the Expectation-Maximization (EM) algorithm. However, the name HMM is not used in the paper and mathematicians refer to HMM as "probabilistic functions of Markov chains".

1975

James Baker at CMU applies HMM to speech recognition in the DRAGON speech understanding system. This is one of the earliest engineering applications of HMM. HMM is further applied to speech recognition through the 1970s and 1980s by Jelinek, Bahl and Mercer at IBM.

1989

Lawrence Rabiner publishes a tutorial on HMM covering theory, practice and applications. He notes that HMM originated in mathematics and was not widely read by engineers. Even when it was applied to speech processing in the 1970s, there were no tutorials to help translate theory into practice.

2003

HMM is typically used when the number of states is small but one research team applies it to large scale web traffic analysis. This involves hundreds of states and tens of millions of observations.

Author
No. of Edits
No. of Chats
DevCoins
4
0
1565
1736
Words
0
Chats
4
Edits
3
Likes
3637
Hits

## Cite As

Devopedia. 2019. "Hidden Markov Model." Version 4, September 6. Accessed 2020-09-22. https://devopedia.org/hidden-markov-model
• Site Map