• . Source: IntelligentWire
• Bell Labs. Source: Bell Laboratories
• Mozilla DeepSpeech & Common Voice. Source: Mozilla
• Steps involved in Speech Recognition. Source: https://www.youtube.com/watch?v=i9Gn2QYrYpo
• Hidden Markov Model. Source: Automatic Speech recognition - ESAT
• Speech Recognition - Word Error Rate. Source: Finding a Voice | The Economist : 2017-05-01
• Milestones in Speech Recognition and Understanding Technology over the Past 40 Years. DNNs replacing Gaussian Mixture Models. Source: Automatic Speech Recognition – A Brief History of the Technology Development by B.H. Juang & Lawrence R. Rabiner, Georgia Institute of Technology, Atlanta & MIT Lincoln Library
• Data flowing through the network.Source: Deep Speech by Mozilla
• Calculating Real Time Factor. Source: Ondrej Platek
• Calculation of Word Error Rate; Chart showing the reduction of ASR error rate over time, attributed to deep learning. Source: Nuance, Sonix
• Comparison of open source and free speech recognition toolkits. Source: Silicon Valley Data Science

# Speech Recognition

## Summary

Speech Recognition is the process by which a computer maps an acoustic speech signal to text.

Speech Recognition is also known as Automatic Speech Recognition (ASR) or Speech To Text (STT).

Speech Recognition crossed over to 'Plateau of Productivity' in the Gartner Hype Cycle as of July 2013, which indicates its widespread use and maturity in present times.

In the longer term, researchers are focusing on teaching computers not just to transcribe the acoustic signals that come out of peoples' mouths but also to understand the words that they are saying. Automatic speech understanding is when a computer maps an acoustic speech signal to an abstract meaning.

It is a sub-field of computational linguistics (an interdisciplinary field concerned with the statistical or rule-based modeling of natural language) that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.

## Milestones

1952

Bell Labs researchers - Davis, Biddulph and Balashak - build a system for single-speaker digit (10) recognition. Their system worked by locating the formants in the power spectrum of each utterance. They were basically creating realisations of earlier analysis establishing relationships between sound classes and signal spectrum by Harvey Fletcher and Homer Dudley, both from AT&T Bell Laboratories. Speech recognition research at Bell Labs was defunded after an open letter by John Robinson Pierce that was critical of speech recognition research.

1962

IBM demonstrates its 'Shoebox' machine at the 1962 World Fair, which could understand 16 words spoken in English.

1968

Dabbala Rajagopal "Raj" Reddy conducts a demonstration of voice control of a robot, large vocabulary connected speech recognition, speaker independent speech recognition and and unrestricted vocabulary dictation. Hearsay I was one of the first systems capable of continuous speech recognition. In 1994 he won the ACM Turing Award. Hearsay: Speech Recognition (CMU 1973). Reddy laid the foundation for more than three decades of research at Carnegie Mellon University with his work in the field of continuous speech recognition based on dynamic tracking of phonemes. Funding by DARPA's Speech Understanding Research (SUR) program was responsible for Carnegie Melon's "Harpy" speech understanding system, that could understand 1011 words.

1968

In Russia, Professor Taras Vintsyuk proposes the use of dynamic programming methods for time aligning a pair of speech utterances (generally known as Dynamic Time Warping (DTW)). Velichko and Zagoruyko use Vintsyuk's work to advance use of pattern recognition ideas in speech recognition. The duo build a 200-word recogniser. Professor Victorov develops a system that recognises 1000 words in 1980.

1974

First commercial speech recognition company - Threshold Technology

1975

James Baker starts working on HMM-based speech recognition systems. In 1982 James and Janet Baker (students of Raj Reddy) cofound Dragon Systems, one of the first companies to use Hidden Markov Models in speech recognition.

1991

Tony Robinson publishes work on neural networks in ASR. In 2012 he founded Speechmatics, offering cloud-based speech recognition services. In 2017 the company announced a breakthrough in accelerated new language modelling. By 1994 Robinson's neural network system was in the top 10 in the world in the DARPA Continuous Speech Evaluation trial, while the other nine were HMMs..

2007

Google begins its first effort at speech recognition came after hiring some researchers from Nuance. By around 2007 itself, LSTM trained by Connectionist Temporal Classification (CTC) started to outperform traditional speech recognition in certain applications.

Sep
2015

Google's speech recognition experiences a performance jump of 49% through CTC-trained LSTM.

Nov
2017

Mozilla open sources speech recognition model - DeepSpeech and voice dataset - Common Voice.

Oct
2017

Baidu Research releases Deep Speech 3 which enables end-to-end training using a pre-trained language model. Deep Speech 1 was Baidu's PoC, followed by Deep Speech 2 that demonstrated how models generalise well to different languages.

## Discussion

• What are the steps involved in the process of speech recognition ?
• Analog to Digital Conversion - Speech is usually recorded/available in analog format. Standard sampling techniques/devices are available to convert analog speech to digital using techniques of sampling and quantization. The digital speech is usually a one-dimension vector of speech samples, each of which is an integer.
• Speech pre-processing/noise removal - Recorded speech usually comes with background noise and long sequences of silence. Speech pre-processing involves identification and removal of silence frames and signal processing techniques to reduce/eliminate noise. After pre-processing, the speech is broken down into frames of 20ms each for further steps of feature extraction
• Feature extraction - It is the process of converting speech frames into feature vector which indicates which phoneme/syllable is being spoken.
• Word selection - Based on a language model/probability model, the sequence of phonemes/features are converted into the word being spoken
• Which are the popular feature extraction methods ?
• RASTA-PLP : Relative Spectral Transform - Perceptual Linear Prediction. PLP is a way of warping spectra to minimize differences between speakers while preserving important speech information. RASTA applies a band-pass filter to the energy in each frequency subband in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel e.g. from a telephone line
• Linear Predictive cepstral coefficients (LPCC) - A cepstrum is the result of taking the inverse Fourier transform (IFT) of the logarithm of the estimated spectrum of a signal. The power cepstrum is used in the analysis of human speech.
• Mel-frequency cepstral coefficients (MFCC) are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between LPCC & mel-frequency cepstrum is that in the MFCC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping allows for better representation of sound.
• What are the traditional Probablity Mapping and Selection methods?

A Hidden Markov Model is a type of graphical model often used to model temporal data. Hidden Markov Models (HMMs) assume that the data observed is not the actual state of the model, but is instead generated by the underlying hidden (the H in HMM) states. While this would normally make inference difficult, the Markov Property (the first M in HMM) of HMMs makes inference efficient.

The hidden Markov model can be represented as the simplest dynamic Bayesian network. The mathematics behind the HMM were developed by L. E. Baum and coworkers.

Because of their flexibility and computational efficiency, Hidden Markov Models have found a wide application in many different fields like speech recognition, handwriting recognition, and speech synthesis.

• How is the accuracy of a speech recognition program validated ?

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. Generally it is measured on Switchboard - a recorded corpus of conversations between humans discussing day-to-day topics. This has been used over two decades to benchmark speech recognition systems. There are other corpuses like LibriSpeech (based on public domain audio books) and Mozilla’s Common Voice project.

For some languages, like Mandarin, the metric is often CER - Character Error Rate. There is also Utterance Error Rate.

An IEEE paper that focussed on ASR and machine translator (MT) interactions, in a speech translation system, showed that BLEU-oriented global optimisation of ASR system parameters improves translation quality by absolute 1.5% BLEU score, while sacrificing WER over the conventional WER-optimised ASR system. Therefore the choice of metrics for ASR optimisation is context and application dependent.

• How has speech recognition evolved over the years ?
• Starting from the 1960s, the pattern recognition based approaches started making speech recognition practical for applications with limited vocabulary with the use of LPC (Linear Predictive Coefficients) and LPCC (Linear Predictive Cepstral Coefficient) based techniques.

The advantage of this technique was that low resources were required to build this model and could be used for applications requiring upto about 300 words.

• In the late 1970s, Paul Mermelstein found a new feature called MFCC (Mel-frequency Cepstral Coefficients). This soon became the de-facto approach for feature extraction and helped to tackle applications related to multi-speaker as well as multi-language speech recognition.
• In the 1990s, H. Hermansky came up with the RASTA-PLP approach ( Relative Spectral Transform - Perceptual Linear Prediction) of feature extraction which could be used for applications requiring very large vocabulary with multiple speakers and multiple languages with good accuracy.
• What are the AI based approaches for speech recognition ?

In the 1990s and in early 2000s, Deep Learning techniques involving Recurrent Neural Networks were applied on Speech Recognition. In 2000s, the variant of RNNs using LSTM (Long Short Term Memory) to include Long Term Memory aspects into the model helped to minimise or avoid the problems of vanishing gradient or exploding gradients as part of training the RNNs.

Baidu research released DeepSpeech 2014 achieving a WER of 11.85% using RNN. They leveraged the “Connectionist Temporal Classification” loss function.

With DeepSpeech2 in 2015 they achieved a 7x increase in speed using GRU (Gated Recurrent Units).

DeepSpeech3 was released in 2017. They perform an empirical comparison between three models — CTC which powered Deep Speech 2, attention-based Seq2Seq models which powered Listend-Attend-Spell among others, and RNN-Transducer for end-to-end speech recognition. The RNN-Transducer could be thought of as an encoder-decoder model which assumes the alignment between input and output tokens is local and monotonic. This makes the RNN-Transducer loss a better fit for speech recognition (especially when online) than attention-based Seq2Seq models by removing extra hacks applied to attentional models to encourage monotonicity.

• How is the speed of a speech recognition system measured ?

Real Time Factor is a very natural measure of a speech decoding speed that expresses how much the recogniser decodes slower than the user speaks. The latency measures the time between the end of the user speech and the time when a decoder returns the hypothesis, which is the most important speed measure for ASR.

Real-time Factor (RTF): the ratio of the speech recognition response time to the utterance duration. Usually both mean RTF (average over all utterances), and 90th percentile RTF is examined in efficiency analysis.

• How is Word Error Rate calculated ?

In October 2016, Microsoft announced its ASR had a WER of 5.9% against the Industry standard Switchboard speech recognition task. This was surpassed by IBM Watson in March 2017 with a WER of 5.5%. In May 2017 Google announced it reached a WER of 4.9%, however google does not benchmark against the Switchboard.

ASR systems have seen big improvements in recent years due to more efficient acoustic models that use Deep Neural Networks (DNNs) to determine how well HMM states fit the extracted acoustic features rather than statistical techniques such as Gaussian Mixture Models, which were the preferred method for several years.

• Which are the popular APIs that one can use to incorporate automatic speech recognition in an application ?
• Bing Speech API
• Nuance Speech Kit
• AWS Transcribe
• IBM Watson Speech to Text
• Speechmatics
• Vocapia Speech to Text API
• LBC Listen By Code API
• Kaldi
• CMU Sphinx

• What are the applications for speech recognition ?
• Aerospace (e.g. space exploration, spacecraft, etc.) NASA's Mars Polar Lander used speech recognition technology from Sensory, Inc. in the Mars Microphone on the Lander
• Automatic subtitling with speech recognition
• Automatic translation
• Court reporting (Realtime Speech Writing)
• eDiscovery (Legal discovery)
• Education (assisting in learning a second language)
• Hands-free computing: Speech recognition computer user interface
• Home automation (Alexa, Google Home etc.)
• Interactive voice response
• Medical transcription
• Mobile telephony, including mobile email
• Multimodal interaction
• People with disabilities
• Pronunciation evaluation in computer-aided language learning applications
• Robotics
• Speech-to-text reporter (transcription of speech into text, video captioning, Court reporting )
• Telematics (e.g. vehicle Navigation Systems)
• User interface in telephony
• Transcription (digital speech-to-text)
• Video games, with Tom Clancy's EndWar and Lifeline as working examples
• Virtual assistant (e.g. Apple's Siri)
• Which are the available Open source ASR toolkits?
• Microsoft Cognitive Toolkit, a deep learning system that Microsoft used for its ASR system is available on GitHub through an open source license.
• The Machine Learning team at Mozilla Research has been working on an open source Automatic Speech Recognition engine modelled after the Deep Speech papers (1, 2) published by Baidu. It has a WER of 6.5 percent on LibriSpeech’s test-clean set.
• Kaldi is a popular open-source speech recognition toolkit which is integrated with TensorFlow. Its code now hosted on GitHub with 121 contributors. It originated at a 2009 workshop in John Hopkins University. It is designed for local installation.
• CMU Sphinx is from Carnegie Mellon University. Java and C versions exist on GitHub.
• HTK began in Cambridge University in 1989.
• Julius has been in development since 1997 and had its last major release in September of 2016.
• ISIP originated from Mississippi State. It was developed mostly from 1996 to 1999, with its last release in 2011.
• VoxForge - crowdsourced repository of speech recognition data and trained models.

Python wrappers exist for most of these options.

## Sample Code

• # Script that uses the AWS SDK for Python (Boto) to transcribe speech into text utilising the Amazon Transcribe API
# Source : https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-python.html
from __future__ import print_function
import time
import boto3
transcribe = boto3.client('transcribe')
job_name = "job name"
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': job_uri},
MediaFormat='wav',
LanguageCode='en-US'
)
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
time.sleep(5)
print(status)

## Milestones

1952

Bell Labs researchers - Davis, Biddulph and Balashak - build a system for single-speaker digit (10) recognition. Their system worked by locating the formants in the power spectrum of each utterance. They were basically creating realisations of earlier analysis establishing relationships between sound classes and signal spectrum by Harvey Fletcher and Homer Dudley, both from AT&T Bell Laboratories. Speech recognition research at Bell Labs was defunded after an open letter by John Robinson Pierce that was critical of speech recognition research.

1962

IBM demonstrates its 'Shoebox' machine at the 1962 World Fair, which could understand 16 words spoken in English.

1968

Dabbala Rajagopal "Raj" Reddy conducts a demonstration of voice control of a robot, large vocabulary connected speech recognition, speaker independent speech recognition and and unrestricted vocabulary dictation. Hearsay I was one of the first systems capable of continuous speech recognition. In 1994 he won the ACM Turing Award. Hearsay: Speech Recognition (CMU 1973). Reddy laid the foundation for more than three decades of research at Carnegie Mellon University with his work in the field of continuous speech recognition based on dynamic tracking of phonemes. Funding by DARPA's Speech Understanding Research (SUR) program was responsible for Carnegie Melon's "Harpy" speech understanding system, that could understand 1011 words.

1968

In Russia, Professor Taras Vintsyuk proposes the use of dynamic programming methods for time aligning a pair of speech utterances (generally known as Dynamic Time Warping (DTW)). Velichko and Zagoruyko use Vintsyuk's work to advance use of pattern recognition ideas in speech recognition. The duo build a 200-word recogniser. Professor Victorov develops a system that recognises 1000 words in 1980.

1974

First commercial speech recognition company - Threshold Technology

1975

James Baker starts working on HMM-based speech recognition systems. In 1982 James and Janet Baker (students of Raj Reddy) cofound Dragon Systems, one of the first companies to use Hidden Markov Models in speech recognition.

1991

Tony Robinson publishes work on neural networks in ASR. In 2012 he founded Speechmatics, offering cloud-based speech recognition services. In 2017 the company announced a breakthrough in accelerated new language modelling. By 1994 Robinson's neural network system was in the top 10 in the world in the DARPA Continuous Speech Evaluation trial, while the other nine were HMMs..

2007

Google begins its first effort at speech recognition came after hiring some researchers from Nuance. By around 2007 itself, LSTM trained by Connectionist Temporal Classification (CTC) started to outperform traditional speech recognition in certain applications.

Sep
2015

Google's speech recognition experiences a performance jump of 49% through CTC-trained LSTM.

Nov
2017

Mozilla open sources speech recognition model - DeepSpeech and voice dataset - Common Voice.

Oct
2017

Baidu Research releases Deep Speech 3 which enables end-to-end training using a pre-trained language model. Deep Speech 1 was Baidu's PoC, followed by Deep Speech 2 that demonstrated how models generalise well to different languages.

## Tags

• Deep Neural Networks
• Hidden Markov Models
• Recurrent Neural Networks
• Long Short-Term Memory
• Gaussian Mixture Models

1. McLellan, Charles. 2016. How we learned to talk to computers, and how they learned to answer back. Tech Republic.
2. Benesty, Jacob. 2008. Speech Recognition. Springer Handbook of Speech Processing.

## Top Contributors

Last update: 2018-07-28 08:23:07 by arvindpdmn
Creation: 2018-07-23 15:17:55 by arjun

2269
Words
1
Chats
2
Authors
39
Edits
2
Likes
961
Hits

## Cite As

Devopedia. 2018. "Speech Recognition." Version 39, July 28. Accessed 2018-10-18. https://devopedia.org/speech-recognition
BETA V0.17
• Site Map