Speech Recognition is the process by which a computer maps an acoustic speech signal to text.
Speech Recognition is also known as Automatic Speech Recognition (ASR) or Speech To Text (STT).
Speech Recognition crossed over to 'Plateau of Productivity' in the Gartner Hype Cycle as of July 2013, which indicates its widespread use and maturity in present times.
In the longer term, researchers are focusing on teaching computers not just to transcribe acoustic signals but also to understand the words. Automatic speech understanding is when a computer maps an acoustic speech signal to an abstract meaning.
It is a sub-field of computational linguistics (an interdisciplinary field concerned with the statistical or rule-based modeling of natural language) that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.
What are the steps involved in the process of speech recognition?
We can identify the following main steps:
- Analog-to-Digital Conversion: Speech is usually recorded/available in analog format. Standard sampling techniques/devices are available to convert analog speech to digital using techniques of sampling and quantization. The digital speech is usually a one-dimension vector of speech samples, each of which is an integer.
- Speech Pre-processing: Recorded speech usually comes with background noise and long sequences of silence. Speech pre-processing involves identification and removal of silence frames and signal processing techniques to reduce/eliminate noise. After pre-processing, the speech is broken down into frames of 20ms each for further steps of feature extraction.
- Feature Extraction: It is the process of converting speech frames into feature vector which indicates which phoneme/syllable is being spoken.
- Word Selection: Based on a language model/probability model, the sequence of phonemes/features are converted into the word being spoken.
Which are the popular feature extraction methods?
While there are many feature extraction methods, we note three of them:
- Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP): PLP is a way of warping spectra to minimize differences between speakers while preserving important speech information. RASTA applies a band-pass filter to the energy in each frequency subband in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel e.g. from a telephone line
- Linear Predictive Cepstral Coefficients (LPCCs): A cepstrum is the result of taking the inverse Fourier transform (IFT) of the logarithm of the estimated spectrum of a signal. The power cepstrum is used in the analysis of human speech.
- Mel Frequency Cepstral Coefficients (MFCCs): These are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between LPCC & mel-frequency cepstrum is that in the MFCC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping allows for better representation of sound.
Which are the traditional Probability Mapping and Selection methods?
A Hidden Markov Model is a type of graphical model often used to model temporal data. Hidden Markov Models (HMMs) assume that the data observed is not the actual state of the model, but is instead generated by the underlying hidden (the H in HMM) states. While this would normally make inference difficult, the Markov Property (the first M in HMM) of HMMs makes inference efficient.
The hidden Markov model can be represented as the simplest dynamic Bayesian network. The mathematics behind the HMM were developed by L. E. Baum and coworkers.
Because of their flexibility and computational efficiency, Hidden Markov Models have found a wide application in many different fields like speech recognition, handwriting recognition, and speech synthesis.
How is the accuracy of a speech recognition program validated?
Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. Generally it is measured on Switchboard - a recorded corpus of conversations between humans discussing day-to-day topics. This has been used over two decades to benchmark speech recognition systems. There are other corpora like LibriSpeech (based on public domain audio books) and Mozilla’s Common Voice project.
For some languages, like Mandarin, the metric is often CER - Character Error Rate. There is also Utterance Error Rate.
An IEEE paper that focussed on ASR and machine translation interactions, in a speech translation system, showed that BLEU-oriented global optimisation of ASR system parameters improves translation quality by absolute 1.5% BLEU score, while sacrificing WER over the conventional WER-optimised ASR system. Therefore the choice of metrics for ASR optimisation is context and application dependent.
How has speech recognition evolved over the years?
Starting from the 1960s, the pattern recognition based approaches started making speech recognition practical for applications with limited vocabulary with the use of LPC (Linear Predictive Coding) Coefficient and LPCC (Linear Predictive Cepstral Coefficient) based techniques. The advantage of this technique was that low resources were required to build this model and could be used for applications requiring up to about 300 words.
In the late 1970s, Paul Mermelstein found a new feature called MFCCs. This soon became the de-facto approach for feature extraction and helped to tackle applications related to multi-speaker as well as multi-language speech recognition.
In the 1990s, H. Hermansky came up with the RASTA-PLP approach of feature extraction which could be used for applications requiring very large vocabulary with multiple speakers and multiple languages with good accuracy.
What are the AI based approaches for speech recognition?
In the 1990s and in early 2000s, Deep Learning techniques involving Recurrent Neural Networks were applied on Speech Recognition. In 2000s, the variant of RNNs using LSTM (Long Short Term Memory) to include Long Term Memory aspects into the model helped to minimise or avoid the problems of vanishing gradient or exploding gradients as part of training the RNNs.
Baidu research released DeepSpeech 2014 achieving a WER of 11.85% using RNN. They leveraged the “Connectionist Temporal Classification” loss function.
With DeepSpeech2 in 2015 they achieved a 7x increase in speed using GRUs (Gated Recurrent Units).
DeepSpeech3 was released in 2017. They perform an empirical comparison between three models — CTC which powered Deep Speech 2, attention-based Seq2Seq models which powered Listend-Attend-Spell among others, and RNN-Transducer for end-to-end speech recognition. The RNN-Transducer could be thought of as an encoder-decoder model which assumes the alignment between input and output tokens is local and monotonic. This makes the RNN-Transducer loss a better fit for speech recognition (especially when online) than attention-based Seq2Seq models by removing extra hacks applied to attentional models to encourage monotonicity.
How is the speed of a speech recognition system measured?
Real Time Factor is a very natural measure of a speech decoding speed that expresses how much the recogniser decodes slower than the user speaks. The latency measures the time between the end of the user speech and the time when a decoder returns the hypothesis, which is the most important speed measure for ASR.
Real-time Factor (RTF): the ratio of the speech recognition response time to the utterance duration. Usually both mean RTF (average over all utterances), and 90th percentile RTF is examined in efficiency analysis.
How is Word Error Rate calculated?
In October 2016, Microsoft announced its ASR had a WER of 5.9% against the Industry standard Switchboard speech recognition task. This was surpassed by IBM Watson in March 2017 with a WER of 5.5%. In May 2017 Google announced it reached a WER of 4.9%, however google does not benchmark against the Switchboard.
ASR systems have seen big improvements in recent years due to more efficient acoustic models that use Deep Neural Networks (DNNs) to determine how well HMM states fit the extracted acoustic features rather than statistical techniques such as Gaussian Mixture Models, which were the preferred method for several years.
Which are the popular APIs that one can use to incorporate automatic speech recognition in an application?
Here's a selection of popular APIs: Bing Speech API, Nuance Speech Kit, Google Cloud Speech API, AWS Transcribe, IBM Watson Speech to Text, Speechmatics, Vocapia Speech to Text API, LBC Listen By Code API, Kaldi, CMU Sphinx.
What are the applications for speech recognition?
Applications of speech recognition are diverse and we note a few:
- Aerospace (e.g. space exploration, spacecraft, etc.) NASA's Mars Polar Lander used speech recognition technology from Sensory, Inc. in the Mars Microphone on the Lander.
- Automatic subtitling with speech recognition
- Automatic translation
- Court reporting (Realtime Speech Writing)
- eDiscovery (Legal discovery)
- Education (assisting in learning a second language)
- Hands-free computing: Speech recognition computer user interface
- Home automation (Alexa, Google Home etc.)
- Interactive voice response
- Medical transcription
- Mobile telephony, including mobile email
- Multimodal interaction
- People with disabilities
- Pronunciation evaluation in computer-aided language learning applications
- Speech-to-text reporter (transcription of speech into text, video captioning, Court reporting )
- Telematics (e.g. vehicle Navigation Systems)
- User interface in telephony
- Transcription (digital speech-to-text)
- Video games, with Tom Clancy's EndWar and Lifeline as working examples
- Virtual assistant (e.g. Apple's Siri)
Which are the available open source ASR toolkits?
Here's a selection of open source ASR toolkits:
- Microsoft Cognitive Toolkit, a deep learning system that Microsoft used for its ASR system is available on GitHub through an open source license.
- The Machine Learning team at Mozilla Research has been working on an open source Automatic Speech Recognition engine modelled after the Deep Speech papers published by Baidu. It has a WER of 6.5 percent on LibriSpeech’s test-clean set.
- Kaldi is integrated with TensorFlow. Its code is hosted on GitHub with 121 contributors. It originated at a 2009 workshop in John Hopkins University. It is designed for local installation.
- CMU Sphinx is from Carnegie Mellon University. Java and C versions exist on GitHub.
- HTK began in Cambridge University in 1989.
- Julius has been in development since 1997 and had its last major release in September of 2016.
- ISIP originated from Mississippi State. It was developed mostly from 1996 to 1999, with its last release in 2011.
- VoxForge - crowdsourced repository of speech recognition data and trained models.
Bell Labs researchers - Davis, Biddulph and Balashak - build a system for single-speaker digit (10) recognition. Their system worked by locating the formants in the power spectrum of each utterance. They were basically creating realisations of earlier analysis establishing relationships between sound classes and signal spectrum by Harvey Fletcher and Homer Dudley, both from AT&T Bell Laboratories. Speech recognition research at Bell Labs was defunded after an open letter by John Robinson Pierce that was critical of speech recognition research.
Dabbala Rajagopal "Raj" Reddy conducts a demonstration of voice control of a robot, large vocabulary connected speech recognition, speaker independent speech recognition and unrestricted vocabulary dictation. Hearsay-I is one of the first systems capable of continuous speech recognition. Reddy's work lays the foundation for more than three decades of research at Carnegie Mellon University with his work in the field of continuous speech recognition based on dynamic tracking of phonemes. Funding by DARPA's Speech Understanding Research (SUR) program was responsible for Carnegie Melon's "Harpy" speech understanding system, that could understand 1011 words.
In Russia, Professor Taras Vintsyuk proposes the use of dynamic programming methods for time aligning a pair of speech utterances (generally known as Dynamic Time Warping (DTW)). Velichko and Zagoruyko use Vintsyuk's work to advance use of pattern recognition ideas in speech recognition. The duo build a 200-word recogniser. Professor Victorov develops a system that recognises 1000 words in 1980.
Tony Robinson publishes work on neural networks in ASR . In 2012 he founded Speechmatics, offering cloud-based speech recognition services. In 2017 the company announced a breakthrough in accelerated new language modelling. By 1994 Robinson's neural network system was in the top 10 in the world in the DARPA Continuous Speech Evaluation trial, while the other nine were HMMs. .
- Alim, Sabur Ajibola and Nahrul Khair Alang Rashid. 2018. "Some Commonly Used Speech Feature Extraction Algorithms." IntechOpen, December 12. Accessed 2020-07-23.
- Alvarez, Raziel, and Yishay Carmiel. 2017. "Kaldi now offers TensorFlow integration." Google Developers, August 28. Accessed 2020-07-23.
- Amodei, Dario. 2015. "Deep Speech 2 : End to end speech recognition in English and Mandarin". Baidu Research. December 8. Accessed 2018-07-28.
- Baidu. 2017. "Deep Speech 3: Even more end-to-end speech recognition". Baidu Research. October 31. Accessed 2018-07-28.
- Baker, James. 1975. "STOCHASTIC MODELING AS A MEANS OF AUTOMATIC SPEECH RECOGNITION". ProQuest Dissertations Publishing. Accessed 2018-07-28.
- Baum, LE. 1966. "Statistical Inference for Probabilistic Functions of Finite State Markov Chains". The Annals of Mathematical Statistics. 37 (6): 1554–1563. Accessed 2018-07-28.
- Benesty, Jacob. 2008. "Speech Recognition". Springer Handbook of Speech Processing. Pages 524-526. Accessed 2018-07-24.
- Bogert, BP. 1963. "The Quefrency Alanysis [sic] of Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum and Saphe Cracking". Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. Accessed 2018-07-28.
- Branscombe, Mary. 2017. "Beyond the Switchboard: The Current State of the Art in Speech Recognition". The New Stack. November 10. Accessed 2018-07-27.
- CMU. 1996. "Q6.1 What is Speech Recognition". compsppech FAQ. June 18. Accessed 2018-07-27
- Dominguez, Javier Gonzalez. 2015. "A Real-Time End-to-End Multilingual Speech Recognition Architecture". IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL 9, NO 4. June. Accessed 2018-07-28.
- Dragon. 2012. "History of Speech & Voice Recognition and Transcription Software". Focus Medical Software. Accessed 2018-07-28.
- Fernandez, Santiago. 2007. "An application of recurrent neural networks to discriminative keyword spotting". Proceedings of ICANN (2), pp 220–229. Accessed 2018-07-28.
- Gartner. 2013. "Gartner's 2013 Hype Cycle for Emerging Technologies Maps Out Evolving Relationship Between Humans and Machines". Gartner Newsroom. August 19. Accessed 2018-07-24.
- GetSmarter. 2019. "Applications of Speech Recognition." Blog, GetSmarter, March 28. Accessed 2020-07-23.
- Graves, Alex. 2013. "Speech Recognition with Deep Recurrent Neural Networks". ICASSP. January. Accessed 2018-07-28.
- HMB431. 2016. "Application of Speech Recognition Technology in Speech-Related Disabilities: An Analysis and Forecast." Medium, March 23. Accessed 2020-07-23.
- Hannun, Awni. 2014. "Deep Speech: Scaling up end-to-end speech recognition". Cornell University Library. December 17. Accessed 2018-07-28.
- Hasegawa, H., and M. Inazumi. 1993. “Speech Recognition by Dynamic Recurrent Neural Networks”. Proceedings of 1993 International Joint Conference on Neural Networks. Accessed 2018-07-28.
- He, Xiaodong. 2011. "Why word error rate is not a good metric for speech recognizer training for the speech translation task". 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). May 22. Accessed 2018-07-24.
- Hermansky, Hynek. 1990. "Perceptual linear predictive (PLP) analysis of speech". Journal of the Acoustical Society of America vol 87, no 4, pp 1738-1752. Accessed 2018-07-28.
- Hermansky, H. 1994. "RASTA processing of speech". IEEE Transactions on Speech and Audio Processing Volume 2, Issue 4. October. Accessed 2018-07-28.
- Juang. 2014. "Automatic speech recognition–a brief history of the technology development". August 17. Accessed 2018-07-24.
- Jurafsky, Daniel and James H. Martin. 2019. "Chapter 9: Automatic Speech Recognition." In: Speech and Language Processing, Third Edition draft, October 16. Accessed 2020-07-23.
- Kincaid, Jason. 2011. "The Power Of Voice: A Conversation With The Head Of Google's Speech Technology". Tech Crunch. February 13. Accessed 2018-07-28.
- Linn, Allison. 2016. "Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition". Microsoft Research Blog. October 18. Accessed 2018-07-24.
- McLellan, Charles. 2016. "How we learned to talk to computers, and how they learned to answer back". Tech Republic. Accessed 2018-07-28.
- Mermelstein, Paul. 1976. "Distance measures for speech recognition, psychological and instrumental". Pattern Recognition and Artificial Intelligence pp 374–388. Accessed 2018-07-28.
- Morais, Reuben. 2017. "A Journey to <10% Word Error Rate". Mozilla Hacks. November 29. Accessed 2018-07-24.
- Orlowski, Andrew. 2017. "Brit neural net pioneer just revolutionised speech recognition all over again". The Register. July 17. Accessed 2018-07-26.
- Peterson, Casey. 2015. "A Guide to Speech Recognition Algorithms (Part 1)." YouTube, December 8. Accessed 2020-07-23.
- Pierce, John R. 1969. "Whither speech recognition?". Journal of the Acoustical Society of America. Vol 46 No 4 Part 2 Pages 1049-1051. October. Accessed 2018-07-24.
- Pinola, Melanie. 2011. "Speech Recognition Through the Decades: How We Ended Up With Siri". PC World. November 2. Accessed 2018-07-24.
- PlanetarySociety. 2012. "Project : Planetary Microphones - The Mars Microphone". Planetary Society. Accessed 2018-07-28.
- Platek, Ondrej. 2014. "Automatic speech recognition using Kaldi". Institute of Formal and Applied Linguistics, Charles University in Prague. Accessed 2018-07-28.
- Restresco. 2017. "An overview of speech recognition APIs". RESTRESCO. February 21. Accessed 2018-07-28.
- Robinson, Tony. 1991. "A recurrent error propagation network speech recognition system". Computer Speech and Language. 5 (3): 259–274. July. Accessed 2018-07-26.
- Robinson, Tony. 1996. "The Use of Recurrent Neural Networks in Continuous Speech Recognition". The Kluwer International Series in Engineering and Computer Science. The Kluwer International Series in Engineering and Computer Science. 355: 233–258. Accessed 2018-07-26.
- Ronzhin, Andrey L. 2006. "Survey of Russian Speech Recognition Systems". SPECOM2006, St Petersburg. June 25. Accessed 2018-07-28.
- Rudnicky, Alexander I. 2014. "What are the performance measures in Speech recognition?". Research Gate. Accessed 2018-07-27.
- SVDS. 2017. "Open Source Toolkits for Speech Recognition". Silicon Valley Data Science. February 23. Accessed 2018-07-28.
- Saini, Preeti. 2013. "Automatic Speech Recognition: A Review". International Journal of Engineering Trends and Technology- Volume4Issue2. Accessed 2018-07-28.
- Saon, George. 2017. "Reaching new records in speech recognition". IBM Watson Blog. March 7. Accessed 2018-07-24.
- Senior, Andrew. 2015. "Google voice search: faster and more accurate". Google AI Blog. September 24. Accessed 2018-07-28.
- Stanford. 1968. "Here Hear: Speech Recongnition 1968 Stanford". Stanford Artificial Intelligence Project. Accessed 2018-07-26.
- White, Sean. 2017. "Announcing the Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Dataset". The Mozilla Blog. November 29. Accessed 2018-07-28.
- Wikipedia. 2020. "Speech recognition." Wikipedia, July 21. Accessed 2020-07-23.
- Deep Neural Network
- Hidden Markov Model
- Recurrent Neural Network
- Long Short-Term Memory
- Gaussian Mixture Model
- Readability score of this article is below 50 (49.3). Use shorter sentences. Use simpler words.