Audio Feature Extraction

Application of machine intelligence and deep learning in the subdomain of audio analysis is rapidly growing. Some examples include automatic speech recognition, digital signal processing, and audio classification, tagging and generation. Virtual assistants such as Alexa, Siri and Google Home are largely built atop models that can perform perform artificial cognition from audio data.

To train any statistical or ML model, we need to first extract useful features from an audio signal. Audio feature extraction is a necessary step in audio signal processing, which is a subfield of signal processing. It deals with the processing or manipulation of audio signals. It removes unwanted noise and balances the time-frequency ranges by converting digital and analog signals. It focuses on computational methods for altering the sounds.

This article introduces most commonly used audio features that are used as inputs to models.

Discussion

  • What is an audio signal?
    Sampling and digitization of an analog signal, and later reconstructing the analog signal. Source: Buur 2016.
    Sampling and digitization of an analog signal, and later reconstructing the analog signal. Source: Buur 2016.

    An audio signal is a representation of sound. It encodes all the necessary information required to reproduce sound. Audio signals come in two basic types: analog and digital.

    Analog refers to audio recorded using methods that replicate the original sound waves. Examples include vinyl records and cassette tapes. Digital audio is recorded by taking samples of the original sound wave at a specified rate, called sampling rate. CDs and MP3 files are examples of digital formats.

    In the real world, conversions between digital and analog waveforms are common and necessary. ADC (Analog-to-Digital Converter) and the DAC (Digital-to-Analog Converter) are part of audio signal processing and they achieve these conversions.

  • What are the common audio features useful for modeling?

    Audio features are description of sound or an audio signal that can basically be fed into statistical or ML models to build intelligent audio systems. Audio applications that use such features include audio classification, speech recognition, automatic music tagging, audio segmentation and source separation, audio fingerprinting, audio denoising, music information retrieval, and more.

    Different features capture different aspects of sound. Generally audio features are categorised with regards to the following aspects:

    • Level of Abstraction: High-level, mid-level and low-level features of musical signals.
    • Temporal Scope: Time-domain features that could be instantaneous, segment-level and global.
    • Musical Aspect: Acoustic properties that include beat, rhythm, timbre (colour of sound), pitch, harmony, melody, etc.
    • Signal Domain: Features in the time domain, frequency domain or both.
    • ML Approach: Hand-picked features for traditional ML modeling or automatic feature extraction for deep learning modeling.
  • How do we categorize audio features at various levels of abstraction?

    These broad categories cover mainly musical signals rather than audio in general:

    • High-level: These are the abstract features that are understood and enjoyed by humans. These include instrumentation, key, chords, melody, harmony, rhythm, genre, mood, etc.
    • Mid-level: These are features we can perceive. These include pitch, beat-related descriptors, note onsets, fluctuation patterns, MFCCs, etc. We may say that these are aggregation of low-level features.
    • Low-level: These are statistical features that are extracted from the audio. These make sense to the machine, but not to humans. Examples include amplitude envelope, energy, spectral centroid, spectral flux, zero-crossing rate, etc.
  • Could you briefly explain the temporal scope for audio features?

    This type of categorisation applies to audio in general, that is, both musical and non-musical:

    • Instantaneous: As the name suggests, these features give us instantaneous information about the audio signal. These consider tiny chunks of the audio signal, in the range of milliseconds. The minimum temporal resolution that humans are capable of appreciating is around 10ms.
    • Segment-level: These features can be calculated from segments of the audio signal in the range of seconds.
    • Global: These are aggregate features that provide information and describe the whole sound.
  • Could you explain on the signal domain features for audio?

    Signal domain features consist of the most important or rather descriptive features for audio in general:

    • Time domain: These are extracted from waveforms of the raw audio. Zero crossing rate, amplitude envelope, and RMS energy are examples.
    • Frequency domain: These focus on the frequency components of the audio signal. Signals are generally converted from the time domain to the frequency domain using the Fourier Transform. Band energy ratio, spectral centroid, and spectral flux are examples.
    • Time-frequency representation: These features combine both the time and frequency components of the audio signal. The time-frequency representation is obtained by applying the Short-Time Fourier Transform (STFT) on the time domain waveform. Spectrogram, mel-spectrogram, and constant-Q transform are examples.
  • Could you describe some time-domain audio features?
    Maximum amplitudes per frame shown in the waveform. Source: Velardo 2020b, 18:52.
    Maximum amplitudes per frame shown in the waveform. Source: Velardo 2020b, 18:52.

    Amplitude Envelope of a signal consists of the maximum amplitudes value among all samples in each frame. This feature gives a rough idea of loudness. It is however, sensitive to outliers. This feature has been extensively used for onset detection and music genre classification.

    Root Mean Square Energy is based on all samples in a frame. It acts as an indicator of loudness, since higher the energy, louder the sound. It is however less sensitive to outliers as compared to the Amplitude Envelope. This feature has been useful in audio segmentation and music genre classification tasks.

    Zero-Crossing Rate is simply the number of times a waveform crosses the horizontal time axis. This feature has been primarily used in recognition of percussive vs pitched sounds, monophonic pitch estimation, voice/unvoiced decision for speech signals, etc.

  • What are the audio features under the ML approach?

    Traditional Machine Learning approach considers all or most of the features from both time and frequency domain as inputs into the model. Features need to be hand-picked based on its effect on model performance. Some widely used features include Amplitude Envelope, Zero-Crossing Rate (ZCR), Root Mean Square (RMS) Energy, Spectral Centroid, Band Energy Ratio, and Spectral Bandwidth.

    Deep Learning approach considers unstructured audio representations such as the spectrogram or MFCCs. It extracts the patterns on its own. By late 2010s, this became the preferred approach since feature extraction is automatic. It's also supported by the abundance of data and computation power.

    Commonly used features or representations that are directly fed into neural network architectures are spectrograms, mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCCs).

  • What are spectrograms?
    Spectrogram of a male voice saying 'nineteenth century'. Source: Aquegg 2008.
    Spectrogram of a male voice saying 'nineteenth century'. Source: Aquegg 2008.

    A spectrogram is a visual depiction of the spectrum of frequencies of an audio signal as it varies with time. Hence it includes both time and frequency aspects of the signal. It is obtained by applying the Short-Time Fourier Transform (STFT) on the signal. In the simplest of terms, the STFT of a signal is calculated by applying the Fast Fourier Transform (FFT) locally on small time segments of the signal.

  • What is a mel-spectrogram?
    A mel-spectrogram. Source: Roberts 2020.
    A mel-spectrogram. Source: Roberts 2020.

    Apparently, we humans perceive sound logarithmically. We are better at detecting differences in lower frequencies than higher frequencies. For example, we can easily tell the difference between 500 and 1000 Hz, but we will hardly be able to tell a difference between 10,000 and 10,500 Hz, even though the distance between the two pairs is the same. Hence, the mel scale was introduced. It is a logarithmic scale based on the principle that equal distances on the scale have the same perceptual distance.

    Conversion from frequency (f) to mel scale (m) is given by

    $$ m=2595 \cdot log(1+\frac{f}{500}) $$

    A mel-spectrogram is a therefore a spectrogram where the frequencies are converted to the mel scale.

  • What are MFCCs?
    Steps to extract MFCCs from an audio signal. Source: Mahanta et al. 2021, fig. 5.
    Steps to extract MFCCs from an audio signal. Source: Mahanta et al. 2021, fig. 5.

    The information of the rate of change in spectral bands of a signal is given by its cepstrum. A cepstrum is basically a spectrum of the log of the spectrum of the time signal. The resulting spectrum is neither in the frequency domain nor in the time domain and hence, it was named the quefrency (an anagram of the word frequency) domain. The Mel-Frequency Cepstral Coefficients (MFCCs) are nothing but the coefficients that make up the mel-frequency cepstrum.

    The cepstrum conveys the different values that construct the formants (a characteristic component of the quality of a speech sound) and timbre of a sound. MFCCs thus are useful for deep learning models.

  • What is Band Energy Ratio?
    The low and high frequency regions in a spectrogram. Souce: Velardo 2020c, 5:18.
    The low and high frequency regions in a spectrogram. Souce: Velardo 2020c, 5:18.

    The Band Energy Ratio (BER) provides the relation between the lower and higher frequency bands. It can be thought of as the measure of how dominant low frequencies are. This feature has been extensively used in music/speech discrimination, music classification etc.

  • Could you explain the Spectral Centroid and Spectral Bandwidth features?
    Spectral Centroid plotted using a Librosa function. Source: Librosa Docs 2020.
    Spectral Centroid plotted using a Librosa function. Source: Librosa Docs 2020.

    The Spectral Centroid provides the center of gravity of the magnitude spectrum. In other words, it gives the frequency band where most of the energy is concentrated. It maps into a very prominent timbral feature called "brightness of sound" (energetic, open, dull). Mathematically, the spectral centroid is the weighted mean of the frequency bins.

    The spectral bandwidth or spectral spread is derived from the spectral centroid. It is the spectral range of interest around the centroid, that is, the variance from the spectral centroid. It has a direct correlation with the perceived timbre. The bandwidth is directly proportional to the energy spread across frequency bands. Mathematically, it is the weighted mean of the distances of frequency bands from the Spectral Centroid.

  • Which libraries provide the essential tools for audio data processing?

    Librosa and TorchAudio (Pytorch) are two Python packages that used for audio data pre-processing.

Milestones

1928
Sampling of an analog signal. Source: Rbj 2006.
Sampling of an analog signal. Source: Rbj 2006.

Creation of the Nyquist-Shannon sampling theorem. Harry Nyquist shows that up to 2B independent pulse samples could be sent through a system of bandwidth B.

1951

The Kay Electric Co. produces the first commercially available machine for audio spectrographic analysis, which they market under the trademark Sona-Graph. The graphs produced by a Sona-Graph come to be called Sonagrams. For decades, all spectrograms are called Sonagrams.

1957

Max Mathews becomes the first person to synthesize audio from a computer, giving birth to computer music.

1963

The concept of the cepstrum is introduced by B. P. Bogert, M. J. Healy, and J. W. Tukey . After publication of the FFT in 1965, the cepstrum is redefined so as to be reversible to the log spectrum. Shortly afterwards, Oppenheim and Schafer define the complex cepstrum, which is reversible to the time domain.

1965
The Fast Fourier Transform algorithm. Source: OhArthits 2010.
The Fast Fourier Transform algorithm. Source: OhArthits 2010.

The Fast Fourier Transform (FFT) algorithm is developed by Cooley and Tukey. It reduces the computational complexity of Discrete Fourier Transform (DFT) significantly from \(O(N^2)\) to \(O(N \cdot log_{2}N)\).

1988

Lewis and Todd propose the use of neural networks for automatic music composition. Lewis uses a multi-layer perceptron for his algorithmic approach to composition called "creation by refinement". On the other hand, Todd uses a Jordan auto-regressive neural network (RNN) to generate music sequentially — a principle that stays relevant in decades to come.

2002

Marolt et al. use a multi-layer perceptron operating on top of spectrograms for the task of note onset detection. This is the first time that someone processes music in a format that is not symbolic. This starts a new research era: learning a mapping system (or function) able to solve a task directly from raw audio, as opposed to solving it using engineered features (like spectrograms) or from symbolic music representations (like MIDI scores).

2009

Following Hinton's approach based on pre-training deep neural networks with deep belief networks, Lee et al. build the first deep convolutional neural network for music genre classification. This is the foundational work that establishes the basis for a generation of deep learning researchers designing better models to recognize high-level (semantic) concepts from music spectrograms.

2014

Dieleman and Schrauwen build the first end-to-end music classifier. They explore the idea of directly processing waveforms for the task of music audio tagging. They achieve some degree of success, though spectrogram-based models are still superior to waveform-based ones.

2016
The WaveNet layout. Source: Dufresne 2018.
The WaveNet layout. Source: Dufresne 2018.

Deepmind introduces WaveNet, a deep generative model of raw audio waveforms. It is able to generate relatively realistic-sounding human-like voices by directly modeling waveforms using a neural network method trained with recordings of real speech.

Apr
2020
The t-SNE shows how the model learns to cluster similar artists and genres close together, and also makes some surprising associations. Source: OpenAI 2020.
The t-SNE shows how the model learns to cluster similar artists and genres close together, and also makes some surprising associations. Source: OpenAI 2020.

OpenAI introduces Jukebox, a model that generates music with singing in the raw audio domain. They use VQ-VAE and the power of transformers to show that the combined model at scale can generate high-fidelity and diverse songs with coherence lasting multiple minutes.

References

  1. Aquegg. 2008. "File:Spectrogram-19thC.png." Wikimedia Commons, December 21. Accessed 2021-05-23.
  2. Buur, Michael Hansen. 2016. "Is the quality of a DAC related to software implementation?" Sound Design, StackExchange, September 14. Asked on 2016-09-13. Accessed 2021-05-23.
  3. Center Point Audio. 2021. "Understanding the difference between Analog and Digital Audio." Center Point Audio. Accessed 2021-05-23.
  4. Chauhan, Nagesh Singh. 2020. "Audio Data Analysis Using Deep Learning with Python (Part 1)." KDNuggets, February. Accessed 2021-05-23.
  5. Dhariwal, Prafulla, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. "Jukebox: A Generative Model for Music." arXiv, v1, April 30. Accessed 2021-05-23.
  6. Doshi, Ketan. 2021. "Audio Deep Learning Made Simple (Part 1): State-of-the-Art Techniques." Towards Data Science, on Medium, February 12. Accessed 2021-05-23.
  7. Dufresne, Steven. 2018. "Facebook's Universal Music Translator." Hackaday, June 2. Accessed 2021-05-23.
  8. Knees, Peter, and Markus Schedl. 2013. "Music Similarity and Retrieval." Tutorial, SIGIR, July 28. Accessed 2021-05-23.
  9. Knees, Peter, and Markus Schedl. 2016. "Music Similarity and Retrieval: An Introduction to Audio- and Web-based Strategies." The Information Retrieval Series, vol. 36., Springer-Verlag Berlin Heidelberg. doi: 10.1007/978-3-662-49722-7. Accessed 2021-05-23.
  10. Lee, Honglak, Peter Pham, Yan Largman, and Andrew Y. Ng. 2009. "Unsupervised feature learning for audio classification using convolutional deep belief networks." Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 1096-1104. Accessed 2021-05-23.
  11. Librosa Docs. 2020. "librosa.feature.spectral_centroid." Librosa Docs, v0.8.0, July 22. Accessed 2021-05-23.
  12. Mahanta, Saranga Kingkor, Abdullah Faiz Ur Rahman Khilji, and Partha Pakray. 2021. "Deep Neural Network for Musical Instrument Recognition Using MFCCs." Computación y Sistemas, vol. 25, no. 2. Accessed 2021-05-23.
  13. Marolt, Matija, Alenka Kavcic, and Marko Privosnik. 2002. "Neural Networks for Note Onset Detection in Piano Music." Proc. Int. Computer Music Conference, Gothenber. Accessed 2021-05-23.
  14. McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. "librosa: Audio and music signal analysis in python." Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18-25. Accessed 2021-05-23.
  15. Nair, Prateeksha. 2018. "The dummy's guide to MFCC." On Medium, July 25. Accessed 2021-05-23.
  16. OhArthits. 2010. "File:DIT-FFT-butterfly.png." Wikimedia Commons, January 4. Accessed 2021-05-23.
  17. OpenAI. 2020. "Jukebox." Blog, OpenAI, April 30. Accessed 2021-05-23.
  18. Oppenheim, Alan V., and Ronald W. Schafer. 2004. "From frequency to quefrency: A history of the cepstrum." IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95-106. doi: 10.1109/MSP.2004.1328092. Accessed 2021-05-23.
  19. Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. "Pytorch: An imperative style, high-performance deep learning library." arXiv, v1, December 3. Accessed 2021-05-23.
  20. Pieplow, Nathan. 2009. "A Brief History of Spectrograms." Blog, Earbirding, December 7. Accessed 2021-05-23.
  21. Pons, Jordi. 2018. "Neural Networks For Music: A Journey Through Its History." Towards Data Science, on Medium, October 30. Accessed 2021-05-23.
  22. Pons, Jordi, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, and Xavier Serra. 2018. "End-to-end learning for music audio tagging at scale." arXiv, v4, June 15. Accessed 2021-05-23.
  23. Randall, Robert B. 2017. "A history of cepstrum analysis and its application to mechanical problems." Mechanical Systems and Signal Processing, vol. 97, pp. 3-19. doi: 10.1016/j.ymssp.2016.12.026. Accessed 2021-05-23.
  24. Rbj. 2006. "File:ReconstructFilter.png." Wikimedia Commons, August 18. Accessed 2021-05-23.
  25. Roberts, Leland. 2020. "Understanding the Mel Spectrogram." Analytics Vidhya, on Medium, March 6. Accessed 2021-05-23.
  26. Schutz, Michael, and Jonathan M. Vaisberg. 2012. "Surveying the temporal structure of sounds used in Music Perception." Music Perception: An Interdisciplinary Journal, vol. 31, no. 3, pp. 288-296. doi: 10.1525/mp.2014.31.3.288. Accessed 2021-05-23.
  27. Singh, Jyotika. 2019. "An introduction to audio processing and machine learning using Python." Opensource.com, Red Hat, Inc., September 19. Accessed 2021-05-23.
  28. van den Oord, Aaron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. "Wavenet: A generative model for raw audio." arXiv, v2, September 19. Accessed 2021-05-23.
  29. Velardo, Valerio. 2020a. "Audio Signal Processing for Machine Learning." Playlist on Youtube, The Sound of AI, October 19. Accessed 2021-05-23.
  30. Velardo, Valerio. 2020b. "How to Extract Audio Features." The Sound of AI, on YouTube, July 16. Accessed 2021-05-23.
  31. Velardo, Valerio. 2020c. "Frequency-Domain Audio Features." The Sound of AI, on YouTube, October 12. Accessed 2021-05-23.
  32. Wikipedia. 2021a. "Nyquist–Shannon sampling theorem." Wikipedia, March 23. Accessed 2021-05-23.
  33. Wikipedia. 2021b. "Audio Signal Processing." Wikipedia, May 7. Accessed 2021-05-23.
  34. Wikipedia. 2021c. "Fast Fourier transform." Wikipedia, May 19. Accessed 2021-05-23.

Further Reading

  1. Velardo, Valerio. 2020. "Audio Signal Processing for Machine Learning." Playlist on Youtube, The Sound of AI, October 19. Accessed 2021-05-23.
  2. Oppenheim, Alan V., and Ronald W. Schafer. 2004. "From frequency to quefrency: A history of the cepstrum." IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95-106. doi: 10.1109/MSP.2004.1328092. Accessed 2021-05-23.
  3. Roberts, Leland. 2020. "Understanding the Mel Spectrogram." Analytics Vidhya, on Medium, March 6. Accessed 2021-05-23.
  4. Nair, Prateeksha. 2018. "The dummy's guide to MFCC." On Medium, July 25. Accessed 2021-05-23.
  5. Schutz, Michael, and Jonathan M. Vaisberg. 2012. "Surveying the temporal structure of sounds used in Music Perception." Music Perception: An Interdisciplinary Journal, vol. 31, no. 3, pp. 288-296. doi: 10.1525/mp.2014.31.3.288. Accessed 2021-05-23.
  6. Chauhan, Nagesh Singh. 2020. "Audio Data Analysis Using Deep Learning with Python (Part 1)." KDNuggets, February. Accessed 2021-05-23.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
7
1271
2108
Words
0
Likes
2583
Hits

Cite As

Devopedia. 2021. "Audio Feature Extraction." Version 8, May 23. Accessed 2021-09-09. https://devopedia.org/audio-feature-extraction
Contributed by
2 authors


Last updated on
2021-05-23 16:17:51
  • Mel-Frequency Cepstrum
  • Music Information Retrieval
  • Librosa
  • TorchAudio
  • Audio Mining
  • Digital Signal Processing