# Audio Feature Extraction

Application of machine intelligence and deep learning in the subdomain of audio analysis is rapidly growing. Some examples include automatic speech recognition, digital signal processing, and audio classification, tagging and generation. Virtual assistants such as Alexa, Siri and Google Home are largely built atop models that can perform perform artificial cognition from audio data.

To train any statistical or ML model, we need to first extract useful features from an audio signal. Audio feature extraction is a necessary step in audio signal processing, which is a subfield of signal processing. It deals with the processing or manipulation of audio signals. It removes unwanted noise and balances the time-frequency ranges by converting digital and analog signals. It focuses on computational methods for altering the sounds.

This article introduces most commonly used audio features that are used as inputs to models.

## Discussion

• What is an audio signal?

An audio signal is a representation of sound. It encodes all the necessary information required to reproduce sound. Audio signals come in two basic types: analog and digital.

Analog refers to audio recorded using methods that replicate the original sound waves. Examples include vinyl records and cassette tapes. Digital audio is recorded by taking samples of the original sound wave at a specified rate, called sampling rate. CDs and MP3 files are examples of digital formats.

In the real world, conversions between digital and analog waveforms are common and necessary. ADC (Analog-to-Digital Converter) and the DAC (Digital-to-Analog Converter) are part of audio signal processing and they achieve these conversions.

• What are the common audio features useful for modeling?

Audio features are description of sound or an audio signal that can basically be fed into statistical or ML models to build intelligent audio systems. Audio applications that use such features include audio classification, speech recognition, automatic music tagging, audio segmentation and source separation, audio fingerprinting, audio denoising, music information retrieval, and more.

Different features capture different aspects of sound. Generally audio features are categorised with regards to the following aspects:

• Level of Abstraction: High-level, mid-level and low-level features of musical signals.
• Temporal Scope: Time-domain features that could be instantaneous, segment-level and global.
• Musical Aspect: Acoustic properties that include beat, rhythm, timbre (colour of sound), pitch, harmony, melody, etc.
• Signal Domain: Features in the time domain, frequency domain or both.
• ML Approach: Hand-picked features for traditional ML modeling or automatic feature extraction for deep learning modeling.
• How do we categorize audio features at various levels of abstraction?

These broad categories cover mainly musical signals rather than audio in general:

• High-level: These are the abstract features that are understood and enjoyed by humans. These include instrumentation, key, chords, melody, harmony, rhythm, genre, mood, etc.
• Mid-level: These are features we can perceive. These include pitch, beat-related descriptors, note onsets, fluctuation patterns, MFCCs, etc. We may say that these are aggregation of low-level features.
• Low-level: These are statistical features that are extracted from the audio. These make sense to the machine, but not to humans. Examples include amplitude envelope, energy, spectral centroid, spectral flux, zero-crossing rate, etc.
• Could you briefly explain the temporal scope for audio features?

This type of categorisation applies to audio in general, that is, both musical and non-musical:

• Instantaneous: As the name suggests, these features give us instantaneous information about the audio signal. These consider tiny chunks of the audio signal, in the range of milliseconds. The minimum temporal resolution that humans are capable of appreciating is around 10ms.
• Segment-level: These features can be calculated from segments of the audio signal in the range of seconds.
• Global: These are aggregate features that provide information and describe the whole sound.
• Could you explain on the signal domain features for audio?

Signal domain features consist of the most important or rather descriptive features for audio in general:

• Time domain: These are extracted from waveforms of the raw audio. Zero crossing rate, amplitude envelope, and RMS energy are examples.
• Frequency domain: These focus on the frequency components of the audio signal. Signals are generally converted from the time domain to the frequency domain using the Fourier Transform. Band energy ratio, spectral centroid, and spectral flux are examples.
• Time-frequency representation: These features combine both the time and frequency components of the audio signal. The time-frequency representation is obtained by applying the Short-Time Fourier Transform (STFT) on the time domain waveform. Spectrogram, mel-spectrogram, and constant-Q transform are examples.
• Could you describe some time-domain audio features?

Amplitude Envelope of a signal consists of the maximum amplitudes value among all samples in each frame. This feature gives a rough idea of loudness. It is however, sensitive to outliers. This feature has been extensively used for onset detection and music genre classification.

Root Mean Square Energy is based on all samples in a frame. It acts as an indicator of loudness, since higher the energy, louder the sound. It is however less sensitive to outliers as compared to the Amplitude Envelope. This feature has been useful in audio segmentation and music genre classification tasks.

Zero-Crossing Rate is simply the number of times a waveform crosses the horizontal time axis. This feature has been primarily used in recognition of percussive vs pitched sounds, monophonic pitch estimation, voice/unvoiced decision for speech signals, etc.

• What are the audio features under the ML approach?

Traditional Machine Learning approach considers all or most of the features from both time and frequency domain as inputs into the model. Features need to be hand-picked based on its effect on model performance. Some widely used features include Amplitude Envelope, Zero-Crossing Rate (ZCR), Root Mean Square (RMS) Energy, Spectral Centroid, Band Energy Ratio, and Spectral Bandwidth.

Deep Learning approach considers unstructured audio representations such as the spectrogram or MFCCs. It extracts the patterns on its own. By late 2010s, this became the preferred approach since feature extraction is automatic. It's also supported by the abundance of data and computation power.

Commonly used features or representations that are directly fed into neural network architectures are spectrograms, mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCCs).

• What are spectrograms?

A spectrogram is a visual depiction of the spectrum of frequencies of an audio signal as it varies with time. Hence it includes both time and frequency aspects of the signal. It is obtained by applying the Short-Time Fourier Transform (STFT) on the signal. In the simplest of terms, the STFT of a signal is calculated by applying the Fast Fourier Transform (FFT) locally on small time segments of the signal.

• What is a mel-spectrogram?

Apparently, we humans perceive sound logarithmically. We are better at detecting differences in lower frequencies than higher frequencies. For example, we can easily tell the difference between 500 and 1000 Hz, but we will hardly be able to tell a difference between 10,000 and 10,500 Hz, even though the distance between the two pairs is the same. Hence, the mel scale was introduced. It is a logarithmic scale based on the principle that equal distances on the scale have the same perceptual distance.

Conversion from frequency (f) to mel scale (m) is given by

$$m=2595 \cdot log(1+\frac{f}{500})$$

A mel-spectrogram is a therefore a spectrogram where the frequencies are converted to the mel scale.

• What are MFCCs?

The information of the rate of change in spectral bands of a signal is given by its cepstrum. A cepstrum is basically a spectrum of the log of the spectrum of the time signal. The resulting spectrum is neither in the frequency domain nor in the time domain and hence, it was named the quefrency (an anagram of the word frequency) domain. The Mel-Frequency Cepstral Coefficients (MFCCs) are nothing but the coefficients that make up the mel-frequency cepstrum.

The cepstrum conveys the different values that construct the formants (a characteristic component of the quality of a speech sound) and timbre of a sound. MFCCs thus are useful for deep learning models.

• What is Band Energy Ratio?

The Band Energy Ratio (BER) provides the relation between the lower and higher frequency bands. It can be thought of as the measure of how dominant low frequencies are. This feature has been extensively used in music/speech discrimination, music classification etc.

• Could you explain the Spectral Centroid and Spectral Bandwidth features?

The Spectral Centroid provides the center of gravity of the magnitude spectrum. In other words, it gives the frequency band where most of the energy is concentrated. It maps into a very prominent timbral feature called "brightness of sound" (energetic, open, dull). Mathematically, the spectral centroid is the weighted mean of the frequency bins.

The spectral bandwidth or spectral spread is derived from the spectral centroid. It is the spectral range of interest around the centroid, that is, the variance from the spectral centroid. It has a direct correlation with the perceived timbre. The bandwidth is directly proportional to the energy spread across frequency bands. Mathematically, it is the weighted mean of the distances of frequency bands from the Spectral Centroid.

• Which libraries provide the essential tools for audio data processing?

Librosa and TorchAudio (Pytorch) are two Python packages that used for audio data pre-processing.

## Milestones

1928

Creation of the Nyquist-Shannon sampling theorem. Harry Nyquist shows that up to 2B independent pulse samples could be sent through a system of bandwidth B.

1951

The Kay Electric Co. produces the first commercially available machine for audio spectrographic analysis, which they market under the trademark Sona-Graph. The graphs produced by a Sona-Graph come to be called Sonagrams. For decades, all spectrograms are called Sonagrams.

1957

Max Mathews becomes the first person to synthesize audio from a computer, giving birth to computer music.

1963

The concept of the cepstrum is introduced by B. P. Bogert, M. J. Healy, and J. W. Tukey . After publication of the FFT in 1965, the cepstrum is redefined so as to be reversible to the log spectrum. Shortly afterwards, Oppenheim and Schafer define the complex cepstrum, which is reversible to the time domain.

1965

The Fast Fourier Transform (FFT) algorithm is developed by Cooley and Tukey. It reduces the computational complexity of Discrete Fourier Transform (DFT) significantly from $$O(N^2)$$ to $$O(N \cdot log_{2}N)$$.

1988

Lewis and Todd propose the use of neural networks for automatic music composition. Lewis uses a multi-layer perceptron for his algorithmic approach to composition called "creation by refinement". On the other hand, Todd uses a Jordan auto-regressive neural network (RNN) to generate music sequentially — a principle that stays relevant in decades to come.

2002

Marolt et al. use a multi-layer perceptron operating on top of spectrograms for the task of note onset detection. This is the first time that someone processes music in a format that is not symbolic. This starts a new research era: learning a mapping system (or function) able to solve a task directly from raw audio, as opposed to solving it using engineered features (like spectrograms) or from symbolic music representations (like MIDI scores).

2009

Following Hinton's approach based on pre-training deep neural networks with deep belief networks, Lee et al. build the first deep convolutional neural network for music genre classification. This is the foundational work that establishes the basis for a generation of deep learning researchers designing better models to recognize high-level (semantic) concepts from music spectrograms.

2014

Dieleman and Schrauwen build the first end-to-end music classifier. They explore the idea of directly processing waveforms for the task of music audio tagging. They achieve some degree of success, though spectrogram-based models are still superior to waveform-based ones.

2016

Deepmind introduces WaveNet, a deep generative model of raw audio waveforms. It is able to generate relatively realistic-sounding human-like voices by directly modeling waveforms using a neural network method trained with recordings of real speech.

Apr
2020

OpenAI introduces Jukebox, a model that generates music with singing in the raw audio domain. They use VQ-VAE and the power of transformers to show that the combined model at scale can generate high-fidelity and diverse songs with coherence lasting multiple minutes.

Author
No. of Edits
No. of Chats
DevCoins
4
4
1713
4
7
1271
2108
Words
2
Likes
22675
Hits

## Cite As

Devopedia. 2021. "Audio Feature Extraction." Version 8, May 23. Accessed 2022-09-22. https://devopedia.org/audio-feature-extraction
Contributed by
2 authors

Last updated on
2021-05-23 16:17:51
• Mel-Frequency Cepstrum
• Music Information Retrieval
• Librosa
• TorchAudio
• Audio Mining
• Digital Signal Processing
• Site Map