Home Artificial Intelligence Audio Classification with Deep Learning in Python Problem Statement: Audio Classification with Domain Shift Approaching Audio Classification as an Image Classification Problem with Deep Learning Preparations: Getting Acquainted with Audio Data Step 1: Convert the Audio Classification Problem to an Image Classification Problem Step 2: Apply Augmentations to Audio Data Step 3: Advantageous-tune a Pretrained Image Classification Model for Few-Shot Learning Summary Enjoyed This Story? References

Audio Classification with Deep Learning in Python Problem Statement: Audio Classification with Domain Shift Approaching Audio Classification as an Image Classification Problem with Deep Learning Preparations: Getting Acquainted with Audio Data Step 1: Convert the Audio Classification Problem to an Image Classification Problem Step 2: Apply Augmentations to Audio Data Step 3: Advantageous-tune a Pretrained Image Classification Model for Few-Shot Learning Summary Enjoyed This Story? References

1
Audio Classification with Deep Learning in Python
Problem Statement: Audio Classification with Domain Shift
Approaching Audio Classification as an Image Classification Problem with Deep Learning
Preparations: Getting Acquainted with Audio Data
Step 1: Convert the Audio Classification Problem to an Image Classification Problem
Step 2: Apply Augmentations to Audio Data
Step 3: Advantageous-tune a Pretrained Image Classification Model for Few-Shot Learning
Summary
Enjoyed This Story?
References

Advantageous-tuning image models to tackle domain shift and sophistication imbalance with PyTorch and torchaudio in audio data

Classifying bird calls in soundscapes with Machine Learning
Classifying bird calls in soundscapes with Machine Learning (Image drawn by the writer)

Welcome to a different edition of “The Kaggle Blueprints”, where we’ll analyze Kaggle competitions’ winning solutions for lessons we are able to apply to our own data science projects.

This edition will review the techniques and approaches from the “BirdCLEF 2022” competition, which resulted in May 2022.

The target of the “BirdCLEF 2022” competition was to discover Hawaiian bird species by sound. The competitors got short audio files of single bird calls and were asked to predict whether a particular bird was present in an extended recording.

In contrast to a vanilla audio classification problem, this competition added flavor with the next challenges:

  • — The training data consisted of fresh audio recordings of a single bird call separated from any additional sounds (just a few seconds, different lengths). Nevertheless, the test data consisted of “unclean” longer (1 minute) recordings taken “within the wild” and contained different sounds apart from bird calls (e.g., wind, rain, other animals, etc.).
Domain shift in audio data
Domain shift in audio data
  • —As some birds are less common than others, we’re coping with a long-tailed class distribution where some birds only have one sample.
Long-tailed class distribution
Long-tailed class distribution

Insert your data here! — To follow along in this text, your dataset should look something like this:

Insert your data here: How your audio dataset dataframe should be formatted
Insert your data here: How your audio dataset dataframe ought to be formatted

A preferred approach amongst competitors to this audio classification problem was to:

  1. Converting the audio classification problem to a picture classification problem by converting the audio from waveform to a Mel spectrogram and applying a Deep Learning model
  2. Applying data augmentations to the audio data in waveform and in spectrograms to tackle the domain shift and sophistication imbalance
  3. Finetune a pre-trained image classification model to tackle class imbalance

This text will use PyTorch (version 1.13.0) for the Deep Learning framework and torchaudio (version 0.13.0) and librosa (version 0.10.0) for audio processing. Moreover, we might be using timm (version 0.6.12) for fine-tuning with pre-trained image models.

# Deep Learning framework
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader

# Audio processing
import torchaudio
import torchaudio.transforms as T
import librosa

# Pre-trained image models
import timm

Before getting began with solving an audio classification problem, let’s first get aware of working with audio data. You may load the audio and its sampling rate from different file formats (e.g., .wav, .ogg, etc.) with the .load() method from the torchaudio library or the librosa library.

PATH = "audio_example.wav"

# Load a sample audio file with torchaudio
original_audio, sample_rate = torchaudio.load(PATH)

# Load a sample audio file with librosa
original_audio, sample_rate = librosa.load(PATH,
sr = None) # Gotcha: Set sr to None to get original sampling rate. Otherwise the default is 22050

If you should take heed to the loaded audio directly in a Jupyter notebook for explorations, the next code will give you an audio player.

# Play the audio in Jupyter notebook
from IPython.display import Audio

Audio(data = original_audio, rate = sample_rate)

Displaying audio player for loaded data in Jupyter notebook
Displaying audio player for loaded data in Jupyter notebook

The librosa library also provides various methods to display the audio data for exploration purposes quickly. For those who used torchaudio to load the audio file, be certain that to convert the tensors to NumPy arrays.

import librosa.display as dsp
dsp.waveshow(original_audio, sr = sample_rate);
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [1]
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [0]

A preferred method to model audio data with a Deep Learning model is to convert the “computer hearing” problem to a computer vision problem [2]. Specifically, the waveform audio is converted to a Mel spectrogram (which is a style of image) as shown below.

Converting an audio file from waveform (time domain) to Mel spectrogram (frequency domain)
Converting an audio file from waveform (time domain) to Mel spectrogram (frequency domain)

Normally, you’ll use a Fast Fourier Transform (FFT) to computationally convert an audio signal from the time domain (waveform) to the frequency domain (spectrogram).

Nevertheless, the FFT will provide you with the general frequency components for all the time series of the audio signal as an entire. Thus, you might be losing the time information when converting audio data from the time domain to the frequency domain.

As a substitute of the FFT, you should utilize the Short-Time Fourier Transform (STFT) to preserve the time information. The STFT is a variant of the FFT that breaks up the audio signal into smaller sections through the use of a sliding time window. It takes the FFT on each section after which combines them.

  • n_fft —length of the sliding window (default: 2048)
  • hop_length — variety of samples by which to slip the window (default: 512). The hop_lengthwill directly impact the resulting image size. In case your audio data has a set length and you should convert the waveform to a set image size, you may set hop_length = audio_length // (image_size[1] — 1)
Short-Time Fourier Transform (STFT)
Short-Time Fourier Transform (STFT)

Next, you’ll convert the amplitude to decibels and bin the frequencies in accordance with the Mel scale. For this purpose, n_mels is the variety of frequency bands (Mel bins). This might be the peak of the resulting spectrogram.

Convert amplitude to decibel and apply Mel binning to spectrum
Convert amplitude to decibels and apply Mel binning to the spectrum

For an in-depth explanation of the Mel spectrogram, I like to recommend this text:

Below you may see an example PyTorch Dataset which loads an audio file and converts the waveform to a Mel spectrogram after some preprocessing steps.

class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length
wave_transforms=None,
spec_transforms=None):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length
self.wave_transforms = wave_transforms
self.spec_transforms = spec_transforms

def __len__(self):
return len(self.df)

def __getitem__(self, index):

# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])

# Convert to mono
audio = torch.mean(audio, axis=0)

# Resample
if sample_rate != self.target_sample_rate:
resample = T.Resample(sample_rate, self.target_sample_rate)
audio = resample(audio)

# Adjust variety of samples
if audio.shape[0] > self.num_samples:
# Crop
audio = audio[:self.num_samples]
elif audio.shape[0] < self.num_samples:
# Pad
audio = F.pad(audio, (0, self.num_samples - audio.shape[0]))

# Add any preprocessing you want here
# (e.g., noise removal, etc.)
...

# Add any data augmentations for waveform you want here
# (e.g., noise injection, shifting time, changing speed and pitch)
...

# Convert to Mel spectrogram
melspectrogram = T.MelSpectrogram(sample_rate = self.target_sample_rate,
n_mels = 128,
n_fft = 2048,
hop_length = 512)
melspec = melspectrogram(audio)

# Add any data augmentations for spectrogram you want here
# (e.g., Mixup, cutmix, time masking, frequency masking)
...

return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}

Your resulting dataset should produce samples that look something like this before we feed it to the neural network:

Sample structure from the Audio Dataset

One technique to tackle this competition’s challenges of domain shift and sophistication imbalance was to use data augmentations to the training data [5, 8, 10, 11]. You may apply data augmentations for audio data within the waveform and the spectrogram. The torchaudio library already provides numerous different data augmentations for audio data.

Popular data augmentation techniques for audio data in are:

  • Noise injection like white noise, coloured noise, or background noise (AddNoise)
  • Shifting time
  • Changing speed (Speed; alternatively use TimeStretch in frequency domain)
  • Changing pitch (PitchShift)
Overview of different data augmentation techniques for audio in waveform: Noise injection (white noise, colored noise, background noise), shifting time, changing speed and pitch
Overview of various data augmentation techniques for audio in waveform: Noise injection (white noise, coloured noise, background noise), shifting time, changing speed and pitch

Popular data augmentation techniques for audio data within the are:

  • Popular image augmentation techniques like Mixup [13] or Cutmix [3]
Data Augmentation for Spectrogram: Mixup [4]
Data Augmentation for Spectrogram: Mixup [13]
Data Augmentation for Spectrogram: SpecAugment [2]
Data Augmentation for Spectrogram: SpecAugment [7]

As you may see while providing numerous audio augmentations, torchaudio doesn’t provide the entire proposed data augmentations.

Thus, if you should inject a particular style of noise, shift the time, or apply Mixup [13] or Cutmix [12] augmentations, you should write a custom data augmentation in PyTorch. You may reference this collection of audio data augmentation techniques for his or her implementations:

In the instance PyTorch Dataset class from before, you may apply the info augmentations as follows:

class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length

def __len__(self):
return len(self.df)

def __getitem__(self, index):

# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])

# Add any preprocessing you want here
# (e.g., converting to mono, resampling, adjusting size, noise removal, etc.)
...

# Add any data augmentations for waveform you want here
# (e.g., noise injection, shifting time, changing speed and pitch)
wave_transforms = T.PitchShift(sample_rate, 4)
audio = wave_transforms(audio)

# Convert to Mel spectrogram
melspec = ...

# Add any data augmentations for spectrogram you want here
# (e.g., Mixup, cutmix, time masking, frequency masking)
spec_transforms = T.FrequencyMasking(freq_mask_param=80)
melspec = spec_transforms(melspec)

return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}

On this competition, we’re coping with a category imbalance. As some classes only have one sample, we’re coping with a few-shot learning problem. Nakamura and Harada [6] showed in 2019 that fine-tuning might be an efficient approach to few-shot learning.

Plenty of competitors [2, 5, 8, 10, 11] fine-tuned common pre-trained image classification models resembling

  • EfficientNet (e.g., tf_efficientnet_b3_ns) [9],
  • SE-ResNext (e.g., se_resnext50_32x4d) [3],
  • NFNet (e.g., eca_nfnet_l0) [1]

You may load any pre-trained image classification model with the timm library for fine-tuning. Be sure to set in_chans = 1 as we will not be working with 3-channel images but 1-channel Mel spectrograms.

class AudioModel(nn.Module):
def __init__(self,
model_name = 'tf_efficientnet_b3_ns',
pretrained = True,
num_classes):
super(AudioModel, self).__init__()

self.model = timm.create_model(model_name,
pretrained = pretrained,
in_chans = 1)
self.in_features = self.model.classifier.in_features
self.model.classifier = nn.Sequential(
nn.Linear(self.in_features, num_classes)
)

def forward(self, images):
logits = self.model(images)
return logits

Other competitors reported successes from fine-tuning models pre-trained on similar audio classification problems [4, 10].

Advantageous-tuning is completed with a cosine annealing learning rate scheduler (CosineAnnealingLR) for just a few epochs [2, 8].

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
T_max = ..., # Maximum variety of iterations.
eta_min = ...) # Minimum learning rate.
PyTorch Cosine Annealing / Decay Learning Rate Scheduler (Image by the author, originally published in “A Visual Guide to Learning Rate Schedulers in PyTorch”)
PyTorch Cosine Annealing / Decay Learning Rate Scheduler (Image by the writer, originally published in “A Visual Guide to Learning Rate Schedulers in PyTorch”)

You will discover more suggestions and best practices on this guide for fine-tuning Deep Learning models:

Subscribe without cost to get notified after I publish a latest story.

Grow to be a Medium member to read more stories from other writers and me. You may support me through the use of my referral link whenever you join. I’ll receive a commission at no extra cost to you.

Find me on LinkedIn, Twitter, and Kaggle!

Dataset

As the unique competition data doesn’t allow industrial use, examples are done with the next dataset.

[0] Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

License: CC-BY-4.0

Image References

If not otherwise stated, all images are created by the writer.

Web & Literature

[1] Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021, July). High-performance large-scale image recognition without normalization. In International Conference on Machine Learning (pp. 1059–1071). PMLR.

[2] Chai Time Data Science (2022). BirdCLEF 2022: eleventh Pos Gold Solution | Gilles Vandewiele (accessed March thirteenth, 2023)

[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

[4] Kramarenko Vladislav (2022). 4th place in Kaggle Discussions (accessed March thirteenth, 2023)

[5] LeonShangguan (2022). [Public #1 Private #2] + [Private #7/8 (potential)] solutions. The host wins. in Kaggle Discussions (accessed March thirteenth, 2023)

[6] Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216.

[7] Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A straightforward data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.

[8] slime (2022). third place solution in Kaggle Discussions (accessed March thirteenth, 2023)

[9] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). PMLR.

[10] Volodymyr (2022). 1st place solution models (it’s not all BirdNet) in Kaggle Discussions (accessed March thirteenth, 2023)

[11] yokuyama (2022). fifth place solution in Kaggle Discussions (accessed March thirteenth, 2023)

[12] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization technique to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6023–6032).

[13] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here