Spoken language recognition on Mozilla Common Voice — Audio Transformations.

Artificial Intelligence

Spoken language recognition on Mozilla Common Voice — Audio Transformations.

admin

August 13, 2023

Spoken language recognition on Mozilla Common Voice — Audio Transformations.

That is the third article on spoken language recognition based on the Mozilla Common Voice dataset. In Part I, we discussed data selection and data preprocessing and in Part II we analysed performance of several neural network classifiers.

The ultimate model achieved 92% accuracy and 97% pairwise accuracy. Since this model suffers from somewhat high variance, the accuracy could potentially be improved by adding more data. One quite common method to get extra data is to synthesize it by performing various transformations on the available dataset.

In this text, we’ll consider 5 popular transformations for audio data augmentation: adding noise, changing speed, changing pitch, time masking, and cut & splice.

The tutorial notebook might be found here.

For illustration purposes, will use the sample common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning fire had been extinguished.

import librosa as lr
import IPythonsignal, sr = lr.load('./transformed/common_voice_en_100040.wav', res_type='kaiser_fast') #load signal
IPython.display.Audio(signal, rate=sr)

Original sample common_voice_en_100040 from MCV.

Original signal waveform (image by the creator)

Adding noise is the best audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal signal amplitude and standard deviation of noise. We’ll generate several noise levels, defined with SNR, and see how they alter the signal.

SNRs = (5,10,100,1000) #Signal-to-noise ratio: max amplitude over noise stdnoisy_signal = {}
for snr in SNRs:
noise_std = max(abs(signal))/snr #get noise std
noise =  noise_std*np.random.randn(len(signal),) #generate noise with given std
noisy_signal[snr] = signal+noise
IPython.display.display(IPython.display.Audio(noisy_signal[5], rate=sr))
IPython.display.display(IPython.display.Audio(noisy_signal[1000], rate=sr))

Signals obtained by superimposing noise with SNR=5 and SNR=1000 on the unique MCV sample common_voice_en_100040 (generated by the creator).

Signal waveform for several noise levels (image by the creator)

So, SNR=1000 sounds almost just like the unperturbed audio, while at SNR=5 one can only distinguish the strongest parts of the signal. In practice, the SNR level is hyperparameter that is determined by the dataset and the chosen classifier.

The only method to change the speed is simply to pretend that the signal has a unique sample rate. Nevertheless, this may also change the pitch (how low/high in frequency the audio sounds). Increasing the sampling rate will make the voice sound higher. As an instance this we will “increase” the sampling rate for our example by 1.5:

IPython.display.Audio(signal, rate=sr*1.5)

Signal obtained by utilizing a false sampling rate for the unique MCV sample common_voice_en_100040 (generated by the creator).

Changing the speed without affecting the pitch is more difficult. One needs to make use of the Phase Vocoder(PV) algorithm. Briefly, the input signal is first split into overlapping frames. Then, the spectrum inside each frame is computed by applying Fast Fourier Transformation (FFT). The playing speed is then modifyed by resynthetizing frames at a unique rate. For the reason that frequency content of every frame isn’t affected, the pitch stays the identical. The PV interpolates between the frames and uses the phase information to attain smoothness.

For our experiments, we’ll use the stretch_wo_loop time stretching function from this PV implementation.

stretching_factor = 1.3signal_stretched = stretch_wo_loop(signal, stretching_factor)
IPython.display.Audio(signal_stretched, rate=sr)

Signal obtained by various the speed of the unique MCV sample common_voice_en_100040 (generated by the creator).

Signal waveform after speed increase (image by the creator)

So, the duration of the signal decreased since we increased the speed. Nevertheless, one can hear that the pitch has not modified. Note that when the stretching factor is substantial, the phase interpolation between frames won’t work well. Consequently, echo artefacts may appear within the transformed audio.

To change the pitch without affecting the speed, we will use the identical PV time stretch but pretend that the signal has a unique sampling rate such that the full duration of the signal stays the identical:

IPython.display.Audio(signal_stretched, rate=sr/stretching_factor)

Signal obtained by various pitch of the unique MCV sample common_voice_en_100040 (generated by the creator).

Why will we ever hassle with this PV while librosa already has time_stretch and pitch_shift functions? Well, these functions transform the signal back to the time domain. When it’s essential compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. Then again, it is simple to switch the stretch_wo_loop function such that it yields Fourier output without taking the inverse transform. One could probably also attempt to dig into librosa codes to attain similar results.

These two transformation were initially proposed within the frequency domain (Park et al. 2019). The concept was to save lots of time on FFT by utilizing precomputed spectra for audio augmentations. For simplicity, we’ll exhibit how these transformations work within the time domain. The listed operations might be easily transferred to the frequency domain by replacing the time axis with frame indices.

Time masking

The concept of time masking is to cover up a random region within the signal. The neural network has then less probabilities to learn signal-specific temporal variations that should not generalizable.

max_mask_length = 0.3 #maximum mask duration, proportion of signal lengthL = len(signal)
mask_length = int(L*np.random.rand()*max_mask_length) #randomly select mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly select mask position
masked_signal = signal.copy()
masked_signal[mask_start:mask_start+mask_length] = 0
IPython.display.Audio(masked_signal, rate=sr)

Signal obtained by applying time mask transformation on the unique MCV sample common_voice_en_100040 (generated by the creator).

Signal waveform after time masking (the masked region is indicated with orange) (image by the creator)

Cut & splice

The concept is to exchange a randomly chosen region of the signal with a random fragment from one other signal having the identical label. The implementation is nearly similar to for time masking, except that a bit of one other signal is placed as an alternative of the mask.

other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second signalmax_fragment_length = 0.3 #maximum fragment duration, proportion of signal length
L = min(len(signal), len(other_signal))
mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly select mask position
synth_signal = signal.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]
IPython.display.Audio(synth_signal, rate=sr)

Synthetic signal obtained by applying cut&splice transformation on the unique MCV sample common_voice_en_100040 (generated by the creator).