Audio augmentation by random speed changes and random filtering

2020-11-10

This example expands common-voice-small, in which we talked about different ways of augmenting the dataset. Instead of simply copying samples, we can resample them randomly to make them a bit faster or slower. In addition, by applying random finite impulse response (FIR) filters on the signals, we can try to simulate microphone differences. We'll apply these two augmentation techniques in this example and see if it is possible to improve on our previous results.

tf.data.Dataset makes it easy to cache all raw audio samples into a single file, from which we can reload the whole dataset at each epoch. This means that we can reapply both random augmentation techniques at every epoch, hopefully with different output at each epoch.

Data

This example uses the same data as in the common-voice-small example.

Loading the metadata

Checking the metadata is valid

Balancing the language distribution

We'll repeat the same random oversampling by audio sample length procedure as we did in common-voice-small. This time, we add a flag is_copy == True to each oversampled copy, which allows us to easily filter all copies when we do random speed changes on the audio signals.

Inspecting the audio

Random filtering

Random speed change

Loading all data

Exhaust iterators to collect all audio into binary files

NOTE that this creates 7.2 GiB of additional data on disk.

Inspect dataset contents in TensorBoard

Train a supervised, neural network language classifier

Evaluate the classifier

Conclusions

Comparing to our previous example with the same dataset of 4 different languages (common-voice-small), the $\text{C}_\text{avg}$ value improved from 0.112 to 0.091 and accuracy from 0.803 to 0.846.

Even though it is tempting to conclude that our augmentation approach was the cause of this improvement, we should probably perform hundreds of experiments with carefully chosen configuration settings to get a reliable answer if augmentation is useful or not.