Common Voice spoken language identification with a neural network

2020-11-08

This example is a thorough, but simple walk-through on how to do everything from loading mp3-files containing speech to preprocessing and transforming the speech data into something we can feed to a neural network classifier. Deep learning based speech analysis is a vast research topic and there are countless techniques that could possibly be applied to improve the results of this example. This example tries to avoid going into too much detail into these techniques and instead focuses on getting an end-to-end classification pipeline up and running with a small dataset.

Data

This example uses open speech data downloaded from the Mozilla Common Voice project. See the readme file for downloading the data. In addition to the space needed for the downloaded data, you will need at least 10 GiB of free disk space for caching (can be disabled).

Loading the metadata

We start by preprocessing the Common Voice metadata files.

Update datadir and workdir to match your setup. All output will be written to workdir.

Common Voice metadata is distributed as tsv files and all audio samples are mp3-files under clips.

There's plenty of metadata, but it seems that the train-dev-test split has been predefined so lets use that.

pandas makes it easy to read, filter, and manipulate metadata in tables. Lets try to preprocess all metadata here so we don't have to worry about it later.

Checking that all splits are disjoint by speaker

To ensure our neural network will learn what language is being spoken and not who is speaking, we want to test it on data that does not have any voices present in the training data. The client_id should correspond to a unique, pseudonymized identifier for every speaker.

Lets check all splits are disjoint by speaker id.

We can see that none of the speakers are in two or more dataset splits. We also see that the test set has a lot of unique speakers who are not in the training set. This is good because we want to test that our neural network classifier knows how to classify input from unknown speakers.

Checking that all audio files exist

Balancing the language distribution

Lets see how many samples we have per language.

We can see that the amount of samples with Mongolian, Tamil, and Turkish speech are quite balanced, but we have significantly larger amounts of Estonian speech. More data is of course always better, but if there is too much of one label compared to the others, our neural network might overfit on this label.

But these are only the counts of audio files, how much speech do we have in total per language? We need to read every file to get a reliable answer. See also SoX for a good command line tool.

The median length of Estonian samples is approx. 2.5 seconds greater compared to Turkish samples, which have the shortest median length. We can also see that the total amount of Estonian speech is much larger compared to other languages in our datasets. Notice also the significant amount of outliers with long durations in the Tamil and Turkish datasets.

Lets do simple random oversampling for the training split using this approach:

  1. Select the target language according to maximum total amount of speech in seconds (Estonian).
  2. Compute differences in total durations between the target language and the three other languages.
  3. Compute median signal length by language.
  4. Compute sample sizes by dividing the duration deltas with median signal lengths, separately for each language.
  5. Draw samples with replacement from the metadata separately for each language.
  6. Merge samples with rest of the metadata and verify there are no duplicate ids.

Speech data augmentation is a common research topic. There are better ways to augment data than the simple duplication of metadata rows we did here. One approach (which we won't be doing here) which is easy to implement and might work well is to take copies of signals and make them randomly a bit faster or slower. For example, draw randomly speed ratios from [0.9, 1.1] and resample the signal by multiplying its sample rate with the random ratio.

Inspecting the audio

Lets take a look at the speech data and listen to a few randomly picked samples from each label. We pick 2 random samples for each language from the training set.

Then lets read the mp3-files from disk, plot the signals, and listen to the audio.

One of the most challenging aspects of the Mozilla Common Voice dataset is that the audio quality varies greatly: different microphones, background noise, user is speaking close to the device or far away etc. It is difficult to ensure that a neural network will learn to classify different languages as opposed to classifying distinct acoustic artefacts from specific microphones. There's a vast amount of research being done on developing techniques for solving these kind of problems. However, these are well out of scope for this simple example and we won't be studying them here.

Spectral representations

It is usually not possible (at least not yet in 2020) to detect languages directly from the waveform. Instead, the fast Fourier transform (FFT) is applied on small, overlapping windows of the signal to get a 2-dimensional representation of energies in different frequency bands. See this for further details.

However, output from the FFT is usually not usable directly and must be refined. Lets begin by selecting the first signal from our random sample and extract the power spectrogram.

Power spectrogram

This representation is very sparse, with zeros everywhere except in the lowest frequency bands. The main problem here is that relative differences between energy values are very large, making it different to compare large changes in energy. These differences can be reduced by mapping the values onto a logarithmic scale.

The decibel-scale is a common choice. We will use the maximum value of powspec as the reference power ($\text{P}_0$).

Decibel-scale spectrogram

This is an improvement, but the representation is still rather sparse. We also see that most speech information is in the lower bands, with a bit of energy in the higher frequencies. A common approach is to "squeeze together" the y-axis of all frequency bands by using a different scale, such as the Mel-scale. Lets "squeeze" the current 256 frequency bins into 40 Mel-bins.

Log-scale Mel-spectrogram

Note that we are scaling different things here. The Mel-scale warps the frequency bins (y-axis), while the logarithm is used to reduce relative differences between individual spectrogram values (pixels).

One common normalization technique is frequency channel standardization, i.e. normalization of rows to zero mean and unit variance.

Or only mean-normalization if you think the variances contain important information.

Cepstral representations

Another common representation are the Mel-frequency cepstral coefficients (MFCC), which are obtained by applying the discrete cosine transform on the log-scale Mel-spectrogram.

MFCC

Most of the information is concentrated in the lower coefficients. It is common to drop the 0th coefficient and select a subset starting at 1, e.g. 1 to 20. See this post for more details.

Now we have a very compact representation, but most of the variance is still in the lower coefficients and overshadows the smaller changes in higher coefficients. We can normalize the MFCC matrix row-wise by standardizing each row to zero mean and unit variance. This is commonly called cepstral mean and variance normalization (CMVN).

MFCC + CMVN

Which one is best?

Speech feature extraction is a large, active research topic and it is impossible to choose one representation that would work well in all situations. Common choices in state-of-the-art spoken language identification are log-scale Mel-spectrograms and MFCCs, with different normalization approaches. For example, here is an experiment in Arabic dialect identification, where log-scale Mel-spectra (referred to as FBANK) produced slightly better results compared to MFCCs.

It is not obvious when to choose which representation, or if we should even use the FFT at all. You can read this post for a more detailed discussion.

Voice activity detection

It is common for speech datasets to contain audio samples with short segments of silence or sounds that are not speech. Since these are usually irrelevant for making a language classification decision, we would prefer to discard such segments. This is called voice activity detection (VAD) and it is another large, active research area. Here is a brief overview of VAD.

Non-speech segments can be either noise or silence. Separating non-speech noise from speech is non-trivial but possible, for example with neural networks. Silence, on the other hand, shows up as zeros in our speech representations, since these segments contain lower energy values compared to segments with speech. Such non-speech segments are therefore easy to detect and discard, for example by comparing the energy of the segment to the average energy of the whole sample.

If the samples in our example do not contain much background noise, a simple energy-based VAD technique should be enough to drop all silent segments. We'll use the root mean square (RMS) energy to detect short silence segments. lidbox has a simple energy-based VAD function, which we will use as follows:

  1. Divide the signal into non-overlapping 10 ms long windows.
  2. Compute RMS of each window.
  3. Reduce all window RMS values by averaging to get a single mean RMS value.
  4. Set a decision threshold at 0.1 for marking silence windows. In other words, if the window RMS is less than 0.1 of the mean RMS, mark the window as silence.

The filtered signal has less silence, but some of the pauses between words sound too short and unnatural. We would prefer not to remove small pauses that normally occur between words, so lets say all pauses shorter than 300 ms should not be filtered out. Lets also move all VAD code into a function.

We dropped some silence segments but left most of the speech intact, perhaps this is enough for our example.

Although this VAD approach is simple and works ok for our data, it will not work for speech data with non-speech sounds in the background like music or noise. For such data we might need more powerful VAD filters such as neural networks that have been trained on a speech vs non-speech classification task with large amounts of different noise.

But lets not add more complexity to our example. We'll use the RMS based filter for all other signals too.

Comparison of representations

Lets extract these features for all signals in our random sample.

Loading the samples to a tf.data.Dataset iterator

Our dataset is relatively small (2.5 GiB) and we might be able to read all files into signals and keep them in main memory. However, most speech datasets are much larger due to the amount of data needed for training neural network models that would be of any practical use. We need some kind of lazy iteration or streaming solution that views only one part of the dataset at a time. One such solution is to represent the dataset as a TensorFlow iterator, which evaluates its contents only when they are needed, similar to the MapReduce programming model for big data.

The downside with lazy iteration or streaming is that we lose the capability of doing random access by row id. However, this shouldn't be a problem since we can always keep the whole metadata table in memory and do random access on its rows whenever needed.

Another benefit of TensorFlow dataset iterators is that we can map arbitrary tf.functions over the dataset and TensorFlow will automatically parallelize the computations and place them on different devices, such as the GPU. The core architecture of lidbox has been organized around the tf.data.Dataset API, leaving all the heavy lifting for TensorFlow to handle.

But before we load all our speech data, lets warmup with our small random sample of 8 rows.

Lets load it into a tf.data.Dataset.

All elements produced by the Dataset iterator are dicts of (string, Tensor) pairs, where the string denotes the metadata type.

Although the Dataset object is primarily for automating large-scale data processing pipelines, it is easy to extract all elements as numpy-values:

Reading audio files

Lets load the signals by mapping a file reading function for each element over the whole dataset. We'll add a tf.data.Dataset function wrapper on top of read_mp3, which we defined earlier. TensorFlow will infer the input and output values of the wrapper as tensors from the type signature of dataset elements. We must use tf.numpy_function if we want to allow calling the non-TensorFlow function read_mp3 also from inside the graph environment. It might not be as efficient as using TensorFlow ops but reading a file would have a lot of latency anyway so this is not such a big hit for performance. Besides, we can always hide the latency by reading several files in parallel.

Removing silence and extracting features

Organizing all preprocessing steps as functions that can be mapped over the Dataset object allows us to represent complex transformations easily.

Inspecting dataset contents in TensorBoard

lidbox has a helper function for dumping element information into TensorBoard summaries. This converts all 2D features into images, writes signals as audio summaries, and extracts utterance ids.

Open a terminal and launch TensorBoard to view the summaries written to $wrkdir/cache/tensorboard/dataset/sample:

tensorboard --logdir /data/exp/cv4/cache/tensorboard

Then open the url in a browser and inspect the contents. You can leave the server running, since we'll log the training progress to the same directory.

Loading all data

We'll now begin loading everything from disk and preparing a pipeline from mp3-filepaths to neural network input. We'll use the autotune feature of tf.data to allow TensorFlow figure out automatically how much of the pipeline should be split up into parallel calls.

Testing pipeline performance

Note that we only constructed the pipeline with all steps we want to compute. All TensorFlow ops are computed only when elements are requested from the iterator.

Lets iterate over the training dataset from first to last element to ensure the pipeline will not be a performance bottleneck during training.

Caching pipeline state

We can cache the iterator state as a single binary file at arbitrary stages. This allows us to automatically skip all steps that precede the call to tf.Dataset.cache.

Lets cache the training dataset and iterate again over all elements to fill the cache. Note that you will still be storing all data on the disk (4.6 GiB new data), so this optimization is a space-time tradeoff.

If we iterate over the dataset again, TensorFlow should read all elements from the cache file.

As a side note, if your training environment has fast read-write access to a file system configured for reading and writing very large files, this optimization can be a very significant performance improvement.

Note also that all usual problems related to cache invalidation apply. When caching extracted features and metadata to disk, be extra careful in your experiments to ensure you are not interpreting results computed on data from some outdated cache.

Dumping a few batches to TensorBoard

Lets extract 100 first elements of every split to TensorBoard.

Training a supervised, neural network language classifier

We have now configured an efficient data pipeline and extracted some data samples to summary files for TensorBoard. It is time to train a classifier on the data.

Drop metadata from dataset

During training, we only need a tuple of model input and targets. We can therefore drop everything else from the dataset elements just before training starts. This is also a good place to decide if we want to train on MFCCs or Mel-spectra.

Asserting all input is valid

Since the training dataset is cached, we can quickly iterate over all elements and check that we don't have any NaNs or negative targets.

It is also easy to compute stats on the dataset elements. For example finding global minimum and maximum values of the inputs.

Selecting a model architecture

lidbox provides a small set of neural network model architectures out of the box. Many of these architectures have good results in the literature for different datasets. These models have been implemented in Keras, so you could replace the model we are using here with anything you want.

The "x-vector" architecture has worked well in speaker and language identification so lets create an untrained Keras x-vector model. One of its core features is learning fixed length vector representations (x-vectors) for input of arbitrary length. These vectors are extracted from the first fully connected layer (segment1), without activation. This opens up opportunities for doing all kinds of statistical analysis on these vectors, but that's out of scope for our example.

We'll try to regularize the network by adding frequency channel dropout with probability 0.8. In other words, during training we set input rows randomly to zeros with probability 0.8. This might avoid overfitting the network on frequency channels containing noise that is irrelevant for deciding the language.

Channel dropout demo

Here's what happens to the input during training.

Training the classifier

The validation set is needed after every epoch, so we might as well cache it. Note that this writes 2.5 GiB of additional data to disk the first time the validation set is iterated over, i.e. at the end of epoch 1. Also, we can't use batches since our input is of different lengths (perhaps with ragged tensors).

Evaluating the classifier

Lets run all test set samples through our trained model by loading the best weights from the cache.

Average detection cost ($\text{C}_\text{avg}$)

The de facto standard metric for evaluating spoken language classifiers might be the average detection cost ($\text{C}_\text{avg}$), which has been refined to its current form during past language recognition competitions. lidbox provides this metric as a tf.keras.Metric subclass. Scikit-learn provides other commonly used metrics so there is no need to manually compute those.

Conclusions

This was an example on deep learning based simple spoken language identification of 4 different languages from the Mozilla Common Voice free speech datasets. We managed to train a model that adequately recognizes languages spoken by the test set speakers.

However, there is clearly room for improvement. We did simple random oversampling to balance the language distribution in the training set, but perhaps there are better ways to do this. We also did not tune optimization hyperparameters or try different neural network architectures or layer combinations. It might also be possible to increase robustness by audio feature engineering, such as random FIR filtering to simulate microphone differences.