Language vectors, recurrent neural networks, and an angular proximity loss function

2020-11-21

In this example, we take a different approach for training language vectors (embeddings) compared to common-voice-embeddings. Previously, we trained a neural network on a classification task and used one of its layers as the representation for different classes. In this example, we train a neural network directly on the language vector task by maximizing the angular distance between vectors of different classes. We'll be using the approach described by G. Gelly and J.L. Gauvain.

Data

We will continue with the same, 4-language Common Voice data as in all previous examples.

Loading and preparing the metadata

Preparing the feature extraction pipeline

Most of the preprocessing will be as in common-voice-embeddings, but this time we will not be training on samples with varying length.

We will make these changes:

Filling the caches

Training the LSTM model with angular proximity loss

lidbox implements both the model and the angular proximity loss function used in the reference paper. The loss function aims to maximize the cosine distance of language vectors of different languages and minimize the distance for vectors of the same language. Reference vectors will be generated for each class such that all reference vectors are orthogonal to each other.

In addition, we'll add random channel dropout to avoid overfitting on noise, as in the common-voice-small example.

Evaluating as an end-to-end classifier

The angular proximity loss function uses reference directions for each language, such that each direction is orthogonal to each other. By selecting the closest reference direction for every predicted language vector, the model can be used as an end-to-end classifier.

Merging chunk predictions

We divided all samples into 3.2 second chunks, so all predictions are still for these chunks. Lets merge all chunk predictions by taking the average over all chunks for each sample.

Evaluate test set predictions

Extracting all data as language vectors

Constructing a language vector extractor pipeline

We'll now extend the existing feature extraction pipeline by adding a step where we extract language vectors with the trained model. In addition, we merge all chunks of each sample by summing over all components of its chunk vectors. The vector is then L2-normalized.

Preprocessing the language vectors for back-end training

Now, let's extract all embeddings and integer targets into NumPy-data and preprocess them with scikit-learn.

Fit classifier on training set vectors and evaluate on test set vectors

Conclusions

Compared to the results from our previous examples, we were unable to get better results by training an RNN based model with the angular proximity loss function. However, the PCA scatter plots suggest that language vectors of the same class are much closer to each other compared to what we extracted from the x-vector model.

In any case, we might need much larger datasets before we can reliably compare the x-vector model and the LSTM model we used here.