Representation learning and back-end classification

2020-11-21

This example expands common-voice-augmenting by implementing language vector classification. So far, we have used the x-vector neural network as an end-to-end classifier, making classification decisions based on its log-softmax outputs. However, it can also be used for representation learning by adding a second step after training. Once we have found reasonably optimal weights for the network, we extract all speech data as fixed-length vectors and train a separate, back-end classifier on these vectors. These vectors are also called embeddings. As explained in the original x-vector paper, one benefit of this approach is that we could first train a single neural network on vast amounts of data in hundreds of languages, which can then be used as a feature extractor for producing training data to arbitrary back-end classifiers. These back-end classifiers could be trained on any subset of languages from the larger training set.

Data

This example uses the same data as in the common-voice-small example.

Loading and preparing the metadata

Preparing the feature extraction pipeline

Filling the caches

Loading a trained x-vector model

We already have a trained instance of the x-vector model from common-voice-augmenting so we can skip training the model.

Evaluating as an end-to-end classifier

Using the classifier as a feature extractor

In previous examples we stopped here, but this time we'll make use of the internal representation our neural network has learned. As described in the x-vector paper, the language vectors should be extracted from the first fully connected layer, without activations. Lets create a new feature extractor model that uses same inputs as the trained x-vector model, but uses the segment1 layer as its output layer. We also freeze the model by converting it into a tf.function.

Extracting a few embeddings

Constructing a language vector extractor pipeline

Let's extend our existing tf.data.Dataset feature extraction pipelines by appending a step that extracts language vectors (embeddings) with the trained model. We can add all embeddings into our metadata table, under a column called embedding in order to keep everything neatly in one location.

Preprocessing the language vectors for back-end training

Now, let's extract all embeddings and integer targets into NumPy-data and preprocess them with scikit-learn.

Fit classifier on training set vectors and evaluate on test set vectors

Finally, we train a classifier on the training set vectors and predict some language scores on the test set vectors, from which we compute all metrics as before.

Conclusions

We were unable to improve our classification results by training a separate back-end classifier on the internal representation of the x-vector neural network. However, this technique can be useful if you have a pre-trained neural network and want to train a classifier on new data.