# Automatically reload imported modules that are changed outside this notebook
%load_ext autoreload
%autoreload 2
# More pixels in figures
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.dpi"] = 200
# Init PRNG with fixed seed for reproducibility
import numpy as np
np_rng = np.random.default_rng(1)
import tensorflow as tf
tf.random.set_seed(np_rng.integers(0, tf.int64.max))
2020-11-08
This example is a thorough, but simple walk-through on how to do everything from loading mp3-files containing speech to preprocessing and transforming the speech data into something we can feed to a neural network classifier. Deep learning based speech analysis is a vast research topic and there are countless techniques that could possibly be applied to improve the results of this example. This example tries to avoid going into too much detail into these techniques and instead focuses on getting an end-to-end classification pipeline up and running with a small dataset.
This example uses open speech data downloaded from the Mozilla Common Voice project. See the readme file for downloading the data. In addition to the space needed for the downloaded data, you will need at least 10 GiB of free disk space for caching (can be disabled).
import urllib.parse
from IPython.display import display, Markdown
languages = """
et
mn
ta
tr
""".split()
languages = sorted(l.strip() for l in languages)
display(Markdown("### Languages"))
display(Markdown('\n'.join("* `{}`".format(l) for l in languages)))
bcp47_validator_url = 'https://schneegans.de/lv/?tags='
display(Markdown("See [this tool]({}) for a description of the BCP-47 language codes."
.format(bcp47_validator_url + urllib.parse.quote('\n'.join(languages)))))
We start by preprocessing the Common Voice metadata files.
Update datadir
and workdir
to match your setup.
All output will be written to workdir
.
import os
workdir = "/data/exp/cv4"
datadir = "/mnt/data/speech/common-voice/downloads/2020/cv-corpus"
print("work dir:", workdir)
print("data source dir:", datadir)
os.makedirs(workdir, exist_ok=True)
assert os.path.isdir(datadir), datadir + " does not exist"
work dir: /data/exp/cv4 data source dir: /mnt/data/speech/common-voice/downloads/2020/cv-corpus
Common Voice metadata is distributed as tsv
files and all audio samples are mp3-files under clips
.
dirs = sorted((f for f in os.scandir(datadir) if f.is_dir()), key=lambda f: f.name)
print(datadir)
for d in dirs:
if d.name in languages:
print(' ', d.name)
for f in os.scandir(d):
print(' ', f.name)
missing_languages = set(languages) - set(d.name for d in dirs)
assert missing_languages == set(), "missing languages: {}".format(missing_languages)
/mnt/data/speech/common-voice/downloads/2020/cv-corpus et validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv mn validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv ta validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv tr validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv
There's plenty of metadata, but it seems that the train-dev-test split has been predefined so lets use that.
pandas makes it easy to read, filter, and manipulate metadata in tables. Lets try to preprocess all metadata here so we don't have to worry about it later.
import pandas as pd
from IPython.display import display, Markdown
# Lexicographic order of labels as a fixed index target to label mapping
target2lang = tuple(sorted(languages))
lang2target = {lang: target for target, lang in enumerate(target2lang)}
print("lang2target:", lang2target)
print("target2lang:", target2lang)
def expand_metadata(row):
"""
Update dataframe row by generating a unique utterance id,
expanding the absolute path to the mp3 file,
and adding an integer target for the label.
"""
row.id = "{:s}_{:s}".format(
row.path.split(".mp3", 1)[0].split("common_voice_", 1)[1],
row.split)
row.path = os.path.join(datadir, row.lang, "clips", row.path)
row.target = lang2target[row.lang]
return row
def tsv_to_lang_dataframe(lang, split):
"""
Given a language and dataset split (train, dev, test),
load the Common Voice metadata tsv-file from disk into a pandas.DataFrame.
Preprocess all rows by dropping unneeded columns and adding new metadata.
"""
df = pd.read_csv(
os.path.join(datadir, lang, split + ".tsv"),
sep='\t',
# We only need these columns from the metadata
usecols=("client_id", "path", "sentence"))
# Add language label as column
df.insert(len(df.columns), "lang", lang)
# Add split name to every row for easier filtering
df.insert(len(df.columns), "split", split)
# Add placeholders for integer targets and utterance ids generated row-wise
df.insert(len(df.columns), "target", -1)
df.insert(len(df.columns), "id", "")
# Create new metadata columns
df = df.transform(expand_metadata, axis=1)
return df
split_names = ("train", "dev", "test")
# Concatenate metadata for all 4 languages into a single table for each split
splits = [pd.concat([tsv_to_lang_dataframe(lang, split) for lang in target2lang])
for split in split_names]
# Concatenate split metadata into a single table, indexed by utterance ids
meta = (pd.concat(splits)
.set_index("id", drop=True, verify_integrity=True)
.sort_index())
del splits
for split in split_names:
display(Markdown("### " + split))
display(meta[meta["split"]==split])
lang2target: {'et': 0, 'mn': 1, 'ta': 2, 'tr': 3} target2lang: ('et', 'mn', 'ta', 'tr')
client_id | path | sentence | lang | split | target | |
---|---|---|---|---|---|---|
id | ||||||
et_18039906_train | fa7f67d93b2f3a6e685275897b5b67653df98a2880d1a8... | /mnt/data/speech/common-voice/downloads/2020/c... | Kusjuures selle nimel Mägi riskis isiklikult j... | et | train | 0 |
et_18039907_train | fa7f67d93b2f3a6e685275897b5b67653df98a2880d1a8... | /mnt/data/speech/common-voice/downloads/2020/c... | Väidetavalt oli sel hetkel ka Nordica lennujaa... | et | train | 0 |
et_18039908_train | fa7f67d93b2f3a6e685275897b5b67653df98a2880d1a8... | /mnt/data/speech/common-voice/downloads/2020/c... | Remo arvates võiks vaadata ka Peipsi äärde, nä... | et | train | 0 |
et_18039909_train | fa7f67d93b2f3a6e685275897b5b67653df98a2880d1a8... | /mnt/data/speech/common-voice/downloads/2020/c... | Peaaegu kõikides kirikutes ja konfessioonides ... | et | train | 0 |
et_18135494_train | 29a3279b66344d333c6ce542c44280d36128d716416c93... | /mnt/data/speech/common-voice/downloads/2020/c... | Ta tunnistas, et masintõlge neurovõrkudega on ... | et | train | 0 |
... | ... | ... | ... | ... | ... | ... |
tr_22024145_train | 8e630ccc7f89386948fdd4c882accc0f3f32c148bc8164... | /mnt/data/speech/common-voice/downloads/2020/c... | Dördüncü şahsın menşei belirlenemedi. | tr | train | 3 |
tr_22024149_train | 8e630ccc7f89386948fdd4c882accc0f3f32c148bc8164... | /mnt/data/speech/common-voice/downloads/2020/c... | Bunu nasıl iyileştirmeye çalışıyorsunuz? | tr | train | 3 |
tr_22024334_train | 8e630ccc7f89386948fdd4c882accc0f3f32c148bc8164... | /mnt/data/speech/common-voice/downloads/2020/c... | Bir köy, bu konuda ortalamanın üstünde. | tr | train | 3 |
tr_22024387_train | 8e630ccc7f89386948fdd4c882accc0f3f32c148bc8164... | /mnt/data/speech/common-voice/downloads/2020/c... | Parti, kararı temyize götürdü. | tr | train | 3 |
tr_22024395_train | 8e630ccc7f89386948fdd4c882accc0f3f32c148bc8164... | /mnt/data/speech/common-voice/downloads/2020/c... | Fuar Pazar günü sona eriyor. | tr | train | 3 |
8822 rows × 6 columns
client_id | path | sentence | lang | split | target | |
---|---|---|---|---|---|---|
id | ||||||
et_18135665_dev | 53766c5456ef60e9656bf8d8676576cb3644e8aa7eb917... | /mnt/data/speech/common-voice/downloads/2020/c... | Ning mõelda millelegi sellisele, mis tekitab h... | et | dev | 0 |
et_18135667_dev | 53766c5456ef60e9656bf8d8676576cb3644e8aa7eb917... | /mnt/data/speech/common-voice/downloads/2020/c... | Mõlemad kiituste grupid on olulised, kuid mõis... | et | dev | 0 |
et_18135685_dev | 53766c5456ef60e9656bf8d8676576cb3644e8aa7eb917... | /mnt/data/speech/common-voice/downloads/2020/c... | Aasta hiljem tutvustati üldsusele analüüsi tul... | et | dev | 0 |
et_18135686_dev | 53766c5456ef60e9656bf8d8676576cb3644e8aa7eb917... | /mnt/data/speech/common-voice/downloads/2020/c... | Eduseis vaikselt küll kahanes, aga reaalset ša... | et | dev | 0 |
et_18151474_dev | 3ad734a9b3b939b5f62bddf6344cf30d7f367c0bb8dc1c... | /mnt/data/speech/common-voice/downloads/2020/c... | Lift peatub viiel korrusel ja sõidab ka lava a... | et | dev | 0 |
... | ... | ... | ... | ... | ... | ... |
tr_22313441_dev | 114819780185e9471c3a3a635ad38135c83e01a7dc54a4... | /mnt/data/speech/common-voice/downloads/2020/c... | Fakat yine sekiz çocuğumuzu öldürdüler. | tr | dev | 3 |
tr_22313447_dev | 114819780185e9471c3a3a635ad38135c83e01a7dc54a4... | /mnt/data/speech/common-voice/downloads/2020/c... | Sınır ötesi harekât için meclis onayı gerekiyor. | tr | dev | 3 |
tr_22313449_dev | 114819780185e9471c3a3a635ad38135c83e01a7dc54a4... | /mnt/data/speech/common-voice/downloads/2020/c... | Para biriminin sayısal kodu ise dokuz yüz kırk... | tr | dev | 3 |
tr_22313450_dev | 114819780185e9471c3a3a635ad38135c83e01a7dc54a4... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak bu iş kolay olmayacak. | tr | dev | 3 |
tr_22313451_dev | 114819780185e9471c3a3a635ad38135c83e01a7dc54a4... | /mnt/data/speech/common-voice/downloads/2020/c... | Buraya fare düşse zehirlenir. | tr | dev | 3 |
7451 rows × 6 columns
client_id | path | sentence | lang | split | target | |
---|---|---|---|---|---|---|
id | ||||||
et_18031888_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 |
et_18031889_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 |
et_18031891_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 |
et_18038135_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 |
et_18038136_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 |
... | ... | ... | ... | ... | ... | ... |
tr_22462713_test | f58bab150fb6d452f028697b97e9032d372452c9e60022... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 |
tr_22474271_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 |
tr_22474274_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 |
tr_22477339_test | 25e40b1938d0956ccae093f3a4d160fb3759eafa9e162b... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 |
tr_22498670_test | b925da8c206e5269e2cdfe67e201e7d120ed03d1cae5df... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 |
7569 rows × 6 columns
To ensure our neural network will learn what language is being spoken and not who is speaking, we want to test it on data that does not have any voices present in the training data.
The client_id
should correspond to a unique, pseudonymized identifier for every speaker.
Lets check all splits are disjoint by speaker id.
def assert_splits_disjoint_by_speaker(meta):
split2spk = {split: set(meta[meta["split"]==split].client_id.to_numpy())
for split in split_names}
for split, spk in split2spk.items():
print("split {} has {} speakers".format(split, len(spk)))
print()
print("asserting all are disjoint")
assert split2spk["train"] & split2spk["test"] == set(), "train and test, mutual speakers"
assert split2spk["train"] & split2spk["dev"] == set(), "train and dev, mutual speakers"
assert split2spk["dev"] & split2spk["test"] == set(), "dev and test, mutual speakers"
print("ok")
assert_splits_disjoint_by_speaker(meta)
split train has 162 speakers split dev has 257 speakers split test has 1057 speakers asserting all are disjoint ok
We can see that none of the speakers are in two or more dataset splits. We also see that the test set has a lot of unique speakers who are not in the training set. This is good because we want to test that our neural network classifier knows how to classify input from unknown speakers.
for uttid, row in meta.iterrows():
assert os.path.exists(row["path"]), row["path"] + " does not exist"
print("ok")
ok
Lets see how many samples we have per language.
import seaborn as sns
sns.set(rc={'figure.figsize': (8, 6)})
ax = sns.countplot(
x="split",
order=split_names,
hue="lang",
hue_order=target2lang,
data=meta)
ax.set_title("Total amount of audio samples")
plt.show()
We can see that the amount of samples with Mongolian, Tamil, and Turkish speech are quite balanced, but we have significantly larger amounts of Estonian speech. More data is of course always better, but if there is too much of one label compared to the others, our neural network might overfit on this label.
But these are only the counts of audio files, how much speech do we have in total per language? We need to read every file to get a reliable answer. See also SoX for a good command line tool.
import miniaudio
meta["duration"] = np.array([
miniaudio.mp3_get_file_info(path).duration for path in meta.path], np.float32)
meta
client_id | path | sentence | lang | split | target | duration | |
---|---|---|---|---|---|---|---|
id | |||||||
et_18031888_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 | 5.952 |
et_18031889_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 | 8.928 |
et_18031891_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 | 3.336 |
et_18038135_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 | 9.816 |
et_18038136_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 | 5.904 |
... | ... | ... | ... | ... | ... | ... | ... |
tr_22462713_test | f58bab150fb6d452f028697b97e9032d372452c9e60022... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 | 2.208 |
tr_22474271_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 | 4.176 |
tr_22474274_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 | 2.424 |
tr_22477339_test | 25e40b1938d0956ccae093f3a4d160fb3759eafa9e162b... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 | 2.424 |
tr_22498670_test | b925da8c206e5269e2cdfe67e201e7d120ed03d1cae5df... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 | 2.616 |
23842 rows × 7 columns
def plot_duration_distribution(data):
sns.set(rc={'figure.figsize': (8, 6)})
ax = sns.boxplot(
x="split",
order=split_names,
y="duration",
hue="lang",
hue_order=target2lang,
data=data)
ax.set_title("Median audio file duration in seconds")
plt.show()
ax = sns.barplot(
x="split",
order=split_names,
y="duration",
hue="lang",
hue_order=target2lang,
data=data,
ci=None,
estimator=np.sum)
ax.set_title("Total amount of audio in seconds")
plt.show()
plot_duration_distribution(meta)
The median length of Estonian samples is approx. 2.5 seconds greater compared to Turkish samples, which have the shortest median length. We can also see that the total amount of Estonian speech is much larger compared to other languages in our datasets. Notice also the significant amount of outliers with long durations in the Tamil and Turkish datasets.
Lets do simple random oversampling for the training split using this approach:
def random_oversampling(meta):
groupby_lang = meta[["lang", "duration"]].groupby("lang")
total_dur = groupby_lang.sum()
target_lang = total_dur.idxmax()[0]
print("target lang:", target_lang)
print("total durations:")
display(total_dur)
total_dur_delta = total_dur.loc[target_lang] - total_dur
print("total duration delta to target lang:")
display(total_dur_delta)
median_dur = groupby_lang.median()
print("median durations:")
display(median_dur)
sample_sizes = (total_dur_delta / median_dur).astype(np.int32)
print("median duration weighted sample sizes based on total duration differences:")
display(sample_sizes)
samples = []
for lang in groupby_lang.groups:
sample_size = sample_sizes.loc[lang][0]
sample = (meta[meta["lang"]==lang]
.sample(n=sample_size, replace=True, random_state=np_rng.bit_generator)
.reset_index()
.transform(update_sample_id, axis=1))
samples.append(sample)
return pd.concat(samples).set_index("id", drop=True, verify_integrity=True)
def update_sample_id(row):
row["id"] = "{}_copy_{}".format(row["id"], row.name)
return row
# Augment training set metadata
meta = pd.concat([random_oversampling(meta[meta["split"]=="train"]), meta]).sort_index()
assert not meta.isna().any(axis=None), "NaNs in metadata after augmentation"
plot_duration_distribution(meta)
assert_splits_disjoint_by_speaker(meta)
meta
target lang: et total durations:
duration | |
---|---|
lang | |
et | 19753.007812 |
mn | 11101.583984 |
ta | 8085.552246 |
tr | 7110.624023 |
total duration delta to target lang:
duration | |
---|---|
lang | |
et | 0.000000 |
mn | 8651.423828 |
ta | 11667.455078 |
tr | 12642.383789 |
median durations:
duration | |
---|---|
lang | |
et | 6.624 |
mn | 4.920 |
ta | 4.176 |
tr | 3.768 |
median duration weighted sample sizes based on total duration differences:
duration | |
---|---|
lang | |
et | 0 |
mn | 1758 |
ta | 2793 |
tr | 3355 |
split train has 162 speakers split dev has 257 speakers split test has 1057 speakers asserting all are disjoint ok
client_id | path | sentence | lang | split | target | duration | |
---|---|---|---|---|---|---|---|
id | |||||||
et_18031888_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 | 5.952 |
et_18031889_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 | 8.928 |
et_18031891_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 | 3.336 |
et_18038135_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 | 9.816 |
et_18038136_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 | 5.904 |
... | ... | ... | ... | ... | ... | ... | ... |
tr_22462713_test | f58bab150fb6d452f028697b97e9032d372452c9e60022... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 | 2.208 |
tr_22474271_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 | 4.176 |
tr_22474274_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 | 2.424 |
tr_22477339_test | 25e40b1938d0956ccae093f3a4d160fb3759eafa9e162b... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 | 2.424 |
tr_22498670_test | b925da8c206e5269e2cdfe67e201e7d120ed03d1cae5df... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 | 2.616 |
31748 rows × 7 columns
Speech data augmentation is a common research topic.
There are better ways to augment data than the simple duplication of metadata rows we did here.
One approach (which we won't be doing here) which is easy to implement and might work well is to take copies of signals and make them randomly a bit faster or slower.
For example, draw randomly speed ratios from [0.9, 1.1]
and resample the signal by multiplying its sample rate with the random ratio.
Lets take a look at the speech data and listen to a few randomly picked samples from each label. We pick 2 random samples for each language from the training set.
samples = (meta[meta["split"]=="train"]
.groupby("lang")
.sample(n=2, random_state=np_rng.bit_generator))
samples
client_id | path | sentence | lang | split | target | duration | |
---|---|---|---|---|---|---|---|
id | |||||||
et_18309293_train | a1fe9d415a381158a7fb89978304161183e0795c65d0b3... | /mnt/data/speech/common-voice/downloads/2020/c... | Meresmaa ütleb, et hoolimata sellest, kas puid... | et | train | 0 | 8.736 |
et_20816668_train | 723cd1a56681e4c3dbeb36ceac204f435fa517dd8a94d4... | /mnt/data/speech/common-voice/downloads/2020/c... | Keegi ei arva ka, et need ei peaks olema kalli... | et | train | 0 | 4.584 |
mn_19023260_train | 74c6df0d177aacb734c2ea4052772610dcfc860656bd8b... | /mnt/data/speech/common-voice/downloads/2020/c... | Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин ... | mn | train | 1 | 6.336 |
mn_18598365_train_copy_695 | be1b9005c04889bbf9759a71dbe046be839ee068a668f4... | /mnt/data/speech/common-voice/downloads/2020/c... | Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв. | mn | train | 1 | 3.864 |
ta_19093638_train | 6622032a09c9f7e0fbb3bddc0a33304509ca3f33ec79fe... | /mnt/data/speech/common-voice/downloads/2020/c... | மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்! | ta | train | 2 | 5.304 |
ta_20435594_train | 7d61a7238caeb62624af2b9c202edbfc534e7955658646... | /mnt/data/speech/common-voice/downloads/2020/c... | தெருவார் வந்து சேர்ந்தார் உள்ளே. | ta | train | 2 | 3.888 |
tr_19847090_train | 7af2e0f706baed314ca0f96efe612ea592bf57791a348b... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak daha yapılacak çok iş var. | tr | train | 3 | 3.744 |
tr_21324796_train | 7b735c8f538c3bae9b0d2a63492fb70a49d214173903d3... | /mnt/data/speech/common-voice/downloads/2020/c... | Bundan sonra bir şeylerin değişmesi gerekecek. | tr | train | 3 | 4.584 |
Then lets read the mp3-files from disk, plot the signals, and listen to the audio.
from IPython.display import display, Audio, HTML
import scipy.signal
def read_mp3(path, resample_rate=16000):
if isinstance(path, bytes):
# If path is a tf.string tensor, it will be in bytes
path = path.decode("utf-8")
f = miniaudio.mp3_read_file_f32(path)
# Downsample to target rate, 16 kHz is commonly used for speech data
new_len = round(len(f.samples) * float(resample_rate) / f.sample_rate)
signal = scipy.signal.resample(f.samples, new_len)
# Normalize to [-1, 1]
signal /= np.abs(signal).max()
return signal, resample_rate
def embed_audio(signal, rate):
display(Audio(data=signal, rate=rate, embed=True, normalize=False))
def plot_signal(data, figsize=(6, 0.5), **kwargs):
ax = sns.lineplot(data=data, lw=0.1, **kwargs)
ax.set_axis_off()
ax.margins(0)
plt.gcf().set_size_inches(*figsize)
plt.show()
def plot_separator():
display(HTML(data="<hr style='border: 2px solid'>"))
for sentence, lang, clip_path in samples[["sentence", "lang", "path"]].to_numpy():
signal, rate = read_mp3(clip_path)
plot_signal(signal)
print("length: {} sec".format(signal.size / rate))
print("lang:", lang)
print("sentence:", sentence)
embed_audio(signal, rate)
plot_separator()
length: 8.736 sec lang: et sentence: Meresmaa ütleb, et hoolimata sellest, kas puidu all on kivipõrand või ei, jaotatakse kaabel mööda põrandat ühtlaste loogetena laiali.
length: 4.584 sec lang: et sentence: Keegi ei arva ka, et need ei peaks olema kallimad kui tavaravimid.
length: 6.336 sec lang: mn sentence: Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин хувь заяагаа хайж олохоор Бостоныг орхин одлоо.
length: 3.864 sec lang: mn sentence: Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв.
length: 5.304 sec lang: ta sentence: மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்!
length: 3.888 sec lang: ta sentence: தெருவார் வந்து சேர்ந்தார் உள்ளே.
length: 3.744 sec lang: tr sentence: Ancak daha yapılacak çok iş var.
length: 4.584 sec lang: tr sentence: Bundan sonra bir şeylerin değişmesi gerekecek.
One of the most challenging aspects of the Mozilla Common Voice dataset is that the audio quality varies greatly: different microphones, background noise, user is speaking close to the device or far away etc. It is difficult to ensure that a neural network will learn to classify different languages as opposed to classifying distinct acoustic artefacts from specific microphones. There's a vast amount of research being done on developing techniques for solving these kind of problems. However, these are well out of scope for this simple example and we won't be studying them here.
It is usually not possible (at least not yet in 2020) to detect languages directly from the waveform. Instead, the fast Fourier transform (FFT) is applied on small, overlapping windows of the signal to get a 2-dimensional representation of energies in different frequency bands. See this for further details.
However, output from the FFT is usually not usable directly and must be refined. Lets begin by selecting the first signal from our random sample and extract the power spectrogram.
from lidbox.features.audio import spectrograms
def plot_spectrogram(S, cmap="viridis", figsize=None, **kwargs):
if figsize is None:
figsize = S.shape[0]/50, S.shape[1]/50
ax = sns.heatmap(S.T, cbar=False, cmap=cmap, **kwargs)
ax.invert_yaxis()
ax.set_axis_off()
ax.margins(0)
plt.gcf().set_size_inches(*figsize)
plt.show()
sample = samples[["sentence", "lang", "path"]].to_numpy()[0]
sentence, lang, clip_path = sample
signal, rate = read_mp3(clip_path)
plot_signal(signal)
powspec = spectrograms([signal], rate)[0]
plot_spectrogram(powspec.numpy())
This representation is very sparse, with zeros everywhere except in the lowest frequency bands. The main problem here is that relative differences between energy values are very large, making it different to compare large changes in energy. These differences can be reduced by mapping the values onto a logarithmic scale.
The decibel-scale is a common choice.
We will use the maximum value of powspec
as the reference power ($\text{P}_0$).
from lidbox.features.audio import power_to_db
dbspec = power_to_db([powspec])[0]
plot_spectrogram(dbspec.numpy())
This is an improvement, but the representation is still rather sparse. We also see that most speech information is in the lower bands, with a bit of energy in the higher frequencies. A common approach is to "squeeze together" the y-axis of all frequency bands by using a different scale, such as the Mel-scale. Lets "squeeze" the current 256 frequency bins into 40 Mel-bins.
Note that we are scaling different things here. The Mel-scale warps the frequency bins (y-axis), while the logarithm is used to reduce relative differences between individual spectrogram values (pixels).
from lidbox.features.audio import linear_to_mel
def logmelspectrograms(signals, rate):
powspecs = spectrograms(signals, rate)
melspecs = linear_to_mel(powspecs, rate, num_mel_bins=40)
return tf.math.log(melspecs + 1e-6)
logmelspec = logmelspectrograms([signal], rate)[0]
plot_spectrogram(logmelspec.numpy())
One common normalization technique is frequency channel standardization, i.e. normalization of rows to zero mean and unit variance.
from lidbox.features import cmvn
logmelspec_mv = cmvn([logmelspec])[0]
plot_spectrogram(logmelspec_mv.numpy())
Or only mean-normalization if you think the variances contain important information.
logmelspec_m = cmvn([logmelspec], normalize_variance=False)[0]
plot_spectrogram(logmelspec_m.numpy())
Another common representation are the Mel-frequency cepstral coefficients (MFCC), which are obtained by applying the discrete cosine transform on the log-scale Mel-spectrogram.
def plot_cepstra(X, figsize=None):
if not figsize:
figsize = (X.shape[0]/50, X.shape[1]/20)
plot_spectrogram(X, cmap="RdBu_r", figsize=figsize)
mfcc = tf.signal.mfccs_from_log_mel_spectrograms([logmelspec])[0]
plot_cepstra(mfcc.numpy())
Most of the information is concentrated in the lower coefficients. It is common to drop the 0th coefficient and select a subset starting at 1, e.g. 1 to 20. See this post for more details.
mfcc = mfcc[:,1:21]
plot_cepstra(mfcc.numpy())
Now we have a very compact representation, but most of the variance is still in the lower coefficients and overshadows the smaller changes in higher coefficients. We can normalize the MFCC matrix row-wise by standardizing each row to zero mean and unit variance. This is commonly called cepstral mean and variance normalization (CMVN).
mfcc_cmvn = cmvn([mfcc])[0]
plot_cepstra(mfcc_cmvn.numpy())
Speech feature extraction is a large, active research topic and it is impossible to choose one representation that would work well in all situations. Common choices in state-of-the-art spoken language identification are log-scale Mel-spectrograms and MFCCs, with different normalization approaches. For example, here is an experiment in Arabic dialect identification, where log-scale Mel-spectra (referred to as FBANK) produced slightly better results compared to MFCCs.
It is not obvious when to choose which representation, or if we should even use the FFT at all. You can read this post for a more detailed discussion.
It is common for speech datasets to contain audio samples with short segments of silence or sounds that are not speech. Since these are usually irrelevant for making a language classification decision, we would prefer to discard such segments. This is called voice activity detection (VAD) and it is another large, active research area. Here is a brief overview of VAD.
Non-speech segments can be either noise or silence. Separating non-speech noise from speech is non-trivial but possible, for example with neural networks. Silence, on the other hand, shows up as zeros in our speech representations, since these segments contain lower energy values compared to segments with speech. Such non-speech segments are therefore easy to detect and discard, for example by comparing the energy of the segment to the average energy of the whole sample.
If the samples in our example do not contain much background noise, a simple energy-based VAD technique should be enough to drop all silent segments.
We'll use the root mean square (RMS) energy to detect short silence segments.
lidbox
has a simple energy-based VAD function, which we will use as follows:
from lidbox.features.audio import framewise_rms_energy_vad_decisions
import matplotlib.patches as patches
sentence, lang, clip_path = sample
signal, rate = read_mp3(clip_path)
window_ms = tf.constant(10, tf.int32)
window_frame_length = (window_ms * rate) // 1000
# Get binary VAD decisions for each 10 ms window
vad_1 = framewise_rms_energy_vad_decisions(
signal=signal,
sample_rate=rate,
frame_step_ms=window_ms,
strength=0.1)
# Plot unfiltered signal
sns.set(rc={'figure.figsize': (6, 0.5)})
ax = sns.lineplot(data=signal, lw=0.1, legend=None)
ax.set_axis_off()
ax.margins(0)
# Plot shaded area over samples marked as not speech (VAD == 0)
for x, is_speech in enumerate(vad_1.numpy()):
if not is_speech:
rect = patches.Rectangle(
(x*window_frame_length, -1),
window_frame_length,
2,
linewidth=0,
color='gray',
alpha=0.2)
ax.add_patch(rect)
plt.show()
print("lang:", lang)
print("sentence: '{}'".format(sentence))
embed_audio(signal, rate)
# Partition the signal into 10 ms windows to match the VAD decisions
windows = tf.signal.frame(signal, window_frame_length, window_frame_length)
# Filter signal with VAD decision == 1 (remove gray areas)
filtered_signal = tf.reshape(windows[vad_1], [-1])
plot_signal(filtered_signal)
print("dropped {:d} out of {:d} frames, leaving {:.3f} of the original signal".format(
signal.shape[0] - filtered_signal.shape[0],
signal.shape[0],
filtered_signal.shape[0]/signal.shape[0]))
embed_audio(filtered_signal, rate)
lang: et sentence: 'Meresmaa ütleb, et hoolimata sellest, kas puidu all on kivipõrand või ei, jaotatakse kaabel mööda põrandat ühtlaste loogetena laiali.'
dropped 43936 out of 139776 frames, leaving 0.686 of the original signal
The filtered signal has less silence, but some of the pauses between words sound too short and unnatural. We would prefer not to remove small pauses that normally occur between words, so lets say all pauses shorter than 300 ms should not be filtered out. Lets also move all VAD code into a function.
def remove_silence(signal, rate):
window_ms = tf.constant(10, tf.int32)
window_frames = (window_ms * rate) // 1000
# Get binary VAD decisions for each 10 ms window
vad_1 = framewise_rms_energy_vad_decisions(
signal=signal,
sample_rate=rate,
frame_step_ms=window_ms,
# Do not return VAD = 0 decisions for sequences shorter than 300 ms
min_non_speech_ms=300,
strength=0.1)
# Partition the signal into 10 ms windows to match the VAD decisions
windows = tf.signal.frame(signal, window_frames, window_frames)
# Filter signal with VAD decision == 1
return tf.reshape(windows[vad_1], [-1])
sentence, lang, clip_path = sample
signal, rate = read_mp3(clip_path)
filtered_signal = remove_silence(signal, rate)
plot_signal(filtered_signal)
print("dropped {:d} out of {:d} frames, leaving {:.3f} of the original signal".format(
signal.shape[0] - filtered_signal.shape[0],
signal.shape[0],
filtered_signal.shape[0]/signal.shape[0]))
print("lang:", lang)
print("sentence: '{}'".format(sentence))
embed_audio(filtered_signal, rate)
dropped 14656 out of 139776 frames, leaving 0.895 of the original signal lang: et sentence: 'Meresmaa ütleb, et hoolimata sellest, kas puidu all on kivipõrand või ei, jaotatakse kaabel mööda põrandat ühtlaste loogetena laiali.'
We dropped some silence segments but left most of the speech intact, perhaps this is enough for our example.
Although this VAD approach is simple and works ok for our data, it will not work for speech data with non-speech sounds in the background like music or noise. For such data we might need more powerful VAD filters such as neural networks that have been trained on a speech vs non-speech classification task with large amounts of different noise.
But lets not add more complexity to our example. We'll use the RMS based filter for all other signals too.
Lets extract these features for all signals in our random sample.
for sentence, lang, clip_path in samples[["sentence", "lang", "path"]].to_numpy():
signal_before_vad, rate = read_mp3(clip_path)
signal = remove_silence(signal_before_vad, rate)
logmelspec = logmelspectrograms([signal], rate)[0]
logmelspec_mvn = cmvn([logmelspec], normalize_variance=False)[0]
mfcc = tf.signal.mfccs_from_log_mel_spectrograms([logmelspec])[0]
mfcc = mfcc[:,1:21]
mfcc_cmvn = cmvn([mfcc])[0]
plot_width = logmelspec.shape[0]/50
plot_signal(signal.numpy(), figsize=(plot_width, .6))
print("VAD: {} -> {} sec".format(
signal_before_vad.size / rate,
signal.numpy().size / rate))
print("lang:", lang)
print("sentence:", sentence)
embed_audio(signal.numpy(), rate)
plot_spectrogram(logmelspec_mvn.numpy(), figsize=(plot_width, 1.2))
plot_cepstra(mfcc_cmvn.numpy(), figsize=(plot_width, .6))
plot_separator()
VAD: 8.736 -> 7.82 sec lang: et sentence: Meresmaa ütleb, et hoolimata sellest, kas puidu all on kivipõrand või ei, jaotatakse kaabel mööda põrandat ühtlaste loogetena laiali.
VAD: 4.584 -> 3.74 sec lang: et sentence: Keegi ei arva ka, et need ei peaks olema kallimad kui tavaravimid.
VAD: 6.336 -> 4.57 sec lang: mn sentence: Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин хувь заяагаа хайж олохоор Бостоныг орхин одлоо.
VAD: 3.864 -> 2.35 sec lang: mn sentence: Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв.
VAD: 5.304 -> 3.5 sec lang: ta sentence: மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்!
VAD: 3.888 -> 2.13 sec lang: ta sentence: தெருவார் வந்து சேர்ந்தார் உள்ளே.
VAD: 3.744 -> 1.92 sec lang: tr sentence: Ancak daha yapılacak çok iş var.
VAD: 4.584 -> 2.7 sec lang: tr sentence: Bundan sonra bir şeylerin değişmesi gerekecek.
tf.data.Dataset
iterator¶Our dataset is relatively small (2.5 GiB) and we might be able to read all files into signals and keep them in main memory. However, most speech datasets are much larger due to the amount of data needed for training neural network models that would be of any practical use. We need some kind of lazy iteration or streaming solution that views only one part of the dataset at a time. One such solution is to represent the dataset as a TensorFlow iterator, which evaluates its contents only when they are needed, similar to the MapReduce programming model for big data.
The downside with lazy iteration or streaming is that we lose the capability of doing random access by row id. However, this shouldn't be a problem since we can always keep the whole metadata table in memory and do random access on its rows whenever needed.
Another benefit of TensorFlow dataset iterators is that we can map arbitrary tf.function
s over the dataset and TensorFlow will automatically parallelize the computations and place them on different devices, such as the GPU.
The core architecture of lidbox
has been organized around the tf.data.Dataset
API, leaving all the heavy lifting for TensorFlow to handle.
But before we load all our speech data, lets warmup with our small random sample of 8 rows.
samples
client_id | path | sentence | lang | split | target | duration | |
---|---|---|---|---|---|---|---|
id | |||||||
et_18309293_train | a1fe9d415a381158a7fb89978304161183e0795c65d0b3... | /mnt/data/speech/common-voice/downloads/2020/c... | Meresmaa ütleb, et hoolimata sellest, kas puid... | et | train | 0 | 8.736 |
et_20816668_train | 723cd1a56681e4c3dbeb36ceac204f435fa517dd8a94d4... | /mnt/data/speech/common-voice/downloads/2020/c... | Keegi ei arva ka, et need ei peaks olema kalli... | et | train | 0 | 4.584 |
mn_19023260_train | 74c6df0d177aacb734c2ea4052772610dcfc860656bd8b... | /mnt/data/speech/common-voice/downloads/2020/c... | Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин ... | mn | train | 1 | 6.336 |
mn_18598365_train_copy_695 | be1b9005c04889bbf9759a71dbe046be839ee068a668f4... | /mnt/data/speech/common-voice/downloads/2020/c... | Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв. | mn | train | 1 | 3.864 |
ta_19093638_train | 6622032a09c9f7e0fbb3bddc0a33304509ca3f33ec79fe... | /mnt/data/speech/common-voice/downloads/2020/c... | மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்! | ta | train | 2 | 5.304 |
ta_20435594_train | 7d61a7238caeb62624af2b9c202edbfc534e7955658646... | /mnt/data/speech/common-voice/downloads/2020/c... | தெருவார் வந்து சேர்ந்தார் உள்ளே. | ta | train | 2 | 3.888 |
tr_19847090_train | 7af2e0f706baed314ca0f96efe612ea592bf57791a348b... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak daha yapılacak çok iş var. | tr | train | 3 | 3.744 |
tr_21324796_train | 7b735c8f538c3bae9b0d2a63492fb70a49d214173903d3... | /mnt/data/speech/common-voice/downloads/2020/c... | Bundan sonra bir şeylerin değişmesi gerekecek. | tr | train | 3 | 4.584 |
Lets load it into a tf.data.Dataset
.
def metadata_to_dataset_input(meta):
# Create a mapping from column names to all values under the column as tensors
return {
"id": tf.constant(meta.index, tf.string),
"path": tf.constant(meta.path, tf.string),
"lang": tf.constant(meta.lang, tf.string),
"target": tf.constant(meta.target, tf.int32),
"split": tf.constant(meta.split, tf.string),
}
sample_ds = tf.data.Dataset.from_tensor_slices(metadata_to_dataset_input(samples))
sample_ds
<TensorSliceDataset shapes: {id: (), path: (), lang: (), target: (), split: ()}, types: {id: tf.string, path: tf.string, lang: tf.string, target: tf.int32, split: tf.string}>
All elements produced by the Dataset
iterator are dict
s of (string, Tensor) pairs, where the string denotes the metadata type.
Although the Dataset
object is primarily for automating large-scale data processing pipelines, it is easy to extract all elements as numpy
-values:
for x in sample_ds.as_numpy_iterator():
display(x)
{'id': b'et_18309293_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/et/clips/common_voice_et_18309293.mp3', 'lang': b'et', 'target': 0, 'split': b'train'}
{'id': b'et_20816668_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/et/clips/common_voice_et_20816668.mp3', 'lang': b'et', 'target': 0, 'split': b'train'}
{'id': b'mn_19023260_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/mn/clips/common_voice_mn_19023260.mp3', 'lang': b'mn', 'target': 1, 'split': b'train'}
{'id': b'mn_18598365_train_copy_695', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/mn/clips/common_voice_mn_18598365.mp3', 'lang': b'mn', 'target': 1, 'split': b'train'}
{'id': b'ta_19093638_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/ta/clips/common_voice_ta_19093638.mp3', 'lang': b'ta', 'target': 2, 'split': b'train'}
{'id': b'ta_20435594_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/ta/clips/common_voice_ta_20435594.mp3', 'lang': b'ta', 'target': 2, 'split': b'train'}
{'id': b'tr_19847090_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/tr/clips/common_voice_tr_19847090.mp3', 'lang': b'tr', 'target': 3, 'split': b'train'}
{'id': b'tr_21324796_train', 'path': b'/mnt/data/speech/common-voice/downloads/2020/cv-corpus/tr/clips/common_voice_tr_21324796.mp3', 'lang': b'tr', 'target': 3, 'split': b'train'}
Lets load the signals by mapping a file reading function for each element over the whole dataset.
We'll add a tf.data.Dataset
function wrapper on top of read_mp3
, which we defined earlier.
TensorFlow will infer the input and output values of the wrapper as tensors from the type signature of dataset elements.
We must use tf.numpy_function
if we want to allow calling the non-TensorFlow function read_mp3
also from
inside the graph environment.
It might not be as efficient as using TensorFlow ops but reading a file would have a lot of latency anyway so this is not such a big hit for performance.
Besides, we can always hide the latency by reading several files in parallel.
def read_mp3_wrapper(x):
signal, sample_rate = tf.numpy_function(
# Function
read_mp3,
# Argument list
[x["path"]],
# Return value types
[tf.float32, tf.int64])
return dict(x, signal=signal, sample_rate=tf.cast(sample_rate, tf.int32))
for x in sample_ds.map(read_mp3_wrapper).as_numpy_iterator():
print("id: {}".format(x["id"].decode("utf-8")))
print("signal.shape: {}, sample rate: {}".format(x["signal"].shape, x["sample_rate"]))
print()
id: et_18309293_train signal.shape: (139776,), sample rate: 16000 id: et_20816668_train signal.shape: (73344,), sample rate: 16000 id: mn_19023260_train signal.shape: (101376,), sample rate: 16000 id: mn_18598365_train_copy_695 signal.shape: (61824,), sample rate: 16000 id: ta_19093638_train signal.shape: (84864,), sample rate: 16000 id: ta_20435594_train signal.shape: (62208,), sample rate: 16000 id: tr_19847090_train signal.shape: (59904,), sample rate: 16000 id: tr_21324796_train signal.shape: (73344,), sample rate: 16000
Organizing all preprocessing steps as functions that can be mapped over the Dataset
object allows us to represent complex transformations easily.
def remove_silence_wrapper(x):
return dict(x, signal=remove_silence(x["signal"], x["sample_rate"]))
def batch_extract_features(x):
with tf.device("GPU"):
signals, rates = x["signal"], x["sample_rate"]
logmelspecs = logmelspectrograms(signals, rates[0])
logmelspecs_smn = cmvn(logmelspecs, normalize_variance=False)
mfccs = tf.signal.mfccs_from_log_mel_spectrograms(logmelspecs)
mfccs = mfccs[...,1:21]
mfccs_cmvn = cmvn(mfccs)
return dict(x, logmelspec=logmelspecs_smn, mfcc=mfccs_cmvn)
features_ds = (sample_ds.map(read_mp3_wrapper)
.map(remove_silence_wrapper)
.batch(1)
.map(batch_extract_features)
.unbatch())
for x in features_ds.as_numpy_iterator():
print(x["id"])
for k in ("signal", "logmelspec", "mfcc"):
print("{}.shape: {}".format(k, x[k].shape))
print()
b'et_18309293_train' signal.shape: (125120,) logmelspec.shape: (780, 40) mfcc.shape: (780, 20) b'et_20816668_train' signal.shape: (59840,) logmelspec.shape: (372, 40) mfcc.shape: (372, 20) b'mn_19023260_train' signal.shape: (73120,) logmelspec.shape: (455, 40) mfcc.shape: (455, 20) b'mn_18598365_train_copy_695' signal.shape: (37600,) logmelspec.shape: (233, 40) mfcc.shape: (233, 20) b'ta_19093638_train' signal.shape: (56000,) logmelspec.shape: (348, 40) mfcc.shape: (348, 20) b'ta_20435594_train' signal.shape: (34080,) logmelspec.shape: (211, 40) mfcc.shape: (211, 20) b'tr_19847090_train' signal.shape: (30720,) logmelspec.shape: (190, 40) mfcc.shape: (190, 20) b'tr_21324796_train' signal.shape: (43200,) logmelspec.shape: (268, 40) mfcc.shape: (268, 20)
lidbox
has a helper function for dumping element information into TensorBoard
summaries.
This converts all 2D features into images, writes signals as audio summaries, and extracts utterance ids.
import lidbox.data.steps as ds_steps
cachedir = os.path.join(workdir, "cache")
_ = ds_steps.consume_to_tensorboard(
# Rename logmelspec as 'input', these will be plotted as images
ds=features_ds.map(lambda x: dict(x, input=x["logmelspec"])),
summary_dir=os.path.join(cachedir, "tensorboard", "data", "sample"),
config={"batch_size": 1, "image_size_multiplier": 4})
2020-11-08 13:05:38.515 I lidbox.data.steps: Writing 1 first elements of -1 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4/cache/tensorboard/data/sample' WARNING:tensorflow:From /usr/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead
2020-11-08 13:05:39.248 W tensorflow: From /usr/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead
2020-11-08 13:05:39.525 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-08 13:05:41.489 I lidbox.data.steps: 8 done, 4.075 elements per second.
Open a terminal and launch TensorBoard to view the summaries written to $wrkdir/cache/tensorboard/dataset/sample
:
tensorboard --logdir /data/exp/cv4/cache/tensorboard
Then open the url in a browser and inspect the contents. You can leave the server running, since we'll log the training progress to the same directory.
We'll now begin loading everything from disk and preparing a pipeline from mp3-filepaths to neural network input.
We'll use the autotune feature of tf.data
to allow TensorFlow figure out automatically how much of the pipeline should be split up into parallel calls.
import lidbox.data.steps as ds_steps
TF_AUTOTUNE = tf.data.experimental.AUTOTUNE
def signal_is_not_empty(x):
return tf.size(x["signal"]) > 0
def pipeline_from_metadata(data, shuffle=False):
if shuffle:
# Shuffle metadata to get an even distribution of labels
data = data.sample(frac=1, random_state=np_rng.bit_generator)
ds = (
# Initialize dataset from metadata
tf.data.Dataset.from_tensor_slices(metadata_to_dataset_input(data))
# Read mp3 files from disk in parallel
.map(read_mp3_wrapper, num_parallel_calls=TF_AUTOTUNE)
# Apply RMS VAD to drop silence from all signals
.map(remove_silence_wrapper, num_parallel_calls=TF_AUTOTUNE)
# Drop signals that VAD removed completely
.filter(signal_is_not_empty)
# Extract features in parallel
.batch(1)
.map(batch_extract_features, num_parallel_calls=TF_AUTOTUNE)
.unbatch()
)
return ds
# Mapping from dataset split names to tf.data.Dataset objects
split2ds = {
split: pipeline_from_metadata(meta[meta["split"]==split], shuffle=split=="train")
for split in split_names
}
Note that we only constructed the pipeline with all steps we want to compute. All TensorFlow ops are computed only when elements are requested from the iterator.
Lets iterate over the training dataset from first to last element to ensure the pipeline will not be a performance bottleneck during training.
_ = ds_steps.consume(split2ds["train"], log_interval=2000)
2020-11-08 13:05:42.080 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-08 13:05:53.129 I lidbox.data.steps: 2000 done, 181.031 elements per second. 2020-11-08 13:06:03.133 I lidbox.data.steps: 4000 done, 199.943 elements per second. 2020-11-08 13:06:13.380 I lidbox.data.steps: 6000 done, 195.181 elements per second. 2020-11-08 13:06:23.385 I lidbox.data.steps: 8000 done, 199.932 elements per second. 2020-11-08 13:06:33.603 I lidbox.data.steps: 10000 done, 195.741 elements per second. 2020-11-08 13:06:43.735 I lidbox.data.steps: 12000 done, 197.407 elements per second. 2020-11-08 13:06:53.827 I lidbox.data.steps: 14000 done, 198.191 elements per second. 2020-11-08 13:07:04.035 I lidbox.data.steps: 16000 done, 195.950 elements per second. 2020-11-08 13:07:07.552 I lidbox.data.steps: 16728 done, 207.036 elements per second.
We can cache the iterator state as a single binary file at arbitrary stages.
This allows us to automatically skip all steps that precede the call to tf.Dataset.cache
.
Lets cache the training dataset and iterate again over all elements to fill the cache. Note that you will still be storing all data on the disk (4.6 GiB new data), so this optimization is a space-time tradeoff.
os.makedirs(os.path.join(cachedir, "data"))
split2ds["train"] = split2ds["train"].cache(os.path.join(cachedir, "data", "train"))
_ = ds_steps.consume(split2ds["train"], log_interval=2000)
2020-11-08 13:07:07.580 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-08 13:07:18.833 I lidbox.data.steps: 2000 done, 177.741 elements per second. 2020-11-08 13:07:28.764 I lidbox.data.steps: 4000 done, 201.413 elements per second. 2020-11-08 13:07:39.481 I lidbox.data.steps: 6000 done, 186.630 elements per second. 2020-11-08 13:07:49.762 I lidbox.data.steps: 8000 done, 194.542 elements per second. 2020-11-08 13:07:59.732 I lidbox.data.steps: 10000 done, 200.615 elements per second. 2020-11-08 13:08:13.171 I lidbox.data.steps: 12000 done, 148.833 elements per second. 2020-11-08 13:08:23.181 I lidbox.data.steps: 14000 done, 199.815 elements per second. 2020-11-08 13:08:33.245 I lidbox.data.steps: 16000 done, 198.739 elements per second. 2020-11-08 13:08:37.075 I lidbox.data.steps: 16728 done, 190.145 elements per second.
If we iterate over the dataset again, TensorFlow should read all elements from the cache file.
_ = ds_steps.consume(split2ds["train"], log_interval=2000)
2020-11-08 13:08:37.101 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-08 13:08:37.580 I lidbox.data.steps: 2000 done, 4190.413 elements per second. 2020-11-08 13:08:37.988 I lidbox.data.steps: 4000 done, 4908.743 elements per second. 2020-11-08 13:08:38.411 I lidbox.data.steps: 6000 done, 4730.102 elements per second. 2020-11-08 13:08:38.831 I lidbox.data.steps: 8000 done, 4766.785 elements per second. 2020-11-08 13:08:39.251 I lidbox.data.steps: 10000 done, 4766.216 elements per second. 2020-11-08 13:08:39.671 I lidbox.data.steps: 12000 done, 4772.698 elements per second. 2020-11-08 13:08:40.092 I lidbox.data.steps: 14000 done, 4758.473 elements per second. 2020-11-08 13:08:40.555 I lidbox.data.steps: 16000 done, 4324.156 elements per second. 2020-11-08 13:08:40.715 I lidbox.data.steps: 16728 done, 4558.081 elements per second.
As a side note, if your training environment has fast read-write access to a file system configured for reading and writing very large files, this optimization can be a very significant performance improvement.
Note also that all usual problems related to cache invalidation apply. When caching extracted features and metadata to disk, be extra careful in your experiments to ensure you are not interpreting results computed on data from some outdated cache.
Lets extract 100 first elements of every split to TensorBoard.
for split, ds in split2ds.items():
_ = ds_steps.consume_to_tensorboard(
ds.map(lambda x: dict(x, input=x["logmelspec"])),
os.path.join(cachedir, "tensorboard", "data", split),
{"batch_size": 1,
"image_size_multiplier": 2,
"num_batches": 100},
exist_ok=True)
2020-11-08 13:08:40.750 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4/cache/tensorboard/data/train' 2020-11-08 13:08:40.940 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-08 13:08:44.802 I lidbox.data.steps: 100 done, 25.899 elements per second. 2020-11-08 13:08:44.814 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4/cache/tensorboard/data/dev' 2020-11-08 13:08:45.025 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-08 13:08:48.989 I lidbox.data.steps: 100 done, 25.231 elements per second. 2020-11-08 13:08:49.000 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4/cache/tensorboard/data/test' 2020-11-08 13:08:49.212 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-08 13:08:53.556 I lidbox.data.steps: 100 done, 23.021 elements per second.
We have now configured an efficient data pipeline and extracted some data samples to summary files for TensorBoard. It is time to train a classifier on the data.
During training, we only need a tuple of model input and targets. We can therefore drop everything else from the dataset elements just before training starts. This is also a good place to decide if we want to train on MFCCs or Mel-spectra.
model_input_type = "logmelspec"
def as_model_input(x):
return x[model_input_type], x["target"]
train_ds_demo = list(split2ds["train"]
.map(as_model_input)
.shuffle(100)
.take(6)
.as_numpy_iterator())
for input, target in train_ds_demo:
print(input.shape, target2lang[target])
if model_input_type == "mfcc":
plot_cepstra(input)
else:
plot_spectrogram(input)
plot_separator()
(297, 40) mn
(626, 40) et
(229, 40) ta
(311, 40) tr
(366, 40) ta
(245, 40) tr
Since the training dataset is cached, we can quickly iterate over all elements and check that we don't have any NaNs or negative targets.
def assert_finite(x, y):
tf.debugging.assert_all_finite(x, "non-finite input")
tf.debugging.assert_non_negative(y, "negative target")
return x, y
_ = ds_steps.consume(split2ds["train"].map(as_model_input).map(assert_finite), log_interval=5000)
2020-11-08 13:08:54.396 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 5000 2020-11-08 13:08:55.681 I lidbox.data.steps: 5000 done, 3892.823 elements per second. 2020-11-08 13:08:56.826 I lidbox.data.steps: 10000 done, 4367.472 elements per second. 2020-11-08 13:08:57.965 I lidbox.data.steps: 15000 done, 4394.803 elements per second. 2020-11-08 13:08:58.376 I lidbox.data.steps: 16728 done, 4212.310 elements per second.
It is also easy to compute stats on the dataset elements. For example finding global minimum and maximum values of the inputs.
x_min = split2ds["train"].map(as_model_input).reduce(
tf.float32.max,
lambda acc, elem: tf.math.minimum(acc, tf.math.reduce_min(elem[0])))
x_max = split2ds["train"].map(as_model_input).reduce(
tf.float32.min,
lambda acc, elem: tf.math.maximum(acc, tf.math.reduce_max(elem[0])))
print("input tensor global minimum: {}, maximum: {}".format(x_min.numpy(), x_max.numpy()))
input tensor global minimum: -19.79332733154297, maximum: 15.17326831817627
lidbox
provides a small set of neural network model architectures out of the box.
Many of these architectures have good results in the literature for different datasets.
These models have been implemented in Keras, so you could replace the model we are using here with anything you want.
The "x-vector" architecture has worked well in speaker and language identification so lets create an untrained Keras x-vector model.
One of its core features is learning fixed length vector representations (x-vectors) for input of arbitrary length.
These vectors are extracted from the first fully connected layer (segment1
), without activation.
This opens up opportunities for doing all kinds of statistical analysis on these vectors, but that's out of scope for our example.
We'll try to regularize the network by adding frequency channel dropout with probability 0.8. In other words, during training we set input rows randomly to zeros with probability 0.8. This might avoid overfitting the network on frequency channels containing noise that is irrelevant for deciding the language.
import lidbox.models.xvector as xvector
def create_model(num_freq_bins, num_labels):
model = xvector.create([None, num_freq_bins], num_labels, channel_dropout_rate=0.8)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5))
return model
model = create_model(
num_freq_bins=20 if model_input_type == "mfcc" else 40,
num_labels=len(target2lang))
model.summary()
Model: "x-vector" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input (InputLayer) [(None, None, 40)] 0 _________________________________________________________________ channel_dropout (SpatialDrop (None, None, 40) 0 _________________________________________________________________ frame1 (Conv1D) (None, None, 512) 102912 _________________________________________________________________ frame2 (Conv1D) (None, None, 512) 786944 _________________________________________________________________ frame3 (Conv1D) (None, None, 512) 786944 _________________________________________________________________ frame4 (Conv1D) (None, None, 512) 262656 _________________________________________________________________ frame5 (Conv1D) (None, None, 1500) 769500 _________________________________________________________________ stats_pooling (GlobalMeanStd (None, 3000) 0 _________________________________________________________________ segment1 (Dense) (None, 512) 1536512 _________________________________________________________________ segment2 (Dense) (None, 512) 262656 _________________________________________________________________ outputs (Dense) (None, 4) 2052 _________________________________________________________________ log_softmax (Activation) (None, 4) 0 ================================================================= Total params: 4,510,176 Trainable params: 4,510,176 Non-trainable params: 0 _________________________________________________________________
Here's what happens to the input during training.
channel_dropout = tf.keras.layers.SpatialDropout1D(model.get_layer("channel_dropout").rate)
for input, target in train_ds_demo:
print(input.shape, target2lang[target])
input = channel_dropout(tf.expand_dims(input, 0), training=True)[0].numpy()
if model_input_type == "mfcc":
plot_cepstra(input)
else:
plot_spectrogram(input)
plot_separator()
(297, 40) mn
(626, 40) et
(229, 40) ta
(311, 40) tr
(366, 40) ta
(245, 40) tr
The validation set is needed after every epoch, so we might as well cache it. Note that this writes 2.5 GiB of additional data to disk the first time the validation set is iterated over, i.e. at the end of epoch 1. Also, we can't use batches since our input is of different lengths (perhaps with ragged tensors).
callbacks = [
# Write scalar metrics and network weights to TensorBoard
tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(cachedir, "tensorboard", model.name),
update_freq="epoch",
write_images=True,
profile_batch=0,
),
# Stop training if validation loss has not improved from the global minimum in 10 epochs
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
),
# Write model weights to cache everytime we get a new global minimum loss value
tf.keras.callbacks.ModelCheckpoint(
os.path.join(cachedir, "model", model.name),
monitor='val_loss',
save_weights_only=True,
save_best_only=True,
verbose=1,
),
]
train_ds = split2ds["train"].map(as_model_input).shuffle(1000)
dev_ds = split2ds["dev"].cache(os.path.join(cachedir, "data", "dev")).map(as_model_input)
history = model.fit(
train_ds.batch(1),
validation_data=dev_ds.batch(1),
callbacks=callbacks,
verbose=2,
epochs=100)
Epoch 1/100 Epoch 00001: val_loss improved from inf to 0.83537, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 118s - loss: 1.1293 - val_loss: 0.8354 Epoch 2/100 Epoch 00002: val_loss improved from 0.83537 to 0.70583, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 64s - loss: 0.8079 - val_loss: 0.7058 Epoch 3/100 Epoch 00003: val_loss improved from 0.70583 to 0.64955, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 64s - loss: 0.6489 - val_loss: 0.6495 Epoch 4/100 Epoch 00004: val_loss improved from 0.64955 to 0.63552, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 64s - loss: 0.5719 - val_loss: 0.6355 Epoch 5/100 Epoch 00005: val_loss improved from 0.63552 to 0.49460, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 64s - loss: 0.5035 - val_loss: 0.4946 Epoch 6/100 Epoch 00006: val_loss did not improve from 0.49460 16728/16728 - 64s - loss: 0.4745 - val_loss: 0.5106 Epoch 7/100 Epoch 00007: val_loss did not improve from 0.49460 16728/16728 - 63s - loss: 0.4404 - val_loss: 0.5211 Epoch 8/100 Epoch 00008: val_loss did not improve from 0.49460 16728/16728 - 64s - loss: 0.4091 - val_loss: 0.5852 Epoch 9/100 Epoch 00009: val_loss improved from 0.49460 to 0.46267, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 64s - loss: 0.3935 - val_loss: 0.4627 Epoch 10/100 Epoch 00010: val_loss did not improve from 0.46267 16728/16728 - 64s - loss: 0.3742 - val_loss: 0.4762 Epoch 11/100 Epoch 00011: val_loss did not improve from 0.46267 16728/16728 - 64s - loss: 0.3593 - val_loss: 0.5452 Epoch 12/100 Epoch 00012: val_loss did not improve from 0.46267 16728/16728 - 64s - loss: 0.3411 - val_loss: 0.5779 Epoch 13/100 Epoch 00013: val_loss improved from 0.46267 to 0.43999, saving model to /data/exp/cv4/cache/model/x-vector 16728/16728 - 65s - loss: 0.3352 - val_loss: 0.4400 Epoch 14/100 Epoch 00014: val_loss did not improve from 0.43999 16728/16728 - 64s - loss: 0.3182 - val_loss: 0.5564 Epoch 15/100 Epoch 00015: val_loss did not improve from 0.43999 16728/16728 - 63s - loss: 0.3064 - val_loss: 0.4535 Epoch 16/100 Epoch 00016: val_loss did not improve from 0.43999 16728/16728 - 63s - loss: 0.2948 - val_loss: 0.5177 Epoch 17/100 Epoch 00017: val_loss did not improve from 0.43999 16728/16728 - 63s - loss: 0.2873 - val_loss: 0.5505 Epoch 18/100 Epoch 00018: val_loss did not improve from 0.43999 16728/16728 - 63s - loss: 0.2747 - val_loss: 0.4923 Epoch 19/100 Epoch 00019: val_loss did not improve from 0.43999 16728/16728 - 64s - loss: 0.2804 - val_loss: 0.4426 Epoch 20/100 Epoch 00020: val_loss did not improve from 0.43999 16728/16728 - 64s - loss: 0.2734 - val_loss: 0.4851 Epoch 21/100 Epoch 00021: val_loss did not improve from 0.43999 16728/16728 - 63s - loss: 0.2602 - val_loss: 0.5883 Epoch 22/100 Epoch 00022: val_loss did not improve from 0.43999 16728/16728 - 64s - loss: 0.2605 - val_loss: 0.5300 Epoch 23/100 Epoch 00023: val_loss did not improve from 0.43999 16728/16728 - 64s - loss: 0.2483 - val_loss: 0.6036
Lets run all test set samples through our trained model by loading the best weights from the cache.
from lidbox.util import predict_with_model
test_ds = split2ds["test"].map(lambda x: dict(x, input=x["logmelspec"])).batch(1)
_ = model.load_weights(os.path.join(cachedir, "model", model.name))
utt2pred = predict_with_model(model, test_ds)
test_meta = meta[meta["split"]=="test"]
assert not test_meta.join(utt2pred).isna().any(axis=None), "missing predictions"
test_meta = test_meta.join(utt2pred)
test_meta
client_id | path | sentence | lang | split | target | duration | prediction | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
et_18031888_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 | 5.952 | [-0.40346453, -4.6794834, -4.2704687, -1.1752582] |
et_18031889_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 | 8.928 | [-0.37804937, -3.5340228, -1.6545405, -2.359831] |
et_18031891_test | e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2cfc... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 | 3.336 | [-1.1492252, -2.4992998, -1.2034616, -1.2012371] |
et_18038135_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 | 9.816 | [-0.9674505, -0.4790977, -15.160749, -7.406921] |
et_18038136_test | b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477671... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 | 5.904 | [-0.03923787, -3.743445, -9.1418495, -4.220002] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
tr_22462713_test | f58bab150fb6d452f028697b97e9032d372452c9e60022... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 | 2.208 | [-5.754553, -4.1547365, -5.051416, -0.025583064] |
tr_22474271_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 | 4.176 | [-9.614679, -12.940073, -0.0007821838, -7.246398] |
tr_22474274_test | 110ef1bc367a63b877f98d637e4df8e7425c7b75a2d480... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 | 2.424 | [-5.688516, -8.901997, -0.09620939, -2.4280872] |
tr_22477339_test | 25e40b1938d0956ccae093f3a4d160fb3759eafa9e162b... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 | 2.424 | [-3.755436, -2.1214423, -1.7926536, -0.37072548] |
tr_22498670_test | b925da8c206e5269e2cdfe67e201e7d120ed03d1cae5df... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 | 2.616 | [-7.4401236, -8.947252, -0.051250182, -3.011015] |
7569 rows × 8 columns
The de facto standard metric for evaluating spoken language classifiers might be the average detection cost ($\text{C}_\text{avg}$), which has been refined to its current form during past language recognition competitions.
lidbox
provides this metric as a tf.keras.Metric
subclass.
Scikit-learn provides other commonly used metrics so there is no need to manually compute those.
from lidbox.util import classification_report
from lidbox.visualize import draw_confusion_matrix
true_sparse = test_meta.target.to_numpy(np.int32)
pred_dense = np.stack(test_meta.prediction)
pred_sparse = pred_dense.argmax(axis=1).astype(np.int32)
report = classification_report(true_sparse, pred_dense, lang2target)
for m in ("avg_detection_cost", "avg_equal_error_rate", "accuracy"):
print("{}: {:.3f}".format(m, report[m]))
lang_metrics = pd.DataFrame.from_dict({k: v for k, v in report.items() if k in lang2target})
lang_metrics["mean"] = lang_metrics.mean(axis=1)
display(lang_metrics.T)
fig, ax = draw_confusion_matrix(report["confusion_matrix"], lang2target)
avg_detection_cost: 0.112 avg_equal_error_rate: 0.110 accuracy: 0.803
precision | recall | f1-score | support | equal_error_rate | |
---|---|---|---|---|---|
et | 0.944851 | 0.779702 | 0.854369 | 2483.00 | 0.093787 |
mn | 0.832143 | 0.772376 | 0.801146 | 1810.00 | 0.121375 |
ta | 0.743289 | 0.929792 | 0.826146 | 1638.00 | 0.077053 |
tr | 0.680067 | 0.743590 | 0.710411 | 1638.00 | 0.149216 |
mean | 0.800088 | 0.806365 | 0.798018 | 1892.25 | 0.110358 |
This was an example on deep learning based simple spoken language identification of 4 different languages from the Mozilla Common Voice free speech datasets. We managed to train a model that adequately recognizes languages spoken by the test set speakers.
However, there is clearly room for improvement. We did simple random oversampling to balance the language distribution in the training set, but perhaps there are better ways to do this. We also did not tune optimization hyperparameters or try different neural network architectures or layer combinations. It might also be possible to increase robustness by audio feature engineering, such as random FIR filtering to simulate microphone differences.