# Automatically reload imported modules that are changed outside this notebook
%load_ext autoreload
%autoreload 2
# More pixels in figures
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.dpi"] = 200
# Init PRNG with fixed seed for reproducibility
import numpy as np
np_rng = np.random.default_rng(1)
import tensorflow as tf
tf.random.set_seed(np_rng.integers(0, tf.int64.max))
2020-11-10
This example expands common-voice-small
, in which we talked about different ways of augmenting the dataset.
Instead of simply copying samples, we can resample them randomly to make them a bit faster or slower.
In addition, by applying random finite impulse response (FIR) filters on the signals, we can try to simulate microphone differences.
We'll apply these two augmentation techniques in this example and see if it is possible to improve on our previous results.
tf.data.Dataset
makes it easy to cache all raw audio samples into a single file, from which we can reload the whole dataset at each epoch.
This means that we can reapply both random augmentation techniques at every epoch, hopefully with different output at each epoch.
This example uses the same data as in the common-voice-small
example.
import urllib.parse
from IPython.display import display, Markdown
languages = """
et
mn
ta
tr
""".split()
languages = sorted(l.strip() for l in languages)
display(Markdown("### Languages"))
display(Markdown('\n'.join("* `{}`".format(l) for l in languages)))
bcp47_validator_url = 'https://schneegans.de/lv/?tags='
display(Markdown("See [this tool]({}) for a description of the BCP-47 language codes."
.format(bcp47_validator_url + urllib.parse.quote('\n'.join(languages)))))
import os
workdir = "/data/exp/cv4-augment"
datadir = "/mnt/data/speech/common-voice/downloads/2020/cv-corpus"
print("work dir:", workdir)
print("data source dir:", datadir)
print()
os.makedirs(workdir, exist_ok=True)
assert os.path.isdir(datadir), datadir + " does not exist"
dirs = sorted((f for f in os.scandir(datadir) if f.is_dir()), key=lambda f: f.name)
print(datadir)
for d in dirs:
if d.name in languages:
print(' ', d.name)
for f in os.scandir(d):
print(' ', f.name)
missing_languages = set(languages) - set(d.name for d in dirs)
assert missing_languages == set(), "missing languages: {}".format(missing_languages)
work dir: /data/exp/cv4-augment data source dir: /mnt/data/speech/common-voice/downloads/2020/cv-corpus /mnt/data/speech/common-voice/downloads/2020/cv-corpus et validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv mn validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv ta validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv tr validated.tsv invalidated.tsv other.tsv dev.tsv train.tsv clips test.tsv reported.tsv
from lidbox.meta import common_voice, generate_label2target
meta = common_voice.load_all(datadir, languages)
meta, lang2target = generate_label2target(meta)
print("lang2target")
for l, t in lang2target.items():
print(" {}: {}".format(l, t))
for split in meta.split.unique():
display(Markdown("### " + split))
display(meta[meta["split"]==split])
lang2target et: 0 mn: 1 ta: 2 tr: 3
client_id | path | sentence | label | split | target | |
---|---|---|---|---|---|---|
id | ||||||
common_voice_et_18031888 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 |
common_voice_et_18031889 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 |
common_voice_et_18031891 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 |
common_voice_et_18038135 | et_b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 |
common_voice_et_18038136 | et_b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 |
... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22462713 | tr_f58bab150fb6d452f028697b97e9032d372452c9e60... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 |
common_voice_tr_22474271 | tr_110ef1bc367a63b877f98d637e4df8e7425c7b75a2d... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 |
common_voice_tr_22474274 | tr_110ef1bc367a63b877f98d637e4df8e7425c7b75a2d... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 |
common_voice_tr_22477339 | tr_25e40b1938d0956ccae093f3a4d160fb3759eafa9e1... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 |
common_voice_tr_22498670 | tr_b925da8c206e5269e2cdfe67e201e7d120ed03d1cae... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 |
7569 rows × 6 columns
client_id | path | sentence | label | split | target | |
---|---|---|---|---|---|---|
id | ||||||
common_voice_et_18039906 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Kusjuures selle nimel Mägi riskis isiklikult j... | et | train | 0 |
common_voice_et_18039907 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Väidetavalt oli sel hetkel ka Nordica lennujaa... | et | train | 0 |
common_voice_et_18039908 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Remo arvates võiks vaadata ka Peipsi äärde, nä... | et | train | 0 |
common_voice_et_18039909 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Peaaegu kõikides kirikutes ja konfessioonides ... | et | train | 0 |
common_voice_et_18135494 | et_29a3279b66344d333c6ce542c44280d36128d716416... | /mnt/data/speech/common-voice/downloads/2020/c... | Ta tunnistas, et masintõlge neurovõrkudega on ... | et | train | 0 |
... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22024145 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Dördüncü şahsın menşei belirlenemedi. | tr | train | 3 |
common_voice_tr_22024149 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Bunu nasıl iyileştirmeye çalışıyorsunuz? | tr | train | 3 |
common_voice_tr_22024334 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Bir köy, bu konuda ortalamanın üstünde. | tr | train | 3 |
common_voice_tr_22024387 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Parti, kararı temyize götürdü. | tr | train | 3 |
common_voice_tr_22024395 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Fuar Pazar günü sona eriyor. | tr | train | 3 |
8822 rows × 6 columns
client_id | path | sentence | label | split | target | |
---|---|---|---|---|---|---|
id | ||||||
common_voice_et_18135665 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Ning mõelda millelegi sellisele, mis tekitab h... | et | dev | 0 |
common_voice_et_18135667 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Mõlemad kiituste grupid on olulised, kuid mõis... | et | dev | 0 |
common_voice_et_18135685 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Aasta hiljem tutvustati üldsusele analüüsi tul... | et | dev | 0 |
common_voice_et_18135686 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Eduseis vaikselt küll kahanes, aga reaalset ša... | et | dev | 0 |
common_voice_et_18151474 | et_3ad734a9b3b939b5f62bddf6344cf30d7f367c0bb8d... | /mnt/data/speech/common-voice/downloads/2020/c... | Lift peatub viiel korrusel ja sõidab ka lava a... | et | dev | 0 |
... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22313441 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Fakat yine sekiz çocuğumuzu öldürdüler. | tr | dev | 3 |
common_voice_tr_22313447 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Sınır ötesi harekât için meclis onayı gerekiyor. | tr | dev | 3 |
common_voice_tr_22313449 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Para biriminin sayısal kodu ise dokuz yüz kırk... | tr | dev | 3 |
common_voice_tr_22313450 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak bu iş kolay olmayacak. | tr | dev | 3 |
common_voice_tr_22313451 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Buraya fare düşse zehirlenir. | tr | dev | 3 |
7451 rows × 6 columns
from lidbox.meta import verify_integrity
print("size of all metadata", meta.shape)
meta = meta.dropna()
print("after dropping NaN rows", meta.shape)
print("verifying integrity")
verify_integrity(meta)
print("ok")
size of all metadata (23842, 6) after dropping NaN rows (23842, 6) verifying integrity ok
We'll repeat the same random oversampling by audio sample length procedure as we did in common-voice-small
.
This time, we add a flag is_copy == True
to each oversampled copy, which allows us to easily filter all copies when we do random speed changes on the audio signals.
import pandas as pd
import seaborn as sns
from lidbox.meta import read_audio_durations, random_oversampling
from lidbox.visualize import plot_duration_distribution
meta["duration"] = read_audio_durations(meta)
# Flag for distinguishing original rows from copies produced by oversampling
# This is also used later for random resampling of signals
meta = meta.assign(is_copy=False)
train, rest = meta[meta["split"]=="train"], meta[meta["split"]!="train"]
augmented_train = random_oversampling(train, copy_flag="is_copy", random_state=np_rng.bit_generator)
meta = pd.concat([augmented_train, rest], verify_integrity=True).sort_index()
verify_integrity(meta)
sns.set(rc={})
plot_duration_distribution(meta)
for split in meta.split.unique():
display(Markdown("### " + split))
display(meta[meta["split"]==split])
client_id | path | sentence | label | split | target | duration | is_copy | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
common_voice_et_18031888 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Aleksejevi sõnul on ka selle osa laevast disai... | et | test | 0 | 5.952 | False |
common_voice_et_18031889 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Nende kategooriate alla mahuvad nii seinamaali... | et | test | 0 | 8.928 | False |
common_voice_et_18031891 | et_e570aa634f53f3496f29b20b54b7fc501e1b5b9e6d2... | /mnt/data/speech/common-voice/downloads/2020/c... | Ära keeda liiga püdelaks massiks. | et | test | 0 | 3.336 | False |
common_voice_et_18038135 | et_b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477... | /mnt/data/speech/common-voice/downloads/2020/c... | Mitmed lasteaiad ja ka omavalitsused on oma in... | et | test | 0 | 9.816 | False |
common_voice_et_18038136 | et_b6fc7a62e442937e5e60891e8a1bc49df76c2bd0477... | /mnt/data/speech/common-voice/downloads/2020/c... | Maastikuarhitektide liidu aastapreemiate nomin... | et | test | 0 | 5.904 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22462713 | tr_f58bab150fb6d452f028697b97e9032d372452c9e60... | /mnt/data/speech/common-voice/downloads/2020/c... | üç | tr | test | 3 | 2.208 | False |
common_voice_tr_22474271 | tr_110ef1bc367a63b877f98d637e4df8e7425c7b75a2d... | /mnt/data/speech/common-voice/downloads/2020/c... | evet | tr | test | 3 | 4.176 | False |
common_voice_tr_22474274 | tr_110ef1bc367a63b877f98d637e4df8e7425c7b75a2d... | /mnt/data/speech/common-voice/downloads/2020/c... | Hey | tr | test | 3 | 2.424 | False |
common_voice_tr_22477339 | tr_25e40b1938d0956ccae093f3a4d160fb3759eafa9e1... | /mnt/data/speech/common-voice/downloads/2020/c... | dokuz | tr | test | 3 | 2.424 | False |
common_voice_tr_22498670 | tr_b925da8c206e5269e2cdfe67e201e7d120ed03d1cae... | /mnt/data/speech/common-voice/downloads/2020/c... | hayır | tr | test | 3 | 2.616 | False |
7569 rows × 8 columns
client_id | path | sentence | label | split | target | duration | is_copy | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
common_voice_et_18039906 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Kusjuures selle nimel Mägi riskis isiklikult j... | et | train | 0 | 5.256 | False |
common_voice_et_18039907 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Väidetavalt oli sel hetkel ka Nordica lennujaa... | et | train | 0 | 6.864 | False |
common_voice_et_18039908 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Remo arvates võiks vaadata ka Peipsi äärde, nä... | et | train | 0 | 9.384 | False |
common_voice_et_18039909 | et_fa7f67d93b2f3a6e685275897b5b67653df98a2880d... | /mnt/data/speech/common-voice/downloads/2020/c... | Peaaegu kõikides kirikutes ja konfessioonides ... | et | train | 0 | 8.544 | False |
common_voice_et_18135494 | et_29a3279b66344d333c6ce542c44280d36128d716416... | /mnt/data/speech/common-voice/downloads/2020/c... | Ta tunnistas, et masintõlge neurovõrkudega on ... | et | train | 0 | 7.824 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22024387 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Parti, kararı temyize götürdü. | tr | train | 3 | 3.984 | False |
common_voice_tr_22024387_copy_3103 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Parti, kararı temyize götürdü. | tr | train | 3 | 3.984 | True |
common_voice_tr_22024395 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Fuar Pazar günü sona eriyor. | tr | train | 3 | 3.336 | False |
common_voice_tr_22024395_copy_122 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Fuar Pazar günü sona eriyor. | tr | train | 3 | 3.336 | True |
common_voice_tr_22024395_copy_1574 | tr_8e630ccc7f89386948fdd4c882accc0f3f32c148bc8... | /mnt/data/speech/common-voice/downloads/2020/c... | Fuar Pazar günü sona eriyor. | tr | train | 3 | 3.336 | True |
16728 rows × 8 columns
client_id | path | sentence | label | split | target | duration | is_copy | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
common_voice_et_18135665 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Ning mõelda millelegi sellisele, mis tekitab h... | et | dev | 0 | 5.544 | False |
common_voice_et_18135667 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Mõlemad kiituste grupid on olulised, kuid mõis... | et | dev | 0 | 7.656 | False |
common_voice_et_18135685 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Aasta hiljem tutvustati üldsusele analüüsi tul... | et | dev | 0 | 8.304 | False |
common_voice_et_18135686 | et_53766c5456ef60e9656bf8d8676576cb3644e8aa7eb... | /mnt/data/speech/common-voice/downloads/2020/c... | Eduseis vaikselt küll kahanes, aga reaalset ša... | et | dev | 0 | 6.504 | False |
common_voice_et_18151474 | et_3ad734a9b3b939b5f62bddf6344cf30d7f367c0bb8d... | /mnt/data/speech/common-voice/downloads/2020/c... | Lift peatub viiel korrusel ja sõidab ka lava a... | et | dev | 0 | 5.184 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
common_voice_tr_22313441 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Fakat yine sekiz çocuğumuzu öldürdüler. | tr | dev | 3 | 4.728 | False |
common_voice_tr_22313447 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Sınır ötesi harekât için meclis onayı gerekiyor. | tr | dev | 3 | 5.232 | False |
common_voice_tr_22313449 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Para biriminin sayısal kodu ise dokuz yüz kırk... | tr | dev | 3 | 6.096 | False |
common_voice_tr_22313450 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak bu iş kolay olmayacak. | tr | dev | 3 | 3.984 | False |
common_voice_tr_22313451 | tr_114819780185e9471c3a3a635ad38135c83e01a7dc5... | /mnt/data/speech/common-voice/downloads/2020/c... | Buraya fare düşse zehirlenir. | tr | dev | 3 | 4.44 | False |
7451 rows × 8 columns
samples = (meta[meta["split"]=="train"]
.groupby("label")
.sample(n=2, random_state=np_rng.bit_generator))
samples
client_id | path | sentence | label | split | target | duration | is_copy | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
common_voice_et_18309293 | et_a1fe9d415a381158a7fb89978304161183e0795c65d... | /mnt/data/speech/common-voice/downloads/2020/c... | Meresmaa ütleb, et hoolimata sellest, kas puid... | et | train | 0 | 8.736 | False |
common_voice_et_20816668 | et_723cd1a56681e4c3dbeb36ceac204f435fa517dd8a9... | /mnt/data/speech/common-voice/downloads/2020/c... | Keegi ei arva ka, et need ei peaks olema kalli... | et | train | 0 | 4.584 | False |
common_voice_mn_19023260 | mn_74c6df0d177aacb734c2ea4052772610dcfc860656b... | /mnt/data/speech/common-voice/downloads/2020/c... | Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин ... | mn | train | 1 | 6.336 | False |
common_voice_mn_18598365_copy_695 | mn_be1b9005c04889bbf9759a71dbe046be839ee068a66... | /mnt/data/speech/common-voice/downloads/2020/c... | Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв. | mn | train | 1 | 3.864 | True |
common_voice_ta_19093638 | ta_6622032a09c9f7e0fbb3bddc0a33304509ca3f33ec7... | /mnt/data/speech/common-voice/downloads/2020/c... | மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்! | ta | train | 2 | 5.304 | False |
common_voice_ta_20435594 | ta_7d61a7238caeb62624af2b9c202edbfc534e7955658... | /mnt/data/speech/common-voice/downloads/2020/c... | தெருவார் வந்து சேர்ந்தார் உள்ளே. | ta | train | 2 | 3.888 | False |
common_voice_tr_19847090 | tr_7af2e0f706baed314ca0f96efe612ea592bf57791a3... | /mnt/data/speech/common-voice/downloads/2020/c... | Ancak daha yapılacak çok iş var. | tr | train | 3 | 3.744 | False |
common_voice_tr_21324796 | tr_7b735c8f538c3bae9b0d2a63492fb70a49d21417390... | /mnt/data/speech/common-voice/downloads/2020/c... | Bundan sonra bir şeylerin değişmesi gerekecek. | tr | train | 3 | 4.584 | False |
from lidbox.features import audio
from lidbox.visualize import plot_signal
from IPython.display import display, Audio, HTML
def read_mp3(path):
s, rate = audio.read_mp3(path)
out_rate = 16000
s = audio.resample(s, rate, out_rate)
s = audio.peak_normalize(s, dBFS=-3.0)
s = audio.remove_silence(s, out_rate)
return s, out_rate
def embed_audio(signal, rate):
display(Audio(data=signal, rate=rate, embed=True, normalize=False))
def plot_separator():
display(HTML(data="<hr style='border: 2px solid'>"))
for sentence, lang, clip_path in samples[["sentence", "label", "path"]].to_numpy():
signal, rate = read_mp3(clip_path)
signal = signal.numpy()
plot_signal(signal)
print("length: {} sec".format(signal.size / rate))
print("lang:", lang)
print("sentence:", sentence)
embed_audio(signal, rate)
plot_separator()
length: 7.82 sec lang: et sentence: Meresmaa ütleb, et hoolimata sellest, kas puidu all on kivipõrand või ei, jaotatakse kaabel mööda põrandat ühtlaste loogetena laiali.
length: 3.74 sec lang: et sentence: Keegi ei arva ka, et need ei peaks olema kallimad kui tavaravimid.
length: 4.57 sec lang: mn sentence: Жэймстэй ширүүхэн маргалдсаны улмаас Бенжамин хувь заяагаа хайж олохоор Бостоныг орхин одлоо.
length: 2.35 sec lang: mn sentence: Болж өгвөл сүүдрээсээ хүртэл болгоомжилж яв.
length: 3.5 sec lang: ta sentence: மிஞ்சுகின்ற காதலின்மேல் ஆணையிட்டு விள்ளுகின்றேன்!
length: 2.13 sec lang: ta sentence: தெருவார் வந்து சேர்ந்தார் உள்ளே.
length: 1.92 sec lang: tr sentence: Ancak daha yapılacak çok iş var.
length: 2.7 sec lang: tr sentence: Bundan sonra bir şeylerin değişmesi gerekecek.
import scipy.signal
def random_filter(s, N=10):
b = np_rng.normal(0, 1, N)
return scipy.signal.lfilter(b, 1.0, s).astype(np.float32), b
def display_signal(s, r, l):
plot_signal(s)
print("length: {} sec".format(s.size / r))
print("lang:", l)
embed_audio(s, r)
plot_separator()
sentence, lang, path = samples[["sentence", "label", "path"]].to_numpy()[2]
signal, rate = read_mp3(path)
signal = audio.remove_silence(signal, rate).numpy()
print("original")
display_signal(signal, rate, lang)
np.set_printoptions(precision=1)
for _ in range(5):
s, b = random_filter(signal)
print("filter:", b)
s = audio.peak_normalize(s, dBFS=-3.0).numpy()
display_signal(s, rate, lang)
original
length: 4.57 sec lang: mn
filter: [ 0.9 -1. -1.6 -0.1 0.3 0.5 -0. -1. 0.5 0.2]
length: 4.57 sec lang: mn
filter: [ 0.4 -0.8 0.8 -0.8 0.8 -0.2 0.2 -0.6 0.5 1. ]
length: 4.57 sec lang: mn
filter: [-1.4 0.4 -1.1 0.4 -0.5 -1.1 -0.4 -0. 0.6 -1.3]
length: 4.57 sec lang: mn
filter: [-0.8 1.6 -0.2 0.6 -1.5 1.8 -1. -0.7 0.9 1.3]
length: 4.57 sec lang: mn
filter: [-0.3 0.9 -0.6 0.3 0.2 -0.6 -0.4 0.1 -1.1 -1.3]
length: 4.57 sec lang: mn
def random_speed_change(s, r, lo=0.9, hi=1.1):
ratio = np_rng.uniform(lo, hi)
new_len = int(len(s) * r / (ratio * r))
return scipy.signal.resample(s, new_len).astype(np.float32), ratio
print("original")
display_signal(signal, rate, lang)
for ratio in [0.9, 0.95, 1, 1.05, 1.1]:
s, ratio = random_speed_change(signal, rate, lo=ratio, hi=ratio)
print("speed ratio: {:.3f}".format(ratio))
display_signal(s, rate, lang)
original
length: 4.57 sec lang: mn
speed ratio: 0.900
length: 5.07775 sec lang: mn
speed ratio: 0.950
length: 4.8105 sec lang: mn
speed ratio: 1.000
length: 4.57 sec lang: mn
speed ratio: 1.050
length: 4.352375 sec lang: mn
speed ratio: 1.100
length: 4.1545 sec lang: mn
from lidbox.features import audio, cmvn
TF_AUTOTUNE = tf.data.experimental.AUTOTUNE
def metadata_to_dataset_input(meta):
return {
"id": tf.constant(meta.index, tf.string),
"path": tf.constant(meta.path, tf.string),
"label": tf.constant(meta.label, tf.string),
"target": tf.constant(meta.target, tf.int32),
"split": tf.constant(meta.split, tf.string),
"is_copy": tf.constant(meta.is_copy, tf.bool),
}
def read_mp3(x):
s, r = audio.read_mp3(x["path"])
out_rate = 16000
s = audio.resample(s, r, out_rate)
s = audio.peak_normalize(s, dBFS=-3.0)
s = audio.remove_silence(s, out_rate)
return dict(x, signal=s, sample_rate=out_rate)
def random_speed_change_wrapper(x):
if not x["is_copy"]:
return x
s, _ = tf.numpy_function(
random_speed_change,
[x["signal"], x["sample_rate"]],
[tf.float32, tf.float64],
name="np_random_speed_change")
return dict(x, signal=s)
def random_filter_wrapper(x):
s, _ = tf.numpy_function(
random_filter,
[x["signal"]],
[tf.float32, tf.float64],
name="np_random_filter")
s = tf.cast(s, tf.float32)
s = audio.peak_normalize(s, dBFS=-3.0)
return dict(x, signal=s)
def batch_extract_features(x):
with tf.device("GPU"):
signals, rates = x["signal"], x["sample_rate"]
S = audio.spectrograms(signals, rates[0])
S = audio.linear_to_mel(S, rates[0])
S = tf.math.log(S + 1e-6)
S = cmvn(S, normalize_variance=False)
return dict(x, logmelspec=S)
def signal_is_not_empty(x):
return tf.size(x["signal"]) > 0
def pipeline_from_metadata(data, split):
if split == "train":
data = data.sample(frac=1)
ds = (
tf.data.Dataset.from_tensor_slices(metadata_to_dataset_input(data))
.map(read_mp3, num_parallel_calls=TF_AUTOTUNE)
.filter(signal_is_not_empty)
# Try to keep 1000 signals prefetched in an in-memory buffer to reduce downstream latency
.prefetch(1000)
# Cache signals to a single file
.cache(os.path.join(cachedir, "data", split))
# In-memory buffer when reading from the cache
.prefetch(1000))
if split == "train":
ds = (ds
# Randomly change speed of all oversampled copies
.map(random_speed_change_wrapper, num_parallel_calls=TF_AUTOTUNE)
# Apply random filter for every training sample
.map(random_filter_wrapper, num_parallel_calls=TF_AUTOTUNE))
return (ds
.batch(1)
.map(batch_extract_features, num_parallel_calls=TF_AUTOTUNE)
.unbatch())
cachedir = os.path.join(workdir, "cache")
os.makedirs(os.path.join(cachedir, "data"))
split2ds = {
split: pipeline_from_metadata(meta[meta["split"]==split], split)
for split in meta.split.unique()
}
NOTE that this creates 7.2 GiB of additional data on disk.
import lidbox.data.steps as ds_steps
for split, ds in split2ds.items():
print("filling", split, "cache")
_ = ds_steps.consume(ds, log_interval=2000)
filling test cache 2020-11-10 20:46:29.419 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-10 20:46:43.789 I lidbox.data.steps: 2000 done, 139.182 elements per second. 2020-11-10 20:46:56.283 I lidbox.data.steps: 4000 done, 160.090 elements per second. 2020-11-10 20:47:06.763 I lidbox.data.steps: 6000 done, 190.858 elements per second. 2020-11-10 20:47:14.500 I lidbox.data.steps: 7569 done, 202.798 elements per second. filling train cache 2020-11-10 20:47:14.503 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-10 20:47:33.569 I lidbox.data.steps: 2000 done, 104.898 elements per second. 2020-11-10 20:47:48.868 I lidbox.data.steps: 4000 done, 130.744 elements per second. 2020-11-10 20:48:01.529 I lidbox.data.steps: 6000 done, 157.971 elements per second. 2020-11-10 20:48:14.077 I lidbox.data.steps: 8000 done, 159.398 elements per second. 2020-11-10 20:48:26.839 I lidbox.data.steps: 10000 done, 156.725 elements per second. 2020-11-10 20:48:39.412 I lidbox.data.steps: 12000 done, 159.091 elements per second. 2020-11-10 20:48:52.238 I lidbox.data.steps: 14000 done, 155.936 elements per second. 2020-11-10 20:48:59.453 I lidbox.data.steps: 16000 done, 277.268 elements per second. 2020-11-10 20:49:00.957 I lidbox.data.steps: 16728 done, 484.173 elements per second. filling dev cache 2020-11-10 20:49:00.959 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = 2000 2020-11-10 20:49:16.472 I lidbox.data.steps: 2000 done, 128.921 elements per second. 2020-11-10 20:49:28.660 I lidbox.data.steps: 4000 done, 164.106 elements per second. 2020-11-10 20:49:38.433 I lidbox.data.steps: 6000 done, 204.663 elements per second. 2020-11-10 20:49:45.254 I lidbox.data.steps: 7451 done, 212.774 elements per second.
for split, ds in split2ds.items():
_ = ds_steps.consume_to_tensorboard(
ds.map(lambda x: dict(x, input=x["logmelspec"])),
os.path.join(cachedir, "tensorboard", "data", split),
{"batch_size": 1,
"image_size_multiplier": 2,
"num_batches": 100})
2020-11-10 20:49:45.317 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4-augment/cache/tensorboard/data/test' WARNING:tensorflow:From /usr/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead
2020-11-10 20:49:46.078 W tensorflow: From /usr/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead
2020-11-10 20:49:46.377 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-10 20:49:49.419 I lidbox.data.steps: 100 done, 32.885 elements per second. 2020-11-10 20:49:49.431 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4-augment/cache/tensorboard/data/train' 2020-11-10 20:49:49.648 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-10 20:49:52.302 I lidbox.data.steps: 100 done, 37.689 elements per second. 2020-11-10 20:49:52.313 I lidbox.data.steps: Writing 1 first elements of 100 batches, each of size 1, into Tensorboard summaries in '/data/exp/cv4-augment/cache/tensorboard/data/dev' 2020-11-10 20:49:52.524 I lidbox.data.steps: Exhausting the dataset iterator by iterating over all elements, log_interval = -1 2020-11-10 20:49:55.502 I lidbox.data.steps: 100 done, 33.588 elements per second.
import lidbox.models.xvector as xvector
def create_model(num_freq_bins, num_labels):
model = xvector.create([None, num_freq_bins], num_labels, channel_dropout_rate=0.8)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5))
return model
model = create_model(
num_freq_bins=40,
num_labels=len(lang2target))
model.summary()
Model: "x-vector" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input (InputLayer) [(None, None, 40)] 0 _________________________________________________________________ channel_dropout (SpatialDrop (None, None, 40) 0 _________________________________________________________________ frame1 (Conv1D) (None, None, 512) 102912 _________________________________________________________________ frame2 (Conv1D) (None, None, 512) 786944 _________________________________________________________________ frame3 (Conv1D) (None, None, 512) 786944 _________________________________________________________________ frame4 (Conv1D) (None, None, 512) 262656 _________________________________________________________________ frame5 (Conv1D) (None, None, 1500) 769500 _________________________________________________________________ stats_pooling (GlobalMeanStd (None, 3000) 0 _________________________________________________________________ segment1 (Dense) (None, 512) 1536512 _________________________________________________________________ segment2 (Dense) (None, 512) 262656 _________________________________________________________________ outputs (Dense) (None, 4) 2052 _________________________________________________________________ log_softmax (Activation) (None, 4) 0 ================================================================= Total params: 4,510,176 Trainable params: 4,510,176 Non-trainable params: 0 _________________________________________________________________
def as_model_input(x):
return x["logmelspec"], x["target"]
callbacks = [
# Write scalar metrics and network weights to TensorBoard
tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(cachedir, "tensorboard", model.name),
update_freq="epoch",
write_images=True,
profile_batch=0,
),
# Stop training if validation loss has not improved from the global minimum in 10 epochs
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
),
# Write model weights to cache everytime we get a new global minimum loss value
tf.keras.callbacks.ModelCheckpoint(
os.path.join(cachedir, "model", model.name),
monitor='val_loss',
save_weights_only=True,
save_best_only=True,
verbose=1,
),
]
train_ds = split2ds["train"].map(as_model_input).shuffle(1000)
dev_ds = split2ds["dev"].map(as_model_input)
history = model.fit(
train_ds.batch(1),
validation_data=dev_ds.batch(1),
callbacks=callbacks,
verbose=2,
epochs=100)
Epoch 1/100 Epoch 00001: val_loss improved from inf to 0.83661, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 96s - loss: 1.1405 - val_loss: 0.8366 Epoch 2/100 Epoch 00002: val_loss improved from 0.83661 to 0.72218, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.8152 - val_loss: 0.7222 Epoch 3/100 Epoch 00003: val_loss improved from 0.72218 to 0.60692, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.6693 - val_loss: 0.6069 Epoch 4/100 Epoch 00004: val_loss improved from 0.60692 to 0.57515, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.5820 - val_loss: 0.5752 Epoch 5/100 Epoch 00005: val_loss improved from 0.57515 to 0.56978, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.5358 - val_loss: 0.5698 Epoch 6/100 Epoch 00006: val_loss did not improve from 0.56978 16728/16728 - 79s - loss: 0.4925 - val_loss: 0.5875 Epoch 7/100 Epoch 00007: val_loss improved from 0.56978 to 0.49638, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.4589 - val_loss: 0.4964 Epoch 8/100 Epoch 00008: val_loss improved from 0.49638 to 0.48430, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.4346 - val_loss: 0.4843 Epoch 9/100 Epoch 00009: val_loss improved from 0.48430 to 0.47015, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.4167 - val_loss: 0.4702 Epoch 10/100 Epoch 00010: val_loss did not improve from 0.47015 16728/16728 - 79s - loss: 0.3936 - val_loss: 0.5242 Epoch 11/100 Epoch 00011: val_loss improved from 0.47015 to 0.42853, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 78s - loss: 0.3853 - val_loss: 0.4285 Epoch 12/100 Epoch 00012: val_loss did not improve from 0.42853 16728/16728 - 79s - loss: 0.3759 - val_loss: 0.4437 Epoch 13/100 Epoch 00013: val_loss did not improve from 0.42853 16728/16728 - 79s - loss: 0.3623 - val_loss: 0.4382 Epoch 14/100 Epoch 00014: val_loss did not improve from 0.42853 16728/16728 - 79s - loss: 0.3374 - val_loss: 0.6402 Epoch 15/100 Epoch 00015: val_loss did not improve from 0.42853 16728/16728 - 78s - loss: 0.3339 - val_loss: 0.4962 Epoch 16/100 Epoch 00016: val_loss did not improve from 0.42853 16728/16728 - 79s - loss: 0.3202 - val_loss: 0.4727 Epoch 17/100 Epoch 00017: val_loss did not improve from 0.42853 16728/16728 - 79s - loss: 0.3255 - val_loss: 0.5850 Epoch 18/100 Epoch 00018: val_loss did not improve from 0.42853 16728/16728 - 78s - loss: 0.3119 - val_loss: 0.5058 Epoch 19/100 Epoch 00019: val_loss improved from 0.42853 to 0.40879, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.3093 - val_loss: 0.4088 Epoch 20/100 Epoch 00020: val_loss did not improve from 0.40879 16728/16728 - 78s - loss: 0.3082 - val_loss: 0.4397 Epoch 21/100 Epoch 00021: val_loss improved from 0.40879 to 0.39705, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 80s - loss: 0.2982 - val_loss: 0.3970 Epoch 22/100 Epoch 00022: val_loss did not improve from 0.39705 16728/16728 - 79s - loss: 0.2922 - val_loss: 0.4870 Epoch 23/100 Epoch 00023: val_loss did not improve from 0.39705 16728/16728 - 80s - loss: 0.2880 - val_loss: 0.4826 Epoch 24/100 Epoch 00024: val_loss did not improve from 0.39705 16728/16728 - 80s - loss: 0.2909 - val_loss: 0.4611 Epoch 25/100 Epoch 00025: val_loss did not improve from 0.39705 16728/16728 - 78s - loss: 0.2789 - val_loss: 0.4422 Epoch 26/100 Epoch 00026: val_loss did not improve from 0.39705 16728/16728 - 79s - loss: 0.2784 - val_loss: 0.4725 Epoch 27/100 Epoch 00027: val_loss improved from 0.39705 to 0.37631, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.2784 - val_loss: 0.3763 Epoch 28/100 Epoch 00028: val_loss improved from 0.37631 to 0.34670, saving model to /data/exp/cv4-augment/cache/model/x-vector 16728/16728 - 79s - loss: 0.2712 - val_loss: 0.3467 Epoch 29/100 Epoch 00029: val_loss did not improve from 0.34670 16728/16728 - 79s - loss: 0.2672 - val_loss: 0.4142 Epoch 30/100 Epoch 00030: val_loss did not improve from 0.34670 16728/16728 - 78s - loss: 0.2734 - val_loss: 0.4046 Epoch 31/100 Epoch 00031: val_loss did not improve from 0.34670 16728/16728 - 79s - loss: 0.2658 - val_loss: 0.3773 Epoch 32/100 Epoch 00032: val_loss did not improve from 0.34670 16728/16728 - 79s - loss: 0.2642 - val_loss: 0.3854 Epoch 33/100 Epoch 00033: val_loss did not improve from 0.34670 16728/16728 - 78s - loss: 0.2707 - val_loss: 0.4242 Epoch 34/100 Epoch 00034: val_loss did not improve from 0.34670 16728/16728 - 79s - loss: 0.2620 - val_loss: 0.4204 Epoch 35/100 Epoch 00035: val_loss did not improve from 0.34670 16728/16728 - 79s - loss: 0.2614 - val_loss: 0.3496 Epoch 36/100 Epoch 00036: val_loss did not improve from 0.34670 16728/16728 - 78s - loss: 0.2576 - val_loss: 0.4227 Epoch 37/100 Epoch 00037: val_loss did not improve from 0.34670 16728/16728 - 78s - loss: 0.2736 - val_loss: 0.4296 Epoch 38/100 Epoch 00038: val_loss did not improve from 0.34670 16728/16728 - 78s - loss: 0.2575 - val_loss: 0.4984
from lidbox.util import evaluate_testset_with_model
from lidbox.visualize import draw_confusion_matrix
_ = model.load_weights(os.path.join(cachedir, "model", model.name))
report = evaluate_testset_with_model(
model=model,
test_ds=split2ds["test"].map(lambda x: dict(x, input=x["logmelspec"])).batch(1),
test_meta=meta[meta["split"]=="test"],
lang2target=lang2target)
for m in ("avg_detection_cost", "avg_equal_error_rate", "accuracy"):
print("{}: {:.3f}".format(m, report[m]))
lang_metrics = pd.DataFrame.from_dict({k: v for k, v in report.items() if k in lang2target})
lang_metrics["mean"] = lang_metrics.mean(axis=1)
display(lang_metrics.T)
fig, ax = draw_confusion_matrix(report["confusion_matrix"], lang2target)
avg_detection_cost: 0.091 avg_equal_error_rate: 0.088 accuracy: 0.846
precision | recall | f1-score | support | equal_error_rate | |
---|---|---|---|---|---|
et | 0.957596 | 0.827628 | 0.887881 | 2483.00 | 0.078057 |
mn | 0.881306 | 0.820442 | 0.849785 | 1810.00 | 0.088557 |
ta | 0.858068 | 0.889499 | 0.873501 | 1638.00 | 0.063059 |
tr | 0.689706 | 0.858974 | 0.765090 | 1638.00 | 0.123251 |
mean | 0.846669 | 0.849136 | 0.844064 | 1892.25 | 0.088231 |
Comparing to our previous example with the same dataset of 4 different languages (common-voice-small
), the $\text{C}_\text{avg}$ value improved from 0.112 to 0.091 and accuracy from 0.803 to 0.846.
Even though it is tempting to conclude that our augmentation approach was the cause of this improvement, we should probably perform hundreds of experiments with carefully chosen configuration settings to get a reliable answer if augmentation is useful or not.