VoxCeleb 說話人識別挑戰

VoxSRC 消息：

2020 VoxCeleb Speaker Recognition Challenge (VoxSRC) 將聯合 Interspeech 國際會議於 2020 年 10 月 30 日在上海舉辦。

文章目錄

VoxCeleb 說話人識別挑戰

摘要

“Speaker recognition in the wild” 是一項非常具有挑戰性的任務，需要面對語音中各種不確定性，例如複雜的噪聲、不同程度的背景音、短促的笑聲等情況。針對這一問題，可以在 VoxSRC 提供的語料及其各種模型的實驗結果，尋找合適的語音段編碼器，設計合理的度量學習模型，分析造成性能降低的數據因素，都將成爲提升識別性能的潛在解決方案。本文就 VoxSRC 提供的實驗結果和相關的論文進行歸納、總結與展望。

VoxSRC

2020 VoxCeleb Speaker Recognition Challenge (VoxSRC) 旨在研究現有的說話人識別方法對來自 “in the wild” 語音數據的識別效果。這次挑戰提供了來自 YouTube 名人訪問視頻的語音語料。相對傳統的電話、麥克風語音，這類數據集包含更多的干擾與不確定性。

此次挑戰分爲 3 項任務，分別是：

特定訓練數據的說話人確認監督任務(Fixed-Full)：VoxCeleb2 dev 數據集作爲訓練數據；
訓練數據不受限的說話人確認監督任務(Open-Full)：訓練數據可以使用 VoxSRC 測試數據以外的任意數據集；
特定訓練數據說話人確認自監督任務(Fixed-Self)：VoxCeleb2 dev 數據集作爲訓練數據，但無法使用說話人的標籤，但可以使用除此以外的其它標籤，例如跨模態的視覺幀，但無法使用任意模態的預訓練模型。

競賽舉辦方爲任務 1 與 2 提供了說話人確認監督學習的基準，爲任務 3 提供了說話人確認自監督學習的基準。

根據 3 個任務場景，不難看出主辦方對於競賽的想法，針對固定的評估數據：

針對任務 1，訓練集是固定的，該任務旨在設計最佳的學習算法；
針對任務 2，訓練集是開放的，該任務除了設計合理的學習算法，還需要選擇能夠提高評估數據性能的訓練數據，因此，該任務旨在跨領域的知識遷移；
針對任務 3，訓練集是固定的，無說話人標籤，存在跨域標籤，該任務旨在跨任務的知識遷移。

根據對 3 個任務的分析，可以發現三個任務是依次遞進、逐漸複雜的。爲了解決這些問題，學習方法的設計、遷移學習方法、跨領域/任務的方法會有利於改善這些問題。

度量學習與編碼器

論文 Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System 討論了幾種(段層次)編碼器和幾種損失函數對說話人識別性能的影響，其中編碼器包含 temporal average pooling (TAP)、self-attentive pooling (SAP) 和 learnable dictionary encoding (LDE)，損失函數包含 Softmax、Center 和 augular softmax (ASoftmax)，並將這些編碼器和損失函數整合到端到端模型中，評估算法在 VoxCeleb1 數據集上的效果。以 Cosine 作爲評分函數，性能(低於 4.90% EER)的排名分別是：

LDE-ASoftmax (4.56) > TAP-Center (4.75) > SAP-ASoftmax (4.90)。

論文 In defence of metric learning for speaker recognition 討論了多種損失函數(包含分類損失和度量學習)對 CNN 學習算法的影響，並在 VoxCeleb 數據集上分別評估 VGG-M-40 模型和 Thin ResNet-34 模型的性能，該評估方式與 VoxSRC 任務 1 (Fixed-Full) 一致，其中損失函數包含：

分類目標：Softmax、AM-Softmax (CosFace) 和 AAM-Softmax (ArcFace)；
度量學習目標：Triplet、Prototypical、Generalised end-to-end (GE2E) 和 Angular Prototypical。

以 $10 \times 10$ 對的 $\Vert\cdot\Vert$ 的平均值作爲評分函數，不同損失函數的性能(僅考慮 Thin ResNet-34，因爲這裏 VGG-M-40 性能較差)排名分別是：

分類目標：AAM-Softmax (2.36) > AM-Softmax (2.40) > Softmax (5.82)

度量學習目標：Angular Prototypical (2.21) > Prototypical (2.34) > GE2E (2.52) > Triplet (2.53)

分類目標中，相比較 AM-Softmax，AAM-Softmax 對算法參數更加敏感，從在 2.36 ～ 10.55 的波動；對比分類損失，度量學習能夠更實現更優的性能。

從數據集上看，VoxCeleb2 作爲訓練數據，對於 VoxCeleb1 的提升效果非常明顯，即從 4.56% EER 改善到 2.21% EER，50% 的提升量，可以猜想：數據集的補充，有利於學習算法的改進。

高維度數據可視化 TSNE

說話人的特徵表示，在解釋性上，仍然存在很大的障礙，很多時候，很難了解學習到的說話人特徵是怎麼樣的。2008 年發佈的 TSNE 可視化方法，提供了一種高維數據轉化爲低維流形的方法，爲說話人表示提供了一種可行的可視化方案。

TSNE 提供了一種高維特徵距離投影爲低維特徵距離的方法，採用了基於概率的模型來刻畫數據點上的距離，其學習過程類似一種數據合成的迭代方法，可以大膽地想象：如果直接將這類方法引入說話人建模，能夠改善說話人特徵的解釋性。

考慮到這類方法的實用性，筆者尋找了 sklearn 關於 TSNE 的實現，它提供了一個手寫數字的案例：

from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import manifold, datasets, discriminant_analysis

# Prepare digits dataset
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30

# Scale and visualize the embedding vectors
def plot_embedding(X, title=None, sub_num=111):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    # plt.figure()
    ax = plt.subplot(sub_num)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

# Plot images of the digits
print("Showing selected digits")
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.figure(figsize=(12, 10))
plt.subplot(2,2,1)
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')

# t-SNE embedding of the digits dataset
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time %.2fs)" %
               (time() - t0), sub_num=222)

# Projection on to the first 2 linear discriminant components
print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
t0 = time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2
                                                         ).fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time() - t0), sub_num=223)

# Isomap projection of the digits dataset
print("Computing Isomap projection")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
plot_embedding(X_iso,
               "Isomap projection of the digits (time %.2fs)" %
               (time() - t0), sub_num=224)

print("0 and 1 are Red.\n2 is Blue.\n3 is Green.\n4 is Purple.\n5 is Orange.")
plt.tight_layout()
plt.savefig('t-SNE.png')

深度學習平臺 NSML

VoxSRC 採用了韓國 NSML 平臺，這個平臺提供了研究者很多自動化的功能，使開發者可以更專注模型的設計。這與深度學習平臺的開發需求是非常吻合的。在國內，也有非常多的深度學習競賽擁有這這類平臺，例如阿里雲、騰訊雲、百度雲、京東雲、華爲雲、ucloud 雲。

儘管筆者在單機上的深度學習平臺上有所嘗試，但是高門檻成爲了平臺建設的主要困難，這些困難包含技術上的，和設計思路上的。這方面非常希望有讀者願意加入到筆者到團隊中來，一起研究。

參考文獻：

VoxCeleb Speaker Recognition Challenge (VoxSRC): Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., Jung, S., Lee, B.-J., Han, I., 2020. In defence of metric learning for speaker recognition. arXiv Prepr. arXiv2003.11982.
編碼器與損失函數對說話人/語音識別的討論: Cai, W., Chen, J., Li, M., 2018. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System, in: Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d’Olonne, France, pp. 74–81. https://doi.org/10.21437/odyssey.2018-11
高維數據可視化 TSNE: Van Der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2625.
深度學習平臺: Sung, N., Kim, M., Jo, H., Yang, Y., Kim, J., Lausen, L., Kim, Y., Lee, G., Kwak, D.-H., Ha, J.-W., Kim, S., 2017. NSML: A Machine Learning Platform That Enables You to Focus on Your Models. CoRR arXiv prep.

作者：王瑞同濟大學計算機系博士研究生

郵箱：[email protected]

CSDN：https://blog.csdn.net/i_love_home

Github：https://github.com/mechanicalsea

如果大家有興趣參加 2020 VoxSRC 競賽，歡迎一起交流～

VoxCeleb 說話人識別挑戰

VoxCeleb 說話人識別挑戰

文章目錄

摘要

VoxSRC

度量學習與編碼器

高維度數據可視化 TSNE

深度學習平臺 NSML

參考文獻：

X-Vector 數據增益方法

文本無關說話人確認的深度神經網絡嵌入

SincNet 原始波形的說話人識別

端到端的文本無關說話人確認的深度神經網絡嵌入

跨信道文本無關說話人識別的信道對抗訓練

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結