臨時的

前置知識

style token

結論性知識

  • 端到端需要至少10小時的數據量。
  • According to [1], it concludes that around 10 hours of speech-transcript pairs from one speaker are needed to get high quality by a neural end-to-end TTS model such as Tacotron.
  • 對於多說話人,每人需要幾十分鐘。
    • In order to support multiple speakers, we usually have to use tens of minutes of training data for every speaker, which make collecting high quality data a laborious work.
  • 關於multi-sopeake,主要依賴於學習說話人的隱藏結構。 most of these methods rely on a speaker embedding.
    • speaker adaptation method : fine-tuning a pre-trained multi-speaker model entirely or merely to the speaker embedding。
    • speaker encoding method : training a seperate model to predict the new speaker embedding with few data。
  • speaker encoder 學到的 speaker embedding, 可以表示兩種語言之間的發音關係。

問題

  • 什麼叫做 Initialize the decoder with speaker embedding?

論文

貢獻

  • 多說話人、跨語言(cross-language: Cross-lingual TTS aims to build a system which can synthesize speech in a specific language not spoken by the target speaker)、端到端。
  • 實現了一種單獨訓練的 neural speaker embedding network, 用於表示不同說話人以及隱藏發音的隱藏結構
  • 雙語訓練,中文說話人說英語,以及相反。
  • 小數據量訓練新的說話人。

工作

  • One is that we further discuss how to use limited amount of data to achieve multi-speaker TTS.
    • 待回答
  • Secondly, we analyze endto-end models in cross-lingual setting.
    • 待回答

系統結構

包含 speaker encoder, T2, vocoder

在這裏插入圖片描述

speaker encoder

  • follows ResCNN in [2]。 本文圖例爲圖1, [2]爲圖2,3.
  • the filter size of convolution layers is 64, 128, 256, and 512, respectively.
  • 單獨訓練,再finetune。 We firstly train the speaker embedding network separately on a speaker verification task with softmax loss, and then finetune the whole model with triplet loss which maps speech into a feature space where the distances correspond to speaker similarity.
  • Firstly, we pre-train the speaker encoder for 10 epochs with softmax loss and using a minibatch size of 32 as it converge to an approximate local optimal point.
  • then the model is fine-tuned with triple loss for 10 epochs using a minibatch size of 64.

在這裏插入圖片描述在這裏插入圖片描述在這裏插入圖片描述

Mel-Spectrogram Generation Network

  • 框架使用 Tacotron2
  • encoder 輸入是 音素序列
  • 嘗試四種speaker embedding 的使用方法::
  • (1). Concatenate speaker embedding to each time step of the encoder;
  • (2). Add an affine transformation to speaker embedding, then splicing to each time step of the encoder;
  • (3). Initialize the encoder with the speaker embedding;
  • (4). Initialize the decoder with speaker embedding.
  • 結論:
    • (1)效果很差,音色不一致。
    • (3)有明顯的噪音。
    • (2)更加穩定且流暢
    • (4)效果最好
    • 最後使用(2)+(4),先經過一次仿射層,拼接到encoder, 初始化decoder
  • We normalize the mel-spectrograms to [-4, 4] in preprocess in order to reduce blurriness in synthesized audio.

vocoder

  • 60輪迭代的 Griffin-Lim

實驗

數據

數據庫

用兩種單語語料訓練模型

語種 音庫 時長 說話人 句子數 various accents for
EN VCTK corpus 44h 109 400 YES train
ZH subset of [4] - 855 120 - train
EN [5] - 7 - YES test
ZH internal - 7 120 - test

數據參數

  • 16kHz
  • trim leading and tailing silence.

數據使用

for data
training multi-speaker model 337 Chinese speakers and 109 English speakers
validation 8 Chinese speakers and 8 English speakers
testing (Seen) 2 Chinese, 2 English speakers
new speaker adaptation (Unseen) 3 Chinese, 3 English speakers
  • all have similar distribution with training dataset in terms of gender and accent.
  • IPA(International Phonetic Alphabet)將中英文轉成相同表示形式。
    • improves the pronunciation accuracy
    • improves unifies phonetic transcriptions of different languages.
  • 都轉換成音素序列。( use grapheme-to-phoneme (G2P) library

實驗參數

  • batch size = 32
  • one Nvidia V100GPU
  • L2 regularization

方法

  • 預訓練decoder

結論

  • the different training sets have significant impact on speaker embedding.
  • the learned speaker embedding can represent the relation between pronunciations across the two languages.
  • for the bilingual speaker voice generated by our model, it imposes the effect of
    his/her mother tongue while speaking another language
  • We observe the phonemes with similar pronunciation are inclined to stay closer than the others across the two languages.
  • Our result shows that the multi-speaker TTS model can extract the speaker characteristics as well as language pronunciations with speaker embbeding from the latent space
  • MOS 得分來看,效果並不理想
  • MOS 低於3.6時。效果就不能很好的滿足需求了。

idea

  • 音素輸入
  • 受母語音色影響

  • speaker ID 應該放在 decoder (initialize)

reference

  • [1] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017
  • [2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu,“Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  • [3] 一文讀懂卷積神經網絡
  • [4] surfing.ai, “St-cmds-20170001 1, free ST Chinese Mandarin corpus,” 2017
  • J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.
發佈了18 篇原創文章 · 獲贊 3 · 訪問量 5419
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章