文章目錄
前置知識
style token
結論性知識
- 端到端需要至少10小時的數據量。
- According to [1], it concludes that around 10 hours of speech-transcript pairs from one speaker are needed to get high quality by a neural end-to-end TTS model such as Tacotron.
- 對於多說話人,每人需要幾十分鐘。
- In order to support multiple speakers, we usually have to use tens of minutes of training data for every speaker, which make collecting high quality data a laborious work.
- 關於multi-sopeake,主要依賴於學習說話人的隱藏結構。 most of these methods rely on a speaker embedding.
- speaker adaptation method : fine-tuning a pre-trained multi-speaker model entirely or merely to the speaker embedding。
- speaker encoding method : training a seperate model to predict the new speaker embedding with few data。
- speaker encoder 學到的 speaker embedding, 可以表示兩種語言之間的發音關係。
問題
- 什麼叫做 Initialize the decoder with speaker embedding?
論文
貢獻
- 多說話人、跨語言(cross-language: Cross-lingual TTS aims to build a system which can synthesize speech in a specific language not spoken by the target speaker)、端到端。
- 實現了一種單獨訓練的 neural speaker embedding network, 用於表示不同說話人以及隱藏發音的隱藏結構。
- 雙語訓練,中文說話人說英語,以及相反。
- 小數據量訓練新的說話人。
工作
- One is that we further discuss how to use limited amount of data to achieve multi-speaker TTS.
- 待回答
- Secondly, we analyze endto-end models in cross-lingual setting.
- 待回答
系統結構
包含 speaker encoder, T2, vocoder
speaker encoder
- follows ResCNN in [2]。 本文圖例爲圖1, [2]爲圖2,3.
- the filter size of convolution layers is 64, 128, 256, and 512, respectively.
- 單獨訓練,再finetune。 We firstly train the speaker embedding network separately on a speaker verification task with softmax loss, and then finetune the whole model with triplet loss which maps speech into a feature space where the distances correspond to speaker similarity.
- Firstly, we pre-train the speaker encoder for 10 epochs with softmax loss and using a minibatch size of 32 as it converge to an approximate local optimal point.
- then the model is fine-tuned with triple loss for 10 epochs using a minibatch size of 64.
Mel-Spectrogram Generation Network
- 框架使用 Tacotron2。
- encoder 輸入是 音素序列
- 嘗試四種speaker embedding 的使用方法::
- (1). Concatenate speaker embedding to each time step of the encoder;
- (2). Add an affine transformation to speaker embedding, then splicing to each time step of the encoder;
- (3). Initialize the encoder with the speaker embedding;
- (4). Initialize the decoder with speaker embedding.
- 結論:
- (1)效果很差,音色不一致。
- (3)有明顯的噪音。
- (2)更加穩定且流暢
- (4)效果最好
- 最後使用(2)+(4),先經過一次仿射層,拼接到encoder, 初始化decoder
- We normalize the mel-spectrograms to [-4, 4] in preprocess in order to reduce blurriness in synthesized audio.
vocoder
- 60輪迭代的 Griffin-Lim
實驗
數據
數據庫
用兩種單語語料訓練模型
語種 | 音庫 | 時長 | 說話人 | 句子數 | various accents | for |
---|---|---|---|---|---|---|
EN | VCTK corpus | 44h | 109 | 400 | YES | train |
ZH | subset of [4] | - | 855 | 120 | - | train |
EN | [5] | - | 7 | - | YES | test |
ZH | internal | - | 7 | 120 | - | test |
數據參數
- 16kHz
- trim leading and tailing silence.
數據使用
for | data |
---|---|
training multi-speaker model | 337 Chinese speakers and 109 English speakers |
validation | 8 Chinese speakers and 8 English speakers |
testing (Seen) | 2 Chinese, 2 English speakers |
new speaker adaptation (Unseen) | 3 Chinese, 3 English speakers |
- all have similar distribution with training dataset in terms of gender and accent.
- 用 IPA(International Phonetic Alphabet)將中英文轉成相同表示形式。
- improves the pronunciation accuracy
- improves unifies phonetic transcriptions of different languages.
- 都轉換成音素序列。( use grapheme-to-phoneme (G2P) library )
實驗參數
- batch size = 32
- one Nvidia V100GPU
- L2 regularization
方法
- 預訓練decoder
結論
- the different training sets have significant impact on speaker embedding.
- the learned speaker embedding can represent the relation between pronunciations across the two languages.
- for the bilingual speaker voice generated by our model, it imposes the effect of
his/her mother tongue while speaking another language - We observe the phonemes with similar pronunciation are inclined to stay closer than the others across the two languages.
- Our result shows that the multi-speaker TTS model can extract the speaker characteristics as well as language pronunciations with speaker embbeding from the latent space
- MOS 得分來看,效果並不理想
- MOS 低於3.6時。效果就不能很好的滿足需求了。
idea
- 音素輸入
- 受母語音色影響
- speaker ID 應該放在 decoder (initialize)
reference
- [1] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017
- [2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu,“Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
- [3] 一文讀懂卷積神經網絡
- [4] surfing.ai, “St-cmds-20170001 1, free ST Chinese Mandarin corpus,” 2017
- J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.