前置知識

style token

結論性知識

端到端需要至少10小時的數據量。
According to [1], it concludes that around 10 hours of speech-transcript pairs from one speaker are needed to get high quality by a neural end-to-end TTS model such as Tacotron.
對於多說話人，每人需要幾十分鐘。
- In order to support multiple speakers, we usually have to use tens of minutes of training data for every speaker, which make collecting high quality data a laborious work.
關於multi-sopeake,主要依賴於學習說話人的隱藏結構。 most of these methods rely on a speaker embedding.
- speaker adaptation method ： fine-tuning a pre-trained multi-speaker model entirely or merely to the speaker embedding。
- speaker encoding method ： training a seperate model to predict the new speaker embedding with few data。
speaker encoder 學到的 speaker embedding，可以表示兩種語言之間的發音關係。

問題

什麼叫做 Initialize the decoder with speaker embedding？

論文

貢獻

多說話人、跨語言（cross-language： Cross-lingual TTS aims to build a system which can synthesize speech in a specific language not spoken by the target speaker）、端到端。
實現了一種單獨訓練的 neural speaker embedding network，用於表示不同說話人以及隱藏發音的隱藏結構。
雙語訓練，中文說話人說英語，以及相反。
小數據量訓練新的說話人。

工作

One is that we further discuss how to use limited amount of data to achieve multi-speaker TTS.
- 待回答
Secondly, we analyze endto-end models in cross-lingual setting.
- 待回答

系統結構

包含 speaker encoder, T2, vocoder

speaker encoder

follows ResCNN in [2]。本文圖例爲圖1， [2]爲圖2,3.
the filter size of convolution layers is 64, 128, 256, and 512, respectively.
單獨訓練，再finetune。 We firstly train the speaker embedding network separately on a speaker verification task with softmax loss, and then finetune the whole model with triplet loss which maps speech into a feature space where the distances correspond to speaker similarity.
Firstly, we pre-train the speaker encoder for 10 epochs with softmax loss and using a minibatch size of 32 as it converge to an approximate local optimal point.
then the model is fine-tuned with triple loss for 10 epochs using a minibatch size of 64.

Mel-Spectrogram Generation Network

框架使用 Tacotron2。
encoder 輸入是音素序列
嘗試四種speaker embedding 的使用方法：:
(1). Concatenate speaker embedding to each time step of the encoder;
(2). Add an affine transformation to speaker embedding, then splicing to each time step of the encoder;
(3). Initialize the encoder with the speaker embedding;
(4). Initialize the decoder with speaker embedding.
結論：
- （1）效果很差，音色不一致。
- （3）有明顯的噪音。
- （2）更加穩定且流暢
- （4）效果最好
- 最後使用（2）+（4），先經過一次仿射層，拼接到encoder, 初始化decoder
We normalize the mel-spectrograms to [-4, 4] in preprocess in order to reduce blurriness in synthesized audio.

vocoder

60輪迭代的 Griffin-Lim

實驗

數據

數據庫

用兩種單語語料訓練模型

語種	音庫	時長	說話人	句子數	various accents	for
EN	VCTK corpus	44h	109	400	YES	train
ZH	subset of [4]	-	855	120	-	train
EN	[5]	-	7	-	YES	test
ZH	internal	-	7	120	-	test

數據參數

16kHz
trim leading and tailing silence.

數據使用

for	data
training multi-speaker model	337 Chinese speakers and 109 English speakers
validation	8 Chinese speakers and 8 English speakers
testing (Seen)	2 Chinese, 2 English speakers
new speaker adaptation (Unseen)	3 Chinese, 3 English speakers

all have similar distribution with training dataset in terms of gender and accent.

用 IPA（International Phonetic Alphabet）將中英文轉成相同表示形式。

improves the pronunciation accuracy

improves unifies phonetic transcriptions of different languages.

都轉換成音素序列。（ use grapheme-to-phoneme (G2P) library ）

實驗參數

batch size = 32
one Nvidia V100GPU
L2 regularization

方法

預訓練decoder

結論

the different training sets have significant impact on speaker embedding.
the learned speaker embedding can represent the relation between pronunciations across the two languages.
for the bilingual speaker voice generated by our model, it imposes the effect of
his/her mother tongue while speaking another language
We observe the phonemes with similar pronunciation are inclined to stay closer than the others across the two languages.
Our result shows that the multi-speaker TTS model can extract the speaker characteristics as well as language pronunciations with speaker embbeding from the latent space

MOS 得分來看，效果並不理想

MOS 低於3.6時。效果就不能很好的滿足需求了。

idea

音素輸入
受母語音色影響

speaker ID 應該放在 decoder （initialize）

reference

[1] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017
[2] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu,“Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
[3] 一文讀懂卷積神經網絡
[4] surfing.ai, “St-cmds-20170001 1, free ST Chinese Mandarin corpus,” 2017
J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.

GFward

發佈了18 篇原創文章 · 獲贊 3 · 訪問量 5419

私信關注

臨時的

文章目錄

前置知識

style token

結論性知識

問題

論文

貢獻

工作

系統結構

speaker encoder

Mel-Spectrogram Generation Network

vocoder

實驗

數據

數據庫

數據參數

數據使用

實驗參數

方法

結論

idea

reference

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

語音信號處理 0 ----- 寫在前面

語音信號處理1 ----- 基礎知識

numpy 中對於shape 以及 axis 的理解

java （二） --- 數據類型

java （一） --- 面向對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結