VAE-Tacotron-2/1 以及 VQ-VAE的原理探討與實現.

Tacotron (yanggeng1995

An implementation of VAE Tacotron speech synthesis in TensorFlow. (




Blizzard2013 368K training results.Will vae-tacotron2 get better results?、

I trained the model 368k with Blizzard2013 and the here is the result (Parallel transfer)
You can hear that 118 119 is well but 120 has less prosody.
These audios have metal tones because of the Grrifin-LIM.I will use vocoder such as wavenet or wavernn to improve audio quality.
I think training less than 200k times it cant train prosody. The model's mission within 200k times is training a good alignment .More than 200K it will begin to train prosody according to the reference.
368k isn‘t enough to train a good prosody model. I think the number of training needs to be greater. I will keep training.
By the way if i change the model into tacotron2.Will it produce better results or simliar results like tacotron1 ?Does anyone trained model as good as the paper?

4. 反正效果不好, 最好的就是3中的.  如果現成的VAE-Tacotron-2/1不好找, 其實找一些VAE實現比較好的也行, 預訓練好遷移過來. 現在GST已經很成熟了,, Kyubyong/expressive_tacotron考慮GST之後再vae, 把兩步驟分開, 不急於sub condition信息, 或者用我設計的赤鞘巨人的結構.

突然想到用prosody那部分, 不是重點考慮的對象! 不需要非得在那個地方用VAE! 甚至不需要包含這個prosody結構. 需要思考的是如何讓speaker id在不同的音素之間遷移. 目前直觀的時反loss, 之後加一些相關性的. 需要請教別人. 同時思考google論文中對prosody結構存在的解釋, 有好, 沒有也可以, 跟voice clone沒關, 而我的目的也不是voice clone, 而是code-switching.


5. 有處理BC數據的代碼.

VAE Tacotron-2 (

Tensorflow Implementation of Learning latent representations for style control and transfer in end-to-end speech synthesis

1. In my testing, I havn't get good results so far on style transfer side. 

2. author of the paper used 105 hrs of Blizzard Challenge 2013 dataset


開始寫代碼: 先直接用yanggeng1995的, 然後去看VAE的經典版本, 然後遷移過來, 最好能夠預訓練, 然後再和syang1993結合.

以及嚴格按照論文中的結構: For each encoder, a mel spectrogram is first passed through two convolutional layers, which contains 512 filters with shape 3 × 1. The output of these convolutional layers is then fed to a stack of two bidirectional LSTM layers with 256 cells at each direction. A mean pooling layer is used to summarize the LSTM outputs across time, followed by a linear projection layer to predict the posterior mean and log variance.


口音, 韻律, 都有類似的, speaker 單指voice的話 好好定義voice clone中的voice  粵語和英語很像   所以可以  普通話=>粵語=>英語

"讀萬卷書 行萬里路 閱人無數 高人指路"


金庸大師小說用餘秋雨大師風格重新寫一遍(add), 再讀一遍,就很好.

有時間去複習一下本科時學的模式識別, 去學/補一下信號處理.

不用詩集, 歌詞就行.






VAE GAN 和  另外一個



[8] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical gen-erative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.



但是如何防止過度有信息? 結合 ad loss!

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder 另外一篇幾乎一樣的論文.


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.