VAE-Tacotron-2/1 以及 VQ-VAE的原理探討與實現.

原創

ruclion

2020-07-08 07:08

Tacotron （yanggeng1995）

An implementation of VAE Tacotron speech synthesis in TensorFlow. (https://arxiv.org/abs/1812.04342)

1.https://github.com/yanggeng1995/vae_tacotron.

2.requirement.txt都滿足.

Blizzard2013 368K training results.Will vae-tacotron2 get better results?、

I trained the model 368k with Blizzard2013 and the here is the result (Parallel transfer）
https://drive.google.com/drive/folders/12dBWg883S1VXQ0jEzJ7Lcz1bxI3lI_2t
You can hear that 118 119 is well but 120 has less prosody.
These audios have metal tones because of the Grrifin-LIM.I will use vocoder such as wavenet or wavernn to improve audio quality.
I think training less than 200k times it cant train prosody. The model's mission within 200k times is training a good alignment .More than 200K it will begin to train prosody according to the reference.
368k isn‘t enough to train a good prosody model. I think the number of training needs to be greater. I will keep training.
By the way if i change the model into tacotron2.Will it produce better results or simliar results like tacotron1 ?Does anyone trained model as good as the paper?

4. 反正效果不好，最好的就是3中的. 如果現成的VAE-Tacotron-2/1不好找, 其實找一些VAE實現比較好的也行, 預訓練好遷移過來. 現在GST已經很成熟了, https://github.com/syang1993/gst-tacotron, Kyubyong/expressive_tacotron, 考慮GST之後再vae, 把兩步驟分開, 不急於sub condition信息, 或者用我設計的赤鞘巨人的結構.

突然想到用prosody那部分, 不是重點考慮的對象! 不需要非得在那個地方用VAE! 甚至不需要包含這個prosody結構. 需要思考的是如何讓speaker id在不同的音素之間遷移. 目前直觀的時反loss, 之後加一些相關性的. 需要請教別人. 同時思考google論文中對prosody結構存在的解釋, 有好, 沒有也可以, 跟voice clone沒關, 而我的目的也不是voice clone, 而是code-switching.

Kyubyong/vq-vae不懂.

5. 有處理BC數據的代碼.

VAE Tacotron-2 (https://github.com/rishikksh20/vae_tacotron2)

Tensorflow Implementation of Learning latent representations for style control and transfer in end-to-end speech synthesis

1. In my testing, I havn't get good results so far on style transfer side.

2. author of the paper used 105 hrs of Blizzard Challenge 2013 dataset

開始寫代碼: 先直接用yanggeng1995的, 然後去看VAE的經典版本, 然後遷移過來, 最好能夠預訓練, 然後再和syang1993結合.

以及嚴格按照論文中的結構: For each encoder, a mel spectrogram is first passed through two convolutional layers, which contains 512 filters with shape 3 × 1. The output of these convolutional layers is then fed to a stack of two bidirectional LSTM layers with 256 cells at each direction. A mean pooling layer is used to summarize the LSTM outputs across time, followed by a linear projection layer to predict the posterior mean and log variance.

口音, 韻律, 都有類似的, speaker 單指voice的話好好定義voice clone中的voice 粵語和英語很像所以可以普通話=>粵語=>英語

"讀萬卷書行萬里路閱人無數高人指路"

金庸大師小說用餘秋雨大師風格重新寫一遍(add), 再讀一遍,就很好.

有時間去複習一下本科時學的模式識別, 去學/補一下信號處理.

不用詩集, 歌詞就行.

非常重要的!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

用分開的高手!!!!!!!!!!!!!!!!!!!!!!!!

VAE GAN 和另外一個

google非常習慣並且喜歡在decoder之前concat

[8] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical gen-erative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.

但是如何防止過度有信息? 結合 ad loss!

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder 另外一篇幾乎一樣的論文.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

VAE-Tacotron-2/1 以及 VQ-VAE的原理探討與實現.

Tacotron （yanggeng1995）

Kyubyong/vq-vae不懂.

VAE Tacotron-2 (https://github.com/rishikksh20/vae_tacotron2)

非常重要的!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

用分開的高手!!!!!!!!!!!!!!!!!!!!!!!!

VAE GAN 和另外一個

google非常習慣並且喜歡在decoder之前concat

但是如何防止過度有信息? 結合 ad loss!

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder 另外一篇幾乎一樣的論文.

Python 潮流週刊#52：Python 處理 Excel 的資源

kaggle比賽一之ieee-fraud-detection

嘗試nvidia的Tacotron-2和waveglow的結合, 並且着重考慮多GPU以及inference時的性能.

簡單的基於Tacotron2的中英文混語言合成, 包括code-switch和voice clone. 以及深入結構設計的探討.

Tensorflow1.x查看ckpt變量情況, 以及爲之後部分恢復權重做鋪墊.

Pycharm爲核心在構建服務器端深度學習語音合成程序時的配置和技巧

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

VAE-Tacotron-2/1 以及 VQ-VAE的原理探討與實現.

Tacotron （yanggeng1995）

Kyubyong/vq-vae不懂.

VAE Tacotron-2 (https://github.com/rishikksh20/vae_tacotron2)

非常重要的!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

用分開的高手!!!!!!!!!!!!!!!!!!!!!!!!

VAE GAN 和 另外一個

google非常習慣並且喜歡在decoder之前concat

但是如何防止過度有信息? 結合 ad loss!

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder 另外一篇幾乎一樣的論文.

VAE GAN 和另外一個