1. 概述
Tacotron2是由Google Brain 2017年提出來的一個語音合成框架。
Tacotron2:一個完整神經網絡語音合成方法。模型主要由三部分組成:
- 聲譜預測網絡:一個引入注意力機制(attention)的基於循環的Seq2seq的特徵預測網絡,用於從輸入的字符序列預測梅爾頻譜的幀序列。
- 聲碼器(vocoder):一個WaveNet的修訂版,用預測的梅爾頻譜幀序列來生成時域波形樣本。
- 中間連接層:使用低層次的聲學表徵-梅爾頻率聲譜圖來銜接系統的兩個部分。
2.聲譜預測網絡:
聲譜預測網絡主要包含一個編碼器和一個包含注意力機制的解碼器。編碼器把字符序列轉換成一個隱層表徵,解碼器接受這個隱層表徵用以預測聲譜圖。
編碼器:
編碼器模塊包含一個字符嵌入層(Character Embedding),一個3層卷積,一個雙向LSTM層。
- 輸入字符被編碼成512維的字符向量;
- 然後穿過一個三層卷積,每層卷積包含512個5x1的卷積核,即每個卷積核橫跨5個字符,卷積層會對輸入的字符序列進行大跨度上下文建模(類似於N-grams),這裏使用卷積層獲取上下文主要是由於實踐中RNN很難捕獲長時依賴;卷積層後接批歸一化(batch normalization),使用ReLu進行激活;
- 最後一個卷積層的輸出被傳送到一個雙向的LSTM層用以生成編碼特徵,這個LSTM包含512個單元(每個方向256個單元)。
其中,F1、F2、F3爲3個卷積核,ReLU爲每一個卷積層上的非線性激活,E表示對字符序列X做embedding,EncoderRecurrency表示雙向LSTM。
class Encoder(nn.Module):
"""Encoder module:
- Three 1-d convolution banks 5x1
- Bidirectional LSTM
"""
def __init__(self, hparams):
super(Encoder, self).__init__()
convolutions = []
for _ in range(hparams.encoder_n_convolutions):
conv_layer = nn.Sequential(
ConvNorm(hparams.encoder_embedding_dim,
hparams.encoder_embedding_dim,
kernel_size=hparams.encoder_kernel_size, stride=1,
padding=int((hparams.encoder_kernel_size - 1) / 2),
# 進行填充,保持輸入,輸出的維度一致。
dilation=1, w_init_gain='relu'),
nn.BatchNorm1d(hparams.encoder_embedding_dim))
convolutions.append(conv_layer)
self.convolutions = nn.ModuleList(convolutions)
self.lstm = nn.LSTM(hparams.encoder_embedding_dim,
int(hparams.encoder_embedding_dim / 2), 1,
batch_first=True, bidirectional=True)
注意力網絡:
Tacotron2中使用了基於位置敏感的注意力機制(Attention-Based Models for Speech Recognition),是對之前注意力機制的擴展(Neural machine translation by jointly learning to align and translate);這樣處理可以使用之前解碼處理的累積注意力權重作爲一個額外的特徵,因此使得模型在沿着輸入序列向前移動的時候保持前後一致,減少了解碼過程中潛在的子序列重複或遺漏。位置特徵用32個長度爲31的1維卷積核卷積得出,然後把輸入序列和爲位置特徵投影到128維隱層表徵,計算出注意力權重。關於具體的注意力機制計算可以參考這篇博客。
Tacotron2中使用的是混合注意力機制,在對齊中加入了位置特徵。
其中,、W、V、U和b爲待訓練參數,爲當前解碼器隱狀態,是當前編碼器隱狀態,是之前的注意力權重經卷積而得的位置特徵(location feature),。混合注意力機制能夠同時考慮內容和輸入元素的位置。
下面是代碼實現:
class Attention(nn.Module):
def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
attention_location_n_filters, attention_location_kernel_size):
super(Attention, self).__init__()
self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
bias=False, w_init_gain='tanh') # 解碼器隱狀態
self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
w_init_gain='tanh') # 編碼器隱狀態
self.v = LinearNorm(attention_dim, 1, bias=False)
self.location_layer = LocationLayer(attention_location_n_filters,
attention_location_kernel_size,
attention_dim) # 計算位置特徵
self.score_mask_value = -float("inf")
def get_alignment_energies(self, query, processed_memory,
attention_weights_cat):
"""
PARAMS
------
query: decoder output (batch, n_mel_channels * n_frames_per_step) 當前解碼隱狀態
processed_memory: processed encoder outputs (B, T_in, attention_dim) 當前編碼器隱狀態
attention_weights_cat: cumulative and prev. att weights (B, 2, max_time) 之前累積的注意力權重
RETURNS
-------
alignment (batch, max_time)
"""
processed_query = self.query_layer(query.unsqueeze(1)) # 當前解碼隱狀態
processed_attention_weights = self.location_layer(attention_weights_cat) # 位置特徵,之前累積的注意力權重
energies = self.v(torch.tanh(
processed_query + processed_attention_weights + processed_memory)) # 對齊能量值
energies = energies.squeeze(-1)
return energies
def forward(self, attention_hidden_state, memory, processed_memory,
attention_weights_cat, mask):
"""
PARAMS
------
attention_hidden_state: attention rnn last output
memory: encoder outputs 編碼器隱狀態
processed_memory: processed encoder outputs 解碼器隱狀態
attention_weights_cat: previous and cummulative attention weights 累積的注意力權重,位置特徵
mask: binary mask for padded data
"""
alignment = self.get_alignment_energies(
attention_hidden_state, processed_memory, attention_weights_cat) # 對齊
if mask is not None:
alignment.data.masked_fill_(mask, self.score_mask_value)
attention_weights = F.softmax(alignment, dim=1) # 歸一化的注意力權重
attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
attention_context = attention_context.squeeze(1) # 上下文權重向量
return attention_context, attention_weights
計算位置特徵:位置特徵由之前的累加的注意力權重經過卷積得來。
class LocationLayer(nn.Module):
def __init__(self, attention_n_filters, attention_kernel_size,
attention_dim):
"""
PARAMS
------
attention_n_filters: 32維
attention_kernel_size: 卷積核大小 31
RETURNS
-------
"""
super(LocationLayer, self).__init__()
padding = int((attention_kernel_size - 1) / 2)
self.location_conv = ConvNorm(2, attention_n_filters,
kernel_size=attention_kernel_size,
padding=padding, bias=False, stride=1,
dilation=1)
self.location_dense = LinearNorm(attention_n_filters, attention_dim,
bias=False, w_init_gain='tanh')
# 位置特徵使用累加的注意力權重卷積而來
def forward(self, attention_weights_cat):
processed_attention = self.location_conv(attention_weights_cat)
processed_attention = processed_attention.transpose(1, 2)
processed_attention = self.location_dense(processed_attention)
return processed_attention
解碼器:
解碼器是一個自迴歸循環神經網絡,它從編碼的輸入序列預測輸出聲譜圖,一次預測一幀。
- 上一步預測出的頻譜首先被傳入一個“pre-net”,每層由256個隱藏ReLU單元組成的雙層全連接層,pre-net作爲一個信息瓶頸層(boottleneck),對於學習注意力是必要的。
- pre-net的輸出和注意力上下文向量拼接在一起,傳給一個兩層堆疊的由1024個單元組成的單向LSTM。LSTM的輸出再次和注意力上下文向量拼接在一起,然後經過一個線性投影來預測目標頻譜幀。
- 最後,目標頻譜幀經過一個5層卷積的“post-net”來預測一個殘差疊加到卷積前的頻譜幀上,用以改善頻譜重構的整個過程。post-net每層由512個5X1卷積核組成,後接批歸一化層,除了最後一層卷積,每層批歸一化都用tanh激活。
- 並行於頻譜幀的預測,解碼器LSTM的輸出與注意力上下文向量拼接在一起,投影成一個標量後傳遞給sigmoid激活函數,來預測輸出序列是否已經完成的概率。
下面看代碼:
pre-net層:雙層全連接層,使用了0.5的dropout。
class Prenet(nn.Module):
def __init__(self, in_dim, sizes):
super(Prenet, self).__init__()
in_sizes = [in_dim] + sizes[:-1]
self.layers = nn.ModuleList(
[LinearNorm(in_size, out_size, bias=False)
for (in_size, out_size) in zip(in_sizes, sizes)])
def forward(self, x):
for linear in self.layers:
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
return x
解碼器主體部分:可以看出解碼器由哪些層組成,pre-net(預處理層),attention_rnn,attention_layer,(注意力網絡),decoder_rnn(解碼器LSTM),linear_project(線性投影層),gate_layer(判斷預測是否結束)。
class Decoder(nn.Module):
def __init__(self, hparams):
super(Decoder, self).__init__()
self.n_mel_channels = hparams.n_mel_channels
self.n_frames_per_step = hparams.n_frames_per_step
self.encoder_embedding_dim = hparams.encoder_embedding_dim
self.attention_rnn_dim = hparams.attention_rnn_dim
self.decoder_rnn_dim = hparams.decoder_rnn_dim
self.prenet_dim = hparams.prenet_dim
self.max_decoder_steps = hparams.max_decoder_steps
self.gate_threshold = hparams.gate_threshold
self.p_attention_dropout = hparams.p_attention_dropout
self.p_decoder_dropout = hparams.p_decoder_dropout
self.prenet = Prenet(
hparams.n_mel_channels * hparams.n_frames_per_step,
[hparams.prenet_dim, hparams.prenet_dim])
self.attention_rnn = nn.LSTMCell(
hparams.prenet_dim + hparams.encoder_embedding_dim,
hparams.attention_rnn_dim)
self.attention_layer = Attention(
hparams.attention_rnn_dim, hparams.encoder_embedding_dim,
hparams.attention_dim, hparams.attention_location_n_filters,
hparams.attention_location_kernel_size)
self.decoder_rnn = nn.LSTMCell(
hparams.attention_rnn_dim + hparams.encoder_embedding_dim,
hparams.decoder_rnn_dim, 1)
self.linear_projection = LinearNorm(
hparams.decoder_rnn_dim + hparams.encoder_embedding_dim,
hparams.n_mel_channels * hparams.n_frames_per_step)
self.gate_layer = LinearNorm(
hparams.decoder_rnn_dim + hparams.encoder_embedding_dim, 1,
bias=True, w_init_gain='sigmoid')
解碼器的步驟:
def decode(self, decoder_input):
""" Decoder step using stored states, attention and memory
PARAMS
------
decoder_input: previous mel output
RETURNS
-------
mel_output:
gate_output: gate output energies
attention_weights:
"""
cell_input = torch.cat((decoder_input, self.attention_context), -1)
self.attention_hidden, self.attention_cell = self.attention_rnn(
cell_input, (self.attention_hidden, self.attention_cell))
self.attention_hidden = F.dropout(
self.attention_hidden, self.p_attention_dropout, self.training)
attention_weights_cat = torch.cat(
(self.attention_weights.unsqueeze(1),
self.attention_weights_cum.unsqueeze(1)), dim=1)
self.attention_context, self.attention_weights = self.attention_layer(
self.attention_hidden, self.memory, self.processed_memory,
attention_weights_cat, self.mask)
# 注意力權重累加
self.attention_weights_cum += self.attention_weights
decoder_input = torch.cat(
(self.attention_hidden, self.attention_context), -1)
self.decoder_hidden, self.decoder_cell = self.decoder_rnn(
decoder_input, (self.decoder_hidden, self.decoder_cell))
self.decoder_hidden = F.dropout(
self.decoder_hidden, self.p_decoder_dropout, self.training)
decoder_hidden_attention_context = torch.cat(
(self.decoder_hidden, self.attention_context), dim=1)
decoder_output = self.linear_projection(
decoder_hidden_attention_context)
gate_prediction = self.gate_layer(decoder_hidden_attention_context)
return decoder_output, gate_prediction, self.attention_weights
def forward(self, memory, decoder_inputs, memory_lengths):
""" Decoder forward pass for training
PARAMS
------
memory: Encoder outputs
decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs
memory_lengths: Encoder output lengths for attention masking.
RETURNS
-------
mel_outputs: mel outputs from the decoder
gate_outputs: gate outputs from the decoder
alignments: sequence of attention weights from the decoder
"""
# 解碼器的輸入
decoder_input = self.get_go_frame(memory).unsqueeze(0)
decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
decoder_inputs = self.prenet(decoder_inputs)
# 初始化解碼器狀態
self.initialize_decoder_states(
memory, mask=~get_mask_from_lengths(memory_lengths))
mel_outputs, gate_outputs, alignments = [], [], []
while len(mel_outputs) < decoder_inputs.size(0) - 1:
decoder_input = decoder_inputs[len(mel_outputs)]
mel_output, gate_output, attention_weights = self.decode(
decoder_input)
mel_outputs += [mel_output.squeeze(1)]
gate_outputs += [gate_output.squeeze()]
alignments += [attention_weights]
# 處理解碼器的輸出
mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
mel_outputs, gate_outputs, alignments)
return mel_outputs, gate_outputs, alignments
後處理網絡(post-net): 5層1維卷積層,最後一層沒有使用tanh。使用殘差進行計算。
其中y爲原始輸入;
其中,爲上個卷積層的輸出或解碼器輸出。
class Postnet(nn.Module):
"""Postnet
- Five 1-d convolution with 512 channels and kernel size 5
"""
def __init__(self, hparams):
super(Postnet, self).__init__()
self.convolutions = nn.ModuleList()
self.convolutions.append(
nn.Sequential(
ConvNorm(hparams.n_mel_channels, hparams.postnet_embedding_dim,
kernel_size=hparams.postnet_kernel_size, stride=1,
padding=int((hparams.postnet_kernel_size - 1) / 2),
dilation=1, w_init_gain='tanh'),
nn.BatchNorm1d(hparams.postnet_embedding_dim))
)
for i in range(1, hparams.postnet_n_convolutions - 1):
self.convolutions.append(
nn.Sequential(
ConvNorm(hparams.postnet_embedding_dim,
hparams.postnet_embedding_dim,
kernel_size=hparams.postnet_kernel_size, stride=1,
padding=int((hparams.postnet_kernel_size - 1) / 2),
dilation=1, w_init_gain='tanh'),
nn.BatchNorm1d(hparams.postnet_embedding_dim))
)
# 最後一層沒有使用tanh
self.convolutions.append(
nn.Sequential(
ConvNorm(hparams.postnet_embedding_dim, hparams.n_mel_channels,
kernel_size=hparams.postnet_kernel_size, stride=1,
padding=int((hparams.postnet_kernel_size - 1) / 2),
dilation=1, w_init_gain='linear'),
nn.BatchNorm1d(hparams.n_mel_channels))
)
def forward(self, x):
for i in range(len(self.convolutions) - 1):
x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
return x
與Tacotron對比:
- Tacotron 2使用了更簡潔的構造模塊,在編碼器和解碼器中使用是普通的LSTM和卷積層;Tacotron中使用的是“CBHG”堆疊結構和GRU循環層;
- Tacotron2在解碼器的輸出中沒有使用“縮小因子(reduction factor)”,即每個解碼步驟只輸出一個單獨的頻譜幀。
3.聲碼器:
Tacontron2原論文使用的是一個修正版的WaveNet,把梅爾頻譜特徵表達逆變換爲時域波形樣本。現在也有用Waveglow聲碼器的。聲碼器可以分爲一個完整的部分,有興趣可以看看相關論文。
參考鏈接:
本文只解讀了部分代碼,完整代碼可以參考https://github.com/NVIDIA/tacotron2,其中還有很多語音處理的部分。
https://www.cnblogs.com/mengnan/p/9527797.html
https://blog.csdn.net/yunnangf/article/details/79585089
http://blog.sina.com.cn/s/blog_8af106960102xj6j.html