语音识别中seq2seq的输入数据构建

很多seq2seq的实战都是翻译问题,如英语转法语等。给模型构建的特征都是先建立字母或者词的字典,然后构建向量作为输入。最终的输入是一个0,1组成的三维向量。

构建机器翻译的输入——Keras官方给的例子源码解读

下载回来的数据 fra.txt大概长这样:

Go.		Va !
Run!	Cours !
Run!	Courez !
Wow!	Ça alors !
Fire!	Au feu !
Help!	À l'aide !
Jump.	Saute.
Stop!	Ça suffit !
Stop!	Stop !
Stop!	Arrête-toi !

官方处理源码如下:

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra-eng/fra.txt'

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
	# 将每一行中的输入和翻译结果分开
    input_text, target_text, _ = line.split('\t') 
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    # 建立输入字典
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
     # 建立输出字典
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# 需要翻译每句话不一定等长,需要找到最长的一句话为准确定特征向量的维度
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

# 将字典中的每个词与索引绑定
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])

# 一句话向量化为一个nxm的矩阵,有l句话,预先创建三维数组(l, n, m)
# 第一维为样本数量,第二维为句子长度,第三维为字典长度
# 第三维设为字典长度而不是1,是因为要进行one-hot编码
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# 进行编码
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        # 目标输出时间步往前移
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

构建语音识别的输入- seq2seq用于语音识别-源码解读

对于一个语音序列,理论上我们可以这么做,找出给定所有序列中不同的值然后排序,然后构建每个序列的向量。为什么说理论呢?假设我们的序列归一化在0-1之间,然而语音序列每个点都可以看做是不相等的,一直在变化,因此在0-1之间,可以取无穷个值。不像单词构建的字典,语音序列可能需要构建一个百万、千万级别的字典。所以类似词嵌入的方式理论上可能,实际上不可能,因为它要海量的内存和处理时间。当然,我们这里不考虑稀疏编码的情况。

主要的向量构建源码如下:

def audioToInputVector(audio_filename, numcep, numcontext):
    """
    Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
    at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
    context frames to the left and right of each time step, and returns this data
    in a numpy array.
    Borrowed from Mozilla's Deep Speech and slightly modified.
    https://github.com/mozilla/DeepSpeech
    """
	# 使用librosa库的函数加载文件
    audio, fs = librosa.load(audio_filename)
	
	# 使用python_speech_features库的mfcc函数求mfccs特征
    # # Get mfcc coefficients
    features = mfcc(audio, samplerate=fs, numcep=numcep, nfft=551)
    # features = librosa.feature.mfcc(y=audio,
    #                                 sr=fs,
    #                                 n_fft=551,
    #                                 n_mfcc=numcep).T

    # We only keep every second feature (BiRNN stride = 2)
    features = features[::2]

    # One stride per time step in the input
    num_strides = len(features)
	

    # Add empty initial and final contexts 
    # 这里使用上下文信息,numcontext 表示当前帧关联前后帧数
    empty_context = np.zeros((numcontext, numcep), dtype=features.dtype)
    # 为了保证第一帧和最后一帧也有上下文,必须在前后padding
    features = np.concatenate((empty_context, features, empty_context))
    
    # numcontext (past) + 1 (present) + numcontext (future) 
    # 考虑上下文后一个特征的第一个维度就是window_size 
    window_size = 2 * numcontext + 1
    
   	# 使用np.lib.stride_tricks.as_strided()将features中划分成维度为(num_strides, window_size, numcep)的新特征
  	# 划分方式为(features.strides[0], features.strides[0], features.strides[1])
    train_inputs = np.lib.stride_tricks.as_strided(
        features,
        (num_strides, window_size, numcep),
        (features.strides[0], features.strides[0], features.strides[1]),
        writeable=False)

    # 展开第二、三维度
    train_inputs = np.reshape(train_inputs, [num_strides, -1])
    
    # Copy the strided array so that we can write to it safely
    train_inputs = np.copy(train_inputs)
    # 归一化
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)

    # Return results
    return train_inputs
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章