语音识别——基于深度学习的中文语音识别系统实现(代码详解)

利用thchs30为例建立一个语音识别系统

之前写了个关于语音识别系统框架的文章好多同学都很喜欢,但是很多问题不能一一回复,我个人也是更愿意分享方法和思路一些。
于是写了这篇实践篇,里面详细介绍了如何一步步实现这个语音识别系统的,希望对大家有一定的参考作用。
(模型这边近期应该不会继续训练升学模型了,而语言模型可能过段时间也会整理一个tutorial方便各位同学进行实验)
(目前主要写了数据处理部分和建模训练部分,测试和推断可以参考原来代码)

notebook地址: https://github.com/audier/my_ch_speech_recognition/blob/master/CTC_tutorial.ipynb

dfcnn论文地址:http://www.infocomm-journal.com/dxkx/CN/article/downloadArticleFile.do?attachType=PDF&id=166970

  • 特征提取
  • 搭建模型
    • DFCNN
  • 数据处理
  • 建模训练

1. 特征提取

input为输入音频数据,需要转化为频谱图数据,然后通过cnn处理图片的能力进行识别。

1. 读取音频文件

import scipy.io.wavfile as wav
import matplotlib.pyplot as plt
import os

# 随意搞个音频做实验
filepath = 'test.wav'

fs, wavsignal = wav.read(filepath)
plt.plot(wavsignal)
plt.show()

2. 构造汉明窗

import numpy as np

x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1))
plt.plot(w)
plt.show()

汉明窗

3. 对数据分帧

  • 帧长: 25ms
  • 帧移: 10ms
采样点(s) = fs
采样点(ms)= fs / 1000
采样点(帧)= fs / 1000 * 帧长
time_window = 25
window_length = fs // 1000 * time_window

4. 分帧加窗

# 分帧
p_begin = 0
p_end = p_begin + window_length
frame = wavsignal[p_begin:p_end]
plt.plot(frame)
plt.show()
# 加窗
frame = frame * w
plt.plot(frame)
plt.show()

在这里插入图片描述

在这里插入图片描述

5. 傅里叶变换

所谓时频图就是将时域信息转换到频域上去,具体原理可百度。人耳感知声音是通过

from scipy.fftpack import fft

# 进行快速傅里叶变换
frame_fft = np.abs(fft(frame))[:200]
plt.plot(frame_fft)
plt.show()

# 取对数,求db
frame_log = np.log(frame_fft)
plt.plot(frame_log)
plt.show()

在这里插入图片描述

在这里插入图片描述

  • 分帧
  • 加窗
  • 傅里叶变换
import numpy as np
import scipy.io.wavfile as wav
from scipy.fftpack import fft


# 获取信号的时频图
def compute_fbank(file):
	x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
	w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1) ) # 汉明窗
	fs, wavsignal = wav.read(file)
	# wav波形 加时间窗以及时移10ms
	time_window = 25 # 单位ms
	window_length = fs / 1000 * time_window # 计算窗长度的公式,目前全部为400固定值
	wav_arr = np.array(wavsignal)
	wav_length = len(wavsignal)
	range0_end = int(len(wavsignal)/fs*1000 - time_window) // 10 # 计算循环终止的位置,也就是最终生成的窗数
	data_input = np.zeros((range0_end, 200), dtype = np.float) # 用于存放最终的频率特征数据
	data_line = np.zeros((1, 400), dtype = np.float)
	for i in range(0, range0_end):
		p_start = i * 160
		p_end = p_start + 400
		data_line = wav_arr[p_start:p_end]	
		data_line = data_line * w # 加窗
		data_line = np.abs(fft(data_line)) / wav_length
		data_input[i]=data_line[0:200] # 设置为400除以2的值(即200)是取一半数据,因为是对称的
	data_input = np.log(data_input + 1)
	#data_input = data_input[::]
	return data_input
  • 该函数提取音频文件的时频图
import matplotlib.pyplot as plt
filepath = 'test.wav'

a = compute_fbank(filepath)
plt.imshow(a.T, origin = 'lower')
plt.show()

在这里插入图片描述

2. 模型搭建

训练输入为时频图,标签为对应的拼音标签,如下所示:

搭建语音识别模型,采用了 CNN+CTC 的结构。
在这里插入图片描述

import keras
from keras.layers import Input, Conv2D, BatchNormalization, MaxPooling2D
from keras.layers import Reshape, Dense, Lambda
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model

Using TensorFlow backend.
  • 定义3*3的卷积层
def conv2d(size):
    return Conv2D(size, (3,3), use_bias=True, activation='relu',
        padding='same', kernel_initializer='he_normal')
  • 定义batch norm层
def norm(x):
    return BatchNormalization(mode=0,axis=-1)(x)
  • 定义最大池化层,数据的后两维维度都减半
def maxpool(x):
    return MaxPooling2D(pool_size=(2,2), strides=None, padding="valid")(x)
  • dense层
def dense(units, activation="relu"):
    return Dense(units, activation=activation, use_bias=True,
        kernel_initializer='he_normal')
  • 由cnn + cnn + maxpool构成的组合
# x.shape=(none, none, none)
# output.shape = (1/2, 1/2, 1/2)
def cnn_cell(size, x, pool=True):
    x = norm(conv2d(size)(x))
    x = norm(conv2d(size)(x))
    if pool:
        x = maxpool(x)
    return x
  • 添加CTC损失函数,由backend引入

注意:CTC_batch_cost输入为:

  • labels 标签:[batch_size, l]
  • y_pred cnn网络的输出:[batch_size, t, vocab_size]
  • input_length 网络输出的长度:[batch_size]
  • label_length 标签的长度:[batch_size]
def ctc_lambda(args):
    labels, y_pred, input_length, label_length = args
    y_pred = y_pred[:, :, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

搭建cnn+dnn+ctc的声学模型

class Amodel():
    """docstring for Amodel."""
    def __init__(self, vocab_size):
        super(Amodel, self).__init__()
        self.vocab_size = vocab_size
        self._model_init()
        self._ctc_init()
        self.opt_init()

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        self.h5 = cnn_cell(128, self.h4, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h5)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)

    def _ctc_init(self):
        self.labels = Input(name='the_labels', shape=[None], dtype='float32')
        self.input_length = Input(name='input_length', shape=[1], dtype='int64')
        self.label_length = Input(name='label_length', shape=[1], dtype='int64')
        self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
            ([self.labels, self.outputs, self.input_length, self.label_length])
        self.ctc_model = Model(inputs=[self.labels, self.inputs,
            self.input_length, self.label_length], outputs=self.loss_out)

    def opt_init(self):
        opt = Adam(lr = 0.01, beta_1 = 0.9, beta_2 = 0.999, decay = 0.0, epsilon = 10e-8)
        self.ctc_model.compile(loss={'ctc': lambda y_true, output: output}, optimizer=opt)

am = Amodel(500)
am.ctc_model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
the_inputs (InputLayer)         (None, None, 200, 1) 0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, None, 200, 32 320         the_inputs[0][0]                 
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, None, 200, 32 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, None, 200, 32 9248        batch_normalization_1[0][0]      
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, None, 200, 32 128         conv2d_2[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, None, 100, 32 0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, None, 100, 64 18496       max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, None, 100, 64 256         conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, None, 100, 64 36928       batch_normalization_3[0][0]      
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, None, 100, 64 256         conv2d_4[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, None, 50, 64) 0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, None, 50, 128 73856       max_pooling2d_2[0][0]            
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, None, 50, 128 512         conv2d_5[0][0]                   
__________________________________________________________________________________________________
conv2d_6 (Conv2D)               (None, None, 50, 128 147584      batch_normalization_5[0][0]      
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, None, 50, 128 512         conv2d_6[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D)  (None, None, 25, 128 0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, None, 25, 128 147584      max_pooling2d_3[0][0]            
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, None, 25, 128 512         conv2d_7[0][0]                   
__________________________________________________________________________________________________
conv2d_8 (Conv2D)               (None, None, 25, 128 147584      batch_normalization_7[0][0]      
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, None, 25, 128 512         conv2d_8[0][0]                   
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, None, 25, 128 147584      batch_normalization_8[0][0]      
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, None, 25, 128 512         conv2d_9[0][0]                   
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, None, 25, 128 147584      batch_normalization_9[0][0]      
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, None, 25, 128 512         conv2d_10[0][0]                  
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, None, 3200)   0           batch_normalization_10[0][0]     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 256)    819456      reshape_1[0][0]                  
__________________________________________________________________________________________________
the_labels (InputLayer)         (None, None)         0                                            
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, None, 500)    128500      dense_1[0][0]                    
__________________________________________________________________________________________________
input_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
label_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
ctc (Lambda)                    (None, 1)            0           the_labels[0][0]                 
                                                                 dense_2[0][0]                    
                                                                 input_length[0][0]               
                                                                 label_length[0][0]               
==================================================================================================
Total params: 1,828,564
Trainable params: 1,826,644
Non-trainable params: 1,920
__________________________________________________________________________________________________

3. 训练准备

下载数据

thchs30: http://www.openslr.org/18/

3.1 生成音频文件和标签文件列表

考虑神经网络训练过程中接收的输入输出。首先需要batch_size内数据需要统一数据的shape。

格式为:[batch_size, time_step, feature_dim]

然而读取的每一个sample的时间轴长都不一样,所以需要对时间轴进行处理,选择batch内最长的那个时间为基准,进行padding。这样一个batch内的数据都相同,就能进行并行训练啦。

source_file = 'E:\\Data\\thchs30\\data_thchs30'

定义函数source_get,获取音频文件及标注文件列表

形如:

E:\Data\thchs30\data_thchs30\data\A11_0.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_1.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_10.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_100.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_102.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_103.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_104.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_105.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_106.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_107.wav.trn
def source_get(source_file):
    train_file = source_file + '\\train'
    label_lst = []
    wav_lst = []
    for root, dirs, files in os.walk(train_file):
        for file in files:
            if file.endswith('.wav') or file.endswith('.WAV'):
                wav_file = os.sep.join([root, file])
                wav_lst.append(wav_file)
            elif file.endswith('.trn'):
                label_file = os.sep.join([source_file, 'data', file])
                label_lst.append(label_file)
    return label_lst, wav_lst

label_lst, wav_lst = source_get(source_file)

确认相同id对应的音频文件和标签文件相同

for i in range(10000):
    wavname = (wav_lst[i].split('\\')[-1]).split('.')[0]
    labelname = (label_lst[i].split('\\')[-1]).split('.')[0]
    if wavname != labelname:
        print('error')

3.2 label数据处理

定义函数read_label读取音频文件对应的拼音label

def read_label(label_file):
    with open(label_file, 'r', encoding='utf8') as f:
        data = f.readlines()
        return data[1]

print(read_label(label_lst[0]))

def gen_label_data(label_lst):
    label_data = []
    for label_file in label_lst:
        pny = read_label(label_file)
        label_data.append(pny.strip('\n'))
    return label_data

label_data = gen_label_data(label_lst)
print(len(label_data))
lv4 shi4 yang2 chun1 yan1 jing3 da4 kuai4 wen2 zhang1 de5 di3 se4 si4 yue4 de5 lin2 luan2 geng4 shi4 lv4 de5 xian1 huo2 xiu4 mei4 shi1 yi4 ang4 ran2

10000

为label建立拼音到id的映射,即词典

def mk_vocab(label_data):
    vocab = ['<PAD>']
    for line in label_data:
        line = line.split(' ')
        for pny in line:
            if pny not in vocab:
                vocab.append(pny)
    return vocab

vocab = mk_vocab(label_data)
print(len(vocab))
1176

有了词典就能将读取到的label映射到对应的id

def word2id(line, vocab):
    return [vocab.index(pny) for pny in line.split(' ')]

label_id = word2id(label_data[0], vocab)
print(label_data[0])
print(label_id)
lv4 shi4 yang2 chun1 yan1 jing3 da4 kuai4 wen2 zhang1 de5 di3 se4 si4 yue4 de5 lin2 luan2 geng4 shi4 lv4 de5 xian1 huo2 xiu4 mei4 shi1 yi4 ang4 ran2
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 11, 16, 17, 18, 2, 1, 11, 19, 20, 21, 22, 23, 24, 25, 26]

总结:

我们提取出了每个音频文件对应的拼音标签label_data,通过索引就可以获得该索引的标签。

也生成了对应的拼音词典.由此词典,我们可以映射拼音标签为id序列。

输出:

  • vocab
  • label_data
print(vocab[:15])
print(label_data[10])
print(word2id(label_data[10], vocab))
['<PAD>', 'lv4', 'shi4', 'yang2', 'chun1', 'yan1', 'jing3', 'da4', 'kuai4', 'wen2', 'zhang1', 'de5', 'di3', 'se4', 'si4']
ru2 yan1 ban4 huang2 gua1 jiang1 huang2 gua1 xi3 jing4 qie1 pian4 hou4 yong4 yan2 yan1 ban4 xiao3 shi2 zuo3 you4 shi2 yong4 shi2 jia1 tang2 cu4 ma2 you2 ji2 ke3
[45, 5, 232, 233, 234, 235, 233, 234, 133, 85, 236, 237, 83, 190, 89, 5, 232, 146, 238, 88, 92, 238, 190, 238, 129, 205, 239, 240, 228, 112, 79]

3.3 音频数据处理

音频数据处理,只需要获得对应的音频文件名,然后提取所需时频图即可。

其中compute_fbank时频转化的函数在前面已经定义好了。

fbank = compute_fbank(wav_lst[0])
print(fbank.shape)
(777, 200)
plt.imshow(fbank.T, origin = 'lower')
plt.show()

在这里插入图片描述

由于声学模型网络结构原因(3个maxpooling层),我们的音频数据的每个维度需要能够被8整除。

fbank = fbank[:fbank.shape[0]//8*8, :]
print(fbank.shape)
(776, 200)

总结:

  • 对音频数据进行时频转换
  • 转换后的数据需要各个维度能够被8整除

3.4 数据生成器

确定batch_size和batch_num

total_nums = 10000
batch_size = 4
batch_num = total_nums // batch_size

shuffle

打乱数据的顺序,我们通过查询乱序的索引值,来确定训练数据的顺序

from random import shuffle
shuffle_list = [i for i in range(10000)]
shuffle(shuffle_list)

generator

batch_size的信号时频图和标签数据,存放到两个list中去

def get_batch(batch_size, shuffle_list, wav_lst, label_data, vocab):
    for i in range(10000//batch_size):
        wav_data_lst = []
        label_data_lst = []
        begin = i * batch_size
        end = begin + 4
        sub_list = shuffle_list[begin:end]
        for index in sub_list:
            fbank = compute_fbank(wav_lst[index])
            fbank = fbank[:fbank.shape[0] // 8 * 8, :]
            label = word2id(label_data[index], vocab)
            wav_data_lst.append(fbank)
            label_data_lst.append(label)
        yield wav_data_lst, label_data_lst

batch = get_batch(4, shuffle_list, wav_lst, label_data, vocab)

wav_data_lst, label_data_lst = next(batch)
for wav_data in wav_data_lst:
    print(wav_data.shape)
for label_data in label_data_lst:
    print(label_data)
(808, 200)
(928, 200)
(768, 200)
(880, 200)
[146, 224, 99, 367, 961, 89, 487, 95, 24, 305, 183, 1120, 185, 104, 129, 208, 175, 104, 133, 70, 296, 640, 904, 59, 680, 1121]
[116, 36, 15, 451, 283, 95, 267, 680, 914, 889, 365, 282, 138, 138, 76, 76, 588, 171, 365, 283, 293, 557, 247, 735, 125, 14, 439, 866, 593, 197, 51, 273, 95, 763, 120]
[34, 387, 451, 24, 223, 24, 117, 262, 49, 444, 135, 24, 534, 30, 278, 153, 297, 34, 355, 126, 362, 82, 425, 482, 734, 100, 116, 192, 24, 583, 579]
[299, 146, 126, 296, 337, 11, 166, 338, 95, 178, 339, 340, 337, 341, 342, 265, 310, 9, 269, 337, 343, 126, 337, 79, 175, 261, 344, 345, 346]
lens = [len(wav) for wav in wav_data_lst]
print(max(lens))
print(lens)
928
[808, 928, 768, 880]

padding

然而,每一个batch_size内的数据有一个要求,就是需要构成成一个tensorflow块,这就要求每个样本数据形式是一样的。
除此之外,ctc需要获得的信息还有输入序列的长度。
这里输入序列经过卷积网络后,长度缩短了8倍,因此我们训练实际输入的数据为wav_len//8。

  • padding wav data
  • wav len // 8 (网络结构导致的)
def wav_padding(wav_data_lst):
    wav_lens = [len(data) for data in wav_data_lst]
    wav_max_len = max(wav_lens)
    wav_lens = np.array([leng//8 for leng in wav_lens])
    new_wav_data_lst = np.zeros((len(wav_data_lst), wav_max_len, 200, 1))
    for i in range(len(wav_data_lst)):
        new_wav_data_lst[i, :wav_data_lst[i].shape[0], :, 0] = wav_data_lst[i]
    return new_wav_data_lst, wav_lens

pad_wav_data_lst, wav_lens = wav_padding(wav_data_lst)
print(pad_wav_data_lst.shape)
print(wav_lens)
(4, 928, 200, 1)
[101 116  96 110]

同样也要对label进行padding和长度获取,不同的是数据维度不同,且label的长度就是输入给ctc的长度,不需要额外处理

  • label padding
  • label len
def label_padding(label_data_lst):
    label_lens = np.array([len(label) for label in label_data_lst])
    max_label_len = max(label_lens)
    new_label_data_lst = np.zeros((len(label_data_lst), max_label_len))
    for i in range(len(label_data_lst)):
        new_label_data_lst[i][:len(label_data_lst[i])] = label_data_lst[i]
    return new_label_data_lst, label_lens

pad_label_data_lst, label_lens = label_padding(label_data_lst)
print(pad_label_data_lst.shape)
print(label_lens)
(4, 35)
[26 35 31 29]

4. 开始训练

这样训练所需的数据,就准备完毕了,接下来可以进行训练了。我们采用如下参数训练:

  • batch_size = 4
  • batch_num = 10000 // 4
  • epochs = 1
total_nums = 10000
batch_size = 4
batch_num = total_nums // batch_size
epochs = 1
  • 准备训练数据,shuffle是为了打乱训练数据顺序
source_file = 'E:\\Data\\thchs30\\data_thchs30'
label_lst, wav_lst = source_get(source_file)
label_data = gen_label_data(label_lst)
vocab = mk_vocab(label_data)
vocab_size = len(vocab)

shuffle_list = [i for i in range(10000)]
shuffle(shuffle_list)
  • 调用模型,开始进行训练,每过50个batch,检查一次loss下降情况
am = Amodel(vocab_size)

for k in range(epochs):
    print('this is the', k+1, 'th epochs trainning !!!')
    batch = get_batch(batch_size, shuffle_list, wav_lst, label_data, vocab)
    for i in range(batch_num):
        wav_data_lst, label_data_lst = next(batch)
        pad_wav_data, input_length = wav_padding(wav_data_lst)
        pad_label_data, label_length = label_padding(label_data_lst)
        inputs = {'the_inputs': pad_wav_data,
                  'the_labels': pad_label_data,
                  'input_length': input_length,
                  'label_length': label_length,
                 }
        outputs = {'ctc': np.zeros(pad_wav_data.shape[0])} 
        am.ctc_model.fit(inputs, outputs, verbose=0)
        if (i+1) % 50 == 0:
            print('the ', i, 'th steps, cost:', am.ctc_model.evaluate(inputs, outputs, verbose=0) )
this is the 1 th epochs trainning !!!
the  0 th steps, cost: 1292.562255859375
.
.
.
.
.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章