語音識別——基於深度學習的中文語音識別系統實現(代碼詳解)

利用thchs30爲例建立一個語音識別系統

之前寫了個關於語音識別系統框架的文章好多同學都很喜歡,但是很多問題不能一一回復,我個人也是更願意分享方法和思路一些。
於是寫了這篇實踐篇,裏面詳細介紹瞭如何一步步實現這個語音識別系統的,希望對大家有一定的參考作用。
(模型這邊近期應該不會繼續訓練升學模型了,而語言模型可能過段時間也會整理一個tutorial方便各位同學進行實驗)
(目前主要寫了數據處理部分和建模訓練部分,測試和推斷可以參考原來代碼)

notebook地址: https://github.com/audier/my_ch_speech_recognition/blob/master/CTC_tutorial.ipynb

dfcnn論文地址:http://www.infocomm-journal.com/dxkx/CN/article/downloadArticleFile.do?attachType=PDF&id=166970

  • 特徵提取
  • 搭建模型
    • DFCNN
  • 數據處理
  • 建模訓練

1. 特徵提取

input爲輸入音頻數據,需要轉化爲頻譜圖數據,然後通過cnn處理圖片的能力進行識別。

1. 讀取音頻文件

import scipy.io.wavfile as wav
import matplotlib.pyplot as plt
import os

# 隨意搞個音頻做實驗
filepath = 'test.wav'

fs, wavsignal = wav.read(filepath)
plt.plot(wavsignal)
plt.show()

2. 構造漢明窗

import numpy as np

x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1))
plt.plot(w)
plt.show()

漢明窗

3. 對數據分幀

  • 幀長: 25ms
  • 幀移: 10ms
採樣點(s) = fs
採樣點(ms)= fs / 1000
採樣點(幀)= fs / 1000 * 幀長
time_window = 25
window_length = fs // 1000 * time_window

4. 分幀加窗

# 分幀
p_begin = 0
p_end = p_begin + window_length
frame = wavsignal[p_begin:p_end]
plt.plot(frame)
plt.show()
# 加窗
frame = frame * w
plt.plot(frame)
plt.show()

在這裏插入圖片描述

在這裏插入圖片描述

5. 傅里葉變換

所謂時頻圖就是將時域信息轉換到頻域上去,具體原理可百度。人耳感知聲音是通過

from scipy.fftpack import fft

# 進行快速傅里葉變換
frame_fft = np.abs(fft(frame))[:200]
plt.plot(frame_fft)
plt.show()

# 取對數,求db
frame_log = np.log(frame_fft)
plt.plot(frame_log)
plt.show()

在這裏插入圖片描述

在這裏插入圖片描述

  • 分幀
  • 加窗
  • 傅里葉變換
import numpy as np
import scipy.io.wavfile as wav
from scipy.fftpack import fft


# 獲取信號的時頻圖
def compute_fbank(file):
	x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
	w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1) ) # 漢明窗
	fs, wavsignal = wav.read(file)
	# wav波形 加時間窗以及時移10ms
	time_window = 25 # 單位ms
	window_length = fs / 1000 * time_window # 計算窗長度的公式,目前全部爲400固定值
	wav_arr = np.array(wavsignal)
	wav_length = len(wavsignal)
	range0_end = int(len(wavsignal)/fs*1000 - time_window) // 10 # 計算循環終止的位置,也就是最終生成的窗數
	data_input = np.zeros((range0_end, 200), dtype = np.float) # 用於存放最終的頻率特徵數據
	data_line = np.zeros((1, 400), dtype = np.float)
	for i in range(0, range0_end):
		p_start = i * 160
		p_end = p_start + 400
		data_line = wav_arr[p_start:p_end]	
		data_line = data_line * w # 加窗
		data_line = np.abs(fft(data_line)) / wav_length
		data_input[i]=data_line[0:200] # 設置爲400除以2的值(即200)是取一半數據,因爲是對稱的
	data_input = np.log(data_input + 1)
	#data_input = data_input[::]
	return data_input
  • 該函數提取音頻文件的時頻圖
import matplotlib.pyplot as plt
filepath = 'test.wav'

a = compute_fbank(filepath)
plt.imshow(a.T, origin = 'lower')
plt.show()

在這裏插入圖片描述

2. 模型搭建

訓練輸入爲時頻圖,標籤爲對應的拼音標籤,如下所示:

搭建語音識別模型,採用了 CNN+CTC 的結構。
在這裏插入圖片描述

import keras
from keras.layers import Input, Conv2D, BatchNormalization, MaxPooling2D
from keras.layers import Reshape, Dense, Lambda
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model

Using TensorFlow backend.
  • 定義3*3的卷積層
def conv2d(size):
    return Conv2D(size, (3,3), use_bias=True, activation='relu',
        padding='same', kernel_initializer='he_normal')
  • 定義batch norm層
def norm(x):
    return BatchNormalization(mode=0,axis=-1)(x)
  • 定義最大池化層,數據的後兩維維度都減半
def maxpool(x):
    return MaxPooling2D(pool_size=(2,2), strides=None, padding="valid")(x)
  • dense層
def dense(units, activation="relu"):
    return Dense(units, activation=activation, use_bias=True,
        kernel_initializer='he_normal')
  • 由cnn + cnn + maxpool構成的組合
# x.shape=(none, none, none)
# output.shape = (1/2, 1/2, 1/2)
def cnn_cell(size, x, pool=True):
    x = norm(conv2d(size)(x))
    x = norm(conv2d(size)(x))
    if pool:
        x = maxpool(x)
    return x
  • 添加CTC損失函數,由backend引入

注意:CTC_batch_cost輸入爲:

  • labels 標籤:[batch_size, l]
  • y_pred cnn網絡的輸出:[batch_size, t, vocab_size]
  • input_length 網絡輸出的長度:[batch_size]
  • label_length 標籤的長度:[batch_size]
def ctc_lambda(args):
    labels, y_pred, input_length, label_length = args
    y_pred = y_pred[:, :, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

搭建cnn+dnn+ctc的聲學模型

class Amodel():
    """docstring for Amodel."""
    def __init__(self, vocab_size):
        super(Amodel, self).__init__()
        self.vocab_size = vocab_size
        self._model_init()
        self._ctc_init()
        self.opt_init()

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        self.h5 = cnn_cell(128, self.h4, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h5)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)

    def _ctc_init(self):
        self.labels = Input(name='the_labels', shape=[None], dtype='float32')
        self.input_length = Input(name='input_length', shape=[1], dtype='int64')
        self.label_length = Input(name='label_length', shape=[1], dtype='int64')
        self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
            ([self.labels, self.outputs, self.input_length, self.label_length])
        self.ctc_model = Model(inputs=[self.labels, self.inputs,
            self.input_length, self.label_length], outputs=self.loss_out)

    def opt_init(self):
        opt = Adam(lr = 0.01, beta_1 = 0.9, beta_2 = 0.999, decay = 0.0, epsilon = 10e-8)
        self.ctc_model.compile(loss={'ctc': lambda y_true, output: output}, optimizer=opt)

am = Amodel(500)
am.ctc_model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
the_inputs (InputLayer)         (None, None, 200, 1) 0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, None, 200, 32 320         the_inputs[0][0]                 
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, None, 200, 32 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, None, 200, 32 9248        batch_normalization_1[0][0]      
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, None, 200, 32 128         conv2d_2[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, None, 100, 32 0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, None, 100, 64 18496       max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, None, 100, 64 256         conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, None, 100, 64 36928       batch_normalization_3[0][0]      
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, None, 100, 64 256         conv2d_4[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, None, 50, 64) 0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, None, 50, 128 73856       max_pooling2d_2[0][0]            
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, None, 50, 128 512         conv2d_5[0][0]                   
__________________________________________________________________________________________________
conv2d_6 (Conv2D)               (None, None, 50, 128 147584      batch_normalization_5[0][0]      
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, None, 50, 128 512         conv2d_6[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D)  (None, None, 25, 128 0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, None, 25, 128 147584      max_pooling2d_3[0][0]            
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, None, 25, 128 512         conv2d_7[0][0]                   
__________________________________________________________________________________________________
conv2d_8 (Conv2D)               (None, None, 25, 128 147584      batch_normalization_7[0][0]      
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, None, 25, 128 512         conv2d_8[0][0]                   
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, None, 25, 128 147584      batch_normalization_8[0][0]      
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, None, 25, 128 512         conv2d_9[0][0]                   
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, None, 25, 128 147584      batch_normalization_9[0][0]      
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, None, 25, 128 512         conv2d_10[0][0]                  
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, None, 3200)   0           batch_normalization_10[0][0]     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 256)    819456      reshape_1[0][0]                  
__________________________________________________________________________________________________
the_labels (InputLayer)         (None, None)         0                                            
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, None, 500)    128500      dense_1[0][0]                    
__________________________________________________________________________________________________
input_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
label_length (InputLayer)       (None, 1)            0                                            
__________________________________________________________________________________________________
ctc (Lambda)                    (None, 1)            0           the_labels[0][0]                 
                                                                 dense_2[0][0]                    
                                                                 input_length[0][0]               
                                                                 label_length[0][0]               
==================================================================================================
Total params: 1,828,564
Trainable params: 1,826,644
Non-trainable params: 1,920
__________________________________________________________________________________________________

3. 訓練準備

下載數據

thchs30: http://www.openslr.org/18/

3.1 生成音頻文件和標籤文件列表

考慮神經網絡訓練過程中接收的輸入輸出。首先需要batch_size內數據需要統一數據的shape。

格式爲:[batch_size, time_step, feature_dim]

然而讀取的每一個sample的時間軸長都不一樣,所以需要對時間軸進行處理,選擇batch內最長的那個時間爲基準,進行padding。這樣一個batch內的數據都相同,就能進行並行訓練啦。

source_file = 'E:\\Data\\thchs30\\data_thchs30'

定義函數source_get,獲取音頻文件及標註文件列表

形如:

E:\Data\thchs30\data_thchs30\data\A11_0.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_1.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_10.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_100.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_102.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_103.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_104.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_105.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_106.wav.trn
E:\Data\thchs30\data_thchs30\data\A11_107.wav.trn
def source_get(source_file):
    train_file = source_file + '\\train'
    label_lst = []
    wav_lst = []
    for root, dirs, files in os.walk(train_file):
        for file in files:
            if file.endswith('.wav') or file.endswith('.WAV'):
                wav_file = os.sep.join([root, file])
                wav_lst.append(wav_file)
            elif file.endswith('.trn'):
                label_file = os.sep.join([source_file, 'data', file])
                label_lst.append(label_file)
    return label_lst, wav_lst

label_lst, wav_lst = source_get(source_file)

確認相同id對應的音頻文件和標籤文件相同

for i in range(10000):
    wavname = (wav_lst[i].split('\\')[-1]).split('.')[0]
    labelname = (label_lst[i].split('\\')[-1]).split('.')[0]
    if wavname != labelname:
        print('error')

3.2 label數據處理

定義函數read_label讀取音頻文件對應的拼音label

def read_label(label_file):
    with open(label_file, 'r', encoding='utf8') as f:
        data = f.readlines()
        return data[1]

print(read_label(label_lst[0]))

def gen_label_data(label_lst):
    label_data = []
    for label_file in label_lst:
        pny = read_label(label_file)
        label_data.append(pny.strip('\n'))
    return label_data

label_data = gen_label_data(label_lst)
print(len(label_data))
lv4 shi4 yang2 chun1 yan1 jing3 da4 kuai4 wen2 zhang1 de5 di3 se4 si4 yue4 de5 lin2 luan2 geng4 shi4 lv4 de5 xian1 huo2 xiu4 mei4 shi1 yi4 ang4 ran2

10000

爲label建立拼音到id的映射,即詞典

def mk_vocab(label_data):
    vocab = ['<PAD>']
    for line in label_data:
        line = line.split(' ')
        for pny in line:
            if pny not in vocab:
                vocab.append(pny)
    return vocab

vocab = mk_vocab(label_data)
print(len(vocab))
1176

有了詞典就能將讀取到的label映射到對應的id

def word2id(line, vocab):
    return [vocab.index(pny) for pny in line.split(' ')]

label_id = word2id(label_data[0], vocab)
print(label_data[0])
print(label_id)
lv4 shi4 yang2 chun1 yan1 jing3 da4 kuai4 wen2 zhang1 de5 di3 se4 si4 yue4 de5 lin2 luan2 geng4 shi4 lv4 de5 xian1 huo2 xiu4 mei4 shi1 yi4 ang4 ran2
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 11, 16, 17, 18, 2, 1, 11, 19, 20, 21, 22, 23, 24, 25, 26]

總結:

我們提取出了每個音頻文件對應的拼音標籤label_data,通過索引就可以獲得該索引的標籤。

也生成了對應的拼音詞典.由此詞典,我們可以映射拼音標籤爲id序列。

輸出:

  • vocab
  • label_data
print(vocab[:15])
print(label_data[10])
print(word2id(label_data[10], vocab))
['<PAD>', 'lv4', 'shi4', 'yang2', 'chun1', 'yan1', 'jing3', 'da4', 'kuai4', 'wen2', 'zhang1', 'de5', 'di3', 'se4', 'si4']
ru2 yan1 ban4 huang2 gua1 jiang1 huang2 gua1 xi3 jing4 qie1 pian4 hou4 yong4 yan2 yan1 ban4 xiao3 shi2 zuo3 you4 shi2 yong4 shi2 jia1 tang2 cu4 ma2 you2 ji2 ke3
[45, 5, 232, 233, 234, 235, 233, 234, 133, 85, 236, 237, 83, 190, 89, 5, 232, 146, 238, 88, 92, 238, 190, 238, 129, 205, 239, 240, 228, 112, 79]

3.3 音頻數據處理

音頻數據處理,只需要獲得對應的音頻文件名,然後提取所需時頻圖即可。

其中compute_fbank時頻轉化的函數在前面已經定義好了。

fbank = compute_fbank(wav_lst[0])
print(fbank.shape)
(777, 200)
plt.imshow(fbank.T, origin = 'lower')
plt.show()

在這裏插入圖片描述

由於聲學模型網絡結構原因(3個maxpooling層),我們的音頻數據的每個維度需要能夠被8整除。

fbank = fbank[:fbank.shape[0]//8*8, :]
print(fbank.shape)
(776, 200)

總結:

  • 對音頻數據進行時頻轉換
  • 轉換後的數據需要各個維度能夠被8整除

3.4 數據生成器

確定batch_size和batch_num

total_nums = 10000
batch_size = 4
batch_num = total_nums // batch_size

shuffle

打亂數據的順序,我們通過查詢亂序的索引值,來確定訓練數據的順序

from random import shuffle
shuffle_list = [i for i in range(10000)]
shuffle(shuffle_list)

generator

batch_size的信號時頻圖和標籤數據,存放到兩個list中去

def get_batch(batch_size, shuffle_list, wav_lst, label_data, vocab):
    for i in range(10000//batch_size):
        wav_data_lst = []
        label_data_lst = []
        begin = i * batch_size
        end = begin + 4
        sub_list = shuffle_list[begin:end]
        for index in sub_list:
            fbank = compute_fbank(wav_lst[index])
            fbank = fbank[:fbank.shape[0] // 8 * 8, :]
            label = word2id(label_data[index], vocab)
            wav_data_lst.append(fbank)
            label_data_lst.append(label)
        yield wav_data_lst, label_data_lst

batch = get_batch(4, shuffle_list, wav_lst, label_data, vocab)

wav_data_lst, label_data_lst = next(batch)
for wav_data in wav_data_lst:
    print(wav_data.shape)
for label_data in label_data_lst:
    print(label_data)
(808, 200)
(928, 200)
(768, 200)
(880, 200)
[146, 224, 99, 367, 961, 89, 487, 95, 24, 305, 183, 1120, 185, 104, 129, 208, 175, 104, 133, 70, 296, 640, 904, 59, 680, 1121]
[116, 36, 15, 451, 283, 95, 267, 680, 914, 889, 365, 282, 138, 138, 76, 76, 588, 171, 365, 283, 293, 557, 247, 735, 125, 14, 439, 866, 593, 197, 51, 273, 95, 763, 120]
[34, 387, 451, 24, 223, 24, 117, 262, 49, 444, 135, 24, 534, 30, 278, 153, 297, 34, 355, 126, 362, 82, 425, 482, 734, 100, 116, 192, 24, 583, 579]
[299, 146, 126, 296, 337, 11, 166, 338, 95, 178, 339, 340, 337, 341, 342, 265, 310, 9, 269, 337, 343, 126, 337, 79, 175, 261, 344, 345, 346]
lens = [len(wav) for wav in wav_data_lst]
print(max(lens))
print(lens)
928
[808, 928, 768, 880]

padding

然而,每一個batch_size內的數據有一個要求,就是需要構成成一個tensorflow塊,這就要求每個樣本數據形式是一樣的。
除此之外,ctc需要獲得的信息還有輸入序列的長度。
這裏輸入序列經過卷積網絡後,長度縮短了8倍,因此我們訓練實際輸入的數據爲wav_len//8。

  • padding wav data
  • wav len // 8 (網絡結構導致的)
def wav_padding(wav_data_lst):
    wav_lens = [len(data) for data in wav_data_lst]
    wav_max_len = max(wav_lens)
    wav_lens = np.array([leng//8 for leng in wav_lens])
    new_wav_data_lst = np.zeros((len(wav_data_lst), wav_max_len, 200, 1))
    for i in range(len(wav_data_lst)):
        new_wav_data_lst[i, :wav_data_lst[i].shape[0], :, 0] = wav_data_lst[i]
    return new_wav_data_lst, wav_lens

pad_wav_data_lst, wav_lens = wav_padding(wav_data_lst)
print(pad_wav_data_lst.shape)
print(wav_lens)
(4, 928, 200, 1)
[101 116  96 110]

同樣也要對label進行padding和長度獲取,不同的是數據維度不同,且label的長度就是輸入給ctc的長度,不需要額外處理

  • label padding
  • label len
def label_padding(label_data_lst):
    label_lens = np.array([len(label) for label in label_data_lst])
    max_label_len = max(label_lens)
    new_label_data_lst = np.zeros((len(label_data_lst), max_label_len))
    for i in range(len(label_data_lst)):
        new_label_data_lst[i][:len(label_data_lst[i])] = label_data_lst[i]
    return new_label_data_lst, label_lens

pad_label_data_lst, label_lens = label_padding(label_data_lst)
print(pad_label_data_lst.shape)
print(label_lens)
(4, 35)
[26 35 31 29]

4. 開始訓練

這樣訓練所需的數據,就準備完畢了,接下來可以進行訓練了。我們採用如下參數訓練:

  • batch_size = 4
  • batch_num = 10000 // 4
  • epochs = 1
total_nums = 10000
batch_size = 4
batch_num = total_nums // batch_size
epochs = 1
  • 準備訓練數據,shuffle是爲了打亂訓練數據順序
source_file = 'E:\\Data\\thchs30\\data_thchs30'
label_lst, wav_lst = source_get(source_file)
label_data = gen_label_data(label_lst)
vocab = mk_vocab(label_data)
vocab_size = len(vocab)

shuffle_list = [i for i in range(10000)]
shuffle(shuffle_list)
  • 調用模型,開始進行訓練,每過50個batch,檢查一次loss下降情況
am = Amodel(vocab_size)

for k in range(epochs):
    print('this is the', k+1, 'th epochs trainning !!!')
    batch = get_batch(batch_size, shuffle_list, wav_lst, label_data, vocab)
    for i in range(batch_num):
        wav_data_lst, label_data_lst = next(batch)
        pad_wav_data, input_length = wav_padding(wav_data_lst)
        pad_label_data, label_length = label_padding(label_data_lst)
        inputs = {'the_inputs': pad_wav_data,
                  'the_labels': pad_label_data,
                  'input_length': input_length,
                  'label_length': label_length,
                 }
        outputs = {'ctc': np.zeros(pad_wav_data.shape[0])} 
        am.ctc_model.fit(inputs, outputs, verbose=0)
        if (i+1) % 50 == 0:
            print('the ', i, 'th steps, cost:', am.ctc_model.evaluate(inputs, outputs, verbose=0) )
this is the 1 th epochs trainning !!!
the  0 th steps, cost: 1292.562255859375
.
.
.
.
.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章