自頂向下分析一個簡單的語音識別系統(八)

上回我們說到了get_audio_and_transcript函數、pad_sequences函數和sparse_tuple_from函數等3個函數,本回我們分析這3個函數分別實現了哪些功能。

1.get_audio_and_transcript函數

該函數主要通過上文獲得的txt_files列表和wav_files列表,得到audio和標記文本數據,具體代碼如下:

def get_audio_and_transcript(txt_files, wav_files, n_input, n_context):
    '''
    Loads audio files and text transcriptions from ordered lists of filenames.
    Converts to audio to MFCC arrays and text to numerical arrays.
    Returns list of arrays. Returned audio array list can be padded with
    pad_sequences function in this same module.
    '''
    audio = []
    audio_len = []
    transcript = []
    transcript_len = []

    for txt_file, wav_file in zip(txt_files, wav_files):
        # load audio and convert to features
        audio_data = audiofile_to_input_vector(wav_file, n_input, n_context)
        audio_data = audio_data.astype('float32')

        audio.append(audio_data)
        audio_len.append(np.int32(len(audio_data)))

        # load text transcription and convert to numerical array
        target = normalize_txt_file(txt_file)
        target = text_to_char_array(target)
        transcript.append(target)
        transcript_len.append(len(target))

    audio = np.asarray(audio)
    audio_len = np.asarray(audio_len)
    transcript = np.asarray(transcript)
    transcript_len = np.asarray(transcript_len)
    return audio, audio_len, transcript, transcript_len

由上面代碼可以看出,主要是通過audiofile_to_input_vector函數將audio信息轉換爲可以輸入網絡中的訓練向量的。其中涉及到一些語音處理相關知識,我們首先看看是如何對原始audio進行一些常規的語音處理的。

2.讀取wav文件

使用如下代碼繪製出波形圖如下所示:

import wave  
import numpy as np 
import struct 
import pylab as pl

#打開wav文件  
#open返回一個的是一個Wave_read類的實例,通過調用它的方法讀取WAV文件的格式和數據  
f = wave.open(r"777-126732-0068.wav","rb")  

#讀取格式信息  
#一次性返回所有的WAV文件的格式信息,它返回的是一個組元(tuple):聲道數, 量化位數(byte單位), 採  
#樣頻率, 採樣點數, 壓縮類型, 壓縮類型的描述。wave模塊只支持非壓縮的數據,因此可以忽略最後兩個信息  
params = f.getparams()  
nchannels, sampwidth, framerate, nframes = params[:4]  
print("channel",nchannels)  
print("sample_width",sampwidth)  
print("framerate",framerate)  
print("numframes",nframes) 

#讀取波形數據  
#讀取聲音數據,傳遞一個參數指定需要讀取的長度(以取樣點爲單位)  
str_data  = f.readframes(nframes)  
wave_data = struct.unpack('{n}h'.format(n=nframes), str_data)
wave_data = np.array(wave_data)
f.close()  

time = np.arange(0, nframes) * (1.0 / framerate)  

#繪製波形圖
pl.subplot(211)   
pl.plot(time, wave_data)    
pl.xlabel("time (seconds)")   

# 採樣點數,修改採樣點數和起始位置進行不同位置和長度的音頻波形分析
N=nframes
start=0 #開始採樣位置
df = framerate/(N-1) # 分辨率
freq = [df*n for n in range(0,N)] #N個元素
wave_data2=wave_data[start:start+N]
c=np.fft.fft(wave_data2)*2/N
#常規顯示採樣頻率一半的頻譜
d=int(len(c)/2)
pl.subplot(212)
pl.plot(freq[:d-1],abs(c[:d-1]),'r')
pl.xlabel("Hz")
pl.show()  

得到波形圖和頻譜圖如下所示:
這裏寫圖片描述
語音處理通常在頻域進行處理,結合到人耳的一些特徵,我們並不需要輸入所有的頻域信息進行我們的訓練,只需要計算出其mfcc係數即可。

3. MFCC係數

耳蝸實質上相當於一個濾波器組,耳蝸的濾波作用是在對數頻率尺度上進行的,在1000HZ下,人耳的感知能力與頻率成線性關係;而在1000HZ以上,人耳的感知能力與頻率不構成線性關係,而更偏向於對數關係,這就使得人耳對低頻信號比高頻信號更敏感。MFCC在一定程度上模擬了人耳對語音的處理特點,應用了人耳聽覺感知方面的研究成果,採用這種技術語音識別系統的性能有一定提高。
下面結合幾張圖來詳細瞭解一個MFCC係數是如何得到的。
這裏寫圖片描述
研究表明,人的語音中有用的部分包含在上圖頻譜中的共振峯上,即頻譜的包絡中(如上圖中)。去掉包絡信息,剩下的大部分信息與環境噪聲有關,稱之爲頻譜細節。那麼我們應該如何將這兩部分信息分別提取出來呢?
我們觀察到頻譜包絡的頻率較低,同時頻譜細節的頻率較高,因此我們可以對我們得到的頻譜做一個FFT。在頻譜上做傅里葉變換就相當於逆傅里葉變換Inverse FFT (IFFT)。具體如下圖所示:
這裏寫圖片描述
最後根據如下公式我們可以得到一組Mel濾波器組,如下:
這裏寫圖片描述
這裏寫圖片描述
濾波器組如下圖所示:
這裏寫圖片描述

4.audiofile_to_input_vector函數

前面介紹了一些背景知識,我們在回到主線上來,我們下面來分析audiofile_to_input_vector函數,代碼如下圖所示:

def audiofile_to_input_vector(audio_filename, numcep, numcontext):
    # Load wav files
    fs, audio = wav.read(audio_filename)

    # Get mfcc coefficients
    orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)
    #fs=1.6kHz numcep=26 該處調用python_speech_features包中的mfcc計算相應的MFCC係數

    # We only keep every second feature (BiRNN stride = 2)
    orig_inputs = orig_inputs[::2]

    # For each time slice of the training set, we need to copy the context this makes
    # the numcep dimensions vector into a numcep + 2*numcep*numcontext dimensions
    # because of:
    #  - numcep dimensions for the current mfcc feature set
    #  - numcontext*numcep dimensions for each of the past and future (x2) mfcc feature set
    # => so numcep + 2*numcontext*numcep
    train_inputs = np.array([], np.float32)
    train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))

    # Prepare pre-fix post fix context
    empty_mfcc = np.array([])
    empty_mfcc.resize((numcep))

    # Prepare train_inputs with past and future contexts
    time_slices = range(train_inputs.shape[0])
    context_past_min = time_slices[0] + numcontext
    context_future_max = time_slices[-1] - numcontext
    for time_slice in time_slices:
        # Reminder: array[start:stop:step]
        # slices from indice |start| up to |stop| (not included), every |step|

        # Add empty context data of the correct size to the start and end
        # of the MFCC feature matrix

        # Pick up to numcontext time slices in the past, and complete with empty
        # mfcc features
        need_empty_past = max(0, (context_past_min - time_slice))
        empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
        data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
        assert(len(empty_source_past) + len(data_source_past) == numcontext)

        # Pick up to numcontext time slices in the future, and complete with empty
        # mfcc features
        need_empty_future = max(0, (time_slice - context_future_max))
        empty_source_future = list(empty_mfcc for empty_slots in range(need_empty_future))
        data_source_future = orig_inputs[time_slice + 1:time_slice + numcontext + 1]
        assert(len(empty_source_future) + len(data_source_future) == numcontext)

        if need_empty_past:
            past = np.concatenate((empty_source_past, data_source_past))
        else:
            past = data_source_past

        if need_empty_future:
            future = np.concatenate((data_source_future, empty_source_future))
        else:
            future = data_source_future

        past = np.reshape(past, numcontext * numcep)
        now = orig_inputs[time_slice]
        future = np.reshape(future, numcontext * numcep)

        train_inputs[time_slice] = np.concatenate((past, now, future))
        assert(len(train_inputs[time_slice]) == numcep + 2 * numcep * numcontext)

    # Scale/standardize the inputs
    # This can be done more efficiently in the TensorFlow graph
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)
    return train_inputs

其中每25ms語音片段我們使用26個MFCC倒譜特徵。第25-70行實現將當前25ms語音片段和前後各9個語音片段的494個倒譜系數拼接到一個train_inputs向量中(不存在的前後片段補0)。
這樣我們就得到了訓練需要的語音信息,下面我們看看訓練的標註信息是如何獲得的。這部分主要在text.py中實現。

5.normalize_txt_file函數

由get_audio_and_transcript函數代碼可知,在調用audiofile_to_input_vector函數獲得倒譜數據之後,它緊接着就調用了normalize_txt_file函數。那麼這個函數實現了一個什麼功能呢?我們馬上來看代碼,如下所示:

def normalize_txt_file(txt_file, remove_apostrophe=True):
    with codecs.open(txt_file, encoding="utf-8") as open_txt_file:
        return normalize_text(open_txt_file.read(), remove_apostrophe=remove_apostrophe)

可以看到這個函數只是調用了normalize_text函數,我們再看看這個代碼,如下所示:

def normalize_text(original, remove_apostrophe=True):
    # convert any unicode characters to ASCII equivalent
    # then ignore anything else and decode to a string
    result = unicodedata.normalize("NFKD", original).encode("ascii", "ignore").decode()
    if remove_apostrophe:
        # remove apostrophes to keep contractions together
        result = result.replace("'", "")
    # return lowercase alphabetic characters and apostrophes (if still present)
    return re.sub("[^a-zA-Z']+", ' ', result).strip().lower()

這段代碼主要是去掉文本文件中不支持的字符。

6.text_to_char_array函數

normalize_txt_file函數去掉了文本標註文件中不被支持的字符,現在我們來分析它之後調用的text_to_char_array函數,代碼如下:

# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1  # 0 is reserved to space

def text_to_char_array(original):
    # Create list of sentence's words w/spaces replaced by ''
    result = original.replace(' ', '  ')
    result = result.split(' ')

    # Tokenize words into letters adding in SPACE_TOKEN where required
    result = np.hstack([SPACE_TOKEN if xt == '' else list(xt) for xt in result])

    # Return characters mapped into indicies
    return np.asarray([SPACE_INDEX if xt == SPACE_TOKEN else ord(xt) - FIRST_INDEX for xt in result])

由這段代碼可以看出,text_to_char_array函數將文本標註文件中的字符串表示成了一個數值數組(數值對應着對應字母的ASCII碼以及SPACE對應的ASCII碼)。
自此我們就得到了我們訓練輸入輸出的所需的全部向量。再次返回到next_batch函數中,我們還有pad_sequences函數和sparse_tuple_from函數需要分析。

7.pad_sequences函數

這段代碼主要是將語音輸入向量和本次batch最長的序列保持一致,在向量的頭部或者尾部補0(由padding參數決定)。

def pad_sequences(sequences, maxlen=None, dtype=np.float32,
                  padding='post', truncating='post', value=0.):
    '''
    Pads each sequence to the same length of the longest sequence.

        If maxlen is provided, any sequence longer than maxlen is truncated to
        maxlen. Truncation happens off either the beginning or the end
        (default) of the sequence. Supports post-padding (default) and
        pre-padding.

        Args:
            sequences: list of lists where each element is a sequence
            maxlen: int, maximum length
            dtype: type to cast the resulting sequence.
            padding: 'pre' or 'post', pad either before or after each sequence.
            truncating: 'pre' or 'post', remove values from sequences larger
            than maxlen either in the beginning or in the end of the sequence
            value: float, value to pad the sequences to the desired value.

        Returns:
            numpy.ndarray: Padded sequences shape = (number_of_sequences, maxlen)
            numpy.ndarray: original sequence lengths
    '''
    lengths = np.asarray([len(s) for s in sequences], dtype=np.int64)

    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break

    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)
    return x, lengths

8.sparse_tuple_from函數

該函數主要獲得標註向量的一個稀疏表示,代碼如下圖所示:

def sparse_tuple_from(sequences, dtype=np.int32):
    """
    Create a sparse representention of ``sequences``.

    Args:
        sequences: a list of lists of type dtype where each element is a sequence
    Returns:
        A tuple with (indices, values, shape)
    """

    indices = []
    values = []

    for n, seq in enumerate(sequences):
        indices.extend(zip([n] * len(seq), range(len(seq))))
        values.extend(seq)

    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)

    # return tf.SparseTensor(indices=indices, values=values, shape=shape)
    return indices, values, shape

其中,假設sequences有2個,值分別爲[1 3 4 9 2]、[ 8 5 7 2]。則indices=[[0 0][0 1][0 2][0 3][0 4][0 0][0 1][0 2][0 3]],values=[1 3 4 9 2 8 5 7 2],shape=[2 6]。
自此,我們就得到了訓練的輸入輸出,接下來我們就正式進入模型的訓練代碼。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章