語音信號處理1:Introduction

參考An introduction to signal processing for speech,From Dan Ellis @ Columbia University,Chapter 22 in Handbook of Phonetic Science ,極好的入門引導,摘錄+補充。

This chapter aims to give a transparent and intuitive introduction to the basic ideas of the Fourier domain and filtering, and connects them to some of the common representations used in speech science, including the spectrogram and cepstral coefficients.

關鍵詞:傅里葉分析,濾波,時頻圖(spectrogram),倒譜系數

在介紹傅里葉分析之前,首先要理解一個重要的概念:線性,粗略來說就是系統的輸出是隨着輸入同步縮放的。

Very roughly, linearity is the idea that scaling the input to a system will result in scaling the output by the same amount.

正弦函數就是線性系統的一個特徵函數(也就是說正弦函數是滿足線性特徵的)。

Now we can learn one more very important property of sinusoids: they are the eigenfunctions of linear systems. What this means is that if a linear system is fed a sinusoid (with or without an exponential envelope), the output will also be a sinusoid, with the same frequency and the same rate of exponential decay, merely scaled in amplitude and possibly shifted in phase.

這裏要稍微解釋一下,參考什麼是特徵函數?

我們把沒有給出具體解析式的函數稱爲抽象函數。而特徵函數是相對於抽象函數而言的,是滿足一定(特徵)條件的具體函數,表現形式是一個具體函數。
比如f(xy)=f(x)+f(y)沒有給出具體的函數表達形式,只是給出了相應的函數性質,是一個抽象函數。而滿足該性質的一個具體函數被稱爲特徵函數,比如 f(x)=ln(x) 就是它的一個特徵函數,觀察函數性質可知該特徵函數集合是對數函數。

線性所帶來的重要而精妙的特性之一就是疊加性(superposition)。
正弦輸入正弦輸出+疊加性,is the key to the value of Fourier transform.

下面正式引入傅里葉分析,首先看看periodically-repeating波形的情況:

The core of Fourier analysis is a simple but somewhat surprising fact: Any periodically-repeating waveform can be expressed as a sum of sinusoids, each scaled and shifted in time by appropriate constants. Moreover, the only sinusoids required are those whose frequency is an integer multiple of the fundamental frequency of the periodic sequence.

那麼,該如何去求a sum of sinusoids呢?這也就是如何求傅里葉級數。(原文這段大致解釋了做內積的原理,推薦看這個遺蹟系列 -【學渣告訴你】到底神馬是傅里葉級數!,白話易懂,關鍵點在於“將一個函數用一堆基函數表示”和“將一個向量用一堆基向量表示”這兩件事情是類似的)

It turns out that finding the Fourier series coefficients – the optimal scale constants and phase shifts for each harmonic – is very straightforward: All you have to do is multiply the waveform, point-for-point, with a candidate harmonic, and sum up (i.e. integrate) over a complete cycle; this is known as taking the inner product between the waveform and the harmonic, and gives the required scale constant for that harmonic. This works because the harmonics are orthogonal, meaning that the inner product between different harmonics is exactly zero, so if we assume that the original waveform is a sum of scaled harmonics, only the term involving the candidate harmonic appears in the result of the inner product. Finding the phase requires taking the inner product twice, once with a cosine-phase harmonic and once with the sine-phase harmonic, giving two scaled harmonics that can sum together to give a sinusoid of the corresponding frequency at any amplitude and any phase.

求傅里葉級數即是做傅里葉分析,反之,傅里葉合成就是將傅里葉級數轉化爲波形。
但是如果傅里葉分析只對periodically-repeating波形有效,那就沒太大意義了,因爲純週期信號(在無限時間上週期循環)只是一個數學抽象,在現實世界中並不存在。怎麼辦呢?

Consider, however, stretching the period of repetition to be longer and longer. Fourier analysis states that within this very long period we can have any arbitrary and unique waveform, and we will still be able to represent it as accurately as we wish. All that happens is that the ‘harmonics’ of our very long period become more and more closely spaced in frequency.

關鍵點就在於stretching the period of repetition,參考從傅立葉級數到傅立葉變換
FT

Now by letting the fundamental period go to infinity, we end up with a signal that is no longer periodic, since there is only space for a single repetition in the entire real time axis; at the same time, the spacing between our harmonics goes to zero, meaning that the Fourier series now becomes a continuous function of frequency, not a series of discrete values. However, nothing essentially changes – and, in particular, we can still find the value of the Fourier transform function simply by calculating the inner product integral. Now we have the most general form of the Fourier transform, pairing a continuous, non-repeating (aperiodic) waveform in time, with a continuous function of frequency.

人類聽覺系統其實就像是在做傅里葉變換,通過耳蝸(可以視爲一組濾波器)將時域聲壓轉換爲獨立的不同頻率的分量,但是,準確地說,更接近於短時傅里葉變換(STFT)。關於STFT,參考能不能通俗的講解下傅立葉分析和小波分析之間的關係?

傅里葉變換處理非平穩信號(頻率隨時間變化的信號)有天生缺陷。它只能獲取一段信號總體上包含哪些頻率的成分,但是對各成分出現的時刻並無所知。因此時域相差很大的兩個信號,可能頻譜圖一樣。
對於這樣的非平穩信號,只知道包含哪些頻率成分是不夠的,我們還想知道各個成分出現的時間。知道信號頻率隨時間變化的情況,各個時刻的瞬時頻率及其幅值——這也就是時頻分析。
一個簡單可行的方法就是——加窗。我又要套用方沁園同學的描述了,“把整個時域過程分解成無數個等長的小過程,每個小過程近似平穩,再傅里葉變換,就知道在哪個時間點上出現了什麼頻率了。”這就是短時傅里葉變換。

我們在時頻圖(spectrogram,不是頻譜spectrum,頻譜是不帶有時域信息的)上所看到的實際上就是STFT的幅值。

Calculation of the spectrogram. Input signal (1) is converted into a sequence of short excerpts by applying a sliding tapered window (2). Each short excerpt is converted to the frequency domain via the Fourier transform (3), then these individual spectra become columns in the spectrogram image (4), with each pixel’s color reflecting the log-magnitude at the corresponding frequency value in the Fourier transform.
spectra

(Linear Prediction暫時略過)

然而,spectrum和spectrogram中所包含的信息太多,實際應用中並不需要那麼多的細節,更重要的是提取其中的關鍵特徵。
在語音識別中最爲常用的特徵就是Mel-frequency cepstral coefficients(MFCC)。
要理解MFCC,可以從兩個方面進行:(1)什麼是Mel-frequency scale?(2)什麼是倒譜系數cepstral coefficients?
(1)什麼是Mel-frequency scale?

The Mel-frequency scale is a nonlinear mapping of the audible frequency range. The scale is approximately linear below 1000 Hz and approximately logarithmic above 1000Hz.

(2)什麼是倒譜系數cepstral coefficients?

Cepstra amounts to taking a second Fourier transform on the logarithm of the magnitude of the original spectrum (Fourier transform of the time waveform). Because of the symmetry between time and frequency in the basic Fourier mathematics, without the intervening log-magnitude step, taking the Fourier transform of a Fourier transform almost gets you back to the original signal. But taking the magnitude removes any phase (relative timing) information between different frequencies, and applying a logarithm drastically alters the balance between intense and weak components, leading to a very different signal.

那麼,MFCC就是在Mel-warped spectrum上求cepstra了。

當然還有其他的一些特徵,比如:

Perceptual Linear Prediction (PLP)

PLP features often perform comparably to MFCCs, although which feature is superior tends to vary from task to task. PLP features use the Bark auditory scale, and trapezoidal (flat-topped) rather than triangular windows, to create the initial auditory spectrum. Then, rather than smoothing the auditory spectrum by keeping only the low-order cepstral coefficients, linear prediction is used to find a smooth spectrum consisting of only a few resonant peaks (typically 4 to 6) that matches the Bark-spectrum.Finally, this smoothed PLP spectrum is again converted to the compact, decorrelated cepstral coefficients via another neat mathematical trick that finds cepstra directly from an LP model.

delta coefficients

an estimate of the local slope, along the direction of the time axis, for each frequency or cepstral coefficient

Cepstral Mean Normalization (CMN)

the average value of each cepstral dimension over an entire segment or utterance is subtracted from that dimension at every time step

另外一些不錯的資料:
zouxy09的專欄
小腹黑zju的博客
傅里葉級數和傅里葉變換是什麼關係?
如何理解傅里葉變換公式?
CMU Speech Processing
愛丁堡大學Automatic Speech Recognition
哥倫比亞大學Speech and Audio Processing and Recognition
TAMU Speech processing
MIT Linguistic Phonetics
語譜圖,濾波器組(Filter banks、MFCC),介紹了MFCC具體的求法

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章