基於python手動畫出spectrogram(語譜圖)

Spectrogram是基於STFT變換得到的，非常有助於分析信號的時頻特性，在語音信號處理中常被稱爲"語譜圖"。

python中有一些寫好的模塊可以直接將時域的信號轉化成spectrogram，但這並不利於對其原理的理解，而且橫縱左邊的轉換也不是很方便，在這篇博客中我們嘗試直接基於python的基本操作來手東畫出spectrogram。

Generate synthetic data

每臺模擬電話的撥盤上都會產生2個正弦波信號，例如按下數字1就會產生頻率包含697Hz和1209Hz的正弦波，697Hz表示正弦波會在1s時間內重複整個周i697次，兩個不同頻率的正弦波表示信號是這兩個正弦波的總和。

假設採樣率爲4000Hz，意味着1s採樣4000個點，前3s對應數字1，中間2s爲silencem，最後3s對應數字2，則生成數據的代碼如下：


import numpy as np
import matplotlib.pyplot as plt
import warnings
import librosa
warnings.filterwarnings("ignore", category=RuntimeWarning)

def get_signal_Hz(Hz,sample_rate,length_ts_sec):
    ## 1 sec length time series with sampling rate
    ts1sec = list(np.linspace(0,np.pi*2*Hz,sample_rate))
    ## 1 sec length time series with sampling rate
    ts = ts1sec*length_ts_sec
    return(list(np.sin(ts)))

sample_rate   = 4000
length_ts_sec = 3
## --------------------------------- ##
## 3 seconds of "digit 1" sound
## Pressing digit 2 buttom generates
## the sine waves at frequency
## 697Hz and 1209Hz.
## --------------------------------- ##
ts1  = np.array(get_signal_Hz(697, sample_rate,length_ts_sec))
ts1 += np.array(get_signal_Hz(1209,sample_rate,length_ts_sec))
ts1  = list(ts1)

## -------------------- ##
## 2 seconds of silence
## -------------------- ##
ts_silence = [0]*sample_rate*1

## --------------------------------- ##
## 3 seconds of "digit 2" sounds
## Pressing digit 2 buttom generates
## the sine waves at frequency
## 697Hz and 1336Hz.
## --------------------------------- ##
ts2  = np.array(get_signal_Hz(697, sample_rate,length_ts_sec))
ts2 += np.array(get_signal_Hz(1336,sample_rate,length_ts_sec))
ts2  = list(ts2)

## -------------------- ##
## Add up to 7 seconds
## ------------------- ##
ts = ts1 + ts_silence + ts2

Plot the generated sound signal in frequency domain

採用DFT變換來畫出信號在頻域上的頻譜圖，代碼如下所示。

def get_xn(Xs,n):
    '''
    calculate the Fourier coefficient X_n of 
    Discrete Fourier Transform (DFT)
    '''
    L  = len(Xs)
    ks = np.arange(0,L,1)
    xn = np.sum(Xs*np.exp(((-1)*1j*2*np.pi*ks*n)/L))
    return(xn)

def get_xns(ts):
    '''
    Compute Fourier coefficients only up to the Nyquest Limit Xn, n=1,...,L/2
    and multiply the absolute value of the Fourier coefficients by 2, 
    to account for the symetry of the Fourier coefficients above the Nyquest Limit. 
    '''
    mag = []
    L = len(ts)
    for n in range(int(L/2)): # Nyquest Limit
        mag.append(np.abs(get_xn(ts,n))*2)
    return(mag)
mag = get_xns(ts)

這裏的"get_xns"函數是基於Nyquest限制下計算Fourier係數的，同樣由於Fourier係數的對稱性所以每個Fourier係數的絕對值應該double。

注：這裏原博的“get_xns”中計算係數採用的是： xn = np.sum(Xsnp.exp((1j2np.piks*n)/L))/L
我個人覺得這個是錯誤的，雖然並不影響後續的分析。

相應的波形圖爲：

DFT on entire dataset to visualize the signals at frequency domain for all k=1,…L/2.

可視化信號的頻譜圖：

# the number of points to label along xaxis
Nxlim = 10

plt.figure(figsize=(20,3))
plt.plot(mag)
plt.xlabel("Frequency (k)")
plt.title("Two-sided frequency plot")
plt.ylabel("|Fourier Coefficient|")
plt.show()

相應的頻譜圖爲：

參考我的博客()，第k個頻點上的Fourier係數 $X_k$ 對應的頻率計算公式爲：

$\frac{SampleRate * k}{Number of Sample Points N}$ Hz

依據於此，將頻譜圖的x軸座標轉換到以Hz爲單位，那麼就可以看到頻譜圖在697Hz，1209Hz和1336Hz處有峯值出現。

def get_Hz_scale_vec(ks,sample_rate,Npoints):
    freq_Hz = ks*sample_rate/Npoints
    freq_Hz  = [int(i) for i in freq_Hz ]
    return(freq_Hz )

ks   = np.linspace(0,len(mag),Nxlim)
ksHz = get_Hz_scale_vec(ks,sample_rate,len(ts))

plt.figure(figsize=(20,3))
plt.plot(mag)
plt.xticks(ks,ksHz)
plt.title("Frequency Domain")
plt.xlabel("Frequency (Hz)")
plt.ylabel("|Fourier Coefficient|")
plt.show()

得到的圖形爲：

Create Spectrogram

終於進入今天的正題了~

前面已經介紹了信號的wavfeorm和spectra，這兩個域分別展現了信號的時域和頻域特性。爲了能夠更好地分析信號的時頻特性，於是採用了帶窗的DFT變換，即STFT變換。

信號通過STFT變換得到語譜圖，python中有現成的函數"matplotlib.pyplot.spectram"來計算spectrogram，這裏我們給出具體的STFT計算過程：

def create_spectrogram(ts,NFFT,noverlap = None):
    '''
          ts: original time series
        NFFT: The number of data points used in each block for the DFT.
          Fs: the number of points sampled per second, so called sample_rate
    noverlap: The number of points of overlap between blocks. The default value is 128. 
    '''
    if noverlap is None:
        noverlap = NFFT/2
    noverlap = int(noverlap)
    starts  = np.arange(0,len(ts),NFFT-noverlap,dtype=int)
    # remove any window with less than NFFT sample size
    starts  = starts[starts + NFFT < len(ts)]
    xns = []
    for start in starts:
        # short term discrete fourier transform
        ts_window = get_xns(ts[start:start + NFFT]) 
        xns.append(ts_window)
    specX = np.array(xns).T
    # rescale the absolute value of the spectrogram as rescaling is standard
    spec = 10*np.log10(specX)
    assert spec.shape[1] == len(starts) 
    return(starts,spec)

L = 256
noverlap = 84
starts, spec = create_spectrogram(ts,L,noverlap = noverlap )

Plot the hand-made spectrogram

完成STFT變換之後，就可以手動畫出spectrogram：

def plot_spectrogram(spec,ks,sample_rate, L, starts, mappable = None):
    plt.figure(figsize=(20,8))
    plt_spec = plt.imshow(spec,origin='lower')

    ## create ylim
    Nyticks = 10
    ks      = np.linspace(0,spec.shape[0],Nyticks)
    ksHz    = get_Hz_scale_vec(ks,sample_rate,len(ts))
    plt.yticks(ks,ksHz)
    plt.ylabel("Frequency (Hz)")

    ## create xlim
    Nxticks = 10
    ts_spec = np.linspace(0,spec.shape[1],Nxticks)
    ts_spec_sec  = ["{:4.2f}".format(i) for i in np.linspace(0,total_ts_sec*starts[-1]/len(ts),Nxticks)]
    plt.xticks(ts_spec,ts_spec_sec)
    plt.xlabel("Time (sec)")

    plt.title("Spectrogram L={} Spectrogram.shape={}".format(L,spec.shape))
    plt.colorbar(mappable,use_gridspec=True)
    plt.show()
    return(plt_spec)
plot_spectrogram(spec,ks,sample_rate,L, starts)

得到的語譜圖如下所示，可以清晰地看到前3s包含了頻率爲697Hz和1209Hz的信號，緊接着是2s的slience，最後3s包含了頻率爲693Hz和1336Hz的信號。

Frequency resolution vs time resolution

最後，我想要討論一下在spectrogram中存在的"不確定性原則"(uncertainty principle)。

Uncertainty principle We cannot arbitrarily narrow our focus both in time and in frequency. If we want higher time resolusion, we need to give up frequency resolusion and vise verse.

在之前的spectrogram中，window size設爲256，sample rate設爲4000，因此每個窗包含：

time resolution : $\frac{WindowSize}{SampleRate} = \frac{256}{4000} = 0.064$ second

而 frequency resolution 則與 time resolution 互爲倒數：

time resolution : $\frac{WindowSize}{SampleRate} = \frac{256}{4000} = 0.064$ second

下面的幾張圖表現了在 frequency resolution 和 time resolution 這兩個方面的權衡，如果Spectroogram採用了較大的窗，則頻域信息更加清晰，反之頻帶較寬的話，則時域信息更加清晰。

注：這裏原博的標題是Wideband spectrogram vs narrowband spectrogram，但由於信號本身就有 wideband 和 narrowband 的區別，所以採用這個標題容易引起歧義，我就改爲了Frequency resolution vs time resolution。

plt_spec1 = None
for iL, (L, bandnm) in enumerate(zip([150, 200, 400],["wideband","middleband","narrowband"])):
    print("{:20} time resoulsion={:4.2f}sec, frequency resoulsion={:4.2f}Hz".format(bandnm,L/sample_rate,sample_rate/L))
    starts, spec = create_spectrogram(ts,L,noverlap = 1 )
    plt_spec = plot_spectrogram(spec,ks,sample_rate, L, starts,
                                 mappable = plt_spec1)
    if iL == 0:
        plt_spec1 = plt_spec

wideband :

middleband :

narrowband:

參考鏈接：
[1]: Implement the Spectrogram from scratch in python

基於python手動畫出spectrogram(語譜圖)

Generate synthetic data

Plot the generated sound signal in frequency domain

DFT on entire dataset to visualize the signals at frequency domain for all k=1,…L/2.

Create Spectrogram

Plot the hand-made spectrogram

Frequency resolution vs time resolution

京東面試：如何進行JVM調優？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

基於numpy實現離散卷積和CNN

說話人聚類--譜聚類和層次聚類

L0、L1、L2、核範數以及RPCA方法的應用

C++的一些概念面向對象程序的基本特點

C++編程之構造函數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結