對比torch.stft與librosa.stft在獲取語音的幅度和相位的不同表達
torch.stft
stft(self, n_fft, hop_length=None, win_length=None,window=None,center=True, pad_mode='reflect', normalized=False, onesided=True)
Parameters:
----------
input (Tensor) – the input tensor
n_fft (int) – size of Fourier transform
hop_length (int, optional) – the distance between neighboring sliding window frames. Default: None (treated as equal to floor(n_fft / 4))
win_length (int, optional) – the size of window frame and STFT filter. Default: None (treated as equal to n_fft)
window (Tensor, optional) – the optional window function. Default: None (treated as window of all 111 s)
center (bool, optional) – whether to pad input on both sides so that the ttt -th frame is centered at time t×hop_lengtht \times \text{hop\_length}t×hop_length . Default: True
pad_mode (string, optional) – controls the padding method used when center is True. Default: "reflect"
normalized (bool, optional) – controls whether to return the normalized STFT results Default: False
onesided (bool, optional) – controls whether to return half of results to avoid redundancy Default: True
Returns the real and the imaginary parts together as one tensor of size :math:`(* \times N \times T \times 2)`, where :math:`*` is the optional batch size of :attr:`input`, :math:`N` is the number of frequencies where STFT is applied, :math:`T` is the total number of frames used, and each pair in the last dimension represents a complex number as the real part and the imaginary part.
----------
其輸入爲一維或者二維的時間序列
返回值爲一個tensor,其中第一個維度爲輸入數據的batch size,第二個維度爲STFT應用的頻數,第三個維度爲幀總數,最後一個維度包含了返回的複數值中的實部和虛部部分。
幅度和相位的獲取如下:
spec = torch.stft(mono,n_fft=len_frame,hop_length=len_hop)
rea = spec[:, :, 0]#實部
imag = spec[:, :, 1]#虛部
mag = torch.abs(torch.sqrt(torch.pow(mag_mono, 2) + torch.pow(pha_mono, 2)))
pha = torch.atan2(imag.data, rea.data)
librosa.stft
stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=np.complex64, pad_mode='reflect')
Parameters
----------
y : np.ndarray [shape=(n,)], real-valuedinput signal
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros.
The number of rows in the STFT matrix `D` is (1 + n_fft/2).
The default value, n_fft=2048 samples, corresponds to a physical
duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the
default sample rate in librosa. This value is well adapted for music
signals. However, in speech processing, the recommended value is 512,
corresponding to 23 milliseconds at a sample rate of 22050 Hz.
In any case, we recommend setting `n_fft` to a power of two for
optimizing the speed of the fast Fourier transform (FFT) algorithm.
hop_length : int > 0 [scalar]
number of audio samples between adjacent STFT columns.
Smaller values increase the number of columns in `D` without
affecting the frequency resolution of the STFT.
If unspecified, defaults to `win_length / 4` (see below).
win_length : int <= n_fft [scalar]
Each frame of audio is windowed by `window()` of length `win_length`
and then padded with zeros to match `n_fft`.
Smaller values improve the temporal resolution of the STFT (i.e. the
ability to discriminate impulses that are closely spaced in time)
at the expense of frequency resolution (i.e. the ability to discriminate
pure tones that are closely spaced in frequency). This effect is known
as the time-frequency localization tradeoff and needs to be adjusted
according to the properties of the input signal `y`.
If unspecified, defaults to ``win_length = n_fft``.
window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]
Either:
- a window specification (string, tuple, or number);
see `scipy.signal.get_window`
- a window function, such as `scipy.signal.hanning`
- a vector or array of length `n_fft`
Defaults to a raised cosine window ("hann"), which is adequate for
most applications in audio signal processing.
.. see also:: `filters.get_window`
center : boolean
If `True`, the signal `y` is padded so that frame
`D[:, t]` is centered at `y[t * hop_length]`.
If `False`, then `D[:, t]` begins at `y[t * hop_length]`.
Defaults to `True`, which simplifies the alignment of `D` onto a
time grid by means of `librosa.core.frames_to_samples`.
Note, however, that `center` must be set to `False` when analyzing
signals with `librosa.stream`.
.. see also:: `stream`
dtype : numeric type
Complex numeric type for `D`. Default is single-precision
floating-point complex (`np.complex64`).
pad_mode : string or function
If `center=True`, this argument is passed to `np.pad` for padding
the edges of the signal `y`. By default (`pad_mode="reflect"`),
`y` is padded on both sides with its own reflection, mirrored around
its first and last sample respectively.
If `center=False`, this argument is ignored.
.. see also:: `np.pad`
通過在短重疊窗口上計算離散傅里葉變換(DFT)來表示時頻域信號。返回值爲一個複數值矩陣D,其中np.abs(D)表示幅度,np.angle(D)表示相位。
幅度和相位的獲取如下:
spec = librosa.stft(mono, n_fft=len_frame, hop_length=len_hop)
mag = np.abs(spec)
pha = np.angle(spec)
或者直接利用librosa.core中封裝好的函數
spec = librosa.stft(mono, n_fft=len_frame, hop_length=len_hop)
mag,pha = librosa.core.magphase(spec)