轉自:https://blog.csdn.net/qq_18124075/article/details/78867536
說話人識別
這裏,博主對說話人兩個baseline模型應該matlab的MSR工具箱進行處理。
1、GMM-UBM說話人識別
- nSpeakers = 20;
- nDims = 13; % dimensionality of feature vectors
- nMixtures = 32; % How many mixtures used to generate data
- nChannels = 10; % Number of channels (sessions) per speaker
- nFrames = 100; % Frames per speaker (10 seconds assuming 100 Hz)
- nWorkers = 1; % Number of parfor workers, if available
這裏爲了方便不用一般的語音數據庫如TIMIT,直接生產隨機多信道的音頻數據(10信道)。這裏trainSpeakerData和testSpeakerData爲20*10的cell,20爲說話人的個數,10爲說話人的信道數。每個說話人在訓練和測試集裏面是一一對應的。在每一個cell裏面維度爲13*100,13爲分幀之後的維度,100位幀數,在實際中分幀後的語音都會經過MFCC特徵提取。
- % Pick random centers for all the mixtures.
- mixtureVariance = .10;
- channelVariance = .05;
- mixtureCenters = randn(nDims, nMixtures, nSpeakers);
- channelCenters = randn(nDims, nMixtures, nSpeakers, nChannels)*.1;
- trainSpeakerData = cell(nSpeakers, nChannels);
- testSpeakerData = cell(nSpeakers, nChannels);
- speakerID = zeros(nSpeakers, nChannels);
- % Create the random data. Both training and testing data have the same
- % layout.
- disp('Create the random data');
- for s=1:nSpeakers
- trainSpeechData = zeros(nDims, nMixtures);
- testSpeechData = zeros(nDims, nMixtures);
- for c=1:nChannels
- for m=1:nMixtures
- % Create data from mixture m for speaker s
- frameIndices = m:nMixtures:nFrames;
- nMixFrames = length(frameIndices);
- trainSpeechData(:,frameIndices) = ...
- randn(nDims, nMixFrames)*sqrt(mixtureVariance) + ...
- repmat(mixtureCenters(:,m,s),1,nMixFrames) + ...
- repmat(channelCenters(:,m,s,c),1,nMixFrames);
- testSpeechData(:,frameIndices) = ...
- randn(nDims, nMixFrames)*sqrt(mixtureVariance) + ...
- repmat(mixtureCenters(:,m,s),1,nMixFrames) + ...
- repmat(channelCenters(:,m,s,c),1,nMixFrames);
- end
- trainSpeakerData{s, c} = trainSpeechData;
- testSpeakerData{s, c} = testSpeechData;
- speakerID(s,c) = s; % Keep track of who this is
- end
- end
- % Step1: Create the universal background model from all the training speaker data
- disp('Create the universal background model');
- nmix = nMixtures; % In this case, we know the # of mixtures needed
- final_niter = 10;
- ds_factor = 1;
- ubm = gmm_em(trainSpeakerData(:), nmix, final_niter, ds_factor, nWorkers);
最大後驗準則MAP從UBM通用背景模型裏面訓練每一個說話人的聲學模型,自適應的策略是根據目標說話人的訓練集trainSpeakerData特徵矢量與第一步求得的UBM的相似程度,將UBM的各個高斯分量按訓練集特徵矢量進行調整,從而形成目標說話人的聲學模型。再根據EM重估公式,計算每一個說話人修正模型的最優參數。
- % Step2: Now adapt the UBM to each speaker to create GMM speaker model.
- disp('Adapt the UBM to each speaker');
- map_tau = 10.0;
- config = 'mwv';
- gmm = cell(nSpeakers, 1);
- for s=1:nSpeakers
- disp(['for the ',num2str(s),' speaker...']);
- gmm{s} = mapAdapt(trainSpeakerData(s, :), ubm, map_tau, config);
- end
計算每個說話人模型的得分。因爲在說話人確認系統中,與說話人辨認不同,測試目標testSpeakerData變爲確認某段測試語音是否來源於某個目標說話人,本實驗爲20個說話人。如果測試語音與目標語音來源於相同的說話人,則此次測試爲目標測試(target test);反之,如果測試語音與目標語音來源與不同的說話人,則此次測試爲非目標測試(non-target test)。將目標測試與非目標測試的後驗概率比作爲得分。
- % Step3: Now calculate the score for each model versus each speaker's data.
- % Generate a list that tests each model (first column) against all the
- % testSpeakerData.
- trials = zeros(nSpeakers*nChannels*nSpeakers, 2);
- answers = zeros(nSpeakers*nChannels*nSpeakers, 1);
- for ix = 1 : nSpeakers,
- b = (ix-1)*nSpeakers*nChannels + 1;
- e = b + nSpeakers*nChannels - 1;
- trials(b:e, :) = [ix * ones(nSpeakers*nChannels, 1), (1:nSpeakers*nChannels)'];
- answers((ix-1)*nChannels+b : (ix-1)*nChannels+b+nChannels-1) = 1;
- end
- disp('Calculate the score for each model vs test speaker');
- gmmScores = score_gmm_trials(gmm, reshape(testSpeakerData', nSpeakers*nChannels,1), trials, ubm);
計算指標AUC和EER。對於開集的說話人辨認系統,需要將測試語音的輸出得分與特定的閾值進行比較,以做出是否是集外說話人的判決。對於說話人確認系統,需要對測試語音的輸出得分進行判決,一般是將其與一特定的閾值進行比較,若大於此閾值則接受其爲目標說話人,否則判定其爲冒認說話人。因而,閾值的選取對說話人識別系統的性能有着直接的影響,尤其是在實用的說話人識別系統研究中,閾值選取問題更是得到了研究者們的廣泛關注,提出了許多有效的閾值選取方法,其中比較常用的有等錯誤率(equal error rate,EER)閾值。這裏,博主加入了AUC,可以方便與深度學習方法做對比。
- % Step4: Now compute the EER and plot the DET curve and confusion matrix
- imagesc(reshape(gmmScores,nSpeakers*nChannels, nSpeakers))
- title('Speaker Verification Likelihood (GMM Model)');
- ylabel('Test # (Channel x Speaker)'); xlabel('Model #');
- colorbar; drawnow; axis xy
- figure
- disp('Compute the EER');
- [eer,auc] = compute_eer(gmmScores, answers, true);
2、基於ivector的GMM-UBM說話人識別
- % Step1: Create the universal background model from all the training speaker data
- nmix = nMixtures; % In this case, we know the # of mixtures needed
- final_niter = 10;
- ds_factor = 1;
- ubm = gmm_em(trainSpeakerData(:), nmix, final_niter, ds_factor, nWorkers);
- %%
- % Step2.1: Calculate the statistics needed for the iVector model.
- stats = cell(nSpeakers, nChannels);
- for s=1:nSpeakers
- for c=1:nChannels
- [N,F] = compute_bw_stats(trainSpeakerData{s,c}, ubm);
- stats{s,c} = [N; F];
- end
- end
- % Step2.2: Learn the total variability subspace from all the speaker data.
- tvDim = 100;
- niter = 5;
- T = train_tv_space(stats(:), ubm, tvDim, niter, nWorkers);
- %
- % Now compute the ivectors for each speaker and channel. The result is size
- % tvDim x nSpeakers x nChannels
- devIVs = zeros(tvDim, nSpeakers, nChannels);
- for s=1:nSpeakers
- for c=1:nChannels
- devIVs(:, s, c) = extract_ivector(stats{s, c}, ubm, T);
- end
- end
- %%
- % Step3.1: Now do LDA on the iVectors to find the dimensions that matter.
- ldaDim = min(100, nSpeakers-1);
- devIVbySpeaker = reshape(devIVs, tvDim, nSpeakers*nChannels);
- [V,D] = lda(devIVbySpeaker, speakerID(:));
- finalDevIVs = V(:, 1:ldaDim)' * devIVbySpeaker;
- % Step3.2: Now train a Gaussian PLDA model with development i-vectors
- nphi = ldaDim; % should be <= ldaDim
- niter = 10;
- pLDA = gplda_em(finalDevIVs, speakerID(:), nphi, niter);
- %%
- % Step4.1: OK now we have the channel and LDA models. Let's build actual speaker
- % models. Normally we do that with new enrollment data, but now we'll just
- % reuse the development set.
- averageIVs = mean(devIVs, 3); % Average IVs across channels.
- modelIVs = V(:, 1:ldaDim)' * averageIVs;
- % Step4.2: Now compute the ivectors for the test set
- % and score the utterances against the models
- testIVs = zeros(tvDim, nSpeakers, nChannels);
- for s=1:nSpeakers
- for c=1:nChannels
- [N, F] = compute_bw_stats(testSpeakerData{s, c}, ubm);
- testIVs(:, s, c) = extract_ivector([N; F], ubm, T);
- end
- end
- testIVbySpeaker = reshape(permute(testIVs, [1 3 2]), ...
- tvDim, nSpeakers*nChannels);
- finalTestIVs = V(:, 1:ldaDim)' * testIVbySpeaker;
3、參考文獻
[2] P. Kenny, "A small footprint i-vector extractor," in Proc. Odyssey, The Speaker and Language Recognition Workshop, Jun. 2012.