蓋晁旭+梁隆愷+何勇軍



摘 要:在過去的數十年里,研究者們對說話人識別進行了廣泛而深入的研究,提出了許多有效的方法。目前主流的說話人識別方法如高斯混合-通用背景模型(Gaussian mixture modelUniversal background model, GMMUBM)和高斯混合-支持向量機模型(Gaussian mixture modelSupport vector machine, GMMSVM),雖然能取得比較理想的識別效果,但都需要充分的訓練和測試數據。而這一要求在現實應用中通常難以滿足,導致其識別率急劇降低。針對這一問題,提出了一種基于稀疏編碼的說話人識別方法。該方法在訓練階段為每個說話人訓練一個語音字典;在識別階段,將測試語音分別表示在每個字典上然后根據重構誤差打分。實驗表明,在少量無噪的訓練和測試語音數據情況下,所提出的方法取得了比GMMUBM和GMMSVM更好的識別效果。
關鍵詞:說話人識別;高斯混合;支持向量機;稀疏編碼
DOI:1015938/jjhust201703003
中圖分類號: TN9123
文獻標志碼: A
文章編號: 1007-2683(2017)03-0013-06
Abstract:Speaker recognition has attracted broad and deep research in the past few decades, and many methods have been proposed At present, the popular methods such as the Gaussian mixture modelUniversal background model(GMMUBM) and Gaussian mixture modelSupport vector machine(GMMSVM) have got a better recognition result, but they all need too much training and testing data They will suffer severe performance degradation in practical application, because their data needs always could not be satisfied To solve this problem, a speaker recognition method based on sparse coding is presented In the training stage, the method learns a dictionary for each speaker; and in the recognition stage, it represents test speech over each dictionary sparsely and gets scores from the reconstitution error Experiments show that the proposed method achieves better recognition results than GMMUBM and GMMSVM, when the training and testing data are clean and limited
Keywords:speaker recognition; gaussian mixture model; support vector machine; sparse coding
通過實驗得到的DET曲線可以直觀地看出,在訓練時長為8s,測試時長分別為2s,3s,4s,5s和訓練時長分別為4s,5s,6s,7s,測試時長為2s這兩種條件下,當實驗所用語音數據時長逐步增加時,兩種方法的識別結果在不同程度上都有所改善。而在每一次實驗中,基于稀疏編碼的說話人識別方法取得的識別效果要明顯優于GMMUBM和GMMSVM的識別效果,這是因為稀疏編碼利用了語音信號本身具有的稀疏性,在語音數據相對較少的情況下具有比高斯混合模型更好的語音特征表示能力。
3 結 語
在目前的現實應用中,諸如GMMUBM,GMMSVM這些基于高斯混合模型的主流說話人識別方法,它們的識別率隨著訓練、測試數據的減少急劇下降,因此,在確保識別效果的前提下,減少識別方法對數據的需求量具有重要的意義。本文提出了一種基于稀疏編碼的說話人識別方法。字典是通過訓練的方法而不是收集樣例的方法來獲取,這進一步確保了語音在字典上稀疏。然后將測試語音分別在已訓練好的字典上進行打分,根據得分情況給出最后識別結果。最后的實驗結果表明,無噪環境中,在訓練語音和測試語音較少的情況下,基于稀疏編碼的說話人識別方法取得了比GMMUBM和GMMSVM更好的識別效果。
本方法在語音數量較少且無噪聲的條件下有較好的識別效果,因此具有更加廣泛的實用價值,可用于現實中語音環境較為理想的說話人識別任務。我們將在未來的工作中降低方法的計算量,提高方法的抗噪能力,以增強其實時性和對環境噪聲的魯棒性。
參 考 文 獻:
[1] ALNA B, KAMARAUSKAS J Evaluation of Effectiveness of Different Methods in Speaker Recognition[J]. Elektronika ir Elektrotechnika, 2015, 98(2): 67-70
[2] SOONG F K, ROSENBERG / E, RABINER L R, et al A Vector Quantization Approach to Speaker Recognition[C]// Acoustics, Speech, and Signal Processing(ICASSP),1985: 387-390
[3] FURUI S Cepstral Analysis Technique for Automatic Speaker Verification[J]. IEEE Transactions on Acoustics Speech & Signal Processing, 1981, 29(2): 254-272
[4] BENZEGHIBA M F, BOURLARD H Usercustomized Password Speaker Verification Using Multiple Reference and Background Models[J]. Speech Communication, 2006, 48(9): 1200-1213
[5] REYNOLDS D A, ROSE R C Robust Textindependent Speaker Identification Using Gaussian Mixture Speaker Models[J]. IEEE Transactions on Speech & Audio Processing, 1995, 3(1): 72-83
[6] FARRELL K R, MAMMONE R J, ASSALEH K T Speaker Recognition Using Neural Networks and Conventional Classifiers[J]. IEEE Transactions on Speech & Audio Processing, 1994, 2(1): 194-205
[7] KENNY P, GUPTA V, STAFYLAKIS T, et al Deep Neural Networks for Extracting Baumwelch Statistics for Speaker Recognition[C]//Proc Odyssey,2014: 293-298
[8] SUN H, LEE K A, MA B A New Study of GMMSVM System for Textdependent Speaker Recognition[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015: 4195-4199
[9] REYNOLDS D A Speaker Verification Using Adapted Gaussian Mixture Models[J]. Digital Signal Processing, 2000, 7(1): 19-41
[10]劉明輝 基于GMM和SVM的文本無關的說話人確認方法研究[D]. 合肥:中國科學技術大學, 2007
[11]DEHAK N, KENNY P, DEHAK R, et al FrontEnd Factor Analysis for Speaker Verification[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(4): 788-798
[12]張陳昊, 鄭方, 王琳琳 基于多音素類模型的文本無關短語音說話人識別[J]. 清華大學學報 (自然科學版), 2013(6):17
[13]林琳, 陳虹, 陳建, 等 基于多核 SVMGMM 的短語音說話人識別[J]. 吉林大學學報: 工學版, 2013 (2): 504-509
[14]何勇軍, 付茂國, 孫廣路 語音特征增強方法綜述[J]. 哈爾濱理工大學學報, 2014, 19(2): 19-25
[15]PATI Y C, REZAIIFAR R, KRISHNAPRASAD P S Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition[C]// in Conference Record of The TwentySeventh Asilomar Conference on Signals, Systems and Computers,1995: 1-3
[16]MALLAT S G, ZHANG Z Matching Pursuits with Timefrequency Dictionaries[J]. IEEE Transactions on Signal Processing, 1994, 41(12): 3397-3415
[17]CHEN S S, DONOHO D L, SAUNDERS M A Atomic Decomposition by Basis Pursuit[J]. Siam Review, 1998, 20(1): 129-159
[18]TIBSHIRANI R J Regression Shrinkage and Selection via the LASSO[J]. Journal of the Royal Statistical Society, 1996, 58:267-288
[19]AHARON M, ELAD M, BRUCKSTEIN A KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation[J]. IEEE Transactions on Signal Processing, 2006, 54(11): 4311-4322
[20]MACQUEEN J Some Methods for Classification and Analysis of Multivariate Observations[C]// In 5th Berkeley Symp Math Statist Prob 1967: 281-297
(編輯:溫澤宇)