Multi-Scale Kernels for Short Utterance Speaker Recognition Wei-Qiang Zhang1 , Junhong Zhao2,3 , Wen-Lin Zhang4 , Jia Liu1 1
Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China 2 State Key Laboratory of Transducer Technology, Institute of Electronics Chinese Academy of Sciences, Beijing 100190, China 3 University of Chinese Academy of Sciences, Beijing 100190, China 4 Zhengzhou Information Science and Technology Institute, Zhengzhou 450002, China
[email protected],
[email protected], zwlin
[email protected]
Abstract Short utterance is a great challenge for speaker recognition, for there is very limited data can be used for training and testing. To give a robust estimation, the amount of model parameters for the short utterance should be less than that for the long utterance; however, this may impede the models descriptive capability. In this paper, we propose a multi-scale kernel (MSK) approach to solve this problem. We construct a series of kernels with different scales, and combine them through multiple kernel learning (MKL) optimization. In this way, the robustness and scalability of the model will be both enhanced. The experimental results on NIST SRE 2010 10sec10sec dataset show that the proposed MSK method outperforms the traditional Gaussian mixture model supervector (GSV) followed by support vector machine (SVM) method. Index Terms: speaker recognition, short utterance, multi-scale kernel
1. Introduction Although speaker recognition has been developed for decades, short utterance speaker recognition remains a great challenge till now. Since the amount of data for training and/or testing is very limited, it is not easy to extract sufficient feature and produce effective model to describe the voiceprint information existed in the short utterances. There are many attempts to solve the short utterance problem in speaker recognition. For example, Li et al. proposed multi-resolution time frequency feature and combine them in different levels to make better use of multi-resolution temporal-frequency information [1]. Kanagasundaram et al. proposed a source and utteranceduration normalized linear discriminant analysis approach to compensate session variability in short utterance i-vector systems [2]. Meanwhile, there are alternate approaches which resort to the text-dependant information, such as vowel-category information [3, 4], multi-layer acoustic and temporal structure information [5, 6], and so on. For speaker recognition, the optimal amount of model parameters (model scale) should be dependant on the utterance duration. Usually, long utterance prefers big model which can provide sufficient descriptive capability, while short utterance prefers small model which can be more robustly estimated. In practice, the model scale is usually determined through empirical and experimental methods. In order to give a more elegant solution for model scale selection, in this paper we make
an attempt on text-independent short utterance speaker recognition in the framework of Gaussian mixture model supervector (GSV) modeling followed by support vector machine (SVM) [7] (or GSV for short). Our proposed method is to construct a series of kernels with different scales, and then combine them through multiple kernel learning (MKL) optimization method. In this way, the suitable scale kernels will be retained and assigned proper weights, and thus the robustness and scalability of the model will be both enhanced. We refer to such a method as multi-scale kernel (MSK) approach. The rest of the paper is organized as follows. In Section 2 we summarize the relevant existing approaches. Then we present our formulation in Section 3. Section 4 demonstrates the effectiveness of the proposed methods through detailed experiments. Finally, conclusions are given in Section 5.
2. Related work The basic idea of this paper is similar to the pyramid match kernel (PMK) [8] method, in which a set of feature vectors is mapped onto a multi-resolution histogram pyramid. In our method the multi-scale comes from the different number of mixture of universal background model (UBM), while in PMK the multi-scale is based the construction of a pyramid of histograms. In addition, in our method the weights of each kernel are optimized through MKL, while in PMK they are determined by the pyramid structure. Besides that, our method is similar to MKL [9, 10, 11], in which the model is based on a combination of multi-type kernels, such as Gaussian radial basis function (RBF) kernels and polynomial kernels. Strictly speaking, our method is also a type of MKL, but we focus on a series of same-type kernels with different scales, which are expected to improve the model’s multi-scale modeling ability.
3. Multi-scale kernel In the GMM supervector (GSV) method, the kernel is defined as [7]: k(x(i) , x(j) ) =
M X √ √ −1 −1 wm Σm 2 µ(j) h wm Σm 2 µ(i) m , m i, (1)
m=1
where {wm , µm , Σm }M m=1 are the weights, mean vectors and covariance matrices of adapted GMM, and M is the number of mixtures. Note that in the speaker recognition, usually only
Featurevectorseriesx(i)
Featurevectorseriesx(j)
K1I1 ( x ( i ) ) UBM1
MAP adaptation
K1I1 (x ( j ) )
{ȝ m }mM11 {wm , Ȉm }mM11
MKL
...
...
... K N IN ( x ( i ) )
K N IN ( x ( j ) )
UBMN
stacking
stacking
IK ( x ( i ) ), IK ( x ( j ) )
Figure 1: Schematic diagram of multi-scale kernel method. the mean vectors are adapted, so {wm , Σm }M m=1 are usually borrowed from the UBM and shared by all the speakers. According to the Mercer theorem, the GSV kernel can be expressed as k(x(i) , x(j) ) = hφ(x(i) ), φ(x(j) )i, and the feature mapping, which maps the data into the feature space, is defined as √ −1 w1 Σ1 2 µ1 √ 1 w2 Σ − 2 µ 2 2 φ(x) = (2) . .. . √ −1 wM ΣM2 µM The number of mixtures of GMM (or UBM) M is an important control parameter in GSV method. We can use a series of UBM with different number of mixtures to construct a set of kernels {kn (x(i) , x(j) )}N n=1 . Note that different M means different model scale and resolution. For the short utterance, this may provides the scalability to capture the suitable resolution information for kernel modeling. For this reason, our method is named as multi-scale kernel. The set of kernels can be combined using the MKL approach [12] kη (x(i) , x(j) ) = fη ({kn (x(i) , x(j) )}N n=1 |η),
(3)
where η is the vector of combine parameters. For the linear combination, it can be simplified as kη (x(i) , x(j) ) =
N X
ηn kn (x(i) , x(j) ),
a, b, η
subject to
η 0,
(6)
where {(xi , yi )} are the training samples, C is a predefined positive trade-off parameter, a and b are the parameters of SVM, l(·) is the loss function and r(·) is the regularizer. The commonly used loss function is max{0, 1 − yi (aT φη (xi ) + b)}, which is the same as SVM. The commonly used regularizers include l1 norm, l2 norm and lp norm. The l1 norm regularizer is sometimes referred to as lasso, which can drive an element-wise shrinkage of η towards zero, thus leading to a sparse solution [16]. The l2 norm regularizer is sometimes referred to as ridge regression, which can force the weights approach to small values and thus prevent overfitting. The lp norm with p ≥ 1 regularizer [17] is more generalized one and may obtain a better trade-off between the sparseness and non-sparseness. As a preliminary study, we only use l2 norm regularizer in this paper. After η being optimized, the rest of the process becomes straightforward. The schematic diagram of our proposed method is illustrated in Figure 1.
(4)
n=1 T
The objective of MKL is to learn both the weights η and the parameters of SVM. There exist some optimization methods, such as the methods proposed in [13, 14]. In the generalized MKL (GMKL) framework [15], the optimization problem can be expressed as P 1 T minimize a a + C i l(yi , aT φη (xi ) + b) + r(η), 2
where η = [η1 , η2 , . . . , ηN ] are the combination weights. The linear combination is equivalent to construct a new feature mapping: √ η1 φ1 (x) √ η2 φ2 (x) φη (x) = (5) . .. . √ ηN φN (x)
4. Experimental results 4.1. Experimental setup Our experiments are performed on NIST 2010 Speaker Recognition Evaluation (SRE10) [18] telephone data. All experiments were carried out in 10sec-10sec short utterance condition. To simplify the experimental setup, we did not use channel compensation and score normalization techniques, although they will make our baseline seem more state-of-the-art.
Our system is gender-dependent. The female dataset contains 266 speakers and 308 test segments and the male dataset contains 264 speakers and 290 test segments. There are 12948 trials for female and 10858 trials for male. We use SRE04 1side (368 segments for female and 248 segments for male) for both UBM training and negative pool modeling. The PLP features were extracted using a 25 ms Hamming window and 10 ms frame shift. In the PLP, 12 cepstral coefficients together with C0 were calculated, and delta and acceleration coefficients were appended to produce a final 39dimension feature vector. After that, feature warping was applied and 25% of low energy frames were discarded using a dynamic threshold. The performance measure is the same as NIST SRE, using equal error rate (EER) and minimum detection cost function (min DCF) [18]. Note that for 10sec-10sec test condition, the parameters for compute DCF are CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01 [18].
25
1.25 EER (%) min DCF
22
1.1
19
0.95
16
0.8
13
0.65
10
128
256
512
1024
0.5
2048
Number of mixtures
Figure 3: Performance of GSV on SRE10 10sec-10sec male dataset.
We first evaluate the GSV method with different model scales. The number of mixtures are set as 128, 256, 512, 1024 and 2048. The EERs and min DCFs are illustrated in Figure 2 and Figure 3. From the results, we can see that the EERs first decrease and then increase with the increasing of the number of mixtures, and the the minimum EER occurs at 256 mixtures for female and at 512 mixtures for male, respectively. This shows that large scale model is not always good for speaker recognition, especially for short utterances. We should select suitable scale models to give robust modeling. According the results, if we select the optimal model scales, we can obtain EER of 22.60% for female and 20.29% for male, respectively. 40
2 EER (%) min DCF
35
1.75
30
1.5
25
1.25
20
1
15
0.75
from 0.8620 to 0.8047 for male. This shows the effectiveness of the MSK method. 90 GSV (mix = 256) proposed MSK
80
Miss probability (in %)
4.2. Results and discussions
60 40 20 10 5
10
128
256
512
1024
2048
0.5
Number of mixtures
2 1 0.5 0.5 1 2
5 10 20 40 60 False Alarm probability (in %)
80
90
Figure 4: The DET curves of GSV and proposed MSK method on SRE10 10sec-10sec female dataset. The crosses denote EER operating points and the circles denote the min DCF operating points.
Figure 2: Performance of GSV on SRE10 10sec-10sec female dataset. We then evaluate the proposed MSK method. We use different scale kernels with the number of mixtures of 128, 256, 512 and 1024, and optimize the multi-scale kernels with spectral projected gradient (SPG) descent method [15]. The DET curves of GSV and MSK methods are plotted in Figure 4 and Figure 5. We can observe that, the MSK method outperforms the GSV method, with EER decreasing from 22.60% to 22.37% for female, from 20.29% to 19.85% for male, and min DCF decreasing from 1.0690 to 0.9626 for female,
5. Conclusion and future work In this paper, we proposed a multi-scale kernel (MSK) method for short utterance speaker recognition. In this method, we construct a series of kernels with different scales, and then combine them through multiple kernel learning optimization. In this way, the robustness and scalability of the model can be both enhanced. The preliminary experimental results show the proposed MSK method outperforms the traditional GSV method for short utterance speaker recognition.
90 80
Miss probability (in %)
[4] ——, “Short utterance speaker recognition a research agenda,” in Proc. International Conference on Systems and Informatics (ICSAI), May 2012, pp. 1746–1750.
GSV (mix = 512) proposed MSK
[5] A. Larcher, K. A. Lee, B. Ma, and H. Li, “The RSR2015: Database for text-dependent speaker verification using multiple pass-phrases,” in Proc. InterSpeech, Portland, Sept. 2012, pp. 1578–1581.
60 40
[6] K. A. Lee, A. Larcher, H. Thai, B. Ma, and H. Li, “Joint application of speech and speaker recognition for automation and security in smart home,” in Proc. InterSpeech, Aug. 2011, pp. 3317–3318.
20
[7] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. ICASSP, Toulouse, May 2006, pp. 97–100.
10 5
[8] A. D. Dileep and C. C. Sekhar, “Speaker recognition using pyramid match kernel based support vector machines,” International Journal of Speech Technology, vol. 15, no. 3, pp. 365–379, Sept. 2012.
2 1 0.5 0.5 1 2
5 10 20 40 60 False Alarm probability (in %)
80
90
[9] C. Longworth and M. J. F. Gales, “Multiple kernel learning for speaker verification,” in Proc. ICASSP, Mar. 2008, pp. 1581– 1584.
Figure 5: The DET curves of GSV and proposed MSK method on SRE10 10sec-10sec male dataset. The crosses denote EER operating points and the circles denote the min DCF operating points.
[10] T. Ogawa, H. Hino, N. Reyhani, N. Murata, and T. Kobayashi, “Speaker recognition using multiple kernel learning based on conditional entropy minimization,” in Proc. ICASSP, May 2011, pp. 2204–2207. [11] L. Lin, H. Chen, J. Chen, and H.-M. Jin, “Speaker recognition with short utterances based on multiple kernel SVM-GMM,” Journal of Jilin University, vol. 43, pp. 504–509, Mar. 2013.
As further work we plan to perform more experiments with channel compensation and score normalization techniques, and incorporate the MSK idea into the state-of-the-art speaker recognition method, such as i-vector.
[12] M. Gonen and E. Alpaydin, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211– 2268, Jul. 2011.
6. Acknowledgements
[13] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521, Nov. 2008.
This work was supported by the National Natural Science Foundation of China under Grants 61370034 and 61273268.
[14] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” in Proc. International Conference on Machine Learning (ICML), Montreal, Jun. 2009, pp. 1065–1072.
7. References [1] Z.-Y. Li, W.-Q. Zhang, and J. Liu, “Multi-resolution time frequency feature and complementary combination for short utterance speaker recognition,” Multimedia Tools and Applications, Oct. 2013. [Online]. Available: http://dx.doi.org/10.1007/s11042-013-1705-4. [2] A. Kanagasundaram, D. Dean, S. Sridharan, J. GonzalezDominguez, J. Gonzalez-Rodriguez, and D. Ramos, “Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques,” Speech Communication, vol. 59, pp. 69–82, 2014. [3] N. Fatima and T. F. Zheng, “Vowel-category based short utterance speaker recognition,” in Proc. International Conference on Systems and Informatics (ICSAI), May 2012, pp. 1774–1778.
[15] A. Jain, S. V. N. Vishwanathan, and M. Varma, “SPG-GMKL: Generalized multiple kernel learning with a million kernels,” in Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug. 2012. [16] W.-L. Zhang, W.-Q. Zhang, D. Qu, and B.-C. Li, “Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2014, p. 11, 2014. [Online]. Available: http://asmp.eurasipjournals.com/content/2014/1/11. [17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “lp -norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, Mar. 2011. [18] National Institute of Standards and Technology, “The NIST Year 2010 Speaker Recognition Evaluation Plan,” http://www.itl.nist.gov/iad/mig/tests/sre/2010/index.html.