Multi-Scale Kernels for Short Utterance Speaker Recognition Wei-Qiang Zhang1 , Junhong Zhao2,3 , Wen-Lin Zhang4 , Jia Liu1 1

Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China 2 State Key Laboratory of Transducer Technology, Institute of Electronics Chinese Academy of Sciences, Beijing 100190, China 3 University of Chinese Academy of Sciences, Beijing 100190, China 4 Zhengzhou Information Science and Technology Institute, Zhengzhou 450002, China [email protected], [email protected], zwlin [email protected]

Abstract Short utterance is a great challenge for speaker recognition, for there is very limited data can be used for training and testing. To give a robust estimation, the amount of model parameters for the short utterance should be less than that for the long utterance; however, this may impede the models descriptive capability. In this paper, we propose a multi-scale kernel (MSK) approach to solve this problem. We construct a series of kernels with different scales, and combine them through multiple kernel learning (MKL) optimization. In this way, the robustness and scalability of the model will be both enhanced. The experimental results on NIST SRE 2010 10sec10sec dataset show that the proposed MSK method outperforms the traditional Gaussian mixture model supervector (GSV) followed by support vector machine (SVM) method. Index Terms: speaker recognition, short utterance, multi-scale kernel

1. Introduction Although speaker recognition has been developed for decades, short utterance speaker recognition remains a great challenge till now. Since the amount of data for training and/or testing is very limited, it is not easy to extract sufficient feature and produce effective model to describe the voiceprint information existed in the short utterances. There are many attempts to solve the short utterance problem in speaker recognition. For example, Li et al. proposed multi-resolution time frequency feature and combine them in different levels to make better use of multi-resolution temporal-frequency information [1]. Kanagasundaram et al. proposed a source and utteranceduration normalized linear discriminant analysis approach to compensate session variability in short utterance i-vector systems [2]. Meanwhile, there are alternate approaches which resort to the text-dependant information, such as vowel-category information [3, 4], multi-layer acoustic and temporal structure information [5, 6], and so on. For speaker recognition, the optimal amount of model parameters (model scale) should be dependant on the utterance duration. Usually, long utterance prefers big model which can provide sufficient descriptive capability, while short utterance prefers small model which can be more robustly estimated. In practice, the model scale is usually determined through empirical and experimental methods. In order to give a more elegant solution for model scale selection, in this paper we make

an attempt on text-independent short utterance speaker recognition in the framework of Gaussian mixture model supervector (GSV) modeling followed by support vector machine (SVM) [7] (or GSV for short). Our proposed method is to construct a series of kernels with different scales, and then combine them through multiple kernel learning (MKL) optimization method. In this way, the suitable scale kernels will be retained and assigned proper weights, and thus the robustness and scalability of the model will be both enhanced. We refer to such a method as multi-scale kernel (MSK) approach. The rest of the paper is organized as follows. In Section 2 we summarize the relevant existing approaches. Then we present our formulation in Section 3. Section 4 demonstrates the effectiveness of the proposed methods through detailed experiments. Finally, conclusions are given in Section 5.

2. Related work The basic idea of this paper is similar to the pyramid match kernel (PMK) [8] method, in which a set of feature vectors is mapped onto a multi-resolution histogram pyramid. In our method the multi-scale comes from the different number of mixture of universal background model (UBM), while in PMK the multi-scale is based the construction of a pyramid of histograms. In addition, in our method the weights of each kernel are optimized through MKL, while in PMK they are determined by the pyramid structure. Besides that, our method is similar to MKL [9, 10, 11], in which the model is based on a combination of multi-type kernels, such as Gaussian radial basis function (RBF) kernels and polynomial kernels. Strictly speaking, our method is also a type of MKL, but we focus on a series of same-type kernels with different scales, which are expected to improve the model’s multi-scale modeling ability.

3. Multi-scale kernel In the GMM supervector (GSV) method, the kernel is defined as [7]: k(x(i) , x(j) ) =

M X √ √ −1 −1 wm Σm 2 µ(j) h wm Σm 2 µ(i) m , m i, (1)

m=1

where {wm , µm , Σm }M m=1 are the weights, mean vectors and covariance matrices of adapted GMM, and M is the number of mixtures. Note that in the speaker recognition, usually only

Featurevectorseriesx(i)

Featurevectorseriesx(j)

K1I1 ( x ( i ) ) UBM1

MAP adaptation

K1I1 (x ( j ) )

{ȝ m }mM11 {wm , Ȉm }mM11

MKL

...

...

... K N IN ( x ( i ) )

K N IN ( x ( j ) )

UBMN

stacking

stacking

IK ( x ( i ) ), IK ( x ( j ) )

Figure 1: Schematic diagram of multi-scale kernel method. the mean vectors are adapted, so {wm , Σm }M m=1 are usually borrowed from the UBM and shared by all the speakers. According to the Mercer theorem, the GSV kernel can be expressed as k(x(i) , x(j) ) = hφ(x(i) ), φ(x(j) )i, and the feature mapping, which maps the data into the feature space, is defined as  √  −1 w1 Σ1 2 µ1  √  1  w2 Σ − 2 µ   2 2  φ(x) =  (2) . ..   .   √ −1 wM ΣM2 µM The number of mixtures of GMM (or UBM) M is an important control parameter in GSV method. We can use a series of UBM with different number of mixtures to construct a set of kernels {kn (x(i) , x(j) )}N n=1 . Note that different M means different model scale and resolution. For the short utterance, this may provides the scalability to capture the suitable resolution information for kernel modeling. For this reason, our method is named as multi-scale kernel. The set of kernels can be combined using the MKL approach [12] kη (x(i) , x(j) ) = fη ({kn (x(i) , x(j) )}N n=1 |η),

(3)

where η is the vector of combine parameters. For the linear combination, it can be simplified as kη (x(i) , x(j) ) =

N X

ηn kn (x(i) , x(j) ),

a, b, η

subject to

η  0,

(6)

where {(xi , yi )} are the training samples, C is a predefined positive trade-off parameter, a and b are the parameters of SVM, l(·) is the loss function and r(·) is the regularizer. The commonly used loss function is max{0, 1 − yi (aT φη (xi ) + b)}, which is the same as SVM. The commonly used regularizers include l1 norm, l2 norm and lp norm. The l1 norm regularizer is sometimes referred to as lasso, which can drive an element-wise shrinkage of η towards zero, thus leading to a sparse solution [16]. The l2 norm regularizer is sometimes referred to as ridge regression, which can force the weights approach to small values and thus prevent overfitting. The lp norm with p ≥ 1 regularizer [17] is more generalized one and may obtain a better trade-off between the sparseness and non-sparseness. As a preliminary study, we only use l2 norm regularizer in this paper. After η being optimized, the rest of the process becomes straightforward. The schematic diagram of our proposed method is illustrated in Figure 1.

(4)

n=1 T

The objective of MKL is to learn both the weights η and the parameters of SVM. There exist some optimization methods, such as the methods proposed in [13, 14]. In the generalized MKL (GMKL) framework [15], the optimization problem can be expressed as P 1 T minimize a a + C i l(yi , aT φη (xi ) + b) + r(η), 2

where η = [η1 , η2 , . . . , ηN ] are the combination weights. The linear combination is equivalent to construct a new feature mapping: √  η1 φ1 (x) √  η2 φ2 (x)    φη (x) =  (5) . ..   . √ ηN φN (x)

4. Experimental results 4.1. Experimental setup Our experiments are performed on NIST 2010 Speaker Recognition Evaluation (SRE10) [18] telephone data. All experiments were carried out in 10sec-10sec short utterance condition. To simplify the experimental setup, we did not use channel compensation and score normalization techniques, although they will make our baseline seem more state-of-the-art.

Our system is gender-dependent. The female dataset contains 266 speakers and 308 test segments and the male dataset contains 264 speakers and 290 test segments. There are 12948 trials for female and 10858 trials for male. We use SRE04 1side (368 segments for female and 248 segments for male) for both UBM training and negative pool modeling. The PLP features were extracted using a 25 ms Hamming window and 10 ms frame shift. In the PLP, 12 cepstral coefficients together with C0 were calculated, and delta and acceleration coefficients were appended to produce a final 39dimension feature vector. After that, feature warping was applied and 25% of low energy frames were discarded using a dynamic threshold. The performance measure is the same as NIST SRE, using equal error rate (EER) and minimum detection cost function (min DCF) [18]. Note that for 10sec-10sec test condition, the parameters for compute DCF are CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01 [18].

25

1.25 EER (%) min DCF

22

1.1

19

0.95

16

0.8

13

0.65

10

128

256

512

1024

0.5

2048

Number of mixtures

Figure 3: Performance of GSV on SRE10 10sec-10sec male dataset.

We first evaluate the GSV method with different model scales. The number of mixtures are set as 128, 256, 512, 1024 and 2048. The EERs and min DCFs are illustrated in Figure 2 and Figure 3. From the results, we can see that the EERs first decrease and then increase with the increasing of the number of mixtures, and the the minimum EER occurs at 256 mixtures for female and at 512 mixtures for male, respectively. This shows that large scale model is not always good for speaker recognition, especially for short utterances. We should select suitable scale models to give robust modeling. According the results, if we select the optimal model scales, we can obtain EER of 22.60% for female and 20.29% for male, respectively. 40

2 EER (%) min DCF

35

1.75

30

1.5

25

1.25

20

1

15

0.75

from 0.8620 to 0.8047 for male. This shows the effectiveness of the MSK method. 90 GSV (mix = 256) proposed MSK

80

Miss probability (in %)

4.2. Results and discussions

60 40 20 10 5

10

128

256

512

1024

2048

0.5

Number of mixtures

2 1 0.5 0.5 1 2

5 10 20 40 60 False Alarm probability (in %)

80

90

Figure 4: The DET curves of GSV and proposed MSK method on SRE10 10sec-10sec female dataset. The crosses denote EER operating points and the circles denote the min DCF operating points.

Figure 2: Performance of GSV on SRE10 10sec-10sec female dataset. We then evaluate the proposed MSK method. We use different scale kernels with the number of mixtures of 128, 256, 512 and 1024, and optimize the multi-scale kernels with spectral projected gradient (SPG) descent method [15]. The DET curves of GSV and MSK methods are plotted in Figure 4 and Figure 5. We can observe that, the MSK method outperforms the GSV method, with EER decreasing from 22.60% to 22.37% for female, from 20.29% to 19.85% for male, and min DCF decreasing from 1.0690 to 0.9626 for female,

5. Conclusion and future work In this paper, we proposed a multi-scale kernel (MSK) method for short utterance speaker recognition. In this method, we construct a series of kernels with different scales, and then combine them through multiple kernel learning optimization. In this way, the robustness and scalability of the model can be both enhanced. The preliminary experimental results show the proposed MSK method outperforms the traditional GSV method for short utterance speaker recognition.

90 80

Miss probability (in %)

[4] ——, “Short utterance speaker recognition a research agenda,” in Proc. International Conference on Systems and Informatics (ICSAI), May 2012, pp. 1746–1750.

GSV (mix = 512) proposed MSK

[5] A. Larcher, K. A. Lee, B. Ma, and H. Li, “The RSR2015: Database for text-dependent speaker verification using multiple pass-phrases,” in Proc. InterSpeech, Portland, Sept. 2012, pp. 1578–1581.

60 40

[6] K. A. Lee, A. Larcher, H. Thai, B. Ma, and H. Li, “Joint application of speech and speaker recognition for automation and security in smart home,” in Proc. InterSpeech, Aug. 2011, pp. 3317–3318.

20

[7] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. ICASSP, Toulouse, May 2006, pp. 97–100.

10 5

[8] A. D. Dileep and C. C. Sekhar, “Speaker recognition using pyramid match kernel based support vector machines,” International Journal of Speech Technology, vol. 15, no. 3, pp. 365–379, Sept. 2012.

2 1 0.5 0.5 1 2

5 10 20 40 60 False Alarm probability (in %)

80

90

[9] C. Longworth and M. J. F. Gales, “Multiple kernel learning for speaker verification,” in Proc. ICASSP, Mar. 2008, pp. 1581– 1584.

Figure 5: The DET curves of GSV and proposed MSK method on SRE10 10sec-10sec male dataset. The crosses denote EER operating points and the circles denote the min DCF operating points.

[10] T. Ogawa, H. Hino, N. Reyhani, N. Murata, and T. Kobayashi, “Speaker recognition using multiple kernel learning based on conditional entropy minimization,” in Proc. ICASSP, May 2011, pp. 2204–2207. [11] L. Lin, H. Chen, J. Chen, and H.-M. Jin, “Speaker recognition with short utterances based on multiple kernel SVM-GMM,” Journal of Jilin University, vol. 43, pp. 504–509, Mar. 2013.

As further work we plan to perform more experiments with channel compensation and score normalization techniques, and incorporate the MSK idea into the state-of-the-art speaker recognition method, such as i-vector.

[12] M. Gonen and E. Alpaydin, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211– 2268, Jul. 2011.

6. Acknowledgements

[13] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521, Nov. 2008.

This work was supported by the National Natural Science Foundation of China under Grants 61370034 and 61273268.

[14] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” in Proc. International Conference on Machine Learning (ICML), Montreal, Jun. 2009, pp. 1065–1072.

7. References [1] Z.-Y. Li, W.-Q. Zhang, and J. Liu, “Multi-resolution time frequency feature and complementary combination for short utterance speaker recognition,” Multimedia Tools and Applications, Oct. 2013. [Online]. Available: http://dx.doi.org/10.1007/s11042-013-1705-4. [2] A. Kanagasundaram, D. Dean, S. Sridharan, J. GonzalezDominguez, J. Gonzalez-Rodriguez, and D. Ramos, “Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques,” Speech Communication, vol. 59, pp. 69–82, 2014. [3] N. Fatima and T. F. Zheng, “Vowel-category based short utterance speaker recognition,” in Proc. International Conference on Systems and Informatics (ICSAI), May 2012, pp. 1774–1778.

[15] A. Jain, S. V. N. Vishwanathan, and M. Varma, “SPG-GMKL: Generalized multiple kernel learning with a million kernels,” in Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug. 2012. [16] W.-L. Zhang, W.-Q. Zhang, D. Qu, and B.-C. Li, “Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2014, p. 11, 2014. [Online]. Available: http://asmp.eurasipjournals.com/content/2014/1/11. [17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “lp -norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, Mar. 2011. [18] National Institute of Standards and Technology, “The NIST Year 2010 Speaker Recognition Evaluation Plan,” http://www.itl.nist.gov/iad/mig/tests/sre/2010/index.html.

Multi-Scale Kernels for Short Utterance Speaker ...

for there is very limited data can be used for training and testing. To give a robust ... Usually, long utterance prefers big model which can provide sufficient ..... Tools and Applications, Oct. 2013. [Online]. ... Data Mining, Aug. 2012. [16] W.-L.

153KB Sizes 2 Downloads 168 Views

Recommend Documents

AN UTTERANCE COMPARISON MODEL FOR ...
SPEAKER CLUSTERING USING FACTOR ANALYSIS .... T | ··· | VM. T ]T . (15) is too difficult to manage analytically. To simplify, we assume each obser- vation is ...

Improper Deep Kernels - cs.Princeton
best neural net model given a sufficient number ... decade as a powerful hypothesis class that can capture com- ...... In Foundations of Computer Science,.

SVM Optimization for Lattice Kernels - Semantic Scholar
[email protected]. ABSTRACT. This paper presents general techniques for speeding up large- scale SVM training when using sequence kernels. Our tech-.

SVM Optimization for Lattice Kernels - Semantic Scholar
gorithms such as support vector machines (SVMs) [3, 8, 25] or other .... labels of a weighted transducer U results in a weighted au- tomaton A which is said to be ...

A MULTISCALE APPROACH
Aug 3, 2006 - ena (e.g. competition for space, progress through the cell cycle, natural cell death and drug-induced cell .... the host tissue stim- ulates the production of enzymes that digest it, liberating space into which the .... The dynamics of

Multiscale cosmological dynamics
system is the universe and its large-scale description is the traditional ob- ..... negative pressure represents the collapsing effects of gravitation. There is an ...

Using Machine Learning for Non-Sentential Utterance ...
Department of Computer Science. King's College London. UK ...... Raquel Fernández, Jonathan Ginzburg, and Shalom Lap- pin. 2004. Classifying Ellipsis in ...

Learning sequence kernels - Semantic Scholar
such as the hard- or soft-margin SVMs, and analyzed more specifically the ..... The analysis of this optimization problem helps us prove the following theorem.

dbTouch in Action Database Kernels for Touch ... - Harvard University
input and analytics tasks given by the user in real-time such as sliding a finger over a column to scan it ... analyzes only parts of the data at a time, continuously refining the answers and continuously reacting to user input. .... translate the lo

Theoretical Foundations for Learning Kernels in ... - Research at Google
an unsupervised procedure, mostly used for exploratory data analysis or visual ... work is in analysing the learning kernel problem in the context of coupled ...

Generalization Bounds for Learning Kernels - NYU Computer Science
and the hypothesis defined based on that kernel. There is a ... ing bounds are based on a combinatorial analysis of the .... By the definition of the dual norm, sup.

Multiscale ordination: a method for detecting pattern at ...
starting position ofthe transect, 2) n1atrices may be added at any block size, and ... method to fabricated data proved successful in recovering the structure built ...

A Multiscale Mean Shift Algorithm for Mode Estimation ...
Computer Science, University College London, UK; [email protected] ... algorithm are that (i) it is simple, (ii) it is fast, (iii) it lacks data-dependent parameters ...

Multiscale Manifold Learning
real-world data sets, and shown to outperform previous man- ifold learning ... to perception and robotics, data appears high dimensional, but often lies near.

Multiscale Topic Tomography
[email protected]. William Cohen. Machine ... republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Nested Subtree Hash Kernels for Large-Scale Graph ...
such as chemical compounds, XML documents, program flows, and social networks. Graph classification thus be- comes an important research issue for better ...

Nonparametric Transforms of Graph Kernels for Semi ...
the spectral transformation is an exponential function, and for the Gaussian ... unlabeled data, we will refer to the resulting kernels as semi-supervised kernels.

New Kernels for Protein Structural Motif Discovery and ... - CiteSeerX
ence on Machine Learning, Bonn, Germany, 2005. Copy- ... Conversely, if the structure size is set too large, the motif will ..... the 21 positive proteins in our data set, and it is also known that each .... Structure motif discovery and mining the P

L2 Regularization for Learning Kernels - NYU Computer Science
via costly cross-validation. However, our experiments also confirm the findings by Lanckriet et al. (2004) that kernel- learning algorithms for this setting never do ...

Best Speaker for Echo Dot - AUDIOSOUNDCENTRAL.COM.pdf ...
Get the Best Picks for the Best. Amazon Echo DOT Bluetooth. Speakers. Page 5 of 21. Best Speaker for Echo Dot - AUDIOSOUNDCENTRAL.COM.pdf.

A Multiscale Mean Shift Algorithm for Mode Estimation 1. Introduction
Computer Science, University College London, UK; [email protected]. 2 ...... 36(4): p. 742-756. 13. Everitt, B.S. and D.J. Hand, Finite Mixture Distributions.

A Speaker Count System for Telephone Conversations
The applications of this research include three-way call detection and speaker tracking, and could be extended to speaker change-point detection and indexing.