THU-EE System Fusion for the NIST 2012 Speaker ...

Viewer
Transcript

THU-EE System Fusion for the NIST 2012 Speaker Recognition Evaluation Wei-Qiang Zhang, Zhi-Yi Li, Weiwei Liu, Jia Liu Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected]

Abstract This paper introduces the system fusion of THU-EE for the NIST 2012 Speaker Recognition Evaluation (SRE12). In our approach, mean Z-norm, pseudo post probability T-norm and bi-criterion optimization are used. These methods mainly concern the changes of evaluation rules and limitation of our development data, which are the new problems for SRE12. Through post evaluation of SRE12 core test, the effectiveness of our approach is validated. Index Terms: speaker recognition, system fusion, bi-criterion optimization

1. Introduction With the advances in speaker recognition, many systems with diffident modeling methods emerged, such as the classical Gaussian mixture model - universal background model (GMMUBM) [1], support vector machine based on GMM supervector (GSV-SVM) [2], joint factor analysis (JFA) [3] and more popular i-vector [4]. Researchers tend to employ system fusion to take the advantages of multi-system and yield robustness for the impact of noise, channel, and other factors. In the National Institute of Standards and Technology (NIST) Speaker Recognition Evaluations (SREs), such trend is more evident. In order to obtain better performance, almost all sites submit their results based on many heterogeneous systems [5]. During previous evaluation processes, many sites adopted Niko Brummer’s well known toolkits, such as FoCal [6] and BOSARIS [7], for score normalization, calibration and fusion. This has become a common practice in the SRE community. In the year 2012, however, the evaluation rules [8] have some changes. For example, 1) the knowledge of all target speakers can be used; 2) only log-likelihood-ratio (LLR) is valid as output score; and 3) the primary performance measure is based on two instead of one operating point. In order to deal with these changes, the standard system fusion procedures have room for improvement. In this paper, we will introduce the system fusion workflow of our site, namely THU-EE, for the NIST 2012 Speaker Recognition Evaluation (SRE12). We will mainly focus on the differences between our method and the standard methods through post evaluation. Previously used standard methods could not make full use of the knowledge of all target speakers, so some of them had to rely on additional development data. Moreover, the standard score fusion is optimized for one operating point. The present study is related to recent SRE12. It conforms the new evaluation rules and builds on limited development data, which was not considered in earlier studies. The rest of this paper is organized as follows. First we give the framework of the fusion method in section 2 and

then describe the details. In section 3, we introduce our Znorm and T-norm methods. Section 4 describes the bi-criterion optimization strategy. Detailed experiments and performance comparison results are presented in section 5. Finally, section 6 gives conclusions.

2. System Fusion Framework The commonly used procedure for system fusion includes zero normalization (Z-norm) [9], test normalization (T-norm) [9], pool adjacent violators (PAV) calibration [7] and score fusion [7]. The Z-norm and T-norm both need a cohort data set, and calibration and fusion both need development data sets. For SRE12, we try to use as many data as possible for training speaker models, so the remaining data for development are limited. In addition, the evaluation rule permits us to do some operations across multiple target speakers. So we modify the standard system fusion procedure and our flowchart is illustrated in Fig. 1. Firstly, the scores of each system are normalized. Different from the standard normalization method, our Z-norm means to only minus the mean and T-norm means to formally calculate the post probability. This will alleviate system dependance on the development set. After that, the scores from different systems are fused without calibration. In the score fusion, a bi-criterion optimization scheme is used, which is designed to deal with the two-operating-point performance measure. These methods will be described in detail in the following sections. After score fusion, we convert the simple LLR to compound LLR. This is implemented simply using Niko Brummer’s program [10]. The overall system fusion performed is gender- and channel-dependent. We partition the data into female/male gender and telephone/interview channel conditions.

Raw Scores Mean Z-Norm Post Prob. T-Norm Bi-Criterion Optimization Simple to Compound Compound LLR Figure 1: Flowchart of the system fusion procedure.

3. Mean Z-Norm and Pseudo Post Probability T-Norm The standard form of Z- or T-norm is [9] s′ =

s−µ , σ

(1)

where s and s′ are the input and output scores, µ and σ are the mean and standard deviation obtained from the cohort set. We should note that both Z-norm and T-norm are impostor-based normalization, that is to say, all the trials for estimating the parameters of normalization should be non-target trials. However, in our case, the development data have been seen in the training, so some of trials are target trials and their scores tend greater than the normal values. Using these scores, the estimated means and standard deviations will be not reliable. So we try to use these parameters as less as possible. In our Z-norm, we only subtract the mean without dividing the standard deviation, i.e.,

(a) Before PT-norm

sZ (target i, test j) = s(target i, test j) − µ(target i), (2) where µ(target i) is obtained from the development set. We call this the mean Z-norm (MZ-norm) and we believe this simplified version will be more reliable than the standard one. In SRE12, knowledge of all targets is allowed in computing each trial’s detection score [8]. In addition, although not explicitly declared, all the target speakers are disjoint. There two points differ from the previous SREs, in which we cannot use the knowledge of other target speakers and usually several different target models essentially belong to the same speaker. If we assume that all the target speakers consist of a closed set, then we can calculate the post probability for each test segment: sZT (spk i, seg j) = P

exp{sZ (spk i, seg j)} . ′ ∀i′ exp{sZ (spk i , seg j)}

(3)

This is an operation across all the target speakers, and can be seen as an alternative to the standard T-norm. This idea is borrowed from language recognition and we call it pseudo post probability T-norm (PT-norm) in this paper. PT-norm has two import properties: P1) For any specific segment j, if the input score sZ (spk i1 , seg j) > sZ (spk i2 , seg j), then the output score sZT (spk i1 , seg j) > sZT (spk i2 , seg j). This property will reserve the orders of each trial within the same test segment. P P2) For any segment j, the output scores satisfy ∀i sZT (spk i, seg j) = 1. This property will eliminate the score variability across different test segments. Let us see a simple example as illustrated in Fig. 2. There are five target speakers and three segments. Before PT-norm, the scores of seg2 are less than those of other test segments as a whole and this lead the score of target trial (spk2, seg2) is less than those of other non-target trials. After PT-norm, the scores become well-ordered. One may argue that the speaker recognition is not a closedset problem. In fact, this does not matter. Because the target speakers are disjoint, there is at most one target trial for any test segment. And because there are thousands of target speakers (the number of known speakers in SRE12 is 1918), the denominator in (3) is strictly or approximately based on imposters and its estimation reliable. There are two advantages of the PT-norm. One is that it does not need additional cohort sets, because the operation can be completed on scores of each test segments against all the

(b) After PT-norm

Figure 2: Illustration of scores before and after PT-norm. The circle size indicates the score value. The red circles denote target trials and the blue circles denote non-target trials.

target speakers. The other is that it may need no further score calibration. We know that the aim of calibration is to transfer the score to LLR. If we believe sZT as post probability, then in theory, the LLR will be sZT ℓ = log . (4) 1 − sZT For the non-target trials, sZT approaches zero, so we can use ℓ ≈ log sZT

(5)

as an estimation.

4. Bi-criterion optimization In BOSARIS toolkit, the standard objective function for score fusion is the empirical evaluation criterion [11]: π X Elog (L|π) = log(1 + e−ℓt )+ |T | t∈T 1−π X log(1 + eℓt ), (6) |N | t∈N where π is the target prior. T and N are the sets of indices that belong to target and non-target trials. ℓt is the LLR of t-th trial. It can be seen that Elog (L|π) is related to the prior π, or more precisely, the so-called effective prior [7], which is dependent on the detection cost function (DCF) parametrization: π ˜=

πCmiss , πCmiss + (1 − π)Cfa

(7)

where Cmiss and Cfa are the cost of miss and false alarm respectively. In SRE12, however, the primary performance measure is based on two operating points [8], i.e., Cmiss = Cfa = 1, π1 = 0.01, π2 = 0.001. From these parameters, we can get π ˜1 = 0.01, π ˜2 = 0.001. In the score fusion optimization, we can set the effective prior as either π ˜1 or π ˜2 , or even other value between π ˜1 and π ˜2 . But neither setting is optimal according to the evaluation criterion. In order to match the evaluation criterion, we use the following bi-criterion: Elog (L|˜ π1 ) + Elog (L|˜ π2 ) . (8) 2 This can be easily implemented by simple modification of the BOSARIS toolkit. It should be noted that Elog (L) =

Elog (L) 6= Elog (L|(˜ π1 + π ˜2 )/2).

(9)

This is because in (4) ℓt = α +

N X

βi sit + logit˜ π,

(10)

i=1

where N is the number of subsystems to be fused; sit is the score of subsystem i for trial t; α and {βi } are the parameters to be optimized. In addition, in the implementation, (4) is normalized by the prior entropy [7], which is defined as C ∗ (˜ π ) = −˜ π log(˜ π )−(1−˜ π) log(1−˜ π). This leads Elog (L|˜ π) is not a simple linear relation with π ˜. We validate the bi-criterion optimization with synthetic data which are generated by demo make data for fusion of BOSARIS [7]. We perform system fusion with single criterion π ˜1 = 0.01, π ˜2 = 0.001, and also with bi-criterion optimization. The normalized DCF curves are plotted in Fig. 3. We can observe that criterion π ˜1 is better for little prior region while criterion π ˜2 is better for great prior region, and bicriterion is between π ˜1 and π ˜2 . This is consistent with our expectations. 0.9 prior1 fusion prior2 fusion bi−criterion fusion prior1 prior2

normalized Bayes error−rate

0.8 0.7 0.6 0.2

5.2. Score Normalization We first select the best one (according to the post evaluation) among all our systems for ZT-norm experiments. This system is based on English decoder VAD, PLP feature and i-vector + PLDA model [12]. Because we did not prepare the ZT-norm cohort set, we can only perform standard Z-norm, MZ-norm on the development set and perform PT-norm on the test set. The results are listed in Table 1. From the table, we can see that the MZ- and PT-norm gives quite good results. (Our other systems also show similar results, but we omit them here.) At this stage, we can not perform further experiments for comparison between our method with the standard ZT-norm, but in this data limited situation, our approach may be a good choice. Table 1: ZT-norm experimental results norm raw

0.4 0.5

target speaker 2 files map.v3.txt and NIST SRE12 target speaker 2 single file per ldcid map.v2.txt.v2.1.txt, and the rest of data come from SRE08 Followup. We tried to use all excluding the repeated speech for training. There were in total 86454 segments for training. The development data were based on NIST SRE12 tar get speaker 2 files w sre08 followup map.v3.t xt. We only used the files that did not overlap with the training set. However, there was not any new speech, but rather repeated speech over different channels or with different durations. We also cut the long segments to 30, 60, 90, 120 and 150 sec version to give different durations. There were in total 56535 segments for developmental purposes. The test data were based on NIST SRE12 core.ndx and the key was based on NIST SRE12 core trial key. v0.csv. There were totally 73106 segments and 1741311 trials for score testing. Note that we partitioned the data into female/male gender and telephone/interview channel conditions before the release of the NIST keys and we follow this partition in this paper. These conditions are not the same to five common conditions provided by NIST. Roughly speaking, the telephone condition includes common condition 2, 4 and 5, and the interview condition includes common condition 1 and 3.

0.35

0.4 0.15

4

0.3

5

6

−8−7.5−7−6.5 0.2

Z- and PT-norm

0.1 0 −10

−5

0

5

10

logit(prior)

Figure 3: Normalized DCF curves of different fusion priors.

5. Experiments 5.1. Experimental Data The training (enrollment) data were organized based on the lists released by NIST, NIST SRE12 evaluation release

MZ- and PT-norm

cond. M tel M int F tel F int M tel M int F tel F int M tel M int F tel F int

EER% 4.55 16.40 3.75 12.00 5.16 16.92 4.74 12.51 4.84 17.92 4.44 12.73

DCF08 0.187 0.604 0.175 0.455 0.215 0.563 0.222 0.432 0.196 0.608 0.192 0.445

DCF10 0.528 0.816 0.527 0.733 0.593 0.817 0.611 0.758 0.457 0.807 0.478 0.719

F: female, M: male, tel: telephone, int: interview DCF: min DCF

DCF12 0.430 0.799 0.430 0.693 0.484 0.774 0.509 0.694 0.394 0.776 0.402 0.658

5.3. Score Fusion For score fusion experiments, we select two other best systems and their performance after MZ- and PT-norm are listed in Table 2. Table 2: Individual system performance system sys2

sys3

cond. M tel M int F tel F int M tel M int F tel F int

EER% 6.05 17.19 5.92 13.82 6.98 26.00 8.16 22.50

DCF08 0.244 0.569 0.269 0.499 0.346 0.679 0.405 0.581

DCF10 0.623 0.812 0.666 0.745 0.876 0.865 0.903 0.756

DCF12 0.521 0.766 0.561 0.706 0.775 0.824 0.843 0.724

As mentioned previously, our development data have been seen in the training stage. This is not good and may give misleading result, so we split the test set into two equal parts and perform 2-fold cross-validation. The results with single criterion π ˜1 = 0.01, π ˜2 = 0.001, and also with bi-criterion optimization are listed in Table 3. From the results, we can see that the performances with two single criteria are similar. Generally speaking, π ˜1 is slightly better than π ˜2 for DCF08, and π ˜2 is slightly better than π ˜1 for DCF10. This is easy to understand because the effective prior for DCF10 is just π ˜ = 0.001, for DCF08 it is π ˜ ≈ 0.0917. Matching effective prior leads the better results. From the last part of Table 3, we can observe that bicriterion optimization outperforms both π ˜1 and π ˜2 for DCF12. This shows the effectiveness of our method. Table 3: Score fusion experimental results (based on 2-fold cross-validation) fusion π ˜1

π ˜2

bi-crit

cond. M tel M int F tel F int M tel M int F tel F int M tel M int F tel F int

EER% 3.12 14.84 3.12 10.13 3.16 14.84 3.14 10.09 3.15 15.02 3.14 10.05

DCF08 0.134 0.507 0.140 0.383 0.137 0.507 0.141 0.383 0.132 0.489 0.140 0.371

DCF10 0.389 0.775 0.441 0.662 0.391 0.772 0.435 0.661 0.368 0.745 0.422 0.642

DCF12 0.317 0.723 0.350 0.612 0.318 0.720 0.347 0.612 0.307 0.697 0.339 0.590

bi-crit: bi-criterion optimization

5.4. PAV Calibration In SRE12, we did not perform PAV calibration because we thought the score after PT-norm (logarithmic version) approximates to LLR and we were not confident with the development data. We will post test PAV calibration in this

section. The experiment is also based on 2-fold cross-validation on the test set. In our procedure, we add an additional PAV calibration step after T-norm. The results are listed in Table 4. Compare them with Table 3, we can notice that the results with PAV calibration are worse than that without PAV calibration. This shows that our PT-norm approach is satisfactory. Table 4: Score fusion with PAV calibration results (based on 2-fold cross-validation) fusion pav cal and bi-crit

cond. M tel M int F tel F int

EER% 3.18 15.16 3.17 10.66

DCF08 0.136 0.502 0.141 0.390

DCF10 0.373 0.755 0.431 0.670

DCF12 0.313 0.714 0.348 0.613

pav cal: PAV calibration

5.5. Development Set The training data for score fusion should come from the development set. Although our development data were not good, we still list the results to give the whole picture of our system fusion. The results of Table 5 are trained on development set, tested on development set and test set respectively. We can observe that there is a certain gap between the performances on development set and test set. However, our system fusion strategy relies as little as possible on the development set, so the last results on test set are acceptable. Table 5: Score fusion results trained with development set data set dev

test

cond. M tel M int F tel F int M tel M int F tel F int

EER% 2.00 1.18 2.17 1.88 3.41 17.02 3.66 12.57

DCF08 0.054 0.030 0.054 0.048 0.150 0.561 0.148 0.437

DCF10 0.135 0.053 0.132 0.095 0.422 0.786 0.462 0.683

DCF12 0.108 0.045 0.104 0.079 0.340 0.747 0.371 0.629

6. Conclusion In this paper, we introduced our system fusion for the NIST SRE12. For score normalization, we propose mean Z-norm which requires only mean values, and pseudo post probability T-norm which requires no cohort set and no PAV calibration. For score fusion, we present a bi-criterion optimization method which matches the evaluation performance measure. The post evaluation experiments on SRE12 core test show the effectiveness of our approach.

7. Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant No. 61005019, No. 61273268 and No. 90920302.

8. References [1] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, Jan. 2000. [2] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. ICASSP, Toulouse, France, May 2006, pp. I 97–I 100. [3] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1435–1447, May 2007. [4] N. Dehak, P. Kenny, R. Dehak, P Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011. [5] N. Brummer, L. Burget, J. H. Cernocky, et al., “Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 7, pp. 2072–2084, Sept. 2007. [6] N. Brummer, “FoCal: Tools for fusion and calibration of automatic speaker detection systems,” http://sites.google.com/site/nikobrummer/focal, July 2005. [7] N. Brummer and E. Villiers, “The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing,” http://sites.google.com/site/bosaristoolkit/, Dec. 2011. [8] The National Institute of Standards and Technology, “The NIST year 2012 speaker recognition evaluation plan,” http://www.nist.gov/itl/iad/mig/upload/NIST SRE12 eval plan-v17-r1.pdf, May 2012. [9] C. Barras and J.-L. Gauvain, “Feature and score normalization for speaker verification of cellular data,” in Proc. ICASSP, Hong Kong, Apr. 2003, vol. 2, pp. 49–52. [10] N. Brummer, “LLR transformation for SRE’12,” http://sites.google.com/site/bosaristoolkit/sre12/llrTrans simple2compound.m, Sept. 2012. [11] N. Brummer, Measuring, Refining and Calibrating Speaker and Language Information Extracted from Speech, Ph.D. thesis, University of Stellen-bosch, Stellenbosch, South Africa, Dec. 2010. [12] J. Liu, L. He, Z. Li, et al., “The THU-EE system description for NIST SRE 2012,” in Proc. NIST SRE 2012 Workshop, Orlando, Dec. 2012.