ACCURATE ESTIMATE OF THE CROSS-VALIDATED ...

Viewer
Transcript

D. Ververidis and C. Kotropoulos, "Accurate estimate of the cross-validated prediction error variance in Bayes classifiers," in Proc. Machine Learning for Signal Processing (MLSP), Thessaloniki, 2007.

ACCURATE ESTIMATE OF THE CROSS-VALIDATED PREDICTION ERROR VARIANCE IN BAYES CLASSIFIERS Dimitrios Ververidis and Constantine Kotropoulos Dept. of Informatics, Aristotle Univ. of Thessaloniki, Box 451, Thessaloniki 54124, Greece. E-mails:{jimver,costas}@aiia.csd.auth.gr

ABSTRACT A relationship between the variance of the prediction error committed by the Bayes classifier and the mean prediction error was established by experiments in emotional speech classification within a cross-validation framework in a previous work. This paper theoretically justifies the validity of the aforementioned relationship. Furthermore, it proves that the new estimate of the variance of the prediction error, treated as a random variable itself, exhibits a much smaller variance than the usual estimate obtained by cross-validation even for a small number of repetitions. Accordingly, we claim that the proposed estimate is more accurate than the usual, straightforward, estimate of the variance of the prediction error obtained by applying cross-validation. 1. INTRODUCTION Two popular methods for estimating the prediction error of a classifier are bootstrap and cross-validation. In these methods, the available dataset is divided repeatedly into a set used for designing the classifier (i.e. the training set) and a set used for testing the classifier (i.e. the test set). By averaging the prediction error over all repetitions, hopefully a more accurate estimate of the prediction error is obtained than just using the prediction error in one repetition of the experiment. Both cross-validation [1] and bootstrap [2] stem from the jackknife method. Jackknife was introduced by M. Quenouille for finding unbiased estimates of statistics, such as the sample mean and the sample variance [3, 4]. Originally, jackknife meant to divide the dataset into two equal sets, to derive the target statistic over the two sets independently, and next to average the statistic estimates in order to obtain an unbiased statistic. Later, jackknife was about to split the dataset into many sets of equal cardinality. In another version, the statistic is estimated on the whole dataset except one sample, this procedure is repeated in a cyclic fashion, and the average estimate is found finally. The latter “leave-one-out” version dominates in practice. A review of jackknife variants can be found in [5]. The ordinary cross-validation is the extension of the jackknife method to derive an unbiased estimate of the prediction error in the “leave-one-out” sense [1]. The computational demands of the ordinary cross-validation are rising proportionally to the number of samples. A variant of cross-validation with a smaller number This work has been supported by the FP6 European Union Network of Excellence MUSCLE “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752).

of repetitions is the s-fold cross-validation. During s-fold crossvalidation the dataset is divided into s roughly equal subsets, the samples in the s − 1 subsets are used for training the classifier, and the samples in the last subset are used for testing. The procedure is repeated for each one of the s subsets in a cyclic fashion and the prediction error is estimated by averaging the prediction errors measured in the test phase of the s repetitions. Burman proposed the repeated s-fold cross-validation for model selection, which is simply the s-fold cross-validation repeated many times [6]. The prediction error measured during cross-validation repetitions is a random variable that follows the Gaussian distribution. Therefore, according to the central limit theorem (CLT), the more the repetitions of the random variable are, the less varies the average prediction error. Throughout this paper, the repeated s-fold crossvalidation is simply denoted as cross-validation for short. The outline of the paper is as follows. Section 2 deals with the prediction error committed by the Bayes classifier. A theoretical analysis of the factors that affect the variance of the prediction error is made in Section 3. A comparison of the proposed method that predicts the variance of the cross-validated prediction error from a small number of repetitions against the usual estimate of the variance of the prediction error is presented in Section 4. Finally, Section 5, concludes the paper by indicating future research directions.

2. CLASSIFIER DESIGN N K Let uW = {uW i }i=1 be a set of N samples, where W = {wk }k=1 is the feature set comprising K features wk . The samples can be considered as independent and identically distributed (i.i.d.) random variables (r.vs) distributed according to the multivariate diW stribution F of the feature set W. Each sample uW i = (yi , li ) is W treated as a pattern consisting of a measurement vector yi and a label li ∈ {1, 2, . . . , C}, where C is the total number of classes. Let us predict the label of a sample by processing the feature vectors using a classifier. Cross-validation (CV) calculates the mean over b = {1, 2, . . . , B} prediction error estimates as follows. Let s ∈ {2, 3, . . . , N/2} be the number of folders the data should be divided into. To find the bth prediction error estiN samples are randomly selected without remate, ND = s−1 s substitution from uW to build the design set uW Db while the remaiN ning uW T b of s samples forms the test set. The prediction error in CV repetition b is the error committed by the Bayes classifier. For sample uW = (yiW , li ) ∈ uW i T b , the

class label η predicted by the Bayes classifier is given by C

η(yiW ) = arg max{pb (yiW |Ωc )Pb (Ωc )},

3. THEORETICAL ANALYSIS (1)

c=1

where Pb (Ωc ) = Ncb /N is the a priori class probability, Ncb is the number of samples that belong to class Ωc , c = 1, 2, . . . , C in the bth cross-validation repetition, and pb (yiW |Ωc ) is the class conditional probability density function (pdf) of the sample uW i given Ωc . The class conditional pdf is modeled as a single Gaussian. Two parameters for each class Ωc are required for a Gaussian pdf, namely the mean vector µc and the covariance matrix Σc . If W W uW Dbc = {uDb ∩ Ωc }, and NDbc is the number of samples in uDbc , then the class sample mean vector and the class sample dispersion matrix can be used as estimates of µc and Σc , i.e. X 1 ˆW µ = yiW , (2) bc NDbc W W

Lemma 1 For a two-class pattern recognition problem, when each class pdf is modeled by a single Gaussian, the prediction error in one CV repetition CVeb (uW ) can be approximated as a function of the difference of the class means. N Proof Let us assume that the set uW = {uW i }i=1 consists of infinite samples (N = ∞) that belong to two classes Ωc , c = 1, 2. Each class conditional pdf p(y | Ωc ) is a Gaussian pdf G(y; µc , σc2 ), where µc and σc2 are the sample mean and the sample variance for each class Ωc , c = 1, 2, respectively. Without any loss of generality we assume µ2 > µ1 . Let P (Ωc ) = Nc /N be the a priori probability of each class. In Figure 1, P (Ωc )p(y | Ωc ) is plotted for each class as a function of the measurement value y.

P (Ω1 )p(y | Ω1 )

ui ∈uDbc

ˆW Σ bc

=

1 NDbc

X

W uW i ∈uDbc

T W ˆW ˆW (yiW − µ bc ) . (3) bc )(yi − µ

If |Γ| denotes the determinant of matrix Γ and G() denotes the Gaussian pdf, then the class conditional pdf is given by pb (yiW |Ωc )

=

ˆW ˆW G(yiW ; µ bc , Σbc )

=

1 K 2

1 2 |ΣW bc |

(2π) 1 W W T ˆ W −1 ˆ bc ) (Σbc ) (yiW − µ ˆW exp[− (yi − µ bc )]. 2

(4)

Let L[li , η(yiW )] denote the zero-one loss function between the label li and the predicted class label η(yiW ) for uW i , i.e. 0 if li = η(yiW ) L[li , η(yiW )] = . (5) 1 if li 6= η(yiW ) W If err(Fˆ (uW Db ), uT b ) is the error predicted by the Bayes classifier W ˆ F that is designed using the set uW Db when it is applied to set uT b for classification, then the CV estimate of prediction error in a single repetition b is X 1 W L[li , η(yiW )], CVeb (uW ) = err(Fˆ (uW Db ), uT b ) = NT W W ui ∈uT b

(6) W where NT = card(uW T b ) is the cardinality of the test set uT b . The CV estimate of the prediction error over B repetitions is given by M CVeB (uW ) =

B 1 X CVeb (uW ), B

(7)

b=1

and its variance is V CVeB (uW ) =

B 1 X [CVeb (uW ) − M CVeB (uW )]2 . B

(8)

b=1

In [7], it is experimentally found by using linear regression that V CVe∞ (uW ) =

(s +

s2 √

2)N

M CVe∞ (uW ) 1−M CVe∞ (uW ) .

(9) In the next section, theoretical evidence about (9) is provided and discussed.

P (Ω2 )p(y | Ω2 )

′

S1 t′

S2 µ1

t

S1 µ2

y

Fig. 1. Prediction error of a Bayes classifier based on a single measurement for a two-class problem. Let t and t′ denote the measurement values where P (Ωc )p(y | Ωc ) for both classes are equal. The points t, t′ can be found by solving the equation P (Ω1 )p(y | Ω1 ) = P (Ω2 )p(y | Ω2 ). The exact solutions are derived in the Appendix. The prediction error for Ω1 is the sum of areas S1 plus S1′ , while the prediction error for class Ω2 is the area S2 . The total prediction error is Pe = (S1 + S1′ ) + S2 . Because S1′ ≪ S1 , the term S1′ will be ignored. Then Z +∞ Pe = S1 + S2 = P (Ω1 ) p(y | Ω1 )dy + P (Ω2 )· t Z t t − µ1 t − µ1 1 )erf(| |)+ · p(y | Ω2 )dy = − P (Ω1 )sgn( 2 σ σ1 1 −∞ t − µ2 t − µ2 P (Ω2 )sgn( )erf(| |) (10) σ2 σ2 where sgn(x) is the sign function, erf(x) is the error function defined as Z x ξ2 1 √ exp(− )dξ, erf(x) = (11) 2 2π 0 The proof of (10) can be found in [8]. It is clearly seen that Pe is a function of variables µ1 , µ2 , σ1 , σ2 , and t. In the Appendix, it is shown that the ratio (t − µc )/σc , c = 1, 2, is always a function of µ2 − µ1 . Let ̺ = µ2 − µ1 . Then, Pe can be rewritten as Pe (t, µ1 , µ2 , σ1 , σ2 ) = Pe (̺, σ1 , σ2 )

(12)

The maximum prediction error Pe is 0.5, when ̺ = 0 and σ1 = σ2 . In Figure 2, Pe (̺, σ1 , σ2 ) is plotted for ̺ ∈ (−∞, +∞), when σ1 = σ2 using (10).

Pe Pe (̺) −∞

̺ +∞ 0 Fig. 2. Prediction error for a two-class problem, Pe , as a function of the difference of class means ̺ = µ2 − µ1 .

Let us assume a finite number of samples. In each crossvalidation repetition b, to estimate the parameters µcb , σcb , c = 1, 2, on the design sets, N1b and N2b samples are selected from the available dataset set. According to CLT [9], the sample means of measurements y are distributed as µ1b µ2b

σ12 σ12 ) = G(µ1 , ), N1b Pb (Ω1 )N

∼

G(µ1 ,

∼

σ22 ). G(µ2 , Pb (Ω2 )N

(13) (14)

If ̺b = µ2b − µ1b , then it can be shown that ̺b ∼ G(̺, σ 2 ),

(15)

µ2 − µ1 , σ22 σ12 + . Pb (Ω1 )N Pb (Ω2 )N

(16)

where ̺

=

σ2

=

(17)

Let V CVeB (uW ) be an estimate of the variance of the r.v. Pe (̺b ) and M CVeB (uW ) be an estimate of the mean of r.v. Pe (̺b ), when the following assumptions are made to simplify the analysis: • the design test is used for testing,

2 2 , Pb (Ω1 ), and Pb (Ω2 ) are invariant through cross, σ2b • σ1b validation repetitions and equal to σ12 , σ22 , P (Ω1 ), and P (Ω2 ), respectively.

Accordingly, Pe can be expressed as a function of one r.v., i.e. ̺b , and (17) reduces to σ2 =

σ12 P (Ω1 )N

+

σ22 P (Ω2 )N

(18)

which concludes the proof of Lemma 1.

a straight line in a small area about ̺A , ̺B , and ̺C . From Figure 3, one can deduce that the variance of Pe (̺A b ) is smaller than the B variance of Pe (̺B b ), and moreover, the variance of Pe (̺b ) is greaC C ter than the variance of Pe (̺b ). Let Pe (̺b ) = αC ̺b + β where αC = tan(φC ). Then, the variance of Pe (̺C b ), according to the identity V ar(αx + β) = α2 V ar(x) and (15) is 2 2 V ar Pe (̺C (19) b ) = αC σ .

It can be seen in Figure 3 that as ̺C → ∞, then αC → 0, which combined with (19) yields lim V ar Pe (̺C (20) b ) = 0. ̺C →∞

So, it can be deduced from (19) and (20) that V ar Pe (̺C b ) depends on ̺C . The just described Theorem 1 is extended to Theorem 2 V CVeB (uW ) is: (i) proportional to s, (ii) inversely proportional to N , and (iii) proportional to M CVeB (uW ) 1 − M CVeB (uW ) . On the contrary, V CVeB (uW ) depends on neither the cardinality of feature set W, nor the number of classes C, nor the prior probabilities P (Ωc ). Proof Let ΥebT be the number of samples of the test set that are misclassified in one CV repetition. From (6) we simply have X L[li , η(yiW )]. (21) ΥebT = W uW i ∈uT b

Let also P rob ΥebT = k denote the probability the r.v. ΥebT admits the integer value k at the bth CV repetition. If PeD (̺b ) is the prediction error estimated from the design set at the bth repetition, then it can be inferred that ΥebT follows the binomial distribution ! k N −k e NT PeD (̺b ) 1 − PeD (̺b ) T , P rob ΥbT = k = k (22) and therefore V ar(ΥebT ) = NT PeD (̺b ) 1 − PeD (̺b ) .

If we assume that

M CVeB (uW ) =

Henceforth, Pe (̺) is treated as a function of the r.v. ̺. Theorem 1 For a singleton feature set (i.e. one that contains only one feature), V CVeB (uW ) depends on M CVeB (uW ) for two classes (i.e. C = 2). Proof A qualitative proof will be made through an example that demonstrates the dependence of Pe (̺) on ̺. Let us derive the diB C A B C stribution of Pe (̺A b ), Pe (̺b ), and Pe (̺b ) when ̺b , ̺b , and ̺b A B C are Gaussian r.vs. with means ̺ = 0, ̺ , and ̺ , respectively and equal standard deviations, as shown in Figure 3. C B At the bottom of Figure 3, the pdfs of ̺A b , ̺b , ̺b are plotted downwards to maintain readability. In the right side, the pdfs of B C Pe (̺A b ), Pe (̺b ), and Pe (̺b ) are calculated by the projection of C B A ̺b , ̺b , ̺b over the curve Pe (̺), when Pe (̺) is approximated by

(23)

B 1 X PeD (̺b ), B

(24)

b=1

is a better estimate of prediction error than PeD (̺b ), from (23) it can be inferred that (25) V ar(ΥeT ) = NT M CVeB (uW ) 1 − M CVeB (uW ) . Given (6), (8), and (21),

V CVeB (uW ) = V ar(

ΥeT 1 ) = 2 V ar(ΥeT ). NT NT

(26)

Given that NT = N/s, (25), and (26), it is concluded that V CVeB (uW ) =

s M CVeB (uW ) 1 − M CVeB (uW ) . N

(27)

Pe φA

Pe (̺A b )

φB

Pe (̺B b ) φC

̺B −∞

0

̺ ̺B b B

̺A b A

C

Pe (̺C b )

+∞ ̺

̺C b C

Fig. 3. Prediction error Pe in a two-class problem as a function of ̺ for three cases A, B, C.

The result (27) confirms that V CVeB (uW ) depends on M CVeB ( uW ), as Theorem 1 asserts. From the comparison of (27) derived theoretically, and (9) obtained on experimental basis, it becomes evident that (27) should be multiplied by the factor s+s√2 for B = ∞. However, when s >> 1 then s+s√2 → 1. Accordingly, the difference between (9) and (27) becomes negligible. We believe that the factor s+s√2 reflects a relationship between design and test sets. An accurate estimate of V CVe∞ (uW ) can be obtained by just employing an estimate M CVe10 (uW ) of M CVe∞ (uW ) with B = 10 cross-validation repetitions, i.e. ∞

\ V CV e (uW ) ≃

(s +

s2 √

2)N

M CVe10 (uW ) 1−M CVe10 (uW ) . (28)

The gains obtained by using (28) in order to estimate V are theoretically derived in Section 4.

CVe∞ (uW )

4. GAINS OBTAINED BY THE PROPOSED METHOD B Let V CVe;dir (uW ) be the variance of prediction error directly caB (uW ) be the lculated using (8) for B repetitions, and V CVe;prop variance of prediction error estimated by (28), for B repetitions in B general instead of 10. In order to show that V CVe;prop (uW ) is B (uW ), the following gain factor δ is more accurate than V CVe;dir defined B V ar V CVe;dir (uW ) . δ, (29) B (uW ) V ar V CVe;prop

If δ > 1, the proposed method finds an estimate of V CVe∞ (uW ) that varies much smaller than that of the variance estimate delivered by cross-validation, and therefore the proposed method is better than the straight forward cross-validation. The nominator and the denominator of (29) are derived separately. Nominator: Since CVeb (uW ) is a binomial r.v., it can be approximated by a Gaussian r.v [10], if N M CVe∞ (uW ) 1 − M CVe∞ (uW ) > 25.

(30)

B (uW ) is According to CLT its variance for B repetitions V CVe;dir a r.v. that follows the χ2B−1 distribution, i.e.

B−1 B (uW ) ∼ χ2B−1 . V CVe;dir V CVe∞ (uW )

(31)

Given that V ar(ax) = a2 V ar(x), and the fact that V ar(χ2n ) = 2n, where a, n are constants, from (31) we obtain B V ar(V CVe;dir (uW )) = 2

2 V CVe∞ (uW ) B−1

.

(32)

From (9) and (32), it is inferred that 2s4 √ (B − 1)(s + 2)2 N 2 2 2 1 − M CVe∞ (uW ) . M CVe∞ (uW )

B V ar V CVe;dir (uW ) =

(33)

Denominator: The variance of CVeb (uW ) from B repetitions with the proposed method is the function s2 √

M CVeB (uW ) 2)N 1 − M CVeB (uW ) ,

B V CVe;prop. (uW ) =

(s +

(34)

of the r.v. M CVeB (uW ), where according to CLT

V CVe∞ (uW ) M CVeB (uW ) ∼ G M CVe∞ (uW ), . (35) B B (uW ) , the fundaTo derive approximately V ar V CVe;prop. mental theorem governing the transformation of one r.v. [9] is applied. In Figure 4 the function s √ M CVe∞ (uW ) R M CVe∞ (uW ) = (s + 2)N 1 − M CVe∞ (uW ) , (36)

V CVe (uW ) f V CVeB (uW )

R M CVe∞ (uW )

V CVe∞ (uW )

ϕ

f M CVeB (uW ) 0

M CVe∞ (uW )

1 CVe (uW )

Fig. 4. Approximation of f V CVeB (uW ) by using the derivative of the curve R. is plotted. The pdf of M CVeB (uW ) is plotted on axis x and the pdf of V CVeB (uW ) is plotted on axis y. It can be seen that the pdf of B V CVe;prop. (uW )is the projection of M CVeB (uW ) on the curve ∞ R M CVe (uW ) . The curve R M CVe∞ (uW ) can be approximated with a straight line y = tan(ϕ)x + β over the area above the pdf of M CVeB (uW ), where dR M CVe∞ (uW ) s2 √ tan(ϕ) = = ∞ W dM CVe (u ) (s + 2)N 1 − 2M CVe∞ (uW ) . (37) Given that V ar(tan(ϕ)x + β) = and (37) we obtain

2 tan(ϕ) V ar(x), from (34)

B V ar V CVe;prop. (uW ) =

s4 √ (s + 2)2 N 2

2 1 − 2M CVe∞ (uW ) V ar M CVeB (uW ) .

(38)

Then, from (9), (35) and (38), we find

s6 √ M CVe∞ (uW ) (s + 2)3 N 3 B 2 1 − M CVe∞ (uW ) 1 − 2M CVe∞ (uW ) . (39)

B V ar V CVe;prop. (uW ) =

By using (29), (33), and (39), δ is found to be √ M CV ∞ (uW ) 1 − M CV ∞ (uW ) e e B(s + 2) δ = 2N . 2 (B − 1)s2 1 − 2M CVe∞ (uW ) (40) From (40), it is inferred that the gain factor δ is: • proportional to the total number of samples N ,

• not affected dramatically from the number of cross-validation repetitions B, • almost inversely proportional to folder number s,

• maximized when M CVe∞ (uW ) → 0.5, whereas it is minimized when M CVe∞ (uW ) → 0 or M CVe∞ (uW ) → 1.

By ignoring B in (40), the gain δ is greater than 1 when s √ N (s + 2) √ < M CVe∞ (uW ) < 0.5 − 0.5 N (s + 2) + 2s2 s √ N (s + 2) √ 0.5 + 0.5 . (41) N (s + 2) + 2s2

For example, when the number of folders, s, equals 2 and the total number of samples is 1000, it can be inferred from (41) that the gain is greater than 1 if 0.001 < M CVe∞ (uW ) < 0.999. The gain is smaller than 1 when M CVe∞ (uW ) approaches 0, i.e. the classes are well separated or when M CVe∞ (uW ) → 1, which means random classification for a great number of classes. Gain values higher than 900 are obtained as M CVe∞ (uW ) tends to 0.5, B for any number of classes C. In such cases, V ar V CVe;prop. (uW ) B W ∞ W → 0, and therefore V CVe;prop. (u ) → V CVe (u ), i.e. B V CVe;prop. (uW ) can be a very accurate estimator of V CVe∞ (uW ), even for B = 10 repetitions. 5. CONCLUSIONS In this paper, we studied the cross-validation method, when it is applied to obtain an unbiased estimator of the prediction error. On the grounds of experimental findings [7] and the presented theoretical analysis, we derived Eq. (9) that relates the variance of the prediction error with the mean value of the prediction error by employing an infinite number of cross-validation repetitions. The theoretical analysis began with the variance of the prediction error committed by the Bayes classifier using univariate Gaussian class pdfs (Theorem 1) and extended for any dimensionality and any number of classes (Theorem 2). The main result came out of this theoretical analysis, i.e. Eq. (27), indicates that by a multiplicative factor s+s√2 the experimentally derived relationship (9) conforms with the theoretically derived one (27). Although the proposed equation (9) is valid for an infinite number of cross-validation repetitions, it is proved that the va-

riance of the prediction error, treated as an r.v. itself, exhibits a variance that could be 900 smaller than that delivered by crossvalidation even for a finite number of repetitions, say 10. By exploiting the main result of the paper in (28) we succeeded to speed up floating forward feature selection algorithm [11] within the framework of emotional speech classification [7]. The relationships between the sample estimates of the variance and the mean of the prediction error can be extended to other estimates such as the bootstrap estimates.

Appendix Given that p(y|Ω1 ) and p(y|Ω1 ) are Gaussians, then we prove that i , i = 1, 2 for all the solutions of the ratios t−µ σi P (Ω2 ) P (Ω1 ) 1 t − µ1 2 1 t − µ2 2 exp{− ( ) }= exp{− ( ) }, σ1 2 σ1 σ2 2 σ2 (42) are always functions of µ2 − µ1 . Proof (42) leads to t2 (σ22 − σ12 ) + t(2µ2 σ12 − 2µ1 σ22 ) + µ21 σ22 − µ22 σ12 − 2σ12 σ22 Λ = 0 where Λ = ln

σ2 P (Ω1 ) . σ1 P (Ω2 )

• If σ1 6= σ2 and P (Ω1 ) 6= P (Ω2 ), there are two solutions t1,2

=

α β γ

= = =

p

β 2 − 4αγ , where 2α 2 2 σ2 − σ1 , 2µ2 σ12 − 2µ1 σ22 , (µ1 σ2 )2 − (µ2 σ1 )2 − 2(σ1 σ2 )2 Λ. −β ±

(44) (45) (46) (47)

• If σ1 = σ2 = σ, there is a single solution t=

P (Ω1 ) µ2 + µ1 σ2 −γ = + ), ln( β 2 µ2 − µ1 P (Ω2 )

(50)

and therefore, t − µ1 σ t − µ2 σ

P (Ω1 ) µ2 − µ1 σ + ), (51) ln( 2σ µ2 − µ1 P (Ω2 ) P (Ω1 ) µ2 − µ1 σ − + ), (52) ln( 2σ µ2 − µ1 P (Ω2 )

= =

which are functions of µ2 − µ1 .

• If σ1 = σ2 = σ and P (Ω1 ) = P (Ω2 ) then, t=

−σ 2 (µ21 − µ22 ) −γ 1 = = (µ2 + µ1 ) β 2σ 2 (µ2 − µ1 ) 2

(53)

and the ratios are t − µ1 σ t − µ2 σ

= =

µ2 − µ1 , 2σ µ2 − µ1 − . 2σ

(54) (55)

From (48), (49), (51), (52), (54), and (55), it can be inferred that i , i = 1, 2, are always functions of µ2 − µ1 . the ratios t−µ σi 6. REFERENCES [1] M. Stone, “Cross-validatory choice and assesment of statistical predictions,” J. R. Statist. Soc. (Series B), vol. 36, no. 2, pp. 111–147, 1974. [2] B. Efron, “Bootstrap methods: another look at the jackknife,” Ann. Statist., vol. 7, pp. 1–26, 1979.

If

[3] M. H. Quenouille, “Approximate tests of correlation in timeseries,” J. R. Statist. Soc. (Series B), vol. 11, pp. 68–84, 1949.

β 2 − 4αγ = (2σ1 σ2 )2 (µ2 − µ1 )2 + 2(σ22 − σ12 )Λ > 0,

[4] M. H. Quenouille, “Notes on bias in estimation,” Biometrika, vol. 43, no. 3/4, pp. 353–360, 1956.

then (44) leads to t1,2 =

µ1 σ22 − µ2 σ12 ± σ1 σ2 σ22 − σ12

s

(µ2 − µ1 )2 2Λ + 2 . (σ22 − σ12 )2 σ2 − σ12

Then t1,2 − µ1 σ1 = −(µ2 − µ1 ) 2 ± σ1 σ2 − σ12 s (µ2 − µ1 )2 2Λ + 2 , σ2 (σ22 − σ12 )2 σ2 − σ12

[5] R. G. Miller, “The jackknife - a review,” Biometrika, vol. 61, no. 1, pp. 1–15, 1974. [6] P. Burman, “A comparative study of ordinary crossvalidation, v-fold cross-validation and the repeated learningtesting methods,” Biometrika, vol. 76, no. 3, pp. 503, 1989. [7] D. Ververidis and C. Kotropoulos, “Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections,” in Proc. European Signal Processing Conf. (EUSIPCO), 2006.

(48)

[8] K. Fukunaga, Introduction to Statistical Pattern Recognition, N.Y.: Academic Press, second edition, 1990. [9] A. Papoulis and S. U. Pillai, Probability, Random Variables, and Stochastic Processes, N.Y.: McGraw-Hill, 2002.

and t1,2 − µ2 σ2 = (µ2 − µ1 ) 2 ± σ2 σ2 − σ12 s (µ2 − µ1 )2 2Λ σ1 + 2 , (σ22 − σ12 )2 σ2 − σ12 which are indeed functions of µ2 − µ1 .

[10] M. Evans, N. Hastings, and J. B. Peacock, Statistical distributions, N.Y. Wiley, 2000. (49)

[11] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in feature selection,” Pat. Rec. Letters, vol. 15, pp. 1119–1125, 1994.

ACCURATE ESTIMATE OF THE CROSS-VALIDATED ...

in the bth cross-validation repetition, and pb(yW i |Î©c) is the class conditional probability density function (pdf) of the sample uW i given Î©c. The class conditional ...

Download PDF

130KB Sizes 0 Downloads 185 Views

Report

ACCURATE ESTIMATE OF THE CROSS-VALIDATED ...

Recommend Documents