Revisiting the Effect of Topic Set Size on Retrieval Error Wei-Hao Lin

Alexander Hauptmann∗

Language Technologies Institute School of Computer Science 5000 Forbes Ave Pittsburgh PA 15213 U.S.A.

Language Technologies Institute School of Computer Science 5000 Forbes Ave Pittsburgh PA 15213 U.S.A.

[email protected]

[email protected]

Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software—performance evaluation

General Terms Experimentation, Measurement

Keywords Test Collections, Measurement Error

1.

INTRODUCTION

Evaluating retrieval systems in a controlled environment with a large set of topics has been the core paradigm in the information retrieval community. Voorhees and Buckley proposed to estimate the reliability of retrieval experiments by calculating the probability of making wrong effectiveness judgments between two retrieval systems over two retrieval experiments[2], which is called Retrieval Experiment Error Rate (REER) in this paper. They have successfully shown how the topic set sizes affect the retrieval experiment reliability. However, the REER model in the previous work was empirically justified without providing a derivation based on statistical principles. We fill this gap and show that REER can indeed be derived from statistical principles. Based on the derived model we can explain why a successful experiment design depends on factors including a sufficient number of topics, large enough measurement score difference between systems, and a homogeneous distribution of retrieval scores for topics and systems, which reduces the variance of the score differences.

2.

RETRIEVAL EXPERIMENT ERROR RATES

In a TREC-like retrieval experiment, a test document collection and a topic set T of size |T | are given to participants, and participants run their retrieval system over the test collection and return rank lists for each topic. Human assessors ∗ This work was supported in part by the Advanced Research and Development Activity (ARDA) under contract number H98230-04-C-0406 and NBCHC040037.

Copyright is held by the author/owner. SIGIR’05, August 15–19, 2005, Salvador, Brazil. ACM 1-59593-034-5/05/0008.

manually identify the relevant documents for each topic, and an evaluation metric, for example, Mean Average Precision (MAP)1 is calculated to objectively compare the effectiveness of the retrieval systems. Now consider two retrieval systems A and B. Denote A1 , A2 , . . . , A|T | and B1 , B2 , . . . , B|T | as average precisions of System A and System B for each topic in the topic set T , respectively. Each Ai is assumed to be sampled independently from an unknown but identical distribution FA 2 with mean µA and variance σA , and Bi is sampled independently from another unknown but identical distribution FB 22 with mean µB and variance σB . By definition MAP of Sys¯ and tem A is the average of A1 , A2 , . . . , A|T | , denoted as A, ¯ By the central limit similarly the MAP of System B is B. ¯ are approximately normally distributed, theorem, A¯ and B σ2 A¯ ∼ N (µA , A ) |T |

(1)

2 ¯ ∼ N (µB , σB ) B |T |

(2)

The MAP difference between two system is a random vari¯ − Y¯ . Since X ¯ and Y¯ are indepenable, denoted as D = X dent, it follows that D is normally distributed, D ∼ N (µX − µY ,

2 + σY2 σX ) |T |

(3)

Now we can formalize REER proposed in [2] as the probability of the event that the results of two retrieval experiments are contradictory , i.e. the sign of the MAP difference in the first experiment D1 is different from the sign of the MAP difference in the second experiment D2 , REER

=

Pr(D1 × D2 < 0)

=

Pr(D1 > 0, D2 < 0) + Pr(D1 < 0, D2 > 0)(5)

(4)

Since the two retrieval experiments are conducted independently, the joint probability of D1 and D2 is the product 1

Note that the derivation in the section is not restricted to MAP, and it applies to other metrics like Precision at 100. 2 If a further assumption that FA and FB are normal is made, one can carry out the usual two-sample t-test procedure to compare if MAPs of two retrieval systems indeed differ. However, neither [2] nor we make this assumption.

3.

of the individual event probabilities, REER = Pr(D1 > 0) × Pr(D2 < 0)+ Pr(D1 < 0) × Pr(D2 > 0) = (1 − Pr(D1 ≤ 0)) × Pr(D2 ) < 0)+ Pr(D1 < 0) × (1 − Pr(D2 ≤ 0))

(6)

DISCUSSIONS

The theoretical REER derivation in (8) does not only explain why the empirical REER formula in (9) fitted the TREC evaluation results so well in [2], but also point out three important factors in designing successful retrieval experiments:

Pr(D ≤ 0) is the cumulative density function of D. From (3) we know D is normally distributed, and hence we can represent Pr(D ≤ 0) in the standard normal cumulative density function Φ,

Sufficient number of topics By increasing the topic set size, i.e. |T |, the theoretic REER model in (8) predicts that REER will decrease accordingly, that is, we can be more confident about the effectiveness judgments. This is consistent with the empirical findings in [2].

(7)

Large score differences If MAPs of two systems differ much, i.e. µX − µY is large, the theoretic REER model in (8) predicts that REER will be smaller. This explains why “a large enough difference between two effectiveness scores” is a general rule of thumb for acceptable experiment design [1].

1 0 −(µ − µ ) A B A Pr(D ≤ 0) = Φ @ q 2 +σ 2 σA B |T |

Plug (7) back into (6), we finally obtain REER as follows,

10 11 0 0 −(µ − µ ) −(µ − µ ) A B A B A @1 − Φ @ q AA REER = 2Φ @ q 2 +σ 2 σA B |T |

2 +σ 2 σA B |T |

(8)

From (8) it can be easily shown that the range of REER fall between 0 and 0.5 for the Φ function ranges between 0 and 1.

2.1

Approximation

Voorhees and Buckley empirically fitted REER in the following model [2], REER = b1 exp(−b2 |T |)

(9)

where b1 and b2 are two parameters. At the first sight the empirical model in (9) and our theoretically derived model in (8) bear no resemblance, but we will show that the empirical model in fact is an approximation of the theoretical model3 The theoretical REER (8) is not in closed form because of the integral in the standard normal cumulative density function Φ, Φ(z) =

√1 2π

Rz

−∞ 1 (1 2

=

exp(−x2 /2)dx

1 − exp(

−2x2

) (12) π By replacing the Erf function in (11) with the approximation in (12), the theoretic REER model in (8)can be approximated as follows,



REER ≈

0.01 ≤ µX − µY < 0.02 0.02 ≤ µX − µY < 0.03 0.03 ≤ µX − µY < 0.04

(11)

where z is a standard normal random variable, and Erf is the so-called error function. There have been efforts to approximate the Erf function in closed form, one of which is proposed by Williams[3] as follows, Erf(x) ≈

MAP Difference Level

(10)

+ Erf( √z2 ))

r

Small score variances The last factor is the score vari2 2 ances, i.e. σA + σB . The smaller the score variances, the lower the REER. Consequently when the topic difficulties vary much, the performance of a retrieval system will fluctuate greatly, resulting in bigger variance of the MAP difference. We estimate the variances of the MAP differences for TREC-3 and TREC-6 at the selected MAP difference levels4 , as shown in Table 1. The variance in TREC-6 is larger than that in TREC3, according to (8), REER in TREC-6 will be higher than that in TREC-3 at the same MAP difference level. This is consistent with the TREC participants’ impression that TREC-3 is easier than TREC-6 [2].



1 2 (µA − µB )2 exp − |T | 2 2 2 π σA + σB

(13)

If we compare the approximation in (13) with the empirical model in (9), they are clearly in exactly the same form. Therefore, we show that the empirical REER model proposed in [2] is indeed an approximation of the theoretical REER. 3 Note that our goal here is not to approximate REER but to show the connection between the exponential form of the empirical REER model in (9) and the theoretical REER model in (8).

Variances in TREC-3 0.000464 0.000294 0.000434

Variances in TREC-6 0.000783 0.000814 0.000684

Table 1: The variances at different MAP difference level in TREC-3 and TREC-6.

4.

REFERENCES

[1] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the 23th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33–40. ACM Press, 2000. [2] E. M. Voorhees and C. Buckley. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 316–323. ACM Press, 2002. [3] J. D. Williams. An approximation to the probability integral. The Annals of Mathematical Statistics, 17(3):363–365, September 1946.

4 Past TREC evaluation results can be found at http:// trec.nist.gov/results.html.

Revisiting the Effect of Topic Set Size on Retrieval Error

... Subject Descriptors. H.3.4 [Information Storage and Retrieval]: Systems ... theorem, ¯A and ¯B are approximately normally distributed,. ¯A ∼ N(µA, σ2. A. |T |. ).

82KB Sizes 5 Downloads 148 Views

Recommend Documents

Revisiting the affective Simon effect
Aug 28, 2007 - Downloaded B .... Apparatus. The experiment was designed using the Micro Experimental. Laboratory (MEL II) software (Schneider ...... Table 4, this trend can be clearly seen in the tasks consisting in judging the semantic and ...

effect of cyclodextrine cavity size on twisted ...
Apr 28, 1989 - sion (at 360 nm) and the TICT emission (at 480 nm) (fig. ... trum monitored at 360 nm is slightly blue-shifted .... In view of this and the large size.

On Packet Size and Error Correction Optimisations ... - Semantic Scholar
Recent sensor network platforms with greater computa- ... platforms [2], [3]. These resource-rich platforms have increased processing capabilities which make erasure code handling viable and efficient [3]. Moreover, improved radio designs [4] facilit

On Packet Size and Error Correction Optimisations ... - Semantic Scholar
it only accounts for interference and does not consider packet transmissions. Because CQ relies on the receiver ... latency of packets, which accounts for packet transmission time plus the inter-packet interval (IPI). Definition 1. ..... sage out of

Revisiting economies of size in American education
evidence that moderately sized elementary schools (300–500 students) and high schools (600–900 ..... a tangible effect, and where the school administration is ...... student birth childhood experience high achiever order, health, m obility, study

Grain Size Effect on the Structural Parameters of the ...
Feb 19, 2008 - The austenite average grain size was determined using optical ... Optical microscopy analysis of large GS samples has shown that a.

Quantum Size Effect on Adatom Surface Diffusion
Dec 27, 2006 - tron energy levels and electron filling as a function of film thickness. .... an atom have a 50% chance of sticking to the island edge and a 50% ...

Quantum Size Effect on Adatom Surface Diffusion
Dec 27, 2006 - pioneering studies have greatly enriched our fundamental ... Surface diffusion is the most fundamental kinetic rate constant in controlling ... first grown by evaporating Pb from a Knudsen cell onto the ... strain-free) surface areas f

The effect of error correction on learners' ability to write ...
In the last decade, there has been a great deal of discussion on the value of ... Cohen (1992) offered the following rule of thumb for interpreting d: small ... One can either average all the individual effect sizes found or use only one for each stu

The effect of error correction on learners' ability to ... - Semantic Scholar
changes in students' ability to write accurately, i.e. their learning. Studies of learning look at the difference between a measure of accuracy at one time and a comparable measure done at a later time. A writing task that students do with help from

The Effect of Crossflow on Vortex Rings
The trailing column enhances the entrainment significantly because of the high pressure gradient created by deformation of the column upon interacting with crossflow. It is shown that the crossflow reduces the stroke ratio beyond which the trailing c

The Effect of Crossflow on Vortex Rings
University of Minnesota, Minneapolis, MN, 55414, USA. DNS is performed to study passive scalar mixing in vortex rings in the presence, and ... crossflow x y z wall. Square wave excitation. Figure 1. A Schematic of the problem along with the time hist

Question-answer topic model for question retrieval in ...
"plus" way4 (PLSA_qa) with the traditional topic model PLSA [6]. We use folding-in ... TransLM significantly outperforms Trans (row 2 vs. row 3). The results are ...

Topic Models in Information Retrieval - Personal Web Pages
applications, such as Internet Explorer and Microsoft Word. The “Stuff I've Seen” ... Google also featured personal history features in its “My. Search History” ...

Effects of sample size on the performance of ... -
area under the receiver operating characteristic curve (AUC). With decreasing ..... balances errors of commission (Anderson et al., 2002); (11) LIVES: based on ...

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - The pivotal work was done by Hall (1992b), and has been relied upon since. ... error optimal bandwidths and a fully data-driven direct plug-in.

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - degree local polynomial regression, we show that, as with point estimation, coverage error adapts .... collected in a lengthy online supplement.

Topic Models in Information Retrieval - Personal Web Pages
64. 5.2.1 Similarity Coefficient .................................................................... 64. 5.2.2 Co-occurrence in Windows............................................................ 65. 5.2.2.1 Non-overlapping window ..................

Sample size and sampling error in geometric ...
Received: 24 October 2006 / Accepted: 8 June 2007 / Published online: 17 July 2007. © Springer-Verlag 2007. Abstract .... monly used in geometric morphometric studies, using data ... parameters is examined by comparing parameters in.

The effect of mathematics anxiety on the processing of numerical ...
The effect of mathematics anxiety on the processing of numerical magnitude.pdf. The effect of mathematics anxiety on the processing of numerical magnitude.pdf.

On the measurement of privacy as an attacker's estimation error
... ESAT/SCD/IBBT-COSIC,. Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium ... tions with a potential privacy impact, from social networking platforms to ... We show that the most widely used privacy metrics, such as k-anonymity, ... between da