Proceedings Template - WORD

Viewer
Transcript

A Comparison of On-Demand Workforce with Trained Judges for Web Search Relevance Evaluation Maria Stone

Omar Alonso

Kylee Kim

Suvda Myagmar

[email protected]

[email protected]

[email protected]

[email protected]

Microsoft 1065 La Avenida Mountain View, CA 94044

ABSTRACT

1. INTRODUCTION

Evaluation of search engine effectiveness is frequently delegated to the expert judges because of the belief that detection of minute differences in search relevance requires specialized training and sensitivity to it can be developed with training. Meanwhile, crowdsourcing offers cheap and attractive alternative. Crowdsourcing platforms make it extremely easy to set up and carry out relevance experiments. A traditional approach to understanding how well crowdsourcing works relative to expert judges is to generate a set of labels that has been pre-judged by the expert judges, and then determine how closely crowdsourced data approximate these expert labels. This approach inherently assumes that expert judges are better and are better able to assign appropriate relevance labels.

Evaluation of search engine effectiveness involves, at least in part, collecting relevance judgments provided by humans. This is normally a very costly and time-consuming process. Most evaluations are done by experts, which require extensive in-house training. For example, all major search engines maintain a pool of either in-house editors or contractors who are trained to perform such evaluations. These in-house systems are not cheap, and not easily scalable due to the effort and time it takes to train expert judges and the cost of employing them for such tasks. Meanwhile, crowdsourcing offers an alternative that may or may not be cheaper, and may or may not be “good enough” for such evaluations. Platforms such as Amazon Mechanical Turk (MTurk) or CrowdFlower make it easy to set up and carry out simple relevance evaluations [1][2][9]. In principle, they offer infinite and instant scalability at an extremely low price. Quinn and Bederson [16] present a taxonomy of crowdsourcing tools, and of the distributed human computation area in general, which classifies and compares various systems.

Rather than relying on this approach, we wanted to conduct an independent test of how well two groups will do, without relying on a set that has been pre-judged by the expert judges. Instead, we chose to rely on a ranker of one of the major search engines to generate such data. If the ranker is doing its job, then removing the top result should damage most search results pages for most queries. In this paper, we study how well expert judges and Mechanical Turk workers are able to label full, unaltered pages and pages with top result removed.

However, crowdsourcing comes with its own baggage of problems, mainly around quality control. Many of these issues are due to the lack of quality control features available in the current crowdsourcing platforms.

Our findings so far show that both groups give slightly higher scores to full pages on average, but the expert judges are not much better than the Mechanical Turk workers. In fact, both groups performed very poorly on this task.

There is a lot of disagreement with respect to quality of crowdsourced data. Some researchers go so far as to say that 90% of data generated in this fashion are worthless [15]. Others claim success with crowd-sourcing in the areas of IR, NLP, and machine translation [1][5]. We define success in this context, as the ability to validate the annotation (or labeling) of existing standard by workers.

Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and software — performance evaluation; J.4 [Computer Applications]: Social and Behavioral Sciences.

Mostly, these papers make an assumption that expert judges are better able to detect minute differences in relevance, and worker performance in a platform (e.g. Mechanical Turk) is evaluated against a gold standard provided by expert judges.

General Terms Measurement, Experimentation, Human Factors.

Rather than assuming that expert judges are inherently better and more sensitive to minute differences in relevance, we chose a different approach. We deliberately degrade search results from a well-established search engine by removing a top result or a second result. We presume that a search engine is doing its job, and on a large sample of queries, we ought to see that such a removal negatively impacts relevance. We then ask the question: Which group (Mechanical Turk workers or expert judges) will be better able to detect the removal and under what circumstances?

Keywords Crowdsourcing, relevance assessment, user studies

Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval. Copyright is held by the author(s).

15

them making the test expensive to maintain in the long run. The use of honey pots involves interjecting some questions that we know the answer in advance, to make sure that the workers match.

2. RELATED WORK Crowdsourcing is a new approach for gathering labels and conducting different kinds of evaluations. Contradicting claims have been made with respect to quality of crowdsourced tasks in different domains.

Most quality control systems rely heavily on the existence of a “gold standard” set—which is manually generated by expert judges. Rather than manually generating a gold standard or having a set of hits with pre-assigned labels, we degrade all the hits substantially, by removing result #1 from the standard list of search results of one of the major search engines. We then offer both unaltered and degraded hits for judgment to both workers and expert judges. This makes our type of quality control independent of either group’s notion of quality and makes the comparison between expert judges and workers more objective

On the positive side, Alonso and Mizzaro compared MTurk to TREC on assessing a single topic and found that workers were as good as the original assessors [1]. Ipeirotis points out that once spammers are removed, the crowdsourced labeling results are undistinguishable from more advanced methods, albeit with some noise and bias which can be corrected [10][11]. Bernstein et al. [4] demonstrate the feasibility of using crowdsourcing as a feature of a word processing like spelling correction. The NLP community has also been using crowdsourcing with some degree of success for their needs. The research work by Snow et al. [18] shows the quality of workers in the context of four different NLP tasks such as affect recognition, word similarity, textual entailment, and event temporal ordering. Callison-Burch shows how Mechanical Turk can be used for evaluating machine translation [5]. The translation case is very interesting because experts on that domain can be very expensive.

3. EXPERIMENT SETUP To generate a query sample, we use a large collection of log data collected via a toolbar of one of the major search engines over a period of 26 weeks in 2009. We randomly select 500 queries from this collection. For each of the queries, we extract top 10 search results using publicly available Bing API (http://www.bing.com/developers). We retrieve the Title, URL, and snippet for each result.

On the negative side, there are many skeptics. Marsden [15] claimed that 90% of crowdsourcing contributions are unusable. Another recent paper simply states that crowd evaluations are “risky” because it was impossible to reverse-engineer the labels provided by experts via a crowd-sourced solution [8].

Once scraped, we generate three hits for each query.

A more philosophical and balanced approach is offered by Bailey et al, who state that it all depends on the goals of the evaluation. The data collected via crowdsourcing isn’t better or worse, but merely different from the data collected from experts [3].

1.

Full search results page (10 search results presented unaltered)

2.

Degraded search results page (top search result removed from the list of 10 search results)

The HIT preparation process involves retrieving the results for that query set and preparing the data for both platforms (MTurk and internal MS tool). The data sets were uploaded around the same time in both platforms. Both groups evaluate the same exact content.

Very recently, quite a bit of effort has been dedicated to understanding how to use crowdsourcing effectively. Many researchers early on focused on factors that impact the quality of crowdsourced tasks, such as importance of experiment and hit design [16][19]. In the context of book search, Kazai [12] addressed three most obvious factors that contribute to quality of crowdsourced relevance data: worker qualification, pay, and task difficulty. They concluded that all three factors are important, but pointed out some of the complexities. For example, better pay increased the quality of data initially, but eventually attracted malicious sophisticated cheaters to the task. While filtering out low qualification workers improved judgment quality, it also filtered out some honest workers trying to build up their reputation in the system.

To make sure that there are no issues in terms of branding, we only present those items to the user in a plain look and feel as shown in Figures 1 and 2. We call this SERP (Search Engine Results Page). Figure 1. Top portion of Judge UI with brief instructions on top

A number of tools have been developed for solving some of the limitations of MTurk with TurKit being one of the most popular ones [14]. TurKontrol, a planner for controlling crowdsourced workflows, is presented in [6]. Platforms such as CrowdFlower offer custom-designed solutions for quality control, for example by including an initial training period and subsequent sporadic insertion of predefined gold standard data [13]. Ipeirotis et al. [11] propose an algorithm that enables the separation of worker’s bias and error and generates a scalar score representing the inherent quality of each worker. The two most popular approaches for improving the quality of work have been: qualification tests and “honey pots”. In a qualification test, the worker has to pass correctly a test to start working. The drawbacks of this approach is that it takes more time to complete and that workers can share the answers among

16

doesn’t naturally have clearly labeled ordered categories, respondents frequently pay more attention to numbers than to verbal labels [7][17]. We believe that relevance represents such a dimension. Figure 2. Bottom portion of Judge UI with scale on the bottom and comments box on the bottom.

We also include the option to justify an answer by expanding more in an open answer format. For the case of MTurk, we pay $0.02 cents per answer with the option of bonus if the feedback was good. There were no honey pots or qualification tests for workers. Only a 97% approval rate was required. Correspondingly, expert judges didn’t receive any special guidelines or any special training. The only benefit they had was their prolonged exposure to and training on other search relevance tasks, but not this specific one.

4. ANALYSIS Seven expert judges contributed their judgments to this task, and 122 Mechanical Turk workers did. On average, individual Mechanical Turk worker contributed approximately 24 judgments each, with some contributing as few as 1, and some contributing as many as 881 judgments. This is expected as work distribution in crowdsourcing follows a power-law. Individual expert judges contributed 479 judgments on average, with as few as 120 and as many as 870 judgments. Given that there was no gold standard of any kind and that different judges were exposed to different subsets of queries, calibration isn’t possible. All of the analyses are based on the raw scores.

The instructions given to the two groups were identical and very simple:

To analyze the data, we first compute means and standard deviations for full and degraded SERP version for each query for each group, and then compute summary statistics for all queries. There were a total of 398 queries for which judgments were available. Of those queries, 350 queries had expert judge data and 389 queries had Mechanical Turk data. Finally, 305 queries had data from both groups. These summary statistics are presented in Table I.

A user of a search engine issued this query and saw the following search results page. Please look at this query and try to understand what this user wanted. How relevant is this Search Results Page to the query? You can click on as many links as you need to determine the relevance of this search results page. Here are some of the characteristics of a highly relevant results page: 

Most relevant pages contain most useful links arranged in the best possible order



It is easy to determine that links are relevant by scanning the text on the search results page before clicking

Group

Mean Degraded

Difference

In some cases, the information contained on search results page is sufficient to answer the query partially or completely

Mean Full

Difference SD

Expert Judges (350 queries)

8.64

8.56

0.12

0.07

Mechanical Turk workers

7.73

7.52

0.21

0.08



Table 1. Means, Differences and Standard Deviations of Differences for Expert and Mechanical Turk Groups

Once prepared, each HIT was evaluated by 5 workers in MTurk and 5 expert judges via an internal system on a 11 point scale. We used the same experiment template (same UI) for both cases. The scale was labeled 0 (irrelevant) to 10 (extremely relevant). The choice of scale is unusual, and we justify it as follows. First, we wanted the scale that is sensitive enough to allow judges and workers to express small, minute differences in relevance they detect. It may very well be that two groups are equally sensitive on a coarse scale (such as standard 5 point scale) but expert judges are able to express sensitivity to finer differences in relevance on a finer scale. Thus, we chose a relatively fine scale. Second, we chose to have numbers rather than verbal labels. While there are some circumstances when verbal labels are preferable, they are not always best. Usually, verbal labels are good when they unambiguously define distinct categories that are very familiar to respondents. In other circumstances, numbers can be better. In fact, when asked to quantify a dimension that

(389 queries) In Figure 3 we present the means and confidence intervals reported in Table I.

17

We also evaluated agreement between groups, both on the scores assigned to full pages, the scores assigned to degraded pages, and the difference scores. There was a moderate statistically significant correlation for both types of pages (r=.38 for full pages and r=0.39 for degraded pages). That is, within reason, full and degraded pages that received a high score from one group were more likely to receive a high score from the other group. However, the main interest we had is in understanding the correlation between differences. That is, if one group had a large difference for a query between full page and degraded page, did the other group agree, either with the magnitude or at the very lease with the direction of this difference? The correlation between pair wise differences detected by each group for each query was non-significant and very near 0. To determine if the two groups agreed at least directionally, we classified all differences for all queries for each group as either negative (degraded page was given the higher score), neutral (no difference in scores) or positive (full page was given the higher score), and carried out a chi-squared test. The test confirmed what the correlation coefficient also told us—that the two groups’ responses are completely independent of each other. There was absolutely no agreement among the two groups about which specific queries had negative differences between full and degraded pages, and which had positive differences.

Figure 3. Means and confidence intervals for full page, degraded page, and differences for Experts and Mechanical Turk workers.

To understand if different classes of queries were easier or harder in this task, we used user-generated query classification to understand if judges were better able to detect more obviously degraded pages. For example, one expectation that we had was that navigational queries may be more obviously degraded by the removal of top result. Another expectation may be that queries with high classification agreement are less ambiguous, and easier to deliver the results for, and thus, would be more impacted by the removal of top result. Since multiple judges provided classification data, we first computed the probability that a query was classified as navigational, informational or transactional. To compute the probability we simply divided the number of judges that picked this classification for the query by the total number of judges that did the classification task for this query. Second, we computed the difference between mean ratings for full and degraded page for each group. Finally, we looked at the correlation between the difference variable and the probability of classification as navigational, transactional, or informational query. We observed a slight but statistically significant correlation between the difference variable and the probability that a query was navigational for both groups. There were also two additional negative correlations—one for expert judge group and probability that query was informational, and one for the Mechanical Turk group and the probability that query was transactional. Hard to know if any of these correlations are of importance, but the slight positive statistically significant correlation with the probability that a query was navigational makes intuitive sense. Navigational queries are much more likely to be damaged by the removal of the top result.

While this is not the main interest of this paper, it looks like Mechanical Turk workers gave much lower ratings to both full and degraded SERP’s, on average. Statistical tests confirm that they did on average give lower scores to both types of pages. As is clear from this table, both groups gave slightly higher scores to full rather than degraded pages. The difference was statistically significant for Mechanical Turk workers (t(388)=2.43, p<.02) and marginally significant for expert judges (t(349)=1.79, p<.08). When only queries for which data from both groups are available are included into the analysis, the picture remains much the same. There was no statistical difference in the magnitude of the perceived difference between the two groups (t(304)<1). That is to say that on average, workers and expert judges were equally sensitive (or not sensitive) to the manipulation. To compare within-query level of agreement, we compared standard deviations and range values. Expert judges had both statistically lower standard deviations and narrower ranges, on average. These values are presented in Table 2. Table 2. Average Standard Deviation and Range for Expert Judge and Mechanical Turk Groups Group

SD Full

SD Degraded

Range Full

Range Degraded

Expert

1.31

1.84

2.66

2.64

Mechanical Turk

1.33

1.86

3.59

3.59

Finally, we examined a few specific navigational queries, such as facebook, google, yahoo and ebay. Both groups detected the absence of the navigational result for www.google.com and for facebook. However, neither group detected any differences for query ebay, even though the main navigational result was removed, and the next result was from Wikipedia.

Experts had significantly lower standard deviations and significantly narrower range for both full and degraded pages. Experts did stay closer to the upper bound of the scale in their responses, but at the same time, experts used a wider range of scores for different queries.

Finally, we present two extra graphs are general descriptions of the raw data. In Figure 4, we present histograms for overall mean

18

There is a widespread belief that lengthy guidelines and long training makes expert judges “better”. It may do so for the relevance task for which these guidelines are written. However, this does not seem to translate into a more generalized skill of being better able to assign relevance labels when a new relevance task is proposed.

score distributions. As we can see expert judges gave higher scores in general. Figure 5 presents another sets of graph showing how standard deviation increases as mean score decreases for both expert judges and workers.

This is an interesting discovery with implications for how to think of expertise in trained judges. Their expertise isn’t a transferrable skill. Every new task requires precise definition of what needs to be labeled and how. When this definition is absent, they are no better than Mechanical Turk workers.

6. FUTURE WORK Comparing experts versus workers in different crowdsourcing platforms will still be an area of work as different tasks requires different levels of expertise. We plan to keep working on similar experiments with different data sets so we can quantify differences.

Figure 4. Histograms for overall mean score distributions

7. REFERENCES [1] O. Alonso and S. Mizzaro. “Can we get rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment”. ACM SIGIR Workshop on the Future of IR Evaluation (2009)

[2] O. Alonso, D. Rose, and B. Stewart. “Crowdsourcing relevance evaluation”. SIGIR Forum, 42(2), 9-15 (2008).

[3] P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. “Relevance assessment: Are judges exchangeable and does it matter”. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 667-674. (2008)

[4] M. Bernstein et al. “Soylent: A Word Processor with a Crowd Inside”, UIST (2010)

[5] C. Callison-Burch. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk”, EMNLP (2009)

[6] P. Dai, Mausam, and D. Weld. “Decision-Theoretic of

Figure 5. Sets of graph showing how standard deviation increases as mean score decreases for both expert judges and workers.

Crowd-Sourced Workflows”, AAAI (2010)

[7] H. H. Friedman and T. Amoo. “Rating the rating scales”. Journal of Marketing Management, 9(3), pp. 114-123. (1999).

5. RESULTS AND CONCLUSIONS

[8] D. Glick and Y. Liu. “Non-expert evaluation of

In this paper we explore the ability of expert judges and workers in Mechanical Turk to detect obvious damage to search pages of a major search engine on a relatively small random sample of queries. This was a very hard task. Both groups performed relatively poorly in this task.

summarization is risky”. Proceedings of NAACL, Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk, (2010)

[9] C. Grady and M. Lease. “Crowdsourcing Document Relevance Assessment with Mechanical Turk”. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (2010)

Both groups just barely detected the degradation in the entire query sample, but there was rather little agreement both within each group as well as between groups about the magnitude and the direction of the difference for individual queries.

[10] P. Ipeirotis. “Crowdsourcing using Mechanical Turk: Quality Management and Scalability”. Invited talk at

19

[16] A. J. Quinn and B. B. Bederson. “A taxonomy of distributed

Workshop on Crowdsourcing for Search and Data Mining, CSDM 2011.

[11] P. Ipeirotis, F. Provost, and J. Wang. “Quality Management

human computation”. Technical Report HCIL-2009-23, University of Maryland (2009)

on Amazon Mechanical Turk”. KDD-HCOMP (2010)

[17] N. Schwarz, B. Knauper, H. J. Hippler, E. Noelle-Neumann, and F. Clark. “Rating scales: Numeric values may change the meaning of scale labels”. Public Opinion Quarterly, 5, pp. 570-582. (1991)

[12] G. Kazai. “In search of quality in crowdsourcing for search engine evaluation”. In: P. Clough et al. (Eds.): ECIR 2011, LNCS 6611, pp. 165-176. (2011).

[18] R. Snow et al. “Cheap and Fast But is it Good? Evaluating

[13] J. Le, A. Edmonds, V. Hester, and L. Biewald. “Ensuring

Non-Expert Annotations for Natural Language Tasks”. EMNLP (2008)

quality in crowdsourced search relevance evaluations”. In: SIGIR Workshop on Crowdsourcing for Search Evaluation, pp. 17-20 (2010)

[19] L. von Ahn and L. Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHI Conference on Human Factors in Computer Systems, CHI 2004, pp. 319-326. (2004)

[14] G. Little et al. “TurKit: Tools for Iterative Tasks on Mechanical Turk”, KDD HCOMP (2009)

[15] P. Marsden. “Crowdsourcing”. Contagious Magazine 18, 24-28. (2009).

20

Proceedings Template - WORD

crowdsourcing offers cheap and attractive alternative. ... data approximate these expert labels. ... H.3.4 [Information Storage and Retrieval]: Systems and.

Download PDF

1MB Sizes 2 Downloads 264 Views

Report

Proceedings Template - WORD

Recommend Documents