Is it Time to Abandon Abandonment?

Viewer
Transcript

Is it Time to Abandon Abandonment? Abhimanyu Lad

Daniel Tunkelang

LinkedIn, Corp. 2029 Stierlin Court, Mountain View, CA

LinkedIn, Corp. 2029 Stierlin Court, Mountain View, CA

[email protected]

ABSTRACT Commonly used click-based measures like abandonment and mean reciprocal rank (MRR) present an incomplete, and often misleading, picture of search performance, especially in rich user interfaces that support a wide range of search behaviors. We propose a search utility framework that is based on a holistic view of the information-seeking process. First, we go beyond the use of clicks as indicators of relevance, taking into account actions performed on search results as more reliable indicators of completion of the user’s underlying task. Second, instead of looking only at individual queries, we consider the entire search session comprising multiple queries that are meant to address a single information need. We argue that an evaluation metric that combines these two features more accurately reflects the effectiveness of the system as perceived by the user. Finally, we propose future experiments to operationalize as well as validate this framework.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering; H.1.2 [Models and Principles]: User/Machine Systems—human factors, human information processing

General Terms Information Retrieval, Evaluation, Human Factors

Keywords Evaluation, search utility, abandonment

1.

INTRODUCTION

An important part of research in information retrieval is the design of evaluation metrics that allow accurate evaluation and comparison of retrieval systems. A good evaluation methodology can provide valuable insights into

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HCIR 2011, Mountain View, California Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

[email protected]

various aspects of the performance of a system, and thus, guide the development of new search algorithms as well as interfaces that more effectively serve their users’ needs. Retrieval system evaluation can be broadly divided into two main paradigms: offline and online evaluation. Offline evaluation is generally based on the Cranfield methodology [3], which uses a pre-defined set of queries and a test collection of documents with relevance judgments to evaluate and compare specific aspects of retrieval algorithms in a controlled setting. However, such an approach is often prohibitive in terms of the time and effort required, especially in the case of personalized search. Moreover, it fails to capture the dynamic nature of the informationseeking process; the behavior of real users is affected by their situational constraints and tasks, the search interface, as well as other factors like novelty and timeliness of the search results that are difficult to capture using a static test collection. The second alternative – and the focus of this position paper – is online evaluation, which directly assesses performance in terms of users’ behavior as they interact with a live retrieval system. Such a user-in-the-loop evaluation strategy is more likely to reflect the real utility of the search interface. This approach, however, confronts the challenge of determining which signals are useful for capturing the users’ behavior, as well as how to correctly interpret these signals. The most obvious and commonly used signal is clickthrough on search results, which has given rise to several evaluation metrics, including abandonment, mean reciprocal rank (MRR) of the first click, and clicks per query. Unfortunately, interpreting “raw clicks” as indicators of usefulness is problematic for several reasons. First, users’ click behavior is subject to trust bias [8]: users expect the top-ranked results to be more relevant and are therefore more likely to click on them irrespective of their relevance. Second, users may click on a result by mistake, possibly due to a misleading snippet. Third, the correct interpretation of a click depends on the nature of the user’s information-seeking task. For exploratory queries that can have many relevant results, a higher number of clicks generally indicates higher satisfaction. The opposite is true, however, for navigational queries that have only one correct result; for navigational queries, more clicks indicate an inability to find the single desired result. Our proposed approach goes beyond clicks and considers dwell time, as well as other actions performed on the search results as more robust indicators of usefulness and

task accomplishment. Such actions are highly domaindependent. For instance, in the context of a professional networking site like LinkedIn, the user can perform a variety of actions on a search result, including sending connection requests, sending messages, and bookmarking profiles. These actions can be assigned different weights based on the value of those actions to the user. Another major shortcoming of most contemporary evaluation measures is that they are defined at the individual query level, ignoring the relationships among queries issued by the user within a single search session. In reality, users often engage in search sessions with multiple interrelated queries to satisfy their information needs. The number of queries, as well as the types of query reformulations required to satisfy an information need, play a crucial role in determining the effectiveness of a system. For example, a system that automatically corrects obvious misspellings in the query is more useful than a system that suggests a few spelling corrections for the user to choose – which in turn is better than a system that provides no such assistance and instead requires the user to recognize his or her spelling mistake and retype the query from scratch. In evaluating the effectiveness of a search system, we should consider the total amount of effort expended by the user to satisfy his or her information need. It is also important to note that the user’s experience across a session can affect the reality and perception of the system’s effectiveness far more than small differences in the ranking of results tracked by measures like mean reciprocal rank, which have been found to have at best a small effect on task performance [12]. In other words, current click-based measures of search quality do not necessarily favor retrieval system behaviors that intuitively seem useful to the end user. Our goal is this paper is to address these issues and define a framework for measuring search utility that accurately reflects the perceived effectiveness of search interfaces. To operationalize our ideas and make them more concrete, we will use LinkedIn1 as the running example, but the concepts discussed in this paper can be easily extended and applied to other rich user interfaces with similar properties.

2.

CURRENT EVALUATION MEASURES

Let us review the most commonly used click-based evaluation measures and their shortcomings. Abandonment Rate is the percentage of queries that do not receive any clicks. Higher values indicate that users are less likely to find useful information in response to their query; lower values indicate better overall performance. Abandonment is a coarse measure that ignores the rank positions at which clicks were observed. On the plus side, abandonment rate is less subject to trust bias than measures that favor clicks early in the result ranking. Mean Reciprocal Rank (MRR) is the reciprocal of the rank of the first click, averaged over all queries to obtain a mean. Higher values indicate that users are more likely to click on the top-ranked results and thus indicate better overall ranking. Clicks per Query (CPQ) is the average number of clicks received on each query. This measure can have two conflicting interpretations. For exploratory queries with many relevant results, a higher number of clicks indicates 1

http://www.linkedin.com

higher user satisfaction. For navigational queries with a single correct result, a higher number of clicks is a negative signal, implying the user’s inability in finding the single correct result. Probability of Skip (pSkip) is the ratio between the number results skipped and the total number of results seen for each query [13]. Equivalently, pSkip can be defined equivalently in terms of Precision at Last Click (PLC) [2], which is equal to 1−pSkip. pSkip is sensitive to the ranking quality of the search engine; more skips means that the user has to expend more effort to find each relevant result, assuming that he or she scans the ranked list in a top-down manner. Hence, lower values of pSkip (and higher values of PLC) indicate better performance. Since pSkip is undefined for abandoned queries, its measurement does not take such queries into account. At best, these measures provide an incomplete picture of search effectiveness; at worst, the picture can be misleading. All of these measures depend on interpreting raw clicks as indicators of usefulness, which can be problematic in some cases (e.g., “bad” clicks that do not lead users to satisfy their information needs). Moreover, these measures are defined at the level of individual queries, whereas users often engage in search sessions comprising multiple interrelated queries to satisfy their information need. For example, abandonment rate conflates hard abandonment, where the user gives up hope after a single unsuccessful query, and soft abandonment, in which the user reformulates the initial query to refine the result set. Similarly, ranked-based measures like MRR and pSkip also ignore the number of query reformulations required to retrieve the relevant results. At best, a combination of the above metrics is necessary for a complete picture of search effectiveness. At worst, these metrics can move in contradictory directions. For example, a small change in a ranking algorithm may result in an increased (i.e., worse), abandonment rate, but decreased (i.e., improved) pSkip because the newly abandoned queries, which were earlier contributing negatively to pSkip, are removed from its calculation. Our goal is to address these shortcomings and propose a unified measure of performance that offers an intuitive interpretation and an accurate reflection of the utility of the retrieval system as experienced and perceived by the user.

3.

SEARCH UTILITY

The user’s queries are merely means to an end: for LinkedIn users, the motivating information need may be to learn more about a person, establish a connection, recruit candidates, etc. Our goal is to establish a framework that measures the search engine’s ability to support the user in accomplishing such tasks relative to the effort expended by the user. Hence, we define search utility as the ratio of “gain” obtained by the user to the amount of effort expended by the user during a search session in order to obtain this gain: gain user effort We define utility at the session level, which corresponds to a sequence of one or more queries directed towards a single information need. Given a collection of search sessions, we search-utility =

can establish the mean utility provided by the search engine to its users. It is difficult to know precisely how much value the search engine provides to a user, or how much effort a user exerts in obtaining that value. Hence, we resort to modeling gain and effort in terms of signals we can observe and measure, relying on our understanding of user goals and behavior.

3.1

Gain

We identify two main indicators of gain: actions performed on search results, and time spent on those results. Actions. Search is a means to an end, and that end often involves taking an action on a search result. For example, LinkedIn users send messages or connection requests to the people they find through search, bookmark profiles for future use, etc. Such actions provide a much stronger and unambiguous signals of utility than clicks. Not all actions signify equal value to the user. For example, sending a message is a stronger signal of value to the user than bookmarking a profile. The relative importance of an action is highly dependent on the domain and application, but it can be be determined through user studies that correlate these implicit signals with explicit feedback given by the user. Dwell time. In some cases, a user may not be interested in performing an action on a search results but may simply be gathering information. For example, a LinkedIn user may want to learn about a person without any immediate intention of communicating with that person. In order to measure utility in the absence of an action we use the dwell time – that is, the amount of time spent on an individual result page – as a surrogate for its usefulness. Dwell time has been shown to be a strong indicator of user’s interest in several studies [5, 1]. Also, when dwell time is less than a certain threshold, it serves to indicate a bad click, i.e., the user realizing his or her mistake and going back to the search results page. Inferring bad clicks based on dwell time can correct for spurious clicks in click-based measures.

3.2

Effort

We measure effort in terms of sequential scanning of results, query reformulation, and total session time. Sequential Scan. Previous studies have shown that most people employ a linear strategy in which they evaluated search results sequentially, deciding whether to click on each scanned result item before proceeding to the next one [9]. We capture this browsing effort in terms of the number of results examined by the user (relevant as well as irrelevant) as indicated by the position of the last click, with extra cost associated with the act of pagination. Query (Re-)formulation. Users express their information needs by making an active effort to formulating queries. When a query fails to retrieve sufficiently useful (or sufficiently many) results, users may invest additional effort to reformulate their queries. Given that query reformulating is a much higher investment than incremental cot of scanning an additional search result, a search engine should minimize the number of times that a user needs to reformulate a query to complete a task. Not all query reformulations incur the same level of investment on the user’s part. In particular, it is important to distinguish between system-provided refinements and manual query reformulations. The former include system-

suggested spelling corrections, related queries, and facetbased narrowing of the results. The latter typically involve the user manually adding, removing, or replacing a keyword – but could go as far as a complete rewriting of the query. The costs of each of these actions depend on the precise details of the search interface; these can be measured in aggregate through log analysis. Time. Finally, we can measure the total amount of time required for the user to satisfy his or her information need. This includes time to browse the results as well as time taken to reformulate the query. While this measure is partially redundant to the previous ones, it captures the variability in the costs of activities, as well as behavior too complex to be captured by a simple model of scanning and query reformulation.

3.3

Proposed Measures

Given these definitions of gain and effort, we can derive variations of some commonly used evaluation measures in terms of our framework for search utility: Session-Level Abandonment. Here the gain gives us a binary numerator for each search session: 1 if the user clicks through a results at any point during a search session, 0 otherwise. The effort here is constant – we are evaluating gain per session. Thus the mean search utility is equivalent to 1 − session-level abandonment, i.e., the percentage of sessions that are not abandoned. Time To First Click. Time To First Click is a generalization of Session-level Abandonment, only that now we make the numerator constant and switch from a binary cost to a cost reflecting the time until the first click. Since this time is infinite for abandoned searches, we define the search utility of a session as zero for abandoned searches. pSkip and PLC. We can reinterpret these measures by defining effort as the number of results scanned (inferred from the rank of the last result clicked) and the gain as as the number of clicks, possibly discarding spurious clicks with short dwell times. We obtain a search utility that is equivalent to Precision at Last Click (PLC), or 1−pSkip. As noted earlier, this measure ignores queries with no clicks, so it cannot be used in isolation without risking that we “reward” abandonment.

4.

FUTURE EXPERIMENTS

We intend to perform a variety of experiments using the framework proposed in this position paper. The planned experiments include: • Comparing actions rather than clicks for evaluating exploratory search cases, such as candidate searches by hiring managers. • Testing if improved snippet presentation reduces bad clicks, i.e., clicks with low dwell time. • Determining how different precision/recall trade-offs affect the various gain and cost measures we have proposed. • Exploring a combination of features to measure gain as well as user effort. This requires a deeper understanding of the relative value of user actions and dwell time on the one hand, and the relative cost of scanning results and reformulating the query.

This experimentation is work in progress, and we hope to present preliminary results at the workshop. [5]

5.

RELATED WORK

The use of implicit signals including clicks as well as actions (dwell time, scrolling, bookmarking, etc.) is not new. For example, Fox et al. developed a Bayesian model to understand the relationship between various implicit and explicit measures of user satisfaction [5]. However, the authors sought to understand the predictive power of various implicit signals, but did not use them to devise a readily interpretable measure of search engine performance, which was the focus of this paper. Radlinski et al. investigated how well various implicit features like clickthrough and query reformulation relate to retrieval quality [11]. The authors found that none of these features realibly predicted retrieval quality; however, this is not surprising, since they considered each feature individually, and did not differentiate between good and bad clicks by using additional features like dwell time on landing pages. For this reason we propose combining these features, and using dwell time as a way to differentiate good clicks from bad ones. There have been some attempts to measure search performance in the context of rich user interfaces. J¨ arvelin proposed a extension of discounted cumulated gain for evaluation of multiple-query sessions [7]. Fuhr proposed a generic framework that models interactive search as a sequence of user choices and maximizes the users’ overall expected benefit in terms of tradeoffs at each choice [6]. Cole et al. suggest an evaluation model and methodology centered on the notion of “usefulness”, but leave the operationalization of this notion as being specific to user tasks and goals [4]. Finally, Wilson proposed a usability evaluation method to assess both search designs and fully implemented systems [14]. Drawing an analogy between abandoning a search result and system failure in reliability analysis, Liu et al. proposed a model of dwell time using the Weibull distribution and empirically validated this model using log data from web searchers [10]. The goal of our proposed approach is to capture the insights of these works in a framework that can be quantified and applied operationally in the context of online evaluation of search engines with rich user interfaces.

6.

REFERENCES

[1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19–26. ACM, 2006. [2] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 621–630. ACM, 2009. [3] C. Cleverdon, J. Mills, and E. Keen. An inquiry in testing of information retrieval systems.(2 vols.). Cranfileld, UK: Aslib Cranfield Research Project, College of Aeronautics, 1966. [4] M. Cole, J. Liu, N. Belkin, R. Bierig, J. Gwizdka, C. Liu, J. Zhang, and X. Zhang. Usefulness as the

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

criterion for evaluation of interactive information retrieval. Proc. HCIR, pages 1–4, 2009. S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White. Evaluating implicit measures to improve web search. ACM Transactions on Information Systems (TOIS), 23(2):147–168, 2005. N. Fuhr. A probability ranking principle for interactive information retrieval. Information Retrieval, 11(3):251–265, 2008. K. J¨ arvelin, S. Price, L. Delcambre, and M. Nielsen. Discounted cumulated gain based evaluation of multiple-query ir sessions. Advances in Information Retrieval, pages 4–15, 2008. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154–161. ACM, 2005. K. Kl¨ ockner, N. Wirschum, and A. Jameson. Depth-and breadth-first processing of search result lists. In CHI’04 extended abstracts on Human factors in computing systems, pages 1539–1539. ACM, 2004. C. Liu, R. White, and S. Dumais. Understanding web browsing behaviors through weibull analysis of dwell time. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 379–386. ACM, 2010. F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceeding of the 17th ACM conference on Information and knowledge management, pages 43–52. ACM, 2008. A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pages 11–18. ACM, 2006. K. Wang, T. Walker, and Z. Zheng. Pskip: estimating relevance ranking quality from web search clickthrough data. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1355–1364. ACM, 2009. M. Wilson. An analytical inspection framework for evaluating the search tactics and user profiles supported by information seeking interfaces. PhD thesis, University of Southampton, 2009.

Is it Time to Abandon Abandonment?

Life-Is-Conscious-Live-With-Purpose-Abandon-Fear-Coexist.pdf ...

Read PDF CRUSH IT!: why NOW is the time to cash in ...