CrowdTiles: Presenting Crowd-based Information for Event-driven Information Needs ∗

Stewart Whiting, Ke Zhou & Joemon M. Jose School of Computing Science, University of Glasgow, UK.

Omar Alonso Microsoft Corp. Mountain View, California, USA.

Teerapong Leelanupab King Mongkut’s Institute of Technology Ladkrabang Bangkok, 10520, Thailand.

stewh,zhouke,[email protected]

[email protected]

[email protected]

ABSTRACT Time plays a central role in many web search information needs relating to recent events. For recency queries where fresh information is most desirable, there is likely to be a great deal of highly-relevant information created very recently by crowds of people across the world, particularly on platforms such as Wikipedia and Twitter. With so many users, mainstream events are often very quickly reflected in these sources. The English Wikipedia encyclopedia consists of a vast collection of user-edited articles covering a range of topics. During events, users collaboratively create and edit existing articles in near real-time. Simultaneously, users on Twitter disseminate and discuss event details, with a small number of users becoming influential for the topic. In this demo, we propose a novel approach to presenting a summary of new information and users related to recent or ongoing events associated with the user’s search topic, therefore aiding most recent information discovery. We outline methods to detect search topics which are driven by events, identify and extract changing Wikipedia article passages and find influential Twitter users. Using these, we provide a system which displays familiar tiles in search results to present recent changes in the event-related Wikipedia articles, as well as Twitter users who have tweeted recent relevant information about the event topics. Categories and Subject Descriptors: H.4 [Information Systems Applications] - Miscellaneous Keywords: Wikipedia, Twitter, Time, Events

1.

Figure 1: Demonstration screenshot for the search query ‘euro 2012’ in June 2012. Tile (1) displays event digests (i.e. recently changed passages) from Wikipedia articles related to the event. Tile (2) shows Twitter users most relevant to the search topic.

INTRODUCTION

ering a wide range of topics, each with an average of 19.64 user revisions3 . Wikipedia encyclopaedia projects covering numerous languages have also become popular. Query log analysis suggests Wikipedia pages feature prominently in over 90% of searches. Meanwhile, Twitter has over 140 million active users providing over 340 million tweets a day4 . By presenting the latest event-related information using ‘CrowdTiles’ (i.e. tiles (1) and (2) in Figure 1), we provide a summary of what is happening right now in the search results, thereby assisting the user in further discovery and exploration of the event and related topic information. In this demo we propose a novel approach to integrating eventdriven information in search results, provided by two crowds: anonymous users, and well-known people or organisations. Firstly, passages recently changed by anonymous users in Wikipedia articles relevant to the event are shown (i.e., tile (1) in Figure 1). Secondly, influential Twitter users for event topics can be found and

A large volume of web searches are initiated by users searching for information related to recent or ongoing events [5]. Highlyrelevant news articles are typically provided for these event-driven information needs using aggregated search results, incorporating a news article vertical (as seen in Figure 1). A vast amount of especially fresh and informative recent information is being generated in near real-time by different crowds of users on social and collaborative platforms such as Twitter1 and Wikipedia2 . However, these highly temporal sources are typically ignored in web search. Since creation in early 2001, the founding user-edited English language encyclopaedia has grown to almost 4 million articles cov∗ This research is partially supported by the EU-funded project: LiMoSINe (288024). 1 http://www.twitter.com 2 http://www.wikipedia.org

Copyright is held by the author/owner(s). CIKM’12, October 29–November 2, 2012, Maui, HI, USA. ACM 978-1-4503-1156-4/12/10.

3 http://en.wikipedia.org/wiki/Size_of_Wikipedia (April 23rd, 2012) 4 http://blog.twitter.com/2012/03/twitter-turns-six.html

2698

changes, providing a succinct summary of event details. We extract this temporally ‘hot’ text from the article to provide to users. In this early work we outline an approach for extracting event digests from a Wikipedia article during periods of event-based article change activity. Each digest is essentially a passage in the relevant Wikipedia article which has seen significant change over the last period of N article edits. Period 1 2

3 4

Figure 2: System architecture overview for CrowdTiles. included along with their most recent on-topic tweets (i.e., tile (2) in Figure 1). Presenting influential users rather than simply tweets provides valuable background and trust for limited tweet informational content. Real-time engineering and information reliability issues pose challenges for efficient and effective use of real-time data sources. As such, we are motivated to develop lightweight but robust architecture (shown in Figure 2), combining algorithmic approaches for both finding influential users for event topics on Twitter, and extracting recently changing passages from Wikipedia articles.

2.

5 6 7

8

9

EVENT-BASED INFORMATION

Wikipedia article authoring and editing activity is regularly prompted by users reporting ongoing real-world events, often very soon after they occurred [6]. To develop the algorithms we outline in this section we had to use past data mined from historic Wikipedia archives. For this reason we illustrate examples using data taken over the period covering Whitney Houston’s death. Houston’s death was first unofficially reported on Twitter at 00:15 UTC on the 12th February 2012 [1]. Although the news immediately spread through Twitter, it wasn’t until 00:57 UTC that the Associated Press verified the news on their Twitter feed (prior to releasing a proper news story). The first update to the Whitney Houston Wikipedia article referencing the circumstances of her death was at 01:01 UTC, 4 minutes after press confirmation. During the period immediately after her death, article changes went from an average of less than 8 per day to over 120 per day. Topic Detection and Tracking (TDT) automatically detects and organizes event-based topics in streaming collections, such as broadcast news [2]. Wikipedia has unique characteristics compared to conventional TDT document collections as each topic is a single evolving document. In this section we outline an approach to performing the TDT story tracking task for a known topic as it develops (i.e. tracking a particular Wikipedia article).

2.1

Table 1: Nine digests for the Whitney Houston article at intervals of 20 edits following her death. References to external sources and other ‘Wiki’ markup have been removed for readability. To provide the digest, we extract the most-edited passages from the Nth revision in the N-edit period (i.e., the 20th revision in a 20-edit period). A lower N will result in more sensitive and upto-date digests, however, risking lower precision if a user makes a number of edits that are not event related. Conversely, a higher N will generally be more robust and take into account longer term and more persistent event-related changes. As the article is evolving, it is impossible to simply track the passage which has received the most changes, as passages are constantly being added, changed and removed between edits. We therefore treat digest extraction as an intra-document retrieval problem. The query is essentially the edit terms over the last N-edit period and the retrievable documents are passages contained within the article’s text. To illustrate the temporal nature of edit terms, the most common edit terms for the Whitney Houston article after her death are shown in Figure 3. A weighted boolean retrieval model appeared most robust for identifying intuitively relevant passages to use as digests. We discarded term frequency as it was adversely affected by non-uniform passage lengths. A weighted approach was necessary to increase the importance of previously unseen terms being added to the article (e.g. ’death’ or ’hilton’ as opposed to ‘whitney’ or ‘houston’). Term weighting uses an analogue of the traditional inverse document frequency statistic [4] (IDF). We compute the specificity of teditFrequency , an edit term t to provide the boolean weight tweight = tarticleFrequency where teditFrequency is the frequency of the edit term in the N-edits period, and tarticleFrequency is the edit term frequency in the complete article text (i.e. collection frequency). Each passage score is computed as ∑ tweight , for query terms present in the passage. The outcome of running this digest extraction algorithm is illustrated in Table 1 for the Whitney Houston article in the period

Event Detector

CrowdTiles for Wikipedia and Twitter are only shown when the user’s information need is event-driven. To determine this, a relevant Wikipedia article is found by detecting a Wikipedia-based result in the top 10 web search results. At this early stage, we simply consider a Wikipedia article to be affected by an event, and therefore show the CrowdTiles, if it has received more than 10 text changes in the last 12 hours. Investigation of more reliable burst detection techniques for this task is left for future work.

2.2

Digest {{death}} (Refers to the article ’infobox’ with birth and death dates.) Houston died on February 11, 2012. Publicist Kristen Foster said Saturday that the singer had died, but the cause of her death was unknown. She died in [[Ottawa]], [[Canada]]. [Similar to previous.] On February 11, 2012, publicist Kristen Foster revealed Houston had died aged 48. A cause of death was not immediately given. She died in her Beverly Hills home. [Similar to previous.] [Similar to previous.] On February 11, 2012, publicist Kristen Foster revealed Houston had died from unspecified causes at the age of 48, with unconfirmed reports suggesting her death occurred in her room at the [[Beverly Hilton Hotel]]. Houston released her new album, ”[[I Look to You]]”, on August 2009. The album’s first two singles are "I Look to You" and "Million Dollar Bill". The album entered the [[Billboard 200]] at No. 1... Local police said said there were "no obvious signs of criminal intent." Two days prior to her death, witnesses reported seeing Houston behave erratically. They were rumored that she died of drug overdose.

Event Digests from Wikipedia

Article editors distill event information from disparate sources such as Twitter and traditional news reports into referenced article

2699

series, established time-series analysis methods can be used to measure the temporal characteristics. Stable terms are discovered by aggregating the appearance of each term over a series of time windows (e.g. the term appears on average 2.4 times every 6 hours). Similarly, temporally important terms are extracted by computing a recency score for each new term to appear in the article edits. This score is defined as a simple measure: the frequency of appearances of the term in a recent time window (e.g. the last 24 hours) minus the frequency of older appearances of the term prior to the time window, thus penalizing terms which are less new to the article. After extracting the most stable and temporally important terms from the Wikipedia article edit terms, a query containing a selection of the terms is constructed to retrieve recently relevant tweets for the event. Stable terms ensure retrieved tweets are on-topic for the event, while temporally important terms ensure the tweet is about recent developments. From this set of retrieved tweets, a set of candidate influential Twitter users is then extracted. Ranking Influential Users. We rank Twitter users by the simple ratio of their friends to followers, thereby promoting influential accounts with considerably more followers than friends. Further work will explore additional features for ranking users based on tweet features, including retweet counts and event coverage in their previous tweets.

12

10

alcohol

bathtub

death

drugs

grammy

hilton

Term Frequency in Period

8

hotel 6

4

2

0 0

5

10

15

20

25

30

35

Period

Figure 3: Most frequent terms appearing in edits made to the Whitney Houston article in the 3 days after her death. shortly after her death, as details are unfolding. All digests with the exception of period 8 are on the topic of her death. Closer inspection indicates that this was caused by a burst of edits made to her background discography information at that time. Related Article Identifier. In the ‘euro 2012’ example shown in Figure 1, while the central Wikipedia article ‘UEFA Euro 2012’ is most relevant, further articles may become temporally associated with the event. For example, during an England game, not only may the team’s article get regular changes, but so too will prominent players such as ‘Wayne Rooney’. Displaying recently changed information from these highly temporal articles is likely to help the user better understand the event details unfolding. Changes made to strongly associated Wikipedia articles can be included by exploiting the rich Wikipedia link structure. Previous work by Ciglan and Nørvåg [3] used a graph-based spreading activation approach to propagate article viewing popularity over the link structure. As the intention of this work is purely to extract event-related passages, we instead apply the same passage extraction algorithm to articles that are directly linked to from the initially retrieved Wikipedia article, and which are also receiving a large number of recent changes.

2.3

3.

DEMONSTRATION SCENARIOS

A live demonstration is available at: http://www. stewartwhiting.com/CrowdTiles.htm. Figure 2 is the current architecture for the experimental CrowdTiles system. Wikipedia publishes a stream of all real-time activity (e.g. article creation and editing) to both an internet relay chat (IRC) channel and syndication feed6 . An application programming interface (API) provides access to current article revision metadata and text. Finding influential Twitter users relies on publicly accessible Twitter REST APIs. The CrowdTiles interface is developed in C# and built into a prototype front-end of the Microsoft Bing7 web search engine. Two demonstration scenarios are planned. For non-event topic searches, CrowdTiles will not display. For any search query related to a recent event, CrowdTiles will be triggered and show a number of most changed and most relevant passages extracted from the directly relevant search topic/event-related Wikipedia article. If there are other closely related articles which have also been frequently changed recently, passages from them will be displayed in later tiles. Multiple digests will be presented in a transitioning series of tiles. At most 5 influential Twitter users will be selected for the topic and presented in a series of transitioning tiles, including the user’s bio and recent related tweets.

Influential Twitter Users

Rather than tweets alone, we instead chose to present users with tweets, as the tweet author’s background may often express the trustworthiness of the limited tweet content. Existing approaches for topic-specific influencer detection typically rely on networkbased approaches which offer effective solutions to the problem, however are infeasible in a vast and highly amorphous network such as Twitter’s full social graph. Furthermore, relevant influential users may change relatively quickly over time as the event focus and discussion change. To present relevant Twitter users and their tweets in CrowdTiles, the Twitter search API5 is used to retrieve recently relevant tweets for the event topic, with further ranking performed on user features. Both stable and temporally important (i.e., most recently distinctive) terms for the event topic are expressed in the most frequent terms appearing in Wikipedia article edits, as shown in Figure 3. The temporal distribution of each term is a strong indicator of the term’s stability in the event topic, and therefore, its usefulness to retrieve descriptive high quality tweets both relevant, and recently topical for the event. In Figure 3, the distribution of the term ‘death’ is relatively stable throughout the 3 day duration represented. In contrast, terms such ‘alcohol’ and ‘hilton’ are much more temporal as they burst with new event developments. Considering that the distribution of terms is essentially a time-

4.

REFERENCES

[1] 2.5 million tweets an hour as news of whitney houston’s death spreads. http://www.topsylabs.com/2012/02/12/ 2-5-million-tweets-an-hour-as-news-of-whitney/ houstons-death-spreads/. [2] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Research and Development in Information Retrieval, pages 37–45, 1998. [3] M. Ciglan and K. Nørvåg. Wikipop: personalized event detection system based on wikipedia page view statistics. In CIKM, pages 1931–1932, 2010. [4] K. Jones et al. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972. [5] A. Kulkarni, J. Teevan, K. M. Svore, and S. T. Dumais. Understanding temporal query dynamics. In ACM WSDM ’11, pages 167–176, New York, NY, USA, 2011. ACM. [6] F. Vis. Wikinews reporting of hurricane katrina. In Citizen Journalism: Global Perspectives, Global Crises and the Media. Peter Lang, 2009. 6 http://en.wikipedia.org/wiki/Wikipedia:Syndication 7 http://www.bing.com

5 https://dev.twitter.com/docs/api

2700

CrowdTiles: presenting crowd-based information for event-driven ...

Nov 2, 2012 - California, USA. [email protected] ... Time plays a central role in many web search information needs re- lating to recent events.

1MB Sizes 2 Downloads 143 Views

Recommend Documents

Presenting - Libsyn
to on your computer, MP3 player or mobile phone. .... includes notebook and desktop computers, iPods, and cell phones.” sector: A division of something larger, ...

Presenting - Libsyn
coffee room and I gave him an elevator pitch about my new sales strategy.” accomplished ... includes notebook and desktop computers, iPods, and cell phones.” sector: A ... People are concerned about gas prices and the environment.”.

Method for presenting high level interpretations of eye tracking data ...
Aug 22, 2002 - Advanced interface design and virtual environments, Oxford Univer sity Press, Oxford, 1995. In this article, Jacob describes techniques for ...

Presenting to win.pdf
Page 3 of 289. Presenting to win.pdf. Presenting to win.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Presenting to win.pdf. Page 1 of 289.

Loose-leaf Version for Scientific American: Presenting ...
community college instructors Deborah Licht and Misty Hull alongside science ... has experience in traditional, online, and hybrid courses, and is particularly ...

pdf-1377\loose-leaf-version-for-scientific-american-presenting ...
... the apps below to open or edit this item. pdf-1377\loose-leaf-version-for-scientific-american-pr ... tific-american-presenting-psychology-six-month-acc.pdf.

Presenting Cheque to Hon'ble MR.PDF
National Federation of Indian Railwaymen 3, CHELMSFORD ROAD, NEW DELHI ... Presenting Cheque to Hon'ble MR.PDF. Presenting Cheque to Hon'ble MR.

Kim's presenting schedule (Current).pdf
6/19-20/2017 The First 30 Days Oklahoma City, OK SDE. 6/26-27/2017 Customized Training SDE. 7/10-13/2017 National I Teach K Las Vegas, NV SDE. 7/14-15/2017 Frog Street Splash Dallas, TX Frog Street. 8/7-8/2017 The First 30 Days Minneaspolis, MN SDE.