Ranking Very Many Typed Entities on Wikipedia Hugo Zaragoza

Henning Rode

Peter Mika

Yahoo! Research Barcelona, Spain

University of Twente The Netherlands

Yahoo! Research Barcelona, Spain

[email protected] Jordi Atserias

Massimiliano Ciaramita

[email protected] Giuseppe Attardi

Yahoo! Research Barcelona, Spain

Yahoo! Research Barcelona, Spain

Università di Pisa Italy

[email protected]

[email protected]

[email protected]

ABSTRACT We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very specific. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval.

[email protected]



Instead, we must develop models on the fly, at query time. In this sense the models developed are more similar to those described in [1]. However, here we wish to rank the entities themselves, and not sentences. In this paper we explore two types of algorithms for entity ranking: i) algorithms that use the entity containment graph to compute the importance of entities based on the top ranked passages, and ii) algorithms that use correlation on web search results.

2.

PROBLEM SETTING

To study this task we followed these steps:

1.

MOTIVATION

We are interested in the problem of ranking entities of different types as a response to an open (ad-hoc) query. In particular, we are interested in collections with many entities and many types. This is the typical case when we deal with collections which have been analyzed using NLP techniques such as name entity recognition or semantic tagging. Let us give an example of the task we are interested in. Imagine that a user types an informational query such as “Life of Pablo Picasso” or “Egyptian Pyramids” into a search engine. Besides relevant documents, we wish to rank relevant entities such as people, countries, dates, etc. so that they can be presented to the user for browsing. We believe this task is novel and interesting in its own right. In some sense the task is similar to the expert finding task in TREC [2]. However, this task will lead to very different models, for two reasons. First we must deal with a heterogeneous set of entities; some of them are very general (like “school”, “mother”) whereas others are very specific (“Pablo Picasso”). Second, the entities are simply too many to build entity-specific models as is done for experts in TREC [2]. ∗on sabbatical at Yahoo! Research Barcelona.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

• we used a statistical entity recognition algorithm to identify many entities (and their corresponding types) on a copy of the English Wikipedia, • we asked users to issue queries to our baseline entity ranking systems and to evalute the results, • we compared the performance of several algorithms on these queries. In order to extract entities from Wikipedia, we first trained a statistical entity extractor on the BBN Pronoun Coreference and Entity Type Corpus which includes annotation of named entity types (Person, Facility, Organization, GPE, Location, Nationality, Product, Event, Work of Art, Law, Language, and Contact-Info), nominal entity types (Person, Facility, Organization, GPE, Product, Plant, Animal, Substance, Disease and Game), and numeric types (Date, Time, Percent, Money, Quantity, Ordinal and Cardinal). We note that some types are dedicated to identify common nouns that refer or describe named entities; for example, father and artist could be tagged with the Person-Description type. We applied this entity extractor on an XMLised Wikipedia collection constructed by the 2006 INEX XML retrieval evaluation initiative [3] (625,405 Wikipedia entries). This identified 28 million occurrences of 5,5 million unique entities. A special retieval index was then created containing both the text and the identified entities. The overall processing time was approximately one week on a single PC. This tagged collection has been made available [7]; more detailed information about its construction and content can be found at this reference. The evaluation framework for the task was set up as follows. First, the user chose a query on a topic that the user

Query

Table 1: Example queries and entity judgedments (see text for discussion). “Yahoo! Search Engine”

Most Important Entities

Yahoo, Google, MSN, Inktomi, Yahoo.com.

Important Entities

Web, crawler, 2004, AltaVista, 2002, Amazon.com, Jeeves, TrustRank, WebCrawler, Search Engine Placement, more than 20 billion Web, eBay, Worl WIde Web, BT OpenWorld, between 1997 and 1999, Stanford University and Yahoo, AOL, Kelkoo, Konfabulator, AlltheWeb, Excite.

Related Entities

users, Firefox, Teoma, LookSmart, Widget, companies, company, Dogpile, user, Searchen Networks, MetaCrawler, Fitzmas, Hotbot, ...

Query

“Budapest”

Most Important Entities

Budapest, Hungary, Hungarian, city, Greater Budapest, capital, Danube, Budapesti K¨ ozgazdas´ agtu´ dom´ anyi ´ es Allamigazgat´ asi Egyetem, M3 Line, Pest county.

Important Entities

University of Budapest, Austria, town, Budapest Metro, Soviet, 1956, Ferenc Joachim, Karl Marx University of Economic Sciences, Budapest University of Economic Sciences, E¨ otv¨ os Lor´ and University of Budapest, Technical University of Budapest, 1895, February 13, Budapest Stock Exchange, Kispest, ...

Related Entities

Paris, Vienna, German, Prague, London, Munich, Collegium Budapest, government, Jewish, Nazi, 1950, Debrecen, 1977, M3, center, Tokyo, World War II, New York, Zagreb, Leipzig, population, residences, state, cementery, Serbian, Novi Sad, 1949, Szeged, Turin, Graz, 6-3, Medgyessy, ...

Query Most Important Entities Important Entities Related Entities

“Tutankhamun curse” Tutankhamun, Carnarvon, mummies, Boy Pharaoh, The Curse, archaeologist, Howard Carter, 1922. Pharaohs, King Tutankhamun. Valley, KV62, Curse of Tutankhamun, Curse, King, Mummy’s Curse, ...

knew well and that was covered in Wikipedia. Then the system ran this query against a standard passage retrieval algorithm, that retrieved the 500 most relevant passages and collected all the entities which appeared in them. This is the candidate entity set which needs to be ranked by the different algorithms. Finally, the entities were ranked using our baseline entity ranking algorithm (discussed later) and given to the user to evaluate. The possible judge assessments were: Most Important, Important, Related, Unrelated, or Don’t know. The user was asked to rank all entries if possible, and at least the first fifty. Besides the judgment labels, users were not given any specific training nor were they given examples queries or judgments. 10 judges were recruited and each judged from 3 to 10 queries, coming to a total of 50 judged queries. Some resulting queries and judgments are given in Table 1; with these examples we want to stress the difficulty and subjectivity of the evaluation task. Indeed we realize that our task evaluation is quite na¨ıve and may suffer from a number of problems which we wish to address in future evaluations. However, this initial evaluation allowed us to start studying some of the properties of this task, and to compare (however roughly) several ideas and approaches.

3.

ENTITY RANKING METHODS

First, we will introduce some notation. Let a retrieved passage be the tuple (pID, s) where pID is the unique ID of the passage and s is the retrieval score of the passage for the given query. Call Pq the set formed by the K highest scored retrieved passages with respect to the query q (in our

case K =500). Let an entity be the tuple (v,t) where v is its string value (e.g. ’Einstein’) and t is its type (e.g. Person). Call C the set of all entities in the collection and Cq the set of all entities occurring in Pq . The baseline model we consider is to use a passage retrieval algorithm and score an entity by the maximum score s of the passages pID in which the entity appears in Pq . This is referred to as MaxScore in Table 2. We report a number of evaluation measures. P@K, MAP and MRR denote precision at K, mean average precision and mean reciprocal rank respectively; these measures were computed by binarising the judgments into relevant (for Most Important and Important labels) and irrelevant (for the rest). DCG is the discounted cumulative gain function; we used gains 10, 3, 1 and 0 respectively for the Most Important to Unrelated labels, and the discount function used was −log(r+1). NDCG is the normalized DCG.

3.1

Entity Containment Graph Methods

The first set of algorithms are based on the “entity containment” graph. This graph is constructed connecting every passage in Pq to every entity in Cq which is contained in the passage. This forms a bipartite graph in which the degree of an entity equals its passage frequency in passages Pq . Figure 1 shows two of the entity-containment graph obtained for the query ’Life of Pablo Picasso’. Once this graph is constructed we can use different graph centrality measures to rank entities in the graph. The most basic one is the degree. This method (noted Degree in Table 2) alone yields a 47% relative increase in MAP and 22% in NDCG. This is clear indication that the entity containment

where N is the total number of sentences and ne the number containing the entity e. This improved the results further, leading to a 76% relative increase over the baseline in MAP and 31% in NDCG. We also tried to improve results by weighting the entity degree computation with the sentence relevance scores. This approach (noted W- in Table 2) did not improve the results, despite trying several forms of score normalization.

we do not need to be constrained to the text of Wikipedia: to compute entity relevance, we can rely on the Web as a noisier, but much larger scale corpus. Based on this observation, we have experimented with ranking entities by computing their correlation to the query phrase on the Web using correlation measures well-known in text mining [6]. This technique has been successfully applied in the past, for example to the problem of social network mining from the Web [4, 5]. The difference here is that we are only interested in the correlations between the query and the set of related entities, while in co-occurrence analysis one typically computes the correlations between all possible pairs of entities to create a co-occurrence graph of entities. Query-to-entity correlation measures can be easily computed using search engines by observing the page counts returned for the entity, query and their conjunction. We found that of the common measures we tested (Jaccardcoefficient, Simpson-coefficient and Google distance), the Jaccard-coefficient clearly produced the best results (see Web Jaccard in Table 2). It resulted in practice that we obtain the best results from the search engine when quoting the query string, but not the entity label. This can be explained by the fact that queries are typically correct expressions, while the entity tagger often makes mistakes in determining the boundaries of entities. Enforcing these incorrect boundaries results in a decrease in performance. The improvement obtained over the baseline (32% relative in MAP and 6% in NDCG) is however not as good as that obtained from the entity containment graph. One of the reasons may be that, for some queries, the quality of the results obtained from searching the Web may be inferior to that obtained retrieving Wikipedia passages. For such queries, results obtained after a certain rank are not relevant and therefore bias the correlation measures. To alleviate this, we experimented with a novel measure based on the idea of discounting the importance of documents as their rank increases. Simple versions of this did not lead to an increase in performance. One of the main problems is that different queries and entities result in result sets of varying quality and size. This lead us to try slightly more sophisticated methods. In order not to penalize documents with lots of relevant results, instead of using the ranks directly we used a notion of average precision where the relevant documents are those returned both by the query and the entity. The method is illustrated in Figure 2. We compare the set of top K documents returned by the query (thought of as relevant) with the ranked list of results returned for a particular entity. Next, we determine which of the documents returned for the entity are in the relevant set and compute the their average precision. Computing such an average precision has the advantage of almost eliminating the effect of K, which should depend on the query. Indeed, this method greatly improves the result over the Jaccard and baseline methods (see Web RankDiscounted in Table 2). This method has achieved a performance that is on par with degree-based methods that take ief into account. Nevertheless, we still require the entity extraction and passage retrieval steps to produce the set of candidate entities.

3.2

4.

Figure 1: Entity containment graphs for the query “Life of Pablo Picasso”. (a) Small Graph Detail (3 relevant sentences only):

(b) Full Entity Containment Graph

graph can be a useful representation of entity relevance. We experimented with higher order centrality measures such as finite stochastic walks or PageRank but the performance was similar or worse than that of degree. We observed that degree-dependent methods are biased by very general entities (such as descriptions, country names, etc.) which are not interesting but have high frequency. To improve on this, we experimented with two different methods. An ad-hoc method consists in removing the description types which are the most generic and would seem to be a priori the less informative. However, doing this did not lead to improved results (models noted F- in Table 2). Furthermore, this solution would not be applicable in practice since we may not always know which are the less informative types of a corpus. An alternative method considered was to weight the degree of an entity by its inverse entity frequency: ief := log(N/ne ) ,

Web Search based Methods

For computing the relevance of entities to a given query,

DISCUSSION

We have taken the first steps towards studying the problem

Table 2: Performance of the different models (best two in bold). MODEL MAP P@10 P@30 MRR DCG NDCG Relative ∆NDCG MaxScore 0.34 0.37 0.28 0.64 67.91 0.64 – MaxScore(1 + ief ) 0.40 0.39 0.29 0.66 69.92 0.68 6% Degree 0.50 0.54 0.37 0.96 79.69 0.78 22% F-Degree 0.50 0.52 0.39 0.95 79.23 0.79 22% Degree ·ief 0.60 0.63 0.451 0.98 83.89 0.84 31% F-Degree ·ief 0.57 0.60 0.44 0.98 82.59 0.82 28% W-Degree 0.48 0.51 0.38 0.92 79.11 0.76 21% W-Degree ·ief 0.54 0.63 0.42 0.94 82.68 0.81 28% Web RankDiscounted 0.62 0.65 0.50 0.95 86.34 0.83 30% Web Jaccard 0.45 0.50 0.340 0.75 78.27 0.71 10%

of ad-hoc entity ranking in the presence of a large set of heterogeneous entities. We have constructed a realistic testbed to carry out evaluation of entity ranking models, and we have provided some initial directions of research. With respect to entity containment graphs our results show that it is important to take into account the notion of inverted entity frequency to discount general types. With respect to Web methods we showed that taking into account the rank of the documents in the computation of correlations can yield significant improvements in performance. Web methods are complementary to graph methods and could be combined in a number of ways. For example, correlation measures can be used to compute correlation-graphs among the entities; these graphs could replace the entity containment graphs discussed above. Furthermore ief could be combined with Web measures. Or we could define a ief that depends on Web information. Furthermore, it may be possible to select the candidate set of entities directly from the search results (or even just the snippets) obtained from a Web search engine. This would eliminate the need of offline pre-processing collections. We plan to explore these issues in the future. Nevertheless, it is necessary to increase the quality of the evaluation in order to further quantify the benefits of the different methods. To this end, we have released to the public the corpus used in this study and we plan to design and carry our more thorough evaluations.

Figure 2: Computing the average precision of the results returned for the entity ”Gertrude Stein” with respect to the set of pages relevant to the query ”Life of Pablo Picasso”.

[4] Yutaka Matsuo, Masahiro Hamasaki, Hideaki Takeda, Junichiro Mori, Danushka Bollegara, Yoshiyuki Nakamura, Takuichi Nishimura, Koiti Hasida, and Mitsuru Ishizuka. Spinning Multiple Social Networks for Semantic Web. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI2006), 2006. [5] Peter Mika. Flink: Semantic Web Technology for the 5. ACKNOWLEDGEMENTS Extraction and Analysis of Social Networks. Journal of For entity extraction, we used the open source SuperSense Web Semantics, 3(2), 2005. Tagger (http://sourceforge.net/projects/supersensetag/). [6] G. Salton. Automatic text processing. Addison-Wesley, For indexing and retrieval, we used the IXE retrieval library Reading, MA, 1989. (http://www.ideare.com/products.shtml), kindly made avail- [7] H. Zaragoza, J. Atserias, M. Ciaramita, and G. Attardi. able to us by Tiscali. Semantically annotated snapshot of the English Wikipedia v.0 (SW0). http://www.yr-bcn.es/semanticWikipedia, 2007. 6. REFERENCES [1] S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW ’06, pages 717–726, New York, NY, USA, 2006. ACM Press. [2] H. Chen, H. Shen, J. Xiong, S. Tan, and X. Cheng. Social network structure behind the mailing lists: ICT-IIIS at TREC 2006 expert finding track. In Text REtrieval Conference (TREC), 2006. [3] Ludovic Denoyer and Patrick Gallinari. The wikipedia xml corpus. SIGIR Forum, 40(1):64–69, June 2006.

Ranking Very Many Typed Entities on Wikipedia

Tagger (http://sourceforge.net/projects/supersensetag/). For indexing and retrieval, we used the IXE retrieval library. (http://www.ideare.com/products.shtml), kindly made avail- able to us by Tiscali. 6. REFERENCES. [1] S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in.

1MB Sizes 0 Downloads 182 Views

Recommend Documents

On Ranking Controversies in Wikipedia: Models and ...
Feb 11, 2008 - School of Computer Engineering. Nanyang Technological University. Singapore 639798 ... General Terms. Algorithms, Experimentation. Keywords. Wikipedia, controversy rank, online dispute .... posed CR models are able to rank controversia

Running Many Molecular Dynamics Simulations on Many ... - GitHub
Using this customized version, which we call the Simulation. Manager (SM) ..... International Conference for High Performance Computing, Networking,. Storage ...

Optimising Typed Programs
Typed intermediate languages of optimising compilers are becoming increasingly recognised, mainly for two reasons. First, several type based optimisations and analyses have been suggested that cannot be done in an untyped setting. Such analyses and o

Scheduling Monotone Interval Orders on Typed Task ...
scheduling monotone interval orders with release dates and deadlines on Unit Execution Time (UET) typed task systems in polynomial time. This problem is ...

Scheduling Monotone Interval Orders on Typed Task ...
eralize the parallel processors scheduling problems by intro- ducing k types ... 1999). In both cases, the pipelined implementation of functional units yield scheduling ..... is infeasible and the only difference between these two re- laxations is th

Advisory Committee on Tax Exempt and Government Entities (ACT ...
Jun 6, 2012 - University provides a defined benefit pension plan and three defined ... service providers, he addresses both technical plan design and general ...

Scheduling Monotone Interval Orders on Typed Task ...
In scheduling applications, the arc weights are the start-start time-lags of the ..... (also describe the machine environment of job shop problems). • We modify the ...

Advisory Committee on Tax Exempt and Government Entities (ACT ...
Jun 6, 2012 - company the prototype or volume submitter plan documents. ...... this report will strengthen the ability of TE/GE to regulate Section 501(c)(3) ...

A Few Bad Votes Too Many? Towards Robust Ranking ...
Apr 22, 2008 - ingly, a small fraction of malicious users are trying to “game the system” ... republish, to post on servers or to redistribute to lists, requires prior specific permission ... recent work [3] showed that incorporating user vote in

Kwanzaa Wikipedia Booklet.pdf
Page 1 of 5. Kwanzaa 1. Kwanzaa. Kwanzaa. 2003 Kwanzaa celebration with its founder, Maulana Karenga, and others. Observed by African Americans, parts ...

Detecting Wikipedia Vandalism using WikiTrust
Abstract WikiTrust is a reputation system for Wikipedia authors and content. WikiTrust ... or USB keys, the only way to remedy the vandalism is to publish new compilations — incurring both ..... call agaist precision. The models with β .... In: SI

On the Effects of Ranking by Unemployment Duration
Nov 9, 2015 - ... Universitat Aut`onoma de Barcelona, and Barcelona GSE (email: .... non-sequentially when they use formal methods such as advertising.

Halloween Wikipedia Booklet.pdf
Halloween Wikipedia Booklet.pdf. Halloween Wikipedia Booklet.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Halloween Wikipedia Booklet.pdf.

One many, many readings1
Since many shows an obvious analogy to gradable adjectives – it can be put in the compara- tive form ..... by persons comparable to her, e.g., 3-year-old girls.

On the Effects of Ranking by Unemployment Duration
Nov 23, 2016 - Antonella Trigari, Ludo Visschers, and participants at the Conference on Matched Employer-Employee Data: Developments since AKM ..... of a job-destruction shock. Its value function is ..... and let q and q′, with q

Ranking Support for Keyword Search on Structured Data using ...
Oct 28, 2011 - H.2.8 [Database Management]: Database applications. General ... the structured data in these settings, only a small number .... held in memory.

Dependently Typed Metaprogramming (in Agda) - Semantic Scholar
Aug 26, 2013 - It is not unusual for arguments to be inferrable at usage sites from type informa- tion, but none ... The open declaration brings map into top level scope, and the. {{. .... 10. CHAPTER 1. VECTORS AND NORMAL FUNCTORS ... responding to

A Novel Gene Ranking Algorithm Based on Random ...
Jan 25, 2007 - Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, ... Ruichu Cai is with College of Computer Science and Engineering, South .... local-optimal when it ranks all the genesat a time. The.

A Meteoroid on Steroids: Ranking Media Items ... - Semantic Scholar
A Meteoroid on Steroids: Ranking Media Items. Stemming from Multiple Social Networks. Thomas Steiner. Google Germany GmbH. ABC-Str. 19.

Re-ranking Search Results based on Perturbation of ...
Apr 29, 2006 - search results. ▫. Perturb the graph and see if the resulting graph is still relevant to the query. ▫. Re-rank the search results based on the amount of perturbation .... Microsoft monopoly. Globalization and Democracy. Recent Eart

Improving Wikipedia with DBpedia
build extracbng informabon from Social Web. – Dbpedia extracted from Wikipedia infoboxes. – It is possible to make semanbc queries and deduce new data.

Ranking on Data Manifold with Sink Points.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Ranking on ...

Efficient Multi-Ranking Based on View Selection for ...
scaling up to large data sets without losing retrieval accuracy. ... of digital images on both personal computers and Internet, the retrieval task in image data.