Result Merging in a Peer-to-Peer Web Search Engine

Viewer
Transcript

Result Merging in a Peer-to-Peer Web Search Engine

Sergey Chernov

¨ DES SAARLANDES UNIVERSITAT February, 2005

A Result Merging in a Peer-to-Peer Web Search Engine

A thesis submitted in partial fulfillment of the requirement for the degree of

Master of Science in Computer Science

Submitted by

Sergey Chernov under the guidance of Prof. Dr-Ing. Gerhard Weikum Christian Zimmer

¨ DES SAARLANDES UNIVERSITAT February, 2005

Abstract A tremendous amount of information in the Internet requires powerful search engines. Currently, only the commercial centralized search engines like Google can process terabytes of Web documents. Such approaches fail in indexing the “Hidden Web” located in the intranets and local databases, and with exponential growth in information volume the situation is becoming even worse. Peer-to-Peer (P2P) systems can be pursued for extending current search capabilities. The Minerva project is a Web search engine based on the P2P architecture. In this thesis, we investigate the effectiveness of different result merging methods for the Minerva system. Each peer provides efficient search engine for its own focused Web crawl. Each peer can pose a query against a number of selected peers; the selection is based on a database ranking algorithm. The best top-k results from several highly ranked peers are collected by query initiator and merged into a single list. We address the result merging problem. We select several merging methods feasible to use in a heterogeneous, dynamic, distributed environment. The experimental framework for these methods was implemented and the effectiveness of the merging techniques was studied with TREC Web data. The language modeling based ranking method produced the most robust and accurate results under the different conditions. We also proposed a new merging method, which incorporates the preference-based language model. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. In every tested setup, the new method was at least as effective as the baseline or slightly better.

i

I hereby declare that this thesis is entirely my own work and that I have not used any other media than the ones mentioned in the thesis.

Saarbr¨ ucken, the 9th February, 2005

Sergey Chernov

ii

Acknowledgements I would like to thank my academic advisor, Professor Gerhard Weikum, for his guidance and encouragement through the duration of my master thesis project. I wish to express my sincere gratitude to my supervisors Christian Zimmer, Sebastian Michel, and Matthias Bender for their invaluable assistance and feedback. I would like to thank Kerstin Meyer-Ross for her continuous support in everything. I am very grateful to the members of the Databases and Information Systems group AG5, fellow students from the IMPRS program and all my friends from the Max-Planck Institute who provided me with friendly and stimulating environment. I would like to extend special thanks to Pavel Serdyukov and Natalie Kozlova for the numerous discussions and helpful ideas. I want to thank Alexandra and Oleg Schultz for proofreading the thesis for me. It is difficult to explain how grateful I am to my mother, Galina Nikolaevna, and my father, Alevtin Petrovich, their wisdom and care made it possible for me to study. Finally, I want to thank the one person who was very supportive and patient during this process, my dear wife Olga. I would never accomplish this work without her love.

iii

Contents 1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Our contribution . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Description of the remaining chapters . . . . . . . . . . . . . .

5

2 Web search and Peer-to-Peer Systems

6

2.1

Information retrieval basics

. . . . . . . . . . . . . . . . . . .

6

2.2

Web search engines . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3

Peer-to-Peer architecture . . . . . . . . . . . . . . . . . . . . . 11

2.4

P2P Web search engines . . . . . . . . . . . . . . . . . . . . . 12

2.5

Minerva project . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Result merging in distributed information retrieval

15

3.1

Distributed information retrieval in general . . . . . . . . . . . 15

3.2

Result merging problem . . . . . . . . . . . . . . . . . . . . . 19

3.3

Prior work on collection fusion . . . . . . . . . . . . . . . . . . 20

3.4

3.3.1

Collection fusion properties . . . . . . . . . . . . . . . 21

3.3.2

Cooperative environment . . . . . . . . . . . . . . . . . 22

3.3.3

Uncooperative environment . . . . . . . . . . . . . . . 23

3.3.4

Learning methods . . . . . . . . . . . . . . . . . . . . . 25

3.3.5

Probabilistic methods . . . . . . . . . . . . . . . . . . . 26

Prior work on the data fusion . . . . . . . . . . . . . . . . . . 26 3.4.1

Data fusion properties . . . . . . . . . . . . . . . . . . 27

3.4.2

Basic methods . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.3

Mixture methods . . . . . . . . . . . . . . . . . . . . . 29

iv

3.4.4 3.5

Metasearch approved methods . . . . . . . . . . . . . . 30

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Selected result merging strategies

31

4.1

Target properties for result merging methods . . . . . . . . . . 31

4.2

Score normalization with global IDF

4.3

Score normalization with ICF . . . . . . . . . . . . . . . . . . 35

4.4

Score normalization with CORI . . . . . . . . . . . . . . . . . 36

4.5

Score normalization with language modeling . . . . . . . . . . 37

4.6

Score normalization with raw T F scores . . . . . . . . . . . . 39

4.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Our approach

. . . . . . . . . . . . . 33

40

5.1

Result merging with the preference-based language model . . . 40

5.2

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Implementation

44

6.1

Global statistics classes . . . . . . . . . . . . . . . . . . . . . . 44

6.2

Testing components . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Experiments 7.1

7.2

7.3

7.4

53

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1.1

Collections and queries . . . . . . . . . . . . . . . . . . 53

7.1.2

Database selection algorithm . . . . . . . . . . . . . . . 55

7.1.3

Evaluation metrics . . . . . . . . . . . . . . . . . . . . 55

Experiments with selected result merging methods . . . . . . . 56 7.2.1

Result merging methods . . . . . . . . . . . . . . . . . 56

7.2.2

Merging results . . . . . . . . . . . . . . . . . . . . . . 57

7.2.3

Effect of limited statistics on the result merging . . . . 66

Experiments with our approach . . . . . . . . . . . . . . . . . 68 7.3.1

Optimal size of the top-n . . . . . . . . . . . . . . . . . 69

7.3.2

Optimal smoothing parameter β . . . . . . . . . . . . . 69

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

v

8 Conclusions and future work

73

8.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.2

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography

75

A Test queries

81

vi

List of Figures 2.1

The Minerva system architecture . . . . . . . . . . . . . . . . 13

3.1

Simple metasearch architecture . . . . . . . . . . . . . . . . . 17

3.2

A query processing scheme in the distributed search system . . 18

3.3

Collection fusion vs. data fusion . . . . . . . . . . . . . . . . . 20

3.4

An overlapping in the collection fusion problem . . . . . . . . 22

3.5

Statistics propagation for the collection fusion . . . . . . . . . 24

3.6

Data fusion on a single search engine . . . . . . . . . . . . . . 27

6.1

Main classes involved in merging

6.2

A general view on the experiments implementation . . . . . . 47

7.1

The macro-average precision with the database ranking RANDOM 59

7.2

The macro-average recall with the database ranking RANDOM 59

7.3

The macro-average precision with the database ranking CORI

7.4

The macro-average recall with the database ranking CORI . . 61

7.5

The macro-average precision with the database ranking IDEAL 62

7.6

The macro-average recall with the database ranking IDEAL . 62

7.7

The macro-average precision of the LM 04 result merging method

. . . . . . . . . . . . . . . . 45

61

with the different database rankings . . . . . . . . . . . . . . . 63 7.8

The macro-average precision with the database ranking CORI with the global statistics collected over the 10 selected databases 67

7.9

The macro-average precision with the database ranking IDEAL with the global statistics collected over the 10 selected databases 67

7.10 The macro-average precision with the database ranking CORI with the different size of top-n for the preference-based model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vii

7.11 The macro-average precision with the database ranking IDEAL with the different size of top-n for the preference-based model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.12 The macro-average precision with the database ranking CORI with the top-10 documents for the preference-based model estimation with β = 0.6 and LM 04 result merging method . . . 71 7.13 The macro-average precision with the database ranking IDEAL with the top-10 documents for the preference-based model estimation with β = 0.6 and LM 04 result merging method . . . 71 A.1 Relevant documents distribution for the T F method with the IDEAL ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.2 Relevant documents distribution for the LM 04P B06 method with the IDEAL ranking . . . . . . . . . . . . . . . . . . . . . 84 A.3 Residual between the number of relevant documents of the CE06LM 04 and T F methods with the IDEAL database ranking 85

viii

List of Tables 4.1

The target properties of the result merging methods . . . . . . 33

6.1

Classes description . . . . . . . . . . . . . . . . . . . . . . . . 52

7.1

Topic-oriented experimental collections . . . . . . . . . . . . . 54

7.2

The difference in percents of the average precision between the result merging strategies and corresponding baselines with the RANDOM ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach . . . . . . . . . . . . . . . . . . . . . 65

7.3

The difference in percents of the average precision between the result merging strategies and corresponding baselines with the CORI ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.4

The difference in percents of the average precision between the result merging strategies and corresponding baselines with the IDEAL ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.1 The topic-oriented set of the 25 experimental queries (topics are coded as “HM” for the Health and Medicine, and “NE” for the Nature and Ecology) . . . . . . . . . . . . . . . . . . . 82 A.2 The number of relevant documents for the T F and LM 04P B06 methods with the IDEAL database ranking (LM 04P B06 name is shortened to LM P B for convinience) . . . . . . . . . . . . . 83

ix

Chapter 1 Introduction 1.1

Motivation

Millions of new documents are created every day across the Internet and even more are changed. The huge amount of information increases exponentially and a search engine is the only hope to find the documents relevant to a particular user’s need. Routine access to the information is now based on full-text information retrieval, instead of controlled vocabulary indices. Currently, a huge number of people use the Web for text and image search on regular basis. In this thesis, we consider the problems of search on text data. The need for effective search tools becomes more important every day, but currently only a few centralized search engines like Google (www.google.com) is capable of coping with this task, but these engines are only partially effective. The so-called Hidden Web consists of all intranets and local databases behind portal pages. According to estimation from [SP01], it is 2 to 50 times larger then the Visible Web, which can be crawled by the search robots. Taking into account that even the largest Google crawl of more than 8 billion pages encompasses only a part of the Visible Web, we can imagine how many potentially relevant pages a centralized search engine omits during a search. This problem results from technical limitations of a single search engine. The desire to overcome the limitations of a single search engine established a new scientific direction — distributed information retrieval or metasearch, we will use both terms as synonyms. The main technique developed in this 1

field is an intermediate broker called a metasearch engine. It has access to the query interfaces of the individual search engines and text databases. Briefly, when a metasearch engine receives a user’s query, it passes the query to a set of appropriate individual search engines and databases. Then it collects the partial results and combines them to improve the overall result. Numerous examples of metasearch engines are available on the Web (www.search.com, www.profusion.com, www.inquirus.com). The metasearch approach contains several significant sub-problems, which arise in the query execution process. The database selection problem occurs when a query is routed from a metasearch engine to the individual search engines. A naive routing approach is to propagate a query to all available engines. The scalability of such strategy is unsatisfactory since it is inefficient to ask more than several dozens of servers. The database selection process helps to discover a small number of the most useful databases for a current query and to ask only this limited subset without significant loss in recall. Many of the database selection methods were developed to tackle this issue [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem is another important sub-problem of the metasearch technique. In information retrieval, the output result is a ranked list of documents sorted by their similarity score values. In distributed information retrieval, aforementioned list is obtained from several result lists, which are merged into one. The result merging problem is not trivial, numerous merging techniques have been studied in the literature [CLC95, TVGJL95, Bau99, Cra01, SJCO02]. The issue of an automatic database discovery was not fully addressed; so adding new data sources to the metasearch engine mainly remains a manual task. The major drawback of metasearch is that large search engines are not interested in cooperation. A search result is a commercial product which they want to “sell” themselves. For example, the STARTS proposal [GCGMP97] is a quite effective communication protocol designed especially for metasearch, but it is not widely used because of the said reason. The new Peer-to-Peer (P2P) technologies can help us remove the limitations caused by uncooperativeness of search engine vendors. The computation power of processors increases every year, and so does the network bandwidth. Millions of per2

sonal computers offer sufficient storage and computational resources to index their own documents and perform small crawls of the interesting fragments of the Web. They can provide a search on their local index, but do not have to uncover the data itself unless they want to. This is the way to incorporate the Hidden Web pages into a global search mechanism. Collaborative crawling can span a larger portion of the Web, since every peer can contribute its own focused crawl into the system. This method is cheap and provides us with topic-oriented search opportunities; we can also use intellectual inputs from other users to improve our own search. Such considerations launched the Minerva project, a new P2P Web search engine. The metasearch field has many common properties with search in a P2P system, but some important distinctions should be taken into account. A P2P environment is much more dynamic than traditional metasearch: • Queries are processed on millions of small indices instead of dozens of large indices; • Global query execution might require resource sharing and collaboration among the different peers and cannot be fully performed on one peer; • Limited statistics is a necessary requirement for a scalable P2P system, while in the distributed informational retrieval rich statistics can be provided by a centralized source; • Cooperativeness of peers in a P2P system, in contrast to a metasearch setting, helps to reduce heterogeneity in such parameters as representative statistics or index update time. Distributed information retrieval accommodates features from two research areas: information retrieval and distributed systems. The goal of effectiveness is inherent for the former, it aims at high relevance of the returned documents, and the collaboration of users in a P2P setting gives us additional opportunities to refine the search results. The main goals of the Minerva project include traditional metasearch goals and some new issues:

3

1. Increased search coverage of the Web; 2. Retrieval effectiveness comparable with centralized search engine; 3. Scalable architecture to combine millions of small search engines. For this purpose, we want to exploit existing solutions from distributed information retrieval and adapt them to our new setup, with the aforementioned distinctive properties in mind. We also want to find new methods, which are suitable for P2P architecture and can improve our system. The practical goal is to create a prototype of a highly scalable, effective, and efficient distributed P2P Web search engine.

1.2

Our contribution

The main purpose of this thesis is to develop an effective result merging method for the Minerva system. We analyze major sub-problems of the result merging, and review several existing techniques. The selected methods have been implemented and evaluated in the Minerva prototype. In addition, a new preference-based language model method for result merging is proposed. Our approach combines the preference-based and the result merging rankings. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. We address the issue of effectiveness. It is determined by the underlying result merging scheme. As in distributed information retrieval in a P2P system, the similarity scores for each document are computed on the base of the local database statistics. It makes the scores incomparable due to the differences in statistics on the different databases. A score computation based on the global statistics is the most accurate solution in our case. For the cooperative data sources, as we have in the Minerva system, we can collect the local database-dependent statistics and replace it with the globally estimated one, which is fair for all databases. We elaborated on this issue by testing several global score computation techniques and discovering the most effective scoring function.

4

We also exploited additional information about the user’s preferences in order to improve the quality of the final ranking. Our method combines two rankings. The first ranking is the language modeling result merging scheme. The second one is based on the language model from pseudo-relevance feedback. The user preferences are inferred using the pseudo-relevance feedback, the top-k results from the best ranked database are assumed relevant. The novelty of our method is that pseudo-relevant feedback is obtained on the top-ranked peer before the global query execution.

1.3

Description of the remaining chapters

Background information about information retrieval and P2P systems is presented in Chapter 2. An overview of distributed information retrieval and recent work on result merging is introduced in Chapter 3. Chapter 4 presents details of the merging techniques that we select for our experimental studies. Chapter 5 contains a new approach using the preference-based language model. In Chapter 6, we present implementation details. The experimental setup, evaluation methodology, and our results are presented in Chapter 7. Chapter 8 finalizes the thesis with the conclusions and suggestions for future work.

5

Chapter 2 Web search and Peer-to-Peer Systems In this chapter, we present a short description of Web search, P2P systems, and their potential as a platform for distributed information retrieval. Section 2.1 contains introductory information about information retrieval. In Section 2.2, we review the Web search engines. In Section 2.3, some general properties of P2P systems are discussed. Section 2.4 presents recent approaches for combining search mechanisms with P2P architecture. Section 2.5 describes our approach, the Minerva project.

2.1

Information retrieval basics

Information retrieval deals with search engine architectures, algorithms, and methods used for information search in the Internet, digital libraries, and text databases. The main goal is to find the relevant documents for a query from a collection of documents. The documents are preprocessed and placed into an index, which provides the base for retrieval. A typical search engine is based on the single database model of the text retrieval. In the model, the documents from the Web and local databases are collected into a centralized repository and indexed. The whole model is effective if the index is large enough, to satisfy most of the user’s information needs and a search engine uses appropriate retrieval system. A retrieval system is the set of retrieval algorithms for different purposes: ranking, stem6

ming, index processing, relevance feedback and so on. The widely used “bag-of-words” model assumes that every document may be represented by the words, which are contained in it. The most frequent words like “the”, “and” or “is” do not have rich semantics. They are called the stopwords and we remove them from the document representation. The full set of stopwords is stored in a stopwords list. The words variations with the same stem like “run”, “runner” and “running” are mapped into the one term, corresponding to a particular stem, a stemming algorithm performs this process. In current example the term is “run”. An important characteristic of a retrieval system is its underlying model of retrieval process. This model specifies the procedure of the probability estimation that a document will be judged relevant. The final document ranking is based on this estimation. The ranking is presented to the user after a query execution. The simple retrieval process models include a probabilistic model and a vector space model ; the latter is most widely used in the search engines. In the vector space model the document D is represented by the vector d~ = (w1 , w2 , . . . , wm ), where wi is the weight indicating the importance of term ti in representing the semantics of the document and m is the number of distinct terms. For all terms failing to occur in the document, corresponding entries will be equal to zero and the full document vector is very sparse. When a term occurs in the document, two factors are of importance in a weight assignment. The first factor is a term frequency (T F ) — it is a number of the term’s occurrences in the document. The weight of the term in the document’s vector is proportional to T F . The more often the term occurs, the greater its importance in representing a document’s semantics. The second factor, which affects the weight, is a document frequency (DF ) — it is the number of documents with particular term. The term weight is multiplied by the inverse document frequency IDF . The more frequently the term appears in the documents, the less its importance in discriminating between the documents having the term from the documents not having it. The worldwide standard for a term weighting is T F · IDF product and its modifications. A simple query Q is a set of keywords. It is also transformed into an 7

m-dimensional vector ~q = (w1 , w2 , . . . , wm ) using all preprocessing steps like stopwords elimination, stemming and term weighting. After a creation of ~ a similarity between the document’s vector and the query’s vector ~q and d, is estimated. This estimation is based on a similarity function, it can be a distance or angle measure. The most popular similarity function is the cosine ~ measure, which is computed as a scalar product between ~q and d. Another popular approach that tries to overcome a heuristic nature of term weight estimation comes from the probabilistic model. The Language modeling approach [PC98] to information retrieval attempts to predict a probability of a query generation given a document. Although details may be different, the main idea is represented as follows: every document is viewed as a sample generated from a special language. A language model for each document can be estimated during indexing. The relevance of a document for a particular query is formulated as how likely the query was generated from the language model for that document. The likelihood for the query Q to be generated from the language model of the document D is computed as follows [SJCO02]: P (Q|D) =

|Q| Y

λ · P (ti |D) + (1 − λ) · P (ti |G)

(2.1)

i=1

Where: ti — is the query item in the query Q; P (ti |D) — is the probability for ti to appear in the document D; P (ti |G) — is the probability for the term ti to be used in the common language model, e.g. in English; λ — is the smoothing parameter between zero and one. The role of P (ti |C) is to smooth the probability of the document D to generate the query term ti , particularly when P (ti |D) is equal to zero. The usual measures for a retrieval effectiveness evaluation are the recall and precision, they are defined as follows [MYL02]: recall =

N umberOf RetrievedRelevantDocuments N umberOf RelevantDocuments

precision =

N umberOf RetrievedRelevantDocuments N umberOf RetrievedDocuments 8

(2.2) (2.3)

The effectiveness of a text retrieval system is evaluated using a set of test queries. The relevant document set is identified beforehand. For every test query a precision value is computed on the different levels of recall, these values are averaged over the whole query set and an average recall-precision curve is produced. In ideal case, when a system retrieves only the full set of relevant results every time the recall and precision values should be equal to one. In practice, we cannot achieve such effectiveness due to query ambiguity, specific user’s understanding of a relevance notion and other factors. Incorporating the explicit user’s feedback and implicitly inferred user’s preferences from the previous search sessions can improve the retrieval quality.

2.2

Web search engines

Information retrieval system for the Web pages is called a Web search engine. The capabilities of these systems are very broad; the modern techniques allow queries on text, image, and sound files. In our work, we consider the problem of text data retrieval. Web search engines are also differentiated by their application area. The general-purpose search engines can search across the whole Web, while the special-purpose engines are concerned on the specific information sources or specific subjects. We are interested in the general-purpose Web search engines. The Web search engines inherited many properties from traditional information retrieval. Every Web search engine has a text database or, equally, a document collection that consists of all documents searchable by this engine. An index for these documents is created before query time, every term in it represents the single keyword or phrase. For each term one inverted index list is constructed, this list contains document identifiers for every document with current term, along with the corresponding similarity values. During query execution, a search engine joins the inverted index lists corresponding to the query terms. Then search engine sorts all found documents in descending order of their similarity score and presents the resulting ranking to the user. There are also distinctive features of the Web retrieval that were not used in traditional information retrieval. The most prominent examples are addi9

tional hyperlink relationships between the documents and intensive document tagging. These differences can serve as the sources of additional information for a search refinement and they are exploited in the different retrieval algorithms. The Web developers created a significant portion of the hyperlinks on the Web manually and this is the implicit intellectual input. The linkage structure can be used as expert evidence that two pages connected by a hyperlink are also semantically related. It can also be an indication that the Web designer, who placed the hyperlinks on some pages, assesses their content as valuable. Several algorithms are based on these considerations. The PageRank algorithm computes the global importance of the Web page in a large Web graph, which is inferred from the set of crawled pages [BP98]. The advantage of this algorithm is that global importance of the page can be precomputed before query execution time. A HITS algorithm [Kle99] uses only a small subset of the Web graph, this subset is constructed during query time. Such on-line computation is inconvenient for the search engines with a high query workload, but it allows a topic-oriented page authority computation. The HTML tagging of documents can also be useful in the Web search engines. Rich information about an importance of terms is inferred from their position in a document. The terms in the title section are more important than in the body of the document. Emphasizing with a font size and style also indicates an additional importance of the term. The sophisticated term weightings schemes, which are based on these observations, improve the retrieval quality. There are several important limitations of the existing Web search engines. The first restriction is imposed on a size of searchable index. According to the Google statistics (www.google.com), this search engine has the largest crawled index on the Web, its current size is about 8 billion pages. At the same time the Hidden Web or Deep Web, which embraces the pages that were excluded from the crawling for commercial or technical reasons, has a size about 2-50 times larger then the Visible Web [SP01]. Even now it is unrealistic for a single search engine to maintain an index of this size and the information volume increases even faster than a computation power 10

of the centralized Web search engines. The second problem is the outdating of the crawled information. The news pages are changed daily and it is impossible to update the whole index at this rate. Some updating strategies help track changes on the most popular sites in the Internet, but many index entries are completely outdated. The novel opportunities provided by the peer-to-peer systems help to solve these problems.

2.3

Peer-to-Peer architecture

A distributed system is a collection of autonomous computers that cooperate in order to achieve a common goal [Cra01]. In ideal case a user of such system does not explicitly notice other computers, their location, storage replication, load balancing, reliability or functionality. P2P system is an instance of the distributed system; it is decentralized, self-organized, highly dynamic loose coupling of many autonomous computers. P2P systems became famous several years ago with the Napster (www.napster.com) and Gnutella (www.gnutella.com) file-sharing systems. In the file-sharing P2P communities, every computer can join as a peer using the client program. Other peers can access all resources shared by the peers in this environment. The main feature of such systems is that the peer who is looking for a file can directly contact the peer that is sharing this file. The only information that has to be propagated is the peer’s address and a short description of the shared data. The first systems like Napster used a centralized server with all peers addresses and names of the shared files. Other approaches avoided a single point of failure and used the Gnutella-style flooding protocol. It consequently broadcasts a request for a particular file through a small number of closest neighbors until the message expires. The modern P2P applications like eDonkey (http://www.edonkey2000.com/) are extremely popular now; they have numerous improvements over their predecessors. Therefore, we harness a power of thousands of autonomous personal computers all over the world to create a temporal community for a collaborative work. A P2P technology is trying to make systems: scalable, self-organized, fault-tolerant, publicly available, and load-balanced. This list of desirable 11

P2P properties is not exhaustive and there are also issues like anonymity, security, etc., but the selected properties are fundamental for our task. For example, modern P2P systems are often based on a mixture topology when some “super-peers” establish the different levels of hierarchy, but we are interested in a pure P2P flat structure. It gives equal rights to all peers and makes a system more scalable. The limitation of search capabilities is a considerable drawback of the most P2P systems. Sometimes you have to know an exact filename of a data of interest or you will miss the relevant results. The combination of the search engine mechanisms for an effective retrieval with a powerful paradigm of a P2P community is a promising research direction.

2.4

P2P Web search engines

The idea of a Peer-to-Peer Web search engine is extensively investigated nowadays. The interesting combinations of the search services with the P2P platforms are described in several of the following approaches. ODISSEA [SMW+ 03] is different from many other P2P search approaches. It assumes two-layered search engine architecture and a global index structure distributed over the nodes of the system. Under a global index organization, in contrast to a local one, a single node holds the entire inverted index for a particular term. The distributed version of Fagin’s threshold algorithm is used for result aggregation over the inverted lists. It is efficient only over very short queries about 2-3 words. For a distributed hash table (DHT) implementation, this system incorporated the Pastry protocol. PlanetP [CAPMN03] is another content search infrastructure. Each node maintains an index of its content and summarizes the set of terms in its index using a Bloom filter. The global index is the set of all summaries. Summaries are propagated and kept synchronized using a gossiping algorithm. This approach is effective for several thousands peers, but it is not scalable. Its retrieval quality is rather low for the top-k queries with a small k. GALANX [WGDW03] is the P2P system, which is implemented on the top of BerkeleyDB. Similar to the Minerva system, it maintains a local peer index on every node and distributes information about term presence on a 12

peer with a DHT. The different query routing strategies are evaluated during the simulation. Most of them are based on the Chord protocol and proposed strategies improve the basic effectiveness by the enlarging of the index size. The presented query routing approaches are not highly scalable since the index volume continuously increases with the number of peers in the system.

2.5

Minerva project

The project Minerva [BMWZ04] is another Web search engine that is based on P2P architecture. See Figure 2.1. In this system, every peer Pi provides an efficient search engine for its own focused Web crawl Ci . The documents Dij are indexed locally and the result is posted into a global directory as a set of index statistics Si . A posting process and all other communications between the peers are based on the Chord protocol [SMK+ 01]. Every peer contains a set of peerlists Li for a disjoint subset of terms Ti , P | 0 where |P i=1 Ti = T . The peerlist l is a mapping t → P , where t is a particular term and P 0 is a subset of peers which contain at least one document with this term. The terms are hashed and their corresponding peerlists are distributed fairly across the peers by the Chord protocol. During query execution all necessary peerlists, one for each query keyword, are obtained, and merged into one.

Figure 2.1: The Minerva system architecture 13

Every peer can pose a query against a number of selected peers that are most probable to contain the relevant documents. The selection is based on a query routing strategy and this issue is known in a literature as the database selection problem. A search engine on every selected peer processes its inverted index until it obtains the top-k highly ranked documents for a current query. Then the best top-k results from these peers are collected by the query initiator and merged into one top-k list, this task is known as the result merging problem. Quality of the final top-k list depends heavily on a term weighting scheme on peers and merging algorithm, whereas speed depends mostly on a local index processing scheme.

2.6

Summary

In this chapter, we introduced several basic concepts from information retrieval and Web search. We described some key ideas of P2P systems and reviewed several combinations of Web search engines with a P2P platform. Also small description of the new P2P Web search engine Minerva was provided. The scalability issue was recognized as being extremely important. P2P architecture seems valuable in terms of the effective and efficient retrieval.

14

Chapter 3 Result merging in distributed information retrieval In this chapter, we review recent work on distributed information retrieval. In Section 3.1, we give a short overview of the general metasearch issues. Section 3.2 contains a comprehensive description of the result merging task. In Section 3.3, we elaborate on the collection fusion task. In Section 3.4, we address the problems of the data fusion task.

3.1

Distributed information retrieval in general

During the past ten years, a new research direction emerged — distributed information retrieval or metasearch. Metasearch is the task of collecting and combining the search results from a set of different sources. A typical scenario includes several search engines that execute a query and one metasearch engine that merges the results and creates a single ranked document list. Several interesting surveys of distributed information retrieval problems and solutions are presented in [Cal00, Cra01, Cro00, MYL02]. Distributed information retrieval task appears when the documents of interest are spread across many sources. In such situation, it might be possible to collect all documents to one server or establish multiple search engines, one for each collection of documents. The search process is performed across the

15

network with communications between many servers. This is a distinctive feature of distributed information retrieval. Searching in the distributed environment has several attractive properties, which make it preferable to the single engine search. Several of these important features are listed in [MYL02]: • The increased coverage of the Web, the indices from many sources are used in one search; • The solution for the problem of the search scalability, a combination is cheaper than a centralized solution; • The automation of the result preprocessing and combining, a user does not have to compare and combine the results from different sources manually; • The improved retrieval effectiveness, the combination of different search engines can produce a better ranking than any single ranking algorithm. The metasearch is based on the multi-database model, where several text databases are modeled explicitly. The multi-database model for information retrieval has many tasks in common with the single-database model but also has some additional problems [Cro00]: • The resource description task; • The database selection task; • The result merging task. These issues are essentially the core of distributed information retrieval research, we briefly describe them below. The main unit in the metasearch is an intermediate broker that is called a metasearch engine. It obtains and stores a limited summary about every database participating in a search process and decides which databases are most appropriate for a query. A metasearch engine also propagates a query to the selected single search engines, collects and reorganizes results. Simple metasearch architecture is presented on Figure 3.1. A user poses a query Q against a metasearch engine, which in turn propagates it to several search 16

Figure 3.1: Simple metasearch architecture engines. Then the result rankings Ri are retrieved by the broker, merged, and presented to the user as a single document ranking Rm . A summary statistics from a search engine is called resource description or database representative. A full-text database provides information about its contents in a set of statistics. It may include data about the number of specific term occurrences in the particular documents, in a whole collection, or the number of indexed documents etc. Information for building resource description is obtained during the index creation step. The richness of the database representatives depends on the level of cooperation in the system. For example, the STARTS standard [GCGMP97] is the good choice for a cooperative environment, where all search engines present their results in the unified informative format. On the other hand, when they are unwilling to cooperate we can infer their statistics from query-based sampling [SC03]. The collected resource descriptions are used for the database selection or query routing task. In practice, we are not interested in the databases, which are unlikely to contain relevant documents. Therefore, we can select from all data sources only those, which are probably relevant to our query according to their resource descriptions. For each database, we calculate the usefulness measure that is usually based on the vector space model. Creating the effective and robust usefulness measure for the database ranking is the most prominent task of database selection. Several attempts to address this 17

problem are described in [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem arises when a query is executed on several selected databases and we want to create one single ranking out of these results.

This problem is not trivial since the computation of similarity

score between documents and query uses local collection statistics. Therefore, the scores are not directly comparable. The most accurate solution could be obtained by a global score normalization and requires a cooperation from sources. We are especially interested in this latter problem. The carefully designed result merging algorithm can provide us with high quality results and give us an opportunity to speed-up a local index processing. More information about the result merging methods can be found in [CLC95, TVGJL95, Bau99, Cra01, SJCO02, SC03].

Figure 3.2: A query processing scheme in the distributed search system More precisely the query processing scheme is presented on Figure 3.2 [Cra01]. A query Q is posed on the set of search engines that are represented by their resource descriptions Si . A metasearch engine selects a subset of servers S’, which are most probable to contain the relevant documents. The size of this subset usually does not exceed 10 databases. The broker routes Q to these selected search engines Si ’ and obtains a set of document rankings R from the selected servers. In practice, a user is interested only in the top-k best results where k can vary from 5 to 30. All rankings Ri are merged into 18

one rank Rm and the top-k results from it are presented to the user. Text retrieval aims at the high relevance of the results at the minimum response time. These two components are translated into the general issues of effectiveness or quality and efficiency or speed of the query processing. This thesis is concerned with the effectiveness of the result merging problem.

3.2

Result merging problem

A common issue in the metasearch is how to combine several ranked lists of the relevant documents from the different search engines into one ranked list. It is the so-called result merging problem. The following section reviews some modern merging methods. Result merging is divided into two main sub-problems. The first one is collection fusion, where the results are merged from the disjoint or nearly disjoint document sets. The second sub-problem is data fusion, which arises when we merge the different rankings obtained on the identical document sets. The main difference between the collection fusion and data fusion is that in the first case we want to approximate the result of a single search system on which the document set consists of all document’s sub-sets involved in the merging. Therefore, the optimal solution is to obtain the same retrieval effectiveness as the search engine on the top of united database has. However, in the data fusion problem the task is to merge the different rankings in such a way that the final ranking is better than each participating ranking. The maximum quality of the result here is undefined but it should be no less than the quality of the best single ranking. Simple intuition for these two problems is presented on Figure 3.3. A comprehensive description of the differences between collection fusion and data fusion can be found in [VC99b, Mon02]. In metasearch, we often do not know beforehand what kind of a merging problem we have because it depends on the level of overlap between the documents of combined databases. If the overlap is very high the situation is closer to the data fusion, otherwise it is the collection fusion task. The metasearch on the Web was addressed mainly as the collection fusion problem. In fact, the overlap of search results on the different search engines is 19

Figure 3.3: Collection fusion vs. data fusion surprisingly low. However, some approaches also take into account the data fusion methods, sometimes both types are evaluated in the mixture setups. Another important property is the level of search engine cooperation. We divide all merging methods by the environment type into two categories: • Cooperative (integrated) environment; • Uncooperative (isolated) environment. The uncooperative or isolated merging methods have no other access to the individual databases than a ranked list of documents in the response to a query [Voo95]. The cooperative or integrated merging techniques assume an access to the database statistics values like T F , DF etc. In general, both types of merging methods can produce more effective results than the single collection with the full set of documents, if the data fusion strategy is used [TVGJL95]. In practice, the merged results produced by the uncooperative strategies have been less effective than the single collection run. Our primary goal is to find a subset of the effective merging methods, which we can apply and evaluate in the P2P Web search engine Minerva. We assume here that all peers in the Minerva system are cooperative and provide all necessary statistics.

3.3

Prior work on collection fusion

A formal definition of the collection fusion problem was stated in [TVGJL95]. 20

It is mixed with the data fusion definition, therefore, we modified it. Assume a set of document collections C associated with the search engines. With respect to the query Q, each collection Ci contains a number of relevant documents. After the query Q is posed against the collection Ci , the search engine returns the ranked list Ri of documents Dij in a decreasing order of their similarity Sij to the query. The top-k results is the merged ranked list of length k containing the documents Dij with the highest similarity values S Sij in a decreasing order. Consider a document collection Cg = Ci and the top-k results Rg , which contains the documents Dgj with similarity values S Sgj . The collection fusion task is given Q, C, and k find from Rj the top-k results Rc of the documents Dcj such that Scj = Sgj .

3.3.1

Collection fusion properties

An ideal collection fusion method combines the documents from local search results into one ranked list in a descending order of their global similarity scores. The global similarity scores are produced by the single global search system over the united database containing all local documents. In the cooperative environment, where all search engines provide necessary statistics, we can achieve the consistent merging as produced by a non-distributed system, it is also known as the perfect merging and merging with normalized scores [Cra01]. In practice, no efficient collection fusion technique can guarantee exactly the same ranking as that of the centralized database with all documents from all databases involved. Three main factors affect the collection fusion: 1. Only the documents returned by the selected servers can participate in a merging. Some relevant documents will be missed after the database selection step. 2. Different statistics and retrieval algorithms caused their own separate problem of incomparable scores. We may exclude the relevant documents from search when the top-k results are merged and necessary document is locally ranked (k+1)th or greater. This problem can be solved by the global statistics normalization methods in the cooperative environment. 21

3. Overlapping between the databases. See Figure 3.4. The pure collection fusion approaches [VF95, Kir97, CLC95] do not consider overlapping. It is quite difficult to accurately estimate the actual level of the document’s overlap between datasets. Our assumption is that the degradation of the result quality due to overlapping is small when the efforts for statistics correction are significant.

Figure 3.4: An overlapping in the collection fusion problem

3.3.2

Cooperative environment

In [SP99, SR00] was claimed that the simple raw score merging could show a good retrieval performance. It seems that the raw-score approach might be a valid first attempt for the merging of result lists, which are provided by the same retrieval model. In [CLC95] it was suggested that the collection fusion based on the raw T F values seems as a valuable approach when the involved databases are more or less homogeneous and the retrieval quality degrades only by 10%. However, we assumed topically organized collections and they have highly skewed statistics. The most effective collection fusion methods are the score normalization techniques, which are based on consistent global collection statistics. All search engines must produce the document’s relevance score using the same retrieval algorithms, including document ranking algorithm, stemming method, stopwords list. A metasearch engine collects all required local statistics from the selected databases before or during query time. Notice, that under the common T F · IDF scheme the T F component is documentdependent and fair across all databases. In contrast, when the IDF compo22

nent is a collection-dependent we should globally normalize it. Analogously in the language modeling the P (ti |D) component remains unchanged and P (ti |G) should be recomputed. The communications for such aggregation are presented on Figure 3.5 [Cra01]. In the scheme A from [VF95], the search engines exchange their DF statistics between themselves before query time. During query execution we compute the comparable similarity scores. Under the scheme B [CLC95], the databases also return the comparable scores, but the document frequency statistics is collated at the metasearch engine and sent with the query. In the case C [Kir97], instead of a communication before query time, all search engines return the T F and DF statistics together with the documents rankings, and full information is used for the fair fusion. The distinction between the first two schemes and the last one is that the A and B are based on DF statistics for all search engines, but the scheme C is based only on the statistics from the selected search engines. All three schemes will return the statistics for the comparable scores and the metasearch engine performs the fusion by sorting result documents in a descending order of their similarity scores. In [LC04b] the method for merging results in a hierarchical peer-to-peer network was proposed. SESS is a cooperative algorithm that requires the neighbors to provide a summary statistics for each of their top-ranked documents. It is an extended version of Kirsch’s algorithm [Kir97]. It allows very accurate normalized document scores to be determined before downloading of any document. However, the limitation here is that the hierarchical system is assumed.

3.3.3

Uncooperative environment

When the environment is uncooperative or the search engines are intended to cheat, it is still possible to obtain a good approximation of the globally computed scores. In the approach from [CLC95, Cal00, SJCO02] was proposed a merging strategy, which is based on both resource and document scores. The database selection algorithm CORI assigns a score to each database. This database score reflects its ability to provide relevant documents

23

Figure 3.5: Statistics propagation for the collection fusion

24

to a query. Then the local document score is weighted with the database score and some heuristically set constants. This method can work in a semicooperative environment when the document scores are available or in the uncooperative setup with the slightly degraded accuracy. A similar merging strategy was proposed in [RAS01] with different formulas for the database rank estimation. The final score is again the product of the database score and the local score computed in a heuristic way. In [PFC+ 00] it was claimed that when database selection is employed, it is not necessary to maintain collection wide information, e.g. global IDF . Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. In contrast, in [LCC00] it was discovered that it is better to organize the collections topically, and that for the result merging, the topically organized collections require the global IDF for the best performance. The normalized scores are not as good as the global IDF for merging when the collections are topically organized. The current polemic is a good indicator that a comprehensive experimental evaluation of the collection fusion methods is still an open research.

3.3.4

Learning methods

Another merging strategy uses the logistic regression for the score transformation [CS00]. This method requires training queries for learning the model. The presented experiments show that the logistic regression approach is significantly better than the Round-Robin, raw-score, and normalized raw-score approaches. In [SC03] a linear regression model was used for the collection fusion with a small overlapping between collections. It is assumed that there is a centralized sample database. It stores a certain number of documents from each resource. The metasearch engine runs the query on the sample database at the same time as it is propagated to the search engines. Then a central broker finds the duplicate documents in the results for the sample database and for each resource. A normalization of the document scores in all results is performed by a linear regression analysis with the document scores from the sample database taken as a baseline. The experimental re-

25

sults showed that this method performs slightly better than CORI does. However, all learning-based approaches assume some kind of training set. It is unaffordable in a highly dynamic environment. It is hard to maintain such information for thousands of databases when they join and leave the system frequently.

3.3.5

Probabilistic methods

Several collection fusion methods were developed for the probabilistic retrieval model. The approach in [Bau99] was designed for the cooperative environment. A probabilistic ranking algorithm is based on the exported statistics from the search engines. The consistent merging in the metasearch is achieved with respect to the probability ranking principle. Another approach with a probabilistic principle is investigated in [NF03]. This paper explores the family of the linear and logistic mapping functions for the different retrieval methods. The retrieval quality for the distributed retrieval is only slightly improved by using the logistic function. The language modeling for the collection fusion [SJCO02] is another probabilistic method based on the same assumptions as Equation 2.1. The merging of results from the different text databases is performed under the single probabilistic retrieval model. This new approach is designed primarily for Intranet environments, where it is reasonable to assume that the resource providers are relatively homogeneous and can adopt the same kind of search engine. The language model based merging approach is performed to integrate the results. Compared with the heuristic methods like CORI algorithm, this framework tends to be better justified by the probability theory.

3.4

Prior work on the data fusion

A formal definition of the data fusion for the distributed retrieval is given here. Assume a set of identical document collections C associated with the different search engines and retrieval algorithms. With respect to a query Q, each collection Ci contains the same number of the relevant documents. After Q is posed against Ci , the search engine returns a ranked list Ri of

26

documents Dij in a decreasing order of their similarity Sij to a query. The top-k results is a merged ranked list of a length k containing the documents Dij with the highest similarity values Sij in a decreasing order. The data S fusion task is given Q, C, and k find from Rj the top-k results Rd of Pk the documents Ddj such that the j=1 Sdj is maximized. In our setup the rankings from different local search engines should be combined so that they collect the most relevant documents from all rankings and put them into the merged top-k results. See Figure 3.6.

Figure 3.6: Data fusion on a single search engine

3.4.1

Data fusion properties

Data fusion attempts to make use of three Diamond’s effects [Dia96]. They can occur during a combination of different rankings over a single document collection [VC99b]: • The skimming effect happens when the retrieval approaches represent the documents differently and thus retrieve different relevant documents. A combination model that takes the highly ranked items from every retrieval approach could outperform the effectiveness of any single combined ranking. • The chorus effect occurs when several retrieval approaches suggest that an item is relevant to a query. It is used as an evidence of the higher relevance of that document. 27

• The dark horse effect occurs because a retrieval approach may produce unusually accurate estimates of relevance for some documents, in comparison with the other retrieval approaches. A combination model may exploit this fact by using the most correct document score. It is obvious that all three effects are inversely correlated. For example, if we pay more attention to the chorus effect, we decrease our chances of getting advantages from the dark horse effect. The optimal tradeoff between these three situations is essentially the data fusion or retrieval expert combination task. In some sense, the data fusion problem may be defined as a voting procedure where a set of ranking algorithms selects the best k documents. The most effective data fusion schemes are the linear combinations of similarity scores produced by the different search engines and the problem is to find the optimal weights for such combination. Two factors influence the performance of any data fusion approach [Mon02]: • The effective algorithms; each system participating in the fusion should have an acceptable effectiveness, comparable with others. • The uncorrelated rankings: the rankings, which are produced by the different algorithms, should be independent from each other. The previous experiments confirmed that the rankings which do not satisfy aforementioned requirements reduce the quality of the fused ranking.

3.4.2

Basic methods

In [SF94], a number of combination techniques including operators like Min, Max, CombSum, and CombMNZ was proposed. CombSum sets the score of each document in the combination to the sum of the scores obtained by the individual resource, while in CombMNZ the score of each document was obtained by multiplying this sum by the number of resources that had nonzero scores. CombSum is equivalent to an averaging while CombMNZ is equivalent to a weighted averaging. In [Lee97] these methods were studied further with six different search servers. The main contribution was to 28

normalize each information retrieval algorithm on a per-query basis that substantially improves the results. It was showed that the CombMNZ algorithm is best followed by the CombSum, while the operators like Min and Max were the worst. Three newer modifications of these algorithms can be found in [WC02]. They differ by a weight estimation mechanism. Another method [VC99a] based on a linear combination of scores. In this approach, the relevance of a document to a query is computed by combining both a score that captures the quality of each resource and a score that captures the quality of the document with respect to a query.

3.4.3

Mixture methods

Some methods were developed for the mixture of the collection fusion and data fusion problems. A simple but ineffective merging method is the RoundRobin, which takes one document in turn from each of the available result sets. The quality of such method depends on the performance of the component search engines. Only if all the results have a similar effectiveness, then the Round-Robin performs well, but if some result lists are irrelevant then the entire result deteriorate. In [TVGJL95, VGJL94] was demonstrated a way of improving the Round-Robin method. A probabilistic mechanism is used to determine each of the documents in a merged list. It is based on the length of returned document lists or the estimated usefulness of a database. In particular, using a random experiment, one of the contributing ranked lists is selected and the top available element from that list is placed in the next position in a result list. This procedure repeats until all contributing lists are depleted. Later in [YR98] a deterministic version of this method was proposed. Two new techniques for the merging search results are introduced in [CHT99, Cra01]: the feature distance ranking algorithm and the reference statistics method. They are reasonably effective in the isolated environment. It was shown that the feature distance algorithm is also effective in the integrated environment. In [WCG03] the problem of merging results exploiting the document overlaps was addressed. This case is somewhere in the middle between the disjoint and identical database approaches. The task is how to merge the documents that appear in only one result with those that appear

29

in several different results. The new algorithms for the result merging are proposed, which take advantage of the use of duplicate documents in two ways: one correlates the scores from different results; the other one regards the duplicates as an increased evidence of being relevant to a query.

3.4.4

Metasearch approved methods

The result merging policies are used by metasearch engines. They often cannot obtain additional statistics from the single search engines and therefore use the less effective fusion strategies. Most of their merging schemes are based on the data fusion methods. The Metacrawler [SE97] is one of the first metasearch engines that were developed for the Web. It uses the CombSum data fusion method after eliminating the duplicate URLs. In the Profusion [GWG96], which is another metasearch engine, all duplicate URLs are merged using the Max function. The Inquirus metasearch engine [LG98] downloads the documents and analyzes their content. A combination of the similarity and proximity matches is included into the ranking formula. In the Mearf [OKK02] several methods of merging were introduced that are based on their similarity to result clusters. The clusters are obtained from the top-k document summaries, provided by a search engine. As the implicit relevance judgements for evaluation, they used the fact if user clicked on the presented link or not. A re-ranking is founded on both the analytical results from the contents and links of returned web page summaries.

3.5

Summary

In this chapter the common metasearch issues were described, in particular we elaborated on the result merging task. We provided details on two main problems of the result merging — the collection fusion and data fusion. According to our taxonomy, we reviewed related work in both fields. This chapter established a basis for the following selection and evaluation of the result merging methods.

30

Chapter 4 Selected result merging strategies This chapter contains the detailed descriptions of the result merging strategies selected for the evaluation in the Minerva system. In Section 4.1 we investigate the properties of the available result merging methods and define their target values. We describe the score normalization techniques: with the global IDF values in Section 4.2, with the ICF values in Section 4.3, the CORI merging in Section 4.4, the language modeling-based merging in Section 4.5, and with the T F values in Section 4.6.

4.1

Target properties for result merging methods

A subset of properties, which is specific for the Minerva system, imposes restrictions on the result merging methods. We identified the most distinctive environment properties and summarized them in Table 4.1. All properties are coded as desirable “++”, acceptable “+−” or undesirable “−−”.

31

N

Property

Description

Options

Grade

Overlapping

++

Disjoint

+−

Scores

++

Ranks

−−

Used

++

Not used

+−

Used

−−

Not used

++

The effectiveness of the methods on disjoint and overlapping collections of 1

Document

documents is different. In a P2P

overlap

environment we assume some overlapping between the collections on peers. Some methods use only ranks of the documents to perform merging while others use the similarity scores and additional information about

2

Inputs

collections. In general, the score-based result merging methods work better than the rank-based methods when additional information about the collection is available. The result merging methods are often effectively combined with the database selection step. They include information about the database and

3

Database selection

cannot be performed efficiently without a particular database selection method. Some methods are also sensitive to the differences in information which is used in the database selection step and the result merging step. The most effective results can be achieved with models learned

4

Training

from a current data. But

data

learning methods imply relatively static environment with limited number of nodes. 32

It was discovered that some particularly good methods perform 5

Scalability

High

++

Low

−−

Skewed

++

Uniform

+−

Cooperative

++

poorly with the large number of querying databases or increasing number of the top-k results. Another feature of a merging method is its ability to deal with different types of document distributions across the databases. A uniform distribution

Content

6

assumes that all collections have equal

distribution proportions of the documents relevant to a query. The collections in the Minerva are topic-oriented, which is traditionally a difficult testbed for the merging algorithms. The result merging techniques are designed with a certain degree of the search engine’s cooperation in mind. A

7

Integration

search engine may provide us with all necessary statistics about its database or we have to obtain it with additional efforts or discard some parameters.

Noncooperative

Table 4.1: The target properties of the result merging methods

4.2

Score normalization with global IDF

In [CLC95, VF95, Kir97] several methods were proposed that are based on the globally computed IDF values and where the difference is in the particular algorithms for collecting necessary statistics. The rationale here is that in the case of a disjoint partitioning of the documents this method 33

+−

is expected to be most effective and equal to a centralized search engine approach using T F · IDF score function. By the globally computed IDF values, we eliminate the statistics estimation differences for the scores from different databases. Since our environment is cooperative, we can collect the required statistics by posting local DFi values and a number of documents in the collection |Ci | into a global directory and the global IDF (GIDF ) values and scores are computed as follows: P|C|

i=1 |Ci | GIDF = log( P|C| ) i=1 |DFi |

s=

|q| X

T Fijk · GIDFik

(4.1)

(4.2)

k=1

Where: s — is the similarity score for the query and the document under the vector space model. If GIDF values are computed over every disjoint collection, the score in the distributed environment should be exactly the same as the one of a single database with all documents Dij . However, in practice, an overlapping between the documents in different collections will affect GIDF values. If a document was crawled and indexed by several databases then all terms in this document will have higher DF values than they should. In theory, we can consider this fact by providing a mechanism for finding duplicate documents among all peers. However, in practice, such effort to eliminate the skew in scores is unaffordable in the distributed system. A score correction with respect to overlap will cost too much and will probably give us a negligible improvement. We assume that the overlapping will affect all terms in approximately the same degree on a very large collection. Another assumption is that such skewed term GIDF values may reflect some latent tendencies. For example, the terms in the highly replicated documents may be deemed less important than the terms that are not so popular, because the replicated documents may be easily found due to their wide dissemination. Problems may also occur if GIDF 0 score is computed only over a subset of collections C 0 , which were chosen for a query by the database selection algo34

rithm. It corresponds to the non-distributed case where only the documents from selected databases were placed into the single collection: P|C 0 |

|Ci | GIDF = log( P|Ci=1 ) 0| |DF | i i=1 0

s=

|q| X

T Fijk · GIDFik0

(4.3)

(4.4)

k=1

The effectiveness of such GIDF 0 values will be different from the fully computed GIDF but may still perform well. We want to investigate how the database selection and GIDF 0 estimation influences a retrieval effectiveness.

4.3

Score normalization with ICF

A new measure emerged in distributed information retrieval — the inverted collection frequency or ICF : GIDF 0 = log(

|C| ) |CF |

(4.5)

Where: CF — is the collection frequency that is equal to the number of collections where the term occurs at least once; |C| — is the overall number of collections. ICF is analogue to IDF measure but one level higher; instead of the notion of document, we use the notion of collection. It can replace the IDF part in the score computation since a term that occurs in many collections is deemed less important than the rare one. The ICF measure is fair for all collections and can be used in a scoring function: s=

|q| X

T Fijk · ICFik

(4.6)

k=1

The advantage of this measure is that it is easy to compute. Only the information on whether the term occurs in the collection or not and the number of nodes in the system are needed. But this approximation may perform worse than GIDF because it is the more “averaged” view on term importance. 35

4.4

Score normalization with CORI

CORI method for the result merging was proposed in [CLC95], extensively tested, and improved in [Cal00, LCC00, SJCO02, LC04b, SC03]. It is a de-facto standard for the result merging problem. This approach heuristically combines the local scores with the server rank. The rank is obtained during the server selection step. For a general analysis, it will represent all other heuristic approaches of this kind as the most effective ones. The normalized score suitable for the merging is calculated in several steps. Assuming a query q of m terms is posed against database Ci . At first, the database selection step is performed and the rank ri for each database for one term is computed as follows [Cal00]: rik = b + (1 − b) · T · I T =

(4.7)

DFi DFi + 50 + 150 · cwi /cw

(4.8)

log( |C|+0.5 ) CF log(C + 1.0)

(4.9)

T = Where:

cwi — is the vocabulary size on Ci cw — is the average vocabulary size over all Ci b or “default belief” — is a heuristically set constant, usually 0.4. The final database rank Ri is computed as follows: Ri =

|q| X

rik

(4.10)

k=1

After all databases are ranked and a number of them are selected for query execution, the local document scores on every database are computed and preliminary normalized: snorm = ijk

min (slocal ijk − sik ) (smax − smin ik ik )

(4.11)

smin = min(T Fijk ) · IDFik ik

(4.12)

= max(T Fijk ) · IDFik smax ik

(4.13)

j

j

Where: 36

snorm — preliminary normalized local score; ijk slocal ijk — locally computed T F · IDF score for k-th term in Dij ; smin — minimum possible term score among all Dij in Ci database; ik smax — maximum possible term score among all Dij in Ci database. ik should reduce the statistics difThe preliminary normalized scores snorm ijk ferencies, which are caused by the different local IDF values. However, for the effective merging the database rank is also normalized so that low ranked databases still have an opportunity to contribute the documents into the final ranking. With respect to the maximum and minimum values, which an algorithm can potentially assign to a database, the rank is normalized as follows: Rinorm

(Ri − Rmin ) = (Rmax − Rmin )

(4.14)

Where: Rmin — database rank estimated with component T set to zero; Rmax — database rank estimated with component T set to one. The globally normalized score s is composed from the locally normalized score snorm and the normalized database rank Rinorm in a heuristic way: ijk sijk

snorm + 0.4 · snorm · Rinorm ijk ijk = 1.4 s=

|q| X

sijk

(4.15)

(4.16)

k=1

The first version of CORI method did not use the intermediate normalization steps for rank and score; they were added later for the better accuracy. This method is superior for the uncooperative environment, but it is also competitive for the cooperative collections case.

4.5

Score normalization with language modeling

The language modeling approach [PC98] to information retrieval attempts to predict a probability of a query generation from the language model, from which a particular document was generated. We suppose that each document 37

represents a sample that was generated from a particular language. The relevance of a document for a particular query is estimated as how probable that the query was generated from the language model for that document. A language modeling local scoring function for every peer is similar to the Equation 2.1 and is defined as the likelihood for a query q to be generated from a document Dij [SJCO02]: P (q|Dij ) =

|q| Y

λ · P (tk |Dij ) + (1 − λ) · P (tk |Ci )

(4.17)

k=1

Where: tk — is a query term; P (tk |Dij ) — is the probability for the query term ti to appear in the document Dij ; P (tk |Ci ) — is the probability for the term tk to appear in the collection Ci , to which document Dij belongs; λ is a weighting parameter between zero and one. The role of the term P (tk |Ci ) is to smooth the probability for the document Dij to generate the query term tk , especially when it is zero. The idea of smoothing is similar to the T F · IDF term weighting scheme that is used in a vector space model where ”popular” words are discouraged and ”rare” words are emphasized by the IDF component. Like the IDF value is a collection dependent in T F · IDF scoring, the P (tk |Ci ) component is a collection dependent in the language modeling. We replace Ci with a global collection model G estimated on all available peers. A formula for P (tk |G) looks like: P (q|Dij ) =

|q| Y

λ · P (tk |Dij ) + (1 − λ) · P (tk |G)

(4.18)

k=1

P (tk |Dij ) = T Fijk /|Dij| P|C| P|Di | i=1 j=1 T Fijk P (tk |G) = P|C| P|D | P|cwi | i l=1 T Fijl i=1 j=1

(4.19) (4.20)

This language modeling based score is a collection independent and fair for all documents on all peers. The necessary information like sums of T F values on the document collections are posted into a distributed directory. The sum 38

of the term probabilities is more convenient than multiplication. That is why an order-preserving logarithmic transformation is applied to P (tk |Dij ): s=

|q| X

log(λ · P (tk |Dij ) + (1 − λ) · P (tk |G))

(4.21)

k=1

We also investigated the effect of the partially available information when only selected databases contribute to G0 and the scores are computed as: s=

|q| X

log(λ · P (tk |Dij ) + (1 − λ) · P (tk |G0 ))

(4.22)

k=1

Both full and partial score types are used for the result merging and expected to be reasonably effective.

4.6

Score normalization with raw T F scores

For the short popular queries, the T F -based scoring is a simple but still reasonably good solution: s=

|q| X

T Fijk

(4.23)

k=1

Result merging experiments often contain the fusion by raw T F values as an indicator of the IDF component importance for a current query. If the all importance values associated with the different query terms are more or less the same, this method will show a very competitive result. We have also evaluated it in the Minerva system.

4.7

Summary

In this chapter, we provided descriptions of result merging strategies selected for the evaluation in the Minerva system. In Section 4.1 we investigated the properties of the available result merging methods and defined their target values. We described the score normalization techniques: with the global IDF values in Section 4.2, with the ICF values in Section 4.3, the CORI merging in Section 4.4, the language modeling-based merging in Section 4.5, and with the T F values in Section 4.6. 39

Chapter 5 Our approach In this chapter we present our approach for combining the result merging with the preference-based language model. The latter is obtained with a pseudo-relevance feedback on the best peer in the database ranking.

5.1

Result merging with the preference-based language model

An important source of the user-specific information is the user’s collection of documents. In the Minerva system, the Web pages are crawled with respect to the user’s bookmarks and therefore are assumed to reflect some of his specific interests. We can exploit this fact using a pseudo-relevance feedback for finding the preference-based language model from the most relevant database. The desctription of our approach is presented below. When the user poses a topic-oriented query Q at first we collect necessary statistics and build peers ranking P Q . The probability distribution for the whole set of documents G from peers in P Q is estimated. Then Q is executed on the best database in the ranking P1Q . According to our query routing strategy this database should have more relevant documents than any other database, therefore, it is the best choice for our preference-based language model estimation. The concatenation of the top-n results from the best peer represents a user-specific preference set U . From U we estimate the preference-based language model. The language model U is the mixture of

40

the general language model and the preference-based language model: P (tk |U ) = λ · PM L (tk |U ) + (1 − λ) · PM L (tk |G)

(5.1)

Where: PM L (tk |U ) — is the maximum likelihood estimate of term tk in the top-n results on PiQ ; PM L (tk |G) — is the maximum likelihood estimate of term tk across all selected peers P Q ; λ — is the empirically set smoothing parameter. The Equation 5.1 is based on the Jelinek-Mercer smoothing. The PM L (tk |G) and PM L (tk |U ) are defined as: P|P Q | P|Di |

i=1 j=1 T Fijk PM L (tk |G) = P|P Q | P|D | P|pwi | i l=1 T Fijl j=1 i=1

P|top−k| j=1

T Fijk

j=1

|Dij |

PM L (tk |U ) = P|top−k|

(5.2)

(5.3)

Where: Dij — is a document j on peer PiQ ; T Fijk — term frequency of the term tk in document Dij ; pwi — is the vocabulary size on PiQ When both P (tk |U )M L and P (tk |G)M L components are obtained, we apply an adapted version of the EM algorithm from [TZ04] to compute P (tk |U ). Pavel Serdyukov implements this algorithm for the Minerva project [Ser05]. Probabilities P (Q|G) and P (Q|U ) and query Q are sent to every peer PiQ in ranking. We compute similarity scores for result merging in three steps. In the beginning, the globally normalized similarity score sLM gn is computed with Equation 5.4. Then, the preference-based similarity scores are computed with the cross-entropy function. See Equation 5.5. The documents with the higher dissimilarity between preference-based and document language models will have the lower score. Both sLM gn and sLM pb scores are combined as in Equation 5.6 with the empirically set parameter β, it lies in interval from zero to one.

41

Algorithm for our approach 1.

query Q is posed

2.

statistics S Q for terms in Q is collected from peers

3.

peers ranking P Q is created for Q

4.

probability P (Q|G) is estimated for the whole set of documents G

on peers in P Q 6.

Q is executed on P1Q , top-n result documents are concatenated into

a set U and probability P (Q|U ) is estimated 6.

Q with P (Q|G) and P (Q|U ) are propagated to every PiQ

7.

for each document Dij on each PiQ gn 7.1 globally normalized similarity score sLM is computed: k gn sLM = log(λ · P (tk |Dij ) + (1 − λ) · P (tk |G)) k

(5.4)

pb 7.2 preference-based similarity score sLM is computed: k pb sLM = −P (tk |U ) · log(P (tk |Dij )) k

(5.5)

7.3 both scores are combined into result merging scores sLM rm : LM rm

s

=

|Q| X

gn pb β · sLM + (1 − β) · sLM k k

(5.6)

k=1

8.

top-k urls with the highest sLM rm scores are returned from each Pi

9.

returned results are sorted in descending order of sLM rm and the

best top-k urls are presented to the user

5.2

Discussion

The merging with the preference-based language model is close to the query re-weighting technique that is described in [ZL01]. This approach tries to refine the initial estimation of the query language model with the additional pseudo-relevant documents. Our approach is designed for the distributed setup. Executing the query on the one peer we select the top-n results for our preference-based model from another peer, which was ranked the best by the database selection algorithm. One user with the highly specified collection of documents implicitly helps another user to refine final document 42

ranking. The estimation of preference-based set U is performed by analogy with the cluster-based retrieval approach from [LC04a], the preference set is treated as the cluster of relevant documents. The simple intuition for using preference-based language model is given here. Assume that the user is mostly interested in the documents that were written with the same specific subset of the general language in mind, as we have among our best top-n results. We are looking for specific medical information and some peer has many documents from a medical Internet domain. After the execution of query “symptoms of diabetes” on this peer we infer the preference-based set of documents U from the top-n results. This model has the term distribution, which is typical for the medicine articles. Now we want to find the documents that were generated with the language model for U in mind, we treat it like a “relevance model”. The proposed scheme for combining the result merging ranking with the preference-based model ranking has limited performance. Both fused rankings are correlated since they are term-frequency based. This constraint is typical for information retrieval in general. The most prominent gain we can get from some additional independent features like PageRank or explicit structure of the text. The important question: “what is the appropriate size of the top-n?” should be answered empirically.

5.3

Summary

In this chapter we presented our method for combining of the result merging with the preference-based language model. This approach exploites a pseudo-relevance feedback on the best peer in the database ranking to build a preference documents set. We provided method details and discussed it.

43

Chapter 6 Implementation In this chapter, we provide a brief description of the merging methods implementation in the Minerva system. We included a short summary for several essential classes from the result merging package.

6.1

Global statistics classes

The Minerva system and all merging methods in it are implemented with Java 1.4.2. The document databases associated with the peers are maintained on Oracle 9.2.0. On the Figure 6.1, we present the diagram for the main implemented classes. For collecting statistics across many peers, three main classes are used: • RMICFscoring; • RMGIDFscoring; • RMGLMscoring. RMICFscoring class constructor takes SimpleGlobalQueryProcessor and RMQuery input objects and computes the ICF value for each query term. The calculated quantities are put into the termICF hash and accessible with the getTermICF(String term) method. In the same manner, the RMGIDFscoring and RMGLMscoring classes produce global GIDF values and global language model respectively.

44

Figure 6.1: Main classes involved in merging

45

Three objects of the classes described above are wrapped into the one RMGlobalStatistics object. The RMQuery class represents the query and the RMGlobalStatistics object is placed inside it. Therefore, every global statistics required for the query execution is propagated with the query. When the RMTopKLocalQueryProcessor executes the query, the class RMSwitcherTFIDF is involed to re-weight the result scores. From RMQuery the switcher takes the name of the fusion method and necessary global statistics and returns the new scores. The experiments with the limited global statistics are performed with the RMGIDF10peersScoring and RMGLM10peersScoring classes. For the experiments with the top-k language model, we reused the RMGLMscoring class.

6.2

Testing components

A simplified view on general components of the experiments implementation is shown on Figure 6.2. We skipped the description of many native classes of the Minerva project as unimportant for the merging problem. We mentioned only several specific classes that are helpful for the general overview of our experiments. The selected classes of the three main components are highlighted with the color borders. The order of execution is following: Start peers (Green)→Test methods (Blue)→Evaluate results (Red). The “Green box” is executed from the RMDotGov50peersStart class. It starts 50 peers; each of them is represented by the Peer object and associated with the instance of the RMTopKLocalQueryProcessor class. The latter is connected with the Oracle database server. The SocketBasedPeerDescriptor object object allows communicating with the peer through the network. Every query received by the peer is executed with the local query processor and the returned results are wrapped into the QueryResult object. Next, the “Blue box” is executed from the RMDotGovExperiments50peers class. The RMPeersRanking class is intended to build the database rankings 46

Figure 6.2: A general view on the experiments implementation

47

RANDOM, CORI and IDEAL. The first one is constructed inside the RMPeersRanking object while two others are obtained with the RMCORI2 and RMIdealRankingReader objects. The queries are wrapped into the RMQuery object. They are read from files with RMQueries and necessary global statistics are added with the classes from Figure 6.1. The SocketBasedPeerDescriptor objects are created for the communication with already running peers from “Green box’. Query results from different peers are merged into the QueryResultList object. The last component is the “Red box”. It is started from the RMDotGovRelevanceEvaluation10of50peers class. Its goal is to compute precision and recall measures for the merged lists. Query results are taken from the text files, which are produced by the QueryResultList class in the “Blue”component. Comprehensive description of classes is presented in Table 6.1. Class / Description

Method/Field description

RM CORI2 Creates the CORI peer ranking

uniqeIdentif ierT oRank : HashM ap creates the mapping be-

for a query

tween peer descriptor and CORI rank AN D : boolean if set to ”true”, only peers with all query keywords are ranked, otherwise it is enough for peer to have only one keyword to be ranked db : double ”default belief” in CORI algorithm k : int maximum size of the final ranking ringSize : long CHORD ring size RM CORI2(int, long, boolean) creates the CORI peer ranking for a query getP eerListM ergingAlgorithmN ame()

return

”CORI”

string getP eerRank(P eerDescriptorInterf ace) return rank of the peer mergeP eerLists(M ap) merge PeerLists with respect to variable AN D RM DOT GOV 50P eersStart runs 50 peers,

main(String[]) runs 50 peers, connects each peer to respective

connects each peer to respective Oracle data-

Oracle database

base RM DotGovExperiments50peersGS10

runs

scoreT ype : String current merging score type

experiments with limited statistics, only 10 databases are used for computing GIDF values and Global Language Model main(String[]) runs experiments with limited statistics, only 10 databases are used for computing GIDF values and Global Language Model

48

RM DotGovExperiments50peers− T OP KLM CrossEnt

runs

scoreT ype : String current merging score type

experiments

with our approach, which is based on crossenthropy between language models for top-k pseudo-relevant documents and each scored document main(String[]) runs experiments with our approach, which is based on cross-enthropy between language models for top-k pseudo-relevant documents and each scored document RM DotGovExperimentsSingleDatabase

scoreT ype : String current merging score type

compute TFIDF document ranking for the single ”united” database main(String[]) compute TFIDF document ranking for the single ”united” database RM DotGovExperimentsSingleDatabaseGLM scoreT ype : String current merging score type compute language modeling document ranking for the single ”united” database main(String[]) compute language modeling document ranking for the single ”united” database RM DotGovRelevanceEvaluation10of 50peers

evaluate(String, String[], String, int, RM Query[], int) com-

set the parameters and paths to the results of

putes the average recall and precision metrics for every ex-

experiments with 50 peers and computes the

periment

average recall and precision metrics for every experiment main(String[]) set the parameters and paths to the results of experiments with 50 peers and invoke ”evaluate” method RM DotGovRelevanceEvaluationSingleDB

evaluate(String, String[], RM Query[], int) computes the av-

set the parameters and paths to the results

erage recall and precision metrics for every experiment

of experiments with single database and computes the average recall and precision metrics for every experiment main(String[]) set the parameters and paths to the results of experiments with single database and invoke ”evaluate” method peersScoring GIDF container for limited sta-

termGIDF : HashM ap keys are the terms and values are the

tistics case

GIDF values RM GIDF 10peersScoring(SimpleGlobalQueryP rocessor, RM Query, LinkedList) constructor, it invokes the ”computeGIDF” method computeGIDF (SimpleGlobalQueryP rocessor, RM Query, LinkedList) calculates GIDF values over 10 databases getT ermGIDF (String) access to every GIDF value

RM GIDF scoring GIDF container when all

termGIDF : HashM ap keys are the terms and values are the

databases are available

GIDF values RM GIDF scoring(SimpleGlobalQueryP rocessor, RM Query) constructor, it invokes the ”computeGIDF” method computeGIDF (SimpleGlobalQueryP rocessor, RM Query) calculates GIDF values over all databases

49

getT ermGIDF (String) access to every GIDF value RM GLM 10peersScoring GLM container for limited statistics case

termGLM : HashM ap keys are the terms and values are the global language model values RM GLM 10peersScoring(SimpleGlobalQueryP rocessor, RM Query, LinkedList) constructor, it invokes the ”computeGLM” method computeGLM (SimpleGlobalQueryP rocessor, RM Query, LinkedList) calculates GLM values over 10 databases getT ermGLM (String) access to every GLM value

RM GLM scoring GLM container when all databases are available

termGLM : HashM ap keys are the terms and values are the global language model values RM GLM scoring(QueryResultList, RM Query, double) constructor for building language model from top-k results, it invokes the ”computeGLM” method RM GLM scoring(SimpleGlobalQueryP rocessor, RM Query) constructor for global language model, it invokes the ”computeGLM” method computeGLM (SimpleGlobalQueryP rocessor, RM Query) calculates GLM values over all databases computeT opKLM (QueryResultList, RM Query, double) calculates GLM values over top-k results getT ermGLM (String) access to every GLM value setGLM (HashM ap) we can also set termGLM externally

RM ICF scoring ICF container

termICF : HashM ap keys are the terms and values are the ICF values RM ICF scoring(SimpleGlobalQueryP rocessor, RM Query) constructor, it invokes the ”computeICF” method computeICF (SimpleGlobalQueryP rocessor, RM Query) calculates ICF values over all databases getT ermICF (String) access to every ICF value

RM GlobalStatistics container for all neces-

coriRank : double rank of the peer, where query is executing

sary statistics for every merging method gidf Object : RM GIDF scoring GIDF container glmObject : RM GLM scoring GLM container icf Object : RM ICF scoring ICF container topklmObject

:

RM GLM scoring

container for top-k

preference-based language model getCORIRank() access to CORIRank getGidf Object() access to GIDF container getGlmObject() access to GLM container getIcf Object() access to ICF container getT opklmObject() access to container for top-k preferencebased language model RM IdealRankingReader Extract manually

getRelevantDocuments(RM Query, String) Extract relevant

created IDEAL database ranking and relevant

documents from Web Track topic distillation task for TREC

documents from Web Track topic distillation

2002, 2003

task for TREC 2002, 2003

50

getT op10Databases(RM Query, String, int, boolean) Extract manually created IDEAL database ranking RM P eersRanking Creates any of three data-

getRanking(String, int, int, String, long,

base ranking, accordingly to invokation para-

boolean, M ap, RM Query) Creates ”CORI”, ”IDEAL”, or

meters

”RANDOM” ranking, accordingly to invokation parameters

RM Queries Queries reader

readQueries(String, int) reads preprocessed queries from external file stringT oQuery(String, int) converts string to query terms

RM Query Query class extends query with ad-

globalStatistics : RM GlobalStatistics container for all neces-

ditional statistics

sary statistics keywordObjects : KeywordObject[] query terms wrapped into ”KeywordObject” class numberOf Results : int requested size of the final result list scoreT ype : String current merging score type getGlobalStatistics() access to ”GlobalStatistics” container getScoreT ype() access to merging score type readObject(ObjectInputStream) unpacker after receiving the object from the network setGlobalStatistics(RM GlobalStatistics) assigns new global statistics setScoreT ype(String) assigns new score type writeObject(ObjectOutputStream) wrapper for sending the object over the network

RM SwitcherT F IDF Swithcher between the

CurrentRank : double CORI rank for this peer

different merging score types gidf Object : RM GIDF scoring GIDF container glmObject : RM GLM scoring GLM container icf Object : RM ICF scoring ICF container topklmObject

:

RM GLM scoring

container for top-k

preference-based language model computeCORIwithM AXRT F (KeywordStatistics, CollectionStatistics, CollectionT ermStatistics)

CORI

scoring function computeGLM (KeywordStatistics, CollectionT ermStatistics, double) LM04 scoring function computeGLM SingleDB(KeywordStatistics, CollectionT ermStatistics)

LM04

scoring

function

on

single database computeT F (KeywordStatistics, CollectionT ermStatistics) TF scoring function computeT F GIDF (KeywordStatistics, CollectionT ermStatistics) GIDF scoring function computeT F ICF (KeywordStatistics, CollectionT ermStatistics) ICF scoring function computeT F IDF (KeywordStatistics, CollectionStatistics, CollectionT ermStatistics) scoring function

51

TFIDF

computeT OP KLMC ROSSEN T ROP Y (KeywordStatistics, CollectionT ermStatistics, double) scoring for our approach RM T opKLocalQueryP rocessor Each local query processor executes query on its own peer

globalStatistics : RM GlobalStatistics global statistics comes with the query scoreT ype : String current merging score type collectionStatistics : CollectionStatistics collection statistics container, contains number of documents on the peer, DF values etc. dbAccess : DBAccess access to database, corresponding to the peer qrlm ap : HashM ap final list of result documents, which is returned to the query initiator execute(RM Query) retrieves local results for the query

RM T REC2LOCALconvertor

convertor of

main(String[]) Extracts information from TREC Web Track

TREC data into the Minerva

topic distillation task 2002, 2003 about relevant documents

T aoAndZhaiM odeling Language model esti-

GetT ermM odels(Query, int) Obtain preference-based model

mation method

from top-k result documents Table 6.1: Classes description

52

Chapter 7 Experiments Current chapter contains a detailed description of our experiments with result merging strategies selected for the evaluation in the Minerva system. In Section 7.1, the system configuration and the dataset parameters are described. Section 7.2 contains the experiments with existing result merging methods and the discussion of results afterwards. In Section 7.3, we present the results of the experiments with our approach with a preference-based language model.

7.1

Experimental setup

7.1.1

Collections and queries

The previous experiments provided the different pro and contra for the result merging algorithms. We want to check the existing methods once again because the following combination of the features was not tested in the known experiments: • Minerva works with real Web data; • There is an overlap between the documents on different peers; • Collections are topically organized; • Database selection algorithm is executed before the result merging step; • Queries are topically oriented. 53

We conducted new experiments with 50 databases, which were created from the TREC-2002, 2003 and 2004 Web Track datasets from the “.GOV” domain. For these three volumes, four topics were selected. The relevant documents from each topic were taken as a training set for the classification algorithm and 50 collections were created. The non-classified documents were randomly distributed among all databases. Each classified document was assigned to two collections from the same topic. For example, for the topic “American music” we have the subset of 15 small collections and all classified documents are replicated twice in it. The topics with the numbers of corresponding collections are summarized in the Table 7.1, each collection is placed on one peer. N

Topic

Number of collections

1

Health and medicine

15

2

Nature and ecology

10

3

Historic preservation

10

4

American music

15

Table 7.1: Topic-oriented experimental collections Assuming that the search is topic-oriented, we selected from the topic distillation task for the TREC 2002 and 2003 Web Track the set of the 25 out of 100 title queries. We used relevance judgements, which are available on the NIST site (http://trec.nist.gov). Queries are selected with respect to two requirements: • at least 10 relevant documents exist; • query is related to the “Health and Medicine” or “Nature and Ecology” topics. The full table of the selected queries is presented in the Appendix A. The database selection algorithm chooses each time 10 peers out of 50. To simulate the merging retrieval algorithm we obtain 500 documents on every peer with local scores computed as: s

local

|Q| X T Fk |C| = ∗ log |Dij | DFik k=1

54

(7.1)

Then we recompute scores for these documents with current merging method. Top-30 documents with the best merging scores are retrieved to the peer, who issued a query. Then all 10 sets of 30 documents are combined into one ranking in descending order of their similarity scores. Top-30 documents from this ranking are evaluated.

7.1.2

Database selection algorithm

The database selection step adds an important dimension to the result merging experiments. Only selected 10 databases are participated in a query execution and, therefore, the effectiveness of the query routing algorithm influences the quality of result. CORI merging explicitly use results from the database selection step for the merging. We evaluated the result merging methods under the following database rankings: • RANDOM; • CORI; • IDEAL. RANDOM ranking is the “weakest” algorithm, which just randomly selects 10 peers without use of any information about their collection statistics. It is the lower bound for the effectiveness of the database selection algorithm. The CORI database selection algorithm [Cal00] is described in Chapter 4. The IDEAL ranking is the manually created ranking, where the collections are sorted in a descending order of the number of relevant documents for the query.

7.1.3

Evaluation metrics

For the evaluation, we utilized the framework from [SJCO02]. For all tested algorithms, the average precision measure is computed over the 25 queries at the level of the top-5, 10, 15, 20, 25, and 30 documents. For example, using the relevance judgements for topic distillation task for the TREC 2002 and 2003 Web Track, we compute the precision at the level of top-5 documents separately for each query. Then we average the precision value over all 25 55

queries. Most of the users of the Web search engines do not look after the first 10 or 20 results. Therefore, the difference in the effectiveness of the algorithms after the top-30 results is not significant. When we compute the precision for these fixed levels the micro- and macro-average precision measures are equal. We also exploited another baseline — the effectiveness on the single database. The single database contains all the documents from the 50 peers and uses two retrieval algorithms: • T F · IDF • Language modeling We included the two term weighting functions because the merging with the language modeling should be compared with the language modeling baseline. It is fair since in general the language modeling is more effective than the T F · IDF -based term weighting schemes. For both baselines, the collectiondependent components IDF and P (tk |C) are computed on the single database. We will keep the notion of the single database with the meaning of the “united database” in the rest of the paper.

7.2

Experiments with selected result merging methods

7.2.1

Result merging methods

For the experiments, we used six methods which are described in Chapter 4. As a lower bound, we took the merging by local T F · IDF values, which is expected to be the most ineffective merging algorithm. We also used two single database retrieval algorithms as the upper bound. The language modeling scores for the result merging were tested with different values of λ parameter. It was found that the 0.4 value gives the most stable results, however, all other values are almost equally effective. Instead of the simple term frequency component T F in all methods, we used the normalized term

56

frequency T F norm : norm = T Fijk

T Fijk |Dij |

(7.2)

Where: T Fijk — is the term frequency, the number of a term occurrences in the document; |Dij | — is the document length in terms. In order to keep the notation simple we continue using T F in the text instead of T F norm . The CORI method was tested in two variations with an additional normalization by the maximum TF value and without normalization. The second variant was consistently better and we left only this method. Finally, the result merging methods and baselines were coded like this: • T F — merging by raw T F scores; • T F IDF — lower bound, merging by T F · IDF scores with local IDF ; • T F GIDF — merging by T F · GIDF scores with global GIDF ; • T F ICF — merging by T F · ICF scores; • CORI — merging by CORI method; • LM 04 — merging with global language model and λ = 0.4; • SingleT F IDF — single database baseline with the T F · IDF scores; • SingleLM — single database baseline with single collection language model.

7.2.2

Merging results

On Figures 7.1-7.6 we summarize the results from the result merging experiments with six methods and two baselines over three ranking algorithms. For each ranking algorithm, we placed the average precision and recall plots. Both measures are macro-averaged.

57

On Figure 7.1 and Figure 7.2 the results with RANDOM ranking algorithm are presented. The performance of all algorithms is similar and significantly worse than that of the single database algorithms. The degradation in performance in comparison with the baseline is obvious since databases are chosen randomly and many relevant documents are excluded from the merging after the database selection step. The LM 04 method is significantly better than other result merging methods in the top-5. . . 15 interval. Surprisingly, the T F IDF shows a very competitive result, it is slightly better than other merging strategies in the top-15. . . 30 interval. The explanation is that the database statistics are not as highly skewed as assumed. The next experiment with CORI database ranking is summarized on Figure 7.3 and Figure 7.4. This ranking is more realistic and the comparable performance is expected from the final database selection algorithm for the Minerva system. All result merging methods do better than with the RANDOM algorithm. The T F ICF strategy has the least effectiveness. From the fact that query terms are quite popular we infer that many terms are encountered on every peer. It indicates that the ICF measure for the term importance is too rough. Another possible reason is that such approximation does not work with relatively small peer lists, since we have only 50 peers in the system. The T F GIDF method works worse than we expected and even does not outperform the local T F IDF scores. It seems that the fair GIDF values, which are “averaged” over all databases, are more influenced by the noise while the local IDF values are more topic-specific. For example, the T F IDF scheme works even better than the single database on the very small top of results. The CORI merging method shows a mediocre result. The database selection algorithm plays a role of the variance reduction technique. It eliminates from the search the large portion of non-relevant documents from the databases that were not selected. That is why some methods perform better than two single collection baselines. Another important observation is that T F is the one of the two best strategies. The normalized T F value divided by the document length is equal to the maximum likelihood estimation of P (tk |D). In other words, the ranking by the T F scores is equal to the language modeling when the λ equals to one. It is an evidence that for our testbed, with the CORI data58

0.280

0.260

0.240

Average precision

0.220

0.200

0.180

0.160

0.140

0.120

0.100

0.080

SingleTFIDF TF TFIDF TFGIDF TFICF CORI SingleLM LM04

5

10

15

20

25

30

0.208 0.192 0.160 0.168 0.168 0.160 0.224 0.224

0.176 0.132 0.140 0.120 0.120 0.136 0.200 0.160

0.160 0.125 0.131 0.131 0.123 0.128 0.181 0.139

0.158 0.112 0.122 0.118 0.110 0.122 0.158 0.114

0.142 0.101 0.110 0.107 0.106 0.107 0.149 0.106

0.135 0.093 0.099 0.099 0.096 0.097 0.141 0.099

Number of documents in the top-k

Figure 7.1:

The macro-average precision with the database ranking

RANDOM 0.180

0.160

0.140

Average recall

0.120

0.100

0.080

0.060

0.040

0.020

0.000

5

10

15

20

25

30

SingleTFIDF

0.048

0.079

0.102

0.133

0.145

0.161

TF

0.040

0.053

0.075

0.089

0.098

0.106

TFIDF

0.032

0.057

0.074

0.092

0.103

0.108

TFGIDF

0.035

0.048

0.077

0.091

0.103

0.108

TFICF

0.036

0.049

0.068

0.079

0.098

0.104

CORI

0.032

0.057

0.074

0.092

0.101

0.107

SingleLM

0.047

0.087

0.117

0.132

0.149

0.168

LM04

0.049

0.066

0.080

0.085

0.101

0.109

Number of documents in the top-k

Figure 7.2: The macro-average recall with the database ranking RANDOM

59

base ranking, the smoothing by the general language model is only slightly effective. The LM 04 is the second best strategy. It has almost the same absolute effectiveness as the T F method, but since we compare LM 04 with the better SingleLM baseline, it has the lower relative efficiency in the Table 7.3. If we continue the comparison of T F and LM 04, we say that the performance of smoothing is decreased from the database selection step. We suggest that the λ should be tuned for every database ranking algorithm separately. The third experiment is carried out with the manually created IDEAL database ranking. The results are presented on Figure 7.5 and Figure 7.6. It is hard to achieve such an accurate automatic database ranking, which will rank all databases in decreasing order of the number of relevant documents. We used the information about the number of relevant documents in the databases from the TREC 2002 and 2003 Web Track topic distillation relevance judgements and built the IDEAL rank in a semi-automatic manner for every query. There is no absolute winner here; both the T F and T F GIDF methods perform worse than the T F IDF and CORI methods. For T F it can be explained so that when all databases have a comparable number of relevant documents the difference in the IDF values is more important. When we use the GIDF values they are “smoothed” too much and reflect the term importance in very general sense. On the one hand, the local IDF are computed on the reasonably large number of documents inside a collection, they are not too “overfitted”. On the other hand, they correspond to a specific situation when the collection is topically oriented. When all 10 selected collections are close to each under the IDEAL ranking, the local IDF values are both comparable and topicspecific. Therefore, the T F IDF method appears to be a good one with the IDEAL ranking, the result merging by CORI method shows almost the same effectiveness. The LM 04 method is again the best in the top-5. . . 10 interval and quite good in the rest categories. The main observations so far: • All result merging methods are quite close to each other;

60

0.280

0.260

0.240

Average precision

0.220

0.200

0.180

0.160

0.140

0.120

0.100

0.080

SingleTFIDF TF TFIDF TFGIDF TFICF CORI SingleLM LM04

5

10

15

20

25

30

0.208 0.264 0.248 0.224 0.208 0.240 0.224 0.272

0.176 0.196 0.184 0.176 0.172 0.176 0.200 0.196

0.160 0.173 0.157 0.152 0.152 0.157 0.181 0.173

0.158 0.160 0.146 0.144 0.146 0.146 0.158 0.158

0.142 0.150 0.138 0.136 0.138 0.138 0.149 0.149

0.135 0.136 0.129 0.133 0.133 0.129 0.141 0.135

Number of documents in the top-k

Figure 7.3: The macro-average precision with the database ranking CORI 0.180

0.160

0.140

Average recall

0.120

0.100

0.080

0.060

0.040

0.020

0.000

5

10

15

20

25

30

SingleTFIDF

0.048

0.079

0.102

0.133

0.145

0.161

TF

0.056

0.085

0.106

0.125

0.147

0.158

TFIDF

0.055

0.081

0.101

0.120

0.136

0.149

TFGIDF

0.049

0.078

0.096

0.118

0.138

0.160

TFICF

0.044

0.074

0.099

0.120

0.136

0.159

CORI

0.054

0.078

0.101

0.120

0.137

0.149

SingleLM

0.047

0.087

0.117

0.132

0.149

0.168

LM04

0.057

0.085

0.106

0.123

0.145

0.157

Number of documents in the top-k

Figure 7.4: The macro-average recall with the database ranking CORI

61

0.280

0.260

0.240

Average precision

0.220

0.200

0.180

0.160

0.140

0.120

0.100

0.080

SingleTFIDF TF TFIDF TFGIDF TFICF CORI SingleLM LM04

5

10

15

20

25

30

0.208 0.264 0.248 0.240 0.224 0.240 0.224 0.264

0.176 0.184 0.204 0.204 0.188 0.204 0.200 0.220

0.160 0.155 0.192 0.163 0.168 0.189 0.181 0.184

0.158 0.142 0.180 0.164 0.166 0.178 0.158 0.168

0.142 0.125 0.165 0.150 0.150 0.163 0.149 0.157

0.135 0.120 0.141 0.141 0.144 0.144 0.141 0.145

Number of documents in the top-k

Figure 7.5: The macro-average precision with the database ranking IDEAL 0.180

0.160

0.140

Average recall

0.120

0.100

0.080

0.060

0.040

0.020

0.000

5

10

15

20

25

30

SingleTFIDF

0.048

0.079

0.102

0.133

0.145

0.161

TF

0.049

0.074

0.091

0.114

0.124

0.138

TFIDF

0.054

0.084

0.118

0.144

0.161

0.167

TFGIDF

0.052

0.085

0.101

0.131

0.151

0.170

TFICF

0.047

0.082

0.106

0.132

0.150

0.169

CORI

0.053

0.084

0.117

0.143

0.160

0.170

SingleLM

0.047

0.087

0.117

0.132

0.149

0.168

LM04

0.055

0.091

0.116

0.131

0.151

0.166

Number of documents in the top-k

Figure 7.6: The macro-average recall with the database ranking IDEAL

62

0.275

0.255

0.235

0.215

Precision

0.195

0.175

0.155

0.135

0.115

0.095

0.075

LM04-RANDOM LM04-CORI LM04-IDEAL SingleDBLM

5

10

15

20

25

30

0.224 0.272 0.264 0.224

0.160 0.196 0.220 0.2

0.139 0.17333333 0.184 0.18133333

0.114 0.158 0.168 0.158

0.106 0.1488 0.157 0.1488

0.099 0.13466667 0.145 0.14133333

Number of documents in top-k

Figure 7.7: The macro-average precision of the LM 04 result merging method with the different database rankings • The LM 04 method shows the best performance and robust under every ranking; • The T F ICF methods does not work well; • Surprisingly, the T F IDF method is more effective than the T F GIDF technique; • The database selection has a significant influence on merging; a good database ranking allows to outperform single database baseline. See Figure 7.7. The Tables 7.2-7.4 contain the difference in average precision in percents. It is computed as a residual between a particular method and corresponding single database algorithm. For the LM 04 technique, the baseline is SingleLM and for all others the baseline is SingleT F IDF method. It is not clear why the local IDF -based methods are relatively good in different setups when the GIDF -based result merging methods are not so effective. The only observation that we made is that the simplest IDF 63

computation without additional tuning give us a very unreliable model. It is possible to enhance it with additional heuristics but the simple version has the very unstable behaviour. Another fact is that language models are both effective and robust, they outperform all other tested algorithms.

64

TOP

TF

T F IDF

T F GIDF

T F ICF

CORI

LM 04

5

-7.69%

-23.08%

-19.23%

-19.23%

-23.08%

0.00%

10

-25.00%

-20.45%

-31.82%

-31.82%

-22.73%

-20.00%

15

-21.67%

-18.33%

-18.33%

-23.33%

-20.00%

-23.53%

20

-29.11%

-22.78%

-25.32%

-30.38%

-22.78%

-27.85%

25

-29.21%

-22.47%

-24.72%

-25.84%

-24.72%

-29.03%

30

-30.69%

-26.73%

-26.73%

-28.71%

-27.72%

-30.19%

Table 7.2: The difference in percents of the average precision between the result merging strategies and corresponding baselines with the RANDOM ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach TOP

TF

T F IDF

T F GIDF

T F ICF

CORI

LM 04

5

+26.92%

+19.23%

+7.69%

0.00%

+15.38%

+21.43%

10

+11.36%

+4.55%

0.00%

-2.27%

0.00%

-2.00%

15

+8.33%

-1.67%

-5.00%

-5.00%

-1.67%

-4.41%

20

+1.27%

-7.59%

-8.86%

-7.59%

-7.59%

0.00%

25

+5.62%

-3.37%

-4.49%

-3.37%

-3.37%

0.00%

30

+0.99%

-3.96%

-0.99%

-0.99%

-3.96%

-4.72%

Table 7.3: The difference in percents of the average precision between the result merging strategies and corresponding baselines with the CORI ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach TOP

TF

T F IDF

T F GIDF

T F ICF

CORI

LM 04

5

+26.92%

+19.23%

+15.38%

+7.69%

+15.38%

+17.86%

10

+4.55%

+15.91%

+15.91%

+6.82%

+15.91%

+10.00%

15

-3.33%

+20.00%

+1.67%

+5.00%

+18.33%

+1.47%

20

-10.13%

13.92%

+3.80%

+5.06%

+12.66%

+6.33%

25

-12.36%

15.73%

+5.62%

+5.62%

+14.61%

+5.38%

30

-10.89%

4.95%

+4.95%

+6.93%

+6.93%

+2.83%

Table 7.4: The difference in percents of the average precision between the result merging strategies and corresponding baselines with the IDEAL ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach

65

7.2.3

Effect of limited statistics on the result merging

In practice, it is inefficient to collect full statistics for the global language model or GIDF value from the thousands of peers. We can use the limited statistics from the 10 selected databases, which are participating in merging. On Figure 7.8 and Figure 7.9 we present the results from experiments with limited statistics. We tested LM 04 method as the most effective one. 10LM 04 method is the variation of LM 04, which uses only the statistics from the 10 merged databases. We did not use the RANDOM ranking in the rest experiments since they make sense only for a reasonably good database selection algorithm. With the CORI database selection algorithm the LM 04 has the better performance with the general language model estimated over all peers. However, with the IDEAL database the results of the 10LM 04 technique are almost equal to the results of the LM 04 method. Our conclusion is that if we use an effective database ranking algorithm we can merge results only with statistics from databases, which are involved in merging. It is give us both scalable and effective merging method.

66

0.280

0.260

0.240

0.220

Precision

0.200

0.180

0.160

0.140

0.120

0.100

0.080

SingleLM LM04 10LM04

5

10

15

20

25

30

0.224 0.272 0.248

0.200 0.196 0.180

0.181 0.173 0.155

0.158 0.158 0.144

0.149 0.149 0.131

0.141 0.135 0.124

Number of documents in the top-k

Figure 7.8: The macro-average precision with the database ranking CORI with the global statistics collected over the 10 selected databases 0.280

0.260

0.240

0.220

Precision

0.200

0.180

0.160

0.140

0.120

0.100

0.080

SingleLM LM04 10LM04

5

10

15

20

25

30

0.224 0.264 0.256

0.200 0.22 0.216

0.181 0.184 0.18133333

0.158 0.168 0.168

0.149 0.1568 0.1568

0.141 0.14533333 0.14666667

Number of documents in the top-k

Figure 7.9: The macro-average precision with the database ranking IDEAL with the global statistics collected over the 10 selected databases

67

7.3

Experiments with our approach

In the second series of experiments, we evaluated our technique. We tested the best result merging method LM 04 in combination with the ranking from the preference-based language model. The detailed description of our approach is provided in Chapter 5. Here we repeat the main equations for the gn merging score computation. The globally normalized similarity score sLM k

in method LM 04 is computed as: gn sLM = log(λ · P (tk |Dij ) + (1 − λ) · P (tk |G)) k

(7.3)

pb The preference-based similarity score sLM is computed as: k pb sLM = −P (tk |U ) · log(P (tk |Dij )) k

(7.4)

Finally, both scores are combined into result merging scores sLM rm : s

LM rm

=

|Q| X

gn pb β · sLM + (1 − β) · sLM k k

(7.5)

k=1

The influence of two main parameters was investigated: • n — the number of the top documents, from which the preference model U is composed; gn pb • β — the smoothing parameter between sLM and sLM scores; k k

The value of β we explicitly included in last two digits of the method’s name. For example, the LM 04P B02 is the combination of LM 04 score and preference-based score with β = 0.2. Notice, that codes 04 and 02 in LM 04P B02 refer to different parameters. The first one is λ — smoothing parameter between the document and global language models in language modeling result merging method LM 04. The second one is β parameter, which defines a trade-off between combined scores in the our approach. If we substitute both combined scores into the formula for the final merging score we obtain: s

LM rm

=

|q| X

β·(log(λ·P (tk |Dij )+(1−λ)·P (tk |G)))+(1−β)·(P (tk |U )·log(P (tk |Dij )

k=1

(7.6) 68

When β = 0 the method reduces to the ranking by cross-entropy between the preference-based and document language models, we call it P B. When β = 1 we obtain a pure LM 04 ranking. For retrieval of the pseudo-relevant top-n documents, we used T F IDF retrieval algorithm.

7.3.1

Optimal size of the top-n

At first, we conducted experiments for the separate P B ranking in order to find the optimum n for estimating our preference-based model. A reasonable assumption is that n should lie in the [5 . . . 30] interval. The lower bound was set to avoid overfitting and the higher bound was set with respect to the average number of relevant documents in databases. On Figure 7.10 and Figure 7.11 we present the results from experiments with different n for the preference-based language model estimation. The large variance in results is in accord with our expectations. So far, there is no method, which will guarantee the accurate best n estimation for every database rankings. The number of the relevant documents in the top is crucial for the n tuning. The best database for the CORI ranking gives us the best average precision for the model with n = 30, while for the IDEAL ranking the best choice is n = 10. We concluded that in Minerva system the appropriate choice for the preference-based model estimation is n = 10. It shows the best performance with the IDEAL ranking and reasonably good with the CORI ranking as well. For some queries we have databases with more than 10 relevant documents, but for others we have less than three relevant documents in the best database. Therefore, it is dangerous to take large n since we can introduce many irrelevant documents into the preference-based language model estimation.

7.3.2

Optimal smoothing parameter β

After we fixed the n parameter for the preference-based language model estimation, we conducted experiments with different values of the β parameter. We carried out experiments for β = 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99 and obtained the best combination with β = 0.6. On Figure 7.12 and Figure 7.13 we present results for LM 04P B06 method and show the separate 69

Average Precision

0.280

0.230

0.180

0.130

0.080 PBn5 PBn10 PBn15 PBn20 PBn25 PBn30

top-5

top-10

top-15

top-20

top-25

top-30

0.288 0.272 0.232 0.272 0.272 0.288

0.196 0.196 0.168 0.200 0.196 0.200

0.171 0.173 0.149 0.173 0.173 0.179

0.154 0.158 0.138 0.154 0.158 0.158

0.141 0.149 0.128 0.146 0.149 0.144

0.129 0.135 0.117 0.135 0.135 0.131

Figure 7.10: The macro-average precision with the database ranking CORI with the different size of top-n for the preference-based model estimation 0.280

Average Precision

0.230

0.180

0.130

0.080 PBn5 PBn10 PBn15 PBn20 PBn25 PBn30

top-5

top-10

top-15

top-20

top-25

top-30

0.256 0.248 0.256 0.208 0.248 0.256

0.184 0.200 0.184 0.168 0.180 0.176

0.157 0.168 0.157 0.133 0.155 0.157

0.142 0.146 0.142 0.128 0.146 0.146

0.131 0.130 0.131 0.114 0.131 0.133

0.119 0.119 0.119 0.097 0.119 0.120

Figure 7.11: The macro-average precision with the database ranking IDEAL with the different size of top-n for the preference-based model estimation

70

0.300 0.280 0.260

Average Precision

0.240 0.220 0.200 0.180 0.160 0.140 0.120 0.100 PB LM04PB06 LM04 SingleLM

Top-5

Top-10

Top-15

Top-20

Top-25

Top-30

0.272 0.288 0.272 0.224

0.196 0.196 0.196 0.200

0.173 0.176 0.173 0.181

0.158 0.158 0.158 0.158

0.149 0.149 0.149 0.149

0.135 0.135 0.135 0.141

Figure 7.12: The macro-average precision with the database ranking CORI with the top-10 documents for the preference-based model estimation with β = 0.6 and LM 04 result merging method 0.3 0.28 0.26

Average Precision

0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 PB LM04PB06 LM04 SingleLM

Top-5

Top-10

Top-15

Top-20

Top-25

Top-30

0.248 0.272 0.264 0.224

0.192 0.22 0.22 0.2

0.165333333 0.186666667 0.184 0.181333333

0.146 0.17 0.168 0.158

0.1296 0.1568 0.1568 0.1488

0.118666667 0.144 0.145333333 0.141333333

Figure 7.13: The macro-average precision with the database ranking IDEAL with the top-10 documents for the preference-based model estimation with β = 0.6 and LM 04 result merging method

71

performance of combined methods for comparison. The single P B ranking, which is purely based on the pseudo-relevance feedback, shows the unstable performance under the different database rankings. It is effective with CORI ranking and poor with the IDEAL ranking. It is the inherent property of the pseudo-relevance feedback, with “lucky” choice of the top-n documents for the model estimation it increases the performance and with “unlucky” choice decreases it. The performance of P B method is totally depends on the first database in the ranking, on which we yield the preference-based language model. The average precision of the LM 04P B06 is slightly better than that of the LM 04 on the small top-5 and top-15 categories. In other cases, it shows the same performance. We conclude that the LM 04P B06 combination of the cross-entropy ranking P B with the LM 04 language model with β = 0.6 is slightly more effective than the single LM 04 method.

7.4

Summary

In this chapter, we presented a detailed description of our experiments with the result merging strategies selected for the evaluation in the Minerva system. In Section 7.1, we provided the testbed description. The results from testing several known result merging techniques are presented in Section 7.2. We found that the LM 04 result merging method is the most effective and robust. In Section 7.3 we described the experimental results with proposed approach. We found that the combination of the pseudo-relevance feedback base P B method with the best LM 04 result merging method gives a small improvement on some intervals and at least as effective as the best of the combined methods.

72

Chapter 8 Conclusions and future work 8.1

Conclusions

In this thesis, we investigated the effectiveness of the different result merging methods for the P2P Web search engine Minerva. We selected several merging methods, which are feasible to use in a heterogeneous, dynamic, distributed environment. The experimental framework for these methods was implemented with Java 1.4.2. and Oracle 9.2.0. We carried out experiments with the different database rankings and studied the effectiveness of result merging methods with the different size of top-k. The language modeling ranking method LM 04 produced the most robust and accurate results under the different conditions. We proposed the new result merging method that combines two types of similarity scores. The first score type was computed with the language modeling merging method LM 04. The second score type was computed as a cross-entropy value between the preference-based language model and the document language model. The novelty of our approach is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the peers ranking. The combination is tuned with the heuristically set parameter. In every tested setup, the new method was at least as effective as the best of the individual merging methods or slightly better. The main observations are the following:

73

• All merging algorithms are very close in absolute retrieval effectiveness; • Language modeling methods are more effective than T F · IDF based methods; • The effectiveness of the database selection step influences the quality of result merging; • The pseudo-relevance feedback information from the topically organized collections improves the retrieval quality.

8.2

Future work

There are several ways to enhance the result merging in the Minerva. The effectiveness and efficiency are the two important dimensions for improvement. The effectiveness of the multiple text-statistics based methods is similar and does not significantly improve the final ranking. We can exploit other sources of evidence and incorporate them into the fused document score. The linkage-based algorithms (e.g. PageRank) can be added to the retrieval algorithm. The problem here is how to compute it in a completely distributed environment. The efficiency is mainly depends on smart top-k result computation algorithm. We can improve it by introducing the additional communications between the peers during query processing.

74

Bibliography [Bau99]

Christoph Baumgarten. A probabilistic solution to the selection and fusion problem in distributed information retrieval. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 246–253. ACM Press, 1999.

[BMWZ04]

M. Bender, S. Michel, G. Weikum, and C. Zimmer. The minerva project: Database selection in the context of p2p search. 2004.

[BP98]

Sergey Brin and Lawrence Page. The anatomy of a largescale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

[Cal00]

J. Callan. W.B. Croft, editor, Advances in information retrieval,, chapter Distributed information retrieval, pages 127– 150. 2000.

[CAPMN03] Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P. Martin, and Thu D. Nguyen. Planetp: Using gossiping to build content addressable peer-to-peer information sharing communities. In HPDC, pages 236–249, 2003. [CHT99]

Nick Craswell, David Hawking, and Paul B. Thistlewaite. Merging results from isolated search engines. In Australasian Database Conference, pages 189–200, 1999.

[CLC95]

J. P. Callan, Z. Lu, and W. Bruce Croft. Searching Distributed Collections with Inference Networks . In E. A. Fox, P. Ing75

wersen, and R. Fidel, editors, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, Seattle, Washington, 1995. ACM Press. [Cra01]

Nicholas Eric Craswell. Methods for Distributed Information Retrieval. PhD thesis, January 01 2001.

[Cro00]

W. Bruce Croft. Combining approaches to ir (invited talk). In DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries, 2000.

[CS00]

Anne Le Calve and Jacques Savoy. Database merging strategy based on logistic regression. Inf. Process. Manage., 36(3):341– 359, 2000.

[Dia96]

Ted Diamond. Information retrieval using dynamic evidence combination. PhD thesis, 1996.

[GCGMP97] Luis Gravano, Kevin Chen-Chuan Chang, Hector GarciaMolina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching (experience paper). In SIGMOD Conference, pages 207–218, 1997. [GGMT99]

Luis Gravano, H´ector Garc´ıa-Molina, and Anthony Tomasic. GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2):229–264, 1999.

[GWG96]

Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion: Intelligent Fusion from Multiple, Distributed Seach Engines. Journal of Universal Computing, Springer-Verlag, 2(9), Sept. 1996.

[Kir97]

S. T. Kirsch. Distributed search patent. u.s. patent 5,659,732, 1997.

[Kle99]

Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. 76

[LC04a]

Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using language models. In SIGIR ’04: Proceedings of the 27th annual international conference on Research and development in information retrieval, pages 186–193. ACM Press, 2004.

[LC04b]

Jie Lu and Jamie Callan. Merging retrieval results in hierarchical peer-to-peer networks. In SIGIR ’04: Proceedings of the 27th annual international conference on Research and development in information retrieval, pages 472–473. ACM Press, 2004.

[LCC00]

Leah S. Larkey, Margaret E. Connell, and James P. Callan. Collection selection and results merging with topically organized u.s. patents and trec data. In CIKM, pages 282–289, 2000.

[Lee97]

Jong-Hak Lee. Analyses of multiple evidence combination. In SIGIR, pages 267–276, 1997.

[LG98]

Steve Lawrence and C. Lee Giles. Inquirus, the neci meta search engine. In WWW7: Proceedings of the seventh international conference on World Wide Web 7, pages 95–105. Elsevier Science Publishers B. V., 1998.

[Mon02]

Mark Montague. Metasearch: Data Fusion for Document Retrieval. PhD thesis, 2002.

[MYL02]

Weiyi Meng, Clement T. Yu, and King-Lup Liu. Building efficient and effective metasearch engines. ACM Comput. Surv., 34(1):48–89, 2002.

[NF03]

H. Nottelmann and N. Fuhr. From retrieval status values to probabilities of relevance for advanced IR applications. Information Retrieval, 6(4), 2003.

[OKK02]

B. Uygar Oztekin, George Karypis, and Vipin Kumar. Expert agreement and content based reranking in a meta search environment using mearf. In WWW, pages 333–344, 2002.

77

[PC98]

Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281, 1998.

[PFC+ 00]

Allison L. Powell, James C. French, James P. Callan, Margaret E. Connell, and Charles L. Viles. The impact of database selection on distributed searching. In SIGIR, pages 232–239, 2000.

[RAS01]

Yves Rasolofo, Faiza Abbaci, and Jacques Savoy. Approaches to collection selection and results merging for distributed information retrieval. In CIKM, pages 191–198, 2001.

[SC03]

Luo Si and Jamie Callan. A semisupervised learning method to merge search engine results. ACM Trans. Inf. Syst., 21(4):457– 491, 2003.

[SE97]

E. Selberg and O. Etzioni. The MetaCrawler architecture for resource aggregation on the Web. IEEE Expert, (January– February):11–14, 1997.

[Ser05]

Pavel Serdyukov. Query routing in a peer-to-peer web search engine. Master’s thesis, 2005.

[SF94]

Joseph A. Shaw and Edward A. Fox. Combination of multiple searches. In Proceedings of the 3th Text REtrieval Conference (TREC-3), pages 105–109, 1994.

[SJCO02]

Luo Si, Rong Jin, James P. Callan, and Paul Ogilvie. A language modeling framework for resource selection and results merging. In CIKM, pages 391–397, 2002.

[SMK+ 01]

Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149–160, 2001.

78

[SMW+ 03]

Torsten Suel, Chandan Mathur, Jowen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shanmugasundaram. Odissea: A peer-to-peer architecture for scalable web search and information retrieval. In WebDB, pages 67–72, 2003.

[SP99]

Jacques Savoy and Justin Picard. Report on the trec-8 experiment: Searching on the web and in distributed collections. In Proceedings of the 8th Text REtrieval Conference (TREC-8), 1999.

[SP01]

Chris Sherman and Gary Price. The invisible web: Uncovering information sources search engines can’t see, 2001.

[SR00]

Jacques Savoy and Yves Rasolofo. Report on the trec-9 experiment: Link-based retrieval and distributed collections. In Proceedings of the 9th Text REtrieval Conference (TREC-9), 2000.

[TVGJL95] Geoffrey G. Towell, Ellen M. Voorhees, Narendra Kumar Gupta, and Ben Johnson-Laird. Learning collection FUsion strategies for information retrieval. In International Conference on Machine Learning, pages 540–548, 1995. [TZ04]

Tao Tao and ChengXiang Zhai. A mixture clustering model for pseudo feedback in information retrieval. 2004.

[VC99a]

Christopher C. Vogt and Garrison W. Cottrell. Fusion via a linear combination of scores. Inf. Retr., 1(3):151–173, 1999.

[VC99b]

Christopher Charles Vogt and Garrison W. Cottrell. Adaptive combination of evidence for information retrieval. PhD thesis, 1999.

[VF95]

Charles L. Viles and James C. French. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM

79

SIGIR Conference on Research and Development in Information Retrieval, pages 12–20, 1995. [VGJL94]

Ellen M. Voorhees, Narendra Kumar Gupta, and Ben JohnsonLaird. The collection fusion problem. In Text REtrieval Conference, pages 0–, 1994.

[Voo95]

E. Voorhees. Siemens trec-4 report: Further experiments with database merging, 1995.

[WC02]

Shengli Wu and Fabio Crestani. Data fusion with estimated weights. In CIKM, pages 648–651, 2002.

[WCG03]

Shengli Wu, Fabio Crestani, and Forbes Gibb. New methods of results merging for distributed information retrieval. In Distributed Multimedia Information Retrieval, pages 84–100, 2003.

[WGDW03] Y. Wang, L. Galanis, and D.J. De Witt. Galanx: An efficient peer-to-peer search engine system. 2003. [YL97]

Budi Yuwono and Dik Lun Lee. Server ranking for distributed text retrieval systems on the internet. In Database Systems for Advanced Applications, pages 41–50, 1997.

[YR98]

Ronald R. Yager and Alexander Rybalov. On the fusion of documents from multiple collection information retrieval systems. J. Am. Soc. Inf. Sci., 49(13):1177–1184, 1998.

[ZL01]

Chengxiang Zhai and John Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the tenth international conference on Information and knowledge management, pages 403–410. ACM Press, 2001.

80

Appendix A Test queries In the Table A.1 we summarized the test queries from the topic distillation task for the TREC 2002 and 2003 Web Track datasets (http://trec.nist.gov). Queries are selected with respect to two requirements: they have at least 10 relevant documents and related to the “Health and Medicine” or “Nature and Ecology” topic. In the Table A.2 we present the relevant documents distributions for the T F and LM 04P B06 merging methods. This is typical for the tested merging methods when results for some specific queries are enhanced and for others are made worse. On Figure A.1 and Figure A.2 we placed a graphical interpretation of the same data. The residual in performance between the two methods is presented on Figure A.3.

81

N

Query

number

in

Stemmed query

Topic

TREC

Total

number

of

relevant documents

1

552

food cancer patient

HM

26

2

557

clean air clean water

NE

48

3

560

symptom diabet

HM

28

4

561

erad boll weevil

NE

17

5

563

smoke drink pregnanc

HM

23

6

564

mother infant nutrit

HM

24

7

569

invas anim plant

NE

18

8

574

whale dolphin protect

NE

27

9

575

nuclear wast storag transport

NE

46

10

576

chesapeak bay ecolog

NE

14

11

578

regul zoo

NE

20

12

584

birth defect

HM

50

13

583

florida endang speci

NE

29

14

586

women health cancer

HM

38

15

589

mental ill adolesc

HM

112

16

594

food prevent high cholesterol

HM

27

17

TD5

pest control safeti

NE

13

18

TD14

agricultur biotechnolog

NE

12

19

TD31

deaf children

HM

13

20

TD32

wildlif conserv

NE

86

21

TD33

food safeti

HM

28

22

TD35

arctic explor

NE

45

23

TD36

global warm

NE

12

24

TD43

forest fire

NE

25

25

TD44

ozon layer

NE

12

Table A.1: The topic-oriented set of the 25 experimental queries (topics are coded as “HM” for the Health and Medicine, and “NE” for the Nature and Ecology)

82

QueryN

top-5

top-10

top-15

top-20

top-25

top-30

TF | LMPB

TF | LMPB

TF | LMPB

TF | LMPB

TF | LMPB

TF | LMPB

1

1|5

1|5

2|5

3|6

3|6

5|6

2

0|0

0|0

0|0

0|0

0|0

0|0

3

5|2

5|2

5|4

8|5

10 | 6

10 | 8

4

4|4

7|7

8|9

9|9

10 | 10

10 | 10

5

1|1

1|2

1|2

2|2

2|2

2|2

6

1|1

2|3

2|6

4|6

6|6

7|7

7

1|0

1|1

1|2

1|2

3|2

3|2

8

0|0

0|1

0|1

1|1

1|1

2|1

9

2|2

3|3

4|3

4|4

4|5

6|7

10

0|0

1|0

1|1

1|2

2|3

2|3

11

2|3

3|3

4|4

5|4

6|4

8|5

12

1|3

2|4

5|4

8|6

8|9

9|9

13

0|0

0|1

0|1

1|3

1|3

1|3

14

3|3

5|5

8|8

10 | 11

12 | 12

12 | 12

15

4|3

4|5

5|6

5|6

5|8

5|8

16

1|1

1|1

1|1

1|1

1|1

1|1

17

0|2

2|3

3|3

3|3

3|3

3|3

18

2|0

2|2

3|3

3|3

3|3

3|3

19

0|0

0|1

1|1

1|2

1|3

1|3

20

0|0

0|0

0|0

0|0

0|0

0|0

21

0|0

0|0

0|0

0|1

0|1

0|2

22

0|0

1|0

1|0

1|0

1|1

2|2

23

0|1

1|1

1|1

1|1

1|1

1|1

24

1|1

1|1

1|1

1|1

1|1

1|1

25

2|2

4|4

4|4

6|6

8|7

9|9

Table A.2: The number of relevant documents for the T F and LM 04P B06 methods with the IDEAL database ranking (LM 04P B06 name is shortened to LM P B for convinience)

83

14

N1food cancer patient N2clean air clean water N3symptom diabet N4 erad boll weevil N5 smoke drink pregnanc N6 mother infant nutrit N7 invas anim plant N8 whale dolphin protect N9 nuclear wast storag transport N10 chesapeak bay ecolog N11 regul zoo N12 florida endang speci N13 women health cancer N14 mental ill adolesc N15 food prevent high cholesterol N16 pest control safeti N17 agricultur biotechnolog N18 deaf children N19 wildlif conserv N20 food safeti N21 arctic explor N22 global warm N23 forest fire N24 ozon layer N25 birth defect

12

Number of relevant documents

10

8

6

4

2

0 5

10

15

20

25

30

Number of documents in top-k

Figure A.1: Relevant documents distribution for the T F method with the IDEAL ranking 14

N1food cancer patient N2clean air clean water N3symptom diabet N4 erad boll weevil N5 smoke drink pregnanc N6 mother infant nutrit N7 invas anim plant N8 whale dolphin protect N9 nuclear wast storag transport N10 chesapeak bay ecolog N11 regul zoo N12 florida endang speci N13 women health cancer N14 mental ill adolesc N15 food prevent high cholesterol N16 pest control safeti N17 agricultur biotechnolog N18 deaf children N19 wildlif conserv N20 food safeti N21 arctic explor N22 global warm N23 forest fire N24 ozon layer N25 birth defect

12

Number of relevant documents

10

8

6

4

2

0 5

10

15

20

25

30

Number of documents in top-k

Figure A.2: Relevant documents distribution for the LM 04P B06 method with the IDEAL ranking 84

5

4 N1food cancer patient N2clean air clean water N3symptom diabet N4 erad boll weevil N5 smoke drink pregnanc N6 mother infant nutrit N7 invas anim plant N8 whale dolphin protect N9 nuclear wast storag transport N10 chesapeak bay ecolog N11 regul zoo N12 florida endang speci N13 women health cancer N14 mental ill adolesc N15 food prevent high cholesterol N16 pest control safeti N17 agricultur biotechnolog N18 deaf children N19 wildlif conserv N20 food safeti N21 arctic explor N22 global warm N23 forest fire N24 ozon layer N25 birth defect

3

Number of relevant documents

2

1

0

-1

-2

-3

-4

-5 top-5

top-10

top-15

top-20

top-25

top-30

Figure A.3: Residual between the number of relevant documents of the CE06LM 04 and T F methods with the IDEAL database ranking

85

Database Selection and Result Merging in P2P Web Search