Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst

DONALD METZLER Yahoo! Research

TREVOR STROHMAN Google Inc.

Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

Contents

Search Engines and Information Retrieval 1.1 What Is Information Retrieval? 1.2 The Big Issues 1.3 Search Engines 1.4 Search Engineers

1 1 4 6 9

Architecture of a Search Engine 2.1 What Is an Architecture? 2.2 Basic Building Blocks 2 3 Breaking It Down 2.3.1 Text Acquisition 2.3.2 Text Transformation 2.3.3 Index Creation 2.3.4 User Interaction 2.3.5 Ranking 2.3.6 Evaluation 2.4 How Does it Really Work?

13 13 14 17 17 19 22 23 25 27 28

Crawls and F e e d s . 3.1 Deciding What to Search 3.2 Crawling the Web 3.2.1 Retrieving Web Pages 3.2.2 The Web Crawler 3.2.3 Freshness 3.2.4 Focused Crawling 3.2.5 Deep Web

31 31 32 33 35 37 41 41

Contents

3.3 3.4 3.5 3.6

3.7 3.8

3.2.6 Sitemaps 3.2.7 Distributed Crawling Crawling Documents and Email Document Feeds The Conversion Problem 3.5.1 Character Encodings Storing the Documents 3.6.1 Using a Database System 3.6.2 Random Access 3.6.3 Compression and Large Files 3.6.4 Update 3.6.5 BigTable Detecting Duplicates Removing Noise.................................

Processing Text 4.1 From Words to Terms 4.2 Text Statistics 4.2.1 Vocabulary Growth 4.2.2 Estimating Collection and Result Set Sizes 4.3 Document Parsing 4.3.1 Overview 4.3.2 Tokenizing 4.3.3 Stopping 4.3.4 Stemming 4.3.5 Phrases and N-grams 4.4 Document Structure and Markup 4.5 Link Analysis 4.5.1 Anchor Text 4.5.2 PageRank 4.5.3 LinkQuality 4.6 Information Extraction 4.6.1 Hidden Markov Models for Extraction 4.7 Internationalization

43 44 46 47 49 50 52 53 53 54 56 57 60 63 75 75 77 82 85 88 88 89 92 93 99 103 106 107 107 113 115 117 120

Contents

XI

5

Ranking with Indexes 5.1 Overview 5.2 Abstract Model of Ranking 5.3 Inverted Indexes 5.3.1 Documents 5.3.2 Counts 5.3.3 Positions 5.3.4 Fields and Extents 5.3.5 Scores 5.3.6 Ordering 5.4 Compression 5.4.1 Entropy and Ambiguity 5.4.2 Delta Encoding 5.4.3 Bit-Aligned Codes 5.4.4 Byte-Aligned Codes 5.4.5 Compression in Practice 5.4.6 Looking Ahead 5.4.7 Skipping and Skip Pointers 5.5 Auxiliary Structures 5.6 Index Construction 5.6.1 Simple Construction 5.6.2 Merging 5.6.3 Parallelism and Distribution 5.6.4 Update 5.7 Query Processing 5.7.1 Document-at-a-time Evaluation 5.7.2 Term-at-a-time Evaluation 5.7.3 Optimization Techniques 5.7.4 Structured Queries 5.7.5 Distributed Evaluation 5.7.6 Caching

127 127 128 131 133 135 136 138 140 141 142 144 146 147 150 151 153 153 156 158 158 159 160 166 167 168 170 172 180 182 183

6

Queries and Interfaces 6.1 Information Needs and Queries 6.2 Query Transformation and Refinement 6.2.1 Stopping and Stemming Revisited 6.2.2 Spell Checking and Suggestions

191 191 194 194 197

XII

Contents 6.2.3 Query Expansion 6.2.4 Relevance Feedback 6.2.5 Context and Personalization 6.3 Showing the Results 6.3.1 Result Pages and Snippets 6.3.2 Advertising and Search 6.3.3 Clustering the Results 6.4 Cross-Language Search

203 212 215 219 219 222 225 230

7

Retrieval Models 7.1 Overview of Retrieval Models 7.1.1 Boolean Retrieval 7.1.2 The Vector Space Model 7.2 Probabilistic Models 7.2.1 Information Retrieval as Classification 7.2.2 The BM25 Ranking Algorithm 7.3 Ranking Based on Language Models 7.3.1 Query Likelihood Ranking 7.3.2 Relevance Models and Pseudo-Relevance Feedback 7.4 Complex Queries and Combining Evidence 7.4.1 The Inference Network Model 7.4.2 The Galago Query Language 7.5 Web Search 7.6 Machine Learning and Information Retrieval 7.6.1 Learning to Rank 7.6.2 Topic Models and Vocabulary Mismatch 7.7 Application-Based Models

237 237 239 241 247 248 254 256 258 265 271 272 277 283 287 288 292 295

8

Evaluating Search Engines 8.1 Why Evaluate? 8.2 The Evaluation Corpus 8.3 Logging 8.4 Effectiveness Metrics 8.4.1 Recall and Precision 8.4.2 Averaging and Interpolation 8.4.3 Focusing on the Top Documents 8.4.4 Using Preferences

301 301 303 309 312 312 317 322 325

Contents

9

XIII

8.5 Efficiency Metrics 8.6 Training, Testing, and Statistics 8.6.1 Significance Tests 8.6.2 Setting Parameter Values 8.6.3 Online Testing 8.7 The Bottom Line

326 329 329 334 336 337

Classification and Clustering 9.1 Classification and Categorization 9.1.1 Naive Bayes 9.1.2 Support Vector Machines 9.1.3 Evaluation 9.1.4 Classifier and Feature Selection 9.1.5 Spam, Sentiment, and Online Advertising 9.2 Clustering 9.2.1 Hierarchical and K-Means Clustering 9.2.2 Κ Nearest Neighbor Clustering 9.2.3 Evaluation 9.2.4 How to Choose k 9.2.5 Clustering and Search

343 344 346 355 363 363 368 377 379 388 390 391 393

10 Social Search 10.1 What Is Social Search? 10.2 User Tags and Manual Indexing 10.2.1 Searching Tags 10.2.2 Inferring Missing Tags 10.2.3 Browsing and Tag Clouds 10.3 Searching with Communities 10.3.1 What Is a Community? 10.3.2 Finding Communities 10.3.3 Community-Based Question Answering 10.3.4 Collaborative Searching 10.4 Filtering and Recommending 10.4.1 Document Filtering 10.4.2 Collaborative Filtering 10.5 Peer-to-Peer and Metasearch 10.5.1 Distributed Search

401 401 404 406 408 410 412 412 413 419 424 427 427 436 442 442

XIV

Contents 10.5.2 P2P Networks

446

11 Beyond Bag of Words 11.1 Overview 11.2 Feature-Based Retrieval Models 11.3 Term Dependence Models 11.4 Structure Revisited 11.4.1 XML Retrieval 11.4.2 Entity Search 11.5 Longer Questions, Better Answers 11.6 Words, Pictures, and Music 11.7 One Search Fits All?

455 455 456 458 463 465 468 470 474 483

References

491

Index

517

search engines information retrieval practice.pdf

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. search engines ...

193KB Sizes 3 Downloads 263 Views

Recommend Documents

Enabling Federated Search with Heterogeneous Search Engines
Mar 22, 2006 - tional advantages can be gained by agreeing on a common .... on top of local search engines, which can be custom-built for libraries, built upon ...... FAST software plays an important role in the Vascoda project because major.

Enabling Federated Search with Heterogeneous Search Engines
Mar 22, 2006 - 1.3.1 Distributed Search Engine Architecture . . . . . . . . . . 10 ..... over all covered documents, including document metadata (author, year of pub-.

A Comparison of Information Seeking Using Search Engines and ...
Jan 1, 2010 - An alternative, facilitated by the rise of social media, is to pose a question to one's online social network. In this paper, we explore the pros and ...

Domain-Specific-Custom-Search-for-Quicker-Information-Retrieval ...
Domain-Specific-Custom-Search-for-Quicker-Information-Retrieval.pdf. Domain-Specific-Custom-Search-for-Quicker-Information-Retrieval.pdf. Open. Extract.

web search engines pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Composite Retrieval of Heterogeneous Web Search
mote exploratory search in a way that helps users understand the diversity of ... forward a new approach for composite retrieval, which we refer ... budget and compatibility (e.g. an iPhone and all its accessories). ..... They call this approach Prod

Method of wireless retrieval of information
Dec 24, 1996 - mitted to the database service provider which then provides a response back to .... a CNI, an e-mail address, a fax number, etc. and any of this.

Method of wireless retrieval of information
Dec 24, 1996 - provider can establish a database which can be accessed by subscribers to ... message center then forwards the messages to the designated mobile station via ... tion to place a call via the wireless network and a PSTN to an.

Discriminative Models for Information Retrieval - Semantic Scholar
Department of Computer Science. University ... Pattern classification, machine learning, discriminative models, max- imum entropy, support vector machines. 1.

Information Diversity and the Information Retrieval ...
degree of divergence, between query and image retrieved, is inbuilt (although generally held to be unwanted) in ... search in the form of a multi-dimensional spatial- semantic visualisation (see e.g., [19]) that allows the ... using semantic-spatial

Template Detection for Large Scale Search Engines - Semantic Scholar
web pages based on HTML tag . [3] employs the same partition method as [2]. Keywords of each block content are extracted to compute entropy for the.

Exploiting Code Search Engines to Improve ...
We showed the effectiveness of our framework with two tools developed based ... ing]: Coding Tools and Techniques—Object-oriented program- ming ... lang:java java.sql.Statement executeUpdate. Along with assisting programmers in reusing code samples

Why the Politics of Search Engines Matters - Semantic Scholar
disappearance of notions of public service from public dis- course, and the .... create a map of the Web by indexing Web pages according to keywords and ...... internal punctuation separates them: don't, digital.com, x–y, AT&T,. 3.14159, U.S. ...

web-seo-outdo-search-engines-issue-seo-agencies ...
Whoops! There was a problem loading more pages. Retrying... web-seo-outdo-search-engines-issue-seo-agencies-june-2017-and-it-1499590214101.pdf.

Why the Politics of Search Engines Matters - Semantic Scholar
volume of backlinks—in ways that would tend to push out the equally ..... pushes something into the realm we call public is that it is not privately owned. The Web ...

web-seo-outdo-search-engines-issue-seo-agencies ...
... Submitted To The Top Business Top SearchEngines ,. Page 2 of 2. web-seo-outdo-search-engines-issue-seo-agencies-june-2017-and-it-1499590214101.pdf.

Which Vertical Search Engines are Relevant?
ferent types of online media (e.g. images, news, video), it is becoming popular for web search engines to present results from a set of specific verticals dispersed throughout the stan- dard 'general web' results, for example, adding image results to

Using Search Engines for Robust Cross-Domain ... - Research at Google
We call our approach piggyback and search result- ..... The feature is calculated in the same way for ..... ceedings of the 2006 Conference on Empirical Meth-.