Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1

Search Engine Samplers Search Engine Public Interface

Index

Web

D

Top k results Queries

Sampler

Random document x∈D

Indexed Documents

2

Motivation 

Useful tool for search engine evaluation:  Freshness 

Fraction of up-to-date pages in the index

 Topical 

bias

Identification of overrepresented/underrepresented topics

 Spam 

Fraction of spam pages in the index

 Security 

Fraction of pages in index infected by viruses/worms/trojans

 Relative 

Size

Number of documents indexed compared with other search engines 3

Size Wars August 2005 : We index 20 billion documents.

September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

So, who’s right? 4

Why Does Size Matter, Anyway? 

Comprehensiveness A

good crawler covers the most documents possible



Narrow-topic queries  E.g.,



get homepage of John Doe

Prestige A

marketing advantage 5

Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05]  

Sample pages uniformly at random from the search engine’s index Two alternatives  Absolute  



Sample until collision Collision expected after k ~ N½ random samples (birthday paradox) Return k2

 Relative 

size estimation

size estimation

Check how many samples from search engine A are present in search engine B and vice versa 6

Other Approaches 

Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]



Queries from user query logs [LawrenceGiles98, DobraFeinberg04]



Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

7

The Bharat-Broder Sampler: Preprocessing Step Lexicon Large corpus

C

L t1, freq(t1,C) t2, freq(t2,C) … …

8

The Bharat-Broder Sampler Search Engine t1 AND t2

L

Two random terms t1, t2

Top k results

BB Sampler

Random document from top k results

Only if: • all queries return the same number of results ≤ k • all documents are of the same length Then, samples are uniform. 9

The Bharat-Broder Sampler: Drawbacks 

Documents have varying lengths  Bias



towards long documents

Some queries have more than k matches  Bias

towards documents with high static rank

10

Our Contributions 

A pool-based sampler  Guaranteed



to produce near-uniform samples

Focus of this talk

A random walk sampler  After

sufficiently many steps, guaranteed to produce near-uniform samples  Does not need an explicit lexicon/pool at all!

11

Search Engines as Hypergraphs “news”

“google”

www.cnn.com

news.google.com

news.bbc.co.uk

www.google.com

www.foxnews.com maps.google.com

www.mapquest.com

en.wikipedia.org/wiki/BBC www.bbc.co.uk

   

“bbc” results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  

Vertices: Hyperedges:

Indexed documents { result(q) | q ∈ P }

maps.yahoot.com

“maps”

12

Query Cardinalities and Document Degrees “news”

“google”

www.cnn.com

news.google.com

news.bbc.co.uk

www.google.com

www.foxnews.com maps.google.com

www.mapquest.com

en.wikipedia.org/wiki/BBC www.bbc.co.uk

maps.yahoot.com

“bbc” 

Query cardinality: card(q) = |results(q)| 



card(“news”) = 4, card(“bbc”) = 3

Document degree: deg(x) = |queries(x)| 



“maps”

deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

Cardinality and degree are easily computable

13

Sampling documents uniformly  



Sampling documents from D uniformly Hard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:

14

Sampling documents by degree “news”

“google”

www.cnn.com

news.google.com

news.bbc.co.uk

www.google.com

www.foxnews.com maps.google.com

www.mapquest.com

en.wikipedia.org/wiki/BBC www.bbc.co.uk

“bbc”

 

maps.yahoot.com

“maps”

p(news.bbc.co.uk) = 2/13 p(www.cnn.com) = 1/13 15

Monte Carlo Simulation   



We need: Samples from the uniform distribution We have: Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling

Importance Sampling

MetropolisHastings

MaximumDegree 16

Monte Carlo Simulation 

π: Target distribution 



p: Trial distribution 



In our case: π = uniform on D In our case: p = degree distribution

Bias weight of p(x) relative to π(x): 

In our case:

Samples from p p-Sampler

π−Sampler (x1,w(x)), (x2,w(x)), …

Sample from π Monte Carlo Simulator

x 17

Bias Weights 

Unnormalized forms of π and p:

:



(unknown) normalization constants 

Examples:  



π = uniform: p = degree distribution:

Bias weight: 18

Rejection Sampling [von Neumann] 

C: envelope constant C



≥ w(x) for all x

The algorithm:  accept

:= false  while (not accept)   

generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true

 return



x

In our case: C = 1 and acceptance prob = 1/deg(x) 19

Pool-Based Sampler 

Degree distribution: p(x) = deg(x) / Σx’deg(x’) Search Engine q1,q2,…

results(q1), results(q2),…

Pool-Based Sampler (x1,1/deg(x1)), Rejection Degree distribution (x2,1/deg(x2)),… Sampling sampler Documents sampled from degree distribution with corresponding weights

Uniform sample

x

20

Sampling documents by degree “news”

“google” www.cnn.com

news.google.com

news.bbc.co.uk

www.google.com

www.foxnews.com maps.google.com

www.mapquest.com

en.wikipedia.org/wiki/BBC www.bbc.co.uk

“bbc”     

maps.yahoot.com

“maps”

Select a random query q Select a random x ∈ results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality

21

Sampling documents by degree (2) “news”

“google” www.cnn.com

news.google.com

news.bbc.co.uk

www.google.com

www.foxnews.com maps.google.com

www.mapquest.com

en.wikipedia.org/wiki/BBC www.bbc.co.uk

maps.yahoot.com

“bbc”   

“maps”

Select a query q proportionally to its cardinality Select a random x ∈ results(q) Analysis: 22

Degree Distribution Sampler Search Engine Query sampled from cardinality distribution

q

results(q)

Degree Distribution Sampler Cardinality Distribution Sampler

Sample x uniformly from results(q)

Document sampled from degree distribution

x

23

Sampling queries by cardinality  

Sampling queries from pool uniformly: Sampling queries from pool by cardinality:

Easy Hard

 Requires

knowing cardinalities of all queries in the search engine



Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target

distribution: the cardinality distribution  Trial distribution: uniform distribution on the query pool 24

Sampling queries by cardinality 

Bias weight of cardinality distribution relative to the uniform distribution:





Can be computed using a single search engine query

Use rejection sampling: 

Envelope constant for rejection sampling:



Queries are sampled uniformly from the pool Each query q is accepted with probability



25

Complete Pool-Based Sampler Search Engine Uniform Query (q,card(q)),… Sampler Uniform query sample

Degree Distribution Sampler

Rejection Sampling Query sampled from cardinality distribution

(q,results(q)),…

(x,1/deg(x)),…

Documents sampled from degree distribution with corresponding weights

Rejection Sampling

x Uniform document sample 26

Dealing with Overflowing Queries 

Problem: Some queries may overflow (card(q) > k)  Bias



towards highly ranked documents

Solutions:  Select

a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights

Theorem: Samples of PB sampler are at most β-away from uniform. (β = overflow probability of P)

27

Creating the query pool Large corpus

C



P q1 q2 … …

Example: P = all 3-word phrases that occur in C 

If “to be or not to be” occurs in C, P contains: 



Query Pool

“to be or”, “be or not”, “or not to”, “not to be”

Choose P that “covers” most documents in D 28

A random walk sampler 

Define a graph G over the indexed documents  (x,y)

∈ E iff queries(x) ∩ queries(y) ≠ ∅





Run a random walk on G  Limit distribution = degree distribution  Use MCMC methods to make limit distribution  Metropolis-Hastings  Maximum-Degree

 

uniform.

Does not need a preprocessing step Less efficient than the pool-based sampler 29

Bias towards Long Documents Percent of documents from sample .

60%

Pool Based 50%

Random Walk

40%

Bharat-Broder

30% 20% 10% 0% 1

2

3

4

5

6

7

8

9

10

Deciles of documents ordered by size 30

Relative Sizes of Google, MSN and Yahoo!

Google = 1 Yahoo! = 1.28 MSN Search = 0.73

31

60%

Google MSN Yahoo!

50% 40% 30% 20% 10%

in fo

ie

es

no

it

us

ca

au go v

de

ed u

uk

ne t

or g

0%

co m

Percent of documents from sample .

Top-Level Domains in Google, MSN and Yahoo!

Top level domain name 32

Conclusions 

Two new search engine samplers  Pool-based

sampler  Random walk sampler

Samplers are guaranteed to produce nearuniform samples, under plausible assumptions.  Samplers show no or little bias in experiments. 

33

Thank You

34

Random Sampling from a Search Engine's Index ...

40%. 50%. 60% e n ts fro m sa m p le . Google. MSN. Yahoo! 32. 0%. 10%. 20%. 30% com org net uk edu de au gov ca us it noes ie info. Top level domain name.

606KB Sizes 1 Downloads 230 Views

Recommend Documents

Random Sampling from a Search Engine's Index
Mar 4, 2008 - Email: [email protected]. ... In an attempt to come up with reliable automatic benchmarks for search engines, Bharat and ...... return x as a result, but that would have required sending many (sometimes, thousands of) ...

Random Sampling from a Search Engine's Index - EE, Technion
Mar 4, 2008 - (2) Applying the same sampling procedure to both the provider's own ...... successfully fetch and parse, that were in text, HTML, or pdf format, ...

Random Sampling from a Search Engine's Index
“weight”, which represents the probability of this document to be selected in .... from an arbitrary document, at each step choose a random term/phrase from ...... battellemedia.com/archives/001889.php, 2005. [4] K. Bharat ... Management, 40:495â

Random Sampling from a Search Engine's Index | Google Sites
Mar 4, 2008 - †Department of Electrical Engineering, Technion, Haifa 32000, ... Security evaluation: using an anti-virus software, we can estimate the ..... MCMC methods allow us to transform a given ergodic Markov Chain P ..... 3. results(·) is a

Random Sampling from a Search Engine's Index ...
□A good crawler covers the most documents possible. 5. ▫ Narrow-topic queries. □E.g., get homepage of John Doe. ▫ Prestige. □A marketing advantage ...

Random Sampling from a Search Engine's Index - EE, Technion
Mar 4, 2008 - ... of 2.4 million documents substantiate our analytical findings and ... Security evaluation: using an anti-virus software, we can estimate the ...

Enabling Federated Search with Heterogeneous Search Engines
Mar 22, 2006 - tional advantages can be gained by agreeing on a common .... on top of local search engines, which can be custom-built for libraries, built upon ...... FAST software plays an important role in the Vascoda project because major.

Enabling Federated Search with Heterogeneous Search Engines
Mar 22, 2006 - 1.3.1 Distributed Search Engine Architecture . . . . . . . . . . 10 ..... over all covered documents, including document metadata (author, year of pub-.

Random sampling and probability
The larger the condition number, the harder it is to solve linear equations numerically. ... By Cauchy-Schwarz, the L∞ norm is comparable (with a constant.

Jittered random sampling with a successive approximation ADC.pdf ...
Georgia Institute of Technology, 75 Fifth Street NW, Atlanta, GA 30308. Abstract—This paper ... result, variable word length data samples are produced by the ... Successive Sine Matching Pursuit (SSMP) is proposed to recover. spectrally ...

A Comparison of Information Seeking Using Search Engines and ...
Jan 1, 2010 - An alternative, facilitated by the rise of social media, is to pose a question to one's online social network. In this paper, we explore the pros and ...

search engines information retrieval practice.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. search engines ...

web search engines pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Keeping a Search Engine Index Fresh - Research at Google
Abstract. Search engines strive to maintain a “current” repository of all web pages on the internet to index for user queries. However, refreshing all web pages all ...

Near-Optimal Random Walk Sampling in Distributed ...
in a continuous online fashion. We present the first round ... at runtime, i.e., online), such that each walk of length l can ... Random walks play a central role in computer science, ..... S are chosen randomly proportional to the node degrees, then

Hardware-Efficient Random Sampling of Fourier ...
frequencies instead of only frequencies on a grid. We also introduce .... Slopes generated for given input signal (red) in a traditional slope. ADC (top) and a ...

Snowball sampling for estimating exponential random ...
Nov 13, 2015 - Abstract. The exponential random graph model (ERGM) is a well-established statis- tical approach to modelling social network data. However, Monte Carlo estimation of ERGM parameters is a computationally intensive procedure that imposes

Template Detection for Large Scale Search Engines - Semantic Scholar
web pages based on HTML tag . [3] employs the same partition method as [2]. Keywords of each block content are extracted to compute entropy for the.

Mining Search Engine Query Logs via Suggestion Sampling - CiteSeerX
and notice is given that copying is by permission of the Very Large Data .... The performance of suggestion sampling and mining is measured ...... Estimating the efficiency of backtrack programs. Mathematics of Computation, 29(129):121–136,.

Exploiting Code Search Engines to Improve ...
We showed the effectiveness of our framework with two tools developed based ... ing]: Coding Tools and Techniques—Object-oriented program- ming ... lang:java java.sql.Statement executeUpdate. Along with assisting programmers in reusing code samples