A systematic study of parameter correlations in large ...

Viewer
Transcript

A systematic study of parameter correlations in large scale duplicate document detection Shaozhi Ye1

Ji-Rong Wen2

Wei-Ying Ma2

1 Department

of Computer Science University of California, Davis 2 Microsoft

Research Asia

The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 9-12, Singapore

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

1 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

2 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

3 / 25

1. Motivation

Duplicate pages and mirrored web sites: phenomenal on the Web. More than 250 sites mirrored Linux Document Project. 10% of hosts are mirrored to various extents [Bharat & Broder, 1999] 5.5% result entries for popular queries are duplicated in major search engines [Ye et.al, 2004]

It is important to detect duplicated and nearly duplicated documents. crawling, ranking, clustering, archiving and caching...

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

4 / 25

1. Motivation

DDD: Duplicate Document Detection

The tremendous volume of web pages challenges DDD algorithms. Much work has been done on both DDD algorithms and applications Little has been explored about the factors affecting DDD performance and scalability.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

5 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

6 / 25

2.1 Shingle Based Algorithms Definition

A shingle is a set of contiguous terms in a document. 1

Each document is divided into multiple shingles.

“The ones we don’t know we don’t know" → {the ones we}, {ones we don’t}, {we don’t know}, {don’t know we}, {know we don’t} 2

A hash value is assigned to each shingle.

{the ones we} → 1cb888794a0ed3d9e9989093d0e353b4 {ones we don’t} → 20dc9a35cd0cbbf895272744b100278a ··· 3

Then the resemblance of two documents is calculated based on the number of shingles they share.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

7 / 25

Document Similarity Metric Broder et.al 1997

Definition

The resemblance r of two documents A and B is defined as: r (A, B) =

|S(A) ∩ S(B)| . |S(A) ∪ S(B)|

(1)

Where |S| represents the number of elements in the set S. Pairwise comparison → N 2

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

8 / 25

Clustering Algorithm Beat N 2 Complexity

Merge Sort → N log (N/M) 1

Get all the shingles for each document → kN

2

Sort pairs → kN log (kN/M)

3

4

Get list: expand, divide, sort, merge → kN log (kN/M) Scan list → ???

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

9 / 25

Parameters used in Prior Work Work Broder97

Volume of Documents Set 30M

Shingling Strategy 10-gram

Hash Function 40-bit

Similarity Threshold 0.5

Shivakumar98, Cho00 Fetterly03

24M 25M 150M

entire document, two or four lines 5-gram

32-bit

25 or 15 shingles in common two supershingles in common

64-bit

Sampling Ratio/Strategy Broder97 1/25 and at most 400 shingles per document Shivakumar98 & Cho00 Line based shingling Fetterly03 14 shingles per supershingle six supershingles per document

No formal evaluation is provided for their parameter or tradeoff choices.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

10 / 25

2.2 Term Based Algorithms

Use individual terms/words as the basic unit, instead of continuous k-gram shingles. Many IR techniques are used, especially feature selection. Work well for small-scale IR systems and online DDD. Too complex for large scale DDD. We focus on shingle based, offline DDD algorithms in this paper.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

11 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

12 / 25

3.1 Data Description TREC .Gov Collection:

HTML Documents Total Size Average Document Size Average Words per Document

1,053,034 12.9 GB 13.2 KB 699

Divided into 11 groups based on size: Group 0 1 2 3 4 5 6 7 8 9 10

S. Ye, J. Wen and W. Ma (UCD & MSRA)

Words in Document 0-500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

Number of Documents 651,983 153,741 78,590 28,917 14,669 8,808 5,636 3,833 2,790 1,983 7,775

Shingles in Group 118,247,397 105,876,410 107,785,579 69,980,491 50,329,605 39,165,329 30,760,394 24,750,365 20,796,424 16,770,544 93,564,410

PAKDD 2006

13 / 25

3.1 Data Description

Document Size Distribution

Document Size Distribution: Long Tailed S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

14 / 25

3.2 Implementation Issues

Hash Function

Previous work: 30-bit and 40-bit Rabin hashes → 0.5 probability to have a collision in 220 (about one million) Our work: MD5 → 0.5 probability to have a collision in 264

No-sampling baseline → very time consuming

Remove exact duplicates first External sort is used: BerkeleyDB Two weeks to run 400 trials with two 3G Hz Xeon CPU, 4 GB Memory, and SCSI disks.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

15 / 25

3.3 Results

Precision with Different Similarity Thresholds. Sampling Ratio: 1/4

1 0.9 0.8

Precision

0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

16 / 25

3.3 Results

Precision with Different Similarity Thresholds. Sampling Ratio: 1/16

1 0.9 0.8

Precision

0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

17 / 25

3.4 Parameter Correlations Summary

Similarity Threshold: precision drops with the increase of similarity threshold., especially when the threshold is higher than 0.9. Sampling Ratio: precision drops with the decreasing of sampling ratio, especially for small documents containing fewer than 500 words. Document Size: small documents are more sensitive to similarity threshold and sampling ratio than large documents. Recall: sampling ratio does not hurt recall because sampling only generates false positives.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

18 / 25

Outline 1

Motivation

2

Prior Work Shingle Based Algorithms Term Based Algorithms

3

Experiments Data Description Implementation Issues Results Parameter Correlations Summary

4

Adaptive Sampling Strategy Adaptive Sampling Experimental Results

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

19 / 25

4.1 Adaptive Sampling Strategy

Key Idea: apply small sampling ratio on large documents and large sampling ratio on small documents. Long Tailed Distribution:

In our data set, 68% of the documents have fewer than 500 words, but contribute only 17% shingles. Applying small sampling ratio on large documents greatly reduces the total shingles.

Experimental Result: With required precision 0.8 and similarity threshold 0.6, only 8% of the total shingles have to be processed.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

20 / 25

4.2 Experimental Results

Adaptive Sampling with Different Precision Thresholds

100

0.5 0.6 0.7 0.8 0.9 0.95 0.99

Percentage of Shingles(%)

90 80 70 60 50 40 30 20 10 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

0.99 1

PAKDD 2006

21 / 25

Summary

Large sampling ratio is required for high precision. Especially when the precision is higher than 0.9.

Small sampling ratio hurts the precision of DDD.

Small documents consist of a major fraction of the whole Web.

Adaptive sampling strategy greatly reduces the sampling ratio of documents. Faster and more scalable when dealing with large document set.

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

22 / 25

Questions?

Thank you!

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

23 / 25

Backup Slides

S. Ye, J. Wen and W. Ma (UCD & MSRA)

PAKDD 2006

24 / 25

Recall with Different Similarity Thresholds

Sampling Ratio: 1/16

1 0.9 0.8 0.7 Recall

0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000

0.5 0.4 0.3 0.2 0.1 0

0.5

0.55

S. Ye, J. Wen and W. Ma (UCD & MSRA)

0.6

0.65

0.7 0.75 0.8 0.85 Similarity Threshold

0.9

0.95

1

PAKDD 2006

25 / 25