A systematic study of parameter correlations in large scale duplicate document detection Shaozhi Ye1
Ji-Rong Wen2
Wei-Ying Ma2
1 Department
of Computer Science University of California, Davis 2 Microsoft
Research Asia
The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 9-12, Singapore
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
1 / 25
Outline 1
Motivation
2
Prior Work Shingle Based Algorithms Term Based Algorithms
3
Experiments Data Description Implementation Issues Results Parameter Correlations Summary
4
Adaptive Sampling Strategy Adaptive Sampling Experimental Results
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
2 / 25
Outline 1
Motivation
2
Prior Work Shingle Based Algorithms Term Based Algorithms
3
Experiments Data Description Implementation Issues Results Parameter Correlations Summary
4
Adaptive Sampling Strategy Adaptive Sampling Experimental Results
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
3 / 25
1. Motivation
Duplicate pages and mirrored web sites: phenomenal on the Web. More than 250 sites mirrored Linux Document Project. 10% of hosts are mirrored to various extents [Bharat & Broder, 1999] 5.5% result entries for popular queries are duplicated in major search engines [Ye et.al, 2004]
It is important to detect duplicated and nearly duplicated documents. crawling, ranking, clustering, archiving and caching...
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
4 / 25
1. Motivation
DDD: Duplicate Document Detection
The tremendous volume of web pages challenges DDD algorithms. Much work has been done on both DDD algorithms and applications Little has been explored about the factors affecting DDD performance and scalability.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
5 / 25
Outline 1
Motivation
2
Prior Work Shingle Based Algorithms Term Based Algorithms
3
Experiments Data Description Implementation Issues Results Parameter Correlations Summary
4
Adaptive Sampling Strategy Adaptive Sampling Experimental Results
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
6 / 25
2.1 Shingle Based Algorithms Definition
A shingle is a set of contiguous terms in a document. 1
Each document is divided into multiple shingles.
“The ones we don’t know we don’t know" → {the ones we}, {ones we don’t}, {we don’t know}, {don’t know we}, {know we don’t} 2
A hash value is assigned to each shingle.
{the ones we} → 1cb888794a0ed3d9e9989093d0e353b4 {ones we don’t} → 20dc9a35cd0cbbf895272744b100278a ··· 3
Then the resemblance of two documents is calculated based on the number of shingles they share.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
7 / 25
Document Similarity Metric Broder et.al 1997
Definition
The resemblance r of two documents A and B is defined as: r (A, B) =
|S(A) ∩ S(B)| . |S(A) ∪ S(B)|
(1)
Where |S| represents the number of elements in the set S. Pairwise comparison → N 2
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
8 / 25
Clustering Algorithm Beat N 2 Complexity
Merge Sort → N log (N/M) 1
Get all the shingles for each document → kN
2
Sort pairs → kN log (kN/M)
3
4
Get list: expand, divide, sort, merge → kN log (kN/M) Scan list → ???
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
9 / 25
Parameters used in Prior Work Work Broder97
Volume of Documents Set 30M
Shingling Strategy 10-gram
Hash Function 40-bit
Similarity Threshold 0.5
Shivakumar98, Cho00 Fetterly03
24M 25M 150M
entire document, two or four lines 5-gram
32-bit
25 or 15 shingles in common two supershingles in common
64-bit
Sampling Ratio/Strategy Broder97 1/25 and at most 400 shingles per document Shivakumar98 & Cho00 Line based shingling Fetterly03 14 shingles per supershingle six supershingles per document
No formal evaluation is provided for their parameter or tradeoff choices.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
10 / 25
2.2 Term Based Algorithms
Use individual terms/words as the basic unit, instead of continuous k-gram shingles. Many IR techniques are used, especially feature selection. Work well for small-scale IR systems and online DDD. Too complex for large scale DDD. We focus on shingle based, offline DDD algorithms in this paper.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
11 / 25
Outline 1
Motivation
2
Prior Work Shingle Based Algorithms Term Based Algorithms
3
Experiments Data Description Implementation Issues Results Parameter Correlations Summary
4
Adaptive Sampling Strategy Adaptive Sampling Experimental Results
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
12 / 25
3.1 Data Description TREC .Gov Collection:
HTML Documents Total Size Average Document Size Average Words per Document
1,053,034 12.9 GB 13.2 KB 699
Divided into 11 groups based on size: Group 0 1 2 3 4 5 6 7 8 9 10
S. Ye, J. Wen and W. Ma (UCD & MSRA)
Words in Document 0-500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000
Number of Documents 651,983 153,741 78,590 28,917 14,669 8,808 5,636 3,833 2,790 1,983 7,775
Shingles in Group 118,247,397 105,876,410 107,785,579 69,980,491 50,329,605 39,165,329 30,760,394 24,750,365 20,796,424 16,770,544 93,564,410
PAKDD 2006
13 / 25
3.1 Data Description
Document Size Distribution
Document Size Distribution: Long Tailed S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
14 / 25
3.2 Implementation Issues
Hash Function
Previous work: 30-bit and 40-bit Rabin hashes → 0.5 probability to have a collision in 220 (about one million) Our work: MD5 → 0.5 probability to have a collision in 264
No-sampling baseline → very time consuming
Remove exact duplicates first External sort is used: BerkeleyDB Two weeks to run 400 trials with two 3G Hz Xeon CPU, 4 GB Memory, and SCSI disks.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
15 / 25
3.3 Results
Precision with Different Similarity Thresholds. Sampling Ratio: 1/4
1 0.9 0.8
Precision
0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000
0.5 0.4 0.3 0.2 0.1 0
0.5
0.55
S. Ye, J. Wen and W. Ma (UCD & MSRA)
0.6
0.65
0.7 0.75 0.8 0.85 Similarity Threshold
0.9
0.95
1
PAKDD 2006
16 / 25
3.3 Results
Precision with Different Similarity Thresholds. Sampling Ratio: 1/16
1 0.9 0.8
Precision
0.7 0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000
0.5 0.4 0.3 0.2 0.1 0
0.5
0.55
S. Ye, J. Wen and W. Ma (UCD & MSRA)
0.6
0.65
0.7 0.75 0.8 0.85 Similarity Threshold
0.9
0.95
1
PAKDD 2006
17 / 25
3.4 Parameter Correlations Summary
Similarity Threshold: precision drops with the increase of similarity threshold., especially when the threshold is higher than 0.9. Sampling Ratio: precision drops with the decreasing of sampling ratio, especially for small documents containing fewer than 500 words. Document Size: small documents are more sensitive to similarity threshold and sampling ratio than large documents. Recall: sampling ratio does not hurt recall because sampling only generates false positives.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
18 / 25
Outline 1
Motivation
2
Prior Work Shingle Based Algorithms Term Based Algorithms
3
Experiments Data Description Implementation Issues Results Parameter Correlations Summary
4
Adaptive Sampling Strategy Adaptive Sampling Experimental Results
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
19 / 25
4.1 Adaptive Sampling Strategy
Key Idea: apply small sampling ratio on large documents and large sampling ratio on small documents. Long Tailed Distribution:
In our data set, 68% of the documents have fewer than 500 words, but contribute only 17% shingles. Applying small sampling ratio on large documents greatly reduces the total shingles.
Experimental Result: With required precision 0.8 and similarity threshold 0.6, only 8% of the total shingles have to be processed.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
20 / 25
4.2 Experimental Results
Adaptive Sampling with Different Precision Thresholds
100
0.5 0.6 0.7 0.8 0.9 0.95 0.99
Percentage of Shingles(%)
90 80 70 60 50 40 30 20 10 0
0.5
0.55
S. Ye, J. Wen and W. Ma (UCD & MSRA)
0.6
0.65
0.7 0.75 0.8 0.85 Similarity Threshold
0.9
0.95
0.99 1
PAKDD 2006
21 / 25
Summary
Large sampling ratio is required for high precision. Especially when the precision is higher than 0.9.
Small sampling ratio hurts the precision of DDD.
Small documents consist of a major fraction of the whole Web.
Adaptive sampling strategy greatly reduces the sampling ratio of documents. Faster and more scalable when dealing with large document set.
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
22 / 25
Questions?
Thank you!
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
23 / 25
Backup Slides
S. Ye, J. Wen and W. Ma (UCD & MSRA)
PAKDD 2006
24 / 25
Recall with Different Similarity Thresholds
Sampling Ratio: 1/16
1 0.9 0.8 0.7 Recall
0.6 <500 500-1000 1000-2000 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 7000-8000 8000-9000 >9000
0.5 0.4 0.3 0.2 0.1 0
0.5
0.55
S. Ye, J. Wen and W. Ma (UCD & MSRA)
0.6
0.65
0.7 0.75 0.8 0.85 Similarity Threshold
0.9
0.95
1
PAKDD 2006
25 / 25