A Large-scale N-gram Search System Using Inverted Files Presented by
Susumu Yata*, Kazuhiro Morita, Masao Fuketa and Jun-ichi Aoe University of Tokushima *JSPS Research Fellow
Introduction Target
Search System for a Large-scale N-gram Corpus
More simple and lightweight than Full Trie Index
Methods
Inverted Index
No use…
Inverted Database
Customized inverted index for n-gram searches
Result
Usable system (on 64-bit Linux)
http://code.google.com/p/ssgnc/
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
2
Large-scale N-gram Corpus Available corpora
Word n-grams on public web pages
Web 1T 5-gram Version 1 [Brants and Franz 2006]
Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]
Contributors
Scales
Google Inc., somebodies and nobodies Around 100GB from tens of billions of sentences
How large?
It takes 17 minutes to scan 100GB at 100MB/s.
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
3
Example of N-grams Word n-grams with frequencies
English 3-grams
ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52
ceramics collectible pottery 50
English 4-grams
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
4
Applications of N-gram Corpus Researches
Statistical machine translation
Speech recognition
[Carlson and Fette 2007]
Text mining
[Choueiter et al. 2007]
Spelling correction
[Finch et al. 2007, Federico and Cettolo 2007, Zheng et al. 2008]
[Bethard and Martin 2008]
Estimation of document frequencies
[Klein and Nelson 2009]
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
5
Obvious Problem Too large…
Processing time of “grep” is linear to the corpus size.
Processing time [s]
1000 100
We cannot wait a response…
10 1 0.1 100MB
1GB
10GB
100GB
Corpus size 28 July 2009
A Large-scale N-gram Search System Using Inverted Files
6
Trie Index for N-gram Corpus We need an index.
Full trie index [Sekine 2008]
It supports a query containing wildcards „*‟.
It responds to a reasonable query in a second.
“ serve as the incoming 92 ” “ serve as the incubator 99 ”
It works on a modern PC.
“ serve as the * ”
4GB RAM and 490GB disk space for Japanese 7-grams
It takes much resources for building an index.
28 July 2009
Several months using 4GB RAM
Unacceptable
A Large-scale N-gram Search System Using Inverted Files
7
How about Inverted Index? Well-known index data structure
Advantages
Easy to build
Compact
Less than 12 hours
Less than 200GB
Disadvantages
Bad search performance
Random access to disk
Uh…
We tested Inverted Index for n-gram corpus.
It works with some techniques.
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
8
INVERTED INDEX INVERTED DATABASE EVALUATION
INVERTED INDEX
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
9
Inverted Index Mapping Terms to Documents
The two parts of an inverted index [Manning et al. 2008]
Dictionary: List of terms to be indexed.
Posting: List of documents containing a specific term.
Boss
4
Wally
12 17 29 31 45 46 49 …
Alice
11
Dictionary
6
8
9
10 11
13 15 …
13 24 55 61 70 75 Postings (a list of documents)
In practice, differences are stored as variable byte codes. 28 July 2009
A Large-scale N-gram Search System Using Inverted Files
10
Document Search System Examples
“Wally”
Documents listed in a posting list.
Wally
12 17 29 31 45 46 49 …
“Boss AND Alice”
Documents listed in both posting lists.
Boss
4
Alice
11
6
8
9
10 11
13 15 …
13 24 55 61 70 75 To be merged…
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
11
N-gram Search System Mapping Unigrams to N-grams
The two parts of an inverted index
Dictionary: List of unigrams to be indexed.
Posting: List of n-grams containing a specific unigram.
Boss
4
Wally
12 17 29 31 45 46 49 …
Alice
11
Dictionary
6
8
9
10 11
13 15 …
13 24 55 61 70 75 Postings (a list of n-grams)
Unigrams and n-grams replace terms and documents. 28 July 2009
A Large-scale N-gram Search System Using Inverted Files
12
System Component (A) There are 3 steps and 3 databases. Query
N-grams Dictionary lookup
N-gram lookup
Unigram IDs
Unigram dictionary
N-gram IDs Index lookup N-gram database
Inverted index 28 July 2009
A Large-scale N-gram Search System Using Inverted Files
13
Drawback of Inverted Index N-grams are distributed on disk. N-grams
Read me Read me
N-gram lookup
Read me Read me
Features of disk access
Seek time
We have to wait for a disk to seek.
N-gram database
Block access
We have to read a block even for a bit.
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
14
INVERTED INDEX INVERTED DATABASE EVALUATION
INVERTED DATABASE
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
15
Overview N-gram compression
Reduction of disk access there
are
126
7E
35
23
any 96
60
issues
759
18724
Using unigram IDs
18724
5 77
1 12 24
Using variable byte codes
Inverted database
Random Access to Sequential Access there
1131
1262
1873
there
There is a 15975737
There are no 11399502 28 July 2009
…
IDs to n-grams
There is no 14837037
…
A Large-scale N-gram Search System Using Inverted Files
16
N-gram Compression Compressing n-grams for reducing disk access
Unigram IDs
Allocate IDs to unigrams in frequency order.
Replace unigrams in n-grams by IDs.
ID
Unigram
ID
Unigram
ID
Unigram
ID
Unigram
0
2
,
4
6
-
1
3
.
5
the
7
of
Variable byte codes
Allocate shorter codes to frequent unigrams.
Allocate shorter codes to low frequencies. 1 byte 20
28 July 2009
2 byte 27
3 byte 214
4 byte 221
A Large-scale N-gram Search System Using Inverted Files
228 17
System Component (B) There are 4 steps and 3 databases. Query
N-grams Dictionary lookup
Unigram dictionary
Unigram IDs
Encoded n-grams
Index lookup Inverted index 28 July 2009
N-gram decode
N-gram lookup N-gram IDs
N-gram database
A Large-scale N-gram Search System Using Inverted Files
18
Inverted Database Copying n-grams for sequential disk access
N-gram lists
Use n-gram lists instead of postings.
Before
At most n copies for each n-gram
Unigram …
After
N-gram
Unigram
ID …
ID
ID
N-gram
N-gram
… …
N-gram
N-gram N-gram
… …
Compress n-grams.
28 July 2009
On the whole, Inverted Database is larger than the total of n-grams and Inverted Index. A Large-scale N-gram Search System Using Inverted Files
19
Features of Inverted Database Advantages
Disk access
N-grams are sequentially accessed.
No intersection
There is no need for merging posting lists. 1. 2.
Choose the shortest n-gram list (the most unique unigram). Filter the n-gram list by using a given query. N-gram
3.
N-gram
N-gram
N-gram
N-gram
…
Return remaining n-grams.
Disadvantages
Disk usage
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
20
System Component (C) There are 3 steps and 2 databases. Query
N-grams Dictionary lookup
Unigram dictionary
Unigram IDs
N-gram decode
Encoded n-grams Database lookup
Inverted database 28 July 2009
Sequential
A Large-scale N-gram Search System Using Inverted Files
21
INVERTED INDEX INVERTED DATABASE EVALUATION
EVALUATION
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
22
Experimental Setup Environment
Hardware
CPU: Core 2 Quad 2.83 GHz
RAM: 8 GB (1 GB required)
Disks: 1.5 TB HDD x 2 (RAID 0) and 256 GB SSD
Software
OS: 64-bit Ubuntu 9.04 Desktop
Search systems: Implemented with C++
http://code.google.com/p/ssgnc/
Corpus
107 GB (plain text)
Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
23
Experimental Results (Disk) Methods
(A) Uncompressed n-grams + inverted index
(B) Compressed n-grams + inverted index
Both on HDD and on SSD.
(C) Inverted database
Less than 12 hours
Table 1. Disk usage of n-gram search systems (MB) Term Dic Inv. Idx N-gram DB (A)
77
29,758
109,221
(B)
148
26,969
29,034
(C)
148
0
Inv. DB
Total
0 139,056 0
56,151
0 150,720 150,720
Dic = Dictionary, Inv. = Inverted, DB = Database 28 July 2009
A Large-scale N-gram Search System Using Inverted Files
24
Experimental Results (Time) Table 2. Average and median of search time (s) Query
1-gram Ave
Med
3-gram Ave
Med
5-gram Ave
Med
(A)
5.016 0.268 3.855 0.983 2.394 0.830
(B) HDD
2.467 0.179 3.272 0.868 2.156 0.761
(B′) SSD
0.547 0.012 0.699 0.148 0.577 0.162
(C)
0.040 0.029 0.167 0.094 0.219 0.097
Performance evaluation
(B) N-gram compression shows a little improvement. (B′) SSD shows a good performance. (C) Inverted database shows the best performance.
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
25
Experimental Results (Details) Search time (s)
100 10
(A)
1
(B)
(B′)
0.1 0.01
(C) 1
10
100
1,000
Number of matches (N)
Figure 1. Search time for 3-gram queries
Performance evaluation
For a small N, (B′) is better than (C). For a large N, (C) is better than (B′).
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
26
CONCLUSION
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
27
Conclusion Search systems for large-scale n-grams
Full Trie Index
Inverted Index
Requires too much resources for indexing. Requires inefficient disk access in searching n-grams.
Inverted Database
Provides a well-balanced search system.
Future study
Character n-grams. Queries containing classes and attributes.
28 July 2009
A Large-scale N-gram Search System Using Inverted Files
28
A Large-scale N-gram Search System Using Inverted ...
More simple and lightweight than Full Trie Index. Methods. â. Inverted Index. â. No use⦠â. Inverted Database. â. Customized inverted index for n-gram searches. Result. â. Usable system (on 64-bit Linux). â http://code.google.com/p/ssgnc/. 28 July 2009. A Large-scale N-gram Search System Using Inverted Files. 2 ...