A Large-scale N-gram Search System Using Inverted Files Presented by

Susumu Yata*, Kazuhiro Morita, Masao Fuketa and Jun-ichi Aoe University of Tokushima *JSPS Research Fellow

Introduction Target 

Search System for a Large-scale N-gram Corpus 

More simple and lightweight than Full Trie Index

Methods 

Inverted Index 



No use…

Inverted Database 

Customized inverted index for n-gram searches

Result 

Usable system (on 64-bit Linux) 

http://code.google.com/p/ssgnc/

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

2

Large-scale N-gram Corpus Available corpora 



Word n-grams on public web pages 

Web 1T 5-gram Version 1 [Brants and Franz 2006]



Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]

Contributors 



Scales 



Google Inc., somebodies and nobodies Around 100GB from tens of billions of sentences

How large? 

It takes 17 minutes to scan 100GB at 100MB/s.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

3

Example of N-grams Word n-grams with frequencies 



English 3-grams 

ceramics collectables collectibles 55



ceramics collectables fine 130



ceramics collected by 52



ceramics collectible pottery 50

English 4-grams 

serve as the incoming 92



serve as the incubator 99



serve as the independent 794



serve as the index 223

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

4

Applications of N-gram Corpus Researches 

Statistical machine translation 



Speech recognition 



[Carlson and Fette 2007]

Text mining 



[Choueiter et al. 2007]

Spelling correction 



[Finch et al. 2007, Federico and Cettolo 2007, Zheng et al. 2008]

[Bethard and Martin 2008]

Estimation of document frequencies 

[Klein and Nelson 2009]

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

5

Obvious Problem Too large… 

Processing time of “grep” is linear to the corpus size.

Processing time [s]

1000 100

We cannot wait a response…

10 1 0.1 100MB

1GB

10GB

100GB

Corpus size 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

6

Trie Index for N-gram Corpus We need an index. 

Full trie index [Sekine 2008] 

It supports a query containing wildcards „*‟. 



It responds to a reasonable query in a second.  



“ serve as the incoming 92 ” “ serve as the incubator 99 ”

It works on a modern PC. 



“ serve as the * ”

4GB RAM and 490GB disk space for Japanese 7-grams

It takes much resources for building an index. 

28 July 2009

Several months using 4GB RAM

Unacceptable

A Large-scale N-gram Search System Using Inverted Files

7

How about Inverted Index? Well-known index data structure 

Advantages 

Easy to build 



Compact 



Less than 12 hours

Less than 200GB

Disadvantages 

Bad search performance 

Random access to disk

Uh…

We tested Inverted Index for n-gram corpus. 

It works with some techniques.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

8

INVERTED INDEX INVERTED DATABASE EVALUATION

INVERTED INDEX

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

9

Inverted Index Mapping Terms to Documents 

The two parts of an inverted index [Manning et al. 2008] 

Dictionary: List of terms to be indexed.



Posting: List of documents containing a specific term.

Boss

4

Wally

12 17 29 31 45 46 49 …

Alice

11

Dictionary

6

8

9

10 11

13 15 …

13 24 55 61 70 75 Postings (a list of documents)

In practice, differences are stored as variable byte codes. 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

10

Document Search System Examples 

“Wally” 

Documents listed in a posting list.

Wally 

12 17 29 31 45 46 49 …

“Boss AND Alice” 

Documents listed in both posting lists.

Boss

4

Alice

11

6

8

9

10 11

13 15 …

13 24 55 61 70 75 To be merged…

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

11

N-gram Search System Mapping Unigrams to N-grams 

The two parts of an inverted index 

Dictionary: List of unigrams to be indexed.



Posting: List of n-grams containing a specific unigram.

Boss

4

Wally

12 17 29 31 45 46 49 …

Alice

11

Dictionary

6

8

9

10 11

13 15 …

13 24 55 61 70 75 Postings (a list of n-grams)

Unigrams and n-grams replace terms and documents. 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

12

System Component (A) There are 3 steps and 3 databases. Query

N-grams Dictionary lookup

N-gram lookup

Unigram IDs

Unigram dictionary

N-gram IDs Index lookup N-gram database

Inverted index 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

13

Drawback of Inverted Index N-grams are distributed on disk. N-grams

Read me Read me

N-gram lookup

Read me Read me

Features of disk access 

Seek time 



We have to wait for a disk to seek.

N-gram database

Block access 

We have to read a block even for a bit.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

14

INVERTED INDEX INVERTED DATABASE EVALUATION

INVERTED DATABASE

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

15

Overview N-gram compression 

Reduction of disk access there

are

126

7E

35

23

any 96

60

issues

759

18724

Using unigram IDs

18724

5 77

1 12 24

Using variable byte codes

Inverted database 

Random Access to Sequential Access there

1131

1262

1873

there

There is a 15975737

There are no 11399502 28 July 2009



IDs to n-grams

There is no 14837037



A Large-scale N-gram Search System Using Inverted Files

16

N-gram Compression Compressing n-grams for reducing disk access 



Unigram IDs 

Allocate IDs to unigrams in frequency order.



Replace unigrams in n-grams by IDs.

ID

Unigram

ID

Unigram

ID

Unigram

ID

Unigram

0



2

,

4



6

-

1



3

.

5

the

7

of

Variable byte codes 

Allocate shorter codes to frequent unigrams.



Allocate shorter codes to low frequencies. 1 byte 20

28 July 2009

2 byte 27

3 byte 214

4 byte 221

A Large-scale N-gram Search System Using Inverted Files

228 17

System Component (B) There are 4 steps and 3 databases. Query

N-grams Dictionary lookup

Unigram dictionary

Unigram IDs

Encoded n-grams

Index lookup Inverted index 28 July 2009

N-gram decode

N-gram lookup N-gram IDs

N-gram database

A Large-scale N-gram Search System Using Inverted Files

18

Inverted Database Copying n-grams for sequential disk access 

N-gram lists 

Use n-gram lists instead of postings. 

Before

At most n copies for each n-gram

Unigram …

After 

N-gram

Unigram

ID …

ID

ID

N-gram

N-gram

… …

N-gram

N-gram N-gram

… …

Compress n-grams. 

28 July 2009

On the whole, Inverted Database is larger than the total of n-grams and Inverted Index. A Large-scale N-gram Search System Using Inverted Files

19

Features of Inverted Database Advantages 

Disk access 



N-grams are sequentially accessed.

No intersection 

There is no need for merging posting lists. 1. 2.

Choose the shortest n-gram list (the most unique unigram). Filter the n-gram list by using a given query. N-gram

3.

N-gram

N-gram

N-gram

N-gram



Return remaining n-grams.

Disadvantages 

Disk usage

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

20

System Component (C) There are 3 steps and 2 databases. Query

N-grams Dictionary lookup

Unigram dictionary

Unigram IDs

N-gram decode

Encoded n-grams Database lookup

Inverted database 28 July 2009

Sequential

A Large-scale N-gram Search System Using Inverted Files

21

INVERTED INDEX INVERTED DATABASE EVALUATION

EVALUATION

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

22

Experimental Setup Environment 



Hardware 

CPU: Core 2 Quad 2.83 GHz



RAM: 8 GB (1 GB required)



Disks: 1.5 TB HDD x 2 (RAID 0) and 256 GB SSD

Software 

OS: 64-bit Ubuntu 9.04 Desktop



Search systems: Implemented with C++ 

http://code.google.com/p/ssgnc/

Corpus 

107 GB (plain text)

Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

23

Experimental Results (Disk) Methods 

(A) Uncompressed n-grams + inverted index



(B) Compressed n-grams + inverted index 



Both on HDD and on SSD.

(C) Inverted database

Less than 12 hours

Table 1. Disk usage of n-gram search systems (MB) Term Dic Inv. Idx N-gram DB (A)

77

29,758

109,221

(B)

148

26,969

29,034

(C)

148

0

Inv. DB

Total

0 139,056 0

56,151

0 150,720 150,720

Dic = Dictionary, Inv. = Inverted, DB = Database 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

24

Experimental Results (Time) Table 2. Average and median of search time (s) Query

1-gram Ave

Med

3-gram Ave

Med

5-gram Ave

Med

(A)

5.016 0.268 3.855 0.983 2.394 0.830

(B) HDD

2.467 0.179 3.272 0.868 2.156 0.761

(B′) SSD

0.547 0.012 0.699 0.148 0.577 0.162

(C)

0.040 0.029 0.167 0.094 0.219 0.097

Performance evaluation   

(B) N-gram compression shows a little improvement. (B′) SSD shows a good performance. (C) Inverted database shows the best performance.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

25

Experimental Results (Details) Search time (s)

100 10

(A)

1

(B)

(B′)

0.1 0.01

(C) 1

10

100

1,000

Number of matches (N)

Figure 1. Search time for 3-gram queries

Performance evaluation  

For a small N, (B′) is better than (C). For a large N, (C) is better than (B′).

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

26

CONCLUSION

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

27

Conclusion Search systems for large-scale n-grams 

Full Trie Index 



Inverted Index 



Requires too much resources for indexing. Requires inefficient disk access in searching n-grams.

Inverted Database 

Provides a well-balanced search system.

Future study  

Character n-grams. Queries containing classes and attributes.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

28

A Large-scale N-gram Search System Using Inverted ...

More simple and lightweight than Full Trie Index. Methods. ○. Inverted Index. ○. No use… ○. Inverted Database. ○. Customized inverted index for n-gram searches. Result. ○. Usable system (on 64-bit Linux). ○ http://code.google.com/p/ssgnc/. 28 July 2009. A Large-scale N-gram Search System Using Inverted Files. 2 ...

501KB Sizes 0 Downloads 176 Views

Recommend Documents

No documents