A Large-scale N-gram Search System Using Inverted Files Presented by

Susumu Yata*, Kazuhiro Morita, Masao Fuketa and Jun-ichi Aoe University of Tokushima *JSPS Research Fellow

Introduction Target 

Search System for a Large-scale N-gram Corpus 

More simple and lightweight than Full Trie Index

Methods 

Inverted Index 



No use…

Inverted Database 

Customized inverted index for n-gram searches

Result 

Usable system (on 64-bit Linux) 

http://code.google.com/p/ssgnc/

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

2

Large-scale N-gram Corpus Available corpora 



Word n-grams on public web pages 

Web 1T 5-gram Version 1 [Brants and Franz 2006]



Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]

Contributors 



Scales 



Google Inc., somebodies and nobodies Around 100GB from tens of billions of sentences

How large? 

It takes 17 minutes to scan 100GB at 100MB/s.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

3

Example of N-grams Word n-grams with frequencies 



English 3-grams 

ceramics collectables collectibles 55



ceramics collectables fine 130



ceramics collected by 52



ceramics collectible pottery 50

English 4-grams 

serve as the incoming 92



serve as the incubator 99



serve as the independent 794



serve as the index 223

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

4

Applications of N-gram Corpus Researches 

Statistical machine translation 



Speech recognition 



[Carlson and Fette 2007]

Text mining 



[Choueiter et al. 2007]

Spelling correction 



[Finch et al. 2007, Federico and Cettolo 2007, Zheng et al. 2008]

[Bethard and Martin 2008]

Estimation of document frequencies 

[Klein and Nelson 2009]

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

5

Obvious Problem Too large… 

Processing time of “grep” is linear to the corpus size.

Processing time [s]

1000 100

We cannot wait a response…

10 1 0.1 100MB

1GB

10GB

100GB

Corpus size 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

6

Trie Index for N-gram Corpus We need an index. 

Full trie index [Sekine 2008] 

It supports a query containing wildcards „*‟. 



It responds to a reasonable query in a second.  



“ serve as the incoming 92 ” “ serve as the incubator 99 ”

It works on a modern PC. 



“ serve as the * ”

4GB RAM and 490GB disk space for Japanese 7-grams

It takes much resources for building an index. 

28 July 2009

Several months using 4GB RAM

Unacceptable

A Large-scale N-gram Search System Using Inverted Files

7

How about Inverted Index? Well-known index data structure 

Advantages 

Easy to build 



Compact 



Less than 12 hours

Less than 200GB

Disadvantages 

Bad search performance 

Random access to disk

Uh…

We tested Inverted Index for n-gram corpus. 

It works with some techniques.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

8

INVERTED INDEX INVERTED DATABASE EVALUATION

INVERTED INDEX

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

9

Inverted Index Mapping Terms to Documents 

The two parts of an inverted index [Manning et al. 2008] 

Dictionary: List of terms to be indexed.



Posting: List of documents containing a specific term.

Boss

4

Wally

12 17 29 31 45 46 49 …

Alice

11

Dictionary

6

8

9

10 11

13 15 …

13 24 55 61 70 75 Postings (a list of documents)

In practice, differences are stored as variable byte codes. 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

10

Document Search System Examples 

“Wally” 

Documents listed in a posting list.

Wally 

12 17 29 31 45 46 49 …

“Boss AND Alice” 

Documents listed in both posting lists.

Boss

4

Alice

11

6

8

9

10 11

13 15 …

13 24 55 61 70 75 To be merged…

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

11

N-gram Search System Mapping Unigrams to N-grams 

The two parts of an inverted index 

Dictionary: List of unigrams to be indexed.



Posting: List of n-grams containing a specific unigram.

Boss

4

Wally

12 17 29 31 45 46 49 …

Alice

11

Dictionary

6

8

9

10 11

13 15 …

13 24 55 61 70 75 Postings (a list of n-grams)

Unigrams and n-grams replace terms and documents. 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

12

System Component (A) There are 3 steps and 3 databases. Query

N-grams Dictionary lookup

N-gram lookup

Unigram IDs

Unigram dictionary

N-gram IDs Index lookup N-gram database

Inverted index 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

13

Drawback of Inverted Index N-grams are distributed on disk. N-grams

Read me Read me

N-gram lookup

Read me Read me

Features of disk access 

Seek time 



We have to wait for a disk to seek.

N-gram database

Block access 

We have to read a block even for a bit.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

14

INVERTED INDEX INVERTED DATABASE EVALUATION

INVERTED DATABASE

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

15

Overview N-gram compression 

Reduction of disk access there

are

126

7E

35

23

any 96

60

issues

759

18724

Using unigram IDs

18724

5 77

1 12 24

Using variable byte codes

Inverted database 

Random Access to Sequential Access there

1131

1262

1873

there

There is a 15975737

There are no 11399502 28 July 2009



IDs to n-grams

There is no 14837037



A Large-scale N-gram Search System Using Inverted Files

16

N-gram Compression Compressing n-grams for reducing disk access 



Unigram IDs 

Allocate IDs to unigrams in frequency order.



Replace unigrams in n-grams by IDs.

ID

Unigram

ID

Unigram

ID

Unigram

ID

Unigram

0



2

,

4



6

-

1



3

.

5

the

7

of

Variable byte codes 

Allocate shorter codes to frequent unigrams.



Allocate shorter codes to low frequencies. 1 byte 20

28 July 2009

2 byte 27

3 byte 214

4 byte 221

A Large-scale N-gram Search System Using Inverted Files

228 17

System Component (B) There are 4 steps and 3 databases. Query

N-grams Dictionary lookup

Unigram dictionary

Unigram IDs

Encoded n-grams

Index lookup Inverted index 28 July 2009

N-gram decode

N-gram lookup N-gram IDs

N-gram database

A Large-scale N-gram Search System Using Inverted Files

18

Inverted Database Copying n-grams for sequential disk access 

N-gram lists 

Use n-gram lists instead of postings. 

Before

At most n copies for each n-gram

Unigram …

After 

N-gram

Unigram

ID …

ID

ID

N-gram

N-gram

… …

N-gram

N-gram N-gram

… …

Compress n-grams. 

28 July 2009

On the whole, Inverted Database is larger than the total of n-grams and Inverted Index. A Large-scale N-gram Search System Using Inverted Files

19

Features of Inverted Database Advantages 

Disk access 



N-grams are sequentially accessed.

No intersection 

There is no need for merging posting lists. 1. 2.

Choose the shortest n-gram list (the most unique unigram). Filter the n-gram list by using a given query. N-gram

3.

N-gram

N-gram

N-gram

N-gram



Return remaining n-grams.

Disadvantages 

Disk usage

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

20

System Component (C) There are 3 steps and 2 databases. Query

N-grams Dictionary lookup

Unigram dictionary

Unigram IDs

N-gram decode

Encoded n-grams Database lookup

Inverted database 28 July 2009

Sequential

A Large-scale N-gram Search System Using Inverted Files

21

INVERTED INDEX INVERTED DATABASE EVALUATION

EVALUATION

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

22

Experimental Setup Environment 



Hardware 

CPU: Core 2 Quad 2.83 GHz



RAM: 8 GB (1 GB required)



Disks: 1.5 TB HDD x 2 (RAID 0) and 256 GB SSD

Software 

OS: 64-bit Ubuntu 9.04 Desktop



Search systems: Implemented with C++ 

http://code.google.com/p/ssgnc/

Corpus 

107 GB (plain text)

Web Japanese N-gram Version 1 [Kudo and Kazawa 2007]

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

23

Experimental Results (Disk) Methods 

(A) Uncompressed n-grams + inverted index



(B) Compressed n-grams + inverted index 



Both on HDD and on SSD.

(C) Inverted database

Less than 12 hours

Table 1. Disk usage of n-gram search systems (MB) Term Dic Inv. Idx N-gram DB (A)

77

29,758

109,221

(B)

148

26,969

29,034

(C)

148

0

Inv. DB

Total

0 139,056 0

56,151

0 150,720 150,720

Dic = Dictionary, Inv. = Inverted, DB = Database 28 July 2009

A Large-scale N-gram Search System Using Inverted Files

24

Experimental Results (Time) Table 2. Average and median of search time (s) Query

1-gram Ave

Med

3-gram Ave

Med

5-gram Ave

Med

(A)

5.016 0.268 3.855 0.983 2.394 0.830

(B) HDD

2.467 0.179 3.272 0.868 2.156 0.761

(B′) SSD

0.547 0.012 0.699 0.148 0.577 0.162

(C)

0.040 0.029 0.167 0.094 0.219 0.097

Performance evaluation   

(B) N-gram compression shows a little improvement. (B′) SSD shows a good performance. (C) Inverted database shows the best performance.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

25

Experimental Results (Details) Search time (s)

100 10

(A)

1

(B)

(B′)

0.1 0.01

(C) 1

10

100

1,000

Number of matches (N)

Figure 1. Search time for 3-gram queries

Performance evaluation  

For a small N, (B′) is better than (C). For a large N, (C) is better than (B′).

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

26

CONCLUSION

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

27

Conclusion Search systems for large-scale n-grams 

Full Trie Index 



Inverted Index 



Requires too much resources for indexing. Requires inefficient disk access in searching n-grams.

Inverted Database 

Provides a well-balanced search system.

Future study  

Character n-grams. Queries containing classes and attributes.

28 July 2009

A Large-scale N-gram Search System Using Inverted Files

28

A Large-scale N-gram Search System Using Inverted ...

More simple and lightweight than Full Trie Index. Methods. ○. Inverted Index. ○. No use… ○. Inverted Database. ○. Customized inverted index for n-gram searches. Result. ○. Usable system (on 64-bit Linux). ○ http://code.google.com/p/ssgnc/. 28 July 2009. A Large-scale N-gram Search System Using Inverted Files. 2 ...

501KB Sizes 0 Downloads 154 Views

Recommend Documents

and PD-PID Controllers for a Nonlinear Inverted Pendulum System
nonlinear problem with two degrees of freedom (i.e. the angle of the inverted pendulum ..... IEEE Region 10 Conf. on Computers, Communications, Control and.

A Consumer Video Search System by Audio-Visual ...
based consumer video search engine exploiting the query- by-concept ... The sufficiently good per- ... concept classification, return ranked videos based on the.

Comparisons of search designs using search ...
The search probability for a given M0, M1, and 2 is defined as. P[SSE(M0) .... This enormous task can be greatly ... In Table 3, we define ˜n+ u , ˜n− u , and ˜n0.

Enhancing mobile search using web search log data
Enhancing Mobile Search Using Web Search Log Data ... for boosting the relevance of ranking on mobile search. .... However, due to the low coverage.

A Framework of Surveillance System using a PTZ ...
School of Computer Science and Technology .... adjuster one by one, with the best matching image. (maximum ... length as the image to which it best matches.

A Comparison of Information Seeking Using Search Engines and ...
Jan 1, 2010 - An alternative, facilitated by the rise of social media, is to pose a question to one's online social network. In this paper, we explore the pros and ...

Diphthongs & Inverted Diphthongs.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Sigma Encoded Inverted Files
Compression of term frequency and very long doc-id lists for inverted files is examined. Traditional methods are not well suited to compressing these lists. ◇ A novel technique is presented: Sigma Encoding prior to compression. In essence, a parame

Liquid Level Control System Using a Solenoid Valve
A liquid level system using water as the medium was constructed to ... The system consisted of two 5 gallon buckets, with a solenoid valve to control the input ...

Hideyuki Fuke_Development of a Cooling System for GAPS using ...
Hideyuki Fuke_Development of a Cooling System for GAPS using Oscillating Heat Pipe.pdf. Hideyuki Fuke_Development of a Cooling System for GAPS using ...

A multimedia production system using Evolutionary ...
the GruVA is hosted; the CITI2 – Informatics and Information Technology Centre at. Department of ...... cific professional environment, notably interactive television shopping, biomedical, ...... Hyperdictionary: online technical dictionary.

A Monitoring System Using Wireless Sensor Network.pdf ...
A Monitoring System Using Wireless Sensor Network.pdf. A Monitoring System Using Wireless Sensor Network.pdf. Open. Extract. Open with. Sign In.

A SPARSE SYSTEM IDENTIFICATION BY USING ...
inversion in each time-step whose computational cost is usually not accepted in adaptive ... i=0 and an initial estimate h0 (see, the right of Fig. 1). 3. PROPOSED ...

QuASM: A System for Question Answering Using Semi ...
Question answering (QA) systems augment IR systems by taking over the ..... The second database contained the content files with tables extracted and prose ...

A User Location and Tracking System using Wireless Local Area ...
A User Location and Tracking System using Wireless Local Area Network. Kent Nishimori ... Area Network signal strength and Geographical. Information ..... The initial K-nearest neighbor algorithm [1] takes all of the K selected reference points and a

An Evaluation of a Collision Handling System using ...
A number of experiments in virtual scenarios with objects falling in a static plane ... Realism—Animation; I.3.7 [Virtual Reality]: Three-Di- mensional Graphics ..... Physical Modeling, pages 173 – 184, Cardiff, Wales,. UK, 2006. ACM Press. 1245.

Sensible Initialization of a Computational Evolution System Using ...
via expert knowledge sources improves classification accuracy, enhancing our abil- ... form of analysis in the detection of common human disease. The goal of ...

QuASM: A System for Question Answering Using Semi-Structured Data
document contains suitable data for answering questions. Many of the documents ... Permission to make digital or hard copies of all or part of this work for personal or classroom use ...... Andre Gauthier for the disk space. Any opinions, findings ..

a self powering wireless environment monitoring system using soil ...
a self powering wireless environment monitoring system using soil energy 06874554.pdf. a self powering wireless environment monitoring system using soil ...

In-cabin occupant tracking using a low-cost infrared system
In-cabin occupant tracking using a low-cost infrared system. Abstract – Vehicles in future will be safer and more intelligent, able to make appropriate and ...

Hideyuki Fuke_Development of a Cooling System for GAPS using ...
The pressure balance be- tween vapor plugs and liquid slugs of working fluid in the ... flow which primarily transfers the heat from the heating section to the cooling section. ... of a Cooling System for GAPS using Oscillating Heat Pipe.pdf.

Sigma Encoded Inverted Files
ABSTRACT. Compression of term frequency lists and very long document-id lists within an inverted file search engine are examined. Several compression schemes are compared including Elias γ and δ codes,. Golomb Encoding, Variable Byte Encoding, and

Video Conference System Using Raspberry Pi - IJRIT
the early 1970s as part of AT&T's development of Picture phone technology. With the ... (such as phone and Internet), by reducing the need to travel, which is often carried out by aero plane, to .... There is no arrangement for audio input in Raspber