Fast Construction of a Word↔Number Index for Large Data ˇ Miloˇs Jakub´ıˇcek, Pavel Rychl´y, Pavel Smerk Natural Language Processing Centre Faculty of Informatics Masaryk University

7. 12. 2013

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

1/7

Introduction • Inspiration: Aleˇs Hor´ ak @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB) • Problem: indexes for large text corpora (billions of tokens) • Current solution: .lex, .lex.idx and .lex.srt files • .lex: null-terminated strings, in the order of appearance in corpus • .lex.idx: 4B offsets of words in .lex • .lex.srt: 4B indices (positions in .lex.idx) sorted alphabetically • id2str: 2 accesses to the memory • str2id: 3 * ln2 |lexicon| accesses to the memory • New solution: HAT-trie + (reimplemented) Daciuk’s fsa tools • HAT-trie: cache-conscious, combines trie + hash, allows sorted access • for indexing natural language strings, it is among the best solutions regarding both time and space • Daciuk: minimal DAFSA for perfect hashing ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

2/7

Data sets used in the experiments

data set 100M 1000M 10000M

size 1148 MB 5161 MB 69010 MB

words 110 M 957 M 12967 M

unique 1660 k 1366 k 27892 k

size 31 MB 14 MB 384 MB

language Tajik French English

• three sets of corpus data: they differ not only in size • Tajik uses Cyrillic ⇒ words are two times longer only due to encoding • French corpus (OPUS project): mostly legal texts ⇒ limited vocabulary

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

3/7

Comparison of encodevert and hat-trie data set 100M 1000M 10000M

encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB data set 100M 1000M 10000M

hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB

encodevert local fair 3:27 m 1:25 m 26:10 m 6:26 m 9:21 h 4:02 h

size 44 MB 25 MB 607 MB

hat-trie fair 32.6 s 3:09 m 1:02 h

• the table from the paper have revealed to be unfair to encodevert • local data on local hdd, but probably more used • fair times: both apps produces the same set of files • in fact, this is still unfair, but now to hat-trie • ⇒ whole applications have to be tested ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

4/7

Reduction of the size of data data set 100M 1000M 10000M

encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB

hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB

size 44 MB 25 MB 607 MB

data set 100M 1000M 10000M

fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB

hat + time 31.7 s 2:34 m 1:08 h

size 15 MB 11 MB 363 MB

new fsa memory 0.09 GB 0.06 GB 1.47 GB

• for very large corpora the files can consume a lot of memory • with Daciuk’s fsa tools we have built automata for perfect hashing • fsa_ubuild is an original Daciuk’s implementation (unsorted data) • hat + new fsa is an reimplementation with HAT-trie as presort • (experiments from the two tables were run on different hardware) ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

5/7

HAT-trie based sort + fsa overperforms fsa_ubuild data set 100M 1000M 10000M data set 100M 1000M 10000M

fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB hat-trie sort time memory 28.4 s 0.06 GB 2:51 m 0.04 GB 59:16 m 0.77 GB

hat + time 31.7 s 2:34 m 1:08 h

new fsa memory 0.09 GB 0.06 GB 1.47 GB

fsa_build time memory 12.4 s 0.21 GB 5.6 s 0.11 GB 35:15 m 27.07 GB

size 15 MB 11 MB 363 MB

new fsa time memory 4.2 s 0.03 GB 1.8 s 0.03 GB 9:36 m 0.71 GB

• the second table compares fsa construction from sorted data • ⇒ having such an effective sort algorithm, to sort data and then use

the algorithm for sorted data is always better than fsa_ubuild • ⇒ to reduce the used memory it would better to flush sorted data to hard disk before fsa construction, as the time penalty is minimal ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

6/7

Future Work

• it is a work in progress, even the measured times are biased • we want to • fine tune hat-trie (we have used default settings) • further reduce • • • •

compile space: fsa can be built directly in memory compile time: hash for “registered” nodes run space: VLEncoded information, relative adresses, UTF-8, . . . run time: smaller run space, numbers in arcs

• run experiments on a hdd not shared with other processes

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

7/7

Fast Construction of a WordNumber Index for Large Data

the table from the paper have revealed to be unfair to encodevert. • local data on local hdd, but probably more used. • fair times: both apps produces the same set of files. • in fact, this is still unfair, but now to hat-trie. • ⇒ whole applications have to be tested. Šmerk et al. (NLPC FI MU). Construction of a Word↔Number Index.

268KB Sizes 2 Downloads 233 Views

Recommend Documents

Fast Construction of a WordNumber Index for Large Data
Fast Construction of a Word↔Number Index for Large Data. Miloš Jakub´ıcek, Pavel Rychlý, Pavel Šmerk. Natural Language Processing Centre. Faculty of ... (but we still did not compare Manatee and some sql DB). • Problem: indexes for large tex

Fast Construction of a WordNumber Index for Large Data - raslan 2013
Construction of a Word↔Number Index. 7. 12. 2013. 1 / 7. Page 2. Introduction. • Inspiration: Aleš Horák @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB). • Problem: indexes for large text corpora (billi

Fast Construction of a Word↔Number Index for Large Data
number to word indices for very large corpus data (tens of billions of tokens), which is ... database management system must be in place – and so is this the case of the ... it is among the best solutions regarding both time and space. We used ...

Fast Construction of a Word↔Number Index for Large ... - raslan 2013
also for many other applications, e.g. building data for morphological analysers ... database management system must be in place – and so is this the case of the.

A Relational Model of Data for Large Shared Data Banks
banks must be protected from having to know how the data is organized in the machine ..... tion) of relation R a foreign key if it is not the primary key of R but its ...

A Database Index to Large Biological Sequences
for arbitrarily large sequences, for instance for the longest human ... The largest public database of DNA1 .... sistent trees for large datasets over 50 Mbp . Using.

A Database Index to Large Biological Sequences
given that copying is by permission of the Very Large Data Base. Endowment ... tructure of parallel computers. ...... Sphere: Recovering A Persistent Object Store.

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - your own independent legal, professional, accounting, investment, tax and other professional advisors prior to making any decision hereon.

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - The scope of works will include: shopping mall (4- storey with leasable area ... Seif Engineering Contracting to build a residential com- pound in ...

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - Source: Various sources, NCB. HEADLINES. NCB Construction Contracts Index. NCB Construction Contracts Index jumped to 341.98 ..... steam turbines– gas turbines– heat recovery steam genera- tor (HRSG), air condenser (ACC), control sy

Sailfish: A Framework For Large Scale Data Processing
... data intensive computing has become ubiquitous at Internet companies of all sizes, ... by using parallel dataflow graph frameworks such as Map-Reduce [10], ... Our Sailfish implementation and the other software components developed as ...

A Relational Model of Data for Large Shared Data Banks-Codd.pdf ...
A Relational Model of Data for Large Shared Data Banks-Codd.pdf. A Relational Model of Data for Large Shared Data Banks-Codd.pdf. Open. Extract. Open with.

Fast Edge-Preserving PatchMatch for Large ...
The column. “EPE s40+” means the average endpoint error over regions with flow ve- locities larger than 40 pixels per frame. The runtime are reproduced from.

A fast search algorithm for a large fuzzy database
a large fuzzy database that stores iris codes or data with a similar ... To deploy a large-scale biometric recognition system, the ... manipulating a file of this size.

Algorithms for Linear and Nonlinear Approximation of Large Data
become more pertinent in light of the large amounts of data that we ...... Along with the development of richer representation structures, recently there has.

a specific gravity index for minerats - RRuff
A. MURSKyI ern R. M. THOMPSON,. Un'fuersity of Bri.ti,sh Col,umb,in, Voncouver, Canad,a. This work was undertaken in order to provide a practical, and as far.

View All - Large Stock Index Fund | Fidelity Investments.pdf ...
Apple Inc 3.26%. Microsoft Corp 2.47%. Exxon Mobil Corporation 1.81%. General Electric Co 1.64%. Johnson & Johnson 1.58%. Amazon.com Inc 1.45%.

View All - Large Stock Index Fund | Fidelity Investments.pdf ...
Current performance may be higher or lower than the. performance data quoted. Page 3 of 6. View All - Large Stock Index Fund | Fidelity Investments.pdf.

The X-tree: An Index Structure for High-Dimensional Data
tabases, for example, the multimedia objects are usually mapped to feature vectors in some high-dimensional space and queries are processed against a ...

A Novel Index Supporting High Volume Data ...
Efficiency in OLAP system operation is of significant cur- rent interest, from a research as ...... to the caching provided by the file system. 4.2 Query Processing ...

Fast Shape Index Framework based on Principle ...
Some other system like IBM's Query By Image Content (QBIC) .... 752MB memory LENOVO Laptop running Windows XP Media Center operating system.

An Index Structure for High-Dimensional Data - Computer Science
by permission of the Very Large Data Base Endowment. To copy otherwise, or to .... query performance since in processing queries, overlap of directory nodes ...

large data -> MMM.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. large data ...