Fast Construction of a Word↔Number Index for Large Data ˇ Miloˇs Jakub´ıˇcek, Pavel Rychl´y, Pavel Smerk Natural Language Processing Centre Faculty of Informatics Masaryk University
7. 12. 2013
ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
1/7
Introduction • Inspiration: Aleˇs Hor´ ak @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB) • Problem: indexes for large text corpora (billions of tokens) • Current solution: .lex, .lex.idx and .lex.srt files • .lex: null-terminated strings, in the order of appearance in corpus • .lex.idx: 4B offsets of words in .lex • .lex.srt: 4B indices (positions in .lex.idx) sorted alphabetically • id2str: 2 accesses to the memory • str2id: 3 * ln2 |lexicon| accesses to the memory • New solution: HAT-trie + (reimplemented) Daciuk’s fsa tools • HAT-trie: cache-conscious, combines trie + hash, allows sorted access • for indexing natural language strings, it is among the best solutions regarding both time and space • Daciuk: minimal DAFSA for perfect hashing ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
2/7
Data sets used in the experiments
data set 100M 1000M 10000M
size 1148 MB 5161 MB 69010 MB
words 110 M 957 M 12967 M
unique 1660 k 1366 k 27892 k
size 31 MB 14 MB 384 MB
language Tajik French English
• three sets of corpus data: they differ not only in size • Tajik uses Cyrillic ⇒ words are two times longer only due to encoding • French corpus (OPUS project): mostly legal texts ⇒ limited vocabulary
ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
3/7
Comparison of encodevert and hat-trie data set 100M 1000M 10000M
encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB data set 100M 1000M 10000M
hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB
encodevert local fair 3:27 m 1:25 m 26:10 m 6:26 m 9:21 h 4:02 h
size 44 MB 25 MB 607 MB
hat-trie fair 32.6 s 3:09 m 1:02 h
• the table from the paper have revealed to be unfair to encodevert • local data on local hdd, but probably more used • fair times: both apps produces the same set of files • in fact, this is still unfair, but now to hat-trie • ⇒ whole applications have to be tested ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
4/7
Reduction of the size of data data set 100M 1000M 10000M
encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB
hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB
size 44 MB 25 MB 607 MB
data set 100M 1000M 10000M
fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB
hat + time 31.7 s 2:34 m 1:08 h
size 15 MB 11 MB 363 MB
new fsa memory 0.09 GB 0.06 GB 1.47 GB
• for very large corpora the files can consume a lot of memory • with Daciuk’s fsa tools we have built automata for perfect hashing • fsa_ubuild is an original Daciuk’s implementation (unsorted data) • hat + new fsa is an reimplementation with HAT-trie as presort • (experiments from the two tables were run on different hardware) ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
5/7
HAT-trie based sort + fsa overperforms fsa_ubuild data set 100M 1000M 10000M data set 100M 1000M 10000M
fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB hat-trie sort time memory 28.4 s 0.06 GB 2:51 m 0.04 GB 59:16 m 0.77 GB
hat + time 31.7 s 2:34 m 1:08 h
new fsa memory 0.09 GB 0.06 GB 1.47 GB
fsa_build time memory 12.4 s 0.21 GB 5.6 s 0.11 GB 35:15 m 27.07 GB
size 15 MB 11 MB 363 MB
new fsa time memory 4.2 s 0.03 GB 1.8 s 0.03 GB 9:36 m 0.71 GB
• the second table compares fsa construction from sorted data • ⇒ having such an effective sort algorithm, to sort data and then use
the algorithm for sorted data is always better than fsa_ubuild • ⇒ to reduce the used memory it would better to flush sorted data to hard disk before fsa construction, as the time penalty is minimal ˇ Smerk et al. (NLPC FI MU)
Construction of a Word↔Number Index
7. 12. 2013
6/7
Future Work
• it is a work in progress, even the measured times are biased • we want to • fine tune hat-trie (we have used default settings) • further reduce • • • •
compile space: fsa can be built directly in memory compile time: hash for “registered” nodes run space: VLEncoded information, relative adresses, UTF-8, . . . run time: smaller run space, numbers in arcs
• run experiments on a hdd not shared with other processes
Fast Construction of a WordNumber Index for Large Data
Fast Construction of a WordâNumber Index for Large Data. MiloÅ¡ Jakub´ıcek, Pavel Rychlý, Pavel Å merk. Natural Language Processing Centre. Faculty of ... (but we still did not compare Manatee and some sql DB). ⢠Problem: indexes for large text corpora (billions of tokens). ⢠Current solution: .lex, .lex.idx and .lex.srt files.
the table from the paper have revealed to be unfair to encodevert. ⢠local data on local hdd, but probably more used. ⢠fair times: both apps produces the same set of files. ⢠in fact, this is still unfair, but now to hat-trie. ⢠â whole ap
Construction of a WordâNumber Index. 7. 12. 2013. 1 / 7. Page 2. Introduction. ⢠Inspiration: AleÅ¡ Horák @ 1st NLP Centre seminar :-) ⢠(but we still did not compare Manatee and some sql DB). ⢠Problem: indexes for large text corpora (billi
number to word indices for very large corpus data (tens of billions of tokens), which is ... database management system must be in place â and so is this the case of the ... it is among the best solutions regarding both time and space. We used ...
also for many other applications, e.g. building data for morphological analysers ... database management system must be in place â and so is this the case of the.
banks must be protected from having to know how the data is organized in the machine ..... tion) of relation R a foreign key if it is not the primary key of R but its ...
for arbitrarily large sequences, for instance for the longest human ... The largest public database of DNA1 .... sistent trees for large datasets over 50 Mbp . Using.
given that copying is by permission of the Very Large Data Base. Endowment ... tructure of parallel computers. ...... Sphere: Recovering A Persistent Object Store.
Oct 2, 2015 - your own independent legal, professional, accounting, investment, tax and other professional advisors prior to making any decision hereon.
Oct 2, 2015 - The scope of works will include: shopping mall (4- storey with leasable area ... Seif Engineering Contracting to build a residential com- pound in ...
Oct 2, 2015 - Source: Various sources, NCB. HEADLINES. NCB Construction Contracts Index. NCB Construction Contracts Index jumped to 341.98 ..... steam turbinesâ gas turbinesâ heat recovery steam genera- tor (HRSG), air condenser (ACC), control sy
... data intensive computing has become ubiquitous at Internet companies of all sizes, ... by using parallel dataflow graph frameworks such as Map-Reduce [10], ... Our Sailfish implementation and the other software components developed as ...
A Relational Model of Data for Large Shared Data Banks-Codd.pdf. A Relational Model of Data for Large Shared Data Banks-Codd.pdf. Open. Extract. Open with.
The column. âEPE s40+â means the average endpoint error over regions with flow ve- locities larger than 40 pixels per frame. The runtime are reproduced from.
a large fuzzy database that stores iris codes or data with a similar ... To deploy a large-scale biometric recognition system, the ... manipulating a file of this size.
become more pertinent in light of the large amounts of data that we ...... Along with the development of richer representation structures, recently there has.
A. MURSKyI ern R. M. THOMPSON,. Un'fuersity of Bri.ti,sh Col,umb,in, Voncouver, Canad,a. This work was undertaken in order to provide a practical, and as far.
Current performance may be higher or lower than the. performance data quoted. Page 3 of 6. View All - Large Stock Index Fund | Fidelity Investments.pdf.
tabases, for example, the multimedia objects are usually mapped to feature vectors in some high-dimensional space and queries are processed against a ...
Efficiency in OLAP system operation is of significant cur- rent interest, from a research as ...... to the caching provided by the file system. 4.2 Query Processing ...
by permission of the Very Large Data Base Endowment. To copy otherwise, or to .... query performance since in processing queries, overlap of directory nodes ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. large data ...