Variable-Length Codes for Space-Efficient Grammar ...

Viewer
Transcript

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression Yoshimasa Takabatake1 , Yasuo Tabei2 , and Hiroshi Sakamoto1,3 1 2

Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka, 820-8502 ERATO Minato Project, Japan Science and Technology Agency, Sapporo, Japan 3 PRESTO JST, 4-1-8 Honcho Kawaguchi, Saitama 332-0012, Japan [email protected], [email protected], [email protected]

Abstract. Dictionary is a crucial data structure to implement grammarbased compression algorithms. Such a dictionary should access any codes in O(1) time for an eﬃcient compression. A standard dictionary consisting of ﬁxed-length codes consumes a large amount of memory of 2n log n bits for n variables. We present novel dictionaries consisting of variable-length codes for oﬄine and online grammar-based compression algorithms. In an oﬄine setting, we present a dictionary of at most √ min{n log n + 2n + o(n), 3n log σ(1 + o(1))} bits of space where σ < 2 n. In an online setting, we present a dictionary of at most 74 n log n+4n+o(n) bits of space for a constant alphabet and unknown n. Experiments revealed that memory usage in our dictionary was much smaller than that of state-of-the-art dictionaries.

1

Introduction

Grammar-based compression is an active research area with a wide variety of applications that include, for example, compressed pattern matching [22,23], qgram mining [8], and edit distance computation [10]. The task is to ﬁnd a small context-free grammar (CFG) that generates a given string uniquely. Grammar-based compression has two kinds of settings: oﬄine and online. While all texts are given beforehand in the oﬄine setting, a streaming model is assumed for texts in the online setting. A lot of grammar-based compression algorithms have been proposed thus far [3,19,20,21,15]. Among them, the oﬄine and online LCA algorithms respectively proposed by Sakamoto et al. [20,21] and Maruyama et al. [15] are fast, memory-eﬃcient, and achieve good compression ratios. Although compression ratio of the online LCA is slightly worse than that of the oﬄine LCA, it has a large advantage. The online LCA does not need to keep all of the input text in memory to build a CFG. In addition, it is applicable to streaming texts. Dictionary and hashtable are crucial data structures for practical grammarbased compression algorithms. They output a set of production rules of the form

Partially supported by KAKENHI(23680016, 20589824) and JST PRESTO program.

L. Calder´ on-Benavides et al. (Eds.): SPIRE 2012, LNCS 7608, pp. 398–410, 2012. c Springer-Verlag Berlin Heidelberg 2012

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

399

Xk → Xi Xj where Xk is called a variable. We represent the production rules as a sequence X1 , X2 , ..., Xn where each Xk (1 ≤ k ≤ n) is associated to exactly one digram Xi Xj . We call a data structure storing the sequence dictionary. In a dictionary, variables are represented by codes as bit strings of possibly diﬀerent lengths. Array is a dictionary using ﬁxed-length codes which we call fixed-length dictionary (FLD). FLD requires 2n log n bits of space. We call a dictionary using variable-length codes variable-length dictionary (VLD). A hashtable stores reverse production rules as Xi Xj → Xk where Xi Xj is a key and Xk is a value. The hashtable is used to check whether or not a production rule has already been generated in execution. In general, both keys and values need to be stored in a hashtable to avoid collisions. For the grammarbased compression, the hashtable does not need to store keys, because several values are returned by a key, we can identify the value relating to the key by referring to the dictionary. Thus, while the hashtables used in grammar-based compression algorithms are relatively small, the dictionaries have a serious issue of the memory bottleneck. A diﬃculty in designing a VLD is how to organize small codes while keeping addressability to any code in O(1) time for fast compression. Moreover, a new incoming variable should be pushed into the current tail of the dictionary for the online setting. Brisaboa et al. [2] overcame these diﬃculties by applying γ-coding to variable-length codes. There have proposed several related works (See e.g., [1,7]) and a c++ library [17] as an extension of their method. However, since their codes employ γ-coding, they are not memory-eﬃcient to store large variables for grammar-based compression, resulting in limited scalability of memory. Since available data is ever increasing, developing VLDs using smaller amounts of memory remains a challenge. We present novel VLDs for oﬄine and online grammar compression algorithms in this paper. The common idea for our oﬄine and online VLDs is to extract an increasing sequence of variables from a dictionary, compute the diﬀerences between every pair of a variable and the next variable in the sequence, and deﬁne compact codes for the results. Thus, a long increasing sequence is preferable in our VLDs. We present eﬃcient methods to extract a long increasing sequence in the oﬄine and online settings, which enables us to respectively extract an increasing sequence of at least half and quarter lengths of a dictionary. The memory usages of our VLDs is, respectively, at most n log n + 2n + o(n) and 7 4 n log n + 4n + o(n) bits in the oﬄine and online settings. With the help of rank/select dictionary [18], our VLDs enables O(1)-time access to any element. In the oﬄine setting, we present√another VLD of at most 3n log σ(1 + o(1)) bits of space for a parameter σ < 2 n that prove for the number n of variables by using Erd˝os-Szekeres theorem [6]. Thus, we can choose the smallest one from two VLDs in the oﬄine setting. We applied the online LCA with our VLD in experiments to whole English wikipedia and human genome, and demonstrated signiﬁcantly better memory eﬃciency than that with Brisaboa et al’s VLD and FLD while performing fast compression.

400

2 2.1

Y. Takabatake, Y. Tabei, and H. Sakamoto

Preliminaries Grammar-Based Compression

We assume a ﬁnite set Σ and a recursively enumerable set X such that Σ∩X = ∅. A member in Σ is called an alphabet symbol and X ∈ X is called a variable. A sequence of symbols from Σ ∪ X is called a string. The set of all strings from Σ is denoted by Σ ∗ . The empty string is denoted by ε. For a sequence S, |S|, S[i], and S[i, j] denote the length of S, the i-th symbol of S, and the substring of S from S[i] to [j], respectively. Let [S] be the set of symbols in S. A string of length two is called a digram. For a ﬁnite set C, |C| denotes its cardinality, e.g., |[S]| denotes the set of symbols appearing in S. A context-free grammar (CFG) is represented by G = (Σ, V, P, Xs ) where V is a ﬁnite subset of X , P is a ﬁnite subset of V × (V ∪ X )∗ , and Xs ∈ V . A member of P is called a production rule and Xs is called the start symbol. The set of strings in Σ ∗ derived from Xs by G is denoted by L(G). A CFG G is called to be admissible if, for each X ∈ V , there exists exactly one X → α ∈ P and |L(G)| = 1. An admissible G deriving S is called a grammarbased compression of S. We consider only the case |α| = 2 for any production rule X → α because any grammar-based compression with n variables can be transformed to such a restricted grammar within 2n variables. Moreover, this restriction is useful for practical applications, for example, LZ78 [24], SLP [5], REPAIR [13], ESP [14], and LCA [15]. A derivation tree of G is then represented by an ordered binary tree such that internal nodes are labeled by variables in V and the sequence of the leaves is equal to S. A data structure D is called a dictionary for P if we can directly access Xi Xj for a given Xk associated with the phrase Xi Xj by Xk → Xi Xj ∈ P . The production rule Xk → Xi Xj can be represented by the triple (k, i, j) of nonnegative integers. Thus, the set of n production rules is represented by an array D[1, 2n] such that k indicates the production rule (k, D[2k − 1], D[2k]). 2.2

Rank/Select Dictionary

Rank/select dictionaries are data structures for string S ∈ Σ ∗ of length n [11,9]. They support the rank and select queries as follows: rankσ (S, k) returns the number of σ ∈ Σ in S[1, k], and selectσ (S, k) returns the position of the k-th occurrence of σ ∈ Σ in S. For example, if S = 10110100111, rank1 (S, 7) = 4, because the number of 1 in S[1, 7] is 4, and select1 (S, 5) = 9, because the position of the ﬁfth 1 in S is 9. When S is a binary string, i.e., Σ = {0, 1}, the computational time for the rank/select queries is O(1) [4,11,16]. Rank/select dictionaries for string S ∈ Σ ∗ , |Σ| ≥ 3, are also called wavelet trees [9], and their rank/select queries take O(log |Σ|) time. The memory usage of rank/select dictionaries is |S| + o(|S|) bits.

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

3

401

Oﬄine Variable-Length Dictionaries

We present two VLDs for the oﬄine problem. They respectively achieve n log n+ 2n + o(n) and 3n log σ(1 + o(1)) bits of space for a static dictionary D[1, 2n] and a parameter σ < n, which are smaller than FLD achieving 2n log n bits of space. After building two VLDs, we choose the smallest VLD between them. The access time of our VLD is O(log σ). Our basic idea for building two VLDs is to divide a given static dictionary D into a weakly increasing subsequence D1 and the other subsequence D2 , and build small codes, respectively, for D1 and D2 . We present space-eﬃcient codes for D1 , and two types of codes for D2 . Thus, the only diﬀerence between our two VLDs is codes for D2 . Since our codes for D1 are, basically, more space-eﬃcient than those for D2 , a long weakly increasing subsequence for D1 is preferable. We present spanning tree decomposition to extract a long weakly increasing subsequence D1 from D. 3.1

Spanning Tree Decomposition

In the oﬄine setting, a grammar-based compression D for the string S is given. When considering the directed edges (Z, X) and (Z, Y ) for a production rule Z → XY in D, D is transformed to a directed acyclic graph (DAG) with a single source and k sinks for k = |[S]|. In such a DAG, any internal node has exactly two (left/right) edges. Introducing a super-sink s and adding left/right edges from any sink to s, the DAG is renovated to the equivalent DAG G with a single source/sink. For this G, we present the following fact remarked in [14]. fact 1 For any in-branching spanning tree of G, the graph defined by the remaining edges is also an in-branching spanning tree of G. Lemma 1. A static dictionary D[1, 2n] is decomposable into a weakly increasing subsequence D1 and the other subsequence D2 whose length is, respectively, n. Proof. When computing TL and TR from D, the in-branching spanning trees TL have n internal nodes. Assigning a new label to any internal node by the breadth-ﬁrst order, we obtain new labels for all internal nodes in TL . These labels are mapped to the internal nodes in TR by the original correspondence, and D is renovated to the one that satisﬁes the following condition: for each n + 1 ≤ i ≤ 2n − 1, D1 [i] ≤ D1 [i + 1]. 2 3.2

Variable-Length Dictionary of at Most n log n + 2n + o(n) Bits of Space

We decompose a static dictionary D[1, 2n] into a weakly increasing subsequence D1 [1, n] and the other subsequence D2 [1, n] by the spanning tree decomposition. We then encode D1 [1, n] as a bit string Inc as follows: i-th substring of Inc is D[i] 0s followed by 1 for i = 1 and (D[i] − D[i − 1]) 0s followed by 1 for i > 1. For example, D1 = (1, 1, 2, 3, 5) is encoded into Inc = 0110101001. Inc

402

Y. Takabatake, Y. Tabei, and H. Sakamoto

is indexed by the rank/select dictionary for bit strings. D1 [i] is recovered as rank0 (Inc, select1 (Inc, i)), because p = select1 (Inc, i) returns the position p of the i-th occurrence of 1. Then, rank0 (Inc, p) returns the number of 0 in Inc[1, p], which corresponds to D1 [i]. We encode D2 [1, n] into an FLD. Theorem 1. A static dictionary D can be transformed to a VLD whose size is at most n log n + 2n + o(n) bits to access any position in D in O(1) time. Proof. Inc for D1 includes 2n 0s and 1s in total. Since Inc is indexed by the rank/select dictionary, it is at most 2n + o(n) bits of space. D2 is encoded into an FLD whose size is n log n bits. 2 3.3

Variable-Length Dictionary of at Most 3n log σ(1 + o(1)) Bits of Space

As in the previous subsection, a static dictionary D[1, 2n] is decomposed into a weakly increasing subsequence D1 [1, n] and the other subsequence D2 [1, n] by the spanning tree decomposition. D1 is encoded into Inc, indexed by the rank/select dictionary. Our basic idea for encoding scheme for D2 is to extract weakly increasing and decreasing subsequences, and build small codes for these subsequences, respectively. We divide the indices [1, n] into σ sets d1 , ..., dσ satisfying two conditions: (i) disjointness: di and dj are disjoint if i = j, (ii) weak monotonicity: D2 [p1 ] ≤ · · · ≤ D2 [p|dj | ] or D2 [p1 ] ≥ · · · ≥ D2 [p|dj | ] for p1 , ..., p|dj | ∈ dj (1 ≤ j ≤ σ) s.t. p1 < · · · < p|dj | . Thus, d1 , ..., dσ are disjoint sets of indices consisting of weakly monotonic subsequences included in D2 [1, n]. ˆ π and b (Fig. 1). We Our VLD consists of two strings and a bit string: D, present compact codes over the alphabet Σ = {1, 2, ..., σ} for small σ << n. We sort the indices k(1 ≤ k ≤ n) of D2 according to D2 [k], and denote the array consisting of the sorted indices as DI = [k1 , ..., kn ] for 1 ≤ i ≤ n. Using DI , the array DL is built as follows: DL [k] = j iﬀ DI [k] ∈ dj (1 ≤ k ≤ n). We then ˆ ∈ (Σ ∪ {τ })∗ by introducing a new symbol τ ∈ construct the string D / Σ as follows: ˆ = τ 1 DL [1]τ 2 DL [2] · · · τ n DL [n], D where i = D2 [DI [i]] − D2 [DI [i − 1]]. We prepare π ∈ Σ ∗ to indicate which set dj D2 [i] belongs to as follows: π[i] = j iﬀ i ∈ dj for i ∈ [1, n]. In addition, we prepare a bit string b of length σ to indicate whether D2 [i] is a member of a weakly increasing subsequence or a weakly decreasing subsequence. Thus, if b[j] = 0 for j = π[i], D2 [i] belongs to a weakly increasing subsequence, otherwise, D2 [i] belongs to a weakly decreasing ˆ and π are, respectively, indexed by wavelet trees. subsequence. D D2 [i] is recovered in O(log σ) time by using the rank/select operation as follows. We check whether D2 [i] belongs to whether or not the member of a weakly increasing subsequence by accessing b[j] for j = π[i]. If D2 [i] is so, j = π[i] and k = rankj (π, i) means that π[i] is the k-th j ∈ {1, 2, ..., σ} in π. We then compute

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

403

Fig. 1. Oﬄine Variable-Length Code: DI is divided into a set of monotonic subsequences, e.g., {d1 , d2 , d3 }, d1 = [4, 6, 7, 8, 9], d2 = [2, 3], and d3 = [5, 1]. In this case, d1 , d2 are increasing and d3 is decreasing. This information is represented by the bit ˆ b). We build the wavelet tree for π and string b. The array D[1, 9] is encoded by (π, D, ˆ over the alphabet Σ = {1, 2, 3, ∗}. When obtaining the variable D[7] = 3, we ﬁrst D check π[7] = 1 and b[1] = 0, i.e., D[7] belongs to the increasing sequence d1 . We then ˆ 3) = 6, and rank∗ (D, ˆ 6) = 3. obtain D[7] = 3 by rank1 (π, 7) = 3, select1 (D,

ˆ selectj (D, ˆ k)). If D2 [i] belongs to a weakly decreasing subseD2 [i] as rankτ (D, ˆ selectj (D, ˆ + 1 − k)) for = rankj (π, n) quence, we compute D2 [i] as rankτ (D, and the computed k. The access time for each D2 [i] is O(log σ). The size of VLD is at most 3n log σ(1 + o(1)) bits, because b, codes for D1 and D2 are respectively σ bits, at most 2n + o(n) bits and at most 3n log σ bits. Thus, the size depends on σ. The following lemma estimates σ to reduce the VLD. Lemma 2. (Erd˝ os-Szekeres [6]) √ Any permutation of [1, n] contains at least one monotonic sequence longer than n. Theorem 2. Dictionary D[1, 2n] can be transformed to a VLD whose size is at most 3n log √ σ(1 + o(1)) bits. The access time for any member in D is O(log σ) for σ < 2 n. Proof. By Lemma 1, it is suﬃcient to prove that D = D[n+1, 2n] (1 ≤ D[n] ≤ n) can be divided into at most √ σ weakly monotonic subsequences indicated by d1 , . . . , dσ such that σ < 2 n. We sort the indexes k of D by the value D[k]. The array of the sorted indexes is then denoted by DI = [k1 , . . . , kn ] for 1 ≤ ki ≤ n. Because DI is a permutation of [1, n], there exists a monotonic subsequence √ d1 of DI longer than n by Lemma 2.Removing d1 from DI , we can take a √ monotonic subsequence d√ 2 longer √ √ than n − n from the reformed DI . Using the inequality α − β ≥ α − β, for the j-the iteration of this √process, we can √ n. The number take a monotonic subsequence dj longer than αj = n−(j−1) √ √ √ k of the iterations satisﬁes k ≤ max{j | n − (j − 1) n > 0} ≤ n + 1. Thus, the average of the length of the subsequences is estimated by √ k √ √ 1 n 1 (k − 1) . αj = n − n≥ k j=1 2 2

404

Y. Takabatake, Y. Tabei, and H. Sakamoto

Fig. 2. Online Variable-Length Code: D is divided into the monotonically weakly increasing subsequence Incbit and the other subsequence Other. If Flagbit[i] = 0, D[i] ¯ belongs to Incbit. Otherwise, it belongs to Other.

√ √ Thus, the number of iterations is at most 2 n; that is, σ < 2 n. For each monotonic subsequence k1 , . . . , k of DI , we have D[ki ] ≤ D[kj ] provided i < j. Therefore, we can obtain weakly monotonic subsequences d1 , . . . , dσ such that √ 2 σ < 2 n. This estimation derives the bound of this theorem. If σ is small enough, the size of D is signiﬁcantly reduced. This method, however, is not useful in a case where the number of variables in D is unknown. Such a situation happens in compressing stream data. We next focus on this problem and present variable-length codes for the online dictionary.

4

Online Variable-Length Dictionary

We present a VLD in an online setting. A text is transformed into a sequence X1 , X2 , . . . , Xn , . . . of variables associated with production rules called the straight-line program (SLP). Definition 1. (Karpinsk-Rytter-Shinohara [12]) An SLP is a grammar-based compression (V = {X1 , X2 , . . . , Xn }, Σ) that defines the string over Σ by the following two types of production rules: Xi → a and Xk → Xi Xj , where a ∈ Σ and k > i, j. An SLP of n variables is represented by a dictionary D. When a VLD of D is obtained, it is required to update D for the next variable associated with a production rule in O(1) time. Our online VLD is built on the same idea as the VLD in Section 3.2. For a dictionary D, we construct the following two bit strings and an array: Flagbit, Incbit, Other. − Flagbit is a bit string of length n. If Flagbit[i] = 0, D[i] belongs to the increasing subsequence Incbit. Otherwise, D[i] belongs to the other subsequence Other. − Incbit is a bit string. If Flagbit[i] = 0, i-th substring of Incbit is (D[i] − D[k]) 0s followed by 1 where k is the largest index such that Flagbit[k] = 0, k < i. Otherwise, i-th substring of Incbit is the empty string . − Other is an array. Other[k] is D[i] for the position i such that Flagbit[i] is k-th 1.

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

405

Fig. 3. 2-balanced derivation tree: L0 is a string derived by T = T0 , and Li is the concatenation of the variables L(vj ) (j = 1, . . . , m) for Ti+1 = Ti [v1 , . . . , vm ]

Flagbit and Incbit are indexed by rank/select dictionaries. Grammar-based compression algorithms have to update the current VLD Dn for n variables to Dn+1 in O(1) time for eﬃciency. Our variable-length codes perform O(1) time updates by inserting a new variable to the current tail of Flagbit, Incbit and Other. We recover D[i] by using Flagbit, Incbit and Other. If Flagbit[i] = 1, D[i] is computed as Other[rank0 (Flagbit, i)]. This is because k = rank0 (Flagbit, i) returns the number of 0s in Flagbit[1, i]. Such a k corresponds to the position in Other. Thus, D[i] is computed as Other[k]. If Flagbit[i] = 0, we ﬁrst compute k = rank1 (Flagbit, i), and count the number k of 1s in Flagbit[1, i]. Then, D[i] is computed as rank0 (Incbit, select1(Incbit, k)), because j = select1 (Incbit, k) returns the position j of k-th 1 in Incbit and rank0 (Incbit, j) returns the number of 0s in Incbit[1, j], which corresponds to D[i]. We consider k-balanced SLPs for deriving an upper bound of space for our VLD. Let T be the derivation tree of an SLP. Let T [v] denote the tree obtained by replacing the whole subtree on a node v by the single node v. Similarly we deﬁne T [v1 , . . . , vm ] for the nodes v1 , . . . , vm where any two vi , vj (i = j) are not in a same path. Let yield (v) denote the concatenation of all leaves of the subtree on v and yield (v1 · · · vm ) = yield (v1 ) · · · yield (vm ). If the subtrees on vj , vj+1 are adjacent in this order and yield (v1 · · · vm ) = yield (r) for the root r of T , the sequence v1 , . . . , vm of nodes is called a decomposition of T . The decomposition is called to be proper if any vj is an internal node of T . If there exists a proper decomposition v1 , . . . , vm of Ti such that the height of any vj (j = 1, . . . , m) is at most k, Ti is called to be k-balanced. If so, we denote Ti+1 = Ti [v1 , . . . , vm ] for such a longest decomposition v1 , . . . , vm . When each Ti (0 ≤ i ≤ d) is k-balanced for T0 = T and Td = r, T is called to be k-balanced. An example is shown in Fig. 3 for k = 2. We assume any derivation tree of SLP is 2-balanced. This assumption is, however, reasonable since the condition is satisﬁed by several eﬃcient grammar-based compression algorithms [14,15]. Indeed, we improve the memory consumption of such an algorithm by our method in the next section. We show that the size of proposed VLD is smaller than that of FLD requiring 2n log n bits of space. For any grammar-based compression G and permutation π : X → X , the renamed G is equivalent to G. Thus, without loss of generality, we can assume a labeling procedure as follows. If Ti+1 = Ti [v1 , . . . , vm ]

406

Y. Takabatake, Y. Tabei, and H. Sakamoto

for some i, we assume max{L(vj ) | vj is a descendant of vj } < min{L(vj+1 ) | vj+1 is a descendant of vj+1 } where L(v) denotes the variable of node v such that L(v) ∈ X . This means that descendants of vj+1 are labeled after labeling all descendants of vj .

Theorem 3. Assuming 2-balanced derivation tree of SLP over a constant alphabet Σ, the size of proposed VLD is at most 74 n log n + 4n + o(n) bits of space, where n is the number of variables. Moreover, this VLD is updated in O(1) time for a new production rule. Proof. Our VLD consists of Flagbit, Incbit and Other. Clearly the size of Flagbit and Incbit is both at most 2n bits of space. By the deﬁnition of SLP and the labeling procedure, any leaf of T in Σ ∗ is replaced by a variable such that the occurrence of leftmost Xi is smaller than that of leftmost Xj if i < j. Let L1 be a resulting string. The size of this correspondence is O(1) space. We estimate the size of Other for L1 . For any production rule Zi → αi (1 ≤ i ≤ n, |αi | = 2), if D[1, 2n] = α1 · · · αn contains an increasing sequence of variables of length k, the size of the remaining variables is encoded by Other whose size is at most (2n − k) log n bits of space. We show that k ≥ n/4 and such a sequence is founded by an online algorithm in O(1) update time. Let T be the 2-balanced derivation tree of L1 and Li+1 = L(v1 ) · · · L(vm ) for some Ti+1 = Ti [v1 , . . . , vm ]. Let x, y be members of decompositions of Ti+1 , Ti respectively. If z is a child of x and the parent of y, x is called an (i + 1)-th intermediate node. By the labeling procedure, max[Li ], i.e., the maximum variable in Li , is smaller than min[Li+1 ] for any i. It follows that any two increasing sequences in D[1, 2n] indicated by variables in Li and Li+1 respectively do not overlap. Thus, for obtaining k ≥ n/4, it is suﬃcient to evaluate the length of an increasing sequence for each Li independently. When there is no (i + 1)-th intermediate node, an increasing sequence of length |[Li ]| exists in D[1, 2n]. Hence, we assume any member x of decomposition of Ti+1 has an (i + 1)-th intermediate node y as the worst case. We note that x never has two intermediate nodes as its children, since if so, there is a proper decomposition longer than the current one. Then, let x1 , . . . , xm be the decomposition of Ti+1 , y be the intermediate node of x and z1 , . . . , z3m be the decomposition of Ti . For these nodes, let min = L(y1 ) and max = max{L(x1 ), . . . , L(x ), L(y1 ), . . . , L(y )}. Note that min ≤ max for any . We show that D[min, max ] contains an increasing sequence of length at least |[Li [1, 3 ]]|/2 by induction on = 1, . . . , m. Since only z1 , z2 , z3 are descendants of x1 , this statement is true for = 1. Suppose the induction hypothesis on some ≥ 1. Let z3+1 , z3+2 , z3+3 be descendants of x+1 . Let new = |{L(z3+1 ), L(z3+2 ), L(z3+3 )} − [L[1, . . . , 3 ]]|, that is, new is the number of new variables not appearing in [L[1, . . . , 3 ]]. If new = 0, the hypothesis is clearly true on + 1. Otherwise, there are three cases: 1 ≤ new ≤ 3. In case new = 1, a node z ∈ {z3+1 , z3+2 , z3+3 } having the new variable L(z) is a child of x+1 or y+1 . Since L(x+1 ) > L(y+1 ), D[min, max+1 ] = D[min, L(x+1 )] and it has an increasing sequence containing L(z) as its tail. In case new = 2, there exist zi , zj ∈ {z3+1 , z3+2 , z3+3 } having new variables such

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

407

that zi , zj are children of x+1 or y+1 and L(zi ) < L(zj ) for i < j. In this case, D[min, max+1 ] = D[min, L(x+1 )] has an increasing sequence containing at least half of the new variables. The case new = 3 is analogously estimated. Thus, D[min, maxm ] has an increasing sequence Li of length at least |[Li ]|/2. By the labeling procedure, if L(x ) < L(x+1 ), it holds L(x )+1 ≤ L(x+1 ) ≤ L(x )+2. Since any member of decomposition of Ti has at most one intermediate node, at least half of the variables is contained in Li for all i. It follows k ≥ n/4. We can easily design an online algorithm for obtaining such increasing sequence. 2

5

Experiments

This section evaluates our VLD compared with a previously proposed VLD and a FLD code in the online setting. We used dag vector as state-of-the-art VLD downloadable from [17], which is an extension of [2], and used an STL vector as FLD. Because an alphabet size cannot be estimated beforehand in the online setting, the code length in the STL vector is ﬁxed at 32 bits. We combined the online LCA [15] with those dictionaries. Most of the memory in the algorithm is dominated by a dictionary and a hashtable. The hashtable is implemented as a standard chain hashtable where values having the same hash key are kept in the same linked list. We used two real-world text data (Table 1). One is wikipeida (en) data consisting of 5.5 GB in size, 5, 442, 222, 932 in length and 209 characters. We downloaded all currently available wikipedia text in XML format from http://dumps.wikimedia.org/enwikisource/20120412/. The other is genome data consisting of 3.1 GB in size, 3, 137, 162, 547 in length and 38 alphabets. The genome data consists of all 23 human chromosomes downloadable from http://genome.ucsc.edu/. We used dictionary size, total memory size and compression time as evaluation measures. All experiments were made on a Linux machine with an 8-Core Intel(R) Xeon(R) CPU E7-8837 2.67GHz with 1TB memory. Figure 4 shows memory usage of our VLD and memory overhead of the hashtable for increasing the text length. Memory use of our VLD and memory overhead of the hashtable linearly increased for the length of text. Memory overhead of the hashtables was about one-third of the memory of our VLD in both wikipedia (en) and genome. This means that minimizing dictionary size is meaningful for scaling up grammar-based compression algorithms. Figure 5 shows memory usage of dictionary without hashtable for increasing the text length. The memory usage of our VLD is much smaller than dag vector and the STL vector. These memory usages linearly increased for increasing the text length (Figure 5). Table 2 shows results on wikipedia (en) and genome. Memory usage of dag vector was comparable to that of the STL vector on two data, and consumed 10 GB and 5.1 GB for wikipedia (en) and genome, respectively. This occurs because dag vector is based on γ-coding, which is eﬀective in compressing small values. In the grammar-based compression, variables in CFG are not always small. In fact, the maximum sizes of variables to compress wikipedia (en)

408

Y. Takabatake, Y. Tabei, and H. Sakamoto

Table 1. The table details comprise the size (MB), the length and the character quantity of data size (MB) length #alphabet text wikipedia (en) 5, 533 5, 442, 222, 932 209 3, 199 3, 137, 162, 547 38 genome

wikipedia (en)

genome

overhead dictionary

overhead dictionary

3000

5000

2500

size (MB)

size (MB)

4000

2000

3000

1500

2000

1000

1000

500

1e+09

2e+09

3e+09

4e+09

5.0e+08 1.0e+09 1.5e+09 2.0e+09 2.5e+09 3.0e+09

5e+09

length of text

length of text

Fig. 4. Memory usage of our VLD and overhead of hashtable to increase text length

and genome were 391, 117, 827 and 241, 134, 656 represented in 29 bits and 28 bits, respectively. These were further translated to 56 bits and 54 bits in γ-coding. Memory usage of our VLD was about half that of dag vector and STL vector. While memory usage of our VLD was 3.4 GB and 1.8 GB in wikipedia (en) and genome, respectively, memory overheads were 5.7 GB and 3.1 GB. Building time of the STL vector was fastest among the three methods, and that of dag vector was slowest. Building time of our VLD was slightly slower than that of the STL vector. Our VLD ﬁnished building dictionaries from wikipedia (en) and genome in 4, 903 seconds and 2, 417 seconds, respectively.

wikipedia (en) ●

genome

proposed dag_vector STL vector

5000

memory (MB)

8000

●

proposed dag_vector STL vector

4000

6000

memory (MB)

10000

3000

4000

2000

●

●

● ●

2000

1000 ●

● ●

0

●

0

●● ●

0e+00

1e+09

2e+09

3e+09

length of text

4e+09

5e+09

●● ● ●

0.0e+00

5.0e+08

1.0e+09

1.5e+09

2.0e+09

2.5e+09

3.0e+09

length of text

Fig. 5. Memory usage of dictionary for increasing the text length

Variable-Length Codes for Space-Eﬃcient Grammar-Based Compression

409

Table 2. Results on wikipedia (en) data (top) and genome data (bottom) wikipedia (en) method dictionary size (MB) overhead (MB) time (sec) proposed 3, 367 5, 748 4, 903 10, 014 12, 511 16, 960 dag vector 9, 401 11, 898 3, 125 STL vector genome method proposed dag vector STL vector

dictionary size (MB) overhead (MB) time (sec) 1, 819 3, 114 2, 417 5, 104 6, 461 6, 359 5, 109 6, 467 1, 576

References 1. Barbay, J., Navarro, G.: Compressed Representations of Permutations, and Applications. In: STACS, pp. 111–122 (2009) 2. Brisaboa, N.R., Ladra, S., Navarro, G.: Directly Addressable Variable-Length Codes. In: Karlgren, J., Tarhio, J., Hyyr¨ o, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009) 3. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51, 2554– 2576 (2005) 4. Clark, D.: Compact Pat Trees. PhD thesis, University of Waterloo (1996) 5. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011) 6. Erd˝ os, P., Szekeres, G.: A combinatorial problem in geometry. Compositio Mathematica 2, 463–470 (1935) 7. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. In: SODA, pp. 690–696 (2007) 8. Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 278–289. Springer, Heidelberg (2011) 9. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: SODA, pp. 636–645 (2003) 10. Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A Uniﬁed Algorithm for Accelerating Edit-Distance Computation via Text-Compression. In: STACS, pp. 26–28 (2009) 11. Jacobson, G.: Space-eﬃcient static trees and graphs. In: FOCS, pp. 549–554 (1989) 12. Karpinski, M., Rytter, W., Shinohara, A.: An eﬃcient pattern-matching algorithm for strings with short descriptions. Nordic J. Comp. 4(2), 172–186 (1997) 13. Larsson, N.J., Moﬀat, A.: Oﬀ-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000) 14. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 398–409. Springer, Heidelberg (2011)

410

Y. Takabatake, Y. Tabei, and H. Sakamoto

15. Maruyama, S., Sakamoto, H., Takeda, M.: An online algorithm for lightweight grammar-based compression. Algorithms 5(2), 213–235 (2012) 16. Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996) 17. Okanohara, D.: dag vector, https://github.com/pfi/dag_vector 18. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002) 19. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302, 211–222 (2003) 20. Sakamoto, H., Kida, T., Shimozono, S.: A Space-Saving Linear-Time Algorithm for Grammar-Based Compression. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 218–229. Springer, Heidelberg (2004) 21. Sakamoto, H., Maruyama, S., Kida, T., Shimozono, S.: A space-saving approximation algorithm for grammar-based compression. IEICE Trans. Inf. Syst. 92(2), 158–165 (2009) 22. Tiskin, A.: Towards Approximate Matching in Compressed Strings: Local Subsequence Recognition. In: Kulikov, A., Vereshchagin, N. (eds.) CSR 2011. LNCS, vol. 6651, pp. 401–414. Springer, Heidelberg (2011) 23. Yamamoto, T., Bannai, H., Inenaga, S., Takeda, M.: Faster Subsequence and Don’tCare Pattern Matching on Compressed Texts. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 309–322. Springer, Heidelberg (2011) 24. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)

Variable-Length Codes for Space-Efficient Grammar ...

of variable-length codes for offline and online grammar-based compres- ... context-free grammar (CFG) that generates a given string uniquely. ...... Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds.

Download PDF

426KB Sizes 4 Downloads 320 Views

Report

Variable-Length Codes for Space-Efficient Grammar ...

Recommend Documents