Simple@let@token r and efficient LZW-compressed ...

Viewer
Transcript

Simpler and efficient LZW-compressed multiple pattern matching Paweł Gawrychowski

July 4, 2012

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

1 / 20

We consider the standard pattern matching problem.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

2 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

Find kjfdkasl in

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

3 / 20

And move to its natural generalization.

Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

And move to its natural generalization.

Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

And move to its natural generalization.

Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.

Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

4 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

√ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!

ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

5 / 20

t[1..N] text, which after compression consists of n blocks p1 , p2 , . . . , p` patterns of total length M

LZW-compressed multiple pattern matching Input: p1 , p2 , . . . , p` and a sequence of n blocks defining text t Output: does any pi occur in t?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

6 / 20

First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O(n log M + M) and O(n + M 2 ).

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

7 / 20

Year later the second algorithm was improved by Kosaraju, who developed a O(n + M 1+ ) time solution.

Gawrychowski SODA 2011 Single pattern version can be solved in O(n + M) time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

8 / 20

If we consider more than one pattern, the situation seems significantly more challenging.

Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998 Multiple pattern version can be solved in O(n + M 2 ) time. Is it possible to narrow the gap between single and multiple pattern versions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

9 / 20

This paper Multiple pattern version can be solved in O(n log M + M) or O(n + M 1+ ) time. 1

matches the bounds of Amir et al. and Kosaraju.

2

DOES NOT use any combinatorics on words, reduces the question to simple-to-state data structure problems.

3

the same high-level idea in both algortihms. So, in a certain sense, more uniform than the previously known solutions for single pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

10 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

11 / 20

Algorithm 1 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for

detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.

prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

12 / 20

Algorithm 2 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for

detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.

prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

12 / 20

Consider detector(s1 , s2 ). Let P = p1 $p2 $ . . . $p` .

s2

s1 $

pi [1..j]

pi [j + 1..|pi |]

$

Consider the situation in the prefix tree T r and the suffix tree T .

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

13 / 20

s2

s1 pi [1..j]

$

Tr

pi [j + 1..|pi |]

$

T

pi [1..j] pi [j + 1..|pi |] s1 Paweł Gawrychowski

s2 LZW-compressed multiple pattern matching

July 4, 2012

14 / 20

By computing the pre- and post-order numbers, this reduces to preprocessing a collection of M rectilinear rectangles so that given a point we can quickly retrieve (any) rectangle containing it.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

15 / 20

Similarly, for prefixer(s1 , s2 ) we need to preprocess a collection of weighted horizontal segments so that given a vertical segment we can quickly retrieve the heaviest segment it intersects.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

16 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.

Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

17 / 20

We want better bounds, though. More precisely, we would like to be linear in either n or M.

O(n log M + M)

The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.

O(n + M 1+ )

We increase the out-degree of the tree to M . Then the updates become more expensive, but the depth (and so the query time) become constant.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

18 / 20

We want better bounds, though. More precisely, we would like to be linear in either n or M.

O(n log M + M)

The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.

O(n + M 1+ )

We increase the out-degree of the tree to M . Then the updates become more expensive, but the depth (and so the query time) become constant.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

18 / 20

Similar ideas work for the second problem, too. To get the whole solution we must fill in some details (for example, we need an efficient way of retrieving the vertices corresponding to the snippets, and, if we do not assume a constant alphabet, a fast implementation of the Aho-Corasick automaton). Nevertheless, all those detail boil down to the same ideas as above.

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

19 / 20

1 2

is it possible to achieve O(n + M) time for multiple patterns?

what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?

Questions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

20 / 20

1 2

is it possible to achieve O(n + M) time for multiple patterns?

what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?

Questions?

Paweł Gawrychowski

LZW-compressed multiple pattern matching

July 4, 2012

20 / 20

Computationally Efficient Simulation of Queues: The R Package - arXiv