Simpler and efficient LZW-compressed multiple pattern matching Paweł Gawrychowski
July 4, 2012
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
1 / 20
We consider the standard pattern matching problem.
Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
2 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
3 / 20
Find kjfdkasl in
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
3 / 20
And move to its natural generalization.
Pattern matching Given a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
4 / 20
And move to its natural generalization.
Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
4 / 20
And move to its natural generalization.
Compressed pattern matching Given a compressed representation of a text t and a pattern p, does p occur in t? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version.
Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p1 , p2 , . . . , p` , does any pi occur in t?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
4 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
√ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t[1..N] is split into disjoint blocks b1 b2 . . . bn . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF!
ababbababababababababaabbbaa √ You can see that n ∈ Ω( N), so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
5 / 20
t[1..N] text, which after compression consists of n blocks p1 , p2 , . . . , p` patterns of total length M
LZW-compressed multiple pattern matching Input: p1 , p2 , . . . , p` and a sequence of n blocks defining text t Output: does any pi occur in t?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
6 / 20
First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O(n log M + M) and O(n + M 2 ).
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
7 / 20
Year later the second algorithm was improved by Kosaraju, who developed a O(n + M 1+ ) time solution.
Gawrychowski SODA 2011 Single pattern version can be solved in O(n + M) time.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
8 / 20
If we consider more than one pattern, the situation seems significantly more challenging.
Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998 Multiple pattern version can be solved in O(n + M 2 ) time. Is it possible to narrow the gap between single and multiple pattern versions?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
9 / 20
This paper Multiple pattern version can be solved in O(n log M + M) or O(n + M 1+ ) time. 1
matches the bounds of Amir et al. and Kosaraju.
2
DOES NOT use any combinatorics on words, reduces the question to simple-to-state data structure problems.
3
the same high-level idea in both algortihms. So, in a certain sense, more uniform than the previously known solutions for single pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
10 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
11 / 20
Algorithm 1 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for
detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.
prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
12 / 20
Algorithm 2 M ULTIPLE - PATTERN - MATCHING(s1 , s2 , . . . , sn0 ) 1: c ← s1 2: for k = 2, 3, . . . , n0 do 3: add (c, sk ) to P 4: c ← prefixer(c, sk ) 5: end for 6: for all (s, s 0 ) ∈ P do 7: detector(s, s0 ) 8: end for
detector(s1 , s2 ) Given two snippets, check if any pattern occurs in their concatenation.
prefixer(s1 , s2 ) Find the longest suffix of the concatenation which is a prefix of some pattern. Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
12 / 20
Consider detector(s1 , s2 ). Let P = p1 $p2 $ . . . $p` .
s2
s1 $
pi [1..j]
pi [j + 1..|pi |]
$
Consider the situation in the prefix tree T r and the suffix tree T .
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
13 / 20
s2
s1 pi [1..j]
$
Tr
pi [j + 1..|pi |]
$
T
pi [1..j] pi [j + 1..|pi |] s1 Paweł Gawrychowski
s2 LZW-compressed multiple pattern matching
July 4, 2012
14 / 20
By computing the pre- and post-order numbers, this reduces to preprocessing a collection of M rectilinear rectangles so that given a point we can quickly retrieve (any) rectangle containing it.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
15 / 20
Similarly, for prefixer(s1 , s2 ) we need to preprocess a collection of weighted horizontal segments so that given a vertical segment we can quickly retrieve the heaviest segment it intersects.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
16 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.
Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
17 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.
Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
17 / 20
Now we can completely forget about the words and focus on those simple data structures problems! We use the usual sweeping from left to right technique. Whenever a new rectangle appears, we insert an interval into the structure S, and whenever a segment ends, we remove its corresponding interval from S. So, S stores a collection of intervals so that we can quickly retrieve (any) interval a given point belongs to.
Trivial solution Implement S as any balanced search tree to get O(n log M + M log M) total time.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
17 / 20
We want better bounds, though. More precisely, we would like to be linear in either n or M.
O(n log M + M)
The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.
O(n + M 1+ )
We increase the out-degree of the tree to M . Then the updates become more expensive, but the depth (and so the query time) become constant.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
18 / 20
We want better bounds, though. More precisely, we would like to be linear in either n or M.
O(n log M + M)
The intervals do not cross each other, and we can use this property to replace balanced search tree by a perfect binary tree, where each update touches just one vertex. Additionally, we exploit the fact that we can process all queries at once.
O(n + M 1+ )
We increase the out-degree of the tree to M . Then the updates become more expensive, but the depth (and so the query time) become constant.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
18 / 20
Similar ideas work for the second problem, too. To get the whole solution we must fill in some details (for example, we need an efficient way of retrieving the vertices corresponding to the snippets, and, if we do not assume a constant alphabet, a fast implementation of the Aho-Corasick automaton). Nevertheless, all those detail boil down to the same ideas as above.
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
19 / 20
1 2
is it possible to achieve O(n + M) time for multiple patterns?
what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?
Questions?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
20 / 20
1 2
is it possible to achieve O(n + M) time for multiple patterns?
what about approximate pattern matching? For example, given k , can we detect an occurrence with at most k mismatches faster than in O(nmk )?
Questions?
Paweł Gawrychowski
LZW-compressed multiple pattern matching
July 4, 2012
20 / 20