Faster algorithm for computing the edit distance ...

Viewer
Transcript

Faster algorithm for computing the edit distance between SLP-compressed strings Paweł Gawrychowski Max-Planck-Institut für Informatik

October 24, 2012

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

1 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b. a = ATGCCGAC b = CAGACTAGA

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

We consider the question of computing the edit distance between two strings.

Edit distance Given two strings a and b, what is the minimum number of insert/delete/replace operations required to transform a into b?

Observation Computing the edit distance between a and b is equivalent to finding the length of longest common subsequence of a and b. a = AT GCCGAC b = CAGACTAGA

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

2 / 18

The question is: how quickly can be compute this number?

Use dynamic programming! For each i and j compute the longest common subsequence of a[1..i] and b[1..j]. Which results in a very simple quadratic time algorithm. More precisely, the running time is O(|a||b|).

Is there anything faster? Can be improved by either using bit parallelism, or assuming that some parameter (edit distance, number of matches, . . . ) is small.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

3 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

A natural direction is to consider the case when the input strings a and b are highly compressible. The question is: what method of compression should we assume?

Straight-line programs, or grammar compression A context-free grammar in Chomsky normal form with exactly one production for each nonterminal, hence generating exactly one string.

Fibonacci words F0 = 1 F1 = 0 F0 = 1

F2 = 01

F1 = 0

F3 = 010

Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001 F5 = 01001010 F6 = 0100101001001 F7 = 010010100100101001010 Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

4 / 18

Why this compression method? 1

as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be converted into SLP with small size increase),

2

easy to work with!

Problem We are given two strings a and b described by two SLPs. n is the total number of rules in both SLPs N is the total length of both words How quickly can we compute LCS(a, b)?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

5 / 18

Why this compression method? 1

as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be converted into SLP with small size increase),

2

easy to work with!

Problem We are given two strings a and b described by two SLPs. n is the total number of rules in both SLPs N is the total length of both words How quickly can we compute LCS(a, b)?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

5 / 18

Whenever we are working with compressed data, the goal is to achieve running time depending just on the size of the compressed representation. Unfortunately, in our case this is not possible.

Lifshits and Lohrey MFCS 2006 Checking if a is a subsequence of b is N P-hard.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

6 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009) O(nN 1.5 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009) O(nN 1.5 ) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 O(n1.4 N 1.2 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 O(n1.4 N 1.2 ) is possible!

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 (full) ...finally, O(nN log Nn ) is possible.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

Lifshits CPM 2007 ...but maybe O(nN) running time would be possible?

Tiskin SODA 2010 Well, actually O(nN log N) is possible!

Hermelin, Landau, Landau and Weimann STACS 2009 (full) ...finally, O(nN log Nn ) is possible.

This paper O(nN

q

log Nn ) time algorithm.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

7 / 18

The aforementioned papers actually solve a more general weighted version of the problem in the same complexity, for any rational scoring function. The same is true for the improvement, but today we will focus just on the unweighted case.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

8 / 18

High-level idea Same as in the previous solutions: look at the alignment dag...

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

9 / 18

High-level idea Same as in the previous solutions: look at the alignment dag...

A

B

C

B

A

A

B A C B B A

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

9 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

... and notice that if the strings are highly repetitive, many fragments of the dag look somehow similar, and we can avoid duplicating work. X1

X2

X3

X4

X5

X6 x

X10 X20 X30 X40 X50 X60 x

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

10 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

To make life simple, we want to partition the whole dag into blocks corresponding to pairs of nonterminals.

Theorem Given a SLP and a parameter x, we can (quickly) construct a new SLP of roughly the same size and with the following two properties: 1

each nonterminal describes a string of length at most x,

2

the original string can be represented as a concatenation of roughly Nx new nonterminals.

x will be chosen later.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

11 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block.

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block. I0

I1

I2

I3

I4

I5

I6 O12

I−1

O11

I−2

O10

I−3

O9

I−4

O8

I−5 I−6 O0

O7 O1

O2

O3

O4

O5

O6

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

We want to compute only the values on the boundaries between blocks. Consider the situation around a single block. I0

I1

I2

I3

I4

I5

I6 O12

I−1

O11

I−2

O10

I−3

O9

I−4

O8

I−5 I−6 O0

O7 O1

O2

O3

O4

O5

O6

We need to take a closer look at how the values on the right and bottom boundary (outputs) depend on the values on the left and top boundary (inputs). Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

12 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j)

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j) 1 1 1 1 1 1 1

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th output of the block corresponding to words α and β. Then the matrix Hα,β has a very simple structure: Hα,β (i, j) = j − i − P Σ (i, j) 1 1 1 1 (i, j)

1

1 1

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

13 / 18

Hence each Hα,β can be represented in a succinct form by simply storing the nonzeroes of the permutation matrix. The question is how to compute this representation efficiently?

Theorem (Tiskin SODA 2010) Given the representation of Hα0 ,β and Hα00 ,β , we can compute the representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x. Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

14 / 18

Hence each Hα,β can be represented in a succinct form by simply storing the nonzeroes of the permutation matrix. The question is how to compute this representation efficiently?

Theorem (Tiskin SODA 2010) Given the representation of Hα0 ,β and Hα00 ,β , we can compute the representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x. Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

14 / 18

Theorem Given the representation of Hα,β and the vector u storing the input values, we can compute the vector v storing the output values in O(x) time, where |α|, |β| ≤ x. Note that v is simply the max-product of Hα,β and u. This assumes the word RAM model with w ∈ Ω(log n).

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

15 / 18

Theorem Given the representation of Hα,β and the vector u storing the input values, we can compute the vector v storing the output values in O(x) time, where |α|, |β| ≤ x. Note that v is simply the max-product of Hα,β and u. This assumes the word RAM model with w ∈ Ω(log n).

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

15 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Fast matrix-vector multiplication We reduce the question to a very simple data structure problem, which is to maintain an array of length x under two operations: 1

decreasing all values in some prefix of the array by one,

2

computing the maximum in the whole array.

This can be reduced further to an even simpler problem, which is to maintain a partition of [1, x] in to a number of disjoint segments under the following two operations: 1

locating the segment a given element belongs to,

2

merging two adjacent segments.

This is known as the interval-union-find problem, and a (not very complicated) amortized constant time solution is known.

Paweł Gawrychowski

Faster edit distance between SLPs

October 24, 2012

16 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

Now we only have to combine those two theorems and compute the total running time, which turns out to be proportional to: n2 x log x + choosing x = √ f

log f

, where f =

N n

N2 x

gives us log x ≤ log f and hence we

get the following bound on the running time: p p N2 p n f log f + log f = nN log f = nN f 2

Paweł Gawrychowski

Faster edit distance between SLPs

r log

N n

October 24, 2012

17 / 18

1 2

Can we avoid bit manipulation? q Can we get rid of the annoying log Nn factor?

Questions?

Paweł Gawrychowski

,

Faster edit distance between SLPs

October 24, 2012

18 / 18