The 24th Workshop on Combinatorial Mathematics and Computation Theory

The Application of Convolution to Suffix to Prefix Rule for the Exact String Matching Problem Zhong He Chen and R. C. T. Lee Department of Computer Science and Information Engineering National Chi Nan University, Puli, Nantou Hsien, 545, Taiwan ROC [email protected] [email protected] Abstract In this paper, we consider the exact string matching problem. We first point out a rule, called the suffix to prefix rule, which can be used to avoid the brute-force sliding window approach. The Backward Nondeterministic Matching algorithm, Backward Oracle algorithm and Reverse Factor algorithm all use this rule. To implement this rule, we have to find the longest suffix of text T which is equal to a prefix of pattern P. In this paper, we point out that convolution can be used to do this. As can be seen, the convolution technique is easy to understand and easy to program.

1 1.1

The Exact Problem

String

Matching

Background

String matching is a classical problem in computer science and can be applied to encryption, compression, DNA sequences analysis, imaging processing, and etc. Some searching engines such as yahoo, google and Wikipedia also apply the technique. There are many algorithms for solving the exact string matching problem, such as MP algorithm [9], KMP algorithm [6], Boyer-Moore algorithm [11], Smith algorithm [13], Backward Nondeterministic Matching algorithm [10], Horspool algorithm [5], Shift-And algorithm[15], and etc.

1.2

For example, we are given a text T =AT TGCCAA TGCCA CCA and a pattern P =TGCCA. T =AT TGCCAATG CCA CCA P = TGCCA As the example shown above, P has a match at location 3 in T. Definition 2.1: Exact String Matching Problem Given a text T=t1t2t3…tn and a pattern P=p1p2p3...pm, find all the i in T such that titi+1…ti+m-1 is equal to p1p2p3...pm . For example, we are given a text T =AT TGCCAA TGCCA CCA and a pattern P =TGCCA. T =ATTGCCAATG CCA CCA P = TGCCA T =ATTGCCAATGCCA CCA P= TGCCA In this case, P has two matches, location 3 and 9.

1.3

A brute-force algorithm

It is easy to design a brute-force algorithm to solve the exact matching problem. Suppose that we are given a text of length n and a pattern of length m. A brut-force algorithm compares all n-m+1 locations in text T which are possibly a match of P. A simple brute-force algorithm is shown as following. Algorithm Brute-Force Algorithm for Exact String Matching Input: A text T = t1t2…tn and a pattern P = p1p2p3...pm. Output: find all the i in T such that titi+1…ti+m-1 is equal to p1p2p3...pm .

Terminology

Definition 1.1: Match Given a text T=t1t2t3…tn and a pattern P=p1p2p3...pm, if there exits a location i in T such that titi+1…ti+m-1 is equal to p1p2p3...pm. We say that P has a match at location i in T.

for i = 1 to n – m + 1 do if P1, m is equal to T j , j m 1 then Report that P appears at position i in T; endif endfor

-393-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

The brute force algorithm essentially skips the pattern to the right, one step at each time. In other words, this straight-forward approach computes the pattern P with every substring of T. Almost all exact string matching algorithms avoid computing pattern P with all substrings of T. In the following, we shall introduce a rule which can be used to move P more than one step to the right.

2

paper, we shall use a rule, which may be called the suffix to prefix rule. Consider Fig. 2.2. Let U be the longest suffix of the window of T which is also a prefix of P. If U is also the window, an exact matching is found; otherwise, we may slide the pattern P such that the prefix of P coincides with U, as shown in Fig. 2.3.

The Suffix to Prefix Rule Fig. 2.2

In this section, we introduce the suffix to prefix rule as following. First we give two definitions. Definition 2.1 Given a string X = x 1 x 2…xn. S is called a suffix of X if S = X i…n , for i Ч{1 …n}. For example, X=abc. We can see that “abc”, “bc” and “c” are all suffixes of X. Fig. 2.3

Definition 2.2 Given a string X = x 1 x 2…xn. S is called a prefix of X if S = x 1…xi, for i Ч {1.. n}. For example, X=abc. We can see that “abc”, “ab” and “a” are all prefixes of X. In almost every exact string matching algorithm, we open a window in T with the same size of the pattern P, as shown in Fig. 2.1. If the window exactly matches the pattern, a matching is found; otherwise, we will have to move the pattern to the right. It will be critical if we can slide the pattern to the right a large number of steps.

One may ask why we may slide the pattern in such a way. Suppose that P slides smaller number of steps than that suggested by the suffix to prefix rule as shown in Fig. 2.4. As shown in Fig. 2.4, we will have a suffix U’ of the window of T which is also the prefix of P where the length of U’ is larger than that of U, which is impossible.

Fig. 2.4 Now, we give an example. Suppose that we are given a text T = acbabbaccbaac and a pattern P = bacc. In Fig 2.5, the window is “acba” and U is “ba”, so we can shift the window to right by two steps.

Fig. 2.1 There are many methods to determine the number of steps to slide the pattern. In this

-394-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

Fig 2.5 Convolution is a technique widely used in communication [2], [4], [7], [12]. For application of convolution to string matching, consult [14]. Definition 3.1 Convolution in Discreet Case

An Algorithm for suffix to prefix rule is shown as following: Algorithm SuffixToPrefixRuleExactMatching Input: A text T of length n and a pattern P of length m Output: All occurrences of P in T Step 1. Let the start position of the window of length of m be 1. Step 2. If the start position of the window is less than n-m+1, go to step3; otherwise, end this algorithm. Step 3. Find the longest suffix X of the window which is also a prefix of P and goes to step 4. Step 4. If the length of X is m, report there is a match. Go to Step 5. Step 5. Let the start position of the window = the start position of the window + mlength of X. Go to Step 2. This suffix to prefix rule has been used by several researchers to design exact string matching algorithms such as the Reverse Factor algorithm [8], the Backward Nondeterministic Matching algorithm [10] and the Backward Oracle Matching algorithm [1]. The problem is how to find the longest of suffix U which is equal to a prefix of P. The Reverse Factor algorithm uses automaton to find the longest of suffix U which is equal to a prefix of P. The Backward Nondeterministic Matching algorithm uses bit-parallel to solve the problem and the Backward Oracle algorithm uses a data structure called “oracle” to solve the problem. In this paper, we shall show that convolution can be easily used to find such a longest suffix U.

3

Let X= x 1 x 2…xm. and Y = y 1 y 2…yn. and

Let

…

† be two functions. The convolution of

X and Y is Z = z0, z1,…zn+m where zk = xi … yj ,for

† i j

k

k=0,1…m+n.

To find the convolution of two strings X and Y, we may define … to be as following: xi … yj = 1 if xi = yj xi … yj = 0 if xiЋyj † is just the addition function. Besides, we always find the convolution of X and Y where Y is the reverse of Y, as shown below:

Fig3.1 Then we get the result of convolution of T and P as shown in Fig 3.2.

Convolution

-395-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

4

The Suffix to Prefix Rule Implemented by Convolution

In this section, we give an example to explain how to use convolution to find the longest suffix of the window of T which is also a prefix of P.

4.1 Fig 3.2

Using Convolution to Find the Longest of Suffix U Which Is a Prefix of P

Here, we give an example of convolution. Suppose that we are given a text T = cbac and a pattern P = bacc. The convolution of T and P is shown in Fig 4.1.

We show the meaning of the values in Fig 3.3

Fig 3.3

Fig 4.1

-396-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

prefix of Yn Step 1. Do a convolution of X and Y and get F-vector. Go to step 2. Step 2. Set i=1 and denote the length of U be |U|. Step 3. If iʳ Љʳ ̀ʿʳ ˺̂ʳ̇̂ step 4. Otherwise, go to step5. Step 4. If the i-th value counted from right to left of F-vector is exactly equal to i, Set |U | = i. Let i =i+1 and go to step3. Step 5. Report U =xm-|U|+1, xm-|U|+2, xm-|U|+3,…xm-1, xm.

We call the lowest column “F-vector” as indicated in Fig 4.2.

4.2 The Bit-Pattern Approach The Bit-Pattern Approach can be seen as a modified convolution. The Bit-Pattern Approach can be used to find the longest suffix of text T which is also a prefix of pattern P. Given a string X = x1 x2 x3…xn and a character Ӫ , the Ӫ -bit pattern of X is defined as b1b2 b3…bn where bi =1 if xi = Ӫ and bi=0 if otherwise. For example, X = bacc. a-bit pattern of X is 0100. b-bit pattern of X is 1000. c-bit pattern of X is 0011. We give an example first. Suppose that we are given a text T = cbac and a pattern P = bacc. And the bit pattern of every character of the alphabet of T is shown as below: a-bit pattern of T is 0010. b-bit pattern of T is 0100. c-bit pattern of T is 1001. In the convolution of X and P is shown in Fig 4.4. We can observe that we can use bit-pattern to achieve convolution.

Fig 4.2 If the i-th value counted from right to left of F-vector is exactly equal to i, we know that the suffix of text T of length i is also a prefix of pattern P. From the example above, we can see that the third value of F-vector is also three as shown in Fig 4.3.

Fig 4.3 So, we know that “bac” is a suffix of T of length three which is also a prefix of pattern P. In this way, we can find out all the suffixes of T which is also a prefix of pattern P. Thus, we can also find the longest suffix of T which is also a prefix of pattern P by using convolution.

Fig 4.4 Now, we use ADD operation instead of the ADD operation above as shown above:

Algorithm FindTheLongestSuffixU Input: A string X = x 1, x 2, x3…xm and a string Y =y1,y1, y3,…, ym. Output: The longest suffix U of X which is also a

-397-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

`Fig 4.5

Fig 4.6 [1]C.Allauzen, E.Crochemore, and M.Raffinot, We give another example where T = abcbc and P Efficient experimental string matching by weak = bcbcd as shown in Fig. 4.6. factor recognition, In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, number 289 in Lecture Notes in So, we can use AND operation which is Computer, pp. 212-223, 2001 faster than ADD operation in computers to find [2]A. Bruce Carlson, Paul B. Crilly, Janet the longest suffix of text T which is equal to the Rutledge, Communication Systems:An prefix of pattern P. Introduction to Signals and Noise in Electrical Engineering, McGraw-Hill, 1986. Section 5 Concluding Remarks [3]M.Crochemore, A. Czumaj, L.Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. In this paper, we propose convolution to Rytter, Speeding up two string matching find the longest suffix of text which is also a algorithms, Algorithmica, Vol. 12(4/5), 1994, prefix of pattern for the suffix to prefix rule. 247-267. For suffix tree and automaton both are not easy [4]Fischer, M.M., and Paterson, M.S., to be constructed and programmed, convolution String-Matching and other products, is much easier and simpler. SIAM-AMS Proceedings, Vol. 7., 1974, pp. 113-125 [5] R.N. Horspool, Practical fast searching in References: strings, Software Practice and Experience, Vol. 10(6), pp.501-506, 1980

-398-

The 24th Workshop on Combinatorial Mathematics and Computation Theory

[6] Knuth, D.E., Morris Jr, J.H., and Pratt, V.R., Fast pattern matching in strings, SIAM J.Comput., Vol. 6(1), 1977, pp. 323-350. [7]H. Kwakernaak and R. Sivan,Modern Signals and Systems,Prentice-Hall,1991. [8] T. Lecroq, A variation on the Boyer-Moore algorithm, Theoretical Computer Science, Vol. 92(1), 119-144, 1992 [9]Morris, J. H., Jr. and Pratt, V. R., A Linear Pattern-Matching Algorithm, Report 40, University of California, Berkeley, 1970. [10]G. Navarro and M. Raffinot, Fast and flexible string matching by combining bit-parallelism and suffix automata, ACM Journal of Experimental Algorithmics, Vol. 5(4), 2000 [11]R. S. and Moore, J. S. Boyer, A Fast String Searching Algorithm, Communication of the ACM, Vol. 20, 1977, pp. 762-772. [12]M.Selik,R.Baraniuk,Properties of Convolution,The Connexions Project and licensed under the Creative Commons Attribution License,2006 [13]P.D. Smith, Experiments with a very fast substring search algorithm, Software—Practice and Experience, Vol. 21(10), 1991, pp.1065–1074 [14]B.H. Wu, Convolution and Its Applications to Sequence Analysis, M.D., National Chi-Nan University, 2004 [15]S.Wu and U.Manber, Fast text searching allowing errors, Communications of the ACM, Vol. 35(10), pp.83-91, 1992

-399-

The Application of Convolution to Suffix to Prefix Rule ...

So, we know that “bac” is a suffix of T of length three which is also a prefix of pattern P. In this way, we can find out all the suffixes of T which is also a prefix of pattern P. Thus, we can also find the longest suffix of T which is also a prefix of pattern P by using convolution. Algorithm FindTheLongestSuffixU. Input: A string X = x 1, ...

519KB Sizes 0 Downloads 118 Views

Recommend Documents

prefix suffix poster freebie.pdf
©Teacher's Take-Out.com. “most” when comparing without. full of can be done. before not. not or opposite. again. not wrong. Page 1 of 3 ...

prefix suffix poster freebie.pdf
prefix suffix poster freebie.pdf. prefix suffix poster freebie.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying prefix suffix poster freebie.pdf. Page 1 of ...

Application of Rule 41(g) - Snell & Wilmer
Feb 7, 2014 - electronic backup of its records, or if the backup re- cords were also seized, then the .... to review records to conduct operations. The company.

Application of Rule 41(g) - JD Supra
Feb 7, 2014 - Application of Rule 41(g) in Response to Federal Search Warrants. BY CRAIG S. ... seize a company's books and records.1 The statute of.

Fast Prefix Matching of Bounded Strings - gsf
LPM is a core problem in many applications, including IP routing, network data clustering, ..... We next discuss how to do this using dynamic programming.

Download Rule of Three: Will to Survive, The Full Pages
Rule of Three: Will to Survive, The Download at => https://pdfkulonline13e1.blogspot.com/0374301816 Rule of Three: Will to Survive, The pdf download, Rule of Three: Will to Survive, The audiobook download, Rule of Three: Will to Survive, The read

The Application of Evolutionary Computation to the ...
The School of Computer Science. The University of Birmingham. Edgbaston, Birmingham ... axis (more detail of process in [Brown et al 2003]). 1.1 Traditional ...

The Application of Public Health Genomics to the ... - PHG Foundation
context of the degree of overweight/ obesity, possible genetic causes, and ..... performed using PubMed (MEDLINE)28 and the online database of genome-.

AN APPLICATION OF BASIL BERNSTEIN TO VOCATIONAL ...
national structure and policy of vocational education and training (VET) in ... Bernstein describes framing as the result of two discourses, the instructional.

AN APPLICATION OF BASIL BERNSTEIN TO ... - CiteSeerX
national structure and policy of vocational education and training (VET) in Australia. ... of online technology on the teaching practice of teachers employed in ...

Amendment to rule 22 rela -
G.O.Ms.No.436, General Administration (Services-D) Department,. Dated:15.10.1996. 2. G.O.Ms.No.252, General Administration (Services-D) Department,. Dated:28.08.2004. 3. G.O.Ms.No.23, Department for Women, Child, Disabled and Senior Citizens,. Dated:

Application for admission to the Provident Fund.(to be submitted ...
Returned with account number allotted. ... (A.P) Rules to reserve the amount the may stand to my credit in the fund in the even of my death before that amount ...

Application for admission to the Provident Fund.(to be submitted ...
I hereby nominate the person mentioned below who is a member of my family as defined in Rule 2 of the. General Provident Fund (Andhra Pradesh) Rules to ...

Supplement to “Learning Rule of Homeostatic Synaptic Scaling ...
to study learning dynamics with a recurrent neural network of binary neurons. 2.1 Binary Neuron ... code is available online at the author's homepage. 2.3 Results of ... (D) ¯ν(τ+1) - ¯ν(τ) indicates the degree of jump discontinuity. Excitation

Improvement of the Orthogonal Code Convolution ...
[7] B. Sklar, Digital Communications Fundamentals and Applications,. Prentice Hall, 1998. [8] Qualcomm, The CDMA Network Engineering Handbook, Vol. I,.

Amendment to sub rule 17 of rule 10 vide GO.NO.39 dt ... - aptgguntur
School Education – The Andhra Pradesh Educational Institutions (Establishment, ... 1 G.O.Ms.No.1, Education (PS.2) Department, dated 01.01.1994. 2 From the ...

Accomplishments and the Prefix re
that a result state has been restored, its meaning explains why re- can appear on ac- complishments, which have result ... again, its meaning is that the result-state of an accomplishment is true for a second time, but not .... NP denotations, and ca

Application of the discrete dipole approximation to ...
Application of the discrete dipole approximation to extreme refractive indices: filtered coupled dipoles ... routinely simulated using modern desktop computers.

Application of the Optimal Discovery Procedure to ...
Mar 10, 2011 - Fax +41 61 306 12 34. E-Mail [email protected] ... +44 1223 767 408, E-Mail ioanna.tachmazidou @ mrc-bsu.cam.ac.uk. © 2011 S. Karger AG, ..... lele frequency model with one free parameter, say 1j. We then define 1j { 1j + ...

AN APPLICATION OF THE MATCHING LAW TO ...
Individual data for 7 of 9 participants were better described by the generalized response-rate matching equation than by the generalized time-allocation ...

Modeling the Lifespan of Discourse Entities with Application to ...
wonder). Such interactions were Karttunen's primary focus (Karttunen, 1973, 1976), and they have ..... fied phrases like the new Scorsese movie that stars De Niro tend to be non-anaphoric, whereas short phrases with ... variety of feature types: lexi

The Application of International Criminal Law to ...
internal and international armed conflict in the Democratic. Republic of ..... competitors.30 Alongside military intimidation, manipulation of the banking sector, and ...

Application of Digital Human Modeling to the Design ...
Sungchan Bae, Jaewon Choi, and Thomas J. Armstrong. Presented at the 2008 SAE Digital Human Modeling Conference, Pittsburgh,. Pennsylvania, June ...

Amendment to sub rule 17 of rule 10 vide GO.NO.39 dt ... - aptgguntur
School Education – The Andhra Pradesh Educational Institutions (Establishment, ... taken up, the competent authority may transfer the staff, with or without posts ...