Learning Context Free Grammar Rules from a Set of Programs

Viewer
Transcript

Learning Context Free Grammar Rules from a Set of Programs Alpana Dubey, Pankaj Jalote, and Sanjeev Kumar Aggarwal Dept of Computer Science and Engineering, Indian Institute of Technology, Kanpur - 208016. India {alpanad,jalote,ska}@iitk.ac.in

Abstract. We present a technique for learning context free grammar rules from positive samples (a set of syntactically valid programs). The work focuses on the grammar learning in programming language domain. Grammar of a programming language is important in developing program analysis and modification tools. Sometimes programs are written in dialects of standard languages, which are minor variations of standard languages. Normally, grammars of standard languages are available but grammars of dialects may not be available. We propose a technique which infers grammar rules from a given set of programs and an approximate grammar (viz. grammar of the standard language). Our approach is an iterative approach with backtracking where each iteration consists of three phases. The first phase learns a set of possible left hand sides of rules from the LR parser. The second phase learns a set of possible right hand sides of rules using the CYK parser. The third phase builds a set of possible rules and adds one of them in the grammar. After a fixed number of iterations, the grammar is checked to see whether it parses all the programs or not; if all the programs are parsed successfully then the set of rules added in the grammar is returned as correct set of rules else algorithm backtracks and selects another rule. Correctness of the algorithm is proved. A set of optimizations are proposed to make the approach more efficient. The approach and the optimizations are experimentally verified on a set of programming languages. Keywords: Grammar inference, Context free grammar, LR-parser, LRstate machine, CYK-parser.

1

Introduction

The problem of grammar extraction or inference has attracted researchers in NLP and theoretical domain for many decades, but its importance in software engineering field is felt quite recently [19, 20, 18]. The problem of grammar inference is important in the programming language (PL) domain because grammar is used for developing many software engineering tools. Sometimes programs are written in the languages which are minor variations of standard languages. This

is due to non coverage of some of the features in the grammar of standard languages which are frequently needed for a particular software implementation. For generating software engineering tools for those programs we need the grammar of the dialect. Grammars of standard languages are generally available, however, grammars of dialects may not be available. For example C* is a data parallel dialect of C developed by Thinking Machines Corp for Connection Machine (CM2 and CM5) SIMD processors. Since Thinking Machine Corp no longer exists, it is very difficult to find out the source code or the reference manual for C*. Sometimes available reference manuals are incomplete; this problem has been discussed by [19]. One way of obtaining the grammar of a dialect is to infer it from the source code of the software. Since programming language grammars can be represented with context free grammars (CFG), this paper focuses on the CFG inference. Most of the results concerning to the CFG learning is negative. For example, Gold’s theory states that no language in the Chomsky hierarchy can be learned from the positive samples (set of valid sentences) alone [10]. A CFG, although, can be learned from the positive and negative samples [11], but the question that it can be learned in the polynomial time is still open. Despite above limitations a great deal of work on the grammatical inference has been done which are mainly in the natural language processing applications and in theoretical domain [29, 9, 21, 24, 1]. In the theoretical domain, researchers have addressed the problem of inferring small subclasses of context free languages (CFL). Only some efforts have been made in the field of programming languages [26, 20, 14, 16, 23, 4, 5, 7]. A detailed survey on the grammar inference attempts can be found in [22, 6]. Attempts done in the field of programming languages use some heuristic to guess the grammar rules, with or without user input, and then check the correctness of rule selection by parsing the given programs with the grammar which includes the proposed rule. Also some of the above techniques are not grammatical inference technique because they do not infer the grammar from input programs; rather they extract grammar from other artifacts (such as reference manuals, parser source codes etc.) or use a knowledgebase. This paper proposes a technique for inferring grammar rules from the positive samples (a set of correct programs). It address the problem of syntactic growth in programming languages. Syntactic growth of a PL is nothing but the extension of the context free grammar of the base language. A PL grammar can be divided into different parts expressing various parts of a program. For example, a set of nonterminals can express different statements, another set of nonterminals can express different kinds of expressions etc. in an imperative programming language grammar. Most commonly observed extensions in PL dialects are1 (1) New declaration specifiers to support new data types, new scopes etc. For example, RC [25] is a dialect of C which adds safe, region-based memory management to C with three additional type qualifiers, viz, sameregion, parentptr and traditional. (2) New expressions to support new kinds of ex1

Note that observations discussed may not completely cover all kinds of growth, but these are the most perceived dimensions on which minor growth happens.

pressions. For example, C* [3] has max, min, minto, maxto, dimof, rankof etc new operators. and (3) New keywords to support new statements. For example, Java1.5 [15] has a statement with keyword enum to support enumerated types, while lower versions of Java do not have this statement. Above extensions require modifications in the lexical analyzer to identify tokens corresponding to new terminals (i.e. new types, operators, keywords) and modifications in the parser to support new grammar rules. Our technique is an iterative technique with backtracking. A set of possible rules is built in each iteration and one among them is added in the grammar. After a certain number of iterations, the modified grammar is checked to see whether it parses all the programs or not. If grammar does not parse all the programs then the algorithm backtracks to the previous iteration and selects another rule. The technique builds upon previous work of [7], which employed a heuristic based approach; hence does not always guarantee to infer a grammar which will parse all the programs. The approach discussed in this paper is deterministic. We also describe ways to reduce the search space of possible rules. The organization of the paper is as follows. Problem is formalized and notations are discussed in the section 2. Section 3 describes our approach for single missing rule and discusses the correctness of the approach. Section 4 describes the extension of the proposed approach for inferring multiple missing rules. Section 5 proposes some optimizations which are applied in our approach. Section 6 discusses the implementation of the approach. Section 7 discusses our experiences with the approach and section 8 concludes the article.

2

Problem Definition and Notations

The problem of grammar inference under PL growth can be formalized as follows: Given a set of programs P = {w1 , w2 , ....} and an incomplete grammar G = (N, T, P, S), find a grammar G′ = (N ′ , T ′ , P ′ , S) such that G′ is complete w.r.t. P. Definition 1. A grammar G′ is complete w.r.t. input programs P, if P ⊆ L(G′ ). From Gold’s theorem [10] no language in the Chomsky hierarchy can be exactly identified using the set of positive samples alone; hence our goal is not to find the “actual” grammar (i.e. GT ) of the entire language, as there can be infinite number of grammars which accept a given set of programs and it is impossible to determine L(GT ) = L(G′ ) [13]. We make following assumptions on G′ based on the observed growth in programming languages: 1. The relationship between the G′ and G is as follows : N = N ′ , T ⊆ T ′ , P ⊆ P ′ . G and G′ are epsilon free grammar. Set (T ′ − T ) is known beforehand and denoted as Tnew . 2. Type of rules missing in the grammar are constructs which involve new keywords, operators and declaration specifiers. We call these as new terminals of the grammar, which fall in the set Tnew .

3. One rule is sufficient to express the syntactic construct corresponding to each new terminal. 4. Additional rules in G′ are of the form A → αanew β, where A ∈ N , anew is a new terminal (i.e. anew ∈ Tnew ) and α, β ∈ (N ∪ T )∗ . Assumption 1 says that new terminals of the dialect are known beforehand. Hence, lexical analyzer does not recognize them as an identifier or as an unknown token. By the condition N = N ′ , we assume that left hand sides (LHSs) of additional rules in G′ (i.e. rules in (P ′ − P )) are from the set N . If LHS of an additional rule is a new nonterminal then it should be reachable from the start symbol; this causes addition of another rule containing the new nonterminals in its RHS. Assumption 2 is used to enforce the language growth model we discussed earlier. Assumption 3 is based on observations found in the programming language grammars. For example, consider the grammars of C and C* and with construct of C*; only one grammar rule is sufficient to express the syntax of with construct. Assumption 4 is based on the observation of PL grammars. For example, rules corresponding to keywords and declaration specifiers are usually of form A → anew β where anew is a keyword or a declarator specifier and rules corresponding to operators are usually of form A → αanew β where anew is an operator. Since only one rule is sufficient to express the syntax corresponding to each new terminal, there exists at least one complete grammar G′ = (N ′ , T ′ , P ′ , S ′ ) such that |P ′ − P | = |Tnew |. Our goal is to get a complete grammar Gj , which is not more than |Tnew | rules away from G. A parser which is generated from a complete grammar is called a complete parser and a parser generated from approximate grammar (G) is called an approximate parser. In the case of single missing rule, there can be many rules corresponding to a new terminal which make the grammar complete; we call the set of all such rules a set of correct rules corresponding to the new terminal. Similarly in the cases of multiple missing rules, there can be many sets of rules such that each set makes the grammar complete (each set contains one rule for each new terminal); we call such set a correct set of rules (note the different between terms set of correct rule and a correct set of rules). Input program is represented as a1,n , where n is the number of tokens in program, ai (1 ≤ i ≤ n) denotes ith token of the program and ai,j (1 ≤ i ≤ j ≤ n) denotes the substring which starts at ith token and ends at j th token. An LR parser’s configuration is a pair whose first component is a stack and second component is a remaining input string (to be parsed): (s0 X1 s1 X2 s2 . . . Xm sm , ai ai+1 . . . an ) This configuration represents the right sentential form X1 X2 . . . Xm ai ai+1 . . . an Where sm denotes the top of the stack. We will use terms active state and top of the stack interchangeably in our discussion. Other terms and definitions

related to LR parser, if not defined, is taken from [2]. We ignore lookahead part of LR item at some places where they are not needed. Since each state corresponds to an itemset, we use state and its corresponding itemset interchangeably. A CYK parser [17, 30] maintains an upper triangular matrix C of size n × n, where n is the length of the input a1,n . Each entry C[i, j] is called a CYK-cell, which contains all the symbols which can derive substring ai,j , i.e. A ∈ C[i, j].

3

Grammar Completion with Single Rule

We first present our basic approach for the case when only a single rule is sufficient to make the grammar complete. That is, there exists at least one complete grammar which is just one rule away from the initial grammar. We present an approach for inferring a single correct rule from a set of programs. The approach is later extended for multiple rules. 3.1

Overall Approach

First, input programs are parsed with the parser generated from the approximate grammar. If each program is parsed successfully then the grammar is considered complete and returned intact. Otherwise a correct rule is computed. For determining a correct rule we work with one program at a time. We start with the approximate grammar G and a program a1,n . The basic approach (figure 1) consists of four phases: (1) LHSs gathering phase, (2) RHSs generation phase, (3) rule building and (4) rule checking phase (figure 1). LHSs gathering phase determines the set (L) of possible left hand sides (LHSs) of the correct rule from the context of error point. The RHSs generation phase builds a set (R) of possible right hand sides (RHSs) of correct rules using the CYK-parser table. Rule building phase builds the set of possible rules (P R) using L and R. Each rule in P R is checked in the rule checking phase, i.e. whether the grammar after including a rule is able to parse all the programs. If all the programs are parsed successfully by including some rule in G, then that rule is returned as correct rule. We describe the phases in more detail. For now, we restrict our discussion to determine a correct rule from one program. We assume that the program a1,n cannot be parsed by the approximate grammar G and G must be augmented. In the next few sections we describe the approach which infers rules of the form A → anew α and later generalize the basic approach for the rules of the form A → αanew β (A ∈ N , α, β ∈ (N ∪ T )∗ and anew ∈ Tnew ). 3.2

LHSs Gathering Phase

The LHSs gathering phase returns a set of nonterminals (L) as possible left hand sides (LHSs) of correct rules. The algorithm for this phase is shown in figure 2. Input to this phase is a program, a1,n , and the approximate grammar G. First, a1,n is parsed with the LR parser generated from G. Since G is incomplete, parser

Function INFER SINGLE RULE (a1,n , G) L ← GAT HER P OSSIBLE LHSs(a1,n , G) R ← BUILD RHSs(a1,n , G, max length) P R ← BUILD RULES(L, R) for all rule ∈ P R do Add rule in G if Modified G parses a1,n then return the rule else Roll back the change in G

Fig. 1. Overall single rule inference approach

will stop at some point. If ai is the first occurrence of new terminal, parser will stop after shifting the token ai−1 because G does not have a production containing ai in its RHS. This state of parser is called the error state. Possible LHSs are collected from LR-parser stack as follows. It starts with a set of stack configurations2 . Initially this set contains a single stack, i.e. LR parser stack in the error state. The algorithm does the following iteration until this set becomes empty: It removes a stack from the set and looks at the action table entry corresponding to the the top of the stack. If some of the entries corresponding to the top of the stack are reduce entries, then it performs reductions possible on that state without looking at the next token. We call this operation “forced reduction”. In the case of multiple possible reductions, each reduction is performed on a separate copy of the stack. The new stack obtained after performing each reduction is added to the stack set. If top of the stack has some shift entries then for all items of type [A → α • B β] in the itemset corresponding to the top of the stack, nonterminals which occur after the dot (B) is collected in L. The algorithm stops when the stack set gets empty. The stack set will get empty when no more reductions are possible on any of the stacks in the stack set. If a1,n is parsed with a complete parser; the parser will make a sequence of reductions between the shifts of tokens ai−1 and ai 3 and then expand a nonterminal to derive the substring covered by a correct rule. Algorithm GAT HER P OSSIBLE LHSs simulates the behavior of a complete parser by performing all possible reductions. By including all nonterminals that come after the dot in some item of top itemsets, it guesses which nonterminal a complete parser will expand while parsing remaining string. 2

3

Our algorithm is a small modification of the LR parser. In the LR parser there is only one stack, while in our algorithm there are multiple stacks. Rest of the operations are same to LR parser. There can be no reductions between the shifts in some cases

Function GATHER POSSIBLE LHSs(a1,n , G) Parse a1,n with the parser generated from G err stack ← Stack corresponding to error state of the parser ST ACK SET ← {err stack} L ← {} while ST ACK SET is nonempty do for all stack ∈ ST ACK SET do ST ACK SET ← ST ACK SET − {stack} I ← itemset corresponding to the top of the stack if top state has some shift entries in the action table then for all items of form [A → α • B β] in I do L ← L ∪ {B} if top state has some reduce entries in the action table then for all possible reductions r on the top of the stack do s′ ← copy of the stack Apply reduction r on the s′ ST ACK SET ← ST ACK SET ∪ {s′ } return L

Fig. 2. Algorithm for gathering possible LHSs

3.3

RHSs Generation Phase

The RHSs generation phase determines the possible RHSs of the missing rule. The algorithm for this is shown in figure 3. The algorithm is called with a program, the approximate grammar and a parameter max length. It builds a set of possible RHSs of correct rules of length no more than max length. First, the input program is parsed with G to get the incomplete CYK-table for the program. It then uses this CYK-table to build a set of possible RHSs. For building the set of possible RHSs, first the index of the last occurrence of the new terminal is determined. Suppose it is i. For each k (i < k ≤ n, where n is number of tokens in the program) the algorithm computes the set of symbol strings that can derive the substring which starts at token i and ends at k (ai,k , where a1,n is input program). Since correct rule starts with a new terminal, the last occurrence of new terminal in the program is the point from where the substring derived by the last occurrence of correct rule starts. Since we consider each index as a possible end point of the substring derived by the last occurrence of correct rule, RHSs of correct rules will be built and added in the possible RHSs. The set of symbol strings, which can derive a substring ai,j , is computed by the BU ILD ST RIN GS algorithm which works similar to the CYK parser (figure 3). First each cell C[p, p] (i ≤ p < j) is filled with the symbol string which derive the single token ap,p . Symbol strings which can derive the larger substrings are built in bottom up manner. For computing the symbol string of length l,

which can derive am,m+q , the algorithm does following iteration for each index k (m ≤ k < m + q): A symbol string of length r (0 ≤ r ≤ l) is picked from the cell C[m, k] and a symbol string of length of l − r is picked from the cell C[k + 1, m + q]; these strings are concatenated to get a symbol string of length l4 .

Function BUILD RHSs(a1,n , G, max length) RHSs ← {} i ← index of the last occurrence of new terminal in a1,n for i < k ≤ n do RHSs ← RHSs ∪ BUILD ST RINGS(i, k, max length) return RHSs

Function BUILD STRINGS(i, j, max length) B ← {} for i ≤ p < j initialize C[p, p] ← symbols that derive ap,p for 1 ≤ q ≤ j − i

⊲ Build symbol strings that derive substrings of larger lengths

for i ≤ m < i − q for m ≤ k ≤ m + q for 0 ≤ l ≤ max length Concatenate symbol string of length r (0 ≤ r ≤ l) from C[m, m + k] and symbol string of length l − r from C[m + k + 1, m + q] and put in B return B

Fig. 3. Algorithm for RHSs building

For example, to build symbol strings of length 2, which can derive the substring ai,k , the algorithm considers a break point i1 ; suppose cell C[i, i1 ] has symbol strings X1 and X2 and C[i1 + 1, k] has symbol strings Y1 and Y2 then the symbol strings constructed from these two cells will be X1 Y1 , X1 Y2 , X2 Y1 and X2 Y2 . 3.4

Rule Building and Rule Checking phase

In the rule building phase, set of possible RHSs and LHSs, constructed in previous phases, are used to build a set of possible rules (P R). P R is a cross-product of sets L and R. Each rule in P R is checked in the rule checking phase. Rules are added in approximate grammar and then modified grammar is checked whether it parses all the programs. If the modified grammar parses all the programs by adding some rule, then that rule is returned as a correct rule. Above checking is done using LR parser. The approach returns whenever it finds the first correct rule. This set also includes some rules which are non-LR. A rule is non-LR if the grammar after including that rule becomes non-LR, else it is called an LR preserving rule. 4

Since algorithm is a bottom up algorithm, symbol strings for the substrings ai,k and ak+1,j are already computed.

The parameter max length is usually kept equal to the length of the largest RHS of rules in G. We assume that the length of the correct rule’s RHS is not more than max length. If there does not exist any correct rule whose RHS’s length is less than or equal to the max length, the algorithm will incrementally build possible rules with larger RHS lengths and check their correctness. 3.5

A Realistic Example

Suppose the rule corresponding to keyword while is missing in the ANSI C grammar and program in figure 4(a) is given as input. For gathering possible LHSs, program is parsed with the parser generated from the incomplete ANSI C grammar. Since while is a new terminal, parser will get stuck when it reaches to the first while. Now we look at the itemset corresponding to the top of the stack (figure 4(b)), let it is: {[statement → expression SEM ICOL • ]} Since an LR item in the above itemset is of form [A → β • ], a forced reduction will be performed. After performing forced reduction we reach to the state where itemset corresponding to the top of stack contains an item of form [statement → statement list • statement ]. Nonterminal statement will be added in the set of possible LHSs because it occurs after the dot. For building possible RHSs we start from the last occurrence of the while and build a set of symbol strings which can derive substrings starting from the last occurrence of while. The set of symbol strings are shown in the figure 4(b). Using the set of possible LHSs and RHSs, set of possible rules are built as shown in figure 4(b)5 . Now each rule in this set is checked whether the grammar after including that rule is able to parse all the programs. After adding rule statement → while ( expression ) or statement → while ( expression ) statement, the modified grammar parses all the programs. These two rules are returned as correct rules. 3.6

Proof of Correctness

In this section we prove the correctness of the algorithm. For proving the correctness we show that algorithm will always find a grammar complete w.r.t. a given set of programs. For this we show that a correct rule will always fall in the set of possible rules (P R) built by the algorithm. Since there exists many correct rules, we pick one correct rule (suppose r = D → η) and show that D will fall in L and η will fall in R respectively; therefore, D → η will fall in P R. The grammar obtained after adding D → η in G is denoted as G′ . G′ is a complete grammar w.r.t. P. The parser generated from a complete grammar (G′ ) is called a complete parser (or modified parser) and denoted as ℘G′ . Parser generated from G is represented as ℘G . State machine generated from G is represented as 5

We are not showing all possible RHSs, LHSs and itemsets as these were too many for the real grammar

Itemset corresponding to the top of stack= {[statement → expression SEMICOL • ]}

Itemset after forced reduction= {[statement list → statement list • statement]}

main() { int x, y, z, i; x=1000; y=90; z=800; error point

while(x>500) x−−;

for(i=0; i < 50; i++) Last occ { of keyword while(y > 200) y=y/2; } } (a) Input program

Possible LHSs for keyword while = {statement} Possible RHSs= (1) (2) (3) (4) (5) (6)

while while while while while while

( ( ( ( ( (

id expression expression expression expression

) ) statement ) statement } ) statement } }

Possible rules built = (1) (2) (3) (4) (5) (6)

statement → while ( statement → while ( id statement → while ( expression ) statement → while ( expression ) statement statement → while ( expression ) statement } statement → while ( expression ) statement } }

Correct rules = (1) statement → while ( expression ) (2) statement → while ( expression ) statement

(b) Set of possible LHSs, RHSs and correct rules Fig. 4. Example of rule inference process

M and the state machine generated from G′ is represented as M ′ . a1,n is the input program and the substring derived by the first occurrence of rule r starts from index i. For proving that L always contains D, we first discuss the relationship between ℘G and ℘G′ . Suppose ℘G′ , while parsing a1,n , makes a sequence of reductions [p1 , p2 , . . . , pn ] (where pi denotes a production) between the shifts of tokens ai−1 and ai . Since algorithm GAT HER P OSSIBLE LHSs explores a set of reduce sequences with forced reductions, we show that the reduce sequence [p1 , p2 , . . . , pn ] will be explored by GAT HER P OSSIBLE LHSs. It is evident that ℘G , while parsing a1,n , will get stuck after shifting the token ai−1 because ai is a new terminal and it does not fall in the terminal set T of G. Suppose the sentential form corresponding to the parser configuration, after shifting the token ai−1 , is αai . . . an (Each parser configuration (s0 X1 s1 . . . sm , ai ai+1 . . . an ) represents a sentential form X1 . . . Xm ai ai+1 . . . an .). ⋆ The derivation α ⇒ a1,i−1 contains the productions of G alone. Therefore, when a1,n is parsed with the ℘G′ , the sentential forms through which parsers ℘G′ and ℘G pass will be same until they hit the token ai . In other words, the states through which ℘G and ℘G′ will pass, while parsing a1,n , will be equivalent.

Definition 2. A state s in M and a state s′ in M ′ are equivalent, if there is at least one viable prefix for which both the states are valid6 . If both the states are valid for prefix β, we say that states are β-equivalent. We denote such equivalence as s ∼ =β s′ . Shift and reduce operations on two equivalents states result in another equivalent state as stated in the following lemma: Lemma 1. If s ∼ =β s′ , where s and s′ are the states in M and M ′ respectively, then the following relationship holds: ∼βt s′ , 1. if action[s, t] = shif t si and action′ [s′ , t] = shif t s′i then si = i T. action[s, t] = error t = anew . 2. if action[s, t] = ri and action′ [s′ , t] = ri′ then ri = ri′ , ∀t ∈ T ∀A ∈ N 3. goto[s, A] ∼ =βA goto′ [s′ , A]

∀t ∈ if

Proof. 1. Case t ∈ T : Since s is valid for viable prefix β, si will be valid for viable prefix βt. Similarly, s′i will be valid for viable prefix βt. Hence, si ∼ =βt s′i . Case t = anew : Since no production in G contains anew in its RHS, action[s, anew ] will be error. 2. Suppose action′ [s′ , t] = reduce with C → β2 . This implies there exists a ⋆ sentential derivation S ⇒ β1 Cγ ⇒ β1 β2 γ, where β1 β2 = β (as s is valid for viable prefix β) and t ∈ F IRST (γ)7 . Since F IRSTG (γ) ⊆ F IRSTG′ (γ) and F IRSTG′ (γ) contains an additional terminal anew , except the entry corresponding to anew all reduce entries will be same. 3. Arguments are same as given in the first case. Lemma 2. The reduce sequence made by ℘G′ between the shift of tokens ai−1 and ai will be covered by the GAT HER P OSSIBLE LHSs algorithm. Proof. From the lemma 1, the sequence of shifts and reduces made by ℘G and ℘G′ will be same until they reach the token ai because both the parsers will either shift same token or reduce with the same production (as they will pass through equivalent states). Suppose, configuration of ℘G after shifting ai−1 is: (s0 X1 s1 X2 s2 . . . Xm sm , ai ai+1 . . . an ) And configuration of ℘G′ after shifting token ai−1 is: (s′0 X1 s′1 X2 s′2 . . . Xm s′m , ai ai+1 . . . an ) States inside the parser stacks of ℘G and ℘G′ contains equivalent states. Top of the stack in ℘G is sm and itemset corresponding to that state is Im . We prove 6

7

A state s is valid for viable prefix β if there exists a configuration (s0 X1 . . . Xm s, ai ) where X1 . . . Xm = β ⋆ F IRST (γ) = {a|γ ⇒ aβ, for some β ∈ N ∪ T ∗ }.

by induction on the number of steps in the algorithm that each step guarantees that one of the reduce sequences explored by the algorithm is same as the reduce sequence made by ℘G′ between the shift of tokens ai−1 and ai . 1. Basis k = 1 iteration: If Im has items of type [A → α • ], then algorithm does the forced reduction for all possible reductions in Im . Since sm and s′m are equivalent states, both will have same set of possible reductions in the first iteration (from lemma 1). Therefore one reduction made by ℘G′ in the algorithm will be same as p1 . If Im has items of type [A → α • D β] also, ′ will contain additional item [D → • η]8 . Hence ℘G′ may have shift then Im action at this configuration. For considering the possibility of shift operation, algorithm collects D in the possible LHSs set. 2. Induction k = n + 1: Suppose one of the reduce sequence explored by the algorithm in the first n iterations is [r1 r2 . . . rn ] and it is same as the first n reductions made by ℘G′ between the shift of ai−1 and ai (where n < l), i.e. p1 = r1 , . . . pn = rn . We show that one of the reductions, performed in the next iteration of algorithm, will be same as pn+1 . Suppose kn is the configuration of ℘G after performing reductions [r1 r2 . . . rn ] and that of ℘G′ is kn′ . Top states in kn and kn′ will be equivalent because as long as ℘G and ℘G′ perform same reductions they pass through equivalent states. The possible reductions in kn and kn′ (if there are any) will be same because top states are equivalent. Therefore, one of the reductions, rn+1 , performed in the next step will be same as pn+1 . If top of the stack in kn has some items of type [A → α • Dβ], then ℘G′ may have shift action after the reductions [r1 r2 . . . rn ] as LHS of the correct rule is D. In this case the sequence of reductions made by ℘G′ between the shift of ai−1 and ai is [r1 r2 . . . rn ]. This sequence is already explored by the algorithm and D is added in the possible LHSs set for considering the possibility of next action as shift. Lemma 3. LHS of the correct rule (i.e. D) will fall in the set L. Proof. L is the set of nonterminals collected from the itemsets corresponding to the top of stacks during forced reductions. From lemma 2, reduce sequence [p1 . . . pl ] will be explored by the algorithm. Suppose, after making [p1 . . . pl ] reductions ℘G reaches to the configuration (s0 X0 . . . Xl sl , ai ai+1 . . . an ) We show that one of the nonterminals, collected from the top itemset Il , will be D. If none of the nonterminals collected from this configuration is D (LHS of the correct rule), then LHS will be determined from some state, sj , inside the stack. Suppose itemset corresponding to the state sj has an item it = [A → γ • D δ ], then the string of symbols Xj+1 Xj+2 . . . Xl (which includes all symbols starting from the state sj+1 to the top of the stack) will be appended 8

This is because of the way itemsets are computed [2].

before the new terminal in the RHS of the correct rule; i.e. rule will be of form C → Xj+1 Xj+2 . . . Xl anew γ (γ is a symbol string). This contradicts our assumption that η starts with a new terminal. Hence D will be collected from the top itemset and will fall in L. Lemma 4. RHS of the correct rule (i.e. η) will fall in R. Proof. For building the set of possible RHSs, algorithm BU ILD RHSs starts from the index of the last occurrence of new terminal (suppose j) and adds all symbol strings which can derive the substring aj,k (j < k ≤ n) in the set R. Suppose substring aj,k is derived by the last occurrence of D → η. No substring derived from other instance of a correct rule will be nested within the substring aj,k because aj,k is the last and innermost occurrence of the substring derived by D → η. Since algorithm considers each index k (j < k ≤ n) as a possible end point of the substring derived by a correct rule, it will definitely consider index k in some iteration. Since BU ILD ST RIN GS algorithm constructs all symbol strings that derive a particular substring, η will be constructed while constructing the set of symbol strings that derive aj,k . Hence η will fall in R. Lemma 5. Algorithm will always return a correct rule. Proof. From lemma 3 and 4, LHS and RHS of a correct rule D → η will be in L and R respectively. Therefore P R will contain D → η. Since algorithm iteratively checks the correctness of rules, it will select D → η in some iteration and return it as a correct rule. 3.7

Time Complexity √

The size of an LR(1) parser for the grammar of size m9 can be O(2c m ) in the worst case (c > 0 is a constant) [28]; hence the worst case time taken in building √ an LR(1) parser for the grammar√of size m is O(2c m ). Therefore the LHSs gathering phase takes O(n) + O(2c m ) time. The maximum number of possible symbol strings of length less than or equal to max length is v max length , where v = |N ∪ T |. The time taken in computing the set of symbol string is O(n3 ) because the computations done for larger substrings reuse the computations already done for the smaller substrings in the similar way as the CYK parser works; hence the time taken in building all symbol strings is O(n3 × v max length ). The upper bound on the number of possible rules is O(v max length × |N |) = O(v max length+1 ). √ c m ) time. Hence the Correctness of each rule can be checked in O(n √ + 2 c m 3 total time taken by the algorithm is O(n) + O(2 ) + O(n ) × v max length + √ max length+1 c m max length+1 O(v )×O(n+2 ), which is equal to O(v ×(n3 +O(n+ √ c m 2 )). In practice, the time taken in building possible RHSs and LHSs are not 9

Size of a grammar is expressed as a sum of the lengths of all the productions where length of a production B → β is equal to the 1 + length(β); length(β) is the number of symbols in β.

high because the worst case occurs only when each substring of the program is derived by each nonterminal; that is grammar is highly ambiguous. The major time spent by the approach is in checking the correctness of a large set of possible rules; i.e. the component v max length+1 in the above equation. We proposed some optimizations to reduce this set in a later section. 3.8

Using Multiple Programs

Since the syntax of each new terminal can be expressed with a single grammar rule, we can use the information obtained from multiple programs to get a reduced number of possible rules to be checked. The main grammar inference algorithm (figure 1) will be changed as follows: Main algorithm will call GAT HER P OSSIBLE LHSs for each program to get a set of possible LHSs from each program and then compute their intersection to get a reduced set of possible LHSs. Similarly, a set of possible RHSs will be computed from each program and then their intersection is considered as a set of possible RHSs. Rule building phase will use the reduced set of possible LHSs and RHSs to build a set of possible rules. 3.9

Inferring a Rule of form A → αanew β

The approach discussed previously works if the correct rule is of the form A → anew α (i.e. RHS of the rule starts with a new terminal). In this section we weaken this restriction by some modifications in the previous approach. For this we present modifications in the LHSs and the RHSs gathering phases. We discuss our approach with one input program, which can be extended for multiple programs in the similar fashion as discussed in the earlier section. The above extension has appeared in [8]. LHSs gathering phase In the modified approach the input program, a1,n , is parsed with the LR parser generated by the approximate grammar. Unlike the earlier approach where LHSs were gathered only after the parser arrives at error state (at a new terminal), here possible LHSs are gathered from each configuration the approximate parser passes. That is at each step the approach checks the top itemset of the parser stack; if it contains an item of type [A → α • B β], then B is added in the set of possible LHSs set. Once the parser reaches the error state, it performs a forced reduction in the same manner as discussed previously and collects other possible LHSs. The idea behind collecting possible LHSs from each configuration the parser passes is as follows: Suppose, af is the first occurrence of the new terminal in a1,n ; the index f denotes the first and the outermost occurrence of missing rule in the program. Suppose, the substring covered by the first occurrence of missing rule starts from the index m (m ≤ f ). If the input program is parsed by a complete parser, the parser will start recognizing the substring covered by missing rule from mth token. Since the value of m is not known, the modified

approach considers each index i (0 ≤ i < f ) as a possible starting point of the substring covered by the missing rule. Therefore, possible LHSs are gathered from each configuration the approximate parser passes while parsing the input program. In the previous approach value of m is always equal to f (i.e the index of the first new terminal) as missing rules always start with a new terminal. RHSs generation phase Since RHS of correct rule is of the form α anew β (where α, β ∈ T ∪ N ∗ and anew ∈ Tnew ), we can divide a possible RHS into two parts: (1) Part which occurs to the left of the new terminal, i.e. α. (2) Part which occurs to the right of the new terminal, i.e. β. We build sets of possible α and possible β separately and then use these two sets to build a set of possible RHSs. Set of all possible symbol strings which occur to the left of the new terminal is denoted as RL . Similarly, set of all possible symbol strings which occur to the right of the new terminal is denoted as RR . For building sets RL and RR , first input program a1,n is parsed with the approximate grammar using the CYK parser. Suppose, f and l are the index of the first and the last occurrence of new terminal. For building RL , we consider each index i (1 ≤ i < f ) as a possible starting point of the substring derived by the missing rule. For each i, a set of symbol strings that can derive substring ai,f −1 is computed and added in the set RL . Similarly, for building set RR , each index j (l < j ≤ n) is considered as a possible end point of the substring derived by the missing rule and all symbol strings that can derive substring al+1,j are added in the set RR . The set of possible RHSs R is built by concatenating the symbol strings taken from sets RL and RR as follows: R = {α anew β | ∀α ∈ RL and ∀β ∈ RR } Note: Sets RL and RR both will additionally contain empty string ǫ. This is to consider the case when RHS is of form anew α or αanew (i.e. new terminal in the RHS occurs at the first or at the last place). The rule building and rule checking phase is the same as discussed in the previous approach.

4

Grammar Completion with Multiple Rules

In this section we address the situation when more than one rules are required to make the initial grammar complete. A simple extension of the algorithm discussed in previous section is used to address the issue of inferring multiple rules. An iterative algorithm with backtracking is used where in each iteration a set of possible rules corresponding to a new terminal is built. We first describe the approach which infers rules of the form A → anew α and later extend it for the rules of form A → αanew β. For systematically building a set of possible rules corresponding to a new terminal, programs are grouped according to the layout of new terminals in the programs. For each new terminal K two groups of programs, PK and P′K , are made. PK is the set of all programs in which K is the first new terminal and

P′K is the set of all programs in which K is the last new terminal. The above grouping is done because the same approach, as discussed in the previous section, is used for building the set of possible rules. For building a set of possible RHSs of rules corresponding to a new terminal K, we use those programs where K is the last new terminal. Hence, programs in which K is the last new terminal are grouped together in the set P′K . For building a set of possible LHSs of rules corresponding to K we use those programs in which K is the first new terminal. Therefore, programs in which K is the first new terminal are grouped together in the set PK . These two groups, PK and P′K , are used to build a set of possible rules (P RK ) corresponding to K. Now we describe the approach for inferring multiple missing rules. Suppose the initial grammar is G = (N, T, P, S), the set of new terminals is Tnew = (T ′ − T ) = {K1 , . . . Km } and the set of programs is P. The approach iteratively builds a set of possible rules corresponding to each new terminal Ki ∈ Tnew . The iterative approach is described in figure 5. In each iteration, first, a pair of groups (PKi , P′Ki ) is made for each new terminal Ki ∈ Tnew . Next, a terminal Ki ∈ Tnew is selected such that sets PKi and P′Ki are nonempty and the set of possible rules corresponding to Ki is built using sets PKi and P′Ki . The set of possible LHSs (LKi ) of rules corresponding to Ki is gathered from programs in PKi using the algorithm GAT HER P OSSIBLE LHSs. The set of possible RHSs (RKi ) of rules corresponding to Ki is gathered from programs in the set P′Ki using the algorithm BU ILD RHSs. The set of possible rules, P RKi , corresponding to Ki is built by taking the cross product of LKi and RKi . One rule from P RKi is selected and added in G and Ki is removed from the set of new terminals (Tnew ). Since Ki is no longer a new terminal, input programs are grouped again according to the layout of the modified set of new terminals in the next iteration. Above steps are repeated until a rule corresponding to each new terminal is added in G. Now the modified grammar is checked for the completeness. If grammar is complete w.r.t. P then the rules added in G in different iterations are collectively returned as a correct set of rules, else we backtrack to one of the previous iterations and select another rule. Note that in some iterations the algorithm may not find a new terminal, Ki ∈ Tnew , such that sets PKi and P′Ki both are nonempty. In such case it builds possible rules corresponding to those new terminals, Kj , for which atleast set P′Kj is nonempty. The set of possible RHSs corresponding to Kj is computed using P′Kj and the set of possible LHSs is assigned equal to N (set of all nonterminals of G). 4.1

Proof of Correctness

The correctness of the IN F ER RU LES algorithm can be proved by showing that in each iteration, algorithm has enough information for building a set of possible rules corresponding to at least one new terminal and then use the results of single missing rule case, discussed earlier, to show that the algorithm will

Function INFER RULES(P, G, Tnew )

⊲ Tnew is a set of new terminals

while Tnew is not empty do For each Ki ∈ Tnew make groups PKi and P′Ki Select a Ki from Tnew such that the set P′Ki is non-empty if PKi is non-empty then T GAT HER P OSSIBLE LHSs(a1,n , G) LKi ← else RK i ←

∀a1,n ∈PKi

LKi ← N T BUILD RHSs(a1,n , G, max length) ∀a1,n ∈P′K

i

P RKi ← build rules using sets LKi and RKi Select a rule r from P RKi and add it in G Remove Ki from Tnew if G parses all programs in P do return the set of new rules added in G else Backtrack to a previous iteration and try another rule

Fig. 5. Algorithm for inferring multiple missing rules

always return a set of grammar rules which makes the grammar complete. Hence, we leave the detailed proof. 4.2

Time Complexity

Each iteration of the algorithm involves building a √set of possible rules and adding one of them to the grammar; this takes O(n+2 cm +n3 v max length ) time. The number of possible rules for each keyword is bounded by O(v max length+1 ) (v = |N ∪ T |). Since algorithm checks the completeness of only those grammars which are not more than |Tnew | rules away from the initial grammar, the maximum number of iterations in the algorithm is equal to the number of nodes in a complete tree where the degree of each tree node is atmost O(v max length+1 ); such a tree will have O(v max length+1 )|Tnew | = O(v (max length+1)×|Tnew | ) nodes. Hence, the maximum number of iterations in the algorithm will be O(v (max length+1)×|Tnew | ). The maximum number of possible combinations of rules for all new terminals is p|Tnew | (p is the upper bound on the number of possible rules for each keyword; i.e. p =√ v max length+1 ). Hence, the worst case time taken by the algorithm is new | O(n+2 √cm +n3 v max length )×O(v (max length+1)×|T )+O(v (max length+1)×|Tnew | )× √ cm (max length+1)×|Tnew | cm O(n + 2 ) = O(v × (2 + n3 )). 4.3

Inferring Rules of form A → αanew β

We can extend the approach used for inferring multiple rules so that it infers rules of the form A → αanew β. The overall approach of the algorithm IN F ER RU LES is the same and the steps required for building possible LHSs and RHSs is similar as we discussed in single missing rule case.

Above extension of the algorithm will work, if in each iteration there exists at least one new terminal K such that there is at least one program where K as the first new terminal and another program where K is the last new terminal.

5

Optimizations

The approach we have discussed above usually results in a very large set of possible rules to be checked because grammar of programming languages are normally large in size (typically 200-400 productions). For example, consider the example discussed in the section 3.5 where we worked with an ANSI C grammar in which the rule corresponding to keyword while is missing. We have deliberately not shown all the possible RHSs in that example as in the real experiment this number was too high. In this section, we discuss some optimizations to reduce the number of possible rules. 5.1

Utilizing unit productions

Table 1 compares the number of unit productions with the total number of productions in few programming language grammars taken from [12]. We use this property for reducing the number of possible RHSs to be checked. We add only most general symbol strings in the set of possible RHSs to be checked. For achieving this, the algorithm BU ILD ST RIN G is modified as follows: Suppose, ⋆ cells [X1 , X2 ] and [Y1 , Y2 ] are used for building symbol strings and X1 ⇒ X2 and ⋆ Y1 ⇒ Y2 . Rather than adding all the symbol strings, i.e. X1 Y1 , X1 Y2 , X2 Y1 , X2 Y2 , built from these cells, we add only X1 Y1 in the set of possible RHSs. A rule whose RHS is X1 Y1 , i.e. A → X1 Y1 (for some A ∈ L), is sufficient for checking the incorrectness of rules whose RHS are X1 Y2 , X2 Y1 or X2 Y2 (i.e. A → X1 Y2 , A → X2 Y1 , A → X2 Y2 ) because A → X1 Y1 is more general than other rules. Hence, the number of possible RHSs to be checked can be reduced without compromising the correctness of the approach. This optimization significantly reduces the number of possible rules. Languages No of productions No of unit productions

Algol ADA COBOL CPP CSTAR C Delphi grail Java Matlab Pascal 170 576 519 785 312 227 385 122 265 92 188 78 218 193 237 169 106 177 20 124 40 84

Table 1. Summary of unit productions in different programming language’s grammars

5.2

Optimization in Rule Checking Process

In this section we propose an optimization in the rule checking process. We propose a modified CYK parsing algorithm in which correctness of a RHS w.r.t. a set of possible LHSs can be checked in a single CYK parse. That is, for a

symbol string α ∈ R and a set of possible LHSs L, we check if a rule with α as its RHS and some nonterminal A ∈ L as its LHS is a correct rule. This is done by incorporating some additional steps in the CYK-parser. The worst case number of invocations to the rule checking step in the earlier approach was |L| × |R| because each rule in the set P R (built from set L and R) is checked individually whereas it is |R| after above optimization. Correctness of a RHS, α, w.r.t. to a set of possible LHSs, L, is checked as follows: First, a set of rules, B → α (∀B ∈ L), is added in the approximate grammar. Input program is parsed with the modified grammar using the CYK parser. Consider approximate grammar G = (V, T, P, S), where T = {a, b, d}, V = {A, B, C, S} and P has following productions: 1. 2. 3. 4. 5. 6.

r1 r2 r3 r4 r5 r6

: : : : : :

S→aAB C S→ aC d A→a A→AB B→b C→AB

Suppose input program is aabeabaeabbeab and we check the correctness of RHS eab w.r.t. possible LHSs set {A, B, C}. Therefore, rules A → eab, B → eab and C → eab will be added in the grammar and input program aabeabaeabbeab will be parsed with the modified grammar. The parse tree is shown in figure 6; it is slightly different than a normal parse tree. Root of each subtree contains a set of pairs; first part of the pair is a nonterminal which derives the substring covered by the subtree, second part is a set of nonterminals which is discussed later. For example, node 6 in figure 6 contains pairs (A, {B}) and (C, {B}); first part of the pair is A and C which shows that A and C derive the substring a7,10 = aeab. Set {B} is called a set of unfiltered nonterminals. Dashed edges are not the part of the parse tree. Program is parsed with the modified CYK parser which filters out incorrect rules (the newly added rules) while parsing the program. The idea behind checking the correctness of a RHS (α ∈ R) w.r.t. a set of possible LHSs is as follows: Initially each rule with RHS α (i.e A → α, for all A ∈ L) is considered a correct rule and parse tree is built bottom up with the CYK parser. CYK parser filters out those new rules which are not used in building subtrees of larger substrings. To support filtering operation a set of nonterminals (called unfiltered nonterminals) is associated with each nonterminal of each CYK-cell. This set is shown as the second part of the pair in the figure 6. For example, node 6 contains a pair (A, {B}), where A is the root of the subtree covering aeab and {B} is the set of unfiltered nonterminal. The set of unfiltered nonterminals associated with a nonterminal A ∈ C[i, j] is denoted as U FA (i, j). Nonterminal B ∈ U FA (i, j) implies that rule B → α is used in the derivation ⋆ ⋆ B→α ⋆ A ⇒ ai,j . I.e. A ⇒ δ B γ ⇒ δ α γ ⇒ ai,j . For example, consider substring a7,10 = aeab in figure 6, since B → eab is used in the derivation of a7,10 (A

A→AB

⇒

A→a

A B ⇒ a B

B→eab

⇒

aeab), hence U FA (7, 10) = {B}. This set is

associated with nonterminal A at node 6 in figure 6. U F sets are maintained as follows: First, ∀ A ∈ C[i, i] (0 ≤ i ≤ n), U FA (i, i) is initialized as empty set; next program is parsed with CYK parser. Since modified grammar has rules B → α (∀B ∈ L), whenever a substring derived by α is encountered (suppose ai,j ), all nonterminals of L are added in C[i, j]. Initially each new rule B → α (B ∈ L) is considered correct, therefore set U FB (ai,j ) = {B} is associated with each nonterminal B (B ∈ C[i, j]). For example, consider nodes 3,8 and 9 in figure 6. Nonterminals A, B, and C are roots of the subtrees covering substring eab (substrings a4,6 , a8,10 and a12,14 ). The figure shows roots of the subtrees and their corresponding UF sets. For example, roots of the subtree covering a4,6 (node 3) are A, B and C and associated UF sets are U FA (4, 6) = {A}, U FB (4, 6) = {B}, U FC (4, 6) = {C} respectively. Since CYK parser builds the parse tree bottom up, suppose, while building entry for cell C[p, q], production A → X Y is used; where X ∈ C[p, k] and Y ∈ C[k + 1, q]. Set U FA (p, q) is updated with following rules:

1. U FA (p, q) = U FX (p, k) ∪ U FY (k + 1, q) if at least one set between U FX (p, k) and U FY (k + 1, q) is empty. For example, consider node 2 in figure 6. Production A → A B is used while building the cell entry C[2, 3], where both U FA (2, 2) and U FB (3, 3) are empty; hence U FA (2, 3) is empty. Consider node 6 in figure 6. Production A → A B is used, where U FA (7, 7) is empty but U FB (8, 10) is non-empty, hence U FA (7, 10) = {B}. 2. U FA (p, q) = U FX (p, k)∩U FY (k +1, q), if both U FX (p, k) and U FY (k +1, q) are nonempty. If U FA (p, q) comes empty from above computation, then nonterminal A is dropped out from the cell C[p, q]. Consider node 4 in figure 6. Production A → A B is used for building the cell entry for C[7, 14]. Here sets U FA (7, 11) and U FB (12, 14) both are nonempty, therefore, U FA (7, 14) = U FA (7, 11) ∩ U FB (12, 14) = {B}. Consider node 14, where UF sets of subtrees are U FA (8, 11) = {A} and U FB (12, 14) = {B}, hence U FA (8, 14) = U FA (8, 11) ∩ U FB (12, 14) = {}. Since U FA (8, 14) is empty, nonterminal A is dropped out from the cell C(8, 14).

Nonterminals which are correct LHSs w.r.t. a given RHS (i.e. α here) will get added in the unfiltered set of nonterminals of the subtrees of larger substrings and will climb up-ward in the parse tree (this is shown by dashed arrows in the figure 6). Nonterminals which are finally added in the set U FS (1, n) (in the figure 6 it is {B}), are correct LHSs for the given RHS. Incorrect nonterminals are filtered out during the parsing. If set U FS (1, n) is empty after parsing, then RHS is incorrect. That is, for no nonterminal, B (where B ∈ L), rule B → α is a correct rule. For example, for the given program and the possible RHS eab, B is the correct LHS, whereas. Nonterminals A and C are not correct as they are filtered out.

1 (S,{B})

correct LHSs

(A,{B}) 4 (C,{B})

5 (A,{B}) (C,{B})

(A,{}) (C,{}) 14 (A,{A}) (C,{A})

6 (A,{B}) (C,{B}) 2 (A,{}) 10 (A,{})

3 (A,{A}) (B,{B}) (C,{C})

11 (B,{})

13 9

8 (A,{A}) (B,{B}) (C,{C})

7 (A,{})

Missmatch in UF sets, dropped

(A,{A}) (B,{B}) (C,{C})

12 (B,{})

a

a

b

e

a

b

a

e

a

b

b

e

a

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Fig. 6. Example of LHSs filtering

6

Implementation

We have implemented the complete approach for finding single missing rule as well as multiple missing rules. The implementation is done in Java. Our implementation incorporates various optimizations. The schematic diagram of the whole grammar inference process is shown in the figure 7. The input to the system is a grammar written in the yacc readable format and a set of input programs. New terminals are tagged as %newkeyword in the grammar specification file. The grammar and the set of programs are fed to the group programs module (figure 7). Here programs are grouped according to the layout of new terminals. After grouping the programs a group of programs is fed to the LHSs generator module and the RHSs generator module as discussed in the earlier section. The main component of the LHSs generator module is a modified LR parser generator. Since our approach uses an additional operation called “forced reduction” which is not in the conventional LR parsers, we use a modified LR(1) parser generator which generates an LR(1) parser with the support of forced reduce operation. For supporting the forced reduce operation, we use a special data structure called graph structured stack (GSS) which can represent the multiple stacks compactly [27].

LHSs generator Possible LHSs

LR parser

LR parser

Generator G

Group

P

Programs

Rules building

Set of Rules

RHSs generator Add rule CYK parser

in G Possible RHSs G after |Tnew |

Modified G

iterations

Check

P

G is complete Output G is not complete

Grammar

G

Backtrack

Fig. 7. The Schematic Diagram of Grammar Inference Module

The RHSs generator modules consists of a modified CYK parser. The modified CYK parser works in several modes which are used in different optimizations. Input program is parsed with the CYK parser and the possible RHSs are generated. Rule building module gets a set of possible LHSs and a set of possible RHSs as input and builds a set of possible rules. The modified grammar is checked for the completeness in the grammar checking module (shown as check grammar module in figure 7).

7

Experiments

We performed experiments on four programming languages, viz. Java, C, Matlab and Cobol. Grammars of languages were obtained from [12]. We also performed experiments on a real dialect of C, i.e. C*. Since the grammars were not truly LR, we have used precedence and associativity10 to remove the non-LR ness of the grammars. The parser generator generates an LR parser only if there is less than fifteen conflicts. Therefore, in further discussions we state a grammar as LR if there are less than fifteen shift reduce conflicts. Experiments were conducted on a machine with Intel Pentium(R) 4, 2.4 GHz processor and 512 MB RAM. For conducting various experiments we removed rules corresponding to different keywords and operators. To validate the approach for inferring a single missing rule, first we removed rules corresponding to a single keyword. Most of the experiments are done in this setting; however, there are also a few experiments on multiple rules inference. We also performed experiments on those cases where missing rule was of form A → αanew β (i.e. new terminal occurs at any position in the RHS of the rule). Since such rules are mostly used for representing 10

This is the most commonly used method for resolving conflicts in LR parsers [2].

expressions (e.g. expressions involving additive, multiplicative operators etc.), we removed rules corresponding to different expressions for performing these experiments. The only parameter in our approach is the maximum length of the RHSs. After a study of different programming language grammars we found that the average RHS length of productions in most of the PL grammars were close to six. Hence, we chose this parameter as six in our experiments. In all the experiments we found that there were at-least few correct rules with length less than or equal to six, i.e. the software never failed due to this parameter value. Since there is no test set suit available for checking the performance of a grammar inference approach in a programming language domain, we have downloaded programs and grammars from different sites to make a small test suit11 . The summary of all the experiments are given in the table 2. The size of grammar is expressed as the sum of the lengths of the RHSs of all the productions in the grammar. Language Constructs

C Java Matlab

Cobol

7.1

# Single missing rule experiments case, 25

for, while, switch, break switch, case, try, enum, while, for, if, for, case, switch, otherwise, while, &&, /., ∗., . ˆ, /, ==, −−, ∗ move, display, read, perform Table

# Multiple Size of missing rules grams experiments 5 100

pro- Size of grammar 465

160

8

75

568

122

3

40

239

64

674

21 1 2. Summary of experiments

Inferring Single Rule without Optimization

In this section we discuss experiments done to validate our approach on different grammars. In each experiment, we removed a rule corresponding to one construct and then fed it to our software; software returns a rule which makes the grammar complete w.r.t. given set of programs. A correct rule returned by the software need not be same as the removed rule as there can be many possible correct rules. Some of the experiments in which only a single program was used for inferring correct rules are shown in the table 3; the table also gives the examples of correct rules returned by the software. Numbers written inside parenthesis in the last column of the table shows the number of rules the system checked before arriving at a correct rule. We observe that in some of the cases, the time taken by the system is only few seconds but in some cases it is several hours. The time spent by the inference process involves (1) time in generating the possible rules and (2) time in checking the possible rules. Since the time spent in the first process is more or less constant 11

The experimental suit can be obtained from http://www.cse.iitk.ac.in/users/ alpanad/grammars/.

if the grammar and input program size are same, the major time taken by these experiments depend on how early the approach arrives at a correct rule (as the number of possible rules is very large). For example in the case of Java grammar and while construct, the time taken by the system is 102263 seconds (i.e. around 26 hours); this is because the system checked 5922 rules before arriving at a correct rule12 ; whereas in the case of enum construct of the Java grammar, the time taken was 83.7 sec. because the very first rule checked by the system was a correct rule. Language Missing construct

Size of pro- Number Rule returned Execution gram (LOC) of possible time (Sec.) rules C break 103 6.3 × 105 labeled stmt → 1190.7 (79) BREAK stmt list C case 103 7.3 × 106 if stmt → CASE 418.1 (4) cond expr COLON stmt list C f or 89 9.0 × 105 expr stmt → F OR 236.57 (3) ( stmt list cond expr ) stmt list C switch 78 1.9 × 105 select stmt → 2324.79 SW IT CH (184) translation unit 6 Java enum 31 1.3 × 10 M odif iers → EN U M 83.7 (1) IDEN T IF IER Java if 26 3.5 × 108 CondExpr → IF 1019.18 (56) V arInit Java switch 39 4.4 × 105 Stmt → SW IT CH 21254.91 ComplexP rimary (1224) Block 7 Java while 47 1.8 × 10 GuardingStmt → 102263.69 W HILE (5923) ComplexP rimary F ieldDecl Matlab case 20 5.7 × 106 equality expr → CASE 38.99 (1) CON ST AN T Matlab otherwise 29 6.8 × 105 select stmt → 44.07 (4) OT HERW ISE IDEN T IF IER] Matlab switch 20 1.0 × 107 otherwise stmt → 16641.19 SW IT CH stmnt list (3626) EN D Matlab while 12 4.9 × 106 unary expr → W HILE 16914.48 stmt list EN D (3745) 4 Cobol display 26 2.5 × 10 stmt list → DISP LAY 18 (1) ident or string id stmt stmt list Cobol perf orm 85 3.9 × 105 ident → P ERF ORM 49.5 (1) loop cond part2 stmt list EN D P ERF ORM Cobol read 43 8850 clause → READ 25 (1) ident or string opti ate ndc lause Table 3. Summary of the unoptimized approach on single rule inference

12

This time can be reduced by using a faster LR-parser generator. The parser generator used by us are not as efficient as yacc or bison.

As mentioned earlier, with multiple programs we can reduce the number of possible rules to be checked, which can otherwise be very large (table 3). Therefore, we discuss some of the experiments which uses multiple programs for inferring a single missing rule. We removed keyword based rules from different PL grammars, build a set of possible RHSs and LHSs from each program given as input and then take their intersection to get a reduced set of possible RHSs and LHSs. We study the reduction achieved in the number of possible rules by this optimization. Table 4 demonstrates the results. Each row represents the language and the construct removed. The number of possible rules obtained from each program are shown in the third column separated by commas. The 2nd column shows the size of the intersection of the possible rules obtained from each program. We can observe that the size of intersection of possible rules is 10 − 100 times lesser than that of the unoptimized approach in many cases. In some experiments, it reduces drastically; for example in the case of Java grammar and while construct the number of possible rules obtained from different programs are 1.85 × 107 and 2.50 × 107 whereas the intersection is 2.93 × 104 . By reducing the number of possible rules, the approach checks less number of possible rules for the correctness, hence arrives at a correct rule much earlier than the unoptimized version of the approach. Language / Construct C / for C / switch C / case C / break Java / case Java / for Java / switch Java / while Matlab / case Matlab / for Matlab / otherwise Matlab / while

Size of intersection No of rules obtained from different programs of possible rules 6.4 × 105 9.0×105 , 1.3×106 , 8.1×105 , 9.0×105 , 9.0×105 ,1.2×106 , 8.1 × 105 5 1.2 × 10 1.0 × 106 , 1.9 × 105 , 1.5 × 107 , 1.5 × 106 , 1.9 × 105 , 4.6 × 106 1.4 × 104 5.3 × 106 , 7.3 × 106 , 5.9 × 104 1.3 × 102 8.8 × 103 , 6.3 × 105 , 5.2 × 102 , 1.1 × 105 4.7 × 104 4.0 × 105 , 2.2 × 106 2.4 × 106 2.8 × 106 , 2.5 × 106 2.7 × 105 4.9 × 105 , 4.4 × 105 4 2.9 × 10 1.8 × 107 , 2.5 × 107 6 5.4 × 10 1.1 × 107 , 5.5 × 106 4.4 × 105 3.2 × 106 , 6.7 × 106 , 6.0 × 106 , 1.4 × 106 , 4.6 × 106 6.8 × 105 6.8 × 105 , 6.8 × 105 , 6.8 × 105 2.1 × 106 1.1 × 107 , 4.9 × 106

Table 4. Number of possible rules generated from the programs and their intersection in different PL grammars

7.2

Unit Production Optimization

In this section we study the effect of unit production optimization on the grammar inference process. We study the reductions achieved in the number of possible rules in real programming language grammars. In each experiment, a single rule corresponding to a keyword is removed and a set of all possible rules and a set of rules with most general RHSs are built. The size of above two sets are then compared. Table 5 shows the outcome of our experiments. The first column shows the language and the construct removed from the grammar. The second column compares the number of all possible rules and the number of rules

with the the most general RHSs (shown as All/MG in the table) obtained from different programs. The last column shows the overall reduction achieved after considering the intersection of the sets of rules obtained from each program in both the cases. That is, the last column shows the reduction achieved by the combination of unit production optimization and the use of multiple programs. Language construct C / for

/ Number of all possible rules (All) / most general rules (MG) obtained from different programs

Intersection

9.0 × 105 / 1.4 × 104 , 1.3 × 106 / 1.8 × 104 , 9.0 × 105 / 1.4 × 104 , 9.0 × 105 / 1.4 × 103 , 1.2 × 106 / 1.8 × 104 , 8.1 × 105 / 1.4 × 104 , 8.1 × 105 / 1.4 × 104 1.0 × 106 / 2.7 × 104 , 1.9 × 105 / 9.6 × 103 , 1.5 × 107 / 6.2 × 104 , 1.5 × 106 / 3.0 × 104 , 1.9 × 105 / 9.6 × 103 , 4.6 × 106 / 5.1 × 104 5.3 × 106 / 3.4 × 104 , 7.3 × 106 / 5.1 × 104 , 5.9 × 104 / 2.9 × 103

6.4 × 105 103 C / switch 1.2 × 105 103 C / case 1.5 × 104 102 Java / case 4.0 × 105 / 1.3 × 104 , 2.2 × 106 / 2.3 × 104 4.7 × 104 102 Java / for 2.8 × 106 / 4.9 × 104 , 2.5 × 106 / 4.1 × 104 2.4 × 106 104 Java / switch 4.9 × 105 / 1.4 × 104 , 4.4 × 105 / 3.5 × 105 2.7 × 105 103 Java / while 1.8 × 107 / 1.4 × 105 , 2.5 × 107 / 1.4 × 105 2.9 × 104 103 Matlab / case 1.1 × 107 / 4.1 × 104 , 4.9 × 106 / 6.5 × 104 5.4 × 106 103 Matlab / for 3.2 × 106 / 3.4 × 104 , 6.7 × 106 / 7.3 × 104 , 6.0 × 106 / 6.3104 , 1.4 × 106 / 1.7 × 104 , 4.0 × 106 4.4 × 105 / 4.7 × 104 103 Matlab / oth- 6.8 × 105 / 5.4 × 103 , 6.8 × 105 / 5.4 × 103 , 6.8 × 105 / 5.4 × 103 6.8 × 105 erwise 103 Table 5. Comparison of number of all possible RHSs and number of most general RHSs

We can observe that the above optimization reduces the search space of possible rules to a great extent (i.e. by a factor of 100 − 10000). Since the reduction we achieve here is due to the abundance of unit productions, this optimization holds for most of the PL grammars as our study on number of unit productions in PL grammars show that a large fraction of productions are unit productions in the language grammars (table 1). Since some of the rules obtained from unit production optimization may cause non-LR-ness in the grammar, the use of LR parser for checking the correctness of the rules may fail sometimes. To study that how often the set of rules obtained from the unit production optimization contains a correct as well as LR preserving rule, we checked the correctness of rules in the reduced set using LR parser. We also compare the rules returned by the LR-parser checker (where LR parser is used for checking the correctness) and the rules returned by the CYK-parser checker. Table 6 shows the results of the experiments. As evident, in none of the cases the LR-parser checker fails, so one can use LR-parser for checking the correctness of the rules. The time taken by both the versions of the checkers are compared in bar charts shown in figure 8. Since LR-checker accepts only LR retaining rules, in some cases it takes more time because it has to check more rules than the CYK-checker. The time taken by the LR-checker is the sum of the time taken in generating the parser and the time taken in parsing the programs (i.e. O(n), n is the length of input program). The

/ 8.0 × / 8.3 × / 8.6 × / 7.7 × / 2.2 × / 9.1 × / 3.0 × / 9.7 × / 4.8 × / 5.4 ×

Language Construct

/ Rule returned by the LR parser checker

C / break C / case

Rule returned by the CYK-parser checker

No of Avg size progs of progs (LOC) 4 114 COLON 6 131

stmt → BREAK stmt list stmt → CASE arg expr list COLON stmt list C / for stmt → F OR (stmt list assign expr) stmt list C / switch* stmt → SW IT CH declarator stmt list C / while expr stmt → W HILE (arg expr list) stmt list Java / case LocV arDecOrStmt → CASE ArrayInit COLON LocV arDecAndStmt Java / if LocV arDecOrStmt → IF (ConstExpr) LocV arDecAndStmt Java / switch* LocV arDecOrStmt → SW IT CH (ClassN ameList) LocV arDecAndStmt Java / while Block → W HILE (ArrayInitializers)

stmt → BREAK stmt list stmt → CASE arg expr expr stmt list stmt → F OR (stmt list stmt list 7 arg expr list) stmt → SW IT CH init decl list stmt list 6 stmt → W HILE arg expr list stmt list 7

Matlab / case assign expr → CASE assign expr Matlab / for primary expr → F OR translation unit EN D Matlab / primary expr → SW IT CH switch translation unit EN D Matlab / oth- stmt → OT HERW ISE array list erwise Cobol / move if clause → M OV E f ile name string loop cond part2 Cobol / read if clause → READ f ile name string opt at end clause Cobol / per- if clause → P ERF ORM loop cond part2 form stmt list EN D P ERF ORM

LocV arDecOrStmt → CASE ArrayInit COLON LocV arDecAndStmt LocV arDecOrStmt → IF ConstExpr LocV arDecAndStmt LocV arDecOrStmt → SW IT CH F orIncr LocV arDecAndStmt LocV arDecOrStmt → W HILE ArrayInitializers array list unary expr → CASE translation unit expr → F OR translation unit EN D

66 131 101

4

84

5

97*

4

84

6

78

4

38

4

54

primary expr → SW IT CH 4 translation unit EN D stmt → OT HERW ISE array list 3

38.25

if clause → M OV E f ile name string 4 loop cond part2 if clause → READ f ile name string 4 opt at end clause if clause → P ERF ORM loop cond part2 3 stmt list EN D P ERF ORM

68

27

68 56

Table 6. Comparison of LR-parser and CYK-parser as a grammar completeness checker

time in generating the parser depends only on the size of the grammar; therefore for large programs LR-parser will always outperform than the CYK-parser as CYK-parser is O(n3 ) algorithm. In our experiments, small to medium sized programs are used which does not make a significant difference between the LRchecker and the the CYK-checker in terms of the time of computation. We can see in the figure 8 that except few cases, the LR-checker performs better than the CYK-checker. For example, in the experiment where possible rule corresponding to while in C grammar is checked, the CYK-checker is performing better than the LR-checker as the LR-parser checks more number of rules than the CYKparser to get a correct as well as LR preserving rule.

7.3

LHSs Filtering Optimization in Rule Checking Phase

In this section we will discuss the effect of LHSs filtering optimization used in the rule checking phase. In this optimization correctness of an RHS is checked w.r.t. a set of LHSs. Improvements we gain from this optimization is measured by comparing the time taken in checking all possible rules with simple CYK parser and the time taken in checking all possible RHSs with the modified CYK parser. If L is the set of possible LHSs, then for checking the correctness of a possible RHS, α, there will be |L| invocations to the rule checking process (i.e. invocations to the parser) whereas using the LHSs filtering optimization it will be only one. If tr is the time taken by a simple CYK parsing algorithm and trhs is the time taken by a modified CYK parsing algorithm then we compare the quantities tr × |L| and trhs .

The quantity trhs will always be higher than tr because trhs involves additional overhead of LHSs filtering. The bigger is the set L, the larger is the overhead. Hence, in order to study the effect of maximum overhead we consider the set L equal to the set of all nonterminals N while measuring trhs . We have conducted few experiments on C and Cobol grammars to obtain a first hand experience of this optimization. Results are shown in the table 7. The second column shows the rules removed from the grammar, third column shows number of possible combinations of LHSs. In the multiple missing rules case, it is equal to the product of the number of possible LHSs corresponding to each new terminal. For example, suppose the set of possible LHSs for keyword while is {stmt, cond stmt} and for keyword f or is {stmt, cond stmt} and the possible RHSs corresponding to the same are W HILE (expr) stmt and F OR (stmt list expr)stmt respectively; the sets of possible rules to be checked here will be 1. 2. 3. 4.

{stmt → W HILE (expr) stmt, stmt → F OR (stmt list expr) stmt} {stmt → W HILE (expr) stmt, cond stmt → F OR(stmt list expr) stmt} {cond stmt → W HILE (expr) stmt, stmt → F OR (stmt list expr) stmt} {cond stmt → W HILE (expr) stmt, cond stmt → F OR (stmt list expr) stmt}

Using the modified CYK-parser, the correctness of above four sets is checked in one pass. Hence although tr is less than trhs , the total time spent in checking the correctness of all possible rules is much higher than the total time spent in checking the correctness of all possible RHSs with LHSs filtering optimization. Therefore, this optimization improves the process of grammar inference. The comparison shown in the table 7 assumes that all possible rules are checked. In practice the rule checker does not check all possible rules. Therefore, for comparing the actual time taken by a simple CYK parser and a modified CYK parser in the rule checking phase, we conducted experiments on C, Java and Matlab grammars. In each experiment we removed a rule corresponding to a keyword and build a set of possible rules. We compared the times taken by the simple CYK parser and the modified CYK parser in arriving at a correct rule. The bar charts shown in figure 9 compares the time. We can see that the modified parser is either comparable or better than the simple CYK parser. Language Statements removed

C

Cobol

|L|

trhs

tr

tr ∗ |L|

switch, case 672 = 4.5 × 103 386 164 7.3 × 105 switch, case, break 673 = 3.0 × 105 816 143 4.4 × 106 4 7 5 4 switch, case, break, default 67 = 2.0 × 10 1.1 × 10 6.0 × 10 1.2 × 101 2 switch, case, break, default, for, 36 × 675 = 4.8 × 101 0 1.4 × 104 518 2.5 × 101 3 while 1 perform 5 = 5 3811 3268 16340 display 51 = 5 1237 680 3400 read 51 = 5 46172 43520 217600 2 5 read, perform, move 5 × 181 = 1.6 × 10 4092 3297 5.4 × 108 Table 7. Experiments of LHSs filtering optimization (times are in milli seconds)

Avg size of progs 16 16 131 16 43

56

7.4

Experiments on Multiple Rules Inference

This section presents some experiments done on multiple rule inference; i.e. when more than one rules are needed to make the grammar complete. In each experiment more than one keyword based rules were removed from a PL grammar and then the grammar was completed by inferring rules from a set of programs. These experiments were done when both the optimizations, i.e. unit production optimization and the use of modified CYK parser were enabled. Table 8 shows the results of the experiments. Our approach successfully inferred a set of correct rules in each of these experiments. The time taken in most experiments are 2 min - 15 min; but in some cases it took hours in inferring the rules. Language Constructs

Matlab

Java

C Cobol

7.5

No of Avg size Programs of a prog for, while 2 33 for, switch, case, otherwise, while 5 29 switch, case, otherwise 3 27 try, catch 4 98 if, while 3 37 switch, case, enum 7 61 switch, case, try, catch, enum 7 61 switch, case 1 16 switch, case, break 1 16 switch, case, break, default, while, for 1 16 read, perform, move 3 56 Table 8. Experiments on multiple rules inference

Time 1.5 min 19.5 min 2.2 min 14.9 min 6.7 min 17.6 min 24.23 min 3sec 7.3secs 14.1secs 1.3min

Experiments to infer C* specific grammar rules

We discuss here few experiments in which we inferred rules corresponding to different keywords, operators and declaration specifiers of C* grammar when a C grammar and programs written in C* are given as input. We wrote small programs in C* which used constructs specific to C* (those which are not the part of a standard C grammar). Table 9 summarizes the additional constructs (which contain new terminals) of C* grammar. This summary is not complete as the resources for C* are very scarce13 . We found that except few cases additional rules in C* grammar follow the assumptions we made in section 2. One such exception is the declaration of a parallel data type. A parallel data type describes a structured collection of objects with a particular member type. For example consider following C* declaration: shape [10]S; int:S parallel_int; shape is a new declaration specifier which is used for defining the template of parallel objects. A parallel object is an object of parallel data type. In the above example, variable parallel int represents a parallel object of type int whose 13

After an inquiry on forums like comp.compilers, we could get an incomplete manual and very few example programs.

New terminal shape shapeof with

Type Description Declaration specifier Used for expressing template of a parallel data type Keyword Returns a pointer to a shape object Keyword A control flow statement which operates on all the elements of a parallel data type. where Keyword A control flow conditional statement for parallel object. This is much similar to the “if statement” but here operations are performed on parallel objects. CS elemental Declaration specifier Used for facilitating parallel and non parallel operations. everywhere Keyword Used for making each element of a parallel variable accessible within a function positionof Keyword NA rankof Keyword NA alignof Keyword NA extension Keyword NA attribute Keyword NA current Keyword NA dimof Keyword NA %% Operator Real modules operator ? Operator Maximum operator ? = Operator Maximum assignment operator Table 9. Summary of new terminals in C*

structure is defined by shape S. That is, parallel int represents an array of ten integers and the operation on each element of parallel int can be done parallely on different processors. The declaration statement “int : S parallel int;” declares the parallel int to be of type “int : S”. Grammar rule corresponding to statement “int : S parallel int;” do not involve a new terminal, hence we will not discuss this case. However, the declaration corresponding to shape keyword follows our assumption, therefore we consider input programs containing shape keyword. The experiments were conducted on input programs containing different combinations of new terminals whose corresponding grammar rules follow the assumptions given in section 2. Experiments are done on single input programs in which all the optimizations were enabled. Table 10 shows the results of experiments. We can observe that the system correctly inferred a grammar complete w.r.t. input program in each of the experiments. The example of inferred rules are also shown in the table 10.

New terminals

Inferred Rules

shape, with

3.2 Sec.

Size of prog (LOC) 25

shape, where

4.2 Sec.

28

3.9 Sec.

16

11.7 Sec.

28

statement list → W IT H program statement list, declaration specif iers → SHAP E statement list program statement → SHAP E statement list program statement list, statement → W HERE return expression statementl ist with, where statement → W HERE return expression, statement → W IT H program shape, elemental, external declaration → SHAP E statement list program, → W IT H program statement list, with statement list declaration specif iers → ELEM EN T AL statement list >? = expression → M AX ASSIGN identif ierl ist
Time

207.5 Sec. 23 220.6 Sec. 23

8

Conclusions

In this paper we have presented a technique for inferring grammar rules of a programming language dialect when a set of programs written in the dialect and the grammar of standard language (approximate grammar) is given as input. The input programs are parsed by the parser generated from an approximate grammar. If the parser gets stuck then the possible left hand sides and possible right hand sides of rules are determined from the parser stack obtained from the LR(1) parser and the CYK parser respectively. Approach is theoretically as well as experimentally verified. Furthermore, ways to improve the process of grammar inference is proposed. Although the approach assumes that rules follow certain properties, it has been shown by a study of PL dialects that the assumptions cover most of the syntactic extensions in programming languages fairly well. We also observed in the experiments that there can be several correct set of rules, therefore sometimes inferred rules are not close to the actual missing rule. This opens a possibility for future work in defining the goodness of a rule.

References 1. Pieter W Adriaans. Language Learning for Categorial Perspective. PhD thesis, University of Amsterdam, Amsterdam, Netherlands, November 1992. 2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers Principles, Techniques, and Tools. Pearson Education (Singapore) Pte. Ltd., 2002. 3. C*. UNH C* - a data parallel dialect of C, 1998. URL: http://www.cs.unh.edu/ ~pjh/cstar/. 4. Matej Crepinsek, Marjan Mernik, Faizan Javed, Barrett R. Bryant, and Alan Sprague. Extracting grammar from programs: evolutionary approach. SIGPLAN Not., 40(4):39–46, 2005. 5. Matej Crepinsek, Marjan Mernik, and Viljem Zumer. Extracting grammar from programs: brute force approach. SIGPLAN Not., 40(4):29–38, 2005. 6. Colin de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38:1332–1348, 2005. 7. Alpana Dubey, Sanjeev K. Aggarwal, and Pankaj Jalote. A technique for extracting keyword based rules from a set of programs. In CSMR ’05: Proceedings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR’05), pages 217–225, Manchester, UK, 2005. IEEE Computer Society. 8. Alpana Dubey, Pankaj Jalote, and Sanjeev K. Aggarwal. Inferring grammar rules of programming language dialects, Sept 2006. To appear in the proceedings of 8th International Colloquium on Grammatical Inference, Tokyo, Japan, SpringerVerlag LNCS. 9. Jeroen Geertzen and Menno van Zaanen. Grammatical inference using suffix trees. In Proceedings of the International Colloquium on Grammatical Inference (ICGI); Athens, Greece, pages 63–174, October 2004. 10. E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967. 11. E. Mark Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.

12. Grammars. Compilers & interpreters, July 2000. URL: http://www.angelfire. com/ar/CompiladoresUCSE/COMPILERS.html. 13. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction To Automata Theory, Languages, And Computation. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990. 14. Rahul Jain, Sanjeev Kumar Aggarwal, Pankaj Jalote, and Shiladitya Biswas. An interactive method for extracting grammar from programs. Softw. Pract. Exper., 34(5):433–447, 2004. 15. Java1.5. JDKTM 5.0 documentation, 2004. URL: http://java.sun.com/j2se/1. 5.0/docs/index.html. 16. F. Javed, B. R. Bryant, Matej Crepinsek, M. Mernik, and A. Sprague. Context-free grammar induction using genetic programming. In Proceedings of the 42nd annual Southeast regional conference, pages 404–405. ACM Press, 2004. 17. T. Kasami. An efficient recognition and syntax analysis algorithm for context free languages. Technical report AFCRL-65758, Air Force Cambridge Research Laboratory, BedFord, MA, 1965. 18. Paul Klint, Ralf L¨ ammel, and Chris Verhoef. Toward an engineering discipline for grammarware. ACM Trans. Softw. Eng. Methodol., 14(3):331–380, 2005. 19. R. L¨ ammel and C. Verhoef. Cracking the 500-Language Problem. IEEE Software, pages 78–88, November/December 2001. 20. R. L¨ ammel and C. Verhoef. Semi-automatic Grammar Recovery. Software— Practice & Experience, 31(15):1395–1438, December 2001. 21. Steve Lawrence, C. Lee Giles, and Sandiway Fong. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering, 12(1):126–140, 2000. 22. Lillian Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96, Harvard University, 1996. URL: ftp://deasftp.harvard.edu/techreports/tr-12-96.ps.gz. 23. Marjan Mernik, Goran Gerlic, Viljem Zumer, and Barrett Bryant. Can a parser be generated from examples? In Proceedings of 18th ACM symposium on applied computing, pages 1063–1067. ACM Press, 2003. 24. Rajesh Parekh and Vasant Honovar. Grammar Inference, Automata Induction, and Language Acquision, chapter Invited chapter. Dale, Moisl and Somers (Ed). New York: Marcel Dekker, 2000. 25. RC. RC - safe, region-based memory-management for C, 2001. URL: http:// berkeley.intel-research.net/dgay/rc/index.html. 26. M.P.A. Sellink and C. Verhoef. Development, Assessment, and Reengineering of Language Descriptions. In J. Ebert and C. Verhoef, editors, Proceedings of the Fourth European Conference on Software Maintenance and Reengineering, pages 151–160. IEEE Computer Society, March 2000. 27. Masaru Tomita. Graph-structured stack and natural language parsing. In Proceedings of the 26th annual meeting on Association for Computational Linguistics, pages 249–257, Morristown, NJ, USA, 1988. Association for Computational Linguistics. 28. E. Ukkonen. Lower bounds on the size of deterministic parsers. Journal of Computer and System Sciences, 26(2):153–170, 1983. 29. Menno van Zaanen. ABL: Alignment-based learning. In COLING 2000 - Proceedings of the 18th International Conference on Computational Linguistics, pages 961–967, Saarbr¨ ucken, Germany, Aug 2000. 30. D. H. Younger. Recognition and parsing of context-free languages in time n3 . Information and Control, 10(2):189–208, Feb 1967.

3000

50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0

LR parser CYK parser

2500 2000 1500 1000 500

0 break

case

for

switch

while

LR parser CYK parser

case

(a) C

1600

switch

while

(b) Java

LR parser CYK parser

1400

if

LR parser CYK parser

3000

1200

2500

1000

2000

800

1500

600 1000

400

500

200 0

0 for

switch

(c) Matlab

otherwise

move

read

(d) Cobol

Fig. 8. Comparison of times taken by LR parser and CYK parser in rule checking module (Times are in seconds on Y axis)

Modi CYK Simp CYK

2000

4000

Modi CYK Simp CYK

3500 3000

1500 2500 2000

1000

1500 1000

500

500 0 break

0 case

switch

while

(a) C

2500

case

for

(b) Java

Modi CYK Simp CYK

2000 1500 1000 500 0 case

otherwise

switch

(c) Matlab

Fig. 9. Comparison of times taken by Modified CYK parser and simple CYK parser in rule checking module (Times are in seconds on Y axis)

if

Learning Context Free Grammar Rules from a Set of Programs

We propose a technique which infers grammar rules from a given set of programs and an approx- ...... Semi-automatic Grammar Recovery. Softwareâ. Practice ...

Download PDF

381KB Sizes 2 Downloads 245 Views

Report

Learning Context Free Grammar Rules from a Set of Programs

Recommend Documents