Inferring Grammar Rules from Programs

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

by Alpana Dubey

to the DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY KANPUR June 2006

CERTIFICATE

Certified that the work contained in the thesis entitled “Inferring Grammar Rules from Programs”, by “Alpana Dubey”, has been carried out under our supervision and that this work has not been submitted elsewhere for a degree.

(Dr. Sanjeev Kumar Aggarwal) Professor, Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur.

(Dr. Pankaj Jalote) Professor, Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur.

June 2006

ii

Synopsis Software maintenance is the major activity which is done once the software is released. It involves enhancing or optimizing the software and removing its defects. The problem of software maintenance becomes more difficult in legacy systems. Legacy systems are the big software systems which are continuously being used and which people often don’t want to replace with a new system as the cost involved in replacing these systems is very high. Also a lot of business logic, accumulated over a period of time, is hidden in these systems. Due to the lack of coding discipline, the old legacy systems are not well documented. Hence the knowledge sources are very scarce for these systems. Sometimes we do not even know the grammar of the underlying programming language especially when the software is written in a dialect of a standard language. The grammar of a programming language is an important asset because it is used in automatically generating software analysis and modification tools. Hence, without proper grammar our ability to develop tools to help in maintenance of the legacy systems reduces. In this thesis, we address the problem of grammar extraction when a language undergoes a growth. An example of growth in a programming language is the creation of new dialects. Standard languages sometimes lack features needed for a particular software implementation; which leads to the creation of dialects. Sometimes the grammar of a dialect is not available (in the form of reference manual or the compiler source code) and it has to be extracted from the programs written in the dialect. For example, C* is a data parallel dialect of C which was developed by Thinking Machines Corp for Connection Machine (CM2 and CM5) SIMD processors. Since Thinking Machines Corp no longer exists, it is very hard to find out the compiler source code or the reference manual for C*. The only things which is available are programs written in C* and the executable of C* compiler. In this thesis, we develop an approach which extracts the grammar of a dialect when the iii

grammar of the standard language and a set of programs written in the dialect are given as input. A programming language grammar can be represented by a context free grammar; therefore this problem can be posed as a problem of learning/extracting a context free grammar from a set of valid/correct sentences/programs. It is known that it is not possible to identify the exact grammar from the positive samples (set of valid programs) alone, since for a given finite set of valid programs there exists infinite number of grammars which can generate the given set of sentences (because one can always generalize the grammar and there are no negative samples to restrict the generalization). Therefore, an exact grammar which is implemented in the compiler of the dialect cannot be extracted. We sacrifice the criterion of extracting the exact grammar and address the problem of extracting a grammar which is complete with respect to the given set of programs. A grammar is complete with respect to a set of programs if the grammar can parse all the programs successfully. A critical study of programming languages and their dialects show that the main extensions in a programming language dialects are: (1) New statements containing new keywords, e.g. C* is a dialect of C which contains new statements with keywords with, where, etc., (2) New expressions containing new operators, e.g. C* has new operators max, min, maxto, minto, etc. and (3) New declaration specifiers to support new data types; e.g. RC is a data parallel dialect of C which supports new declaration specifiers, viz, sameregion, parentptr and traditional, to support safe region based memory. This thesis focuses on these main extensions. In this thesis, we first propose an approach which extracts keyword based grammar rules from a set of programs; it assumes a hypothetical dialect in which additional rules are keyword based rules where right hand sides of additional rules start with a new terminal1 . We later extend the approach and relax the above assumption to allow the new terminal to occur at any place in the right hand side of the missing rule2 . The extended approach also extracts the grammar rules which contain new operators and new declaration specifiers. The approach is an iterative one. In each iteration a set of possible rules corresponding to a new terminal is built and one among them is added to the 1

New terminal is a generic term used to represent new keywords, new operators, and new declaration specifiers. 2 Missing rule is of the form A → αanew β, where A is a nonterminal, α, β are symbol strings and anew is a new keyword.

iv

incomplete grammar. Once a rule corresponding to each new terminal is added in the initial grammar, the modified grammar is checked for the completeness with respect to the given set of input programs. If the modified grammar is incomplete, the algorithm backtracks and selects another set of rules. Else the rules added in the each iteration are returned as correct. A set of rules is called correct if the grammar after adding those rules parses all the programs successfully. We also prove the correctness of the approach. The above solution results in a large set of possible rules for real programming language grammars, hence we propose a set of optimizations to make the approach more efficient. One optimization uses multiple programs for building a set of possible rules - it builds a set of possible rules from multiple programs and then takes their intersection to get a reduced set of possible rules. Another optimization exploits the abundance of unit productions in the programming language grammars - this optimization checks the correctness of most general rules only which reduces the number of possible rules to be checked. A prototype of the proposed approach has also been implemented. We implemented a CYK-parser which besides parsing supports other functionalities needed in our approach; such as building a set of possible right hand sides from the input programs. We reuse some of the code of CUP parser generator developed at Princeton University for implementing the LR parser generator; CUP generates an LALR parser whereas our implementation generates an LR parser. We have also implemented the optimizations. A few experiments are conducted to test our approach and to study the impact of the optimizations. The experiments have been conducted on real programming language grammars like C, Matlab etc. Experiments show that the approach does work correctly and the optimizations makes the approach feasible for real programming language grammars. Since there exist many sets of correct grammar rules and exact learning of grammar is not possible, we experimented with different selection criteria for selecting the “best” rule from a set of correct rules corresponding to each new terminal. We first examined the effectiveness of grammar based metrics in the rule selection. Grammar based metrics are used for assessing the complexity of grammar dependent software such as compilers, editors, etc. Experiments show that these metrics are not sufficient in rule selection as there can be many rules with the same metric value. Hence, we propose two more rule selection criteria. One is closely related v

to the principle of minimum description length and other is based on the common patterns found in programming language grammars. Each selection criterion and its combination with other selection criteria are assessed experimentally. For the experiments, we removed a rule corresponding to a keyword, generated a set of possible correct rules and then applied the selection criteria to select one rule. The selected rule is then evaluated for its closeness with the rule actually removed. Experiments show that the proposed criteria, when coupled with grammar metrics, selects reasonably good grammar rules.

vi

Dedicated to My Parents

Acknowledgements My stay at IITK has given me opportunities of many life time experiences. One among them is working with the Dr. Pankaj Jalote and Dr. Sanjeev Kumar Aggarwal. Their supervision has made me learn the process of an organized and focused work. They have helped me a lot in improving clarity in my technical writing. Dr. Jalote has first explained me why an unavailability of programming language grammars is a big problem for software companies. I would like to thank Dr. Jalote for his insight and belief in my capabilities and making my first research experience a life time experience. His constructive comments on different aspects of my attitude has helped me to improve and become a better researcher. His research insight has contributed substantially in giving a complete shape to this thesis. In spite of my several shortcomings he never expressed them explicitly. I would like to thank Dr. Sanjeev Kumar Aggarwal for his very useful suggestions and discussions throughout my thesis work. Dr. Aggarwal’s hold on compiler techniques was very helpful while implementing the prototype. Discussions with him on a range of topics were also very useful. I thank Dr. Aggarwal for providing me good facilities throughout my Ph.D. tenure. I thank Dr. Ralf L¨ammel for useful discussions on my thesis work. His enthusiastic and prompt responses and his prior experience on the grammar recovery has helped me a lot. I thank Dr. Barrett Bryant and Dr. Marjan Mernik for useful discussions during my visit to ACM SAC’06 in France. I extend my thanks to Dr. Somenath Biswas for his technical comments given during the initial phase of my thesis work. His sound theoretical insight helped me identify inaccuracies of my work and improve them. I thank my other course instructors Dr. Sanjeev Saxena, Dr. Manindra Aggarwal, and Dr. R. K. Ghosh. I thank Dr. Phalguni Gupta for his occasional inquiries on my thesis progress. I thank other faculties in the department for their ix

cooperation. I thank Mr. Rao Remala for his generous grant which has helped me in attending very useful international conferences. I thank CSE lab staff R. P. Tiwari and B. M. Shukla for their help. I also thank CSE office members, especially Mishra ji and Santosh for their help in all the paper works. I would like to thank past and present Ph.D. students of the department. I thank Prof Manoj Gore who has inspired me to choose research as a career. I thank Dr. P. Suresh for his useful suggestions during the initial years of my program. I also thank Dr. Atul Kumar for his useful suggestion on several fronts not during his stay in IITK but also after he left IITK. I would like to thank all my friends in the department with whom I shared memorable moments. I thank Vijaya Saradhi and Karmveer for useful discussions on a range of topics. Their balanced suggestions were very useful at several moments. I was lucky to spent time with very brilliant colleagues; I thank Karmveer Arya, Vijaya Saradhi, Nitin, Neeraj, Vibhu, Piyush, Atul Gupta, Sagarmoy and Deepanjan for giving me nice company during most of my Ph.D. tenure. I would like to thank Vartika and Nupur for giving me a nice company in the department for first two years. Their constant encouragement and enthusiasm for research have helped me adjust to the academic environment of IITK. I thank my present wingmates Asha and Dipti and past wingmates Shraddha, Pooja, Bharti, Deepti, Vartika and Nupur for providing me a nice hostel life. I will always cherish the post dinner discussions with them on a range of topics. Lastly I want to express my thanks to my family members who have always given me freedom in choosing research as a career. Without their support this thesis would not have been possible. The diligence of my father, Mr. Jai Narayan Dubey, and management of my mother, Ms. Shanti Dubey, have always inspired me to attain a fraction of these qualities. I thank my brothers Mr. Prabhu Narayan Dubey, Mr. Dharm Narayan Dubey, Mr. Deo Narayan Dubey and Mr. Shiv Narayan Dubey for arranging my professional tours and providing me all kinds of other supports. I thank my sisters Ms. Ranjana Pandey, Ms. Kalpana Tiwari and Ms. Shantvana Tiwari for constant moral support. I am very lucky to be a part of such a family.

Alpana Dubey x

Contents List of Figures

xv

List of Tables

xvi

1 Introduction 1.1 Historical Perspective . . . . . . . . . . . . . . . 1.2 Grammar Extraction in Programming Language 1.3 Goals of Our Work . . . . . . . . . . . . . . . . 1.4 Outline of the Thesis . . . . . . . . . . . . . . .

. . . . . Domain . . . . . . . . . .

2 Related Work 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 2.2 Theoretical Results . . . . . . . . . . . . . . . . . . . 2.3 Learning Regular Languages and its subclasses . . . . 2.4 Learning Subclasses of Context Free Language . . . . 2.5 Learning from Structured Inputs . . . . . . . . . . . 2.6 Genetic Algorithmic Based Approaches . . . . . . . . 2.7 Grammar Extraction from Other Knowledge Sources 2.8 Other Attempts . . . . . . . . . . . . . . . . . . . . . 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . 3 System Model 3.1 Programming Language and Grammar 3.1.1 LR parser . . . . . . . . . . . . 3.1.2 CYK Parser . . . . . . . . . . . 3.2 Programming Language Syntax . . . . xi

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

1 2 4 5 6

. . . . . . . . .

8 9 11 13 16 20 22 23 24 26

. . . .

27 27 28 32 33

3.3 3.4 3.5 3.6

Syntactic Growth of Programming Languages Correctness and Completeness of a Grammar . Grammar Extraction under Language Growth Summary . . . . . . . . . . . . . . . . . . . .

. . . .

4 Grammar Completion with Single Rule 4.1 Problem Definition and Assumptions . . . . . . 4.2 Rule Extraction . . . . . . . . . . . . . . . . . . 4.2.1 LHSs Gathering Phase . . . . . . . . . . 4.2.2 RHSs Generation Phase . . . . . . . . . 4.2.3 Rule Building and Rule Checking phase 4.2.4 A Realistic Example . . . . . . . . . . . 4.3 Proof of Correctness . . . . . . . . . . . . . . . 4.4 Time Complexity . . . . . . . . . . . . . . . . . 4.5 Using Multiple Programs . . . . . . . . . . . . . 4.6 Extracting a Rule of form A → αanew β . . . . . 4.6.1 LHSs gathering phase . . . . . . . . . . . 4.6.2 RHSs generation phase . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . 5 Grammar Completion with Multiple Rules 5.1 Notation Recap . . . . . . . . . . . . . . . . 5.2 Overview of the Approach . . . . . . . . . . 5.3 An Example . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . 5.5 Proof of Correctness . . . . . . . . . . . . . 5.6 Time Complexity . . . . . . . . . . . . . . . 5.7 Extracting Rules of form A → αanew β . . . 5.8 Summary . . . . . . . . . . . . . . . . . . .

. . . . . . . .

6 Optimizations 6.1 Utilizing unit productions . . . . . . . . . . . 6.2 Optimization in Rule Checking Process . . . . 6.3 Rule Evaluation Order . . . . . . . . . . . . . 6.3.1 A Criterion for Rule Evaluation Order xii

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . .

. . . .

33 35 37 39

. . . . . . . . . . . . .

40 40 41 42 46 48 49 49 55 56 56 57 58 59

. . . . . . . .

60 60 61 65 66 66 69 70 71

. . . .

72 72 74 76 78

6.4

6.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Implementation and Experiments 7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Modified LR(1) parser generator . . . . . . . . . . . . 7.1.3 CYK parser . . . . . . . . . . . . . . . . . . . . . . . 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Extracting Single Rule without Optimization . . . . . 7.2.2 Unit Production Optimization . . . . . . . . . . . . . 7.2.3 LHSs Filtering Optimization in Rule Checking Phase 7.2.4 Performance of Rule Evaluation Order . . . . . . . . 7.2.5 Experiments of Multiple Rules Extraction . . . . . . 7.2.6 Experiments to recover C* specific grammar rules . . 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Rule Selection Criteria 8.1 Problem Definition and Terminology . . . . . . . 8.2 Grammar based Metrics . . . . . . . . . . . . . . 8.3 Using Grammar based Metrics for Rule Selection 8.4 A Criterion based on Weight of Rules . . . . . . . 8.5 Usage Count based Rule Selection Criterion . . . 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . 8.7 Summary . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . .

82 82 83 85 88 89 90 93 96 98 101 101 103

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . .

105 . 106 . 107 . 109 . 112 . 114 . 116 . 120

9 Conclusions 122 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 References

127

xiii

List of Figures 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 5.1 5.2 5.3 5.4 5.5 5.6 6.1 6.2

Execution of Angluin’s algorithm . . . . . . . . . . . . . . . . . . . . Steps in regular language inference from positive and negative sample PTA corresponding to example set {λ, 00, 11, 0000, 0101, 0110, 1010} . Steps in the inference of zero reversible language . . . . . . . . . . . . Application of a transition function in tree automata . . . . . . . . . Moves of LR parser on input aabab . . . . . . . . . . . . . . . . . . . State machine corresponding to original grammar . . . . . . . . . . . CYK table built for input aabab . . . . . . . . . . . . . . . . . . . . . A small program with keyword while . . . . . . . . . . . . . . . . . . Overall single rule extraction approach . . . . . . . . . . . . . . . . . Algorithm for gathering possible LHSs . . . . . . . . . . . . . . . . . Partial parse trees in each step . . . . . . . . . . . . . . . . . . . . . . Algorithm for RHSs building . . . . . . . . . . . . . . . . . . . . . . . Possible RHSs gathered from the program and correct rules . . . . . . Example of rule extraction process . . . . . . . . . . . . . . . . . . . State machine corresponding to modified grammar . . . . . . . . . . . Example of building possible RHSs . . . . . . . . . . . . . . . . . . . Set of input programs . . . . . . . . . . . . . . . . . . . . . . . . . . . Program where while is the last keyword . . . . . . . . . . . . . . . . Algorithm for extracting multiple missing rules . . . . . . . . . . . . . Possible RHSs constructed from each program and their intersection . Set of possible rules for keywords f or and while and set of rule pairs needed to make the grammar complete . . . . . . . . . . . . . . . . . A set of input programs . . . . . . . . . . . . . . . . . . . . . . . . . Example of LHSs filtering . . . . . . . . . . . . . . . . . . . . . . . . Input program and set of possible rules corresponding to terminal while xiv

14 15 16 17 21 30 32 33 39 42 43 45 47 48 50 52 59 62 63 64 66 67 71 77 79

7.1 7.2 7.3 7.4 7.5 7.6 7.7 8.1

The schematic diagram of grammar extraction module . . . . . . . GSS before reduction . . . . . . . . . . . . . . . . . . . . . . . . . . GSS after reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . GSS after shifting common state . . . . . . . . . . . . . . . . . . . . GSS before and after local ambiguity packing . . . . . . . . . . . . Comparison of times taken by LR parser and CYK parser in rule checking module (Times are in seconds on Y axis) . . . . . . . . . . Comparison of times taken by Modified CYK parser and simple CYK parser in rule checking module (Times are in seconds on Y axis) . . A small program with keyword while and a set of correct rules . .

xv

. . . . .

84 86 87 87 88

. 95 . 98 . 105

List of Tables 3.1 6.1

LR table generated from the grammar given in example 3.1 . . . . . . 29 Summary of unit productions in different programming language grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Weight and coverage of possible rules given in figure 6.2 . . . . . . . . 80 7.1 A summary of different modules . . . . . . . . . . . . . . . . . . . . . 85 7.2 Summary of experiments . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.3 Summary of the unoptimized approach on single rule extraction . . . 92 7.4 Number of possible rules generated from the programs and their intersection in different PL grammars . . . . . . . . . . . . . . . . . . 93 7.5 Comparison of number of all possible RHSs and number of most general RHSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.6 Comparison of LR-parser and CYK-parser as a grammar completeness checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.7 Experiments of LHSs filtering optimization (times are in milli seconds) 99 7.8 Experiments for MDL based weight metric . . . . . . . . . . . . . . . 100 7.9 Experiments on multiple rules extraction . . . . . . . . . . . . . . . . 101 7.10 Summary of new terminals in C* . . . . . . . . . . . . . . . . . . . . 102 7.11 Rules extracted from programs written in C* grammar . . . . . . . . 103 8.1 The number of possible rules and correct rules . . . . . . . . . . . . . 110 8.2 Comparison of rules having the lowest T IM P with other rules . . . 111 8.3 Comparison of rules having the highest weight with other rules . . . . 113 8.4 Comparison of rules having the highest usage with other rules . . . . 117 8.5 Comparison of weight and usage as a selection criterion . . . . . . . . 119 8.6 Rules selected by height coverage and highest usage criteria . . . . . . 120

xvi

Chapter 1 Introduction Grammar is an important part of a programming language specification; it is used in developing software maintenance tools such as program analysis and modification tools. Software maintenance is the major activity which is done once the software is released. It involves enhancing or optimizing the software and removing its defects. For example, suppose we want to develop a program modification tool which replaces IF construct in a COBOL program with another IF construct which has EN DIF terminator also. One possible way is to develop a tool which lexically searches IF constructs and replaces them with the IF construct terminated by EN DIF . This is not a good approach as it is not general enough. A generic solution to this problem is to first develop a modification tool generator which takes a grammar of any programming language as input along with the specification of the construct to be replaced (set of translation rules) and automatically generates a modification tool. When a COBOL grammar along with the translation rule corresponding to the IF construct is given as input to this tool generator, it will generate the intended modification tool. YACC [73] and BISON [74] are two examples of such tool generators which are used for generating compilers, language translators, etc. It is much easier to write a syntax directed translation corresponding to a given construct than to develop a tool which lexically searches the construct and replaces them. Hence the grammar is important for automating the tool development process. A practical example of importance of grammar in the programming language domain can be illustrated with the problem faced during the Y2K era. It was estimated that around 225 billion lines of COBOL code were to be modified [31]; 1

2

these were written in various dialects of COBOL. Also 40% of the installed software written in less common languages were also affected with the Y2K problem. Capers Jones [28] estimated that automated tool support for around 500 languages was needed in order to solve the Y2K problem whereas Y2K repair engines for only 10 languages were available at that time. Although the Y2K problem no longer exists but grammar is still important in the software engineering and programming language domain as the problem of code migration and code analysis are still prevalent in the software industry. For example Goldstone [77] is a company which is involved in the migration of Fort´e system to J2EE platform [76]. Grammar is present wherever we find some structure. A book, for example, follows a structure; it is divided into chapters, chapters are divided into sections and sections divided into paragraphs. This structure can be represented by a grammar. Genes follow patterns which, researchers believe [54], can be represented by a grammar. Hence, the problem of grammar extraction is important not only in the programming language domain but also in other domains where structures are learned from a given set of data, such as natural language processing (NLP) domain, document processing applications, bio informatics, programming languages and software engineering. The main focus of this thesis is the grammar extraction in the programming language domain.

1.1

Historical Perspective

Grammar extraction can be related with the language learning. Learning is a process in which a model of the system is obtained from a set of evidence. Language learning is a process in which a language is learned from the sentences of the language; hence evidences are the sentences of the language and model of the system is the set of grammar rules used in generating these sentences. The problem of language learning in children has attracted researchers of natural language processing (NLP) applications since long ago. The debate that language learning is innate in humans or machines can learn languages is decades old; e.g. Chomsky [79] has strongly supported that language learning is innate in humans. This belief of Chomsky was not quite accepted by many researchers [58, 59] as learnability of different classes of languages, such as regular language, context free language etc., are still being investigated by many researchers. A language is learnable if there exists an

3

algorithm which can learn the language from input sentences. Since a language can be represented by a grammar, language learning is termed as grammar learning which is commonly referred as grammatical inference or grammar extraction. A large body of work on the grammatical inference exists in the NLP domain and theoretical domain [1, 12, 37, 39, 47, 68]; some attempts study the learnability of different classes of grammars. Unfortunately most results concerning to the learnability of context free grammar are negative. There are two main models under which learnability of grammars is discussed: (1) Language identification in the limit and (2) Exact identification of the grammar with polynomial number of queries. One of the important results given by Gold states that no language in the Chomsky hierarchy [10] can be identified in the limit with positive samples alone [19]. A positive sample is a valid sentence of a language and a negative sample is an invalid sentence of a language. An intuitive reason behind this result is that the learner tends to overgeneralize because there are no negative samples to alter the learner’s guess. The second result was given by Angluin [7] which is based on the second model of learning, i.e. learning by queries to an oracle. His result states that the problem of grammar learning by asking in polynomial number of queries is NP-hard [24] which was later proved NP-complete [24]. Deterministic finite automata (DFAs) are shown learnable in the limit from the polynomial time and data [45, 46] but it is not true for the nondeterministic finite automata (NFAs) and the context free grammar. Most of the learnability questions in CFG are still open. This class can be identified in the limit from positive and negative samples but whether it can be identified with a polynomial number of queries is an open problem. Despite negative results, researchers tried to address the problem of grammar inference either empirically or by solving the problem for subclasses of context free languages (CFLs). Despite a plethora of activities in the NLP domain, very few attempts exist in grammar extraction in the programming language domain. There are a few differences due to which attempts done in the NLP domain cannot be applied in the programming language (PL) domain. First, in the NLP domain, grammar is inferred from the sentences which are not more than 20-30 words long whereas in the PL domain each sentence (that is one program) is very large; even a very small program consists of 10-20 lines. Recursive constructs are not prevalent in the natural language sentences whereas in a program we may find several occurrences

4

of recursive constructs nested within each other. An example of recursive constructs in a programming language is nesting of while loop block, for loop block and if then else block within each other. Nesting occurs in natural language sentences but in a very limited form; e.g. we do not find a series of noun phrases and verb phrases nested within each other in natural language sentences. Also most of the attempts in the NLP domain are supervised learning attempts in which grammar is first learned from the training data. Training data consists of set of sentences and their corresponding parse trees. These attempts learn a probabilistic grammar from this data; for those sentences whose parse tree is unknown, the system assigns a most probable parse tree.

1.2

Grammar Extraction in Programming Language Domain

L¨ammel et al. [32] have given a general framework where grammar is obtained from various artifacts such as compiler source code, grammar specification files and language reference manuals. Although these artifacts are the first choice for recovering grammar, sometimes these resources are not available particularly when programs are written in some dialect of a standard programming language. For example, C* [83] is a data parallel dialect of C which was developed by Thinking Machines Corp for Connection Machine (CM2 and CM5) SIMD processors. Since Thinking Machines Corp no longer exists, it is very hard to find out the compiler source code or reference manual for C*. The only thing which is available are programs written in C* and the executable of the C* compiler. Sometimes resources are available but they are not complete. For example, L¨ammel et al [32] have discussed the problem of incomplete reference manuals they faced while recovering COBOL grammar from the reference manual of IBM COBOL. In such cases, the only information available is the source code and an incomplete grammar. Mernik et al. [41, 70, 71] have proposed a genetic programming approach which is based on grammar specific heuristic operators. Their approach extracts grammars of small domain specific languages. None of the existing attempts infer grammar from a set of programs written in a real programming language. In summary a survey of existing grammatical inference techniques shows that:

5

• Existing grammatical inference techniques are for small subclasses of context free languages which is not applicable in the PL domain. • Attempts given in the PL domain are either a heuristic in approach or they extract the grammar from sources like compiler source code, language reference manuals etc. • A grammar cannot be learned from positive samples alone as there exists infinitely many grammars which can generate a given set of programs. Hence, we face the problem of selecting a grammar from a set of grammars where each grammar accepts the given set of programs. So there is a pressing need for an approach which can learn programming language grammars. From this motivation we identify the main goals of this thesis which are discussed in the next section.

1.3

Goals of Our Work

The primary goal of this thesis is to develop a feasible approach which extracts grammar of a programming language from a set of valid programs. Extracting a programming language grammar, which is normally large in size, from scratch is too ambitious and this problem rarely occurs in practice. In practice, when programs in a language are given, some approximate grammar for the language is either available or easily inferred. Therefore, we address the problem of completing a grammar when an incomplete grammar is given along with a set of programs. We call the incomplete grammar an approximate grammar. In other words, we wish to learn a set of grammar rules from a set of programs which when included in the approximate grammar gives a grammar that accepts the given set of input programs. We assume that input programs are syntactically correct. That is, we address the problem of grammar rules extraction from positive samples. We address the problem of extracting a “good” (not the exact) grammar as from Gold’s result, “exact” identification of grammar is impossible in the limit from positive samples alone. A good grammar is one of the member of the set of grammars where each grammar can generate the given set of programs. The problem of extracting a grammar of a dialect poses several questions such as: What kind of

6

extensions are found in the syntax of programming language dialects? What should be the properties of the grammar given initially (i.e. grammar of standard language) for successfully extracting the grammar of dialect? When should we say that the grammar is a good grammar? In this thesis we address the above questions. We first investigate language dialects and the extensions found in their syntax. We use this domain specific information for formulating the problem of grammar extraction under programming language growth. Later we propose a feasible approach of grammar extraction. A major portion of this thesis has appeared in articles [13, 14, 15, 16]. In the summary, the following are the main goals of the thesis: • Develop an approach which extracts a “good” grammar from a set of correct programs and an incomplete grammar. We wish to arrive at a deterministic and feasible solution of this problem. • Investigate the proposed approach theoretically as well as experimentally, by prototyping and using different programming language grammars. • Investigate different criteria for grammar rule selection.

1.4

Outline of the Thesis

The rest of this thesis is organized as follows: Chapter 2 discusses related works done in the field of grammar extraction. We first discuss a set of important results related to the learnability of grammars then we discuss the attempts done in the grammar extraction. Chapter 3 provides a gist of background theory and important terms used in the thesis. We discuss the possible extensions that are commonly observed in programming language dialects and formulate the problem of grammar extraction based on these observations. Chapter 4 discusses the basic approach of grammar extraction. First we discuss the approach when only one rule is missing in the approximate grammar. We also discuss the proof of correctness of the approach.

7

Chapter 5 discusses the extension of the basic approach when multiple rules are missing in the approximate grammar and discusses its correctness proof. Chapter 6 Discusses a set of optimizations to improve the grammar extraction process. We discuss three optimizations which are experimentally verified in chapter 7. Chapter 7 discusses the implementation of the approach. We have implemented the approach for verifying the proposed approach and optimizations on real programming language grammars. We study the feasibility of proposed approach and optimizations on four programming languages, viz. Java, C, Matlab and Cobol. Chapter 8 examines different criteria of rule selection. Since we address the problem of grammar extraction when only correct programs (i.e. positive samples) are available, an exact learning is not possible; therefore, we explore different criteria of rule selection. We critically examine grammars of different programming languages and hypothesize the notion of goodness of grammar rules. We later verify our hypothesis on a set of programs. Chapter 9 concludes the thesis with a summary of our contributions. We also discuss a set of possibilities for future works.

Chapter 2 Related Work The problem of language learning in children has attracted researchers for many years. This problem has posed an important question, i.e. whether there exists a formal method of language learning in humans or language learning is innate? Chomsky strongly believes that language learning is innate in human beings [79] which has been criticized by many researchers [59]. The importance of grammar learning is now being felt in other domains also; such as formal language theory, syntactic pattern recognition, structural pattern recognition, computational biology, speech recognition and software engineering. This problem in programming languages, though important for reverse engineering, has not yet achieved a significant maturity. In this chapter we present a survey of context free grammar (CFG) extraction approaches. We have divided such attempts in two broad categories: (1) Those which obtain grammar from various artifacts containing the specification of the grammar in some form, and (2) those which extract/infer grammar from a set of input sentences (input programs) with or without the help of a teacher (called supervised or unsupervised learning respectively). Approaches which fall in the first category are termed as grammar recovery attempts as they do not infer grammar from sentences; rather recover from the available artifacts. This term was used by L¨ammel et al. in [32] where they proposed a framework for obtaining grammar from the sources other than programs. In the programming language domain, grammar extraction works mostly fall into the first category (i.e. grammar recovery). Since the goal of this thesis is grammar 8

9

extraction/inference in programming language domain, we discuss grammatical inference attempts done in other domains also. We will use terms inference, extraction and learning interchangeably in our discussion. Section 2.1 discusses terminologies related to context free languages and regular languages. Section 2.2 provides some important theoretical results on the learnability of different language classes. Section 2.3 discusses the main works done in the field of regular language inference. We discuss those attempts in the regular language inference which have formed a base to many other works done in the context free languages. Section 2.4 till section 2.8 discuss different approaches proposed for inferring context free languages and their subclasses.

2.1

Terminology

There are four main grammar classes defined by Chomsky [24], viz, (1) Regular grammars, (2) Context free grammars (3) Context sensitive grammars, and (4) Unrestricted grammars. The most restricted class is the regular grammar. Programming language syntax is mostly expressed with the context free grammar1 , therefore our focus is on the inference of context free grammars. In this section we discuss terminologies related to regular grammar and context free grammar as these are used in further discussions. A regular language is a class of languages which can be recognized by a finite automata (FA). A finite automata is a five tuple (Q, Σ, δ, q0 , F ) [24]; where Q is a set of states, Σ is a set of terminal symbols, q0 is the start state, F (F ⊆ Q) is a set of final states and δ is a set of transition functions δ : Q × Σ → 2Q . Extended ˆ u), where u = wa (a ∈ Σ, u, w ∈ T ∗ ) is a string of terminals, transition function δ(q, is defined as follows: ˆ u) = {b : b ∈ δ(p, a) and p = δ(q, ˆ w)} δ(q, ˆ 0 , w)∩ A string w is said to be accepted by a finite automata (Q, Σ, δ, q0 , F ), if δ(q F is a nonempty set. A finite automata is called a deterministic finite automata (DFA) if function δ(q, a) (q ∈ Q and a ∈ Σ) contains only one state. A regular 1

Although there are few constructs which can not be expressed with context free grammar. For representing those constructs, syntax directed translation is used.

10

language is also referred as a regular set. Symbols u, v, w are used to represent terminal strings. λ is used to represent the empty string. A finite automaton can be represented by a directed graph where states are represented by the nodes of the graph and each transition function δ(qi , a) = {qj } is represented by an edge, labeled a, between states qi and qj . A context free grammar is a 4-tuple G = (N, T, P, S), where N is the set of nonterminals, T is the set of terminals, P is the set of productions for generating valid sentences of the language, and S ∈ N is a special symbol called the start symbol. Productions are of the form A → α, where α is a sequence of terminals and nonterminals. Productions are also termed as rules or grammar rules. The sequence of terminals and nonterminals are represented by lower-case Greek letters α, β, γ etc. We use the term “symbol string” to refer to this. The starting capital letters A, B, C denote nonterminals and the ending capital letters X, Y , Z denote symbols that may be either terminal or nonterminal. The lower case letters a, b, c... represent terminal symbols. Strings of only terminals are denoted by w, x, y, z. A language accepted by a CFG is called a Context Free Language (CFL). Terms sentence and program are used interchangeably. Suppose A → γ is a production in the grammar G = (N, T, P, S) and αAβ is a string of terminals and nonterminals, αAβ ⇒ αγβ represents that αAβ derives the symbol string αγβ in G. A sequence G



of derivations α1 ⇒ α2 ⇒ . . . ⇒ αn are denoted as α1 ⇒ αn . A symbol string G

G

G



G

β ∈ N ∪ T ∗ is called a sentential form of G if S ⇒ β. The language generated by G G is defined as follows:



L(G) = {w|S ⇒ w} G

FIRST set of a symbol string α in G is defined as follows:



F IRSTG (α) {a|α ⇒ aw} G

If G is understood then we remove G in the above definitions; i.e. we denote ⋆ ⋆ α1 ⇒ αn as α1 ⇒ αn and F IRSTG (α) as F IRST (α). G

11

2.2

Theoretical Results

In this section we discuss some theoretical results on the learnability of different language classes; i.e. whether it is theoretically possible to learn a class of language from a set of sentences. Unfortunately, most of the results concerning the learnability of context free languages are negative. Before discussing those results, first we will discuss the models on which the learnability of languages are defined. There are two main models which are used for addressing the learnability question of languages. The first model, called “exact identification in the limit”, is given by Gold in 1967 [19]. In this model, a learner receives an information It at each time unit (t here) and outputs a hypothesis or model of the system, H(I1 , I2 , . . . , It ). A learning algorithm is successful, if after a finite amount of time the hypothesis does not change. The second popular learning model is “exact identification using queries in polynomial time”, given by Angluin in 1988 [6]. In this model, a learner can ask questions to an oracle and must halt in polynomial time with the correct description of the language. Other learning models are a variation of these two models. Most of the results concerning the learnability of languages are negative. Gold’s seminal paper [20] states that it is impossible to identify any of the four classes of grammars of the Chomsky hierarchy in the limit, if input data consists of only positive samples (strings that are in the language). Since Gold’s theorem is related with the exact identification in the limit, it is obvious that no language can be learned from positive samples alone in the limit, because in all cases the learner tends to produce a hypothesis/grammar which is a superset of the target and there are no negative samples to alter the learner’s guess. Gold developed his results further, which states that if a set of positive and negative examples are given then at some point the learner will return the correct hypothesis. He showed that the deterministic finite automata are identifiable in the limit from polynomial time and data. A class R (suppose R is a set of all DFAs) is said to be identifiable in polynomial time and data if: (1) for a given set of positive and negative samples, (S+, S−) of size m, an algorithm A exists which returns a representation R in R (i.e. a DFA) consistent with (S+, S−) in O(p(m)) time, and (2) for every representation R of size n, there exists a characteristic sample (CS+, CS−) of size less than q(n) for which, if S+ ⊇ CS+ and S− ⊇ CS−, A returns a representation R′ equivalent to R.

12

Here p() and q() are two polynomials. A characteristic sample for a representation R is the smallest set of sentences which is essential for correctly identifying R. A characteristic sample must be monotone, i.e. if a few more correctly labeled samples are added in the characteristic set, then the algorithm must infer an equivalent representation. The above result does not talk about the size of the DFA learned. Gold has further showed that finding the smallest DFA consistent with a given set of positive and negative sentences is NP-hard [24]. Another result, given by Angluin [7], is based on the framework of learning by questions to an MAT (minimally adequate teacher). A learner can ask two types of queries to an MAT: (1) Membership query in which he proposes a string to the MAT and MAT gives answer as yes or no depending upon whether the string falls in the language or not, and (2) Equivalence query in which the learner proposes a representation (i.e. a grammar); if representation is equivalent to the actual representation it returns answer yes else it returns a counterexample string which falls in the symmetric difference of the set proposed by the learner and the set being learned. It was shown that DFAs can be learned in time polynomial in the number of states of the minimum DFA and maximum length of any counterexample [5]. Angluin also showed that DFAs can not be learned if the learner is allowed to ask only a polynomial number of membership queries or polynomial number of equivalence queries; he further showed that this problem is NP hard even if the target automata consists of only two states [7, 49]. Another result is given by Higuera [22] which is based on the model of identification in the limit from polynomial time and data. This states that DFAs are learnable in this model but this is not true for the NFA (nondeterministic finite automata) and CFG. The entire class of CFGs seems intractable in all the learning models. Most of the questions regarding the learnability of CFGs are still open. The class can be identified in the limit (with positive and negative samples) but whether this can be identified with a polynomial number of queries is an open problem. Despite these negative results, there are some positive results for DFA learning; researchers have tried to extend those results for language classes bigger than the DFA. Since DFA represents regular languages, in coming sections we will discuss some of the key attempts done in the regular language inference and then discuss attempts done in inference of context free languages (CFLs) and its subclasses.

13

2.3

Learning Regular Languages and its subclasses

Angluin [5] has proposed a method which learns regular languages using membership and equivalence queries. Suppose S + is a set of valid sentences given initially. His approach works on an observation table T in which each row corresponds to a string; each string is either of form w or of form wa where w is a prefix of a sentence falling in S + and a is a terminal symbol. For example, if S + = {λ, b, aa}, then rows of the table will correspond to strings λ, b, a, aa, ab, ba, bb, aaa, aab. Table entries are built by asking membership queries as follows: For building a table entry T (u, v), corresponding to the column representing string v, a membership query uv is posed to the teacher. If the answer is yes then entry T (u, v) is assigned as T (u, w) ∪ {v} (T (u, w) is the table entry of the previous column) else it is retained as T (u, w). A string x ∈ T (u, v) implies that string ux is a valid sentence of the target language. An observation table for the set of valid sentences S + = {λ, b, aa} is shown in the figure 2.1(a). Initially the table has three columns (0, 1 and 2) as these entries can easily be determined from S + . Now, inconsistencies in the table are determined. A table is inconsistent if rows corresponding to strings u and v (for some u and v ) are the same but rows corresponding to ua and va (a is some terminal) are not the same. For example, in the figure 2.1(a) rows corresponding to b and aa are the same (see entries of columns 0, 1 and 2) but rows corresponding to aaa and ba are not the same, therefore a separate column is added to differentiate these two strings (aa and b). In the figure 2.1(a) column aa is added and the entries corresponding to this column is filled by asking membership queries. String aa is added in the entry T (aa, aa) as aaaa is a valid sentence whereas it is not added in the entry T (b, aa) as string baa is not a valid sentence. Entries for other rows are modified in the same manner. This process is repeated until a consistent table is obtained. Finally, information from the table is used in building its corresponding DFA as shown in the figure 2.1(b). Each state of the DFA corresponds to a prefix of a string in S + . Transitions are obtained from the observation table. For example, rows corresponding to a and aaa are the same in the table of figure 2.1(a) (i.e. these are equivalent states), therefore we make a transition from state aa back to state a. This DFA is proposed to the teacher. If the DFA is equivalent to the actual DFA, then the teacher returns the answer of yes else it replies with a counter example. The counter example and all its prefixes are added in the observation table. And previous

14

steps are repeated to make the observation table consistent and closed. This process is performed until the teacher returns an answer of yes to an equivalence query.

i string

0

1

λ

b

a

2

3 aa b

same

Not same

λ

φ

{b}

{b}

{b, aa}

a

φ

φ

{a}

{a}

b

{λ}

{λ}

{λ}

{λ}

aa

{λ}

{λ}

{λ}

{λ, aa}

ab

φ

φ

φ

φ

ba

φ

φ

φ

φ

bb

φ

φ

φ

φ

aaa

φ

φ

{a}

{a}

aab

φ

φ

φ

φ

b a λ

a

(a) Observation table

a

a

aa

(b) DFA obtained

Figure 2.1: Execution of Angluin’s algorithm

Oncina et al. [46, 45] have proposed a method for inferring regular languages from positive and negative samples. Their approach first builds a prefix tree automata (PTA) from positive samples. Prefix tree automata is an automata in which each state represents an unique prefix of the sentences falling in the positive samples. For example, PTA for the set of valid sentences S + = {011, 101} is shown in figure 2.2(a) where each state is labeled with the prefix it represents; e.g. state λ represents empty prefix, state 01 represents prefix 01 etc. Prefix tree automata accept exactly the set of sentences given in the positive sample. Now states in the PTA are merged to get a more general automata. Negative sentences are used for validating a merge operation. A merge is valid if the automaton obtained after the merging operation does not accept any negative sample. Suppose the negative sample is S − = {1, 01}. First, states λ and 0 are merged; the resulting automaton is shown in the figure 2.2(b). This is a valid merge as the automaton obtained after this merge accepts neither 1 nor 01. If a merge is not valid then it is discarded and other states are considered for merging. The process of state merging is applied repeatedly until no more merge is possible. The whole process of automaton induction is shown in figure 2.2. Oncina et. al. have shown that this algorithm identifies the regular grammar correctly in the polynomial time if positive and negative samples follow certain characteristics such as each of

15

the transition function is used in generating at least one valid sentence given as input. 1

0

01

1

011

0

λ

01

0

1

011

1

λ 1

1

1

0

10

1

101

1

(a)

0

10

1

101

(b)

011

0

011

0 0

1

1

0

λ

λ 1

1

1 1

0

10

1

101

1

1

101

λ

1

1

(c)

(d)

(e)

Figure 2.2: Steps in regular language inference from positive and negative sample

Another important work done in the field of regular language inference is by Angluin [4]. Angluin has proposed a method which learns k-reversible languages, a subclass of regular language, from the finite positive samples. Definition 2.1 A regular language L is k-reversible if a deterministic acceptor with lookahead of length k can be built for both L and Lr , where Lr is the language obtained by reversing the string in L. A deterministic acceptor of lookahead k is a finite automata (Q, Σ, δ, Q0 , F ), such that for states q1 , q2 , q3 ∈ Q and a ∈ Σ if δ(q1 , a) = {q2 , q3 } then there does not exist a string u of length k such that both δ(q1 , u) and δ(q2 , u) are nonempty. That is the conflict between two nondeterministic branches (here it is transition to states q2 and q3 on terminal a) at any state can be resolved by looking ahead k more terminals. Angluin’s inference algorithm first builds a prefix tree automata (PTA) of input sentences. PTA for the set of sentences {λ, 00, 11, 0000, 0101, 0110, 1010} is shown in figure 2.3. Now, violations of the properties of k-reversible automata is determined in the PTA and removed by the state merging operation. For example, suppose we are trying to infer a 0-reversible automata2 from the PTA shown in figure 2.3. 2

For 0-reversible automata the condition is that automata and its reverse (obtained by reversing the transition edges) are both deterministic FA.

16

First all final states (i.e. A, C, E, H, J, N and O) are merged together to produce the automaton of figure 2.4(a). Next we see that state A has two nondeterministic branches δ(A, 0) = B and δ(A, 0) = D, therefore B and D are merged to produce the automata of figure 2.4(b). More than one state can reach to the state A on 1 (i.e. δ(G, 1) = δ(K, 1) = A) in figure 2.4(b), hence states G and K are merged together. This process of merging goes as long as no merging is possible and we reach the automaton of figure 2.4(d). This automaton is the smallest 0-reversible automaton accepting the given set of sentences. A

0

0

B

0

C

0

D

E

1 0

F 1

G

1

H

1 0

I

K

0

L

1

M

J

0

N

1 O

Figure 2.3: PTA corresponding to example set {λ, 00, 11, 0000, 0101, 0110, 1010}

2.4

Learning Subclasses of Context Free Language

Since DFAs are learnable in polynomial time and data [22], researchers tried to extend these results to language classes larger than DFAs. The main hurdle in extending these results to larger classes is of non-determinism and non-linearity. A language is linear if it is produced by a context free grammar in which the right hand side of productions contain at most one nonterminal. A language is deterministic if a deterministic parser can be generated for them. A deterministic parser is one which at each step can unambiguously determine the next action to perform. The class of linear grammar is studied the most for achieving the positive learnability results. Some approaches have addressed the problem of learning even linear grammars (ELG). An even linear grammar, proposed by Amar and Putzolu [3], is one whose productions are balanced, i.e. right hand side (RHS) of productions either contains terminal symbols only, or they are of the form u T v, where T is a nonterminal and u and v are terminal strings of equal length.

17

D 0

1

1

0 0 0

A

1

B

1

0

1

L

1

B

1

K

M

(a)

G

1

I

0

0

0

F

0 1

I

0

K

A

G

1

0 1

0

F

0 0

1

L

M

(b) 0

0 A

1

1

K

A

B

0 1 0

1

1

1

1

F

K

B

0 1 0

1

F

0 L

(c)

0

(d)

Figure 2.4: Steps in the inference of zero reversible language

The first work done on the learnability of ELG is by Radhakrishnan et al [51]. Their approach first builds a skeleton corresponding to each positive sample. Skeleton is a tree structure imposed on a sentence. Initially nodes of each skeleton are labeled with temporary names. The construction of a skeleton is straightforward as productions in ELG are balanced. Once skeletons for all the sentences are built, nodes in skeletons are assigned with new nonterminal names to get derivation trees of each sentence. While assigning a name to a node the algorithm ensures that nodes with the same child nodes are labeled with the same name. Next, productions of the grammar are built from these derivation trees. Another method for inferring even linear languages is proposed by Takada [63]. He has shown that the problem of learning even linear languages can be reduced to the problem of learning a regular control set. A control set for a grammar G = (N, T, P, S) is defined as follows: Suppose, productions of G are labeled as π1 , π2 , . . . , πn . A control set of a language L(G) is a set of strings over alphabets π1 , π2 , . . . πn . Since a sequence of productions represents a derivation, a language

18

which is generated by applying production sequences falling in a control set, is said to be generated by that control set. Takada has shown that the class of even linear grammars can be generated by control sets which are regular. Therefore, the class of even linear languages can be learned by learning its corresponding control set using the techniques available for regular sets. Makinen [40] discusses the learnability of even linear languages and its subclasses with positive samples alone. He has shown that a restricted class of even linear languages, called terminal fixed even linear languages, can be learned in the limit from the positive samples alone. An even linear language is terminal fixed if productions A → aBb and C → aDb in the grammar implies A = C and B = D. The method of grammar learning by learning its control set was further extended by Koshiba et al. [30]. They proposed a method for learning deterministic even linear languages. They have shown that the class of LR(k) even linear language3 is not learnable from the positive samples alone but a restricted subclass called LRS(k) even linear grammar (LRS(k) ELG) is learnable from the positive samples. A grammar G = (N, T, P, S) is LRS(k) ELG if G is in the normal form4 and α

π

(i)

S ⇒ uAv ⇒ u′ βv ′ ,

(ii)

S

G α′

G ′ ′ ′ ′ π ⇒uAv ⇒ G G

u′ βv ′ ,

(iii) pref (v, k) = pref (v ′ , k), and (iv) suf f (u, k) = suf f (u′ , k), implies π = π ′ ; where u, u′ , v, v ′ ∈ T ∗ , β ∈ (T ∪ N )∗ ; A, A′ ∈ N ; α, α′ , π, π ′ ∈ P ∗ ; π S ⇒ uAv represents S derives uAv by a sequence of productions π; pref (v, k) G

represents the first k symbols of v; suf f (v, k) represents the last k symbols of v. It was shown that the class of LRS(k) ELG is generated from a control set which can be represented by k-reversible finite automata. Hence, by learning the control set one can learn the LRS(k) ELG. A front end processing algorithm first converts input programs into the problem of inferring a k-reversible automata learning and uses the algorithm for k-reversible finite automata learning. Once a k-reversible control set is learned, set of productions are constructed from the control set to get the context free grammar. 3 4

An LR(k) language which is also even linear is called LR(k) even linear language. Normal form here means that each symbol of G appear in some sentential derivation.

19

The work proposed by Higuera et al. [23] learns a larger class called deterministic linear languages. They have shown that this class can be learned in the polynomial time. A deterministic linear grammar G = (V, T, P, S), is a linear grammar where all rules are of form T → a T ′ u or T → ǫ, where T → a u and T → a v implies u = v. Their approach learns a canonical form of the grammar (which is not minimal but polynomially larger than the equivalent minimal grammar) from positive and negative samples in the polynomial time and data. In order to learn the grammar correctly, their approach requires an input data set which obeys certain characteristics. This work is the best known theoretical result in the field of context free grammar inference. Starkie [62] has proposed a method for learning left aligned grammars from positive samples alone. Left aligned grammars are those for which a non-backtracking top down parser (known as the remainder parser) and non-backtracking bottom up parser (known as the left simplex parser) can be built. A remainder parser is a variant of an LL(1) parser. A left simplex parser is a variant of an LR(1) parser in which parser table entries corresponding to each states are either all shifts or all reduce by the same production (i.e. a state does not have shift and reduce actions both and all reduce entries are same). Hence, whenever a parser encounters an RHS of a production it performs reduction irrespective of what the next token is. Starkie’s approach starts with a prefix tree automaton of given set of positive sentences. First a set of trivial rules are obtained from the prefix tree automaton. Next, a set of steps are performed which includes merging rules with the same right hand sides in one, deleting repeated rules and ensuring that right hand sides of rules are not contained within the right hand side of other rules. These steps are performed to ensure properties of the left aligned grammar. The above steps are performed until no more merging is possible. Laxminarayana et al. [38] have given an approach for learning terminal distinguishable context free languages (TDCFL) from positive samples in polynomial time. TDCFL is a class of language which obeys backward determinism, terminal completeness and terminal dissimilarity properties defined as follows: Definition 2.2 A CFG G = (N, T, P, S) in GNF5 possesses the backward determinism property iff B ⇒ w and C ⇒ w implies B = C. 5

GNF (Greibach Normal Form) is a normal form of context free grammar in which there are no epsilon productions and all production are of form A → aα (A ∈ N, a ∈ T, α ∈ (T ∪ N )∗ ).

20

Definition 2.3 A CFG G = (N, T, P, S) in GNF possesses the terminal completeness property iff ∀x, y ∈ L(A) and A ∈ N , T er(x) = T er(y), where function T er(x) return the set of terminals in the string x. L(A) is the set of all strings derived by A. Definition 2.4 A CFG G = (N, T, P, S) in GNF possesses the terminal dissimilarity property if A, B ∈ N − {S}, γ1 , γ2 ∈ N + , a ∈ T and B → aγ1 , A → aγ2 ∈ P , then ∀x ∈ L(A) and ∀y ∈ L(B), T er(x) 6= T er(y) Their approach first builds a canonical grammar from each program. A canonical grammar is obtained as follows: for each input sentence x = a1 a2 . . . an a set of productions S → Ax1 , Ax1 → a2 Ax2 Ax3 . . . Axn , Ax2 → a2 , . . . , Axn → an are added in the grammar. Canonical grammar is generalized by determining the equivalence between nonterminals; nonterminals which are equivalent are renamed with the same name. The approach proposed by Yokomori [84] learns a very simple grammar from the positive data. A very simple grammar is a grammar G = (N, T, P, S) where productions are of form A → aα (α ∈ N ∗ ) and there exists exactly one rule of the form A → a α for each a. Their approach learns an interpretation I from input programs. Interpretation is later used to get the underlying very simple grammar. Interpretation I is an ordered pair (fn , fp ) of mappings, where fn is a way to encode the nonterminal and fp is a homomorphism which is used to get the correct labeling (or decoding) of the nonterminals and productions to get the underlying grammar from the interpretation.

2.5

Learning from Structured Inputs

Some grammar inference attempts use structural input data. For example, the input to the learning system may be parse tree structures of programs and the learner finds the correct labeling of parse tree nodes. The structured inputs are termed as skeletons and the property of skeleton is that they are exactly the set of trees accepted by a tree automaton (TA). A Tree Automaton is a type of finite state machine that deals with structures, rather then strings. It can be top down and bottom up depending upon how it

21

recognizes the tree structure. It is represented as (Q, Γ, δ, F ), where Q is a set of states, Γ is a set of symbols, F is a set of final states and δ is a transition function. Each symbol is associated with an arity. For each symbol f of arity n, transition functions δ(f ) : Qn → 2Q are defined. Suppose, a tree automata encounters a subtree of figure 2.5(a) in some step and there is a transition function q ∈ δ(f, q1 , . . . , qn ), then the root of the subtree will be labeled as q as a result of this transition (as shown in figure 2.5(b)). A TA is said to accept a tree if it reaches to the root of the tree and the state at the root of the tree falls in the set of final states (i.e. F ). q f

f

qn

q1

(a)

qn

q1

(b)

Figure 2.5: Application of a transition function in tree automata

The problem of learning from structures can be reduced to the problem of learning the underlying tree automata. Following this line, Sakakibara has proposed a method for learning CFG [52, 53]. His approach takes a structured input set and learns a TA. For learning TA, it uses the method proposed by Angluin [5] which learns finite automata. Sakakibara’s approach requires an MAT which can answer structural equivalence and membership queries. He has further worked on the same problem and produced several extensions of his approach. In [57], he has proposed a method which learns CFG from partially structured examples. In his recent work [55, 56], he has proposed a genetic algorithm based method for learning CFG from finite set of positive and negative examples. He has shown that the hypothesis for each sentence (i.e. a parse tree imposed on the sentence) can be represented in the form of a tabular structure (e.g by a CYK-table in this case). Initially table entries are filled with nonterminals having temporary names. Nonterminals in the CYK tables are later partitioned. Nonterminals which fall in the same partition are renamed with the same name. Production rules are obtained from the re-named nonterminals. His approach uses genetic algorithms for partitioning nonterminals.

22

He has shown through some practical experiments that partially structured examples improve the process of learning.

2.6

Genetic Algorithmic Based Approaches

Lankhorst [36] has proposed a genetic algorithm (GA) based approach for learning CFGs. His approach starts with a set of random population of grammars. Genetic operators such as mutation and crossover are applied on the individual grammar to create a new set of grammars. A fitness function is applied on each grammar in the set; those grammars which are above a fitness threshold are considered for the next generation. The fitness criterion used for a grammar G is as follows:

F itness(G) =

(# + ve sentences accpeted by G) + (# − ve sentences rejected by G) T otal number of sentences

Lankhorst has experimentally shown on 7 small languages that after a number of generations his approach finds out a grammar consistent with the given set of positive and negative sentences. Another GA based approach is proposed by Mernik et al. [41, 70]. They have proposed a GA based approach for extracting grammar of domain specific languages (DSL). Apart from usual crossover and mutation operators they have provided a few grammar specific heuristic operators. Their approach starts with a non-random population of grammars; it generates some valid derivation trees from the positive samples. Grammars obtained from these trees are used as initial population. The following heuristic operators are proposed by them: 1. Option operator: A production of the type A → αBβ (α and β can be empty also) is replaced with productions A → αBβ and B → ǫ. This makes B optional in A → αBβ. 2. Iteration* operator: Production of the type A → α is replaced with productions A → αA and A → ǫ. This makes A to derive string α∗ . 3. Iteration+ operator: Production of the type A → α is replaced with productions A → α and A → αA. This makes A to derive string α+ .

23

Unlike the Lankhorst’s scheme, the fitness value of a grammar in this approach depends on the length of the portions of the program recognized by some nonterminal of the grammar. They have shown experimental results on some small domain specific languages. Their experiments validate that the non random initial population and heuristic operators improve the process of grammar extraction. Although the aim of this work was to arrive at a practical approach for inferring programming language grammars but it has not yet given any result for real programming languages.

2.7

Grammar Extraction from Other Knowledge Sources

Attempts which are done in the programming language domain mostly rely on other knowledge sources than the input programs. In this section we discuss all such approaches. L¨ammel et al. [32] have proposed a semiautomatic approach in which grammar is systematically obtained from several artifacts such as language reference manuals, compiler source codes etc. First, a raw grammar is obtained from the language artifacts. They have shown that the grammar obtained from such sources are not always complete. Such as, there may be some nonterminals which are not defined (i.e. do not occur as a left hand side of any production) and some nonterminals which are not used (i.e. they do not occur in any sentential derivation). Such incompletenesses in the grammar are resolved in the next step. Next, a test driven correction of the grammar is done in which grammar is tested on a set of correct programs. Once a grammar is obtained, steps such as beautification, modularization and disambiguation of the grammar are optionally performed to make the grammar usable and understandable. Using the above approach L¨ammel et al. were able to recover an industry usable COBOL grammar [34]. They have also mentioned difficulties they faced while gathering grammar from such sources; for example in many cases reference manuals were buggy and incomplete. Such problems were resolved manually. Jain et al. [9, 27] have proposed an approach for extracting grammar rules from a set of programs. Their approach handles those cases in which a grammar of a

24

standard language is given and a grammar of the dialect has to be extracted from the programs written in the dialect. Their approach uses a knowledge-base for extracting the grammar. The knowledge-base is a global repository of all kinds of grammar rules found in different languages. Their approach assumes that new constructs in dialects are usually borrowed from the grammars of other programming languages; for example, a C dialect may borrow a repeat until statement from Pascal. The knowledge-base contains all kinds of keyword based rules found in different languages. Example of keyword based rules are if statement, while statement etc. Their approach first parses programs with the incomplete grammar given initially. Since the grammar is incomplete, the parser will get stuck at some point. They use the information available in the parser to build an error string. An error string is a string of terminals and nonterminals. Error string is used for searching the appropriate rule in the knowledgebase. Those rules which contain the same keyword as in the error string are returned from the knowledgebase.

2.8

Other Attempts

In this section we discuss other attempts done in the CFG inference. Since CFGs are not learnable from positive samples, these approaches do not ensure correctness. There are plethora of such approaches in the NLP domain but most of them are supervised learning attempts. In supervised learning, grammar is learned from a tagged corpus (a set of sentences whose corresponding parse tree is known). The system after learning the grammar imposes a parse tree structures on untagged sentences (whose parse tree structure is not known). The supervised learning attempts are useful in the NLP domain because in this domain we are interested in the most probable parse tree of a sentence. In this section we will not discuss the supervised learning attempts and touch only a few representative unsupervised learning attempts. First we discuss the approach given by Nakamura which addresses the general problem of CFG inference. Later we provide a few attempts done in the NLP domain. Nakamura has proposed a synapse system for learning CFG from the positive and negative samples [43]. His approach is an incremental approach which uses CYKparser. It generates a set of rules incrementally from the programs and adds those rules in the grammar. Generated grammar is then checked against all the programs.

25

They have experimentally verified their work on a set of grammars borrowed from the book of Aho, Sethi and Ullman [2]. ABL (Alignment Based Learning) system was proposed by Menno van Zaanen in his Ph.D. thesis [18, 68, 69]. His technique works by aligning the set of sentences whose some parts share same sequence of terminals. Information obtained from the aligned sentences is used to find out the parts which are interchangeable. For example, sentences Ram is going and Seeta is going, will be aligned on the phrase is going; which shows that Ram and Seeta are interchangeable. Since such parts are considered to be substitutable with each other, they are hypothesized as a possible constituent (i.e. derived by same nonterminal); e.g. Ram and Seeta is derived by a nonterminal representing personal noun. The ABL algorithm consists of two phases. The first phase is the constituent generator which generates a set of possible constituents and the second phase restricts this set by selecting the best constituents (as there can be several alignment possible among a set of sentences). Iteratively a grammar structure is imposed on the sentences. The EMILE system, proposed by Adriaans [1], is also based on the similarity between the sentences (like ABL system). First a set of contexts and subexpressions are built from each input. Context is the text surrounding a portion of string in an input sentence. For example in the sentence this is a beautiful palace, phrases this is a and palace are the left and the right context of beautiful respectively. Word beautiful is called the expression. This approach forms clusters of equivalent contexts. Expressions which have same contexts are considered to be of same type. Hence, they are hypothesized to be derived from same nonterminal. A group lead by Solan et al. [25, 60, 61, 72] has proposed the ADIOS system to learn a grammar from the positive samples. The ADIOS algorithm starts by loading the input sentences onto a graph. Each input sentence defines a separate path over the graph. Next, significant patterns are determined from this graph. A significant pattern is a portion of a path which is common in most of the graphs (graphs corresponding to sentences). Each significant pattern is then considered as a constituent (i.e. derived by the same nonterminal). This process is repeated and the grammar structure is imposed on the sentences. E-GRIDS proposed by Petasis et al. [48] starts initially with a flat grammar that contains rules only of form S → X . . . Y (one for each sentence) and (X → W ) for each word. E-GRIDS starts with this specific grammar and generalizes it with

26

merge operations. Given a rule X → Y . . . Z, merging X and Z will produce a rule X → Y . . . X. The learning alternates between two modes, merge and create. In the create mode new nonterminals are created; e.g. replacement of a rule A → XY Z with rules A → XB and B → Y Z creates a new nonterminal B. First a set of successor grammars is created with merge operations. An evaluation function is applied on each successor grammar to find and select the best grammar. If none of the grammars scores better than the previous grammar the algorithm switches to the create mode and generates a set of grammars with the create operation and the best among them is selected for the next iteration. The algorithm continues in this manner until neither of the operations results in a better grammar, in such case the algorithm halts. Since the number of successor grammars is large, it uses simplicity bias for directing the search [35]. This bias measures the simplicity of each grammar in terms of description length of G plus the description of training sentences encoded as a derivation in G.

2.9

Summary

In this chapter we presented a survey of different grammar inference techniques. We first discussed theoretical limitations of the problem of grammar inference. We discussed some representative approaches proposed in the inference of regular languages. Attempts which infer CFG from the unlabeled parse trees, attempts which rely on other knowledge sources, attempts which are based on genetic algorithms and attempts which are done in the NLP domain are discussed. We observe that the attempts in the programming language domain frequently rely on other sources such as reference manuals, compiler source codes etc. None of the approaches in the programming language domain extracts grammar from programs alone. This observation provides further motivation for this thesis.

Chapter 3 System Model In this chapter we present a critical study of grammars of standard languages and their dialects and discuss extensions commonly observed in programming language dialects. We formulate the problem of grammar extraction under those extensions. We also discuss the notion of correctness and completeness of the extracted grammar. We use terms programming language growth and extension interchangeably in discussions.

3.1

Programming Language and Grammar

A programming language (PL) is specified with a set of syntactic and semantic rules. Grammar is used for representing the syntax of programming language. Programming language grammars are generally restricted to a subclass of context free grammar (CFG) called LR grammar. An LR language is a language for which a deterministic bottom up shift reduce parser can be made. That is, a parser can unambiguously determine the action (i.e. shift or reduce) to be performed in the left to right scan of the input sentence. Operations shift and reduce are discussed in the next section. We represent an input program with a1,n in our discussion, where n is the number of tokens in the program, ai (1 ≤ i ≤ n) denotes ith token of the program and ai,j (1 ≤ i ≤ j ≤ n) denotes the substring of a1,n which starts at ith token and ends at j th token. 27

28

3.1.1

LR parser

An LR parser works on a parser stack and a parser table [2]. A parser stack contains string of form s0 X1 s1 X1 . . . Xm sm . Where si is LR parser state and Xi is the grammar symbol. An LR parser’s configuration is a pair whose first component is the stack and second component is the remaining input string: (s0 X1 s1 X2 s2 . . . Xm sm , ai ai+1 . . . an ) This configuration represents the right sentential form X1 X2 . . . Xm ai ai+1 . . . an An LR-parser table (called action table) consists of two parts, action and goto. The LR parser works as follows. It determines the state at the top of the stack, sm , and consults action[sm , ai ], ai is the next token. We will use terms active state and top of the stack interchangeably in our discussion. The state at the top of the stack also represents the state of the parser because parser’s behavior is determined from the top state. Parser’s new configuration depends on the entry in action[sm , ai ] as follows: 1. If action[sm , ai ] = shif t s, parser executes a shift move and new configuration becomes (s0 X1 s1 X2 s2 . . . Xm sm ai s, ai+1 . . . an ) 2. If action[sm , ai ] = reduce A → β, the parser executes a reduce move and enters into the configuration (s0 X1 s1 X2 s2 . . . Xm−r sm−r A s, ai ai+1 . . . an ) The first 2r symbols from the parser stack are popped exposing state sm−r , where r = |β| is the length of the RHS of rule A → β. Parser pushes symbol A and the state s corresponding to the entry goto[sm−r , A]. The sequence of top r symbols (i.e. Xm−r Xm−r+1 . . . Xm ) on the stack and β will be same here. 3. If action[sm , ai ] = accept, parsing has recognized the input sentence as a valid sentence.

29

Table 3.1 LR table generated from the grammar given in example 3.1 State s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12

a s1 s8 r5 s8

b

action d

e

$

A

B

goto C

2 s3 r5 r4 s6 r5 r4 r3

S 12

10 4

r5 r6

5

9 7

r5 r6 r1 s11 r2 acc

4. Empty action[sm , ai ] means that parser is in error state. Example 3.1: Table 3.1 shows the LR parser table, containing action and goto functions, for the grammar G = (N, T, P, S), where T = {a, b, d}, N = {A, B, C, S} and P has following productions: 1. r1 : S → a A B C 2. r2 : S → a C d 3. r3 : A → a 4. r4 : A → A B 5. r5 : B → b 6. r6 : C → A B Notations in the table are as follows: si means shift state si and ri means reduce with production labeled ri . A number i in the entry goto[sm , A] shows that state si is to be shifted after shifting the nonterminal A. Suppose string aabab is given as input; the initial parser configuration will be (s0 , aabab$). $ is a special symbol which is used as the end of marker. The sequence of stacks and input contents are shown in figure 3.1. For example at line (1) in figure 3.1 LR-parser is in state s0 and a is the first input. action[s0 , a] in action table (table 3.1) is s1 ; therefore a shift move will be performed. Symbol a followed by state s1 will be shifted on the parser stack and parser configuration will become (s0 as1 , abab). Since action[s1 , a] = s8 , another

30

shift will occur and the parser will move to the configuration (s0 as1 as8 , bab) (line (3) in figure 3.1). Since action[s8 , b] = r3 , therefore a reduction will be performed by rule A → a. Two symbols will be popped from the stack exposing state s1 and function goto[s1 , A] in the parse table will be consulted. It is state 2, hence parser will shift A followed by state s2 (line (4) in figure 3.1). The whole process of parsing is shown in figure 3.1. Lastly, parser reaches to the state s12 and entry action[s12 , $] is acc which means that input is a valid sentence of L(G).

Stack

Input

Action

(1) s0

aabab$

Shift s1

(2) s0 as1

abab$

Shift s8

(3) s0 as1 as8

bab$

Red r3 (A → a)

(4) s0 as1 As2

bab$

Shift s3

(5) s0 as1 As2 bs3

ab$

Red r5 (B → b)

(6) s0 as1 As2 Bs4

ab$

Shift s8

(7) s0 as1 As2 Bs4 as8

b$

Red r3 (A → a)

(8) s0 as1 As2 Bs4 As5

b$

Shift s6

(9) s0 as1 As2 Bs4 As5 bs6

b$

Red r5 (B → b)

(10) s0 as1 As2 Bs4 As5 Bs7

$

Red r6 (C → AB)

(11) s0 as1 As2 Bs4 Cs9

$

Red r1 (S → aABC)

(12) s0 Ss12

$

accept

Figure 3.1: Moves of LR parser on input aabab

An LR parser configuration (s0 X1 s1 X2 s2 . . . Xm sm , ai ai+1 . . . an ) is called an error configuration if the action table entry action[sm , ai ] is empty. A grammar is LR if every action table entry, corresponding to the grammar, contains a single operation. An LR(k) parser is the one in which parser scans string of length k in order to determine the next action (i.e. shift or reduce). The parser shown in

31

the previous example is an LR(1) parser. Since, we use LR(1) parsing in this thesis, terms LR and LR(1) are used interchangeably. LR State Machine: LR parser table entries are built from the LR-state machine. An LR-state machine can be seen as a deterministic finite automata (DFA) in which each state corresponds to an LR-itemset and edges are labeled with symbols of the grammar. For example, figure 3.2 shows the LR-state machine corresponding to the grammar given in example 3.1. Each rectangular box is an LR-itemset which corresponds to a state of LR parser. We represent itemset corresponding to state si as Ii in our discussions. In figure 3.2 state number corresponding to each itemset is put at the right top corner of the box. An LR-itemset is a set of LR-items and represented as {it1 , it2 , . . . , itn }, where iti , (1 ≤ i ≤ n) is an LR-item. An LR-item has following form

[A → α • β, {a, . . . , }]

A → αβ is a production of the grammar. We denote LR-item as it in our discussion. Second part of an LR-item {a, . . . } is a set of terminals called lookahead. For a parser configuration (s0 X1 s1 . . . Xm sm , ai ai+1 . . . an ), an item of type [A → α • β, U ] in Im (itemset corresponding to the top state sm ) denotes that symbol α has already been recognized in the input string and it is inside the stack and β has yet to be recognized from the input. Hence, if Im contains an item of type [A → α • Bβ, U ], then parser first recognizes a substring derived from B and later recognizes a substring derived by β. An item of type [A → α • , U ] in Im denotes that a substring derived from α has already been recognized in the input program and it is inside the stack; this can be reduced to A if the next token, ai , falls in the lookahead set U . Based upon above interpretations parser table entries are built. s

ri

If k1 and k2 are two parser configurations; k1 ⊢ k2 , k1 ⊢ k2 denotes that k2 results from shifting next symbol and reducing by rule ri respectively. We ignore the lookahead part of LR item at some places where they are not needed. Since each state corresponds to an itemset, we refer to a state and its corresponding itemset interchangeably.

32

[B → b • , { a, b, d}]

[S ′ → • S, { $}] [S → • a A B C , { $}] [S → • a C d, { $}]

0

a

3

b

[S → a • A B C , { $}] 1 [S → a • C d , { $}] [A → • a , { b}] [A → • A B , { b}] [C → • A B , { d}]

A

[S → a A • B C , { $}] 2 [A → A • B , { b}] [B → • b , { a, b, d}] [C → A • B , { d}]

S a [S ′ → S • , { $}]

C

12 [A → a • , { b}]

8

[S → a C • d , { $}]

10

d a

[A → A B • , { b}] [C → A B • , { $}]

B 7

[S → a C d• , { $}]

[A → A • B , { b}] [B → • b , { b, $}] [C → A • B , { $}]

5

b [B → b • , { b, $}]

B

A

11

[S → a A B • C , { $}] [A → • a , { b}] [A → • A B , { b}] [A → A B • , { b}] [C → • A B , { $}] [C → A B • , { $}]

4

6 C [S → a A B C • , { $}]

9

Figure 3.2: State machine corresponding to original grammar

3.1.2

CYK Parser

A CYK parser [29, 85] maintains an upper triangular matrix, called CYK table, C of size n × n, where n is the length of the input program a1,n . Each entry C[i, j] is called a CYK-cell, which contains all symbols which can derive substring ai,j ; i.e. ⋆ A ∈ C[i, j] implies A ⇒ ai,j . That is, A is the root of a subtree whose yield is the substring ai,j . We will use term T [i, j] to represent the subtree whose yield is ai,j . Example 3.2: Suppose we parse program aabab with the grammar shown in example 3.1 using CYK parser. The upper triangular matrix made by the parser is shown in figure 3.3. For example, entry C[2, 3] contains two entries A and C, which ⋆ ⋆ means A ⇒ ab and C ⇒ ab. Since entry C[1, 5] contains start symbol S, hence program is correct w.r.t. given grammar.

33

1

2

3

4

a, A

5 S

a, A

1

A, C

2

b, B

3 a, A

A, C

4

b, B

5

Figure 3.3: CYK table built for input aabab

3.2

Programming Language Syntax

A programming language’s syntactic specification is divided into two parts (1) lexical specification: to specify terminals of the grammar and (2) grammar specification: to specify the underlying grammar rules of programming language. Terminals are also referred to as tokens. Terminals of a programming language can be divided into major categories representing keywords, operators, end of terminators, types, etc. of the language. For example in a C grammar we can find a set corresponding to different data types of C (such as int, char), a set corresponding to different operators of C (such as +, −, ∗, etc.) and a set corresponding to different keywords of C (such as while, switch, case, etc.). The set of rules which impose structure on the programs are specified with the underlying CFG of programming language. These rules are encoded in productions of the underlying CFG. For example production statement → while expression statement represents the syntax of while statement.

3.3

Syntactic Growth of Programming Languages

The main reason for programming language growth is to make the language stronger and more expressive. A language can undergo major changes like paradigm shifts; like transition of an imperative language to an object oriented language or minor changes like creating new dialects which have a few more functionalities. Version change in a programming language can also be considered as a minor growth. Our focus is on minor extensions in a programming language syntax. A language

34

dialect is created to add some new functionalities needed for a class of software implementations. For example, Business Basic was developed for minicomputer systems in early 1970s. It was derived from the Dartmouth Basic by introducing file indexing methods which is much similar to file access facilities available to COBOL programmers. Also Business Basic interpreter offered much richer diagnostic capabilities. A program can be divided into many logical parts. For example, the main components of a program written in an imperative programming language are different decelerations, different statements, and expressions. Hence, given a program written in a dialect, we can expect above parts to have new syntactic constructs (i.e. new kind of statements, operators etc.). Syntactic growth is nothing but the extension of the context free grammar of the base language. Nonterminals of a programming language grammar can be divided into different groups which generate various logical parts of the program as discussed above. For example we can find a group in which nonterminals generate different kind of statements, another group in which nonterminals generate different kinds of expressions etc. In order to ease the process of grammar extraction, we can exploit the above observations and focus on the growth in each part separately. With this motive, we studied the syntax of some programming languages and the extensions found in their dialects. Based on the study, we divide syntactic growth in following dimensions1 : 1. New Declaration Specifiers: This kind of growth happens when there is a need for new data types, new scopes etc. For example, RC [82] is a dialect of C which adds safe, region-based memory management to C. Regions enforce a memory structure on the application’s data. RC has three types of type qualifiers, viz., sameregion, parentptr and traditional. A pointer declared as sameregion can point to objects which are in the same region. Similarly other type qualifiers are used to specify different scopes. C* [83], a data parallel dialect of C, has shape as new declaration specifier. 2. New Expressions: A dialect can have new operators to support new kinds of expressions. New operators may operate on a data type that already exists 1

Note that observations discussed may not completely cover all kinds of growth, but these are the most perceived dimensions on which minor growth in programming languages happen.

35

in the base language or may operate on new data types. For example, C* has max, min, minto, maxto, dimof, rankof etc. new operators. 3. New Statements: A language can be extended to support new kinds of statements. For example, Java1.5 [80] has a statement with keyword enum to support enumerated types, while lower versions of Java do not have this statement. C* has statements with keywords where, with and everywhere which is not in the C language. Locomotive Basic is a proprietary dialect of the Basic written by Locomotive softwares. This has dedicated command for handling graphics with keywords DRAW, PLOT, PAPER, CIRCLE etc. It also has special commands for memory allocations and handling, such as MEMORY and parametric LOAD. Above kinds of growth require modifications in the lexical analyzer of the base language to identify tokens corresponding to new types, new operators and new keywords. It also requires modifications in the parser of the base language to support new grammar rules. In the framework of syntactic growth of programming languages, the problem of grammar extraction can be posed as: Given a set of programs P (written in a dialect of a base language) and a grammar G of the base language, extract a grammar G′ such that G′ accepts all the programs in P.

3.4

Correctness and Completeness of a Grammar

In this section we discuss the correctness and the completeness of a grammar w.r.t. a given set of input programs. Suppose, the set of input programs is P and the grammar which is implemented in the compiler of the dialect (called target grammar) is GT . Compiler of the dialect is referred to as the target compiler and language L(GT ) is referred to as the target language. Definition 3.1 A grammar G′ is correct w.r.t. target language L(GT ), if L(GT ) ⊇ L(G′ ) and complete if L(GT ) ⊆ L(G′ ). The ideal goal of any grammar extraction process is to get a grammar which is correct as well as complete w.r.t. GT . Obtaining a correct and complete grammar

36

(i.e. exact model of the system) from positive samples is not possible from the Gold’s result. Even if we know GT we can not achieve the above goal as for two context free grammars G1 and G2 determining L(G1 ) ⊆ L(G2 ), L(G1 ) = L(G2 ), L(G1 ) 6= L(G2 ), etc. are undecidable [24]. In the grammar extraction problem, GT is not known. Therefore, we address the problem of extracting a grammar G′ which is complete w.r.t. set of input programs P. Definition 3.2 A grammar G′ is complete w.r.t. input programs P, if P ⊆ L(G′ ). Because of undecidability results and non-availability of GT ; absolute correctness of a grammar G′ (i.e. L(GT ) ⊇ L(G′ )) can not be ensured. However, correctness w.r.t. some coverage criterion can be ensured. Such criteria additionally assume that the parser for GT is available; this parser is called the target parser and denoted as ℘T . Examples of different coverage criteria are, rule coverage [50], context dependent rule coverage (CDRC) [33], context dependent branch coverage (CDBC) [33], etc. For ensuring the correctness of an extracted grammar w.r.t. some coverage criteria, a test set T (set of programs) is generated which achieves the given coverage criteria. Next, all programs in T are parsed with the ℘T . If all programs in T are parsed successfully then extracted grammar is correct w.r.t. given coverage criteria. For example, to check the correctness of an extracted grammar G′ w.r.t. rule coverage we generate a test set which ensures that each rule of G′ is used in the generation of at-least one program. The problem of grammar extraction under language growth poses the following important question: Given an initial grammar G, can we define the extent of incompleteness of G w.r.t. a set of programs P. Intuitively, the extent of incompleteness should be based on some distance metric between grammars. For example, if G′ is a complete grammar w.r.t. P. Distance between G and G′ will tell us the extent of incompleteness. There are two main problems in such distance notion: First, there may exist an infinite number of grammars which are complete w.r.t. P. Second, a distance metric between two grammars based on languages accepted by them cannot be really defined as an important property of distance metric is that given two points in the space, a distance of 0 between two points implies that two points are equal. If we consider the space of context free languages then a distance of 0 between two context free languages L(G1 ) and L(G2 ) implies L(G1 ) = L(G2 ). Since, problem L(G1 ) = L(G2 ) is undecidable, we cannot create a distance metric for context free

37

languages. Due to these limitations, we consider a simple set theoretic notion to define the extent of incompleteness of grammars as follows: Definition 3.3 Given two grammars G1 = (N1 , T1 , P1 , S1 ) and G2 = (N2 , T2 , P2 , S2 ), where N1 ⊇ N2 , T1 ⊇ T2 , S1 = S2 , and P1 ⊇ P2 , G2 is said to be |P2 − P1 | rules away from G1 . We will use these definitions for further problem formulation.

3.5

Grammar Extraction under Language Growth

The general problem of grammar extraction can be posed as, given a set of programs extract the underlying grammar of programs. However, we are interested in grammar extraction when programming language growth occurs. That is, the grammar of a language is available but due to language growth, we may not have the grammar of the new version. How do we extract the grammar of the new version from a set of programs is the problem we address in this thesis. The problem is formalized as follows: Given a set of programs P = {w1 , w2 , ....} and an incomplete grammar G = (N, T, P, S), find a grammar G′ = (N ′ , T ′ , P ′ , S) such that G′ is complete w.r.t. P. Note that our goal is not to find the “actual” grammar (i.e. GT ) of the entire language, as there can be infinite number of grammars which accept a given set of programs and there is no way of ensuring correctness and completeness of a grammar w.r.t. GT . However, based on the observed growth in programming languages, we make the following assumptions on G′ : 1. The relationship between G′ and G is as follows: N = N ′ , T ⊆ T ′ , P ⊆ P ′ . G and G′ are epsilon free grammar. Set T ′ − T is known beforehand and denoted as Tnew . 2. The type of rules missing in the grammar are constructs which involve new keywords, new operators or new types. We collectively call new keywords, new operators and new types as new terminals of the grammar which fall in the set Tnew .

38

3. One grammar rule is sufficient to express the syntactic construct corresponding to each new terminal. Assumption 1 says that new terminals of the dialect (i.e., new keywords, new operators, new types etc.) are known beforehand, i.e., the lexical analyzer does not recognize them as an identifier or as an unknown token. Assumption 2 is used for enforcing the language growth model we discussed earlier. Assumption 3 is based on observations of rules corresponding to already existing constructs in many programming language grammars; we found that most of the language constructs can be expressed by a single grammar rule. For example, consider grammar of C and C* and the with construct of C*. Only one grammar rule is sufficient to express the syntax of the with construct. However, assumption 3 is limiting in some sense because in a few instances this may not hold. Since only one rule is sufficient to express the syntax corresponding to each new terminal, there exists at least one complete grammar which is |Tnew | rules away from G. Therefore, our goal is to get a complete grammar Gj which is not more than |Tnew | rules away from G. A parser which is generated from a complete grammar is called a complete parser and the parser generated from G is called an approximate parser. For a given set of programs there can be many sets of rules which make the grammar complete. For example, suppose we are given an incomplete C grammar in which a grammar rule corresponding to keyword “while” is missing and a complete grammar is being extracted using the program shown in figure 3.4. Here we extract a grammar rule such that the initial grammar, after including that rule, becomes complete w.r.t. the input program. We call such rule a correct rule. There are many options for a single correct rules; e.g. statement → W HILE (expression ) statement, statement → W HILE (conditional expression) statement, statement → W HILE (expression), statement → W HILE (conditional expression), statement → (id > N U M ) statement to name a few2 . It is clear that in the case of a single missing rule, there can be many rules corresponding to a new terminal which make the grammar complete; we call the set of all such rules a set of correct rules corresponding to the new terminal. Similarly in the cases of multiple missing rules, there can be many sets of rules such that each set makes the grammar complete (each set contains one rule for each new terminal); 2

Example grammar is taken from http://www.cse.iitk.ac.in/users/alpanad/grammars/.

39

we call such set a correct set of rules (note the difference between terms set of correct rule and a correct set of rules).

main() { int a=100; while (a > 0 ) { printf("Counter value %d", a); .... .... } }

Figure 3.4: A small program with keyword while

3.6

Summary

In this chapter, we discussed commonly observed syntactic growth in programming languages. Based on those observations a problem of incremental grammar extraction is formulated. We have also discussed completeness and correctness of extracted grammar and shown why an absolute completeness and correctness can not be achieved.

Chapter 4 Grammar Completion with Single Rule As discussed in chapter 2, existing approaches for grammar extraction in the programming language domain are heuristic based. Also, the approaches proposed in the NLP domain cannot be applied directly. In this chapter, we present an approach for extracting a grammar of a dialect when the dialect’s grammar contains only one additional rule than the initial grammar. That is, there exists at-least one complete grammar which is just one rule away from the initial grammar. A rule, which after being included in the initial grammar, makes the grammar complete (that is modified grammar parses all the programs successfully) is called a correct rule. We address the problem of extracting a correct rule. We first discuss the approach which uses a single program for extracting a correct rule and then extend it for multiple programs. Later we show the correctness of our approach. By the term a set of correct rules, we mean that each rule in the set makes the grammar complete. The preliminary idea of our approach has appeared in [13] which was later completed in [14].

4.1

Problem Definition and Assumptions

The problem is stated as follows: Given a set of programs P = {w1 , w2 , ....} and an approximate grammar G = (N, T, P, S), find a grammar G′ = (N ′ , T ′ , P ′ , S) such that G′ accepts all programs in P. Note that our goal is not to find the “actual” 40

41

grammar (the grammar implemented in the dialect’s compiler), GT , of the entire language, as there can be infinite number of grammars which accept a given set of programs and there is no way of knowing which grammar is the actual grammar. Our goal is to find a grammar that will parse all the programs in P. We make the following assumptions: 1. The relationship between the G′ and G is as follows: N = N ′ , T ⊆ T ′ , P ⊆ P ′ . G and G′ are epsilon free grammars. 2. |P ′ − P | = 1 (i.e. there is only one additional rule in G′ ). 3. The correct rule (i.e. missing rule in G) is of the form A → anew α where A ∈ N , α ∈ (T ∪ N )∗ and anew ∈ Tnew . From P we extract a correct rule to make G complete. We use terms missing rule and correct rule interchangeably. The second assumption is later relaxed in the next chapter where we show how the approach can be extended for more than one missing rules. Assumption 3 which states that the correct rule starts with a new terminal is also relaxed later in this chapter. A parser which is generated from a complete grammar is called complete parser. An input program is represented as a1,n , where n is the number of tokens in the program. ai represents the ith token of the program and ai,j represents the substring which starts at the ith token and ends at the j th token.

4.2

Rule Extraction

First, input programs are parsed with the parser generated from the approximate grammar. If each program is parsed successfully then the grammar is considered complete and returned intact. Otherwise a correct rule is computed. For determining a correct rule we work with one program at a time. We start with the approximate grammar G and a program a1,n . The basic approach (figure 4.1) consists of four phases: (1) LHSs gathering phase, (2) RHSs generation phase, (3) rule building and (4) rule checking phase (figure 4.1). LHSs gathering phase determines the set (L) of possible left hand sides (LHSs) of the correct rule from the context of error point. The RHSs generation phase builds a set (R) of possible right hand sides (RHSs) of correct rules using the CYK-parser

42

table. The rule building phase builds the set of possible rules (P R) using L and R. Each rule in P R is checked in the rule checking phase, i.e. whether the grammar after including a rule is able to parse all the programs. If all programs are parsed successfully by including some rule in G, then that rule is returned as the correct rule.

Function EXTRACT SINGLE RULE (a1,n , G) L ← GAT HER P OSSIBLE LHSs(a1,n , G) R ← BUILD RHSs(a1,n , G, max length) P R ← BUILD RULES(L, R) for all rule ∈ P R do Add rule in G if Modified G parses a1,n then return the rule else Roll back the change in G

Figure 4.1: Overall single rule extraction approach

We describe all the phases in more detail in next sections. For now, we restrict our discussion to the computation of a correct rule from one program. We assume that the program a1,n cannot be parsed by the approximate grammar G and G must be augmented.

4.2.1

LHSs Gathering Phase

The LHSs gathering phase returns a set of nonterminals (L) as possible left hand sides (LHSs) of correct rules. The algorithm for this phase is shown in figure 4.2. Input to this phase is a program, a1,n , and the approximate grammar G. First, a1,n is parsed with the LR parser generated from G. Since, G is incomplete, the parser will stop at some point. If ai is the first occurrence of the new terminal, the parser will stop after shifting the token ai−1 because G does not have a production containing ai in its RHS. This state of parser is called the error state. Possible LHSs are collected from LR-parser stack as follows.

43

It starts with a set of stack configurations1 . Initially this set contains a single stack, i.e. LR parser stack in the error state. The algorithm does the following iteration until this set becomes empty: It removes a stack from the set and looks at the action table entry corresponding to the top of the stack. If some of the entries corresponding to the top of the stack are reduce entries, then it performs reductions possible on that state without looking at the next token. We call this operation “forced reduction”. In the case of multiple possible reductions, each reduction is performed on a separate copy of the stack. The new stack obtained after performing each reduction is added to the stack set. If top of the stack has some shift entries then for all items of type [A → α • B β] in the itemset corresponding to the top of the stack, nonterminals which occur after the dot (B) are collected in L.

Function GATHER POSSIBLE LHSs(a1,n , G) Parse a1,n with the parser generated from G err stack ← Stack corresponding to error state of the parser ST ACK SET ← {err stack} L ← {} while ST ACK SET is nonempty do for all stack ∈ ST ACK SET do ST ACK SET ← ST ACK SET − {stack} I ← itemset corresponding to the top of the stack if top state has some shift entries in the action table then for all items of form [A → α • B β] in I do L ← L ∪ {B} if top state has some reduce entries in the action table then for all possible reductions r on the top of the stack do s′ ← copy of the stack Apply reduction r on the s′ ST ACK SET ← ST ACK SET ∪ {s′ } return L

Figure 4.2: Algorithm for gathering possible LHSs

The algorithm stops when the stack set gets empty. The stack set will get empty 1

Our algorithm is a small modification of the LR parser. In the LR parser there is only one stack, while in our algorithm there are multiple stacks. The rest of the operations are same as LR parser.

44

when no more reductions are possible on any of the stacks in the stack set. If a1,n is parsed with a complete parser; the parser will make a sequence of reductions between the shifts of tokens ai−1 and ai 2 and then expand a nonterminal to derive the substring covered by a correct rule. Algorithm GAT HER P OSSIBLE LHSs simulates the behavior of all possible complete parsers by performing all possible reductions. By including all nonterminals that come after the dot, in some item of top itemsets, it guesses which nonterminal a complete parser will expand. Example 4.1: Suppose, approximate grammar is G = (N, T, P, S), where T = {a, b, d}, N = {A, B, C, S} and P has following productions: 1. r1 : S → a A B C 2. r2 : S → a C d 3. r3 : A → a 4. r4 : A → A B 5. r5 : B → b 6. r6 : C → A B Suppose the input sentence is aabeabaeabbeab. Terminal e is a new terminal, which is known beforehand. First LR parser will be generated from G. The LRtable corresponding to G is shown in chapter 3, in table 3.1 3 . The input sentence will be parsed with the parser generated from G (Approximate LR parser) denoted as ℘G . Since e is a new token parser will stop when it reaches to the first occurrence of e. The parser configuration at this point is (s0 as1 As2 bs3 , eabaeabbeab$). The incomplete parse tree of the program corresponding to this configuration is shown in figure 4.3 (configuration (1)). Figure 4.3 shows partial parse trees built by ℘G as it moves from one configuration to another in the algorithm. Figure 4.3 also shows different configurations (we have removed the state symbols from the configurations given in figure) of ℘G as the algorithm proceeds. Top itemset corresponding to each configuration is also shown. Fat arrows show the parser moving from one configuration to the next configuration. The label associated with the fat arrow is the action performed on the configuration. 2

There can be no reductions between the shifts in some cases Although terminal e is not the part of G, we add one extra empty column corresponding to e in the table so that approximate LR-state machine detects the error point. 3

45

Possible reduction

(1)

a

A

B

a

b

Top itemset [B → b •, {a, b, d}]

e

a

b

3

a

e

a

b

b

e

a

b

Red by B → b

error point Top itemset

(2)

a

C

A

B

a

b

e

a

a

b

a

Possible LHS = C Possible LHS = A

e

a

b

b

e

a

b

Top itemset

A

(3)

4

A

B

a

b

[S → a A • B C , {$}] [A → A • B , {b}] [B → • b , {a, b, d}] [C → A • B , {d}] e

a

b

a

e

[S → a C • d , {$}]

10

2 Possible LHS = B

a

b

b

e

a

b

a

b

b

e

a

b

Red by C → AB

A

[S → a A B • C , {$}] [A → • a , {b}] [A → • A B , {b}] [A → A B • , {b}] [C → • A B , {$}] [C → A B • , {$}]

Red by A → AB

Possible reductions

C Top itemset

(4)

a

A

B

a

b

e

a

b

a

e

Figure 4.3: Partial parse trees in each step

In the first configuration, there is only one reduction possible on the top state s3 (i.e. r5 : B → b). Hence, forced reduction will be performed on this state and the configuration after the forced reduction will be (s0 as1 As2 Bs4 , eabaeabbeab$) (figure 4.3, configuration (2)). Since state s4 has one shift entry in the action table, we look at the LR-items corresponding to state s4 (figure 4.3, itemset I4 ). Nonterminals A and C occurs after the dot in I4 , hence these are collected in the possible LHSs set. The above action explores the reduce sequence [r5 ] as one of the reduce sequence between the shift of tokens b and e. If a complete parser starts shifting the next token (e) after the reduction r5 , then it will expand either A or C to derive the substring covered by a correct rule, hence A and C can be possible candidates for LHSs. State s4 (top state in configuration (2) in figure 4.3) also has two possible reductions, i.e. r4 (A → AB) and r6 (C → AB); therefore two separate stacks will be

46

made and each reduction will be performed separately on these stacks. The resulting set of configurations (figure 4.3, configuration (3) and (4)) after two forced reductions is {(s0 as1 As2 , eabaeabbeab$), (s0 as1 Cs10 , eabaeabbeab$)}. Nonterminal B occurs after the dot in an LR-item corresponding to the top state in the configuration (3) (i.e. I2 ) hence B is collected in the possible LHSs set. In configuration (4), no nonterminals occur after the dot in any item corresponding to top itemset (I10 ), hence no nonterminal will be collected in the possible LHSs set. Since, no more reductions are possible in configurations (3) and (4), the algorithm will stop and return the set {A, B, C} as a set of possible LHSs.

4.2.2

RHSs Generation Phase

The RHSs generation phase determines the possible RHSs of the missing rule. The algorithm for this is shown in figure 4.4. The algorithm is called with a program, the approximate grammar and a parameter max length. It builds a set of possible RHSs of correct rules, corresponding to a new terminal, of length no more than max length. First, the input program is parsed with G to get the incomplete CYKtable for the program. It then uses this CYK-table to build a set of possible RHSs. For building the set of RHSs, first the index of the last occurrence of the new terminal is determined. Suppose it is i. For each k (i < k ≤ n, where n is the number of tokens in the program) the algorithm computes the set of symbol strings that can derive the substring which starts at token i and ends at k (ai,k , where a1,n is input program). Since the correct rule starts with a new terminal, the last occurrence of new terminal in the program is the point from where the substring derived by the last occurrence of the correct rule starts. Since we consider each index as a possible end point of the substring derived by the last occurrence of correct rule, RHSs of correct rules will be built in some iteration and added in the possible RHSs. The set of symbol strings, which can derive a substring ai,j , is computed by the BU ILD ST RIN GS algorithm which works similar to the CYK parser (figure 4.4). First each cell C[p, p] (i ≤ p < j) is filled with the symbol string which derives the single token ap,p . Symbol strings which can derive the larger substrings are built in a bottom up manner. For computing the symbol string of length l and which can derive am,m+q , the algorithm does the following iteration for each index k (m ≤ k < m + q): A symbol string of length r (0 ≤ r ≤ l) is picked from the cell

47

C[m, k] and a symbol string of length of l − r is picked from the cell C[k + 1, m + q]; these strings are concatenated to get a symbol string of length l4 .

Function BUILD RHSs(a1,n , G, max length) RHSs ← {} i ← index of the last occurrence of new terminal in a1,n for i < k ≤ n do RHSs ← RHSs ∪ BUILD ST RINGS(i, k, max length) return RHSs

Function BUILD STRINGS(i, j, max length) B ← {} for i ≤ p < j initialize C[p, p] ← symbols that derive ap,p for 1 ≤ q ≤ j − i

⊲ Build symbol strings that derive substrings of larger lengths

for i ≤ m < i − q for m ≤ k ≤ m + q for 0 ≤ l ≤ max length Concatenate symbol string of length r (0 ≤ r ≤ l) from C[m, m + k] and symbol string of length l − r from C[m + k + 1, m + q] and put in B return B

Figure 4.4: Algorithm for RHSs building

For example, to build symbol strings of length 2 and which can derive the substring ai,k , we consider a break point i1 . Suppose cell C[i, i1 ] has symbol strings X1 and X2 and C[i1 + 1, k] has symbol strings Y1 and Y2 ; the symbol strings constructed from these two cells will be X1 Y1 , X1 Y2 , X2 Y1 and X2 Y2 . Example 4.2: Consider the same example as given in example 1. Last occurrence of new terminal e is shown in figure 4.5. Starting from the last occurrence of e, each succeeding token is considered as a possible end point of the substring derived by a correct rule. Figure 4.5 shows possible RHSs built from the input program.

4

Since the algorithm is a bottom up algorithm, symbol strings for the substrings ai,k and ak+1,j will have already been computed at this point.

48

Last occurence of e a

a

b

e

a

b

a

e

a

b

b

e

a

b

Possible RHSs = {eab, eAb, eaB, eAB, eA, eC} Possible LHSs = {A, B, C} Possible rules = A → eab

A → eAb

A → eaB

A → eAB

A → eA

A → eC

B → eab

B → eAb

B → eaB

B → eAB

B → eA

B → eC

C → eab

C → eAb

C → eaB

C → eAB

C → eA

C → eC

B → eAb

B → eaB

B → eAB

B → eA

B → eC

Correct rules = B → eab

Figure 4.5: Possible RHSs gathered from the program and correct rules

4.2.3

Rule Building and Rule Checking phase

In the rule building phase, the set of possible RHSs and LHSs, constructed in previous phases, are used to build a set of possible rules (P R). P R is a crossproduct of sets L and R. Each rule in P R is checked in the rule checking phase. Rules are added in the approximate grammar and then the modified grammar is checked whether it parses all the programs. If the modified grammar parses all the programs by adding some rule, then that rule is returned as a correct rule. The above checking is done using an LR parser. Figure 4.5 shows the set of possible rules for the example discussed earlier. The approach returns whenever it finds the first correct rule. We are showing here all possible correct rules; this set also includes some rules which are non-LR (b → eaB and B → eab are the only correct and LR preserving rules). A rule is non-LR if the grammar after including that rule becomes non-LR, else it is called LR preserving rule. The parameter max length is usually kept equal to the length of the largest RHS of rules in G. We assume that the length of the correct rule’s RHS is not more than max length. If there does not exist any correct rule whose RHS’s length is less than or equal to the max length, algorithm will incrementally build possible rules with larger RHS lengths and check their correctness.

49

4.2.4

A Realistic Example

Suppose the rule corresponding to keyword while is missing in the ANSI C grammar and program in figure 4.6(a) is given as input. For gathering possible LHSs, the program is parsed with the parser generated from the incomplete ANSI C grammar. Since while is a new terminal, the parser will get stuck when it reaches to the first while. Now we look at the itemset corresponding to the top of the stack (figure 4.6(b)), suppose it is: {[statement → expression SEM ICOL • ]} Since an LR item in above the itemset is of the form [A → β • ], a forced reduction will be performed. After performing forced reduction we reach to the state where itemset, corresponding to the top of stack, contains an item of the form [statement → statement list • statement ]. Nonterminal statement will be added in the set of possible LHSs because it occurs after the dot. For building possible RHSs we start from the last occurrence of the while and build a set of symbol strings which can derive the substring starting from the last occurrence of while. The set of symbol strings are shown in the figure 4.6(b). Using the set of possible LHSs and RHSs, the set of possible rules are built as shown in figure 4.6(b)5 . Now each rule in this set is checked whether the grammar after including that rule is able to parse all the programs. After adding any of the following rules the modified grammar parses all the programs: statement → while ( expression ) statement → while ( expression ) statement One of these two rules is returned as a correct rules.

4.3

Proof of Correctness

In this section we prove the correctness of the algorithm. For proving the correctness we show that the algorithm will always find a grammar complete w.r.t. a given set of 5

We are not showing all possible RHSs, LHSs and itemsets as these were too many for the real grammar

50

Itemset corresponding to the top of stack= {[statement → expression SEMICOL • ]}

Itemset after forced reduction= {[statement list → statement list • statement]}

Possible LHSs for keyword while = {statement}

main() { int x, y, z, i; x=1000; y=90; z=800; error point

Possible RHSs= (1) (2) (3) (4) (5) (6)

( ( ( ( ( (

id expression expression expression expression

) ) statement ) statement } ) statement } }

Possible rules built =

while(x>500) x−−;

for(i=0; i < 50; i++) Last occ { of keyword while(y > 200) y=y/2; } }

(a) Input program

while while while while while while

(1) (2) (3) (4) (5) (6)

statement → while ( statement → while ( id statement → while ( expression ) statement → while ( expression ) statement statement → while ( expression ) statement } statement → while ( expression ) statement } }

Correct rules = (1) statement → while ( expression ) (2) statement → while ( expression ) statement

(b) Set of possible LHSs, RHSs and correct rules Figure 4.6: Example of rule extraction process

programs. For this we show that a correct rule will always fall in the set of possible rules (P R) built by the algorithm. Since there exist many correct rules, we pick one correct rule (suppose r = D → η) and show that D will fall in L and η will fall in R respectively; therefore, D → η will fall in P R. The grammar obtained after adding D → η in G is denoted as G′ . G′ is a complete grammar w.r.t. P. The parser generated from a complete grammar (G′ ) is called a complete parser (or modified parser) and denoted as ℘G′ . The parser generated from G is represented as ℘G . The state machine generated from G is represented as M and the state machine generated from G′ is represented as M ′ . a1,n is the input program and the substring derived by the first occurrence of rule r starts from index i. For proving that L always contains D, we first discuss the relationship between ℘G and ℘G′ . Suppose ℘G′ , while parsing a1,n , makes a sequence of reductions [p1 , p2 , . . . , pn ] (where pi denotes a production) between the shifts of tokens ai−1 and ai . Since algorithm GAT HER P OSSIBLE LHSs explores a set of reduce sequences with forced reductions, we show that the reduce sequence [p1 , p2 , . . . , pn ] will be explored by the GAT HER P OSSIBLE LHSs.

51

It is evident that ℘G , while parsing a1,n , will get stuck after shifting the token ai−1 because ai is a new terminal and it does not fall in the terminal set T of G. Suppose the sentential form corresponding to the parser configuration, after shifting ⋆ the token ai−1 , is αai . . . an 6 . The derivation α ⇒ a1,i−1 contains the productions of G alone. Therefore, when a1,n is parsed with the ℘G′ , the sentential forms through which parsers ℘G′ and ℘G pass will be same until they hit the token ai . In other words, the states through which ℘G and ℘G′ will pass, while parsing a1,n , will be equivalent. Definition 4.1 A state s in M and a state s′ in M ′ are equivalent, if there is at least one viable prefix for which both the states are valid. If both the states are valid for prefix β, i.e. a path from the start state to s (in M ) and s′ (in M ′ ) has label β, we say that states are β-equivalent. We denote such equivalence as s ∼ = β s′ . Example 4.3: Suppose a rule B → eab, which is correct w.r.t. input program given in example 4.1, is added in the grammar given in the same example, i.e. G′ = (N, T ∪ {e}, P ∪ {B → eab}, S). LR-state machines corresponding to G and G′ are given in figures 3.2 and 4.7. State 4 in figure 3.2 and state 7 in figure 4.7 are equivalent for viable prefix aAB, i.e. s4 ∼ =aAB s7. Shift and reduce operations on two equivalents states result in another equivalent state as stated in the following lemma:

Lemma 4.1 If s ∼ =β s′ , where s and s′ are the states in M and M ′ respectively, then the following relationship holds: 1. if action[s, t] = shif t si and action′ [s′ , t] = shif t s′i then si ∼ =βt s′i , action[s, t] = error

if t = anew .

2. if action[s, t] = ri and action′ [s′ , t] = ri′ then ri = ri′ , 3. goto[s, A] ∼ =βA goto′ [s′ , A] 6

∀t ∈ T .

∀t ∈ T

∀A ∈ N

Each parser configuration (s0 X1 s1 . . . sm , ai ai+1 . . . an ) represents a sentential form X1 . . . Xm ai ai+1 . . . an .

52

[B → b • , { a, b, d, e}] 3 0

[S ′ → • S, { $}] [S → • a A B C , { $}] [S → • a C d, { $}]

a

b

[S → a • A B C , { $}] 1 [S → a • C d , { $}] [A → • a , { b, e}] [A → • A B , {b, e}] [C → • A B , { d}]

A

S a [S ′ → S • , { $}]

[S → a A • B C , { $}] 2 [A → A • B , { b, e}] [B → • b , { a, b, d, e}] [C → A • B , { d}] [B → • e a b , { a, b, d, e}] e

C

18 [A → a • , { b, e}]

14

[S → a C • d , { $}]

a

[B → e • a b , { a, b, d, e}] 4

16

d

B

[S → a C d• , { $}]

a

17

[B → e a • b , { a, b, d, e}] 5 b

B

[A → A B • , { b, e}] [C → A B • , { $}]

13

e

[A → [B → [B → [C →

A • B , { b}] • b , { b, $, e}] • e a b , { b, $}] A • B , { $}] b

[B → e • a b , { $, b, e}] 10 a

8

A

[S → a A B • C , { $}] [A → • a , { b, e}] [A → • A B , { b, e}] [A → A B • , { b}] [C → • A B , { $}] [C → A B • , { d}]

[B → b • , { b, $, e}] 9

7

[B → e a b • , { a, b, d, e}] 6

C [S → a A B C • , { $}]

15

[B → e a • b , { $, b, e}] 11 b [B → e a b • , { $, b, e}]

12

Figure 4.7: State machine corresponding to modified grammar

Proof: 1. Case t ∈ T : Since, s is valid for viable prefix β, si will be valid for viable prefix βt. Similarly, s′i will be valid for viable prefix βt. Hence, si ∼ =βt s′i . Case t = anew : Since no production in G contains anew in its RHS, action[s, anew ] will be error. 2. Suppose action′ [s′ , t] = reduce with C → β2 . This implies there exists a ⋆ sentential derivation S ⇒ β1 Cγ ⇒ β1 β2 γ, where β1 β2 = β (as s is valid for viable prefix β) and t ∈ F IRST (γ)7 . Since F IRSTG (γ) ⊆ F IRSTG′ (γ) and F IRSTG′ (γ) contains an additional terminal anew , except the entry corresponding to anew all reduce entries will be same. 3. Arguments are same as given in the first case.

7



F IRST (γ) = {a|γ ⇒ aβ, for some β ∈ N ∪ T ∗ }.

53

Lemma 4.2 The reduce sequence made by ℘G′ between the shift of tokens ai−1 and ai will be covered by the GAT HER P OSSIBLE LHSs algorithm. Proof: From the lemma 4.1, the sequence of shifts and reduces made by ℘G and ℘G′ will be same until they reach the token ai because both the parsers will either shift same token or reduce with the same production (as they will pass through equivalent states). Suppose, configuration of ℘G after shifting ai−1 is: (s0 X1 s1 X2 s2 . . . Xm sm , ai ai+1 . . . an ) And configuration of ℘G′ after shifting token ai−1 is: (s′0 X1 s′1 X2 s′2 . . . Xm s′m , ai ai+1 . . . an ) States inside the parser stacks of ℘G and ℘G′ contains equivalent states. Top of the stack in ℘G is sm and itemset corresponding to that state is Im . We prove by induction on the number of steps in the algorithm that each step guarantees that one of the reduce sequences explored by the algorithm is the same as the reduce sequence made by ℘G′ between the shift of tokens ai−1 and ai . 1. Basis k = 1 iteration: If Im has items of type [A → α • ], then the algorithm does the forced reduction for all possible reductions in Im . Since sm and s′m are equivalent states, both will have same set of possible reductions in the first iteration (from lemma 6.1). Therefore one reduction made by ℘G′ in the algorithm will be same as p1 . If Im has items of type [A → α • D β] also, ′ then Im will contain additional item [D → • η]8 . Hence ℘G′ may have shift action at this configuration. For considering the possibility of shift operations, the algorithm collects D in the possible LHSs set. 2. Induction k = n + 1: Suppose one of the reduce sequences explored by the algorithm in the first n iterations is [r1 r2 . . . rn ] and it is the same as the first n reductions made by ℘G′ between the shift of ai−1 and ai (where n < l), i.e. p1 = r1 , . . . pn = rn . We show that one of the reductions, performed in the next iteration of the algorithm, will be same as pn+1 . Suppose kn is the configuration of ℘G after performing reductions [r1 r2 . . . rn ] and that of ℘G′ is 8

This is because of the way itemsets are computed [2].

54

kn′ . Top states in kn and kn′ will be equivalent because as long as ℘G and ℘G′ perform same reductions they pass through equivalent states. The possible reductions in kn and kn′ (if there are any) will be the same because the top states are equivalent. Therefore, one of the reductions, rn+1 , performed in the next step will be the same as pn+1 . If the top of the stack in kn has some items of type [A → α • Dβ], then ℘G′ may have a shift action after the reductions [r1 r2 . . . rn ] as the LHS of the correct rule is D. In this case the sequence of reductions made by ℘G′ between the shift of ai−1 and ai is [r1 r2 . . . rn ]. This sequence is already explored by the algorithm and D is added in the possible LHSs set for considering the possibility of the next action as a shift.

Lemma 4.3 LHS of the correct rule (i.e. D) will fall in the set L. Proof: L is the set of nonterminals collected from the itemsets corresponding to the top of stacks during forced reductions. From lemma 4.2 reduce sequence [p1 . . . pl ] will be explored by the algorithm. Suppose, after making [p1 . . . pl ] reductions ℘G reaches to the configuration (s0 X0 . . . Xl sl , ai ai+1 . . . an ) We show that one of the nonterminal, collected from the top itemset Il , will be D. Suppose none of the nonterminals collected from this configuration is D (LHS of the correct rule). Then LHS will be determined from some state, sj , inside the stack. Suppose itemset corresponding to the state sj has an item it = [A → γ • D δ ], then the string of symbols Xj+1 Xj+2 . . . Xl (which includes all symbols starting from the state sj+1 to the top of the stack) will be appended before the new terminal in the RHS of the correct rule; i.e. rule will be of form C → Xj+1 Xj+2 . . . Xl anew γ (γ is a symbol string). This contradicts our assumption that η starts with a new terminal. Hence D will be collected from the top itemset and will fall in L.

55

Lemma 4.4 RHS of the correct rule (i.e. η) will fall in R. Proof: For building the set of possible RHSs, algorithm BU ILD RHSs starts from the index of the last occurrence of new terminal (suppose j) and adds all symbol strings which can derive the substring aj,k (j < k ≤ n) in the set R. Suppose substring aj,k is derived by the last occurrence of D → η. No substrings derived from another instance of a correct rule will be nested within the substring aj,k because aj,k is the last and innermost occurrence of the substring derived by D → η. Since the algorithm considers each index k (j < k ≤ n) as a possible end point of the substring derived by a correct rule, it will definitely consider index k in some iteration. Since the BU ILD ST RIN GS algorithm constructs all symbol strings that derive a particular substring, η will be constructed while constructing the set of symbol strings that derive aj,k . Hence η will fall in R.

Lemma 4.5 The algorithm will always return a correct rule. Proof: From lemma 4.3 and 4.4, LHS and RHS of a correct rule D → η will be in L and R respectively. Therefore P R will contain D → η. Since the algorithm iteratively checks the correctness of rules, it will select D → η in some iteration and return it as a correct rule.

4.4

Time Complexity √

The size of an LR(1) parser for the grammar of size m9 can be O(2c m ) [67] in the worst case (c > 0 is a constant); hence the worst case time taken in building an √ LR(1) parser for the grammar of size m is O(2c m ). Therefore the LHSs gathering √ phase takes O(n) + O(2c m ) time. The maximum number of possible symbol strings of length less than or equal to max length is v max length , where v = |N ∪T |. The time taken in computing the set of symbol string is O(n3 ) because the computations done for larger substrings reuse the computations already done for the smaller substrings in the similar way as the CYK parser works; hence the time taken in building all 9

Size of a grammar is expressed as a sum of the lengths of all the productions where length of a production B → β is equal to the 1 + length(β); length(β) is the number of symbols in β.

56

symbol strings is O(n3 × v max length ). The upper bound on the number of possible rules is O(v max length × |N |) = O(v max length+1 ). √

Correctness of each rule can be checked in O(n+2c m ) time. Hence the total time √ taken by the algorithm is O(n) + O(2c m ) + O(n3 ) × v max length + O(v max length+1 ) × √ √ O(n + 2c m ), which is equal to O(v max length+1 × (n3 + O(n + 2c m )). In practice, the time taken in building possible RHSs and LHSs are not high because the worst case occurs only when each substring of the program is derived by each nonterminal; that is the grammar is highly ambiguous. The major time spent by the approach is in checking the correctness of a large set of possible rules; i.e. the component v max length+1 in the above equation. We propose some optimizations to reduce this set in later chapters.

4.5

Using Multiple Programs

Since adding a single rule is sufficient for the grammar to become complete, we can use the information obtained from multiple programs to get a reduced number of possible RHSs and possible LHSs. The main grammar extraction algorithm will be changed as follows - the main algorithm will call GAT HER P OSSIBLE LHSs for each program to gather a set of possible LHSs from each program and then compute their intersection to get a reduced set of possible LHSs. Similarly, possible RHSs will be computed from each program and then their intersection is considered as a set of possible RHSs. The rule building phase will use the reduced set of possible LHSs and RHSs to build a set of possible rules.

4.6

Extracting a Rule of form A → αanew β

The approach discussed previously works if the correct rule is of the form A → anew α (i.e. RHS of the rule starts with a new terminal). In this section we weaken this restriction by some modifications in the previous approach. For this we present modifications in the LHSs and the RHSs gathering phases. We discuss our approach with one input program, which can be extended for multiple programs in the similar fashion as discussed in the earlier section. The above extension has appeared in [16].

57

4.6.1

LHSs gathering phase

In the modified approach the input program, a1,n , is parsed with the LR parser generated by the approximate grammar. Unlike the earlier approach where LHSs were gathered only after the parser arrives at error state (at a new terminal), here possible LHSs are gathered from each configuration the approximate parser passes. I.e. at each step the approach checks the top itemset of the parser stack; if it contains an item of type [A → α • B β], then B is added in the set of possible LHSs set. Once the parser reaches the error state, it performs a forced reduction in the same manner as discussed previously and collects other possible LHSs. The idea behind collecting possible LHSs from each configuration the parser passes is as follows: Suppose, the af is the first occurrence of the new terminal in a1,n ; the index f denotes the first and the outermost occurrence of the missing rule in the program. Suppose, the substring covered by the first occurrence of the missing rule starts from the index m (m ≤ f ). If the input program is parsed by a complete parser, the parser will start recognizing the substring covered by the missing rule from mth token. Since the value of m is not known, the modified approach considers each index i (0 ≤ i < f ) as a possible starting point of the substring covered by the missing rule. Therefore, possible LHSs are gathered from each configuration the approximate parser passes while parsing the input program. In the previous approach the value of m is always equal to f (i.e the index of the first new terminal) as missing rules always start with a new terminal. Example 4.5: Consider the approximate grammar G = (N, T, P, S) as given in example 4.1. Suppose the input sentence is aabaebaaebbaeb. The initial configuration of the parser is (s0, aabaebaaebbaeb$), the itemset corresponding to state 0 (I0 ) contains S after the dot; hence S is included in the possible LHSs set. In the next step the parser reaches the configuration (s0as1, abaebaaebbaeb$); itemset 1 contains nonterminal A and C after the dot, hence these are added in the possible LHSs set. The parser will pass through states 2, 3, 4 and finally get stuck at state 8. The configuration of the parser at the error state is (s0as1As2Bas8, ebaaebbaeb$). The forced reduction will be performed here in the similar fashion as done in the earlier approach. The set of possible LHSs collected after the forced reduction will be {A, B, C, S}.

58

4.6.2

RHSs generation phase

Since the RHS of the correct rule is of the form α anew β (where α, β ∈ T ∪ N ∗ and anew ∈ Tnew ), we can divide a possible RHS into two parts: (1) one which occurs to the left of the new terminal, i.e. α. (2) one which occurs to the right of the new terminal, i.e. β. We build sets of possible α and possible β separately and then use these two sets to build a set of possible RHSs. The set of all possible symbol strings which occur to the left of the new terminal is denoted as RL . Similarly, the set of all possible symbol strings which occur to the right of the new terminal is denoted as RR . For building sets RL and RR , first input program a1,n is parsed with the approximate grammar using the CYK parser. Suppose, f and l are the index of the first and the last occurrence of new terminals. For building RL , we consider each index i (1 ≤ i < f ) as a possible starting point of the substring derived by the missing rule. For each i, a set of symbol strings that can derive substring ai,f −1 is computed and added in the set RL . Similarly, for building set RR , each index j (l < j ≤ n) is considered as a possible end point of the substring derived by the missing rule and all symbol strings that can derive substring al+1,j are added in the set RR . The set of possible RHSs R is built by concatenating the symbol strings taken from sets RL and RR as follows: R = {α anew β | ∀α ∈ RL and ∀β ∈ RR } Note: Sets RL and RR both will additionally contain the empty string ǫ. This is to consider the case when RHS is of form anew α or αanew (i.e. new terminal in the RHS occurs at the first or at the last place). Example 4.6: Consider the same input programs and the approximate grammar as given in the last example 4.5. First the program will be parsed with the approximate grammar. Sets RL and RR will be computed from the shaded regions of the input program as shown in figure 4.8. Sets R and L are shown in the figure. The set of possible RHSs, R, is built by taking the cross concatenation of R and L. The set of possible RHSs built from these two sets is also shown in the figure.

59

Last occurence of keyword e

First occurence of keyword e a

a

b

a

e

b

a

a

e

b

b

b

a

RL = {AABA, AAA, ACA, ABA, BA, AA, AC, A, ǫ} RR = {B, ǫ} Possible RHSs = AABAeB, AAAeB, ACAeB, ABAeB, BAeB, AAeB, ACeB, AeB, eB AABAe, AAAe, ACAe, ABAe, BAe, AAe, ACe, Ae, e

Figure 4.8: Example of building possible RHSs

The rule building and rule checking phase is the same as discussed in the previous approach.

4.7

Summary

In this chapter we presented an approach for extracting a complete grammar when a complete grammar is just one rule away from the initial grammar. We first presented the approach for extracting a rule whose RHS starts with a new terminal; later we removed this assumption. Our approach first builds a set of possible rules from input programs and then checks whether the grammar becomes complete w.r.t. input programs by including one of the rules. We also verified the approach theoretically. We use LR parser (instead of LALR parser) in the LHSs gathering phase because LR parser stops exactly at the error point whereas LALR parser reports the error much later; i.e., it may perform few wrong shifts and reductions. The LHSs gathering phase explores all possible sequences of reductions a complete parser can make while parsing the input program. In the case of LALR parser all possible actions of a complete parser may not be explored because the parser may have performed some wrong shifts and reductions before reporting the actual error.

e

b

Chapter 5 Grammar Completion with Multiple Rules In this chapter we address the situation when more than one rules are needed to complete the initial grammar. We extend the algorithm discussed in the previous chapter to address the problem of multiple rules extraction. We first discuss the approach when the right hand side of each missing rule starts with a new terminal and later relax this assumption. We also prove the correctness of the algorithm. The ideas discussed in this chapter are parts of articles [14, 16].

5.1

Notation Recap

The problem of multiple rule extraction is stated as follows: Given a set of programs P = {w1 , w2 , . . . , wn } and an incomplete grammar G = (N, T, P, S) where, P 6⊆ L(G), find a grammar G′ = (N ′ , T ′ , P ′ , S) such that G′ generates all programs in P. I.e. P ⊆ L(G′ ). Our approach is based on following assumptions: 1. The relationship between G′ and G is as follows: N = N ′, T ⊆ T ′, P ⊆ P ′ Set (T ′ − T ) is the set of new terminals which is know beforehand. We denote this set as Tnew . Note that |Tnew | and |P ′ − P | can be more than one. 60

61

2. The type of rules missing in G are the rules which contain a new terminal in its right hand side (RHS), and the RHS of each rule starts with a new terminal. That is the missing rule is of form A → anew α, for some A ∈ N , α ∈ (T ∪ N )∗ and new terminal anew ∈ Tnew . 3. A single rule is sufficient to represent the syntax corresponding to each new terminal. Since a single rule is sufficient to express the syntax corresponding to each new terminal, there exists at least one complete grammar which is |Tnew | rules away from G. Our goal is to extract a complete grammar G′ which is not more than |Tnew | rules away from G. A correct set of rules is the set of rules which collectively makes the grammar complete (note the difference with the term set of correct rules as used in the earlier chapter).

5.2

Overview of the Approach

The overall approach is an iterative approach with backtracking. In each iteration, a set of possible rules corresponding to a new terminal is built and one among them is added in G to make the G more complete. After |Tnew | iterations, the modified grammar is checked for completeness, i.e. whether the modified grammar is complete w.r.t. P or not. If yes, then rules which were added in G are returned as a set of rules needed to complete the G (i.e. a correct set of rules), else the algorithm backtracks to a previous iteration and selects another rule. For systematically building a set of possible rules corresponding to each new terminal, programs are grouped according to the layout of new terminals in the programs. For each new terminal, K, two groups of programs, PK and P′K , are made; where PK is the set of programs in which K is the first new terminal and P′K is the set of programs in which K is the last new terminal. The above grouping is done because we use the same approach, as discussed in the previous chapter, for building the set of possible rules. The approach for single rule extraction starts from the last occurrence of the new terminal for building a set of possible RHSs; in the similar way, for building a set of possible RHSs of a missing rule corresponding to a new terminal K we use those programs where K is the last new terminal. Hence, we make a group of programs (P′K ) where K is the last new terminal. Similarly, for

62

building a set of possible LHSs of a missing rule corresponding to K we use those programs where K is the first new terminal. Therefore, a group of programs, PK , where K is the first new terminal is made. The two groups PK and P′K are used to build a set of possible rules (P RK ) corresponding to K. Example 5.1: Consider an incomplete ANSI C grammar in which rules corresponding to keywords while and f or are missing and three input programs as shown in figure 5.1 are given. The set of new terminals is Tnew = {f or, while}. Terminal while is the first new terminal in program 1, therefore Pwhile = {P rogram 1}. Terminal while is the last new terminal in all three programs, therefore P′while = {P rogram 1, P rogram 2, P rogram 3}. The set of possible LHSs corresponding to terminal while will be gathered from program 1 and the set of possible RHSs will be gathered from all three programs.

Program 1 main() { int x, y, z, i; x=1000; y=900; z=800; while (x > 500) x−−; for (i=0; i < 50; i++) { while (y > 200)

Program 2 main() { int x, i; x=200; for (i=100; i > 0; i−−) { while (x > 20) { x=x−4; } } }

Program 3 main() { int x, i, j; x=200; for (i=0; i < 300; i++) { for (j=500; j > 50; j−−) ; } while (x < 300) x=x+2; }

y=y/2; } while (z > 20) z−−; }

Figure 5.1: Set of input programs

Such grouping helps in building a set of possible rules corresponding to a new terminal due to the following reason: Suppose a set of possible rules corresponding to keyword K is built. Since each rule starts with a new terminal, a program in which K is the last new terminal implies that the substring derived by the last occurrence of the missing rule, corresponding to K, does not embed a substring derived by

63

another missing rule. For example, figure 5.2 shows a program where while is the last new terminal. The substring covered by a rule corresponding to terminal while (shown as shaded region) cannot embed a substring covered by another missing rule as otherwise while will not be the last new terminal. Programs where K is the first new terminal indicate that the substring derived by the first occurrence of the missing rule corresponding to K will not be embedded within the substring derived by another missing rule. Therefore, if such a program is parsed with an LR-parser generated from a complete grammar, the parser will first recognize the substring covered by the missing rule corresponding to K. That is, the parser will first expand the rule corresponding to new terminal K. Hence, possible LHSs of rule corresponding to K can be gathered from such programs. main () { int i, j; for(i=0; i < 100; i++) { a = a+1; while(j < 0) {

}

printf("Value of j %d ", j); j++;

} }

Figure 5.2: Program where while is the last keyword

Now we describe the approach for extracting multiple missing rules. The approach iteratively builds a set of possible rules corresponding to each new terminal Ki ∈ Tnew as shown in figure 5.3. In each iteration, first, a pair of groups (PKi , P′Ki ) is made for each new terminal Ki ∈ Tnew . Next, a terminal Ki ∈ Tnew is selected such the set P′Ki is nonempty. We prefer here a new terminal Ki for which the set PKi is also nonempty. The set of possible LHSs, LKi , of rules corresponding to Ki is gathered from programs in PKi using the algorithm GAT HER P OSSIBLE LHSs. The set of possible RHSs, RKi , of rules corresponding to Ki is gathered from programs in the set P′Ki using the algorithm BU ILD RHSs. The set of possible rules, P RKi , corresponding to Ki is built by taking the cross product of LKi and RKi . One rule from P RKi is selected and added in G and Ki is removed from the set of new terminals (Tnew ). Since Ki is no longer a new terminal, input programs

64

are grouped again according to the layout of the modified set of new terminals in the next iteration. The above steps are repeated until a rule corresponding to each new terminal is added in G. Now the modified grammar is checked for the completeness. If the modified grammar is complete w.r.t. P then the rules added in G in different iterations are returned as a correct set of rules, else we backtrack to one of the previous iterations and select another rule. Note that in some iterations the algorithm may not find a new terminal, Ki ∈ Tnew , such that sets PKi and P′Ki both are nonempty. In such case it builds possible rules corresponding to those new terminals, Kj , for which at-least set P′Kj is nonempty. The set of possible RHSs corresponding to Kj is computed using P′Kj and the set of possible LHSs is assigned equal to N (set of all nonterminals of G).

Function EXTRACT RULES(P, G, Tnew )

⊲ Tnew is a set of new terminal

while Tnew is not empty do For each Ki ∈ Tnew make groups PKi and P′Ki Select a Ki from Tnew such that the set P′Ki is non-empty if PKi is non-empty then T GAT HER P OSSIBLE LHSs(a1,n , G) LKi ← else RK i ←

∀a1,n ∈PKi

LKi ← N T BUILD RHSs(a1,n , G, max length) ∀a1,n ∈P′K

i

P RKi ← build rules using sets LKi and RKi Select a rule r from P RKi and add it in G Remove Ki from Tnew if G parses all programs in P do return the set of new rules added in G else Backtrack to a previous iteration and try another rule Figure 5.3: Algorithm for extracting multiple missing rules

65

5.3

An Example

Suppose we have an incomplete ANSI C grammar in which rules corresponding to keywords f or and while are missing and three input programs as shown in figure 5.1 are given as input. Tnew = {while, f or}. First programs are grouped according to the layout of new terminals f or and while. Pwhile = {P rogram1}, P′while = {P rogram1, P rogram2, P rogram3}, Pf or = {P rogram2, P rogram3} and P′f or = {}. Since Pwhile and P′while are nonempty, the set of rules corresponding to while is built first. The set of possible RHSs is built from program 1, program 2 and program 3 and then their intersection is taken as shown in figure 5.4. The set of possible LHSs corresponding to while is gathered from program 1. Using the above two sets, a set of possible rules corresponding to terminal while is built. The set of possible rules, P Rwhile , corresponding to while is shown in figure 5.51 . A rule from P Rwhile is selected and added in the grammar. Suppose, stmt → while( is added. Now programs are again grouped according to the layout of new terminals. Since Tnew = {f or}, Pf or = {P rogram1, P rogram2, P rogram3} and P′f or = {P rogram1, P rogram2, P rogram3}. All three programs will be used for gathering possible RHSs and possible LHSs corresponding to terminal f or. After computing possible rules corresponding to terminal f or, one rule from them is selected and added. The set of possible rules corresponding to keyword f or is shown in figure 5.5. After adding a rule corresponding to keyword f or, the modified grammar will contain rules corresponding to each new terminal (f or and while here). The modified grammar is now tested for the completeness. Suppose, rule stmt → f or( is selected from the set of possible rules corresponding to terminal f or. Since the above choice of rules does not parse all the programs, we roll back the changes done in the grammar and select another rule. After some iterations and backtracking steps we find that the grammar after adding rules stmt → while (expr) stmt and stmt → f or ( stmt list expr ) stmt parses all 3 programs. Therefore, these two rules are returned as a correct set of rules (i.e. the set of rules needed to make the G complete). Figure 5.5 shows different possible correct sets of rules corresponding to keywords f or and while. 1

We are not showing the complete sets as obtained in the real experiments as these were too large.

66

Set of possible RHSs computed from programs From Program 1

From Program 2

From Program 3

while ( while ( id while ( id > while ( id > NUM while ( expr ) while ( expr ) stmt

while ( while ( id while ( id > while ( id > NUM while ( expr) while ( expr ) { while ( expr ) stmt

while ( while ( id while ( id < while ( id < NUM while ( expr ) while (expr ) stmt

Intersection while ( while ( id while (expr ) while ( expr ) stmt

Figure 5.4: Possible RHSs constructed from each program and their intersection

5.4

Discussion

An important aspect of the algorithm EXT RACT RU LES is the criterion for selecting a new terminal in each iteration (line 4 in algorithm EXT RACT RU LES). In each iteration, the algorithm picks a new terminal K for which sets PK and P′K are non-empty or at least set P′K is nonempty. What if there are more than one new terminal for which sets P and P′ are nonempty or there are more than one new terminal for which at least the set P′ is nonempty. In such cases, we select a new terminal (suppose M ) for which size of sets (i.e. either PM and P′M or only P′M ) are the maximum and build a set of possible rules corresponding to them. The idea behind such selection is that possible LHSs and RHSs are computed from multiple programs and then their intersection is taken; the more is the number of programs used in building possible LHSs and possible RHSs the less is the number of possible rules obtained after intersection operation. Therefore, we build possible rules for those terminals first, whose corresponding groups P and P′ have the maximum number of programs.

5.5

Proof of Correctness

This section discusses the correctness of the EXT RACT RU LES algorithm. Suppose P is the set of input programs, G is the initial grammar and

67

Possible LHSs for keyword while = { stmt } Possible Rules = (1) stmt → while( (2) stmt → while(id (3) stmt → while(expr) (4) stmt → while(expr)stmt

Possible LHSs for keyword for = { stmt } Possible Rule = (1) stmt → f or( (2) stmt → f or(id = (3) stmt → f or(id = NUM (4) stmt → f or(expr stmt list) (5) stmt → f or(expr stmt list)stmt

Rule pairs which are correct (1) [ stmt → while (expr),

stmt → f or ( stmt list expr) ]

[ stmt → while (expr),

stmt → f or ( stmt list expr) stmt ]

[ stmt → while (expr) stmt,

stmt → f or ( stmt list expr) ]

[ stmt → while (expr) stmt,

stmt → f or ( stmt list expr) stmt ]

Figure 5.5: Set of possible rules for keywords f or and while and set of rule pairs needed to make the grammar complete

Tnew = {K1 , K2 , . . . , K|Tnew | } is the set of new terminals. There can be many complete grammars. The algorithm extracts a complete grammar which is not more than |Tnew | rules away. It returns a set of rules {rK1 , rK2 , . . . rK|Tnew | }, where rKi is the rule corresponding to the new terminal Ki . Since there exists many sets of rules that make the grammar complete, we prove that the algorithm will return one such set. We first prove that in each iteration, the algorithm has enough information for building a set of possible rules corresponding to at least one new terminal. Later we prove the correctness of the algorithm by showing that the algorithm will always return a set of grammar rules which makes the grammar complete. Lemma 5.1 In each iteration, the algorithm has enough information for building a set of possible rules corresponding to at least one new terminal. Proof: For building a set of possible rules corresponding to a new terminal (suppose K) at least the set P′K (set of programs where K is the last new terminal)

68

should be nonempty. We prove by induction on the number of iterations n (n < |Tnew ) that in each iteration there exists at least one new terminal K, s.t. P′K is non-empty. 1. Iteration 1: If a program contains new terminals one of the new terminals will obviously be the last new terminal. Suppose the set P′ is empty for each new terminal; this implies that none of the programs contains a new terminal. I.e., although G is incomplete w.r.t. P but the rules needed to complete the G do not contain a new terminal. This is a contradiction to the assumption that each missing rule in G contains a new terminal. Therefore, the set P′ will be nonempty for some new terminal in the first iteration. 2. Iteration n + 1: Suppose, the algorithm builds a set of possible rules corresponding to terminal Kn in the nth iteration (n < |Tnew |). This implies that at least the set P′Kn is nonempty in the nth iteration. After adding a rule corresponding to Kn , the algorithm removes Kn from the set of new terminals. Hence, programs in P′Kn (where Kn is the last new terminal) will now have other new terminals (terminals which precedes the last occurrence of Kn ) as the last new terminal. Therefore set P′ will be nonempty for some new terminals in the n + 1th iteration also.

Theorem 5.1 The algorithm will return a set of rules which makes G complete. Proof: We prove this by induction on the number of iterations. Suppose the set of rules {rK1 , rK2 , . . . rK|Tnew | } makes the G complete w.r.t. P and the set of rules added in G till ith iteration is {rK1 , rK2 , . . . rKi }. We show that the rule rKi+1 will fall in the set of possible rules built in i + 1th iteration for the new terminal Ki+1 (i.e. P RKi+1 ). 1. Iteration 1: A set of possible rules corresponding to keyword K1 is built in this iteration. Since a set of possible RHSs (RK1 ) corresponding to K1 is built from those programs where K1 is the last new terminal using BU ILD RHSs algorithm, set RK1 will contain the RHS of rK1 from lemma 4.4. Similarly the set of possible LHSs (LK1 ) is constructed either from those programs where K1

69

is the first new terminal using the algorithm GAT HER P OSSIBLE LHSs or made equal to the set of nonterminals of G (i.e. N ). Therefore, LHS of rK1 will fall in the set LK1 from lemma 4.3. Hence P RK1 will contain the rule rK1 . Since the algorithm tries each rule in P RK1 , it will select rK1 for checking its correctness. 2. Iteration n + 1: Suppose the algorithm has selected rules {rK1 , rK2 , . . . , rKn } and added in G before n + 1th iteration. Since, the above set of rules is a subset of a correct set of rules {rK1 , rK2 , . . . rK|Tnew | } (which makes the grammar complete), the modified grammar in the (n + 1)th iteration is (|Tnew | − n) rules away from a complete grammar (where the complete grammar is (N, T ∪ {K1 , . . . , K|Tnew | }, P ∪{rK1 , . . . , rK|Tnew | }}, S). Therefore, the (n+1)th iteration can be considered as the first iteration of the algorithm in which the initial grammar G is (N, T ∪ {K1 , . . . , Kn }, P ∪ {rK1 , . . . , rKn }, S), the set of new terminals is {Kn+1 , . . . , K|Tnew | } and the set of input programs is P. While solving this problem, the algorithm in its first iteration will construct a set of possible rules, P RKn+1 , for keyword Kn+1 . From the previous explanation, the rule rKn+1 will fall in the set P RKn+1 . Since the algorithm iteratively checks the correctness of each rule in P RKn+1 , rKn+1 will be selected in some iteration.

5.6

Time Complexity

Each iteration of the algorithm involves building a set of possible rules and adding √ one of them to the grammar; this takes O(n + 2 cm + n3 v max length ) time. The number of possible rules for each keyword is bounded by O(v max length+1 ) (v = |N ∪ T |). Since the algorithm checks the completeness of only those grammars which are not more than |Tnew | rules away from the initial grammar, the maximum number of iterations in the algorithm is equal to the number of nodes in a complete tree where the degree of each tree node is at most O(v max length+1 ); such a tree will have O(v max length+1 )|Tnew | = O(v (max length+1)×|Tnew | ) nodes. Hence, the maximum number of iterations in the algorithm will be O(v (max length+1)×|Tnew | ). The maximum number of possible combinations of rules for all new terminals is

70

p|Tnew | (p is the upper bound on the number of possible rules for each keyword; i.e. p = v max length+1 ). Hence, the worst case time taken by the algorithm is √ O(n + 2 cm + n3 v max length ) × O(v (max length+1)×|Tnew | ) + O(v (max length+1)×|Tnew | ) × √ √ O(n + 2 cm ) = O(v (max length+1)×|Tnew | × (2 cm + n3 )).

5.7

Extracting Rules of form A → αanew β

In this section we relax the assumption which states that missing rules start with a new terminal. The overall approach of the algorithm EXT RACT RU LES is the same except the steps for building possible LHSs and RHSs. Changes made in these steps are discussed in this section. Suppose K is a new terminal. Since a missing rule corresponding to K can be of form A → αKβ, the set of possible RHSs corresponding to K is built as follows: (1) A set of all possible symbol strings that can occur before K in the RHS (for A → αKβ, it is α) is computed from those programs where K is the first new terminal (i.e. set PK ); this set is denoted as RLK . (2) A set of all possible symbol strings that can occur after K in the RHS (for A → αKβ, it is β) is computed from all those programs where K is the last new L R terminal (i.e. set P′K ); this set is denoted as RR K . (3) Sets RK and RK are built in the similar fashion as discussed in the previous chapter and used for building the set of possible RHSs, RK , corresponding to K: RK = {αKβ|∀α ∈ RLK , ∀β ∈ RR K} The set of possible LHSs (LK ) corresponding to K is computed from those programs where K is the first new terminal (PK ) as discussed in the previous chapter. Using the sets RK and LK , a set of possible rules corresponding to K is built as follows: P RK = {A → α|∀A ∈ L, ∀α ∈ RK } The above extension of the algorithm will work, if in each iteration, there exists at least one new terminal K such that the sets PK and P′K both are nonempty. The above condition may not be true in some cases. For example, consider input programs given in figure 5.6 and suppose that the set of new terminals is {f or, while}. The approach will not work in such situations because Pwhile = {},

71

P′while = {P rogram1, P rogram2}, Pf or = {P rogram1, P rogram2} and P′f or = {}. However, for input programs shown in figure 5.1, it will work because in the first iteration sets Pwhile and P′while are nonempty (see example ) and in the next iteration sets Pf or and P′f or are nonempty.

Program 1

Program 2

main() { int i; int x=100; for (i=0; i <100; i++) { while (x > 20) { printf("%d %d", x, i); x−−;

main() { int i; int x=0; for (i=100; i >0; i−−) { while(x < 80) { printf("%d %d", x, i); x++; }

} } }

} }

Figure 5.6: A set of input programs

5.8

Summary

In this chapter we presented an approach for extracting multiple missing rules. The approach is an extension of the approach we discussed in the previous chapter. It is an iterative approach with backtracking. We also relaxed the assumption that the RHS of missing rules start with a new terminal. Correctness of the algorithm has also been shown. The approach has poor time complexity because of the large number of sets of possible rules for each new terminal. For overcoming these problems we propose a set of optimizations in the next chapter.

Chapter 6 Optimizations The algorithm we discussed in the earlier chapter suffers with poor performance as the number of possible rules built by the algorithm is very large. This is due to the large size of grammars of programming languages and the way the approximate grammar is organized. We discuss techniques to reduce the number of possible rules for each keyword. The technique exploits the abundance of unit productions in the programming language grammars to reduce the number of possible rules. A modified CYK parsing algorithm is presented to optimize the rule checking process. We also propose an ordering criterion in which grammar rules should be evaluated for the correctness. The proposed optimizations have appeared in [15, 16].

6.1

Utilizing unit productions

The number of unit productions in a programming language grammar is generally very high. Table 6.1 compares the number of unit productions with the total number of productions in some programming language grammars [75]. It is evident that a large fraction of productions are unit productions in all the grammars. The presence of unit production increases the number of possible rules to be checked. For example, consider the grammar and the input program given in example 4.1. Since there exist unit productions A → a and B → b, possible symbol strings that can derive the string eab are eab, eaB, eAb and eAB. In the absence of these unit productions the possible symbol strings built by the algorithm will be only eab. 72

73

Table 6.1 Summary of unit productions in different programming language grammars Languages No of productions No of unit productions

Algol 170 78

ADA 576 218

COBOL 519 193

CPP 785 237

CSTAR 312 169

C 227 106

Delphi 385 177

grail 122 20

Java 265 124

Matlab 92 40

We exploit the abundance of unit productions in the programming language grammars to reduce the number of possible rules to be checked. For reducing the number of possible rules we add only most general symbol strings in the set of possible RHSs. The algorithm BU ILD ST RIN GS is modified as follows: Suppose, ⋆ cells [X1 , X2 ] and [Y1 , Y2 ] are used for building symbol strings and X1 ⇒ X2 and ⋆ Y1 ⇒ Y2 . Rather than adding all symbol strings, i.e. X1 Y1 , X1 Y2 , X2 Y1 , X2 Y2 , built from these cells, we add only X1 Y1 in the set of possible RHSs. A rule whose RHS is X1 Y1 , i.e. A → X1 Y1 (for some A ∈ L), is sufficient for checking the incorrectness of the rules whose RHS are X1 Y2 , X2 Y1 or X2 Y2 (i.e. A → X1 Y2 , A → X2 Y1 , A → X2 Y2 ) because A → X1 Y1 is more general than other rules. Hence, the number of possible RHSs to be checked can be reduced without compromising the correctness of the approach. This optimization significantly reduces the number of possible rules. Consider the grammar and the input programs given in example 4.1. The set of possible RHSs without the unit production optimization is {eab, eAb, eaB, eAB, eA, eC} and with unit production optimization it is {eAB, eA, eC} Here the number of possible RHSs reduces by half (from six to three) which also reduces the number of possible rules by half. Note: Since we use an LR parser for checking the completeness of grammars, the above optimization does not guarantee returning a complete grammar because in few cases none of the rules in the reduced set of possible rules will be correct as well as LR preserving. A rule is LR preserving if the grammar remains LR after including that rule. For example, consider the set of all correct rules shown in figure 4.5 obtained from the the input program and the grammar given in example

Pascal 188 84

74

4.1; rule B → eAB is a correct rule and falls in the set of most general rules but this is a non-LR rule. However, while experimenting with real programming language grammars we always found a correct as well as LR preserving rule in the reduced set of possible rules. These experiments are discussed in the next chapter. Since the unit production optimization does not work in some cases, one can alternatively use an CYK parser for checking the completeness of the grammar. Correct rules, obtained by this method, can be later re-factored to obtain an LR rule using the ad-hoc techniques discussed in [78]. Checking the completeness of grammars with CYK parser is expensive because it is an O(n3 ) algorithm. We propose an optimization in the next section which can be used to improve the process of grammar checking.

6.2

Optimization in Rule Checking Process

In this section we propose a modification in the CYK parser which reduces the number of invocations to the rule checking phase by checking the correctness of a group of rules in one invocation. For the sake of clarity, we discuss here the case of a single missing rule. Unlike the previous approach, where the correctness of each possible rule is checked individually1 , this optimization checks the correctness of a set of rules. The CYK parsing algorithm is modified so that for each α ∈ R, it can check the correctness of all rules B → α (for all B ∈ L) in one parse. This reduces the number of invocations to the rule checking phase from |L| to 1. Correctness of a RHS, α, w.r.t. to a set of possible LHSs, L, is checked as follows: First, a set of rules, B → α (∀B ∈ L), is added in the approximate grammar. The input program is parsed with the modified grammar using the CYK parser. Suppose, the input program and the approximate grammar, given in example 4.1, are given and the correctness of RHS eab w.r.t. possible LHSs set {A, B, C} is being checked. Rules A → eab, B → eab and C → eab will be added in the grammar and the input program aabeabaeabbeab will be parsed with the modified grammar. The parse tree is shown in figure 6.1; it is slightly different than a normal parse tree. The root of each subtree contains a set of pairs; the first part of the pair is a nonterminal which derives the substring covered by the subtree and the second part is a set 1

By a correct rule (corresponding to a new terminal) we mean a rule, which makes the grammar complete (assuming there exists a complete grammar one rule away from the initial grammar)

75

of nonterminals. For example, node 6 in figure 6.1 contains pairs (A, {B}) and (C, {B}); the first part, i.e. A and C, derive the substring a7,10 = aeab and the second part, i.e. {B}, is called a set of unfiltered nonterminals. Dashed edges are not the part of the parse tree. The program is parsed with the modified CYK parser which filters out incorrect rules (the newly added rules) while parsing the program. The idea behind filtering incorrect rules (whose RHS is α) is following: Initially each rule with RHS α (i.e B → α, for all B ∈ L) is considered a correct rule and the parse tree is built bottom up with the CYK parser. The parser filters out those rules2 which are not used in building subtrees of larger substrings. To support this operation a set of nonterminals (called unfiltered nonterminals) is associated with each nonterminal of each CYK-cell. This set is shown as the second part of the pair in the figure 6.1. For example, node 6 contains a pair (A, {B}), where {B} is the set of unfiltered nonterminal. The set of unfiltered nonterminals associated with a nonterminal A in the cell C[i, j] is denoted as U FA (i, j). Nonterminal B ∈ U FA (i, j) implies that rule ⋆ ⋆ B→α ⋆ B → α is used in the derivation A ⇒ ai,j . That is, A ⇒ δ B γ ⇒ δ α γ ⇒ ai,j . For example, substring a7,10 = aeab, in figure 6.1, is derived by A and rule B → eab A→AB A→a B→eab is used in this (A ⇒ A B ⇒ a B ⇒ aeab); hence U FA (7, 10) = {B}. This set is associated with nonterminal A at node 6 in figure 6.1. U F sets are maintained while parsing as follows: First, ∀ A ∈ C[i, i] (0 ≤ i ≤ n), U FA (i, i) is initialized as empty set; next program is parsed with CYK parser. Since the modified grammar has rules B → α (∀B ∈ L), whenever a substring derived by α is encountered (suppose ai,j ), all nonterminals of L are added in C[i, j]. Since initially each new rule B → α (B ∈ L) is considered correct, set U FB (ai,j ) (∀B ∈ C[i, j] ∩ L) is assigned {B}. For example roots of the node 3, in figure 6.1, are A, B, and C which derive substring eab and their UF sets are U FA (4, 6) = {A}, U FB (4, 6) = {B}, U FC (4, 6) = {C} respectively. Suppose, production A → X Y is used while building an entry for cell C[p, q]; where, X ∈ C[p, k] and Y ∈ C[k + 1, q]. The set U FA (p, q) is updated with the following rules: 1. U FA (p, q) = U FX (p, k) ∪ U FY (k + 1, q) if at least one set between U FX (p, k) and U FY (k + 1, q) is empty. For example while building the cell entry C[2, 3] 2

Rules added, i.e. B → α, ∀B ∈ L

76

(node 2 in figure 6.1) production A → A B is used and sets U FA (2, 2) and U FB (3, 3) both are empty; hence U FA (2, 3) is empty. Production A → A B is used at node 6 where U FA (7, 7) is empty but U FB (8, 10) is non-empty, hence U FA (7, 10) = {B}. 2. U FA (p, q) = U FX (p, k) ∩ U FY (k + 1, q), if both U FX (p, k) and U FY (k + 1, q) are nonempty. If U FA (p, q) comes out to be an empty set from the above computation, then nonterminal A is dropped out from the cell C[p, q]. For example, production A → A B is used for building the cell entry for C[7, 14] (node 4 in figure 6.1) and sets U FA (7, 11) and U FB (12, 14) both are nonempty, therefore, U FA (7, 14) = U FA (7, 11) ∩ U FB (12, 14) = {B}. UF set at node 14 (U FA (8, 14)) is an empty set because U FA (8, 14) = U FA (8, 11)∩U FB (12, 14) = {A} ∩ {B} = {}; hence nonterminal A is dropped out from the cell C(8, 14). LHSs of the correct rules (among the newly added rules A → α, ∀α ∈ R) will get added in the UF sets of nonterminals of the subtrees of larger substrings and will climb up-ward in the parse tree (this is shown by dashed arrows in the figure 6.1). Nonterminals which are added in the set U FS (1, n) (in the figure 6.1 it is {B}), are LHSs of a correct rule for the given RHS. Incorrect nonterminals are filtered out during the parsing. If set U FS (1, n) is empty after parsing, then the RHS is incorrect. That is, for none of the nonterminals B (where B ∈ L) is rule B → α a correct rule. In figure 6.1, B is the correct LHS for eab and A and C are not correct because they are filtered out.

6.3

Rule Evaluation Order

The goal of the grammar extraction problem is to extract a complete grammar; hence, the order in which the rules are checked for the correctness (line 10 in the algorithm EXT RACT RU LES in figure 5.3) is important for good performance of the algorithm as the number of possible rules built by the algorithm is very high. For example, while experimenting with the C, Java, Matlab and Cobol grammars we found that the number of possible rules corresponding to different C keywords was of the order of 105 − 106 . Even after reducing the number of possible rules by the unit production optimization it was of the order of 400. Therefore, an incorrect choice of a grammar rule may severely hamper the performance of the algorithm

77

1 (S,{B})

correct LHSs

(A,{B}) 4 (C,{B})

5 (A,{B}) (C,{B})

(A,{}) (C,{}) 14

6 (A,{B}) (C,{B}) 2 (A,{}) 10 (A,{})

3 (A,{A}) (B,{B}) (C,{C})

11 (B,{})

(A,{A}) (C,{A}) 13 9

8 (A,{A}) (B,{B}) (C,{C})

7 (A,{})

Missmatch in UF sets, dropped

(A,{A}) (B,{B}) (C,{C})

12 (B,{})

a

a

b

e

a

b

a

e

a

b

b

e

a

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Figure 6.1: Example of LHSs filtering

because the algorithm has to backtrack and select another rule. In this section we investigate a rule evaluation order3 which is closely based on the principle of minimum description length (MDL) for achieving the above goal. The MDL principle is a well established concept in the machine learning literature; it says that the best hypothesis for a given set of data is the one which can represent the data most compactly. In our problem scenario we check the completeness of those grammars first which compactly represent P. Langley et al [21, 35] have studied the effect of the MDL based approach on the grammar inference process, but their results are based on artificial subsets of English grammar (i.e. in the NLP domain). Their approach starts with a set of very specific grammars and uses the MDL criterion to generalize it; hence they explore the best grammar in the space of complete grammars whereas our focus here is to get a complete grammar.

3

The order in which a rule is selected from the set of possible rules and added in the grammar.

78

6.3.1

A Criterion for Rule Evaluation Order

We associate a weight with each rule which represents the rule’s ability to compactly represent the input programs. A rule is compact if it covers/derives large substrings in each input programs and have less number of symbols in the RHS. We associate a weight with each rule: Definition 6.1 The weight of a rule A → β w.r.t. set of programs P is: weightP (A → β) =

coverageP (β) |β|

(6.1)

Where coverageP (A → β) is coverage of the rule w.r.t. P. Coverage depends on the largest substring A → β covers in each program of P. Definition 6.2 The coverage of a rule w.r.t. P is: coverageP (A → β) =

X

(coveragew (A → β))

(6.2)

w∈P

where coveragew (A → β) =

1 ⋆ max{k − i + 1|β ⇒ wi,k ∀i, k, 1 ≤ i ≤ k ≤ |w|} (6.3) |w|

wi,j denotes the substring of program w which starts at token i ends at token j. |w| is the total number of tokens in w. The weight of the rule is used for ordering the rules while evaluating their correctness. Our hypothesis is that rules with higher weights are generally correct rules. Hence, we evaluate the correctness of each rule in non-increasing order of weights to improve the process of grammar extraction. The above rule ordering criterion is closely based on the principle of Minimum Description Length (MDL). The MDL principle says that the best hypothesis for a given set of data is the one which describes the data most compactly. A hypothesis compactly represents the data if the size of the hypothesis and the description of the data in that hypothesis both are small. Rules with higher weights are those which have smaller number of symbols and derive larger substring in each program, therefore the criterion closely (not exactly) follows the MDL principle. Since we do not consider the representation

79

of whole programs using the grammar, the weight does not exactly follows the MDL principle. Example 6.1: Suppose, a program as show in figure 6.2(a) is given as an input and the initial grammar is an incomplete ANSI C grammar in which a rule corresponding to keyword while is missing, i.e. Tnew = {while}. The set of possible rules corresponding to while is shown in figure 6.2(b). Consider the possible rule statement → while (expression) shown in figure 6.2(b). Substrings derived by this rule are while (x > 500 ) and while (y > 200 ). The total number of tokens in program is 62, hence coverage of rule statement → while (expression ) is 6/62 = 0.09. Coverage and weight of some possible rules, taken from figure 6.2(b), are shown in table 6.2. Rules statement → while (expression) statement and statement → while (conditional expression) statement have highest weights, hence their correctness is checked first. Since both rules are correct, the grammar extraction process will return these rules.

Possible rules built = (1) statement → while ( (2) statement → while ( id

main() { int x, y, z, i; x=1000; y=90; z=800; error point

(3) statement → while ( id > (4) statement → while ( id > NUM (5) statement → while ( id > NUM ) (6) statement → while ( expression ) (7) statement → while ( conditional expression ) (8) statement → while ( expression ) statement (9) statement → while ( conditional expression ) statement (10) statement → while ( expression ) statement } (11) statement → while ( conditional expression ) statement }

while(x>500) x−−;

for(i=0; i < 50; i++) Last occ { of keyword while(y > 200) y=y/2; } } (a) Input Program

(12) statement → while ( expression ) statement } } (13) statement → while ( conditional expression ) statement } }

Correct rules = (1) statement → while ( id > NUM ) (2) statement → while ( expression ) (3) statement → while ( conditional expression ) (4) statement → while ( expression ) statement (5) statement → while ( conditional expression ) statement

(b) Set of possible rules and correct rules

Figure 6.2: Input program and set of possible rules corresponding to terminal while

80

Table 6.2 Weight and coverage of possible rules given in figure 6.2 Possible Rule statement → while statement → while statement → while statement → while statement → while statement → while statement → while statement → while statement → while statement → while

6.3.2

( ( ( ( ( ( ( ( ( (

id id > id > N U M id > N U M ) expression ) conditional expression ) expression ) statement conditional expression ) statement expression ) statement }

Coverage 0.03 0.05 0.06 0.08 0.09 0.09 0.09 0.19 0.19 0.21

Weight 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.038 0.038 0.035

Experiments

To evaluate our hypothesis that rules with higher weights are usually correct rules, a set of experiments were performed on different programming language grammars, viz. C, Java, Matlab and Cobol. We removed different rules corresponding to different keywords from their grammars and built a set of possible rules from input programs using the technique proposed in the previous chapter. Possible rules were checked for their correctness in the order of decreasing weights. It was checked how many times the highest weight rule was correct rule. Details of experiments and results are discussed in chapter 7. Experiments show that out of 355 test runs, 67.89% of the time the highest weight rule is the correct rule and 28.17% of the time the second highest rule is correct. Only in 3.94% cases, rules which have weight lesser than the second highest weight are correct.

6.3.3

Discussion

We observe from the experiments that the proposed weight criteria helps in directing the search of correct rules. Only in 3.94% of the cases the extraction process has to check the rules having weights lesser than the second highest weight. Therefore, using the proposed rule evaluation order we can extract a complete grammar quickly. Although, the proposed weight metric ensures getting a complete grammar quickly, it does not talk about the goodness of the extracted grammar (or grammar rules). For example, rules statement → while (expression) statement and statement → while (conditional expression) statement in figure 6.2 both have the same weight

81

and are correct. Therefore, we face the problem of selecting a good rule from these two rules.

6.4

Summary

The algorithms discussed in earlier chapters result in a large number of possible rules to be checked. The number of possible rules is high due to the large size of programming language grammars and their structure. In this chapter we have discussed a set of optimizations to improve the performance of a grammar extraction process. We have proposed an optimization which exploits the abundance of unit productions in programming language grammars to reduce the number of possible rules to be checked. A modified CYK parsing algorithm is also presented which is used in optimizing the rule checking phase. Lastly, we investigated the rule evaluation order which assigns a weight to each rule. The rules are added in the grammar in the non increasing order of weights in order to arrive at a correct grammar quickly (as we hypothesized that rules with higher weights are usually correct rules). The results of experiments to study the impact of these optimizations are discussed in the next chapter.

Chapter 7 Implementation and Experiments In this chapter we briefly describe the implementation and experiments of our rule extraction approach. The main goal of the implementation is to study and verify the proposed approach and the optimizations on real programming language grammars. We have built a prototype in Java on the Linux platform. We also present and discuss the results of experiments we conducted. Experiments were performed on four programming languages, viz. Java, C, Matlab and Cobol. Goals of the experiments are as follows: 1. Verifying the basic approach. 2. Studying the performance of different optimizations. 3. Studying the performance of LR parser and CYK parser in the rule checking phase. 4. Verifying the approach for multiple missing rules. In the first section we discuss overview of the implementation and in the second section we discuss the details of the implementation. Later sections cover all the experiments.

7.1

Implementation

Since goal of the implementation is to verify our approach and study the effect of different optimizations, implementation is done in such a way that different 82

83

optimizations can be chosen optionally. For example, one can choose to build a set of all possible RHSs or one can choose to build a set of most general RHSs. In this section we briefly describe our implementation.

7.1.1

Overview

The schematic diagram of the overall grammar extraction process is shown in figure 7.1. Arrows show the flow of data among different blocks. A dotted box is the optimization box which orders the possible rules based on the weights of rules. Inputs to the system are a grammar written in YACC readable format and a set of input programs. New terminals are tagged as %newkeyword in the grammar specification. The grammar and the set of programs are fed to the group programs module (figure 7.1). Here programs are grouped according to the layout of new terminals. After grouping the programs a group of programs is fed to the LHSs generator module and the RHSs generator module as discussed in earlier chapters. The main component of the LHSs generator module is a modified LR parser generator. Since, our approach uses an additional operation called “forced reduction” which is not in conventional LR parsers, we use a modified LR(1) parser generator which generates an LR(1) parser with the support of forced reduce operations. The LR parser generated from the grammar is used to gather possible LHSs. The RHSs generator modules consists of a modified CYK parser. The modified CYK parser works in several modes which are used in different optimizations. Details of different modes are discussed later in the chapter. First the input program is parsed with the CYK parser and an incomplete CYK table is generated which is used in generating the set of possible RHSs. The rule building module gets a set of possible LHSs and a set of possible RHSs as input and builds a set of possible rules. Now one of the rules is added in the grammar. Rules may be selected randomly or based on their weights. Therefore, rules may pass through the optional optimizing module which orders rules based on their weight. The above three modules, LHSs generator, RHSs generator and rule addition are repeatedly executed for some number of times. After a fixed number of iterations, the modified grammar is checked for the completeness in the grammar checking module (shown as check grammar module in figure 7.1). The grammar checking module consists of an LR parser which is generated from the modified grammar.

84

Grammar Information LHSs generator Possible LHSs

LR parser

LR parser

Generator G

Group

P

Programs

Rules building

Set of

Rule ordering

Rules

based on weight

RHSs generator Add rule CYK parser

in G Possible RHSs G after |Tnew |

Modified G

iterations

Check

P

G is complete Output G is not complete

Grammar

G

Backtrack

Figure 7.1: The schematic diagram of grammar extraction module

It optionally contains a modified CYK parser. The LHSs filtering optimization, discussed earlier, can be chosen to optimize the rule checking process through a modified CYK parser. If grammar is complete w.r.t. input programs, it is returned to the user else the process is backtracked and another rule is selected. Rules can optionally be ordered based on their weight before being queued for the grammar checking module. Since weight of rules depend only upon the RHS, it can be coupled with the rule checking optimization where correctness of RHSs is checked by the LHSs filtering operation.

85

Table 7.1 A summary of different modules Module

Description

LR-parser generator

Generates a modified LR parser which has an extra functionality for forced reductions; for this it uses forest structured stack (FSS). Parses programs and provides extra functionalities for building symbol strings, building most general symbol strings and checking the correctness of rules. Driver gives an interface to generate LR-parser, opening and closing files and parsing grammar specifications. Preprocessor support ordering and grouping of programs according to the layout of new terminals in them. Creates a graph representation of CFGs [42] which is used in finding out the most general nonterminals, unit chains (a sequence of unit productions) etc. in the CFG.

CYK-parser

Preprocessor, Driver and other supporting functions

Grammar graph

Original source code (LOC) CUP parser generator which generates LALR parser : 6058

Size of added code (LOC)

NA

4065

NA

2148

NA

3012

Modified LR-parser=800, FSS=435, Code for building LR itemsets and tables=7865

The table 7.1 gives the main modules, the major functions of each module, and their sizes. The key modules are LR(1) parser generator and CYK parser. These are discussed below. For tokenizing we used JLex [8].

7.1.2

Modified LR(1) parser generator

Since our approach uses an LR(1) parser, an LR(1) parser generator is implemented. Parser generators currently available are LALR parser generators and they do not provide sufficient flexibility for accessing the LR-itemsets corresponding to different states while parsing. An implementation is done to meet these needs. In order to reduce the implementation effort we reused some parts of the CUP parser [26] generator. CUP is an LALR parser generator written in Java at Princeton university and currently maintained at the Technical University of Munich [81]. Our parser generator takes input in YACC readable format unlike the CUP parser which has its own format. We implemented a module for generating LR-state

86

machine and changed the emitter code. The emitter is a module of a parser generator which generates source code of a modified LR(1) parser. Emitter code is changed so that it generates a parser code which supports the “forced reduction” operation apart from the usual shift and reduce operations. During the forced reductions, the algorithm maintains multiple stacks. For maintaining multiple stacks with less space overhead, the parser in our implementation uses a special data structure called a graph structured stack (GSS) at the place of simple parser stack. GSS was proposed by Tomita and it was used first in the generalized LR parser [64]. Graph Structured Stack: A graph structured stack (GSS) [65] is a data structure which represents multiple stacks compactly. This is used in the generalized LR parser (GLR) for parsing general context free languages. The GLR parser uses the parse table similar to an LR parser; the only difference is that a parser table entry in the GLR parser can have multiple shift/reduce operations. Multiple operations are performed on separate stacks and multiple stacks are represented compactly using GSS. There are three key operations to support this: splitting, combining and local ambiguity packing. 1. Splitting: When there are multiple reduce entries in a parser table, then the top of the stack is split. Consider a parser configuration as (s0 X1 s1 X2 s2 X3 s3 , ai ai+1 . . . ) The stack corresponding to this configuration is shown in figure 7.2. We have clubbed together a state and its consecutive symbol in one box in the figure. The rightmost box represents the top of the stack and the leftmost is the bottom of the stack. State s3 is the top of the stack.

s0

X1 , s 1

X2 , s 2

X3 , s 3

Figure 7.2: GSS before reduction

Suppose, there are three reduce entries in the action table, i.e. action[s3 , ai ]= Reduce with A → X2 X3 , Reduce with B → X3 and Reduce with C → X3 .

87

GSS after the forced reduction is shown in figure 7.3. The figure shows compact representation of three stacks where the common part of the stacks, i.e. (s0 X1 s1 ), are shared together. Now there are three top states which are shown as bold boxes. Since, there are three top of the stacks, the parser will consult all the three entries, i.e. action[sa , ai ], action[sb , ai ], action[sc , ai ], in the next step.

C, sc

s0

X1 , s 1

X2 , s 2

B, sb

A, sa

Figure 7.3: GSS after reduction

2. Combining: When the same state has to be shifted on to more than one stack, then it is done only once by combining the tops of the stacks. For example, consider GSS of figure 7.3; suppose, action[sa , ai ] = action[sb , ai ] = action[sc , ai ] = shif t si , then GSS after combining operation is shown in figure 7.4. State si is the top of the stack.

C, sc

s0

X1 , s 1

X2 , s2

B, sb

ai , si

A, sa

Figure 7.4: GSS after shifting common state

3. Local ambiguity packing: If two or more branches turn out to be identical after an operation, they represent local ambiguity and they are merged into

88

a single branch. For example, consider GSS of figure 7.4; in this state parser will consult action[si , ai+1 ] in the next step. Suppose, action[si , ai+1 ] = reduce D → Cai , reduce D → X2 Bai and reduce D → Aai . The GSS after applying above three reductions is shown in figure 7.5(a). Since all three branches are same (i.e., X1 s1 Dsd ), they are merged into one branch and represented as shown in figure 7.5(b).

D, sd Local ambiguity s0

X1 , s 1

D, sd

s0

X1 , s 1

D, sd

packing D, sd (a)

(b)

Figure 7.5: GSS before and after local ambiguity packing

We have not implemented all the functionalities of the GSS described above. We have implemented the splitting operation which is used for implementing the forced reductions1 .

7.1.3

CYK parser

The CYK parser is used in our implementation for supporting two operations (1) building a set of possible RHSs and (2) checking the completeness of the grammar. Both of the above operations are implemented as an extra function provided by the CYK parser module. Since, our approach uses various optimizations in the RHSs building process, the RHSs building phase supports the following operations (1) building all possible RHSs and (2) building only the most general RHSs. The CYK parser can check the completeness of the grammar in two ways: (1) By parsing the programs with the simple CYK parser, and (2) by parsing the programs with the LHSs filtering 1

In the forced reduction we have to handle multiple reduce operations on separated stacks.

89

optimization enabled. Our implementation provides switches so that the user can choose the desired optimization. In the CYK parser, a grammar is first converted to the Chomsky normal form (CNF). CNF is a form of grammar in which productions are either of form A → BC or of form A → a, where A, B, C are some nonterminals and a is a terminal. Our CYK implementation works with slightly modified form of CNF. It works with a grammar where productions are either of form A → X1 X2 or of form A → X3 , where X1 , X2 , X3 are grammar symbols and A is a nonterminal. In our implementation, unit productions are not removed. The main reason for not removing unit productions is to preserve the structure of the grammar. For converting a grammar to the desired form, we use a modified version of the algorithm used in converting a grammar to CNF [24]. In this conversion, many temporary nonterminals are added in the grammar. While building possible RHSs we make sure that the possible RHSs do not contain any temporary nonterminals (as temporary nonterminals do not belong to the original grammar).

7.2

Experiments

We performed experiments on four programming languages, viz. Java, C, Matlab and Cobol. Grammars of languages were obtained from [75]. Since the grammars were not truly LR, we have used precedence and associativity2 to remove the nonLR-ness of the grammars. The parser generator generates an LR parser only if there is less than fifteen conflicts. Therefore, in further discussions we treat a grammar as LR if there are less than fifteen shift reduce conflicts. Experiments were conducted on a machine with Intel Pentium(R) 4, 2.4 GHz processor and 512 MB RAM. For conducting various experiments we removed rules corresponding to different keywords and operators. To validate the approach for extracting a single missing rule, first we removed rules corresponding to a single keyword. Most of the experiments are done in this setting; however, there are also a few experiments of multiple rules extraction. We also performed experiments on those cases where the missing rule was of the form A → αanew β (i.e. the new terminal occurs at any position in the RHS of the rule). Since such rules are mostly used for representing expressions 2

This is the most commonly used method for resolving conflicts in LR parsers [2].

90

Table 7.2 Summary of experiments Language

Constructs

C

for, while, switch, case, break switch, case, try, enum, while, for, if, for, case, switch, otherwise, while, &&, /., ∗., .ˆ, /, ==, −−, ∗ move, display, read, perform

Java Matlab

Cobol

# Single missing rule experiments 25

# Multiple missing rules experiments 5

Size of programs

Size of grammar

100

465

160

8

75

568

122

3

40

239

21

1

64

674

(e.g., expressions involving additive, multiplicative operators, etc.), we removed rules corresponding to different expressions for performing these experiments. The only parameter in our approach is the maximum length of the RHSs. After a study of different programming language grammars we found that the average RHS length of productions in most of the PL grammars were close to six. Hence, we chose this parameter as six in our experiments. In all the experiments we found that there were at-least a few correct rules with length less than or equal to six, i.e. the software never failed due to this parameter value. Since there is no test suite available for checking the performance of a grammatical inference approach in a programming language domain, we have downloaded programs and grammars from different sites to make a small test suite3 . The summary of all the experiments are given in the table 7.2. The size of grammar is expressed as the sum of the lengths of the RHSs of all the productions in the grammar.

7.2.1

Extracting Single Rule without Optimization

In this section we discuss experiments done to validate our approach on different grammars. In each experiment, we removed a rule corresponding to one construct and then fed it to our software; the software returns a rule which makes the grammar complete w.r.t. the given set of programs. A correct rule returned by the software need not be the same as the removed rule as there can be many possible correct rules. Some of the experiments in which only a single program was used for extracting 3

The experimental suite can be obtained from http://www.cse.iitk.ac.in/users/alpanad/ grammars/.

91

correct rules are shown in table 7.3; the table also gives the examples of correct rules returned by the software. Numbers written inside parenthesis in the last column of the table show the number of rules the system checked before arriving at a correct rule. We observe that in some of the cases, the time taken by the system is only a few seconds but in some cases it is several hours. The time spent by the grammar extraction process involves (1) time in generating the possible rules and (2) time in checking the possible rules. Since the time spent in the first process is more or less constant if the grammar and input program size are same, the major time taken by these experiments depend on how early the approach arrives at a correct rule (as the number of possible rules is very large). For example in the case of the while construct in the Java grammar, the time taken by the system is 102263 seconds (i.e. around 26 hours); this is because the system checked 5922 rules before arriving at a correct rule4 ; whereas in the case of enum construct of the Java grammar, the time taken was 83.7 sec. because the very first rule checked by the system was a correct rule. As mentioned earlier, with multiple programs we can reduce the number of possible rules to be checked, which can otherwise be very large (table 7.3). Therefore, we discuss some of the experiments which use multiple programs for extracting a single missing rule. We removed keyword based rules from different PL grammars, built a set of possible RHSs and LHSs from each program given as input and then took their intersection to get a reduced set of possible RHSs and LHSs. We study the reduction achieved in the number of possible rules by this optimization. Table 7.4 demonstrates the results. Each row represents the language and the construct removed. The number of possible rules obtained from each program are shown in the third column separated by commas. The 2nd column shows the size of the intersection of the possible rules obtained from each program. We can observe that the size of intersection of possible rules is 10 − 100 times less than the unoptimized version in many cases. In some experiments, it reduces drastically; for example in the case of the Java grammar while construct the number of possible rules obtained from different programs are 1.85 × 107 and 2.50 × 107 , respectively, whereas the intersection is 2.93 × 104 . By reducing the number of possible rules, the approach 4

This time can be reduced by using a faster LR-parser generator. The parser generator used by us are not as efficient as yacc or bison.

92

Table 7.3 Summary of the unoptimized approach on single rule extraction Language

Missing construct break

Size of program (LOC) 103

Number of possible rules 6.3 × 105

C C

case

103

7.3 × 106

C

f or

89

9.0 × 105

C

switch

78

1.9 × 105

Java

enum

31

1.3 × 106

Java

if

26

3.5 × 108

Java

switch

39

4.4 × 105

Java

while

47

1.8 × 107

Matlab

case

20

5.7 × 106

Matlab

otherwise

29

6.8 × 105

Matlab

switch

20

1.0 × 107

Matlab

while

12

4.9 × 106

Cobol

display

26

2.5 × 104

Cobol

perf orm

85

3.9 × 105

Cobol

read

43

8850

Rule returned

Execution time (Sec.)

labeled stmt → BREAK stmt list if stmt → CASE cond expr COLON stmt list expr stmt → F OR ( stmt list cond expr ) stmt list select stmt → SW IT CH translation unit M odif iers → EN U M IDEN T IF IER CondExpr → IF V arInit Stmt → SW IT CH ComplexP rimary Block GuardingStmt → W HILE ComplexP rimary F ieldDecl equality expr → CASE CON ST AN T select stmt → OT HERW ISE IDEN T IF IER] otherwise stmt → SW IT CH stmnt list EN D unary expr → W HILE stmt list EN D stmt list → DISP LAY ident or string id stmt stmt list ident → P ERF ORM loop cond part2 stmt list EN D P ERF ORM clause → READ ident or string opti ate ndc lause

1190.7 (79) 418.1 (4)

236.57 (3)

2324.79 (184) 83.7 (1) 1019.18 (56) 21254.91 (1224) 102263.69 (5923)

38.99 (1) 44.07 (4)

16641.19 (3626) 16914.48 (3745) 18 (1)

49.5 (1)

25 (1)

93

Table 7.4 Number of possible rules generated from the programs and their intersection in different PL grammars Language / Construct

C / for

Size of intersection of possible rules 6.4 × 105

C / switch

1.2 × 105

C / case C / break Java / case Java / for Java / switch Java / while Matlab / case Matlab / for

1.4 × 104 1.3 × 102 4.7 × 104 2.4 × 106 2.7 × 105 2.9 × 104 5.4 × 106 4.4 × 105

Matlab / otherwise Matlab / while

6.8 × 105 2.1 × 106

No of rules obtained from different programs

9.0 × 105 , 1.3 × 106 , 8.1 × 105 , 9.0 × 105 , 9.0 × 105 ,1.2 × 106 , 8.1 × 105 1.0 × 106 , 1.9 × 105 , 1.5 × 107 , 1.5 × 106 , 1.9 × 105 , 4.6 × 106 5.3 × 106 , 7.3 × 106 , 5.9 × 104 8.8 × 103 , 6.3 × 105 , 5.2 × 102 , 1.1 × 105 4.0 × 105 , 2.2 × 106 2.8 × 106 , 2.5 × 106 4.9 × 105 , 4.4 × 105 1.8 × 107 , 2.5 × 107 1.1 × 107 , 5.5 × 106 3.2 × 106 , 6.7 × 106 , 6.0 × 106 , 1.4 × 106 , 4.6 × 106 6.8 × 105 , 6.8 × 105 , 6.8 × 105 1.1 × 107 , 4.9 × 106

checks less number of possible rules for the correctness, hence arrives at a correct rule much earlier than the unoptimized version of the approach.

7.2.2

Unit Production Optimization

In this section we study the effect of unit production optimization on the grammar extraction process. We study the reductions achieved in the number of possible rules in real programming language grammars. In each experiment, a single rule corresponding to a keyword is removed and a set of all possible rules and a set of rules with most general RHSs are built. The size of the above two sets are then compared. Table 7.5 shows the outcome of our experiments. The first column shows the language and the construct removed from the grammar. The second column compares the number of all possible rules and the number of rules with the most general RHSs (shown as All/MG in the table) obtained from different programs. The last column shows the overall reduction achieved after considering the intersection of the sets of rules obtained from each program in both the cases. That is, the last column shows the reduction achieved by the combination of unit production optimization and the use of multiple programs.

94

Table 7.5 Comparison of number of all possible RHSs and number of most general RHSs Language / construct C / for C / switch C / case

Number of all possible rules (All) / most general rules (MG) obtained from different programs 9.0 × 105 / 1.4 × 104 , 1.3 × 106 / 1.8 × 104 , 9.0 × 105 / 1.4 × 104 , 9.0 × 105 / 1.4 × 103 , 1.2 × 106 / 1.8 × 104 , 8.1 × 105 / 1.4 × 104 , 8.1 × 105 / 1.4 × 104 1.0 × 106 / 2.7 × 104 , 1.9 × 105 / 9.6 × 103 , 1.5 × 107 / 6.2 × 104 , 1.5 × 106 / 3.0 × 104 , 1.9 × 105 / 9.6 × 103 , 4.6 × 106 / 5.1 × 104 5.3 × 106 / 3.4 × 104 , 7.3 × 106 / 5.1 × 104 , 5.9 × 104 / 2.9 × 103

Java / case

4.0 × 105 / 1.3 × 104 , 2.2 × 106 / 2.3 × 104

Java / for

2.8 × 106 / 4.9 × 104 , 2.5 × 106 / 4.1 × 104

Java switch Java while Matlab case Matlab for Matlab otherwise

/

4.9 × 105 / 1.4 × 104 , 4.4 × 105 / 3.5 × 105

/

1.8 × 107 / 1.4 × 105 , 2.5 × 107 / 1.4 × 105

/

1.1 × 107 / 4.1 × 104 , 4.9 × 106 / 6.5 × 104

/

3.2 × 106 / 3.4 × 104 , 6.7 × 106 / 7.3 × 104 , 6.0 × 106 / 6.3104 , 1.4 × 106 / 1.7 × 104 , 4.0 × 106 / 4.7 × 104 6.8 × 105 / 5.4 × 103 , 6.8 × 105 / 5.4 × 103 , 6.8 × 105 / 5.4 × 103

/

Intersection 6.4 × 105 8.0 × 103 1.2 × 105 8.3 × 103 1.5 × 104 8.6 × 102 4.7 × 104 7.7 × 102 2.4 × 106 2.2 × 104 2.7 × 105 9.1 × 103 2.9 × 104 3.0 × 103 5.4 × 106 9.7 × 103 4.4 × 105 4.8 × 103 6.8 × 105 5.4 × 103

/ / / / / / / / / /

We can observe that the above optimization reduces the search space of possible rules to a great extent (i.e. by a factor of 100 − 10000). Since the reduction we achieve here is due to the abundance of unit productions, this optimization holds for most of the PL grammars as our study on number of unit productions in PL grammars show that a large fraction of productions are unit productions in the language grammars (table 6.1, chapter 6). Since some of the rules obtained from unit production optimization may cause non-LR-ness in the grammar, the use of an LR parser for checking the correctness of the rules may fail sometimes. To study how often the set of rules obtained from the unit production optimization contains a correct as well as LR preserving rule, we checked the correctness of rules in the reduced set using an LR parser. We also compare the rules returned by the LR-parser checker (where the LR parser is used for checking the correctness) and the rules returned by the CYK-parser checker. Table 7.6 shows the results of the experiments. As evident, in none of the cases the LR-parser checker fails, so one can use LR-parser for checking the correctness of the rules. The time taken by both the versions of the checkers are compared in bar charts shown in figure 7.6. Since the LR-checker accepts only LR retaining rules, in some cases it takes more time because it has to check more rules than the CYK-checker. The time taken by

95

3000

50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0

LR parser CYK parser

2500 2000 1500 1000 500

0 break

case

for

switch

while

LR parser CYK parser

case

(a) C

1600

switch

while

(b) Java

LR parser CYK parser

1400

if

LR parser CYK parser

3000

1200

2500

1000

2000

800

1500

600 1000

400

500

200 0

0 for

switch

(c) Matlab

otherwise

move

read

(d) Cobol

Figure 7.6: Comparison of times taken by LR parser and CYK parser in rule checking module (Times are in seconds on Y axis)

the LR-checker is the sum of the time taken in generating the parser and the time taken in parsing the programs (i.e. O(n), n is the length of input program). The time in generating the parser depends only on the size of the grammar; therefore for large programs the LR-parser will always outperform the CYK-parser as CYKparsing is O(n3 ) algorithm. In our experiments, small to medium sized programs are used which does not make a significant difference between the LR-checker and the the CYK-checker in terms of the time of computation. We can see in the figure 7.6 that except for a few cases, the LR-checker performs better than the CYK-checker. For example, in the experiment where a possible rule corresponding to while in the C grammar is checked, the CYK-checker performing better than the LR-checker as

96

Table 7.6 Comparison of LR-parser and CYK-parser as a grammar completeness checker Language / Construct

Rule returned by the LR parser checker

Rule returned by the CYK-parser checker

No of progs

C / break C / case

stmt → BREAK stmt list stmt → CASE arg expr list COLON stmt list stmt → F OR (stmt list assign expr) stmt list stmt → SW IT CH declarator stmt list expr stmt → W HILE (arg expr list) stmt list LocV arDecOrStmt → CASE ArrayInit COLON LocV arDecAndStmt LocV arDecOrStmt → IF (ConstExpr) LocV arDecAndStmt LocV arDecOrStmt → SW IT CH (ClassN ameList) LocV arDecAndStmt Block → W HILE (ArrayInitializers) assign expr → CASE array list assign expr primary expr → F OR translation unit EN D primary expr → SW IT CH translation unit EN D stmt → OT HERW ISE array list

stmt → BREAK stmt list stmt → CASE arg expr expr COLON stmt list stmt → F OR (stmt list stmt list arg expr list) stmt → SW IT CH init decl list stmt list stmt → W HILE arg expr list stmt list LocV arDecOrStmt → CASE ArrayInit COLON LocV arDecAndStmt LocV arDecOrStmt → IF ConstExpr LocV arDecAndStmt LocV arDecOrStmt → SW IT CH F orIncr LocV arDecAndStmt

4 6

Avg size of progs (LOC) 114 131

7

66

6

131

7

101

4

84

5

97*

4

84

LocV arDecOrStmt → W HILE ArrayInitializers unary expr → CASE translation unit expr → F OR translation unit EN D primary expr → SW IT CH translation unit EN D stmt → OT HERW ISE array list

6

78

4

38

4

54

4

38.25

3

27

if clause → M OV E f ile name string loop cond part2 if clause → READ f ile name string opt at end clause if clause → P ERF ORM loop cond part2 stmt list EN D P ERF ORM

if clause → M OV E f ile name string loop cond part2 if clause → READ f ile name string opt at end clause if clause → P ERF ORM loop cond part2 stmt list EN D P ERF ORM

4

68

4

68

3

56

C / for C / switch* C / while Java / case

Java / if Java switch*

/

Java while Matlab case Matlab for Matlab switch Matlab otherwise Cobol move Cobol read Cobol perform

/ / / / / / / /

the LR-parser checks a larger number of rules than the CYK-parser to get a correct as well as LR preserving rule.

7.2.3

LHSs Filtering Optimization in Rule Checking Phase

In this section we will discuss the effect of LHSs filtering optimization used in the rule checking phase. In this optimization, correctness of an RHS is checked w.r.t. a set of LHSs. Improvements we gain from this optimization are measured by comparing the time taken in checking all possible rules with the simple CYK parser and the time taken in checking all possible RHSs with the modified CYK parser. If L is the

97

set of possible LHSs, then for checking the correctness of a possible RHS, α, there will be |L| invocations to the rule checking process (i.e., invocations to the parser) whereas using the LHSs filtering optimization it will be only one. If tr is the time taken by a simple CYK parsing algorithm and trhs is the time taken by a modified CYK parsing algorithm then we compare the quantities tr × |L| and trhs . The quantity trhs will always be higher than tr because trhs involves additional overhead of LHSs filtering. The bigger the set L, the larger the overhead. Hence, in order to study the effect of maximum overhead we consider the set L equal to the set of all nonterminals N while measuring trhs . We have conducted a few experiments on C and Cobol grammars to obtain a first hand experience of this optimization. Results are shown in the table 7.7. The second column shows the rules removed from the grammar, the third column shows the number of possible combinations of LHSs. In the multiple missing rules case, it is equal to the product of the number of possible LHSs corresponding to each new terminal. For example, suppose the set of possible LHSs for keyword while is {stmt, cond stmt} and for keyword f or is {stmt, cond stmt} and the possible RHSs corresponding to the same are W HILE (expr) stmt and F OR (stmt list expr)stmt respectively; the sets of possible rules to be checked here will be 1. {stmt → W HILE (expr) stmt, stmt → F OR (stmt list expr) stmt} 2. {stmt → W HILE (expr) stmt, cond stmt → F OR(stmt list expr) stmt} 3. {cond stmt → W HILE (expr) stmt, stmt → F OR (stmt list expr) stmt} 4. {cond stmt → W HILE (expr) stmt, cond stmt → F OR (stmt list expr) stmt} Using the modified CYK-parser, the correctness of the above four sets is checked in one pass. Hence although tr is less than trhs , the total time spent in checking the correctness of all possible rules is much higher than the total time spent in checking the correctness of all possible RHSs with LHSs filtering optimization. Therefore, this optimization improves the grammar extraction process. The comparison shown in table 7.7 assumes that all possible rules are checked. In practice the rule checker does not check all possible rules. Therefore, for comparing the actual time taken by a simple CYK parser and a modified CYK parser in the rule checking phase, we conducted experiments on C, Java and Matlab grammars.

98

In each experiment we removed a rule corresponding to a keyword and built a set of possible rules. We compared the times taken by the simple CYK parser and the modified CYK parser in arriving at a correct rule. The bar charts shown in figure 7.7 compare the time. We can see that the modified parser is either comparable or better than the simple CYK parser.

Modi CYK Simp CYK

2000

4000

Modi CYK Simp CYK

3500 3000

1500 2500 2000

1000

1500 1000

500

500 0 break

0 case

switch

while

(a) C

2500

case

for

if

(b) Java

Modi CYK Simp CYK

2000 1500 1000 500 0 case

otherwise

switch

(c) Matlab

Figure 7.7: Comparison of times taken by Modified CYK parser and simple CYK parser in rule checking module (Times are in seconds on Y axis)

7.2.4

Performance of Rule Evaluation Order

This section explores the effect of a rule evaluation order, based on rule’s weight, on the grammar extraction process. In this section we validate the hypothesis that

99

Table 7.7 Experiments of LHSs filtering optimization (times are in milli seconds) Language

C

Cobol

Statements removed

switch, case switch, case, break switch, case, break, default switch, case, break, default, for, while perform display read read, perform, move

|L|

trhs

tr

tr ∗ |L|

Avg size of progs 16 16 131

672 = 4.5 × 103 673 = 3.0 × 105 674 = 2.0 × 107

386 816 1.1 × 105

164 143 6.0 × 104

7.3 × 105 4.4 × 106 1.2 × 101 2

36 × 675 = 4.8 × 101 0

1.4 × 104

518

2.5 × 101 3

16

51 = 5 51 = 5 51 = 5 5 × 1812 = 1.6 × 105

3811 1237 46172 4092

3268 680 43520 3297

16340 3400 217600 5.4 × 108

43

56

rules having higher weights are more likely to be correct rule. Experiments were done on different PL grammars by removing rules corresponding to different keywords (e.g. while, switch, f or, etc.) or different operators (e.g. +, %, etc.). We removed one rule at a time and used the weight of rules as proposed in chapter 5 for imposing a rule evaluation order. Rules with higher weights are checked first for the correctness. Since, the weight of a rule depends upon the RHS of the rule, this optimization can be coupled with the LHSs filtering optimization proposed in the earlier chapter. We checked how many times the highest weight rule is the correct rule. In order to increase the number of test cases, experiments were conducted on one input program, two input programs and more than one input programs. Tables 7.8(a) to 7.8(d) shows the results for the keyword based rules of Java, C, Matlab and Cobol grammars. For operator based rules (i.e. rules which contain operators), we worked with the Matlab grammar. Rules corresponding to operators, &&, /., ∗., /, . ˆ, ==, −− and ∗ were removed from the Matlab grammar and the correctness of the possible rules were checked with the proposed rule evaluation order. Results of this experiment are shown in table 7.8(e). We observe from the experiments that the proposed weight criterion helps in directing the search of correct rules. Only in 3.94% cases, the extraction process has to check the correctness of rules having lesser weights than the second highest weight rule. Therefore, using the proposed rule evaluation order we can extract correct grammar rules quickly.

100

Table 7.8 Experiments for MDL based weight metric Construct

Case Enum For If Switch Try While Total

No of test cases

#times highest rank rule is correct

10 6 15 15 10 10 16 82

9 6 15 13 10 3 13 69

#times second highest is rule is correct 0 0 0 0 0 7 2 9

Others

1 0 0 2 0 0 1 4

(a) Experiment of MDL based weight metric on Java grammar Construct

Otherwise Case Switch For While Total

No of test cases

#times highest rank rule is correct

6 6 10 36 50 108

6 3 2 2 16 27

#times second highest is rule is correct 0 1 8 34 34 77

&& /. ⋆. .ˆ / == —— ⋆ Total

No of test cases

4 2 4 3 8 8 2 10 41

#times highest rank rule is correct 4 2 4 2 8 8 2 10 40

#times second highest is rule is correct 0 0 0 0 0 0 0 0 0

No of test cases

#times highest rank rule is correct

Break Case For Switch While Total

6 21 28 21 28 104

4 19 28 12 28 91

#times second highest is rule is correct 0 1 0 9 0 10

Others

2 1 0 0 0 3

(b) Experiment of MDL based weight metric on C grammar

Others

0 2 0 0 0 4

(c) Experiment of MDL based weight metric on Matlab grammar Construct

Construct

Construct

No of test cases

#times highest rank rule is correct

Display Move Perform Total

6 4 3 13

5 3 1 9

#times second highest is rule is correct 0 0 2 2

Others

1 1 0 2

(d) Experiment of MDL based weight metric on Cobol grammar

Others

0 0 0 1 0 0 0 0 1

(e) Experiment of MDL based weight metric on operator based rules of Matlab grammar

Language

No of test cases

#times highest rank rule is correct

Java C Matlab Cobol Total

82 104 108 20 355

69 91 27 14 241

#times second highest is rule is correct 9 10 77 4 100

Others

4 3 4 2 14

(f) Summary of MDL based weight metric on different grammars

101

7.2.5

Experiments of Multiple Rules Extraction

This section presents some experiments done on multiple rule extraction; i.e., when more than one rule are needed to make the grammar complete. In each experiment more than one keyword based rules were removed from a PL grammar and then the grammar was completed by extracting rules from a set of programs. These experiments were done when all the optimizations, i.e., unit production optimization, a rule evaluation ordering based on the rule’s weight, and use of the modified CYK parser were enabled. Table 7.9 shows the results of the experiments. Our approach successfully extracted a set of correct rules in each of these experiments. The time taken in most experiments are 2 min - 15 min; but in some cases it took hours in extracting the rules. Table 7.9 Experiments on multiple rules extraction Language

Matlab

Java

C Cobol

7.2.6

Constructs

for, while for, switch, case, otherwise, while switch, case, otherwise try, catch if, while switch, case, enum switch, case, try, catch, enum switch, case switch, case, break switch, case, break, default, while, for read, perform, move

No of Programs 2 5 3 4 3 7 7 1 1 1 3

Avg size of a prog 33 29 27 98 37 61 61 16 16 16 56

Time

1.5 min 19.5 min 2.2 min 14.9 min 6.7 min 17.6 min 24.23 min 3sec 7.3secs 14.1secs 1.3min

Experiments to recover C* specific grammar rules

We discuss here few experiments in which we extracted rules corresponding to different keywords, operators and declaration specifiers of C* grammar when a C grammar and programs written in C* are given as input. We wrote small programs in C* which used constructs specific to C* (those which are not part of a standard C grammar). Table 7.10 summarizes the additional constructs (which contain new terminals) of the C* grammar. This summary is not complete as the resources for C* are very scarce5 . We found that except in a few cases additional rules in the C* grammar follow the assumptions we made in earlier chapters. One such exception is 5

After an inquiry on forums like comp.compilers, we could get an incomplete manual and very few example programs.

102

the declaration of a parallel data type. A parallel data type describes a structured collection of objects with a particular member type. For example consider the following C* declaration: shape [10]S; int:S parallel_int; Table 7.10 Summary of new terminals in C* New terminal shape

Type Declaration specifier

shapeof with

Keyword Keyword

where

Keyword

CS elemental

Declaration specifier

everywhere

Keyword

positionof rankof alignof extension attribute current dimof %% ? ? =

Keyword Keyword Keyword Keyword Keyword Keyword Keyword Operator Operator Operator Operator Operator

Description Used for expressing template of a parallel data type Returns a pointer to a shape object A control flow statement which operates on all the elements of a parallel data type. A control flow conditional statement for parallel object. This is much similar to the “if statement” but here operations are performed on parallel objects. Used for facilitating parallel and non parallel operations. Used for making each element of a parallel variable accessible within a function NA NA NA NA NA NA NA Real modules operator Minimum operator Maximum operator Minimum assignment operator Maximum assignment operator

shape is a new declaration specifier which is used for defining the template of parallel objects. A parallel object is an object of parallel data type. In the above example, variable parallel int represents a parallel object of type int whose structure is defined by shape S. That is, parallel int represents an array of ten integers and the operation on each element of parallel int can be done in parallel on different processors. The declaration statement “int : S parallel int;” declares the parallel int to be of type “int : S”. Grammar rule corresponding to statement

103

Table 7.11 Rules extracted from programs written in C* grammar New terminals

Extracted Rules

Time

shape, with

statement list → W IT H program statement list, declaration specif iers → SHAP E statement list program statement → SHAP E statement list program statement list, statement → W HERE return expression statementl ist statement → W HERE return expression, statement → W IT H program external declaration → SHAP E statement list program, statement list → W IT H program statement list, declaration specif iers → ELEM EN T AL statement list expression → M AX ASSIGN identif ierl ist expression → M IN ASSIGN enumerator list

3.2 Sec.

Size of prog (LOC) 25

4.2 Sec.

28

3.9 Sec.

16

11.7 Sec.

28

207.5 Sec. 220.6 Sec.

23 23

shape, where

with, where shape, elemental, with >? =
“int : S parallel int;” do not involve a new terminal, hence we will not discuss this case. However, the declaration corresponding to the shape keyword follows our assumption, therefore we consider input programs containing the shape keyword. The experiments were conducted on input programs containing different combinations of new terminals whose corresponding grammar rules follow the assumptions given in chapters 4 and 5. Experiments are done on single input programs in which all the optimizations were enabled. Table 7.11 shows the results of experiments. We can observe that the system correctly extracted a grammar complete w.r.t. input programs in each of the experiments. The example of extracted rules are also shown in the table 7.11.

7.3

Summary

In this chapter we have discussed the implementation of our approach. Since the primary goal of the implementation was to verify the feasibility of the approach on real programming language grammars under various optimizations, we have kept the implementation flexible enough to choose different optimizations at different levels. The major components of the implementation are the modified LR(1) parser and the modified CYK parser. The modified LR(1) parser is used for supporting the forced reductions and the modified CYK parser is used for checking the correctness of rules and for incorporating various optimizations.

104

We also presented the experimental results to verify the approach and the optimizations proposed in the rule building and the rule checking phases. The approach is first validated on a single program and single missing construct, then it is verified on the combination of unit production optimizations and the use of multiple programs. The modified CYK-parser used for rule checking and the use of a weight based rule evaluation order are later studied. We have also shown a few experiments on constructs corresponding to the C* grammar (a real dialect). Experiments show that by using the proposed optimizations we arrive at a feasible grammar extraction algorithm.

Chapter 8 Rule Selection Criteria A grammar, complete with respect to a set of programs, can be written in many different ways; hence a difficulty is faced in selecting a good grammar from a set of complete grammars. For example, suppose we are given an incomplete ANSI C grammar [75] in which a grammar rule corresponding to keyword “while” is missing and a correct grammar rule is extracted using the program shown in figure 8.1(a). There are many options for the correct rules as shown in figure 8.1(b). Clearly we would like to choose the “best” rule (from some perspective) among these. The number of correct grammar rules can be in the order of a thousand in practice as we will show in a later section; hence manual selection of a good rule is not possible.

main() { int a=100; while (a > 0 ) { printf("Counter value %d", a); .... .... } }

(a) Input Program

statement → W HILE (a > 0) statement → W HILE (expression) statement → W HILE (conditional expression) statement → W HILE (expression) satement statement → W HILE (conditional expression) statement

(b) Set of possible correct rules

Figure 8.1: A small program with keyword while and a set of correct rules

The approach, we have discussed in earlier chapters, stops whenever it finds a complete grammar and returns the set of rules added in the grammar as a correct set of rules. If, however, we want to determine the correct set of rules in which 105

106

each rule is the best rule, then one approach is to extract all correct sets of rules and select the one in which each rule is the best rule. Another way is to add rules in the grammar in the non-increasing order of goodness (this should be based on some criteria) during the extraction process itself. The above solutions require the ability to give preference to one rule over another. In this chapter, we study different criteria for selecting a rule from a set of correct rules. One approach for rule selection is to select a rule such that the modified grammar obtained after including that rule is correct with respect to a coverage criterion, such as rule coverage [50], context dependent rule coverage [33] or context dependent branch coverage [33] (chapter 3). The above approach assumes that the reference implementation of the target parser is available to us. In this approach, we first generate a set of test programs, which achieve the given coverage criterion and then check whether the reference implementation accepts all the test programs. We select one of the correct rules whose addition makes the grammar correct with respect to a coverage criterion. We take a different approach for rule selection because the above method involves generation of a set of test programs for each correct rule. This becomes time consuming when the number of correct rules is very large. We first study grammar based metrics as a rule selection criterion. Our experiments show that these metrics are not sufficient in many cases as there can be many correct grammar rules with the same metric. We then propose two rule selection criteria - one is based on the weight of the rule as discussed in chapter 6 and other is based on the usage of nonterminals. Experiments show that the usage criterion when coupled with the grammar based metrics, selects a rule which is reasonably close to the removed rule.

8.1

Problem Definition and Terminology

We work in the grammar extraction scenario where a single rule is sufficient to make the grammar complete. We address the following problem - given an incomplete grammar G = (N, T, P, S), a set of programs P and a set of correct grammar rules Rc = {r1 , r2 , . . . rn } where addition of each ri in G makes G complete, select a good rule from Rc . The actual unknown grammar is denoted as GT . Given the set Rc = {r1 , r2 , . . . rn }, the ideal goal of rule selection is to select a rule ri such that G becomes equivalent to GT after including ri . This goal, however, is not achievable

107

due to the seminal result by Gold which states that any grammar in the Chomsky hierarchy cannot be learned exactly from positive samples alone [19]. This result also follows from the observation that there exist many grammars complete with respect to a set of programs and determining the equivalence between the complete grammar and the actual grammar, i.e. L(G′ ) = L(GT ), is impossible [24]. We need some terminology for defining the grammar metrics. A nonterminal B ⋆ is called an immediate successor of a nonterminal A if A ⇒ β and β contains B. ⋆ This relationship is shown as A ⊲ B. Two nonterminals A and B are equivalent if ⋆ ⋆ A ⊲ B and B ⊲ A; this equivalence relationship partitions the set of nonterminals and each partition is called a grammatical level. For any two nonterminals A ∈ L1 ⋆ and B ∈ L2 (L1 and L2 are grammatical levels), if A ⊲ B then we write L1 ≻ L2 . A program is denoted as w. wi denotes the ith token of w and wi,j denotes the substring of w which starts at the ith token and ends at the j th token. A dependency graph of a grammar is the graph built from the successor relationship between nonterminals where each node represents a nonterminal and an edge between two nodes / nonterminals (suppose A and B) denotes the successor relationship between ∗ the two nonterminals (i.e. A ⊲ B).

8.2

Grammar based Metrics

Programming languages like any software undergo changes such as creation of new versions and dialects. In this process, a set of new terminals, nonterminals and productions may get added, deleted or modified in the grammar. The behavior of the grammar based applications (such as compilers, editors, etc.) depends on the productions of the grammar; hence debugging of these programs also depends on the analysis of the underlying grammar. Grammar metrics are used to measure the complexity of grammar based software. Several grammar based metrics have been discussed by the researchers [11, 17]. Grammar based metrics are either size based which are defined in terms of number of terminals, nonterminals, productions of the grammar or structure based defined in terms of relationships between the nonterminals (such as the successor relationship discussed in the previous section). Metrics T ERM and V AR represent the number of terminals and nonterminals in the grammar respectively [17]. M CC (McCabe Cyclometic Complexity) counts the number of alternative operators in the grammar [17]. Examples of alternative

108

operators are ‘?’ which represents one or zero instance of the operand, ‘⋆’ which represents zero or more repetitions of the operand and ‘+’ which represents one or more repetitions of the operand. AV S represents the average length (number of symbols) of the right hand side of the productions in the grammar. HAL (Halstead effort) is defined in terms of the number of operators (?, ∗, +, etc.) and operands (terminals and nonterminals) of the grammar [17]. The Tree Impurity Metric (T IM P ) is a structural metric which depends on the dependency graph of the grammar. T IM P is defines as follows [17]: T ree Impurity (T IM P ) =

2(e − |N | + 1) × 100 (|N | − 1)(|N | − 2)

Where e is the number of edges in the dependency graph and |N | is the number of nonterminals in the grammar. A high T IM P for a grammar shows that refactoring and comprehending the grammar is difficult as nonterminals depend on each other in a more intertwined way. CLEV is a structural metric which depends on the number of grammatical levels defined as follows: #(N≡ ) Level (CLEV ) = × 100 |N | N≡ is the partition induced by the equivalence relationship defined in section 8.1. Every grammatical level can be considered as a module of the grammar as nonterminals in a grammatical level depends on each other. For example, in programming language grammars we often find a grammatical level corresponding to expressions where each nonterminal represents different expressions (such as primary expressions, logical expressions, etc.); another grammatical level often found corresponds to different statements where each nonterminal represents statements (such as iteration statements, expression statements, etc.) [17]. A higher CLEV value indicates that the nonterminals are evenly spread and there are more opportunities to modularize the grammar. Metric N SLEV indicates the number of grammatical levels with more than one nonterminal [17]. DEP measures the number of nonterminals in the largest grammatical levels and HEI measures the height of the tree induced by the grammatical levels where each node represents a grammatical level and edges represent the ≻ relationship discussed in section 8.1 [17]. We use only T IM P and CLEV in our study.

109

8.3

Using Grammar based Metrics for Rule Selection

In this section we discuss experiments conducted for studying the effectiveness of available grammar based metrics and show that these metrics are not sufficient in many cases. Experiments are performed on four programming language grammars viz. Java, C, Cobol and Matlab. In each experiment a rule corresponding to a keyword (e.g. while, case, switch etc.) was removed from a grammar and a correct grammar rule, which makes corresponding grammar complete with respect to the input programs, was extracted from a set of programs using the approach given in earlier chapters1 . The relationship between the initial grammar G = (N, T, P, S) and the complete grammar G′ = (N ′ , T ′ , P ′ , S ′ ) is as follows: N ′ = N , T ′ ⊇ T P ′ = P ∪ {r}, S ′ = S; r is a correct rule and is of the form B → β where B ∈ N and β ∈ (N ∪ T )∗ . The approach is slightly modified so that it also explores other possible correct rules rather than stopping at the first correct rule. Since the search space of possible rules is very large (in the order of 105 − 106 ), we checked the correctness of around 1.5×105 rules in all the experiments. Table 8.1 shows the number of possible correct rules obtained in each experiment. We observe that the number of possible correct rules are of the order 1000s even in a constrained experiment; therefore rule selection criteria are essential. The set of extracted correct rules is input to the rule selection system where rules are selected by different selection criteria. We first study T IM P and CLEV as rule selection criteria. A correct rule is added in the incomplete grammar and the T IM P and CLEV values of the modified grammar are determined. The computed values are assigned as rule’s T IM P and CLEV values. Size metrics such as T ERM , V AR, M CC, HAL are not used as the experiments are done under a controlled assumption in which the metrics M CC and HAL will always increase by a constant value as the number of possible options for a nonterminal will increase by one in each experiment (e.g. the number of options for a nonterminal B will increase if the correct rule is B → β). Structural metrics N SLEV , DEP and HEI are not used as they indirectly depend on the CLEV and 1

We consider only LR preserving correct rules.

110

Table 8.1 The number of possible rules and correct rules Language / Construct

Number of possible rules

Number of correct rules

Matlab / while Matlab / case Matlab / for Matlab / otherwise C / case C / switch C / for Java / case Java / while Java / enum Java / switch Java / for Cobol / move

2.2 × 106 5.4 × 106 4.4 × 105 6.8 × 105 1.4 × 104 1.2 × 105 6.4 × 105 4.7 × 104 2.9 × 104 1.3 × 106 2.6 × 105 2.4 × 106 7680

8442 1156 9198 4706 1584 2093 128 625 974 2256 2617 4049 2826

Number of correct rules with the least T IM P 1428 528 897 1928 720 1803 110 528 897 2256 2346 3505 1826

T IM P . A good grammar should have a lower T IM P value; hence the lowest T IM P value is studied as a selection criterion. Table 8.2 compares rules having the lowest T IM P with other rules. We show only few relevant rules (drawn from few experiments) as there were many rules with the lowest T IM P . T IM P criterion rejects some very awkward rules. For example, one of the filtered out rules corresponding to keyword case in the C grammar is2 :

primary expr → CASE initializer list COLON stmt list BREAK Nonterminal initializer list represents a list of initializer statements where each initializer statement is composed by a set of assignment expressions. That is, ∗ assignment expr is a direct successor of the initializer list (initializer list ⊲ assignment expr) but not vice versa as no nonterminal corresponding to an expression is ever expressed in terms of the list of initializer statements in the C grammar. By adding the above rule in the C grammar a new successor relationship 2

A set of programs and grammars on which the experiments are done can be found on http://www.cse.iitk.ac.in/users/alpanad/grammars. The name of nonterminals are written in a short form to make the reading convenient, e.g. Nonterminal LocalV ariablDeclarationAndStatement in the Java grammar is written as LocalV arDeclAndStmt

111

Table 8.2 Comparison of rules having the lowest T IM P with other rules Programming language / Construct Matlab / case

Matlab / while

Matlab / for Matlab / otherwise

Example of rules with the lowest T IM P

Example of other rules

stmt → CASE expr eostmt stmt list stmt → CASE expr stmt list

assignment stmt → CASE expr stmt stmt list expr stmt → CASE expr stmt stmt list assignment stmt → W HILE expr eostmt translation unit EN D primary expr → W HILE translation unit EN D assignment expr → F OR translation unit EN D assignment expr → F OR ID EQU translation unit EN D assignment stmt → OT HERW ISE f unction ident list expr primary expr → CASE initializer list COLON stmt list BREAK EmptyStmt → W HILE LP ARAN F orIncr RP ARAN N onStaticInitializer

stmt → W HILE expr eostmt stmt list EN D

stmt → F OR expr EQU stmt list EN D

stmt → OT HERW ISE stmt stmt → CASE COLON stmt

C / case Java / while

const expr

Stmt → W HILE LP ARAN CondiExpr RP ARAN LocalV arDeclAndStmt



primary expr ⊲ initializer list is introduced which violates the hierarchical structure of programming language grammars3 ; hence it is not a good rule. The T IM P value increases and the number of grammatical level reduces by the above successor relationship as grammatical levels containing primary expr and initializer list are merged by the addition of this rule This merger is not desirable as above two grammatical levels represent separate parts of a program; one represents expressions and other represents statements. A higher T IM P and lower CLEV values reflect the effect of this merger. We also studied CLEV as a rule selection criterion. A grammar should have higher CLEV , hence we compared rules having the highest CLEV value with other rules. Since CLEV is roughly inversely proportional to the T IM P , we found that the sets of rules having the highest CLEV were the same as the sets of rules having the least T IM P . Therefore, we are not tabulating the results pertaining to CLEV as a selection criterion. The above experiments show that the lowest T IM P and the highest CLEV criteria cuts down some very awkward rules; but these criteria are not sufficient for the rule selection problem as the size of the reduced set is still 3

By hierarchical structure we mean a program arrangement in which a program consists of a set of statements, a set of statements consists of different expressions, etc.

112

large (table 8.1).

8.4

A Criterion based on Weight of Rules

In this section we propose the use of weight of a rule, as defined in chapter 6, as a rule selection criterion in the PL domain. Since the weight of a rule closely follows the principle of minimum description length (MDL) which states that the best hypothesis for a given set of data is the one which represents the data most compactly. We therefore study its effectiveness in rule selection. We believe that this can be a good criterion. A rule with higher weight covers a larger substring and has fewer number of symbols in the right hand side (definition 6.1). Hence, weight of a rule represents the rule’s ability to compactly represent the substring derived by the rule. For example in figure 8.1, rule statement → W HILE (expression) statement has higher weight than the rule statement → W HILE (expression) as the substring derived by the first rule, i.e., while (a > 0) { printf(“Counter value %d”, a); . . . }, is larger than the substring derived by the second one, i.e. while (a > 0) and length of RHS in the first rule is only one symbol more than the second rule. To study the above criterion as a rule selection criterion, we order all the correct rules in the non-increasing order of weights and select the highest weight rules. Table 8.3 shows some of the rules with the highest weight. In some of the cases the highest weight rules do not follow the hierarchical structure of a PL grammar. For example, in the case of Matlab grammar and keyword f or in table 8.3, a highest weight rule is primary expr → F OR translation unit EN D The above rule does not follow the hierarchical structure because with this rule an expression (i.e. primary expr) is expressed in terms of translation unit.

113

Table 8.3 Comparison of rules having the highest weight with other rules Language / Construct Matlab / case Matlab / while

Matlab / for

Matlab / otherwise Java / case

Java / enum

Java / for

Java / switch

Java / while Cobol / move C / case

Rule with the highest weight

Other rule

stmt → CASE stmt list stmt → W HILE translation unit EN D stmt list stmt → W HILE stmt list EN D stmt list primary expr → F OR translation unit EN D stmt → F OR statement list EN D stmt → OT HERW ISE expr stmt LocalV arDeclOrStmt → CASE CondiExpr COLON LocalV arDeclAndStmt LocalV arDeclOrStmt → CASE Expr COLON LocalV arDeclAndStmt StaticInitilizer → EN U M Qualif iedN ame V arInit M odif ier F ieldDecl Stmt → F OR (LocalV arDeclAndStmt CondiExpr) LocalV arDeclAndStmt Block → SW IT CH ComplexP rimary Block

stmt → CASE expr stmt list stmt → W HILE expr stmt list EN D

Block → W HILE ComplexP rimary Block if clause → M OV E ident or string loop cond part2 select stmt → CASE cond expr COLON stmt list stmt → CASE cond expr COLON stmt list

stmt → F OR expr EQU stmt list EN D eostmt

stmt → OT HERW ISE ID expr

Block → CASE Expr COLON Block → EN U M Qualif iedN ame ArrayInit M odif ier T ypeSpec F ieldDecl Block → F OR ( IN T LocalV arDeclAndStmt Expr )

Stmt → SW IT CH ComplexP rimary {LabeledStmt LocalV arDeclAndStmt } Block → W HILE ( Expr ) Block clause → M OV E ident T O ident select stmt → CASE cond expr COLON stmt → CASE cond expr COLON

translation unit is a nonterminal which represents the largest unit of the program which is composed from function bodies, function bodies contains a set of statements and statements are composed from expressions, i.e., there is successor ⋆ ⋆ ⋆ relationship translation unit ⊲ f unction body ⊲ statement ⊲ expression in the ⋆ grammar. Addition of the above rule introduces a new relationship primary expr ⊲ translation unit which further increases the T IM P and decreases CLEV . Filtering out such rules is not possible by the weight criterion alone. Hence, a good ordering of different rule selection criteria is to first filter rules based on the lowest T IM P and the highest CLEV and then apply the weight criterion. Many times a rule selected by the above weight criterion returns overly compact rules (i.e., RHS of the rule is very small); for example a rule having the highest

114

weight, corresponding to keyword f or, in Matlab grammar (table 8.3) is stmt → F OR stmt list EN D and the actually removed rule is stmt → F OR ID EQU expr stmt list EN D Nevertheless, we found that a refactoring (i.e. expanding the right hand side) of the rules selected from above weight criterion results in the actually removed rule. For instance, in the above example ⋆

F OR stmt list EN D ⇒ F OR ID EQU expr stmt list EN D Hence a refactoring results in a desired rule. Since overly compact rules are also more general than the actually removed rule, a refactoring is needed to make a balance between the very general and very specific rule. There can be many rules with the highest weight; consider the following rules as given in figure 8.1(b): statement → W HILE (expression) statement statement → W HILE (conditional expression) statement The above rules have the same weight as they have the same number of symbols in the right hand side and cover the same string, i.e. while (x > 0) { printf ( “ Counter Value %d”, a) ; . . . } Therefore, the weight criterion alone does not solve the problem of rule selection.

8.5

Usage Count based Rule Selection Criterion

In this section we discuss a rule selection criterion based on the common patterns found in the programming language grammars. We first discuss the patterns and then rule selection criterion. A pattern in a programming language grammar is a commonly occurring structure which grammar writers frequently adopt.

115

An example of pattern in the context of grammar is the one used for representing different lists, i.e., a list of statements, a list of declarations, etc. An example of a list pattern is the set of following rules: {statement list → statement list statement, statement list → statement} Here statement list represents a list of statements. For representing other lists the same set of productions is used. Another useful pattern in programming language grammars is that of expression which uses a set of productions to represent different expressions. For imposing precedence and associativity among different operators, grammar writers use solutions as suggested in the compiler book of Aho et.al. [2]. Programming language grammar specifications have an abundance of unit productions; these productions are included to improve the clarity of the grammar and are also due to the use of patterns. We can note one unit production (e.g. statement list → statement) in list pattern. Similarly, in the case of an expression pattern, the grammar has as many unit productions as number of different operators at different precedence levels. For example consider a subset of productions to represent expressions consisting of + and ⋆ operators. E → E + T |T T →T ⋆F | F F → (E) | ID We note a unit chain [E → T → F → ID] in this toy grammar. Nonterminal E represents the main concept (i.e., expression) whereas other nonterminals, i.e., T and F are used to enforce precedences. Hence, E will be used in all the contexts where an expression is needed. We rarely expect a context where only multiplicative expression (T ) is allowed. Therefore E is a heavily used nonterminal whereas F and T are not. A similar pattern is found in real programming language grammars also. The above observation is used in choosing a rule from many correct rules. For example, a rule having E in its RHS is a better rule than the rule having T and F in its RHS. From the above observations, we hypothesize that rules which contain nonterminals with high usage are good rules. We therefore define the average usage of nonterminals in a rule as follows:

116

Definition 8.1 The average usage of the nonterminals occurring in a rule A → X1 . . . Xm is

average usage(A → X1 . . . Xm ) =

X 1 usage(Yi ) + usage(A)) ×( l+1 ∀Y i

Y1 , . . . Yl are the nonterminals occurring in the RHS of the rule A → X1 . . . Xm where usage of a nonterminal, suppose B, is as follows: usage(B) = #Rules of the f orm A → αBβ We will study the average usage of a rule as a rule selection criterion. As discussed, rules with higher usage4 are preferred; table 8.4 compares rules having the highest usage with other rules. In most of the cases this criterion selects rule close to the removed rule; e.g. a rule with the highest usage in the case of Matlab grammar and keyword case is stmt → CASE expr eostmt stmt list But in some instances this criterion returns awkward rules also; e.g. another rule with the highest usage corresponding to keyword case in Matlab grammar is expr stmt → CASE expr eostmt stmt list ∗

An addition of the above rule introduces a new successor relationship expr stmt ⊲ stmt list in the grammar which is not natural as expr stmt represents statement containing expressions; it is not composed from a set of sentences in the original Matlab grammar. Therefore the average rule usage criterion is effective when it is applied to a reduced set of rules obtained after applying T IM P and CLEV criteria.

8.6

Discussion

The experiments conducted for studying different rule selection criteria show that T IM P and CLEV alone are not sufficient but they should be used as an initial level filtering because they filter out some very awkward rules. The weight criterion 4

We use terms usage and average usage interchangeably.

117

Table 8.4 Comparison of rules having the highest usage with other rules Language / Construct Java / case

Rule with the highest usage

Other rules

Block → CASE Expr COLON

Java / enum

Block → EN U M Qualif iedN ame ArrayInit M odif ier T ypeSpec F ildDecl Block → F OR (IN T LocalV arDeclStmt Expr)

Stmt → CASE Expr COLON Stmt Block → EN U M Qualif iedN ame ArrayInit M odif ier T ypeSpec F ieldDecl Stmt → F OR (LocalV arDeclAndStmt Expr) Stmt Stmt → F OR (LocalV arDeclOrStmt LocalV arDeclAndStmt Expr) LocalV arDeclStmt → SW IT CH (P ostf ixExpr) Stmt GuardingStmt → W HILE (F orIncr) Block clause → M OV E display args T O using id list global stmt → CASE index expr array element stmt list graphical stmt → CASE postf ix expr array element stmt stmt list stmt → F OR expr EQU stmt list EN D eostmt stmt → OT HERW ISE stmt list

Java / for

Block → F OR (LocalV arDeclAndStmt Expr) Java / switch

Block → SW IT CH (Expr) Block

Java / while

Block → W HILE (Expr) Block

Cobol / move

clause → M OV E ID T O ID

Matlab / case

Matlab / for Matlab / otherwise

Matlab / while C / for C / case

Stmt → CASE expr eostmt stmt list expr stmt → CASE expr eostmt stmt list expr → F OR expr EQU expr stmt list EN D expr → OT HERW ISE IDEN T expr expr → OT HERW ISE expr expr → W HILE expr eostmt stmt list EN D expr → W HILE stmt list EN D stmt → F OR (stmt list expr) stmt stmt → CASE expr COLON

stmt → W HILE expr stmt list EN D eostmt if stmt → F OR (stmt list initializer) stmt list stmt → CASE additive expr COLON stmt list

118

is also not sufficient alone. In this section we compare the weight and usage based rule selection criteria. Table 8.5 compares the two criteria. The first column shows the language and the construct removed. The second column shows the actual rule removed, the third column lists rules having the highest weight and fourth column shows rules having the highest usage. We observe that the usage based selection criterion outperforms in most of the cases (such cases are marked with star in the table). Due to the initial assumption N = N ′ , the left hand sides of the selected rules are not matching in some cases. This is not an important issue because the removed rule’s LHS is derivable from the selected rule’s LHS; e.g., in the case of the Matlab grammar and the keyword while, the removed rule is iteration stmt → W HILE expr stmt list EN D eostmt And the rule selected by the usage criterion is stmt → W HILE expr stmt list EN D eostmt where ⋆

stmt ⇒ iteration stmt in the original grammar. Hence, usage of a rule is a good criterion. Since, the weight criterion sometimes returns overly compact rules, if the usage based selection criterion is applied to the rules having the highest weight it will also return a compact rule, hence the application of the weight criterion followed by the usage based criterion is not a good option. In a few instances the usage criterion selects rules which are not close to the removed rule. For example, in the case of the Java grammar and keyword case the highest usage rule is Block → CASE Expr COLON whereas the removed rule is LabeledStmt → CASE ConstExpr LocalV arDeclOrStmt

119

Table 8.5 Comparison of weight and usage as a selection criterion Language / Construct Matlab / case*

Matlab / while*

Matlab / for*

Matlab / otherwise

C / case*

C / switch*

Java / case

Java / enum

Java / for

Java / switch*

Java / while* Cobol / move*

Rule Removed

Highest weight rules

case statement → CASE expression eostmt statement list

statement → statement list

iteration statement → W HILE expression statement list EN D eostmt

iteration statement → F OR IDEN T EQU expression statement list EN D eostmt

otherwise statement → OT HERW ISE statement list

labeled statement → CASE constant expression COLON statement

selection statement → SW IT CH (expression) statement

LabeledStmt CASE ConstExpr COLON LocalV arDeclOrStmt NA

IterationStmt → F OR LP ARAN F orInit F orExpr F orIncr RP ARAN Stmt SelectionStmt → SW IT CH (Expr) Block IterationStmt → W HILE (Expr) Stmt clause → M OV E expr T O id list

Highest usage rules CASE

iteration statement CASE statement list graphical statement CASE statement list iteration statement W HILE statement EN D statement list

statement → expression statement list

CASE eostmt

→ → → list

statement → W HILE statement list EN D statement list statement → F OR statement list EN D

assignment statement → F OR statement list EN D selection statement → OT HERW ISE expression statement selection statement → OT HERW ISE statement list statement → CASE conditional expression COLON statement list selection statement → CASE conditional expression COLON statement list statement → SW IT CH external declarator selection statement → SW IT CH external declarator jump statement → SW IT CH external declarator LocalV arDeclOrStmt → CASE CondiExpr COLON LocalV arDeclAndStmt StaticInitilizer → EN U M Qualif iedN ame V arInti M odif ier F ildDecl Stmt → F OR (LocalV arDeclAndStmt CondExpr) LocalV arDeclAndStmt Block → SW IT CH ComplexP rimary Block Block → W HILE ComplexP rimary Block if clause → M OV E ident or string loop cond part2

statement → W HILE expression eostmt statement list EN D eostmt

statement → expression statement list eostmt

F OR EQU EN D

expression → OT HERW ISE IDEN T expression

statement → CASE expression COLON statement

statement → SW IT CH (expression) statement statement → SW IT CH (expression) statement → SW IT CH declarator statement Block → CASE Expr COLON Block → EN U M Qualif iedN ame ArrayInitM odif ier T ypeSpec F ieldDecl Block → F OR (IN T LocalV arDeclStmt Expr) Block → SW IT CH (Expr) Block Block → W HILE (Expr) Block clause → M OV E ID T O ID

120

In such cases, we found that the coverage of the highest usage rule is lower than the coverage of removed rules. For example, rule LabeledStmt → CASE ConstExpr LocalV arDeclOrStmt (in Java grammar) covers a substring of the form5 case (x > 0) : a = a+100; and rule Block → CASE Expr covers substring case (x > 0) : For evaluating highest coverage as a selection criterion, we conducted a few experiments in which we first selected rules having the highest coverage and then applied T IM P and highest usage criteria for rule selection. Some of the rules selected from the above combination of selection criteria are shown in table 8.6. We can observe that selected rules are reasonably good in most of the cases. Hence a higher coverage is more important than the higher weight and a good rule selection ordering is to first filter out the rules based on the highest coverage and then select the highest usage rule from the reduced set. Table 8.6 Rules selected by height coverage and highest usage criteria Language / Construct C / case C / for C / switch Java / case Java / for Java / switch Java / while Matlab / for Matlab / case Matlab / otherwise

8.7

Rule with highest coverage and usage statement → CASE expression COLON statement statement list statement → F OR (statement list expression) statement list statement → SW IT CH (expression) statement list Stmt → CASE Expression COLON LocalV arDeclAndStmt Stmt → F OR (LocalV arDeclAndStmt Expr) LocalV arDeclAndStmt Block → SW IT CH (Expr) Block Stmt → W HILE (Expr) LocalV arDeclAndStmt statement → F OR expression EQU AL statement list EN D statement list statement → CASE expression eostmt statement list expression statement → OT HERW ISE IDEN T IF IER expression eostmt

Summary

In this chapter we examined the effectiveness of grammar based metrics in rule selection where rules are extracted from a set of valid programs. The experiments show that grammar metrics are not sufficient even in controlled experiments. Two 5

Familiarity with the Java is assumed. x and a are some variables.

121

rule selection criteria are proposed and assessed in the same experimental setting. One criterion is closely based on the principle of minimum description length (MDL) and the other is based on the common patterns found in programming languages. Experiments, as discussed in section 8.6, show that proposed criteria, when coupled with grammar metrics, select reasonably good grammar rules.

Chapter 9 Conclusions The goal of this thesis is to address the problem of grammar extraction/inference in the programming language domain; specifically when programming languages undergo a growth (such as a dialect is created or a new version is released). We sought to achieve this goal by proposing an approach which extracts a set of rules when an incomplete grammar along with a set of programs are given as input. The addition of the extracted rules makes the initial grammar complete. Our approach assumes that the rules missing in the initial grammar follow certain properties; these properties have been derived after a study of grammars of programming languages and their dialects. This thesis does not target the cases of a complete paradigm shift in programming languages; such as extension of C to C++ and C# because some of the additional rules in the above scenario do not follow the properties which our approach assumes. For example, one of the additional rules in C++ or C# grammars is statement → declaration. Our approach may not extract this rule as it does not contain any new terminal. Through this work we have developed an understanding of the problem of context free grammar (CFG) inference in different domains and its theoretical limitations. Since the results related to CFG extraction are mostly negative we have formulated a problem which considers a subset of the actual growth observed in programming languages syntax. Since there are very few works which study the problem of grammar extraction in programming language domain, this thesis work has improved our understanding of the different issues in programming language grammar extraction. For example, the number of possible rules to be checked is very large even in a 122

123

very restricted setting. We observed that this was mainly due to abundance of unit productions in programming language grammar. Since exact learning of grammar is not possible from positive samples (set of valid programs) alone, we also address the problem of selecting a good rule from a set of correct rules. We have examined grammar based metrics as a rule selection criteria and also proposed two other rule selection criteria; these criteria and their combinations are investigated critically on a few programming language grammars. In the next section we summarize the contributions of each chapter in more detail and in section 9.2 we conclude the thesis with a set of future directions.

9.1

Contributions

Since the problem of grammar extraction has strong negative results as shown in chapter 2, we did a critical study of programming languages and their dialects in order to formulate a reasonable problem definition under programming language growth. We have addressed the problem of extracting a complete grammar G′ = (N ′ , T ′ , P ′ , S ′ ) from a set of programs P and an incomplete grammar G = (N, T, P, S) where relationship between G and G′ is as follows: N = N ′ , T ⊆ T ′ , P ⊆ P ′ and S = S ′ where rules in (P ′ − P ) are of form A → αaβ (A ∈ N , α, β ∈ (N ∪ T )∗ , a ∈ (T ′ − T )). The above assumption is shown to cover most commonly observed syntactic growth in programming language grammars. A basic approach has been proposed for those cases where a single rule is sufficient for completing the initial grammar. We first proposed an approach which extracts a missing rule of form A → aα and later extended it to rule of form A → αaβ (A ∈ N , a ∈ (T ′ − T ), α, β ∈ (N ∪ T )∗ ). The approach builds a set of possible rules from the input program and then checks the correctness of each rule. The approach returns the first correct rule it encounters. We also verified the correctness of the approach and analyzed its time complexity. We later discussed the extension of the basic approach for multiple rule extraction. The approach for extracting multiple missing rule is iterative and involves backtracking. In each iteration a set of possible rules corresponding to a new terminal is built and one among them is added in the grammar. The modified grammar is checked for the completeness after a fixed number of iterations; if the grammar is complete then the rules added in different iterations are collectively returned as a

124

set of correct rules else approach backtracks and select another rule. We have also verified the correctness of the approach and analyzed its time complexity. Experiments with the above approach show that the search space of possible rules is very large for real programming language grammars; hence we proposed a set of optimizations to reduce the number of possible rules. One optimization uses multiple programs for reducing the set of possible rules and another, called unit production optimization, adds only those rules in the set of possible rules which have the most general right hand sides. Experiments show that the above two optimizations significantly reduces the number of possible rules to be checked. We observed that the number of possible rules reduced by 100 times in the C grammar and 100 − 1000 times in Java and Matlab grammars. We have also proposed a few modifications in the CYK parsing algorithm to improve the performance of the rule checking phase. The modified algorithm checks the correctness of a RHS w.r.t. a set of possible LHS. Experiments show that the proposed modification is either comparable to the unmodified CYK parser or takes less time in arriving at a complete grammar. We have also examined an order in which rules are evaluated for correctness. The proposed rule evaluation order closely follows the principle of minimum description length (MDL) and is shown to improve the process of grammar extraction experimentally on a set of programming languages. A prototype of the approach is implemented which is made flexible enough so that users can choose different combinations of optimizations. The implemented approach is shown to work on four programming language grammars, viz., C, Java, Matlab and Cobol, both in single and multiple missing rule cases. Besides, we conducted experiments to study the effect of optimizations on real programming language grammars. In summary, experiments show that optimizations make the proposed approach feasible on real programming language grammars. Since the set of possible rules obtained after unit production optimization may cause non-LR-ness in the grammar, experiments were conducted to study how often an LR parser fails in finding out a correct rule. We also conducted experiments to compare an LR parser checker and a CYK parser checker. From the results we observe that an LR parser always finds a correct and LR preserving rule in the reduced set of possible rules. We also examined different criteria for selecting a grammar rule from a set of

125

correct grammar rules. We first studied the use of grammar based metrics in rule selection. Experiments showed that these metrics are not sufficient as there can be many rules with the same metric value. We proposed two rule selection criteria - one is based on the length of the strings covered by rules and other is based on the usage of nonterminals in the rules. The experiments were performed to assess different rule selection criteria and their combinations. Results show that the first selection criterion sometimes returns an overly general rule which needs further refactoring. The criterion based on nonterminal usage, when coupled with the grammar metric, selects reasonably good rules.

9.2

Future Work

Although a large number of techniques exist for grammar inference, many of them are unexplored for programming language grammars. This thesis is a step in the programming language grammar inference. The results and observations drawn from the experiments during this work can lead to several directions. We discuss some of them here. The focus of our work is extraction of a complete grammar from a set of correct programs (positive samples) and an incomplete grammar. We did not explore the learnability of different kinds of growth observed in programming language grammars when both positive and negative samples are available. Existing works in grammatical inference discuss the learnability of subclasses of CFL as a whole; they do not address the learnability under different kinds of growths / extensions, i.e., when the grammar is incrementally modified by adding a few rules. A possible direction is to study the learnability of different classes of syntactic growth in programming languages. The proposed approach extracts rules which obey certain properties and covers a subset of actual growth observed in PL grammars. Another future direction is to generalize the approach. A solution to a generalized approach will be much more involved due to inherent theoretical limitations of CFG inference. A possible direction is to explore statistical learning techniques. There are a few statistical learning approaches for CFG inference [66] but most of them have shown their results on either artificial grammar or on natural language grammars. These techniques are guided by training data which cannot be applied in the problem scenario we

126

discussed in this thesis. It would be interesting to study the performance of existing unsupervised learning techniques such as clustering in the inference of programming language grammars. Our approach assumes that the system knows the new terminals beforehand; i.e., the lexical analyzer does not misrecognize new terminals as an identifier or some existing terminal. One should explore the techniques for automatically inferring this information from a set of programs1 . The current approach can not handle those cases where missing rules contain more than one new terminals such as the rule corresponding to construct forin-do-done. However it can be extended to handle the above situations with an additional input from user. For example, in the case of for-in-do-done construct, user is supposed to label only the first terminal (i.e. f or) as a new terminal. The approach can not handle those cases where there are more than one missing rules corresponding to a new terminal. The approach cannot handle those situations also where missing rules correspond to a construct with matched pair of keywords such as begin-end construct. The above situations are possible directions for future work. Another possible area for future work is to explore the use of semantics in the grammar extraction. For example, if we know that semantics of the construct corresponding to keyword while is to execute a set of statements based on a conditional expression, we can use this information in extracting the grammar rule corresponding to keyword while from a set of C programs. Oates et al. [44] have discussed an approach which uses semantics for inferring the grammar [44]. Another direction is to investigate the use of semantics in programming language grammar inference. Semantics can also be useful in the selection of good rules.

1

This information is given by user in a YACC file.

References [1] Pieter W Adriaans. Language Learning for Categorial Perspective. PhD thesis, University of Amsterdam, Amsterdam, Netherlands, November 1992. [2] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers Principles, Techniques, and Tools. Pearson Education (Singapore) Pte. Ltd., 2007. [3] V. Amar and G. Putzolu. On a family of linear grammars. Inform. and Control, 7(3):283–291, 1964. [4] Dana Angluin. Inference of reversible languages. J. ACM, 29(3):741–765, 1982. [5] Dana Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987. [6] Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, 1988. [7] Dana Angluin. Negative results for equivalence queries. Mach. Learn., 5(2):121– 150, 1990. [8] Elliot Berk. JLex: A lexical analyzer generator for Java(T M ) . URL: http://www.cs.princeton.edu/\~appel/modern/java/JLex/current/ manual.htm%l. [9] Shiladitya Biswas and S. K. Aggarwal. A technique for extracting grammar from legacy programs. In 22nd International Conference on Applied Informatics, pages 652–657, Innsbruck, Austria, Feb 2004. [10] Noam Chomsky. Three models for the description of language. IRE Trans. on Infor. Theory, 2(12):113–124, 1956. 127

128

[11] Erzs´ebet Csuhaj-Varj´ u and Alica Kelemenov´a. Descriptional complexity of context-free grammar forms. Theor. Comput. Sci., 112(2):277–289, 1993. [12] Colin de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38:1332–1348, 2005. [13] Alpana Dubey, Sanjeev K. Aggarwal, and Pankaj Jalote. A technique for extracting keyword based rules from a set of programs. In CSMR ’05: Proceedings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR’05), pages 217–225, Manchester, UK, 2005. IEEE Computer Society. [14] Alpana Dubey, Pankaj Jalote, and Sanjeev K. Aggarwal. Learning context free grammar rules from a set of programs. Technical Report TRCS-2005-258, Indian Institute of Technology Kanpur, India, 2005. URL: http://www.cse. iitk.ac.in/report-repository/2005/report5.pdf. [15] Alpana Dubey, Pankaj Jalote, and Sanjeev K. Aggarwal. A deterministic technique for extracting keyword based grammar rules from programs. In Proceedings of 21st Annual ACM Symposium on Applied Computing, PL track, pages 1631–1632, Dijon, France, April 2006. ACM SIGAPP. [16] Alpana Dubey, Pankaj Jalote, and Sanjeev Kumar Aggarwal. Inferring grammar rules of programming language dialects. In ICGI, pages 201–213, Tokyo, Japan, Sept 2006. Springer-Verlag LNCS. [17] James F. Power and Brian A. Malloy. A metrics suite for grammar-based software. Special Issue: Analyzing the Evolution of Large-Scale Software, Journal of Software Maintenance and Evolution: Research and Practice, 16:405–426, 2004. [18] Jeroen Geertzen and Menno van Zaanen. Grammatical inference using suffix trees. In Proceedings of the International Colloquium on Grammatical Inference (ICGI); Athens, Greece, pages 63–174, October 2004. [19] E. Mark Gold. Language identification in the limit. Inform. and Control, 10(5):447–474, 1967.

129

[20] E. Mark Gold. Complexity of automaton identification from given data. Inform. and Control, 37(3):302–320, 1978. [21] Peter Gr¨ unwald. A minimum description length approach to grammar inference. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 203–216, London, UK, 1996. Springer-Verlag. [22] Colin De La Higuera. Characteristic sets for polynomial grammatical inference. Mach. Learn., 27(2):125–138, 1997. [23] Colin De La Higuera and Jos´e; Oncina. Inferring deterministic linear languages. In COLT ’02: Proceedings of the 15th Annual Conference on Computational Learning Theory, pages 185–200, London, UK, July 2002. Springer-Verlag. [24] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction To Automata Theory, Languages, And Computation. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990. [25] David Horn, Zach Solan, Eytan Ruppin, and Shimon Edelman. Unsupervised language acquisition: syntax from plain corpus. In Newcastle Workshop on Human Language, Feb 2004. [26] Scott Hudson. LALR parser generator for java. princeton.edu/~appel/modern/java/CUP/.

URL: http://www.cs.

[27] Rahul Jain, Sanjeev Kumar Aggarwal, Pankaj Jalote, and Shiladitya Biswas. An interactive method for extracting grammar from programs. Softw. Pract. Exper., 34(5):433–447, 2004. [28] Capers Jones. Estimating Software Costs. McGraw-Hills, New York, 1998. [29] T. Kasami. An efficient recognition and syntax analysis algorithm for context free languages. Technical report AFCRL-65758, Air Force Cambridge Research Laboratory, BedFord, MA, 1965. [30] Takeshi Koshiba, Erkki Makinen, and Yuji Takada. Learning deterministic even linear languages from positive examples. Theor. Comput. Sci., 185(1):63–79, 1997.

130

[31] R. L¨ammel and C. Verhoef. Cracking the 500-Language Problem. Software, pages 78–88, November/December 2001.

IEEE

[32] R. L¨ammel and C. Verhoef. Semi-automatic Grammar Recovery. Software— Practice & Experience, 31(15):1395–1438, December 2001. [33] Ralf L¨ammel. Grammar Testing. In Proceedings of Fundamental Approaches to Software Engineering (FASE) 2001, volume 2029 of LNCS, pages 201–216. Springer-Verlag, 2001. [34] Ralf L¨ammel and Chris Verhoef. VS COBOL II grammar version 1.0.4. URL: http://www.cs.vu.nl/grammars/vs-cobol-ii/. [35] Pat Langley and Sean Stromsten. Learning context-free grammars with a simplicity bias. In ECML ’00: Proceedings of the 11th European Conference on Machine Learning, pages 220–228, London, UK, 2000. Springer-Verlag. [36] Marc M. Lankhosrt. Breeding grammars: Grammatical inference with a genetic algorithm. Technical Report CS-R9401, University of Gronigen, Netherlands, 1994. [37] Steve Lawrence, C. Lee Giles, and Sandiway Fong. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering, 12(1):126–140, 2000. [38] J. A. Laxminarayana and G. Nagaraja. Inference of a subclass of context free grammars using positive samples. In ECML/PKDD 2003 Workshop and Tutorial on Learning Context-Free Grammars, pages 29–40, Cavtat-Dubrovnik, Croatia, Sept 2003. [39] Lillian Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96, Harvard University, 1996. URL: ftp://deasftp.harvard.edu/techreports/tr-12-96.ps.gz. [40] E. Makinen. A note on grammatical inference problem for even linear languages. Fundamenta Informaticae, 25(2):175–182, 1996.

131

ˇ [41] Marjan Mernik, Goran Gerliˇc, Viljem Zumer, and Barrett Bryant. Can a parser be generated from examples? In Proceedings of 18th ACM Symposium on Applied Computing, pages 1063–1067. ACM Press, 2003. [42] U. M¨oncke and R. Wilhelm. Iterative algorithms on grammar graphs. In Eighth Conference on Graph-theoretic Concepts in Computer Science, Cavtat, Croatia, 1982. [43] K. Nakamura. Incremental learning of context free grammars by extended inductive CYK algorithm. In ECML/PKDD 2003 Workshop and Tutorial on Learning Context-Free Grammars, Cavtat, Croatia, Sept 2003. [44] T. Oates, T. Armstrong, J. Harris, and M. Nejman. Leveraging lexical semantics to infer context-free grammars. In ECML/PKDD 2003 Workshop and Tutorial on Learning Context-Free Grammars, pages 65–76, 2003. [45] J. Oncina and P. Garcia. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition, pages 99–108, 1992. [46] J. Oncina and P. Garcia. Inferring regular languages in polynomial update time. Pattern Recognition and Image Analysis, pages 46–61, 1992. [47] Rajesh Parekh and Vasant Honovar. Grammar Inference, Automata Induction, and Language Acquision, chapter Invited chapter. Dale, Moisl and Somers (Ed). New York: Marcel Dekker, 2000. [48] G. Petasis, G. Paliouras, V. Karkaletsis, C. Halatsis, and C.D. Spyropoulos. egrids: Computationally efficient grammatical inference from positive examples. GRAMMARS, 7:69–110, 2004. [49] Leonard Pitt and Manfred K. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. J. ACM, 40(1):95–142, 1993. [50] P. Purdom. A sentence generator for testing parsers. BIT, 12(3):366–375, 1972. [51] V. Radhakrishnan and G. Nagraja. Inference of even linear grammars and its application to picture description languages. Pattern Recognition, 21(1):55–62, 1988.

132

[52] Yasubumi Sakakibara. Learning context-free grammars from structural data in polynomial time. In COLT ’88: Proceedings of the First Annual Workshop on Computational Learning Theory, pages 330–344, San Francisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc. [53] Yasubumi Sakakibara. Learning context-free grammars from structural data in polynomial time. Theor. Comput. Sci., 76(2-3):223–242, 1990. [54] Yasubumi Sakakibara. Grammatical inference in bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1051–1062, July 2005. [55] Yasubumi Sakakibara. Learning context-free grammars using tabular representations. Pattern Recognition, 38:1372–1383, 2005. [56] Yasubumi Sakakibara and Mitsuhiro Kondo. GA-based learning of contextfree grammars using tabular representations. In ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning, pages 354–360, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [57] Yasubumi Sakakibara and Hidenori Muramatsu. Learning context-free grammars from partially structured examples. In ICGI ’00: Proceedings of the 5th International Colloquium on Grammatical Inference, pages 229–240, London, UK, 2000. Springer-Verlag. [58] Geoffrey Sampson. Exploring the richness of the stimulus. special issue of The Linguistic Review, 19:73–104, 2002. [59] Geoffrey Sampson. The “Language Instinct” Debate. Continuum International, 2005. [60] Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised context sensitive language acquisition from a large corpus. In Proceedings of NIPS-2003, Whistler, British Columbia, Canada, Dec 2003. [61] Z. Solan, E. Ruppin, D. Horn, and S. Edelman. Automatic acquisition and efficient representation of syntactic structures. In Proceedings of NIPS-2002, Vancouver, British Columbia, Canada, Dec 2002.

133

[62] Brad Starkie. Left aligned grammars: Identifying a class of context-free grammar in the limit from positive data. In ECML/PKDD 2003 Workshop and Tutorial on Learning Context-Free Grammars, pages 89–100, CavtatDubrovnik, Croatia, 1993. [63] Yuji Takada. Grammatical interface for even linear languages based on control sets. Inf. Process. Lett., 28(4):193–199, 1988. [64] Masaru Tomita. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, Norwell, MA, USA, 1985. [65] Masaru Tomita. Graph-structured stack and natural language parsing. In Proceedings of the 26th annual meeting on Association for Computational Linguistics, pages 249–257, Morristown, NJ, USA, 1988. Association for Computational Linguistics. [66] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005. [67] E. Ukkonen. Lower bounds on the size of deterministic parsers. Journal of Computer and System Sciences, 26(2):153–170, 1983. [68] Menno van Zaanen. ABL: Alignment-based learning. In COLING 2000 Proceedings of the 18th International Conference on Computational Linguistics, pages 961–967, Saarbr¨ ucken, Germany, Aug 2000. [69] Menno van Zaanen. Bootstrapping Structure into Language: Alignment-Based Learning. PhD thesis, University of Leeds, Leeds, UK, January 2002. ˇ [70] Matej Crepinˇ sek, Marjan Mernik, Faizan Javed, Barrett R. Bryant, and Alan Sprague. Extracting grammar from programs: evolutionary approach. SIGPLAN Not., 40(4):39–46, 2005. ˇ ˇ [71] Matej Crepinˇ sek, Marjan Mernik, and Viljem Zumer. Extracting grammar from programs: brute force approach. SIGPLAN Not., 40(4):29–38, 2005. [72] X. Automatic distillation of structures. URL: http://adios.tau.ac.il/ index.html.

134

[73] X. Berkeley YACC. URL: http://sourceforge.net/projects/byacc/. [74] X. Bison parser generator. URL: http://www.gnu.org/software/bison/ manual/. [75] X. Compilers & interpreters. URL: http://www.angelfire.com/ar/ CompiladoresUCSE/COMPILERS.html. [76] X. Fort´e migration services. Project page at http://www.goldstonetech. com/services/forte_migration.htm. [77] X. Goldstone technologies. Company’s web page http://www.goldstonetech. com/. [78] X. Grammar-manipulation heuristics. ~pjj/complang/heuristics.html.

URL: http://www.cs.man.ac.uk/

[79] X. Human Nature: Justice versus Power Noam Chomsky debates with Michel Foucault. URL: http://www.chomsky.info/debates/1971xxxx.htm. [80] X. JDKTM 5.0 documentation. URL: http://java.sun.com/j2se/1.5.0/ docs/index.html. [81] X. LALR parser generator for java. projects/cup/.

URL: http://www2.cs.tum.edu/

[82] X. RC - safe, region-based memory-management for C. berkeley.intel-research.net/dgay/rc/index.html.

URL: http://

[83] X. UNH C* - a data parallel dialect of C. URL: http://www.cs.unh.edu/ ~pjh/cstar/. [84] Takashi Yokomori. Polynomial-time identification of very simple grammars from positive data. Theor. Comput. Sci., 298(1):179–206, 2003. [85] D. H. Younger. Recognition and parsing of context-free languages in time n3 . Information and Control, 10(2):189–208, Feb 1967.

Inferring Grammar Rules from Programs

Thinking Machines Corp for Connection Machine (CM2 and CM5) SIMD ...... no epsilon productions and all production are of form A → aα (A ∈ N,a ∈ T,α ∈ (T ...

907KB Sizes 0 Downloads 228 Views

Recommend Documents

Learning Context Free Grammar Rules from a Set of Programs
We propose a technique which infers grammar rules from a given set of programs and an approx- ...... Semi-automatic Grammar Recovery. Software—. Practice ...

Inferring universals from grammatical variation
plane) is the crucial element, since for any roll-call vote we are interested in who voted 'Yea' .... In two dimensions, there are only 24 errors across 1250 data points. ..... the Quotative near the center of the vertical area. ... by a Present or I

INFERRING LEARNERS' KNOWLEDGE FROM ... - Semantic Scholar
In Experiment 1, we validate the model by directly comparing its inferences to participants' stated beliefs. ...... Journal of Statistical Software, 25(14), 1–14. Razzaq, L., Feng, M., ... Learning analytics via sparse factor analysis. In Personali

INFERRING LEARNERS' KNOWLEDGE FROM ... - Semantic Scholar
We use a variation of inverse reinforcement learning to infer these beliefs. ...... Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI) (p.

Inferring Rationales from Choice: Identification for ...
do not address the important issue of identification: to what extent and how .... meaning can be inferred from choice data and, conversely, the extent to which ... the most important product dimension, the consumer selects the product which.

Inferring Maps and Behaviors from Natural Language ...
Visualization of one run for the command “go to the hydrant behind the cone,” showing .... update the semantic map St as sensor data arrives and refine the optimal policy .... that best reflects the entire instruction in the context of the semant

Inferring Protocol State Machine from Real-World Trace - Springer Link
... Protocol State Machine from Real-World Trace. 499. EHLO. /. HELO. MAIL ... Leita, C., Mermoud, K., Dacier, M.: Scriptgen: an automated script generation tool.

Inferring Complex Agent Motions from Partial Trajectory ...
commodates motion models defined over a graph, including complex pathing .... For example, the various online mapping services can pro- vide road paths between ..... It provides the simulation, visualization facility, pathfinding routines and ...

Bootstrapping Dependency Grammar Inducers from ... - Stanford NLP
considered incomplete; (b) sentences with trailing but no internal punctuation ... (b) A complete sentence that can- .... tend to be common in news-style data.