University of Rome, Tor Vergata Faculty of Engineering Academic Year 2004/2005
A Master’s Degree Thesis in Computer Science Engineering on
Engineering Tree Kernels for Semantic Role Labelling Systems by Daniele Pighin
Thesis Advisor: Roberto Basili
Co-Advisor: Alessandro Moschitti
Acknowledgements
This thesis is the result of more than one year of work and research. During this period I’ve been working with many people who have made this experience instructive, fun and definitely worth living. The greatest thanks are for my Advisor and Co-Advisor, Prof. Roberto Basili and Dr. Alessandro Moschitti: they taught me all what I know about the topics I herein discuss, spent a lot of their time guiding me, supporting my activities and correcting my mistakes. I’d also like to express my gratitude to Bonaventura Coppola and Ana Maria Giuglea for their precious contribution to the development of the CoNLL 2005 system and for the nice time spent together. I’d also like to thank Prof. Maria Teresa Pazienza and all the people of the AI-NLP group here at Tor Vergata: they are all very nice people to have around and I’m glad I had the opportunity to work side by side with them and learn from their knowledge and expertise. A very special mention goes to Marco Cammisa, who in the last few months has been taking over part of my duties here at the lab granting me more time to work on this thesis. I definitely owe him one! I
II And then there are my friends, all of them. We had a great time together, making jokes, taking the longest coffee brakes, playing pool, going shopping in the most inappropriate moments, dining over-sized dinners, talking about ourselves and chattering about each other. They are all very good people, I’m lucky I can call them friends. Finally, I wouldn’t be here without the constant, absolute and sincere support of my family. Some of them are not so easy people to deal with, some like arguing, some behave quite strangely and some are somewhat capricious, yet they are all great people, they love me and I love them unconditionally, all of them.
In memory of my Granny, your pride would be my strength.
Abstract
Semantic Role Labelling (SRL) is a complex Natural Language Processing (NLP) task that has received a lot of attention in the latest years. An accurate shallow semantic parser, that recognized predicate-argument structures in a sentence and assigned each argument a semantic (or thematic) role, could be a key factor of larger NLP architectures, human-machine interaction (e. g. high level, semantic-oriented browsing and search systems), information extraction and dialogue based systems, just to name a few. The recognition of semantic structures within a sentence relies on lexical and syntactic information provided by earlier stages of an NLP process, such as lexical analysis, POS (part of speech) tagging and syntactic parsing. The complexity of the SRL task mostly lies in that: (a) this information is generally noisy, i. e. in a real-world scenario the accuracy and reliability of NLP subsystems are generally not very high; (b) the lack of a sound and complete linguistic or cognitive theory about the links between syntax and semantics doesn’t allow an informed, deductive approach to the problem. Still, the large amount of lexical and syntactic information available allows for an inductive approach to the SRL task, which indeed is generally III
IV treated as a combination of statistical classification problems. This solution poses 2 main problems: (a) feature selection, i. e. given a lot of structural information, how to select the features that are relevant for the learning task; (b) feature engineering, i. e. how to represent the features in order to preserve their relevance while rendering as easy as possible the learning task. Features are typically represented as attribute-value pairs, e. g. X : Y , that associate a value Y to a property X of a training or test example. Tree kernels provide a viable alternative to the manual design of linguistic features, as they evaluate the similarity between two syntactic structures without requiring the explicit design and extraction of their attribute-value representations. In 2004 and 2005, the CoNLL (Computational Natural Language Learning) conference, the annual meeting organized by the Association for Computational Linguistics (ACL) Special Interest Group on Natural Language Learning (SIGNLL), has been running a shared task on an SRL linguistic model inspired by Levin’s verb classes. Two hand-annotated linguistic resources, the Penn TreeBank and the PropBank projects, provide a statistically representative data set of both the syntactic and the semantic layers that can be used to train and evaluate unsupervised statistical machine learning algorithms on the SRL task. A research team from our University took part in the 2005 competition with a system architecture that used traditional features for the boundary detection (i. e. recognize which nodes of a syntactic parse tree are likely to dominate all the words of a predicate’s argument) and argument classification (i. e. assigning the proper label to each identified candidate argument) sub-tasks, while employing tree kernels to resolve the problem of overlapping nodes (i. e. pairs of nodes that dominate each other and are both assigned a semantic role, which is not consistent with the underlying linguistic model). The work with this thesis begun with a software contribution to the
V CoNLL 2005 system and continued for the last year and a half with some extensions to the SRL model, mostly regarding the introduction of tree kernels in many SRL sub-tasks and the addition of new stages of processing. We run many experiments that confirmed the general validity of our approach. Position papers describing the employed models and the outcome of the experiments have been accepted in several machine learning, computational linguistics and natural language processing conferences and workshops, namely ACL 2005 (Workshop on Feature Engineering for Machine Learning in Natural Language Processing), EACL 2006 (Workshop on Learning Structured Information in Natural Language Applications), CoNLL 2006 and ECAI 2006. Our latest model implies a tree kernel based re-ranking mechanism that should recognize the most likely predicate argument structures among a set of alternatives proposed by a joint probabilistic model. With the employment of such model, our University’s system has improved by five positions its rank with respect to the CoNLL 2005 competition, in which it was originally 8th. In the near future, we plan to further investigate this approach which grants us an upper-bound performance which is by far above current state of the art systems. We believe that finer tunings of the model and the employed learning algorithms could give us another performance boost and result in a new state of the art performance.
Contents
1 Introduction
1
1.1
A Matter of Information . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Natural Language Processing . . . . . . . . . . . . . . . . . . .
3
1.3
Shallow Semantic Parsing . . . . . . . . . . . . . . . . . . . . .
7
1.4
Statistical Natural Language Processing and Machine Learning 11
1.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Semantic Role Labelling 2.1
2.2
16
Linguistic Resources . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.1
FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2
PropBank and VerbNet . . . . . . . . . . . . . . . . . . . 24
The CoNLL Shared Task on Semantic Role Labelling . . . . . . 32
3 Machine Learning for Semantic Role Labelling
34
3.1
Statistical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2
Features and Feature Spaces . . . . . . . . . . . . . . . . . . . . 40 3.2.1
Linear Features . . . . . . . . . . . . . . . . . . . . . . . 42
i
3.3
3.2.2
Kernel Methods and the Kernel Trick . . . . . . . . . . . 47
3.2.3
Kernel Functions and Feature Space Separability . . . . 48
Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 52
4 Semantic Role Labelling Models
59
4.1
Anatomy of SRL Systems . . . . . . . . . . . . . . . . . . . . . 59
4.2
Features for the SRL task . . . . . . . . . . . . . . . . . . . . . . 67
4.3
4.2.1
Linear Features . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2
Structured Features: Tree Kernels . . . . . . . . . . . . . 72
Literature Systems . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 A Kernelized Semantic Role Labelling System 5.1
5.2
5.3
81
Feature Engineering using Tree Kernels . . . . . . . . . . . . . 82 5.1.1
Overlap Resolution . . . . . . . . . . . . . . . . . . . . . 83
5.1.2
Boundary Detection and Argument Classification . . . 88
Re-ranking Propositions . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1
A Probabilistic Interpretation of SVM Output . . . . . . 94
5.2.2
Identifying the Best Candidate Propositions . . . . . . 95
Features for the Re-ranking Task . . . . . . . . . . . . . . . . . 101 5.3.1
Linear Features . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.2
Structural Features . . . . . . . . . . . . . . . . . . . . . 105
6 System Evaluation
117
6.1
Evaluation of CoNLL 2005 Systems . . . . . . . . . . . . . . . . 117
6.2
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.1
Overlap Resolution . . . . . . . . . . . . . . . . . . . . . 119
6.2.2
Boundary Detection and Argument Classification . . . 125
6.2.3
Proposition Re-ranking . . . . . . . . . . . . . . . . . . 129
7 Conclusions
137
Bibliography
144
ii
List of Algorithms
1
Description of functions and procedures used in the definition of other algorithms. . . . . . . . . . . . . . . . . . . . . . . 114
2
An iteration of the Viterbi algorithm. . . . . . . . . . . . . . . . 115
3
Verb voice identification algorithm. . . . . . . . . . . . . . . . . 116
iii
List of Figures
1.1
General architecture of a natural language processing system.
2.1
A simple frame-and-slot template for a train reservation sys-
6
tem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2
Frame-based inference mechanism for a QA system. . . . . . . 18
2.3
A PropBank annotation on a syntactic parse tree. . . . . . . . . 31
3.1
A (binary) decision tree for the problem of choosing whether to buy a car or not. . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2
Life cycle of a classifier. . . . . . . . . . . . . . . . . . . . . . . . 41
3.3
Representation of a 2-dimensional classification problem. . . . 43
3.4
Different functions can successfully separate the same training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5
Linear separability in R2 . . . . . . . . . . . . . . . . . . . . . . . 51
3.6
A mapping φ that makes linearly separable a set of training points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7
Graphical representation of a neuron. . . . . . . . . . . . . . . 53
iv
3.8
Graphical representation of a perceptron. . . . . . . . . . . . . 54
3.9
Geometric margin of a linear classifier in R2 . . . . . . . . . . . 55
3.10 Hard (a) versus soft (b) margin hyperplanes. . . . . . . . . . . 57 4.1
An highest level view of a SRL system. . . . . . . . . . . . . . . 61
4.2
Predicate Extractor. . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3
Candidate Extractor. . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4
Feature Extractor. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5
Boundary Classifier. . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6
Argument Classifier. . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7
Overlap resolution example. . . . . . . . . . . . . . . . . . . . . 66
4.8
Overlap Resolver. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.9
Flowgraph of a typical SRL system. . . . . . . . . . . . . . . . . 68
4.10 The Path feature. . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.11 A syntactic parse tree and the corresponding grammar production rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.12 Examples of SubTree (b), SubSet Tree (c) and Partial Tree (d) structures for a same parse tree (a). . . . . . . . . . . . . . . . . 75 5.1
A sentence parse tree with two ASTN s.
. . . . . . . . . . . . . 84
5.2
An overlap situation and the different marking strategies adopted for its resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3
AST1 relative to the argument Arg1 of the predicate delivers. . 88
5.4
AST1 s (a) and ASTm 1 s (b) extracted for the same target argument with their respective common fragment spaces (c,b). . . 90
5.5
ASTcmt relative to the argument Arg1 of the predicate delivers. 91 N
5.6
A Probabilistic SVM classifier. . . . . . . . . . . . . . . . . . . . 96
5.7
Traversal strategy adopted by the Viterbi algorithm. . . . . . . . . .
5.8
Flowgraph of a re-ranking SRL system. . . . . . . . . . . . . . 101
5.9
Parse tree of the example sentence. The target predicate “plunge”
98
and its arguments are highlighted. . . . . . . . . . . . . . . . . 106
v
5.10 ASTcm representation of the example proposition. . . . . . . . 107 N 5.11 ASTfl N representation of the example proposition. . . . . . . . . 109 5.12 PAS representation of the example proposition. . . . . . . . . . 111 5.13 PASfl representation of the example proposition. . . . . . . . . 112 5.14 PASt representation of the example proposition. . . . . . . . . 112 6.1
Learning curve of the AST1 and ASTm 1 split classifiers. . . . . 127
vi
List of Tables
2.1
List of PropBank adjunct roles. . . . . . . . . . . . . . . . . . . 30
4.1
Referenced CoNLL 2005 systems. . . . . . . . . . . . . . . . . . 78
4.2
Main properties of the SRL strategies implemented by the top-scoring participant teams at the CoNLL 2005 shared task.
4.3
79
Main features used by the top-scoring participant teams at the CoNLL 2005 shared task. . . . . . . . . . . . . . . . . . . . 80
6.1
Performance of the top scoring systems on the CoNLL shared task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2
Two-steps boundary classification performance using the traditional boundary classifier (TBC), the random selection of non-overlapping structures (RND), the heuristic to select the most suitable non-overlapping node set (HEU) and the ASTord N structures classifier. . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3
Classifiers’ performance for the overlap resolution task on automatic parse trees. . . . . . . . . . . . . . . . . . . . . . . . . 122
vii
6.4
Semantic role labelling performance on automatic parse trees using different ASTN structures. . . . . . . . . . . . . . . . . . 123
6.5
Tree nodes of the sentences from sections 2, 3 and 24 of the PropBank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6
Boundary detection performance of the monolithic and split models using MAST and CAST structural features. . . . . . . . 126
6.7
Accuracy produced by different tree kernels on argument classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.8
Variation of the upper and lower-bound of the re-ranking SRL system on section 23 of the PropBank when modulating the number N of options output by the Viterbi algorithm. . . . 131
6.9
Performance comparison on the SRL task (Section 23) between different tree kernels and kernel combinations. . . . . . . . . . 133
6.10 Results of the combination of the PAStl tree kernel with predicate argument linear features (PF) and with the addition of each proposition argument’s role-labelling features (AF). . . . 135
viii
Chapter 1
Introduction
1.1 A Matter of Information In 1985, the diplomat, journalist and editor Wilson Dizard coined in one of his books the expression Information Age to describe the last quarter of the 20th century. By that time, the movement of information had became faster than physical movement, and sending a representation of the informative contents of some document was a far more efficient and convenient solution than sending the document itself [Dizard, 1985]. This age, which as of now is still in its blooming, is marked by the increased production, transmission, consumption of and reliance on information. News papers, television channels, the Internet with its e-mail, blogs, personal pages, news sites, portals and e-shops, mobile phones and portable devices that can store and replay contents of the most diverse type and size, from simple text messages to whole e-books, video games and movies - all of this has changed our life in so many aspects that, sometimes, we have a 1
1.1. A Matter of Information
2
hard time even recalling how it was before, not to talk about how difficult it is to fall back to our old habits when we are forced to. As an example, think to the sense of isolation and limitation that most of us feel when our LAN or Internet connection goes down; or to the incredible facility with which a programmer can, in most cases, find a ready-to-use solution to some algorithmic problem or exotic, undocumented error, thanks to some unknown benefactor who faced the same problem before and chose to share his experience with the community. Still, this ever increasing amount of information carries along a problem of accessibility. That is, all the information has to be organized somehow, and there must be ergonomic systems to allow the addressee of this acts of communication (us, the users) to retrieve interesting documents or fragments of information in a reasonable time and through proper interfaces. Though this huge galaxy of documents grows larger and larger and new file types and data exchange formats take over the old ones, it is important to notice that the kind of information that is represented, what we need to store, catalogue and retrieve is ultimately always the same: a linguistic act, either verbal or textual, in some form of natural language. This is generally true, since any form of communicative act that carries no linguistic information per se can be (and indeed usually is) associated with some comment, or meta-data, that we users and our technology can use for our collecting purposes. For example, a spreadsheet of data makes little or no sense if it is not accompanied by an expressive naming of rows and columns or some contextual information; dumb movies can be associated with their title, script or reviews; pictures in our digital cameras are often given a digital comment, while plain old paper pictures have usually some note scribbled on their back, or right next to them in a photo album; DVDs, compact discs, old LPs and any other form of audio/video support usually come with an informative booklet; digital multimedia files use ID3 tags or similar solutions to store relevant meta-data, and the list of examples could go on.
1.2. Natural Language Processing
3
So, there is a lot of non-linguistic communication that can somehow be represented as textual or verbal information. Nonetheless, the greatest part of the data that we use and need to access on an everyday basis already is expressed as linguistic communication: newspapers, TV debates and news, our mail, the documents that we have to produce and read to do our job, the greatest part of the Internet, the instructions to install a new appliance, the manual of a new application program, the book on the chair next to our bed - just to name a few. And if we want to be at ease in this Information Age, if we want to exploit all this potentially useful information without being overwhelmed by it, then our computer systems must be able to handle and perform some reasoning upon natural language documents.
1.2 Natural Language Processing The field of computer science devoted to the development of models and technologies to allow computers to deal with natural language is known as Natural Language Processing (NLP). In [Lee, 2004], Lilian Lee provides an effective and synthetic definition of it: Natural language processing, or NLP, is the field of computer science devoted to [. . .] enabling computers to use human languages both as input and as output. The area is quite broad, encompassing problems ranging from simultaneous multi-language translation to advanced search engine development to the design of computer interfaces capable of combining speech, diagrams, and other modalities simultaneously. A natural consequence of this wide range of inquiry is the integration of ideas from computer science with work from many other fields, including linguistics, which provides models of language; psychology, which provides models of cognitive processes; information theory, which provides models of communication; and mathematics and statis-
1.2. Natural Language Processing
4
tics, which provide tools for analyzing and acquiring such models. The very notion that natural language could be handled by computers grew out of a research program, dating back to the early 1900s, to reconstruct mathematical reasoning using logic, most clearly manifested in the work by Frege, Russell, Wittgenstein, Tarski, Lambek and Carnap. As a consequence, the notion of language as a formal system that could be processed by machines began to spread among the scientific community. But three major achievements in the field of logics were fundamental for NLP to flourish: 1. formal language theory - mostly due to the work of Noam Chomsky [Chomsky, 1956], [Chomsky, 1959] - defined a language as a set of strings accepted by a class of automata, such as context-free languages and pushdown automata, and laid the foundations for computational syntax; 2. symbolic logic provided a formal method for capturing selected aspects of natural language that are relevant for expressing logical proofs. A formal calculus in symbolic logic provides the syntax of a language, together with rules of inference and, possibly, rules of interpretation in a set-theoretic model; examples are propositional logic and firstorder logic. Given such a calculus, with a well-defined syntax and semantics, it becomes possible to associate meanings with expressions of natural language by translating them into expressions of the formal calculus; 3. the principle of compositionality - the notion that the meaning of a complex expression is comprised of the meaning of its parts and their mode of combination - provided a useful correspondence between syntax and semantics, namely that the meaning of a complex expression could be computed recursively. Today, this approach is most
1.2. Natural Language Processing
5
clearly manifested in a family of grammar formalisms known as “unification based” grammars, and NLP applications implemented in the Prolog programming language [Gazdar and Mellish, 1989]. The general architecture of a NLP system is represented in Figure 1.1: each level of the analysis is associated with a separate task and enriches the input data with a layer of linguistic information. The first stage is the lexical analysis of the input text, the process by which an input string representing a natural language text is transformed into a sequence of symbols called lexical tokens, or just tokens. The tokens are elements of a lexicon, a collection of the words comprising the language along with their morphological variations and their grammatical function (part-of-speech, POS), i. e. their morphological features. The second stage is syntactic analysis, consisting in the identification of the rules that govern the way the words in a sentence are arranged, resulting in the grouping of words into phrases, clauses and sentences. The rules that can be applied to transform the elements of the lexicon define the grammar of the target language and determine the structure of the resulting sentence structure. This kind of analysis is generally referred to as syntactic parsing, in that the input data is broken into distinct chunks of information and given an internal structure so that it can be more easily interpreted and acted upon. Whereas syntactic analysis determines how a text is structured, semantic analysis (or semantic parsing) is the process of assigning a meaning to tokens, phrases, clauses and sentences, i. e. to represent them in terms of language-independent concepts and the relations among them. A worldmodel provides the representation of the domain and is used to reason on the objects and their interactions. The output representation is called a logic form, as it’s typically based on first order logic. Pragmatic analysis, the last stage of the reasoning process, uses an application model to evaluate the costs, aftermaths and utility of the possible
1.2. Natural Language Processing
6
Input text
Lexicon
Lexical Analysis tokens, features Syntactic Analysis
Grammatic
sentence structure Semantic Analysis
World Model
logic form Application Model
Pragmatic Analysis
Interpretation, action
Figure 1.1: General architecture of a natural language processing system. decisions or courses of action with respect to the objectives of the application. Though NLP has come a long way since its beginnings, NLP systems are generally quite complex and difficult to build. This is mostly due to the complexity of language itself and the consequent difficulty in designing linguistic models that can account for all the aspects of morphology, syntax and semantics of natural languages. In fact, almost every word has multiple meanings, e. g. arm as a part of the body or as a weapon, and is quite common that a word can play different roles in terms of part of speech, e. g. face can be either a noun (as in look at your face) or a verb (you have to face the truth). Ambiguity affects all the layers of a natural language text. While in some cases a careful analysis of one layer can help the disambiguation of the others, e. g. a correct morphological interpretation can help recognizing the correct syntactic structure of a sentence, the converse situation is also possible and indeed quite common. On this issue, there are cases in which some feedback between the modules of the architecture outlined in Figure
1.3. Shallow Semantic Parsing
7
1.1 could improve the accuracy of the system. As an example, an highly confident recognition of a verb’s sense and expected argument structure can help correcting a wrong syntactic analysis. Still, there are many cases in which ambiguity just cannot be resolved, as in the sentence “Paul saw her duck”: without further information it is not possible (even to us) to say if Paul saw the duck belonging to some girl rather than some girl in the act of ducking. A second problem with handling natural language computationally is the fact that language is not static. Phonological, morphological and syntactic systems change over time, as sounds, words and concepts flow in and out of the current use of a language. Not only languages change over time, but also among different social classes, professional contexts, invisible geographic borders. This is one of the reasons why NLP systems built for limited and very specific domains have proven to be very successful: in restricted contexts the ambiguity is reduced, and the amount of real world knowledge that needs to be incorporated in the system becomes manageable.
1.3 Shallow Semantic Parsing Semantic analysis of natural languages can be performed at different levels of granularity: single words can be associated with the concepts they refer to; word sequences, phrases and clauses can be related to each other; the meaning of a discourse or a dialogue as a whole can be investigated. This latter is called a Natural Language Understanding (NLU) problem, and is sometimes referred to as an AI-complete problem by analogy to NPcompleteness in complexity theory [Mallery, 1988]. In fact, some problems that should be addressed by a NLU system, such as anaphora resolution or the handling of quantifiers in logic inference, require so many linguistic, ontological, pragmatic and contextual information that it seems unlikely that
1.3. Shallow Semantic Parsing
8
they can be successfully addressed and resolved in an open domain fashion. A softer approach to semantic analysis consists in restraining the scope of the problem, limiting the study to the development of models that establish semantic relations between words within a sentence, or between groups of words in a sentence and their linguistic context. This kind of analysis is called Shallow Semantic Parsing (SSP), the adjective shallow stressing the fact that the models at study do not pretend to capture the whole semantics of a sentence, or document, or discourse as a whole, whereas they rather focus on some very precise aspects of the semantics of natural language texts, assuming that such aspects: • can be represented with appropriate (and measurable) accuracy; • are interesting for the specific application domain. SSP is quite a vast topic itself and encompasses many different research fields, from lexical semantics, i. e. assigning a meaning to words in a text, to verb classification, to the recognition of the semantic relations that can be associated with a syntactic structure. The most part of these approaches aim to provide some clues about the mechanisms of human semantic knowledge, for example by combining purely inductive and psychologically motivated models for a general purpose verb classification task, as discussed in [Basili et al., 1996b]. Shallow semantic analysis is a critical task in most NLP applications. It can be used to enrich a syntactic analysis with semantic information, as well as to provide richer input to the first stages of NLP systems and cut down a problem’s ambiguity sources along with the space of its possible solutions. Many semantic analysis approaches base their reasoning on some linguistic resource and the underlying linguistic model. In this field, a great deal of research has been inspired by psycholinguistics, the branch of cognitive psychology that studies the psychological basis of linguistic competence and performance. WordNet [Miller, 1995, Miller, 1993], a semantic
1.3. Shallow Semantic Parsing
9
lexicon grounded on this theory’s models, defines a concept in terms of the set of synonyms that can be used to represent it, and provides a rich set of lexical and semantic links. The resulting network is at the core of many inference models that, for example, have been used to define conceptual similarity metrics [Agirre and Rigau, 1996] and perform semantic generalization and word sense disambiguation [Basili and Cammisa, 2004] tasks. An example of a completely different, empirical approach to semantic analysis is represented by Latent Semantic Analysis (LSA) [Landauer and Dumais, 1997, Landauer et al., 1998], a theory and method for extracting and representing the meaning of words with respect to their context by statistical computations applied to a large corpus of text. It is based on the idea that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. These examples should provide a feeling of the variety of models, methods and applications of the field of semantic analysis. Indeed, shallow semantic parsing techniques are used in many NLP systems, and in some cases are key components in the economy of the target applications. This is the case of question answering [Voorhees, 2001] and automatic text summarization [Radev et al., 2002] systems, just to name a few. SSP models and methodologies are not only useful components of larger NLP architectures: in fact, a correct and reliable semantic analysis is a mandatory pre-requisite for any ergonomic, possibly dialogue and speech based human-machine interaction. The semantic layer of natural languages has some major characteristics that make it especially interesting in this perspective: • it provides a very high level of abstraction, as it doesn’t represent how some information is expressed but what it actually means. For exam-
1.3. Shallow Semantic Parsing
10
ple, the quasi logical form together(John, X, Y ), person(X) can represent all the instances of a person X that was together with John at some location Y . Such instances will eventually be represented in many different ways in the collection of searched documents, e. g. : – Paul met John at the hotel; – Yesterday Steven and John had dinner together; – John has been playing backgammon with his brother all the night long; – John and I spent some together last month. • it allows to tune easily a search by relaxing or reinforcing semantic restrictions: if the query was together(John, X, Y ) then X wouldn’t be required to be a person, e. g. it could be a pet, or a feeling as in John was alone with his despair, whereas if it was together(John, X, Y ), person(X), relatives(John, X) then X should be a person and a John’s relative. • it allows cross-lingual reasoning, as the same quasi logical form represents a language independent representation of situations and events provided a mapping between the world models describing the different corpora.
1.4. Statistical Natural Language Processing and Machine Learning
11
1.4 Statistical Natural Language Processing and Machine Learning The development of a NLP system is traditionally centered around a team of linguists and domain experts. These knowledge experts (or knowledge engineers) and their expertise are the key factors to build a model that properly captures the linguistic phenomena of interest, i. e. the lexicon, the grammar, the world and the application models. This kind of approach is limited under many aspects. Most notably, the explicit description of a knowledge domain: • is generally time consuming; • is hardly complete and, where it is not, in very few cases can capture the most relevant aspects of the domain without including noisy or irrelevant information; • relies almost completely on the judgement of the knowledge engineers and their previous experiences - these are subjective parameters whose effect on the modelling task cannot be evaluated a priori. Statistical Natural Language Processing (SNLP) differs from traditional NLP in that, instead of having a linguist manually encode a domain knowledge and world models, it uses automatic or semi-automatic techniques to construct the models that capture the interesting linguistic phenomena. These models are the result of the analysis of large corpora of annotated documents, in which the interesting aspects the model is expected to capture are foregrounded so that computer programs can take them into account [Church and Mercer, 1993]. Furthermore, provided that the corpus is large enough to represent the linguistic phenomena of the target application domain, being grounded in real text examples SNLP approaches are likely to produce usable and reliable results [Charniak, 1994].
1.4. Statistical Natural Language Processing and Machine Learning
12
SNLP systems are useful for many reasons, and are especially interesting as far as industry is involved as they [Callison-Burch and Osborne, 2003]: • afford rapid prototyping - the task of annotating linguistic material is rather fast and inexpensive if compared with domain knowledge engineering. This allows for many different approaches to be attempted and evaluated in relatively few time; • are robust - as far as the input fed to the system is syntactically compatible with the training data the system will produce some output and will not behave in an unpredictable manner; • are cheaper - being their construction largely automated, they usually require less numerous teams to be built, and as they are learned from data they require a less specific knowledge of the target domain and of the language to be analyzed. ML algorithms and models allow a program to learn, i. e. to change its internal state so that the same set of operations on the same set of data can be performed with increasing accuracy. A great contribution to the development of ML has been provided by the Computational Learning Theory [Angluin, 1992], a new and rapidly expanding area of research that examines formal models of induction with the goals of discovering the common methods underlying efficient learning algorithms and identifying the computational impediments to learning. A coarse taxonomy of machine learning algorithms can be as follows: • supervised learning - where the algorithm generates a function that maps inputs to desired outputs. One standard formulation of the supervised learning task is the classification problem, i. e. learning a function that is capable of establishing a correspondence between a set of input vectors (examples) and a set of target categories. Such function is approximated by looking at a set of examples labelled with their expected classification;
1.4. Statistical Natural Language Processing and Machine Learning
13
• unsupervised learning - in which the system is provided with unlabelled examples and a descriptive model of the expected learning function; • semi-supervised learning - which combines aspects of the two previous approaches; • reinforcement learning - in which every time a system performs some action it observes its consequences on the world (i. e. the representation of the world which is accessible to the system). Some utility function, defined in terms of states of the world, is used to evaluate the quality of the action with respect to the most desirable states. Statistical Machine Learning (SML) and the Statistical Learning Theory [Vapnik, 1998] enrich the ML approach integrating methodologies and models from the world of statistics. Statistical models are more suited to cope with uncertainty and noisy phenomena, such as the observations the learning algorithm relies on and the consequences of the resulting (noisy) learning process. The idea behind SML is that machines should be able to reason about uncertainty explicitly in order to deal properly with complex real world scenarios. SML algorithms adopt an inductive approach to the learning task, which is generally example based, and use large amounts of data to learn the target phenomena and figure out a statistical representation of their characteristic noise patterns. Such empirical methods certainly outperform in this regard rationalist, or symbolic, methods. However, empirical methods provide a probabilistic, not conceptual, explanation of the analyzed linguistic phenomena. While they actually work in real applications, they are intrinsically unable to provide insight into the mechanisms of human communication, and a human analyst is eventually required in order to find a linguistic justification of the data. Many efforts have been devoted to define methods for lexical knowledge acquisition that are both scalable and amenable to a theoretically
1.5. Summary
14
founded analysis of language, for example using a combination of probabilistic and knowledge-based methods for the acquisition of selectional restrictions of words in sublanguages [Basili et al., 1996a]. On the other hand, some evidence is arising in support of the notion that perception and neural computation are akin to a Bayesian inference process [Pouget et al., 2003]. Similarly, the close similarity between the human neuron and a simple binary classifier as the perceptron (see Section 3.3) seems to suggest a strong bound between statistical learning and the workings of the human brain. Further investigation of statistical learning problems and methodologies will hopefully help providing some insights into these interesting problems that sit on the edge between modern statistics and learning theories.
1.5 Summary The work with this thesis is about Semantic Role Labelling (SRL), a particular topic within SSP1 . Semantic role labelling is about identifying predicates within sentences, along with the arguments these predicates require to be properly characterized. During the years, the linguistic community proposed many different models of the organization of semantic knowledge, which resulted in many different rolesets definition. Section 2 discusses this issues, and presents two linguistic resources inspired by different linguistic theories: FrameNet, which is based on frame semantics, and PropBank, based on Levin’s verb classes. Of these two, the latter is of great interest for the scope of this Thesis, as it is the corpus used by the CoNLL shared task on semantic role labelling (see Section 2.2), the international competition that in the latest years has been among the major driving forces in this research field. 1
As a result of its growing popularity within the NLP community, many authors tend to use indifferently the terms SRL and SSP.
1.5. Summary
15
Section 3 presents some Statistical Machine Learning models and techniques on which most SRL systems are based, especially binary classifiers and their background statistical models, in some cases detailing their underlying mathematical aspects. The attention will be focused on Support Vector Machines (SVMs), as they are at the core of the SRL system that this Thesis documents. Section 4 outlines the typical architecture of a SRL system and describes it in terms of functional modules and data structures it relies on. It also presents the linguistic features that are considered to be most relevant to the task. It also introduces Tree Kernels, and explains how they can be employed to represent implicitly a huge space of syntactic features. Section 5 deals with more complex SRL models that use sophisticated approaches for the feature engineering task and the re-ranking of candidate propositions or syntactic parse trees, such as semantically enriched structures to be used with tree kernels. These systems are generally more accurate than basic SRL systems, but pose new problems for their modelling and implementation, as they require higher volumes of more structured data to be dealt with. A working SRL system has been designed and implemented after these latest models. Our implementation is described in Section 6, along with its evaluation against the CoNLL data set and in comparison with other literature solutions. A thorough study on engineered syntactic structures for different subtasks within SRL is documented, and different approaches are evaluated and compared. Finally, Section 7 summarizes the discussion carried out in the previous sections and proposes some directions of research that we would like to investigate in the near future.
Chapter 2
Semantic Role Labelling
In many common applications, such as information extraction and dialogue understanding systems, it is frequent to collect various pieces of information that, put together, convey a sound and complete description of some phenomenon or entity which is relevant for the purposes of the application. For example, a train reservation system requires the user to specify the origin and destination cities, the date and time of departure and the number of seats to reserve. That is, a sound reservation is described by the instantiation, i. e. the assignment of a specific value, to the variables of the problem. This kind of representation is generally called a frame-and-slot template: a set of slots (the variables) are required to describe the target situation or frame1. The values assigned to each slot are called slot-fillers, and are generally subject to constraints of various nature. In our example, the origin city and destination city slots require names of cities as fillers, whereas the number of seats slot should be an integer between 1 and the number of available 1
See Section 2.1 for a more sound definition of frame.
16
17
Slot
Frame: Train Reservation Constraints
origin city destination city number of seats departure date and time
city(x) city(x) int(x) : 1
Filler
Rome Milan 2 2006-06-19 @ 10:00pm
Figure 2.1: A simple frame-and-slot template for a train reservation system. seats N. This example frame-and-slot template is represented in Figure 2.1. A limit of this kind of approach is that it is heavily based on domain specific knowledge, with all the limitations that it implies (see Section 1.4). In order to make similar approaches scalable and applicable beyond the boundaries of a single application domain, the very specific slots must be generalized so that a same frame can be used to describe a wider range of situations. Semantic Roles are a less domain-specific entities and are defined at the level of semantic frames like those introduced by [Fillmore, 1976], which describe abstract actions or relationships along with their participants. For example, the Judgement frame contains roles like Judge, Evaluee and Reason, which are respectively associated with the entities emitting, undergoing and causing the judgement, as in the following examples: • [Judge Most of my friends] blame [Evaluee the Government] [Reason for joining this war]. • [Judge You] think [Evaluee I] am wrong just [Reason because you don’t know the truth]. These shallow semantic roles could play an important role in information extraction. For example, a NLP system could recognize that in the two following sentences there is something, a Theme, that undergoes some changes: • The [Theme ruling] was revised because of the protests.
18 Frame: Personal Relationship Slot Constraints Partner1 Partner2
person(x) person(x)
Predicate 0: married, verb Tom Cruise Nicole Kidman
Predicate 1: wife, noun Tom Cruise Who
Figure 2.2: Frame-based inference mechanism for a QA system. • The board changed their [Theme decision] several days ago. The conceptual link between the roles played by “ruling” and “decision” stands although the predicates, i. e. “changed” and “ruled”, are different, as well as the whole structure of the two sentences. This is possible because semantic roles are defined at the frame level and can, therefore, be shared across different syntactic and lexical contexts. A question answering system, for example, could infer that the sentence Tom Cruise and Nicole Kidman married on December 24, 1990. may represent an answer to the question Who’s Tom Cruise’s wife? as they both refer to the Personal Relationship frame. The roles participating in the two predicates are related although the first predicate is a verb and the latter a noun. This can happen because both predicates trigger the proper frame with the corresponding semantic roles. This inference mechanism is described in Figure 2.2. This shallow semantic level of interpretation has additional uses outside of generalizing information extraction, question answering and semantic dialogue systems [Gildea and Jurafsky, 2002]. One such application is wordsense disambiguation, in which the roles associated with a predicate can be
19 used as clues to argue its sense. For example, the argument structure of a verbal predicate can be used to guess its subcategorization frame, providing precious contextual information to disambiguate the meaning of the verb and all its arguments. Semantic roles could also act as as important intermediate representation in statistical machine translation or automatic text summarization, as well as in many data mining tasks. Their incorporation into probabilistic language models could eventually lead to more accurate natural language parsers and more reliable speech recognition models. Semantic roles are among the oldest classes of constructs in linguistic theory, dating back thousand of years [Misra, 1966]. During the centuries, a great variety of semantic roles definitions and sets have been proposed and described. These sets of role vary from the very specific to the very general, and in many computational implementations many different proposals have been adopted as the target linguistic model. At the specific end of the spectrum are domain-specific roles, such as those adopted in the frame-and-slots notation, or verb specific roles such as Drinker, Beverage and Volume for the verb “to drink”. The opposite end of the spectrum consists of theories with only two macroroles (or “proto-roles”), as the Proto-Agent and Proto-Patient suggested by [Dowty, 1991] and [Valin, 1993]. In between, many different theories that define small and flexible role sets have been proposed. An example is [Fillmore, 1971], in which the 9 defined roles are Agent, Experiencer, Instrument, Object, Source, Goal, Location, Time and Path. From a linguistic point of view, semantic roles are the instrument by means of which linking theories between syntax and semantics are being proposed. In this context, it is common to define rather abstract roles that should support a model for the generalization mechanism that establishes a correspondence between the meaning of a sentence and the syntactic realization of its parts. On the other hand, computer scientists are more in-
2.1. Linguistic Resources
20
terested in more specific role sets, such as those describing the argument structure of a verb, as they provide a concrete framework for the development of natural language understanding system.
2.1 Linguistic Resources A linguistic resource is a collection (or corpus) of linguistic evidence which is gathered, edited and organized in order to provide an extensive account of some linguistic phenomenon. Indeed, they play a fundamental role both in linguistics and in natural language processing. For linguists, compiling a corpus is a complex and time consuming task, in which many people are generally involved. During the process, the underlying linguistic model is continuously challenged, and eventually corrected, revised and extended to avoid inconsistencies or incorporate new phenomena. On the computational side, such resources provide both concrete realizations of abstract models and common grounds on whose basis different approaches and systems can be evaluated and compared. Furthermore, as statistical methods gain more and more popularity, large corpora of handcrafted linguistic data are mandatory resources to train machine learning algorithms on a data set whose characteristics, limits and advantages are discussed within and acknowledged by the scientific community. In the remainder of this section two such resources are presented: FrameNet and PropBank. The first is a project driven by linguists aimed at providing as wide as possible coverage of the syntactic contexts that can be used to instantiate certain conceptual situations, or frames. The latter aims at the development of a large corpus of semantic annotations to integrate predicateargument structures in statistical NLP systems, mainly developed for computational purposes. As detailed in Section 2.2, the PropBank is at the center of a great deal of research on SRL systems, and is one of the resources that
2.1. Linguistic Resources
21
have been mostly employed in the development of SRL systems.
2.1.1 FrameNet The FrameNet project [Baker et al., 1998] is rooted in Fillmore’s work on Frame Semantics [Fillmore, 1982], which in turn is an extension of his previous studies on Case Grammars and thematic roles [Fillmore, 1968]. Based on more than thirty years of research in the field of natural language semantics, it proposes roles that are neither as general as Dauty’s Proto-Roles or those proposed by Fillmore himself in [Fillmore, 1971], nor as specialized as the thousands of verb-specific roles that could be imagined. In fact, roles are defined for each semantic frame. A frame is a schematic representation of a situation involving various participants, properties and other conceptual roles [Fillmore, 1976]. A frame is described in terms of the kind of interactions that establish among the involved roles, and is activated by several lexical units that, in at least one of their possible meanings, can be used to describe the target situation. In the FrameNet jargon, a lexical unit that triggers a frame is called a Target Word (TW), whereas the semantic roles are called Frame Elements (FEs). A target word can be any kind of predicate, i. e. not only a verb but also a noun, an adverb or an adjective. This aspect marks one of the main differences between frame elements and the thematic roles as commonly described in the literature, as these are generally meant to be arguments of some verbal predicate. For example, the frame Cure is defined as follows: This frame deals with a Healer treating and curing an Affliction (the injuries, disease, or pain) of the Patient, sometimes also mentioning the use of a particular Treatment or Medication; it is activated by some sense of the following target words:
2.1. Linguistic Resources
22
nouns alleviation, curative, cure, healer, palliation, palliative, rehabilitation, remedy, therapist, therapy, treatment; adjectives curable, curative, incurable, palliative, rehabilitative, therapeutic; verbs alleviate, cure, ease, heal, palliate, rehabilitate, resuscitate, treat; and describes the interactions of the following frame elements: Affliction frequently incorporating the Patient as a possessor; Body Part the specific area of the Patient’s body which is treated; Healer anyone who treats or cures the Patient; Medication the injested, applied, injected, etc. substance designed to cure the Patient; Patient the sufferer of the injury, disease or pain; Treatment treatments as well as their means. For each frame, a set of hand-annotated sentences is included. Some examples from the Cure frame are: • You can’t rely on [Healer a human being] to cure [Patient you] [Affliction of evil] and give you peace; • [Patient Eight patients] were studied in detail and were given supervised [Treatment exercise] therapy; • He says that the [Treatment animals] are therapeutic [Patient for children and adults alike]; • Mr Sommerville said [Affliction [Patient Ninham]’s condition] could be treated [Treatment by drugs];
2.1. Linguistic Resources
23
• Most importantly, there will be an empathy between [Patient yourself] and your [Healer therapist]. The two latest examples are interesting ones, as in the first it is shown how frame elements are allowed to nest, i. e. the Affliction being defined as the Patient’s condition, while in the latter the lexical unit that triggers the frame is part of a role itself, i. e. the therapist is the Healer. Defining semantic roles at this intermediate level has the advantage of allowing the degree of generalization which is needed for many different words, most of which also have different grammatical functions, to be used as predicates within the same frame. Furthermore, it also allows a wider generalization to be attempted across different frames, establishing semantic links between different frame elements in different frames. On the other hand, these role definitions do not allow the establishment of an immediate correspondence between a FE and its syntactic representation, making hard to automatically recognize and identify frame elements. The annotation methodology has proceeded on a frame by frame basis, that consists of: 1. choosing a semantic frame, i. e. defining a frame and its frame elements; 2. listing the target words that can invoke the frame; 3. searching examples of the frame instantiation in a corpus (the FrameNet project uses the British National Corpus, BNC [Geoffrey Leech and Bryant, 1994]) and annotate target words and frame elements. Apart from the semantic annotation, the project also provides a shallow level of morpho-syntactic analysis, as each frame element is assigned a morphological feature, i. e. the grammatical role of the frame element with respect to the predicate, such as external argument, and a syntactic chunk type, such as NP (Noun Phrase) or PP (Prepositional Phrase).
2.1. Linguistic Resources
24
The annotators are more interested in finding different and exotic examples of a frame’s instantiation rather than many occurrences of the same phenomenon. This is in line with the scope of the project, which is not focused on the development of a statistically representative corpus. Indeed it attempts to be as far as possible complete on the lexicographic side, providing as many examples as possible of different linguistic contexts in which a same semantic frame can be applied.
2.1.2 PropBank and VerbNet A different, more computational point of view is suggested by the Proposition Bank (PropBank) Project [Kingsbury and Palmer, 2002]. In this case, the attention is focused on the argument structure of verbs and on the alternation patterns that describe the movement of verbal arguments within a predicate structure [Palmer et al., 2005]. For example, in the two sentences • John broke the window. • The window broke. there is an entity, the window, that plays the same semantic role, i. e. it is the entity which gets broken. Still, with respect to the verb break, in the first case it is the object of its transitive form, in the latter it is the subject of its intransitive form. Alternation in the syntactic realization of semantic arguments is very common, affecting in some way most english verbs, and the alternation patterns that each verb exhibits are quite different from each other. Such alternations have been thoroughly investigated by many linguists, such as Beth Levin in her work on english verb classes [Levin, 1993]. Still, the huge number and variability of linguistic phenomena make it impractical to compile a complete list of alternation patterns along with the contexts in which they occur.
2.1. Linguistic Resources
25
The hand annotated PropBank corpus is meant to collect a reliable, broadcovering collection of such phenomena, so that statistical machine learning algorithm (see Section 3.1) can be used to incorporate probabilistic models of syntactic alternation of semantic roles into statistical parsers and NLP systems. The PropBank corpus consists in the addition of a shallow level of predicateargument information to the syntactic structures of the Penn Treebank [Marcus et al., 1993], a collection of hand-corrected syntactic parse trees of the Wall Street Journal (WSJ) corpus [Paul and Baker, 1992]. For each verb, a set of underlying semantic roles have been defined and numbered progressively. Each instance of the verb has then be annotated within the Penn Treebank. The following examples are (flattened) annotations for the verb offer: • . . . [Arg0 the company] to . . . offer [Arg1 a 15% to 20% stake] [Arg2 to the public]. • . . . [Arg0 Sotheby’s] . . . offered [Arg2 the Dorrance heirs] [Arg1 a moneyback guarantee]. • . . . [Arg1 an emendment] offered [Arg0 by Rep. Peter DeFazio] . . . • . . . [Arg2 Subcontractors] will be offered [Arg1 a settlement] . . . The numbering of the arguments, e. g. Arg0, Arg1 and so on, is defined on a verb by verb basis. For each verb, Arg0 is generally the argument playing the role of a Proto-Agent, while Arg1 is the Proto-Patient or Theme [Dowty, 1991]. This is the only generalization that can be made consistently across different verbs, since no such correspondence exists for higher-numbered arguments. The other labels are generally assigned based on the grammatical obliqueness of the arguments with respect to the predicate [Dowty, 1982], i. e. their contribution to the specification of its sense and is generally re-
2.1. Linguistic Resources
26
flected by their relative distance, though an effort is made to define the semantics of roles consistently across different verb senses or uses. The framework that has been used to establish a link between different verb annotations is VerbNet [Kipper et al., 2000, K. Kipper and Rambow, 2002], which provides an extension of Levin’s classes by adding an abstract representation of the syntactic frames for each class. A syntactic frame is the description of a mapping between a set of thematic labels and a set of deep syntactic arguments and establishes explicit correspondences between the syntactic position of an argument and the semantic role it expresses. A set of roles corresponding to a distinct usage of a verb is called a roleset, and can be associated with a set of syntactic frames indicating allowable syntactic variations in the expression of that set of roles. A roleset along with its associated frames is called a frameset, and a polysemous verb is allowed to have more than one frameset if its different meanings are such to require a specific roleset for each frameset. Syntactic frames that are shared among verb classes are grouped into frame files, where each frame is described along with an example (ex), its syntactic realization (sym) and its semantic interpretation (sem) in terms of first order logic predicates. As an example, the verb jolt can activate the syntactic frames described in the frame files amuse-31.1 and force-59, and the two rolesets jolt.01 and jolt.02 Amuse-31.1 • Roles – Experiencer [+animate] – Cause • Frames
2.1. Linguistic Resources
27
– Basic Transitive ex: "The clown amused the children" sym: Cause V Experiencer sem: cause(Cause, E) emotional_state(result(E), Emotion, Experiencer) – Middle Construction ex: "Little children amuse easily" sym: Experiencer V sem: property(Experiencer, Prop) Adv(Prop) – PRO-Arb Object Alternation ex: "That joke never fails to amuse " sym: Cause V sem: cause(Cause, E) emotional_state(result(E), Emotion, ?Experiencer) – NP-PP with-PP ex: "The clown amused the children with his antics" sym: Cause V Experiencer with Oblique sem: cause(Cause, E) emotional_state(result(E), Emotion, Experiencer) – NP Attribute Subject ex: "The clown’s antics amused the children" sym: Cause[+genitive] ss Oblique V Experiencer sem: cause(Cause, E) emotional_state(during(E), Emotion, Experiencer) – NP-ADJP Resultative ex: "That movie bored me silly"
2.1. Linguistic Resources
28
sym: Cause V Experiencer sem: cause(Cause, E) emotional_state(result(E), Emotion, Experiencer) Pred(result(E), Experiencer) Force-59 • Roles – Agent [+animate] [+organization] – Patient [+animate] [+organization] – Proposition • Frames – Basic Transitive ex: "I forced him" sym: Agent V Patient sem: force(during(E), Agent, Patient, ?Proposition) – NP-P-ING-OC into-PP ex: "I forced him into coming" sym: Agent V Patient into Proposition[+oc_ing] sem: force(during(E), Agent, Patient, Proposition) – NP-PP into-PP ex: "I forced John into the chairmanship" sym: Agent V Patient into Proposition[-sentential] sem: force(during(E), Agent, Patient, Proposition) Roleset Jolt.01 (surprise, shock) • Roles:
2.1. Linguistic Resources
29
Arg0 jolter Arg1 person jolted Arg2 instrument, if separate from agent • Examples: – [Arg0 The decline] surprised analysts and jolted [Arg1 HomeFed’s stock, which lost 8.6% of its value, closing at $38.50 on the New York Stock Exchange, down $3.625]. – [Arg0 John] jolted [Arg1 Mary] [Arg2 with his announcement that he was leaving for Zambia]. Roleset Jolt.02 (impelled action) • Roles: Arg0 jolter Arg1 impelled agent Arg2 impelled action • Examples: – The stock has fallen $87.25, or 31%, in the three trading days since [Arg0 announcement of the collapse of the $300-a-share takeover] jolted [Arg1 the entire stock market] [Arg2 into its second-worst plunge ever]. In addition to verb specific numbered roles, which are generally referred to as core roles, PropBank defines several more general roles that can apply to any verb, which are called adjunct roles (or adjuncts) are labelled ArgMX, X being assigned one of the values described in table 2.1. These labels are used to annotate verbal arguments that do not appear in the predicate’s roleset description and yet provide some useful information about it, as in the following example:
2.1. Linguistic Resources
30
Label
Description
LOC EXT DIS ADV NEG MOD CAU TMP PNC MNR DIR
location extent discourse connectives general purpose negation marker modal verb cause temporal marker purpose manner direction
Table 2.1: List of PropBank adjunct roles. Frameset edge.01 “move slightly” Arg0 causer of motion Arg1 thing in motion Arg2 distance moved Arg3 start point Arg4 end point Arg5 direction [Arg0 Revenue] edged [Arg5 up] [Arg2 3.4%] [Arg4 to $904 milion] [Arg3 from $874 milion] [ArgM-TMP in the last year’s third quarter]. Apart from the 6 core arguments and the 11 adjuncts, PropBank also defines: • a special argument, labelled ArgA, that is used to mark entity inducing an agent to perform an action, i. e. a sort of external agent, such as in “[ArgA John] had [Arg0 Mr. Benson] drinking [Arg1 a full glass of wine]”; • continuation arguments, in the form C-ArgX or C-ArgM-X, used to represent split constituents, such as in “[Arg1 John], [Arg0 Mary] said, [C-Arg1 would join us after dinner]”;
2.1. Linguistic Resources
31 S
[Arg0 NP]
VP
PRP
VBD
He
got
[Arg1 NP]
. [C-V ADVP]
PRP$
NN
RB
his
money
back
.
Figure 2.3: A PropBank annotation on a syntactic parse tree. • reference arguments, in the form R-ArgX or R-ArgM-X that provide a shallow level of anaphora and co-reference description, such as in “I wasn’t [R-Arg4 there] [R-ArgM-TMP when] [Arg0 you] arrived”; • a special verb-continuation argument, C-V, which is used to mark the continuation of a phrasal verb when its part are separated by some words that do not belong to it, such as in “[Arg0 He] got [Arg1 his money] [C-V back]”. As the corpus is defined an annotation of the Penn TreeBank, the labelling is actually performed on the nodes of the TreeBank syntactic parse trees, i. e. the node that exactly covers the words comprised by an argument is assigned the corresponding label, e. g. the latest example sentence’s parse tree would be annotated as in figure 2.3. In order to enforce the separation between the syntactic and semantic levels, the annotations are not encoded within the trees, but in separate resources which are called index file. The Penn TreeBank is divided in sections and subsections, and each parse tree is assigned a numeric identifier which is unique within each section. PropBank annotations are organized in the same fashion, and consists of • a reference to the corresponding Penn TreeBank section and subsection;
2.2. The CoNLL Shared Task on Semantic Role Labelling
32
• a reference to the target parse tree, i. e. its numeric identifier; • the list of annotated arguments. Each argument is in the form O:Z-L, where: O is the offset, starting at 0, of the first word spanning the predicate; Z is the number of nodes that must be climbed in the syntactic tree from the POS tag of the first word in order to get to the argument node; L is the argument type. Hence, the arguments of the annotation [Arg0 He] got [Arg1 his money] [C-V back] with respect to the parse tree in Figure 2.3 would be represented as 0:1-Arg0 1:0-rel 2:1-Arg1 4:1-C-V, rel being the label assigned to the annotation’s predicate, i. e. the verb get.
2.2 The CoNLL Shared Task on Semantic Role Labelling A major contribution to the development of working SRL systems has been offered by the CoNLL (Computational Natural Language Learning) conference, the annual meeting organized by the Association for Computational Linguistics (ACL) Special Interest Group on Natural Language Learning (SIGNLL). Since its year 2004 issue, the conference has been running a shared task on semantic role labelling which provided a common ground for SRL systems to be trained and tested against a standardized data set, and to be compared in order to identify the most interesting and promising approaches to the problem.
2.2. The CoNLL Shared Task on Semantic Role Labelling
33
Given a sentence, the task consists of analyzing the propositions expressed by some target verbs of the sentence. In particular, for each target verb all the constituents in the sentence which fill a semantic role of the verb have to be extracted. The challenge for CoNLL-2004 shared task was to address the SRL problem on the basis of only partial syntactic information, i. e. avoiding the use of full syntactic parse trees and external lexical-semantic knowledge bases. The annotations provided for the development of systems included, the argument boundaries and role labels, words with their POS tags, base chunks, clauses, and named entities [Carreras and Màrquez, 2004]. In the 2005 edition, some novelties were introduced [Carreras and Màrquez, 2005]: • the training corpus was substantially enlarged. This allows to test the scalability of learning-based SRL systems to big data sets and compute learning curves to see how much data is necessary to train; • aiming at evaluating the contribution of full parsing in SRL, the complete syntactic trees given by several alternative parsers was provided as input information for the task; • in order to test the robustness of the presented systems, a cross-corpora evaluation was performed using fresh test sets from corpora other than the one used for training. The participating systems would be trained and tested both on gold, i. e. handcrafted, and automatic syntactic data, to provide respectively an upper bound and a more realistic evaluation of the systems’ performance.
Chapter 3
Machine Learning for Semantic Role Labelling
Semantic role labelling is about identifying groups of words, the arguments, that are in some semantic relation with a predicate within the sentence. As there is no clear and sound theory describing the underlying linguistic phenomenon, it is not possible to design models that allow for unsupervised approaches to the problem yet. As a result, SRL systems are generally built around supervised learning models based on statistical classification: a statistical procedure in which individual items are grouped based on quantitative information on one or more characteristics inherent in the items (generally referred to as features, see Section 3.2) and based on a training set of previously labeled items. The source items are generally referred to as examples, and the target groups as classes (or labels, or categories). In the remainder of this chapter, Section 3.1 provides a formal definition 34
3.1. Statistical Classifiers
35
of statistical classifiers and describes the life cicle of a classifier, i. e. how a classifier learns to separate the examples and how its output is evaluated using different measures. Section 3.2 explains more thoroughly the concept of feature and feature space, i. e. the euclidean space in which a feature set is represented, and provides some examples of feature representations. It also introduces kernel functions, i. e. functions that manipulate the feature space in order to ease the computational burden on the classifier. Finally, Section 3.3 describes Support Vector Machines, the learning algorithm which is at the core of our SRL model.
3.1 Statistical Classifiers Statistical classifiers are among the key tools of SNLP, and are largely used also in the widest field of artificial intelligence. For example, a very common predictive model, mostly used in decision theory and data mining, is a Decision Tree (DT), a graph of decisions and their possible consequences (possibly taking into account costs, risks and benefits) used to create a plan or to reach a goal. The non-terminal nodes represent aspects of the problem that should be taken into account, and are generally listed top-down with respect to their relevance, whereas the leaves are the actual output of the algorithm and correspond to the decisions of the classifier. Figure 3.1 represents a very simple decision tree for the problem of choosing whether to buy a car or not. In this case, the algorithm has learnt not to buy a car, i. e. to label it “don’t buy”, if its cost exceeds a given amount or if it consumes too much; otherwise, the car is bought provided that it has 5 doors or the automatic gear. For any input example belonging to a domain X, a classifier outputs one or more labels out of a discrete set of target categories C. Classifiers can be organized in a taxonomy with respect to the cardinality |C| of the set C and the number of labels that can be attached to each exam-
3.1. Statistical Classifiers
36 cost>10,000$ no
don’t buy
yes liters of fuel > 15 100 Km yes
no don’t buy
5 doors no
Automatic Gear no
yes buy
yes
don’t buy
buy
Figure 3.1: A (binary) decision tree for the problem of choosing whether to buy a car or not. ple: a classification problem is said to be multi-class if the number of target categories is grater than 2; it is multi-label if it is possible to attach more than one label to the same input example. A single-label, 2-class classifier is said to be binary, as any example either does or does not belong to one of the target categories. A statistical classifier actually consists of two distinct entities: 1. a learning algorithm, and 2. a classification function. The learning algorithm is meant to find a suitable approssimation of the target function f :X→C, that establishes a correspondence between a set of examples X (the data set) and a set of categories (or labels) C. The target function f is of course unknown, and the approximation is to be drawn observing a set TR ∈ X of training examples (the training set).
3.1. Statistical Classifiers
37
The learnt function is called the hypothesis of the algorithm, and is usually labelled h. The induced hypothesis is described in a model, a data structure that encodes the hypothesis in terms of relevant features and their meaningfulness in deciding what label to assign to a new input example. The logical and physical structure of a model clearly depends on the internals of the learning algorithm and on the specific implementation in use. The model is a key factor in the quality of a statistical classifier, as the outcome of the classification really depends on how properly the model describes the set of input examples, along with the differences among them which can help determine the belonging to a distinct category. The definition of the model is a typical generalization problem: the learning algorithm evaluates a great number of local features describing the training examples and infers general rules that should be valid also for those examples that have never been observed. The responsibility of using the model to classify new examples is on the classification function: each new input is evaluated against the model and labelled accordingly. The outputs of the classification function are referred to as the predictions of the classifier, and mathematically correspond to the results of the application of the hypothesis function h to each element within the set of examples. As opposed to the induction that leads to the generation of the model, the classification is a deductive task, i. e. it is a matter of specialization: the general descriptions the model consists of are projected on the specific instances that are to be labelled in search for similarities and correspondences. The evaluation of this generalization/specialization mechanism, which is common to any (partially) supervised ML model, requires that the examples labelled by the classification function are not part of the training set. That’s why the experiments are conducted on a data set which is split in two non-overlapping sets, the training and the test set, of which the first is
3.1. Statistical Classifiers
38
used only to train the learning algorithm, and the latter only to evaluate the predictions of the classifier, i. e. the quality of the learning model. The predictions are evaluated against an oracle, a resource describing the expected sequence of classification outputs for the examples of the test set. The evaluation can be carried out using different measures that depend on the task and on the focus of the on-going research. These measures are generally referred to the outcome for each target category ci ∈ C, and are based on the following quantities: • true positives (Ai ): the number of examples classified in ci that actually belong to the category, i. e. labelled accordingly by the oracle; • false positives (Bi ): the number of examples mistakenly labelled ci ; • false negatives (Ci ): the number of non retrieved examples, i. e. those that should have been labelled ci . If N is the total number of documents, than the following measures can be used to evaluate the output of the classifier: • accuracy (Acci ), defined as the number of correct outputs out of the number of classified examples Acci =
N − Bi − Ci ; N
• precision (pi ), the ratio between the number of true positives and the sum of true positives and false positives: pi =
Ai ; Ai + Bi
• recall (ri ), the ratio between the number of true positives and the sum of true positives and false negatives: ri =
Ai . Ai + Ci
3.1. Statistical Classifiers
39
The accuracy is strictly tied to the error rate ǫi of the classifier on the category, i. e. the number of classification errors on the number of instances classified, in fact: ǫi =
Bi + Ci number of errors = ⇒ Acci = 1 − ǫi . number of instances N
A drawback of the accuracy is that it is possible to obtain a very good measurement also if all the examples belonging to a category aren’t labelled correctly. This is true for all those categories which are represented by few instances in the test set, so that the quantity (Bi + Ci ) is very small with respect to N, no matter how many errors are actually made in assigning the ci label. On the other hand, precision and recall are not affected by this limitation and provide a good indication of the classifier on the given class. Using precision and recall it is also possible to derive another measure, the F1 -measure, the harmonic mean of the two values: F1 =
2pr , p+r
which provides a reliable and realistic indication of the classifier’s performance. In a multi-class scenario, the values of precision, recall and F1 -measure for each class are combined in a global measure to evaluate the performance of the whole pool of classifiers. This aggregate measure, which is usually indicated with µ, is called microaverage, and can be calculated for any of them: P
Ai + Bi ) i (A Pi Ai µ(r) = P i i (Ai + Ci )
µ(p) = P
i
3.2. Features and Feature Spaces µ(F1 ) =
40 2µ(p)µ(r) . µ(p) + µ(r)
It is quite common that the output of the classifier is not as accurate, reliable or efficient as expected, requiring the learning process to be refined. Apart from the intrinsic difficulties inherent the problem at study, most of the problems generally arise from the composition of the data set and the learning algorithm. As concerns the first issue, two major factors are the number of available examples, i. e. the training examples may be too few to support a statistical learning process, and the partition of the data set, i. e. the training and test sets are not homogeneous and therefore a generalization of the first doesn’t subsume the elements of the latter. While the first question requires the addition of new training examples, which is not always possible, the second calls for a truly random definition and proper dimensioning of the 2 sets. Concerning the learning algorithm, selecting a different statistical model or fine-tuning the parameters of the learning process can considerably improve the performance of the classifier. For these reasons, the production of an adequate classifier is generally an iterative process in which each cycle ends with the evaluation of the model in terms of the measures previously described. If the measurements are not satisfying the tunable aspects of the classifier are reworked, and a new evaluation is carried out. This “life cycle” of a classifier is depicted in Figure 3.2.
3.2 Features and Feature Spaces In a real-word scenario, classification examples are actually represented in terms of features. A feature is exactly what the name suggests: some aspect, or attribute or part of an example x ∈ X which is considered to be informative with
3.2. Features and Feature Spaces
41
Figure 3.2: Life cycle of a classifier. respect to the learning task. What and how an actual feature represents, it really depends on the kind of learning task and on the type of objects the ML algorithm is supposed to work with. A program that analyzes the input examples and represents them in terms of feature is called a Feature Extractor (FE), and is usually the very core of any classification system. For very simple problems, identifying the set of features that capture the relevant aspects of the examples is quite straightforward. For example, the features “weight” and “height” can be used to learn the concept “mediumbuilt people” and to classify subsequent examples on the basis of the learnt concept’s approximation. Generally, though, the concepts to learn are not so easy to describe and two main problems arise: 1. the significant features must be identified and extracted; 2. features must represented properly, i. e. their representation must be adequate to capture the salient aspects of the source examples and allow for an efficient computation.
3.2. Features and Feature Spaces
42
These considerations also apply to the previous example, which has indeed been over-simplified. Weight and height alone can only provide a very coarse definition of the target function, and in fact we are actually modelling an easier concept which doesn’t take into account, for example, the sex or the constitution of the target subject. This lack of accuracy is acceptable or not, depending on the scope and the uses of the results of the system. Whereas it would suffice for, say, a statistical analysis, it would definitely be inadequate for a nutritionist to decide on a patient’s diet plan. The activity of designing and evaluating the more appropriate features to model a learning problem is referred to as Feature Engineering, and it is generally a complex and time consuming task. A feature engineer must in fact face two major problems: • the features to represent can be too many: in this case, it is necessary to select just a subset of them in order for the learning algorithm to converge in a reasonable amount of time; • the features that can help discriminating the concept from its negation can be unknown: it is very common to deal with very complex phenomena for which a sound and widely accepted interpretation is not available, and the feature engineer may have to identify the relevant aspects by trial.
3.2.1 Linear Features The most common representation of an example’s feature is geometric, and consists in projecting each feature on one dimension of an euclidean space. This is called a linear representation of the features of the example. In this context, an example is actually a vector ~x ∈ Rn described by its
3.2. Features and Feature Spaces
43
Weight bc
bc
Wmax bc
Wmin
b
bc
bc
b
b b b
bc
b b
b
b
bc
bc
bc bc
bc
Height Hmin
Hmax
Figure 3.3: Representation of a 2-dimensional classification problem. components in an n-dimensional euclidean space:
x0
x1 ~x = x2 , ··· xn−1 n being the number of distinct features identified. In Figure 3.3 it is represented the feature space of the simplified weight/height problem previously described, which is 2-dimensional and real valued. Still, imagine a huge number of vectors to classify, or the design a system whose requisites are very strict in terms of memory and processor usage. In this case, the feature engineer may choose a different, more convenient representation of the features, for example using integer rather than real values for the components of each vector. This choice would affect the algorithm in many ways: 1. the number of distinct examples would be drastically reduced, as many
3.2. Features and Feature Spaces
44
vectors which were formerly distinct would now overlap; 2. as a consequence of the overlaps, there would be a number of conflicting examples in the training set, whenever similar weight/height values had been assigned different labels. As a result, the output of the classifications would be definitely more blurry than in the previous case. Yet, the memory and processor usage would be drastically reduces, as integer values are more compact and computationally more easy to deal with than real ones are. The feature engineer may also decide to re-invest the system resources that he managed to save, for example introducing a third feature, i. e. the sex. A more realistic example is provided by the ways commonly used to represent text documents within a text categorization system. Let’s say we have an object o1 , which is actually a text document d1 . It’s a very short one, and its text is “This document is about its contents”. In a very basic bag-of-words model, it would be represented in terms of the words it contains1 : d~1 = [this, document, is, about, its, contents] , Since the vector d~1 must be represented in an euclidean space, each component must be mapped onto a numerical value. For the sake of simplicity, each word is associated with a dimension of the space, and it is assigned a binary value: 1 if the word is present in the document, 0 otherwise. Therefore, given the feature mapping: this
1
document is
about
its
contents
↓
↓
↓
↓
↓
↓
0
1
2
3
4
5
in a bag-of-words model the frequency of each word should be taken into account as well, but here it is omitted for simplicity.
3.2. Features and Feature Spaces
45
the object o1 can be represented as d~1 = φ(o1 ) = [1, 1, 1, 1, 1, 1] ,
where φ(·) is the feature responsible for producing the feature representation of any given target object. Since the document contains six distinct words, it requires a six dimensional space to be represented. Now we add two new examples to our collection, a document d2 (“This is a second document”) and d3 (“Yet another text document”). The first contains 2 words (“a” and “second”) which haven’t been represented in the feature space yet, and the latter 3 of them (“yet”, “another” and “text”). These features are assigned the following mappings: a
second
yet another
text
↓
↓
↓
↓
↓
6
7
8
9
10
.
The current feature space now consists of eleven dimensions, and the three documents are represented as follows:
d~1 = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] d~2 = [1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0] d~3 = [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1] . Given these vectors, it is possible to evaluate their similarity in terms of common features, i. e. calculating the scalar (or dot) product between each couple of vectors: sim(~a, ~b) = h~a, ~bi =
F X i=0
ai bi ,
(3.1)
3.2. Features and Feature Spaces
46
F being the dimensions of the feature space. In our, example, the values of similarities between documents would be:
sim(d~1 , d~2 ) = 4 sim(d~1 , d~3 ) = 1 sim(d~2 , d~3 ) = 1 . It is clear that increasing the number of documents has two main effects on the feature space: • the number of features gets larger and larger, i. e. the number of dimensions grows very fast; • the resulting feature space gets more and more sparse. As a result, there is an increase in computational time as the terms of the sum in Equation 3.1 grow in number, though most of the products are null and do not contribute to the result. In this very case, there are many aspects of the feature selection algorithm that can be improved and engineered. Two widely adopted strategies are stemming and stop-words removal. Stemming refers to the use of the stem (or root) to describe a word instead of its surface form, i. e. in its base form (for verbs) or without affixes (for nouns). This means that, for example, “drink”, “drank”, “drunk” and “drinking” would be mapped to just a feature instead of four. Stopwords removal consists in removing those words that are common to every document and that carries little or no information per se: articles, pronouns, prepositions and auxiliary verbs are among these, and their removal is meant to reduce the number of features as well as the classification noise.
3.2. Features and Feature Spaces
47
3.2.2 Kernel Methods and the Kernel Trick It may be noted that an explicit representation of a huge feature space, as defined by the unique features derived by the analysis of hundreds or thousands of target examples, is not always necessary to the application of a classification algorithm. Indeed, when comparing two object representations ~x and ~y we only need to consider their common features (see Equation 3.1), which generally account for a very small part of the whole feature set. A kernel function is exactly this: a function k(oi , oj ) the evaluates the similarity between example pairs without the need for an explicit representation of the feature space: Definition 3.1 (Kernel). A kernel is a function k such that ∀a, b ∈ X, x~a = φ(a), x~b = φ(b)
k(a, b) = k(x~a , x~b ) = hx~a , x~b i = hφ(a), φ(b)i
where φ(·) is a mapping from X to an (inner product) feature space. It is interesting to notice that, once a kernel function that is effective for a given learning problem has been defined, it is not necessary to know what the mapping φ(·) is. This technique of using kernels to evaluate similarities instead of The existence of such a function is guaranteed by the following definition and proposition: Definition 3.2 (Eigenvalues and Eigenvectors). If a scalar value t satisfies the relation A~x = t~x
for some vector ~x 6= ~0, then t is an eigenvalue of the matrix A and ~x is an eigenvector.
3.2. Features and Feature Spaces
48
Proposition 3.1 (Mercer’s Conditions). Let X be a finite input space with K(~x, ~z ) a symmetric function on X. Then K(~x, ~z) is a kernel function if and only if the matrix K(~x, ~z ) = hφ(~x), φ(~y)i
is positive semi-defined, i. e. has non-negative eigenvalues. The kernel trick [Aizerman et al., 1964] consists in transforming any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm which corresponds to the same linear algorithm operating in the range space of φ. For example, a polynomial kernel maps the original features in a space that also accounts for all their possible conjunctions, e. g. if the original space contains the components x1 and x2 the transformed one will also contain x1 x2 . This comes in very handy, for example in text categorization applications or for disambiguation purposes, as the conjunctions of terms can enforce a more strict constraint than the terms alone can do.
3.2.3 Kernel Functions and Feature Space Separability A statistical classifier learns how to classify in distinct groups a set of training examples. This result is achieved by comparing the couples of training examples, in terms of their feature representations, and identifying a function h that separates examples belonging to a class from those belonging to the others, i. e. in the case of a binary classifier examples that belong to the target set from those that don’t. The hypothesis function h is a member of the family (or class) of functions H that can be derived from the set of available training examples, i. e. from the same training set it is possible to derive many different hypoth-
3.2. Features and Feature Spaces
49 b
4 b
3 b
2 b b
−5
−4
−3
−2
b c b
c b
b
1
−1
h(x) = 52 x3
b
h(x) = 21 x
b
h(x) = 15 x3 b
bc
1
2
bc
−1
3
c b
5
bc
−2
c b
4
−3 bc
bc
bc
−4
Figure 3.4: Different functions can successfully separate the same training set. esis that successfully separate the training examples, as shown in Figure 3.4. However, the members of the function class H are generally heterogeneous, and the choice of the most appropriate function does have an impact on the classifier’s performance. The Vapnik and Chervonenkis (VC) dimension aims to characterize functions from a learning point of view in terms of the maximum number of points that a function class can shatter. The definition of shattering is as follows: Definition 3.3 (Shattered Sets). Let F be a class of binary classification functions, f ∈ F, f : X → {0, 1}. The set S ∈ X is shattered by the function class F if ∀S ′ ⊆ S, ∃f ∈ F : 0 if x ∈ S ′ f (x) = 1 if x ∈ S \ S ′
It means that a set is shattered by a function class if, for any binary partition of the points in the set, there is at least one function in the class that can separate them accordingly. Hence, the members of the class of functions having VC dimension equal to 10 can always separate 10 points, no matter
3.2. Features and Feature Spaces
50
of their spatial distribution. The VC dimension of hypothesis has an impact on the generalization ability of the learning algorithm and therefore on the classification output. In fact: • a function having a high VC dimension is expected to easily separate the training points, as it has the capacity to adapt to any training set. This means that the learned function tends to be very specific to such training data, as it tends to outline the partition rather than generalize the training data. As a consequence, it may perform poorly when applied to other examples, i. e. on the test set; • a function with a low VC dimension can separate a lower number of data configurations. Therefore, if such a function can separate a large training set it very likely represents a good generalization of the training instances, resulting in proper classification output and in good performance on an heterogeneous test set. For these reasons, given an n-dimensional feature space, the search for the hypothesis function is generally restricted to the class of linear functions, i. e. oriented hyperplanes in Rn , which are granted by mathematical proof to have a VC dimensions of (n + 1) if, for any chosen point within the set, the remaining n points are linearly independent. As an example, the separation capabilities of hyperplanes in the 2-dimensional space, i. e. lines in R2 , are shown in Figure 3.5: • frame (a) demonstrates how a linear function can separate any set of 3 points, provided that not all of them are aligned; • frame (b) presents examples of 4 or more points sets which cannot be separated by a line;
3.2. Features and Feature Spaces (a) b c b
b
b
b
bc
b
bc
b
bc
c b c b
b
b
bc
b
bc
bc
c b
b
51
b
b
(c) bc
b b
bc
b
b
(b) b
b
bc
b
b
b
bc bc
b bc
bc
bc bc
bc
Figure 3.5: Linear separability in R2 . • finally, frame (c) presents a good generalization example: the points are far more than 3, still it is possible to separate them and find an appropriate generalization for the classification problem. As shown in Figure 3.5(b) it’s not always possible to learn a linear separation of the training set. Fortunately, this limitation can be overtaken in several manners: • it is possible to model in a more effective way the learning problem, i. e. adding, removing or re-engineering features, so that the problem becomes linearly separable. For example, a configuration of 3 linearly dependant points cannot be shattered by a line in the two dimensional space, but adding the appropriate feature it would become a linearly separable problem in the three-dimensional space; • it is possible to use a cascade of linear functions. The resulting function can be very expressive and can approximate any non linear function, depending on the number of cascaded levels. Another solution is to use a kernel function to remap the initial data points in a separable space, as shown in Figure 3.6. For example, a target separation function in the form f (x, y, z) = C
xy z2
is clearly not linear, and therefore couldn’t be approximated by an hyper-
3.3. Support Vector Machines
52 bc
c b
φ ⇒
b
c b c b c b
c b
b
c b
b
b
b
b
bc
b
b
b
bc
b
bc b bc
bc
b b
bc
b
b
b
Figure 3.6: A mapping φ that makes linearly separable a set of training points. plane in R3 . But applying the mapping φ(·) = ln(·)
the learning algorithm should find an approximation of the function g(x, y, z) = φ(f (x, y, z)) = ln(C) + ln(x) + ln(y) − 2 ln(z)
which is clearly linear in terms of the remapped features ln(C), ln(x), ln(y) and ln(z).
3.3 Support Vector Machines In Section 3.2.3 it was stated that a combination of linear functions is sufficient to approximate any non-linear function. This statement is surely true from a mathematical stand point, as the infinitesimal calculus has demonstrated that any function can be approximated with any desired precision provided a sufficient number of its geometrical tangents at the proper points along its path.
3.3. Support Vector Machines
53
Figure 3.7: Graphical representation of a neuron. This statement also seems reasonable from a learning theory perspective, as the most complex learning apparatus that we are aware of, i. e. the human brain, is made up of linear devices, the neurons. The cell body of a neuron is called soma, from which many thin branching arbors extend, called dendrites. A special dendrite, the axon, is much longer and thicker than the others, and terminates in a forest of filaments. A neuron receives input signals through synapses, chemical links between the filaments and the dendrites of adjacent neurons. Each synapse can either amplify or attenuate the input signal, which is in turn conveyed to the soma. If the sum of the input signals overcomes a certain threshold it propagates to the axon and then to other neurons. The structure of a human neuron is sketched in Figure 3.7. A perceptron is a binary classifier shaped on the model of the animal neuron: it receives a vector ~x of input signals from the environment and weights them against the sensitivity factors w. ~ If the sum of the weighted values, i. e. the scalar product w ~ · ~x, is less then the threshold b, then the value 1 is output, 0 otherwise. The scheme of a perceptron is represented in Figure 3.8.
3.3. Support Vector Machines
54
x1
x2
w1 w2
x3
w3
.. .
wn
xn
+
ϕ(·) = sgn(·)
y
b
1
Figure 3.8: Graphical representation of a perceptron. The perceptron actually models the hyperplane whose equation is y = w1 x1 + w2 x2 + . . . + wn xn + b = w ~ · ~x + b and the output of the classification function is f (~x) = sgn(w ~ · ~x + b) . The signum function simply partitions the data points in two sets, those above and those below the hyperplane. The output of the learning algorithm consists in in a vector of scaling factors w ~ and a threshold b defining an hyperplane that can partition the training data, provided that at least one such hyperplane exists. The vector w ~ is the gradient of the hyperplane, b whereas the scalar − kwk represents the distance of the hyperplane from the ~
origin. With respect to an hyperplane w ~ · ~x + b = 0 and an example x~i , the following two metrics can be defined: Definition 3.4 (Functional Margin). The functional margin γi of an example x~i
3.3. Support Vector Machines
55
b b c c b c b
γ1′ b b b
b b
b b b c b
b
b b b c kwk ~ c b c c b b
b
w ~b
b
b
cb bc
b
bc
bc
b
bx ~i
b b
bc bc bc
b
bc
b
Figure 3.9: Geometric margin of a linear classifier in R2 . with respect to a hyperplane w ~ · ~x + b = 0 is the product: γ i = y i (w ~ · x~i + b) . Definition 3.5 (Geometric Margin). The geometric margin γi′ of an example x~i with respect to a hyperplane w ~ · ~x + b = 0 is the product: γi′ = yi( Being
w ~ kwk ~
b w ~ · x~i + ). kwk ~ kwk ~
· ~x the projection of ~x on the normal, i. e. the normal to the
hyperplane, it is immediate to notice that the geometric margin γi′ is the distance of the point x~i from the hyperplane. These geometric concepts are represented in Figure 3.9. Depending on the initialization of the learning algorithm of the perceptron, the output hyperplane will be differently oriented and located, i. e. if a learning problem is linearly separable there will generally many (infinite) combinations of w ~ and b that can partition the training examples. Different hyperplanes may lead to different error probabilities, and hence to different performance. One of the more interesting results of the statistical learning theory is that, in order to reduce such probability, the hyperplane that maximizes the distance from both negative and positive examples should be selected. This is called the Maximal Margin Hyperplane (MMH) of the training set, and a classifier that would learn and use such a hyperplane is called a Maximal
3.3. Support Vector Machines
56
Margin Classifier (MMC). The learning problem of a MMC classifier is a constrained maximization problem. Assuming to rescale the features so that the maximal margin is 1, it is possible to reduce its solution to the study of the following minimization problem: min kwk ~ y (w ~i + b) ≥ 1 , ∀~ xi ∈ TR i ~ ·x
,
which obviously depends on the composition of the training set TR . It is interesting to notice how, for any choice of the hyperplane, the only vectors that affect the learning problem are those which are closer to the hyperplane, i. e. those whose distance from the hyperplane corresponds to the margin (=1). In other words, the hyperplane can be described in terms of the vectors which are located at a distance of 1 from the hyperplane. These vectors are called support vectors, since they support the decisions of the learning algorithm, and the corresponding classifier is called a Support Vector Machine (SVM). The two major drawbacks of this learning model are that: • the learned function corresponds to the lower bound of the error rate which is provided by the VC dimension and the related theory: since it is not possible to determine a lowest bound, there is no proof that this approach produces the best linear classifier; • it requires a linearly separable data set, otherwise the constraints on the distribution of the data points can never be verified and the algorithm cannot converge. This second aspect, which requires the satisfaction of a set of hard (i. e. “not flexible”) conditions, provided the idea of an alternative name for this learning model, which is Hard Margin Support Vector Machine (HMSVM).
3.3. Support Vector Machines
cb bc bc bc bcbc bc bc bc
b
b b b b b b b b b
(a)
57
cb bc bc bc bcbc bc bc bc
b
b b b b b b b b b
(b)
Figure 3.10: Hard (a) versus soft (b) margin hyperplanes. In many real world scenarios these constraints are likely not to be satisfied, because of the intrinsic inseparability of the feature space or some noisy data which isn’t labelled correctly. In order to solve these critical aspects, the Soft Margin Support Vector Machines (SMSVMs) have been designed. Essentially, the learning algorithm is allowed to violate a certain number of constraints. This number is meant to be as small as possible, so that the consistency of the learnt hyperplane to the training data is preserved. In Figure 3.10, the difference between a soft and a hard margin for the same learning problem is sketched. The soft margin learning algorithm has violated two constraints, but the wider margin that results should grant a higher classification accuracy. Reducing Multi-class to Binary Many classification problems are formulated as multi-class, but there are many common and largely agreed approaches to reduce a multi-class problem to a combination of binary classifications [Allwein et al., 2000]. These approaches clearly depend on the internals of the learning algorithm and on the mathematical interpretation of the output of the classifier. For margin-based classification algorithms, such as SVM, the two most common approaches are Pairwise and OVA (One VS All). In Pairwise, a separate binary classifier is trained to distinguish between each class pairs, and their outputs are combined to predict the classes. For an N-class problem, this requires the training of
N (N −1) 2
binary classifiers.
3.3. Support Vector Machines
58
In OVA, instead, N classifiers would be trained, each one discriminating between the examples of each class and those belonging to all other classes combined. While Pairwise requires a greater number of classifiers, OVA classifiers are trained on larger sets of examples and as such are generally slower. Since the two approaches achieve a different performance, and there is no clear agreement about one being better than the other (see for example [Pradhan et al., 2005a] and [Kreßel, 1999]), choosing one against the other is really a matter that depends on the specific task.
Chapter 4
Semantic Role Labelling Models
This chapter describes the typical anatomy and data flow of a semantic role labelling system. Section 4.1 introduces the modules that generally constitute such software architectures, along with the data they require and output. With a topdown approach, the architecture will be increasingly analyzed at a deeper level down to functional modules; Section 4.2 describes the most relevant linear (4.2.1) and structured (4.2.2) features that are currently used in state of the art models and systems; finally, Section 4.3 presents the relevant characteristics of the most accurate SRL systems that took part in the CoNLL 2005 shared task.
4.1 Anatomy of SRL Systems At the highest level of abstraction, a Semantic role labelling system is a black box that receives some free text as input and outputs a semantic annotation 59
4.1. Anatomy of SRL Systems
60
of it. The adjective free in “free text” means that it doesn’t contain any metainformation or special formatting, it is just a string of text as a user would write in a web form. The output annotation is a representation of the input sentence in which the semantic structures that have been identified are somehow marked. The task is accomplished using some external resources, such as knowledge bases, linguistic tools and statistical models. As an SRL system is typically a module of a complex NLP architecture, its output will likely be a machine readable set of data that later modules of the larger architecture will use for the application’s needs. Hereinafter we will be assuming that an SRL system is fed with free text input sentences and outputs an in-line, bracketed annotation of PropBanklike predicate argument structures, such as: [Arg0 He] got [Arg1 his money] [C-V back] in response to the input text He got his money back. As SRL systems make largely use of lexical and syntactic information, the first step in the identification of predicate-argument structures is the syntactic parse of the input sentence. As there are many robust and highly accurate syntactic parsers available to the scientific community (e. g. [Charniak, 2000] and [Collins, 1999]), full syntactic parsing1 is generally performed outside of the labelling system, as shown in Figure 4.1. In order for predicate-argument structures to be identified, the first step is to recognize the predicates themselves. The module responsible for this is called a Predicate Extractor (PredEx): it scans a parse tree searching for words or syntactic structures that suggest the presence of a predicate. This sub-task can be done both using statistical methods, i. e. training a classifier to recognize the predicates, and hand-crafting some lexical and 1
The CoNLL 2005 shared task has shown that deep syntactic parsing increases SRL performance, hence a full-parse based model is assumed.
4.1. Anatomy of SRL Systems
61 InputText
Syntactic Parser Parse Tree SRL System Semantic Annotation
Figure 4.1: An highest level view of a SRL system. syntactic rules. In the case of verbal predicates, it is easy to write simple rules that recognize predicates by matching some regular expression against the POS tags assigned by the syntactic parser, hence this task can be successfully fulfilled with a deterministic approach. The output of this stage is a set of predicates P whose arguments should be identified within the parse tree t, as shown in Figure 4.2. Parse Tree, t
PredEx Predicates, P
Figure 4.2: Predicate Extractor. For sake of simplicity, let’s assume that the set P contains only an element: P = {p}. The second stage consists in identifying the groups of words that could be arguments of the predicate p, i. e. a set of candidate (or potential) arguments. As an argument can be any combination of adjacent words, the number of potential arguments is very high, i. e. as many distinct such combinations in which the predicate p is not included.
4.1. Anatomy of SRL Systems
62
The first advantage of using full syntactic parses can be exploited here: as the syntactic parse provides hierarchical grouping of words, i. e. each node of the parse tree corresponds to a self-contained linguistic constituent, the set of candidate arguments can be built as the set of the tree nodes which are in the proper syntactic relationship with the predicate. The other tree nodes can also be ignored, i. e. pruned from the tree. Most models follow the pruning approach described in [Xue and Palmer, 2004]: 1. set the current node to the predicate node; collect all its siblings unless they are coordinated with the current node and, for any sibling which is a propositional phrase (PP), also collect its immediate children; 2. reset the current node to its parent, and repeat step 1 until the current node reaches the root of the parse tree. The module that performs such a task is called a Candidate Extractor (CandEx): given a predicate p and a parse tree t, it outputs a set Ac of candidate arguments for the predicate, as show in Figure 4.3. The elements of Ac are nodes of the parse tree which, due to their syntactic relationship with the verb, could exactly span the boundaries of any of its arguments. Parse Tree, t
Predicate, p
CandEx Candidates, Ac
Figure 4.3: Candidate Extractor. Once the potential arguments have been identified, they must be either assigned an argument label or discarded as non argument. This task is generally divided into two sub-tasks:
4.1. Anatomy of SRL Systems
63
1. boundary detection: in which the set Ac is reduced so that it only contains nodes of the parse tree which are very likely arguments of the predicate; 2. argument classification: in which every left candidate is assigned the proper label. These sub-tasks are typical statistical classification problems, and many different learning algorithms have been used to perform them. As an example, for the CoNLL 2005 the Maximum Entropy (ME) statistical framework [Haghighi et al., 2005], Support Vector Machines [Moschitti et al., 2005b], and Decision Trees (DTs) [Ponzetto and Strube, 2005] were used among the others. As discussed in Section 3.2, the input data to both the learning algorithm and the classification function must be represented in terms of their relevant features. The module responsible for extracting this feature representation of the input data is called a Feature Extractor (FeatEx): it realizes a mapping φ that associates to each input value its proper representation in the multidimensional feature space. As detailed in Section 4.2, most features depend on a set of syntactic relations that hold between the predicate and each of the candidate arguments. As these relations are realized into the parse tree, the FeatEx module is input the parse tree t, the target predicate p and the set of candidate argument nodes Ac . For each ci ∈ Ac , the module outputs its feature representation c~i ∈ Rn : c~i = φ(ci , p, t) , which depends on the potential argument, the predicate and the parse tree. A a whole, the output of the module is the set of feature representations Fc =
[ i
c~i ,
4.1. Anatomy of SRL Systems
64
as shown in Figure 4.4. Parse Tree, t
Predicate, p
Candidate Nodes, Ac
FeatEx, φ(ci , p)
Candidate Features, Fc
Figure 4.4: Feature Extractor. The boundary classification task consists in identifying the subset Fc+ of nodes (i. e. their representations) in Fc that actually are arguments of the target predicate. This is essentially a refinement of the pruning strategy previously described that takes into account many more complex, descriptive and computationally expensive features as those output by the feature extractor. The Boundary Classifier module (BndClass) consists of a binary classifier that is input the set Fc and outputs the set Fc+ of potential argument representations that are identified as positive boundaries, i. e. nodes that being arguments of the predicate should be assigned a role label, as shown in Figure 4.5. Candidate Nodes, Fc
BndClass Positive Boundaries, Fc+
Figure 4.5: Boundary Classifier. The argument classification task is a multi-class problem, as any element of Fc+ must be assigned one of the possible argument types, i. e. a label
4.1. Anatomy of SRL Systems
65
lj ∈ L. Depending on the learning algorithm adopted, this step can require a different number of classifiers. In case of binary classification algorithms, some strategies exist to combine the output of multiple binary classifiers into the equivalent of a multi-class classifier, as discussed in Section 3.3 for the specific case of SVMs. The Argument Classifier module (ArgClass) associates to each positive boundary in Fc+ the best possible labelling, i. e. the label that maximizes its classification score. The output of this stage is a set of labellings Fcl whose elements are couples h~ ci , lj i, each argument node being assigned a labelling. The behaviour of this module is represented in Figure 4.6. Positive Boundaries, Fc+ ArgClass Labellings, Fcl
Figure 4.6: Argument Classifier. This two stage approach to the argument classification task is inspired by the architecture proposed in [Pradhan et al., 2005a] and partially revised in [Moschitti et al., 2005b]. It’s an interesting solution as it requires that only one classifier, i. e. the boundary classifier, is trained on the whole data set, whereas the multi-class classifier (generally consisting in a collection of binary one-label classifiers) can be trained on the positive boundaries only, which amount to about 20% of the data set . This solution drastically reduces both training and classification time, at a cost in recall which is generally very small, i. e. few actual argument nodes are pruned during the boundary classification task. Depending on the precision and recall of the classifiers employed, it is possible that the set Fcl contains overlapping nodes, i. e. argument nodes
4.1. Anatomy of SRL Systems
66
that dominate a subtree containing other argument nodes. Figure 4.7(a) represents a typical overlap situation, in which 3 nodes dominating each other are all marked as arguments for the predicate read. An Overlap Resolution module (OverRes) removes from the set Fcl those nodes that are causing conflicts. As each conflict involves at least two nodes, i. e. an overlapping and an overlapped node, the module must be able to choose the best configuration out of the good ones that result from removing each involved node, in turn. For example, Figure 4.7(a) and (b) are two possible, non-overlapping solutions for the same problem. a)
b)
S NP
NP
VP
John VB
DT NN IN title
of
NP DT NN the book
the
title
of
NP
read NP
PP
DT NN IN
VP
John VB
NP
read NP
PP
S NP
VP
John VB
NP
read NP
the
c)
S
NP DT NN
PP
DT NN IN the
title
of
the book
NP DT NN the book
Figure 4.7: Overlap resolution example. This issue can be addressed using some heuristics that try to resolve the problem locally. For example: • removing the nodes that cause the greater number of conflicts; • preferring core-labelled nodes to the adjuncts; • preferring higher or lower level nodes, and so on. Another solution is to train a statistical classifier to recognize the most likely configuration of argument nodes and labels. This latter approach generally involves a joint model to evaluate all the possible labelling configurations and choose the best. This is typically done by means of some proposition re-ranking mechanism, and is further investigated in Section 5.2.
4.2. Features for the SRL task
67
Of course, it would be possible to perform overlap resolution prior to argument classification, but the labels assigned to the argument nodes provide useful features for the task, both for rule-based and statistical approaches. The overlap resolution module outputs a set of labelled argument node A which is a consistent argument structure for the target predicate, as shown in Figure 4.8. Labellings, Fcl
OverRes Argument Structure, A
Figure 4.8: Overlap Resolver. With a consisting labelling scheme defined in A the semantic role labelling task is completed. A simple, procedural module would receive both the parse tree and the argument structure and output a human or machine readable annotation of the sentence, depending on the application context. The overall flowgraph of a traditional SRL system is represented in Figure 4.9.
4.2 Features for the SRL task This section provides an overview of the features (see Section 3.2) which are traditionally employed for the SRL task. Section 4.2.1 describes the most widely used linear features, i. e. features that are explicitly represented in the euclidean feature space of the learning algorithm, whereas Section 4.2.2 describes a special kernel function, the Tree Kernel (TK), which can exploit structured information to implicitly represents a very large and expressive feature space.
4.2. Features for the SRL task
t
68
p
PredEx p Ac CandEx
FeatEx
Fc BndClass Fc+
ArgClass Fcl OverRes
A
Figure 4.9: Flowgraph of a typical SRL system.
4.2.1 Linear Features Linear features are typically represented as attribute-value pairs, i. e. associations between a dimension of the target feature space and the value of the feature’s projection on the corresponding axis. As discussed in Section 3.2, feature engineering is a long and complex task, as relevant features for a given problem are not always obvious, and choosing the best representation for a feature is not always simple or straightforward [Jackendoff, 1990]. For SRL as addressed by the CoNLL 2005 shared task, i. e. identifying predicate-argument structures making use of full syntactic information, a quite established set of features has been defined [Gildea and Jurafsky, 2002, Pradhan et al., 2004]. These features are employed, eventually with some extensions, by the vast majority of SRL systems, and are described in the remainder of this section. Predicate
The lemmatization of the predicate is of course a relevant fea-
ture. Lemmatization, or stemming, consists in removing the affixes of a word only preserving its base form, e. g. : • forgiven→forgive;
4.2. Features for the SRL task
69 S
[Arg0 NP]
VP
PRP
[rel VBD]
He
got
NP
. ADVP
PRP$
NN
RB
his
money
back
.
Figure 4.10: The Path feature. • got→get; • brought→bring, and so on. Lemmatization is useful as it allows to collapse into a single representation all the different inflections of a word, thus easing the generalization task for the learning algorithm. Path The path feature represents the syntactic path that links the predicate and an argument, i. e. the sequence of nodes that must be traversed in order to reach the node covering the predicate from the node covering the argument, along with the traversal direction, i. e. upwards (↑) or downwards (↓). As an example, the path feature for the candidate argument Arg0 in Figure 4.10 would be NP ↑ S ↓ VP ↓ VBD .
Phrase Type
The Phrase Type represents the syntactic category of the ar-
gument node, such as NP (Noun Phrase), or PP (Prepositional Phrase) and so on. Position
A binary feature is used to represent the position of the argument
with respect to the predicate, i. e. if the argument is to the left or to the right of the predicate.
4.2. Features for the SRL task Voice
70
The voice of the predicate, i. e. whether it is realized as an active or
passive construction, is represented with a binary feature. This is generally extracted by matching some regular expressions on the nodes of the parse tree, e. g. searching for voices of the verb to be in a close position to the left of a past participle predicate. Head Word
This feature represents the syntactic head of the argument
phrase. It is commonly extracted with a rule-based approach, such as the head word table described in [Collins, 1999]. Verb Sub-categorization
This feature represents the production rule of the
parser’s grammar that was used to expand the predicate’s parent node. For the example represented in Figure 4.10 this feature’s value would be VBDNP-ADVP. Named Entities in Constituents
Named entities are grouped into a set of
categories, and each category is assigned a binary feature that is set to 1 if at least one of the named entities in that category is contained in the argument phrase. Head Word POS The POS tag of the head word is meant to reduce the ambiguity of the Head Word feature. Verb Clustering Verbs are clustered with respect to the type of verb-directobject relation [Lin, 1998]. Partial Path The Path feature is very relevant, yet it is one of the most sparse and therefore tends to generalize poorly. In order to overcome this problem, it is possible to consider only the partial path instead of the full path, i. e. the path from the argument node to the closest common ancestor
4.2. Features for the SRL task
71
with the predicate. With respect to the example of Figure 4.10 the partial path would be NP ↑ S. Verb Sense Information
As each verb sense activates a different argument
structure (see Section 2.1.2), using the verb sense as an explicit feature could be a most relevant feature to recognize predicate arguments properly. Still, it is difficult to recognize accurately the verb sense in a real-world scenario, without knowing the actual argument structure of the predicate. Head-word of Prepositional Phrases To reduce the sparseness of the phrase type feature, the head word of a PP can be used to specialize the syntactic label. This is especially useful for adjunct arguments, which are generally dominated by PPs characterized by a specific set of head words. The specialization effect can be achieved attaching the proposition to the constituent label, e. g. PP-in or PP-for. First and Last Word and POS in constituent
Some argument types are
characterized by co-occurring patterns of first and last words. These words along with their POS tags are included in the feature set. Ordinal Constituent Position
This feature provides statistical account for
the presence and position of an argument with respect to the verb. Constituent Tree Distance
This is a finer grain definition of the position
feature, that takes in account the distance of the argument from the predicate. Constituent Relative Features Encoding some features of the constituents surrounding the argument node, i. e. its parent and immediate neighbours, can provide contextual information which is useful for proper generalization.
4.2. Features for the SRL task
72
Temporal Cue Words Some temporal markers are very peculiar of certain argument types, e. g. ArgM-TMP, and therefore their presence in the argument’s constituent is represented as a binary feature. Dynamic Class Context
It is possible to add dynamically some features
dynamically to account for the classification of the other nodes which insist on the same subtree as the candidate argument node.
4.2.2 Structured Features: Tree Kernels As shown in the previous section, there are many features that are considered to be relevant for the SRL task. To handle so many features, a SRL system’s feature extractor must implement lots of different algorithms, most of which are hard to debug due to the sparseness and extent of the resulting feature space. Tree Kernels (TK) are a viable way to represent a huge feature space from structural data. In particular, it has been shown that they are very well suited to model syntactic aspects in NLP applications, such as the extraction of predicative semantic structures [Moschitti, 2004]. In fact, they can implicitly represent a huge feature space extracting many types of tree fragments from a syntactic parse tree. Each fragment is mapped to a dimension of the feature space, and the set of shared substructures between two trees is used as a measure of their similarity. The kernel approach, which does not require any noticeable feature design effort, can provide the same accuracy of manually designed features, and sometimes it can suggest to the designer new solutions to improve the model of the target linguistic phenomenon. As the similarity d(t1 , t2 ) between two trees t1 and t2 is defined in terms of their common fragments, it is important to decide what fragments must be taken into account, i. e. the pruning rules that result in valid substructures.
4.2. Features for the SRL task
73
S
S→(NP VP .)
VP
NP PRP
VBD
He
got
NP
ADVP
PRP$
NN
RB
his
money
back
.
NP→PRP, VP→(VBD NP ADVP), .→.
.
PRP→he, VBD→got, NP→(PRP$ NN), ADVP→RB PRP$→his, NN→money, RB→back
Figure 4.11: A syntactic parse tree and the corresponding grammar production rules. Formally, a tree is a directed, connected and acyclic graph with a special node called root, which is the only node without incoming edges. A node can have one of more direct descendants, its children, all of which are root nodes with respect to dominated sub-trees. A node without children is said to be a leaf, whereas if it has at least a child it is called an internal node. In case of syntactic parse trees, each node with its children is associated with a grammar production rule, where the left-hand side of the rule corresponds to the parent and the right-hand ones to the children. The terminal symbols of the grammar are always associated with the leafs of the tree, and that’s way leaf nodes are also called terminal nodes, or simply terminals. A representation of a syntactic parse tree and the corresponding grammar production rules is provided in Figure 4.11. Given a parse tree and the grammar that generated it, it is possible to define different substructure types. A SubTree (ST) is a substructure that comprises any node of a tree along with all its descendants. A SubSet Tree (SST) is more general structure in which not necessarily all the descendants of a node are included. The only restriction is that the production rules of a grammar cannot be broken, i. e. the SST must be generated applying a subset of the grammatical rules that generated the original tree [Collins and Duffy, 2002]. Hence, a major difference between STs and SSTs is that the first always comprise the
4.2. Features for the SRL task
74
leaves underneath the subtree root, whereas the latter not always do. A yet more relaxed fragment definition, allowing the production rules to be broken, leads to the substructures called Partial Trees (PTs), which can be generated out of any subset of connected nodes and edges of the original parse tree. Examples of ST, SST and PTs fragments extracted from the same syntactic parse tree are represented in Figure 4.12. [Moschitti, 2004] presents a slightly modified version of the kernel function described in [Collins and Duffy, 2002] for ST and SST evaluation. Given a tree fragment space {f1 , f2 , ..} = F , an indicator function Ii (n) is defined to be equal to 1 if the target fi is rooted at node n and 0 otherwise. The kernel function k(t1 , t2 ) that evaluates the similarity between two trees is then defined as:
k(t1 , t2 ) =
X
X
∆(n1 , n2 )
(4.1)
n1 ∈Nt1 n2 ∈Nt2
where Nt1 and Nt2 are the sets of the nodes in t1 and t2 , respectively, and
∆(n1 , n2 ) =
|F| X
Ii (n1 )Ii (n2 ) .
i=1
This latter function evaluates the number of common fragments rooted at the n1 and n2 nodes, and can be calculated as follows: • if the productions at n1 and n2 are different then ∆(n1 , n2 ) = 0; • if the productions at n1 and n2 are the same, and n1 and n2 have only leaf children (i. e. they are pre-terminals symbols) then ∆(n1 , n2 ) = 1; • if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminals then C(n1 )
∆(n1 , n2 ) =
Y j=1
(σ + ∆(cjn1 , cjn2 ))
(4.2)
b)
S VP
NP NNP
NP
VP
NNP VBZ
VBZ
Paul delivers DT NN
NP
a
NP
NP
NP
S
S
S
a
talk
NNP
NNP DT NN
a
talk
talk
talk
S
S
S
Paul NNP DT NN DT NN DT NN NP VP NP VP NP VP NP VP NP VP a
NP
delivers DT NN Paul a talk
NP
Paul delivers DT NN
c) NNP NP
VBZ
4.2. Features for the SRL task
a)
NNP
NP
S NP
VP
VP
S
S
S
S
S
NP VP
NP VP
NP VP
NP VP
NP VP
VBZ NP NNP VBZ NP NNP VBZ NP
Paul
Paul
VBZ NP delivers
VBZ
NP
VBZ NP
delivers DT NN
VBZ NP
DT NN
VBZ NP
DT NN
DT NN
a
d) NP NP
S
S
S
DT NN NP NP NP VP a talk NNP
VBZ NP delivers
S
S
NP VP
VP
NP
VBZ
NNP
VBZ
S
S VP
NP
S NP
VBZ NP NNP NNP
delivers delivers Paul delivers
S VP
VBZ
NP NP NNP
S VP
VBZ
NP NP NNP
S VP
VBZ
NP NP NNP
S VP VP
VBZ
VP NP NP NP NP DT NNP
talk
S
S VP VBZ
Paul Paul delivers NN Paul delivers NN Paul delivers DT Paul delivers DT DT a Paul delivers talk
a
NP
VP
NNP VBZ delivers
a
Figure 4.12: Examples of SubTree (b), SubSet Tree (c) and Partial Tree (d) structures for a same parse tree (a). 75
4.2. Features for the SRL task
76
where σ ∈ {0, 1}, C(n1 ) is the number of the children of n1 and cjn is the j-th child of the node n. Note that, as the productions are the same, C(n1 ) = C(n2 ). When σ = 0, ∆(n1 , n2 ) is equal 1 only if ∆(cjn1 , cjn2 ) = 1 ∀j, i. e. all the productions associated with the children are identical. By recursively applying this property, it follows that the SubTrees in n1 and n2 are identical. Thus, Equation 4.1 evaluates the ST kernel. When σ = 1, ∆(n1 , n2 ) evaluates the number of SSTs common to n1 and n2 as proved in [Collins and Duffy, 2002]. The evaluation of PTs is more complex since two nodes n1 and n2 with different child sets (i. e. associated with different productions) can share one or more children and hence have one or more common substructures. An evaluation of their number can be performed: • selecting a children subset from both trees, • extracting the portion of the syntactic rule that contains such subset, • applying Equation 4.2 to the extracted partial productions and • summing the contributions of all the children subsets. Such subsets correspond to all possible common (non-continuous) node subsequences and can be computed efficiently by means of sequence kernels [Lodhi et al., 2000]. As concerns ST and SST structures, the worst case computation time is quadratic in the number of nodes of the parse trees, i. e. the number of substructure pairs that must be evaluated. Nevertheless, as already observed in [Collins and Duffy, 2002], many pairs refer to nodes that are associated with different production rules, resulting in a null ∆ function value. Based on this observation, an approach to reduce the complexity of ST and SST kernel computation has been proposed in [Moschitti, 2006]: evalu-
4.2. Features for the SRL task
77
ating the similarity between two trees t1 andt2 , only the pair set Np = {hn1 , n2 i ∈ Nt1 × Nt2 : p(n1 ) = p(n2 )} can be considered, p(n) being the production rule that expanded the node n. The nodeset Np can be built efficiently by: 1. extracting the lists, L1 and L2 , of the production rules from t1 and t2 , 2. sort them in the alphanumeric order and 3. scan them to find the node pairs hn1 , n2 i such that p(n1 ) = p(n2 ) ∈ L1 ∩ L2 . The third step may require only O(|Nt1 | + |Nt2 |) time, but, if p(n1 ) appears r1 times in t1 and p(n2 ) is repeated r2 times in t2 , then r1 × r2 pairs must be considered. Note that: • the list sorting can be carried out in O(|Nt1 | × log(|Nt1 |)) during the data preparation steps; • the algorithm’s worst case occurs when the parse trees are both generated using a single production rule, i. e. the two internal while cycles carry out |Nt1 | × |Nt2 | iterations; • on the contrary, the kernel time complexity for two identical parse trees is still linear provided that in their generation few productions are used many times. A similar algorithm can be used also to improve the speed of the PT evaluation, taking into account nodes instead of production rules.
4.3. Literature Systems
78
4.3 Literature Systems This section lists the charachteristics of the top-scoring systems that have been presented at the CoNLL-2005 shared task (see Section 2.2). All of this section’s data has been excerpted from [Carreras and Màrquez, 2005]. Table 4.2 reports the main properties of these systems, whereas Table 4.3 presents a synthetic view of the features they employed. The systems are sorted by their F1 measure on the Wall Street Journal (WSJ) test set, i. e. the automatic classification of the roles of Section 23 of the PropBank. An account of their performance on the development and test sets is provided in Section 6.1. Table 4.1 defines a mapping between a system’s label and the reference to the corresponding bibliographic entry. punyakanok haghighi marquez pradhan surdeanu che tsai moschitti
[Punyakanok et al., 2005] [Haghighi et al., 2005] [Màrquez et al., 2005] [Pradhan et al., 2005b] [Surdeanu and Turmo, 2005] [Liu et al., 2005] [Tsai et al., 2005] [Moschitti et al., 2005b]
Table 4.1: Referenced CoNLL 2005 systems.
4.3. Literature Systems
punyakanok haghighi marquez pradhan surdeanu che tsai moschitti
79
ML-method
Synt
Pre
Label
Glob
Post
SNoW ME AB SVM AB ME ME,SVM SVM
n-cha,col n-cha cha,upc cha,col/chunk cha cha cha cha
x&p ? seq ? prun no x&p prun
i+c i+c bio c/bio c c c i+c
yes yes no no no no yes no
no no no no yes yes no no
Table 4.2: Main properties of the SRL strategies implemented by the topscoring participant teams at the CoNLL 2005 shared task. ML-method learning algorithm adopted: AB (AdaBoost), ME (Maximum Entropy), SNoW (Winnow-based network of linear separators), SVM (Support Vector Machines); Synt syntactic structures explored: col,cha,upc (Collins, Charniak and UPC shallow parse trees provided by the organization), n-cha (the best n parsings generated by the Charniak parser), chunk (chunking-based sequential tokenization); Pre pre-processing strategies: x&p (pruning strategy described in [Xue and Palmer, 2004]), seq (sequentialization of hierarchical syntactic structures), prun (some other pruning strategy), no (no pre-processing), ? (unknown, not described); Label labelling strategy: i+c (two stage boundary detection and argument classification), c (single stage, multi class comprising the null category), bio (BIO Beginning, Inside Outside - tagging scheme used by chunk-based systems); Glob usage of a global, generally probabilistic overlap-resolution strategy; Post usage of heuristic post-processing to correct systematic errors.
punyakanok haghighi marquez pradhan surdeanu che tsai moschitti
cha,col,upc cha cha,upc cha,col,upc cha cha cha,upc cha
+ · + + + + + ·
argument at aw ab ac ai pp sd + + + + + + + +
h h h h,c h,c h h h
+ + + + + + + +
t p,s t p,s,t p,s · p,s,t p
+ · + + + · · +
+ + · + · + · +
· + + · + · · ·
verb v sc
rp
arg-verb di ps pv pi
+ + + + + + + +
+ + + + + + + +
c t w,c c,t w,t t w t
+ + + + + + + +
+ + + + + + + +
· + + + + + · +
+ · · + + · · ·
sf
p as
+ · + + · · · +
· + · · · · · ·
4.3. Literature Systems
sources synt ne
Table 4.3: Main features used by the top-scoring participant teams at the CoNLL 2005 shared task. Sources synt: use of parsers, namely Charniak (cha), Collins (col), UPC partial parsers (upc), Bikel’s Collins model (bik) and/or argumentenriched parsers (an,am); ne: use of named entities. On the argument at: argument type; aw: argument words, namely the head (h) and/or content words (c); ab: argument boundaries, i. e. form and PoS of first and/or last argument words; ac: argument context, capturing features of the parent (p) and/or left/right siblings (s), or the tokens surrounding the argument (t); ai: indicators of the structure of the argument (e. g. , on internal constituents, surrounding/boundary punctuation, governing category, etc.); pp: specific features for prepositional phrases; sd: semantic dictionaries. On the verb v: standard verb features (voice, word/lemma, POS); sc: subcategorization. On the argument-verb relation rp: relative position; di: distance, based on words (w), chunks (c) or the syntactic tree (t); ps: standard path; pv: path variations; pi: scalar indicator variables on the path (of chunks, clauses, or other phrase types), common ancestor, etc.; sf: syntactic frame (Xue and Palmer, 2004);
80
On the complete proposition as: sequence of arguments of a proposition.
Chapter 5
A Kernelized Semantic Role Labelling System
This section concentrates on the original contribution of this thesis to the SRL task. The models and the results presented in this section and in Section 6.2 have already undergone the judgement of the scientific community, presented as full papers in several natural language and machine learning conferences [Moschitti et al., 2005a, Moschitti et al., 2006a] and workshops [Moschitti et al., 2006b]. The re-ranking mechanism described in 5.2 has been accepted at the CoNLL-X conference, to be held in June 2006. Our attention has been focused on tree kernels and especially on how to engineer structured features for specific SRL subtask, such as overlap resolution, boundary detection, argument classification and proposition reranking. To employ tree kernels for the first three subtasks, a standard SRL system as described in Section 4 doesn’t need to be reworked, as the only compo81
5.1. Feature Engineering using Tree Kernels
82
nent which is affected by the use of TKs along with (or instead of) linear features is the Feature Extractor module. Our kernelized approach to the solution of these problems is discussed in Section 5.1. On the other hand, introducing a proposition re-ranking module implies that many existing modules are altered, whereas new ones are added. These aspects are discussed in Section 5.2.
5.1 Feature Engineering using Tree Kernels This section documents our attempts to engineer the best tree kernel functions for some sub tasks within semantic role labelling, namely overlap resolution, boundary detection and argument classification (see Section 4.1). In [Moschitti, 2004], two main drawbacks of using tree kernels have been pointed out: • highly accurate boundary detection cannot be carried out by a tree kernel model, since correct and incorrect arguments may share a large portion of the encoding trees, i. e. they may share many substructures. • manually derived features (eventually extended with a polynomial kernel) have been shown to be superior to tree kernel approaches. Nevertheless, we believe that modeling a completely kernelized SRL system is useful for the following reasons: • we can implement it very quickly as the feature extractor module only requires the writing of the subtree extraction procedure. Traditional SRL systems are, in contrast, based on the extraction of more than thirty features [Pradhan et al., 2005a], which require the writing of at least thirty different procedures; • combining it with a traditional attribute-value SRL system allows us to obtain a more accurate system. Usually the combination of two traditional systems (based on the same machine learning model) does not
5.1. Feature Engineering using Tree Kernels
83
result in an improvement as their features are more or less equivalent as shown in [Carreras and Màrquez, 2005]; • the study of the effective structural features can inspire the design of novel linear features which can be used with a more efficient model, i. e. linear SVMs. Our engineering approach relates to marking the nodes of the encoding subtrees in order to generate substructures more strictly correlated with a particular argument, boundary or predicate. For example, marking the node that exactly covers the target argument helps tree kernels to generate different substructures for correct and incorrect argument boundaries. Note that, each markup strategy impacts on the output of a kernel function in terms of the number of structures common to two trees. The same output can be obtained using unmarked trees and redefining consistently the kernel function, i. e. the algorithm described in Section 4.2.2. Another technique that we applied to engineer different kernels is the subdivision of internal and pre-terminal nodes. Designing different classifiers for these two different node types slightly increases the accuracy and remarkably decreases the learning and classification time.
5.1.1 Overlap Resolution Given a parse tree t and any subset Nt = {n1 , . . . , nk } of the nodes of t, we call r the lowest common ancestor of n1 , . . . , nk . Then, from the set of all the descendants of r, we remove all the nodes nj that: • do not belong to Nt ; • are neither ancestors nor descendants of any node belonging to Nt . The resulting tree rooted in r is called a Nodeset Spanning Tree (NST), i. e. a NST r(Nt ) is a partial tree of t rooted in r that contains only the nodes ni ∈ NT along with their ancestors and descendants.
5.1. Feature Engineering using Tree Kernels
84
Sentence parse tree
took{Arg0,Arg1}
read{Arg0,Arg1}
S
S
S
NP
VP
PRP
VP
John VB
CC
VP
NP and VB
took DT NN
NP
read PRP$ NN
the book
its
NP
VP
NP
VP
PRP
VP
PRP
VP
John VB
NP
took DT NN
title
John VB
NP
read PRP$ NN
the book
its
title
Figure 5.1: A sentence parse tree with two ASTN s. Since predicate arguments are associated with tree nodes, we can define the Predicate Argument Spanning Tree (ASTN ) of a predicate’s argument set Np = {a1 , . . . , an } as the NST over such nodes and the predicate node, i. e. the node np exactly covering the predicate p. The ASTN of a predicate p and its argument nodes {a1 , . . . , an }, will also be referred to as p{a1 ,...,an } . An ASTN corresponds to the minimal sub parse tree whose leaves are all and only the word sequences compounding the arguments and the predicate. For example, Figure 5.1 shows the parse tree of the sentence: John took the book and read its title. took{ARG0 ,ARG1 } and read{ARG0 ,ARG1 } are two ASTN s structures associated with the two predicates took and read, respectively. For each predicate, only one NST is a valid ASTN . An automatic classifier which recognizes the spanning trees can potentially be used to detect the predicate argument boundaries. Unfortunately, the application of such classifier to all possible sentence subtrees would require an exponential execution time. As a consequence, we can use it only to decide for a reduced set of subtrees associated with a corresponding set of candidate boundaries. An ASTN is sensitive to the whole predicate argument structure. As such, feature design for the ASTN representation is not simple. Tree ker-
5.1. Feature Engineering using Tree Kernels
85
nels are a viable alternative that allows the learning algorithm to measure the similarity between two ASTN s in term of all possible tree substructures. This section describes a boundary classifier for predicate argument labeling based on two phases: 1. a first annotation of potential arguments by using a high recall traditional boundary classifier (TBC); 2. an ASTN classification step aiming to select the substructures that do not contain overlaps and are more likely to encode the correct argument set . The resulting architecture is slightly different from that described in Section 4.1, as the overlap resolution stage is scheduled before argument classification. The set Fc+ of argument nodes recognized by the TBC can be associated with the corresponding sentence subtrees, which in turn can be classified using tree kernel functions. These measure if a subtree is compatible or not with the subtree of a correct predicate argument structure. In order to have a very efficient procedure, we apply the ASTN classifier only to the candidate ASTN s associated with overlapping nodes, i. e. we look for node pairs ∈ Fc+ × Fc+ where n1 is ancestor of n2 or vice versa, Fc+ being the set of candidate arguments for a predicate p that are classified as positive boundaries by the TBC. After we have detected such nodes, we create two node sets F1+ = Fc+ − {n1 } and F2+ = Fc+ − {n2 } and classify the corresponding ASTN s pF1+ and pF2+ with the ASTN classifier to select the most correct set of argument boundaries. This procedure can be generalized to a set of overlapping nodes O with more than 2 elements, as all we need to do is to generate all and only the permutations of Fc+ ’s nodes that do not contain overlapping pairs. Figure 5.2 shows a working example of such a multi-stage classifier. In (a), the TBC labels as potential arguments four nodes (circled), three of
5.1. Feature Engineering using Tree Kernels
86
which are overlapping (in bold). The overlap resolution algorithm proposes two solutions (b) of which only one is correct. In fact, according to the second solution the propositional phrase of the book would incorrectly be attached to the verbal predicate, i. e. in contrast with the parse tree. The ASTN classifier, applied to the two NSTs, should detect this inconsistency and provide the correct output. Figure 5.2 also highlights a critical problem the ASTN classifier has to deal with: as the two NSTs are perfectly identical, it is not possible to discern between them using only their parse-tree fragments. The solution to engineer novel features is to simply add the boundary information provided by the TBC to the NSTs. We mark with a progressive number the phrase type corresponding to an argument node, starting from the leftmost argument. We call the resulting structure an Ordinal Predicate Argument Spanning Tree (ASTord ). For example, in the first NST of Figure N 5.2(c), we mark as NP-0 and NP-1 the first and second argument nodes, whereas in the second NST we have an hypothesis of three arguments on three nodes that we transform as NP-0, NP-1 and PP-2. This simple modification enables the tree kernel to generate features useful to distinguish between two identical parse trees associated with different argument structures. For example, for the first NST the fragments NP-1 NP
PP
NP DT
PP NN
IN
NP
are generated. They do not match any longer with the fragments of the second NST NP-0 NP
PP
NP-1 DT
NN
PP-2 IN
.
NP
We also experimented another structure, the Marked Predicate Argument Spanning Tree (ASTm N ), in which each argument node is marked with
b) ASTN
S NP
VP
John VB NP
NP
read
PP
DT NN IN
NP
NP NP
PP
DT NN IN
NP
the title of DT NN
the title of DT NN
the book
the book
the book
VP
NP-0
d) ASTm N VP
John VB
NP-1
read
PP
DT NN IN
VP
John VB
NP
read NP
S
read NP
NP
the title of DT NN
S
John VB
VP
John VB PP
DT NN IN
NP-0
S
NP NP
read
c) ASTord N
S
NP
NP-1
S
S NP-A0
VP
John VB
NP PP-2
DT NN IN
NP
5.1. Feature Engineering using Tree Kernels
a) Overlapping nodes
NP-A0 John VB
NP-A1
read NP DT NN IN
VP
read
PP NP
the title of DT NN
the title of DT NN
the title of DT NN
the book
the book
the book
NP NP-A1
PP-A4
DT NN
IN
NP
the title
of DT NN the book
Figure 5.2: An overlap situation and the different marking strategies adopted for its resolution. 87
5.1. Feature Engineering using Tree Kernels
88 VP
S NP NNP
VBZ
VP VBZ
NP
delivers NP
Paul delivers NP Arg0
PP
DT NN IN a
NP
talk in JJ
NP
⇒
PP
DT NN IN a
NN
talk in JJ
NP NN
formal style
formal style Arg1
Figure 5.3: AST1 relative to the argument Arg1 of the predicate delivers. a role label assigned by a traditional role multi-classifier (TRM). Of course, this model requires a TRM to classify all the nodes recognized by the TBC first. An example ASTm N is shown in Figure 5.2(d). The evaluation of the proposed models is carried out in Section 6.2.1.
5.1.2 Boundary Detection and Argument Classification In [Moschitti, 2004], Predicate Argument Features (PAFs) have been shown to be very effective for argument detection. A PAF is a NST defined over a pair of nodes {na , np } where np is the node covering the predicate p and na is a candidate argument node for p, i. e. p{na } using the NST formalism previously adopted. In an attempt to unify the notation, hereinafter PAFs will be referred to as Argument Spanning Trees (AST1 s, the subscript 1 stressing the fact that the structure only encompasses 1 of the N arguments of the predicate). An example AST1 is shown in Figure 5.3. As already said, AST1 s have shown to be very effective for argument classification, but not for boundary detection. The reason is that two nodes that encode correct and incorrect boundaries may generate very similar AST1 s and, consequently, have many fragments in common. For example, Figure 5.4(a) shows two AST1 s corresponding to a correct (AST1 +) and an incorrect (AST1 -) choice of the boundary for the argument Arg1. They have
5.1. Feature Engineering using Tree Kernels
89
fourteen fragments in common, as shown in (c). This prevents the algorithm from making different decisions for such cases. To solve this problem, we specify which is the node that exactly covers the argument (i. e. the argument node) by simply marking it with the label B, denoting the boundary property. Figure 5.4(b) shows the two new marked AST1 s, which we refer to as ASTm 1 s. The features generated from the two subtrees are now very different so that there are only three substructures in common as shown in (d). Yet, since the type of a target argument strongly depends on the type and number of the predicate’s arguments1 [Toutanova et al., 2005, Punyakanok et al., 2005], to correctly label an argument we should extract features from the complete predicate argument structure it belongs to. In contrast, AST1 s completely neglect the information (i. e. the tree portions) related to nontarget arguments. One way to use this further information with tree kernels is to use the minimum subtree that spans all the predicate argument structure, i. e. an ASTN . However, ASTN s pose some problems: • we cannot use them for the boundary detection task since we do not know the predicate’s argument structure yet. However, we can derive the ASTN (its approximation) from the nodes selected by a traditional boundary classifier, i. e. the nodes that correspond to potential arguments. Such approximated ASTN s can be easily used in the argument classification stage; • obviously, an ASTN is the same for all the arguments it includes, thus we need a way to differentiate it for each target argument. Again, we can mark the target argument node as shown in the previous section. We refer to this subtree as a Marked-Target ASTN (ASTmt N ). However, for large arguments (i. e. spread on a large part of the sentence tree) 1
This is true at least for core arguments.
AST1 +
AST1 -
VP
VP
VBZ
NP
VBZ
delivers NP
PP
DT NN IN a
b)
talk in JJ
a
VP
VP NP-B
delivers NP
DT NN
NN
ASTm 1 -
VBZ
NP
delivers NP NP
ASTm 1 +
VP VBZ NP
VP
talk
a
NP
delivers NP-B NP
talk in JJ
DT NN a
NN
talk
formal style
NP
NP
NP
NP
NP
NP
NP
NP
VBZ NP delivers NP
NP
DT NN
NP
DT NN
NP
DT NN
NP
delivers
VBZ
PP
DT NN IN
formal style
c)
VBZ
DT NN
DT NN a a
d)
VBZ
DT NN talk
NP
DT NN a talk
talk DT NN a a
DT NN
5.1. Feature Engineering using Tree Kernels
a)
talk
talk
DT NN
delivers a talk
Figure 5.4: AST1 s (a) and ASTm 1 s (b) extracted for the same target argument with their respective common fragment spaces (c,b). 90
5.1. Feature Engineering using Tree Kernels S
S
NNP
NP
VP
NP VBZ
NP
NNP
Paul delivers NP Arg0
91
a
talk in JJ
VBZ
NP-B
Paul delivers NP-B
PP
DT NN IN
VP
NP NN
⇒
formal style
PP-B
DT-B NN-B IN-B a
talk
NP-B
in JJ-B NN-B formal style
Arg1
Figure 5.5: ASTcmt relative to the argument Arg1 of the predicate delivers. N the substructures’ likelihood of being part of different arguments is quite high. To address this problem, we can mark all the nodes that descend from the target argument node. Figure 5.5 shows an ASTN in which the subtree associated with the target argument Arg1 has all the non-terminal nodes marked. We refer to this structure as a Completely-Marked-Target ASTN s may be seen as AST1 s enriched with new information ). ASTcmt (ASTcmt N N coming from the other arguments (i.e. the non-marked subtrees). Note that we obtain a differently if we consider only the AST1 subtree from a ASTcmt N marked subtree which we refer to as Completely Marked AST1 (ASTcm ). 1 Studying the SRL task, we also noted that many argument types can be found mostly in pre-terminal nodes, e. g. modifier or negation arguments, and do not necessitate training data extracted from internal nodes. We then decided to try splitting each classifier in two: • a classifier for internal nodes; • a classifier for pre-terminal nodes. This methodology doesn’t require a major rework of the architecture described in Section 4.1, as all we need to do is to filter the output of the tree kernel extractor (i. e. the kernelized version of the feature extractor), run two different classifiers on the two sets and then simply merge the results. We
5.2. Re-ranking Propositions
92
refer to such model as to a split classifier, as opposed to a standard, or monolithic, classifier that works on both internal and pre-terminal nodes. This approach is expected to be (at least slightly) more accurate, as each classifier’s training and test sets are more homogeneous, as well as more efficient, due to the smaller dimension of the data sets. These models were presented at the EACL2006 Workshop on Learning Structured Information in Natural Language Applications; their evaluation is reported in Section 6.2.2.
5.2 Re-ranking Propositions A more sophisticated approach to SRL involves a less deterministic processing, consisting in the identification of a set of candidate propositions for each target predicate and a voting mechanism to choose the best solution. Many novel systems that apply re-ranking techniques were presented for the CoNLL 2005 shared task [Carreras and Màrquez, 2005]. Some of them used voting strategies to select the best syntactic interpretation, i. e. the best parse tree, for a given sentence (e. g. [Punyakanok et al., 2005]); others produced both multiple syntactic views of each sentence and multiple annotations for each target predicate (e. g. [Sutton and McCallum, 2005, Haghighi et al., 2005]). Architecturally, systems of the former type can be roughly seen as a combination of traditional SRL systems (see Section 4) working on different parse trees, with the addition of a final re-ranking module to chose the most likely semantic annotation. On the other hand, systems of the latter kind rely on a slightly altered basic approach, requiring the ability to produce non-deterministic labelling schemes out of the same syntactic data and target predicate. Our system is similar to that described in [Toutanova et al., 2005], in which the outputs of the boundary and argument classifiers are used to evaluate the posterior probability of a node to be given a certain labelling;
5.2. Re-ranking Propositions
93
the probabilities of every labelling of every candidate node are then combined, resulting in a set of candidate annotations characterized by their respective probabilities. Among these annotations, the best one has to be identified and chosen. The proposed joint model combines the local outputs of a set of probabilistic classifiers, in order to identify • the most likely consistent labellings for the whole argument structure, and among these • the best labelling among the suggested ones. This model poses some problems: 1. being the distance of a vector from an hyperplane in the feature space, the output of SVMs cannot be interpreted as a posterior probability. Then, we first need to find a probabilistic interpretation of the classifiers’ output; 2. combining all the possible labels of all the candidate arguments would result in a huge number of possible configurations, almost impossible to explore. A means to drastically reduce the number of configurations to be explored is necessary in order to render this solution computationally feasible; 3. as the most probable labelling is not necessarily the right (or the best) one, we need some way to compare the suggested labellings and choose the most suitable. That’s where the re-ranking mechanism comes in. In the remainder of this section, we further discuss these problems and propose our solutions.
5.2. Re-ranking Propositions
94
5.2.1 A Probabilistic Interpretation of SVM Output Unlike other classification methods, such as Maximum Entropy [Berger et al., 1996], Support Vector Machines do not produce probabilistic output. Indeed, this would be very useful, especially in those situations where each classifier is making a small part of an overall decision, and the output of many classifiers must be combined in order to produce a general decision. [Platt, 1999] describes a model to transform the decision of a classifier into a posterior probability conditioned to it. The proposed methodology uses a parametric model to fit the posterior probability distribution P (y = 1, f ), f being the output of the classifier; the parameters are adapted dynamically to give the best probability outputs. In the assumption that the output of the SVM is proportional to the log odds of a positive example, a parametrized sigmoid is chosen as the target function of the distribution fitting:
P (y = 1|f ) =
1 . 1 + exp(Af + B)
The parameters A and B are fit using maximum likelihood estimation from a training set (fi , ti ), where the target probabilities ti are defined as ti =
yi + 1 . 2
The value of the parameters is determined by minimizing the negative log likelihood of the training data, which is a cross-entropy error function:
err(A, B) = min −
X
ti log(pi ) + (1 − ti ) log(1 − pi ) ,
i
where pi =
1 . 1 + exp(Afi + B)
(5.1)
This optimization module, responsible for finding the best value for the
5.2. Re-ranking Propositions
95
two parameters, has been implemented after the pseudo code proposed by Hsuan-Tien Lin an his colleagues in [Lin et al., 2003]. In this paper, Platt’s proposed implementation is revised and improved under many aspects. Most notably, Lin’s solution is quite more efficient and granted to converge by theoric proof. Using the algorithm we estimated the best values of the parameters for any of the employed SVMs. The algorithm’s training examples are couples hyi, f (xi )i, where yi is the value of the target classification function and f (xi ) is the actual output of the SVM. A Probabilistic SVM classifier PClass is thus described by a triple hC, A, Bi, where • C is the corresponding traditional SVM classifier; • A and B are the sigmoid parameters, trained on C’s output. Given a test example xi , the output pi = p (f (xi )) of the probabilistic classifier is the result of the application of the mapping function described by (5.1) to the output f (xi ) of the classifier C:
pi = p (f (xi )) =
1 . 1 + exp(Af (xi ) + B)
Architecturally, a probabilistic SVM classifier can be seen as the cascade of a standard SVM classifier and a scaling module that fits the SVM output onto the sigmoid function characterized by the parameters A and B, as shown in Figure 5.6.
5.2.2 Identifying the Best Candidate Propositions Given the independent probabilities of each candidate node to be assigned a given label, we want to find a convenient way to combine this information so that we can answer the question: what are the most probable combinations of labellings? Or, in other words: what are the combinations of
5.2. Re-ranking Propositions
96
xi xi , A, B SVM classifier f (xi ) 1 1+exp(Af (xi )+B)
A, B
⇒
PClass p(y = 1|f (xi ))
p(y = 1|f (xi ))
Figure 5.6: A Probabilistic SVM classifier. labellings that most likely correspond to the correct annotation of a given proposition? Given a parse tree t and a target predicate p, the corresponding set of candidate predicate argument nodes Ac comprises on average k ≃ 50 elements, and each node can be assigned any of m ≃ 50 different labels. The overall number of possible states, i. e. unique label assignments, is S = k m = 5050 ≃ 9 × 1084 .
(5.2)
After identifying the S possible states we should also evaluate them and find the best one. Let: • li = l(ai ) be the label associated with the ith candidate argument node ai ∈ Ac at a given time. A state s can thus be represented by the set of S labels assigned to each candidate: s = ki=1 li ;
• p(li ) = p (fli (ai )) be the probabilistic output of the classifier of type li for the candidate ai ;
The probability p(s) of a state s is defined as the product of the probabilities of each label assignment:
5.2. Re-ranking Propositions
97
p(s) =
k Y
p(li ) .
(5.3)
i=1
In terms of real products, the cost of evaluating a state’s probability is γ = (k − 1). This means that we would need S × γ ≃ 4 × 1086 real products only to calculate the scores of all all the states. Still, the task of finding the best solutions wouldn’t be accomplished, as the S states need to be sorted first: also using an O(n log n) sorting algorithm (such as Merge, Heap or Quick sort) the temporal and spatial requirements of the solution would be unsustainable. A first way to drastically reduce complexity is to cut the overall number of states. Since all the candidates have to be taken into account, the only quantity that can be changed in (5.2) is the number of labels m: backed by our experiments (see Section 6.2) we noticed that choosing the best m′ = 5 labellings for each node have little or no impact on the overall annotation task. This first trick causes a noticeable reduction of complexity, as the number of states decreases to ′
S ′ = k m = 505 ≃ 3 × 109 . The next direction of improvement is to avoid traversing all the states, only taking into accounts those that are likely to be the most interesting. Such kind of traversal policy is quite common in many computationally expansive scenarios and is known as a Viterbi algorithm [Viterbi, 1967, G. D. Forney, 1973], after the name of Andrew J. Viterbi who first introduced this technique for decoding convolutional codes. It is a dynamic programming algorithm to find the most likely sequence of hidden states (the Viterbi Path) that result in a sequence of observed events, especially in the context of Hidden Markov Models (HMM). The Viterbi algorithm has been applied to many different fields, such as speech recognition, keyword spotting, computational linguistics, and bio-
5.2. Re-ranking Propositions
98
NP PP
NP Det
NN
Prep
an
example
of
NP Det the
NP NN
NN
traversal
policy
Figure 5.7: Traversal strategy adopted by the Viterbi algorithm. informatics. For example, in speech-to-text speech recognition, the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the “hidden cause” of the acoustic signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal. The algorithm operates on a state machine assumption, i. e. there are a finite number of states, however large, that can be listed. At each state, or if you prefer node, multiple sequences or paths can lead to it, but only one of them is the most likely. This path is called the survivor path to that state. This is a fundamental assumption of the algorithm, as it will examine all possible paths leading to a state and only keep the one which is most likely. This way the algorithm does not have to keep track of multiple paths, only one per state. A second key assumption is that a transition from a previous state to a new state is marked by an incremental metric, usually a number. This transition is computed from the event. The third key assumption is that the events are cumulative over a path in some sense, usually additive. Then, when an event happens the algorithm examines moving forward to a new set of states by combining the metric of a possible previous state with the incremental metric of the transition due to the event and chooses the best. In our case, we begin with a state s0 in which all the candidates’ labels are set to NARG.
5.2. Re-ranking Propositions
99
Let d be a variable keeping track of the current depth, and be D the maximum depth of the tree. Let N be the maximum number of states that we want the algorithm to output, i. e. we are only interested in the N most likely labelling schemes. We begin traversing the parse tree, starting with the nodes at the deepest level, i. e. furthest from the root, so d ← D. When all the nodes at depth d = D have been traversed we pass to those at the next level d ← (d − 1), and so on. Figure 5.7 illustrates the described traversal policy. The traversal of each node is associated with an iteration (or step) of the algorithm. The ith iteration works on the set Si−1 of states generated at the previous step, and outputs a new set Si to be used by the next iteration. Each step is designed to pass at most N states to the next one, thus reducing complexity and allowing for faster execution. The behaviour of each iteration is described in Algorithm 2. Although each state outputs at most N states, it can traverse up to a maximum of N × M states, M being the number of most likely labels that are taken into account for each potential argument. Therefore, in the worst case scenario the number of traversed states will be V
= N ×M ×k .
(5.4)
For N = 10 and M = 5 we would need to evaluate at most 10 × 5 × 50 ≃ 3 × 103 states, which is by 6 orders of magnitude less than the straightforward approach. Apart from remarkably boosting computational efficiency, this approach also allows us to enforce policies that prevent overlapping solutions to be output, as anytime a node is assigned a non-NARG label all its ancestors and descendants in the parse tree are forced to be labelled as NARG. This behaviour is guaranteed by the the function PermutateNodeLabels, which is responsible for the generation of the M states that result from the permuta-
5.2. Re-ranking Propositions
100
tions of the labels of a given node in a given state. In the actual model, in order to enforce a dependency of the chosen labelling schemes on the predictions of the boundary classifier, we modified Equation 5.3 to look like this:
pJ (s) =
k Y
J(li ) ,
(5.5)
i=1
where J(li ) = J(p(li )) is a function that rescales the probability of a labelling for the node ai by the probability of the node actually being an argument, represented by pi (NARG): p(li )pi (NARG) if li 6= NARG J(li ) = . (1 − p (NARG))2 otherwise i
(5.6)
As a whole, the application of the Viterbi algorithm can be associated with a functional module in the revised SRL architecture. The Viterbi Evaluator (ViterbiEval) module receives as input the probabilistic output of all the classifiers i. e. the probabilistic boundary classifier, PBndClass, and the probabilistic RMC, PArgClass. Both classifiers output a set, respectively Fc+ and Fcl , whose size equals the one of the set of input candidate node representations Fc , i. e. |Fc | = |Fc+ | = |Fcl | . Each element of Fc+ is the boundary probability of one of the input candidates, whereas each element of Fcl is a tuple hl0 , l1 , . . . , lM −1 i that associates to one of the input candidates its M most likely labellings along with their probabilities. The ViterbiEval module determines the N most likely labelling schemes for the target predicate p and generates the set Ap of candidate annotations for it. The flowgraph of the re-ranking SRL system is represented in Figure 5.8.
5.3. Features for the Re-ranking Task
101
Ap
p t
PredEx
PropReranker
ViterbiEval
p CandEx
p
Fcl
Fc+
A
Ac FeatEx
Fc PBndClass
PArgClass
Fc
Figure 5.8: Flowgraph of a re-ranking SRL system.
5.3 Features for the Re-ranking Task For each predicate p within a sentence, the Viterbi algorithm outputs a set of candidate annotations Ap . These annotations must be re-ranked in order to identify the best among them, i. e. the most akin to the corresponding annotation in the oracle. A typical re-ranking system is trained using pairs in the form hi, e~i, i, e~i being the features extracted from the i-ranked representation of some target object e. For example, a very basic training set TR trained on a single object representations would look like this:
TR
=
h1, e~1 i h2, e~2 i h3, e~3 i . . . . hn, e~n i
Internally, the re-ranking engine would explode this compact notation, generating all the possible couples of vectors along with the label assigned to each couple. A couple is labelled +1 if the first element is ranked higher
5.3. Features for the Re-ranking Task
102
then the second, −1 otherwise [Shen et al., 2003]. This explicit representation TR ′ of the training set allows for the problem to be reduced to a binary classification:
TR ′
=
h+1, e~1 , e~2 i h−1, e~2 , e~1 i h+1, e~1 , e~3 i h−1, e~3 , e~1 i ... h+1, en−1 ~ , e~n i h−1, e~n , en−1 ~ i
.
These classifier’s examples are the 2(n−1)! binary comparisons between the n candidate representations for the object e. The factor 2 is due to the necessity of providing both the positive and the negative couples to the classifier, e. g. h+1, e~1 , e~2 i and h−1, e~2 , e~1 i, so that it can learn to classify them independently from the order of their elements. This is determinant for the proper classification of the test set, as in that case the similarity of each candidate e~j to the target object e, which is unknown, cannot be evaluated. For our experiments we use a modified version of the SVM-light toolkit [Joachims, 1999] which encodes tree kernels. Since it does not sport an integrated re-ranking engine we had to mimic the internals of the re-ranking engine at the data level, feeding the classifier with sets of training and test examples shaped like TR ′ in our example. We then collect the classifier’s predictions and choose the best alternative by counting how many times each candidate has been preferred over the others. We used both linear features and ad hoc engineered tree kernels to feed our re-ranking classifier, and conducted several experiments to identify the most promising kernel combination, which are described in Section 6.2.3. The remainder of this section is dedicated to the discussion of the linear
5.3. Features for the Re-ranking Task
103
and structured features used for the re-ranking task.
5.3.1 Linear Features Viterbi Rank of the Proposition
Assuming to take into account at most
the best 20 candidates, 20 dimensions of the feature space were reserved to represent this feature. The projections are binary values: the one corresponding to the Viterbi rank of the proposition is set to 1, all the others are set to 0. For example, a proposition ranked at the 6th place by the Viterbi algorithm would be represented as {0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} . Viterbi Score
A real-valued feature is used to represent the value of the
score calculated for the state associated with the proposition, i. e. the joint probability of the labelled structure calculated using Equation 5.5. Voice of the Predicate This feature spans 3 binary dimensions, corresponding to the voice: active, passive and unknown. The latter is used for those cases in which either the predicate is not verbal, or the parse tree is badly malformed, i. e. when the predicate’s POS isn’t either verbal or adjectival (assuming that past participle verbs are often tagged as adjectives). If the predicate’s POS isn’t either VBN (past participle verb) or JJ (adjective) than it is assumed to be active, otherwise the three terminals to the left of the predicate are scanned: if any of them is some form of the verb to be, than the voice is assumed to be passive, otherwise not. The pseudo-code of this voice identification algorithm is listed in Algorithm 3. Number of Arguments Assuming a maximum of ten arguments for a single proposition, 10 dimensions are used to represent this feature. The pro-
5.3. Features for the Re-ranking Task
104
jection on the ith dimension is set to 1 if the proposition consists of i arguments, otherwise it is set to 0. For example, if the proposition had 6 arguments this feature’s representation would be: {0, 0, 0, 0, 0, 1, 0, 0, 0, 0} . Arguments’ Position
As each argument can be to the left, to the right or
coincident with the predicate (this latter condition being required to represent the predicate itself), we need 3N dimensions to represent this feature, N being the maximum number of arguments in a proposition. In other words, each argument is represented by a triple of binary values hb, a, ci, of which at most one can be not null. As in the previous case, we are assuming N = 10, hence 30 dimensions of the target feature space are used to describe this feature. For example, if the order of the arguments in the annotation was AM-TMP A0 rel A1 AM-MOD the position feature would be represented by the concatenation of the following 10 triples: {1, 0, 0}
for the first arg, AM-TMP
{1, 0, 0}
for the second arg, A0
{0, 0, 1}
for the predicate
{0, 1, 0}
for the fourth arg, A1
{0, 1, 0}
for the fifth arg, AM-MOD
.
{0, 0, 0} for the sixth arg, which doesn’t exist ···
···
{0, 0, 0} for the tenth arg, which doesn’t exist Argument Structure This feature is meant to represent in a linear fashion the argument structure. Somehow, it extends the position feature by
5.3. Features for the Re-ranking Task
105
encoding the label of each argument in the proposition with respect to its absolute position. Unlike the position feature, that uses three dimensions to encode the information relative to each argument slot, this feature requires each argument slot to span 59 dimensions, which is the number of distinct argument labels that we take into account. Of these 59 components, at most one is assigned a value of 1, while the others are set to 0. The non-null component is the one corresponding to the ordinal of the argument label within the sorted set of the 59 different roles considered. As we take into account 10 argument slots, a total of 590 dimensions are required.
5.3.2 Structural Features This section describes the different structural features that that have been used for the re-ranking task. To point out the differences between these representations, the following sentence will be used as an example: “But while the New York Stock Exchange didn’t fall apart Friday, as the Dow Jones Industrial Average plunged 190.58 points – most of it in the final hour – it barely managed to stay this side of chaos.” with respect to the following argument structure: AM-TMP
← Friday
A1 ← The Dow Jones Industrial Average rel
← plunged
A2 ← 190.58 points The whole parse tree of the sentence, as output by the Charniak parser, along with the annotation of the proposition is shown in Figure 5.9.
CC
But
SBAR
PRN
IN
while
S
:
NP
DT
NNP
NNP
NNP
NNP
AUX
RB
the
New
York
St ock
Exchange
did
n’t
PP
–
VP
VP
ADVP
JJS
VB
PRT
NP
fall
RP
NNP
IN
apart
Friday
as
in
S
IN
NP
of
PRP
NP
VP
DT
NNP
NNP
NNP
NNP
VBD
the
Dow
Jones
Industrial
Average
plunged
VP
.
:
PRP
ADVP
VBD
S
NP
–
it
RB
managed
VP
DT
JJ
NN
barely
TO
the
fi nal
hour
IN
PP
most
SBAR
NP
to
VP
VB
stay
it
.
NP
PP
DT
NN
IN
NP
this
side
of
NN
5.3. Features for the Re-ranking Task
S
chaos
NP
CD
NNS
190.58
points
A2 AM-TMP
A1
106
Figure 5.9: Parse tree of the example sentence. The target predicate “plunge” and its arguments are highlighted.
5.3. Features for the Re-ranking Task
107
Completely-Marked Argument Structure Spanning Tree (ASTcm N )
The
Completely-Marked Argument Structure Spanning Tree is a similar structure as the ASTcmt used for the boundary detection and classification tasks N in the basic version of the SRL system (see Section 5.1), with the difference that not only one argument (i. e. the target argument) is marked. It consists of the node spanning tree embracing the whole argument structure (i. e. the ASTN ), in which each argument node’s label is enriched with the role type. The labels of the descendants of each argument node are modified accordingly, down to the pre-terminal nodes. This representation is meant to provide rich syntactic and lexical information about the proposition. The nodes’ descendants are marked so that substructures are forced to match only among homogeneous argument types. The ASTcm corresponding to the example proposition is shown in FigN ure 5.10. VP
NP
SBAR
NNP-AM-TMP
S
Friday
NP-A1
VP
DT-A1 NNP-A1 NNP-A1 NNP-A1 NNP-A1
the
Dow
Jones
Industrial Average
VBD-rel
NP-A2
plunged CD-A2 NNS-A2
190.58
points
Figure 5.10: ASTcm representation of the example proposition. N
5.3. Features for the Re-ranking Task First-Leaf Argument Structure Spanning Tree (ASTfl N)
108 The First-Leaf Ar-
gument Structure Spanning Tree is a reduced version of the ASTcm N . In order to produce a more compact (and therefore faster) representation, all the descendants of the argument nodes are removed from the tree. For each argument node, only the path down to the first terminal is preserved, so that the tree kernel is provided a minimum of lexical anchors. The argument nodes’ labels are marked with the corresponding argument type, while their descendants’ are not. The ASTfl N of the example sentence is represented in Figure 5.11. While it is evident that the inner structure of an argument is relevant for the boundary detection and argument classification tasks [Gildea and Jurafsky, 2002], it is not clear if this information is also critical in evaluating a proposition as a whole, i. e. the features used for a local model may be irrelevant (or also noisy) for the joint one. The ASTfl N , which doesn’t provide much information about the internals of the argument nodes, is meant to focus the classifier on the syntactic structure that glues the arguments together and to evaluate its impact on the re-ranking task.
5.3. Features for the Re-ranking Task
109
VP
NP
SBAR
NNP-AM-TMP
S
Friday
NP-A1
VP
DT
VBD-rel
NP-A2
the
plunged
CD
190.58
Figure 5.11: ASTfl N representation of the example proposition.
Predicate Argument Structure (PAS) The Predicate Argument Structure can be seen as the dual structure of the ASTfl N , with respect to what is pruned fl from the corresponding ASTcm N . While the ASTN doesn’t consider the syntactic structures within each argument node, the PAS represents the syntactic links between the argument nodes as a fake 1-level tree, which is shared by any PAS and therefore doesn’t influence the similarity evaluation between couples of structures. The fake tree looks like this TREE ARG0 ARG1 ARG2 ... ARG6 and consists of a root node labelled TREE and 7 children ARG0, ARG1, . . . , ARG6 which are used to accommodate sequentially the arguments of an annotation. The first argument’s label is attached as a child of the ARG0 node, and the actual sub-tree corresponding to the argument node is in turn attached to the label node. Then, the second argument’s label is attached
5.3. Features for the Re-ranking Task
110
under the node ARG1 along with the corresponding sub-tree, and so on until all the arguments are consumed. In general, a proposition consists of m arguments, with m < 7. In this case, all the nodes ARGi, i ≤ m ≤ 6 are attached a dummy descendant marked null. The PAS of the example sentence is represented in Figure 5.12: the nodes ARGO,. . . , ARG3 actually host an argument (the three arguments plus the predicate), while the nodes ARG4, ARG5 and ARG6 are assigned the null node. The PAS is engineered so that: 1. the fake tree-structure is only needed for the application of the tree kernel and doesn’t influence the evaluation of similarity between couples of propositions, i. e. the production of the root node is the same for every proposition and is hence irrelevant; 2. the respective productions of the same ARGi nodes are equal for two propositions if and only if the position and the label of the argument are the same in both propositions; 3. within an argument, i. e. within the descendants of an argument label node, the productions no longer depend on the position or the label of the argument. The first two points are meant to enforce the matching between argument structures, whereas the latter accounts for local (lexical and syntactic) analogies between (possibly) different arguments in different propositions.
5.3. Features for the Re-ranking Task
111
TREE
ARG0
ARG1
ARG2
AM-TMP
A1
rel
A2
NNP
NP
VBD
NP
Friday DT NNP NNP
NNP
ARG3 ARG4 ARG5 ARG6
NNP plunged CD
the Dow Jones Industrial Average
null
null
null
NNS
190.58 points
Figure 5.12: PAS representation of the example proposition. First-Leaf Predicate Argument Structure (PASfl )
The First-Leaf Predicate
Argument Structure is a pruned version of the PAS, where all the descendants of the argument label node are pruned but the first pre-terminal of the argument node. This structure is used to evaluate whether local lexical and syntactic information is useful for proposition re-ranking task or not, and to understand to what extent the argument sequence is a relevant feature per se. The PASfl of the example sentence is shown in Figure 5.13.
5.3. Features for the Re-ranking Task
112
TREE
ARG0 ARG1 ARG2 ARG3 ARG4 ARG5 ARG6
AM-TMP A1
rel
A2
NNP
NP
VBD
NP
Friday
DT
plunged
CD
the
null
null
null
190.58
Figure 5.13: PASfl representation of the example proposition. Type-only Predicate Argument Structure (PASt , PAStl)
The Type-only
Predicate Argument Structure is a yet more pruned version of PASfl . Below each argument type node, only a terminal consisting of the syntactic type of the argument node is attached. The only exception is the predicate node, in which the node syntactic type is replaced by the predicate’s surface form. This is essentially the tree-kernel correspondent of the argument structure linear feature (see Section 4.2), but it also encodes the syntactic type of each argument node and the predicate voice. The example PASt is shown in Figure 5.14. TREE
ARG0 ARG1 ARG2 ARG3 ARG4 ARG5 ARG6
AM-TMP A1
NNP
NP
rel
A2
plunged
NP
null
null
null
Figure 5.14: PASt representation of the example proposition.
5.3. Features for the Re-ranking Task
113
In order to improve the generalization power of the learning algorithm, a structure in which the predicate word is lemmatized, called PAStl - TypeOnly Lemmatized Predicate Argument Structure PAS , has been also experimented.
5.3. Features for the Re-ranking Task
114
Algorithm 1 Description of functions and procedures used in the definition of other algorithms. function Set D ESCENDANTS O F (Node aN ode) comment: Returns candidate nodes that are descendants of aN ode function Set A NCESTORS O F (Node aN ode) comment: Returns candidate nodes that are ancestors of aN ode function Set L ABELS F OR (Node aN ode) comment: Returns labels that can be applied to aN ode function string G ET L ABEL (Node aN ode, State aState) comment: Returns the label associated with aN ode in aState function boolean M ATCHES (String aString, RegularExpression aRegEx) comment: Returns true if aString matches aRegEx, f alse otherwise function POS G ET POSF ORT ERMINAL (int tOf f, ParseTree aT ree) comment: Returns the POS (a string) associated with the tOf f th terminal of aT ree function String G ET T ERMINAL (int tOf f, ParseTree aT ree) comment: Returns the surface form of the tOf f th terminal of aT ree procedure S ET L ABEL (Node aN ode, State aState, String label) comment: Sets the label associated with aN ode in aState to label procedure A DD (Set aSet, X elem) comment: Adds elem to the set aSet procedure S ORT (Set aSet) comment: Sorts the set according to some policy defined in the algorithm procedure S PLICE (Set aSet, int of f ) comment: Removes from aSet all elements after the of f position
5.3. Features for the Re-ranking Task
115
Algorithm 2 An iteration of the Viterbi algorithm. function Set V ITERBI I TERATION (Set InStates, Node node, int N ) local Set OutStates, State t for each State s ∈ InStates for each t ∈ P ERMUTATE N ODE L ABELS (node, s) do if not C ONTAINS (OutStates, t) do then A DD(OutStates, t) ( comment: sort states in OutStates by descending score (S ORT (OutStates) comment: only keep the best N states
S PLICE (OutStates, N ) return (OutStates)
function Set P ERMUTATE N ODE L ABELS (Node aN ode, State aState) local Set states local State s local Node tn for each lab ∈ L ABELS F OR N ODE (aN ode) s ← aState if lab 6= NARG comment: Ensure all descendants are labelled ’NARG’ for each tn ∈ D ESCENDANTS O F (aN ode) if G ET L ABEL (tn, s) 6= NARG do then S ET L ABEL (tn, s, NARG) then do comment: Ensure all ancestors are labelled ’NARG’ for each tn ∈ A NCESTORS O F (aN ode) if G ET L ABEL (tn, s) 6= NARG do then S ET L ABEL (tn, s, NARG) if not (s ∈ states) then A DD(states, s) return (states)
5.3. Features for the Re-ranking Task
Algorithm 3 Verb voice identification algorithm. comment: tree is parse tree of the sentence to be annotated comment: predOf f is the offset of the predicate word within the sentece function Voice V ERB V OICE (ParseTree tree, int predOf f ) local POS predP os, POS tempP OS, Word tempW ord predP os ← G ET POSF ORT ERMINAL (predOf f, tree) if not M (ATCHES (predP os, ”(V B.∗)|(JJ)”) comment: the predicate is neither a verb nor an adjective then return (V oice : unknown) if not M ATCHES (predP os, ”(V BN )|(JJ)”) ( comment: the predicate is not a past participle or an adjective then return (V oice : active) for of f ← (predOf f − 1) downto (predOf f − 3) tempW ord ← G ET T ERMINAL (of f, tree) tempP OS ← G ET POSF ORT ERMINAL (of f, tree) do if M ATCHES (tempP OS, ”AU X”) and I S T O B E V OICE (tempW ord) then return (V oice : passive) return (V oice : active)
116
Chapter 6
System Evaluation
This chapter summarizes the results of many experiments that we have run during the last couple of years. Section 6.1 presents the results of the top-scoring systems taking part in the CoNLL 2005 shared task on semantic role labelling (see Section 2.2). Section 6.2 details the setup of the experiments and the results obtained on each specific task. It discusses the performance of the evaluated models and compares it to CoNLL 2005 state of the art systems.
6.1 Evaluation of CoNLL 2005 Systems Table 6.1 lists the results of the top-scoring systems that took part in the CoNLL 2005 shared task on semantic role labelling. The systems are scored in descending order with respect to their F1 measure. The system of our University that participated in the competition [Moschitti et al., 2005b] is labelled moschitti, and represents the starting point of all the extensions to 117
6.2. Experiments
118 Development (24) P R F1
punyakanok haghighi marquez pradhan surdeanu che tsai moschitti
80.05 77.66 78.39 80.90 79.14 79.65 81.13 74.95
74.83 75.72 75.53 75.38 71.57 71.34 72.42 73.10
77.35 76.68 76.93 78.04 75.17 75.27 76.53 74.01
P 82.28 79.54 79.55 81.97 80.32 80.48 82.77 76.55
Test (23) R F1 76.78 77.39 76.45 73.27 72.95 72.79 70.90 75.24
79.44 78.45 77.97 77.37 76.46 76.44 76.38 75.89
Table 6.1: Performance of the top scoring systems on the CoNLL shared task. the model discussed in this thesis. An overview of these systems’ characteristics has been provided in Section 4.3.
6.2 Experiments The experiments were carried out with Alessandro Moschitti’s SVM-lightTK software1 , which encodes tree kernels into Joachim Thorsten’s SVMlight tool [Joachims, 1999]. As referring data set, we used the PropBank corpora2 along with the Penn TreeBank 23 . This corpus contains about 53,700 sentences, split into 24 sec-
tions. For the CoNLL shared task, Sections 02-21 are used for training, Section 24 for development and Section 23 for testing. For the overlap resolution experiments (see Section 6.2.1) we used the architecture described in Section 5.1.1. All the modules are written in Java R , the extractor of flat, i. e. non structural, features being developed and maintained by Ana Maria Giuglea. The boundary detection and argument classification experiments (see 1
Available at http://ai-nlp.info.uniroma2.it/moschitti/. Available at http://www.cis.upenn.edu/∼ace. 3 Available at http://www.cis.upenn.edu/∼treebank.
2
6.2. Experiments
119
Section 6.2.2) are based on a traditional SRL architecture, as described in Section 4.1. No linear feature extractor was used for these experiments, as they only rely on tree kernels and full syntactic parse trees. The proposition re-ranking experiments (see Section 6.2.3) are based on the extended architecture described in Section 5.2. A novel linear feature extractor was implemented in order to extract features from the whole predicate argument structure, as explained in Section 4.2.1, and its output was combined with that of the structured features extractor.
6.2.1 Overlap Resolution Gold Parse Trees
A first evaluation was conducted over gold, i. e. hand-
crafted, trees and argument boundaries. For the TBC, we used the linear kernel with a regularization parameter (option -c) equal to 1 and a cost-factor (option -j) of 10 to have a higher recall. For the ASTord classifier we used λ = 0.4 (see [Moschitti, 2004]). N We used sections from 02 to 07 (54,443 argument nodes and 1,343,046 non-argument nodes) to train the TBC. Then, we applied it to classify the sections from 08 to 21 (125,443 argument nodes vs. 3,010,673 non-argument nodes). As results we obtained 2,988 NSTs containing at least an overlapping node pair out of the total 65,212 predicate structures (according to the TBC decisions). From the 2,988 overlapping structures we extracted 3,624 positive and 4,461 negative NSTs, that we used to train the ASTord classiN fier. The F1 measure was evaluated on Section 23 of the PropBank. This contains 10,406 argument nodes out of 249,879 parse tree nodes. By applying the TBC classifier we derived 235 overlapping NSTs, from which we extracted 204 ASTN s and 385 incorrect predicate argument structures. On such test data, the performance of the ASTord classifier was very high, i.e. 87.08% N in Precision and 89.22% in Recall. Using the ASTord classifier we removed from the TBC output the candiN
6.2. Experiments
120
P TBC TBC+RND TBC+HEU TBC+ASTord N
All R
F1
Overlapping P R F1
92.21 98.76 95.37 98.29 65.8 78.83 93.55 97.31 95.39 74.00 72.27 73.13 92.96 97.32 95.10 68.12 75.23 71.50 94.40 98.42 96.36 89.61 92.68 91.11
Table 6.2: Two-steps boundary classification performance using the traditional boundary classifier (TBC), the random selection of non-overlapping structures (RND), the heuristic to select the most suitable non-overlapping node set (HEU) and the ASTord structures classifier. N date argument(s) causing overlaps. To measure the impact on the boundary identification performance, we compared it with three different boundary classification baselines: • TBC: overlaps are ignored and no decision is taken. This provides an upper bound for the recall as no potential argument is rejected for later labeling. Notice that, in presence of overlapping nodes, the sentence cannot be annotated correctly. • RND: random, one among the non-overlapping structures with maximal number of arguments is randomly selected. • HEU: heuristic, one of the NSTs which contain the nodes with the lowest overlapping score is chosen applying a set of heuristics that filter argument nodes incrementally until the overlaps are resolved. Namely: 1. the node causing the major number of overlaps is removed, i. e. an argument node dominating two argument nodes in involved in two overlap situations, and is therefore discarded; 2. core arguments are preferred over adjuncts; 3. deepest nodes are discarded if conflicting with more shallow ones.
6.2. Experiments
121
The second column of Table 6.2 (labelled All) shows the results that we obtained with the different strategies. • the TBC F1 is slightly higher than the result obtained in [Pradhan et al., 2004], i. e. 95.37% vs. 93.8% on same training/testing conditions, i.e. (same PropBank version, same training and testing split and same machine learning algorithm). This is explained by the fact that we did not include the continuations and the co-referring arguments that are more difficult to detect; • both RND and HEU do not improve the TBC result. This can be explained by observing that in the 50% of the cases a correct node is removed; • using the ASTord classifier to select the best nodes configurations inN creases F1 by 1.49%, i. e. 96.86 vs. 95.37. This is a very good result considering that increasing the very high baseline of the TBC is hard. In order to evaluate our approach on the specific task it was designed for, i. e. overlap resolution, we tested the above classifiers on problematic structures only, i. e. we measured the ASTord classifier improvement on all and N only the structures that presented some overlap situations. Such reduced test set contains 642 argument nodes and 15,408 non-argument nodes. The third column of Table 6.2 reports the classifier performance on such task. We note that the ASTord classifier improves the other heuristics of about 20%. N Automatic Parse Trees
We then evaluated the different ASTN approaches,
i. e. ASTN , ASTord and ASTm N , on automatic parse trees. N We used the trees generated by the Charniak parser and the predicate argument annotations defined in the CoNLL 2005 shared task. We trained the TBC on sections 02-08, whereas, to achieve a very accurate argument classifier, we trained a traditional role multi-classifier (TRM) on sections 0221. Then, we trained the ASTN , ASTord and ASTm N classifiers on the output N
6.2. Experiments
122 Section 21 P R F1
ASTN ASTord N ASTm N
Section 23 P R F1
69.8 77.9 73.7 62.2 77.1 68.9 73.7 81.2 77.3 63.7 80.6 71.2 73.6 84.7 78.7 64.2 82.3 72.1
Table 6.3: Classifiers’ performance for the overlap resolution task on automatic parse trees. of the TBC and TRM over sections 09-20, for a total of 183,642 arguments, 30,220 correct s and 28,143 incorrect predicate argument spanning trees. First, to test the TBC, TRM and the tree kernel classifiers we used Section 23 (17,429 arguments, 2,159 correct and 3,461 incorrect argument structures) and Section 21 (12,495 arguments, 1,975 correct and 2,220 incorrect argument structures). The performance derived on Section 21 corresponds to an upper bound of our classifiers, i. e. the results using an ideal syntactic parser (the Charniak parser was trained also on this data) and an ideal role classifier. They provide the ASTN family classifiers with accurate syntactic and semantic information. Table 6.3 shows Precision, Recall and F1 measure of the ASTN classifiers over the NSTs of sections 21 and 23. Rows 3, 4 and 5 report the performance of ASTN , ASTord , and ASTm N classifiers, respectively. N Several points should be remarked: • the general performance is lower than the one achieved on gold trees with ASTord , i. e. 88.1% (see Table 6.2). The impact of parsing accuN racy is also confirmed by the gap of about 6% points between sections 21 and 23; • the ordinal numbering of arguments (ASTord ) and the role type inN formation (ASTm N ) provide the tree kernel with more meaningful fragments since they improve the basic model of about 4%;
6.2. Experiments
123 bnd Section 21 P R F1
ASTN ASTord N ASTm N
RND
Section 23 P R F1
87.5 87.3 87.4 78.6 78.1 78.3 88.3 88.1 88.2 79 78.4 78.7 88.3 88.3 88.3 79.3 78.7 79 86.9 87.1
87
77.8 77.9 77.9
bnd+class Section 21 P R F1 ASTN ASTord N ASTm N
RND
Section 23 P R F1
85.5 85.7 85.6 73.1 73.8 73.4 86.3 86.5 86.4 73.5 74.1 73.8 86.4 86.8 86.6 73.4 74.4 73.9 85
85.6 85.3 72.3 73.6 72.9
Table 6.4: Semantic role labelling performance on automatic parse trees using different ASTN structures. • the deeper semantic information generated by the argument labels provides useful clues to select correct predicate argument structures, ord performance on both secsince the ASTm N model improves ASTN tions. Second, we measured the impact of the spanning trees classifiers on both phases of semantic role labeling. Table 6.4 reports the results on Sections 21 and 23. For each of them, the Precision, Recall and F1 of different approaches to the boundary identification (bnd) and to the complete SRL task, i. e. boundary and role classification (bnd+class) is shown. Such approaches , are based on different strategies to remove the overlaps, i. e. ASTN , ASTord N ASTm N and the baseline (RND) which uses a random selection of non-overlapping structures. We needed to remove the overlaps from the baseline in order to apply the CoNLL evaluator. We note that:
6.2. Experiments
124
• for any model, the boundary detection F1 on Section 21 is about 10 points higher than the F1 on Section 23 (e.g. 87.0% vs. 77.9% for RND). As expected, the parse tree quality is very important to detect argument boundaries; • on the real test (Section 23) the classification introduces labeling errors which decrease the accuracy of about 5% (77.9 vs 72.9 for RND); • the ASTord and ASTm N approaches constantly improve the baseline F1 N of about 1%; This result does not surprise as it is similar to the one obtained on gold trees: the overlapping structures are a small percentage of the test set, thus the overall impact cannot be very high. Comparison with the CoNLL 2005 results can only be carried out with respect to the whole SRL task (bnd+class), since boundary detection versus role classification is generally not provided in CoNLL 2005. Moreover, our best global result, i. e. 73.9%, was obtained under two severe experimental factors: 1. the use of just 1/3 of the available training set; 2. the usage of the linear SVM model for the TBC classifier, which is much faster than the polynomial SVMs but also less accurate. However, we note the promising results of the argument structure metaclassifiers, which can be used with any of the best figure CoNLL systems. Finally, the outcome of tree-kernel based classifiers suggests that: • it is robust to parse tree errors, since it preserves the same improvement across trees derived with different accuracy, i. e. the gold trees of the Penn TreeBank and the automatic trees of Section 21 and Section 23; • correct and incorrect predicate argument structures can be classified with high accuracy.
6.2. Experiments
Internal Pre-terminal Both
125
pos
Section 2 neg
tot
pos
11,847 894 12,741
71,126 114,052 185,178
82,973 114,946 197,919
6,403 620 7,023
Section 3 neg 53,591 86,232 139,823
tot
pos
59,994 86,852 146,846
7,525 709 8,234
Section 24 neg 50,123 80,366 130,489
tot 57,648 81,075 138,723
Table 6.5: Tree nodes of the sentences from sections 2, 3 and 24 of the PropBank. For each section, the number of nodes that exactly cover an argument (pos), of those that do not (neg) and their total number (tot) is reported.
6.2.2 Boundary Detection and Argument Classification We used Section 2, 3 and 24 from the Penn TreeBank in most of the experiments. Their characteristics are shown in Table 6.5. The values in column indicate the number of nodes corresponding (pos) or not (neg) to a correct argument boundary. Rows 3 and 4 report such number for the internal and pre-terminal nodes separately. We note that the latter are much fewer than the former, resulting in a very fast pre-terminal classifier. As the automatic parse trees contain errors, some arguments cannot be associated with any covering node. This prevents us to extract a tree representation for them. Consequently, we do not consider them in our evaluation. In sections 2, 3 and 24 there are 454, 347 and 731 such cases, respectively. We used a regularization parameter (option -c) equal to 1 and λ = 0.4. Boundary Detection Results
In these experiments, we used Section 2 for
training and Section 24 for testing. The results using the AST1 and the ASTm 1 based kernels are reported in Table 6.6 in rows 4 and 5, respectively. Columns 2 and 3 show the CPU testing time (in seconds) and the F1 of the monolithic boundary classifier. Columns 8 and 9 report the overall performance of the split classifier, whereas columns 4-5 and 6-7 are referred to internal and pre-terminal nodes, respectively. The overall F1 measure has been computed by summing correct, incorrect and not retrieved examples of the two distinct classifiers.
6.2. Experiments
126
Monolithic BC
AST1 ASTm 1
Split BC
CPU
F1
Internal CPU F1
5,180 3,132
75.24 82.07
1,795 79.93 1,410 82.20
Pre-terminal CPU F1 57 61
79.39 79.14
Overall CPU F1 1,852 79.89 1,471 81.96
Table 6.6: Boundary detection performance of the monolithic and split models using MAST and CAST structural features. We note that: • the monolithic classifier applied to ASTm 1 improves both the efficiency, i.e. about 3,131 seconds vs. 5,179, of AST1 and the F1 , i.e. 82.07 vs. 75.24. This suggests that marking the argument node simplifies the generalization process; • by dividing the boundary classification in two tasks, internal and preterminal nodes, we furthermore improve the classification time for both AST1 and ASTm 1 kernels, i.e. 5,179 vs. 1,851 (AST1 ) and 3,131 vs. 1,471 (ASTm 1 ). The individual classifiers are much faster, especially the pre-terminal one (about 61 seconds to classify 81,075 nodes); • the split classifier approach seems quite feasible as its F1 is almost equal to the monolithic one (81.96 vs. 82.07) in case of ASTm 1 and even superior when using AST1 (79.89 vs. 75.34). This result confirms the observations provided in Section 5.1 about the importance of reducing the number of substructures common to syntactic derivations associated with correct and incorrect boundaries; We trained the split boundary classifiers with sets of increasing size to derive the learning curves of the AST1 and ASTm 1 models. We increasingly added data to the training set from sections 03 to 07. Figure 6.1 shows that the ASTm 1 approach is constantly over the AST1 . Consider also that the marking strategy has a less noticeable impact on the split classifier.
4 8 3 8 2 F18 1 8 l i l i f S T t p c a e r m A s s 1 0 82 0 0les1 2 0 0Th4 0ftraini8 0 1 o o n gexam p usands6
6.2. Experiments
127
Figure 6.1: Learning curve of the AST1 and ASTm 1 split classifiers. Argument Classification Results
In these experiments we tested different
kernels on the argument classification task. As some arguments have a very small number of training instances in a single section, we also used Section 3 for training while continuing to test on Section 24. The results of the RMC on 59 argument types4 (e.g. constituted by 59 binary classifiers in the monolithic approach) are reported in Table 6.7. Rows cm strucfrom 3 to 5 report the accuracy when using AST1 , ASTm 1 and AST1 tures; Rows 6-8 show the accuracy of the approaches regarding plain complete argument structures (ASTN ) and complete argument structures in which cmt ). the target argument is marked (ASTmt N ) or completely marked (ASTN More in detail, Column 2 shows the accuracy of the monolithic RMC, whereas Columns 3, 4 and 5 report the accuracy of the internal, pre-terminal and split RMC classifiers, respectively. We can note that: • the split classifier approach does not improve the monolithic approach accuracy. Indeed, the subtrees describing different argument types are 4
7 for the core arguments (A0...AA), 13 for the adjunct arguments (AM-*), 19 for the argument references (R-*) and 20 for the continuations (C-*).
6.2. Experiments
128
Monolithic AST1 ASTm 1 cm AST1 ASTN ASTmt N cmt ASTN
Internal nodes
Split Pre-terminals
Overall
74.16 76.25 75.68 36.52 71.59 71.93
85.61 85.76 85.76 78.14 86.32 86.32
75.15 77.07 76.54 40.10 72.86 73.17
75.06 77.17 76.79 34.80 72.55 73.21
Table 6.7: Accuracy produced by different tree kernels on argument classification. quite different and this property holds for pre-terminal nodes as well. However, we still measured a remarkable improvement in efficiency; • ASTm 1 is the best kernel. This confirms the outcome on boundary detection experiments. The fact that it is more accurate than ASTcm re1 veals that characterizing the target node is more relevant than characterizing the dominated argument structure. To explain this, suppose that two argument nodes, NP1 and NP2 , dominate the following structures: and
NP NP DT
DT
PP
.
NP NN
NN
If we mark only the argument node we obtain NP-B NP DT
PP
and
NP-B DT
,
NN
NN
which have no structure in common. In contrast, if we mark them completely, i. e.
6.2. Experiments
129 and
NP-B NP-B DT-B
NP-B DT-B
PP-B
,
NN-B
NN-B
they will share the subtree NP-B DT-B
.
NN-B
Thus, although it may seem counter intuitive, by marking only one node we obtain more specific substructures. Of course, using different labels for an argument node and its descendants would provide the same specialization effect; • if we do not mark the target argument in the ASTm N s, we obtain an expected very low result, i. e. 40.10%. When we mark the covering node or the complete argument subtree we obtain an acceptable accuracy. Unfortunately, such accuracy is lower than the one produced by ASTm 1 , e. g. 73.17% vs. 77.07%, thus it may seem that the additional information provided by the whole argument structure is not effective.
6.2.3 Proposition Re-ranking This section discusses the results of our novel attempt at using tree kernels to re-rank predicate argument structures. Our architecture was largely based on the system presented at the CoNLL 2005 shared task, consisting of a traditional boundary classifier and a role multi-classifier. The boundary classifier was trained using an SVM with the polynomial kernel of degree 3, regularization parameter, c set to 1 and cost factor j set to 7 (to have a slightly higher recall). Only the 992,819 nodes that resulted from the pruning of Sections 02-08 were used. The classifier took about two days and a half to converge on a 64 bits machine (2.4 GHz and 4Gb Ram). The RMC results from the One-vs-All (See Section 3.3) combination of 52
6.2. Experiments
130
binary argument classifiers. Their training on all the positive boundaries (output from the TBC) from sec 02-21, (i.e. 242,957), required about half a day on a machine with 8 processors (32 bits, 1.7 GHz and overall 4Gb Ram). Each of the 53 (52+1) classifiers was used to classify all the 138,723 nodes of Section 24, and its predictions were used to learn the best parameters A and B to fit the classifier’s output onto a sigmoid distribution. We also extracted all the 218,641 argument nodes of Section 23 and classified them against each binary classifier. By using the parameters and applying Equation 5.1 we transformed each prediction of each classifier for the 218,641 argument nodes in Section 23 to a posterior probability (see Section 5.2.1). We evaluated each node’s five most promising labellings, according to the scaling technique described by Equation 5.6. This set of probabilities was used as input to the Viterbi algorithm, which would output the set of N most likely labelling schemes. Table 6.8 shows the upper and lower bound of our model when changing the number of N, i. e. the number of candidate propositions output by the Viterbi algorithm. The Precision, Recall, F1 measure and number of perfect annotations (%prf ) are tabulated. The lower bound is calculated by selecting, for each target proposition, only the best candidate output by the Viterbi algorithm (i. e. the first, N = 1), and corresponds to a re-ranker that would learn to rank according to the algorithm’s order. On the contrary, the upper bound for each value of N is calculated evaluating for each set of N candidates the one that maximizes the F1 measure when compared to the oracle proposition, i. e. selecting the alternative whose arguments better match those of the oracle. It can be seen that halving the value of N produces a drop of about 2% of F1 measure and 3% of perfect annotations on the upper bound. This observation suggests that the probabilistic model is quite effective, since in most cases the best candidate among the first 20 output options can also be found within the first 5. These results are encouraging, as they are far above
6.2. Experiments
131 N
P
R
F1
%prf
Lower Bound
1
81.58 70.97 75.91 44.83
Upper Bound
5 90.23 79.92 84.76 60.81 10 92.07 82.33 86.93 64.59 20 93.26 84.29 88.59 67.70
Table 6.8: Variation of the upper and lower-bound of the re-ranking SRL system on section 23 of the PropBank when modulating the number N of options output by the Viterbi algorithm. (i. e. about 8 F1 points) current state of the art systems. Different Tree Kernels and Kernel Combinations
We run a set of experi-
ments in order to evaluate the best structured features (see Section 4.2.2) and their combination with linear features (4.2.1) for the proposition re-ranking task. The methodology employed to build training and test data for the reranker is described in Section 5.3. Each classifier example ei is described by a tuple ht1i , t2i , vi1 , vi2 i, where t1i and t2i are two structural features, and vi1 and vi2 two vectors of linear features. t1i and vi1 describe the left member of the annotation pair, whereas t2i and vi2 are relative to the right one. We define the following kernels: Ktr (e1 , e2 ) = Kt (t11 , t12 ) + Kt (t21 , t22 ) − Kt (t11 , t22 ) − Kt (t21 , t12 ) Kpr (e1 , e2 ) = Kp (t11 , t12 ) + Kp (t21 , t22 ) − Kp (t11 , t22 ) − Kp (t21 , t12 ) where Kt is the the kernel function defined in Section 4.2.2 and Kp is a polynomial kernel applied to the feature vectors. The final kernel that we use for re-ranking is the following:
6.2. Experiments
K(e1 , e2 ) =
132
Kpr (e1 , e2 ) Ktr (e1 , e2 ) + . |Ktr (e1 , e2 )| |Kpr (e1 , e2 )|
(6.1)
All the models were learnt on the best 5 propositions output by the Viterbi algorithm for Section 24, i. e. 16,240 candidate annotations and 48,582 comparisons between annotation pairs. We used the models to re-rank the best 5 candidate propositions of Section 23, comprising 26,325 annotations and 81,162 comparisons. The outcome of these experiments is shown in Table 6.9, where the results are sorted by decreasing F1 measure. The columns labelled TK list the performance (in terms of Precision, Recall and F1 measure) of the tree kernels alone, whereas those labelled TK+PF account for the combination of a tree kernel with the linear predicate features. Row 5 (no TK) shows the results of using linear features alone, i. e. without structured features. We note that: • linear features alone perform about 0.6% above the baseline, i. e. 76.55% vs. 75.91%. This small improvement suggests that proposition reranking is a complex task, and the features that we provided are not enough to support the learning process; • tree kernels alone are always less performant than their combination with linear features. This makes sense, as the linear features consider some aspects of the re-ranking problem, namely the Viterbi score and rank, which are crucial to its solution and are not encoded in the syntactic structures; • heavily syntactic tree kernels, i. e. ASTcm and ASTfl N degrade the perN formance of linear features alone. This datum provides an interesting hint on the re-ranking task which seems to be in contrast to what many authors say, e. g. [Toutanova et al., 2005], as it suggests that intra and
6.2. Experiments
133
inter-node syntactic structures as well as too many lexical information have a negative effect on re-ranking accuracy; • on the other hand, all the structures that we engineered in order to represent the predicate argument structure, i. e. all those derived from the PAS, improve the performance over the no TK model. Accordingly to the previous observations, this fact suggests that what is actually crucially for proposition re-ranking is the recognition of the argument structure, which indeed should depend on a verb and its sense rather than its (possibly wrong) syntactic parse of the sentence or the lexical realization of the arguments; • the best kernels are those that strip almost all of the syntactic parse information, i. e. the PASt and PAStl . In this latter case, the lemmatization of the verb voice produces a 0.11% F1 measure improvement that is clearly dued to the increased generality of the resulting structure.
P
TK R
F1
P
TK+PF R
F1
ASTcm
76.57
69.05
72.61
81.71
71.86
76.47
ASTfl
76.77
69.94
73.20
81.65
72.05
76.55
no TK
-
-
-
81.42
72.22
76.55
PAS PASfl PASt
76.32
71.46
73.81
81.84
72.30
76.77
76.26 76.21
72.16 72.22
74.15 74.16
81.64 81.65
72.49 72.57
76.80 76.84
PAStl
76.11
72.36
74.19
81.69
72.75
76.96
N
N
Table 6.9: Performance comparison on the SRL task (Section 23) between different tree kernels and kernel combinations. These experiments allowed us to identify the most promising structured feature, i. e. the PAStl , that we would use in order improve our re-ranking mechanism’s accuracy. First, we performed some tuning of the learning algorithm and observed that preventing the SVMs from performing the normalization of the input
6.2. Experiments
134
examples resulted in better performance. Our next experiments were run under this condition. We then extended our kernel by introducing the feature vectors used for the role classification task into the re-ranking model, as proposed by [Toutanova et al., 2005]. We only added core argument features, as the adjuncts aren’t supposed to provide useful information for the task. The resulting kernel is an extension of the one defined in Equation 6.1, in which all the polynomial kernels Kp (tji , tkj ) with j = k provide a positive contribution, whereas those having j 6= k have a negative weight. We trained on two different sets, i. e. Section 24 and Section 12, in order to measure the impact of syntax errors5 on our kernel performance. The outcome of these experiments is shown in Table 6.10. For each training section (Sec) we present the results of the PAStl tree kernel combined with predicate argument features alone (PAStl +PF) and with the addition of each argument node role-classification features (PAStl +PF+AF). The Columns labelled Reranker report the classification accuracy on the reranking pairs (i. e. SVM-light output, see Section 5.3), whereas those labelled SRL account for the full SRL task performance on Section 23 (i. e. the evaluation is carried out using the CoNLL oracle). For the latter, also the percentage of perfect propositions %prf is shown. We note that: • the inclusion of each argument’s feature set in the model results in a lower performance. This observation corroborates the outcome of our previous experiments, which had already shown that using local lexical and syntactic information for proposition re-ranking introduces more noise than useful features. From a system point of view this is good news, as a reduced number of feature vectors results in faster learning and classification phases; 5
As the Charniak parser was trained on Section 12 of the WSJ, its output on this data set is much better than on Section 24.
6.2. Experiments
135 Reranker
SRL (Section 23)
Sec
Kernel
24
PAStl +PF PAStl +PF+AF
80.45 79.99 80.22 81.00 74.28 78.15 50.29
12
PAStl +PF PAStl +PF+AF
79.28 80.87 80.10 81.90 74.94 78.27 50.69
P
R
F1
P
R
F1
%prf
79.36 79.79 79.57 81.91 73.14 77.77 48.83 79.90 81.49 80.69 82.36 73.38 77.61 48.51
Table 6.10: Results of the combination of the PAStl tree kernel with predicate argument linear features (PF) and with the addition of each proposition argument’s role-labelling features (AF). • the small degradation of performance introduced by bad parse trees, i. e. on Section 24, is an interesting consequence of our loosely parse tree based approach, and confirms that semantic structures are much more relevant for proposition re-ranking than localized lexical and syntactic information. Considerations on the Re-ranking Models It must be noted that the employed architecture has been largely inherited from the system that took part in the 2005 edition of the CoNLL shared task. This includes the candidate selector module and the models used for the boundary and argument classifiers. During the experiments we have observed that the candidate selector fails to consider some of the nodes of the parse trees. In some cases, such nodes actually dominate predicate arguments. Though we didn’t carry out a thorough estimation of the number of skipped nodes, it seems that at least a 1% of the argument nodes are not included in the training and test set. The impact of this issue on the overall system’s performance is not known. Still, as the candidate selector is an initial stage of the architecture, the repercussions of a flawed implementation are likely to bias the whole labelling process. Also, the role multi-classifier was trained only on the set of positive
6.2. Experiments
136
boundaries output by the TBC. This is compliant with the model adopted for the CoNLL task, whereas for the joint model we need to classify with the TRM every tree node in order to evaluate (using the Viterbi algorithm) the most likely labelling configurations. Since training on the same huge data set (i. e. Sections 2-21) would require several weeks on our hardware configuration, we were almost forced to reuse the models learnt for CoNLL. As a result, many classification scores output by the role multi-classifier are totally biased and introduce some degree of error in the training of the sigmoid fitting algorithm and, consequently, in the probabilistic evaluation of the role classifiers’ output. Still, the probabilistic boundary classifier is quite accurate (being its model learnt from all the tree nodes) and its relevance in the evaluation of a labelling scheme’s likelihood should (at least partially) mitigate the impact of this flaw of the model. Using the best 20 configurations output by the Viterbi algorithm, our re-ranking model sports an interesting upper bound F1 measure of 88.59%. With respect to the baseline, our best model produces an F1 improvement of 2.36 points (i. e. 78.27% vs 75.91%), which only accounts for 18.6% of our theoretical upper bound. It is enough to climb five positions (from the 8th to the 3rd) in the final CoNLL ranking, but not to reach the state of the art system [Punyakanok et al., 2005] which declares an F1 measure of 79.44% on Section 23. Nevertheless, this improvement was realized on a very small portion of the available training set (1 section out of 20), and yet was enough to place our system ahead of all the systems that do not perform syntactic parse reranking.
Chapter 7
Conclusions
Semantic Role Labelling is a difficult task consisting in the identification of predicate-argument structures within free text sentences. Given a sentence and a predicate, a SRL system should recognize all the word sequences that correspond to the predicate’s arguments and label them accordingly to their respective role. The role labelling process relies on lexical, morphological and syntactic data relative to the target sentence, which in real-world scenarios are rather noisy and inaccurate due to the inherent complexity of natural languages. The lack of a sound and complete theory establishing a link between syntax and semantics doesn’t allow the problem to be addressed in a deterministic fashion. Indeed, most SRL models are based on inductive approaches and use supervised ML algorithms to learn how to perform the task based on the observation of a large collection of annotated data. In 2004 and 2005, the CoNLL shared task has been focusing the attention of a large part of the research community on an SRL model which is based 137
138 on Levin’s verb classes, using the Penn TreeBank and its layer of semanticroles annotation, the PropBank, as a common data set. This thesis discussed a set of extensions to an existing SRL architecture that was presented at the CoNLL 2005 shared task. These extensions regard the introduction of tree kernels in many stages of the processing, as well as the definition of new functional modules and a series of architectural reworks. Tree kernels are interesting as they allow an implicit definition of the large set of lexical and syntactic features that the statistical ML algorithms have to cope with. We have run many experiments regarding the impact of tree kernels in different stages of the SRL process, namely boundary detection, argument classification, overlap resolution and proposition re-ranking. We demonstrated that tree kernels are a viable alternative to the explicit selection and engineering of features for the SRL task, as carefully tailored structured features can be effectively employed for the realization of accurate and reliable SRL systems. Our latest model, which is based on a joint probabilistic evaluation of a set of candidate annotations and a tree kernel based re-ranking mechanism, has highlighted many interesting aspects and produced encouraging results. From a ML point of view, the recognition of a correct or incorrect proposition is only slightly correlated with local lexical and syntactic information, i. e. a correct predicate-argument structure can eventually be classified with the use of little or no information about the sentence syntactic parse tree. A joint probabilistic model considering each argument’s role likelihood can be used to evaluate the most likely labelling schemes of a sentence, whereas a re-ranking mechanism that uses properly encoded representations of such candidate predicate-argument structures can successfully be employed to select the best alternative. What is mostly relevant for the proposition reranking task is the syntax of the predicate-argument structure, i. e. the po-
139 sition and type of the arguments with respect to the predicate, whereas any attempt to add local syntactic or lexical information caused a degradation of our system’s performance. This latest issue is of great interest, as it has two main implications: • our re-ranking mechanism is robust with respect to local errors of the parser, as a good argument node with a wrong attachment within the parse tree will eventually be recognized. The small performance degradation measured by our experiments when using differently accurate parse trees seems to confirm this aspect; • many state of the art SRL systems, e. g. [Toutanova et al., 2005], obtain very good results including a great deal of local lexical and syntactic information in their proposition re-ranking models. This seems to be in contrast with our observations. Still, our re-ranking experiments were carried out on very small training sets, i. e. 1 section out of 24, and with such a small amount of data it is clear that local information is an obstacle to proper generalization. Further experiments on larger data sets should help clarifying this matter. Noticeably, the application of the re-ranking mechanism improved our system’s CoNLL performance by 2.36 F1 points. This is a very good result considering the complexity of the task and the difficulty of improving a performance that is already generally high. Indeed, such improvement is enough to let our system climb 5 positions with respect to CoNLL results and reach the 3rd place. It’s important to notice that the two best CoNLL systems [Punyakanok et al., 2005, Haghighi et al., 2005] perform syntactic parse re-ranking and can therefore count on more (and generally more accurate) lexical and syntactic data to be used in the earlier stages of the process. However, we have only been able to tap a small percentage of our theoretical margin, i. e. we have an actual F1 measure of 78.27% while our system’s theoretical upper bound is 88.59%. This is surely also due to some
140 imperfections of the system, but for the most part to the fact that we trained our re-ranker on a very small subset of the available training data. Anyhow, our system has the potentiality to compete with state of the art solutions, and we plan to exploit it in order to provide further contributions to the research in this field and to develop accurate and reliable SRL-based advanced NLP applications.
Future Work Our first objective for the nearest future is to reach state of the art performance. We are very confident that a further investigation of the possible flaws of the system and a fine tuning of the learning algorithm can help us exploiting the potentiality of our model and reach this not too far objective. There are many aspects of both the model and the system that can be revised in order to hit this goal: • addressing the issue of the models for the role multi-classifier should increase our lower bound and hence decrease the relevance of the reranking mechanism. At the same time, correcting some errors in earlier processing stages should result in a more accurate probabilistic interpretation of the possible labelling schemes and consequently in less noisy training and test data for the proposition re-ranker. With the highest priority, we should revise the old candidate and feature extractor modules in order to be sure that we are not skipping argument nodes, and retrain all the classifiers on non-boundary nodes as well; • in order to fairly compare our system with the state of the art we need to train our models on all the available data. It is very likely that, without requiring changes to the model, such a large training set would grant us the performance boost that we need;
141 • syntactic parse re-ranking is an important component of an open domain SRL architecture, as it can dumb the effect of bad parses by selecting the less noisy solution among a set of available alternatives. Extending our model to include multiple syntactic parse trees would surely increase our theoretical upper bound, as well as very likely improve the accuracy and reliability of the earlier processing stages; • the predicate in a predicate-argument structure is the key factor on which the number, position and role of the arguments depend. Thus, the process of separating good and not-so-good propositions could benefit from the introduction of some linguistic information about the argument structures that comply with the possible syntactic realizations of a verb’s senses. We would like to use some features accounting for the (probabilistic) compatibility of a candidate argument structure with the observed realizations of its target verb; • not all the arguments in a predicate-argument structure have the same relevance for the re-ranking task. Indeed, those that mostly characterize the proposition are the core arguments, whereas the others provide a kind of information that is very oblique with respect to the target predicate. What we’d like to do is to study some re-ranking models that either do not include adjunct arguments or explicitly assign different weights to adjunct and core arguments. The resulting model should be more akin to the linguistic description of this verb-centered semantic structures; • we would like to experiment new tree kernels, investigating the most appropriate combinations of lexical, syntactic and semantic information for the re-ranking task. Another interesting research direction is that of building a completely kernelized SRL system. We have already been doing some steps in that direction and started deploying a very basic SRL architecture solely based on
142 tree kernels. The experiments that we have been running in these years have provided us with many evidence about what would be the structured features more effective for each stage. Our probabilistic joint model and with the re-ranking mechanism should compensate for the eventual loss of accuracy in the earlier stages of the process. Nevertheless, many issues have to be addressed before such a solution can be successfully attempted. Most notably: • we should find a proper way to encode relevant features for the proposition re-ranking task into our structured features, such as the Viterbi score and rank. This is not a trivial point, as the rank can assume many discrete values and the score is a real value. Using them to label some nodes, as our actual tree kernel engineering methodologies do, would prevent many fragments from matching. This would largely dumb the generalization capability of the learning algorithm; • we should drastically improve our model on the computational side, as tree kernels are required to evaluate huge fragment spaces and therefore are generally less efficient than attribute-value feature representation schemes. The performance issue is also very relevant from a research point of view, as any time we have to learn some models or classify large data sets we experience very long waiting periods between the launch and the evaluation of an experiment. Therefore we are looking forward to adopt a faster (yet possibly less accurate) algorithm than SVMs, such as a perceptron. Once we will have encoded tree kernels into a perceptron learning algorithm, we could employ this solution to run our experiments and be able to evaluate many different configurations in less time. Having derived an optimal model or system configuration, we could finally launch large scale experiments using SVMs in order to achieve an highest level of accuracy.
143 Improving computational efficency is also critical with respect to the possibility of using an SRL system for on-line applications, such as webbased information extraction engines. Actually, the semantic annotation of a sentence is a task that requires several seconds of processing on an average workstation. The bottleneck of the system are the classifiers, as the more the learning is accurate , the more the classification phase is computationally expensive. To render SRL practical for on-line contexts, the system’s response time should be definitely kept below the second, possibly much less. Using a different classifier, e. g. the perceptron, might help but would hardly be a solution. Another improvement could be the employment of distinct classifiers for internal and pre-terminal nodes, as our experiments have shown that this approach can halve classification time without reducing accuracy. Finally, we’d really like to incorporate our SRL system within some useful application. In its current shape, it is an interesting tool for linguists and NLP researchers, as it can provide quite accurate semantic annotations of any well-formed english sentence. Still, our objective would be to integrate our technology into some real-world application that the average user could profitably use, such as a semantic-level layer between the end-user and a traditional information retrieval engine like Google or Yahoo!.
Bibliography
[Agirre and Rigau, 1996] Agirre, E. and Rigau, G. (1996). Word sense disambiguation using conceptual density.
In Proceedings of COLING’96,
pages 16–22, Copenhagen, Danmark. [Aizerman et al., 1964] Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837. [Allwein et al., 2000] Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141. [Angluin, 1992] Angluin, D. (1992). Computational learning theory: Survey and selected bibliography. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, pages 351–369. [Baker et al., 1998] Baker, C. F., Fillmore, C. J., and Lowe, J. B. (1998). The berkeley FrameNet project. In COLING-ACL ’98: Proceedings of the Conference, held at the University of Montréal, pages 86–90. 144
Bibliography
145
[Basili and Cammisa, 2004] Basili, R. and Cammisa, M. (2004). Unsupervised semantic disambiguation.
In Workshop on "Beyond Named En-
tity Recognition - Semantic Labelling for Natural Language Processing Tasks" (LREC2004), Lisbon, Portugal. [Basili et al., 1996a] Basili, R., Pazienza, M. T., and Velardi, P. (1996a). An empirical symbolic approach to natural language processing. Artificial Intelligence, 85(1-2):59–99. [Basili et al., 1996b] Basili, R., Velardi, P., and Pazienza, M. T. (1996b). Integrating general-purpose and corpus-based verb classification. Computational Linguistics, 22(4):559–568. [Berger et al., 1996] Berger, A. L., Della Pietra, S. A., and Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. [Callison-Burch and Osborne, 2003] Callison-Burch, C. and Osborne, M. (2003). Statistical natural language processing. In Handbook for Natural Language Engineers. CSLI Publications. [Carreras and Màrquez, 2004] Carreras, X. and Màrquez, L. (2004). Introduction to the conll-2004 shared task: Semantic role labeling. In Ng, H. T. and Riloff, E., editors, HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004), pages 89–97, Boston, Massachusetts, USA. Association for Computational Linguistics. [Carreras and Màrquez, 2005] Carreras, X. and Màrquez, L. (2005). Introduction to the CoNLL-2005 shared task: Semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 152–164, Ann Arbor, Michigan. Association for Computational Linguistics.
Bibliography
146
[Charniak, 1994] Charniak, E. (1994). Statistical Language Learning. MIT Press, Cambridge, MA, USA. [Charniak, 2000] Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the 1st Meeting of the North American Chapter of the ACL, pages 132–139. [Chomsky, 1956] Chomsky, N. (1956). Three models for the description of language. IRE-IT, 2:113–124. [Chomsky, 1959] Chomsky, N. (1959). On certain formal properties of grammars. IC, 2:137–167. [Church and Mercer, 1993] Church, K. W. and Mercer, R. L. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19:1–24. [Collins, 1999] Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania. [Collins and Duffy, 2002] Collins, M. and Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL02. [Dizard, 1985] Dizard, W. (1985). The coming information age: An overview of technology, economics, and politics. NY: Longman. [Dowty, 1982] Dowty, D. (1982).
Grammatical relations and montague
grammar. In Pullum, P. J., editor, The nature of syntactic representation, pages 79–130. D. Reidel, Dordrecht. [Dowty, 1991] Dowty, D. (1991). Thematic proto-roles and argument structure. Language, 67(3):547–619.
Bibliography
147
[Fillmore, 1968] Fillmore, C. J. (1968). The case for case. In Bach, E. and Harms, R. T., editors, Universals in Linguistic Theory, pages 1–210. Holt, Rinehart, and Winston, New York. [Fillmore, 1971] Fillmore, C. J. (1971). Some problems for case grammar. In O’Brien, R., editor, Report of the Twenty-Second Annual Round Table Meeting on Linguistics and Language Studies, pages 35–56. Georgetown University Press, Washington, DC. Monograph Series on Languages and Linguistics, No. 24. [Fillmore, 1976] Fillmore, C. J. (1976). Frame semantics and the nature of language. In Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language and Speech, volume 280, pages 20–32. [Fillmore, 1982] Fillmore, C. J. (1982). Frame semantics. In of Korea, T. L. S., editor, Linguistics in the Morning Calm. Hanshin, Seoul. [G. D. Forney, 1973] G. D. Forney, J. (1973). The Viterbi algorithm. Proc. IEEE, 61:268–78. [Gazdar and Mellish, 1989] Gazdar, G. and Mellish, C. (1989). Natural Language Processing in Prolog: An Introduction to Computational Linguistics. Addison-Wesley. [Geoffrey Leech and Bryant, 1994] Geoffrey Leech, R. G. and Bryant, M. (1994). Claws4: the tagging of the british national corpus. In COLING 94: Proceedings of the 15th International Conference on Computational Linguistics, volume 1, Kyoto, Japan. [Gildea and Jurafsky, 2002] Gildea, D. and Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28(3):245–288. [Haghighi et al., 2005] Haghighi, A., Toutanova, K., and Manning, C. (2005).
A joint model for semantic role labeling.
In Proceedings of
Bibliography
148
the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 173–176, Ann Arbor, Michigan. Association for Computational Linguistics. [Jackendoff, 1990] Jackendoff, R. (1990). Semantic Structures, Current Studies in Linguistics series. Cambridge, Massachusetts: The MIT Press. [Joachims, 1999] Joachims, T. (1999).
Making large-scale SVM learning
practical. In Sch¨ olkopf, B., Burges, C., and Smola, A., editors, Advances in Kernel Methods - Support Vector Learning. [K. Kipper and Rambow, 2002] K. Kipper, M. P. and Rambow, O. (2002). Extending propbank with verbnet semantic predicates. Workshop on Applied Interlinguas, held in conjunction with AMTA02, Tiburon, CA. [Kingsbury and Palmer, 2002] Kingsbury, P. and Palmer, M. (2002). From Treebank to PropBank. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002),, Las Palmas, Spain. [Kipper et al., 2000] Kipper, K., Dang, H. T., and Palmer, M. (2000). Classbased construction of a verb lexicon. In Proceedings of the 7th National Conference on Artificial Intelligence (AAAI-2000), pages 691–696. [Kreßel, 1999] Kreßel, U. (1999). Pairwise classification and support vector machines. In Advances in Kernel Methods: Support Vector Learning, chapter 15, pages 255–268. MIT Press. [Landauer and Dumais, 1997] Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104:211–240. [Landauer et al., 1998] Landauer, T. K., Foltz, P., and Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, pages 259– 284.
Bibliography
149
[Lee, 2004] Lee, L. (2004). "i’m sorry dave, i’m afraid i can’t do that": Linguistics, statistics, and natural language processing circa 2001. In on the Fundamentals of Computer Science: Challenges, C., Opportunities, C. S., and Telecommunications Board, N. R. C., editors, Computer Science: Reflections on the Field, Reflections from the Field, pages 111–118. The National Academies Press. [Levin, 1993] Levin, B. (1993). English Verb Classes and Alternations. The University of Chicago Press. [Lin, 1998] Lin, D. (1998). Automatic retrieval and clustering of similar words. volume 2, pages 768–774. [Lin et al., 2003] Lin, H.-T., Lin, C.-J., and Weng, R. (2003). A note on platt’s probabilistic outputs for support vector machines. Technical report, National Taiwan University. [Liu et al., 2005] Liu, T., Che, W., Li, S., Hu, Y., and Liu, H. (2005). Semantic role labeling system using maximum entropy classifier. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 189–192, Ann Arbor, Michigan. Association for Computational Linguistics. [Lodhi et al., 2000] Lodhi, H., Shawe-Taylor, J., Cristianini, N., and Watkins, C. J. C. H. (2000). Text classification using string kernels. In NIPS, pages 563–569. [Mallery, 1988] Mallery, J. (1988). Thinking about foreign policy: Finding an appropriate role for artificially intelligent computers. In The 1988 Annual Meeting of the International Studies Association, St. Louis, Missouri. [Marcus et al., 1993] Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19:313–330.
Bibliography
150
[Màrquez et al., 2005] Màrquez, L., Comas, P., Giménez, J., and Català, N. (2005). Semantic role labeling as sequential tagging. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 193–196, Ann Arbor, Michigan. Association for Computational Linguistics. [Miller, 1993] Miller, George A., C. F. K. J. M. (1993). Five papers on wordnet. Technical report, Cognitive Science Laboratory of Princeton University. [Miller, 1995] Miller, G. (1995). Wordnet: A lexical database for English. CACM, 38(11):39–41. [Misra, 1966] Misra, V. N. (1966). The Descriptive Technique of Panini: an Introduction. The Hauge, Mouton. [Moschitti, 2004] Moschitti, A. (2004). A study on convolution kernels for shallow semantic parsing. In proceedings of the 42th Conference on Association for Computational Linguistic (ACL-2004), Barcelona, Spain. [Moschitti, 2006] Moschitti, A. (2006). Making tree kernels practical for natural language learning. In Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006). [Moschitti et al., 2005a] Moschitti, A., Coppola, B., Pighin, D., and Basili, R. (2005a). Engineering of syntactic features for shallow semantic parsing. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing, pages 48–56, Ann Arbor, Michigan. Association for Computational Linguistics. [Moschitti et al., 2006a] Moschitti, A., Coppola, B., Pighin, D., and Roberto, B. (2006a). Semantic tree kernels to classify predicate argument structures. To be published in the proceedings of the The 17th European Conference on Artificial Intelligence (ECAI2006).
Bibliography
151
[Moschitti et al., 2005b] Moschitti, A., Giuglea, A.-M., Coppola, B., and Basili, R. (2005b). Hierarchical semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 201–204, Ann Arbor, Michigan. Association for Computational Linguistics. [Moschitti et al., 2006b] Moschitti, A., Pighin, D., and Basili, R. (2006b). Tree kernel engineering in semantic role labeling systems. In Proceedings of the Workshop on Learning Structured Information in Natural Language Applications, EACL 2006, pages 49–56, Trento, Italy. European Chapter of the Association for Computational Linguistics. [Palmer et al., 2005] Palmer, M., Gildea, D., and Kingsbury, P. (2005). The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106. [Paul and Baker, 1992] Paul, D. B. and Baker, J. M. (1992). The design of the wall street journal-based csr corpus. In Proceedings of the ARPA Speech and Natural Language Workshop, pages 357–362, NY. Harriman. [Platt, 1999] Platt, J. (1999). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A.J. Smola, P. Bartlett, B. S. D. S., editor, Advances in Large Margin Classifiers, pages 61–74. MIT Press. [Ponzetto and Strube, 2005] Ponzetto, S. and Strube, M. (2005).
Seman-
tic role labeling using lexical statistical information. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 213–216, Ann Arbor, Michigan. Association for Computational Linguistics. [Pouget et al., 2003] Pouget, A., Dayan, P., and Zemel, R. S. (2003). Inference and computation with population codes. Annual Review of Neuroscience, 26:381–410.
Bibliography
152
[Pradhan et al., 2005a] Pradhan, S., Hacioglu, K., Krugler, V., Ward, W., Martin, J. H., and Jurafsky, D. (2005a). Support vector learning for semantic argument classification. Machine Learning, 60:1-3:11–39. [Pradhan et al., 2005b] Pradhan, S., Hacioglu, K., Ward, W., Martin, J. H., and Jurafsky, D. (2005b). Semantic role chunking combining complementary syntactic views. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 217–220, Ann Arbor, Michigan. Association for Computational Linguistics. [Pradhan et al., 2004] Pradhan, S. S., Ward, W. H., Hacioglu, K., Martin, J. H., and Jurafsky, D. (2004). Shallow semantic parsing using support vector machines. In Susan Dumais, D. M. and Roukos, S., editors, HLTNAACL 2004: Main Proceedings, pages 233–240, Boston, Massachusetts, USA. Association for Computational Linguistics. [Punyakanok et al., 2005] Punyakanok, V., Koomen, P., Roth, D., and Yih, W.-t. (2005). Generalized inference with multiple semantic role labeling systems. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 181–184, Ann Arbor, Michigan. Association for Computational Linguistics. [Radev et al., 2002] Radev, D. R., Hovy, E., and McKeown, K. (2002). Introduction to the special issue on text summarization. Computational Linguistics, 28(4). [Shen et al., 2003] Shen, L., Anoop, S., and Aravind, J. (2003). Using ltag based features in parse reranking. In Empirical Methods for Natural Language Processing (EMNLP). [Surdeanu and Turmo, 2005] Surdeanu, M. and Turmo, J. (2005). Semantic role labeling using complete syntactic analysis.
In Proceedings of
the Ninth Conference on Computational Natural Language Learning (CoNLL-
Bibliography
153
2005), pages 221–224, Ann Arbor, Michigan. Association for Computational Linguistics. [Sutton and McCallum, 2005] Sutton, C. and McCallum, A. (2005). Joint parsing and semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 225–228, Ann Arbor, Michigan. Association for Computational Linguistics. [Toutanova et al., 2005] Toutanova, K., Haghighi, A., and Manning, C. (2005). Joint learning improves semantic role labeling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 589–596, Ann Arbor, Michigan. Association for Computational Linguistics. [Tsai et al., 2005] Tsai, T.-H., Wu, C.-W., Lin, Y.-C., and Hsu, W.-L. (2005). Exploiting full parsing information to label semantic roles using an ensemble of ME and SVM via integer linear programming. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL2005), pages 233–236, Ann Arbor, Michigan. Association for Computational Linguistics. [Valin, 1993] Valin, R. V. (1993). A synopsis of role and reference grammar. In Valin, V., editor, Advances in Role and Reference Grammar, pages 1–164. John Benjamins Publishing Company, Amsterdam-Philadelphia. [Vapnik, 1998] Vapnik, V. N. (1998). Statistical Learning Theory. John Wiley and Sons. [Viterbi, 1967] Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. on Information Theory, 13(2):260–269.
Bibliography
154
[Voorhees, 2001] Voorhees, E. M. (2001). Question answering in trec. In CIKM ’01: Proceedings of the tenth international conference on Information and knowledge management, pages 535–537, New York, NY, USA. ACM Press. [Xue and Palmer, 2004] Xue, N. and Palmer, M. (2004). Calibrating features for semantic role labeling. In Lin, D. and Wu, D., editors, Proceedings of EMNLP 2004, pages 88–94, Barcelona, Spain. Association for Computational Linguistics.