Leveraging Existing Resources using Generalized Expectation Criteria

Gregory Druck Dept. of Computer Science University of Massachusetts Amherst, MA 01003 [email protected]

Gideon Mann Google, Inc. 76 9th Avenue New York, NY 10011 [email protected]

Andrew McCallum Dept. of Computer Science University of Massachusetts Amherst, MA 01003 [email protected]

Abstract It is difficult to apply machine learning to many real-world tasks because there are no existing labeled instances. In one solution to this problem, a human expert provides instance labels that are used in traditional supervised or semi-supervised training. Instead, we want a solution that allows us to leverage existing resources other than complete labeled instances. We propose the use of generalized expectation (GE) criteria [8] to achieve this goal. A GE criterion is a term in a training objective function that assigns a score to values of a model expectation. In this paper, the expectations are model predicted class distributions conditioned on the presence of selected features, and the score function is the Kullback-Leibler divergence from reference distributions that are estimated using existing resources. We apply this method to the problem of named-entity-recognition, leveraging available lexicons. Using no conventionally labeled instances, we learn a sliding-window multinomial logistic regression model that obtains an F1 score of 0.692 on the CoNLL 2003 data. To attain the same accuracy a supervised classifier requires 4,000 labeled instances.

1

Introduction

Generalized expectation (GE) criteria [8] are terms in a training objective function that assign scores to values of a model expectation. GE resembles the method of moments, but allows us to express arbitrary scalar preferences on expectations of arbitrary functions, rather than requiring equality between sample and model moments. We also note three important differences from traditional training objective functions for factor graphs. First, there need not be a one-to-one relationship between GE terms and model factors. For example, GE allows expectations on sets of variables that form a subset of model factors, or on sets of variables larger than model factors. Next, model expectations for different GE terms can be conditioned on different data sets. Finally, the reference expectation (or more generally, score function) can come from any source, including other tasks or human prior knowledge. GE provides a method for incorporating prior knowledge into model training. We argue that this approach, in which we communicate with the model by specifying preferences on model expectations, is more intuitive and potentially more robust than incorporating prior knowledge with prior distributions over parameters. In this paper, we leverage known associations between features and classes. The expectations are model predicted class distributions conditioned on the presence of selected features, and the score function is the Kullback-Leibler divergence from reference distributions that are

estimated using existing resources. Combining these GE terms with a prior on parameters encourages the use of co-occurrence patterns in unlabeled data to learn parameters for features for which we lack prior information. We apply this method to the task of named-entity-recognition (NER), leveraging existing lexicons, for example lists of universities, organizations, and people. Assuming we have a mapping between lexicons and associated labels, we can estimate reference probability distributions of NER label conditioned on the presence of lexicon features. We use these distributions in a GE objective to train a sliding window multinomial logistic regression classifier. The training procedure requires no labeled sequences. We provide a discussion of related work (including [10, 1, 3, 9, 4, 5]) elsewhere [2, 6, 8].

2

Generalized Expectation Criterion

A generalized expectation (GE) criterion objective function term assigns scores to values of a model expectation [8]. In many cases this score function is some measure of distance between a model expectation and a reference expectation. Specifically, given some distance function ∆(·, ·), a reference expectation fˆ, an empirical distribution p˜, a function f , and a conditional model distribution p, the objective function is: ∆(fˆ, Ep(X) [Ep(Y |X;θ) [f (X, Y )]]). ˜ Here, we explore a special case of GE, similar to the approach of Mann and McCallum [6] and Druck, Mann, and McCallum [2]. More specifically, we use GE in conjunction with multinomial logistic regression models; ∆(·, ·) is the KL divergence; p˜ is unlabeled data; and the expectations are distributions over class conditioned on a specific binary feature, p˜(y|xk = 1; θ), where y is a class label and the feature is indexed by k. We define p˜(y|xk = 1; θ) as: 1 X p˜(y|xk = 1; θ) = p(y|x; θ), |Ck | x∈Ck

where Ck = {x : p˜(x) > 0, xk = 1}, the set of all unlabeled instances that contain the feature indexed by k. We discuss the estimation of reference distributions pˆ(y|xk = 1) in the next section. A single GE objective function term is then: DKL (ˆ p(y|xk = 1)||˜ p(y|xk = 1; θ)) =

X

pˆ(y|xk = 1) log

y

pˆ(y|xk = 1) . p˜(y|xk = 1; θ)

We combine multiple GE terms to obtain the complete objective function: P 2 X j θj O=− DKL (ˆ p(y|xk = 1)||˜ p(y|xk = 1; θ)) − , 2σ 2

(1)

k∈K

where K is the subset of features with a GE term. Because there are more parameters in the model than corresponding GE terms in the objective function, the optimization problem is under-constrained. Therefore, we expect there will be many optimal parameter settings. The Gaussian prior addresses this problem by preferring parameter settings with many small values over settings with a few large values. This encourages non-zero parameter values for features that do not have corresponding GE terms, but often co-occur with features with corresponding GE terms. The optimization of Equation 1 is discussed elsewhere [6, 2].

3

Leveraging Existing Resources with GE

To train a model with Equation 1, we need reference distributions pˆ(y|xk = 1). In some scenarios, there exist resources other than conventionally labeled instances that can be used to estimate these distributions. In natural language processing tasks like named-entity recognition, for example, it is common to use readily-available lexicons, or word lists, as features. We often hypothesize associations between these lexicons and labels in the target

task independent of any labeled data. For example, we expect a list of cities to be a good indicator of the location label. We suggest a simple estimation method for converting such lexicon-label associations into distributions over labels conditioned on the presence of label features. For each lexicon feature xk in an unlabeled instance x, it contributes a vote for each of its associated labels. The instance is assigned the label with the most votes. This results in a labeling of the unlabeled data, and we can compute the distribution pˆ(y|xk = 1) directly.

4

Named-Entity Recognition Experiments

Association Voting Logistic Regression w/ GE

precision 0.697 0.725

recall 0.500 0.662

f1 0.582 0.692

Table 1: Named-entity recognition results. LABEL=ORG POS-TAG=NNPS WORD=qantas WORD=nasdaq WORD=barcelona WORD=university WORD=ford WORD=perth WORD=sydney WORD=university@1 WORD=commonwealth WORD=rugby WORD=airways@1 WORD=league@1 POS-TAG=)@-1 WORD=)@-1

LABEL=LOC WORD=europe WORD=london WORD=america WORD=in@-1 WORD=asia WORD=africa WORD=america@1 WORD=uk WORD=paris WORD=south WORD=united WORD=states@1 WORD=bank@-2 WORD=[DATE]@1 POS-TAG=POS@1

Table 2: The most predictive non-lexicon features for the organization and location labels according to the parameters of a multinomial logistic regression classifier trained with GE. We evaluate GE training with lexicons on the CoNLL 2003 named-entity recognition data. We formulate the NER problem as a sliding-window classification task. That is, each timestep is an instance, and for each timestep we also include features in a window of ±3 timesteps. In addition to the word tokens themselves, we include features indicating part-of-speech tags, capitalization, and lexicon matches. More detail on the feature representation is available elsewhere [7]. Note that since we are using a classifier that does not model sequential dependencies, the accuracy cannot be directly compared with linear chain CRFs (for which we are also working on applying GE). We maximize the objective function Equation 1 with reference distributions estimated using the heuristic described in Section 3. In total, we use 31 lexicons gathered from the web. The model expectations are conditioned on 50,000 unlabeled sentences from the CoNLL 2003 unlabeled data. The results are in Table 1. The model trained with GE achieves an F1 score of 0.692 on the test set (testb). To attain the same accuracy supervised training requires 800 labeled instances of each label type (for a total of 4,000 instances). We compare against a baseline method that uses directly the association voting method described in Section 3, and GE gives a 26% reduction in F1 error. This comparison illustrates that GE leverages the co-occurrence patterns in unlabeled data during training to learn parameters for non-lexicon features. In Table 2 we show some high-weight parameters for non-lexicon features. Inspection of these features shows that we would indeed expect them to be strong indicators of the label.

5

Acknowledgments

This work was supported in part by the Center for Intelligent Information Retrieval, in part by DoD contract #HM1582-06-1-2013, and in part by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under contract number NBCHD030010. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.

References [1] Ming-Wei Chang, Lev Ratinov, and Dan Roth. Guiding semi-supervision with constraint-driven learning. In ACL, 2007. [2] Gregory Druck, Gideon Mann, and Andrew McCallum. Reducing annotation effort using generalized expectation criteria. Technical Report 2007-62, University of Massachusetts, Amherst, 2007. [3] Joao Gra¸ca, Kuzman Ganchev, and Ben Taskar. Expectation maximization and posterior constraints. In NIPS, 2008. [4] Aria Haghighi and Dan Klein. Prototype-driven learning for sequence models. In HTL-NAACL, 2006. [5] Rong Jin and Yi Liu. A framework for incorporating class priors into discriminative classification. In PAKDD, 2005. [6] Gideon Mann and Andrew McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In ICML, 2007. [7] Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CoNLL, 2003. [8] Andrew McCallum, Gideon Mann, and Gregory Druck. Generalized expectation criteria. Technical Report 2007-60, University of Massachusetts, Amherst, 2007. [9] Robert E. Schapire, Marie Rochery, Mazin Rahim, and Narendra Gupta. Incorporating prior knowledge into boosting. In ICML, 2002. [10] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, University of Wisconsin - Madison, 2006.

Leveraging Existing Resources using Generalized ...

from reference distributions that are estimated using existing resources. We ... data; and the expectations are distributions over class conditioned on a specific binary feature .... A framework for incorporating class priors into discriminative.

91KB Sizes 1 Downloads 221 Views

Recommend Documents

Reducing Annotation Effort using Generalized ...
Nov 30, 2007 - between model predicted class distributions on unlabeled data and class priors. .... can choose the instance that has the highest expected utility according to .... an oracle that can reveal the label of each unlabeled instance.

Personal News RSS Feeds Generation using Existing ...
taking such information, we also set limitations to filter them. ... The same expression may be written briefly as follows. .... the documents and define a custom template. ... The advantage of this approach is that rules can be created simply by ...

pdf-16128\leveraging-wmi-scripting-using-windows-management ...
well as leisure activity. Page 3 of 7. pdf-16128\leveraging-wmi-scripting-using-windows-ma ... management-problems-hp-technologies-1st-edition.pdf.

Dynamically Allocating the Resources Using Virtual Machines
Abstract-Cloud computing become an emerging technology which will has a significant impact on IT ... with the help of parallel processing using different types of scheduling heuristic. In this paper we realize such ... business software and data are

A Generalized Data Detection Scheme Using Hyperplane ... - CiteSeerX
Oct 18, 2009 - We evaluated the performance of the proposed method by retrieving a real data ..... improvement results of a four-state PR1 with branch-metric.

A Generalized Data Detection Scheme Using ... - Semantic Scholar
Oct 18, 2009 - We evaluated the performance of the proposed method by retrieving a real data series from a perpendicular magnetic recording channel, and obtained a bit-error rate of approximately 10 3. For projective geometry–low-density parity-che

Using the generalized Schur form to solve a ...
approach here is slightly more general than that of King and Watson (1995a,b), ... 1406. P. Klein / Journal of Economic Dynamics & Control 24 (2000) 1405}1423 ...

Learning from Labeled Features using Generalized ...
Jul 20, 2008 - tion f, and a conditional model distribution p parameterized ... expectation ˆf, an empirical distribution ˜p, a function f, and ..... mac: apple, mac.

Continuation of existing ad
May 18, 2014 - education for a period of 10 years in the State of Andhra Pradesh and ... The Principal Secretary to Govt., Health, Medical & Family Welfare ...

Crawling of Tagged Web Resources Using Mapping ...
Keywords— Focused crawler, ontology, ontology-matching, mapping algorithm, RDF ... Ontology matching is the challenging factor in otology which specifies the entities of needed information in domain. .... [6] Hong-Hai Do, Erhard Rahm, “COMA - A s

Pharmacy - Leveraging Partnerships.pdf
Page 1 of 9. Leveraging Partnerships Among Community. Pharmacists, Pharmacies, and Health Departments. to Improve Pandemic Influenza Response. Sara E. Rubin, Rachel M. Schulman, Andrew R. Roszak, Jack Herrmann, Anita Patel, and Lisa M. Koonin. Respon

Generalized and Doubly Generalized LDPC Codes ...
The developed analytical tool is then exploited to design capacity ... error floor than capacity approaching LDPC and GLDPC codes, at the cost of increased.

Multidimensional generalized coherent states
Dec 10, 2002 - Generalized coherent states were presented recently for systems with one degree ... We thus obtain a property that we call evolution stability (temporal ...... The su(1, 1) symmetry has to be explored in a different way from the previo

Jameson, Actually Existing Marxism.pdf
often seen negatively as the tenacious survival of older forms of governmental. involvement, rather than as a specific and positive form of economic organiza- ...

Medford Square Master Plan - Existing Conditions Memorandum ...
Medford Square Master Plan - Existing Conditions Memorandum - 10-25-16-web.pdf. Medford Square Master Plan - Existing Conditions Memorandum ...

1116 Maricopa Hwy Modificiation of Existing Telecommunication ...
1116 Maricopa Hwy Modificiation of Existing Telecommunication Site.pdf. 1116 Maricopa Hwy Modificiation of Existing Telecommunication Site.pdf. Open.

36-2 Order to Determine Existing Facts and Report of Existing Facts (2 ...
Whoops! There was a problem loading more pages. Retrying... 36-2 Order to Determine Existing Facts and Report of Existing Facts (2 pages).pdf. 36-2 Order to ...