Engineering Applications of Artificial Intelligence 41 (2015) 166–174

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Unsupervised word sense induction using rival penalized competitive learning Yanzhou Huang a,b, Xiaodong Shi a,b,n, Jinsong Su c, Yidong Chen a,b, Guimin Huang d a

Fujian Key Lab of the Brain-like Intelligent Systems, Xiamen University, Xiamen 361005, PR China Department of Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, PR China c School of Software, Xiamen University, Xiamen 361005, PR China d Research Center on Data Science and Social Computing, Guilin University of Electronic Technology, Guilin 541004, PR China b

art ic l e i nf o

a b s t r a c t

Article history: Received 2 April 2014 Received in revised form 21 December 2014 Accepted 3 February 2015 Available online 7 March 2015

Word sense induction (WSI) aims to automatically identify different senses of an ambiguous word from its contexts. It is a nontrivial task to perform WSI in natural language processing because word sense ambiguity is pervasive in linguistic expressions. In this paper, we construct multi-granularity semantic spaces to learn the representations of ambiguous instances, in order to capture richer semantic knowledge during context modeling. In particular, we not only consider the semantic space of words, but the semantic space of word clusters and topics as well. Moreover, to circumvent the difficulty of selecting the number of word senses, we adapt a rival penalized competitive learning method to determine the number of word senses automatically via gradually repelling the redundant sense clusters. We validate the effectiveness of our method on several public WSI datasets and the results show that our method is able to improve the quality of WSI over several competitive baselines. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Natural language processing Word sense induction Multi-granularity semantic representation Competitive learning

1. Introduction Word sense induction (WSI) is crucial for many natural language processing (NLP) tasks as word sense ambiguity is prevalent in all natural languages. WSI and word sense disambiguation (WSD) are two related techniques for lexical semantic computation. The main distinction between the two techniques is that the former discriminates different senses without relying on a predefined sense inventory, while the latter assumes an ability to access an already known sense list. For discriminating different word senses, each occurrence of an ambiguous word is regarded as an ambiguous instance. WSI is to conduct unsupervised sense clustering among these ambiguous instances, and the number of the resulting clusters is explained as the number of induced word senses. We show an example of WSI of ambiguous word “ball” in Fig. 1. We believe that WSI methods face two major challenges. First, the contextual semantic is not explored sufficiently when conducting context modeling. In general, shallow lexical features (e.g. unigrams or bigrams of words) surrounded the ambiguous instances that constitute an important ingredient in sense induction. However, such finegrained semantic features will inevitably suffer from data sparsity problem. More advanced Bayesian methods use topic models such as

n Correspondence to: 422 South Siming Road, Department of Cognitive Science, Xiamen University, Xiamen 361005, China. Tel.: þ86 18959288068. E-mail address: [email protected] (X. Shi).

http://dx.doi.org/10.1016/j.engappai.2015.02.004 0952-1976/& 2015 Elsevier Ltd. All rights reserved.

Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to learn topic distributions of ambiguous instances. Compared with the shallow features, topic features can capture latent topic structure and have more generalization ability in semantic representation. Topic models are able to exploit abstract conceptual structures; however, only using topic models may lose certain amount of unique lexical semantics during context modeling. Based on this, we believe that using contextual features derived from multi-granularity semantic spaces can reflect various aspects of the semantic knowledge of the contexts. Second, the sense number of ambiguous words cannot be determined appropriately. Many popular clustering methods such as k-means algorithm require the cluster number to be preassigned precisely. However, in many practical applications, it becomes impossible to know the exact cluster number in advance, such that these clustering algorithms often result in poor performance (Dehkordi et al., 2009). More recently, the non-parametric Bayesian method (Lau et al., 2012) uses Hierarchical Dirichlet Processes (HDP) (Teh et al., 2006) to learn the number of word senses automatically. However, it tends to induce larger number of word sense when comparing to the gold standard per ambiguous word on SEMEVAL-2010 WSI dataset (Lau et al., 2012). Hence, exploring a word sense clustering algorithm to learn appropriate sense numbers for ambiguous words is also crucial for WSI task. In this paper, we want to overcome the two challenges of WSI mentioned above. We propose a novel WSI framework that automatically induce word senses for ambiguous words over multi-granularity semantic spaces without relying a pre-assigned

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

S1. He plays the ball. S2. I would like to join into the ball tonight. S3.The ball is running very fast. Fig. 1. An example of word sense induction of ambiguous word “ball”. Each occurrence of “ball” is underlined and regarded as an ambiguous instance. In this example, the senses of the instances in S1 and S3 are the same (highlighted with blue), totally different from the one in S2 (highlighted with green). We conduct word sense induction to identify the sense number of ambiguous word “ball” and assign the three instances to their corresponding senses, i.e. ideally {S1,S3} and {S2}. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

Training Data

Semantic Space Construction Word Space Testing Data

Word Cluster Space

Topic Space

Learning Multi - granularity Semantic Representation Semantic Representation of Ambiguous Instances Word Sense Clustering Sense Labels of Test Inst ances

Fig. 2. The architecture of our proposed method in word sense induction.

number of word sense. In particular, our WSI framework runs in two steps: (1) learning multi-granularity semantic representations for ambiguous instances, and (2) context-based word sense clustering for ambiguous words. For the first step, our main idea is that discriminating different word senses entails integrating diverse semantic granularities from the contexts. To be specific, we use Vector Space Model (Salton and Buckley, 1988) to learn the semantic representations of ambiguous instances, under the semantic space of words, word clusters and topics. Semantic distances among different semantic granularities are integrated in terms of a concatenation and a linear interpolation strategy (Section 3). For the second step, we adapt a rival penalized competitive learning (RPCL) method to determine the number of word senses automatically by gradually repelling the redundant sense clusters (Section 4). Once our algorithm matches a stopping condition, the centroid of the remaining clusters are considered as the representations of different word senses, and the number of remaining clusters are considered as the sense number induced for the ambiguous words. Fig. 2 summarizes the architecture of our proposed method for WSI. Our method is able to improve the quality of WSI over several competitive baselines and the induced sense number is close to the gold standard sense. Especially, the main contributions of our work lie in two aspects, including (1) we integrate multi-granularity semantic spaces to represent the ambiguous instances without resorting to any external resources, and (2) instead of being pre-assigned a fixed number of word senses, our framework can automatically determine the sense number of ambiguous words. The remainder of this paper is organized as follows: Section 2 summarizes and compares related work. Section 3 presents our method on how to learn a multi-granularity semantic spaces representation for each ambiguous instance. Section 4 elaborates the context-based word

167

sense clustering for ambiguous words. Section 5 describes our experiments and shows results with discussions. Finally, Section 6 concludes and outlines future directions.

2. Related work In this section, we give an overview of previous methods and the participating systems in the WSI task. Overview of previous methods in WSI: In general, most of the researches in WSI are based on the Distributional Hypothesis (Harris, 1954), which indicates that words surrounded with similar contexts tend to have similar meanings. Previous methods have exploited various linguistic features such as first and second order context vectors (Purandare and Pedersen, 2004), bigrams and triplets of words (Purandare and Pedersen, 2004; Udani et al., 2005; Bordag, 2006), collocations (Klapaftis and Manandhar, 2008), and syntactic relations (Chen et al., 2009; Van de Cruys and Apidianaki, 2011) to conduct contexts modeling. To improve the usability of limited, narrow-domain corpora, Pinto et al. (2007) uses pointwise Mutual Information to construct a co-occurrence list to performing self-term expansion. Based on this contextual features, vector-based (Salton and Buckley, 1988; Purandare and Pedersen, 2004; Pedersen, 2007, 2010; Niu et al., 2007; Pinto et al., 2007) and graph-based (Agirre and Soroa, 2007b; Klapaftis and Manandhar, 2008; Korkontzelos and Manandhar, 2010; Klapaftis and Manandhar, 2010) models are applied to WSI. More advanced Bayesian methods have been explored in recent years as the methods can discover latent topic structures from contexts without involving feature engineering. Brody and Lapata (2009) uses parametric LDA (Blei et al., 2003) to WSI task. The contexts of ambiguous instances are regarded as pseudo documents and their induced topic distributions are considered as the sense distributions. Yao and Van Durme (2011) further use nonparametric HDP (Teh et al., 2006) to learn the sense distributions. The advantage of this method is that it can automatically learn the number of word senses for each ambiguous word, as compared to LDA which needs to be pre-assigned a topic number in advance. Experiment results show that the HDP model is superior to standard LDA model. Lau et al. (2012) also show improvement in supervised F-score after incorporating position features in the HDP model. Charniak (2013) extends the naive Bayes model based on the idea that the more closer a word to the target word, the more relevant this word will be in WSI. Overview of participating systems in WSI: The evaluation campaigns of WSI have been conducted in SemEval-2007 (Agirre and Soroa, 2007a), SemEval-2010 (Manandhar et al., 2010) and SemEval-2013 (Navigli and Vannella, 2013). As to the participating systems in SemEval-2007, their methods mainly use the vectorbased and graph-based models to conduct WSI of target words. I2R (Niu et al., 2007) is the best induction system (vector-based) in supervised evaluation which uses part-of-speech of neighboring words, unordered words and local collocations to capture contextual information. Considering the systems in SemEval-2010, the highest ranked system in supervised evaluation is UoY (graph-based) (Korkontzelos and Manandhar, 2010) where single nouns and noun pairs are included as vertices in the graph. Each cluster is taken to represent one of the senses of the target word. Note that KSU KDD (Elshamy et al., 2010) introduces the topic model LDA (Blei et al., 2003) to infer the topic distribution of each test instance and the k-means algorithm is applied to sense clustering. In view of systems in SemEval-2013, participating systems are applied to web search result clustering. The best performing systems are developed from those HDP teams (Navigli and Vannella, 2013) which take advantage of the topic model HDP (Teh et al., 2006) to

168

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

learn the sense distributions. In addition, following Di Marco and Navigli (2013), new evaluation metrics are applied to evaluate the quality of snippet clustering. Overall, the participating systems treat the WSI task as an unsupervised clustering problem and certain linguistic features are proved to be effective in WSI. In addition, we know that various topic models, especially the HDP (Teh et al., 2006), are able to discover latent topic structures from the contexts and show promising results in WSI tasks. Different from previous methods, we model the contexts with multi-granularity semantic space which are word, word cluster and topic. By doing so, our method can capture various aspects of the semantic knowledge from the contexts in a fully unsupervised fashion. Furthermore, under our WSI framework, we adapt the RPCL method to conduct sense clustering, in order to learn the number of word senses automatically and appropriately. Combined with the use of multi-granularity semantic spaces, our framework shows an improvement of the quality of word sense clustering over several competitive baselines.

3. Learning multi-granularity semantic space representation In view of semantic space, we construct three types of semantic space:  word, word  cluster, and topic. Without loss of generality, let A ¼ ai j i : 1…n be a set of ambiguous words and Iij be a ambiguous instance of ambiguous word ai. If ambiguous word ai contains mi ambiguous instances, then we denote it as mi i Ψm ai ¼ ⋃j ¼ 1 I ij . Based on this, we further denote the whole training n m m n set of A as Ψ A ¼ ⋃ni¼ 1 Ψ ai i . Consequently, I ij A Ψ ai i  Ψ A . To construct the semantic space, our first response towards the n challenge is to perform our proposed method on Ψ A . However, according to our observation, some context words tend to present totally different semantic orientations with respect to different ambiguous words. That is to say, for a given ambiguous word n ai A A, the semantic representations learned from Ψ A are inapplicm able to the one learned from Ψ ai i . As a result, we train semantic model separately for each ambiguous word. 3.1. Learning word and word cluster space representation   We construct word space W ¼ wk j k : 1…K from training set. Each word wi in W is lemmatized; the stop words and those low frequency words are removed. Based on word space W, we then learn semantic representation of a context word η. In particular, each dimension corresponds to a separated word wk and its value, also known as term weight, is estimated according to the semantic correlation between ηand wk. Intuitively, the more two words cooccur, the more closely related they will be. In our implementation, the semantic representation and the computation of term weights are given in Eq. (1) and Eq. (2), respectively,     V K ðηÞ ¼ vk η; wk j k : 1…K ð1Þ       vk η; wk ¼ dice η; wk ns η; wk

ð2Þ

with diceðx; yÞ ¼

sðx; yÞ ¼

2f ðx; yÞ f ðxÞ þ f ðyÞ

f ðx; yÞ minðf ðxÞ; f ðyÞÞ n f ðx; yÞ þ α minðf ðxÞ; f ðyÞÞ þ α

ð3Þ

ð4Þ

In Eq. (3), dice ðnÞ represents the Dice Coefficient to determine the point-wise semantic correlation and f ðnÞ is the frequency counting function. As to the smoothing function sðnÞ (Eq. (4)), it is designed

to reduce the overemphasis for the association of low frequency words. αis the pseudo count for smoothing estimation (Pantel, 2003). Regarding word cluster space, its advantage lies in its generalization ability to model the contexts, capable of alleviating the data sparsity problem. For example, “Monday” and “Friday” are regarded as two distinctive dimensions in word space; whereas, in word cluster space, they are both contained in the same cluster and thus indexed with the same dimension. We employ Brown Clustering (Brown et al., 1992) (the bottom-up agglomerative word clustering algorithm) to generate a hierarchical clustering graph in a fully unsupervised fashion. The algorithm's input is a sequence of words without any labeled data. Its output (i.e. the resulting clusters) is measured based on a classbased bigram language model. In practice, we use an available software package1 to train the mode. After training the model, the algorithm's output contains a hierarchical binary tree; those words that share the same prefix of the tree are considered to be in the same cluster. Based   on this, given a word cluster space C ¼ ck j k : 1…T and T r K , the semantic representation of η is defined as     V T ðηÞ ¼ vk η; ck j k : 1…T ð5Þ   where T is the cluster number. The method  of computing vk η; ck is similar to the one of computing vk η; wk A V K ðηÞ. The frequency of cluster ck is the sum of the frequencies of the words contained in ck, P namely f ðck Þ ¼ w A ck f ðwÞ. 3.2. Learning topic space representation The topic model is a probabilistic graphical model of text generation. Each document is modeled as a multidimensional distribution, where each dimension represents a specific topic characterized by a distribution over word space. The topic model emphasizes abstract conceptual structure than word cluster model. The previous work (Lau et al., 2012) directly induces the topic distribution based on the sentences the ambiguous instance occurs. Then the topic index with the highest aggravated probability is labeled as the induced sense. Differently, we form a pseudo document for each word in the training set via combining the contexts where its occurs. Given these pseudo documents, we further learn their topic distributions via topic model training, and these distributions are regarded as the semantic representations of their corresponding words. Formally, the semantic representation of word η in topic space is defined as     V R ðηÞ ¼ p zk j docη j k : 1…R ð6Þ   where R (R r T) is the topic number; p zk j docη represents the probability of the k-th topic zk given pseudo document docη of word η. The probability is estimated by simply normalizing the aggregation of the words assigned with topic zk in the document. Because topic number in HDP is learned automatically, we select the HDP2 model rather than  the  LDA model to induce topic distribution. HDP chooses G0  DP γ ; H as a base distribution according to the Dirichlet Process (DP), where parameter γ controls the sharing proportion of the induced topics across different documents. Then, we describe the generative process as follows: (1) for each pseudo document docη , generate a nonparametric prior, Gη  DP ðα0 ; G0 Þ, where α0 controls the degree of intersection for the topics in the same document, (2) sample a latent topic, zk  Gη , in a similar manner as the LDA model does, and (3) sample a word, w  zk . To obtain intuitive insight into our learned topic structure, Table 1 sketches the top-5 semantic words of each induced topic learned from the collected pseudo documents. Based upon HDP modeling, the vectorization of each word in the training set is modeled in terms of topic distribution. 1 2

https://github.com/percyliang/brown-cluster.git. http://www.cs.princeton.edu/ blei/topicmodeling.html.

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

Table 1 A sample of the top-5 semantic words of topic induced from the pseudo documents. Topic index

Top-5 semantic words

0 1 2

car, speed, driver, stop, road lap, race, car, track, start water, rain, condition, plant, warm slow, soil, plant, flow, runoff think, feel, run, thing, look light, speed, earth, space, move fish, slow, water, bite, river week, vote, company, cost, early rate, warm, global, climate, growth

3 4 5 6 7 8

Based on the already constructed multi-granularities semantic space, the semantic representation of a given ambiguous instance Iij is learned via context-based semantic composition. To be specific, we use a simpler additive model (Mitchell and Lapata, 2008) to perform the context composition in word, word cluster and topic space, respectively. Hence, each ambiguous instance will have multigranularity semantic representations. More formally, V K ðI ij Þ, V T ðI ij Þ and V R ðI ij Þ denote the semantic representation of Iij derived from word, word cluster and topic space, respectively.

169

4.2. Automatic word sense clustering for ambiguous word In this section, we present the details of how to automatically determine the sense number of ambiguous words. We initialize cluster number of ambiguous word ai to qn (qi r qn r mi ) and randomly assign one instance to each cluster. As an input instance Iij comes, each sense cluster Oik evaluates a degree of its suitability to representing Iij in terms of a individual value or criterion (i.e. the Euclidean distance DðOik ; I ij Þ over multi-granularity semantic spaces), and competes to be allocated to representing Iij. In our method, not only the winner (the nearest) sense cluster is rewarded dramatically, but also its rival (the 2nd nearest) sense cluster is penalized. The rewarding mechanism allows the position of the winner sense cluster moving towards Iij, while the penalization mechanism forces the rival sense cluster repelling away from the input instance Iij with a given step size. With the number of the iteration increasing, the redundant sense clusters are repelled gradually and they will be eliminated when containing no instance. At the converge stage, each remaining cluster is regarded to represent the centroid of a cluster of input instances. In particular, given the input instance Iij, the indexes of the 0 winner sense cluster kn and rival sense cluster k are defined as follows (Xu et al., 1993; Cheung, 2002; Cheung and Jia, 2013):    n k ¼ arg min γ k D Oik ; I ij 1 r k r qn

   k ¼ arg minn γ k D Oik ; I ij 0

ð8Þ

kak

4. Word sense clustering without knowing the number of word senses Our method to conduct the word sense clustering for ambiguous instances is inspired by the RPCL algorithm proposed in Xu et al. (1993). RPCL is able to perform sense clustering via driving redundant sense clusters far away from the input instances, such that the redundant clusters are eliminated automatically. Without loss of generality, we assume that the ambiguous  mi instances I ij j ¼ 1 of ambiguous word ai are from sense clusters  qi Oik k ¼ 1 , where qi is the number of the gold standard senses. We define the centroid of k-th cluster Oik as the average of all its belonging   P instances, namely VðOik Þ ¼ 1=j Oik j Iij A Oik V I ij . Accordingly, we can obtain V K ðOik Þ, V T ðOik Þ and V R ðOik Þ under different semantic spaces.

4.1. Semantic distance computation between cluster and instance

P In Eq. (8), γ k ¼ yk = qn yk defines the relative winning rate of cluster Oik, where yk is the times of Oik as the winning cluster in the past iterations. γ k is used to solve the “dead cluster” problem encountered by competitive learning. The clusters that have won the competition in the past iterations will have a reduced chance to win again, providing equal chance to other clusters during the learning process. Subsequently, the centroid of the winer cluster Oikn and rival cluster Oik0 are updated according to the following equation: VðOik ÞðnewÞ ¼ VðOik ÞðoldÞ þ Δk ;

representations into a longer one. Formally, V E ðI ij Þ ¼ ðV K ðI ij Þ; V T ðI ij Þ; V R ðI ij ÞÞ A R1E , where E ¼ ðK þ T þRÞ. Given V E ðI ij Þ and V E ðOik Þ, we then use a Euclidean distance function DðnÞ to compute their        semantic distance, namely DE Oik ; I ij ¼ V E ðOik Þ V E I ij . For linear interpolation, the semantic distances of different semantic space are computed via a linear interpolation method, which is described as follows:        DKTR ¼ λ1 λ2 DK þ 1  λ2 DT þ 1  λ1 DR : ð7Þ   where D is short for D Oik ; I ij , and the superscripts KTR, K, T and R denote the semantic distance of linear interpolation, word, word cluster and topic, respectively. λ1 and λ2 are the interpolated parameters. If we want to ignore the topic distance DR, then we can set

λ1 to one.

0

ð9Þ

with

Δk ¼ εn ðVðOik Þ  VðIij ÞÞ; εn ¼ e  β ρðIij ÞDðOik ; Iij Þ; τ P i ρðIij Þ ¼ DðI ij ; I ik Þ τþ m k¼1 n

n

n

Δk0 ¼  ε0 ðV ðOik0 Þ  VðI ij ÞÞ

ð10Þ ð11Þ

In Eq. (10), parameter ε controls the learning rate of the winner cluster. To be specific, e  β presents to decay the update activity of winner cluster as the sense clustering is prone to be stable with the increasing of the iteration, where βis updated in each iteration. To reduce the interference of the singular points (isolated noise points) in the data set, we present function ρðI ij Þ to denote the density of the given instance Iij, where τ is a constant. Note that conventional RPCL method defines the learning rate εn as a constant, while our method is instance sensitive, such that the winner cluster is located dynamically according to different instances. In Eq. (11), ε0 is a small positive constant which controls the de-learning rate of the rival cluster. Fig. 3 shows an example of automatic word sense clustering. Initially, nineteen instances positioned at the coordinates are shown in Fig. 3(a). Then, based on prior experience, among the nineteen instances, we select seven, at random, and label them word sense cluster center (blue dots). Next we apply the algorithm. In the first iteration (Fig. 3(b)), the algorithm identifies and eliminates three cluster centers (red dots) as redundant, leaving four valid cluster centers (blue dots). In the fourth iteration (Fig. 3(c)), two additional cluster centers (red dots) are identified as redundant and eliminated, n

We present two strategies to compute the semantic distance between Iij and Oik, namely concatenation and linear interpolation. Concatenation is to concatenate different granularities of semantic

n

k¼k ;k

170

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

Fig. 3. The illustration of learning process of automatic word sense clustering. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

leaving only two valid cluster centers (blue dots). Finally, after multiple iterations, the valid cluster centers (blue dots) converge to only two that are redistributed as shown in Fig. 3(d). Therefore, we determine that, in our model, the appropriate sense number is two. Considering the above analysis, the algorithm of word sense clustering is summarized in Algorithm 1. Algorithm 1. Automatic word sense clustering for ambiguous instances. Input: (1) Semantic representation of ambiguous instances  m VðI ij Þ j ¼i 1 ;

(2) Initialized sense cluster number qn (qi rqn r mi );   Output: Sense label of I ij and the appropriate sense number; 1: Select qn instances randomly, one for each sense cluster; 2: 3: 4: 5:

Set κ ¼0; τ ¼20; yk ¼0, where k ¼1 to qn; complete ¼false; while complete is false do κ ¼ κ þ 0:01; for j¼1 to mi do 0 Retrieve winner cluster kn and rival cluster k based on Eq. (8);

6:

¼yðoldÞ þ1, and update VðOikn Þ and VðOik0 Þ Let ykðnewÞ n n k

7:

based on Eq. (9); end for

 m Redistribute instances within I ij j ¼i 1 to their nearest

8:

clusters and eliminate redundant cluster Oik if j Oik j ¼0;  ðnewÞ  ðoldÞ if Oik ¼ Oik then complete¼ true; end if end while

9: 10: 11: 12:

5. Experiments We conducted a serious of experiments over several public WSI datasets derived from SemEval-2010 (Manandhar et al., 2010), SemEval-

2007 (Agirre and Soroa, 2007a) and SemEval-2013 (Navigli and Vannella, 2013) task to evaluate the effectiveness of our proposed WSI framework. We tested our proposed method via the evaluated scripts provided by organizers. 5.1. Experimental preparation Dataset: Our primary WSI evaluations are based on the dataset of the SEMEVAL-2010 WSI task (Manandhar et al., 2010) as both the training and testing data are included explictly, allowing the WSI models to be evaluated more realistic. It contains 100 target words, viz. 50 nouns and 50 verbs. For each target word, its training set contains a set of instances, each of which is associated with plain sentences without sense annotation. Detailed descriptions of the dataset are given in Table 2. Considering the trial data (development set), it contains 4 verbs, each of which consists of a training and a testing portion. These 4 verbs differ from the target words in the training set. There are about 138 instances on average for each verb in the training portion of the trail data. Since the training data of SemEval-2007 task and SemEval-2013 task are not included as part of their original data, we constructed the training data for each target word via extracting their corresponding training instances from the corpus. For SemEval-2007 task, we followed previous works (Brody and Lapata, 2009; Yao and Van Durme, 2011; Lau et al., 2012) to extract the training instances from the British National Corpus (BNC).3 The test dataset contains 65 target nouns and 35 target verbs. The test instances were extracted from Wall Street Journal (WSJ) corpus and hand-annotated with OntoNotes senses (Hovy et al., 2006). Following the HDP and UKP systems (Lau et al., 2013; Zorn and Gurevych, 2013) in SemEval-2013 task, we also used Wikipedia4 as raw text to extract the training instances. The test data consists of 100 target words (all nouns), each with a list of 64 top-ranking documents. The target words were selected according to the list of 3 4

http://www.natcorp.ox.ac.uk/. The Wikipedia dump was retrieved in http://dumps.wikimedia.org/enwiki/.

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

Table 2 Data description of the SemEval-2010 WSI task.

Table 3 SR evaluation (80% mapping, 20% evaluation) on the SemEval-2010 dataset.

Part-of-speech

Training set

Test set

Noun Verb All

716,945 162,862 879,807

5285 3630 8915

ambiguous Wikipedia entries, length of which ranging between 1 and 4 words. Evaluation metrics: Semeval-2010 WSI task presents supervised evaluation and unsupervised evaluation. In supervised evaluation, the gold standard dataset is split into two parts: mapping set and evaluation set. In particular, we use the supervised recall (SR) (test set split: 80% mapping, 20% evaluation) (Manandhar et al., 2010) in our supervised evaluation. In view of the unsupervised evaluation, we use V-Measure (VM) (Rosenberg and Hirschberg, 2007) which contains two important indicators to determine the effectiveness of the resulting clusters: homogeneity (H) and completeness (C). The homogeneity evaluates the degree that those instances assigned in a resulting cluster actually belong to a gold standard sense; whereas, the completeness evaluates the degree that the instances of a gold standard sense are assigned into the same resulting cluster. In view of Semeval-2007 WSI task, it also includes supervised and unsupervised evaluation. The supervised evaluation is similar to Semeval-2010 and thus we omit it for simplicity. The unsupervised evaluation compares the system output with the gold standard using standard clustering evaluation metrics (e.g. entropy) (Brody and Lapata, 2009). As for Semeval-2013 WSI task, four measures are used to evaluate the clustering quality, which are Rand Index (Rand, 1971), Adjusted Rand Index (Hubert and Arabie, 1985), Jaccard Index (Jaccard, 1901), F1 measure (Van Rijsbergen, 1979).

5.2. Experiment results 5.2.1. Evaluation of semantic representations As word, word cluster5 and topic model6 contain different implications of the contexts of the ambiguous instances, we conducted comparisons among these three semantic models directly based on our adapted rival penalized competitive learning, in order to examine which individual feature catigaries are the most informative. We further investigate the concatenation model and linear interpolation model when integrating multi-granularity semantic information. Here, W, WC and T are short for word, word cluster and topic, respectively. Also, the symbols “  ” and Linear (n) are to perform concatenation and linear interpolation of different semantic representations, respectively. The results in SR evaluation7 are given in Table 3 and the bold numbers indicate the best performance. Considering these results, we find out the semantic representation derived from word space is more favorable for noun sense induction while the topic granularity tends to perform well in verb sense induction. Overall, the SR score of word granularity outperforms the rest two granularities in “All” evaluation. Therefore, we believe that the contextual word features are indispensable in WSI and those features with better generalization ability should be regarded as auxiliary semantic components in sense induction. 5

We set the cluster number to 150. We use the default parameters of the HDP model to train the topic distribution. 7 The values are the average of the results from mapping.1.key to mapping.5.key. 6

171

Representation

SR (%) All

Noun

Verb

Word Word cluster Topic

67.70 65.68 64.04

65.80 63.08 58.44

70.40 69.46 72.18

W-WC-T Linear(W,WC,T) Linear(W,WC-T)

67.70 68.00 68.44

64.70 65.52 65.92

72.08 71.60 72.18

Furthermore, we find out that the linear interpolation methods8 perform better than the concatenation in “All” evaluation. Intuitively, different granularities encode different semantic attributes and concatenating them into a single vector to compute the semantic distances may not as reasonable as the interpolation method. Compared with word, the performances of W-WC-T (67.70) and Linear (W, WC, T) (68.00) are not performed as well as expected. We hypothesize that the abstract semantic information is not exploited appropriately. Therefore, we firstly concatenate the word cluster features and topic features into a vector denoted as WC-T and then use linear interpolation to combine W and WC-T when computing the semantic distance among ambiguous instances. By doing so, the SR scores of Linear(W,WC-T) in “All”, “Noun” and “Verb” all achieve the best performance. In addition, we would like to investigate our proposed method in unsupervised VM evaluation and results are given in Table 4. The combination W-WC-T reaches the best performance 16.8, followed closely by the Linear(W,WC-T). As Pedersen (2010) points out, VM evaluation inclines to encourage those systems which induce higher cluster number than the gold standard. We hypothesize that W-WC-T obtains 0.2 quantitatively better than the Linear(W,WC-T) because of its larger inducing cluster number. In practice, the measures in unsupervised evaluation have degenerated tendency correlated strongly with the inducted number of the method (Manandhar et al., 2010; Lau et al., 2012). For the purpose of this work, we restrict our main attention to the supervised evaluation in the following comparisons.

5.2.2. Evaluation of sense clustering To evaluate the effectiveness of our adapted rival penalized competitive learning (ARPCL), we introduce four related clustering methods as baselines, which are standard k-means, classical competitive learning (CCL), frequency-sensitive competitive learning (FSCL) (Ahalt et al., 1990) and standard rival penalized competitive learning (RPCL) (Xu et al., 1993). To conduct a fair comparison, all compared clustering methods use the Linear(W,WC-T) model to compute the semantic distances among ambiguous instances over multi-granularity semantic spaces. Note that the cluster number of k-means needs to be pre-assigned in advance, we follow the same settings as the LDA based method (Lau et al., 2012) does. To be specific, we set the number of word senses to 7 for each noun and 3 for each verb, based on the average number of the senses in the test set. In Table 5, the performance of k-means in SR evaluation is inferior to those competitive learning methods which are able to determine the sense number automatically. Generally, different words will have different sense number, and thus it is inappropriate for k-means to pre-assign fixed sense numbers to those target words. Compared with the baselines CCL, FSCL and RPCL in competitive learning, our method 8 The parameters are tuned on the trial data and the values are reported here. Linear(W,WC,T): λ1 ¼ 0:9; λ2 ¼ 0:9; Linear(W, WC-T): the weights of W and WC-T are 0.8 and 0.2, respectively.

172

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

shows a significant improvement in SR evaluation. We attribute our improvement to three aspect: (1) instead of being pre-assigned with a fixed sense number to those target words, our method learns the sense number automatically, (2) introducing density function reduces the interference of those isolated noise points, and (3) the learning

Table 4 VM evaluation on the SemEval-2010 dataset. Representation

VM (%) All

Noun

Verb

Cl

Word Word Cluster Topic

15.8 14.8 14.3

21.0 19.3 16.6

8.1 8.2 10.9

3.85 5.91 6.03

W-WC-T Linear(W,WC,T) Linear(W,WC-T)

16.8 16.2 16.6

21.4 21.0 21.6

10.1 9.1 9.4

4.66 4.07 3.92

rate of winner cluster in our method is instance sensitive, allowing our winner cluster to located dynamically. We further investigate the ratio of epoch spans of different semantic spaces (Table 6) and the average number of the induced sense for noun and verb, respectively (Fig. 4). In Table 6, it can be seen that the max converged epoch of our proposed methods is less than 150 which is acceptable in practice. Considering the word, word cluster, and topic, we know that the more abstract semantic level they are, the less learning epoches they need. Meanwhile, the interpolation methods need more learning epoches than concatenation method. In Fig. 4, “GS” represents the gold standard of the test set. The models with pink color tend to be more acceptable than those models labeled with yellow. The average sense numbers of word cluster and topic are too large both in Noun and Verb. Viewed from this perspective, it

Table 7 Comparison with previous works on SemEval-2010 dataset. WSI system

Table 5 Word sense clustering evaluation on the SemEval-2010 dataset. Clustering

k-means CCL FSCL RPCL ARPCL

SR (%) All

Noun

Verb

64.46 65.88 66.54 67.88 68.44

62.54 63.56 64.50 65.10 65.92

67.28 69.28 69.60 71.92 72.18

All

Noun

Verb

Random Most Frequent

57.30 58.70

51.50 53.20

65.70 66.60

Duluth-WSI (Pedersen, 2010) UoY (Korkontzelos and Manandhar, 2010) NMFlib (Van de Cruys and Apidianaki, 2011) NB (Charniak, 2013)

60.50 62.40 62.60 65.40

54.70 59.40 57.30 62.60

68.90 66.80 70.20 69.50

Our method

68.44

65.92

72.18

Table 8 Comparison with previous works on SemEval-2007 dataset.

Table 6 Ratio of epoches span of our proposed models. Representation

SR (%)

Ratio of epoches span (%) [1, 30)

[30, 60)

[60, 90)

[90, 150)

Word Word Cluster Topic

23.68 40.01 82.01

36.47 24.45 13.45

23.63 20.32 4.54

16.23 15.21 0.00

W-WC-T Linear(W,WC,T) Linear(W,WC-T)

63.40 27.95 31.78

17.02 31.88 26.67

7.34 19.06 25.16

12.23 21.11 16.40

WSI system

F-score (%)

Most Frequent

80.9

UMND2 (Niu et al., 2007) I2R (Niu et al., 2007)

84.5 86.8

10w, 5w (BNC) (Brody and Lapata, 2009) HDP (Yao and Van Durme, 2011) HDPþ position (tuned parameters) (Lau et al., 2012)

85.5 85.7 86.9

Our method

87.1

Fig. 4. The average number of the induced sense of our word sense clustering method. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

173

Table 9 Comparison with previous works on SemEval-2013 dataset. WSI system

RI

Most Frequent UKP-WP-LLR2 (Zorn and Gurevych, 2013) UKP-WP-PMI (Zorn and Gurevych, 2013) UKP-WACKY-LLR (Zorn and Gurevych, 2013)

39.90 51.09 50.50 50.02

Duluth (Pedersen, 2013)

52.18

5.75

31.79

46.90

HDP-Lemma (Lau et al., 2013) HDP-NoLemma (Lau et al., 2013)

65.22 64.86

21.31 21.49

33.02 33.75

68.30 68.03

Our method

66.37

23.34

33.57

70.73

is not suitable to apply abstract semantic features independently into our sense clustering method. In view of Linear(W, WC-T) and Linear(W, WC, T), they produce similar sense number to the GS. For example, their average sense numbers in Noun are 5.04 and 5.1 respectively, closely to the GS number 4.46.

5.2.3. Comparison with previous works In this section, we compare our best system against previous related method on SemEval-2010, SemEval-2007 and SemEval2013 WSI task. To follow the implements as much as previous works (Brody and Lapata, 2009; Yao and Van Durme, 2011; Lau et al., 2012) did, we used supervised evaluation (80% mapping, 20% evaluation) in SemEval-2010 (Table 7) and noun data in SemEval2007 (Table 8). In Table 7, Random randomly assigns each test instance to one of the four clusters. This baseline was run five times and the results were averaged. Most Frequent selects the most frequent sense to the test instances. In addition, Duluth-WSI (Pedersen, 2010) and UoY (Korkontzelos and Manandhar, 2010) are the most competitive participating systems in supervised evaluation of SemEval-2010. Considering these results, our method yields the best performance. It is worth mentioning that Lau et al. (2012) only report the WSD F-score (harmonic mean of precision and recall) of their method (HDP þposition) in SemEval-2010, i.e. 68.00 in “All” evaluation. To facilitate comparison of our method with HDP þposition, we also use the WSD F-score and the results show our system obtaining 0.44 quantitatively higher. Table 8 compares our method against previous works on SemEval-2007 WSI task. UMND2 (Niu et al., 2007) and I2R (Niu et al., 2007) are the two best performing participating systems in SemEval-2007. For the purpose of model comparison under identical training setting, we construct our training data from BNC. Based on these results, our method is significantly better than UMND2 and marginally outperform the I2R system and HDPþposition. Finally, we compare our method with the participating systems UKP (Zorn and Gurevych, 2013), Duluth (Pedersen, 2013) and HDP (Lau et al., 2013) in SemEval-2013 WSI task, and the results are shown in Table 9, where RI, ARI, JI and F1 are the abbreviation for Rand Index, Adjusted Rand Index, Jaccard Index and F1 measure, respectively. According to the results, we find out that our method is significantly better than the participating systems in RI, ARI and F1, respectively. However, in JI, UKP-WACKY-LLR (33.94) outperforms our method (33.57). Note that UKP-WACKY-LLR uses extra resources in the training data which are WaCky (Baroni et al., 2009) and a distributional thesaurus from Leipzig corpora (Biemann et al., 2007). We hypothesize these resources contributing their performance. In addition, the JI value of HDP-NoLemma is slightly better than our method but there is no significant difference. The training set of our method is lemmatized and thus we mainly focus on the comparison with the HDP-Lemma. Overly, the experiments show the effectiveness of our proposed framework in WSI task.

ARI 0.00 3.77 3.63 2.53

JI

F1

39.90 31.77 29.32 33.94

54.42 58.64 60.48 58.26

6. Conclusion and future work We have presented a novel WSI framework that automatically induces word senses for ambiguous words over multi-granularity semantic spaces in an unsupervised fashion. Our method has exploited word, word cluster and topic representation to integrating multi-granularity semantic information during context modeling. Instead of being pre-assigned a fixed sense number, our method is able to induce the sense number automatically for each target word via gradually repelling the redundant sense clusters. Our experiments have been conducted in public datasets which are derived from SemEval-2007, SemEval-2010 and SemEval-2013 WSI task. In our experiment, different semantic granularities and combination strategies are compared. Also, certain related sense clustering methods are taken into account. The experimental results demonstrate the superiority and effectiveness of our proposed method. For future work, we want to expand our algorithm by taking advantage of the recursive neural network (type of deep learning structure) to capture the phrase-level vector representation of the given context, catering to improve the sense clustering in WSI task.

Acknowledgments We would like to thank all the referees for their constructive and helpful suggestions on this paper. This work is supported by the Natural Science Foundation of China (Grant nos. 61005052, 61075058 and 61303082), the Key Technologies R&D Program of China (Grant no. 2012BAH14F03), the Fundamental Research Funds for the Central Universities (Grant no. 2010121068), the Natural Science Foundation of Fujian Province, China (Grant no. 2010J01351), the Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20120121120046) and the Ph.D. Programs Foundation of Ministry of Education of China (Grant no. 20130121110040). References Agirre, E., Soroa, A., 2007a. Semeval-2007 task 02: evaluating word sense induction and discrimination systems. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, pp. 7–12. Agirre, E., Soroa, A., 2007b. Ubc-as: a graph based unsupervised system for induction and classification. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, pp. 346–349. Ahalt, S.C., Krishnamurthy, A.K., Chen, P., Melton, D.E., 1990. Competitive learning algorithms for vector quantization. Neural Netw. 3, 277–290. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E., 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43, 209–226. Biemann, C., Heyer, G., Quasthoff, U., Richter, M., 2007. The leipzig corpora collection-monolingual corpora of standard size. In: Proceedings of the Corpus Linguistic 2007. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. Bordag, S., 2006. Word sense induction: triplet-based clustering and automatic evaluation. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics.

174

Y. Huang et al. / Engineering Applications of Artificial Intelligence 41 (2015) 166–174

Brody, S., Lapata, M., 2009. Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 103–111. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C., 1992. Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479. Charniak, E., 2013. Naive Bayes Word Sense Induction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1433– 1437. Chen, P., Ding, W., Bowes, C., Brown, D., 2009. A fully unsupervised word sense disambiguation method using dependency knowledge. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 28–36. Cheung, Y.m., 2002. Rival penalization controlled competitive learning for data clustering with unknown cluster number. In: Proceedings of the 9th International Conference on Neural Information Processing, ICONIP'02, IEEE, pp. 467–471. Cheung, Y.m., Jia, H., 2013. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit. 46, 2228–2238. Van de Cruys, T., Apidianaki, M., 2011. Latent semantic word sense induction and disambiguation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1476–1485. Dehkordi, M.Y., Boostani, R., Tahmasebi, M., 2009. A novel hybrid structure for clustering. Adv. Comput. Sci. Eng. 6, 888–891. Di Marco, A., Navigli, R., 2013. Clustering and diversifying web search results with graph-based word sense induction. Comput. Linguist. 39, 709–754. Elshamy, W., Caragea, D., Hsu, W.H., 2010. Ksu kdd: word sense induction by clustering in topic space. In: Proceedings of the 5th International Workshop on Semantic Evaluation, Association for Computational Linguistics, pp. 367–370. Harris, Z.S., 1954. Distributional Structure. Word. 10, 146–162. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R., 2006. Ontonotes: the 90% solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, Association for Computational Linguistics, pp. 57–60. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2, 193–218. Jaccard, P., 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz. Klapaftis, I.P., Manandhar, S., 2008. Word sense induction using graphs of collocations. In: Proceedings of the 18th European Conference on Artificial Intelligence, pp. 298–302. Klapaftis, I.P., Manandhar, S., 2010. Word sense induction and disambiguation using hierarchical random graphs. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 745–755. Korkontzelos, I., Manandhar, S., 2010. Uoy: graphs of unambiguous vertices for word sense induction and disambiguation, in: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 355–358. Lau, J.H., Cook, P., Baldwin, T., 2013. unimelb: topic modelling-based word sense induction for web snippet clustering. In: Proceedings of the SemEval, pp. 217–221. Lau, J.H., Cook, P., McCarthy, D., Newman, D., Baldwin, T., 2012. Word sense induction for novel sense detection. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 591–601. Manandhar, S., Klapaftis, I.P., Dligach, D., Pradhan, S.S., 2010. Semeval-2010 task 14: Word sense induction and disambiguation. In: Proceedings of the 5th

International Workshop on Semantic Evaluation, Association for Computational Linguistics, pp. 63–68. Mitchell, J., Lapata, M., 2008. Vector-based models of semantic composition. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pp. 236–244. Navigli, R., Vannella, D., 2013. Semeval-2013 task 11: word sense induction and disambiguation within an end-user application, In: Second Joint Conference on Lexical and Computational Semantics (n SEM), pp. 193–201. Niu, Z.Y., Ji, D.H., Tan, C.L., 2007. I2r: three systems for word sense discrimination, Chinese word sense disambiguation, and english word sense disambiguation. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, pp. 177–182. Pantel, P., 2003. Clustering by Committee (Ph.D. thesis). Department of Computing Science, University of Alberta, Canada. Pedersen, T., 2007. Umnd2: Senseclusters applied to the sense induction task of senseval-4. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, pp. 394–397. Pedersen, T., 2010. Duluth-wsi: Senseclusters applied to the sense induction task of semeval-2. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 363–366. Pedersen, T., 2013. Duluth: Word Sense Induction Applied to Web Page Clustering. Atlanta, Georgia, USA, 202. Pinto, D., Rosso, P., Jimenez-Salazar, H., 2007. Upv-si: word sense induction using self term expansion. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, pp. 430–433. Purandare, A., Pedersen, T., 2004. Word sense discrimination by clustering contexts in vector and similarity spaces. In: Proceedings of the Conference on Computational Natural Language Learning. Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850. Van Rijsbergen, C.J., 1979. Information Retrieval, second ed. Butterworth-Heinemann, Newton, MA, USA. Rosenberg, A., Hirschberg, J., 2007. V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420. Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M., 2006. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581. Udani, G., Dave, S., Davis, A., Sibley, T., 2005. Noun sense induction using web search results. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 657– 658. Xu, L., Krzyzak, A., Oja, E., 1993. Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. Neural Netw. 4, 636–649. Yao, X., Van Durme, B., 2011. Nonparametric Bayesian word sense induction, in: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, pp. 10–14. Zorn, H.P., Gurevych, I., 2013. Ukp-wsi: Ukp lab semeval-2013 task 11 System Description. Atlanta, Georgia, USA, 212.

Unsupervised word sense induction using rival ...

... of Information Science and Technology, Xiamen University, Xiamen 361005, PR China ...... Engineering Applications of Artificial Intelligence 41 (2015) 166–174. 171 .... for the Doctoral Program of Higher Education of China (Grant no.

472KB Sizes 3 Downloads 186 Views

Recommend Documents

Word Sense Disambiguation for All Words using Tree ...
systems. 1 Introduction. Word sense disambiguation (WSD) is one of the fundamental ... lexicographers' file ID in WORDNET, with which each noun or verb ...

Unsupervised Translation Sense Clustering - John DeNero
the synonym groups of WordNet R (Miller, 1995).1 ..... Their analysis shows that of the wide range of met- rics, only BCubed ..... Data-driven semantic anal-.

Unsupervised Translation Sense Clustering - John DeNero
large monolingual and parallel corpora using ..... Their analysis shows that of the wide range of met- rics, only ..... Data-driven semantic anal- ... Mining. Lin Sun and Anna Korhonen. 2011. Hierarchical verb clustering using graph factorization.

Making Sense of Word Embeddings - GitHub
Aug 11, 2016 - 1Technische Universität Darmstadt, LT Group, Computer Science Department, Germany. 2Moscow State University, Faculty of Computational ...

word sense disambiguation pdf
word sense disambiguation pdf. word sense disambiguation pdf. Open. Extract. Open with. Sign In. Main menu. Displaying word sense disambiguation pdf.

unsupervised change detection using ransac
the noise pattern, illumination, and mis-registration error should not be identified ... Fitting data to predefined model is a classical problem with solutions like least ...

Unsupervised Morphological Disambiguation using ...
Below you can see three possible morphological parses for the Turkish word “masalı” .... We used the Good-Turing and the Kneser-Ney smoothing techniques to ...

Unsupervised Feature Selection Using Nonnegative ...
trix A, ai means the i-th row vector of A, Aij denotes the. (i, j)-th entry of A, ∥A∥F is ..... 4http://www.cs.nyu.edu/∼roweis/data.html. Table 1: Dataset Description.

The Noisy Channel Model for Unsupervised Word ...
best supervised systems for the all-nouns disambiguation task. ... Engineering, 34450 Sarıyer, ˙Istanbul, Turkey. Email: [email protected], [email protected] ...... systems trained using automatically acquired corpora on Senseval-3 nouns.

Semi-supervised Word Sense Disambiguation ... - Research at Google
loses the sequential and syntactic information of the text. In this paper, we .... shares, positions, equity, jobs, awards, rep- resentation ..... Curran Associates, Inc.

A Filipino-English Dictionary Designed for Word-Sense ...
Adjective: comparative degree, superlative degree. 5. Adverb: comparative .... Notice that the part of speech was not spelled out; its abbreviation was being used ...

Elementary and Secondary Education in America: Using Induction and ...
Expenditure per capita (exppcap) for elementary and secondary education is taken ..... technology (ICT) “infrastructure alone does not bring forth profound and ...

On Contribution of Sense Dependencies to Word ...
On the other hand, (Ide and Veronis. 1998) reported that coarse-grained sense distinctions are sufficient for several NLP applications. In particular, the use of the ...

Using Induction and Correlation to Evaluate Public Policies and ...
1Postal address: 81 Beal Parkway S.E. Fort Walton Beach, FL, 32548, USA. E-mail Address: .... College graduation rate (gradcol) for persons age 25 and over is taken from the U.S.. Department of ...... Arts & Sciences, 6(2), pp. 731-742. 99.

Word Translation Disambiguation Using Bilingual ...
a machine learning technique called. 'Bilingual Bootstrapping'. Bilingual. Bootstrapping makes use of , in learning, a small number of classified data and a ...

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

Word Translation Disambiguation Using Bilingual ...
We define many such features. For each ... Since data preparation for supervised learning is expensive, it is desirable to develop ..... Example application of BB.

Call Transcript Segmentation Using Word ...
form topic segmentation of call center conversational speech. This model is ... in my laptop' and 'my internet connection' based on the fact that word pairs ...