Adapted competitive learning on continuous semantic ...

Viewer
Transcript

Neurocomputing 171 (2016) 1475–1485

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Adapted competitive learning on continuous semantic space for word sense induction Yanzhou Huang a,b, Deyi Xiong c, Xiaodong Shi a,b,n, Yidong Chen a,b, ChangXing Wu a,b, Guimin Huang d a

Fujian Key Lab of the Brain-like Intelligent Systems, Xiamen University, Xiamen 361005, China Department of Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, China c School of Computer Science and Technology, Soochow University, Suzhou 215006, China d Research Center on Data Science and Social Computing, Guilin University of Electronic Technology, Guilin 541004, China b

art ic l e i nf o

a b s t r a c t

Article history: Received 2 November 2014 Received in revised form 15 June 2015 Accepted 29 July 2015 Communicated by Y. Chang Available online 8 August 2015

Word sense induction (WSI) is important to many natural language processing tasks because word sense ambiguity is pervasive in linguistic expressions. The majority of existing WSI algorithms are not applicable to capture both lexical semantics and syntactic relations without involving excessive taskspeciﬁc feature engineering. Moreover, it remains a challenge to explore a sense clustering method which is capable of determining the number of word senses for the polysemous words automatically and properly. In this paper, we learn continuous semantic space representations for the ambiguous instances via recursive context composition, allowing us to capture lexical semantics and syntactic relations simultaneously. Using the learned representations of ambiguous instances, we further adapt rival penalization competitive learning to conduct instances based word sense clustering, allowing us to determine the number of word senses automatically. We validate the effectiveness of our method on the SEMEVAL-2010 WSI dataset. Experiment results show that our method is able to improve the quality of word sense clustering over several competitive baselines. & 2015 Elsevier B.V. All rights reserved.

Keywords: Natural language processing Word sense induction Continuous semantic space representation Competitive learning

1. Introduction Word sense induction (WSI) discriminates different word senses of a polysemous word without relying on a predeﬁned sense inventory. In contrast, word sense disambiguation (WSD) is assumed to access an already known sense list [16]. From this perspective, WSI can be treated as a clustering problem while WSD is a classiﬁcation one. WSI is crucial for many natural language processing (NLP) tasks as word sense ambiguity is prevalent in all natural languages. In this paper, we focus on WSI. Before we introduce the objective of WSI, we differentiate polysemous word and ambiguous instances. A polysemous word refers to a unique word with multiple senses while an ambiguous instance is an occurrence of a polysemous word. Normally, the sense of an ambiguous instance can be determined by its contexts. The underlying assumption is words occurring in similar contexts tend to have the same senses [15].

n Corresponding author at: Department of Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, Fujian, China. Tel.: þ 86 18959288068. E-mail address: [email protected] (X. Shi).

http://dx.doi.org/10.1016/j.neucom.2015.07.090 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Formally, we denote1 the dataset of a polysemous word wi as qi i fsij gm j ¼ 1 , which forms gold standard clusters fOik gk ¼ 1 (1 rqi r mi ), where sij is a context of several sentences that contains a particular i ambiguous instance Iij. Therefore, given fsij gm j ¼ 1 , there exists a set of i ambiguous instances fI ij gm . Under such deﬁnitions, the objective of j¼1 i WSI is to automatically obtain qi from fsij gm j ¼ 1 and further assign qi i fI ij gm j ¼ 1 to their corresponding sense clusters fQ ik gk ¼ 1 , such that: Oix ⋂Oiy ¼ ∅ for x a y. For the task of WSI, one issue needed to be considered is how to effectively learn semantic representations of ambiguous instances via context modeling. Conventional methods [34,4,24,17] use discrete linguistic features to learn vector representations and further apply coarse-grained vector addition or multiplication to conduct context composition. These linguistic features are manually deﬁned and heavily depended on various labeled resources (e.g., part-of-speech taggers, dependency parsers), which constrains the applications of these methods to those under-resourced languages. In addition, simple vector addition or component-wise multiplication in context composition is not sufﬁcient to capture complex linguistic knowledge and phenomenon. More advanced Bayesian methods such as [5] use

1

See all mathematical notations of this paper in Appendix A.

1476

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

Latent Dirichlet Allocation (LDA) [3] to induce the sense distributions of ambiguous instances without resorting to labeled resources. However, LDA is trained under bag-of-words (BoW) assumption which ignores the syntactic structures of the contexts. Moreover, to the best of our knowledge, there is still an open problem to determine the number of word senses for polysemous words automatically and properly. Many typical clustering methods such as k-means require to pre-assign the number of clusters precisely, which is hard or even impossible to many practical problems. Otherwise, these clustering methods will practically lead to poor performance [12]. More recently, the non-parametric Bayesian method [19] uses Hierarchical Dirichlet Processes (HDP) [33] to induce the number of word senses automatically. However, it tends to generate almost twice the number of gold standard sense per polysemous words on SEMEVAL-2010 WSI dataset [19]. Hence, it is desirable to explore a word sense clustering method which enables to determine the number of word senses for polysemous words automatically and properly. In this paper, we want to address the two issues of WSI mentioned above: (1) learning better representations for ambiguous instances, and (2) automatic determination for the number of word senses. In order to achieve this goal, we propose a novel WSI framework (Fig. 1) that automatically induces the word senses of the ambiguous instances based on continuous semantic space representations. To be speciﬁc, our proposed WSI framework runs in two steps: (1) learning a continuous semantic space representation for each ambiguous instance, and (2) conducting word sense induction to the ambiguous instances without specifying the number of word senses. In the ﬁrst step, we learn the representations of ambiguous instances via a recursive autoencoder (RAE) based method [31] in a bottom-up

and unsupervised manner (Section 3). In the second step, we identify different word sense clusters via a rival penalization competitive learning (RPCL) based method [9], capable of gradually eliminating redundant sense clusters (Section 4). Hence, for a polysemous word, the number of remaining sense clusters is considered as the number of induced word senses, and the centroids of the remaining sense clusters are considered as the representations of the induced word senses. The main contributions of our work lie in three aspects:

We learn continuous semantic space representations for ambiguous instances without resorting to any external resources.

In contrast to previous Bayesian methods which learn repre-

sentations of ambiguous instances under BoW assumption, we learn these representations via recursive context composition, allowing our method to capture lexical semantics and syntactic relations within the contexts simultaneously. Instead of being pre-assigned a ﬁxed number of word senses, our framework can determine the number of word senses for polysemous words automatically.

The remainder of this paper is organized as follows: Section 2 summarizes and compares related work. Section 3 presents our method on how to learn a continuous semantic space representation for each ambiguous instance. Section 4 elaborates our method on how to conduct word sense clustering for polysemous words. Section 5 describes our experiments and shows results with discussions. Finally, Section 6 concludes and outlines future directions.

2. Related work 1) polysemous word: ball

INPUT: S1:The ball fell straight to my feet. S2: Her mother helped her to dress for the ball. S3: My ball rolled backwards. S4: She hosts a ball for 300 guests.

2) RAE based context composition

S4 S2

x2 S1

E.g. S1=<0.15, 0.23>

S3 x1

3) RPCL based word sense clustering

OUTPUT: Number of word senses: 2 Sesen1: {S1,S3} Sense2: {S2,S4}

S4 S2

x2 S1 S3

x1 Fig. 1. An illustration of the pipeline of our word sense induction framework. In step (1), the inputs of our framework contain four ambiguous instances of polysemous word “ball”, each of which is underlined accordingly. In step (2), to project the ambiguous instances into continuous semantic space representations, we separately use recursive autoencoder (RAE) based method to conduct context composition for each ambiguous instance, represented by a two dimensions vector. In step (3), to induce the word sense automatically, we use a rival penalization competitive learning (RPCL) based method to conduct word sense clustering. Doing so is able to learn that polysemous word “ball” contains two different senses among the ambiguous instances, and these instances are assigned to their corresponding sense clusters, represented by blue and red, respectively. (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

In this section, we give an overview on previous works of WSI and highlight the differences from our framework. In addition, we also brieﬂy introduce the recursive deep learning as we use RAE in our WSI framework. Word sense induction: We roughly divide the methods of WSI into two categories: linguistic features based methods and Bayesian methods. Linguistic features based methods exploit various linguistic knowledge in context modeling, such as ﬁrst and second order context vectors [27], bigrams and triplets of words [27,34,4], collocations [17], and syntactic relations [8,11]. Based on these linguistic features, both local and global clustering algorithms are employed to conduct sense induction for polysemous words [11]. In general, developing a variety of linguistic features in WSI is important but labor-intensive. Bayesian methods have been explored in recent years as the methods can discover latent topic distributions from contexts without involving excessive feature engineering. Ref. [5] applies parametric LDA [3] to WSI task. The contexts of ambiguous instances are regarded as pseudo documents, and the induced topic distributions are considered as the sense distributions. Ref. [35] further uses non-parametric HDP [33] to learn the sense distributions. The advantages of this method are that it can obtain the number of word senses automatically for each polysemous word, as compared to parametric LDA methods which require the number of word senses to be assigned in advance. Experimental results reveal that the HDP model is superior to LDA model on the SEMEVAL-2010 WSI dataset. Ref. [19] also shows improvement in supervised F-score after incorporating position features in the HDP model. Ref. [7] extends the naive Bayes model based on the idea that the more closer a word to the target word, the more relevant this word will be in WSI. According to the experiment results, we know that the extended naive Bayes model is simple yet effective on the SEMEVAL-2010 WSI dataset.

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

When comparing to the linguistic feature based methods, we learn the semantic representations of ambiguous instances without involving manually developed features. When compared to the Baysian methods, our proposed method is not trained under the BoW assumption. Instead, we learn the semantic features both relating to lexical semantics and syntactic relations. In addition, as discussed in the previous section, there is no need for our method to know the exact clustering number in advance, which is signiﬁcantly different from traditional clustering algorithms and the parametric LDA methods in WSI task. Recursive deep learning: In the recent years, many researchers attempt to model words, phrases and even the whole sentences with continuous semantic representations. Early works on using neural networks to learn phrase representations can be found in [13], which takes advantage of a recurrent neural network to develop compact forms of the phrases. More recently, RAE based methods have been applied in many NLP tasks successfully [29,31,30,20,32]. Recursive deep learning can jointly learn continuous semantic space representations of words and hierarchical syntactic structures within the contexts in an unsupervised manner, capable of capturing complex linguistic phenomenon during context modeling. Considering our recursive context composition, it is similar to the method [31]. However, as our dataset is unlabeled, we only use unsupervised RAE (minimizing reconstruction errors of nonterminal nodes) to optimize the model parameters. Furthermore, instead of only using the word representations to initialize the RAE model, we use the representations of words and word clusters simultaneously, which renders our RAE based method to have a better generalization ability to the test data.

3. Learning semantic representations via recursive autoencoder based context composition This section describes how we resolve semantic representations of ambiguous instances with the aid of RAE based context composition. We divide our algorithm into two parts: (1) learning semantic representations for words and word clusters, and (2) learning semantic representations for the ambiguous instances via recursive context composition. Our method has a better generalization ability in semantic representation and is able to capture both lexical semantics and syntactic relations under a continuous semantic space. 3.1. Learning semantic representations for words and word clusters Learning semantic representations for words: Recently, various deep neural networks (DNN) [2,10,23] have been proposed to project a word into a continuous semantic space representation. The continuous semantic space representation of a word is a dense, low-dimensional and real-valued vector, called word embedding. After learned with DNN, a word embedding matrix M A Rnj V j can be obtained so that each word in the vocabulary V corresponds to a column of M. In practice, we use the DNN toolkit Word2Vec [23] to learn the word embedding matrix M. Word2Vec [23] is an unsupervised algorithm for learning the meaning behind words under continuous semantic space representation. Given this embedding matrix, the semantic representation of a word, assigned with index i in vocabulary V, can be retrieved by simply extracting the ith column of M. Learning semantic representations for word clusters: To alleviate the data sparsity issues, we propose a retraining-based method to learn semantic representations of word clusters. Concretely, we ﬁrst use Cosine distance to construct a word similarity matrix D A R j V j j V j for all words in vocabulary V. For example, value of word pair (ith, jth) in matrix D indicates the semantic similarity score between the ith word and jth word in vocabulary V.

1477

Using the already learned similarity matrix D, we then separately traverse all word pairs and only maintain those highly similar words, each of which should match at least one pair where similarity score larger than a given threshold. Given these selected similar words, an afﬁnity propagation (AP) algorithm [14] is employed to learn their clusters. AP algorithm [14] initially takes similarities of pairs of data points as inputs, then real-valued messages are propagated between data points until a set of representative exemplars and corresponding clusters are identiﬁed. The reason that we constrain the clustering target to the highly similar words lies in two aspects. First, identifying word clusters for all words in the dataset is time-consuming in practice. Second, the AP algorithm can ensure high quality of clusters for those highly similar words in that they are regions of high density and thus can be identiﬁed more easily. Intuitively, if the word clusters are not reliable, then their semantic representations will be inaccurate and further undermine the semantic composition of the contexts. Table 1 shows examples of the word clusters on the SEMEVAL-2010 WSI dataset, where WC is short for word cluster. When learning the semantic representations for these identiﬁed word clusters, a straightforward way is to average the vectors of the words within their corresponding clusters; however, we consider a more advanced method, allowing our model to obtain the semantic representations of different word clusters automatically. First, we substitute words in clusters with their corresponding cluster labels in the dataset. For example, given a sentence “the material is a semiconductor called gallium manganese arsenide”, we have a new sentence “the material is a semiconductor called WC1 manganese WC1” after cluster labels substitution, as “gallium” and “arsenide” are both contained in WC1. Then we retrain the Word2Vec [23] toolkit on this new dataset with cluster labels so that the semantic representations of word clusters are learned automatically. After this operation, we substitute the representations of those highly similar words by their corresponding word clusters in the subsequent recursive context composition.

3.2. Learning semantic representations for ambiguous instances In this section, we describe how we use RAE based context composition to learn the semantic representations of ambiguous instances. Our method is inspired by [31]; however, as our dataset is fully unsupervised (i.e. no sense annotation), our objective function only uses the reconstruction errors of nonterminal nodes in the recursive syntax tree to estimate our parameters. Furthermore, the leaf nodes of our recursive syntax tree contain two types of nodes, namely word and word cluster. Fig. 2 illustrates an example of our recursive syntax tree where the root node models the lexical semantics and syntactic contexts simultaneously. Conventional BoW model only focuses on the lexical semantics and such model suffers from instinct drawbacks. For example, phrases “long before” and “before long” are believed to have the same meaning in BoW model as it disregards syntactic relation and even the word order. In contrast, the context composition in our recursive syntax tree can differentiate the semantic of “long before” and “before long” as their syntactic relations are different. Furthermore, we use word cluster “WC(bike)” to initialize

Table 1 An example of the generated word clusters on the SEMEVAL-2010 WSI datasets. WC1: WC2: WC3: WC4: WC5:

{gallium, arsenide} {ml/min, crcl, ml/s} {butane, ethane, pentane} {digi-key, vesa, heihachi, mishima, kazuya, zaibatsu} {embodiment, cathode, anode, micropump, alu}

1478

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

Recursive Autoencoder based Context Composition Step:4 The bike lost long before Step:3 The bike lost

Step:2

The bike Step:1 long before

The

WC(bike)

lost

long

before

bike

Fig. 2. An example of our recursive context composition to word sequence “The bike lost long before”. The learning procedure of the context composition is conducted in a bottom-up and unsupervised manner. Parent nodes are computed iteratively until the root node is obtained. We denote the leaf nodes and nonterminal nodes with green and yellow dots, respectively. Here, we substitute the representation of “bike” (indicated with gray dots) by its corresponding cluster “WC(bike)” instead. The root node learns the semantic representation of the word sequence, capable of capturing both lexical semantics and syntactic relations. (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

the leaf node in order to alleviate data sparsity and generalize those similar expressions (e.g., bike, bicycle). i Recall that polysemous word wi consists of a dataset fsij gm j ¼ 1, and sij has an associated ambiguous instance Iij. We model the semantic representations of Iij according to its surrounding contexts within sij. For those highly similar words within the context window, we substitute them by their corresponding cluster labels. Hence, using the contexts as the input of our recursive syntax tree, the leaf nodes can be words and word clusters, each of which is a n 1 vector. Especially, for a parent (nonterminal node) p in the tree and its two children nodes c1 and c2, standard RAE calculates the vector of parent p as follows [31]: p ¼ f ðW 1 ½c1 ; c2 þ b1 Þ

ð1Þ

where f ðÞ is a nonlinear activation function and we use sigmoid function in our model. ½c1 ; c2 is 2n 1 vector, which denotes a concatenating vector for c1 and c2. W1 is a n 2n matrix and b1 is a n 1 bias vector. In order to evaluate how well parent p can maintain the semantic information of its two children nodes, we use p to reconstruct c1 and c2 as follows: ½c01 ; c02 ¼ f ðW 2 p þ b2 Þ c01

ð2Þ

c02

and are the reconstructed nodes of c1 and c2 , where respectively. W2 is a reconstruction matrix and b2 is the bias vector for reconstruction. RAE computes parent nodes iteratively until the root node is obtained. In particular, the parameters θ in our RAE based model are subdivided into the following two sets: 1. θwc : embedding features for word and word clusters. 2. θrec : feature matrices W 1 and W2; bias vectors b1 and b2. Then our objective function J for learning the parameters is deﬁned as follows: J¼

λwc 1X λrec J θwc J 2 þ J θrec J 2 Erec sij ; θ þ mi j 2 2 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ðbÞ

ð3Þ

with X p A nonðsij

2 1 99 ½c1 ; c2 c01 ; c02 p 99 2 Þ

4. Automatic word sense clustering for ambiguous instances In this section, we elaborate how we adapt the RPCL based method [9] to perform instances based sense clustering. The RPCL is an iterative clustering algorithm and consists of a competition and a penalization mechanism. It is able to determine the number of clusters automatically during the learning process. Motivated by this strength, we adapt RPCL based method to our sense clustering task using continuous semantic space representations. Recall that polysemous word wi contains a set of ambiguous i instances fI ij gm j ¼ 1 derived from gold standard sense clusters q fOik gki ¼ 1 . In our WSI framework, we initialize the cluster number of wi to be qn (qi r qn r mi ), and our main task is to eliminate the redundant sense clusters. The basic idea of our algorithm is that given an instance Iij, not only the weight of its winner (the nearest) sense cluster is rewarded (competition mechanism), but also the weight of its rival (the 2nd nearest) sense cluster is penalized (penalization mechanism). The weights measure the validity of their corresponding clusters and are initialized equally. During the learning process of our method, if the weight of a cluster is low, then the number of the instances assigned to it will decrease, such that this cluster will be eliminated eventually. Especially, the index of the winner sense cluster kn and the rival 0 sense cluster k are deﬁned as follows [9]: n k ¼ arg min γ k ð1 λk simðOik ; I ij Þ ; 1 r k r qn

k ¼ argminn γ k ð1 λk simðOik ; I ij Þ 0

kak

with exp ð 0:5Δkj Þ ; simðOik ; I ij Þ ¼ P k exp ð 0:5Δkj Þ 2 Δkj ¼ J vecðOik Þ vec I ij J ;

ðaÞ

Erec sij ; θ ¼

In Eq. (3), the ﬁrst part (a) denotes the average of the reconstruci tion errors of fsij gm j ¼ 1 . Given parameter θ, the reconstruction errors of sij (Eq. (4)) is measured based on the Euclidean distance for the set of nonterminal nodes nonðsij Þ. The second part (b) is the regularizer for preventing over-ﬁtting in parameter estimation. Hyperparameter λwc and λrec governs the relative importance of regularization term J θwc J and J θrec J , respectively. In practice, we use the already learned word and word cluster representations to initialize the inputs of RAE and employ an implementation of L-BFGS2 to learn optimal parameters over the i dataset fsij gm j ¼ 1. With the RAE model, we can obtain a hierarchical syntax tree for a sentence. The learned syntax tree contains len leaf nodes and ðlen 1Þ nonterminal nodes, where len denotes the length of the sentence. We extract the vectors of these ð2nlen 1Þ nodes to construct a node matrix T A Rnð2nlen 1Þ . Particularly, the preceding len vectors are representations of the leaf nodes, and the rest ðlen 1Þ vectors are representations of the nonterminal nodes arranged by the merged procedure of the syntax tree. By doing so, we can quickly extract different nodes in the learned tree by a simple multiplication with a binary vector e. For example, if we want to retrieve the root node only, then we can set all dimensions of the binary vector e to zero except the last one. Following the implements discussed above, we are able to project each ambiguous instance into a dense and real-valued vector which models both lexical semantics and syntactic relations within the contexts simultaneously. In the next section, we use these vectors as inputs to conduct instance based sense clustering.

ð4Þ 2

http://nlp.stanford.edu/

ð5Þ

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

vecðOik Þ ¼

1 X vecðI ij Þ j Oik j I A O ij

ð6Þ

ik

In Eq. (5), λk is the weight of cluster Oik, and γ k ¼ Pyk y is the relative k k

winning rate of cluster Oik, where yk is the number of times that Oik is the winner cluster in the past iterations. γ k is used to solve the “dead unit” problem encountered by competitive learning. The clusters that have won the competition in the past iterations will have a reduced chance to win again, providing equal chance to other clusters during the learning process. In Eq. (6), simðOik ; I ij Þ calculates the semantic similarity between cluster Oik and instance Iij, where vecðOik Þ and vecðI ij Þ are the centroid of cluster Oik and the vector representation of instance Iij, respectively. Differed from the method [9], our semantic similarity is computed under the continuous semantic space. Here, vecðI ij Þ is obtained via our recursive context composition discussed in Section 3, and vecðOik Þ is learned via vector average across all instances within cluster Oik. n 0 Based on the selected index k and k , the weight λkn of winner sense cluster Oikn and the weight λk0 of rival sense cluster Oik0 are updated as follows: λðnewÞ ¼ λkðoldÞ þ η simðOikn ; I ij Þ; n n k λkðnewÞ ¼ maxð0; λðoldÞ η expð 0:5Δk0 j ÞÞ 0 0 k

ð7Þ

where parameter η (0 o η r 1) is the learning rate of the weights of clusters. In Eq. (7), the winner-rewarded strength for weight λkn increases with the value of simðOikn ; I ij Þ, as the more similar between Oikn and I ij , the more likely I ij will belong to Oikn . It is worth mentioning that the winner-rewarded mechanism in our method differs from the one in method [9] which rewards the weight of the winner cluster by increasing a small positive constant. Our winner-rewarded mechanism updates weight λkn according to the similarity function simðOik ; I ij Þ. In other words, our updating mechanism is similarity sensitive and more adaptive with respect to different instances. As for weight λk0 , we will increase the rival-penalized strength if the rival sense cluster Oik0 is too close to instance Iij, so that Oik0 will be eliminated quickly because Iij has already chosen winner cluster Oikn . Function maxðÞ ensures all weights of the clusters are nonnegative. Algorithm 1. Context-based sense induction via rival penalization competitive learning. Input: 1. Semantic representations of ambiguous instances i fvecðI ij Þgm j ¼ 1; 2. Initialized sense cluster number qn (qn ¼20 for noun, qn ¼ 10 for verb); Output: Sense label of I ij and the induced sense number; 1: Select qn initial instances, one for each sense cluster; Set η ¼ 0:01; complete ¼ false; yk ¼ 0 and λk ¼1, where k ¼ 1 to qn; 2: while complete ¼ false do 3: for j ¼ 1 to mi do 0 4: Obtain index of winner cluster kn and rival cluster k based on Eq. (5); 5:

¼ yðoldÞ þ1, and update λkn and λk0 based on Eq. Let ykðnewÞ n n k

(7); 6: Assign instance Iij to sense cluster Oikn ; 7: end for 8: for k ¼ 1 to qn do 9: if sizeðOik Þ a 0 then 10: Update vecðOik Þ based on Eq. (6); 11: end if 12: end for 13:

if fOik gðnewÞ ¼ fOik gðoldÞ then

1479

14: complete ¼ true; 15: end if 16 end while Consequently, we summarize the main steps of our instances based sense clustering in Algorithm 1. Doing so enables us to obtain the sense number of the polysemous words automatically, and those ambiguous instances assigned into the same cluster are considered to have the same sense.

5. Experiments We conducted a serious of experiments on the dataset of the SEMEVAL-2010 WSI task [22] to evaluate the effectiveness of our proposed WSI framework. We tested our proposed method via the evaluated scripts provided by organizers. From these evaluations, we would like to investigate the following issues. 1. Whether the RAE based context composition is superior to a simple vector average of the contexts in our WSI framework? 2. Whether our proposed retraining-based method for those highly similar words plays a positive role in our WSI framework? If so, we are curious about the reasons of the improvements. 3. Whether the adapted word sense clustering algorithm in our WSI framework can induce appropriate sense numbers for the polysemous words?

5.1. Experimental setup 5.1.1. Dataset We evaluated our method on the dataset of the SEMEVAL-2010 WSI task [22] that contained 100 target words, viz. 50 nouns and 50 verbs. The training sets of the target words are provided separately, and for each target word, its training set contains a set of instances, each of which is associated with plain sentences without sense annotation. Detailed descriptions of the dataset are given in Table 2. Considering the trial data (development set), it contains 4 verbs, each of which consists of a training and a testing portion. These 4 verbs differ from the target words in the training set. There are about 138 instances on average for each verb in the training portion of the trail data. 5.1.2. Detailed implementation We performed lemmatization for all words in the sentences to reduce data sparsity. For example, the inﬂected form “coming” and “came” are derived from the same lemma “come”, thus we substituted these two words by “come” in practice. Moreover, we substituted all number (e.g. 12, 15) in the sentences by a unique label “[num]”, and those words occurring less than ﬁve times were not under our consideration. To conduct a fair comparison with previous implement [19], we train the models for each target word independently.

Table 2 Data description of the SEMEVAL-2010 WSI task. Part-of-speech

Training set

Test set

Noun Verb All

716,945 162,862 879,807

5285 3630 8915

1480

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

In practice, we use DNN toolkit Word2Vec [23] to train the word embedding. We set the context window size as 8, the low frequency cut-off threshold as 5, the number of the training iteration as 5, the size of word embedding as 100 (a trade-off between computational cost and expressive power), the sub-sampling threshold as 10 4 , the number of negative examples as 25 and use the continuous bag-of-words model. Given word embedding, we further apply AP algorithm [14] to obtain word clusters. We conservatively set the word similarity ﬁltering threshold as 0.9 in order to obtain word clusters with high quality. The impact of this similarity threshold will be further discussed in Section 5.2. We set the preference parameter in AP algorithm as “median” which forces the algorithm treating inputs equally without priori biases. Using the semantic representations of word and word clusters as inputs, we can learn the semantic representations of the ambiguous instances via our RAE based context composition. In our RAE based model, we set the hyperparameter λwc and λrec as 10 4 and 10 5 , respectively. Tokens in the context window c ¼ ðxi 2 ; xi 1 ; xi ; xi þ 1 ; xi þ 2 Þ are used during context modeling, where xi is the current target word. Different from previous methods, our method does not tune the number of word senses as the one would be determined automatically via gradually eliminating the redundant sense clusters. It is worth mentioning that certain word semantic correlations are not stable because the word representations are not fully converged on small dataset. Word2Vec [23] updates the word representations within the context window until it traverses the whole dataset with a pre-assigned iteration number (we used the default value). Given a dataset with small size, the word representations will not be fully converged because of the insufﬁcient updates. To obtain a better convergence, one option is to include extra monolingual dataset to train the word representations; however, this may cause the conjecture that our improvement is due to using extra data instead of the effectiveness of our proposed model when comparing with previous systems. In our method, we used a random function with discrete uniform distribution to sample the sentences in the dataset via sampling with replacement. The sample steps were repeated until we obtained the new dataset almost ﬁve times the size of the original one. The sampling steps will not change the underlying distributions of the word representations because no new context is added at training stage; however, using sample step enhances the number of updates so that the word representations can achieve a better convergence to the underlying distributions. 5.1.3. Baselines To investigate the three issues mentioned above, we introduce the following baselines. For clarity, AVE denotes a simple average of semantic vectors for those words within the context window to represent the test instances, and RAE0 denotes only using word vectors as inputs to conduct RAE based context composition. If we use both the representations of the words and word clusters to initialize the RAE based model, then we have RAE1. In addition, ARPCL denotes our adapted RPCL algorithm, and KM denotes the standard k-means algorithm. Random: Randomly assigned each test instance to one of the four clusters. This baseline was run ﬁve times and the results were averaged. Most frequent: Selected the most frequent sense to the test instances. Baseline1: AVE-ARPCL: This baseline used AVE model (under BoW assumption) to conduct context composition, and then applied our ARPCL algorithm to conduct the word sense clustering. Baseline2: RAE0-ARPCL: Compared with AVE-ARPCL, this baseline used RAE0 model to conduct recursive context composition.

Baseline3: RAE1-RPCL: Based on the RAE1 model, this baseline performed sense induction via RPCL method which used a constant rewarding mechanism. Baseline4: RAE1-KM: Different from RAE1-RPCL, we use KM algorithm to conduct sense induction. As the cluster number of KM should be assigned in advance, we set 7 for each noun and 3 for each verb, based on the average number of the gold standard in the test set. 5.1.4. Evaluation metrics In this section, we give a brief introduction of the evaluation metrics. In particular, the Semeval-2010 WSI task presents supervised evaluation and unsupervised evaluation. In case of supervised evaluation, the gold standard dataset is split into two parts: mapping set and evaluation set. In particular, there are two different split schemes (80% mapping, 20% evaluation and 60% mapping, 40% evaluation). Based on the mapping and evaluation set, the performance of the systems are evaluated by Supervised Recall (SR). Unsupervised evaluation mainly compares the outputs of the systems by using V-Measure (VM) [28] and Paired F-score (Paired-FS) [1]. There are two important indicators in VM to determine the effectiveness of the resulting clusters: homogeneity (H) and completeness (C). The homogeneity evaluates the degree that those instances assigned in a resulting cluster actually belong to a gold standard sense; whereas the completeness evaluates the degree that the instances of a gold standard sense are assigned into the same resulting cluster. The deﬁnition of VM is given in the following equation [28]: VM ¼

2nH nC H þC

ð8Þ

In case of Paired-FS [1] evaluation, the clustering problem is transformed into a classiﬁcation problem [22]. The comparison is conducted between the pairs of the gold standard clusters and the pairs of the resulting clusters. Let F(O) and F(GS) are the instance pairs derived from resulting clusters and gold standard clusters, respectively. We deﬁne precision (P) as the intersection between F(O) and F(GS) to the number of instance pairs in F(O) (Eq. (9)) and recall (R) as the intersection between F(O) and F(GS) to the number of instance pairs in F(GS) (Eq. (10)). Finally, Paired-FS is the aggregative indicator of precision and recall, which is deﬁned in Eq. (11) [22] P¼

j FðOÞ \ FðGSÞj j FðOÞj

ð9Þ

R¼

j FðOÞ \ FðGSÞj j FðGSÞj

ð10Þ

Paired FS ¼

2nP nR P þR

ð11Þ

5.2. Experimental results Our experiments are conducted both in a supervised and an unsupervised evaluation of the SEMEVAL-2010 WSI task. We compared our RAE1-ARPCL method to the baselines mentioned above. The bold numbers indicate the best performance. 5.2.1. Supervised evaluation: supervised recall In Table 3, we can see that baseline RAE0-ARPCL is superior to baseline AVE-ARPCL in ALL evaluation, both in 80–20 split (67.92 vs. 67.14) and 60–40 split (66.82 vs. 65.74). This signiﬁcant improvements suggest that the syntactic structures within the contexts contribute to our semantic representations during context modeling. We conclude that RAE0 model is more favorable than the simple vector average method as RAE0 model can capture more complex linguistic knowledge and phenomenon from the contexts.

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

Furthermore, our proposed method RAE1-ARPCL outperforms the RAE0-ARPCL method both in NOUN and VERB evaluation. This reveals that our retraining-based method is effective when the similarity threshold equals to 0.9. We attribute the improvements to two aspects: (1) using the representations of the word clusters enables our model to have a better generalization ability to the test data. (2) The new dataset with cluster labels contains less data sparsity than the original one so that those unsubstituted words can learn more accurate semantic representations during model training. To better understand the learned representations of ambiguous instances, we select target word “bow (verb)” and “entry (noun)” from test data and show their visualization in Fig. 3, where “num” represents the induced number of word senses by our ARPCL method, and “2DðI ij Þ” represents the two-dimensional vector of ambiguous instance Iij obtained with t-SNE [21]. Typically, clustering aims at between-cluster separation and within-cluster homogeneity (low distances). We can see that if the ambiguous instances belong to different clusters, then these instances learned by RAE1 are separated more clearly than that by AVE, especially for verb “bow”. Furthermore, the ambiguous instances within one cluster tend to have lower distances in RAE1 than that in AVE. The number of word senses for “bow” and “entry” in our RAE1-ARPCL method is 5 and 6, respectively, which is consistent with the induced number of gold standard senses. Note that the improvement of NOUN is more obvious than that of VERB when conducting comparison between RAE1-ARPCL and AVE-ARPCL. In view of the test instances, the average number of Table 3 Supervised Recall on the SemEval-2010 WSI dataset. WSI system

80–20 split (%)

Random Most frequent Baseline1: AVE-ARPCL Baseline2: RAE0-ARPCL RAE1-ARPCL (our method) Baseline3: RAE1-RPCL Baseline4: RAE1-KM

60–40 split (%)

NOUN

VERB

ALL

NOUN

VERB

ALL

51.50 53.20 64.06 64.98 66.18 64.42 65.06

65.70 66.60 71.68 72.16 72.20 71.86 72.06

57.30 58.70 67.14 67.92 68.62 67.48 67.92

50.20 52.50 62.42 63.94 64.16 62.94 62.68

65.70 66.70 70.52 71.00 72.10 71.76 71.94

56.50 58.30 65.74 66.82 67.40 66.54 66.42

1481

the gold standard senses for NOUN (4.46) is larger than that for VERB (3.12) [7]. Besides, the total number of the test instances for NOUN (5285) is also larger than that of VERB (3630). In this respect, discriminating ambiguous instances for NOUN is more difﬁcult than that for VERB on average. In other word, discriminating ambiguous instances for NOUN needs to integrate more linguistic knowledge. Hence, compared with the VERB, the extra syntactic relations in our RAE1 model will provide a greater inﬂuence to the improvement of NOUN. We further conduct comparison among RAE1-ARPCL, RAE1-RPCL and RAE1-KM methods. We can see that the performance of our RAE1-ARPCL method outperforms the two baselines. Regarding RPCL, it uses a constant to control the winner-rewarded mechanism, such that the semantic similarity between the instance and its corresponding winner cluster is ignored when performing weight updating. In this respect, our similarity-sensitive rewarding mechanism is more favorable than the constant rewarding mechanism. Compared with KM, the advantage of our ARPCL lies in the ability to allowing the sense number of the target words to be induced automatically. In addition, we can see that KM is slightly better than RPCL in 80–20 split scheme. In fact, the pre-assigned sense number of KM in our experiment is set based on the prerequisite that we know the gold standard, which is not allowed to access in realistic benchmark. Without this prerequisite, KM will inevitably lead to poor performance.

5.2.2. Unsupervised evaluation: V-measure and paired F-score Let us move on to the unsupervised evaluation and the results are shown in Table 4. To our surprise, the VM performance of baseline AVE-ARPCL outperforms our method RAE1-ARPCL in ALL evaluation (18.6 vs. 18.2). As Pedersen [26] points out, the VM evaluation tends to reward systems which produce a larger cluster number than the gold standard. This is reﬂected in our experimental results as well. The induced sense number of AVE-ARPCL (5.12) and our method RAE1-ARPCL (4.30) are both larger than the gold standard (3.79). Though our method obtains a closer value to the gold standard, the VM evaluation remain rewards the AVEARPCL method as its induced number is larger than ours. Compared to VM evaluation, it is possible to obtain a high Paired-FS by clustering all ambiguous instances into one sense, as

Fig. 3. Learned embeddings of ambiguous instances from verb “bow” and noun “entry” in two-dimensional vector.

1482

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

shown in Most Frequent [7]. Most Frequent has a poor performance in VM evaluation while in Paired-FS evaluation, it achieves the best performance in our experiments. Using VM and Paired-FS to evaluate the word sense clustering has a strong correlation with the number of induced word senses [19]. Hence, we mainly focus on the supervised evaluation in our experimental results. 5.2.3. Comparison with previous works in supervised evaluation Many previous works tested their models mainly on SEMEVAL-2010 WSI dataset, and thus we followed their split scheme (80% mapping, 20% evaluation) in supervised evaluation. In Table 5, UoY [18] is the best performing system in SEMEVAL-2010 WSI competition. NMFlib [11] uses nonnegative matrix factorization to factor a matrix and then conducts word sense clustering on test dataset. HDPþposition [19] achieves a competitive result in WSI via introducing the position features into HDP model. NB extends the naive Bayes model and reestimates the conditional probability of a context word given the sense, based on the distance to the target word. Notice that HDPþposition [19] only shows the F-score in the paper, and thus we use the same evaluation during comparison. From the experimental results, we can see that our method outperforms all compared systems. 5.2.4. Comparison with the number of induced word sense In case of the induced word sense, we conduct comprehensive comparisons among KM, RPCL and APRCL in Table 6. Notation “0” denotes the induced sense number perfectly matching the gold standard. Notation “7 1” denotes the induced number one sense larger or less than the gold standard, same analogy to notation “7 2”. In Table 6, we know that 25% target words are predicted correctly in our ARPCL model, larger than KM (19%) and RPCL (17%). Generally, the gold standard for a target word often varies across different annotators [25]. Hence, for a sense clustering algorithm, if the variation of its induced number toward the gold standard is relatively small, then we believe that such result is a somewhat acceptable solution. Let us consider the variation interval [ 1,þ 1], we know that 59% target words satisfy this

Table 4 Unsupervised evaluation on the SemEval-2010 WSI dataset. WSI system

Random Most Frequent Baseline1: AVE-ARPCL Baseline2: RAE0-ARPCL RAE1-ARPCL (our method) Baseline3: RAE1-RPCL Baseline4: RAE1-KM

#C

VM (%)

Paired-FS (%)

NOUN VERB ALL

NOUN VERB

ALL

4.2 0.0 22.6 22.5 22.9 22.5 23.2

30.4 57.0 34.40 38.23 38.40 40.01 34.56

31.9 63.5 40.50 42.04 42.80 49.16 40.34

4.6 0.0 12.7 12.5 11.5 8.9 11.1

4.4 0.0 18.6 18.4 18.2 16.9 18.3

34.1 72.7 49.30 47.59 49.30 62.49 48.78

4.0 1.0 5.12 4.48 4.30 4.02 5.0

Table 5 Comparison with previous works in supervised evaluation over the SemEval-2010 WSI dataset. WSI system

80–20 split (%) NOUN

VERB

ALL

SR

UoY (2010) NMFlib (2011) NB (2013) RAE1-ARPCL (our method)

59.40 57.30 62.60 66.18

66.80 70.20 69.50 72.20

62.40 62.60 65.40 68.62

F-score

HDP þposition (2012) RAE1-ARPCL (our method)

65.00 66.34

72.00 72.20

68.00 68.75

Table 6 Detailed descriptions of the induced number among KM, RPCL and ARPCL. Difference to gold standard

0 71 72 Others

Match rate KM(%)

RPCL(%)

ARPCL(%)

19 49 17 15

17 35 18 30

25 34 21 20

Fig. 4. Performance of RAE0-APRCL and RAE1-ARPCL in supervised recall evaluation.

requirement in our model. Note that 68% target words in KM are satisﬁed interval [ 1,þ 1], which is larger than ours. As previous section points out, the setting of the sense number for KM is assumed to be able to access the gold standard which contributes the boosting of KM enormously. Without this assumption, we believe that the performance of KM will degrade dramatically. If we loose the constraint to [ 2,þ2], then we have 80% target words falling into this interval. There are two possible reasons to explain our inaccurate estimation for the sense number. First, the semantic representations of ambiguous instances consist of noise during model training and further undermine the quality of our word sense clustering. Second, our ARPCL may merge two valid clusters if their centroids are getting too close.

5.2.5. Impact of the similarity threshold for word clustering In order to know how sensitive is our method toward the similarity threshold, we experiment different similarity thresholds, ranged from 0.1 to 0.9. The results are shown in Fig. 4. Compared with RAE0-ARPCL, our word clusters have a negative impact on the performance if the similarity threshold is set less than 0.6. Generally, if the constrain imposed by the similarity threshold is inadequate, then those words without strong semantic correlations will be added to participating word clustering. Of cause, in doing so, we will lose certain amount of unique lexical semantics because of excessive generalization. Once the similarity threshold is set larger than 0.6, our method is capable of providing a better result than RAE0-ARPCL, and the best performance is obtained when the similarity threshold equals to 0.8. We hypothesize that similarity threshold 0.8 may be the best value for balancing the semantics between word and word cluster during context modeling. When the similarity threshold equals to 0.9, our

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

performance is decreasing because the threshold is too restrictive so that the word clusters are not explored sufﬁciently.

sij Iij i fsij gm j¼1

jth context that contains ambiguous instance Iij jth ambiguous instance of wi a set of mi contexts of wi

5.2.6. The Kappa statistic We use Kappa statistic [6] to investigate whether the sample step can stabilize our model. Kappa statistic [6] provides a reasonable method to evaluate the reproducibility of our sense clustering results. Concretely, for each target word, we separately run Word2Vec [23] on the original data and the newly sampled data twice, and then we can obtain two sense clustering results both in the original data and the newly sampled data. Hence, we can compute the Kappa statistic of the 100 target word via averaging their Kappa statistic. To be speciﬁc, the Kappa statistic of the original data is 0.77 and the one of the newly sampled data is 0.86. Accordingly, we can know that the newly sampled data can stabilize the model in our tested dataset.

i fI ij gm j¼1

a set of mi ambiguous instances of wi

6. Conclusion and future work We have applied recursive autoencoder (RAE) based method to learn the semantic representations of ambiguous instances, capable of capturing both lexical semantics and syntactic relations without involving excessive task-speciﬁc feature engineering. Using our proposed retraining-based method, we can learn better word embedding and generalization ability during context modeling. When performing word sense clustering, our adapted RPCL algorithm is able to induce the number of word senses automatically via gradually eliminating redundant sense clusters. We have conducted a supervised and an unsupervised evaluation on the SEMEVAL-2010 WSI dataset to thoroughly testify the effectiveness of our proposed method. In accordance with our experimental results, we can validate that: 1. RAE based method is superior to a simple vector average method as the former one is capable of capturing both lexical semantics and syntactic relations simultaneously. 2. Given a proper similarity threshold (larger than 0.6 in our experiment), our retraining-based method is capable of improving the semantic representations of ambiguous instances because of better generalization ability. 3. In word sense clustering, our adapted rival penalization competitive learning achieves the best performance in supervised evaluation. When comparing with the gold standard, we know that 25% target words are predicted correctly, and 80% target words fall into a variation interval [ 2,þ 2].

1483

M D T V jVj n p c1 ; c2 c01 , c02 ½c1 ; c2 ½c01 ; c02 len

word embedding matrix word similarity matrix node matrix of the hierarchical syntax tree vocabulary size of vocabulary size of embedding parent of the hierarchical syntax tree two children nodes of the hierarchical syntax tree reconstructed nodes of c1 and c2 vector concatenation of c1 and c2 vector concatenation of c01 and c02 number of the leaf nodes of the hierarchical syntax tree θwc embedding features for word and word clusters θrec feature matrices W1 and W2; bias vectors b1 and b2 θ parameter in our RAE based model, comprising θwc andθrec Erec sij ; θ given θ, the reconstruction error of sij J θwc J L2-norm of θwc J θrec J L2-norm of θrec λwc hyperparameter of J θwc J λrec hyperparameter of J θrec J yk number of times that Oik is already the winner cluster γk relative winning rate of cluster Oik qn initialized cluster number of wi Oik kth sense cluster of wi j Oik j number of instances in Oik q fOik gki ¼ 1 a set of qi gold standard sense clusters of wi Oikn winner sense cluster of wi Oik0 rival sense cluster of wi kn index of the winner sense cluster 0 index of the rival sense cluster k λkn weight of winner sense cluster Oikn λk0 weight of rival sense cluster Oik0 η learning rate of the weights of clusters Δkj semantic distance between Oik and Iij Δk0 j semantic distance between Oik0 and Iij

References Acknowledgements We would like to thank all the referees for their constructive and helpful suggestions on this paper. This work is supported by the Natural Science Foundation of China (Grant nos. 61005052, 61075058 and 61303082), the Key Technologies R&D Program of China (Grant no. 2012BAH14F03), the Fundamental Research Funds for the Central Universities (Grant no. 2010121068), the Natural Science Foundation of Fujian Province, China (Grant no. 2010J01351), the Science and Technology Planning Project of Tibet Autonomous Region (Grant no. Z2014A18G2-13), the Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20120121120046) and the Ph.D. Programs Foundation of Ministry of Education of China (Grant no. 20130121110040). Appendix A. Table of notation wi

polysemous words wi

[1] J. Artiles, E. Amigó, J. Gonzalo, The role of named entities in web people search, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2009, pp. 534–542. [2] Y. Bengio, H. Schwenk, J.S. Senécal, F. Morin, J.L. Gauvain, Neural Probabilistic Language Models, Innovations Mach. Learn. 194 (2006) 137-186.. [3] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. [4] S. Bordag, Word sense induction: triplet-based clustering and automatic evaluation, In Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2006, pp. 137–144. [5] S. Brody, M. Lapata, Bayesian word sense induction, In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009, pp. 103–111. [6] J. Carletta, Assessing agreement on classiﬁcation tasks: the kappa statistic, Comput. Linguist. 22 (1996) 249–254. [7] E. Charniak, Naive Bayes Word Sense Induction, In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2013, pp. 1433–1437. [8] P. Chen, W. Ding, C. Bowes, D. Brown, A fully unsupervised word sense disambiguation method using dependency knowledge, In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 28–36.

1484

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

[9] Y.m. Cheung, H. Jia, Categorical-and-numerical-attribute data clustering based on a uniﬁed similarity metric without knowing cluster number, Pattern Recognit. 46 (2013) 2228–2238. [10] R. Collobert, J. Weston, A uniﬁed architecture for natural language processing: deep neural networks with multitask learning, In: Proceedings of the 25th International Conference on Machine Learning, ACM, Helsinki, 2008, pp. 160– 167. [11] T. Van de Cruys, M. Apidianaki, Latent semantic word sense induction and disambiguation, In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 1476–1485. [12] M.Y. Dehkordi, R. Boostani, M. Tahmasebi, A novel hybrid structure for clustering, Adv. Comput. Sci. Eng. 6 (2009) 888-891. [13] J.L. Elman, Distributed representations, simple recurrent networks, and grammatical structure, Mach. Learn. 7 (1991) 195–225. [14] I.E. Givoni, B.J. Frey, A binary variable model for afﬁnity propagation, Neural Comput. 21 (2009) 1589–1600. [15] Z.S. Harris, Distributional structure, Word 10 (1954) 146–162. [16] Y. Huang, X. Shi, J. Su, Y. Chen, G. Huang, Unsupervised word sense induction using rival penalized competitive learning, Eng. Appl. Artif. Intell. 41 (2015) 166–174. [17] I.P. Klapaftis, S. Manandhar, Word sense induction using graphs of collocations, In: Proceedings of the 18th European Conference on Artiﬁcial Intelligence, 2008, pp. 298–302. [18] I. Korkontzelos, S. Manandhar, Uoy: graphs of unambiguous vertices for word sense induction and disambiguation, In: Proceedings of the 5th International Workshop on Semantic Evaluation, 2010, pp. 355–358. [19] J.H. Lau, P. Cook, D. McCarthy, D. Newman, T. Baldwin, Word sense induction for novel sense detection, In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 591–601. [20] P. Li, Y. Liu, M. Sun, Recursive autoencoders for itg-based translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 567–577. [21] L. van der Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res. 9 (2008) 2579–2605. [22] S. Manandhar, I.P. Klapaftis, D. Dligach, S.S. Pradhan, Semeval-2010 task 14: word sense induction & disambiguation, In: Proceedings of the 5th International Workshop on Semantic Evaluation, 2010, pp. 63–68. [23] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, In Advances in Neural Information Processing Systems, 2013, pp. 3111–3119. [24] J. Mitchell, M. Lapata, Vector-based models of semantic composition, In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, 2008, pp. 236–244. [25] R.J. Passonneau, A. Salleb-Aouissi, V. Bhardwaj, N. Ide, Word sense annotation of polysemous words by multiple annotators, In: Proceedings of the Conference on LREC, 2010, pp. 3244–3249. [26] T. Pedersen, Duluth-wsi: senseclusters applied to the sense induction task of semeval-2, In Proceedings of the 5th international workshop on semantic evaluation, Association for Computational Linguistics, Uppsala, 2010, pp. 363– 366. [27] A. Purandare, T. Pedersen, Word sense discrimination by clustering contexts in vector and similarity spaces, In: Proceedings of the Conference on Computational Natural Language Learning, 2004. [28] A. Rosenberg, J. Hirschberg, V-measure: a conditional entropy-based external cluster evaluation measure, In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007, pp. 410–420. [29] R. Socher, E.H. Huang, J. Pennin, C.D. Manning, A.Y. Ng, Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, In: Advances in Neural Information Processing Systems, 2011, pp. 801–809. [30] R. Socher, B. Huval, C.D. Manning, A.Y. Ng, Semantic compositionality through recursive matrix–vector spaces, In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, Jeju Island, 2012, pp. 1201–1211. [31] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng, C.D. Manning, Semi-supervised recursive autoencoders for predicting sentiment distributions, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 151–161. [32] R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle, 2013, pp. 1631–1642. [33] Y.W. Teh, M.I. Jordan, M.J. Beal, D.M. Blei, Hierarchical Dirichlet processes, J. Am. Stat. Assoc. 101 (2006) 1566–1581. [34] G. Udani, S. Dave, A. Davis, T. Sibley, Noun sense induction using web search results, In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 657–658. [35] X. Yao, B. Van Durme, Nonparametric Bayesian word sense induction, In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, 2011, pp. 10–14.

Yanzhou Huang is currently a Ph.D. candidate at the School of Information Science and Engineering, Xiamen University. His research interests include natural language process and statistical machine translation.

Deyi Xiong received the Ph.D. degree from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) in 2007. He now is a professor in the School of Computer Science and Technology, Soochow University. His research interests include natural language processing and statistical machine translation, especially focusing on semantics-based machine translation, document-level machine translation and synchronous grammar induction.

Xiaodong Shi received the Ph.D. degree in computer software from National University of Defense Technology, Changsha, China, in 1994. He is now a professor in the Cognitive Science Department of Xiamen University. His research interests include natural language processing and artiﬁcial intelligence.

Yidong Chen received his Ph.D. degree in mathematics from Xiamen University, Xiamen, China, in 2008. He is now an associate professor in the Cognitive Science Department of Xiamen University. His research interests include statistical machine translation and semantic analysis.

ChangXing Wu is currently a Ph.D. candidate at the School of Information Science and Engineering, Xiamen University. His research interests include natural language process and machine learning.

Y. Huang et al. / Neurocomputing 171 (2016) 1475–1485

Guimin Huang is a full professor at Guilin University of Electronic Technology in China. He worked as an assistant professor at the Curtin University in Australia. Recently, he has published more than 80 academic papers on international journal and international conference and awarded a patent of invention as well as ﬁve Software Copyright Registration Certiﬁcates. His research interests include natural language processing and text mining.

1485

Adapted competitive learning on continuous semantic ...

Aug 8, 2015 - c School of Computer Science and Technology, Soochow University, Suzhou 215006, China d Research Center on ... Neurocomputing 171 (2016) 1475â1485 ...... Foundation of Ministry of Education of China (Grant no. 2013-.

Download PDF

696KB Sizes 0 Downloads 182 Views

Report

Adapted competitive learning on continuous semantic ...

Recommend Documents