2011 11th IEEE International Conference on Data Mining Workshops

A Novel Co-clustering method with Intra-Similarities Jian-Sheng Wu, Jian-Huang Lai, Chang-Dong Wang School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P. R. China. Email:[email protected],[email protected],[email protected]

For a given co-clustering problem, we can use the cooccurrence matrix to model the relationships between samples of the two sets [3], [4]. So, the co-clustering problem can be treated as a bipartite graph partitioning problem. But it is an NP-complete problem [3]. Dhillon [3] first employs the spectral graph partitioning method and the singular value decomposition (SVD) to partition the bipartite graph in the co-clustering field. Although it is an effective method, there are still some drawbacks due to the shortages of the spectral graph partitioning and the eigenvector based decomposition methods. To overcome the drawbacks existing in SVD or eigenvector based decomposition methods, Long et al. [6] propose a new co-clustering framework, namely, block value decomposition (BVD). Compared with SVD or eigenvectorbased decomposition, the decomposition from BVD has an intuitive interpretation. To overcome the drawbacks existing in spectral graph partitioning, in [7], Rege et al. propose an isoperimetric co-clustering algorithm for partitioning the bipartite graph by minimizing the isoperimetric ratio. Also, the co-occurrence matrix can be viewed as an empirical joint probability distribution of two discrete random variables, and the co-clustering problem can be posed as an optimization problem in information theory [4]. Slonim and Tishby [14] introduce mutual information into document clustering problem by taking the document set and the word set as the value sets of two random variables. They first find word clusters that preserve most of mutual information about the document set, and then find document clusters that maximize the mutual information about the word cluster set. Different from [14], Dhillon et al. [4] simultaneously cluster the documents and words in each iteration until the preserved mutual information between the two cluster sets is the largest. A generalized co-clustering framework is presented in [15], in which any Bregman divergence can be used in the objective function, and various conditional expectation need to be preserved. Previous work has focused on the inter-relationships between the samples belonging to different sets, but not taken into account the intra-relationships between samples of each set. When doing the document and word co-clustering problem, previous approaches just treat the document as a collection of words, disregarding word sequences, so there may be two documents have few shared key words (words selected as features for clustering), but they may have similar means, or two documents having some shared key words, but

Abstract—Recently, co-clustering has become a topic of much interest because of its applications to many problems. It has been proved more effective than one-way clustering methods. But the existing co-clustering approaches just treat the document as a collection of words, disregarding the word sequences. They only consider the co-occurrence counts of words and documents, but do not take into account the similarities between words and similarities between documents. However, these similarity information can help improving the co-clustering. In this paper, we incorporate the word similarities and document similarities into the co-clustering algorithm, and propose a new co-clustering method. And we provide a theoretical analysis that our algorithm can converge to a local minimum. The empirical evaluation on publicly available data sets also shows that our algorithm is effective. Keywords-co-clustering; word similarities; document similarities;

I. I NTRODUCTION Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering objects. It is a fundamental tool commonly used to discover the structure in data and exploratory in nature [1]. Up to now, there have been lots of clustering methods proposed, and many of them are one-way clustering methods, such as K-means, hierarchical clustering, and so on. Though some of them are very effective, it has been shown that co-clustering is a more effective method in many applications [2]–[5], due to the benefit of exploiting the duality between rows and columns [6]. For example, for a document, it can have lots of different words, meanwhile, a word can occur in many documents, and can be in one document many times. Hence, a document can hold some information about words, and a word can also hold some information about documents. Although this paper focuses on the document-word co-clustering, it can be applied in other co-clustering applications. Here, we use the “word” to denote the element and the “document” denote the container that consists of elements. To date, co-clustering has been used in many applications, such as documents clustering [3], [4], [7], genes expression data clustering [2], [5], images processing and scene modeling [8]–[10], etc. And many variants have been developed, such as Bayesian co-clustering models [11], constrained or semi-supervised co-clustering [12], [13]. 978-0-7695-4409-0/11 $26.00 © 2011 IEEE DOI 10.1109/ICDMW.2011.15

300

Since the distribution p is fixed for a given problem, I(X; Y ) is fixed according to (2). So minimizing (1) is equal ˆ Yˆ ). to maximize the preserved mutual information I(X; In (1), it only takes into account the joint occurrence of xi and yr , but discards some information that xi has about xj , and yr has about yt , such as the similarity information. An operational definition of clustering can be stated as follows: Given n representation of n objects, find K groups based on a measure of similarity such that the similarities between samples in the same group are high while the similarities between samples in different groups are low [1]. So, we can cluster the word set by maximizing W SS and cluster the document set by maximizing DSS. W SS and DSS are defined as follows [16]  W SS = γsim(x, μxˆ ) (3)

they may represent different means. For the first case, suppose we have two documents, one has a sentence “Michael Jackson is an American musician and entertainer”, the other has a sentence “The Hillbilly Cat is the most popular singer of rock”, and the key words are musician, entertainer, singer, and rock. Although the two documents have few shared key words, they share the similar topic about entertainment in that all the key words are about entertainment. If we could measure the similarity between the key words musician and singer, entertainer and singer, musician and rock, entertainer and rock, then we can judge whether the two sentences have the similar means. For the latter case, it is because they discard the information that the terms and phrases provide. Computing the similarity between documents using terms and phrases will help judging whether they should belong to the same cluster. Based on these observations, in our work, we focus on incorporating the intra-relationships into the information theoretic co-clustering algorithm (ITCC) [4]. For the document and word co-clustering problem, we take the similarities between words, and the similarities between documents as the intra-relationships, and propose the algorithm information-theoretic co-clustering incorporating word similarities and document similarities (ITCCWDS). The rest of the paper is organized as follows. In section II, the problem formulation is presented. Section III details the proposed ITCCWDS algorithm, and provides a theoretical analysis. Experimental results are reported in section IV. Finally, we conclude the paper in section V.

x ˆ x∈ˆ x

DSS =

θsim(y, νyˆ)

(4)

yˆ y∈ˆ y

where γ and θ are weights for sample x and y, and sim(·, ·) is defined to compute the similarity between vectors. Incorporating the similarity information, we rewrite (1) as ˆ Yˆ ) − αW SS − βDSS I(X; Y ) − I(X; (5) where α and β are the weights for the trade off among the loss of mutual information, W SS, and DSS. While a word may occur in many documents, and also may occur in one document many times, a document can have many words. So we denote the marginal probabilities p(x) and p(y) as the weights for word x and document y.

II. P ROBLEM F ORMULATION

Definition 1. An optimal co-clustering is to minimize (5), subject to the constraints on the number of row and column.

Let X = {x1 , x2 , . . . , xm } and Y = {y1 , y2 , . . . , yn } denote the document and word set, respectively. Here, we call X the row sample set, and Y the column sample set. Then, we can estimate the joint probability p(xi , yj ) based on the co-occurrence matrix of X and Y . For the hard co-clustering problem shown by Dhillon at al. [4], we are interested in simultaneously clustering the row samples into k disjoint clusters, and the column samples into l disjoint clusters. Denote the row cluster set and column cluster set as ˆ = {ˆ ˆ2 , . . . x ˆk } and Yˆ = {ˆ y1 , yˆ2 , . . . x ˆl }, respectively. X x1 , x For some row cluster x ˆ and column cluster yˆ, we denote μxˆ ˆ and νyˆ as the exemplars, respectively. And we denote X(x) as the cluster label for x, and Yˆ (y) for y. Dhillon at al. propose the ITCC algorithm in [4]. ITCC simultaneously cluster documents and words by minimizing the loss of mutual information, defined as ˆ Yˆ ) I(X; Y ) − I(X;



III. T HE ITCCWDS A LGORITHM In [4], Dhillon et al. express (1) as the weighted sum of relative entropies between row distributions p(Y |x) and rowcluster prototype distributions q(Y |ˆ x) or the weighted sum of relative entropies between column distributions p(X|y) and column-cluster prototype distributions q(X|ˆ y ) as follows  ˆ Yˆ ) = p(x)D(p(Y |x)||q(Y |ˆ x)) (6) I(X; Y )−I(X; x ˆ x∈ˆ x

ˆ Yˆ ) = I(X; Y )−I(X;



p(y)D(p(X|y)||q(X|ˆ y )) (7)

yˆ y∈ˆ y

where D(·||·) denotes the Kullback-Leibler(KL) divergence. According to (3), (4) and the definitions of γ and θ, (5) can be expressed as follows as done to (1) ˆ Yˆ ) − αW SS − βDSS I(X; Y ) − I(X;    = p(x) D(p(Y |x)||q(Y |ˆ x)) − αsim(x, μxˆ )

(1)

where I(X; Y ) is the mutual information between sets X and Y . It is defined as  p(y|x) . (2) p(x)p(y|x) log I(X; Y ) = p(y)

x ˆ x∈ˆ x

−β

 yˆ y∈ˆ y

x∈X y∈Y

301

p(y)sim(y, νyˆ)

(8)



=

  p(y) D(p(X|y)||q(X|ˆ y )) − βsim(y, νyˆ)

Algorithm 1 ITCCWDS 1: Input: The joint probability distribution p(X, Y ), the word similarity matrix W ordSim(X),the document similarity matrix DocSim(Y ), the number of row clusters k, the number of column clusters l, parameters α and β. ˆ and Yˆ . 2: Output: The cluster sets X 3: Initialization: Set t = 0. Initialize document and ˆ Yˆ ), ˆ (0) and Yˆ (0) . Compute q (0) (X, word cluster set X (0) (0) ˆ q (Y |Yˆ ), and q (0) (Y |ˆ x) as done in [4]. q (X|X), And obtain U (0) and V (0) by maximizing (3) and (4). 4: repeat 5: E-row step: For each row sample x, find its new cluster as

yˆ y∈ˆ y

−α



p(x)sim(x, μxˆ )

(9)

x ˆ x∈ˆ x

Note that the co-clustering problem is NP-hard, and a local minimum does not guarantee a global minimum [4]. So here, by extending algorithm ITCC, we use algorithm ITCCWDS to minimize the objective function (5) based on the EM algorithm. At the beginning, it starts with an ˆ (0) and Yˆ (0) . Then, it fixes the initialization of partition X ˆ column clusters Y and minimizes the objective function in ˆ and the form (8), and thirdly, it fixes the row clusters X minimizes the objective function in the form (9). The process repeats the last two  steps  until it converges. When fix Yˆ , p(y)sim(y, νyˆ) is fixed, so mini-

ˆ (t+1) (x) X

yˆ y∈ˆ y

ˆ mizing   (8) is equal to minimize (10), and when fix X, p(x)sim(x, μxˆ ) is fixed, minimizing (9) is equal to

Let the column clusters keep unchanged, i.e., Yˆ (t+1) = Yˆ (t) . ˆ Yˆ ), q (t+1) (X|X), ˆ M-row step: Compute q (t+1) (X, (t+1) (t+1) ˆ (Y |Y ), and q (X|ˆ y ) as done in [4], and q obtain V (t+1) by maximizing (4) E-column step: For each column sample y, find its new cluster as

x ˆ x∈ˆ x

minimize (11).  D(p(Y |x)||q(Y |ˆ x)) − αsim(x, μxˆ ))

6:

(10)

x ˆ x∈ˆ x



D(p(X|y)||q(X|ˆ y )) − βsim(y, νyˆ))

7:

(11)

yˆ y∈ˆ y

(t+2)

CY

Thus we can define the distance from row x to row cluster x ˆ, and the distance from column y to column cluster yˆ as follows x)) − αsim(x, μxˆ ) dx→ˆx = D(p(Y |x)||q(Y |ˆ

(12)

y )) − βsim(y, νyˆ) dy→ˆy = D(p(X|y)||q(X|ˆ

(13)

8:

9:

10: 11:

 (t) −αsim(x, μXˆ (t+1) (x) )   (t) −β p(y)sim(y, νyˆ ) =

−β



(14)

yˆ y:Yˆ (t) (y)=ˆ y



p(x)D(p(Y |x)||q (t) (Y |ˆ x))

ˆ (t+1) (x)=ˆ x ˆ x:X x

p(x) D(p(Y |x)||q (t) (Y |ˆ x)) 

 ˆ (t+1) (x))) p(x) D(p(Y |x)||q (t) (Y |X





ˆ (t) (x)=ˆ x ˆ x:X x



(t) −αsim(x, μxˆ )

= argminyˆ(D(p(X|y)||q (t+1) (X|ˆ y )) −βsim(y, νyˆ))

Keep the row clusters unchanged, i.e., CX = (t+1) CX . ˆ Yˆ ), M-column step: Compute q (t+2) (X, (t+2) (t+2) (t+2) ˆ (X|X), q (Y |Yˆ ), and q (Y |ˆ x) as q done in [4], and obtain U (t+2) by maximizing (3) Compute objective function value objV alue(t+2) using (5), and compute the change in objective function value, that is, δ = objV alue(t) − objV alue(t+2) . t=t+2 until δ < 10−6



Lemma 1. Algorithm ITCCWDS monotonically decreases the objective function (5) to a local minimum.

ˆ (t) (x)=ˆ x ˆ x:X x

(y)

(t+2)

In E-steps, we reassign cluster label for each sample based on the current cluster sets. First, in E-row step, keep the column clusters fixed, and for each row sample x, find a new cluster who it is closest to using (12). Then, in E-column step, keep the row clusters fixed, and for each column sample y, find a new cluster who it is closest to using (13). In M-steps, we update the cluster prototype distributions q(Y |ˆ x) and q(X|ˆ y ), and the row and column exemplar sets U and V . First, in M-row step, we update q(Y |ˆ x) for each row cluster as done in [4], and update U by maximizing (3). Then, in M-column step, we update q(X|ˆ y ) for each column cluster as done in [4], and update V by maximizing (4). The details of the algorithm are shown in Algorithm 1.

Proof:  

= argminxˆ (D(p(Y |x)||q (t) (Y |ˆ x)) −αsim(x, μxˆ ))





yˆ y:Yˆ (t) (y)=ˆ y

−α





ˆ (t+1) (x)=ˆ x ˆ x:X x

(t) p(y)sim(y, νyˆ )

−β





yˆ y:Yˆ (t+1) (y)=ˆ y

302

(t)

sim(x, μxˆ ) (t)

p(y)sim(y, νyˆ )

(15)







p(x)D(p(Y |x)||q (t+1) (Y |ˆ x))

Table I S UMMARY OF DATASETS USED FOR EXPERIMENTS

ˆ (t+1) (x)=ˆ x ˆ x:X x

−α





(t+1)

ˆ (t+1) (x)=ˆ x ˆ x:X x

−β







)

(t+1)

yˆ y:Yˆ (t+1) (y)=ˆ y

=

sim(x, μxˆ

p(y)sim(y, νyˆ

)

Dataset # of clusters # of docs # of words Binary & Binary subject 2 500 15582 & 15657 Multi5 & Multi5 subject 5 500 14274 & 14397 Multi10 & Multi10 subject 10 500 15336 & 15480 NG10 10 20,000 143714 NG20 20 20,000 143714 CLASSIC3 3 3893 20168 Yahoo 6 2340 37482

(16)

 p(y) D(p(X|y)||q (t+1) (X|ˆ y ))



yˆ y:Yˆ (t+1) (y)=ˆ y



(t+1)

) −βsim(y, νyˆ   −α

ˆ (t+1) (x)=ˆ x ˆ x:X x

(t+1) p(x)sim(x, μxˆ ).

boundaries between some newsgroups rather fuzzy. To make our comparison consistent with the existing work, we reconstructed various subsets of NG20: Binary, Binary subject, Multi5, Multi5 subject, Multi10, Multi10 subject, NG10, NG20, and preprocessed all the subsets as in [4], [19], i.e., removed stop words, ignored file headers, lowered the upper case characters, and selected the top 2000 words by mutual information. The CLASSIC3 data set consists of 3893 abstracts from MEDLINE, CISI, and CRANFIELD subsets, and the data set Yahoo consists of 2340 articles from 6 categories. For CLASSIC3 and Yahoo, after ignoring html tags (here, only for Yahoo), removing stop words, and lowering the upper case characters, we selected the top 2000 words by mutual information as the preprocessing. The details of these subsets are given in Table I.

(17)

In the above proof, Eq. (14) follows from the E-word step; Eq. (15) follows since the column cluster set is fixed; Eq. (16) follows from [4], as well as maximizing (3) and that the column cluster set is fixed; and (17) is due to (8) and (9). We can get (18) at the same way,    p(y) D(p(X|y)||q (t+1) (X|ˆ y )) yˆ y:Yˆ (t+1) (y)=ˆ y (t+1)



) −βsim(y, νyˆ   −α ≥



(t+1)

ˆ (t+1) (x)=ˆ x ˆ x:X x



p(x)sim(x, μxˆ

)

 p(x) D(p(Y |x)||q (t+2) (Y |ˆ x))

ˆ (t+2) (x)=ˆ x ˆ x:X x

B. Word Similarity Due to lexical ambiguity, in [20], [21], Reisinger and Mooney introduce two multi-prototypes approaches to vector-space lexical semantics, in which they represent individual words as collections of ”prototype” vectors. Here, we use the approach [21] to compute the pairwise similarities between words as it can represent the common metaphor structure found in highly polysemous words. Therefore, we denote the similarity sim(x, μ) between words x and the exemplar μ as



(t+2) ) −αsim(x, μxˆ

−β





yˆ y:Yˆ (t) (y)=ˆ y

(t)

p(y)sim(y, νyˆ ).

(18)

By combining (17) and (18), it follows that ITCCWDS monotonically decreases the objective function. Since the Kullback-Leibler divergence is non-negative, and (3) and (4) are bounded, so, the objective function is lower bounded. Therefore, algorithm ITCCWDS can converge to a local minimum in a finite number of steps.

sim(x, μ) =

Remark 1. The time complexity of Algorithm ITCCWDS is O((nz(k + l) + km2 + ln2 )τ ), where nz is the number of non-zeros in p(X, Y ), and τ is the number of iterations.

Kμ Kx  1  d(πi (x), πj (μ)) Kx Kμ i=1 j=1

(19)

where Kx and Kμ are the number of clusters for x and μ, πi (x) and πj (μ) are the cluster centroids, and d(·, ·) is a standard distributional similarity measure, here we use the cosine distance.

IV. E XPERIMENTAL R ESULTS A. Data Sets and Parameter Settings

C. Document Similarity

For our experiment performance evaluation, we use various subsets of the of 20-Newsgroup (NG20) [17], CLASSIC3 data set [3], and the Yahoo data set [18]. The NG20 data set consists of approximately 20,000 newsgroup articles collected from 20 different usenet newsgroups. Many of the newsgroups have similar topics, and about 4.5% of the articles are present more than in one group, making the

In [22] Chim and Deng propose a phrase-based algorithm to compute the pairwise similarities of documents based on the combination of the Suffix Tree Document (STD) model and the Vector Space Document (VSD) model. It’s proved to be effective. They represent each document as a vector d y = {w(1, y), w(2, y), . . . , w(M, y)}, 303

(20)

Table II T HE PARAMETER SETTINGS OF ITCCWDS

Dataset Binary Binary subject Multi5 Multi5 subject Multi10 Multi10 subject

FOR ALL THE SUBSETS OF

2 0.5\0.1 0.1\1 0.1\0.6 2\6 1\10 2\10

4 0.3\0.2 0.2\0.3 0.2\1 0.1\0.3 3\10 1\6

Number 8 0.8\0.2 0.9\0.9 1\9 0.1\0.1 0.2\1 0.6\1

NG20

WITH DIFFERENT NUMBER OF WORD CLUSTERS

of word clusters 16 32 0.2\0.1 1\0.1 0.1\0.1 0.1\0.9 1\9 7\9 0.2\0.8 0.2\0.4 0.4\0.3 0.5\0.5 1\9 6\7

64 1\0.2 0.8\0.7 6\6 8\8 0.8\1 9\4

0.5

0.72

0.7

0.6 0.5 0.4 0.3

4 8 16 32 64 Number of Word Clusters(log scale)

2

128

(a) Binary

Micro−Averaged−Precision

Micro−Averaged−Precision

0.7

ITCCWDS ITCC ITCCLS

0.3

0.2

0.1 2

128

4 8 16 32 64 Number of Word Clusters(log scale)

128

(d) Binary subject

4 8 16 32 64 Number of Word Clusters(log scale)

128

(c) Multi10

0.6 0.5 0.4

2

ITCCWDS ITCC ITCCLS

0.5

0.7

0.3

Figure 1.

0.4

(b) Multi5

0.75

2

4 8 16 32 64 Number of Word Clusters(log scale)

0.8

0.8

0.65

ITCCWDS ITCC ITCCLS

Micro−Averaged−Precision

0.66 2

ITCCWDS ITCC ITCCLS

Micro−Averaged−Precision

Micro−Averaged−Precision

Micro−Averaged−Precision

0.74

0.68

128 0.5\0.2 1\0.4 0.9\0.1 6\4 0.9\0.8 0.2\0.1

ITCCWDS ITCC ITCCLS 4 8 16 32 64 Number of Word Clusters(log scale)

(e) Multi5 subject

128

0.4

0.3

0.2

0.1 2

ITCCWDS ITCC ITCCLS 4 8 16 32 64 Number of Word Clusters(log scale)

128

(f) Multi10 subject

Micro-averaged-precision values with varied number of word clusters on different NG20 data sets

where M is the number of terms, w(i, y) = (1 + log tf (i, y)) log(1 + N/df (i)), tf (i, y) is the frequency of the ith term in document y, and df (i) denotes the number of documents containing the ith term. Then they computed the pair similarity between documents yi and yj using cosine measure as it is commonly used in the VSD model. So we can get the simiarity between document y and the exemplar ν using sim(y, ν) =

< d y , d ν >

× ||dν||

||dy||

In the experiments, the initial clusters are generated as the same with ITCC. To initial the clusters, we adopt different strategies for the initialization of the word clusters and document clusters. For the initialization of the word clusters, choose initial word cluster “centroids” to be “maximally” far apart from each other. First, take the word which is farthest to the centroid of the whole data set as the first word cluster “centroid”. Then, take the word which is farthest to all the previous word cluster “centroids” already picked as the “centroid” until all the word cluster “centroids” are picked. For the initialization of the document clusters, we use a random perturbation of the “mean” document. Since there is a random component in the initialization step, all our results are averages of five trial. To validate clustering results, micro-averaged-precision [4] is used as the measure metric. By analyzing the distance from the sample to the cluster prototype, and the similarity between the sample and

(21)

where < ·, · > denotes the inner product, and || · || denotes the L2 norm. D. Results and Discussion In this section, we provide empirical evidence to demonstrate the effectiveness of algorithm ITCCWDS, in comparison with ITCC [4], and ITCC with local search (ITCCLS).

304

Micro−Averaged−Precision

1 0.8

also reported the empirical evaluations to demonstrate the advantages of our approach in terms of clustering quality in the document-word co-clustering problem.

ITCCWDS ITCC ITCCLS

ACKNOWLEDGMENT 0.6

This project was supported by the NSFC-GuangDong (U0835005) and the NSFC (61173084).

0.4

R EFERENCES [1] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, vol. 31, pp. 651–666, June 2010.

0.2 0

Figure 2.

NG10

[2] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proc. of Int. Conf. on Intelligent Systems for Molecular Biology, 2000, pp. 93–103.

NG20 Yahoo CLASSIC3 Data sets

[3] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,” in Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2001, pp. 269–274.

Micro-averaged-precision values on the large data set

the cluster exemplar, we set the α and β from 0.1 to 10.0. Figure 1 shows that the performance of ITCCWDS, ITCC, and ITCCLS on the subsets of NG20, varying with the number of word clusters. The value for parameters α and β are shown in Table II. The real numbers in each table cell are the values for α and β. From the performance results reported in Figure 1, it is clear that ITCCWDS improves the document clustering precision substantially over ITCC and ITCCLS. ITCCWDS has obtained on average almost 15% higher precision than its counterparts, on all testing sets except Binary subject. And on Binary subject, the proposed method has gained comparable results. Also, Figure 1 demonstrates that ITCCWDS is less sensitive to the number of word clusters. Figture 2 records the precision values of the three algorithms on the large data sets NG10, NG20, Yahoo, and CLASSIC3. For the four large data sets, we set the number of the word clusters to be 128, 128, 64, 200, respectively, and the parameter pairs (α, β) are set (0.1, 0.9), (0.8, 0.1), (10, 2), and (6, 5). It shows that our algorithm is still effective on the large data set. Data set CLASSIC3 is easy to cluster to groups. On CLASSIC3, the three algorithms all have extracted the original clusters almost correctly resulting in the micro-averaged-precision values more than 0.985. However, on the other three more challenge data sets, our algorithm can achieve much better clustering performance. On data set NG10, our algorithm has obtained almost 7% higher precision than ITCC, and 5% higher than ITCCLS. The comparative results have demonstrated the improvement of ITCCWDS on the clustering performance.

[4] I. S. Dhillon, S. Mallela, and D. S. Modha, “Informationtheoretic co-clustering,” in Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003, pp. 89–98. [5] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, “Minimum sumsquared residue co-clustering of gene expression data,” in Proc. of 4th SIAM Int. Conf. on Data Mining, 2004, pp. 114– 125. [6] B. Long, Z. Zhang, and P. S. Yu, “Co-clustering by block value decomposition,” in Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, 2005, pp. 635–640. [7] M. Rege, M. Dong, and F. Fotouhi, “Co-clustering documents and words using bipartite isoperimetric graph partitioning,” in Proc. of the 6th Int. Conf. on Data Mining, 2006, pp. 532– 541. [8] G. Qiu, “Image and feature co-clustering,” in Proc. of the 17th Int. Conf. on Pattern Recognition, 2004, pp. 991–994. [9] J. Liu and M. Shah, “Scene modeling using co-clustering,” in Proc. of 11th IEEE Int. Conf. on Computer Vision, 2007, pp. 1–7. [10] S. N. Vitaladevuni and R. Basri, “Co-clustering of image segments using convex optimization applied to em neuronal reconstruction,” in Proc. of 2010 IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2010, pp. 2203–2210. [11] H. Shan and A. Banerjee, “Bayesian co-clustering,” in Proc. of 8th IEEE Int. Conf. on Data Mining, 2008, pp. 530–539.

V. C ONCLUSION

[12] R. G. Pensa and J. F. Boulicaut, “Constrainted co-clustering of gene expression data,” in SDM, 2008, pp. 25–36.

In this paper, we have proposed a novel co-clustering method with intra-similarities. Unlike existing co-clustering algorithms, the proposed algorithm incorporates the similarities between samples belonging to the same set. We have

[13] X. Shi, W. Fan, and P. S. Yu, “Efficient semi-supervised spectral co-clustering with constraints,” in Proc. of 10th IEEE Int. Conf. on Data Mining, 2010, pp. 1043–1048.

305

[14] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of the 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000, pp. 208–215. [15] A. Banerjee, I. Dhillon, and D. S. Modha, “A generalized maximum entropy approach to bregman co-clustering and matrix approximation,” in Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2004, pp. 509–514. [16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with bregman divergences,” Machine Learning Research, vol. 6, pp. 1705–1749, December 2005. [17] K. Lang, “Newsweeder: Learning to filter netnews,” in Proc. of the 12th Int. Conf. on Machine Learning, 1995, pp. 331– 339. [18] D. Boly, “Hierachical taxonomies using divisive partitioning,” University of Minnesota, Tech. Rep. TR-98-012, 1998. [19] I. S. Dhillon and Y. Guan, “Information theoretic clustering of sparse co-occurrence data,” Dept. of Computer Sciences, University of Texas, Tech. Rep. TR-03-39, September 2003. [20] J. Reisinger and R. J. Mooney, “Multi-prototype vector-space models of word meaning,” in Proc. of the 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 109–117. [21] J. Reisinger and R. Mooney, “A mixture model with sharing for lexical semantics,” in Proc. of the 2010 Conf. on Empirical Methods in Natural Language Processing, 2010, pp. 1173– 1182. [22] H. Chim and X. Deng, “Efficient phrase-based document similarity for clustering,” IEEE Trans. on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1217–1229, September 2008.

306

A Novel Co-clustering Method with Intra-similarities

The empirical evaluation on publicly available data sets also shows that our algorithm is effective. ..... Micro-averaged-precision values on the large data set.

165KB Sizes 1 Downloads 249 Views

Recommend Documents

A Novel Palmprint Feature Processing Method Based ...
ditional structure information from the skeleton images. It extracts both .... tain degree. The general ... to construct a complete feature processing system. And we.

A Novel Method for Travel-Time Measurement for ...
simulation results obtained through use of a simulation program developed by the ... input data is taken from first-arrival travel-time measurements. The .... Data Recovery: ... beginning at 7 msec, at z=0, the free surface, corresponds to a wave.

A Novel Rainbow Table Sorting Method
statistical analysis of 28,000 passwords recently stolen from a ... Analysis and evaluation ..... on Fast Software Encryption, Lecture Notes in Computer Science,.

A Novel Palmprint Feature Processing Method Based ...
processing approaches to handle online/offline palmprint, ... vert the original image into skeleton image with image pro- cessing .... it store nothing. For the ...

A novel discriminative score calibration method for ...
For training, we use single word samples form the transcriptions. For evaluation, each in- put feature sequence is the span of the keyword detection in the utterance, and the label sequence is the corresponding keyword char sequence. The CTC loss of

A Novel Method for Travel-Time Measurement for ...
simulation results obtained through use of a simulation program developed by the authors. ... In contemporary modern wireless communications systems.

A novel video summarization method for multi-intensity illuminated ...
Dept. of Computer Science, National Chiao Tung Univ., HsinChu, Taiwan. {jchuang, wjtsai}@cs.nctu.edu.tw, {terry0201, tenvunchi, latotai.dreaming}@gmail.com.

A novel method for measuring semantic similarity for XML schema ...
Enterprises integration has recently gained great attentions, as never before. The paper deals with an essential activity enabling seam- less enterprises integration, that is, a similarity-based schema matching. To this end, we present a supervised a

Linux Kernels as Complex Networks: A Novel Method ...
study software evolution. ○ However, this ... predict the scale of software systems in terms of ... functions corresponding to the nodes that account for more than ...

Development of a Novel Method To Populate Native ... -
in the TCEP reduction mixture (viz., the four des species and the four 1S ..... we now have powerful tools to study the rate-determining steps in the oxidative ...

The Method of Separation: A Novel Approach for ...
emerging field of application is the stability analysis of thin-walled ...... Conf. on Computational Stochastic Mechanics (CSM-5), IOS Press, Amsterdam. 6.

A novel time-memory trade-off method for ... - Semantic Scholar
Institute for Infocomm Research, Cryptography and Security Department, 1 ..... software encryption, lecture notes in computer science, vol. ... Vrizlynn L. L. Thing received the Ph.D. degree in Computing ... year. Currently, he is in the Digital Fore

A Novel Method for Objective Evaluation of Converted Voice and ...
Objective, Subjective. I. INTRODUCTION. In the literature subjective tests exist for evaluation of voice transformation system. Voice transformation refers to the.

QPLC: A novel multimodal biometric score fusion method
many statistical data analysis to suppress the impact of outliers. In a biometrics system, there are two score distributions: genuine and impostor as shown in Fig.

A Novel Nano Cellulose Preparation Method and Size ...
degradability and that it originates from renewable resources. There .... °C. This gives an energy consumption of 2,7356 GJ/t or 760 kWh/t. It is possible to ...

Development of a Novel Method To Populate Native ... -
work focuses on both the formation of these structured disulfide intermediates from their unstructured ..... These data strongly suggest that, under less stabiliz-.

A Novel Method for Measuring and Monitoring ...
May 3, 2005 - constructed and measurements were made by observer 2. Plane 1 was used as the ... transferred to the SSD mode. The creation of the plane.

A novel method for 3D reconstruction: Division and ...
object with a satisfactory accuracy, multiple scans, which generally lead to ..... surface B leads to a non-overlapping surface patch. ..... automation, 2009. ICRA'09 ...

FRank: A Ranking Method with Fidelity Loss - Microsoft
advantage in cases in which ground truths come from several annotators that ...... features reflect the business secrete of that search engine company, we would not ..... American Society for Information Science and Technology. 54(10) (pp.

Stein's method meets Malliavin calculus: a short survey with ... - ORBilu
Email: [email protected]. 1 ..... Example 2.2 (Gaussian measures) Let (A,A,ν) be a measure space, where ν is posi- tive, σ-finite and ...... (iii) → (i) is a direct application of Theorem 3.6, and of the fact that the topology of the Ko

A DWT Based Blind Watermarking Method with High ...
... are these sorted values. We quantize the distance of C2 and C3 according to the distance from C1 and C4. We define ∆ as (1). α. ×. −. = Δ. 2. 1. 4. C. C. (1). In (1) α is the strength of watermark. Higher values for α increase the robust

A Direct-Forcing Immersed Boundary Method with Dynamic Velocity ...
A Direct-Forcing Immersed Boundary. Method with Dynamic Velocity Interpolation ... GIS terrain data (10 m resolution) tunnel fires http://www.protectowire.com ...

Stein's method meets Malliavin calculus: a short survey with ... - ORBilu
lants (see Major [13] for a classic discussion of this method in the framework of ...... of Hermite power variations of fractional Brownian motion. Electron. Comm.