Self-taught dimensionality reduction on the high ...

Viewer
Transcript

Pattern Recognition 46 (2013) 215–229

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Self-taught dimensionality reduction on the high-dimensional small-sized data Xiaofeng Zhu a, Zi Huang a, Yang Yang a, Heng Tao Shen a,n, Changsheng Xu b, Jiebo Luo c a

School of Information Technology and Electrical Engineering, The University of Queensland, Australia Institute of Automation, Chinese Academy of Sciences, China c Department of Computer Science, University of Rochester, United States b

a r t i c l e i n f o

abstract

Article history: Received 16 February 2012 Received in revised form 28 June 2012 Accepted 21 July 2012 Available online 4 August 2012

To build an effective dimensionality reduction model usually requires sufﬁcient data. Otherwise, traditional dimensionality reduction methods might be less effective. However, sufﬁcient data cannot always be guaranteed in real applications. In this paper we focus on performing unsupervised dimensionality reduction on the high-dimensional and small-sized data, in which the dimensionality of target data is high and the number of target data is small. To handle the problem, we propose a novel Self-taught Dimensionality Reduction (STDR) approach, which is able to transfer external knowledge (or information) from freely available external (or auxiliary) data to the high-dimensional and small-sized target data. The proposed STDR consists of three steps: First, the bases are learnt from sufﬁcient external data, which might come from the same ‘‘type’’ or ‘‘modality’’ of target data. The bases are the common part between external data and target data, i.e., the external knowledge (or information). Second, target data are reconstructed by the learnt bases by proposing a novel joint graph sparse coding model, which not only provides robust reconstruction ability but also preserves the local structures amongst target data in the original space. This process transfers the external knowledge (i.e., the learnt bases) to target data. Moreover, the proposed solver to the proposed model is theoretically guaranteed that the objective function of the proposed model converges to the global optimum. After this, target data are mapped into the learnt basis space, and are sparsely represented by the bases, i.e., represented by parts of the bases. Third, the sparse features (that is, the rows with zero (or small) values) of the new representations of target data are deleted for achieving the effectiveness and the efﬁciency. That is, this step performs feature selection on the new representations of target data. Finally, experimental results at various types of datasets show the proposed STDR outperforms the state-of-the-art algorithms in terms of k-means clustering performance. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Dimensionality reduction Self-taught learning Joint sparse coding Manifold learning Unsupervised learning

1. Introduction Many real applications (such as text categorization, computer vision, image retrieval, microarray technology and visual recognition) involve high-dimensional data [49,50,37,21]. In practice, although high-dimensional data can be analyzed with highperformance contemporary computers, to deal with high-dimensional data often leads to some problems, such as the explosion in execution time, the curse of dimensionality, ignoring the impact of the noise and redundancy, and so on. However, it has been proven that the ‘‘intrinsic’’ dimensionality of high-dimensional data is typically small [42]. Therefore, it is necessary to develop

n

Corresponding author. Tel.: þ61 733658359. E-mail addresses: [email protected] (X. Zhu), [email protected] (Z. Huang), [email protected] (Y. Yang), [email protected] (H. Tao Shen), [email protected] (C. Xu), [email protected] (J. Luo). 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.07.018

efﬁcient and effective approaches to search for such ‘‘intrinsic’’ dimensionality. Dimensionality reduction techniques have been broadly used to address this via reducing the number of features. Moreover, dimensionality reduction techniques help to decrease time and space complexity as well as to make data more comprehensible [49]. Traditional dimensionality reduction methods have been demonstrated to be effective when sufﬁcient data are provided. However, in real applications, these models often suffer from the problem of lacking sufﬁcient data so that it is difﬁcult to build effective models. For instance, a personal album on Flickr1 usually contains only a small number of photos. In such a case, if we want to perform automatic tagging on these personal photos via visual content analysis, the learning models will probably not work effectively.

1

http://www.ﬂickr.com/

216

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

Objective Data

Outer Data

Fig. 1. Examples of learning models of transfer learning and self-taught learning, adapted by [11,25,35]. (a) Transfer learning. (b) Self-taught learning.

To handle this problem, a straight-forward solution can be designed to convey external knowledge (or information) to enhance the effectiveness of the model. Transfer learning [32] is capable of transferring external data (that is, external data can come from a different but related task) to strengthen learning tasks. As illustrated in Fig. 1(a), the learning process of transfer learning requires external data to be relevant to target data (e.g., target data: ‘‘car’’ and ‘‘motorcycles’’ vs. external data: ‘‘bus’’, ‘‘tractor’’ and ‘‘aircraft’’). Such an assumption makes transfer learning less ﬂexible and effective in case not enough relevant external data can be obtained. Recently, self-taught learning [35] was proposed to relax such a strong assumption to allow us leverage more possible external data in the learning task. That is, self-taught learning only requires that external data are with similar ‘‘modality’’ or ‘‘type’’ to target data, which can be obtained easily in real applications. For example, as shown in Fig. 1(b), natural scene images can be used to categorize motor vehicle images by using self-taught learning. Due to placing signiﬁcantly fewer restrictions on external data, self-taught learning is much easier to be applied than transfer learning, and has been applied in many practical applications [11,25], such as image, audio, text classiﬁcation, among others. Above analysis motivates us to develop a new unsupervised dimensionality reduction approach for dealing with the highdimensional and small-sized data via self-taught learning. In this paper we model the dimensionality reduction model as a reconstruction process, by proposing a Self-Taught Dimension Reduction (STDR) approach. The proposed STDR consists of three steps: (1) A set of the bases is learnt from external data by employing existing dictionary learning models, such as online dictionary learning [29]. The learnt bases are the common part on both external data and target data. In real application, the external data can easily be obtained. Thus sufﬁcient external data ensure to obtain effective bases. (2) Target data are represented (or reconstructed) by the learnt bases by designing a robust joint sparse coding model. The proposed model not only provides robust reconstruction ability but also preserves the local structures amongst target data. After the reconstruction process, target data are sparsely represented by the learnt bases. Moreover, the unimportant features (or the noise features) are represented with small values or zeros. (3) We delete these unimportant features to perform feature selection on the new representations of target data. The contributions of the proposed solution for performing dimensionality reduction on the high-dimensional and small-size data are presented as follows:

The STDR is devised to handle the high-dimensional and small-sized data by employing sufﬁcient external data, which

only need to be with similar ‘‘type’’ or ‘‘modality’’ of target data. To the best of our knowledge, this is the ﬁrst work that focuses on employing such external data in a reconstruction framework to perform dimensionality reduction on the highdimensional and small-sized data. The proposed objective function in the STDR is a robust linear reconstruction model. That is, the ‘2,1 -norm loss function (a.k.a., robust loss function) is designed to eliminate the impact of the outliers as well as to achieve the minimal reconstruction error; The Laplacian prior is used to preserve local structures of target data. The Laplacian prior considers the correlations among target data so that it makes the learning process more effective [4]; The ‘2,1 -norm regularization is designed to avoid the over-ﬁtting issue and to sparsely select a subset of the learnt bases to represent target data. Moreover, we design a novel solution to the resulted optimization problem via a simple algorithm which is theoretically guaranteed to converge to the global optimum. Extensive experiments are conducted on various types of datasets to illustrate the effectiveness of the proposed STDR method. The results show that the proposed STDR outperforms several state-of-the-art and baseline dimensionality reduction methods in term of k-means clustering performance.

The remainder of the paper is organized as below: Related work is brieﬂy reviewed in Section 2, followed by the proposed STDR approach in Section 3. The experimental results are reported and analyzed in Section 4 while Section 5 concludes the paper.

2. Related work To deal with high-dimensional data is very challengeable to all kinds of learning tasks (or models) in the domain of data mining and machine learning. With a lot of features, the hypothesis feature space becomes huge. This leads the learning models to easily be over-ﬁtting. Thus the performance of the learning models will be degenerated. Moreover, the built models become computationally inefﬁcient and difﬁcult to be interpreted. Dimension reduction methods are designed to address these problems. Different methods on dimensionality reduction usually fall into two categories, i.e., feature selection and feature extraction respectively. Given a dataset with a large number of features, if some of them are irrelevant, feature selection performs dimensionality reduction by removing the irrelevant features, then outputting the relevant original features. Feature extraction tries to reduce the dimensionality of original data by combining the original features. The key difference between feature selection

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

and feature extraction is that feature selection keeps relevant original features, while feature extraction generates new features by combining the original features. Different applications require different dimensionality reduction methods. For example, feature selection is usually applied for the cases in which the original features need to be kept for the interpretation. Feature extraction is more preferable for the cases in which the users expect to obtain better performance rather than to interpret the derived results. Feature selection algorithms [20,26,41,23] are broadly categorized into three groups: ﬁlter model, wrapper model and embedded model. Filter model is usually designed to ﬁrst analyze the general characteristics of the data, and then to evaluate the features without involving a learning algorithm. In the existing methods on ﬁlter model, feature selection is decided by the predeﬁned criteria, such as, mutual information [15], variable ranking [8], among others. For example, Laplacian score method [18] ranks the features by evaluating the power of locality preservation of each feature. In real applications, ﬁltering model is (relatively) robust against the issue of over-ﬁtting, but may fail to select the most ‘‘useful’’ features. Wrapper model ‘‘wraps’’ the selection process to identify relevant features while requiring a predetermined learning algorithm [17]. For example, the method in [30] was designed to ﬁnd a subset of all the features by maximizing the performance of the SVM classiﬁer. In real applications, wrapper model can in principle ﬁnd the most ‘‘useful’’ features, so it often outperforms ﬁlter model. However, wrapper model is with high computation cost and prone to the issue of over-ﬁtting. Embedded model (e.g., [6,9,44]) performs feature selection during the process of model building. Thus embedded model usually regards feature selection as a part of the training process, in which useful features are obtained by optimizing the objective function of the learning model. Recent embedded model receives increasing interests due to its superior performance. For example, the method [43] added a ‘0 -norm constraint into the proposed objective function to achieve sparse solution for effectively and efﬁciently performing feature selection. Both the method in [31] employing a ‘1 -norm regularization and the method in [27] employing a ‘2,1 -norm regularization were designed to achieve the similar objectives. Embedded model is similar to wrapper model, but is with less computation cost and less prone to the issue of over-ﬁtting. Feature extraction reduces the dimensionality of original data by combining original features under the pre-set constraints. Feature extraction usually falls into two categories [22,47], i.e., projective methods and manifold methods respectively. Projective methods attempt to ﬁnd the low-dimensional projections containing the most information of original data, by maximizing the pre-deﬁned objective functions. Moreover, projective methods can ﬁnd the explicit transformation between the original data matrix and a low-dimensional space, i.e., ﬁnding a transformation matrix V and expressing the reduced data Y of original data X as Y ¼ VT X. For example, principal component analysis (PCA) [5] was designed to search for the new feature space (that is, the subspace) via maximizing the variance of the data. Independent component analysis (ICA) [34] tries to ﬁnd the projections by considering the probability distributions of original data into the objective function. Manifold methods ﬁrst assume that original data lie on a low-dimensional manifold, then attempt to search for them by satisfying some suitable objective functions. But manifold methods do not form explicit projections [47]. For example, multidimensional scaling (MDS) [12] ﬁnds the top rank projections for preserving the inter-point distance (that is, dissimilarity); ISOMAP [40] preserves the isometric properties of the original space in the subspace; Both locally linear embedding (LLE) [36] and Laplacian eigenmap [3] focus on preserving the

217

local neighbor structures of original data; The method in [10] preserves the separability of the data by using the weighted displacement vectors; The method in [45] was uniquely designed to preserve the feature of global coordinates by a compatible mapping. It has been shown (e.g., [22]) that projective methods and manifold methods can be brought together while employing the kernel.

3. Self-taught dimensionality reduction In this section, we ﬁrst describe the used notations in this paper, and then give an overall description of the proposed STDR. Finally, we present the detailed elaboration of each component in the STDR. To elaborately describe the details of the STDR, we ﬁrst depict how to learn the bases from external data by using the methods on dictionary learning. Second, a robust sparse coding model is proposed to reconstruct target data based on the learnt bases. Third, feature selection is conducted on the new representations of target data. Fourth, we give the pseudocode of the STDR. Finally, we give the theoretical analysis on the proposed robust joint graph sparse coding model. 3.1. Notation For clarity, we summarize the notations used in this paper in Table 1. Besides, the ‘p -norm of a vector v A Rn is deﬁned as P p 1=p n , where vi is the ith element of v. If p¼0 JvJp ¼ i ¼ 1 9vi 9 then we get the ‘‘pseudo norm’’ (a.k.a., ‘0 -norm) which is deﬁned as the number of non-zero elements in v. The ‘r,p -norm over a matrix p=r 1=p Pn Pm r Jm J , MA Rnm is deﬁned as JMJr,p ¼ ij i¼1 j¼1 where mij is the element of the ith row and jth column. 3.2. Framework Before elaborating the technical details of the proposed STDR, we brieﬂy introduce its framework. The proposed framework presented in Fig. 2 includes three steps. In the ﬁrst step, the STDR learns the common parts (that is, the bases) from a large amount of external data. External data can be with different data distribution but with same ‘‘type’’ or ‘‘modality’’ (or mildly related) to target data [25]. According to the example on image analysis in [11], different types of objects (such as, diamond, ring, platinum and titanium) may share common features. For example, diamond and ring perhaps share similar features about ‘‘diamond’’; ring and platinum share same ‘‘modality’’ about ‘‘platinum’’; platinum and titanium share same ‘‘type’’ about ‘‘metal’’. In such a situation, external (or auxiliary) data can easily be obtained as well as beneﬁt for obtaining better data representation for limited target data. Moreover, the fact on the different datasets with different distribution but sharing common features has been found in all kinds of real applications, Table 1 Notations used in this paper. Notation

Description

Feature space Scalar Column vector Matrix Transpose Inverse of matrix

Uppercase italic Lowercase italic Lowercase bold Uppercase bold Superscript T Superscript 1

218

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

...

Objective Data

Motorcycle

Car

2. Reconstruct Objective Data

3. Feature Selection

1. Learn Bases

... Outer Data

New Representations Natural Scenes

Fig. 2. The framework of the proposed STDR approach.

such as, audio recognition, text classiﬁcation, and the others. Thus the proposed STDR can utilize more external data than transfer learning, which requires external data should relate to target data, such as the literature in [38].With sufﬁcient external data, the model for learning the bases can effectively be built. The learnt bases are regarded as a ‘‘bridge’’ between external data and target data. With this, useful knowledge (e.g., the learnt bases) in external data are transferred to target data. This is reasonable because: First, the bases are the common part between external data and target data, thus it is feasible for using the learnt bases to represent target data. This enables the learning task on the limited target data more effectively. Second, to present target data by the bases (i.e., the high-level representations) has been showed to be more effective than those representing target data with the traditional methods, such as pixel raw features [35]. Third, the learnt bases can be reused for various learning tasks, such as for limited target data, for the current external data, and the others. Last but not least, the dimensionality of the bases can be larger or smaller than the ones of target data. We will give a further discussion on this in Section 4.6.1. In the second step, the STDR reconstructs target data into the new feature space (that is, the basis space), via the proposed Algorithm 2. The outputs of this step are the new presentations of target data. Meanwhile, the redundant features of the new representations are assigned with small values or shrunk to zero. In the third step, the STDR performs feature selection on the derived new representations of target data. According to above analysis, we know that the ﬁrst step of the STDR is designed to deal with the issue of small-sized data via employing external data. Both the second step and the third step of the STDR are designed for dealing with the issue of high-dimensional data via performing feature selection on the new representations of target data. The objective is to search for the ‘‘intrinsic’’ dimensionality of high-dimensional data. Hence, the STDR can simultaneously handle the high-dimensional and small-sized data. Finally, after performing dimensionality reduction on the highdimensional and small-sized data, the reduced data are fed into the k-means algorithm. The clustering performance is used to evaluate the effect of the dimensionality reduction algorithm. Comparing the proposed STDR method with feature selection and feature extraction, the STDR ﬁrst generates the new representations of target data from external data, and then performs feature selection on the derived representations of target data. It is similar to embedded model of feature selection, such as [31,46]. Therefore, the STDR can be categorized into feature selection rather than feature extraction. Comparing the STDR with the existing methods on embedded model of feature selection, such as the algorithm UDFS [46] and the algorithm RFS [31]. The UDFS performs feature selection by combining discriminative ability of

the data and local structures of target data. And the UDFS was designed for unsupervised learning and did not employ external data. The RFS conducts feature selection under the assumption of supervised learning and does not employ external data. Comparing the STDR with self-taught learning [11,25,35], both of them belong to sparse learning models, but they have difference. First, the objective of self-taught learning in [11,25,35] is to utilize the freely unlabeled data to improve the performance of supervised learning task, even if the unlabeled data used cannot be associated with the labels of the task. The STDR learns the bases from external data (including unlabeled data and labeled data) to reconstruct target data. The objective of the STDR is designed to use a large number of external data to achieve the better performance of dimensionality reduction in unsupervised learning. Second, to improve the performance of the learning task, the STDR takes more constraints into account than self-taught learning [11,25,35] does during the reconstruction process. For example, the robust loss function is designed for more effective avoiding the high impact of outliers in target data; The constraint on graph Laplacian regularization is used to ensure that the similar data points in the original space are still similar in the basis space; The ‘2,1 -norm regularization is designed to generate the row sparsity, i.e., the sparsity through the whole feature (or row). However, self-taught learning [11,25,35] only employs the standard sparse coding model (such as in [14,24], which consists of a least square loss function and a ‘1 -norm regularization), to reconstruct the labeled data. The literature in [31] show that the ‘2,1 -norm regularization is better for preforming feature selection than the ‘1 -norm regularization. Third, the objective function of the STDR is solved by proposing a novel solution. Moreover, the proposed solver to the objective function is theoretically guaranteed that the objective function of the STDE converges to the global optimum. The proposed solver is efﬁcient since it optimizes the objective matrix by regarding it as a whole. Thus it can generate the sparse codes (i.e., the new representations) of target data efﬁciently. However, the objective function of self-taught learning is solved by generating the new representation of a data point (or a sample) once. Thus it requires solving the Lasso problem [14] d times, where d is the number of the dimensionality of target data.

3.3. Learning bases from external data In order to transfer knowledge from external data to target data, we need to ﬁnd a proper bridge to connect them ﬁrst. The literature in [35] has showed that diverse types of images may contain common basic visual patterns. This motivates us to build such a ‘‘bridge’’ by extracting certain common elements (i.e., the

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

bases) from sufﬁcient external data ﬁrst, and then to use the extracted bases to reconstruct target data. Dictionary learning [29] has been proven to be quite effective for learning the bases in various applications, such as image processing, audio recognition, and so on. In this paper, we employ the existing methods on dictionary learning to search for a ‘‘bridge’’. Given external data matrix Xo ¼ ½Xo1 ,Xo2 , . . . ,Xono , dictionary learning can usually be formulated as follows: min JXo BSo J2F þ l o

fB,S g

no X

Jsoi J1

s:t: JBj J2 r1,j ¼ 1,2, . . . ,m,

ð1Þ

i¼1

where J JF is deﬁned as Frobenius norm, B ¼ ½B1 ,B2 , . . . ,Bm denotes m bases learnt from Xo , and So ¼ ½so1 ,so2 , . . . ,sono is the matrix of the sparse codes. The constraint on each base JBj J2 r1 (j ¼ 1,2, . . . ,mÞ is used to prevent B from having arbitrarily large values, which might lead to very small values of So . To optimize the above objective function, a series of algorithms have been designed, such as online dictionary learning [29] and gradient decent with iterative projection method [24,48], and so on. In this paper, we employ online dictionary learning method [29] because it is efﬁcient to handle large-scale data. In the implementation process, we utilize SPAMS [29] toolkits to solve the objective function in Eq. (1). 3.4. Reconstructing target data In this part we focus on devising a robust and effective model to reconstruct target data from the learnt bases. To this end, we take three facets into account, i.e., robust reconstruction, local structure preservation and irrelevant features elimination. Given a set of n target data points X ¼ ½x1 ,x2 , . . . ,xn A Rdn and the bases B A Rdm learnt from external data, the reconstruction process of X using B can be achieved by all kinds of loss functions, including least square loss function, logistic loss function, squared hinge loss function, and so on. The literatures (e.g., [13]) has shown that the ‘2,1 -norm loss function (i.e., the robust loss function) is robust for avoiding the adverse impact of outliers. In this paper, we employ robust loss function as the loss function in the proposed objective function to robustly reconstruct target data. The robust loss function is deﬁned as in Eq. (2). min S

n X

Jxi Bsi J2 ,

ð2Þ

i¼1

where S ¼ ½s1 ,s2 , . . . ,sn are the sparse codes corresponding to X.Given a loss function in the optimization issue, it is feasible for us to add a regularization into the optimization issue in real applications since the regularization can be designed to avoid the issue of over-ﬁtting or to meet some requirements, such as the sparsity. In this paper, to perform feature selection, we should choose a regularization leading to distinguishing the features, i.e., distinguishing the important features and the unimportant features for the learning task. Motivated by the characteristics of the sparsity in sparse coding, we expect the important features are represented by non-zero and the unimportant features are represented by zero after the reconstruction process. Then we give up the unimportant features (i.e., the features with zero values) and use the important features to perform the learning process. In real applications, the ‘1 -norm regularization is able to achieve the sparsity for each individual data point. It is proposed to obtain an approximate result to the ‘0 -norm regularization under practical conditions [14,28]. The ‘1 -norm regularization is often used in separable spare coding models, such as graph sparse coding [16,48] and Lasso [14] , and so on. However, the ‘1 -norm regularization does not guarantee that all the data points are sparse in the same features. Fortunately, the ‘2,1 -norm

219

regularization can satisfy our expectation and is deﬁned as JSJ2,1 ¼

m X

JðsÞj J2 ,

ð3Þ

j¼1

where ðsÞj is the jth row of S, which indicates the effect of the jth feature to all the data points. The ‘2,1 -norm regularization can penalize all coefﬁcients in one row as it regards each single feature as a whole. Therefore, it achieves the row-level sparsity for all data points at the same time. The ‘2,1 -norm is also rotational invariant for rows: Given any rotational matrix R, JMRJ2,1 ¼ JMJ2,1 . Therefore, we choose the ‘2,1 -norm as the regularization term to achieve feature selection min S

n X i¼1

m X

Jxi Bsi J2 þ l

JðsÞj J2 ,

ð4Þ

j¼1

where l 4 0 is a tuning parameter. Furthermore, one might expect that local structures of target data in the original space can be well preserved in the new feature space [4], that is, the close data points in the original space should be also close in the intrinsic geometry (i.e., the basis space). In this paper, the STDR builds a k-nearest-neighbor graph for each data point in target data to achieve this objective. More speciﬁcally, following the idea in [4,19], we use a heat 2 kernel wij ¼ eJxi xj J =s (s is a tuning parameter, we set s ¼ 1 in this paper) to build a weight matrix W. The value of wij is used to measure the closeness of two points xi and xj , and we set wii ¼ 0 to avoid the problem of scale in this paper. Given a weight matrix W, we use the Euclidean distance to measure the smoothness between si and sj (where si (or sj ) is the projections of xi (or xj ) respectively in the basis space), that is, X X 1X Js s J2 wij ¼ si Dii sTi si sTj wij ¼ trðSDST ÞtrðSWST Þ ¼ trðSLST Þ: 2 i,j i j i i,j

ð5Þ We denote D as a diagonal matrix. The ith diagonal element of D is computed as the sum of the ith column of W, that is, P Dii ¼ j wij . Obviously, L ¼ DW is a Laplacian matrix. By integrating the Laplacian prior in Eq. (5) into Eq. (4), we obtain the ﬁnal objective function for reconstructing target data as follows: min S

n X

Jxi Bsi J2 þ a trðSLST Þ þ lJSJ2,1 ,

ð6Þ

i¼1

where a Z 0 and l 4 0 are the tuning parameters. By setting X ¼ ½x1 , . . . ,xn and S ¼ ½s1 , . . . ,sn , Eq. (6) can be changed into minJXBSJ2 þ a trðSLST Þ þ lJSJ2,1 , S

ð7Þ

where a Z 0 and l 4 0 are the tuning parameters. 3.5. Feature selection After solving Eq. (6) (or Eq. (7)), whose details can be found in Section 3.7 and Algorithm 2, we obtain the new representations S (S A Rmn ) of original target data X (X A Rdn ). Due to the ‘2,1 norm regularization, most of the rows in S shrink to zero, which implies that the corresponding features of these zero rows are not important to the new representations. To achieve efﬁciency in future learning tasks, we may remove them, i.e., we are able to perform dimensionality reduction on original target data. More speciﬁcally, we ﬁrst rank the rows in the S in descending order according to the ‘2 -norm values of each individual row JðsÞj J2 ,j ¼ 1,2, . . . ,m, and then select top ranked rows as the results of dimensionality reduction.

220

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

18

25

1.7

20

1.6

16 14

1.5

15

12

1.4 10

10

1.3

8

5

1.2

6

0

1.1

4

1

−5 8

10

12

14

5

10

15

0.8

1

1.2

1.4

1.6

1.8

Fig. 3. An toy dataset example on the proposed STDR. For better viewing, see original color pdf ﬁle. Note that, the blue dot represents the points in the ﬁrst cluster, the red star represents the points in the second cluster. (a) Original toy data. (b) Reduced data derived by the STDR. (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

Actually, the bases learnt from external data can be applied to various kinds of data, so it may contain redundancy and noise for certain data. Fortunately, the regularization items (i.e., both the Laplacian prior item and the ‘2,1 -norm regularization item) in Eq. (6) can detect them by enforcing the rows with small values. Hence, our proposed model can help to achieve an effective and efﬁcient learning by preserving the locality and deleting the redundancy and noise.

external data. Third, with the same parameters’ setting as in Section 4, we obtain the new representations S (S A R4200 ) by conducting Algorithm 2. Fourth, we implement the third step of Algorithm 1 to obtain the reduced dataset S0 A R2200 . Fifth, we perform k-means clustering algorithm on S0 and obtain the value of ACC 94.5%. Moreover, we plot all data points of S0 in the right sub-ﬁgure of Fig. 3. 3.7. Optimization

3.6. Pseudocode of the STDR approach In summary, we summarize the proposed STDR method as follows: Algorithm 1. Pseudocode of the proposal STDR approach. o

Input: X A Rdn : target data, Xo A Rdn : external data, and m: the number of selected features; Output: Reduced representation S0 ; 1 Learn bases B from external data Xo ; /* Section 3.3 */; 2 Generate new representations S of X by performing Algorithm 2; /* Section 3.4 */; 3 Generate reduced feature representation S0 by performing feature selection on S; /* Section 3.5 */; Now we give a toy example to describe the process of the proposed STDR. We generate the toy dataset (including 200 data points which are represented by 4 dimensions) by following Gaussian distribution. This toy dataset includes two clusters (or classes) where one cluster contains 101 data points and another contains 99 data points. We plot the ﬁrst two dimensions of the toy data in the left sub-ﬁgure of Fig. 3, and the last two dimensions of the toy data in middle sub-ﬁgure of Fig. 3. We perform k-means clustering algorithm on the toy data and obtain clustering performance (i.e., clustering accuracy (ACC for short) deﬁned in Section 4.1.4) as 80.5%. We also perform k-means clustering algorithm on six sub-datasets (each sub-dataset consists of two-dimension original toy data), and obtain the best ACC as 73.5% and the worse one as 50.50% among six results. In our STDR, we ﬁrst generate external data (including 3000 data points which are represented by 4 dimensions) by following exponential distribution. Second, we obtain the bases (with the matrix B A R44 ) by conducting the ﬁrst step of Algorithm 1 on the

3.7.1. The proposed solver As can be seen, the objective function in Eq. (6) (or Eq. (7)) is convex, so it has the global optimum. However, the optimization in Eq. (6) (or Eq. (7)) is very challengeable since the ‘2,1 -norm regularization is non-smooth. To efﬁciently minimize the value of the objective function in Eq. (6), we describe our solution as follows. By setting the derivative of the objective function in Eq. (6) (or Eq. (7)) with respect to S to zero, we obtain ðBT Dl B þ lDr ÞS þ SðaLÞ ¼ BT Dl X,

ð8Þ

where the derivative of the ﬁrst term in Eq. (6) is equivalent to BT Dl BSBT Dl X and the derivative of the third term in Eq. (6) is equivalent to lDr S. Dr and Dl are diagonal matrices with their ith r diagonal element calculated as di,i ¼ 1=2JðSÞi J2 (or l i j di,i ¼ 1=2JðXBSÞ J2 ), and here ðMÞ is used to denote the ith row of a matrix M. Since both Dr and Dl depend on the value of S, it is impractical to compute S directly. In this part we design a novel iterative algorithm to optimize Eq. (6) by alternatively computing S, Dr and Dl . We ﬁrst summarize the details in Algorithm 2 and then prove that in each iteration the updated S, Dr and Dl make the value of the objective function in Eq. (6) decrease. As seen in Algorithm 2, in each iteration, given ﬁxed Dr and Dl , S can be ﬁrst calculated by Eq. (8) via the solution of the Sylvester r equation.2 Then Dr and Dl are updated by di,i ¼ 1=2JðSÞi J2 and l i di,i ¼ 1=2JðXBSÞ J2 respectively. The iteration process is repeated until there is no change to the value of the objective function. 2 Given the known Dr and Dl , Eq. (8) becomes the standard form of the Sylvester equation [2], which can be solved by Matlab function lyap or software LAPACK [1]. More speciﬁcally, given known matrix A, B and C, the Sylvester equation was designed to predict the unknown X via the equation AX þ XB ¼ C. For example, in Eq. (8), given known matrix ðBT Dl B þ lDr Þ, aL and BT Dl X, Matlab function lyap was used to predict the unknown S.

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

Algorithm 2. An iterative algorithm for solving Eq. (6).

m X

þl

Input: X A Rdn ,B A Rdm ,L A Rnn , a, l; Output: S A Rmn ; 1 Initialize t ¼0; 2 Initialize D0 as a m m identity matrix;

r

JðSt þ 1 Þ J2 l

m X

JðXBSÞ J2

ð9Þ

Proof. In order to prove Theorem 1, we ﬁrst introduce an auxiliary optimization problem with respect to S as follows: trðSTtþ 1 St þ 1 Þ trðXBSt þ 1 ÞT ðXBSt þ 1 Þ þl þ a trðSt þ 1 LSTtþ 1 Þ, Dl Dr ð10Þ where Dr and Dl are calculated according to St in the tth iteration. By denoting St þ 1 as the variable of the objective function in Eq. (10), and S, Dr , Dl as the optimal results obtained in the tth iteration, we have: T

trðSt þ 1 St þ 1 Þ trðXBSt þ 1 ÞT ðXBSt þ 1 Þ þl þ a trðSt þ 1 LSTtþ 1 Þ, Dr Dl

ð11Þ which indicates that trðXBSnt þ 1 ÞT ðXBSnt þ 1 Þ trðSnt þ 1 TSnt þ 1 Þ þl þ a trðSnt þ 1 LSnt þ 1 TÞ Dl Dr ð12Þ

By changing the trace form into the form of summation, we have m X JðXBSn

t þ 1Þ i

i 2 J2

m X JðSn

t þ 1Þ i

i 2 J2

þ a trðSnt þ 1 LSnt þ 1 TÞ 2JðXBSÞ J 2JðSÞ J 2 2 i¼1 i¼1 m m X X JðXBSÞi J22 JðSÞi J22 r þ l þ a trðSLST Þ, i i 2JðXBSÞ J 2JðSÞ J2 2 i¼1 i¼1 þl

JðXBSÞ J2

m X JðXBSÞi J22 i ¼ 1 2JðXBSÞ

m X

JðSÞi J22 JðSÞi J2 i i¼1 i ¼ 1 2JðSÞ J2

JðSÞi J2 l

!

i

!

J2

þ a trðSLST Þ:

By substituting a and b in Eq. (9) with JðXBSnt þ 1 Þi J22 (or JðSnt þ 1 Þi J22 ) and JðXBSÞi J22 (or JðSÞi J22 ) in Eq. (14) respectively, and then summing the results over i ¼ 1,2, . . . ,m, and them summing the results over i ¼ 1, . . . ,m, we have

Theorem 1. In each iteration, Algorithm 2 monotonically decreases the value of the objective function in Eq. (6).

trðXBSÞT ðXBSÞ trðST SÞ þl þ a trðSLST Þ: Dl Dr

!

ð14Þ

Lemma 1. For any positive values ai and bi, i ¼ 1, . . . ,m, the following always holds:

r

m X

i¼1

m X

JðXBSnt þ 1 Þi J2 þ l

i¼1

St þ 1

i

i¼1 m X

þl

3.7.2. Convergence In this part we introduce Theorem 1 that guarantee the value of the objective function in Eq. (6) is monotonically decreased in each iteration of Algorithm 2. First we describe the follow lemma similar to in [31]:

Snt þ 1 ¼ argmin

m X

i

i¼1

9 until there is no change to the value of the objective function in Eq. (6);

St þ 1

m X JðSn

i 2 t þ 1 Þ J2 JðSt þ 1 Þ J2 i i¼1 i ¼ 1 2JðSÞ J2 i

n

þ a trðSnt þ 1 LSnt þ 1 TÞ

3 Initialize as a d d identity matrix; 4 repeat 5 Update St þ 1 by solving the Sylvester function in Eq: ð8Þ; 6 Update Dtr þ 1 by computing ith diagonal 1 7 element dri,i ¼ ; 2JðSt þ 1 Þi J2 8 tþ1 by computing ith diagonal Update Dl 1 element dl ¼ ; i,i 2JðXBSt þ 1 Þi J2 t ¼ t þ 1;

min

m X

i

n

i¼1

r D0l

pﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃ pﬃﬃﬃ pﬃﬃﬃ a b 2 ab rða þ bÞ ) 2 aba r 2bb ) a pﬃﬃﬃ r b pﬃﬃﬃ : 2 b 2 b

221

ð13Þ

where ðMÞi denotes the ith row of a matrix M. After performing a simple mathematical transformation, we have ! m m m X X X JðXBSnt þ 1 Þi J22 JðXBSnt þ 1 Þi J2 JðXBSnt þ 1 Þi J2 i i¼1 i¼1 i ¼ 1 2JðXBSÞ2

r

m X

i

m X

JðSnt þ 1 Þi J2 þ a trðSnt þ 1 LSnt þ 1 TÞ

i¼1 m X

JðXBSÞ J2 þ l

i¼1

JðSÞi J2 þ a trðSLST Þ:

ð15Þ

i¼1

This indicates that the value of the objective function in Eq. (6) monotonically decreases in each iteration of Algorithm 2. Therefore, due to the convexity of Eq. (6), Algorithm 2 is able to enable the objective function in Eq. (6) converge to the global optimum. &

4. Experimental results In order to evaluate the effectiveness of the proposed STDR method, we apply it into two real applications, i.e., image analysis and document analysis. For the domain of image analysis, we use three image datasets, including USPS,3 Letter,4 and MNIST.5 For the domain of document analysis, we use three textual datasets, including Reuters21578,6 20Newsgroups7 and TDT2.8 4.1. Experiments setup 4.1.1. Dataset Details of the used datasets in the our experiments are presented in Table 2. We then describe how to build self-taught learning tasks according to the existing literatures [11,25,35]. Both USPS and MNIST are handwritten digit databases. USPS contains 9298 images in 10 categories. MNIST contains 10,000 samples in 10 categories. Dataset Letter has 20,000 unique stimuli for representing 26 capital letters on English alphabet. In the ﬁrst two experiments of Section 4.2, we take USPS and MNIST as target data respectively while dataset Letter is external data. In the third experiment of Section 4.2, we take dataset Letter as target data while regarding USPS as external data. The Reuters21578 corpus is a set of documents appeared on the Reuters newswire in 1987. In our experiments, we use 7285 documents in total to learn the bases of target data, that is, TDT2. The original TDT2 corpus contains news stories collected from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). In our fourth experiment of Section 4.2, we delete the newswires sources (that is, APW and NYT), which may be relevant to Reuters21578, from 3 4 5 6 7 8

http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html/usps. http://archive.ics.uci.edu/ml/datasets/Letter þ Recognition. http://www.uk.research.att.com/facedatabase.html. http://www.daviddlewis.com/resources/testcollections/reuters21578/. http://people.csail.mit.edu/jrennie/20Newsgroups/. http://www.itl.nist.gov/iad/mig/tests/tdt/1998/.

222

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

Table 2 Details of self-taught dimensionality reduction applications evaluated in the experiments. Domain

External data

Target data

#(class)

Raw features of target data

Image analysis Image analysis Image analysis Documents analysis Documents analysis

Letter Letter USPS Reuters newswire 964 newsgroup documents

USPS MNIST Letter TDT2 17,810 newsgroup documents

10 10 26 94 19

Intensities in 28 28 pixel digit image Intensities in 28 28 pixel digit image Intensities in 28 28 pixel character image Bag-of-words with 500 words Bag-of-words with 500 words

dataset TDT2. The left data contain 10,733 stories in 94 categories are regarded as target data. 20Newsgroups dataset was organized into 20 different newsgroups, in which each groups was corresponded to a different topic. Some of the newsgroups are very closely related to each other, while others are highly unrelated (e.g. misc.forsale vs. the others). In our ﬁfth experiment of Section 4.2, we use one group (that is, 964 documents for the group misc.forsale) to learn the bases for target data, that is, the left 19 groups with 17,810 documents in total. In this paper, both target data and external data are represented in a standard way, such as raw pixel intensities for images and the bag-of-words (with vocabulary size 500) representation for text. 4.1.2. Comparison algorithms We compare the proposed STDR approach with the following algorithms:

Original: all original features are used to conduct k-means

clustering. We want to know whether or not the dimensionality reduction algorithms can improve the clustering performance on the high-dimensional and small-sized data. LScore: Laplacian Score [18] belongs to ﬁlter model of feature selection. The Laplacian score of a feature is evaluated by its locality preserving power. The feature will have the high score if data points in the same topic are close to each other. MCFS: Multi-Cluster Feature Selection [7] selects features using spectral regression with the ‘1 -norm regularization. That is, the MCFS sequentially performs the LPP [4,19] and least square regression. UDFS: Unsupervised Discriminative Feature Selection [46] conducts feature selection by combining discriminative analysis with the local structures of target data. LPP: Locality Preserving Projections (LPP) [18], as one of the methods on feature extraction, does not take least square regression into account. JFS: Joint Feature Selection (JFS) considers minimal reconstructor error but not considers the local structures of target data. We change the supervised feature selection algorithm RFS in [31] to generate the JFS in our experiments. The proposed STDR combines the LPP with the JFS together, so both the JFS and the LPP are the baselines of the proposed STDR. JGFS: Joint Graph Feature Selection can be regarded as an extension of the proposed STDR approach. Different from the STDR, the JGFS builds the dimensionality reduction model without using external data.

4.1.3. Experimental setting In our experiments, we set the parameters for the comparison algorithms by following the instructions in their papers. For example, the parameter l for controlling the weight between the discriminative ability and the similarity matrix in the UDFS is set as 1000 according to the setting in [46]. The parameter of reduced dimensionality on the MCFS is set as the number of the classes in the original target data, same as the setting in [7]. In both the JGFS

and the STDR, the parameter a is set as f0:01,0:1,1,10g, the parameter l is set as f0:001,0:1,10,1000g. In the JFS, the parameter l is set as f0:001,0:1,10,1000g. For the algorithms (such as the LScore, the LPP, the MCFS, the UDFS, the JGFS and the STDR), which need to build a k-nearest-neighbor graph, we set the distance metric as Euclidean distance, the value of k as 5, the weight mode as a heat kernel whose width is set as 1. Since the algorithms (such as the ‘‘Original’’, the LScore, the MCFS, the UDFS, the LPP and the JGFS) cannot use external data, we ﬁrst perform dimensionality reduction with these algorithms on target data, and then perform k-means algorithm on the reduced data. In both the JFS and the STDR, we ﬁrst use all the data in external data to learn the bases, and then reconstruct target data to the learnt basis space. Then we perform feature selection on the new representations of target data to generate the reduced data. Finally, we perform k-means clustering algorithm on the reduced data. In all algorithms, ﬁrst, to generate small-sized target data, we randomly sample (but with evenly ratio for each class) original target data to generate our target data. The sample size are set as 500, 1000 and 2000 respectively. Of each sample size, we generate 10 target datasets. For example, in the ﬁrst two experiments of Section 4.2, for original target data USPS, we generate 30 our target datasets, where the size of ten datsets is 500, 1000 and 2000 respectively. Second, the left dimensionality after performing dimensionality reduction on our target datasets is kept as {200, 400, 600} for dataset USPS, Letter and MNIST, and {100, 300, 400} for document datasets. That is, each our target dataset is generated three reduced datasets via each dimensionality reduction method. Third, we perform k-means clustering algorithm on each reduced dataset, and then use the clustering performance to evaluate the effectiveness of all dimensionality reduction algorithms. In k-means clustering algorithm, the number of clusters is set as the number of the classes of original target dataset. We perform k-means algorithm 20 runs with random initializations on each reduced dataset. The average result of the outcome in 20 runs as the result on each reduced dataset. The best result among three different reduced datasets is regarded as the ﬁnal results of one target datasets. We report the average result of 10 target datasets as the ﬁnal result of one sample size. In Section 4.2, we compare the clustering performance on the reduced data derived by all dimensionality reduction algorithms, for evaluating the effectiveness of the proposed STDR. In Section 4.3, we test the parameters sensitivity of the proposed method according to the variation of the parameters, such as a and l in Eq. (6) (or Eq. (7)), aiming at achieving the best performance of the proposed STDR. We use different external data to learn the bases for same target data, aiming at analyzing the effects of different external data to the proposed STDR in Section 4.4. We evaluate the convergence rate of the proposed Algorithm 2 on all ﬁve datasets, for evaluating the efﬁciency of our optimization algorithm, in terms of the objective function value in each iteration in Section 4.5. In Section 4.6, we give a brief discussion on setting the number of the bases as well as the complexity for solving Sylvester equation in the proposed STDR.

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

4.1.4. Evaluation metrics We evaluate the clustering results by comparing the derived cluster labels (via the dimensionality reduction algorithms) with the truth labels (provided by original datasets). Two standard clustering metrics (i.e., accuracy (referred to as ACC) and normalized mutual information (NMI)) are used to measure the clustering performance. Given a data point xi , denote yi and yni as the cluster labels and the truth labels respectively. The evaluation term ACC is deﬁned as follows: Pn dðyi ,mapðyni ÞÞ , ð16Þ ACC ¼ i ¼ 1 n where n is the size of samples and dðx,yÞ is the delta function, dðx,yÞ ¼ 1 if x ¼y, and dðx,yÞ ¼ 0, otherwise. Actually, the clustering algorithm can tell us which data are in the same cluster, but cannot indicate the exactly cluster labels as the truth ones. Hence, mapðyni Þ is the optimal mapping function that permutes cluster labels to match the truth labels. In our experiments, we employ the Kuhn–Munkres algorithm [33] as the optimal mapping function. According to Eq. (16), the maximal value of ACC is 1 in an arbitrary case. And the minimal value is 1=k, when there are k clusters and every cluster contains equivalent number of the data. Following the paper [39], the evaluation of NMI is deﬁned as follows: Pk Pk nni,j n log i¼1 j ¼ 1 i,j n~ i nj ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ NMI ¼ s ð17Þ ﬃ , Pk Pk nj n~ i ~ log n log n i¼1 i j¼1 j n n where n is the number of all data. n~ i (or nj ) is the number of data in the ith cluster in the cluster labels (or the number of data in the jth cluster in the truth labels). ni,j is the number of data in the intersection between the ith cluster in the cluster labels and the jth cluster in the truth labels. Obviously, the value of NMI equals to 1 if the cluster labels are identical with the truth labels, and 0 if they are independent. Note that, ACC and NMI are two independent indicators to measure clustering performance. It is possible that ACC is bigger while NMI is smaller, or ACC is smaller while NMI is bigger.

4.2. Results of ACC and NMI on all algorithms The clustering results (that is, ACC and NMI) on all algorithms are presented from Tables 3–8.9 The experimental results show that the STDR achieves the best performance and the JGFS is the second best. As can be seen from the experimental result, we can make the following conclusions as: 1. The results of the ‘‘Original’’ are better than those of the JFS, and worse than the others. However, the JFS is more efﬁcient due to signiﬁcantly reducing the dimensionality of target data. 2. The methods on dimensionality reduction (except the JFS) take the local structures of target data into account, and outperform the ‘‘Original’’. Therefore, to preserve the local structures in the 9 Since target data are sampled from a large number of dataset as well as are with small sample size, the data points in the dataset with 500 samples can be totally different from those in the dataset with 1000 samples. Therefore, it is normally and reasonable for the case, in which the clustering performance in the datasets with 500 samples are better than those in the same datasets with 1000 samples, such as the clustering performance in the fourth row (i.e., the results on ‘‘USPS vs. Letter’’) of Table 3 are higher than those in the fourth row (i.e., the results on ‘‘USPS vs. Letter’’) of Table 5.

223

dimensionality reduction model is crucial. This is consistent with the conclusion in [18,46]. 3. According to the experimental results, it is necessary to conduct dimensionality reduction, even with the high-dimensional and small-sized data. We understand that the ‘2,1 -norm can be used to remove redundancy and noise. However, our results showed that the JFS performs worse than the ‘‘Original’’ on the clustering results. One possible reason is that some useful features could be also removed, depending on the degree of dimensionality reduction. The main advantage of using the ‘2,1 -norm in our experiments lies in the improvement of efﬁciency. Therefore, we have to also combine it with the Laplacian prior (i.e., the LPP), to further improve the effectiveness. By doing so, both efﬁciency and effectiveness can be achieved. Both the STDR and the JGFS build the dimensionality reduction model by simultaneously taking the Laplacian prior and the minimal reconstruction error into account. However, the algorithms (such as the JFS, the LScore and the LPP) only consider one of constraints. The more constraints enable to build more effective models of dimensionality reduction, so the clustering performance of the algorithms (such as the JFS, the LScore and the LPP) are worse than those of the STDR (or the JGFS). Both the MCFS and the UDFS consider two constraints, but they do not outperform either the STDR or the JGFS. This can be explained as follows. First, the MCFS separately performs the process of the Laplacian prior and the regression process. Thus it does not consider the correlations among target data. Second, the UDFS outperforms the other comparison algorithms (such as the LScore, the JFS, the LPP and the MCFS) since it simultaneously takes the discriminative ability and the Laplacian prior into account. This is same as the conclusion in [46] on the case with sufﬁcient target data. However,according to the illustrated experimental results, the UDFS is a data-driven method, i.e., very sensitive to the choice of the datasets on handling the highdimensional and small-sized data. For example, the UDFS does not obtain good clustering performance on the ﬁrst experiment, i.e., ‘‘Letter vs. MNIST’’, from Tables 3–8. Both the STDR and the JGFS build the dimensionality reduction model by simultaneously taking the Laplacian prior and the minimal reconstruction error into account. The key difference between them is that the STDR employs external data for learning the bases, but the JGFS learns the bases with the given target data. As can be seen from Table 9, the difference on the STDR over the JFGS is the maximum, i.e., 2.3% for ACC and 1.68% for NMI while the sample size is 500, and the difference is the minimum, i.e., 1.32% for ACC and 0.18% for NMI while the sample size is 2000. According to the results in Table 9, we can make the conclusions as follows: 1. With more information (i.e., sufﬁcient external data) in the process for learning the bases, the STDR can build a more effective model to learn the bases than the JGFS, which learns the bases by using limited target data. However, with the increase of the size of target data, the JGFS can also build efﬁcient models to learn the bases with sufﬁcient target data, the function of external data will decrease. For example, when the size of target data is 2000, the results of the STDR are a little better than those of then JGFS. 2. In the process of building learning models, to introduce external data into the learning model can increase the useful information, or add the probability of introducing noise, or even degenerate the performance of the learning model. As can be seen from the experimental results, when the size of target data is small, such as 500, to introduce external data for

224

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

Table 3 Clustering results (ACC% (std%)) of different algorithms with sample size 500. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

41.9(3.7) 47.8(2.2) 32.0(1.4) 14.8(0.7) 17.4(0.6)

44.6(2.9) 44.6(1.5) 31.2(1.5) 15.0(0.7) 18.0(0.5)

43.7(3.2) 48.2(1.5) 30.9(1.7) 15.1(1.0) 17.2(0.5)

37.6(2.0) 49.4(1.8) 32.1(1.4) 17.1(1.3) 17.7(0.7)

46.3(2.4) 52.3(1.6) 31.3(1.7) 17.0(0.6) 18.4(0.8)

41.7(1.3) 49.2(2.2) 30.9(1.5) 13.6(0.6) 17.1(0.6)

46.7(2.7) 52.4(2.2) 32.3(1.6) 17.1(0.3) 18.6(0.8)

52.5(2.1) 55.6(1.9) 32.2(1.8) 17.3(0.1) 21.0(1.0)

Table 4 Clustering results (NMI% (std%)) of different algorithms with sample size 500. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

45.2(2.3) 44.7(2.1) 44.8(1.3) 34.2(1.0) 4.9(0.2)

46.8(2.4) 39.4(1.8) 44.1(1.2) 34.2(0.9) 7.0(0.8)

45.2(2.2) 45.1(2.0) 43.6(1.2) 34.2(1.1) 4.7(0.2)

33.9(2.5) 46.3(2.1) 45.0(1.3) 35.1(0.5) 7.2(0.6)

52.1(1.8) 48.5(2.1) 43.6(1.2) 36.9(1.7) 6.0(0.5)

40.4(2.0) 44.8(1.8) 43.2(1.3) 33.3(1.2) 4.1(0.4)

52.3(1.9) 48.8(2.0) 45.4(1.2) 36.9(0.6) 7.5(0.6)

54.4(1.9) 50.9(2.1) 45.5(1.3) 37.9(0.9) 10.6(1.0)

Table 5 Clustering results (ACC% (std%)) of different algorithms with sample size 1000. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

42.4(2.6) 49.8(1.7) 30.1(0.8) 15.9(0.4) 18.8(0.9)

45.5(1.6) 45.1(1.1) 28.9(0.7) 15.4(0.8) 19.7(1.2)

43.6(3.0) 49.9(1.5) 28.3(0.9) 15.7(0.7) 18.5(0.9)

37.9(2.3) 50.8(2.2) 30.3(1.0) 17.0(1.1) 19.1(0.9)

41.3(4.1) 53.0(1.5) 29.0(0.7) 16.8(0.6) 20.7(2.1)

42.5(1.2) 49.6(1.9) 28.6(0.6) 14.2(0.7) 18.2(0.9)

49.6(2.1) 57.8(1.7) 30.5(1.0) 17.6(0.9) 20.5(1.1)

55.6(1.5) 57.9(1.8) 30.7(1.1) 17.4(1.0) 21.4(1.3)

Table 6 Clustering results (NMIC% (std%)) of different algorithms with sample size 1000. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

44.5(2.0) 44.6(1.7) 40.5(0.6) 36.0(0.8) 6.2(0.7)

46.2(1.7) 39.3(1.1) 39.1(0.9) 35.3(1.2) 7.8(1.3)

45.2(2.6) 44.9(1.7) 38.6(0.9) 34.7(1.0) 5.7(0.5)

32.5(2.2) 45.2(1.4) 40.3(0.7) 36.3(0.7) 7.8(0.6)

43.3(4.9) 49.1(1.7) 39.1(0.7) 38.1(1.0) 7.6(1.8)

40.9(1.0) 45.8(2.4) 38.5(0.9) 34.5(0.7) 5.6(0.5)

54.9(2.3) 51.6(1.8) 41.2(1.0) 38.5(0.8) 8.6(1.1)

56.2(1.3) 51.6(1.8) 41.2(0.9) 38.3(0.6) 11.1(1.3)

Table 7 Clustering results (ACC% (std%)) of different algorithms with sample size 2000. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

43.0(1.4) 49.7(1.1) 29.3(0.5) 16.1(1.0) 19.2(0.5)

46.3(1.9) 45.6(1.1) 28.5(0.6) 15.7(0.8) 20.1(0.8)

43.5(1.9) 49.5(1.9) 28.0(0.6) 16.0(1.0) 19.4(0.9)

36.7(1.2) 50.5(0.7) 29.7(0.7) 16.4(0.5) 19.6(0.6)

41.0(3.0) 52.6(1.5) 28.4(0.4) 16.6(0.5) 21.0(1.4)

43.3(0.9) 49.9(1.9) 28.5(1.5) 14.8(0.4) 19.6(0.7)

49.7(2.0) 59.0(1.3) 30.3(0.6) 18.7(1.5) 21.0(0.8)

56.9(2.4) 58.9(1.1) 30.3(0.6) 17.7(0.5) 21.5(0.9)

Table 8 Clustering results (NMI% (std%)) of different algorithms with sample size 2000. External vs. objective

Original

LScore

MCFS

UDFS

LPP

JFS

JGFS

STDR

Letter vs. MNIST Letter vs. USPS USPS vs. Letter Reuters vs. TDT2 20News vs. 20News

45.8(0.9) 43.2(0.8) 38.4(0.6) 37.1(1.2) 8.2(0.7)

47.3(1.3) 38.5(0.8) 37.5(0.8) 36.9(0.7) 9.3(0.5)

45.7(1.2) 43.6(1.0) 36.8(0.7) 36.0(1.8) 8.0(0.7)

30.9(1.3) 43.7(1.1) 38.2(0.5) 37.5(0.7) 10.5(0.8)

42.4(4.6) 47.7(0.9) 37.0(0.5) 38.2(0.9) 9.5(1.2)

41.9(1.0) 46.7(1.2) 36.6(1.2) 34.6(0.9) 7.8(0.6)

56.5(1.2) 51.8(0.9) 39.8(0.5) 39.8(0.8) 11.8(0.9)

57.5(1.6) 51.9(1.0) 39.9(0.5) 39.3(0.7) 12.0(1.0)

learning limited target data more effectively improves the performance of the dimensionality reduction model than those of the cases with large sample size, such as 1000 or 2000. However, even in the case with enough target data, the STDR still outperforms the comparison algorithms. According to the experimental results, we can make the conclusion: When the

number of target data increases, the function of external data to target data decreases but not degenerates.Hence, external data are always effective to target data in the proposed STDR approach, even if with sufﬁcient target data. Actually, in real applications, we do not employ external data while we have with sufﬁcient target data since sufﬁcient target data can

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

Table 9 Average difference of the clustering performance on the STDR over the JFGS in ﬁve experiments with different sample size. Sample size

ACC (%)

NMI (%)

500 1000 2000

2.3 1.4 1.32

1.68 0.72 0.18

Table 10 Effect on different external data (ACC% (std%), or NMI% (std%)) of STDR on target data USPS. External data

500 (ACC)

1000 (ACC)

JGFS USPS MNIST

32.3(1.6) 30.5(1.0) 32.2(1.8) 30.7(1.1) 32.9(1.8) 31.6(0.7)

2000 (ACC)

500 (NMI)

1000 (NMI)

30.3(0.6) 30.3(0.6) 30.9(1.2)

45.4(1.2) 41.2(1.0) 45.5(1.3) 41.2(0.9) 45.7(1.2) 41.7(0.9)

2000 (NMI) 39.8(0.5) 39.9(0.5) 40.1(0.9)

Table 11 Effect on different external data (ACC% (std%), or NMI% (std%)) of STDR on target data TDT2. External data

500 (ACC)

1000 (ACC)

JGFS Reuters TDT2

17.1(0.3) 17.6(0.9) 17.3(0.1) 17.4(1.0) 17.2(0.3) 17.4(0.4)

2000 (ACC)

500 (NMI)

1000 (NMI)

18.7(1.5) 17.5(0.5) 17.5(0.7)

36.9(0.6) 38.5(0.8) 37.9(0.9) 38.3(0.6) 37.7(0.8) 37.8(0.6)

2000 (NMI) 39.8(0.8) 39.3(0.7) 38.9(0.4)

ensure us to effectively build learning models. In such a case, to employ external data maybe increase the change for introducing the noise and the redundancy in external data into the learning model. 3. According to the above analysis, it is very feasible for us to perform dimensionality reduction by using the proposed STDR to handle the high-dimensional and small-sized data. Finally, we also ﬁnd the clustering performance of the STDR with less external data (e.g., the ﬁfth experiment, that is, using part of data in dataset 20Newsgroup to learn the bases for the left ones) are similar to those of the JGFS with sufﬁcient external data. This conclusion is consistent to that in [35], in which only 10 outdoors images are used to effectively perform self-taught classiﬁcation on Caltech 101 dataset. Hence, less external data can also improve the performance of dimensionality reduction of the STDR, but sufﬁcient external data are preferable. This is feasible in real applications because there are a larger number of available external data.

4.3. Effect on different external data In this section, we evaluate the effects of different external data to target data in the dimensionality reduction model. In our experiments, we ﬁx target data, and utilize different external data to learn the bases of target data. More speciﬁcally, we set external data as MNIST and USPS respectively for target data Letter, and present the results in Table 10. We also set external data as the revised dataset TDT2 (in which there are only 2 newswires, that is, APW and NYT) and Reuters respectively for target data TDT2 (without those two sources). The experimental results are presented in Table 11. In Tables 10 and 11, the results of the JGFS are presented in the second row, and those of the STDR are listed in the last two rows. The results of ACC on different sample size are listed from second

225

column to fourth column. The results of NMI on different sample size are listed in the last three columns. As can be seen, the results of the STDR are still better than those of the JGFS in terms of different external data. Moreover, the maximal difference of the results between the STDR and the JGFS is in the case while sample size is 500, for two different external data in our experiments. Hence, different external data in the STDR can obtained similar performance of dimensionality reduction. Actually, we always expect to know which kind of external data can more improve the performance in self-taught learning, or want to know which kinds of data are with common visual patterns. The interesting issues, similar to the analysis in both transfer learning [32] and self-taught learning [35], will be the issue of our future research. 4.4. Parameters’ sensitivity In this section, we study the clustering performance of the STDR with respect to the variations of different parameters’ setting, that is, l and a in Eq. (6). Due to the similar results in terms of different sample size, we only report one of them. That is, we list the results of ACC and NMI on the case with 500 target data, the left dimensionality after performing dimensionality reduction was kept as 600 for image datasets and 400 for document datasets respectively. The experimental results are presented in Figs. 4 and 5. As can be seen, the STDR can achieve better clustering performance with larger sparsity (that is, the larger value of l) and smaller weight on the Laplacian prior regularization (that is, between 0 and 1). Actually, the larger sparsity leads to less run costs. However, according to the experimental results, the better values of the l are various in different datasets. The conclusion on small weight on a is consistent with those in previous work on graph sparse coding in [16,48]. 4.5. Convergence rate We solve Eq. (7) by the proposed Algorithm 2. In this experiment, we want to know the convergence rate of Algorithm 2. Here we report some of the results in Figs. 6 and 7 due to lack of space. Fig. 6 shows the results on the objective function value while ﬁxing the value of a (i.e., a ¼ 1) and varying l. Fig. 7 shows the results on the objective function value while ﬁxing the value of l (i.e., l ¼ 0:1) and varying l. In both Figs. 6 and 7, the x-axis and yaxis denotes the number of iterations and the objective function value respectively. We can observe from both Figs. 6 and 7: (1) the objective function value rapidly decreases at the ﬁrst few iterations; and (2) the objective function value becomes stable after about 20 iterations (or even less than 20 in many cases) on all datasets. This conﬁrms the fast convergence rate of Algorithm 2 to solve the proposed optimization problem in Eq. (7). Similar results are observed for other a and l values. 4.6. Discussion 4.6.1. The dimensionality of the basis Actually, in the proposed STDR approach, we also need to tune the value of m, that is, the number of the bases. The proposed STDR belongs to one of models on sparse coding, so the value of m can be larger or smaller than the dimensionality d of original data, according to the real applications [24,29]. The case m Z d means that we ﬁrst map original data into a high-dimensional space, in which we reconstruct target data as well as perform feature selection. This is similar to some methods (such as, kernel methods in machine learning) for dealing with high-dimensional

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

0.8

0.4

0.6

0.6

0.3

0.4

ACC

0.8

ACC

ACC

226

0.4

0.2

0.2

0.2

0.1

0

0

0

0.001

0.001 0.1 10

γ

100

0.01

0.1

10

1 α

0.001 0.1 γ

10 100

0.01

0.2

0.1

10

γ

10 100

0.01

0.1

1

10

α

0.25 0.2 ACC

0.15 ACC

0.1

1 α

0.1 0.05

0.15 0.1 0.05

0

0 0.001

0.001 0.1 γ

10 100

0.01

0.1

1

0.1

10

10 100

α

0.01

γ

0.1

10

1 α

0.8

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0

0 0.001

0.5

NMI

0.8

NMI

NMI

Fig. 4. The results of ACC on the different datasets at different parameters’ setting. (a) Letter_USPS. (b) Letter_MNIST. (c) USPS_Letter. (d) Reuters_TDT2. (e) 20News_20News.

10 100

0.01

0

0.1

1

0.001 0.1

10

γ

10 100

α

0.01

0.4

0.2

0.3

0.15 NMI

NMI

γ

0.2 0.1

0.001 0.1

0.3

0.2 0.1

0.1

1

0.1

10

γ

10 100

α

0.01

0.1

1

10

α

0.1 0.05

0

0 0.001

0.001 0.1 γ

10 100

0.01

0.1

1 α

10

0.1 γ

10 100

0.01

0.1

1

10

α

Fig. 5. The results of NMI on the different datasets at different parameters’ setting. (a) Letter_USPS. (b) Letter_MNIST. (c) USPS_Letter. (d) Reuters_TDT2. (e) 20News_20News.

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

910

1200 1000

λ = 0.001 λ = 0.1 λ = 10 λ = 100

800 600 400 200

10

20

30

40

900 890 880 870 860 850

50

10

20

Iteration

30

40

1200 1100 1000

λ = 0.001 λ = 0.1 λ = 10 λ = 100

900 800 700 600 500

50

10

20

Iteration 690

30

40

50

Iteration 440

685

Objective function value

Objective function value

1300

λ = 0.001 λ = 0.1 λ = 10 λ = 100

Objective function value

Objective function value

Objective function value

1400

227

680 675

λ = 0.001 λ = 0.1 λ = 10 λ = 100

670 665 660 655 650 10

20

30

40

430 420

λ = 0.001 λ = 0.1 λ = 10 λ = 100

410 400 390 380

50

10

20

Iteration

30

40

50

Iteration

Fig. 6. An illustration on convergence rate of Algorithm 2 for solving the proposed objective function with ﬁxed a, i.e., a ¼ 1. (a) Letter_USPS. (b) Letter_MNIST. (c) USPS_Letter. (d) Reuters_TDT2. (e) 20News_20News.

500 400 300 200 100 0

10

20

30

Iteration

40

α = 0.01 α = 0.1 α=1 α = 10

900 890 880 870 860 850 840

50

10

Objective function value

700

20

Iteration

α = 0.01 α = 0.1 α=1 α = 10

600 500 400 300 200 100

10

20

30

30

40

Iteration

50

40

900

Objective function value

600

910

α = 0.01 α = 0.1 α=1 α = 10

800 700 600 500 400 300 200 100 0

50

10

20

30

Iteration

40

50

450

Objective function value

α = 0.01 α = 0.1 α=1 α = 10

Objective function value

Objective function value

700

400 350 α = 0.01 α = 0.1 α=1 α = 10

300 250 200 150 100

10

20

30

40

50

Iteration

Fig. 7. An illustration on convergence rate of Algorithm 2 for solving the proposed objective function with ﬁxed l, i.e., l ¼ 0:1. (a) Letter_USPS. (b) Letter_MNIST. (c) USPS_Letter. (d) Reuters_TDT2. (e) 20News_20News.

data. That is, sometimes the learning model can obtain better performance in the higher-dimensional feature space than the original feature space. The case m r d indicates that the STDR ﬁrst performs dimensionality reduction, and then reconstructs target

data to form their new representations, in which we perform dimension reduction again by deleting the redundancy in the new representations. For simpliﬁcation, we always set m ¼d in our experiments, same as the literatures [16,24,29,48].

228

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

4.6.2. The solution of Sylvester equation The solution of Sylvester equation is really time expensive, such as dnnnðmaxðd,nÞÞ, where d is the dimensionality of the bases and n is the size of target data in this paper. However, it is reasonable for us to solve Sylvester equation via MATLAB function lyap since the proposed STDR is designed to handle the highdimensional small-size data. First, the number of the instances in the high-dimensional small-size data is usually small, such as less than 2000 in this paper. Second, the dimensionality of the basis can be large or small according to the pre-setting. As can be seen in Section 4.6.1, while the dimensionality of original data (e.g., 10,000) is high, we can learn to obtain the basis with small dimensionality (e.g., 500, the maximal dimensionality is 784 in this paper). According to our implementation, we ﬁnd that the efﬁciency is acceptable for solving Sylvester equation via MATLAB function lyap on all our cases. For example, in our experiments, the maximal size of the matrix in Sylvester equation is with 2000*2000. To solve such Sylvester equation (i.e., the ﬁfth step of Algorithm 2) will cost about 10 s in a normal modern PC. Due to fast convergence rate of the STDR shown in Section 4.5, i.e., usually less than 30 iterations, to ﬁnish Algorithm 2 only takes less than 5 min in our experiments. Moreover, as a pre-preprocessing phase, feature selection is often off-line. And its efﬁciency is not the most important thing. Finally, package LAPACK was designed to solve the large-scale Sylvester equation.

5. Conclusion In this paper, we propose a novel Self-Taught Dimensionality Reduction (STDR) approach for dealing with the high-dimensional and small-sized data. The proposed STDR approach ﬁrst learns the bases from external data. Then the STDR reconstructs limited target data via the learnt bases, by proposing a novel objective function, which uses a ‘2,1 -norm loss function for avoiding the outliers to make the adverse impact, a ‘2,1 -norm regularization term for detecting the redundancy of the new representations of target data, and a Laplacian prior regularization term for preserving the local structures of target data. After this, the new representations of target data are performed feature selection by deleting their redundancy. The experimental results showed that the proposed STDR can effective utilize external data for coming over the drawback of limited target data. In the future, we will extend the proposed method into its kernel edition for performing dimensionality reduction on the high-dimensional and small-sized data. We will also focus on the topics, such as which kind of external data can more improve the performance in self-taught learning, or which kinds of data are with common visual patterns, and so on.

References [1] E. Anderson, Z. Bai, C.H. Bischof, J.W. Demmel, J.J. Dongarra, J.J.D. Croz, A. Greenbaum, S. Hammarling, S. Ostrouchov, A. McKenney, D.C. Sorensen. Lapack Users’ Guide, third ed., 1999. [2] R.H. Bartels, G.W. Stewart, Solution of the matrix equation ax þ xb ¼c [f4], Communications of the ACM 15 (September) (1972) 820–826. [3] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396. [4] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006) 2399–2434. [5] C. Boutsidis, M.W. Mahoney, P. Drineas, Unsupervised feature selection for principal components analysis, in: ACM Special Interest Group on Knowledge Discovery and Data Mining, 2008, pp. 61–69. [6] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in: ACM Special Interest Group on Knowledge Discovery and Data Mining, 2010, pp. 333–342.

[7] H. Cai, K. Mikolajczyk, J. Matas, Learning linear discriminant projections for dimensionality reduction of image descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (2010) 338–352. [8] R. Caruana, V.R. de Sa, Beneﬁtting from the variables that variable selection discards, Journal of Machine Learning Research 3 (2003) 1245–1264. [9] G.C. Cawley, N.L.C. Talbot, M. Girolami, Sparse multinominal logistic regression via Bayesian l1 regularisation, in: Neural Information Processing Systems, 2007, pp. 209–216. [10] H. Cevikalp, B. Triggs, F. Jurie, R. Polikar, Margin-based discriminant dimensionality reduction for visual recognition. in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [11] W. Dai, Q. Yang, rong G. Xue, Y. Yu, Self-taught clustering, in: International Conference on Machine Learning, 2008, pp. 200–207. [12] J. de Leeuw, P. Mair, Multidimensional scaling using majorization: SMACOF in R, Journal of Statistical Software 31 (3) (2009) 1–30. [13] C.H.Q. Ding, D. Zhou, X. He, H. Zha, R1-pca: rotational invariant l1-norm principal component analysis for robust subspace factorization, in: International Conference on Machine Learning, 2006, pp. 281–288. [14] B. Efron, T. Hastie, L. Johnstone, R. Tibshirani, Least angle regression, Annals of Statistics 32 (2004) 407–499. [15] S. Foitong, P. Rojanavasu, B. Attachoo, O. Pinngern, Estimating optimal feature subsets using mutual information feature selector and rough sets, in: Paciﬁc-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer-Verlag, Berlin, Heidelberg, 2009, pp. 973–980. [16] S. Gao, I.W.-H. Tsang, L.-T. Chia, P. Zhao, Local features are not lonely—Laplacian sparse coding for image classiﬁcation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3555–3561. [17] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classiﬁcation using support vector machines, Machine Learning 46 (2002) 389–422. [18] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Neural Information Processing Systems, 2005, pp. 1–8. [19] X. He, P. Niyogi, Locality preserving projections. in: Neural Information Processing Systems, 2003, pp. 197–204. [20] C. Hou, C. Zhang, Y. Wu, Y. Jiao, Stable local dimensionality reduction approaches, Pattern Recognition 42 (9) (2009) 2054–2066. ¨ [21] Z. Huang, H.T. Shen, J. Shao, S.M. Ruger, X. Zhou, Locality condensation: a new dimensionality reduction method for image retrieval, in: ACM Multimedia, 2008, pp. 219–228. [22] E. Kokiopoulou, J. Chen, Y. Saad, Trace optimization and eigenproblems in dimension reduction methods, Numerical Linear Algebra with Applications 18 (3) (2011) 565–602. [23] S. Kotsiantis, Feature selection for machine learning classiﬁcation problems: a recent overview. Artiﬁcial Intelligence Review (2011) 1–20. [24] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efﬁcient sparse coding algorithms, in: Neural Information Processing Systems, 2007, pp. 801–808. [25] H. Lee, R. Raina, A. Teichman, A.Y. Ng, Exponential family sparse coding with applications to self-taught learning, in: International Joint Conference on Artiﬁcial Intelligence, 2009, pp. 1113–1119. [26] X. Lian, L. Chen, General cost models for evaluating dimensionality reduction in high-dimensional spaces, IEEE Transactions on Knowledge and Data Engineering 21 (10) (2009) 1447–1460. [27] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efﬁcient l2,1-norm minimization, in: International Conference on Uncertainty in Artiﬁcial Intelligence, 2009, pp. 1–8. [28] D. Luo, C. Ding, H. Huang, Towards structural sparsity: an explicit l2/l0 approach, in: International Conference on Data Mining, 2010, pp. 344–353. [29] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, Journal of Machine Learning Research 11 (2010) 19–60. [30] S. Maldonado, R. Weber, A wrapper method for feature selection using support vector machines, Information Science Journal 179 (June) (2009) 2208–2217. [31] F. Nie, H. Huang, X. Cai, C. Ding, Efﬁcient and robust feature selection via joint l2,1-norms minimization, in: Neural Information Processing Systems, 2010, pp. 1–8. [32] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22 (10) (2010) 1345–1359. [33] C.H. Papadimitriou, K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Dover Publications, 1998. [34] M.D. Plumbley, E. Oja, A nonnegative pca algorithm for independent component analysis, IEEE Transactions on Neural Networks 15 (2004) 66–76. [35] R. Raina, A. Battle, H. Lee, B. Packer, A.Y. Ng, Self-taught learning: transfer learning from unlabeled data, in: International Conference on Machine Learning, 2007, pp. 759–766. [36] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [37] H.T. Shen, X. Zhou, A. Zhou, An adaptive and dynamic dimensionality reduction method for high-dimensional indexing, VLDB Journal 16 (2) (2007) 219–234. [38] S. Si, D. Tao, B. Geng, Bregman divergence-based regularization for transfer subspace learning, IEEE Transactions on Knowledge and Data Engineering 22 (7) (2010) 929–942. [39] A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (2003) 583–617. [40] J.B. Tenenbaum, V. Silva, J.C. angford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323.

X. Zhu et al. / Pattern Recognition 46 (2013) 215–229

[41] Laurens van der Maaten, Eric Postma, Jaap van den Herik, Dimensionality Reduction: A Comparative Review, 2009, TiCC TR 2009C005, Tilburg University. [42] K.Q. Weinberger, F. Sha, L.K. Saul, Learning a kernel matrix for nonlinear dimensionality reduction, in: International Conference on Machine Learning, 2004, pp. 17–24. ¨ [43] J. Weston, A. Elisseeff, B. Scholkopf, M. Tipping, Use of the zero norm with linear models and kernel methods, Journal of Machine Learning Research 3 (March) (2003) 1439–1461. [44] L. Wolf, A. Shashua, Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach, Journal of Machine Learning Research 6 (2005) 1855–1887. [45] S. Xiang, F. Nie, C. Zhang, C. Zhang, Nonlinear dimensionality reduction with local spline embedding, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1285–1298.

229

[46] Y. Yang, H.T. Shen, Z. Ma, Z. Hug, X. Zhou, L21-norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the International Joint Conferences on Artiﬁcial Intelligence, 2011, pp. 1589–1594. [47] Y. Yang, Y. Zhuang, F. Wu, Y. Pan, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE Transactions on Multimedia 10 (3) (2008) 437–446. [48] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, D. Cai, Graph regularized sparse coding for image representation, IEEE Transactions on Image Processing 20 (5) (2011) 1327–1336. [49] T. Zhou, D. Tao, X. Wu, Manifold elastic net: a uniﬁed framework for sparse dimension reduction, Data Mining Knowledge Discovery 22 (3) (2011) 340–371. [50] X. Zhu, Z. Huang, H.T. Shen, J. Cheng, C. Xu, Dimensionality reduction by mixed kernel canonical correlation analysis, Pattern Recognition 45 (8) (2012) 3003–3016.

Xiaofeng Zhu is a Ph.D. candidate in of Information Technology & Electrical Engineering, The University of Queensland. His research topics include feature selection and analysis, pattern recognition and data mining.

Zi Huang is a Lecturer and an Australian Postdoctoral Fellow in School of Information Technology & Electrical Engineering, The University of Queensland. She received her B.Sc. degree from Department of Computer Science, Tsinghua University, China, and her Ph.D. in Computer Science from School of ITEE, The University of Queensland. Dr. Huang’s research interests include multimedia search and knowledge discovery.

Yang Yang is a Ph.D. candidate in of Information Technology & Electrical Engineering, The University of Queensland. His research topics include content understanding and pattern recognition.

Heng Tao Shen is a Professor of Computer Science in School of Information Technology & Electrical Engineering, The University of Queensland. He obtained his B.Sc. (with 1st class Honours) and Ph.D. from Department of Computer Science, National University of Singapore in 2000 and 2004 respectively. His research interests include Multimedia/Mobile/Web Search, Database Management, Dimensionality Reduction, etc. Heng Tao has extensively published and served on program committees in most prestigious international publication venues in Multimedia and Database societies. He is also the winner of Chris Wallace Award for outstanding Research Contribution in 2010 from CORE Australasia. He is a Senior Member of IEEE and Member of ACM.

Changsheng Xu is a Professor in National Lab of pattern Recognition, Institute of Automation, Chinese Academy of Sciences. He is also Executive Director of China– Singapore Institute of Digital media. He received the Ph.D. degree from Tsinghua University, China in 1996. He was with National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences as a Post-Doctoral Fellow and Associate Professor from 1996 to 1998. He was with Institute for Infocomm Research, Singapore from 1998 to 2008. His research interests include Multimedia Content Analysis, Image Processing, Pattern Recognition, Computer Vision, and Digital Watermarking. He has published over 150 refereed book chapters, journal and conference papers in these areas. He is a Senior Member of IEEE and Member of ACM.

Jiebo Luo is a Professor in Department of Computer Science in the University of Rochester. His research spans image processing, computer vision, machine learning, data mining, medical imaging, and ubiquitous computing. He has been an advocate for contextual inference in semantic understanding of visual data, and continues to push the frontiers in this area by incorporating geo-location context and social context. He has published extensively with over 180 papers and 60 US patents. He has been involved in numerous technical conferences, including serving as the program co-chair of ACM Multimedia 2010 and IEEE CVPR 2012. He is the Editor-in-Chief of the Journal of Multimedia, and has served on the editorial boards of the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, Pattern Recognition, Machine Vision and Applications, and Journal of Electronic Imaging. He is a Fellow of the SPIE, IEEE, and IAPR.

Self-taught dimensionality reduction on the high ...

Aug 4, 2012 - representations of target data are deleted for achieving the effectiveness and the efficiency. That is, this step performs feature selection on the new representations of target data. Finally, experimental results at various types of datasets show the proposed STDR outperforms the state-of-the-art algorithms in.

Download PDF

986KB Sizes 1 Downloads 261 Views

Report

Self-taught dimensionality reduction on the high ...

Recommend Documents