Graph-Based Semi-Supervised Learning with Redundant Views Yun-Chao Gong1 , Chuan-Liang Chen2 and Yin-Jie Tian3 1 Software Institute, Nanjing University, China 2 Department of Computer Science, Beijing Normal University, China 3 Research Centre on Fictitious Economy & Data Science, Chinese Academy of Sciences, China Email: [email protected], [email protected], [email protected]

Abstract In this paper, we propose a novel semi-supervised algorithm, which works under a two-view setting. Our algorithm, named Kernel Canonical Component Analysis Graph (K C -G RAPH), can effectively enhance the performance and the parameter stability of traditional graph-based semi-supervised algorithms by taking the advantage of two views using Kernel Canonical Component Analysis (KCCA). Experiments have been presented for semi-supervised classification tasks, and have shown that our K C -G RAPH algorithm stays a high classification accuracy and is much more stable than the former algorithms. We also noticed that our algorithm holds very good parameter stability.

1. Introduction In real world applications of machine learning area, former supervised methods use the labeled data to train the classifier. However in real world applications, the labeled examples are difficult to obtain but there are great number of unlabeled data [10]. Semi-supervised learning (SSL) addresses this problem by using both the unlabeled data and the labeled data to train. Many semisupervised learning (SSL) approaches have been proposed [11]. An important research area in SSL is the graph-based semi-supervised learning algorithms. The graph-based methods first define a graph where the nodes are data examples in the data set, and the edges denotes the similarity of the data examples. Many graph-based methods have been proposed [2, 12]. It is indicated in [11] that in the graph-based methods, the graph is at the heart of the algorithm. Some algorithms focus on constructing the graph have been proposed. [1] build a good graph for semi-supervised classification for video surveillance with domain knowledge about time and person. [4] use

an ensemble method which use perturbation and edge removal to build robust graphs for graph-based S SL. In [9], it proposed a method which use a similar method as LLE [8] which assumes that the data points can be reconstructed by its nearby points to construct a parameter stable graph. [11] indicates that “although the graph is very important, its construction has not been studied extensively”. There is another problem that it is very difficult to choose the number of nearest neighbors k to construct the graph. Building a better graph is very important and can effectively enhance the performance. In this paper, we focus on the above problems. In SSL, another important algorithm paradigm is the co-training paradigm algorithms [3] which assumes that the features of the data examples are can be naturally be split into two views, both are sufficient and redundant for training. [3] describes an algorithm which first train two independent classifiers on two independent views and then argument the two training sets with the confidence examples from two classifiers, then the procedure repeated. [10] gives a co-training paradigm algorithm which using Kernel Canonical Correlation Analysis (K CCA) to find the ground truth data and the method need only one labeled point to do classifying in binary case. Many former works have shown that when there exits two sufficient and redundant views, using the two views can effectively enhance the performance of the algorithms. In this paper, we consider the case that graph-based semi-supervised learning under a multiview condition. We wants to investigate wether using two views can enhance the quantity of the graph and improve the performance and parameter stability of traditional graph-based semi-supervised methods. In this paper, we present a method which is focus on constructing the graph W . The proposed method is based on the following intuition which was first proposed in [10]: “when there exits two sufficient views, maybe the correlation between these two views can provide some helpful information, which can be exploited

by K CCA. If the two views are conditionally independent given the class label, the most strongly correlated pair of projections computed by K CCA should be in accordance with the ground-truth ”. So we think that we can also use the embedded data exploited by K CCA that be accordance with the ground-truth to enhance the quantity of the graph and improve the parameter stability. The rest of this paper is organized as follows: section 2 introduces the algorithm, section 3 shows the setup of our experiments and reports the results. Finally conclusions are made in section 4.

2. The K C -G RAPH Algorithm Let X and Y denote two views. Let D = {xk , yk }(k = 1, 2, . . . , l + u) denote the labeled and unlabeled example where x ∈ X and y ∈ Y . Simplifying our discussion, we assume that c is the class label and c ∈ {0, 1} where 0 and 1 denote negative and positive classes. For the labeled data examples, L = {hxi , yi i, c}(i = 1, 2, . . . , l), the c denotes the class labels of the labeled data. In our discussion, we follow some definitions in [10]. When there exits two sufficient views, Canonical Correlation Analyze (C CA) [6] can be used to identify the correlated projections between two views. We suppose that X = (x0 , x1 , . . . , xl+u ) and Y = (y0 , y1 , . . . , yl+u ) and the embedded function of X and Y is wx , wy . C CA [6] maximizes the coefficient between the embedded data wxT X and wyT Y . Then the C CA objective function can be viewed as:   T wx Cxy wy  arg max  q wx ,wy T wx Cxx wx wyT Cyy wy  w.r.t

−1 Cxy Cyy Cyx wx = λ2 Cxx wx

(4)

1 (Ky + κI)−1 Kx α (5) λ Where I is the identity matrix and κ is used for regularization. Then [5] do data embedding according to the embedded function solved: β=

P (x) = φx (x)SxT α

(6)

P (y) = φy (y)SyT β

(7)

(3)

Thus, the matrix W can be viewed as this: the first l columns are labeled examples and the remainder n − l

(1)

Solving the Eq. 1, we can get [6]: 1 −1 C Cyx wx λ yy

(Kx + κI)−1 Ky (Ky + κI)−1 Kx α = λ2 α

(2)

Where Cxy is the between-sets covariance matrix of X and Y , Cxx and Cyy are the within-sets covariance matrices of X and Y . Then we can get corresponding Lagrangian:

wy =

With the objective function, [5] obtain the α and β by solving the following equations:

Then we can use the embedded data of two views to construct a graph W = (V, E) for graph based semi-supervised methods. In the graph, the nodes (V ) are the embedded labeled and unlabeled examples P (hxi , yi i)(i = 1, . . . , l + u) in the data set, and edges (E) reflect the similarity of two embedded data examples. The edge weight reflects the strength of two data samples i to j. Then we can construct a k nearest neighbor graph in which we add an edge between node i and j when i(j) is among the k nearest neighbors of j(i). Then the weight can be calculate by:   −(P (xi ) − P (xj ))T (P (xi ) + P (xj )) wij = exp σ2   −(P (yi ) − P (yj ))T (P (yi ) + P (yj )) + exp σ2

wxT Cxx wx = 1 wyT Cyy wy = 1,

L(λx , λy , wx , wy ) = wxT Cxy wy − λx T λy T (wx Cxx wx − 1) − (wy Cyy wy − 1) 2 2

then we can get wx from solving the eigenvalue problem from Eq. 3 and wy from Eq. 2. As indicated in [10], in real world applications, the data may not in a linear space. So we need to extract the information from a non-linear space. Kernel C CA (K CCA) [5] project the data into a higher dimensional space and can extract the non-linear information. K CCA maps the data sample xi yi into the Hilbert space φ(xi ), φ(yi ) and then use the mapped data as instances as: Sx = (φ(x0 ), φ(x1 ), . . . , φ(xl+u )) and Sy = (φ(y0 ), φ(y1 ), . . . , φ(yl+u )). Then the embedded data in the higher space can be viewed as: wxφ = SxT α and wyφ = SyT β. Then define that Kx (a, b) = φx (a)φTx (b) and Ky (a, b) = φy (a)φTy (b) are the kernel functions on the two different views, and Kx = Sx SxT and Ky = Sy SyT are the two corresponding kernel matrices. Then the objective function can be viewed as:   αT Sx SxT Sy SyT β  arg max  q α,β αT Sx SxT Sx SxT αβ T Sy SyT Sy SyT β

columns are unlabeled examples. It can be denoted as this:   Wll Wlu W = Wul Wuu With the graph W constructed above, in [12] the graph-based semi-supervised method is viewed as estimating a function f : V → R on the graph W and f are used as class labels. Then we use the semi-supervised graph method [12] to assign class labels: 1 (f − c)T (f − c) + f T Lf 2

(8)

Where the matrix L called the graph Laplacian is defined P as this: L = D − W and D = diag(di ), di = ij wij . Then [12] express the function f as: f = Pf

(9)

−1 Where   the P = D W . Then [12] express f as: f = fl where fl and fu defines the label of labeled and fu unlabeled examples. Then [12] get the solution:

fu = (I − Puu )−1 Pul fl

(10)

Where I is the identity matrix. The K C -G RAPH algorithm have been summarized in Table 1.

3. Experiments To evaluate our proposed algorithm, experiments have been performed on two real-world data sets: the course data set [3] and the ads data set [7]. The course data set has two views and contains 1051 examples, each view corresponds to a web page. The data set is used to predict wether the web page is a course page or not. In the ads data set, each data example corresponds to an image on the web, and the data set contains images which is used for advertisement or not. The ads data set has five views which are: url view, originalurl view, ancurl view, caption view and alt view. In this experiment, we use two views of the data set to evaluate our algorithm: url view, originalurl view. We perform two types of experiments: in the first experiment, in each fold, among the training examples, 10% to 90% labeled example per class is randomly picked out to be used as the training examples, while the remaining data are used as the unlabeled training examples and testing. In the experiments, we report the following results on the data sets: our K C -G RAPH algorithm, the graph-based semi-supervised learning algorithm Gaussian Fields Approach (GF). For each data set, an initial undirected edge graph W was constructed as the base line by making a symmetrical connection between each point and its k-nearest neighbors

Table 1. K C -G RAPH Input: L = {hxk , yk i, ck }(k = 1, 2, . . . , l) U = {hxi , yi i}(i = 1, 2, . . . , u) Parameter: k: the number of nearest neighbors in the graph W σ: bandwidth of the Gaussian function Procedure: 1. Identify all pairs of correlated projections to obtain α and β by solving Eq. 4, 5 on U ∪ L. 2. Data embedding for U ∪ L with the solved embedded function α, β according to Eq. 6, 7. 3. Use the embedded data P (x), P (y) to construct a k-nearest graph W . 4. Learn f according to Eq. 10 with both U ∪ L. Output: The class labels c of unlabeled data U.

as measured by Euclidean separation in the input space. Weights were then set for each edge according to the function wij = exp(−s2ij /σ 2 ) of edge length sij , with σ set either to 10. On all the data sets, the results have been averaged over 50 independent runs. In the second experiment, we randomly choose 90% percent examples to train and the rest are used as unlabeled training examples and for testing. In each fold, we increase the number of nearest neighbors k to compare the parameter stability of both algorithms. The rest of the setup of the second experiment are the same with the first one. On all the data sets, the results have been averaged over 50 independent runs.

3.1. Results and Discussion The results of all the first experiment have been summarized in Fig. 1. We report the average classification accuracy with the increase of training examples. The horizontal axis represents the percent of randomly labeled data in the training set and the vertical axis represents the recognition accuracy. From which we can find that our K C -G RAPH algorithm is better in accuracy than the Gaussian Fields Approach all the time. It is clear that by taking the advantage of redundant views, our algorithm outperforms the former method and is more stable. In Fig.2, we reports the results on parameter stability. The horizontal axis represents the number of nearest neighbors k in the graph W and the vertical axis represents the average recognition accuracy. From the results we can see clearly that our algorithm is very stable with the parameter k. Whenever the k was small or large, our K C -G RAPH algorithm holds high classification ac-



0.95



0.95



1 GF KC-GRAPH

0.9

0.9

0.95

0.85

0.85

0.9



1 0.9 0.8 0.7 0.6

0.8

0.8

0.75  0.1

0.2

0.4

0.6

0.8

0.4

GF KC-GRAPH 1

0.75  0

0.2

0.4

0.6

0.8

1

0.8  0

5



0.92 GF KC-GRAPH

0.9



0.92 0.91

10

15

20

25

GF KC-GRAPH  0

30

5

10

GF KC-GRAPH

0.9

0.89



0.95

20

25

30



1

0.9

0.9

0.85

0.8

0.8

0.7

0.75

0.6

0.7

0.5

GF KC-GRAPH

0.89 0.88

15

(b) inlinks

(a) fulltext

(b) inlinks

(a) fulltext

0.91

0.5

0.85

GF KC-GRAPH

0.88 0.87 0.87

0.86 0.85

0.86

0.84  0

0.85  0

0.2

0.4

0.6

(c) url

0.8

1

0.65

0.4 GF KC-GRAPH

0.6 0.2

0.4

0.6

0.8

1

(d) original url

Figure 1. Classification accuracy results

curacy, but when the k is small, the Gaussian Fields Approach fails to keep a good result: the results are unstable and sometimes the results are extremely bad 1 . So from the experiments, we can find that our algorithm is stable with the parameter k and nicely addressed the problem of tuning the k in graph construction when there exits two sufficient and redundant views.

4. Conclusions In this paper we proposed a novel semi-supervised learning algorithm called Kernel Canonical Correlation Analysis Graph (K C -G RAPH). It uses the Kernel Canonical Correlation Analysis (KCCA) to discover the useful information from the two sufficient, redundant and independent views and then use the information to enhance the quantity of the graph for Graph-Based semi-supervised learning. Our algorithm holds higher classification accuracy and much better parameter stability than former methods. Experiments have shown the effectiveness of our method and from which we also can find that the K C -G RAPH algorithm is very stable for the parameter of the number of neighbors in the graph. For future work, extending the graph proposed in our algorithm to more Graph-Based algorithms such as the Graph-Based clustering and manifold learning is an interesting work. 1 We noticed that the results are extremely bad because the 86% non-ads data examples are wrongly classified to the ads class.

0.55  0

5

10

15

(c) url

20

25

0.3 30

0.2  0

10

20

30

40

(d) original url

Figure 2. Parameter stability results

References [1] M.-F. Balcan, A. Blum, P. P. Choi, J. Lafferty, B. Pantano, M. R. Rwebangira, and X. Zhu. Person identification in webcam images: An application of semisupervised learning. ICML, 2005. [2] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. International Conference Machine Learning, 2001. [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998. [4] M. A. Carreira-Perpinan and R. S. Zemel. Proximity graphs for clustering and manifold learning. NIPS, 2005. [5] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 2004. [6] H. Hotelling. Relations between two sets of variates. Biometrika, 1936. [7] N. Kushmerick. Learning to remove internet advertisements. Agents, 1999. [8] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000. [9] F. Wang and C. Zhang. Label propagation through linear neighborhoods. International Conference Machine Learning, 2006. [10] Z.-H. Zhou, D.-C. Zhan, and Q. Yang. Semi-supervised learning with very few labeled training examples. AAAI, 2007. [11] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. [12] X. Zhu, Z. Ghahramani, and J. Lafferty. Semisupervised learning using gaussian fields and harmonic functions. ICML, 2003.

Graph-Based Semi-Supervised Learning with ...

1Software Institute, Nanjing University, China. 2Department ... 3Research Centre on Fictitious Economy & Data Science, Chinese Academy of Sciences, China.

155KB Sizes 3 Downloads 155 Views

Recommend Documents

10 Transfer Learning for Semisupervised Collaborative ...
labeled feedback (left part) and unlabeled feedback (right part), and the iterative knowledge transfer process between target ...... In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data. Mining (KDD'08). 426â

Semisupervised Wrapper Choice and Generation for ...
Index Terms—document management, administrative data processing, business process automation, retrieval ... of Engineering and Architecture (DIA), University of Trieste, Via Valerio .... The ability to accommodate a large and dynamic.

Efficient Active Learning with Boosting
compose the set Dn. The whole data set now is denoted by Sn = {DL∪n,DU\n}. We call it semi-supervised data set. Initially S0 = D. After all unlabeled data are labeled, the data set is called genuine data set G,. G = Su = DL∪u. We define the cost

Deep Learning with Differential Privacy
Oct 24, 2016 - In this paper, we combine state-of-the-art machine learn- ing methods with ... tribute to privacy since inference does not require commu- nicating user data to a ..... an Apache 2.0 license from github.com/tensorflow/models. For privac

Deep Learning with H2O.pdf - GitHub
best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. .... elegant web interface or fully scriptable R API from H2O CRAN package. · grid search for .... takes to cut the learning rate in half (e.g., 10−6 mea

Machine Learning with OpenCV2 - bytefish.de
Feb 9, 2012 - 7.3 y = sin(10x) . ... support and OpenCV 2.3.1 now comes with a programming interface to C, C++, Python and Android. OpenCV is released ...

Efficient Active Learning with Boosting
unify semi-supervised learning and active learning boosting. Minimization of ... tant, we derive an efficient active learning algorithm under ... chine learning and data mining fields [14]. ... There lacks more theoretical analysis for these ...... I

Deep Learning with Differential Privacy
Oct 24, 2016 - can train deep neural networks with non-convex objectives, under a ... Machine learning systems often comprise elements that contribute to ...

Machine Learning and Deep Learning with Python ...
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second ... Designing Data-Intensive Applications: The Big Ideas Behind Reliable, ...

Efficient Active Learning with Boosting
[email protected], [email protected]} handle. For each query, a ...... can be easily generalized to batch mode active learn- ing methods. We can ...

Cluster-parallel learning with VW - PDFKUL.COM
Goals for future from last year. 1. Finish Scaling up. I want a kilonode program. 2. Native learning reductions. Just like more complicated losses. 3. Other learning algorithms, as interest dictates. 4. Persistent Demonization ...

Learning to Rank with Ties
works which combine ties and preference data using statis- ... learning ranking functions from preference data have been .... Figure 1: Illustration of ties. 4.