DCPE Co-Training: Co-Training Based on Diversity of ... - IEEE Xplore

Viewer
Transcript

DCPE Co-Training: Co-Training Based on Diversity of Class Probability Estimation Jin Xu, Haibo He, and Hong Man

Abstract-Co-training is a semi-supervised learning tech nique used to recover the unlabeled data based on two base learners. The normal co-training approaches use the most

A o

confidently recovered unlabeled data to augment the training data. In this paper, we investigate the co-training approaches with a focus on the diversity issue and propose the diversity of class probability estimation (DCPE) co-training approach.

data. The results are compared with classic co-training , tri training and self training methods. Our experimental study

"

K=3

O} /

/

K=5

co-training is robust and efficient in the classification.

B o

I N TRO DU C TI O N

978-1-4244-8126-2/101$26.00 ©2010 IEEE

B' , ?O \ • B I

...... _---"

based on the UCI benchmark data sets shows that the DCPE

Jin Xu and Hong Man are with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030,NJ,U SA (email: [email protected]; [email protected]). Haibo He is with the Department of Electrical,Computer,and Biomedical Engineering,University of Rhode Island, Kingston, RI 02881,U SA ( email: [email protected]).

;,--- .... ,

,'A 'p ',

between two base learners to choose the recovered unlabeled

I.

/

I

The key idea of the DCPE co-training method is to use DCPE

Co-training is a semi-supervised learning method, which uses two base learners to recover the unlabeled data to facilitate the learning process. In the co-training approach, one learner chooses the most confidently marked data (from unlabeled data) for the other classifier. In its earlier form [1], co-training separately establishes two classifiers by learning on two sufficient and redundant views (subsets) of the datasets. The diversity between the two views makes co training work well. But it is not practicable or feasible to divide all the data into two sufficient and redundant views. Recent research proves the applicability of co-training style algorithms without requiring two views [2], [3]. Co-training can use two different underlying learning theories to apply on the data which have one view. The diversity between the two learning theories can make co-training work well [4]. How to choose the unlabeled data to label is the key for co-training. Semi-supervised learning is a kind of label propagation. In the co-training development process, the label propagation is from the most confident unlabeled data from two views by the work of Blum and Mitchell [1], which is inspired from the classic self-training [5]. Then Statistical CO-Learning [2] and Democratic Co-Learning [3] try to use statistic method to get the new label data, which is time-consuming. Tri-training [6] uses the voting between the three classifiers to get the new labeled data. In our work, the labeling process is not only from one leamer, it is decided by the difference of the class probability estimation of two learners. Class probability estimation as well as the predicted label are important in the classification [7]. Class probability

I

Fig. I.

The illustration of KNN classification

estimation has many applications, such as cost-sensitive classification [8]. Different learning methods have different definitions of the class probability estimation [8], [9]. The motivation of our work is to utilize the diversity of class probability estimation (DCPE) for co-training. In DCPE co-training, the unlabeled data with the largest diversity of class probability estimation between two classifiers are chosen to recover the label. Empirical study on the binary UCI data shows that DCPE co-training is able to achieve enhanced classification accuracy compared with classic semi supervised learning models.

II.

C LA S S PRO B A B I LI T Y E S TIMATI O N

Diverse learners can predict the label with variable prob ability based on the different machine learning algorithms. So the class probability estimation is informative as well as the prediction label. In this section, the class probability estimation of two well known machine learning algorithms (k-nearest neighbor and neural networks [10]) are presented. The key idea for k-nearest neighbor (KNN) classifica tion [10] is that similar observations belong to similar categories. The performance of a KNN classification is primarily decided by the nearest number K and the choice of distance metric. Fig.l shows the simple illustration of KNN classification. With different nearest number K, the classification result might be different. For the training data (x, labelx), labelx is a kind of object function f(x). So the label of query data Xq is from f(xq), which is from the KNN learning algorithm [11] by the

Hidden layer HI(x) �

Network Output

�

Input training datv

( D"'�

�

H2(x)

r

� H3(x)

Ol(x)

�

X

..

I

1

Classification output

..

�

r-

F(x)

02(x) y

r ...

./

Fig. 2.

Neural networks for classification

equation:

f ( Xq)

=

argmaxvEv

2:

5(v, f ( Xi))

(1)

i=l...k

where 5(a, b) 1 if a b else 5(a, b) O. The research of class probability estimation in KNN is developing [9], [12]. In our binary classification, the Euclidean distance is used to measure the similarity. The nearest neighbor labeled data which have the same label with the query data are defined as targeted data, and the class probability estimation of query data is decided by the ratio between the sum of distances to the targeted data and the sum of distances to all labeled data [9]. As far as neural networks are concerned, when the net works established, the output f (x) will be compared with a threshold t for classification. The process of a typical multilayer feedforword networks are shown in Fig.2. In our study, the class probability estimation is decided by the distance to the threshold p(x) If(x) - tl. =

=

=

=

III. DCPE C O- TRAINING The DCPE co-training algorithm for the binary category classification (positive and negative) is shown in the Algo rithm 1. This algorithm first establishes the unlabeled data pool U' which randomly selects u unlabeled data from U. Then it uses two base learners (learnl is neural networks and learn2 is KNN) to train the labeled data L to get the first two classifiers: hA and hE. Then hA and hE predict the unlabeled data U', each classifier can output the predictive label and class probability estimation (normalized probability) for each unlabeled data. Based on the prediction results of two classifiers, this algorithm chooses the unlabeled data with the same label and the biggest DCPE to recover the label. The new labeled data should match the ratio of positive data to negative data in the original labeled data distribution [1]. When all the iterations are finished, the final two hypothesis hA and hE will predict the testing data. The label is decided by the hypothesis which has higher class probability estimation. In this algorithm, DCPE co-training chooses some unla beled data and labels them to be new labeled data. There

DCP E co - training (L,V,baselearnerl,baselearner2) input : L labeled data V Unlabeled data baselearnerl: Base learning algorithm 1 baselearner2: Base learning algorithm 2 k: Number of co-training iterations % Create pool V' by randomly choosing u data from V

{V',Vrest} <- randomsample(V ) % do k iteration for the label propagation for i <- 1 to k do % Run learnl and learn2 with data L to compute hypothesis hA and hB {hA,peA} = baselearnerl(L) {hB,peB} = baselearner2(L) Li = ¢ for possible label j <- 1 to 2 do foreach XEV' do % use hA to label data for hB Lia = ¢ if hA(X) = hB(X) = j then dpeA(x) = pq(x) - peB(x) Lia <- LiaU {x,j,dpeA(x)} % find Pj number data with biggest dpeA(x) Lia <- subsample(Lia) end % use hB to label data for hA Lib = ¢ if hA(X) = hB(X) = j then dpeB(x) = peB(x) - peA(x) Lib <- LibU {x,j,dpeB(X)} % find Pi number data with biggest dpeB(x) Lib <- subsample(Lib) end % the data in the V' has changed after sampling V' <- V' % randomly choosing 2Pi unlabeled data for replenishes {Vi,Vrest} <- randomsample(Vrest} end Li = LiULiaULib end % update the labeled data L = LULi % update the unlabeled data V' = V'UVi end output: hA(x),peA hB(x),peB ifpeA > peB then h(x) = hA(X) else h(x) = hB(x) AlgOrIthm I: Peseudocode describmg the Dept: co training algorithm

is a trade-off between adding the noise data and correct data in the process of co-training. Inspired by recent re search [2], [6], we provide a brief discussion on the DCPE co-training. In the training process, the base learning algorithm trains on the labeled data L. For a sequence of m samples in the training data, the size satisfies(2).

m

=

c ---

---co-:c c2(1 - 277)2

(2)

where c is the constant decided by the learning hypothesis, c is from the hypothesis worst-case accuracy (1 - c), and 77 is the classification noise rate in the training data. The purpose of semi-supervised learning is to reduce the hypothesis error rate c by using unlabeled data.

TA BLE I UCI EXPERIMENT D ATA SETS Name

Feature

australian bupa-liver

nnknn 20% australian

Total size

Test data size

'+1-' rate

14

690

138

56/44

6

345

69

42/58 65/35

pima

8

768

154

gennan

20

1000

200

70/30

ionosphere

34

351

70

36/64

tictactoe

9

958

192

65/35

wdbc

30

569

114

37/63

o..

�� 4

5

iteration

-dc�e.cO-Co-IVl!V"""CO-2vl!Y!

nnknn 60% australian

When the number of sample m is stable and c is a constant. Small 'f) can bring small hypothesis error rate c according to the above equation. It means that reducing noise error of the new labeled data will improve the final hypothesis. In our algorithm, the new labeled data are classified based on the agreement from the two base algorithms. The diversity of two base algorithms and agreement from them can make the error rate small. More importantly, the difference of class probability estimation can help to establish and keep the diversity between two base algorithms. IV.

1- trl2 ...... se!f:

...... t!i!

nnknn 80% australian

4

5

5

fteration

iteration

Fig. 3. Comparison of learning models under "nn-knn"for data "australian" in the different learning iterations

nnknn 40% pima

0.34

Configuration

The accuracy of the algorithms is estimated based on the 5fold validation test. Consequently, there is a dataset consisted of 1000 data as supposed, each training and test sets are 800 and 200 data respectively (with the ratio 4:1). For our DCPE co-training algorithm, the 800 training data are split randomly into two groups, L (labeled data) and U (unlabeled data) under different labeled rates 20%, 40%, 60% and 80%. Take 20% for instance, so there are (800x20%) 160 labeled data and (800x80%) 640 unlabeled data in the training set. Then the learning follows the algorithm described in the Algorithm 1. The parameter of the algorithm is modified so that almost all the unlabeled data have been put into the learning pool for the replenishment in eight iterations. B.

4

� CO.2vleY2

EMPIRICA L S TU DY O N UCI DATA S E T S

In order to study how DCPE co-training works on the classification, the empirical study is performed in the UCI data sets [13]. Seven binary UCI datasets are used in the experiment. The detail information is described in Table I. The test data is the number for the test experiment and '+/-' rate means the ratio of positive data to negative data. A.

nnknn 40% australian

0.32 •

..

� 02 �

o.

0.24 02. 2'--��-'--��--' 5 1 4

iteration

-dc�e-CO-CO-lvmr"""CO-2VleIl'1

nnknn 60% pima

0.22

4

1

1- co-2vle1l2

5

fteration

...... tnl

-+ tri2 ...... se1f � se1f nnknn 80% pima

Result for the each learning iteration

The detail comparisons of testing errors of data "aus tralian" and data "pima" between learning models are shown in Fig.3 and FigA separately. DCPE co-training (dcpe-co is abbreviation for DCPE co-training) is compared with standard co-training [1], tri-training [6] and self-training [5]. There are two models of co-training: Two views co-training and single view co-training. Two-views co-training is classic co-training which uses one learning algorithm on the two views of the datasets, the datasets features are randomly

0.24 0.22'--��-'--��--' 1 4 5

ITeration

022 . '--��'--��--' 5 1

fteration

Fig. 4. Comparison of learning models under "nn-knn" for data "pima" in the different learning iterations

separated into two views for the experiment. So when two base learning algorithms are used, there are two learning results called co-2viewl and co-2view2. Separately, co2viewl shows the result for the neural networks and co2view2 shows the result for the KNN. Single view co-training established two different learners by two learning classifiers on the single view of the data. Co-l view is the name of the single view co-training [4]. Tri-training [6] uses boosting sample to get three base learners for voting the label and choosing the unlabeled data. Tri-training is based on the single view datasets and single classifier algorithm. There are also two results of tri-training as two base classifiers. The results of the tri-training are called tril (for neural networks) and tri2 (for KNN) in this paper. Self-training [5] is single learning algorithm applying on the single view datasets, which chooses the most confidently unlabeled data to recover as the new labeled data for learning. Similarly, the results of the self-training are called selfl (for neural networks) and self2(for KNN). Consequently, for co-l view, tri-training and self-training, there are two curves (based on two learning algorithms) for each model. With curves of DCPE co-training and co-2view, there are eight curves shown in Fig.3 and FigA. All learning methods are trained with the same iterations and adding the same number of new recovered data in each iteration. The points in Fig.3 and FigA show the error results in each learning round. "nnknn" stands for the co-training between the neural networks and KNN. In Fig.3, dcpe-co shows the best accuracy with the labeled rate 20% and 40%, co-l view is the best with the labeled rate 60% and tri-training is the best with the labeled rate 80%. Dcpe co has more effective classification result with small rate labeled data. In FigA, dcpe-co shows lower error rate with the labeled rate 20%, 40% and 80%. All the learning curves are trended to increase from 4th to 8th iteration with the labeled rata 40% and 80%, which shows the noise data can reduce the classification accuracy. Actually, as more unlabeled data are recovered, the error rate should be decreased from the previous iteration to subsequent iteration. This trend will be discussed in the following section. From these two datasets result, the DCPE co-training learning curves always list on the bottom of the error groups. DCPE co-training model shows better classification accuracy than classic models. C.

Result for the improvement rate

In this section, the enhancement of accuracy based the initial iteration are compared among DCPE co-training, co l view, co-2view, tri-training and self training. As men tioned above, DCPE co-training and co-l view have used two learning algorithms. For fair comparison, the result of co2views, tri-training and self-training are averaged between two learning algorithms. So there are five learning models to show the improvement rates. Table II and Table III present the test error rates of the hypothesis at initial iteration trained on the labeled data L, best iteration (the hypothesis achieved best result) and final iteration. The learning error curve of

the hypothesis is not always monotonic (Fig.3 and FigA), so the best iteration is recorded as well as the final iteration. The improvement of error rate is:

. zmprove

=

initial - best (final) . .. zmtzal

(3)

In Table II and Table III, the improvement rates for each dataset are shown and compared among DCPE co-training, co-l view, co-2view, tri-training and self training. The biggest improvement is shown in bold. Finally, the best performance has been counted (No. of wins) from all the datasets. The re sults show that DCPE co-training can improve the hypotheses for all basic learners under small unlabeled rates. The results from the Table II show that DCPE co-trainings win in 4 out of 7 data sets in the best iteration and wins 3 out of 7 data sets in the final iteration under "nnknn 20%" learning model. DCPE co-training dominates in the best performance (No. of wins). When the labeled rate is changed to 40%, DCPE co-training ranks first both in the best iteration and final iteration improvement. Under 60% (Table III) labeled rate with "nnknn" learning model, DCPE co-training wins 2 of 7 in the final iteration improvement, co l view dominates the best iteration improvement and DCPE co-training ranks second. Finally, in the 80% labeled rate, self-training ranks first in the final iteration improvement with 3 of 7 and co-l view wins in the best iteration results. With the lower labeled rate, 20% and 40%, DCPE co-training shows better performance. D.

Result with different unlabeled data rate

The experiment is done on the L and U under different labeling rates 20%, 40%, 60% and 80%. This experiment tries to show the classification accuracy change with the different labeled rate. Fig.5 shows the result of the final iteration error rate for each learning models. There are also five learning curves in the figures: DCPE co-training, two kind of standard co-training, tri-training and self-training. For data "australian", the dcpe-co and co-l view show the best result and co-2view performs unacceptable for the data "australian". For the data "bupa-liver", dcpe-co shows the best performance for the classification. However, the error rate goes up when the labeled rate is 80%. The reason may be from adding the noise of the new labeled data. In the subfigure for the data "pima" under the "nnknn" learning model, "co-2view" shows the worst performance and others curves show the similar tendency. The result for the data "german" shows that "dcpe-co" and "tri" curves perform the best tendency and "self' curve is the worst result. For the data "ionosphere", "tri" and "dcpe-co" are better for the classification on the different labeled rate. In subfigure for the data "wdbc", "co-l view" and "dcpe-co" are the best in the error performance. "tri" curve shows an ideal result when the labeled rate is small, but the error rate goes up when the labeled rate is high. For the data "tictactoe", "dcpe-co" and "tri" show the lowest the error rate. In most datasets, all five curves show the decreasing tendency.

TA BLE II CLASSIFICATION IMPROVEMENT RESULT 1

iteration

data

initial best final initial best final initial best final initial best final initial best final initial best final initial best final best final

australian

bupa-Iiver

pima

german

ionosphere

wdbc

tictactoe no. of wins

iteration

data

initial best final initial best final initial best final initial best final initial best final initial best final initial best final

australian

bupa-liver

pima

german

ionosphere

wdbc

tictactoe no. of wins

I

Neural Networks and KNN error rate (%) tri depe-co co-lview co-2view 0.207 0.173 0.185 0.205 0.181 0.171 0.208 0.197 0.171 0.184 0.217 0.201 0.348 0.393 0.330 0.383 0.406 0.342 0.336 0.380 0.428 0.348 0.336 0.390 0.288 0.294 0.290 0.302 0.301 0.275 0.277 0.281 0.290 0.285 0.281 0.303 0.293 0.285 0.289 0.289 0.277 0.284 0.291 0.295 0.306 0.291 0.284 0.290 0.297 0.276 0.290 0.299 0.295 0.278 0.302 0.290 0.304 0.296 0.297 0.322 0.046 0.059 0.047 0.046 0.044 0.040 0.044 0.057 0.040 0.054 0.073 0.048 0.263 0.257 0.264 0.272 0.249 0.275 0.260 0.260 0.269 0.273 0.319 0.258

(nnknn) learning batch with 20% labeled rate improvement rate (%) tri self dcpe-co co-lview co-2view self 0.179 2.550 -0.376 0.175 0.903 2.453 4.124 1.880 -4.281 0.187 0.903 -4.423 0.850 0.358 -1.754 0.371 1.667 0.758 -3.644 -3.321 -1.754 -8.856 -1.893 -8.097 0.387 0.000 0.300 3.604 0.437 0.288 6.639 3.706 2.920 -0.424 -0.006 -0.664 0.298 3.118 2.239 0.284 0.292 0.683 2.807 -2.080 1.730 -2.641 -6.066 1.730 -4.754 -1.754 0.298 0.683 0.294 -1.003 2.211 -5.072 0.842 0.288 4.138 0 0.312 -2.069 -10.145 -7.692 -6.122 0.058 3.874 7.414 2.980 3.052 0.056 11.553 0.067 11.553 -19.198 -23.896 -1.850 -15.141 0.279 -1.204 0.288 1.165 -3.949 8.433 -2.987 -6.495 -20.723 5.175 -2.987 0.288 -2.393 0 4 1 0 2 4 0 0 0 3

Neural Networks and KNN error rate (%) tri dcpe-co co-lview co-2view 0.162 0.161 0.154 0.214 0.152 0.153 0.153 0.214 0.156 0.232 0.153 0.147 0.342 0.355 0.345 0.378 0.365 0.333 0.345 0.372 0.370 0.374 0.397 0.417 0.286 0.278 0.281 0.260 0.263 0.246 0.255 0.272 0.289 0.277 0.285 0.258 0.303 0.298 0.285 0.289 0.284 0.282 0.277 0.290 0.288 0.297 0.296 0.294 0.300 0.280 0.275 0.284 0.284 0.264 0.284 0.296 0.288 0.297 0.307 0.273 0.040 0.032 0.035 0.040 0.033 0.030 0.039 0.037 0.043 0.035 0.037 0.039 0.263 0.257 0.264 0.272 0.260 0.275 0.249 0.260 0.269 0.273 0.319 0.258

(nnknn) learning batch with 40% labeled rate improvement rate (%) tri self self dcpe-co co-l view co-2view 0.167 -0.355 1.485 4.770 0.162 5.637 3.229 -8.225 -1.000 -3.816 3.965 0.168 3.763 0.387 1.533 0.000 0.374 2.542 -2.857 3.371 0.387 -9.322 -15.126 -12.062 -1.190 -3.488 0.286 6.488 5.505 2.102 5.688 0.270 10.924 -4.753 -6.032 -1.675 0.286 0.471 -9.895 0.298 8.581 1.903 -1.754 3.356 0.288 5.369 -2.414 -4.233 -7.465 -3.971 0.310 1.342 0.296 0.300 0.000 5.333 3.825 -1.182 -5.903 -4.577 0.318 -1.408 -3.547 -3.220 -6.177 0.047 2.218 6.656 -3.799 15.004 0.048 -5.588 0.055 -11.127 -29.439 -11.423 0.000 -14.577 0.279 8.433 -2.987 -3.949 -1.204 0.288 1.165 -16.137 -3.558 0.000 -5.227 0.288 -2.393

best final

v. C O NC LU SI O N

Semi-supervised learning has been developed comprehen sively. It can be combined with active learning [14] to choose and require unlabeled data to label [15]. It also can be applied to the regression problems [16] and uses just very few labeled data for learning [17]. Many theoretical supports have been proved and many real-world applications have been demonstrated. Co-training, as a developing branch, utilizing the ensemble learning and diversity learning, has made

3 3

3

o o

1 2

o

important contributions in the semi-supervised learning. In this paper, the DCPE co-training is proposed. It uti lizes the classification diversity from the different learning algorithms. Our experiments have covered different learning iteration results, improvements achieved by semi-supervised learning and different unlabeled data rate learning results on the different learning parameters. The comparison experi ment on the VCI data shows that DCPE co-training is robust and effective among co-training, self-training and tri-training. There are many interesting future works along this direc-

TA BLE III CLASSIFICATION IMPROVEMENT RESULT 2

data

australian

bupa-Iiver

pima

german

ionosphere

wdbc

tictactoe no. of wins data

australian

bupa-Iiver

pima

german

ionosphere

wdbc

tictactoe no. of wins

iteration initial best final initial best final initial best final initial best final initial best final initial best final initial best final best final iteration initial best final initial best final initial best final initial best final initial best final initial best final initial best final best final

Neural Networks and KNN error rate (%) tri dcpe-co co-Iview co-2view 0.171 0.202 0.161 0.165 0.156 0.145 0.197 0.168 0.171 0.164 0.213 0.177 0.339 0.351 0.394 0.386 0.310 0.397 0.307 0.358 0.403 0.377 0.319 0.339 0.251 0.242 0.286 0.271 0.241 0.254 0.277 0.232 0.250 0.258 0.290 0.258 0.270 0.268 0.274 0.267 0.263 0.274 0.257 0.267 0.287 0.276 0.286 0.270 0.267 0.287 0.299 0.282 0.264 0.269 0.271 0.280 0.264 0.287 0.274 0.282 0.028 0.023 0.034 0.037 0.037 0.D25 0.025 0.031 0.033 0.030 0.040 0.D28 0.243 0.199 0.192 0.177 0.169 0.240 0.187 0.188 0.187 0.196 0.251 0.169

(nnknn) learning batch with 60% labeled rate improvement rate (%) self dcpe-co co-Iview co-2view tri 0.178 1.791 12.068 2.652 0.170 2.842 0.944 0.178 -6.671 -3.120 -5.318 0.368 0.354 9.402 11.570 -0.735 7.143 -2.206 2.256 0.351 5.983 3.306 0.262 3.402 7.765 6.253 0.249 0.526 0.258 -6.441 4.812 -1.376 -0.510 0.292 0.274 1.498 0.000 0.560 4.815 -4.745 0.296 -7.116 -2.985 0 0.290 1.124 3.901 2.269 0.300 10.033 0 1.123 0.292 8.361 0 0.035 0.000 0.033 -7.671 12.569 10.181 -7.171 0.034 -30.754 0.055 5.091 0.223 0.189 11.939 -5.879 1.277 6.030 0.195 11.939 -10.593 -5.771 -3.242 1 5 0 0 2 2 1 1

Neural Networks and KNN error rate (%) tri dcpe-co co-lview co-2view 0.141 0.156 0.219 0.146 0.142 0.212 0.139 0.152 0.149 0.223 0.158 0.152 0.319 0.322 0.364 0.370 0.275 0.355 0.383 0.322 0.336 0.348 0.387 0.377 0.254 0.258 0.264 0.286 0.251 0.273 0.262 0.255 0.272 0.273 0.255 0.268 0.263 0.269 0.281 0.274 0.267 0.260 0.265 0.262 0.276 0.271 0.272 0.263 0.264 0.273 0.280 0.278 0.264 0.269 0.270 0.270 0.273 0.263 0.280 0.286 0.021 0.033 0.031 0.021 0.032 0.037 0.021 0.021 0.034 0.D25 0.028 0.037 0.164 0.172 0.162 0.222 0.158 0.215 0.144 0.143 0.157 0.185 0.145 0.222

(nnknn) learning batch with 80% labeled rate improvement rate (%) tri self dcpe-co co-lview co-2view 0.159 3.140 1.117 0.161 2.962 2.613 -0.760 -1.740 -2.099 0.160 -0.970 0.343 14.414 0.330 -0.909 -3.529 2.390 -8.108 -4.706 -3.586 0.362 -5.455 0.268 0.264 1.010 1.011 4.336 0.745 -4.061 0.266 -0.495 -3.206 4.336 0.286 6.762 3.166 0.288 3.285 -1.331 1.780 0.278 1.095 -3.422 2.235 0.284 0.276 1.465 2.878 5.714 -2.277 0 -3.605 0.284 3.663 -2.698 0.039 0.036 0.147 -10.541 -5.734 0.073 -10.471 -11.442 0.035 -33.260 -16.508 0.196 1.572 0.170 16.974 2.615 3.273 0.191 9.087 0.217 12.072 -14.156 1 3 2 0 1 1 2 0

tion. For instance, similar to [18], it would be interested to analyze the theoretical aspect of the proposed approach in terms of learning performance and efficiency. Furthermore, large scale experiments across different types of benchmark data sets are necessary to fully justify the effectiveness of the approach. Meanwhile, the integration of the DCPE co training with other types of base classifiers would also be interested to different application domains. We hope the ini tial results presented in this paper will not only provide useful foundations and techniques for semi-supervised learning, but

self 4.701 0.454 3.937 4.724 4.942 -1.487 6.336 -1.370 -3.454 -0.863 7.524 2.515 15.254 12.661 1 2

self -0.936 -0.457 3.797 -5.485 1.703 -0.736 -0.876 2.627 2.646 -0.176 6.820 9.087

13.281 2.67 1 3

it will also motivate future research opportunities in this field. REFERENCE S [1] A . Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training, Proc. lith Ann. Conf. Computational Learning Theory, pp. 92-100, 1998. [2] S. Goldman and Y. Zhou, EnhanCing Supervised Learning with Un labeled Data, Proc. 17th Int'! Conf. Machine Learning, pp. 327-334, 2000. [3] Y. Zhou and S. Goldman, Democratic co-learning, Proc. 16th IEEE international conference on tools with artificial intelligence, pp. 594602,2004.

0.25

2 �

0.2

bupa-liver

australian

/'v

e Q;

0.31

T§

Q)

0.4

T§

0.28

�

0.27

0

0.3 0

0.2 0.4 0.6 0.8 labeled rate ionosphere

0.08

0.32

0.24 0

0.2 0.4 0.6 0.8 labeled rate wdbc

2

�

0.28

0.2 0.4 0.6 0.8 labeled rate

0.05

0.2 0.4 0.6 0.8 labeled rate

Fig. 5.

0.02 0

...... tri

0.2

...... self

0.15 0.2 0.4 0.6 0.8 labeled rate

0.2 0.4 0.6 0.8 labeled rate

...... co-2view

g

0.03

0.26

0.26 0

...... co-lview

Q)

0.04

0.28

...... dcpe-co

2 � 0.25

ro

�

0.29

0.27

0.3

0.06

0.3

tictactoe

0.07

0.3

0

e

Q;

0.25

0.15

gQ)

2 �

0.26

0.35

2 �

0.31

0.29 Q)

german

0.32

0.3

0.45

e Q;

0

pima

0

0.2 0.4 0.6 0.8 labeled rate

Comparison of learning models on different data under "nnknn" with the different labeled rate

[4] W. Wang and Z.H. Zhou,Analyzing co-training style algorithms, Proc. 18th European conference on machine learning. Warsaw, Poland, pp. 454-465, 2007. [5] D. Yarowsky, Unsupervised Word Sense Disambiguation Rivaling Su pervised Methods, Proc. 33rd Ann. Meeting of the Assoc. Computa tional Linguistics, pp. 189-196, 1995. [6] Z.H. Zhou and M. Li, Tri-training: Exploiting unlabeled data using three classifiers, IEEE Trans. on Knowl. and Data Eng., Vo1.l7, No.ll, pp.1529-1541,2005. [7] M. Saar- Tsechansky and F. Provost,Active Sampling for Class Probabil ity Estimation and Ranking, Machine Learning, 54, pp.153-178, 2004. [8] D. Margineantu, Class probability estimation and cost-sensitive classifi cation decisions, Proc. 13th European Conference on Machine Learning, Helsinki, Finland, pp. 270-281, 2002. [9] A. Atiya, Estimating the posterior probabilities using the K-nearest neighbor rule, Neural Comput 17, pp.731-740, 2005. [10] I. H. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, 1999. [II] T. Mitchell, Machine Learning, McGraw Hill, 1997. [12] K. Fukunaga and L. Hostetler, k-nearest-neighbor bayes-risk estima tion, IEEE Trans.lnformation Theory, Vol. 21,No. 3,pp. 285-293,1975. [13] C. Blake and C. Merz, UCI repository of machine learning databases,University of California, Irvine, School of Information and Computer Sciences, 1998. [14] H. Seung,M. O pper,and H. Sompolinsky, Query by Commillee, Proc. Fifth Ann. ACM Conf. Computational Learning Theory, pp. 287-294, 1992. [15] I. Muslea, S. Minton, and C. Knoblock, Active + semi-supervised learning = robust multi-view learning, Proc. Nineteenth International Conference on Machine Learning, pp. 435-442, 2002.

[16] Z.H. Zhou and M. Li,Semi-supervised regression with co-training style algorithms, [EEE Trans Knowl Data Eng, Vo1.l9,No.3,pp. [479-1493, 2009 [17] Z.H. Zhou, D.C. Zhan, and Q. Yang, Semi-supervised learning with very few labeled training examples, Proc. 22nd AAAI Conference on Artificial Intelligence (AAAI'07), pp. 675-680, 2007. [18] T. Li and M. Ogihara, Semisupervised learning from different irifor mation sources, Knowl [nf Syst, Vo1.7, No.3, pp. 289-309, 2005

Evolutionary Computation, IEEE Transactions on - IEEE Xplore