Multi-Label Boosting via Hypothesis Reuse

Sheng-Jun Huang Yang Yu Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {huangsj, yuy, zhouzh}@lamda.nju.edu.cn

Abstract Multi-label learning rises in many real-world tasks, in which an object is naturally associated with multiple targets. Binary relevance approach, which solves each label separately, is the simplest way to handle multi-label task, but usually does not work well since it ignores the interaction between labels. In this paper, we propose a new boosting method AdaBoost.HR for multi-label learning. Our basic idea is that, if two labels are related, the hypothesis generated for one label could be transferred to help the other label. We implement the idea with a hypothesis reuse mechanism and estimate label relationship with the amount of reused hypotheses throughout the learning process. Experimental results show that AdaBoost.HR achieves superior performance and discloses reasonable label relationship.

1 Introduction In traditional supervised classification, one instance is associated with one target variable, while in many real-world applications, one instance is naturally associated with multiple targets, which is formalized as multi-label learning. For examples, in scene classifications (e.g. [1]) a natural scene picture could be annotated as sky, trees, mountains, lakes and water simultaneously; in text categorizations (e.g. [12]) a piece of news on global warming issue could be categorized as global warming, environment, economics and politics; in email classification (e.g. [3]) an email could be sorted into folders of business, task 1, meeting and need-reply. While it is possible to address multilabel learning by decomposing the problem to a series of binary classification problems each for one label [1], previous studies [5, 7] have revealed that the relationship among labels is quite helpful for multi-label learning. In this paper, we propose a novel multi-label approach named AdaBoost.HR. Our basic idea is that, if two labels are related, the hypothesis generated for one label could be transferred to help the other label. We implement the idea as a boosting approach with a hypothesis reuse mechanism, named AdaBoost.HR. It trains multiple boosting learners simultaneously, each for one label. In each boosting round, the algorithm does not only generate a base learner from its own hypothesis space, but also tries to reuse the hypotheses generated for other labels. The reuse process employs a weighted linear combination to take into account all the trained base hypotheses from all the other labels, and the helpful hypotheses are identified by optimizing the combination weights to minimize the loss on its own label. After that, the reuse score can be calculated from the weights of cross-label hypotheses throughout the boosting process, providing an estimate of label relationship. The rest of the paper is organized as follows. Section 2 introduces related work, Section 3 proposes AdaBoost.HR, Section 4 reports on experiments, Section 5 concludes. 1

2

Related Work

We review boosting techniques for multi-label learning. For other multi-label approaches, one can refer to surveys such as [4, 11]. Boosting is a family of learning algorithms with solid theoretical foundation and wide applications. In [8], two boosting approaches for multi-label learning, AdaBoost.MH and AdaBoost.MR, were proposed. They both train additive models to directly optimize multi-label losses, Hamming loss for AdaBoost.MH and ranking loss for AdaBoost.MR. AdtBoost.MH [2] is an extension of AdaBoost.MH. By incorporating with alternating decision trees [6], AdtBoost.MH produces a set of readable rules, and thus overcomes the drawback of boosting methods that lack interpretability. In [13], a large-scale multi-label approach MSSBoost utilizing boosting techniques was proposed. It maintains a shared pool of base classifiers to reduce the redundancy among labels, and each base classifier is trained in a random subspace and a random sample. At each boosting round, it selects the best classifier from the pool and take it as the base classifier for all the labels. These boosting approaches use a common idea for learning multi-label tasks, i.e., the learned classifiers for different labels share the same hypothesis. However, they do not verify if the labels are strongly related. When the labels are weakly related or independent, one base classifier is hard to fit all the labels and thus it is hard to lead to a good performance.

3

AdaBoost.HR

We denote by S = {(x1 , Y1 ), (x2 , Y2 ), · · · , (xm , Ym )} a multi-label data set of m instances and L possible labels, where xi is the i-th instance and the corresponding label Yi is a vector of L dimension, Yi (l) = 1 if xi has the l-th label, and −1 otherwise. Our motivation is that, if two labels are closely related and we already have a good classifier for one of them, we then can construct a good classifier for the other label by making use of the existing classifier, instead of starting from scratch. AdaBoost.HR, as in Algorithm 1, maintains the general outline of boosting. AdaBoost.HR generates base hypotheses in an iterative manner (the loop from line 3 to line 12). In each round t, it generates a base hypothesis for each label (the loop from line 4 to line 11). For each label l, it firstly trains a ˆ t,l from its own hypothesis space (line 5), then invokes a reuse function R (line 7) to hypothesis h ˆ t,l and the reuse of the hypotheses in the candidate set Qt,l . generate another hypothesis ht,l from h After that, ht,l is treated as the base hypothesis for label l in round t, and the edge of its training error and combination weight are calculated (line 8). Then the training set is updated for the next round (line 9). Note that in the first round, the candidate set Q1,l is empty for all l. The reuse function R is utilized to combine multiple base hypotheses. We implement it as a weighted linear combination, and optimize the combination weights by minimizing the loss on the current label. At last, we calculate the reuse score by summing the weights of cross-label hypotheses reuse throughout the boosting process, and take it as an estimate of label relationship. After the training, the output models are used for predicting test instances. Note that the learned model for one label may involve hypotheses from models for other labels, therefore it is better to follow the training order of the hypotheses. For a test instance, we calculate the predictions of base hypotheses for all the labels round-by-round until the final prediction is obtained.

4

Experiments

AdaBoost.HR is compared with its counterpart AdaBoost.ID, which is the same as AdaBoost.HR except for that it does not reuse hypotheses. AdaBoost.HR is also compared with some other multilabel boosting approaches, including AdaBoost.MH [8], AdaBoost.MR [8], AdtBoost.MH [2] and MSSBoost [13]. The number of base hypotheses for AdaBoost.MH and AdaBoost.MR is set to 500 according to [9]. For AdaBoost.HR and AdaBoost.ID, the base learning algorithm is decision stump. Based on preliminary experiments, we set the number of base hypotheses for all the data sets as default to 2 × #f eature × L for AdaBoost.HR, and 500 for AdaBoost.ID. The approaches are evaluated on 5 data sets: Image [14], Image2 [15], Scene [1], Reuters [10] and Yeast [5]. Note 2

Algorithm 1 The AdaBoost.HR algorithm 1: Input: training set S = {(xi , Yi )}m i=1 , base learning algorithm L, reuse function R, number of rounds T 1 2: D1,l (i) ← m (i = 1 · · · m, l = 1 · · · L) 3: for t ← 1 to T do 4: for l ← 1 to L do ˆ t,l ← L(S, Dt,l ) 5: h 6: Qt,l = {Ht−1,k |k ̸= l} ∪ {−Ht−1,k |k ̸= l} ˆ , Q ) such that ht,l (·) ∈ [−1, 1] 7: ht,l ← R(h ∑ t,l t,l 1+γ 8: γt,l ← i Dt,l (i) · ht,l (xi ) · Yi (l); αt,l ← 12 ln( 1−γt,l ) t,l ( )−1 ∑t 9: Dt+1,l (i) ← 1 + exp(Yi (l) j=1 αj,l · hj,l (xi )) ∑t 10: Ht,l ← j=1 αj,l · hj,l 11: end for 12: end for ∑T 13: Output: H(x, l) ← t=1 αt,l · ht,l (x)

that Image2 is another version of Image processed according to [15]. For all the data sets, we randomly select 1500 instances as training set and the rest of the data as test set. The data partition is repeated randomly for thirty times. The average results as well as standard deviations over the thirty repetitions are reported. The performance results on Hamming loss are shown in Tables 1. When comparing with AdaBoost.ID, AdaBoost.HR achieves significantly better performance on all the data sets. This verifies the effectiveness of the hypothesis reuse mechanism. When comparing with the other boosting approaches, AdaBoost.HR achieves the best performance in most cases.

Table 1: Comparison of AdaBoost.HR with some multi-label boosting approaches on Hamming loss. The best performance and its comparable performances based on paired t-tests at 95% significance level are highlighted in boldface. Methods Image Image2 Scene Reuters Yeast AdaBoost.HR .169±.011 .191±.008 .084±.004 .030±.003 .204±.004 AdaBoost.ID .190±.015 .229±.014 .108±.019 .046±.006 .222±.007 AdaBoost.MH .176±.007 .200±.008 .091±.003 .031±.003 .228±.004 AdaBoost.MR .754±.004 .752±.003 .821±.001 .835±.002 .697±.002 AdtBoost.MH .190±.007 .210±.006 .111±.003 .055±.005 .210±.003 MSSBoost .210±.009 .230±.007 .134±.004 .096±.008 .227±.005

To investigate whether the reuse score reflects reasonable label relationship, we examine the reuse scores on the Image data set. Image contains 5 labels: desert, mountains, sea, sunset, and trees. Figure 1 shows some example images, each column for one label. The reuse scores are reported in Table 2. It is obvious that all the diagonal entries are much larger than the other entries, which means each boosting mainly relies on its own task. Besides the diagonal entries, most elements are negative. This suggests that the mutually exclusive relationship among labels is important when dealing with multi-label scene classification tasks. The entry (trees, mountains) is relatively large, which can be explained by that when there are mountains in an image, it is likely that this image contains trees too. On the contrary, the entry (mountains, trees) is negative, since trees do not imply mountains, as shown in Figure 1. This result shows the benefits of asymmetric property of reuse score. It is noticed that strong negative entries imply relationship as well. For example, when desert occurs in an image, it is unlikely to find sea in it, this explains the strong negative entry (sea, desert). 3

Desert

Mountains

Sea

Sunset

Trees

Figure 1: Example images of Image data set Table 2: Reuse scores on Image Reuse label Target label desert mountains sea sunset trees

5

desert 11.85 -0.63 -1.22 -0.61 -0.18

mountains -0.99 10.25 -0.75 -0.96 0.55

sea -0.60 -1.04 7.44 -0.07 -0.83

sunset -1.29 -0.74 0.07 14.17 -1.06

trees -0.36 -0.57 -1.05 -0.92 9.51

Conclusion

In this paper, we present a novel multi-label learning approach AdaBoost.HR with a hypothesis reuse mechanism. By transferring the hypotheses learned on other labels to help the learning of one label, AdaBoost.HR discovers reasonable relationship among labels and achieves decent performance.

References [1] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004. [2] F. D. Comit´e, R. Gilleron, and M. Tommasi. Learning multi-label altenating decision tree from texts and data. In Proceedings of the 3rd International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 35–49, Leipzig, Germany, 2003. [3] A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and contact information from email and the web. In Proceedings of the 1st Conference on Email and Anti-Spam, Mountain View, CA, 2004. [4] A. de Carvalho and A. Freitas. A tutorial on multi-label classification techniques. In A. Abraham, A.-E. Hassanien, and V. Snael, editors, Foundations of Computational Intelligence, pages 177–195. Springer, 2009. [5] A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems 14, pages 681–687, Vancouver, Canada, 2002. [6] Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Proceedings of the 16th International Conference on Machine Learning, pages 124–133, Bled, Slovenia, 1999. [7] A. McCallum. Multi-label text classification with a mixture model trained by EM. In Working Notes of the AAAI’99 Workshop on Text Learning, Orlando, FL, 1999. [8] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated prediction. Machine Learning, 37(3):297–336, 1999.

4

[9] R. E. Schapire and Y. Singer. Boostexter: a boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [10] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. [11] G. Tsoumakas, M.-L. Zhang, and Z.-H. Zhou. Tutorial on learning from multi-label data. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Bled, Slovenia, 2009. [12] N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15, pages 721–728, Vancouver, Canada, 2003. [13] R. Yan, J. Teˇsi´c, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 834–843, San Jose, CA, 2007. [14] M.-L. Zhang and Z.-H. Zhou. ML-kNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007. [15] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li. Multi-instance multi-label learning. Artificial Intelligence, in press.

5

Multi-Label Boosting via Hypothesis Reuse

sorted into folders of business, task 1, meeting and need-reply. While it is possible to .... In Proceedings of the 3rd International Conference on Machine Learning and Data Mining in. Pattern Recognition ... email and the web. In Proceedings of ...

2MB Sizes 0 Downloads 153 Views

Recommend Documents

Simple Training of Dependency Parsers via Structured Boosting
developing training algorithms for learning structured predic- tors from data ..... the overall training cost to a few hours on a few computers, since the ... First, to determine the effectiveness of the basic structured .... Online large-margin trai

Adaptive Martingale Boosting - Phil Long
has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with ...

DIETARY HYPOTHESIS
Note: The world map in Figure 3 is from “The World Factbook”, operated by the ... Thomas F. Spande received a Ph.D. in chemistry from Princeton University in ...

Hypothesis
which is crucially based on Move-F, is provided in section 3, Some theoretical consequences of the proposed analysis are discussed in section 4, followed by concluding remarks in section 5. 2. Takahashi (1993) and Some Problems. 2.1 Takahashi (1993).

Adaptive Martingale Boosting - NIPS Proceedings
In recent work Long and Servedio [LS05] presented a “martingale boosting” al- gorithm that works by constructing a branching program over weak classifiers ...

The Social Brain Hypothesis
hypothesis, though I present the data only for the first of ..... rather the ''software programming'' that occurs .... Machiavellian intelligence hypothesis, namely to ...

Hypothesis Testing.pdf
... mean weight of all bags of pretzels equals 5 oz. Ha : The mean weight of all bags of chips is less than 5 oz. Reject H0 in favor of Ha if the sample mean is sufficiently less than 5 oz. Matt Jones (APSU) Hypothesis Testing for One Mean and One Pr

Maintenance" Hypothesis
detached, thus we call the predetermining flakes themselves ventral flakes. .... results in a plunging termination that ruins the core. ACKNOWLEDGEMENTS.

Hypothesis testing.pdf
Whoops! There was a problem loading more pages. Retrying... Hypothesis testing.pdf. Hypothesis testing.pdf. Open. Extract. Open with. Sign In. Main menu.

Riemann Hypothesis
Mar 1, 2003 - shops are collaborating on the website (http:// · www.aimath.org/WWN/rh/) ..... independently discovered some of the develop- ments that had ...

Efficient Active Learning with Boosting
unify semi-supervised learning and active learning boosting. Minimization of ... tant, we derive an efficient active learning algorithm under ... chine learning and data mining fields [14]. ... There lacks more theoretical analysis for these ...... I

Efficient Active Learning with Boosting
compose the set Dn. The whole data set now is denoted by Sn = {DL∪n,DU\n}. We call it semi-supervised data set. Initially S0 = D. After all unlabeled data are labeled, the data set is called genuine data set G,. G = Su = DL∪u. We define the cost

Hypothesis Testing Basics.pdf
Sign in. Page. 1. /. 2. Loading… ... H y pot he s is T e s t s 1 0 1. will culminate in students getting 45 ... think of it as the probability of obtaining a. sample statistic ...

Efficient Active Learning with Boosting
real-world database, which show the efficiency of our algo- rithm and verify our theoretical ... warehouse and internet usage has made large amount of unsorted ...

Offshore Decommissioning and Beyond-Reuse Case.pdf ...
Page 2 of 51. “When things do not go your way, remember that every challenge — every adversity. — ​contains ​within ​it ​the ​seeds ​of ​opportunity ​and ​growth”. - Roy​ ​T.​ ​Bennett. 1. Page 2 of 51. Page 3 of 51.

The Conscious Access Hypothesis
support the hypothesis, which has implications for ... computer architectures, which show typical 'limited capacity' behavior ..... 10, 356–365. 17 Rees, G. (2001) ...

Testing Of Hypothesis (1).pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

METER: MEasuring TExt Reuse - Semantic Scholar
Department of Computer Science. University of ... them verbatim or with varying degrees of mod- ification. ... fined °700 and banned for two years yes- terday.