Multi-Label Boosting via Hypothesis Reuse
Sheng-Jun Huang Yang Yu Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {huangsj, yuy, zhouzh}@lamda.nju.edu.cn
Abstract Multi-label learning rises in many real-world tasks, in which an object is naturally associated with multiple targets. Binary relevance approach, which solves each label separately, is the simplest way to handle multi-label task, but usually does not work well since it ignores the interaction between labels. In this paper, we propose a new boosting method AdaBoost.HR for multi-label learning. Our basic idea is that, if two labels are related, the hypothesis generated for one label could be transferred to help the other label. We implement the idea with a hypothesis reuse mechanism and estimate label relationship with the amount of reused hypotheses throughout the learning process. Experimental results show that AdaBoost.HR achieves superior performance and discloses reasonable label relationship.
1 Introduction In traditional supervised classification, one instance is associated with one target variable, while in many real-world applications, one instance is naturally associated with multiple targets, which is formalized as multi-label learning. For examples, in scene classifications (e.g. [1]) a natural scene picture could be annotated as sky, trees, mountains, lakes and water simultaneously; in text categorizations (e.g. [12]) a piece of news on global warming issue could be categorized as global warming, environment, economics and politics; in email classification (e.g. [3]) an email could be sorted into folders of business, task 1, meeting and need-reply. While it is possible to address multilabel learning by decomposing the problem to a series of binary classification problems each for one label [1], previous studies [5, 7] have revealed that the relationship among labels is quite helpful for multi-label learning. In this paper, we propose a novel multi-label approach named AdaBoost.HR. Our basic idea is that, if two labels are related, the hypothesis generated for one label could be transferred to help the other label. We implement the idea as a boosting approach with a hypothesis reuse mechanism, named AdaBoost.HR. It trains multiple boosting learners simultaneously, each for one label. In each boosting round, the algorithm does not only generate a base learner from its own hypothesis space, but also tries to reuse the hypotheses generated for other labels. The reuse process employs a weighted linear combination to take into account all the trained base hypotheses from all the other labels, and the helpful hypotheses are identified by optimizing the combination weights to minimize the loss on its own label. After that, the reuse score can be calculated from the weights of cross-label hypotheses throughout the boosting process, providing an estimate of label relationship. The rest of the paper is organized as follows. Section 2 introduces related work, Section 3 proposes AdaBoost.HR, Section 4 reports on experiments, Section 5 concludes. 1
2
Related Work
We review boosting techniques for multi-label learning. For other multi-label approaches, one can refer to surveys such as [4, 11]. Boosting is a family of learning algorithms with solid theoretical foundation and wide applications. In [8], two boosting approaches for multi-label learning, AdaBoost.MH and AdaBoost.MR, were proposed. They both train additive models to directly optimize multi-label losses, Hamming loss for AdaBoost.MH and ranking loss for AdaBoost.MR. AdtBoost.MH [2] is an extension of AdaBoost.MH. By incorporating with alternating decision trees [6], AdtBoost.MH produces a set of readable rules, and thus overcomes the drawback of boosting methods that lack interpretability. In [13], a large-scale multi-label approach MSSBoost utilizing boosting techniques was proposed. It maintains a shared pool of base classifiers to reduce the redundancy among labels, and each base classifier is trained in a random subspace and a random sample. At each boosting round, it selects the best classifier from the pool and take it as the base classifier for all the labels. These boosting approaches use a common idea for learning multi-label tasks, i.e., the learned classifiers for different labels share the same hypothesis. However, they do not verify if the labels are strongly related. When the labels are weakly related or independent, one base classifier is hard to fit all the labels and thus it is hard to lead to a good performance.
3
AdaBoost.HR
We denote by S = {(x1 , Y1 ), (x2 , Y2 ), · · · , (xm , Ym )} a multi-label data set of m instances and L possible labels, where xi is the i-th instance and the corresponding label Yi is a vector of L dimension, Yi (l) = 1 if xi has the l-th label, and −1 otherwise. Our motivation is that, if two labels are closely related and we already have a good classifier for one of them, we then can construct a good classifier for the other label by making use of the existing classifier, instead of starting from scratch. AdaBoost.HR, as in Algorithm 1, maintains the general outline of boosting. AdaBoost.HR generates base hypotheses in an iterative manner (the loop from line 3 to line 12). In each round t, it generates a base hypothesis for each label (the loop from line 4 to line 11). For each label l, it firstly trains a ˆ t,l from its own hypothesis space (line 5), then invokes a reuse function R (line 7) to hypothesis h ˆ t,l and the reuse of the hypotheses in the candidate set Qt,l . generate another hypothesis ht,l from h After that, ht,l is treated as the base hypothesis for label l in round t, and the edge of its training error and combination weight are calculated (line 8). Then the training set is updated for the next round (line 9). Note that in the first round, the candidate set Q1,l is empty for all l. The reuse function R is utilized to combine multiple base hypotheses. We implement it as a weighted linear combination, and optimize the combination weights by minimizing the loss on the current label. At last, we calculate the reuse score by summing the weights of cross-label hypotheses reuse throughout the boosting process, and take it as an estimate of label relationship. After the training, the output models are used for predicting test instances. Note that the learned model for one label may involve hypotheses from models for other labels, therefore it is better to follow the training order of the hypotheses. For a test instance, we calculate the predictions of base hypotheses for all the labels round-by-round until the final prediction is obtained.
4
Experiments
AdaBoost.HR is compared with its counterpart AdaBoost.ID, which is the same as AdaBoost.HR except for that it does not reuse hypotheses. AdaBoost.HR is also compared with some other multilabel boosting approaches, including AdaBoost.MH [8], AdaBoost.MR [8], AdtBoost.MH [2] and MSSBoost [13]. The number of base hypotheses for AdaBoost.MH and AdaBoost.MR is set to 500 according to [9]. For AdaBoost.HR and AdaBoost.ID, the base learning algorithm is decision stump. Based on preliminary experiments, we set the number of base hypotheses for all the data sets as default to 2 × #f eature × L for AdaBoost.HR, and 500 for AdaBoost.ID. The approaches are evaluated on 5 data sets: Image [14], Image2 [15], Scene [1], Reuters [10] and Yeast [5]. Note 2
Algorithm 1 The AdaBoost.HR algorithm 1: Input: training set S = {(xi , Yi )}m i=1 , base learning algorithm L, reuse function R, number of rounds T 1 2: D1,l (i) ← m (i = 1 · · · m, l = 1 · · · L) 3: for t ← 1 to T do 4: for l ← 1 to L do ˆ t,l ← L(S, Dt,l ) 5: h 6: Qt,l = {Ht−1,k |k ̸= l} ∪ {−Ht−1,k |k ̸= l} ˆ , Q ) such that ht,l (·) ∈ [−1, 1] 7: ht,l ← R(h ∑ t,l t,l 1+γ 8: γt,l ← i Dt,l (i) · ht,l (xi ) · Yi (l); αt,l ← 12 ln( 1−γt,l ) t,l ( )−1 ∑t 9: Dt+1,l (i) ← 1 + exp(Yi (l) j=1 αj,l · hj,l (xi )) ∑t 10: Ht,l ← j=1 αj,l · hj,l 11: end for 12: end for ∑T 13: Output: H(x, l) ← t=1 αt,l · ht,l (x)
that Image2 is another version of Image processed according to [15]. For all the data sets, we randomly select 1500 instances as training set and the rest of the data as test set. The data partition is repeated randomly for thirty times. The average results as well as standard deviations over the thirty repetitions are reported. The performance results on Hamming loss are shown in Tables 1. When comparing with AdaBoost.ID, AdaBoost.HR achieves significantly better performance on all the data sets. This verifies the effectiveness of the hypothesis reuse mechanism. When comparing with the other boosting approaches, AdaBoost.HR achieves the best performance in most cases.
Table 1: Comparison of AdaBoost.HR with some multi-label boosting approaches on Hamming loss. The best performance and its comparable performances based on paired t-tests at 95% significance level are highlighted in boldface. Methods Image Image2 Scene Reuters Yeast AdaBoost.HR .169±.011 .191±.008 .084±.004 .030±.003 .204±.004 AdaBoost.ID .190±.015 .229±.014 .108±.019 .046±.006 .222±.007 AdaBoost.MH .176±.007 .200±.008 .091±.003 .031±.003 .228±.004 AdaBoost.MR .754±.004 .752±.003 .821±.001 .835±.002 .697±.002 AdtBoost.MH .190±.007 .210±.006 .111±.003 .055±.005 .210±.003 MSSBoost .210±.009 .230±.007 .134±.004 .096±.008 .227±.005
To investigate whether the reuse score reflects reasonable label relationship, we examine the reuse scores on the Image data set. Image contains 5 labels: desert, mountains, sea, sunset, and trees. Figure 1 shows some example images, each column for one label. The reuse scores are reported in Table 2. It is obvious that all the diagonal entries are much larger than the other entries, which means each boosting mainly relies on its own task. Besides the diagonal entries, most elements are negative. This suggests that the mutually exclusive relationship among labels is important when dealing with multi-label scene classification tasks. The entry (trees, mountains) is relatively large, which can be explained by that when there are mountains in an image, it is likely that this image contains trees too. On the contrary, the entry (mountains, trees) is negative, since trees do not imply mountains, as shown in Figure 1. This result shows the benefits of asymmetric property of reuse score. It is noticed that strong negative entries imply relationship as well. For example, when desert occurs in an image, it is unlikely to find sea in it, this explains the strong negative entry (sea, desert). 3
Desert
Mountains
Sea
Sunset
Trees
Figure 1: Example images of Image data set Table 2: Reuse scores on Image Reuse label Target label desert mountains sea sunset trees
5
desert 11.85 -0.63 -1.22 -0.61 -0.18
mountains -0.99 10.25 -0.75 -0.96 0.55
sea -0.60 -1.04 7.44 -0.07 -0.83
sunset -1.29 -0.74 0.07 14.17 -1.06
trees -0.36 -0.57 -1.05 -0.92 9.51
Conclusion
In this paper, we present a novel multi-label learning approach AdaBoost.HR with a hypothesis reuse mechanism. By transferring the hypotheses learned on other labels to help the learning of one label, AdaBoost.HR discovers reasonable relationship among labels and achieves decent performance.
References [1] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004. [2] F. D. Comit´e, R. Gilleron, and M. Tommasi. Learning multi-label altenating decision tree from texts and data. In Proceedings of the 3rd International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 35–49, Leipzig, Germany, 2003. [3] A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and contact information from email and the web. In Proceedings of the 1st Conference on Email and Anti-Spam, Mountain View, CA, 2004. [4] A. de Carvalho and A. Freitas. A tutorial on multi-label classification techniques. In A. Abraham, A.-E. Hassanien, and V. Snael, editors, Foundations of Computational Intelligence, pages 177–195. Springer, 2009. [5] A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems 14, pages 681–687, Vancouver, Canada, 2002. [6] Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Proceedings of the 16th International Conference on Machine Learning, pages 124–133, Bled, Slovenia, 1999. [7] A. McCallum. Multi-label text classification with a mixture model trained by EM. In Working Notes of the AAAI’99 Workshop on Text Learning, Orlando, FL, 1999. [8] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated prediction. Machine Learning, 37(3):297–336, 1999.
4
[9] R. E. Schapire and Y. Singer. Boostexter: a boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [10] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. [11] G. Tsoumakas, M.-L. Zhang, and Z.-H. Zhou. Tutorial on learning from multi-label data. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Bled, Slovenia, 2009. [12] N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 15, pages 721–728, Vancouver, Canada, 2003. [13] R. Yan, J. Teˇsi´c, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 834–843, San Jose, CA, 2007. [14] M.-L. Zhang and Z.-H. Zhou. ML-kNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007. [15] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li. Multi-instance multi-label learning. Artificial Intelligence, in press.
5