Pattern Recognition 43 (2010) 470 -- 477
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
Semi-automatic dynamic auxiliary-tag-aided image annotation夡 Shile Zhang ∗ , Bin Li, Xiangyang Xue School of Computer Science, Fudan University, 220 Handan Road, Shanghai, China
A R T I C L E
I N F O
Article history: Received 1 June 2008 Received in revised form 28 December 2008 Accepted 2 March 2009 Keywords: Semi-automatic image annotation Multi-label learning Normalized mutual information User feedback
A B S T R A C T
Image annotation is the foundation for many real-world applications. In the age of Web 2.0, image search and browsing are largely based on the tags of images. In this paper, we formulate image annotation as a multi-label learning problem, and develop a semi-automatic image annotation system. The presented system chooses proper words from a vocabulary as tags for a given image, and refines the tags with the help of the user's feedback. The refinement amounts to a novel multi-label learning framework, named Semi-Automatic Dynamic Auxiliary-Tag-Aided (SADATA), in which the classification result for one certain tag (target tag) can be boosted by the classification results of a subset of the other tags (auxiliary tags). The auxiliary tags, which have strong correlations with the target tag, are determined in terms of the normalized mutual information. We only select those tags whose correlations exceed a threshold as the auxiliary tags, so the auxiliary set is sparse. How much an auxiliary tag can contribute is dependent on the image, so we also build a probabilistic model conditioned on the auxiliary tag and the input image to adjust the weight of the auxiliary tag dynamically. For an given image, the user feedback on the tags corrects the outputs of the auxiliary classifiers and SADATA will recommend more proper tags next round. SADATA is evaluated on a large collection of Corel images. The experimental results validate the effectiveness of our dynamic auxiliary-tag-aided method. Furthermore, the performance also benefits from user feedbacks such that the annotation procedure can be significantly speeded up. © 2009 Elsevier Ltd. All rights reserved.
1. Introduction As the popularity of Web2.0, people around the world can easily share their blogs, photos, and videos. When the user is to upload the photos or the videos, several tags about the content have to been added on for the sake of browsing and searching. Although during the past decade, many algorithms [1–5] have been proposed to solve the problem of content-based image retrieval (CBIR), most of them are still not able to bridge the semantic gap [6] well and meet the needs of various queries containing the semantics. So the annotation work is necessary. To reduce the cost of manual annotation work, many automatic image annotation systems which try to analyze the image content [7–15], have been proposed with the fast development of the research on machine learning. Although it is still a great challenge to make the computer a major role in the accurate annotation of images
夡 This work was supported in part by Natural Science Foundation of China under grant No. 60873178, Ministry of Education under grant No. 104075, and Science and Technology Commission of Shanghai Municipality under grant No. 08DZ15001. ∗ Corresponding author. Tel.: +86 21 6564 3720; fax: +86 21 6564 2820. E-mail addresses:
[email protected] (S. Zhang),
[email protected] (B. Li),
[email protected] (X. Xue). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.03.009
all over the world due to the limited accuracies or data sets. Nevertheless the image annotation systems based on various machine learning algorithms can be much more practical with the power of the user feedback. Relevance feedback used in the traditional text-based information retrieval systems was first adopted in the CBIR field in 1990s [16]. During the procedure of interaction between a computer and a user, the query is adjusted by the feedback information from the user on the relevance of previously retrieved documents or images, and approximates the user's need. Many interactive image retrieval systems [17–20], which try to optimize the similarity measurement of low-level features or find the user's interest, can achieve good performances in a few rounds of interaction with the user. Several semi-automatic or interactive image annotation systems label the images in a way similar to the interactive image retrieval [21–23], where the relevance feedback about the images retrieved adds some positive or negative training samples for annotating the unlabeled image. However, the relationship between tags has not been well mined. Although in [23], a semantic hierarchy was built with the aid of WordNet [24], there are many correlative words, whose relationship is hard to express by a semantic hierarchy structure, e.g., car and road. In the image annotation, tags are not mutually exclusive, so it can be formulated as a multi-label learning problem, which is
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
unlike the traditional classification problem. A probabilistic model for the traditional classification problem is built which has to satisfy the condition y P(y|x) = 1, where x is the input sample and y is the label or category of x. In the multi-label problem, a sample can be associated with multiple labels, e.g., a photo can be annotated with sunset, beach and Africa when it is taken at the Gold Coast at dark. Besides the image annotation, the text categorization [25,26] and the gene functional classification [27] can both be formulated as the multi-label learning problems. In the multi-label learning problems, the correlation between labels, as we stated before, is important and is taken into account by many works [28–32]. In this paper, we propose a multi-label learning framework for the semi-automatic image annotation system, named Semi-Automatic Dynamic Auxiliary-Tag-Aided (SADATA). First, we train a binary statistical model for each tag independently, which means that the multi-label learning problem is transformed to several independent binary classification problems, and that is a common method to deal with the multi-label learning problems [33–35]. Secondly, for a certain tag (target tag), its correlative tags (auxiliary tags) are found by the normalized mutual information, and classifiers of these auxiliary tags, i.e. auxiliary classifiers, aid the classification of the target tag, i.e. the target classifier. Thirdly, a probabilistic model conditioned on the auxiliary tag and the input image is built to dynamically estimate the weight of each auxiliary tag in the classification. At last, because the annotation procedure is interactive, the relevance feedback on the tags recommended by SADATA corrects the outputs of the auxiliary classifiers and SADATA will provide more precise and suitable tags in the next round, which speeds up the annotation procedure. There are several advantages in SADATA: • The framework is flexible because the algorithms for training the auxiliary classifiers, finding the auxiliary tags, and estimating the weights of auxiliary classifiers, are separable modules. Thus different algorithms can be plugged into this framework, especially the algorithms for training the auxiliary classifiers since there are many excellent binary learning algorithms proposed in the past few decades, such as the support vector machine [36], the logistic regression (LR), and etc. • The number of auxiliary tags for a target tag is limited by the definition of the correlation, which is in terms of the normalized mutual information and a given threshold. The scale of auxiliary tags affects the complexity of the framework as well as the accuracy because the auxiliary classifiers are not so reliable and too many irrelevant auxiliary classifiers drops the system performance. We solve this problem by optimizing the threshold with some model selection methods. • Unlike some traditional mixture models where the weights of components are constants, the weight model in SADATA is conditioned on the input image as well as the auxiliary tag. Thus it is dynamic and makes the whole framework like a mixture model. Besides, the estimation of the weight model needs no extra training data, which is also different from some regular supervised fusion methods. The dynamics is demonstrated to be important in our experiments. • The aid of auxiliary classifiers can easily incorporate relevance feedbacks on the tags, because the feedbacks correct the outputs of auxiliary classifiers. They make the final results more accurate, so for an unlabeled image, the top w tags given by SADATA will approximate the ground truth faster. In the next section, we briefly review some related works and an overview of SADATA is given in Section 3. Details of each part of SADATA are given in Sections 4 and 5, respectively. We evaluate our algorithm and make some comparisons on a large collection of Corel images in Section 6. We conclude our work in the last section.
471
2. Related work Many existing multi-label learning algorithms transform the problem into the traditional learning problems, because a lot of excellent learning theories have been proposed and proved performing well on many different applications. One way is to treat every combination of labels as a class which leads to a multi-class problem [33,37]. The disadvantage of this method is that it may produce too many classes and each class only has a few samples. Another common way is to split the multi-label problem into several independent two-class problems [33–35]. It is simple but does not consider the correlation of labels, which is important for the learning. Many works take into account the correlation of labels. A proper label set for a testing sample is gained from its neighboring training samples by explicitly exploiting high-order correlation between labels [28]. Similarly, the KNN approach can be adopted to select a subset of the training set to gain the multi-label information and maximum a posteriori principle is utilized to determine the label set for the testing sample [29]. To classify the multi-label documents, a maximum entropy method, which significantly outperforms the simple combination of single-label method, is proposed [30]. For multi-label video annotation, an input feature vector and its label information are mapped to a sparse high-dimension feature vector which encodes the models of individual labels with their correlations [31]. Instead, the input feature vector is transformed into a bag of instances, each of which reflects the vector's relationship with one of the possible labels [32]. In these works, the mining of the correlation between labels is combined with a specific learning algorithm, resulting in less flexibility. Another kind of multi-label learning algorithm boosts independent binary classifiers by the correlation between labels. In [38], a factor graph multinet is proposed to discover the relationship and interactions between the labels. Because their problem is based on the video data, the temporal dependencies within the labels of frames in a shot, are also accounted for by the multinet. Pre-defined ontology, which models the label correlations, improves the accuracy of the individual classifier [39]. The methods of this kind are intuitive. And the algorithms for training the binary classifiers are almost isolated from the algorithms using the correlation information to improve performances of classifiers, so different methodologies can be tried flexibly under this framework. But it is better to obtain the correlation information by an automatic procedure with no extra training data to reduce the learning cost. 3. Overview of SADATA 3.1. Notations d
X ⊆ R denotes the input space and Y ={Y1 , Y2 , . . . , YL } denotes the set of all labels. The training set is S = {(x1 , y 1 ), (x2 , y 2 ), . . . , (xN , y N )}, (j) where xi ∈ X and y i ∈ {+1, −1}L . y i denotes the presence or the absence of the j-th label Yj in xi , respectively. 3.2. Learning framework The traditional multi-label learning methods by treating the multi-label problem as several independent binary learning problems, neglect the correlation information among labels, which helps the multi-label learning a lot. Thus there comes an intuitive and flexible framework of multi-label learning when it is improved with the correlation information. The frequent co-occurrence or mutual exclusion of two labels indicates the strong correlation between them, because the probability of the presence of one label Yi may increase or decrease conditioned on the presence of the other label Yt . If Yt is the target tag and its classifier P(Yt |X) is the target classifier, the classifier of Yi , i.e. P(Yi |X), can help to refine P(Yt |X) by the correlation between Yt and Yi . Such tag is called an auxiliary tag of
472
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
Yt , and P(Yi |X) is an auxiliary classifier of the target classifier P(Yt |X). If the mining method of the correlation information does not depend on the learning algorithms of the classifiers, different mature learning algorithms of two-class problem can be plugged in this framework to keep it flexible. Further more, such mining procedure does not depend on an extra labeled data set. The SADATA framework is based on the independent binary classifiers, so for each tag Yi we first build an independent probabilistic model P(Yi |X) obtained by some excellent traditional learning algorithms, such as SVM and LR. Then for a target tag Yt , target classifier P(Yt |X) is incorporated with its auxiliary classifiers to get a final classifier Pt (Yt |X). The proposed framework is defined as ⎞ ⎛ 1 ⎝ (1) P(Yt |X) + Pt (Yt |X) = P(Yi |X)P(Yt |Yi , X)⎠ 1 + |(Yt )| Yi ∈(Yt )
where (Yt ) is an auxiliary tag set, each element of which is considered as an auxiliary tag of Yt and its size is |(Yt )|. P(Yt |Yi , X) is the weight of the auxiliary classifier P(Yi |X), which indicates how much can it contribute to help classify Yt , conditioned on the input vector X. Whether a tag Yj belongs to the auxiliary tag set of Yt , i.e. (Yt ), is determined by whether the normalized mutual information between Yt and Yj is larger than a given threshold tc . The detail is described in the next section. To estimate the weight of each auxiliary classifier P(Yi |X), we build a probabilistic model P(Yt |Yi , X), which is conditioned on the input vector X. This dynamic mixture is unlike the traditional static mixture models, in which the weight of each component in the mixture is usually a constant, which is gained from the unsupervised or supervised learning. The procedure of building P(Yt |Yi , X) is given in detail in Section 5. The proposed framework in Eq. (1) can also be interpreted in a Bayesian way. P(Yi |X) is the prior probability of Yi conditioned on X. P(Yt |Yi , X) is the conditional probability of Yt conditioned on Yi and X. Thus the proposed framework can also be viewed as a mixture model. Some works on mixture of experts [40,41] also combine several classifiers. But SADATA is different from the mixture of experts which combines several classifiers trained for a same task, while each classifier in SADATA is trained for a distinct task. 3.3. Relevance feedback There are few real applications of the full-automatic image annotation due to the limited accuracy. However, with the power of relevance feedback, the full-automatic image annotation becomes semi-automatic but much more practical because of the trade-off between the cost and the accuracy of the annotation. An example of the semi-automatic image annotation is given in Fig. 1. After an image is submitted to the annotation system, the system picks several most suitable tags from the vocabulary for the image. The user decides the whether the tags are relevant and gives the judgement to the system. The relevance feedback on tags affects which tags will be recommended by the system next round. A smart system makes maximum use of this feedback and the tags will cover the ground truth in fewer interactive rounds. In the proposed learning framework, the relevance feedback technique can be easily incorporated since the relevance feedback on a tag Yi is actually the ground truth output of the corresponding classifier P(Yi |X). If Yi is an auxiliary tag of Yt and Yt has not been recommended by the system, the change of P(Yi |X) affects Pt (Yt |X), increasing or decreasing the probability of recommending Yt in the next round. For P(Yi |X) is set to the ground truth by the user feedback, Pt (Yt |X) becomes more sensible. Thus the annotation system with SADATA will benefit from the feedback while the annotation systems which do not consider the correlation between tags will not
image
ground truth 1st round (initial) 2nd round (after 1st FB) 3rd round (after 2nd FB)
cloud grass mountain sky water tree sky water grass building sky water grass mountain stone people cloud road sky water grass mountain cloud
Fig. 1. A demo of a semi-automatic image annotation. The tags in bold are judged as relevant by the user. After two rounds of relevance feedback (FB), the tags recommended by the system have covered the ground truth.
benefit at all because all the tags are dealt with as independent tasks. It is also demonstrated by experiments in Section 6. 4. Auxiliary-tag-aided scheme 4.1. Tag correlation construction For a given target tag Yt , we want to get (Yt ) which is composed of auxiliary tags of Yt , so we need the judging rules on whether a tag Yi is an auxiliary tag for Yt . Such rules sometimes are obvious based on human ontology, e.g. child is a person, so the image annotated with child must have the tag person except for the carelessness of the assessor. However, sometimes the rules such as boat and water, building and urban, cannot be directly expressed by the ontology for there aren't any necessary relations within these pairs. But the general knowledge tells us when we see any boat or ship in an image or video clip, it usually contains the waterscape; while the urban scene often contains buildings. Besides, manually defining these rules is time-consuming as well as expensive, and the defined rules are probably incomplete. What's more, the manual quantification of these rules is a hard job. So we have to adopt some statistical methods to judge whether a tag is the auxiliary tag for another tag by some quantitative calculation on the correlation between them. The correlation of the tag pair (Yi , Yj ) can be measured by the mutual information I(Yi , Yj ) which measures the mutual dependence of two random variables. The equation is
I(Yi , Yj ) =
P(Yi , Yj ) log 2
(Yi ,Yj )∈{±1}
P(Yi , Yj ) P(Yi )P(Yj )
(2)
The mutual information is an absolute magnitude which may be affected by the inherent information magnitude, i.e. the entropy of the random variables. To eliminate such effect and compare the mutual dependence of different pairs of random variables at the same scale, we use the normalized mutual information [42] defined by NormI(Yi , Yj ) =
I(Yi , Yj ) min{H(Yi ), H(Yj )}
(3)
where H(Yi ) is the entropy of Yi which is defined by H(Yi ) =
Yi ∈{±1}
P(Yi ) log
1 P(Yi )
(4)
It is assumed that H(Yi ) 0, which can be easily satisfied because the tag which always exists or does not exist, have little sense in the realworld applications. The normalized mutual information NormI(Yi , Yj ) is in the interval [0, 1]. NormI(Yi , Yj ) = 0 when Yi is independent of Yj , while it indicates the strong relationship between Yi and Yj , such as child and person, if NormI(Yi , Yj ) = 1.
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
4.2. Selection of auxiliary tags
0.5
5. Dynamic auxiliary mixture Given a auxiliary tag Yi ∈ (Yt ), the proposed framework needs the weight of its corresponding classifier P(Yi |X) to complete the auxiliary procedure. Some traditional methods are usually static, i.e. the weight of each classifier in the mixture is a constant, which can be learned in an unsupervised or supervised way [45,46]. They may be acceptable on the case of the tradition learning problem, where all the classifiers are trained for the same tag, but in this paper the auxiliary classifiers are trained for different tags. So the dynamic auxiliary mixture is proposed and we can show that the proposed dynamic mixture can reduce to a static one by changing a single parameter. However, our experiments show that the dynamic mixture achieves a better performance in Section 6. 5.1. Nonparametric estimation for the weight For each Yi ∈ (Yt ), a probabilistic model P(Yt |Yi , X) is built to represent the weight of the auxiliary classifier P(Yi |X). The condition on the input vector X indicates the variation of the weight for different input vectors. It is not straightforward to model P(Yt |Yi , X) directly, so the Bayesian formula is utilized to transform P(Yt |Yi , X) to P(X|Yt , Yi )P(Yt , Yi ) P(X|Yi )P(Yi )
= P(Yt |Yi )
P(X|Yt , Yi ) P(X|Yi )
(5)
where the conditional probability P(Yt |Yi ) can be obtained easily by counting the frequency on the annotation of the training set. The probability density functions of P(X|Yt , Yi ) and P(X|Yi ) are estimated by a nonparametric method (Parzen window) as follows: P(X|Yi ) =
1 (X, xl ) |S(Yi )|
(6)
xl ∈S(Yi )
where S(Yi ) is a subset of the training set, and it is defined as (i)
S(Yi ) = {xj |y j = Yi }
(7)
0.45
reference σ=1 σ = 0.3 σ = 0.1
0.4 0.35 0.3 p (x)
According to Eq. (3), a larger NormI(Yi , Yj ) means that Yi and Yj can provide each other with a larger ratio of inherent information. Thus Yi and Yj are more likely to be correlative. A threshold tc ∈ [0, 1) is then given for a hard decision on whether Yi and Yj are considered as correlative, which means whether they are auxiliary tags for each other. Hence for any Yi ∈ (Yt ), we have NormI(Yt , Yi ) > tc . It is clear that tc controls the complexity of the framework since a smaller tc results in more auxiliary tags and vice versa. As we stated before, auxiliary classifiers cannot guarantee the reliability. Thus as more auxiliary classifiers are used, more errors may be propagated to the final result and decrease the performance. The determination of a proper tc is essentially the problem of the model selection, which can be solve in several ways. The proposed framework can be viewed as a mixture model described in the last part of Section 3, we can use some criterions to obtain a proper tc , such as the Bayesian information criterion (BIC) [43], and the minimum description length (MDL) [44], etc. General methods, such as the crossvalidation, are also able to solve such problem. In this paper, the cross-validation is adopted because its simplicity and the validation set can be shared for searching the optimal parameters of auxiliary classifiers. Specifically, the number of folds in cross-validation is set to 5.
P(Yt |Yi , X) =
473
0.25 0.2 0.15 0.1 0.05 0 −2.5 −2 −1.5 −1 −0.5
0 x
0.5
1
1.5
2
2.5
Fig. 2. Different variances for the estimated density functions.
(X, xl ) is the kernel function, which is usually implemented by the Gaussian function defined as xi − xj 2 1 (xi , xj ) = √ exp − 22 2
(8)
where 2 is the variance of the Gaussian kernel (function). P(X|Yt , Yi ) can be estimated in a similar way. The time complexity of this nonparametric estimation is O(Nd), where N is size of the training set and d is the dimension of the feature. It is much larger than finding the auxiliary tags, whose time complexity is O(N). Therefore as we stated in the last section, setting a proper tc makes the complexity of the fusion in an acceptable scale. In fact, if the training process for the auxiliary classifiers adopts the kernel trick and the RBF kernel is taken, some calculation results can be shared to reduce the complexity. 5.2. Linkage to the static model The parameter controls the smoothness of the estimated density function. A larger causes a smoother estimated density function, which is shown in Fig. 2. In this example, we draw 100 points from a normal distribution N(0, 1) (the black solid line). Then we estimate the probability density function with different using these 100 points. It is easily to find out the relationship between and the smoothness of the estimated density function. When approximates the infinity, the estimated density function is a horizontal line or hyperplane, which means the value of this function is a constant. Hence our framework is approximately static when is large enough. 6. Evaluation 6.1. Preliminaries 6.1.1. The data set We choose 6346 images from the Corel image collection and annotate them with 61 words. The vocabulary covers some common natural scene objects (sky, tree, mountain, and etc.), some common man-made objects (building, table, car, and etc.), and some animals (tiger, horse, fox, and etc.). There are totally 23,261 tags in all images, and about 3.7 tags for each. We draw a histogram on the number of tags for each image, as shown in Fig. 3. The colored pattern appearance model (CPAM) [47] is used to capture both color and texture characteristics of these images. The
474
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
respectively. As coverage, a small rank-loss indicates a good performance. (c) One-error is the frequency of the wrong tag being top-ranked. It is defined as
1400
Number of Images
1200 1000
one-error =
800
p 1 p i=1
arg max P(Yj = 1|xi ) ∈/ y+ i Yj ∈Y
(11)
600
where '( = 1 when is true. Thus one-error is as same as the error rate of classification in the single-label learning problem, which decreases when performance is improved.
400 200
6.2. Average size of auxiliary tag sets
0 1
2
3
4
5 6 7 8 Number of Tags
9
10
11
To evaluate the effect of different average sizes of auxiliary tag sets, we first define the auxiliary rate (AR) as follows:
Fig. 3. The histogram on the number of labels for each image.
CPAM is a codebook of common appearance prototypes trained from thousands of image patches. Thus the CPAM feature of an image is a histogram of codewords composing the image, and the number of bins in our experiments is 128. The size of the training set is 3168 and the test set is 3178. 6.1.2. Binary classifiers We use SVM and LR to train the binary classifiers for all the tags because of their outstanding performances on the binary learning problems. Linear kernel is taken, so we only have to search for the optimal regularization parameters by 5-fold cross-validation. Because SVM models the distance from the sample to the decision hyperplane rather than the probability directly, we use a sigmoid function with respect to the distance to transform the distance into a probability-like value. 6.1.3. Evaluation metrics Several metrics are proposed in [25] to evaluate to the performance of the multi-label learning algorithm. Hamming loss is the average error rate on all tags, measuring the average times a sampletag pair is misclassified. The other metrics (coverage, ranking loss, one-error, etc.) are related to the rank of tags. For the image annotation, relevant tags are expected to have higher ranks than the irrelevant ones so as to be recommended earlier. Thus we choose coverage, ranking loss and one-error as the evaluation metrics. (a) Coverage measures the minimum rank on average to cover all proper tags for each sample, and it is defined as coverage =
p 1 (j) max rank(xi , y i ) − 1 p j∈L
(9)
i=1
where p is the size of the test set. The smaller coverage is, the better algorithm performance is. (b) Ranking loss measures the average fraction of tag pairs that are in reversed order for each sample. We use the symbol rank-loss for short, which is defined as follows: rank-loss =
p 1 1 p |y+ y− | i i i=1
× y− }| × |{(Yj , Yk )|P(Yj |xi ) P(Yk |xi ), (Yj , Yk ) ∈ y+ i i (10) where y+ = {Yj |y i = 1, 1 j L} and y− = {Yk |y ki = −1, 1 k L}, i i which include the tags existing in i-th sample or not, j
AR =
L 1 |(Yi )| L L−1
(12)
i=1
tc controls the complexity of SADATA as described in Section 4, which can be expressed by the AR. Least complexity is achieved by setting a very large tc , leading to AR=0, which means that there is no auxiliary classifier for classifying tag Yi except for the target classifier P(Yi |X) itself. And all classifiers are auxiliary classifiers when AR = 1 if tc is small enough. From Fig. 4, it is found that the auxiliary classifiers can boost the performances of target classifiers no matter what the value of sigma is. A proper AR can be found by some model selection methods as stated in Section 4 for a given sigma. However, a proper sigma leads to a better performance at a lower AR, i.e., a lower complexity, as shown in Fig. 4. And the proper value of sigma will be discussed later. Comparing the sub-figures in Fig. 4, it can also be found that although SVM and LR have different optimal objects, SADATA can boost both kinds of classifiers trained by them. Furthermore, the trends of the curves are similar, which indicates the proposed framework is independent on the learning algorithms of classifiers. 6.3. Dynamic auxiliary mixture The weights of auxiliary classifiers are modeled w.r.t. the input vector and are dynamic thus. As shown in Fig. 2, a larger will cause the model more static. On the contrary, if is set to a very small value, which means the estimated density function is very sharp at a few points and is nearly 0 at most points, the values of P(X|Yt , Yi ) and P(X|Yi ) in Eq. (5) are dominated by the minimum distances between the input vector and the vectors in S(Yt , Yi ) and S(Yi ), respectively. Such estimation is similar with 1-NN method and has poor generalization ability. Besides, the numerical calculation problem will also arise because the denominator in Eq. (5) is usually almost 0. Therefore should be chosen carefully, which can be solved by some model selection methods, such as the cross-validation. Under the Gaussian assumption, we set to the standard deviation of distances between vectors in the training set, which corresponds to the blue solid curve in Fig. 4. Such value of is suitable for the probability density estimation, leading to a better performance at a lower complexity. Even if the is not chosen carefully, it is expected to have a relative small value empirically to make the model less static. From Fig. 4 we can see that the most static model (2 = 10) has the worst optimal performance. That is to say, it is important to keep the mixture model dynamic.
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477 Table 1 Performance comparison with other multi-label learning (MLL) algorithms.
25
MLL algorithm
σ2 = 10
24
σ2 = 0.1 23 Coverage
475
SADATA-SVM SADATA-LR ADTBOOST.MH ML-KNN
σ2 = 0.001
22
Evaluation metrics Coverage
Rank-loss
One-error
18.0098 19.2058 19.5120 19.4352
0.1254 0.1343 0.1325 0.1276
0.3996 0.4251 0.4434 0.3965
SADATA-SVM and SADATA-LR are SADATA frameworks with SVM and LR as classifiers, respectively. The numbers in bold indicate the best results.
21 20
Table 2 Performance comparison between SADATA and the general classifiers.
19
Algorithm
18 0
0.2
0.4 0.6 Auxiliary Rate
0.8
1
Evaluation metrics Coverage
Rank-loss
One-error
SADATA-SVM SVM
18.0098 21.3622
0.1254 0.1480
0.3996 0.3571
SADATA-LR LR
19.2058 26.3458
0.1343 0.1793
0.4251 0.3037
30 annotation because several tags for an image are given to the user at one time and thus the precision of the top-ranked tag does not affect the performance of the system much.
σ2 = 10
28
σ2 = 0.1 Coverage
26
σ2 = 0.001 6.5. Relevance feedback
24 22 20 18 0
0.2
0.4 0.6 Auxiliary Rate
0.8
1
Fig. 4. The coverage of SADATA with different model complexities, i.e. the auxiliary rate, and different model dynamics, i.e. the . The classifiers of (a) are trained by SVM while of (b) are by LR.
6.4. Comparison with state-of-the-arts We compare SADATA with some other general multi-label learning algorithms, the multi-label decision tree ADTBOOST.MH [48] and the multi-label lazy learning approach ML-KNN [29]. For ADTBOOST.MH, the number of boosting rounds is set to 10 because after 10 rounds iterations the performance almost converges. For ML-KNN, K is obtained by 5-fold cross-validation, which is finally set to 30. In SADATA, we set 2 = 0.1, which is the variance of the training set, and AR = 0.393 which is obtained by 5-fold cross-validation. From Table 1, we can see that SADATA with SVM as the classifiers performs best on coverage and rank-loss. The SADATA-LR classifiers have the comparable performance with other algorithms. The contribution of SADATA has been shown in Fig. 4, and we draw Table 2 to make it clearer. In Table 2, obvious improvements on coverage and rank-loss approve the advantage of our learning framework. Unfortunately, one-error increases much after using SADATA which means more samples have wrong top-ranked tags, yet the quality of the overall ranking order is much more important than the correctness of the top-one tag in the application of semi-automatic image
Firstly, we evaluate how much SADATA benefits from the relevance feedback in a semi-automatic image annotation system. For an unlabeled image, w tags are given by the system at first. The user judges whether the tags are relevant. The system gives w more tags after receiving the feedback from the user. Such procedure repeats until all tags are judged. A smarter system will cover all the relevant tags in less rounds, which thus reduces the annotation cost. So we record the recall after each round to evaluate the performance of the interactive system. The parameters of SADATA are as same as last experiment, i.e., AR = 0.393 and 2 = 0.1. w is set to 5, which means that five tags are given in each round. From Fig. 5, it can be found that SADATA classifiers give less relevant tags than the general classifiers at the beginning, which is to say that the SADATA classifiers give less precise top-five tags. Similar situations are met in the last experiment shown in Table 2, which indicates that the general classifiers can give the top-ranked label to each sample more precisely. However, SADATA classifiers make the overall ranking order better which is also shown in Table 2. Unlike the general classifiers that are independent each other, SADATA classifiers benefit from the relevance feedback, and the rest tags are re-ranked in a better order after receiving the feedback, which brings the faster increase on the recall value. For both SVM and LR classifiers of SADATA, the recall of tags achieves 0.9 after about four rounds, which is much larger than the general classifiers. Secondly, we compare our algorithm with other multi-label learning algorithms in the application of semi-automatic image annotation. Similar to Fig. 5, Fig. 6 shows that SADATA classifiers have worse performance at first, but after three rounds interaction SADATA classifiers outperforms obviously, which is due to our framework learns from the feedbacks and improves the ranking orders of rest tags. 7. Conclusion In this paper, we propose a multi-label learning framework, named SADATA, for the semi-automatic image annotation problem. For a certain target tag, we use the outputs of a set of auxiliary
476
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
classifiers corresponding to the auxiliary tags to boost the predicting performance. The set of auxiliary tags for a target tag is found by the correlation between tags, and such correlation limits the average size of auxiliary tag sets, reducing the time complexity. The dynamic auxiliary mixture is also different from the traditional static one. Relevance feedback on tags changes the values of components in the mixture, which gives a more proper probability of the target tag. Thus it speeds up the image annotation procedure. SADATA is evaluated on a collection of Corel images. The experiments show the improvement of the performance by SADATA, and the dynamic mixture is also better than the traditional static one. By the relevance feedbacks on the tags, the annotation results will cover about 90% of the relevant tags in a few rounds. Comparison with some other state-of-the-art multi-label learning algorithms also shows the proposed framework has a better performance, especially for the application of semi-automatic image annotation.
1 0.9
Recall
0.8
SADATA−SVM SVM
0.7 0.6 0.5 0.4 0
2
4 6 8 Interactive Rounds
10
12
1 0.9
Recall
0.8
SADATA−LR LR
0.7 0.6 0.5 0.4 0
2
4 6 8 Interactive Rounds
10
12
Fig. 5. The recall of the tags after different numbers of feedback rounds. The classifiers of (a) are trained by SVM while of (b) are by LR.
1 0.9
Recall
0.8 SADATA−SVM SADATA−LR ADTBOOST.MH ML−KNN
0.7 0.6 0.5 0.4 0
2
4
6
8
10
12
Interactive Rounds Fig. 6. The comparison between several multi-label learning algorithms on the recall of the tags after different numbers of feedback rounds.
References [1] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query by image and video content: the QBIC system, Computer 28 (9) (1995) 23–32. [2] T. Gevers, A. Smeulders, Pictoseek: combining color and shape invariant features for image retrieval, IEEE Transactions on Image Processing 9 (1) (2000) 102–119. [3] A. Gupta, R. Jain, Visual information retrieval, Communications of the ACM 40 (5) (1997) 70–79. [4] W. Ma, B. Manjunath, Netra: a toolbox for navigating large image databases, in: International Conference on Image Processing, vol. 1, 1997, p. 568. [5] J.R. Smith, S.-F. Chang, Visualseek: a fully automated content-based image query system, in: MULTIMEDIA '96: Proceedings of the Fourth ACM International Conference on Multimedia, ACM, New York, USA, 1996, pp. 87–98. [6] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1349–1380. [7] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (6) (2008) 985–1002. [8] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D.M. Blei, M.I. Jordan, Matching words and pictures, Journal of Machine Learning Research 3 (2003) 1107–1135. [9] K. Tieu, P. Viola, Boosting image retrieval, International Journal of Computer Vision 56 (1–2) (2004) 17–36. [10] S.-F. Cheng, W. Chen, H. Sundaram, Semantic visual templates: linking visual features to semantics, in: Proceedings of the International Conference on Image Processing, ICIP 98, vol. 3, 4–7 October 1998, pp. 531–535. [11] S. Tong, E. Chang, Support vector machine active learning for image retrieval, in: MULTIMEDIA '01: Proceedings of the Ninth ACM International Conference on Multimedia, ACM, New York, USA, 2001, pp. 107–118. [12] C. Zhang, T. Chen, An active learning framework for content-based information retrieval, IEEE Transactions on Multimedia 4 (2) (2002) 260–268. [13] F. Monay, D. Gatica-Perez, On image auto-annotation with latent space models, in: MULTIMEDIA '03: Proceedings of the Eleventh ACM International Conference on Multimedia, ACM, New York, USA, 2003, pp. 275–278. [14] A. Singhal, J. Luo, W. Zhu, Probabilistic spatial context models for scene content understanding, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 18–20 June 2003, pp. I-235–I-241. [15] X. He, W.-Y. Ma, H.-J. Zhang, Learning an image manifold for retrieval, in: MULTIMEDIA '04: Proceedings of the 12th Annual ACM International Conference on Multimedia, ACM, New York, USA, 2004, pp. 17–23. [16] Y. Rui, T. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on Circuits and Systems for Video Technology 8 (5) (1998) 644–655. [17] Y. Rui, T. Huang, Optimizing learning in image retrieval, in: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2000, pp. 236–243. [18] A. Kushki, P. Androutsos, K. Plataniotis, A. Venetsanopoulos, Query feedback for interactive image retrieval, IEEE Transactions on Circuits and Systems for Video Technology 14 (5) (2004) 644–655. [19] J. Guan, G. Qiu, Learning user intention in relevance feedback using optimization, in: MIR '07: Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, ACM, New York, USA, 2007, pp. 41–50. [20] J. Liu, Z. Li, M. Li, H. Lu, S. Ma, Human behaviour consistent relevance feedback model for image retrieval, in: MULTIMEDIA '07: Proceedings of the 15th International Conference on Multimedia, ACM, New York, USA, 2007, pp. 269–272. [21] L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, B. Field, Semi-automatic image annotation, in: Proceedings of Conference on HCI (INTERACT), IOS Press, 2001, pp. 326–333. [22] A. Dorado, E. Izquierdo, Semi-automatic image annotation using frequent keyword mining, in: Proceedings of the Seventh International Conference on Information Visualization, vol. IV, 16–18 July 2003, pp. 532–535.
S. Zhang et al. / Pattern Recognition 43 (2010) 470 -- 477
[23] C. Yang, M. Dong, F. Fotouhi, I2 A: an interactive image annotation system, 6–8 July 2005. [24] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998. [25] R.E. Schapire, Y. Singer, BoosTexter: a boosting-based system for text categorization, Machine Learning 39 (2/3) (2000) 135–168. [26] J. Rousu, C. Saunders, S. Szedmak, J. Shawe-Taylor, Kernel-based learning of hierarchical multilabel classification models, Journal of Machine Learning Research 7 (2006) 1601–1626. [27] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Neural Information Processing Systems, 2001, pp. 681–687. [28] F. Kang, R. Jin, R. Sukthankar, Correlated label propagation with application to multi-label learning, in: CVPR '06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2006, pp. 1719–1726. [29] M.-L. Zhang, Z.-H. Zhou, Ml-KNN: a lazy learning approach to multi-label learning, Pattern Recognition 40 (7) (2007) 2038–2048. [30] S. Zhu, X. Ji, W. Xu, Y. Gong, Multi-labelled classification using maximum entropy method, in: SIGIR '05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, USA, 2005, pp. 274–281. [31] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, H.-J. Zhang, Correlative multi-label video annotation, in: MULTIMEDIA '07: Proceedings of the 15th International Conference on Multimedia, ACM, New York, USA, 2007, pp. 17–26. [32] Z.-H. Zhang, M.-L. Zhou, Multi-label learning by instance differentiation, in: Proceedings of the 22nd Conference on Artificial Intelligence, Vancouver, Canada, 2007, pp. 669–674. [33] M.R. Boutell, X.S.J. Luo, C.M. Brown, Learning multi-label scene classification, Pattern Recognition 37 (9) (2004) 1757–1771. [34] T. Gonçalves, P. Quaresma, A preliminary approach to the multilabel classification problem of Portuguese juridical documents, in: EPIA, 2003, pp. 435–444.
477
[35] T. Li, M. Ogihara, Detecting emotion in music, in: ISMIR, 2003. [36] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [37] S. Diplaris, G. Tsoumakas, P.A. Mitkas, I.P. Vlahavas, Protein classification with multiple algorithms, in: Panhellenic Conference on Informatics, 2005, pp. 448–456. [38] T.M.R.N. Kozintsev, I.V. Huang, Factor graph framework for semantic video indexing, IEEE Transactions on Circuits and Systems for Video Technology 12 (1) (2002) 40–52. [39] B.S.J. Wu, Y. Tseng, Ontology-based multi-classification learning for video concept detection, in: IEEE International Conference on Multimedia and Expo 2004, vol. 2, 2004, pp. 1003–1006. [40] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Computation 3 (1) (1991) 79–87. [41] C. Bishop, M. Svensén, Bayesian Hierarchical Mixtures of Experts, Morgan Kaufmann, San Francisco, CA, 2003. [42] J. Karmeshu (Ed.), Entropy Measures, Maximum Entropy Principle and Emerging Applications, Springer, Berlin, 2003. [43] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (2) (1978) 461–464. [44] J.J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471. [45] P.D. Gader, M.A. Mohamed, J.M. Keller, Fusion of handwritten word classifiers, Pattern Recognition Letter 17 (6) (1996) 577–584. [46] T.K. Ho, J. Hull, S. Srihari, Decision combination in multiple classifier systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1) (1994) 66–75. [47] G. Qiu, Indexing chromatic and achromatic patterns for content-based colour image retrieval, Pattern Recognition 35 (8) (2002) 1675–1686. [48] F. De Comité, R. Gilleron, M. Tommasi, Learning multi-label alternating decision trees from texts and data, Machine Learning and Data Mining in Pattern Recognition (2003) 251–274.
About the Author—SHILE ZHANG received his B.S. degree in Computer Science and Technology from Fudan University, Shanghai, China, in 2004. He is now a Ph.D. candidate in School of Computer Science, Fudan University, Shanghai, China. His current research interest includes video retrieval, machine learning, and computer vision. About the Author—BIN LI received his B.Eng. degree in Software Engineering from Southeast University, Nanjing, China, in 2004. He is now a Ph.D. candidate with the School of Computer Science, Fudan University, Shanghai, China. His current research interests include machine learning and data mining methods as well as their applications to large-scale cross-media information retrieval, social network mining, and computer vision. About the Author—XIANGYANG XUE received the B.S., M.S., and Ph.D. degrees in communication engineering from Xidian University, Xi'an, China in 1989, 1992 and 1995, respectively. Since 1995, he has been with the School of Computer Science, Fudan University, Shanghai, China, where he is currently a Professor. His research interests include multimedia information process and retrieval, pattern recognition and machine learning.