Active Learning in Very Large Databases

Viewer
Transcript

Active Learning in Very Large Databases Navneet Panda, Kingshy Goh, and Edward Y. Chang University of California, Santa Barbara

Abstract. Query-by-example and query-by-keyword both suffer from the problem of “aliasing,” meaning that example-images and keywords potentially have variable interpretations or multiple semantics. For discerning which semantic is appropriate for a given query, we have established that combining active learning with kernel methods is a very effective approach. In this work, we first examine active-learning strategies, and then focus on addressing the challenges of two scalability issues: scalability in concept complexity and in dataset size. We present remedies, explain limitations, and discuss future directions that research might take. Keywords: active learning, image retrieval, query concept, relevance feedback, support vector machines

1. Introduction To query a database—whether structured, semi-structured, or unstructured— end-users must explicitly specify criteria via query language or keywords. When querying multimedia data such as imagery and video sequences, however, it is unrealistic to expect an end-user to specify a query concept. Consider the difficulty of articulating an image concept such as “landscape.” Articulating such a simple concept in terms of low-level perceptual features (color and spatial relationships) is virtually impossible. The mapping of perceptual features to high-level semantics is very complicated, often involving thousands of predicates (Chang and Li, 2003). An end-user cannot be expected to specify such queries in a comprehensive manner. Without a complete understanding about what a query seeks, no database system can return satisfactory results. Query-by-example was proposed in (Flickner et al., 1995) to alleviate such difficulties in query-concept articulation. Another popular query-concept specification method is query-by-keyword. Both methods suffer from the problem of “aliasing,” meaning that example-images or keywords potentially have multiple semantics. To discern which semantic is appropriate for a particular query and user, the Computer Vision and the Database communities have adopted relevance-feedback methods developed in the IR community. Unfortunately, since most perceptual concepts do not form a cluster in the input space formed by the perceptual features, and since the number of available labeled instances (the number of images that a user is willing to label in the feedback process) is relatively small compared to the data dimensions, traditional IR techniques are not effective for addressing the aliasing problem (Chang et al., 2003b). c 2005 Kluwer Academic Publishers. Printed in the Netherlands.

paper.tex; 13/03/2005; 23:48; p.1

2 It has recently been established that a combination of active learning and kernel methods can effectively produce a non-linear binary classifier to separate the images matching the query concept from the irrelevant ones (Tong and Chang, 2001). While kernel methods provide the model complexity required to formulate virtually any complex class boundary, active learning selects the most useful training instances for learning such a boundary quickly. The main issue with active learning is finding the most useful training instances in the unlabeled pool U both effectively and efficiently. In this paper, we first address the effectiveness issue by examining sample-selection strategies. We then study one key database design issue—scalability. We address the scalability of active learning in terms of two respects: concept complexity, and dataset size. For concept complexity, we first examine the effect of concept scarcity, diversity, and isolation on the learning performance. We propose using negative transduction (which performs transductive learning to rapidly expand the coverage of irrelevant data so as to distill data relevant to the query concept) and co-training (which provides different views on the data so as to grow both relevant and irrelevant pools of data) (Blum and Mitchell, 1998) to enhance active learning when a concept is rare or commingled with other concepts in the input space. For scalability in dataset size, we show that using a clustering scheme to index the data, and make use of cluster prototypes to perform a hierarchical sample selection, the amount of data needs to be processed by active learning to identify samples can be drastically reduced. In addition to the clustering scheme, we also propose a kernel indexer, which preliminary results demonstrate promising performance for hyperplane queries, i.e., a query concept is represented by a hyperplane (rather than a point) in a vector space. Finally, we discuss future directions that research might take.

2. Active Learning Strategies Before we discuss learning challenges and suggest solutions to overcome those challenges, we examine three active learning strategies. Our goal here is to determine which is the optimal active-learning strategy when the concept’s learnability is not an issue. We also explain why the strategy works. It is important to establish a baseline active-learning algorithm for two reasons: to ensure that the subpar learning performance for some concepts is due to the concepts’ complexity, and not the algorithm; and to understand better the limitations of active learning. We employ Support Vector Machines (SVMs) as our base learning algorithm for their effectiveness in many learning tasks. We shall consider SVMs in the binary classification setting. We are given training data {x1 . . . xn } that are vectors in some space X ⊆ R d . We are also given their labels {y1 . . . yn } where yi ∈ {−1, 1}. In their simplest form, SVMs are hyperplanes that separate the training data by a maximal margin. All vectors lying on one side of the hyperplane are labeled as −1, and all vectors lying on the other side are labeled as 1. The training instances that lie closest to the hyperplane are called support vectors. Generally, SVMs allow us to project the original training data in space X to a higher dimensional feature space F via a Mercer kernel

paper.tex; 13/03/2005; 23:48; p.2

3 operator P K. In other words, we consider the set of classifiers of the form: f (x) = ni=1 αi K(xi , x). When f (x) ≥ 0 we classify x as +1; otherwise we classify x as −1. When K satisfies Mercer’s condition we can write: K(u, v) = Φ(u) · Φ(v) where Φ : X → F and “·” denotes P an inner product. We can then rewrite f as: f (x) = w · Φ(x), where w = ni=1 αi Φ(xi ). The SVM computes the αi s that correspond to the maximal margin hyperplane in F. By choosing various kernel functions we can implicitly project the training data from X into various feature spaces. (A hyperplane in F maps to a more complex decision boundary in the original space X.) In employing active learning to work with SVMs, the idea is to select unlabeled instances to solicit user feedback so as to achieve maximal refinement of the hyperplane between the two classes. Clearly, selecting unlabeled instances far away from the hyperplane is unhelpful, since their class memberships are certain. The most useful instances for refining the hyperplane are the unlabeled instances near the hyperplane, in the margin. Combining active learning with SVMs, the SVMActive scheme (Tong and Chang, 2001) learns the query concept and returns the k images that are most relevant to the concept in three steps: 1. SVMActive regards the task of learning a target concept as one of learning an SVM binary classifier. Points on one side of the hyperplane are considered relevant to the query concept, and those on the other side are irrelevant. 2. SVMActive learns the classifier quickly via active learning. The active component of SVMActive selects the most informative instances to train the SVM classifier. This step ensures a fast convergence to the query concept in a small number of feedback rounds. 3. Once the classifier is trained, SVMActive returns the top-k most relevant images. These are the k images farthest from the hyperplane on the query concept side. The main difference between an active learner and a regular passive learner is the querying component. This brings us to the issue of how to choose the next unlabeled instance in the pool to query. In (Tong and Koller, 2000), Tong and Koller show in theory how we can select one unlabeled instance to achieve the maximal refinement. In practice, however, we need to perform multiple pool-queries at the same time. It is impractical to present one image at a time for users to label, because they are likely to lose patience after a few rounds. To prevent this from happening, we present the user with multiple images (say, 20) in each round of pool-querying. Thus, the active learner has more choices for selecting the next batch of unlabeled instances to present to the user. Next, we present and compare three methods for choosing unlabeled instances: speculative, simple, and angle-diversity. 2.1. S PECULATIVE A LGORITHM We present a speculative procedure, which recursively generates samples by speculating about user feedback. The algorithm starts by finding one highly

paper.tex; 13/03/2005; 23:48; p.3

4 informative sample (the closest unlabeled instance to the hyperplane). It then speculates upon the two possible labels for the sample and generates two more samples, one based on the positive speculation and one on the negative. The algorithm speculates recursively, generating a binary tree of samples. The speculative procedure is a near-optimal sampling strategy. It selects each sample based on speculating on the outcome of the previous sample’s lable. Although the single-sample selection step is optimal in theory (Tong and Chang, 2001), speculation may not always be accurate. The algorithm needs to train a hyperplane for selecting each sample, and hence this algorithm is computationally intensive. We use it as a yardstick to measure how well the other active-learning strategies perform. 2.2. S IMPLE ACTIVE A LGORITHM Tong and Chang proposed the simple algorithm (Tong and Chang, 2001), which chooses h unlabeled instances closest to the separating hyperplane (between the relevant and the irrelevant instances in the feature space) to solicit user feedback. Based on the labeled pool L, the algorithm first trains a binary classifier f . The binary classifier f is then applied to the unlabeled pool U to compute each unlabeled instance’s distance to the separating hyperplane. The h unlabeled instances closest to the hyperplane and relatively apart are chosen as the next batch of samples for conducting pool-queries. The main idea of simple is that the h instances closest to the hyperplane are the most ambiguous ones with respect to f trained on L. Algorithm simple attempts to achieve maximal disambiguation by selecting the most informative samples in U to add to L to refine f . 2.3. A NGLE -D IVERSITY A LGORITHM Previous work has pointed out that the selected samples should be diversified (Tong and Chang, 2001). The work of (Brinker, 2003) incorporates a diversity metric into sample selection. The main idea is to select a collection of samples close to the classification hyperplane, while at the same time maintaining their diversity. The diversity of samples is measured by the angles between the samples. For example, suppose xi has a normal vector equal to Φ(xi ). The angle between two hyperplanes hi and hj , corresponding to instances xi and xj , can be written in terms of the kernel operator K: | cos(∠(hi , hj ))| =

|Φ(xi ) · Φ(xj )| |K(xi , xj )| =q . kΦ(xi )kkΦ(xj )k K(xi , xi )K(xj , xj )

(1)

The angle-diversity algorithm (Brinker, 2003) starts with an initial hyperplane hi trained by the given labeled set L. Then, for each unlabeled instance xi , it computes the distance to the classification hyperplane hi . The angle between the unlabeled instance xi and the current sample set S is defined as the maximal angle from instance xi to any instance xj in set S. This angle measures how diverse the resulting sample set S would be, if instance xi were to be chosen as a sample.

paper.tex; 13/03/2005; 23:48; p.4

5 Algorithm angle-diversity introduces a parameter λ to balance two components: the distance from samples to the classification hyperplane and the diversity of angles among samples. Incorporating the trade-off factor, the final score for the unlabeled instance xi can be written as |k(xi , xj )| λ ∗ |f (xi )| + (1 − λ) ∗ (max q ), xj ∈S K(xi , xi )K(xj , xj )

(2)

where function f computes the distance to the hyperplane, function K is the kernel operator, and S is the training set. After that, the algorithm selects as the sample the unlabeled instance that enjoys the smallest score in U. The algorithm repeats the above steps h times to select h samples. In practice, with trade-off parameter λ set at 0.5, the work of (Brinker, 2003) shows that the algorithm achieves good performance. Goh and Chang (Goh et al., 2004) show that the λ value should be set in a query-concept-dependent way. Please consult the paper for details. 2.4. BASELINE E VALUATION This experiment examines the performance of four sampling algorithms. The four active learning algorithms under investigation are: random, simple active, speculative, and angle diversity. The first strategy selects random images from the dataset as samples for user feedback. The rest of the sample algorithms have been described in this section. We conduct experiments to compare these sampling strategies in top-k retrieval accuracy. 1

Top 20 Precision

0.8

0.6

0.4 random simple angle speculative estimation

0.2

0 1

2

3

4

5

6

7

8

Iteration

Figure 1. Top-20 precision 50K dataset.

For the 107-category dataset, Figure 2.4 shows the top-20 precisions versus iterations for five sampling strategies. The figure shows that the angle diversity algorithm performs the best among all active-learning algorithms. The angle diversity algorithm performs as well, and even better in some interactions, as the speculative algorithm, which is supposed to achieve nearly optimal performance. This result confirms that the samples should be diverse, as well as semantically uncertain (near the hyperplane). We use angle diversity as our baseline algorithm.

paper.tex; 13/03/2005; 23:48; p.5

6 3. Scalability Issues When dealing with a large dataset consisting of a large number of potential query concepts, active learning faces scalability challenges. We study two factors affecting active learning’s performance: concept complexity and and dataset size. The first scalability issue that active learning needs to resolve is that of concept complexity. For determining concept complexity, we use three measures: diversity, scarcity, and isolation. We only define these measures without detailed discussion due to the space limitations. (For a detailed treatment on concept complexity, please refer to (Goh et al., 2004).) − Diversity. Diversity characterizes the distribution of a concept in the input space. A diverse concept tends to spread out in the input space (that formed by the perceptual features). − Scarcity. To characterize scarcity, we use hit-rate which is defined as the percentage of images in the dataset matching a query concept. − Isolation. Isolation measures how a concept is affected by the others in the input space. When more than one concept is co-located in the same subspace, its isolation is low. The second scalability issue is an obvious one. When the size of the unlabeled pool U is large, it is computationally prohibitive to scan the entire unlabeled pool to select samples. We need an effective sampling method to select samples without involving the entire U. We also need an effective indexing method for retrieving the top-k results with respect to the final learned concept represented by a hyperplane. (Traditional indexing methods deal with only point queries.) To tackle the learnability challenges, we develop the following three remedies: 1. Co-training for concept-complexity scalability, 2. Hierarchical sampling for sampling scalability, and 3. Kernel indexer (KDX) for retrieval scalability. 3.1. C O - TRAINING Here, we deal only with the learnability due to the low hit-rate. When a concept is scarce (the hit-rate is low), learning the concept is difficult. As counter-intuitive as this may sound, the scarcity of negative examples is more problematic than the scarcity of positive examples for our learning algorithms. This is because describing a concept such as “lake” may be challenging, but describing a negated concept (in this case the “non-lake” concept) can require potentially infinite information. (The number of images needed to adequately portray the lake concept is finite, but the number of non-lake images is infinite.) Negative examples are significantly easier to come by,

paper.tex; 13/03/2005; 23:48; p.6

7 but at the same time, the number of negative samples needed to depict a negated concept can be very large. Thus, we need a substantial number of negative examples to accurately characterize the class boundary. Put another way, if we can find negative instances productively and remove them, we indirectly improve the hit-rate of the target concept. The learning process is analogous to mining gold in a stream. The most productive way to harvest gold (i.e., relevant instances) is to quickly filter out water, mud and sand (i.e., negative-labeled instances). Transductive learning has been proposed to increase the labeled pool. One problem with negative transduction is that the additional inferred negative instances are likely to have the same semantics, so they are not always useful (Chang et al., 2003b). We hope to weed out negative instances not only in quantity, but also in quality. What we mean is that we would like to weed out images that have other negative semantics. The problem of using a large unlabeled sample pool to boost performance of a learning algorithm is considered under the framework of co-training (Blum and Mitchell, 1998). The presence of two distinct views of each example suggests strategies in which two learning algorithms can be trained separately, one in each view, and then each algorithm’s predictions for new unlabeled examples are used to enlarge the training set of the other. We propose the following co-training steps: 1. Use SVMActive to select samples and present them to the user for feedback. 2. Use another learner such as MEGA (Chang and Li, 2003) and the negativelabeled instances to refine its hypothesis. 3. As shown in (Chang and Li, 2003), the concepts that are more general than the hypothesis k-DNF learned by MEGA do not satisfy the target concept. All training instances contradicting to the k-DNF can thus be labeled negative. They can be used to feed SVMActive for refining the hyperplane. 4. The co-training process can be conducted recursively until no more negative instances can be inferred.

3.2. H IERARCHICAL S AMPLING An indexer is essential to make both query-concept learning and image retrieval scalable to dataset size. To deal with the “dimensionality curse” problem, we employ an indexing scheme using clustering and classification methods for supporting sample selection. Our indexing method is a statistical approach that works in two steps. It first performs non-supervised clustering using Clindex (Li et al., 2002) to group similar objects together. To maximize I/O efficiency, each cluster is stored in a sequential file. A search is then treated as a classification problem. Our hypothesis is that if a query object’s class prediction yields C probable classes, then the probability is high that

paper.tex; 13/03/2005; 23:48; p.7

8 its nearest neighbors can be found in these C classes. This hypothesis is analogous to looking for books in a library. If we want to look for a calculus book and we know calculus belongs in the math category, by visiting the math section we can find many calculus books. Similarly, by searching for the most probable clusters into which the query object might be classified, we can harvest most of the similar objects. Once we form our indexer through clustering, we cache in main memory a prototype for each cluster. We select the image that is closest to the cluster centroid as the cluster’s prototype. When an active learning algorithm needs to probe the unlabeled pool to select samples, it first probes the cached prototypes to select the top-C most uncertain and diversified cluster-prototypes with respect to the learned classifier. The learning algorithm then scans the C corresponding clusters to select samples. Our experimental results reported in Section 4.2 show that by caching one prototype per cluster and probing just a small fraction of the clusters, this hierarchical sampling method can select good samples as well as if we had scanned the entire dataset. In other words, the classifier trained on the samples obtained through the indexer can capture the target query concept comparable to that of one trained on the entire unlabeled pool. 3.3. K ERNEL I NDEXER As we will show in Section 4, the hierarchical sample method works effectively for active learning in its sample-selection phase. However, the recall in the retrieval phase suffers from performance degradation due to the approximate nature of the indexing scheme. (Since the sampling process is itself an approximate process, the performance of the sample-selection phase is not affected.) To improve the performance in the retrieval phase, we recently developed a kernel indexing method. Our kernel indexer (or KDX) addresses two fundamental issues over the traditional indexers. First, KDX indexes data in the kernel space F onto which the SVM kernel projects data. Second, KDX can effectively deal with hyperplane queries. Traditional top-k query scenarios use a point in a vector space to depict the query, and the top-k matches are the k nearest instances to the query point in the vector space. A top-k query with SVMs is different from that in the traditional scenarios in two aspects. First, a query concept learned by SVMs is represented by a hyperplane, and not by a point. Second, a top-k query with SVMs requests the farthest instances from the hyperplane, not the traditional nearest neighbors. KDX replies a couple of important geometric properties in the feature space. First the angular distance between any two points is bounded by π2 . For a normalized kernel the inner product of an instance with itself, given by Kn (a, a) is equal to 1. This means that after projection all the instances lie on the surface of a hypersphere. Further, considering the fact that the kernel values are inner products, we see that the angle in feature space between any two instances is bounded above by π2 . This is so since the inner product is constrained to be always greater than or equal to 0 (acos(0) = π2 ). The second property is that data instances exist on both sides of a query hyperplane. The

paper.tex; 13/03/2005; 23:48; p.8

9 hyperplane needs to pass through the region on the hypersphere populated by the projected instances. Otherwise, it would be impossible to separate the positive and the negative training instances. This property is easily ensured since we have at least one instance from the positive class and one from the negative class. Intuitively, KDX works as follows. Given a kernel function and an unlabeled pool, KDX first finds the approximate center instance of the pool in the feature space. (The center instance is the one that has the minimal total distance to the rest of the instances in the dataset.) It then divides the feature space to which the kernel function projects the unlabeled instances into concentric hyper-rings, each hyper-ring contains about the same number of instances. Each hyper-ring is populated by instances according to their distances to the center instance in the kernel space. Given a query concept, represented by a hyperplane, KDX limits the number of hyper-rings examined, and it intelligently prunes out unfit instances in each hyper-ring. Finally, KDX returns the top-k results. Both the inter-ring pruning and intra-ring pruning are performed by exploiting the geometric properties of the feature space. Due to the space limitations, we only briefly present how an indexer is created and a serach is conducted with the indexer. For a full specification of KDX, please refer to (Panda and Chang, 2005). The indexer is created in four steps. 1. Finding the instance φ(xc ) that is approximately centrally located in the feature space F, 2. Separating the instances into rings based on their angular distance from the central instance φ(xc ), 3. Constructing a local indexing structure (intra-ring indexer) for each ring, and 4. Creating an inter-ring index. To divide the instances into rings, we equally divide the sorted list. That is, if the number of instances per ring is g then the first g elements in the sorted array (sorted with respect to their distances to the center) are grouped together and so on. Adjacent rings are laid out on disk contiguously to take advantage of sequential access. Once the indexer has been constructed, KDX performs top-k retrieval by finding the k farthest instances from the hyperplane representing a query concept. KDX performs inter-ring and intra-ring pruning to find the approximate top-k instances as follows (also see Figure 2): 1. Shifting the hyperplane to the origin parallel to itself, and then computing θc , the angular distance between the normal to the hyperplane and the central instance φ(xc ). 2. Identifying the farthest ring from the hyperplane, and selecting a starting instance φ(x) in that ring.

paper.tex; 13/03/2005; 23:48; p.9

10 3. Computing the angular separation between φ(x) and the farthest coordinate in the ring to the hyperplane, denoted as φ(x∗ ). 4. Iteratively, replacing φ(x) with a closer instance to φ(x∗ ) and updating the top-k list, until no “better” φ(x) in the ring can be found. 5. Identifying a good starting instance φ(x) for the next ring, followed by repeating steps 3 to 5, until the termination criterion is satisfied. KDX achieves speedup over the naive linear scan method in two aspects. First, KDX examines only a subset of the rings for a query. Second, in the fourth step, KDX examines only a small fraction of the instances in a ring. The remainder of this section describes these steps, and explains how KDX effectively approximates the top-k results for achieving significant speedup. KDX terminates its search for top-k when the constituents of the top-k set do not change over the evaluation of some additional hyper-rings, or the query time expires. 3.3.1. Computing θc Parameter θc (presented in the first step) is important for KDX to identify the faretst coordinate of a ring to the hyperplane. To compute θc , we first shift the hyperplane to pass the origin of the feature space. The SVM training phase learns the distance of the hyperplane from the origin in terms of variables b and w (Vapnik, 1995). The distance of the hyperplane from the origin is given by −b/kwk. We shift the hyperplane to pass through the origin without changing its orientation by setting b = 0. This shift does not affect the set of farthest instances from the hyperplane because it has the same effect as adding a constant value to all distances. Next, we compute the angular distance θc of the central instance xc from the normal to the hyperplane. Given training instances x1 . . . xm and their labels y1 . . . ym , SVMs solve weight αi for xi . The normal of the hyperplane1 can be written as Pm

αi yi φ(xl,i ) . i,j αi αj yi yj φ(xl,i ) · φ(xl,j )

w = Pm

i

(3)

The angular distance between the central instance and w is w · φ(xc ). 3.3.2. Identifying the Starting Ring The most logical starting ring to look for the farthest instance is the ring containing the farthest coordinate on the hyperphere from the hyperplane. Let φ(x♦ ) denote this farthest coordinate. Note that there may not exist a data instance at φ(x♦ ). However, finding an instance close to the farthest coordinate can help us find the farthest instance with high probability. The farthest point φ(x♦ ) on the surface of the hypersphere from the hyperplane 1 Training instances with zero weights are not support vectors and do not affect the computation of the normal.

paper.tex; 13/03/2005; 23:48; p.10

11 RING OF INTEREST

φ((XC ) W

φ (x )

θ

C

HYPERPLANE

Figure 2. Start Ring.

is at the intersection of the hypersphere and the normal to the hyperplane passing through the origin. We use Figure 2 to illustrate. The figure shows that φ(x♦ ) is at the intersection of the hypersphere and the tangent hyperplane parallel to the hyperplane with θc angular separation to φ(xc ). Given xc and the normal of the hyperplane, we can compute θc to locate the ring which contains the farthest coordinate on the hypersphere from the hyperplane. 3.3.3. Intra-ring Pruning When processing a ring, we do not want to scan the entire ring to look for topk candidates. Intultively, we can conduct a binary search in the ring based on the distances between the instances in the ring to φ(x♦ ). KDX looks for the instances closest to φ(x♦ ), and updates the current top-k list. Finally, the top loop of the KDX algorithm determines how may rings to visit. The lookup terminates when the top-k results do not improve or query time expires. Remark: Because of the space limitations, we cannot discuss KDX in details. KDX supports a couple of important properties. First, it can effectively support insertion and deletion operations. Second, given a kernel function, the indexer works independent of the settings of the kernel parameters (e.g., γ and σ). This parameter-invariant property is especially crucial, since different query-concepts can be best learned by different parameter settings. Please see () for the full specification of KDX.

4. Experiments and Discussion We conducted experiments to answer three questions: 1. Can active learning scale well with concept complexity? If not, what are the primary obstacles, and what are some promising remedies? 2. Can active learning scale well with dataset size? Will preclustering data and using cluster prototypes to guide active learning compromise the quality of learning?

paper.tex; 13/03/2005; 23:48; p.11

12 3. Can KDX find top-k results effectively and efficiently? For empirical evaluation of our learning methods, we used two real-world image datasets: − A 50K-image dataset with 107-categories (Chang et al., 2003a), and − A 350K-image dataset with images from a stock-photo company. We use 144 color, texture, and shape features to characterize each image, and the perceptual features are documented in (Tong and Chang, 2001). Query accuracy is computed by looking at the portion of the k returned results belonging to the target query concept. This would be equivalent to computing the precision of the top-k images. This measure of performance appears to be the most appropriate for the image retrieval task—particularly since, in most cases, not all of the relevant images can be displayed to the user on one screen. As in the case of web searching, we typically wish for the first screen(s) of returned images to contain a high proportion of relevant images. We are less concerned that not every single instance satisfying the query concept is displayed. In the following experiments, we report the average precision rate of top-k retrievals to measure performance, after running each experiment twenty P times. For SVMActive , we used a Laplacian RBF kernel K(u, v) = (e−γ i |ui −vi | ), with the γ set at 0.001. 4.1. C ONCEPT C OMPLEXITY S TUDY As discussed in Section 3, the complexity of a concept is characterized mainly by its scarcity and isolation. We explained in Section 3 the effect of scarcity and isolation on learnability. Here, we evaluate the effectiveness of our proposed remedies. 0.9

Average Top 50 Precision

DPF L1 distance

0.7

0

10

20 30 Number of Transductive Points

40

50

Figure 3. Transductive precision vs. # of transductive points.

We first examine the effect of transductive learning on the 50K-image dataset. We assume that the nearest neighbors of a negative-labeled instance are negative instances. This heuristic is based on the observation that the nearest neighbors of a negative-labeled image are likely to be negative. However,

paper.tex; 13/03/2005; 23:48; p.12

13 the same assumption cannot be made for positive instances because they are scarce. In this experiment we applied the angle diversity algorithm on the 107-category dataset. We also compared the effectiveness of two distance functions, the Dynamic Partial Function (DPF2 (Li and Chang, 2003)) and L1, for finding nearest neighbors. Figure 3 shows two precision curves using two different distance functions versus the number of transductive points, after eight relevance-feedback iterations. When the number of transductive points exceeds 10 (for each negative-labeled instance), the precision starts to deteriorate. This is because some negative-instances inferred might actually be positive. For this dataset, we can infer around 10 times more negative instances, given user-labeled negative instances, to achieve an average precision improvement of about 5%. Also we note that transductive learning with DPF performs consistently better than with L1. This is because the negative instances inferred by DPF through different subspaces achieve higher diversity than those inferred by the L1 function. 1

Top 20 Precision

0.8

0.6

0.4

0.2

0 1

angle + trans + co−train angle + trans angle 2

3

4

5

6

7

8

Iteration

Figure 4. Recall improvement by remedies.

Next, we investigated the effectiveness of the co-training algorithm. In addition to the negative examples collected by SVMActive , we collected 100 additional negative instances inferred by MEGA (i.e., additional 100 unlabeled points outside of its k-DNF sampling boundary) at the end of each user iteration. We fed all these negative instances to SVMActive to refine the class boundary. Figure 4 compares the top-20 precision results of three schemes that use angle diversity: 1) without any enhancement, 2) with transduction, and 3) with transduction and co-training. The figure shows that on average, transduction improves the precision by about 5%, and co-training improves it by another 5% to 7%. However, when we further increase the number of negative objects inferred by MEGA (beyond 100), the precision does not improve. We believe that when co-training is performed too aggressively, the noise may increase, and can eventually hurt precision.

2 DPF is a feature reduction method that provides better similarity measurement for images. Please refer to (Li and Chang, 2003) for details.

paper.tex; 13/03/2005; 23:48; p.13

14 4.2. DATASET S CALABILITY S TUDY We studied the effectiveness and efficiency of our hierarchical sampling method. Notice that the same indexing structure is used for two purposes. The first purpose is to select samples to conduct active learning. The second purpose is to retrieve matching images, given a learned query concept. To study the effect of the indexer on the performance of learning and retrieval separately, we configured this experiment in the following four settings: 1. Sampling and retrieval on entire dataset. 2. Sampling via indexer and retrieval on entire data set. 3. Sampling entire dataset and retrieval via indexer. 4. Sampling and retrieval via indexer. 1

Top 20 Precision

0.8

0.6

0.4 Configuration Configuration Configuration Configuration

0.2

0 1

2

3

4

5

#1 #2 #3 #4

6

7

8

Iteration

Figure 5. Top-20 precision vs. iteration on the 50K-image dataset.

Figure 5 shows the top-20 precision rate on query-concepts with hit-rates greater than 1%, given the number of iterations for the four different configurations, on the 50K dataset. Though not efficient, the first configuration that uses the entire dataset to conduct both sample selection and retrieval ought to achieve the best search precision. The first configuration thus serves as a yardstick for measuring the performance of the other three configurations. The second configuration conducts sample-selection (for active learning) through the indexer. For this experiment, we first read in C = 20 out of 375 clusters, or 5% of the unlabeled pool, to perform sample selection. (We will show the effect of choosing a different C shortly.) The figure shows that selecting samples through the indexer does not degrade search performance. However, when the indexer is used in the retrieval phase (in configurations 3 and 4), the search accuracy degrades. Now, let us focus on configuration 2 to evaluate the usefulness of our indexer. In Figure 6, we plot the top-20 precision curves for various percentages of top-clusters (parameter C in Section 3.2). Percentages ranging from 1 to 10 are shown. We read data from only three clusters (or 1%), yet the query performance is almost as good as if we had read in more clusters. The key reason is because the sample selection process is itself an approximation, the noise introduced by the indexer does not significantly affect the quality of the samples for learning the target concept. However, in the retrieval phase, the

paper.tex; 13/03/2005; 23:48; p.14

15 approximate nature of the indexer may cause the precision rate to degrade. In short, using our hierarchical sampling method, we believe that active learning can scale well with the dataset size. 1

Top 20 Precision

0.8

0.6

0.4 1% 3% 5% 7% 10%

0.2

0 1

2

3

4

5

6

7

8

Iteration

Figure 6. Learnability vs. parameter C.

We continue our empirical studies using the larger and more diverse 350Kimage dataset. Each image is annotated with about 20 keywords; a total of 40K keywords are used for the entire dataset. We first studied the usefulness of our indexer for this dataset. We use the annotated keywords to indicate an image’s category membership. Therefore, each image can belong to more than one category. Figure 7 illustrates the top-20 precision results for categories with a hit-rate greater than 1%. The solid line depicts the case when sampling and retrieval are performed on the entire dataset (configuration 1) whereas the dashed line represents the case when the indexer is used for both sampling and retrieval (configuration 4). 0.5 #1: #4:

sequential scan using indexer

Top 20 Precision

0.4

0.3

0.2

0.1

0 1

2

3

4

5

Iteration

Figure 7. Top-20 precision vs. iteration on the 350K-image dataset.

We observe that using the indexer for sampling and retrieval gives comparable precision results to those obtained by using a sequential scan of the entire dataset. After four iteration of active learning, the indexer’s precision rate only lags that of the sequential scan by 2%. Hence, our results show that our hierarchical sampling remedy scales well with respect to dataset size. Though we have shown that active learning can scale well with dataset size, it does not scale so well when the concept complexity is high. The precisions reported in Figure 7 suffer from a substantial drop compared to those in Figure 6.

paper.tex; 13/03/2005; 23:48; p.15

16 Table I. Qualitative and quantitative comparison Dataset

% Precision

% Discrepancy

% till Precision

350k-image

90.0

0.03607813

2.94255

4.3. KDX S TUDY We evaluated KDX on four UCI datasets and the 350k-image dataset. We first evaluated its retrieval quality, then the discrepencies between the top-k results obtained by KDX and linear scan. Results for the qualitative evaluation are presented in the second column of Table I. The results are averaged over three classes for all the datasets except Seg. The average precision values for all datasets are above 80%. For the 350k-image dataset with the highest number of instances we have an average precision of 90%. The precision values are reasonably high for all the datasets. The third column of the table shows that the discrepencies of the top-20 results found by KDX is very close to that found by a linear scan. The average discrepency of lower than 0.1% indicates the quality of the results is acceptable. Finally, the final column shows the percentage of data read for reaching the precision on the second column. For the UCI datasets, since their sizes are small to begin with, the percentage of evaluated samples is alightly high, For the large image datasets, we find that the results are impressive with less than 3% of the data being evaluated to reach 90% precision. 4.4. M ULTIMODAL R EMEDY A huge advantage of using an annotated dataset, even when annotation is not perfect, is that we can use keywords to seed a query, and then continue with active learning to refine the query concept. We perform manual queries for a set of concepts on the 350K-image dataset. In Figure 8, we plot the top-20 precision results for five concepts: Coffee. The target concept is that of a cup filled with coffee. After seeding the query, the precision rate is 0.05 as many images of people holding a coffee cup are also annotated with the keyword “coffee”. Using active learning, we achieve perfect precision in the ninth iteration. Cocktail. Here, we are interested in images showing a glass of cocktail, not of people drinking. After keyword seeding, the precision rate is 0.1. With active learning, perfect precision is attained in the fourth iteration. Lion. We want to retrieve head shots of a male lion for this query. In the first iteration, keyword seeding gives a precision rate of 0.14. After six iterations of active learning, we arrive at the perfect precision rate. Mountain with Lake. The target concept is an image of a mountain in the background, and a lake in the foreground. Using two keywords, “mountain” and “lake”, to seed this query, we obtain a precision rate of 0.3 in the first

paper.tex; 13/03/2005; 23:48; p.16

17 1

Top 20 Precision

0.8

0.6

0.4 Coffee Cocktail Lion Mountain Lake Paris

0.2

0 1

2

3

4

5 Iteration

6

7

8

9

Figure 8. Multimodal top-20 precision for the 350K-image dataset.

iteration. We only require three rounds of active learning to reach the perfect precision. Paris. For this query, we look for images showing the Eiffel Tower instead of typical Parisian scenes. Using the keyword “Paris” to seed the query, we do not retrieve any desired images in the first iteration. However, after one round of active learning, we obtain a precision rate of 0.35. In the seventh iteration, we reach the precision rate of 1.0. It is worthwhile mentioning that for such a large dataset, if keywords are not used to seed the query, it typically takes at least 50 iterations of relevance feedback before a positive sample is displayed. Our results show that using keywords to seed a query immediately increases the hit-rate to a reasonable range, and simultaneously improves the isolation factor. With a few iterations of active learning, the system is able to learn the target concept and produce good precisions. Thus, employing keywords and perceptual features in a complementary fashion can address the scalability issue effectively.

5. Conclusions and Future Work We have shown in this work the effectiveness of using active learning for learning complex, subjective query concepts despite scarcity of information. We showed that the best sample-selection strategy is to select objects that are most uncertain and most diversified with respect to the current concept boundary. We also studied two important issues of active learning: scalability in concept complexity, and scalability in dataset size. To further improve the effectiveness of active learning, one of our future tasks will be to devise multimodal techniques. In this paper, we briefly demonstrated that using annotated keywords can improve retrieval precision substantially, as expected. However, some subtle but important challenges remain: 1) how to use keywords and perceptual features in a synergistic way to improve hit-rate and isolation? and 2) how to perform active learning in a concept-dependent way? In addition, we consider kernel indexing a critical method for making SVMs scalable in a transductive setting when unlabeled data are available in advance for indexing. Preliminary results showed that

paper.tex; 13/03/2005; 23:48; p.17

18 KDX to be very effective (Panda and Chang, 2005). We are currently investigating its extensions to select samples, and to perform farthest as well as nearest instance queries. A major shortcoming of a discriminative approach like SVMs is its lack of interpretability. That is, when a RBF kernel is employed, one cannot tell what combinations of the features forms the hyperplane (the target concept). Another shortcoming of SVMs is that they cannot take prior knowledge into consideration. To attain interpretability, one must employ a generative model such as (Chang and Li, 2003) or Bayesian. Unfortunately, these models have been shown to require a large number of training instances to learn a concept, and hence are not suitable for online query-concept learning. For considering prior knowledge, we believe that some hybrid approach can be devised to tackle the later issue,and this is an active research subject in the learning machine community.

References Blum, A. and T. Mitchell: 1998, Combining Labeled and Unlabeled Data wih Co-Training. Proceedings of the Workshop on Computational Learning Theory. Brinker, K.: 2003, ‘Incorporating Diversity in Active Learning with Support Vector Machines’. ICML pp. 59–66. Chang, E., K. Goh, G. Sychay, and G. Wu: 2003a, ‘CBSA:Content-based Soft Annotation for Multimodal Image Retrieval U sing Bayes Point Machines’. IEEE Transactions on Circuits and Systems for Video Technolgy Spe cial Issue on Conceptual and Dynamic Aspects of Multimedia Content Description. Chang, E. and B. Li: 2003, ‘MEGA — The Maximizing Expected Generalization Algorithm for Learning Complex Query Concepts’. ACM Transaction on Information Systems. Chang, E., B. Li, G. Wu, and K.-S. Goh: 2003b, ‘Statistical Learning for Effective Visual Information Retrieval (invited paper)’. IEEE Conference in Image Processing. Flickner, M., H. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker: 1995, ‘Query by Image and Video Content: the QBIC System’. IEEE Computer 28(9), 23–32. Goh, K., E. Y. Chang, and W.-C. Lai: 2004, ‘Concept-Dependent Multimodal Active Learning for Image Retrieval’. ACM International Conference on Multimedia pp. 564–571. Li, B. and E. Chang: 2003, ‘Discovery of A Perceptual Distance Function for Measuring Image Similarity’. ACM Multimedia Journal Special Issue on Content-Based Image Retrieval 8(6). Li, C., E. Chang, H. Garcia-Molina, and G. Wiederhold: 2002, ‘Clustering for Approximate Similarity Queries in High-Dimensional Spaces’. IEEE Transaction on Knowledge and Data Engineering. Panda, N. and E. Chang: 2005, ‘Exploiting Geometry for Support Vector Machine Indexing’. Tong, S. and E. Chang: 2001, ‘Support Vector Machine Active Learning for Image Retrieval’. Proceedings of ACM International Conference on Multimedia pp. 107–118. Tong, S. and D. Koller: 2000, ‘Support Vector Machine Active Learning with Applications to Text Classification’. Proceedings of the 17th International Conference on Machine Learning pp. 401–412. Vapnik, V.: 1995, The Nature of Statistical Learning Theory. Springer Verlag.

paper.tex; 13/03/2005; 23:48; p.18

Active Learning in Very Large Databases

20 results - plete understanding about what a query seeks, no database system can return satisfactory ..... I/O efficiency, each cluster is stored in a sequential file.

Download PDF

142KB Sizes 0 Downloads 188 Views

Report

Active Learning in Very Large Databases

Recommend Documents