Approximate Reduction from AUC Maximization to 1-norm Soft Margin Optimization∗

Daiki Suehiro, Kohei Hatano, Eiji Takimoto Department of Informatics, Kyushu University 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan daiki.suehiro, hatano, eiji}@inf.kyushu-u.ac.jp

Abstract Finding linear classifiers that maximize AUC scores is important in ranking research. This is naturally formulated as a 1-norm hard/soft margin optimization problem over pn pairs of p positive and n negative instances. However, directly solving the optimization problems is impractical since the problem size (pn) is quadratically larger than the given sample size (p + n). In this paper, we give (approximate) reductions from the problems to hard/soft margin optimization problems of linear size. First, for the hard margin case, we show that the problem is reduced to a hard margin optimization problem over p + n instances in which the bias constant term is to be optimized. Then, for the soft margin case, we show that the problem is approximately reduced to a soft margin optimization problem over p + n instances for which the resulting linear classifier is guaranteed to have a certain margin over pairs.

1 Introduction Among the problems related to learning to rank, the bipartite ranking is a fundamental ranking problem, which involves learning to obtain rankings over positive and negative instances. More precisely, for a given sample consisting of positive and negative instances, the goal of the bipartite ranking problem is to find a real-valued function h, which is referred to as a ranking function, with the following property: For a randomly chosen test pair of positive instance x+ and negative instance x− , the ranking function h maps x+ to a higher value than x− with high probability. The bipartite ranking problem can be reduced to the binary classification problem over a new instance space, consisting of all pairs (x+ , x− ) of positive and negative instances. More precisely, the problem of maximizing the AUC is equivalent to finding a binary classifier f of the form of f (x+ , x− ) = h(x+ ) − h(x− ) so that the probability that f (x+ , x− ) > 0 is maximized for a randomly chosen instance pair. Several studies including RankSVMs [2]have taken this approach with linear classifiers as the ranking functions. RankSVMs are justified by generalization bounds [4] which say that a large margin over pairs of positive and negative instances in the sample implies a high AUC score under the standard assumption that instances are drawn i.i.d. under the underlying distribution. The reduction approach, however, has a drawback that the sample constructed through the reduction is of size pn when the original sample consists of p positive and n negative instances. This is a quadratic blowup in size. In this paper, we formulate AUC maximization as 1-norm hard/soft margin optimization problems1 over pn pairs of p positive and n negative instances. We show some reduction schemes to 1-norm ∗

The full version of ths paper appeared in ALT 2011 [5]. In this paper we refer to 1-norm soft margin optimization as a soft margin optimization with 1-norm of the weight vector regularized. Note that sometimes the soft margin optimization of SVMs with 1-norm of slack valuables optimized is also called 1-norm soft margin optimization. 1

1

hard (or soft) margin optimization over p+n instances which approximate the original problem over pairs. First, for the hard margin case where the resulting linear classifer is supposed to classfiy all pairs correctly by some positive margin, we show that the original problem over pairs is equivalent to the 1-norm hard margin problem over p + n instances with the bias term. Second, for the soft margin case in which the resulting classsfier is allowed to misclassify some pairs, we show reduction methods to 1-norm soft margin optimization over instances which are guaranteed to have a certain amount of margin over pairs of instance. When we solve the original problem over pairs, it can be shown that for any ε s.t. 0 < ε < 1, the solution have margin at least ρ∗ ≥ γ ∗ over at least (1 − ε)pn pairs, where ρ∗ and γ ∗ are optimal solutions of the primal and dual problems of the original one. Note that the optimal solutions depend on ε respectively. On the other hand, for an appropriate setting of parameters, one √ of our reduction methods guarantees that the resulting classifier has margin at least γ ∗ for (1 − ε)2 pn pairs. Note that, this guarantee might be rather weak, since the guaranteed margin γ ∗ is lower than the optimal margin ρ∗ in general. However, if ρ∗ ≈ γ ∗ , say, when pairs are close to be linearly separable, our theoretical guarantee becomes sharper. Also, theoretically guaranteed reduction methods from AUC maximization to classification are quite meaningful since typical methods lack such properties. In our experiments using artificial and real data, our practical heuristics derived from our analysis (omitted and shown in [5]) achieve AUCs that are almost as high as the original soft margin formulation over pairs while keeping the sample size linear. In addition, our methods also outperform previous methods including RankBoost [1] and SoftRankBoost [3].

2 1-norm soft margin over pairs of positive and negative instances Let X + and X − be the sets of positive instances and negative instances, respectively. Let X = X + ∪ X − be the instance space. In this paper, we assume a finite set H ={h1 , h2 , . . . , hN } of ranking functions, functions from X to [−1, +1]. Our hypothesis class F is the set of convex combination of ∑N ∑N ranking functions in H, that is, F = {f f (x) = k=1 αk hk (x), hk ∈ H, k=1 αk = 1, αk ≥ 0}. Now, our goal is to find a linear combination of ranking functions f ∈ F which has large margin ρ over pairs of instances in S + and S − . More formally, we formulate our problem as optimizing the soft margin over pairs of positive and negative instances. For convenience, for any q ≥ 1, let Pq be the q-dimensional probability simplex, ∑ i.e., Pq = {p ∈ [0, 1]q | i pi = 1}. Then, for positive and negative sets of instances S + and S − , the set H of ranking functions, and any fixed ν ∈ {1, . . . , pn}, the 1-norm soft margin optimization problem is given in (1). p n (γ ∗ , d∗ ) = min γ (2) 1 ∑∑ γ,d (ρ , α , ξ ) = max ρ − ξij (1) ν ∑ ρ,α,ξ i=1 j=1 − dij (hk (x+ s.t. ∑ i ) − hk (xj ))/2 ≤ γ + − i,j s.t. αk (hk (xi ) − hk (xj ))/2 ≥ ρ − ξij k (k = 1, . . . , N ), (i = 1, . . . , p, j = 1, . . . , n), 1 d ≤ 1, d ∈ Ppn . α ∈ PN , ξ ≥ 0. ν ∗





In this problem, the goal is to maximize the margin ρ of the linear combination α of ranking functions w.r.t. instances as well as to minimizing the sum of “losses” ξij , the quantity by which the target margin ρ is violated. Here ν ∈ {1, . . . , pn} controls the tradeoff between the two objectives. Then, using Lagrangian multipliers, the∑ dual problem is given in (2). Since the problem is ∗ a linear program, by duality, we have ρ∗ − ν1 i,j ξij = γ ∗ . Furthermore, by using KKT condi− tions, it can be shown that, the optimal solution guarantees the number of pairs (x+ i , xj ) for which ∑ + − ∗ k αk (hk (xi ) − hk (xj ))/2 ≤ ρ is at most ν. 2

3

The 1-norm hard margin optimization over pairs

In this section, we show the equivalence between two hard margin optimization problems, the 1norm hard margin problem over pairs and 1-norm hard margin with bias. The hard margin optimization problem is a special case of the soft margin one in that the resulting classifier or ranking function is supposed to predict all the instances or pairs correctly with some positive margin. The first problem we consider is the 1-norm hard margin optimization (3) over pairs of positive and negative instances.

max ρ

ρ,α∈PN

s.t.

N ∑

− αk (hk (x+ i ) − hk (xj ))/2 ≥ ρ

max

(3)

ρ,α∈PN ,b

s.t.

N ∑

k=1

k=1

(i = 1, . . . , p, j = 1, . . . , n).

N ∑

ρ

(4)

αk hk (x+ i ) + b ≥ ρ (i = 1, . . . , p), αk hk (x− j ) + b ≤ −ρ (j = 1, . . . , n).

k=1

The second hard margin problem is the 1-norm hard margin optimization with bias (4). In the following, we show that both of these problems are equivalent to each other, in the sense that we can construct an optimal solution of one problem from an optimal solution of the other problem. Theorem 1 Let (ρb , αb , bb ) be an optimal solution of the 1-norm hard margin optimization with bias (4). Then, (ρb , αb ) is also an optimal solution of the 1-norm hard margin optimization over pairs (3).

4

Reduction methods from 1-norm soft margin optimization over pairs

In this section, we propose reduction methods from the 1-norm soft margin optimization over pairs to that over instances. We would like to approximate the dual problem of the 1-norm soft margin optimization over pairs (2). The dual problem is concerned with finding a distribution over pn pairs of positive and negative instances satisfying linear constraints. Our key idea is to replace the distribution dij with a product − + − distribution d+ i dj , where d , d are distributions over positive and negative instances, respectively. − Letting dij = d+ i dj and rearranging, we obtain γˆ (ν + ) = min γ (5) the resulting problem (shown in [5]). Since we d+ ,d− ,γ ∑ ∑ restrict distributions to be products of two dis+ − d+ d− i hk (xi )/2 − j hk (xj )/2 ≤ γ tributions, the optimal solution yields a feasible s.t. i j solution of the original problem (1). The prob(k = 1, . . . , N ), lem has p + n + 1 variables, whereas the original one has pn + 1 variables. So this problem d+ ∈ Pp , d− ∈ Pn , would be easier to solve. But, unfortunately, − + − + d+ this problem is not convex since the constraints i ≤ 1/ν , dj ≤ 1/ν = ν /ν. + − di dj ≤ 1/ν (i = 1, . . . , p, j = 1, . . . , n) are not convex. In [5], we propose a method to find a local minimum of the non-convex problem. First, however, we show a restricted problem whose solution has a certain amount of margin over pairs. In order to avoid non-convex constraints, we fix ν + and ν − such that ν = ν + ν − and enforce − + − − + d+ i ≤ 1/ν and dj ≤ 1/ν . Equivalently, we fix ν = ν /ν. As a result, we obtain the following problem (5).

Note that it is not straightforward to optimize ν + since problem (5) is not convex w.r.t. ν + . On the other hand, for any fixed choice of ν + and ν − , we can guarantee that the solution of problem (5) has a certain amount of margin for many of pairs. Theorem 2 Given ν + and ν − , the solution of problem (5) has margin at least γ ∗ for at least pn − ν + n − ν − p + ν + ν − pairs. 3

Table 2: AUCs for UCI data sets, when N , p, and n stand for the dimension, the number of positive and negative instances of each data sets, respectively. Data hypothyroid ionosphere kr-vs-kp sick-euthroid spambase

N 43 34 73 43 57

p 151 225 1669 293 1813

n 3012 126 1527 2870 2788

Corollary 3√For ν = εpn, ν + = γ ∗ for (1 − ε)2 pn pairs.

5



RankBoost 0.9488 0.9327 0.8712 0.7727 0.8721

εp and ν − =



SoftRankBoost 0.96 0.9917 0.9085 0.7847 0.9359

LP-Pair 0.9511 0.9768 1.0 1.0 1.0

our method 1.0 0.9865 0.9276 1.0 1.0

εn, a solution of problem (5) has margin at least

Experiments

In this section, we show preliminary experimental results. The data sets are drawn from UCI Machine Learning Repository. We compare RankBoost [1], SoftRankBoost [3], 1-norm soft margin over pairs (LP-Pair) which solves naively solves problem (1), and our method. For RankBoost, we set the number of iterations as T = 1000, respectively. For the other methods, we set the parameter ν = εpn, where ε ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. We evaluate each method by 5-fold cross validation. As can be seen in Table 2, our method archives high AUCs for all data sets and competive with LP-Pair. Table 1: Computation time(sec.). We last examine the computation time of LP-Pair and our m LP-Pair our method method. We use the machine with 4 cores of Intel Xeon 5570 100 0.102 0.11 2.93GHz and 32GByte memory. We use artificial data sets with N = 100 and m = 100, 500, 1000, 1500, respectively. 500 24.51 0.514 We set ε = 0.2 for both LP-Pair and our method and evaluate 1000 256.78 0.86 each execution time by 5-fold cross validation. As is shown 1500 1353 1.76 in Table 1, clearly our method is faster than LP-Pair.

References [1] Y. Freund, R. Iyer, R. E. Shapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. [2] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. [3] J. Moribe, K. Hatano, E. Takimoto, and M. Takeda. Smooth boosting for margin-based ranking. In Proceedings of the 19th International Conference on Algorithmic Learning Theory (ALT 2008), pages 227–239, 2008. [4] C. Rudin and R. E. Schapire. Margin-based ranking and an equivalence between AdaBoost and RankBoost. Journal of Machine Learning Research, 10:2193–2232, 2009. [5] D. Suehiro, K. Hatano, and E. Takimoto. Approximate reduction from AUC maximization to 1-norm soft margin optimization. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT 2011), pages 324–337, 2011.

4

Approximate Reduction from AUC Maximization to 1 ...

Our hypothesis class F is the set of convex combination of ranking functions in H, that is, ... Since the problem is a linear program, by duality, we have ρ. ∗ − 1.

66KB Sizes 2 Downloads 196 Views

Recommend Documents

do arbitrage free prices come from utility maximization?
Bank account with no interest. Exist martingale measures. No constraints on strategy H. Agent. Maximal expected utility u(x,q) := sup. H. E[U(x + qf + ∫ T. 0.

Impact from fare reduction
Aug 4, 2015 - significant investment banking, advisory, underwriting or placement services for or relating to such company(ies) as well as solicit such investment ..... (“high net worth companies, unincorporated associations etc”) of the Order; (

gApprox: Mining Frequent Approximate Patterns from a ...
such as biological networks, social networks, and the Web, demanding powerful ... [4, 10, 6] that mine frequent patterns in a set of graphs. Recently, there arise a ...

gApprox: Mining Frequent Approximate Patterns from a ...
it can be pushed deep into the mining process. 3. We present systematic empirical studies on both real and synthetic data sets: The results show that frequent ap ...

Approximate Time-Optimal Control via Approximate ...
and µ ∈ R we define the set [A]µ = {a ∈ A | ai = kiµ, ki ∈ Z,i = 1, ...... [Online]. Available: http://www.ee.ucla.edu/∼mmazo/Personal Website/Publications.html.

Approximate Dynamic Programming applied to UAV ...
Abstract One encounters the curse of dimensionality in the application of dy- namic programming to determine optimal policies for large scale controlled Markov chains. In this chapter, we consider a base perimeter patrol stochastic control prob- lem.

A Guided Tour to Approximate String Matching
One of the largest areas deals with speech recognition, where the ... wireless networks, as the air is a low qual- ..... there are few algorithms to deal with them.

Repetition Maximization based Texture Rectification
Figure 1: The distorted texture (top) is automatically un- warped (bottom) using .... however, deals in world-space distorting and not with cam- era distortions as is ...

Repetition Maximization based Texture Rectification
images is an essential first step for many computer graph- ics and computer vision ... matrix based rectification [ZGLM10] can be very effective, most of our target ...

Partial ROC: This program generates the AUC (Area ... -
First extract the zip / rar file in a directory on hard drive. 2. Change the ... Presences file – This file contains the testing data points. The program expects the data.

Efficient Ideal Reduction in Quadratic Fields 1 Introduction
cations including: finding solutions to the Pell equation, calculating the fun- damental ... of the algorithms were tested on a common hardware and software platform ..... then the rational numbers Ai/Bi = [q0,q1,...,qi] are termed the convergents.

SPARSITY MAXIMIZATION UNDER A ... - Semantic Scholar
This paper considers two problems in sparse filter design, the first in- volving a least-squares ..... We used a custom solver for the diagonal relaxation; ... sparse FIR filters using linear programming with an application to beamforming,” IEEE ..

Expected Sequence Similarity Maximization - Semantic Scholar
ios, in some instances the weighted determinization yielding Z can be both space- and time-consuming, even though the input is acyclic. The next two sec-.

PTR Reduction Funds 021417 (1).pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. PTR Reduction ...

Hyperspectral image noise reduction based on rank-1 tensor ieee.pdf
Try one of the apps below to open or edit this item. Hyperspectral image noise reduction based on rank-1 tensor ieee.pdf. Hyperspectral image noise reduction ...

Buffon's Needle Experiment to Approximate Pi.pdf
Page 1 of 2. Name. Date ______ Assignment # ______. Buf on. 's Needle Experiment to Approximate Pi. In mathematics, Buffon's needle problem is a question. first posed in the 18th century by Georges-Louis Leclerc,. Comte de Buffon: Suppose we have a f

Path-Constrained Influence Maximization in ...
marketers may want to find a small number of influential customers and ...... AT&T. 42 should be different under different targeting node types. While in real-world applications, even we have the same ... search fields related to their business.

Throughput Maximization for Opportunistic Spectrum ...
Aug 27, 2010 - Throughput Maximization for Opportunistic. Spectrum Access via Transmission. Probability Scheduling Scheme. Yang Cao, Daiming Qu, Guohui Zhong, Tao Jiang. Huazhong University of Science and Technology,. Wuhan, China ...

Extended Expectation Maximization for Inferring ... - Semantic Scholar
uments over a ranked list of scored documents returned by a retrieval system has a broad ... retrieved by multiple systems should have the same, global, probability ..... systems submitted to TREC 6, 7 and 8 ad-hoc tracks, TREC 9 and 10 Web.

17-08-022. Disaster Risk Reduction Reduction and Management ...
17-08-022. Disaster Risk Reduction Reduction and Management Program.pdf. 17-08-022. Disaster Risk Reduction Reduction and Management Program.pdf.

Path-Constrained Influence Maximization in ...
For example, mobile phone marketers may want to find a small number of influential customers and give them mobile phones for free, such that the product can ...

PENALTY FUNCTION MAXIMIZATION FOR LARGE ...
state-level Hamming distance versus a negative phone-level ac- curacy. Indeed ... The acoustic features for the English system are 40-dimensional vectors obtained via an .... [3] F. Sha and L. Saul, “Comparison of large margin training to other ...