Efficient tuning of svm hyperparameters using radius ...

Viewer
Transcript

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002

1225

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi

Abstract—This paper discusses implementation issues related to the tuning of the hyperparameters of a support vector machine (SVM) with 2 soft margin, for which the radius/margin bound is taken as the index to be minimized, and iterative techniques are employed for computing radius and margin. The implementation is shown to be feasible and efficient, even for large problems having more than 10 000 support vectors. Index Terms—Hyperparameter tuning, support vector machines (SVMs).

I. INTRODUCTION

T

HE basic problem addressed in this paper is the two-catebe a given set of gory classification problem. Let training examples, where is the th input vector and is the denotes that is in class 1 and target value. denotes that is in class 2. In this paper, we consider the support vector machine (SVM) problem formulation that uses soft margin given by

The solution of (1) is obtained by solving the dual problem max s.t.

and

(4)

At optimality, the objective functions in (1) and (4) are equal. and Let denote the vector of hyperparameters (such as ) in a given SVM formulation. Tuning of is usually done by minimizing an estimate of generalization error such as the leave-one-out (LOO) error or the -fold cross validation error. It was shown by Vapnik and Chapelle [14] that the following bound holds: (5)

LOO Error

where is the solution of (1), is the radius of the smallest sphere that contains all vectors, and is the number of training can be obtained as the optimal objective function examples. value of the following problem (see [10] and [13] for details):

min s.t. Let . This problem is usually converted (see [5] for details) to the SVM problem with hard margin given by

max s.t.

and

(6)

min s.t.

(1)

where denotes the transformed vector in the (modified) feature space (2) if otherwise, and are Popular choices for

is the kernel function.

Gaussian kernel

(3a)

Polynomial kernel

(3b)

Manuscript received March 14, 2001; revised December 21, 2001 and January 10, 2002. The author is with the Department of Mechanical Engineering, National University of Singapore, Singapore 119260, Singapore (e-mail: mpessk@ guppy.mpe.nus.edu.sg). Publisher Item Identifier S 1045-9227(02)05563-7.

The right-hand side bound in (5), , is usually referred to as as well as the radius/margin bound. Note that both depend on and, hence, is also a function of . The first experiments on using the radius/margin bound for model selection were done by Schölkopf et al. [10]; see also [1]. Recently, Chapelle et al. [2] used matrix-based quadratic programming solvers for (1) and (6) to successfully demonstrate the usefulness of for tuning hyperparameters. Since it is difficult, even for medium size problems with a few thousand examples, to load the entire kernel matrix values to the computer memory and do matrix of operations on it, conventional finitely terminating quadratic programming solvers are not very suitable for solving (4) and (6). Hence, specially designed iterative algorithms [5], [6], [8], [11] that are asymptotically converging are popular for solving (4) and (6). The use of these algorithms allows the easy tuning of hyperparameters in large-scale problems. The main aim of this paper is to discuss implementation issues associated with this, and use the resulting implementation to study the usefulness of the radius/margin bound on several benchmark problems.

1045-9227/02$17.00 © 2002 IEEE

1226

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002

It should be mentioned here that Cristianini et al. [4] carried out the first set of experiments using radius/margin bound together with iterative SVM methods. However, their experiments were done on the hard margin problem without the parameter and threshold . To solve (4), they employed the kernel adatron algorithm, which is extremely easy to implement, but very slow. Further, they made no mention of the ease with which the gradient of the radius/margin bound with respect to the hyperparameters can be computed. II. IMPLEMENTATION ISSUES We will assume that is differentiable with respect to and .1 To speed up the tuning, it is appropriate to use a gradient-based technique such as the quasi-Newton algorithm or conjugate-gradient method to minimize . Quasi-Newton algorithms are particularly suitable, because they work well even when the function and gradient are not computed exactly. On the other hand, conjugate-gradient methods are known to be sensitive to such errors. A. Evaluation of We have employed the nearest point algorithm given in [5] for . The numerical experiments of solving (4) and evaluating that paper show that this algorithm is very efficient for solving the hard margin problem in (1) and (4). The sequential minimal optimization (SMO) algorithm [7], [6] is an excellent alternavia (6), the SMO algorithm discussed in tive. To determine [11, Sec. 4] is very suitable. This algorithm was modified along the lines outlined in [6] so that it runs very fast. B. Evaluation of Gradient of The computation of gradient of requires the knowledge of and . Recently, Chapelle et al. gave a the gradients of very useful result (see [2, Lemma 2]) which makes these gradient computations extremely easy once (4) and (6) are solved. It is important to appreciate the usefulness of their result, particularly from the viewpoint of this paper, that iterative nonmatrix-based techniques are used for solving (4) and (6). Clearly, depends on , and, in turn depends on and . Yet, beitself is computed via an optimization problem [i.e., cause (4)], it turns out that the gradient of with respect to the hyperparameters does not enter into the computation of the gradient . Since is also solved via an optimization problem of [i.e, (6)], a similar result holds for and . Remark 1: The easiest way to appreciate the above result is given by . to consider the function denote the solution of the minimization problem; then, Let at . Now, . Hence

(7)

1The contour plots given later in Figs. 1 and 2 seem to indicate that this is a reasonable assumption.

Thus, the gradient of with respect to can be obtained simply by differentiating with respect to , as if has no influence on . The corresponding arguments for the constrained optimization problems in (4) and (6) are a bit more complicated. (See [2] for details.) Nevertheless, the above arguments, together with (4), should easily help one to appreciate the fact that the dewith respect to does not termination of the gradient of . In a similar way, by (6), the determinarequire with respect to does not require tion of the gradient of . It is important to note that the determination of and requires expensive matrix operations involving the kernel matrix. Hence, Chapelle et al.’s result concerning the avoidance of these gradients in the evaluation and gives excellent support for the of the gradients of radius/margin criterion when iterative techniques are employed for solving (4) and (6). For other criteria such as the LOO error, -fold CV error, or other approximate measures, such an easy evaluation of gradient of the performance function with respect to hyperparameters is ruled out. This issue is particularly very important when a large number of hyperparameters, other than and (such as input weighting parameters), are also considered for tuning, because when the number of optimization variables is large, gradient-based optimization methods are many times faster than methods which use function values only. Remark 2: Since iterative algorithms for (4) and (6) converge only asymptotically, a termination criterion is usually employed to terminate them finitely. This termination criterion has to be chosen with care for the following reason. Take, for example, the is to be evalufuntion mentioned in Remark 1. Suppose , we ated at some given . During the solution of use a termination criterion and only obtain , which is an ap. Since , the last equality proximation of is needed to compute in (7) does not hold and, hence, . If the effect of has to be ignored then it is important to ensure that the termination criterion used in the solution is stringent enough to ensure that of is sufficiently small. Unfortunately, it is not easy to come up with precise values of tolerance to do this. A simple approach that works well is to use reasonably small tolerances and, if gradient methods face failure, then decrease these tolerances further. In the rest of this paper, we consider only the Gaussian kernel . Application of Chapelle et given by (3a) and take al.’s [2] gradient calculations, using (7), yields the following expressions:

(8) The derivatives of

are given by

(9)

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002

The derivatives of

are given by

1227

Each optimization iteration involves the determination of a direction using the BFGS method. Then a line search is performed along that direction to look for a point that satisfies certain approximate conditions associated with the following problem: (10) (15)

Also (11) Thus, gradient of puted (since

is cheaply computed once has been comand are all available).

C. Variable Transformation As suggested by Chapelle et al. [2], we use (12) as the variables for optimization instead of and . This is a common transformation usually suggested elsewhere in the literature too. D. Choosing Initial Values of

and

Unless we have some good knowledge about the problem, it is not easy to choose good initial values for and . We have experimented with two pairs of different initial conditions. The first one is (13) Let denote the radius of the smallest sphere in the input space that contains all examples, i.e., the ’s. The second pair of initial conditions is (14) In all the datasets tried in this paper, each component of the ’s is normalized to lie between 1 and 1. Hence, for all the where numerical experiments, we have simply used is the dimension of .2 Detailed testing shows that (14) gives better results than (13). There was one dataset (Splice) for which (13) actually failed. See Fig. 2 for details. E. Issues Associated With the Gradient Descent Algorithm there are many choices for optimization To minimize methods. In this work, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton algorithm [12] has been used. A conjugate-gradient method was also tried, but it required many more evaluations than the BFGS algorithm.3 Since each evaluation is expensive [it requires the solution of (4) and (6)], the BFGS method was preferred.

Since gradient of is easily computed once is obtained, it is effective to use a line search technique that uses both function values and gradients. The code in [12] employs such a techis a natural choice to try nique. For the BFGS algorithm, as the first step size in each line search. This choice is so good that, the line search usually attempts only one or two values of before successfully terminating an optimization iteration. Usuis expected to hold ally, the goodness of the choice of strongly as the minimizer of is approached since the BFGS approaches a Newton root finding step. Howstep for ever, this does not happen in our case for the following reason. As the minimizer is approached, the gradient values are small, and the effect of errors associated with the solution of (4) and (6) on the gradient evaluation become more important. Thus, the line search sometimes requires many evaluations of and in the end steps. In numerical experiments, it was observed that reaching the minimizer of too closely is not important4 for arriving at good values of hyperparameters. Hence, it is a good idea to terminate the line search (as well as the optimization process) if more than ten values of have been attempted in that line search. The optimization process generates a sequence of points in the space of hyperparameters. Successive points attempted by the process are usually located “not-so-far-off” from each other. It is important to take this factor to advantage in the solution of (4) and (6). Thus, if and denote the solution of (4) and (6) at some , and the optimization process next tries a new point , then and are used to obtain good starting points for the solution of (4) and (6) at . This gives significant gains in computational time. Since the constraints in (6) do not depend on the hyperparameters, can be directly carried over for the solution of (6) at . For (4), we already said that the nearest point formulation in [5] is employed. Since the constraints in the nearest point formulation are also independent of the hyperparameters, carrying over the variables for solution at is easy for the nearest point algorithm too. The choice of criterion for terminating the optimization process is also very important. As already mentioned, reaching too closely is not crucial. Hence, the the minimizer of criterion used can be loose. The following choice has worked quite well. Suppose BFGS starts an optimization iteration at , then successfully completes a line search and reaches the next point . Optimization is terminated if the following holds:

the case of Adult-7 dataset, each x has only 15 nonzero entries. Hence,

(16)

3As discussed at the beginning of Section II, this could be due to the sensitivity of the conjugate-gradient method to errors in the evaluation of f and its gradient.

4This should not be confused with our stress, in Remark 2, on the accurate determination of f and its gradient by solving (4) and (6) accurately.

2In

is set to 15 for that example.

1228

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002

TABLE I PERFORMANCE OF THE CODE ON THE DATA SETS CONSIDERED. HERE: n NUMBER OF INPUT VARIABLES; m NUMBER OF EXAMPLES; m IS THE NUMBER OF TEST EXAMPLES; nf IS THE NUMBER OF f EVALUATIONS USED BY THE RADIUS/MARGIN METHOD (RM) (THE NUMBER FOR -FOLD METHOD IS ALWAYS 221); TESTERR IS THE PERCENTAGE ERROR ON THE TEST SET; AND, m IS THE FINAL NUMBER OF SUPPORT VECTORS FOR THE RADIUS/MARGIN METHOD

=

=

5

Typically, the complete optimization process uses only ten to 40 evaluations. III. COMPUTATIONAL EXPERIMENTS We have numerically tested the ideas5 on several benchmark datasets given in [9]. To test the usefulness of the code for solving large scale problems, we have also tested it on the Adult-7 dataset in [7]. All computations were done on a Pentium 4 1.5-GHz machine running on Windows. Gaussian and formed the hyperpakernel was employed. Thus, rameters; (14) was used for initializing them. For comparison, and by five-fold cross validation. The we also tuned search was done on a two-dimensional 11 11 grid in the space. To use previous solutions effectively, the search on the grid was done along a spiral outward from the and . Some important quanticentral grid values of ties associated with the datasets and the performance are given in Table I. While the generalization performance of five-fold and radius/margin methods are comparable, the radius/margin method is much faster. The speed-up achieved is expected to be much more when there are more hyperparameters to be tuned. For a few datasets, Fig. 1 and the left-hand side of Fig. 2 show the sequence of points generated by the BFGS optimization method on plots in which contours with equal values are drawn for various values of . In the case of Splice and Banana datasets, for which the sizes of test sets are large, the right —hand side plots of Fig. 2 show contours of test set error. These are given to point out how good the radius/margin criterion is. A. Using the

Approximation

When the Gaussian kernel function is used, to simplify comis sometimes tried. We did putations, the approximation some experiments to check the usefulness of this approximation. For four datasets, Fig. 3 shows the variation of 5An experimental version of the code, running on Matlab interface through the mex facility, is available from the author.

Fig. 1. Contour plots of equal f values for Adult-7, Breast Cancer, Diabetis, and Flare-Solar datasets. and , respectively, denote points generated by the BFGS algorithm starting from the initial conditions in (13) and (14).

+

Fig. 2. Two figures on the left-hand side give radius/margin contour plots for Splice and Banana datasets. and , respectively, denote points generated by the BFGS algorithm using the initial conditions in (13) and (14). In the case of Splice dataset, for initial condition (13), optimization was terminated after two f evaluations since a very large C value was attempted at the third f evaluation point and so the computing time required for that C became too huge. The two figures on the right-hand side give contour plots of test set error. In these two plots M denotes the location of the point of least test set error.

+

and test set error with respect to for fixed values of . It is clear that all three functions are quite well correlated and hence, seems to be as far as the tuning of is concerned, using a good approximation to make. This agrees with the observation made by Cristianini et al. [4]. for tuning is dangerous. Note using However, using is always increasing with . Clearly alone (9) that is inadequate for the determination of .

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002

1229

ture selection, different cost values, etc., looks very possible. Our current research is focussed on this direction. REFERENCES

Fig. 3. Variation of R kw ~k ; kw~k and TestErr with respect to for fixed C values. In each graph, the vertical axis is normalized differently for R kw~k ; kw~k , and TestErr. This was done because, for tuning , the point of minimum of the function is important and not the actual value of the function.

IV. CONCLUSION In this paper, we have discussed various implementation issues associated with the tuning of hyperparameters for the SVM soft margin problem, by minimizing the radius/margin criterion and employing iterative techniques for obtaining radius and margin. The experiments indicate the usefulness of the radius/margin criterion and the associated implementation. The extension of the implementation to the simultaneous tuning of many other hyperparameters such as those associated with fea-

[1] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining Knowledge Discovery, vol. 2, no. 2, 1998. [2] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. (2002) Choosing kernel parameters for support vector machines. Machine Learning [Online], pp. 131–159. Available: http://www-connex.lip6.fr/ ~chapelle/ [3] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [4] N. Cristianini, C. Campbell, and J. Shawe-Taylor. (1999) Dynamically adapting kernels in support vector machines. Advances Neural Inform. Processing Syst. [Online]. Available: http://lara.enm. bris.ac.uk/cig/pubs/1999/nips98.ps.gz [5] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “A fast iterative nearest point algorithm for support vector machine classifier design,” IEEE Trans. Neural Networks, vol. 11, pp. 124–136, Jan. 2000. , “Improvements to Platt’s SMO algorithm for SVM design,” [6] Neural Comput., vol. 13, no. 3, pp. 637–649, 2001. [7] Sequential Minimal Optimization, J. Platt. (1998). [Online]. Available: http://www.research.microsoft.com/~jplatt/smo.html , “Fast training of support vector machines using sequential [8] minimal optimization,” in Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998. [9] Benchmark Datasets, G. Rätsch. (1999). [Online]. Available: http://ida.first.gmd.de/~raetsch/data/benchmarks.htm [10] B. Schölkopf, C. Burges, and V. Vapnik, “Extracting support data for a given task,” presented at the 1st Int. Conf. Knowledge Discovery Data Mining, U. M. Fayyad and R. Uthurusamy, Eds., Menlo Park, CA, 1995. [11] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, and A. J. Smola. (1999) Estimating the Support of a High Dimensional Distribution. Microsoft Research, Redmond, WA. [Online]. Available: http://www.kernel-machines.org/papers/oneclass-tr.ps.gz [12] D. F. Shanno and K. H. Phua, “Minimization of unconstrained multivariate functions,” ACM Trans. Math. Software, vol. 6, pp. 618–622, 1980. [13] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [14] V. Vapnik and O. Chapelle, “Bounds on error expectation for support vector machines,” Neural Comput., vol. 12, no. 9, 2000.

Efficient tuning of svm hyperparameters using radius ...

(3b). Manuscript received March 14, 2001; revised December 21, 2001 and January. 10, 2002. The author is with the Department of Mechanical Engineering, National. University of Singapore, Singapore 119260, Singapore (e-mail: mpessk@ guppy.mpe.nus.edu.sg). Publisher Item Identifier S 1045-9227(02)05563-7.

Download PDF

254KB Sizes 0 Downloads 132 Views

Report

Efficient tuning of svm hyperparameters using radius ...

Recommend Documents