Dmitry Kropotov [email protected] Dorodnicyn Computing Center of Russian Academy of Sciences, 119234, Russia, Moscow, Vavilova str., 40 Dmitry Vetrov [email protected] Moscow State University, 119992, Russia, Moscow, Leninskie gory, 1, 2nd educational building, CMC department

Abstract In the paper we introduce a method based on the use of information criterion for automatic adjustment of regularization coefficients in generalized linear regression. We develop an RVM-like procedure which finds irrelevant weights and leads to very sparse decision rules. Unlike RVM our method sets some regularization coefficients exactly to zero. We hope that this helps to avoid type-II overfitting (underfitting) which is reported for RVM.

1. Introduction Currently sparse methods are widely used in machine learning. The relevance vector machine (RVM) originally proposed by Tipping (Tipping, 2000) is an elegant example of their application to linear regression problems. In RVM L2-regularization coefficients are assigned individually to each weight and are adjusted automatically by optimizing marginal likelihood (evidence). Such procedure also known as automatic relevance determination (ARD) (MacKay, 1992) leads to infinite regularization coefficients for most of basis functions thus setting the corresponding weights to zeros and effectively removing the corresponding basis functions from model. Known Bayes-Schwartz information criterion (BIC) (Schwarz, 1978) can be considered as a rough approximation of log-evidence (Bishop, 2006) which is exploited in RVM. Akaike information criterion (AIC) (Akaike, 1974) is another well known criterion that suggests an alternative approach to model selection Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.

based on information theory. Although it was originally proposed for selection among the finite number of models it can be relatively straightforwardly extended for the case of continuum number of models. In our paper we suggest to use such generalized version of AIC for performing model selection in linear regression. In particular we are interested in checking whether the resulting regression still has the property of being extremely sparse like RVM. The rest of the paper is organized as follows. In section 2 we establish notation for generalized linear regression problem. In section 3 we briefly describe continuous version of AIC (CAIC) and its application for relevance determination. In section 4 we present the results of experiments and comparative evaluation of CAIC RVM with original RVM.

2. Problem formulation Consider the classical generalized linear regression problem. Let (X, ~t) = {(~x1 , t1 ), . . . (~xn , tn )} be a training set where ~xi ∈ Rd is a vector of observable features and ti ∈ R is a vector of regression variable known only for points of training set. Some set of predefined basis functions {φ1 (~x), . . . , φm (~x)} is fixed in advance. The problem is to find a vector of weights w ~ such that linear function ~ x) = y(~x) = w ~ φ(~ T

m X

wj φj (~x)

j=1

would be close to regression values for new points ~x. Let Φ = (φij ) = (φj (~xi )) be a matrix of basis functions computed for each training point. The classical approach is to optimize regularized likelihood w ~ M P = arg max p(~t|X, w)p( ~ w|α), ~

An ARD Procedure Based on AIC for Linear Regression

where ¶ µ 1 p(~t|X, w) ~ =p ~ − ~tk2 exp − 2 kΦw 2σ (2π)n σ n 1

is likelihood function and r³ ´ ³ α ´ α m p(w|α) ~ = exp − kwk ~ 2 2π 2 is prior distribution over the weights which can be also regarded as regularizer that penalizes large values of w. ~ A more general case is considered in RVM where each weight has its own regularization coefficient and prior distribution has the form p(w|~ ~ α) =

m Y j=1

r³ ´ α

´ ³ α j exp − wj2 = 2π 2 µ ¶ det(A) 1 T exp − w ~ Aw ~ , 2 (2π)m/2

Table 1. Sparsity of different algorithms (number of relevant weights) Data

CAIC RVR

EvRVR

No. of weights

Auto-mpg Boston Heat-1 Heat-2 Heat-3 Pyrimid. Servo Triazines WDBC Autos CPU

7.10 ± 4.94 34.00 ± 5.24 1.20 ± 0.45 13.10 ± 10.17 3.90 ± 2.86 10.10 ± 2.97 22.00 ± 11.80 8.90 ± 5.48 2.60 ± 1.39 36.30 ± 6.23 23.90 ± 0.89

8.60 ± 2.99 24.50 ± 3.02 5.60 ± 5.47 10.10 ± 4.39 2.60 ± 0.22 9.80 ± 5.53 15.80 ± 3.47 31.30 ± 19.84 8.30 ± 4.96 23.80 ± 3.27 26.70 ± 3.03

199 253 45 45 45 37 83 93 23 100 104

j

where regularization matrix A = diag(α1 , . . . , αm ), αj ≥ 0.

estimate. Then it can be shown that αj can be recomputed iteratively ˆj , α ˆ j ≥ 0, α −1 T 0, 0>α ˆ j > ~q(j) P(j) ~q(j) − hjj , (1) αjnew = −1 T +∞, α ˆ j < ~q(j) P(j) ~q(j) − hjj ,

3. Continuous AIC

where

In order to find coefficients α ~ we suggest to optimize the following criterion

and

α ~ ∗ = arg max CAIC(~ α) = α ~ ¡ ¢ arg max log p(~t|X, w ~ M P ) − tr(H(H + A)−1 ) = A∈M ¡ ¢ arg max L(w ~ M P ) − tr(H(H + A)−1 ) . A∈M

Here we denoted H = (hij ) = −∇w ∇w log p(~t|X, w) ~ = −σ −2 ΦT Φ and M = {A ∈ Rm×m |A = diag(α1 , . . . , αm ), αj ≥ 0} is a set of diagonal matrices with non-negative elements. In our talk we will show that CAIC is a natural generalization of Akaike information criterion based on information theory. Indeed, if A is diagonal matrix whose elements equal either zero or plus infinity, then tr(H(H + A)−1 ) = k, where k is number of zero diagonal elements of A, and CAIC becomes equivalent to classical Akaike information criterion. This continuous AIC can also be considered as a special case of deviance information criterion (DIC) described in (Spiegelhalter et al., 2002). We optimize CAIC iteratively by considering the dependence of CAIC w.r.t. single αj given all other αi , i 6= j fixed. We will use (j) notation for subvectors (submatrices) with removed j th row (and j th column). Let P = H + A, ~q is j th column of P . De~ = Hw note ψ ~ M L , where w ~ M L is maximum likelihood

−1 T α ˆ j = ~q(j) P(j) ~q(j) − hjj +

aj , bj

−1 ~ T aj = (~q(j) P(j) ψ(j) − ψj )2 × −1 −1 −1 T T × (~q(j) P(j) H(j) P(j) ~q(j) − 2~q(j) P(j) ~q(j) + hjj ),

−1 ~ −1 −1 ~ T T bj = (~q(j) P(j) ψ(j) − ψj )(~q(j) P(j) H(j) P(j) ψ(j) + ψj − −1 ~ −1 −1 −1 T T T −2~q(j) P(j) ψ(j) )+~q(j) P(j) H(j) P(j) ~q(j) −2~q(j) P(j) ~q(j) +hjj .

Note that αjnew may be set exactly to zero under some conditions. In order to prove that CAIC has unique maximum we consider a more general problem ¡ ¢ log p(~t|X, w ~ M P ) − tr(H(H + A)−1 ) → max (2) A∈N

where N = {A|A = AT º 0}. During our talk we will show how this specific semi-definite problem can be solved analytically. The solution of (2) is unique and due to the fact that M is convex subset of N this implies that CAIC also has unique maximum.

4. Experiments and Discussion We tested original RVM (EvRVM), CAIC RVM and linear ridge regression on a number of datasets taken

An ARD Procedure Based on AIC for Linear Regression 40

pensive exhaustive search trying to solve discrete problem of feature selection, we may transform the problem to continuous one and use CAIC RVM for its solution.

35 30

The experimental results make it possible to conclude that Bayesian learning and information-based approach have much in common and probably are the two sides of the same phenomenon.

25 20 15 10

3

ne

Py rim

id i

2

atHe

1 at-

ton

atHe

He

-m Au to

Bo s

pg

0

s Se rvo Tri az ine s Wi sc on sin Cp Au u-p tos erf orm an ce

5

Figure 1. Number of relevant objects for CAIC RVM for different datasets. Black part of bars corresponds to number of relevant objects with zero regularization coefficients.

from UCI repository1 and Regression Toolbox by Heikki Hyotyniemi2 . In all regressors being compared the number of basis functions m = n + 1, φj (~x) = exp(−γk~x −~xj k2 ) and φn+1 (~x) = 1. The width parameter γ and regularization coefficient in radge regression were selected via 5x2-fold cross validation (Dietterich, 1998). For each data set root mean square error was measured using 5x2-fold cross validation as well. For CAIC RVM and EvRVM we also computed the average rate of non-zero weights. Experiments show no statistically significant difference between the performance of different methods, i.e. we keep the quality of ridge-regression while removing the most of regressors and making the decision rule very simple. Table 1 reports about sparsity of different algorithms. It is also interesting to see how many relevant basis functions have zero regularization coefficients in CAIC RVM. The number of zero-valued αj is shown on bar histogram in figure 1. We may see that like in original RVM most of αj in CAIC RVM tend to infinity. So the main result of the paper is that AIC can be used for ARD as well as Bayesian methods. Unlike RVM in case of CAIC some regularization coefficients are set exactly to zero. The approach based on the use of CAIC RVM seems to be very promising for feature selection in linear regression problems – the problem original AIC was traditionally used for. Instead of performing computationally ex1

http://archive.ics.uci.edu/ml/ http://www.control.hut.fi/Hyotyniemi/publications/ 01 report125/RegrToolbox 2

One of direction of future work is to apply CAIC ARD for classification problems. A possible way to do this is to map classification problem into regression one (Tipping, 2001).

Acknowledgements This work was supported by Russian Foundation for Basic Research (grants Nos. 08-01-00405, 08-01-90016, 08-01-90427).

References Akaike, H. (1974). A new look at statistical model identification. IEEE Transactions on Automatic Control, 25, 461–464. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1924. MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4, 720–736. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Spiegelhalter, D., Best, N., Carlin, B., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, 64, 583–640. Tipping, M. E. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen and K. R. Mueller (Eds.), Advances in neural information processing systems 12, 652–658. MIT Press. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1, 211–244.