An Automatic Relevance Determination Procedure Based on Akaike Information Criterion for Linear Regression Problems

Dmitry Kropotov [email protected] Dorodnicyn Computing Center of Russian Academy of Sciences, 119234, Russia, Moscow, Vavilova str., 40 Dmitry Vetrov [email protected] Moscow State University, 119992, Russia, Moscow, Leninskie gory, 1, 2nd educational building, CMC department

Abstract In the paper we introduce a method based on the use of information criterion for automatic adjustment of regularization coefficients in generalized linear regression. We develop an RVM-like procedure which finds irrelevant weights and leads to very sparse decision rules. Unlike RVM our method sets some regularization coefficients exactly to zero. We hope that this helps to avoid type-II overfitting (underfitting) which is reported for RVM.

1. Introduction Currently sparse methods are widely used in machine learning. The relevance vector machine (RVM) originally proposed by Tipping (Tipping, 2000) is an elegant example of their application to linear regression problems. In RVM L2-regularization coefficients are assigned individually to each weight and are adjusted automatically by optimizing marginal likelihood (evidence). Such procedure also known as automatic relevance determination (ARD) (MacKay, 1992) leads to infinite regularization coefficients for most of basis functions thus setting the corresponding weights to zeros and effectively removing the corresponding basis functions from model. Known Bayes-Schwartz information criterion (BIC) (Schwarz, 1978) can be considered as a rough approximation of log-evidence (Bishop, 2006) which is exploited in RVM. Akaike information criterion (AIC) (Akaike, 1974) is another well known criterion that suggests an alternative approach to model selection Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.

based on information theory. Although it was originally proposed for selection among the finite number of models it can be relatively straightforwardly extended for the case of continuum number of models. In our paper we suggest to use such generalized version of AIC for performing model selection in linear regression. In particular we are interested in checking whether the resulting regression still has the property of being extremely sparse like RVM. The rest of the paper is organized as follows. In section 2 we establish notation for generalized linear regression problem. In section 3 we briefly describe continuous version of AIC (CAIC) and its application for relevance determination. In section 4 we present the results of experiments and comparative evaluation of CAIC RVM with original RVM.

2. Problem formulation Consider the classical generalized linear regression problem. Let (X, ~t) = {(~x1 , t1 ), . . . (~xn , tn )} be a training set where ~xi ∈ Rd is a vector of observable features and ti ∈ R is a vector of regression variable known only for points of training set. Some set of predefined basis functions {φ1 (~x), . . . , φm (~x)} is fixed in advance. The problem is to find a vector of weights w ~ such that linear function ~ x) = y(~x) = w ~ φ(~ T

m X

wj φj (~x)


would be close to regression values for new points ~x. Let Φ = (φij ) = (φj (~xi )) be a matrix of basis functions computed for each training point. The classical approach is to optimize regularized likelihood w ~ M P = arg max p(~t|X, w)p( ~ w|α), ~

An ARD Procedure Based on AIC for Linear Regression

where ¶ µ 1 p(~t|X, w) ~ =p ~ − ~tk2 exp − 2 kΦw 2σ (2π)n σ n 1

is likelihood function and r³ ´ ³ α ´ α m p(w|α) ~ = exp − kwk ~ 2 2π 2 is prior distribution over the weights which can be also regarded as regularizer that penalizes large values of w. ~ A more general case is considered in RVM where each weight has its own regularization coefficient and prior distribution has the form p(w|~ ~ α) =

m Y j=1

r³ ´ α

´ ³ α j exp − wj2 = 2π 2 µ ¶ det(A) 1 T exp − w ~ Aw ~ , 2 (2π)m/2

Table 1. Sparsity of different algorithms (number of relevant weights) Data



No. of weights

Auto-mpg Boston Heat-1 Heat-2 Heat-3 Pyrimid. Servo Triazines WDBC Autos CPU

7.10 ± 4.94 34.00 ± 5.24 1.20 ± 0.45 13.10 ± 10.17 3.90 ± 2.86 10.10 ± 2.97 22.00 ± 11.80 8.90 ± 5.48 2.60 ± 1.39 36.30 ± 6.23 23.90 ± 0.89

8.60 ± 2.99 24.50 ± 3.02 5.60 ± 5.47 10.10 ± 4.39 2.60 ± 0.22 9.80 ± 5.53 15.80 ± 3.47 31.30 ± 19.84 8.30 ± 4.96 23.80 ± 3.27 26.70 ± 3.03

199 253 45 45 45 37 83 93 23 100 104


where regularization matrix A = diag(α1 , . . . , αm ), αj ≥ 0.

estimate. Then it can be shown that αj can be recomputed iteratively  ˆj , α ˆ j ≥ 0,  α −1 T 0, 0>α ˆ j > ~q(j) P(j) ~q(j) − hjj , (1) αjnew =  −1 T +∞, α ˆ j < ~q(j) P(j) ~q(j) − hjj ,

3. Continuous AIC


In order to find coefficients α ~ we suggest to optimize the following criterion


α ~ ∗ = arg max CAIC(~ α) = α ~ ¡ ¢ arg max log p(~t|X, w ~ M P ) − tr(H(H + A)−1 ) = A∈M ¡ ¢ arg max L(w ~ M P ) − tr(H(H + A)−1 ) . A∈M

Here we denoted H = (hij ) = −∇w ∇w log p(~t|X, w) ~ = −σ −2 ΦT Φ and M = {A ∈ Rm×m |A = diag(α1 , . . . , αm ), αj ≥ 0} is a set of diagonal matrices with non-negative elements. In our talk we will show that CAIC is a natural generalization of Akaike information criterion based on information theory. Indeed, if A is diagonal matrix whose elements equal either zero or plus infinity, then tr(H(H + A)−1 ) = k, where k is number of zero diagonal elements of A, and CAIC becomes equivalent to classical Akaike information criterion. This continuous AIC can also be considered as a special case of deviance information criterion (DIC) described in (Spiegelhalter et al., 2002). We optimize CAIC iteratively by considering the dependence of CAIC w.r.t. single αj given all other αi , i 6= j fixed. We will use (j) notation for subvectors (submatrices) with removed j th row (and j th column). Let P = H + A, ~q is j th column of P . De~ = Hw note ψ ~ M L , where w ~ M L is maximum likelihood

−1 T α ˆ j = ~q(j) P(j) ~q(j) − hjj +

aj , bj

−1 ~ T aj = (~q(j) P(j) ψ(j) − ψj )2 × −1 −1 −1 T T × (~q(j) P(j) H(j) P(j) ~q(j) − 2~q(j) P(j) ~q(j) + hjj ),

−1 ~ −1 −1 ~ T T bj = (~q(j) P(j) ψ(j) − ψj )(~q(j) P(j) H(j) P(j) ψ(j) + ψj − −1 ~ −1 −1 −1 T T T −2~q(j) P(j) ψ(j) )+~q(j) P(j) H(j) P(j) ~q(j) −2~q(j) P(j) ~q(j) +hjj .

Note that αjnew may be set exactly to zero under some conditions. In order to prove that CAIC has unique maximum we consider a more general problem ¡ ¢ log p(~t|X, w ~ M P ) − tr(H(H + A)−1 ) → max (2) A∈N

where N = {A|A = AT º 0}. During our talk we will show how this specific semi-definite problem can be solved analytically. The solution of (2) is unique and due to the fact that M is convex subset of N this implies that CAIC also has unique maximum.

4. Experiments and Discussion We tested original RVM (EvRVM), CAIC RVM and linear ridge regression on a number of datasets taken

An ARD Procedure Based on AIC for Linear Regression 40

pensive exhaustive search trying to solve discrete problem of feature selection, we may transform the problem to continuous one and use CAIC RVM for its solution.

35 30

The experimental results make it possible to conclude that Bayesian learning and information-based approach have much in common and probably are the two sides of the same phenomenon.

25 20 15 10



Py rim

id i



1 at-




-m Au to

Bo s



s Se rvo Tri az ine s Wi sc on sin Cp Au u-p tos erf orm an ce


Figure 1. Number of relevant objects for CAIC RVM for different datasets. Black part of bars corresponds to number of relevant objects with zero regularization coefficients.

from UCI repository1 and Regression Toolbox by Heikki Hyotyniemi2 . In all regressors being compared the number of basis functions m = n + 1, φj (~x) = exp(−γk~x −~xj k2 ) and φn+1 (~x) = 1. The width parameter γ and regularization coefficient in radge regression were selected via 5x2-fold cross validation (Dietterich, 1998). For each data set root mean square error was measured using 5x2-fold cross validation as well. For CAIC RVM and EvRVM we also computed the average rate of non-zero weights. Experiments show no statistically significant difference between the performance of different methods, i.e. we keep the quality of ridge-regression while removing the most of regressors and making the decision rule very simple. Table 1 reports about sparsity of different algorithms. It is also interesting to see how many relevant basis functions have zero regularization coefficients in CAIC RVM. The number of zero-valued αj is shown on bar histogram in figure 1. We may see that like in original RVM most of αj in CAIC RVM tend to infinity. So the main result of the paper is that AIC can be used for ARD as well as Bayesian methods. Unlike RVM in case of CAIC some regularization coefficients are set exactly to zero. The approach based on the use of CAIC RVM seems to be very promising for feature selection in linear regression problems – the problem original AIC was traditionally used for. Instead of performing computationally ex1 01 report125/RegrToolbox 2

One of direction of future work is to apply CAIC ARD for classification problems. A possible way to do this is to map classification problem into regression one (Tipping, 2001).

Acknowledgements This work was supported by Russian Foundation for Basic Research (grants Nos. 08-01-00405, 08-01-90016, 08-01-90427).

References Akaike, H. (1974). A new look at statistical model identification. IEEE Transactions on Automatic Control, 25, 461–464. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1924. MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4, 720–736. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Spiegelhalter, D., Best, N., Carlin, B., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, 64, 583–640. Tipping, M. E. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen and K. R. Mueller (Eds.), Advances in neural information processing systems 12, 652–658. MIT Press. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1, 211–244.

An Automatic Relevance Determination Procedure ...

Abstract. In the paper we introduce a method based on the use of information criterion for au- tomatic adjustment of regularization coeffi- cients in generalized linear regression. We develop an RVM-like procedure which finds irrelevant weights and leads to very sparse decision rules. Unlike RVM our method sets some ...

141KB Sizes 0 Downloads 200 Views

Recommend Documents

Automatic Rank Determination in Projective ...
(PNMF), introduced in [6–8], approximates a data matrix by its nonnegative subspace ... is especially desired for exploratory analysis of the data structure.

The-Adventure-Of-Relevance-An-Ethics-Of-Social-Inquiry.pdf ...
Engaging a diverse a range of thinkers including Alfred North Whitehead, Gilles. Deleuze and Isabelle Stengers, as well as the American pragmatists John Dewey and William James, Martin Savransky. challenges longstanding assumptions in the social scie

An improved method for the determination of saturation ...
2) use of cost effective excitation source with improved voltage regulation and ..... power electronic applications, renewable energy systems, energy efficiency.

Disciplinary Procedure
informally, as such something similar to the Corporate Responsibility & Member Action Pyramid in the drugs & alcohol policy may be used if appropriate.

Internal appeals procedure against a decision not to support an ... and A guide to the ... :// ... for approved centres

Meaning and Relevance
become a collective endeavour, with more than thirty books – here we will just mention the most deservedly influential of them ... tribute much more to the explicit side of communication than was traditionally assumed), and Part II, 'Explicit and .

Wage determination
For example, the share of the administration declined from 17.3% .... programmers, technicians, supervisors, administrative, maintenance and health staff. The ...

An Automatic Valuation System in the Human Brain
Nov 11, 2009 - tures, scenic views, or vacation projects have been explored to date (Di Dio et al., ... 360 pictures (120 pictures per category). Easy and hard ...

An Automatic Valuation System in the Human Brain - Semantic Scholar
Nov 11, 2009 - According to economic theories, preference for one item over others reveals its rank value on a common scale. Previous studies identified brain regions encoding such values. Here we verify that these regions can valuate various categor

Automatic Circle Detection on Images with an ... - Ajith Abraham
algorithm is based on a recently developed swarm-intelligence technique, well ... I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search ...

Utilizing S-TaLiRo as an Automatic Test Generation ... -
more are being used in high-end modern automotive systems. Prototype autonomous ... C.E. Tuncali, T.P. Pavlic, and G. Fainekos are with School of Computing,.

An Approach to Automatic Evaluation of Educational ...
When compared with traditional class- ... control over class, educational content and learning and ... media (CD-ROM, Internet, intranet, extranet, audio and.

Automatic Circle Detection on Images with an Adaptive ...
test circle approximates the actual edge-circle, the lesser becomes the value of this ... otherwise, or republish, to post on servers or to redistribute to lists, requires prior ... performance of our ABFOA based algorithm with other evolutionary ...

Geometric Model Checking: An Automatic Verification ...
based embedded systems design, where the initial program is subject to a series of transformations to .... It is required to check that the use of the definition and operand variables in the ..... used when filling the buffer arrays. If a condition d

Utilizing S-TaLiRo as an Automatic Test Generation ... -
(e-mail:{etuncali, tpavlic, fainekos} autonomous vehicle testing. Such frameworks would produce large number of tests generated in an intelligent way ...

An automatic algorithm for building ontologies from data
This algorithm aims to help teachers in the organization of courses and students in the ... computer science, ontology represents a tool useful to the learning ... It is clcar that ontologics arc important bccausc thcy cxplicatc all thc possiblc ...

typical data sets consisting of three collinear points boost the posterior probability of the ... quiry, however, a large number of hypotheses remain in the running. If.

An Automatic Verification Technique for Loop and Data ...
tion/fission/splitting, merging/folding/fusion, strip-mining/tiling, unrolling are other important ... information about the data and control flow in the program. We use ...