Some relations between Gaussian process binary classification, Multivariate skew normal, and Maximising correlations (working note) Kian Ming A. Chai School of Informatics, University of Edinburgh [email protected]

Third version: February 15, 2009 Second version: October 24, 2008 First version: January 31,2008

Abstract We show that the posterior of the latent function in GP binary classification is related to the multivariate skew normal distribution, and suggest how classification may be achieved without explicitly using a latent function. A connection is made between ML-II estimation for GP binary classification and correlation maximisation. Two heuristics to approximate the ML-II are proposed.

1 GP Binary Classification and Multivariate skew normal For an input x, let y ∈ {−1, 1} be its class label. Binary classification can be performed via a Gaussian Process with the following model [Rasmussen and Williams, 2006, chapter 3]: hf (x)i = 0 0

0

hf (x), f (x )i = k(x, x ) p(yi |fi ) = Φ(yi fi )

prior mean of latent process

(1)

prior covariance of latent process probit likelihood,

(2) (3)

Rz 2 where fi = f (xi ) and Φ(z) = √12π −∞ e−t /2 dt is the cdf of standard normal. Given a observed data set D = {(yi , xi )}ni=1 , the posterior over the latent function values f = (f (x1 ), . . . , f (xn )) is p(f |D) ∝ φn (f ; 0, K)

n Y

Φ(yi fi )

posterior of latent values,

(4)

i=1

where K is an n×n matrix with k(xi , xj ) being its (i, j)th entry, and φn (·; µ, Σ) is the n-variate normal pdf with mean µ and covariance Σ. The product over the likelihoods can be expressed as n Y

Φ(yi fi ) = Φn (Dy f ; 0, In )

(5)

i=1

where Dy is an n×n diagonal matrix with yi being its (i, i)th entry, and Φn (·; µ, Σ) is the n-variate normal cdf with mean µ and covariance Σ, i.e. Z z1 Z zn 1 (t−µ)T Σ−1 (t−µ)/2 dt1 · · · dtn e− (6) Φn (z; µ, Σ) = p n (2π) |Σ| −∞ −∞ Note that Φ1 (z; 0, 1) ≡ Φ(z). Placing (5) into (4) immediately identifies the posterior as a multivariate skew normal distribution [MSN, Gupta et al., 2004]: f |D ∼ SN n (·|0, K, Dy ) 1 SN n (z; µ, Σ, D) def φn (z; µ, Σ)Φn (D(z − µ); 0, In ). = Φn (0; 0, I + DΣDT )

(7) (8)

The mean and variance of the posterior is thus those of the corresponding MSN distribution. We note here that the MSN distribution proposed by Gupta et al. [2004] is but one of the many [Arellano-Valle and Azzalini, 2006]. 1

2 Joint distribution via MSN construction Similar to the construction of MSN given by Gupta et al. [2004, Proposition 3.4], we shall give a construction of the joint distribution between the GP latent process and the probit data likelihood. We define an alternative class labelling scheme: define r− def = (−∞, 0) and r+ def = (0, ∞), and for an input x, let r ∈ {r− , r+ } be its class labels; conceptually, it means that instead of knowing the class labels via discrete (binary) values, we know the class labels via the knowledge of the sign of the values. The data set corresponding to D is then Dr def = {(ri , xi ) | ri = rsgn(yi ) }ni=1 . We also introduce a vector of n real auxiliary random variables g def = (g1 , . . . , gn )T , def and define an augmented data set Dg = {(gi ∈ ri , xi )}. Proposition 2.1. Let     I +K g ∼ N2n · 0, f K

K K

 (9)

Then, f ∼ Nn (·|0, K) gi ∈ ri |fi ∼ Φ(yi fi ) f |Dg ∼ SN n (·|0, K, Dy )

prior of latent values likelihood posterior of latent values

(10) (11) (12)

Proof. One may prove by considering the original likelihoods given in (3), but delaying the integration within the probit to as late as possible in all operations. The following is an alternative proof following that by Gupta et al. [2004, Proposition 3.4]. That (10) holds is clear by considering only the f component of (9). For (11), consider the conditional distribution g|f ∼ Nn (·|f , I). The components are independent in this conditional, so we may also write gi |fi ∼ N (·|fi , 1) for each i = 1 . . . n. When yi = −1, then ri = r− , and we have p(gi ∈ ri |fi ) = Φ1 (0; fi , 1) ≡ Φ(yi fi ); similarly for yi = 1. For (12), using Bayes’ rule gives p(f |Dg ) =

p(f ) p({gi ∈ ri }ni=1 |f ) . p({gi ∈ ri }ni=1 )

(13)

Comparing this with (7) and (8), we are required to show ?

p({gi ∈ ri }ni=1 |f ) = Φn (Dy f ; 0, In ) p({gi ∈

ri }ni=1 )

?

= Φn (0; 0, I +

(14)

Dy KDyT ).

(15)

The likelihood term (14) follows immediately from the definition of Φn and (11). For the normalisation term (15), note that Dy is unitary so that we have I + Dy KDyT = Dy (I + K)DyT ,

(16)

giving Φn (0; 0, I +

Dy KDyT )

0

0

  1 T T −1 dt1 · · · dtn exp − t I + Dy KDy t =q 2 −∞ (2π)n |I + Dy KDyT | −∞   Z 0 Z 0 1 T −T 1 −1 −1 dt1 · · · dtn exp − t Dy (I + K) Dy t =q 2 −∞ (2π)n |D (1 + K)DT | −∞ 1

y

Z



y

1

=q

Z

(2π)n |(1 + K)DyT Dy | Z 0 1

Z

0

−∞

0

  T  1 dtn exp − Dy−1 t (I + K)−1 Dy−1 t 2 −∞

Z dt1 · · · 0

  1 yn dun exp − uT (I + K)−1 u 2 (2π)n |1 + K| −∞/y1 −∞/yn   Z Z 1 T 1 −1 =p du1 · · · dun exp − u (I + K) u 2 (2π)n |1 + K| u1 ∈r1 un ∈rn

=p

Z

y1 du1 · · ·

= p({gi ∈ ri }ni=1 ), where we have used a change of the dummy variables of integration ti = yi vi , i = 1 . . . n. 2

(17)

Remarks 1. In essence, the marginal probability p({gi ∈ ri }ni=1 ) chooses the relevant ranges for integration of the latent multivariate Gaussian distribution. 2. From (9), clearly gi ∼ N (·|fi , 1), or gi = fi + i where i ∼ N (·|0, 1). That is, the noise N (·|0, 1) that was previously associated with the likelihood Φ(yi fi ) is now modelled in the latent function g. That the noise can be modelled either in the likelihood or in the latent process has been previously noted by Opper and Winther [2000]. This is also evident from the definition of the probit model which involves a normally distributed threshold. 3

Binary classification without explicit latent values

Proposition 2.1 suggests that one may be able to achieve GP binary classification using the latent values only implicitly, by directly using auxiliary variables. Suppose one would wish to predict the class of a novel input x∗ , and we have that      I +K k∗ g (18) ∼ Nn+1 · 0, g∗ kT 1+κ ∗ where k∗ = (k(x1 , x∗ ), . . . , k(xn , x∗ ))T and κ = k(x∗ , x∗ ), where k is the prior covariance function of f . Then the conditional probability that the class label y∗ = −1 is p(g∗ ∈ r− |Dg , x∗ ) =

Φn+1 (0; 0, I + Σ∗ ) p(g∗ ∈ r− , Dg |, x∗ ) = , p(Dg ) Φn (0; 0, I + Dy KDyT )

(19)

where  Σ∗ =

Dy 0



K kT ∗

0 −1

k∗ κ

 Dy 0

0 −1

T .

(20)

Remarks on prior work Models of this form are also known as multivariate probit models or correlated probit regression models first motivated and proposed by Ashford and Sowden [1970]. The focus on the latent variables f — a la GP binary classification — is proposed by Muthen [1979]; in a similar vein, Albert and Chib [1993] focus on the weights β of the linear composition f (x) = β T x. In particular, Albert and Chib [1993] motivate their Gibbs sampling (of β) approach by citing the inappropriateness of normal (i.e. Laplace) approximations to the posterior, noted previously by Griffiths et al. [1987], and Zellner and Rossi [1984]. 4 ML-II estimation and Correlation maximisation ˆ given by For a kernel k(·, ·) parameterised by θ, the ML-II or empirical Bayes approach requires the search of optimal parameters θ ˆ = arg max p(Dg |θ) = arg max Φn (0; 0, I + Dy Kθ DT ) def θ y = arg max Φn (0; 0, Σ), θ

θ

(21)

θ

where Σ def = I + Dy Kθ DyT , and we have subscripted K with the hyperparameters θ to emphasis its dependence on them. The function Φn involves the integration of a zero-mean multivariate Gaussian over the lower orthant ti ≤ 0, i = 1 . . . n. The higher the mass the multivariate Gaussian has within that orthant, the higher the value of the integral. The value of this integral is also known as the (lower) orthant probability, and depends only on the correlation of the zero mean Gaussian, i.e. the variances do not matter. This suggests that the covariance Σ should be one that maximises the correlation of the random variables distributed under the Gaussian. We give an example for n = 2 where higher correlation gives higher evidence. For the same volume |Σ| = 4, consider the case when Dy Kθ Dy = I for zero correlation to the case when Dy Kθ Dy = 32 12×2 for maximum correlation (though degenerate). In the former  case, the evidence is 14 by symmetry. In the latter case, it is 14 1 + π2 arctan 43 ≈ 14 × 1.4 as illustrated in Figure 1 and considering the additional area bounded between the negative t1 axis and the half line given by the direction vector −4 [see also Ruben, 1961]. 3 Below we shall make explicit what is meant by maximising the correlation. Theorem 4.1. (Slepian’s inequality [Selpian, 1962], see e.g. Gupta [1963]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of Σ by σij and of C by cij . Let σii = cii = 1, i = 1 . . . n. If σij ≥ cij , i, j = 1 . . . n, then 1 P (x1 < a1 , x2 < a2 , . . . , xn < an ) ≥ P (y1 < a1 , y2 < a2 , . . . , yn < an )

(22)

1 Gupta [1963] gives the inequality in terms of the upper orthant, while some other literature give it in terms of the lower. We note that the two are equivalent using the transformation x0 = −In x.

3

t2

t2 t1 = t2 major axis

t1 = −t2 minor axis t1 = −2t2 3t1 + 4t2 = 0

3t1 + 4t2 = 0 2t1 = t2

t21

+

t22

1 4

=2

×

2 π

arctan 43

t1

t1 1 4

1 −1/2 e of the bivariate Gaussian with covariance I; Figure 1: LEFT: The black circle has unit radius, and is the contour p(t) = 2π 1 1 −2 the ellipse is the contour p(t) = 4π e of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The arrows show the transformations that happen when transforming the unit Gaussian p into the correlated Gaussian, using the sampling rule tunit ∼ N (·|0, I) and tcorrelated = √110 ( 53 04 )tunit . The ticks on the axes are 8/5 units from the origin. RIGHT: The thick lines bound the area under isotropic Gaussian to be considered for the lower orthant probability of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The values given in the boxes are the probabilities within the associated areas.

or equivalently Φn (a; 0, Σ) ≥ Φn (a; 0, C)

(23)

The following corollary allows us to work also with covariance functions. Corollary 4.2. (See also [Tong, 1990, §5.1.4]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of the correlation of c σ x by ρij , and of y by rij , i.e. ρij = √σiiijσjj and rij = √ciiijcjj . If ρij ≥ rij , i, j = 1 . . . n, then Φn (0; 0, Σ) ≥ Φn (0; 0, C)

(24)

Proof. Use the variance-correlation decomposition of Σ into ∆1/2 P∆1/2 , where ∆ is a diagonal matrix of variances and P is the correlation matrix. Φn (0; 0, Σ) = Φn (0; 0, P) can be shown via suitable scaling of dummy variables of integration (by the reciprocal of the respective standard deviations). The same can be done for C, so that Φn (0; 0, Σ) = Φn (0; 0, P) ≥ Φn (0; 0, R) = Φn (0; 0, C), where R is the correlation matrix for y. Hence, ML-II maximisation is the same as maximising the correlations between the variables. Note, however, that both the Slepian’s inequality and the above corollary give only a pre-ordering of correlations {ρij } based on the scalar field Φn (a; 0, {ρij }). Hence their uses are limited, since there is no single direction along which to optimise; that is, it is still necessary to try all valid increases in correlations, compute the corresponding Φn values, and choose the highest. To make it clearer, consider the case of 3 data points for GP classification, giving a centered trivariate Gaussian distribution parameterised by {ρ12 , ρ13 , ρ23 }. Figure 2 illustrates Slepian’s inequality. Each point in the vector space of correlations gives a Gaussian more correlated than the points it dominates in the Cartesian sense. However, points such as ∗ and † in Figure 2 are not comparable directly. 5 Maximising projection of correlations and Empirical alignment For maximising the correlation, one heuristic is to maximise the sum of correlations, i.e. given an n×n correlation matrix P = (ρij ), we maximise n X n X X  1T P1 = ρij = n + 2 ρij = n + 21 ρ12 ρ13 . . . ρ(n−1)n (25) n n i=1 j=1

i
4

ρ12

ρ23





ρ13 Figure 2: The axes are the correlations for a trivariate Gaussian distribution. Each point within the the cuboid given by the thick lines gives a Gaussian that is “less correlated” than the point ∗; similarly for the each point within the cuboid given by the thin lines compared with the point †. However, points ∗ and † are not comparable. Note that region that give valid correlations is smaller than the R3 portrayed here. In fact, the valid region in this case a tetrahedron inscribed within {− π2 , π2 }3 in the axis-transformed space asin ρij (see section 5.1). The last equality gives the geometric interpretation, saying that the heuristic maximises the projection of the correlations onto the perfect positive correlation case. Given an n×n covariance matrix Σ, and with ∆ defined to be the corresponding diagonal matrix of variances, the heuristic is 1 −1/2 max 1T Σ∆− /2 1n . n∆ Now for binary GP classification, we have Σ = I + Dy Kθ DyT . We define ∆ accordingly, and let DKθ be the diagonal matrix of variances from Kθ . Using (16) repeatedly gives − /2 T − /2 1T Σ∆− /2 1n = 1T (I + Dy Kθ DyT )(I + Dy DKθ DyT )− /2 1T n∆ n (I + Dy DKθ Dy ) n 1

1

1

1

− /2 T = 1T Dy Dy (I + Kθ )DyT Dy (I + DKθ )− /2 DyT 1T n Dy (I + DKθ ) n 1

1

= y T (I + DKθ )− /2 (I + Kθ )(I + DKθ )− /2 y 1

1

(26)

It will be instructive to compare this with the empirical alignment criterion proposed by Cristianini et al. [2002]: Definition 5.1. [Cristianini et al., 2002] The empirical alignment between an n×n positive semi-definite kernel Kθ and the vector of binary labels y ∈ {−1, +1}n is given by



Kθ , yy T F Kθ , yy T F A= p = p , n hKθ , Kθ iF hKθ , Kθ iF hyy T , yy T iF where hM , N iF is the Frobenius product between matrices M and N . Consider the case when Kθ is a correlation matrix e.g. when given by a correlation function such as the squared exponential covariance function with unit amplitude. In this case, the (inverse) scaling by variances in (26) is not required  for the purpose of optimisation. Further suppose that the GP binary classification has likelihood given by p(yi |fi ) = Φ yi fi /η 2 instead of (3) and take the limit η 2 → ∞. Due to this limit, the identity matrices I in (26) disappear. The heuristic is now to maximise y T Kθ y. Notice this last expression is the numerator of A, since hM , N iF = tr(M N T ). The difference between the two criteria is then due to the denominator p n hKθ , Kθ iF in A. Let us attempt to understand if this difference is significant. Let the labels be given by y = 1n , so that our obP jective function becomes n + 2 i
Maximising projection in the arc-sine axis-transformed space

Instead of maximising the sum of the correlations, one may consider maximising the sum of monotonic transforms of the correlations. Due to the following equation derived by Childs [1967], we propose the use of the asin transform. 2 Using the regression setting (instead of binary classification setting), we have that the ML-II solution is K = yy T , which indeed gives the maximum for y T Ky with K unconstrained. However, if K is constrained by a parameterised covariance function, then the answers obtained by ML-II and by the criteria may be different. Also, an obvious measure of alignment for two Gaussian distributions is the Kullback-Leibler divergence between them.

5

Proposition 5.2. [Childs, 1967] Let x ∼ Nn (0, P), where P is a correlation matrix with ρij its (i, j)th element. Then3   bn/2c  2j   X X X i 1 2 Φn (0; 0, P) def asin ρij + wk1 ,k2 ,...,k2j , = P (x1 < 0, . . . , xn < 0) = n 1 +  2  π π i
where Z



Z



···

wk1 ,k2 ,...,km =

def

−∞

−∞

j=2

(28)

k1
 exp − 12 xT Px Qm dx. i=1 xki

(29)

Thus we may approximate ML-II by maximising the sum of the asins of the correlations. This approximation drops the last doublesummation term in (28) which gives the 4th and higher orders of interactions between the xi s. For GP binary classification, the correlation matrix P in (28) is given by (I + Dy DKθ DyT )−1/2 (I + Dy Kθ DyT )(I + Dy DKθ DyT )−1/2 (see (26)). Using the fact that asin is an odd function, and using a derivation similar to that for (26), the approximation is then to maximise i h  1 1 (30) y T asin (I + DKθ )− /2 (I + Kθ )(I + DKθ )− /2 y, where the matrix function asin(·) gives the element-wise asin of its argument. 6 Summary In this note, we have made explicit the following intuitions on GP binary classification. We have shown that the posterior of the latent function in GP binary classification can be interpreted as a multivariate skew normal distribution, and that GP binary classification can be done without explicitly introducing a latent function. We have also made a connection between ML-II estimation and maximising correlations, and have suggested two heuristics for approximating ML-II. Acknowledgements We thank C.K.I. Williams for helpful discussions and comments. References James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, June 1993. Reinaldo B. Arellano-Valle and Adelchi Azzalini. On the unification of families of skew-normal distributions. Scandinavian Journal of Statistics, 33(3):561–574, September 2006. J. R. Ashford and R. R. Sowden. Multi-variate probit analysis. Biometrics, 26(3):535–546, September 1970. Donald R. Childs. Reduction of the multivariate normal integral to characteristic form. Biometrika, 54(1/2):293–300, 1967. Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. William E. Griffiths, R. Carter Hill, and Peter J. Pope. Small sample properties of probit model estimators. Journal of the American Statistical Association, 82(399):929–937, 1987. Arjun K. Gupta, Graciela Gonz´alez-Far´ıas, and J. Armando Dom´ınguez-Molina. A multivariate skew normal distribution. Journal of Multivariate Analysis, 89:181–190, 2004. Shanti S. Gupta. Probability integrals of multivariate normal and multivariate t. The Annals of Mathematical Statistics, 34(3): 792–828, September 1963. Bengt Muthen. A structural probit model with latent variables. Journal of the American Statistical Association, 74(368):807–811, December 1979. Manfred Opper and Ole Winther. Gaussian processes and SVM: Mean field and leave-one-out. In Alexander J. Smola, Peter J. Bartlett, Bernhard Sch¨olkopf, and Dale Schuurmans, editors, Advances in Large-Margin Classifiers, pages 311–326. MIT Press, October 2000. Carl E. Rasmussen and Christopher K. I. Williams. Massachusetts, 2006. 3 Childs

Gaussian Processes for Machine Learning.

MIT Press, Cambridge,

[1967] gives this in terms of the upper orthant, while we give it in terms of the lower orthant to be consistent with the rest of this note.

6

Harold Ruben. Probability content of regions under spherical normal distributions, iii: The bivariate normal integral. The Annals of Mathematical Statistics, 32(1):171–186, March 1961. D. Selpian. The one-sided barrier problem for Gaussian noise. Bell System Technical Journal, 41:463–501, 1962. Yung Liang Tong. The Multivariate Normal Distribution. Springer-Verlag, New York, 1990. Arnold Zellner and Peter E. Rossi. Bayesian analysis of dichotomous quantal response models. Journal of Econometrics, 25(3): 365–393, July 1984.

7

Some relations between Gaussian process binary ...

Gaussian process binary classification, Multivariate skew normal, and ... between the GP latent process and the probit data likelihood. ... to D is then Dr def.

192KB Sizes 1 Downloads 202 Views

Recommend Documents

Bagging for Gaussian Process Regression
Sep 1, 2008 - rate predictions using Gaussian process regression models. ... propose to weight the models by the inverse of their predictive variance, and ...

Bagging for Gaussian Process Regression
Sep 1, 2008 - A total of 360 data points were collected from a propylene polymerization plant operated in a continuous mode. Eight process variables are ...

Non-gaussian behavior of some commodities prices
School of Business and Economics of Ribeirão Preto - University of São Paulo ..... consequence, prices and risks are mis-evaluated, what has the potential to ...

Topological Relations between Convex Regions
Topological relations between spatial objects are the most im- portant kind of ... sents the topological relation between some convex regions iff characters in {u, v} and ... A homeomorphism of the plane is a mapping f from R2 to itself which is a ..

On the Relations Between Appropriability ...
related notion of a monotonic relation between IP protection and rates of innovation. (section 2). .... Mobile telephony also emerged under a weak IP regime.

Topological Relations between Convex Regions
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia ... We give answers to all above problems for convex re- gions.

Twelve Considerations in Choosing between Gaussian ...
many decisions to be made in designing an IT2 FLC. One of .... the model-driven approach, whereas trapezoidal IT2 FSs can ...... Making Subjective Judgments.

Binary Relations 1 Are We Related?
Oct 17, 2005 - For example, we can define an “is teaching relation” for Fall '05 at MIT to have domain equal .... tion, you get back to the partition you started with. ...... the edge A—C is incident to vertices A and C. Vertex H has degree 1,

Gaussian Process Factorization Machines for Context ...
the user is listening to a song on his/her mobile phone or ...... feedback of 4073 Android applications by 953 users. The ..... In SDM'10, pages 211–222, 2010.

Efficient Variational Inference for Gaussian Process ...
Intractable so approximate inference is needed. • Bayesian inference for f and w, maximum likelihood for hyperparameters. • Variational messing passing was ...

Efficient Variational Inference for Gaussian Process ...
covariance functions κw and κf evaluated on the test point x∗ wrt all of the ..... partment of Broadband, Communications and the Dig- ital Economy and the ...

Uncertainty Propagation in Gaussian Process Pipelines
Gaussian processes (GPs) constitute ideal candidates for building learning pipelines [1, 2, 3]. Their Bayesian, non-parametric ... Experiments on simulated and real-world data show the importance of correctly propagating the uncertainty between stage

Response to the discussion of “Gaussian Process ... - Semantic Scholar
of a Gaussian process regression model for the calibration of multiple response .... like to acknowledge the financial support from the EPSRC KNOW-HOW ...

Fast Allocation of Gaussian Process Experts
Jun 24, 2014 - code: https://trungngv.github.io/fagpe. • Future work: distributed implementation, adaptive number of experts, adaptive number of inducing ...

Some Differences between Maturana and Varela's Theory of ...
Some Differences between Maturana and Varela's Theory of Cognition and Constructivism.pdf. Some Differences between Maturana and Varela's Theory of ...

Relations between the statistical regularities of natural ... - CiteSeerX
Jul 28, 2005 - Information theory states that the most statistically efficient codes are those that ..... An alternative idea would be to utilize it, instead of removing. ..... (like energy efficiency or limited wiring length) with efficient coding i

Relations between microhabitat use and limb shape in ...
Blackwell Science, LtdOxford, UKBIJBiological Journal of the Linnean ... 1Laboratory of Functional Morphology, Biology Department, University of ...... were captured under Arizona State Game and Fish ... Bartlett, Travis La Duc, Wayne Van Devender, B

Relations between microhabitat use and limb shape in ...
ods that take into account the relationships between the groups ..... species into account, phylogenetic analyses were used. As these ...... access to females. ... We would also like to express our gratitude ... American Zoologist 23: 347–361.