Some relations between Gaussian process binary classification, Multivariate skew normal, and Maximising correlations (working note) Kian Ming A. Chai School of Informatics, University of Edinburgh
[email protected]
Third version: February 15, 2009 Second version: October 24, 2008 First version: January 31,2008
Abstract We show that the posterior of the latent function in GP binary classification is related to the multivariate skew normal distribution, and suggest how classification may be achieved without explicitly using a latent function. A connection is made between ML-II estimation for GP binary classification and correlation maximisation. Two heuristics to approximate the ML-II are proposed.
1 GP Binary Classification and Multivariate skew normal For an input x, let y ∈ {−1, 1} be its class label. Binary classification can be performed via a Gaussian Process with the following model [Rasmussen and Williams, 2006, chapter 3]: hf (x)i = 0 0
0
hf (x), f (x )i = k(x, x ) p(yi |fi ) = Φ(yi fi )
prior mean of latent process
(1)
prior covariance of latent process probit likelihood,
(2) (3)
Rz 2 where fi = f (xi ) and Φ(z) = √12π −∞ e−t /2 dt is the cdf of standard normal. Given a observed data set D = {(yi , xi )}ni=1 , the posterior over the latent function values f = (f (x1 ), . . . , f (xn )) is p(f |D) ∝ φn (f ; 0, K)
n Y
Φ(yi fi )
posterior of latent values,
(4)
i=1
where K is an n×n matrix with k(xi , xj ) being its (i, j)th entry, and φn (·; µ, Σ) is the n-variate normal pdf with mean µ and covariance Σ. The product over the likelihoods can be expressed as n Y
Φ(yi fi ) = Φn (Dy f ; 0, In )
(5)
i=1
where Dy is an n×n diagonal matrix with yi being its (i, i)th entry, and Φn (·; µ, Σ) is the n-variate normal cdf with mean µ and covariance Σ, i.e. Z z1 Z zn 1 (t−µ)T Σ−1 (t−µ)/2 dt1 · · · dtn e− (6) Φn (z; µ, Σ) = p n (2π) |Σ| −∞ −∞ Note that Φ1 (z; 0, 1) ≡ Φ(z). Placing (5) into (4) immediately identifies the posterior as a multivariate skew normal distribution [MSN, Gupta et al., 2004]: f |D ∼ SN n (·|0, K, Dy ) 1 SN n (z; µ, Σ, D) def φn (z; µ, Σ)Φn (D(z − µ); 0, In ). = Φn (0; 0, I + DΣDT )
(7) (8)
The mean and variance of the posterior is thus those of the corresponding MSN distribution. We note here that the MSN distribution proposed by Gupta et al. [2004] is but one of the many [Arellano-Valle and Azzalini, 2006]. 1
2 Joint distribution via MSN construction Similar to the construction of MSN given by Gupta et al. [2004, Proposition 3.4], we shall give a construction of the joint distribution between the GP latent process and the probit data likelihood. We define an alternative class labelling scheme: define r− def = (−∞, 0) and r+ def = (0, ∞), and for an input x, let r ∈ {r− , r+ } be its class labels; conceptually, it means that instead of knowing the class labels via discrete (binary) values, we know the class labels via the knowledge of the sign of the values. The data set corresponding to D is then Dr def = {(ri , xi ) | ri = rsgn(yi ) }ni=1 . We also introduce a vector of n real auxiliary random variables g def = (g1 , . . . , gn )T , def and define an augmented data set Dg = {(gi ∈ ri , xi )}. Proposition 2.1. Let I +K g ∼ N2n · 0, f K
K K
(9)
Then, f ∼ Nn (·|0, K) gi ∈ ri |fi ∼ Φ(yi fi ) f |Dg ∼ SN n (·|0, K, Dy )
prior of latent values likelihood posterior of latent values
(10) (11) (12)
Proof. One may prove by considering the original likelihoods given in (3), but delaying the integration within the probit to as late as possible in all operations. The following is an alternative proof following that by Gupta et al. [2004, Proposition 3.4]. That (10) holds is clear by considering only the f component of (9). For (11), consider the conditional distribution g|f ∼ Nn (·|f , I). The components are independent in this conditional, so we may also write gi |fi ∼ N (·|fi , 1) for each i = 1 . . . n. When yi = −1, then ri = r− , and we have p(gi ∈ ri |fi ) = Φ1 (0; fi , 1) ≡ Φ(yi fi ); similarly for yi = 1. For (12), using Bayes’ rule gives p(f |Dg ) =
p(f ) p({gi ∈ ri }ni=1 |f ) . p({gi ∈ ri }ni=1 )
(13)
Comparing this with (7) and (8), we are required to show ?
p({gi ∈ ri }ni=1 |f ) = Φn (Dy f ; 0, In ) p({gi ∈
ri }ni=1 )
?
= Φn (0; 0, I +
(14)
Dy KDyT ).
(15)
The likelihood term (14) follows immediately from the definition of Φn and (11). For the normalisation term (15), note that Dy is unitary so that we have I + Dy KDyT = Dy (I + K)DyT ,
(16)
giving Φn (0; 0, I +
Dy KDyT )
0
0
1 T T −1 dt1 · · · dtn exp − t I + Dy KDy t =q 2 −∞ (2π)n |I + Dy KDyT | −∞ Z 0 Z 0 1 T −T 1 −1 −1 dt1 · · · dtn exp − t Dy (I + K) Dy t =q 2 −∞ (2π)n |D (1 + K)DT | −∞ 1
y
Z
y
1
=q
Z
(2π)n |(1 + K)DyT Dy | Z 0 1
Z
0
−∞
0
T 1 dtn exp − Dy−1 t (I + K)−1 Dy−1 t 2 −∞
Z dt1 · · · 0
1 yn dun exp − uT (I + K)−1 u 2 (2π)n |1 + K| −∞/y1 −∞/yn Z Z 1 T 1 −1 =p du1 · · · dun exp − u (I + K) u 2 (2π)n |1 + K| u1 ∈r1 un ∈rn
=p
Z
y1 du1 · · ·
= p({gi ∈ ri }ni=1 ), where we have used a change of the dummy variables of integration ti = yi vi , i = 1 . . . n. 2
(17)
Remarks 1. In essence, the marginal probability p({gi ∈ ri }ni=1 ) chooses the relevant ranges for integration of the latent multivariate Gaussian distribution. 2. From (9), clearly gi ∼ N (·|fi , 1), or gi = fi + i where i ∼ N (·|0, 1). That is, the noise N (·|0, 1) that was previously associated with the likelihood Φ(yi fi ) is now modelled in the latent function g. That the noise can be modelled either in the likelihood or in the latent process has been previously noted by Opper and Winther [2000]. This is also evident from the definition of the probit model which involves a normally distributed threshold. 3
Binary classification without explicit latent values
Proposition 2.1 suggests that one may be able to achieve GP binary classification using the latent values only implicitly, by directly using auxiliary variables. Suppose one would wish to predict the class of a novel input x∗ , and we have that I +K k∗ g (18) ∼ Nn+1 · 0, g∗ kT 1+κ ∗ where k∗ = (k(x1 , x∗ ), . . . , k(xn , x∗ ))T and κ = k(x∗ , x∗ ), where k is the prior covariance function of f . Then the conditional probability that the class label y∗ = −1 is p(g∗ ∈ r− |Dg , x∗ ) =
Φn+1 (0; 0, I + Σ∗ ) p(g∗ ∈ r− , Dg |, x∗ ) = , p(Dg ) Φn (0; 0, I + Dy KDyT )
(19)
where Σ∗ =
Dy 0
K kT ∗
0 −1
k∗ κ
Dy 0
0 −1
T .
(20)
Remarks on prior work Models of this form are also known as multivariate probit models or correlated probit regression models first motivated and proposed by Ashford and Sowden [1970]. The focus on the latent variables f — a la GP binary classification — is proposed by Muthen [1979]; in a similar vein, Albert and Chib [1993] focus on the weights β of the linear composition f (x) = β T x. In particular, Albert and Chib [1993] motivate their Gibbs sampling (of β) approach by citing the inappropriateness of normal (i.e. Laplace) approximations to the posterior, noted previously by Griffiths et al. [1987], and Zellner and Rossi [1984]. 4 ML-II estimation and Correlation maximisation ˆ given by For a kernel k(·, ·) parameterised by θ, the ML-II or empirical Bayes approach requires the search of optimal parameters θ ˆ = arg max p(Dg |θ) = arg max Φn (0; 0, I + Dy Kθ DT ) def θ y = arg max Φn (0; 0, Σ), θ
θ
(21)
θ
where Σ def = I + Dy Kθ DyT , and we have subscripted K with the hyperparameters θ to emphasis its dependence on them. The function Φn involves the integration of a zero-mean multivariate Gaussian over the lower orthant ti ≤ 0, i = 1 . . . n. The higher the mass the multivariate Gaussian has within that orthant, the higher the value of the integral. The value of this integral is also known as the (lower) orthant probability, and depends only on the correlation of the zero mean Gaussian, i.e. the variances do not matter. This suggests that the covariance Σ should be one that maximises the correlation of the random variables distributed under the Gaussian. We give an example for n = 2 where higher correlation gives higher evidence. For the same volume |Σ| = 4, consider the case when Dy Kθ Dy = I for zero correlation to the case when Dy Kθ Dy = 32 12×2 for maximum correlation (though degenerate). In the former case, the evidence is 14 by symmetry. In the latter case, it is 14 1 + π2 arctan 43 ≈ 14 × 1.4 as illustrated in Figure 1 and considering the additional area bounded between the negative t1 axis and the half line given by the direction vector −4 [see also Ruben, 1961]. 3 Below we shall make explicit what is meant by maximising the correlation. Theorem 4.1. (Slepian’s inequality [Selpian, 1962], see e.g. Gupta [1963]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of Σ by σij and of C by cij . Let σii = cii = 1, i = 1 . . . n. If σij ≥ cij , i, j = 1 . . . n, then 1 P (x1 < a1 , x2 < a2 , . . . , xn < an ) ≥ P (y1 < a1 , y2 < a2 , . . . , yn < an )
(22)
1 Gupta [1963] gives the inequality in terms of the upper orthant, while some other literature give it in terms of the lower. We note that the two are equivalent using the transformation x0 = −In x.
3
t2
t2 t1 = t2 major axis
t1 = −t2 minor axis t1 = −2t2 3t1 + 4t2 = 0
3t1 + 4t2 = 0 2t1 = t2
t21
+
t22
1 4
=2
×
2 π
arctan 43
t1
t1 1 4
1 −1/2 e of the bivariate Gaussian with covariance I; Figure 1: LEFT: The black circle has unit radius, and is the contour p(t) = 2π 1 1 −2 the ellipse is the contour p(t) = 4π e of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The arrows show the transformations that happen when transforming the unit Gaussian p into the correlated Gaussian, using the sampling rule tunit ∼ N (·|0, I) and tcorrelated = √110 ( 53 04 )tunit . The ticks on the axes are 8/5 units from the origin. RIGHT: The thick lines bound the area under isotropic Gaussian to be considered for the lower orthant probability of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The values given in the boxes are the probabilities within the associated areas.
or equivalently Φn (a; 0, Σ) ≥ Φn (a; 0, C)
(23)
The following corollary allows us to work also with covariance functions. Corollary 4.2. (See also [Tong, 1990, §5.1.4]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of the correlation of c σ x by ρij , and of y by rij , i.e. ρij = √σiiijσjj and rij = √ciiijcjj . If ρij ≥ rij , i, j = 1 . . . n, then Φn (0; 0, Σ) ≥ Φn (0; 0, C)
(24)
Proof. Use the variance-correlation decomposition of Σ into ∆1/2 P∆1/2 , where ∆ is a diagonal matrix of variances and P is the correlation matrix. Φn (0; 0, Σ) = Φn (0; 0, P) can be shown via suitable scaling of dummy variables of integration (by the reciprocal of the respective standard deviations). The same can be done for C, so that Φn (0; 0, Σ) = Φn (0; 0, P) ≥ Φn (0; 0, R) = Φn (0; 0, C), where R is the correlation matrix for y. Hence, ML-II maximisation is the same as maximising the correlations between the variables. Note, however, that both the Slepian’s inequality and the above corollary give only a pre-ordering of correlations {ρij } based on the scalar field Φn (a; 0, {ρij }). Hence their uses are limited, since there is no single direction along which to optimise; that is, it is still necessary to try all valid increases in correlations, compute the corresponding Φn values, and choose the highest. To make it clearer, consider the case of 3 data points for GP classification, giving a centered trivariate Gaussian distribution parameterised by {ρ12 , ρ13 , ρ23 }. Figure 2 illustrates Slepian’s inequality. Each point in the vector space of correlations gives a Gaussian more correlated than the points it dominates in the Cartesian sense. However, points such as ∗ and † in Figure 2 are not comparable directly. 5 Maximising projection of correlations and Empirical alignment For maximising the correlation, one heuristic is to maximise the sum of correlations, i.e. given an n×n correlation matrix P = (ρij ), we maximise n X n X X 1T P1 = ρij = n + 2 ρij = n + 21 ρ12 ρ13 . . . ρ(n−1)n (25) n n i=1 j=1
i
4
ρ12
ρ23
∗
†
ρ13 Figure 2: The axes are the correlations for a trivariate Gaussian distribution. Each point within the the cuboid given by the thick lines gives a Gaussian that is “less correlated” than the point ∗; similarly for the each point within the cuboid given by the thin lines compared with the point †. However, points ∗ and † are not comparable. Note that region that give valid correlations is smaller than the R3 portrayed here. In fact, the valid region in this case a tetrahedron inscribed within {− π2 , π2 }3 in the axis-transformed space asin ρij (see section 5.1). The last equality gives the geometric interpretation, saying that the heuristic maximises the projection of the correlations onto the perfect positive correlation case. Given an n×n covariance matrix Σ, and with ∆ defined to be the corresponding diagonal matrix of variances, the heuristic is 1 −1/2 max 1T Σ∆− /2 1n . n∆ Now for binary GP classification, we have Σ = I + Dy Kθ DyT . We define ∆ accordingly, and let DKθ be the diagonal matrix of variances from Kθ . Using (16) repeatedly gives − /2 T − /2 1T Σ∆− /2 1n = 1T (I + Dy Kθ DyT )(I + Dy DKθ DyT )− /2 1T n∆ n (I + Dy DKθ Dy ) n 1
1
1
1
− /2 T = 1T Dy Dy (I + Kθ )DyT Dy (I + DKθ )− /2 DyT 1T n Dy (I + DKθ ) n 1
1
= y T (I + DKθ )− /2 (I + Kθ )(I + DKθ )− /2 y 1
1
(26)
It will be instructive to compare this with the empirical alignment criterion proposed by Cristianini et al. [2002]: Definition 5.1. [Cristianini et al., 2002] The empirical alignment between an n×n positive semi-definite kernel Kθ and the vector of binary labels y ∈ {−1, +1}n is given by
Kθ , yy T F Kθ , yy T F A= p = p , n hKθ , Kθ iF hKθ , Kθ iF hyy T , yy T iF where hM , N iF is the Frobenius product between matrices M and N . Consider the case when Kθ is a correlation matrix e.g. when given by a correlation function such as the squared exponential covariance function with unit amplitude. In this case, the (inverse) scaling by variances in (26) is not required for the purpose of optimisation. Further suppose that the GP binary classification has likelihood given by p(yi |fi ) = Φ yi fi /η 2 instead of (3) and take the limit η 2 → ∞. Due to this limit, the identity matrices I in (26) disappear. The heuristic is now to maximise y T Kθ y. Notice this last expression is the numerator of A, since hM , N iF = tr(M N T ). The difference between the two criteria is then due to the denominator p n hKθ , Kθ iF in A. Let us attempt to understand if this difference is significant. Let the labels be given by y = 1n , so that our obP jective function becomes n + 2 i
Maximising projection in the arc-sine axis-transformed space
Instead of maximising the sum of the correlations, one may consider maximising the sum of monotonic transforms of the correlations. Due to the following equation derived by Childs [1967], we propose the use of the asin transform. 2 Using the regression setting (instead of binary classification setting), we have that the ML-II solution is K = yy T , which indeed gives the maximum for y T Ky with K unconstrained. However, if K is constrained by a parameterised covariance function, then the answers obtained by ML-II and by the criteria may be different. Also, an obvious measure of alignment for two Gaussian distributions is the Kullback-Leibler divergence between them.
5
Proposition 5.2. [Childs, 1967] Let x ∼ Nn (0, P), where P is a correlation matrix with ρij its (i, j)th element. Then3 bn/2c 2j X X X i 1 2 Φn (0; 0, P) def asin ρij + wk1 ,k2 ,...,k2j , = P (x1 < 0, . . . , xn < 0) = n 1 + 2 π π i
where Z
∞
Z
∞
···
wk1 ,k2 ,...,km =
def
−∞
−∞
j=2
(28)
k1
exp − 12 xT Px Qm dx. i=1 xki
(29)
Thus we may approximate ML-II by maximising the sum of the asins of the correlations. This approximation drops the last doublesummation term in (28) which gives the 4th and higher orders of interactions between the xi s. For GP binary classification, the correlation matrix P in (28) is given by (I + Dy DKθ DyT )−1/2 (I + Dy Kθ DyT )(I + Dy DKθ DyT )−1/2 (see (26)). Using the fact that asin is an odd function, and using a derivation similar to that for (26), the approximation is then to maximise i h 1 1 (30) y T asin (I + DKθ )− /2 (I + Kθ )(I + DKθ )− /2 y, where the matrix function asin(·) gives the element-wise asin of its argument. 6 Summary In this note, we have made explicit the following intuitions on GP binary classification. We have shown that the posterior of the latent function in GP binary classification can be interpreted as a multivariate skew normal distribution, and that GP binary classification can be done without explicitly introducing a latent function. We have also made a connection between ML-II estimation and maximising correlations, and have suggested two heuristics for approximating ML-II. Acknowledgements We thank C.K.I. Williams for helpful discussions and comments. References James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, June 1993. Reinaldo B. Arellano-Valle and Adelchi Azzalini. On the unification of families of skew-normal distributions. Scandinavian Journal of Statistics, 33(3):561–574, September 2006. J. R. Ashford and R. R. Sowden. Multi-variate probit analysis. Biometrics, 26(3):535–546, September 1970. Donald R. Childs. Reduction of the multivariate normal integral to characteristic form. Biometrika, 54(1/2):293–300, 1967. Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. William E. Griffiths, R. Carter Hill, and Peter J. Pope. Small sample properties of probit model estimators. Journal of the American Statistical Association, 82(399):929–937, 1987. Arjun K. Gupta, Graciela Gonz´alez-Far´ıas, and J. Armando Dom´ınguez-Molina. A multivariate skew normal distribution. Journal of Multivariate Analysis, 89:181–190, 2004. Shanti S. Gupta. Probability integrals of multivariate normal and multivariate t. The Annals of Mathematical Statistics, 34(3): 792–828, September 1963. Bengt Muthen. A structural probit model with latent variables. Journal of the American Statistical Association, 74(368):807–811, December 1979. Manfred Opper and Ole Winther. Gaussian processes and SVM: Mean field and leave-one-out. In Alexander J. Smola, Peter J. Bartlett, Bernhard Sch¨olkopf, and Dale Schuurmans, editors, Advances in Large-Margin Classifiers, pages 311–326. MIT Press, October 2000. Carl E. Rasmussen and Christopher K. I. Williams. Massachusetts, 2006. 3 Childs
Gaussian Processes for Machine Learning.
MIT Press, Cambridge,
[1967] gives this in terms of the upper orthant, while we give it in terms of the lower orthant to be consistent with the rest of this note.
6
Harold Ruben. Probability content of regions under spherical normal distributions, iii: The bivariate normal integral. The Annals of Mathematical Statistics, 32(1):171–186, March 1961. D. Selpian. The one-sided barrier problem for Gaussian noise. Bell System Technical Journal, 41:463–501, 1962. Yung Liang Tong. The Multivariate Normal Distribution. Springer-Verlag, New York, 1990. Arnold Zellner and Peter E. Rossi. Bayesian analysis of dichotomous quantal response models. Journal of Econometrics, 25(3): 365–393, July 1984.
7