JMLR: Workshop and Conference Proceedings 1:1–7, 2017

ICML 2017 AutoML Workshop

Bayesian Multi-Hyperplane Machine Khanh Nguyen Trung Le Tu Dinh Nguyen Dinh Phung

NKHANH @ DEAKIN . EDU . AU TRUNG . L @ DEAKIN . EDU . AU TU . NGUYEN @ DEAKIN . EDU . AU DINH . PHUNG @ DEAKIN . EDU . AU

Deakin University, Australia

Abstract Current existing multi-hyperplane machine approach deals with high-dimensional and complex datasets by approximating the decision boundary with a parametric mixture of hyperplanes in the input space. Consequently, this approach requires an excessively time-consuming grid search to find the set of optimal hyper-parameters. Moreover, it is often suboptimal since the space discretization step in grid search. To address these challenges, we propose BAyesian Multi-hyperplane Machine (BAMM). Our approach departs from a Bayesian perspective, aiming to construct an alternative probabilistic view in such a way that its maximum-a-posteriori (MAP) estimation reduces exactly to the original optimization problem of a multi-hyperplane machine. This view allows us to endow prior distributions over hyper-parameters and apply recent data augmentation technique to efficiently infer model parameters and hyper-parameters via Markov chain Monte Carlo (MCMC) method. We further apply a Stochastic Gradient Descent (SGD) framework to cope with modern growing large datasets which can be also extended for online learning. Extensive experiments demonstrate the capability of our proposed method in learning the optimal model without using any grid search, hence yielding comparable accuracy comparing with the state-of-art baselines. Keywords: Multi-Hyperplane Machine, Bayesian Inference, Data Augmentation

1. Introduction Max-margin (Vapnik, 1998) is a powerful principle to construct learning model with high generalization capacity. When being applied to the multiclass classification problem, the max-margin is often in the form of the discrepancy between two discriminative values: one for the true label and the another for the runner-up. In (Crammer and Singer, 2002), a set of hyperplanes, each is associated with one class, is used to compute the discriminative values, upon which classification decision is made. With suitable cost function, this learning problem becomes convex and can be solved analytically. However, associating a single hyperplane for each class highly restricts the model expressiveness, hence becomes problematic to deal with high-dimensional, complex datasets. To overcome this issue, Aiolli and Sperduti (2005) proposed to associate each class with multiple hyperplanes to enrich capacity. However, the loss function now becomes non-convex, making the optimization much harder with only local solution. This work was further improved by (Wang et al., 2011) which used SGD technique (Shalev-Shwartz and Singer, 2007), resulting in a model known as Adaptive Multi-hyperplane Machine (AMM). However, there was no method to control the sparsity level, which could easily cause the model to overfit or underfit. To address this problem, Nguyen et al. (2016) incorporated the group norm L2,1 into the optimization problem to maintain optimal sparsity strategy. Nonetheless, an outstanding problem of these aforementioned approaches is the problem of searching for the optimal hyper-parameters. A popular and widely-used approach is grid search. c 2017 K. Nguyen, T. Le, T.D. Nguyen & D. Phung.

N GUYEN L E N GUYEN P HUNG

This common practice, however, entails two serious shortcomings. First, the number of trials grows exponentially with the number of hyper-parameters. Second, the values of hyper-parameters can be continuous and unbounded whilst the grid contains discrete values only, hence there is no guarantee that the tuned parameters are optimal. In addition, determining the number of hyperplanes associated with each class in a non-parametric and principled way is really challenging. Therefore, addressing this issue requires searching out the current optimization view of the existing methods. Bayesian techniques, in particular non-parametric Bayesian models (Ferguson, 1973), have been known with a long history in addressing model selection problem. Nonetheless, their applications in max-margin and kernel methods are still limited (Zhu et al., 2011; Wang and Zhu, 2014). In these works, the question of optimal number of subspaces was still not addressed nor the parameter tuning problem. Addressing specifically for multi-hyperplane approach, we propose in this paper the BAyesian Multi-Hyperplane Machine (BAMM) to address the model selection problem in the multi-hyperplane approach. Our solution is to develop a Bayesian view for this problem whose MAP estimation reduces identically to the optimization problem of SAMM. Under a Bayesian probabilistic view, we then develop a graphical model representation for our model, and subsequently infer the posterior distribution over both the model parameters and hyper-parameters. Posterior inference is however intractable even with MCMC method in our original form of model representation. To this end, we augment the model with auxiliary variables using a recent data augmentation technique proposed in (Polson and Scott, 2011) and then perform Gibbs sampling in a joint space. We further develop our algorithms under a SGD framework to scale up computation and enable to extend a version for online learning context. We validated our BAMM on 5 benchmark datasets. The experimental results demonstrate that our proposed method is able to infer the optimal parameter set that results in comparable predictive performance compared with the baselines using the cross-validation over the full domains of the parameters.

2. Bayesian Multi-Hyperplane Machine We discuss our proposed model in this section. We begin with the optimization problem for multihyperplane machine, then we develop a Bayesian view for it. Finally, we present posterior inference with data augmentation and hyper-parameter learning. 2.1. Optimization problem for Multi-Hyperplane Machine d Given the training set D = {(xn , yn )}N n=1 , where xn ∈ R is a d-dimensional vector and yn ∈ Y = {1, . . . , M } is the corresponding label, multiclass classification problem aims to find a decision function f : Rd → Y to predict the label y for an input x. Under a max-margin view, the decision function is of the form f (x) = argmaxm∈Y gm (x) where gm is the score function associated with the m-th class. A set of experts (i.e., hyperplanes) in the input space was employed to represent each class and the score function gm was then defined by the maximal score given by these experts (Aiolli and Sperduti, 2005). As a result, the decision boundary is described by a set of polyhedrons which approximates a set of contours. Mathematically, the score function gm for each class is defined as T x where the weight vector w gm (x) = maxk wm,k m,k is the parameter of k-th hyperplane in m-th class. To find the optimal set of hyperplanes, a two-step iterative procedure was proposed by Aiolli and Sperduti (2005). The first step is to solve for W while the latent variables z1:N are held fixed. ! N  o n X α min kWk22,2 +βkWk2,1 + max 0, 1 + g−yn (xn ) − wyTn ,zn xn (1) W 2 n=1

2

BAYESIAN M ULTI -H YPERPLANE M ACHINE

where zn is a latent discrete variable indicating the hyperplane that gives the score for the instance xn , W is constructed by concatenating all weight vectors and kWkp,q is group norm Lp,q . The hyper-parameter α is regularization term and the hyper-parameter β is used to control the sparsity level. The second step is to find the optimal assignment zn given the current matrix W: zn = argmax wyTn ,k xn (2). k∈[Kyn ]

2.2. Bayesian formulation for Multi-Hyperplane Machine Our aim is to turn the optimization problem of multi-hyperplane machine in Eq. (1) into a MAP estimation under a probabilistic model where efficient tools exist for inference and learning. Moreover, this view enables us to incorporate hyper-parameters (i.e., α, β) with prior distributions into our model, hence effectively avoiding an expensive grid search for model selection. Our primary parameter of interest is W, hence we depart with a specification for the posterior distribution of W whose MAP estimation is the same as the optimization problem of multihyperplane machine (cf. Eq. (1)), specified ( as follows ) N α  X p (W | X, y ˆ, z, α, β) = C1 (α, β) exp − kWk22,2 + β kWk2,1 − l (W; xn , yˆn , zn ) 2 n=1

= C2 (α, β) p (ˆ y | W, X, z) p (W | α, β)

(3)

where we have defined n o l (W; xn , yn , zn ) = max 0, 1 + g−yn (xn ) − wyTn ,zn xn n α o p (W | α, β) ∝ exp − kWk22,2 + β kWk2,1 2 N N Y Y p (ˆ y | W, X, z) = p (ˆ yn | W, xn , zn ) ∝ exp {−l (W; xn , yn , zn )} n=1

(4) (5)

n=1

and C1 (α, β), C2 (α, β) are normalization terms. It is noteworthy to clarify on the roles of variables yn and yˆn . As in a standard supervised learning setting, yn represents the random variable that gives the true label resulted from an unknown process, whereas yˆn represents the random variable that yields the label given our model assumption. We term yˆn pseudo-label. During training, our goal is to ensure the pseudo label yˆn to be identical to the true label yn . To learn the model parameter W and the latent variables z from training data, our plan is to use Gibbs sampling to iteratively sample W and z from the corresponding conditional distributions. Sampling W directly from its posterior in Eq. (3) is intractable and our solution is to employ the data augmentation technique. For readability, we delay the discussion on sampling W to the next section and finalizing this section with a specification of the conditional distribution for zn first. To specify the conditional distribution for zn , we note that the deterministic estimation of zn was originally proposed in (Aiolli and Sperduti, 2005) (cf. Eq. (2)). This deterministic assignment has an inherent drawback, that is, it favors the old hyperplanes and the data instances hence tend to be assigned to these old hyperplanes. To address this issue and encourage the diversity, we propose an equivalent probabilistic formulation for zn using the softmax distribution as follows T Iδn >0 ewyn ,k xn Iδn >0 p (zn = k | W, xn, yn ) = and p (zn = k+1 | W, xn , yn) = +Iδn ≤0 (6) Z (W, xn , yn ) Z (W, xn, yn) PKyn wT xn where we have defined δn = gyn (xn ) − g−yn (xn ) and Z (W, xn , yn ) = 1 + k=1 e yn ,k . If δn > 0 (i.e., the score assigned to the true class is greater than that of the remaining classes, or, the current hyperplanes are sufficiently to classify xn ), we use the softmax function to specify 3

N GUYEN L E N GUYEN P HUNG

a probability for zn which, in the limit, mimics the deterministic decision specified in Eq. (2). If δn ≤ 0, (i.e., the score given by the remaining classes is greater than that of the true class), we create a new hyperplane whose initial components are set to zero and assign xn to this new hyperplane. The key advantages of using the probabilistic assignment for zn are twofold. First, it deals with uncertainty by giving non-zero probabilities to assign xn to a spectrum of hyperplanes when a comparably discriminative value wyTn ,k xn , (k ∈ [Kyn ]) is attained. Thus it helps avoiding the local optima and encouraging the diversity at the beginning stage. Second, it presents a principled way to increase the number of hyperplanes automatically that does not bias toward the old hyperplanes. This makes our approach behave in a non-parametric learning fashion that automatically grows its model complexity according to the data observed. 2.3. Data Augmentation approach to sample W To infer W, it is required to find a tractable approach to sample from the conditional posterior p (W | X, y ˆ, z, α, β) in Eq. (3), which is intractable in general. To do so, we employ data augmentation technique in (Polson and Scott, 2011) to jointly sample W with an auxiliary variable λ = [| λ1,1 , . . . , λ1,K1 | . . . . . . | λM,1 , . . . , λM,KM |]. We have as follows: p (W, λ | X, y ˆ, z, α, β) ∝ p (ˆ y | W, λ, X, z) p (W, λ | α, β) ∝ p (ˆ y | W, X, z) p (W, λ | α, β) To find the form of p (W, λ | α, β), we depart from the result in (Andrews and Mallows, 1974). This gives the distribution for wm,k as:   2 ˆ ∞ − 12 λm,k +(β 2 λ−1 m,k +α)kwm,kk e p p(wm,k | α,β)= dλm,k 2πλm,k 0 In other words, we have the joint distribution for wm,k and the new auxiliary variable λm,k as:   2 − 12 λm,k +(β 2 λ−1 m,k +α)kwm,k k e p p(wm,k ,λm,k | α,β)= (7) 2πλm,k This can easily be verified as the integration over the right hand side of Eq. (7) w.r.t variable λm,k reduces identically to the marginal distribution for wm,k in the preceding equation. Our state space of Gibbs sampling now is expanded with the auxiliary variable λ. This suggests an iterative Gibbsstyle sampling procedure by alternating between the conditional for W and λ. As presented with details in Section 2.5, sampling the auxiliary variable λ and the model parameter W now becomes tractable via Gibbs-style samplers. 2.4. Learning model hyper-parameters α and β Under a Bayesian setting, we can further endow prior distributions for hyper-parameters α and β. We can then handle these hyper-parameters by attaching them to the state space of the Gibbs sampling instead of performing grid search as in previous works. In particular, p (α |. ) is the Gamma  distribution G (κ0 , θ0 ) and p (β |. ) is a Truncated Normal distribution T N µ0 , σ02 , 0, +∞ . We obtain the conditional distributions for α and β as: ( ) p (α, β | W, λ, ζ0 ) ∝ p (W, λ | α, β) p (α, β | ζ0 ) ∝ exp −

M Km   1 XX β 2 λ−1 + α kwm,k k2 m,k 2

p (α, β | ζ0 )

m=1 k=1

where ζ0 = {κ0 , θ0 , µ0 , σ0 }. Using the conjugacy of Normal-Gamma distribution, then the posterior p (α | W, λ) is also a Gamma distribution G (κl , θl ) and p (β | W, λ) is again a Truncated  Normal distribution T N µl , σl2 , 0, +∞ where we have defined κl = κ0 ,θl = 2θ0/(2+wθ0 ), and PM PKm −1 √ 2 µl = µ0/ 1+τ σ02 ,σl2 = σ02/(1+τ σ02 ) (8) in which τ¯ = ¯ = m=1 k=1 λm,k kwm,k k and w PM PKm 2 m=1 k=1 kwm,k k . 4

BAYESIAN M ULTI -H YPERPLANE M ACHINE

κ0

θ0

µ0

σ0

yn

zn

α

β

xn

yˆn

wm,k

λm,k

α ∼ G (κ0 , θ0 )  β ∼ T N µ0 , σ02 , 0, +∞ yˆn | W, xn , zn ∼ Eq. (5) zn | W, xn , yn ∼ Eq. (6) wm,k , λm,f | α, β ∼ Eq. (7)

Km

N

m∈Y

Figure 1: Graphical model representation and generative process for our proposed BAMM model. Putting all recipes together, the augmented graphical model and generative process for our model is illustrated in Figure 1. 2.5. Posterior inference and parameter estimation Given the model specification and the discussion on the posterior distribution thus far, we summarize the posterior inference and parameter estimation in this section. Estimate W. The posterior distribution of W is given as follows p (W | α, β, λ, X, y ˆ, z) ∝ p (ˆ y | W, X, z) p (W, λ | α, β) To infer W, we use MAP estimate and we arrive at the following optimization, which contains only the group norm L2,2 in itsregularization term  N X X 1 1 min  l (W; xn , yˆn , zn ) + γm,k kwm,k k2  W N 2 n=1

m,k

−1 where γm,k = (β 2 λm,k +α)/N . Thus, it allows us to apply the standard SGD to obtain the solution. Estimate λ. Deriving from Eq. (7), we gain the conditional distribution for λm,k as follows  1 , 1, β 2 kwm,k k2 p (λm,k | W, α, β) ∝ GIG 2   The Lemma 7.4 in Devroye (1986) shows that if λm,k is drawn from GIG 12 , 1, β 2 kwm,k k2 , then  1 λ−1 m,k must follow the inverse Gaussian distribution IG 1, /β kwm,k k . Estimate z. We sample zn using Eq. (6).  Estimate α, β. We sample α and β from G (κl , θl ) and T N µl , σl2 , 0, +∞ , respectively where κl , θl , µl and σl are defined in Eq. (8). To summarize, we present the pseudocode of our proposed method in Algorithm 1.

3. Experiments We use 5 benchmark datasets downloaded from LIBSVM repository for a variety of domains. We compare our BAMM with SAMM (Nguyen et al., 2016) and AMM (Wang et al., 2011). To judge the results of multi-hyperplane methods, we also compare with the kernelized multiclass SVM (KSVM) (Crammer and Singer, 2002) implemented using LIBSVM (Chang and Lin, 2011). In the following result tables, the best performances are in bold and the runner-up in italic. We emphasize in boldface and underline the best and runner-up methods for each measure. We aim to investigate how accurate and how fast the proposed method is compared with baselines. We apply 5-fold cross-validation on the training set to select the best hyper-parameters for SAMM, AMM and KSVM. Our proposed method can infer the hyper-parameters α and β, whereas 5

N GUYEN L E N GUYEN P HUNG

Total running time (hours) Accuracy (%) Optimalvalues of  Dataset BAMM SAMM AMM KSVM BAMM SAMM AMM KSVM α ×10−3 β ×10−2 usps 0.04 1.17 0.55 2.29 93.19 92.86 92.63 95.32 0.68 1.49 ijcnn1 0.09 4.84 1.39 6.62 97.86 97.58 78.48 98.13 5.28 2.56 a9a 0.18 3.61 1.42 11.15 83.01 84.31 84.87 85.09 0.96 1.18 mnist 0.71 69.39 3.32 126.89 95.61 94.80 94.78 98.57 1.38 1.47 webspam 4.80 99.85 63.66 1,855.46 97.47 97.36 82.99 99.12 1.06 1.39 Table 1: The total running time (hours), accuracy (%) comparison and the optimal hyper-parameters α and β automatically inferred by BAMM. the others are required to perform grid search in conjunction with a cross-validation. We search the hyper-parameters α, β, and trade-off hyper-parameter and kernel width in the suggested range of the corresponding authors. Algorithm 1 Pseudo code of BAMM Table 1 shows the total running time (includN ing grid search and training time) of all methods. Input: D = {(xn , yn )}n=1 , κ0 , θ0 , µ0 , σ0 , T Here, we observe that KSVM consumes a huge Output: W = (wm,k ) amount of time due to the limitation of learning begin l ← 1 and W = 0 the model in the feature space. In the meanwhile, repeat SAMM and AMM reduce the total training time Sample αl ∼ G (κl−1 , θl−1 )  since they learn multiple hyperplanes in the in2 Sample βl ∼ N µl−1 , σl−1 put space. Using the group norm L2,1 , SAMM for t ← 1 to T do can control the sparsity level in a principled way Sample nt from [N ] at the cost of introducing one more parameter to Update W based on (xn , yn ) tune. Hence, the number of trials in the grid search end  of SAMM is multiplicatively higher than AMM, 1 Sample λ−1 m,k ∼ IG 1, /β kwm,k k resulting in a longer total running time. In conSample zi using Eq. (6), ∀i ∈ [N ] trast, BAMM can avoid grid search through its Update κl , θl , µl and σl using Eq. (8) Bayesian setting, thus its hyper-parameters and until z is stable or enough l epochs; model parameters can be inferred automatically end using Gibbs sampling. Consequently, BAMM reduces the total training time significantly while still obtains comparable accuracies. There is only a minor performance gap between the kernelized approach (shown in the gray column) and multi-hyperplane approaches (shown in the white columns). Empowered with the ability to control the sparsity level by using the group norm L2,1 , SAMM can balance the overfitting and under-fitting, resulting in better prediction performance comparing with AMM. Interestingly, although BAMM does not invoke any time-consuming grid search, it still outperforms SAMM and AMM in almost all datasets. In addition, results in Table 1 indicate that although the hyperparameter searching ranges of SAMM and AMM cover the best values discovered by BAMM, they could not find out such values due to the discretization in grid search.

4. Conclusion We have proposed the Bayesian Multi-Hyperplane Machine (BAMM) which uses a minimal set of sparse hyperplanes to accurately approximate data regions in the input space. Owing to the advantage of Bayesian approach, our proposed BAMM can resolve the model selection problem in a principled way, hence allows us to avoid expensive grid search. The experimental results have demonstrated that our proposed methods have offered comparable prediction performance while enabling fast computation. 6

BAYESIAN M ULTI -H YPERPLANE M ACHINE

References F. Aiolli and A. Sperduti. Multiclass classification with multi-prototype support vector machines. In Journal of Machine Learning Research, pages 817–850, 2005. David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pages 99–102, 1974. C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27, May 2011. ISSN 2157-6904. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2:265–292, 2002. Luc Devroye. Continuous Univariate Densities, pages 379–484. Springer New York, New York, NY, 1986. ISBN 978-1-4613-8643-8. doi: 10.1007/978-1-4613-8643-8 9. URL http://dx. doi.org/10.1007/978-1-4613-8643-8_9. Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pages 209–230, 1973. Khanh Nguyen, Trung Le, Vu Nguyen, and Dinh Phung. Sparse adaptive multi-hyperplane machine. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 27–39. Springer, 2016. Nicholas G. Polson and Steven L. Scott. Data augmentation for support vector machines. Bayesian Anal., 6(1):1–23, 03 2011. doi: 10.1214/11-BA601. URL http://dx.doi.org/10.1214/ 11-BA601. S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated games. The Hebrew University, 2007. V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. Y. Wang and J. Zhu. Small-variance asymptotics for dirichlet process mixtures of svms. In AAAI, pages 2135–2141, 2014. Z. Wang, N. Djuric, K. Crammer, and S. Vucetic. Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classification. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 24–32, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7. J. Zhu, N. Chen, and E. P. Xing. Infinite svm: a dirichlet process mixture of large-margin kernel machines. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 617–624, 2011.

7

Bayesian Multi-Hyperplane Machine

datasets by approximating the decision boundary with a parametric mixture of ... Keywords: Multi-Hyperplane Machine, Bayesian Inference, Data ..... In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 27–39.

316KB Sizes 0 Downloads 122 Views

Recommend Documents

bayesian multi-hyperplane machine
Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets.

eBook Machine Learning: A Bayesian and Optimization ...
[Download] eBook Machine Learning: A Bayesian ... Hands-On Machine Learning with Scikit-Learn and TensorFlow ... Neural Network Design (2nd Edition).

An Introduction to Bayesian Machine Learning
Apr 8, 2013 - Instead of manually encoding patterns in computer programs, we make computers learn these patterns without explicitly programming them .

bayesian reasoning and machine learning pdf
learning pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. bayesian reasoning and machine learning pdf. bayesian ...

Barber, Bayesian Reasoning and Machine Learning (666p).pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Barber ...

Bayesian dark knowledge - Audentia
By contrast, we use online training (and can thus handle larger datasets), and use ..... Stochastic gradient VB and the variational auto-encoder. In ICLR, 2014.

Bayesian optimism - Springer Link
Jun 17, 2017 - also use the convention that for any f, g ∈ F and E ∈ , the act f Eg ...... and ESEM 2016 (Geneva) for helpful conversations and comments.