BAYESIAN MULTI-HYPERPLANE MACHINE Khanh Nguyen, Trung Le, Tu Dinh Nguyen, Dinh Phung Deakin University, Geelong, Australia, Centre for Pattern Recognition and Data Analytics
Data Augmentation approach to sample πΎ
Problem Current multi-hyperplane machine (MM) approach deals with high-dimensional and complex datasets by approximating the decision boundary with a parametric mixture of hyperplanes in the input space. However, β’ Requires an excessively time-consuming grid search β’ Suboptimal since discretization in grid search
β’ Sampling πΎ directly from Eq. (3) is intractable. β’ Our solution: employ data augmentation technique (Polson and Scott, 2011) by jointly sampling πΎ with auxiliary variables π = π1,1 , β¦ , π1,πΎ1 β¦ ππ,1 , β¦ , ππ,πΎπ Then, we have the joint distribution as follows: π ππ,π , ππ,π πΌ, π½ =
Our solution: (Bayesian Multi-Hyperplane Machine β BAMM) β’ Construct a probabilistic view whose maximum-aposteriori (MAP) estimation reduces to the optimization problem of MM approach => allow to endow prior distributions over hyper-parameters β’ Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets
Optimization Problem (OP) for MM approach
1 β ππ,π + π½ 2 πβ1 π,π +πΌ π 2
ππ,π
2
2πππ,π Now, we can infer πΎ, π, π via Gibbs-style samplers.
Learning model hyper-parameters πΆ and π· β’ We can further learn model hyper-parameters by endowing prior distributions for them. βͺ π πΌ β
is Gamma dist. π’ π
0 , π0 βͺ π π½ β
is Truncated Normal dist. π―π© π0 , π02 , 0, +β β’ Using the conjugacy of Normal-Gamma distribution, we can find the form of the posterior distribution for πΌ and π½.
β’ Given a π-classes training set: π = ππ , π¦π π π=1 . β’ MM approach aims to learn πΎ = π1,1 , β¦ , π1,πΎ1 β¦ ππ,1 , β¦ , ππ,πΎπ where ππ,π is a hyperplane and πΎπ is the number hyperplanes for π-th class. β’ The optimization problem for MM approach is as follows: πΌ minπΎ πΎ 22,2 + π½ πΎ 2,1 + Οπ (1) π=1 π πΎ; ππ , π¦π , π§π 2
where Ξ±, π½ is hyper-parameters, π§π is a latent discrete variable indicating the hyperplane that gives the score for the instance ππ , and π πΎ; ππ , π¦π , π§π is the loss function. β’ The OP is solved by a two-step iterative procedure: - Keep latent variables π§1:π fixed, find the optimal πΎ. - Keep πΎ fixed, find the optimal assignment π§π = argmaxπβ 1,β¦,πΎπ¦ πππ¦π, π ππ (2) π
Bayesian Formulation for MM approach β’ The OP given latent variables π§1:π is a MAP estimation of the following posterior distribution: ΰ·, π, πΌ, π½ β π π ΰ· πΎ, πΏ, π π πΎ πΌ, π½ (3) π πΎ πΏ, π where ΰ· πΎ, πΏ, π = Οπ π π (5) π=1 exp βπ πΎ; ππ , π¦π , π§π πΌ
π πΎ πΌ, π½ = exp β πΎ 22,2 + π½ πΎ 2,1 (6) 2 β’ To mimic Eq. (2), we use the softmax function to specify a probability for π§π . ππ π¦π ,π ππ
ππΏπ >0 π π π§ = π πΎ, ππ , π¦π = π πΎ, ππ , π¦π (7) ππΏπ >0 π π§ = π + 1 πΎ, ππ , π¦π = + ππΏπ <0 π πΎ, ππ , π¦π πΎ
ππ π¦π ,π ππ
π¦π where π πΎ, ππ , π¦π = 1 + Οπ=1 π , and πΏπ = ππ¦π ππ β πβπ¦π ππ in which ππ¦π ππ is the score assigned to the true class and πβπ¦π ππ is the score assigned to the remaining classes.
Graphical model representation and generative process of BAMM
Experiments β’ We use 5 benchmark datasets from LIBSVM repository. β’ We compare our BAMM with SAMM (Nguyen et al., 2016) and AMM (Wang et al., 2011). β’ To judge the results of multi-hyperplane methods, we compare with the kernelized multiclass SVM (KSVM) (Crammer and Singer, 2002). Total running time (hours) Dataset
BAMM SAMM AMM
Accuracy (%)
KSVM
BAMM SAMM AMM KSVM
usps
0.04
1.17
0.55
2.29
93.19
92.86 92.63 95.32
ijcnn1
0.09
4.84
1.39
6.62
97.86
97.58 78.48 98.13
a9a
0.18
3.61
1.42
11.15
83.01
84.31 84.87 85.09
mnist
0.71
69.39
3.32
126.89
95.61
94.80 94.78 98.57
webspam
4.80
99.85 63.66 1,855.46
97.47
97.36 82.99 99.12
Table 1: Total running time (including searching time for the best hyper-parameters and training time) and accuracy comparison a9a
ijcnn1
mnist
usps
webspam
πΌ (Γ 10β3 )
0.68
5.28
0.96
1.38
1.06
π½ (Γ 10β3 )
1.49
2.56
1.18
1.47
1.39
Table 2: The optimal hyper-parameters πΆ and π· automatically inferred by BAMM