bayesian multi-hyperplane machine

Viewer
Transcript

BAYESIAN MULTI-HYPERPLANE MACHINE Khanh Nguyen, Trung Le, Tu Dinh Nguyen, Dinh Phung Deakin University, Geelong, Australia, Centre for Pattern Recognition and Data Analytics

Data Augmentation approach to sample 𝑾

Problem Current multi-hyperplane machine (MM) approach deals with high-dimensional and complex datasets by approximating the decision boundary with a parametric mixture of hyperplanes in the input space. However, • Requires an excessively time-consuming grid search • Suboptimal since discretization in grid search

• Sampling 𝑾 directly from Eq. (3) is intractable. • Our solution: employ data augmentation technique (Polson and Scott, 2011) by jointly sampling 𝑾 with auxiliary variables 𝝀 = 𝜆1,1 , … , 𝜆1,𝐾1 … 𝜆𝑀,1 , … , 𝜆𝑀,𝐾𝑀 Then, we have the joint distribution as follows: 𝑝 𝒘𝑚,𝑘 , 𝜆𝑚,𝑘 𝛼, 𝛽 =

Our solution: (Bayesian Multi-Hyperplane Machine – BAMM) • Construct a probabilistic view whose maximum-aposteriori (MAP) estimation reduces to the optimization problem of MM approach => allow to endow prior distributions over hyper-parameters • Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets

Optimization Problem (OP) for MM approach

1 − 𝜆𝑚,𝑘 + 𝛽 2 𝜆−1 𝑚,𝑘 +𝛼 𝑒 2

𝒘𝑚,𝑘

2

2𝜋𝜆𝑚,𝑘 Now, we can infer 𝑾, 𝒛, 𝝀 via Gibbs-style samplers.

Learning model hyper-parameters 𝜶 and 𝜷 • We can further learn model hyper-parameters by endowing prior distributions for them. ▪ 𝑝 𝛼 ⋅ is Gamma dist. 𝒢 𝜅0 , 𝜃0 ▪ 𝑝 𝛽 ⋅ is Truncated Normal dist. 𝒯𝒩 𝜇0 , 𝜎02 , 0, +∞ • Using the conjugacy of Normal-Gamma distribution, we can find the form of the posterior distribution for 𝛼 and 𝛽.

• Given a 𝑀-classes training set: 𝒟 = 𝒙𝑛 , 𝑦𝑛 𝑁 𝑛=1 . • MM approach aims to learn 𝑾 = 𝒘1,1 , … , 𝒘1,𝐾1 … 𝒘𝑀,1 , … , 𝒘𝑀,𝐾𝑀 where 𝒘𝑚,𝑘 is a hyperplane and 𝐾𝑚 is the number hyperplanes for 𝑚-th class. • The optimization problem for MM approach is as follows: 𝛼 min𝑾 𝑾 22,2 + 𝛽 𝑾 2,1 + σ𝑁 (1) 𝑛=1 𝑙 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 2

where α, 𝛽 is hyper-parameters, 𝑧𝑛 is a latent discrete variable indicating the hyperplane that gives the score for the instance 𝒙𝑛 , and 𝑙 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 is the loss function. • The OP is solved by a two-step iterative procedure: - Keep latent variables 𝑧1:𝑁 fixed, find the optimal 𝑾. - Keep 𝑾 fixed, find the optimal assignment 𝑧𝑛 = argmax𝑘∈ 1,…,𝐾𝑦 𝒘𝑇𝑦𝑛, 𝑘 𝒙𝑛 (2) 𝑛

Bayesian Formulation for MM approach • The OP given latent variables 𝑧1:𝑁 is a MAP estimation of the following posterior distribution: ෝ, 𝒛, 𝛼, 𝛽 ∝ 𝑝 𝒚 ෝ 𝑾, 𝑿, 𝒛 𝑝 𝑾 𝛼, 𝛽 (3) 𝑝 𝑾 𝑿, 𝒚 where ෝ 𝑾, 𝑿, 𝒛 = ς𝑁 𝑝 𝒚 (5) 𝑛=1 exp −𝑙 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 𝛼

𝑝 𝑾 𝛼, 𝛽 = exp − 𝑾 22,2 + 𝛽 𝑾 2,1 (6) 2 • To mimic Eq. (2), we use the softmax function to specify a probability for 𝑧𝑛 . 𝒘𝑇 𝑦𝑛 ,𝑘 𝒙𝑛

𝕀𝛿𝑛 >0 𝑒 𝑝 𝑧 = 𝑘 𝑾, 𝒙𝑛 , 𝑦𝑛 = 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 (7) 𝕀𝛿𝑛 >0 𝑝 𝑧 = 𝑘 + 1 𝑾, 𝒙𝑛 , 𝑦𝑛 = + 𝕀𝛿𝑛 <0 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 𝐾

𝒘𝑇 𝑦𝑛 ,𝑘 𝒙𝑛

𝑦𝑛 where 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 = 1 + σ𝑘=1 𝑒 , and 𝛿𝑛 = 𝑔𝑦𝑛 𝒙𝑛 − 𝑔−𝑦𝑛 𝒙𝑛 in which 𝑔𝑦𝑛 𝒙𝑛 is the score assigned to the true class and 𝑔−𝑦𝑛 𝒙𝑛 is the score assigned to the remaining classes.

Graphical model representation and generative process of BAMM

Experiments • We use 5 benchmark datasets from LIBSVM repository. • We compare our BAMM with SAMM (Nguyen et al., 2016) and AMM (Wang et al., 2011). • To judge the results of multi-hyperplane methods, we compare with the kernelized multiclass SVM (KSVM) (Crammer and Singer, 2002). Total running time (hours) Dataset

BAMM SAMM AMM

Accuracy (%)

KSVM

BAMM SAMM AMM KSVM

usps

0.04

1.17

0.55

2.29

93.19

92.86 92.63 95.32

ijcnn1

0.09

4.84

1.39

6.62

97.86

97.58 78.48 98.13

a9a

0.18

3.61

1.42

11.15

83.01

84.31 84.87 85.09

mnist

0.71

69.39

3.32

126.89

95.61

94.80 94.78 98.57

webspam

4.80

99.85 63.66 1,855.46

97.47

97.36 82.99 99.12

Table 1: Total running time (including searching time for the best hyper-parameters and training time) and accuracy comparison a9a

ijcnn1

mnist

usps

webspam

𝛼 (× 10−3 )

0.68

5.28

0.96

1.38

1.06

𝛽 (× 10−3 )

1.49

2.56

1.18

1.47

1.39

Table 2: The optimal hyper-parameters 𝜶 and 𝜷 automatically inferred by BAMM

bayesian multi-hyperplane machine

Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets.

Download PDF

906KB Sizes 1 Downloads 170 Views

Report

bayesian multi-hyperplane machine

Recommend Documents