BAYESIAN MULTI-HYPERPLANE MACHINE Khanh Nguyen, Trung Le, Tu Dinh Nguyen, Dinh Phung Deakin University, Geelong, Australia, Centre for Pattern Recognition and Data Analytics

Data Augmentation approach to sample 𝑾

Problem Current multi-hyperplane machine (MM) approach deals with high-dimensional and complex datasets by approximating the decision boundary with a parametric mixture of hyperplanes in the input space. However, β€’ Requires an excessively time-consuming grid search β€’ Suboptimal since discretization in grid search

β€’ Sampling 𝑾 directly from Eq. (3) is intractable. β€’ Our solution: employ data augmentation technique (Polson and Scott, 2011) by jointly sampling 𝑾 with auxiliary variables 𝝀 = πœ†1,1 , … , πœ†1,𝐾1 … πœ†π‘€,1 , … , πœ†π‘€,𝐾𝑀 Then, we have the joint distribution as follows: 𝑝 π’˜π‘š,π‘˜ , πœ†π‘š,π‘˜ 𝛼, 𝛽 =

Our solution: (Bayesian Multi-Hyperplane Machine – BAMM) β€’ Construct a probabilistic view whose maximum-aposteriori (MAP) estimation reduces to the optimization problem of MM approach => allow to endow prior distributions over hyper-parameters β€’ Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets

Optimization Problem (OP) for MM approach

1 βˆ’ πœ†π‘š,π‘˜ + 𝛽 2 πœ†βˆ’1 π‘š,π‘˜ +𝛼 𝑒 2

π’˜π‘š,π‘˜

2

2πœ‹πœ†π‘š,π‘˜ Now, we can infer 𝑾, 𝒛, 𝝀 via Gibbs-style samplers.

Learning model hyper-parameters 𝜢 and 𝜷 β€’ We can further learn model hyper-parameters by endowing prior distributions for them. β–ͺ 𝑝 𝛼 β‹… is Gamma dist. 𝒒 πœ…0 , πœƒ0 β–ͺ 𝑝 𝛽 β‹… is Truncated Normal dist. 𝒯𝒩 πœ‡0 , 𝜎02 , 0, +∞ β€’ Using the conjugacy of Normal-Gamma distribution, we can find the form of the posterior distribution for 𝛼 and 𝛽.

β€’ Given a 𝑀-classes training set: π’Ÿ = 𝒙𝑛 , 𝑦𝑛 𝑁 𝑛=1 . β€’ MM approach aims to learn 𝑾 = π’˜1,1 , … , π’˜1,𝐾1 … π’˜π‘€,1 , … , π’˜π‘€,𝐾𝑀 where π’˜π‘š,π‘˜ is a hyperplane and πΎπ‘š is the number hyperplanes for π‘š-th class. β€’ The optimization problem for MM approach is as follows: 𝛼 min𝑾 𝑾 22,2 + 𝛽 𝑾 2,1 + σ𝑁 (1) 𝑛=1 𝑙 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 2

where Ξ±, 𝛽 is hyper-parameters, 𝑧𝑛 is a latent discrete variable indicating the hyperplane that gives the score for the instance 𝒙𝑛 , and 𝑙 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 is the loss function. β€’ The OP is solved by a two-step iterative procedure: - Keep latent variables 𝑧1:𝑁 fixed, find the optimal 𝑾. - Keep 𝑾 fixed, find the optimal assignment 𝑧𝑛 = argmaxπ‘˜βˆˆ 1,…,𝐾𝑦 π’˜π‘‡π‘¦π‘›, π‘˜ 𝒙𝑛 (2) 𝑛

Bayesian Formulation for MM approach β€’ The OP given latent variables 𝑧1:𝑁 is a MAP estimation of the following posterior distribution: ෝ, 𝒛, 𝛼, 𝛽 ∝ 𝑝 π’š ෝ 𝑾, 𝑿, 𝒛 𝑝 𝑾 𝛼, 𝛽 (3) 𝑝 𝑾 𝑿, π’š where ෝ 𝑾, 𝑿, 𝒛 = ς𝑁 𝑝 π’š (5) 𝑛=1 exp βˆ’π‘™ 𝑾; 𝒙𝑛 , 𝑦𝑛 , 𝑧𝑛 𝛼

𝑝 𝑾 𝛼, 𝛽 = exp βˆ’ 𝑾 22,2 + 𝛽 𝑾 2,1 (6) 2 β€’ To mimic Eq. (2), we use the softmax function to specify a probability for 𝑧𝑛 . π’˜π‘‡ 𝑦𝑛 ,π‘˜ 𝒙𝑛

𝕀𝛿𝑛 >0 𝑒 𝑝 𝑧 = π‘˜ 𝑾, 𝒙𝑛 , 𝑦𝑛 = 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 (7) 𝕀𝛿𝑛 >0 𝑝 𝑧 = π‘˜ + 1 𝑾, 𝒙𝑛 , 𝑦𝑛 = + 𝕀𝛿𝑛 <0 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 𝐾

π’˜π‘‡ 𝑦𝑛 ,π‘˜ 𝒙𝑛

𝑦𝑛 where 𝑍 𝑾, 𝒙𝑛 , 𝑦𝑛 = 1 + Οƒπ‘˜=1 𝑒 , and 𝛿𝑛 = 𝑔𝑦𝑛 𝒙𝑛 βˆ’ π‘”βˆ’π‘¦π‘› 𝒙𝑛 in which 𝑔𝑦𝑛 𝒙𝑛 is the score assigned to the true class and π‘”βˆ’π‘¦π‘› 𝒙𝑛 is the score assigned to the remaining classes.

Graphical model representation and generative process of BAMM

Experiments β€’ We use 5 benchmark datasets from LIBSVM repository. β€’ We compare our BAMM with SAMM (Nguyen et al., 2016) and AMM (Wang et al., 2011). β€’ To judge the results of multi-hyperplane methods, we compare with the kernelized multiclass SVM (KSVM) (Crammer and Singer, 2002). Total running time (hours) Dataset

BAMM SAMM AMM

Accuracy (%)

KSVM

BAMM SAMM AMM KSVM

usps

0.04

1.17

0.55

2.29

93.19

92.86 92.63 95.32

ijcnn1

0.09

4.84

1.39

6.62

97.86

97.58 78.48 98.13

a9a

0.18

3.61

1.42

11.15

83.01

84.31 84.87 85.09

mnist

0.71

69.39

3.32

126.89

95.61

94.80 94.78 98.57

webspam

4.80

99.85 63.66 1,855.46

97.47

97.36 82.99 99.12

Table 1: Total running time (including searching time for the best hyper-parameters and training time) and accuracy comparison a9a

ijcnn1

mnist

usps

webspam

𝛼 (Γ— 10βˆ’3 )

0.68

5.28

0.96

1.38

1.06

𝛽 (Γ— 10βˆ’3 )

1.49

2.56

1.18

1.47

1.39

Table 2: The optimal hyper-parameters 𝜢 and 𝜷 automatically inferred by BAMM

bayesian multi-hyperplane machine

Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets.

906KB Sizes 1 Downloads 136 Views

Recommend Documents

Bayesian Multi-Hyperplane Machine
datasets by approximating the decision boundary with a parametric mixture of ... Keywords: Multi-Hyperplane Machine, Bayesian Inference, Data ..... In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 27Ҁ“39.

eBook Machine Learning: A Bayesian and Optimization ...
[Download] eBook Machine Learning: A Bayesian ... Hands-On Machine Learning with Scikit-Learn and TensorFlow ... Neural Network Design (2nd Edition).

An Introduction to Bayesian Machine Learning
Apr 8, 2013 - Instead of manually encoding patterns in computer programs, we make computers learn these patterns without explicitly programming them .

bayesian reasoning and machine learning pdf
learning pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. bayesian reasoning and machine learning pdf. bayesianΒ ...

Barber, Bayesian Reasoning and Machine Learning (666p).pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. BarberΒ ...

Bayesian dark knowledge - Audentia
By contrast, we use online training (and can thus handle larger datasets), and use ..... Stochastic gradient VB and the variational auto-encoder. In ICLR, 2014.

Bayesian optimism - Springer Link
Jun 17, 2017 - also use the convention that for any f, g ҈ˆ F and E ҈ˆ , the act f Eg ...... and ESEM 2016 (Geneva) for helpful conversations and comments.