The Formulation Cookbook

Viewer
Transcript

The Formulation Cookbook

Author: Li Niu

Last Update: January 14, 2018

Contents 1 Fundamental 1.1 Matrix . . . . . 1.2 Probability . . 1.2.1 Gaussian 1.3 Functions . . . 1.4 Equations . . .

. . . . .

4 4 4 5 5 6

2 EM Algorithm 2.1 Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . .

6 6 8

. . . . . . . . . . . . . . . . Distribution . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Integer Programming 9 3.1 Relax Quadratic Problem to Linear Problem . . . . . . . . . . . . . 9 3.2 Quadratic Binary Integer Programming . . . . . . . . . . . . . . . . 9 3.3 Mixed with Continous Variables . . . . . . . . . . . . . . . . . . . . 10 4 Eigen Decomposition 11 4.1 TCA and FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . 13 5 Orthogonal Constraint

14

6 Lagrangian Multiplier 14 6.1 Max Entropy Discrimination . . . . . . . . . . . . . . . . . . . . . . 14 7 GMM 16 7.1 Fisher Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 7.2 Adaptive GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7.3 Interpolation of GMMs . . . . . . . . . . . . . . . . . . . . . . . . . 17 8 HMM

17

9 Graph 20 9.1 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 9.2 Absorbing Markov Chain Process . . . . . . . . . . . . . . . . . . . 20 1

9.3 9.4

Exact Inference . . . . . . . . . . . . 9.3.1 Sum-Product Algorithm . . . Approximate Inference . . . . . . . . 9.4.1 Stochastic Approximation . . 9.4.2 Deterministic Approximation

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

20 20 22 22 23

10 Lassos 23 10.1 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 10.2 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 10.3 Trace Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 11 Low Rank 11.1 Rank Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Nuclear Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Low-rank Surrogate . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27 28 30

12 SVM 12.1 Lagrangian Multiplier . . 12.2 SMO algorithm . . . . . 12.3 Multi-class SVM . . . . 12.4 Latent Structural SVM . 12.5 Multiple Kernel Learning 12.6 Indefinite Kernel . . . . 12.7 AP SVM . . . . . . . . .

31 31 32 34 34 36 37 37

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

13 SVR

38

14 Grassmann Manifold 39 14.1 Intermediate Subspace . . . . . . . . . . . . . . . . . . . . . . . . . 39 15 Gaussian Process 15.1 Gaussian Process for Regression . . . . . . . . . . . . . . . . . . . . 15.2 Gaussian Process for Classification . . . . . . . . . . . . . . . . . . 15.3 Twin Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . .

2

40 40 41 43

16 Back Propagation 16.1 Convolutional Layer and Fully Connected Layer 16.2 Pooling Layer . . . . . . . . . . . . . . . . . . . 16.3 Local Response Normalization (LRN) Layer . . 16.4 Softmax Layer . . . . . . . . . . . . . . . . . . . 17 Optimization 17.1 Difference of Convex Functions . . . . . . . . . 17.1.1 Difference of Convex (DC) Programming 17.1.2 Concave-Convex Procedure (CCCP) . . 17.2 Gradient Descent . . . . . . . . . . . . . . . . . 17.2.1 Unconstrained Gradient Descent . . . . . 17.2.2 Constrained Gradient Descent . . . . . . 17.3 Gradient Method on the Manifold . . . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

44 44 44 44 45

. . . . . . .

46 46 46 47 48 48 49 50

18 Theorems 50 18.1 Banach Fixed Point Theorem . . . . . . . . . . . . . . . . . . . . . 50

3

1

Fundamental

1.1

Matrix

Block matrix inversion formula: " #−1 " # A B M −MBD−1 = , C D −D−1 CM D−1 + D−1 CMBD−1

(1)

where M = (A − BD−1 C)−1 . Woodbury matrix identity: (A + UCV)−1 = A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1

(2)

Derivative of Trace: ∂ tr(AXB) = AT BT . Then, we can have Just remember one basic equation ∂X ∂ tr(AXT B) = BA. Generally, for any complicated formula with arbituray num∂X ber of X and XT , the derivative can be similarly derived: ∂ tr(AXBXT C) ∂X

= (AT )(BXT C)T + (C)(AXB) = AT CT XBT + CAXB

(3)

Derivative of Simple Forms: ∂ kAXB + Ck2F ∂X ∂ hC, AXBi ∂X

1.2

= AT (AXB + C)BT

(4)

= AT CBT

(5)

Probability

If x is independent with y given z, we denote as I(x, z, y) equivalent: p(x|y, z) = p(x|z)

(6)

p(x, y|z) = p(x|z)p(y|z)

(7)

I(x, z, y) satisfies the following properties: • Symmetry I(x, z y) → I(y, z, x) 4

• Decomposition I(x, z, y ∪ w) → I(x, z, y) • Weak Union I(x, z, y ∪ w) → I(x, z

S

w, y)

• Contraction I(x, z ∪ y, w)&I(x, z, y) → I(x, z, y ∪ w) • Intersection I(x, z ∪ w, y)&I(x, z ∪ y, w) → I(x, z, y ∪ w) 1.2.1

Gaussian Distribution

Gaussian distribution is the most commonly used distribution because of its nice properties. KL divergence: The Kullback-Leibler divergence from N (µ0 , Σ0 ) to N (µ1 , Σ1 ) is

=

KL(N (µ0 , Σ0 )||N (µ1 , Σ1 )) |Σ1 | 1 T −1 {tr(Σ−1 }, 1 Σ0 ) + (µ1 − µ0 ) Σ1 (µ1 − µ0 ) − k + ln 2 |Σ0 |

(8)

in which k is the dimension of vector space. Overlap of two Gaussian distributions: To compute the overlap of two probaR bility densities f and g, we can integrate their product: f (x)g(x)dx. Intuitively, the more overlap there is between the densities, the bigger their product. This overlap measure follows similar principles as the Bhattacharyya coefficient. By analytically integrating the product of two normal distributions, we obtain Z N (x|µ0 , σ02 )N (x|µ1 , σ12 )dx = N (µ0 − µ1 |0, σ02 + σ12 ). (9) The normalized overlap [1] is N (µ0 − µ1 |0, σ02 + σ12 ) . N (0|0, σ02 + σ12

1.3

Functions

Sigmoid Function: δ(x) =

1 1+e−x

has the following nice properties:

• δ(−x) = 1 − δ(x) • ∆δ(x) = δ(x)δ(−x) = δ(x)(1 − δ(x)) 5

(10)

1.4

Equations

Sylvester Equation: AX + XB = C,

(11)

where X ∈ Rn×m , A ∈ Rn×n , B ∈ Rm×m . The problem in (11) can be rewritten in the form (Im ⊗ A + BT ⊗ In )vec(X) = vec(C),

(12)

where ⊗ denotes Kronecker product and vec is vectorization operator. For a special case of Sylvester, the solution is much more efficient. In particular, for the following form AXB + X = C,

(13)

where A and B are both real symmetric matrices. We perform SVD on A and B, i.e., A = UΣA UT , B = VΣB VT . By multiplying the left side of (13) by UT and the right side of (13) by V, we can reach ΣA UT XVΣB + UT XV = UT CV.

(14)

˜ (resp., C) ˜ to denote UT XV (resp., UT CV) and then the problem in We use X (14) can be written as ˜ B +X ˜ = C. ˜ ΣA XΣ ˜ i,j = It is obvious that X

(15)

˜

Ci,j i δ j +1 δA B

, where δAi (resp., δBi ) is the i-th diagonal element ˜ we can recover X by using X = UXV ˜ T. of ΣA (resp., ΣB ). After obtaining X,

2 2.1

EM Algorithm Posterior Probability

Assume the input is x, the output is y, the latent variable is z, and the model parameters are sealed as θ, then the incomplete log-likelihood is defined as log p(y|x; θ)

6

and the complete log-likelihood is defined as log p(y, z|x; θ). The incomplete loglikelihood can be represented as L(θ) = log p(y|x; θ) X = log p(y, z|x; θ)

(16) (17)

z

= log

X

q(z|x, y)

z

p(y, z|x; θ) q(z|x, y)

p(y, z|x; θ) = l(θ, q) q(z|x, y) z X X p(z|x, y; θ) = q(z|x, y) log p(y|x; θ) + q(z|x, y) log q(z|x, y) z z ≥

X

q(z|x, y) log

(18) (19) (20)

= log p(y|x; θ) − KL [p(z|x, y; θ)kq(z|x, y)] ,

(21)

= L(θ) − KL [p(z|x, y; θ)kq(z|x, y)] ,

(22)

in which the derivation from step (18) to step (19) is due to the concavity of the log function (i.e., Jensen’s inequality). Note that l(θ, q) is the lower bound of L(θ), in which l(θ, q) is a function w.r.t. {θ, q}, and L(θ) is a function w.r.t. θ. l(θ, q) = L(θ) iff q(z|x, y) = p(z|x, y; θ). E-step: We expect the KL divergence in (22) to be zero, so q(z|x, y) = p(z|x, y; θ t ). When fixing the model parameters θ t from the t-th iteration, the probability of latent variables can be derived as follows, p(z|x, y; θ t )

p(y, z|x; θ t ) p(y|x; θ t ) p(y|z; θ t )p(z|x; θ t ) . =P z; θ t )p(˜z|x; θ t ) ˜ p(y|˜ z =

(23) (24)

M-step: The lower bound of log p(y|x, θ), i.e., (19) (the expected complete loglikelihood), can be written as X Q(θ; θ t ) = p(z|y, x; θ t ) log p(y, z|x; θ). (25) z

Then, the derivative of Q(θ; θ t ) w.r.t. θ can be written as ∂Q(θ; θ t ) X ∂ = p(z|y, x; θ t ) log p(y, z|x; θ). ∂θ ∂θ z 7

(26)

Figure 1: Illustration of EM optimization. The idea of EM is illustrated in Figure 1. In the t-th step, we first find q(z|x, y) = p(z|x, y; θ t ) so that L(θ t ) = l(θ t , q), and then optimize l(θ, q) over θ and get θ t+1 . Since L(θ) is the upper bound of l(θ, q), L(θ t+1 ) > l(θ t+1 , q).

2.2

Marginal Likelihood L(θ)

= log p(x; θ) X p(x, z; θ) = log

(27) (28)

z

= log

p(x, z; θ) q(z|x)

(29)

q(z|x) log

p(x, z; θ) = l(θ, q) q(z|x)

(30)

q(z|x) log

p(z|x; θ)p(x; θ) q(z|x)

(31)

X

q(z|x)

z

≥

X z

=

X z

= log p(x; θ) − KL[p(z|x; θ)||q(z|x)]

8

(32)

E-Step: = p(z|x; θ t ) p(x|z; θ t )p(z; θ t ) =P t t z p(x|z; θ )p(z; θ )

q(z|x)

(33) (34)

M-Step: Q(θ; θ t )

=

X

p(z|x; θ t ) log p(x, z; θ)

(35)

p(z|x; θ t ) log p(x|z; θ)p(z; θ)

(36)

z

=

X z

3

Integer Programming

The package SCIP (i.e., Solving Constraint Integer Programs) can be used to solve mixed-integer programming. Software is available at http://scip.zib.de/.

3.1

Relax Quadratic Problem to Linear Problem

To solve the binary integer programming problem maxy kU(α ◦ y)k2 , we can convert this quadratic problem to a linear problem by using the L∞ norm to relax the L2 norm ! n X max kU(α ◦ y)k∞ = max max | αi Uij yi | , (37) y

y

j

i=1

which can be efficiently solved by simply sorting the coefficients αi Uij ’s for each j.

3.2

Quadratic Binary Integer Programming

Example 1: min xct

s.t.

X

(xct (dct +

t0

t

xct ∈ {0, 1}, dct ≥ 0,

XX ∀c, ∀t,

∀c, ∀t,

9

dcc0 xc0 t0 ))

(38)

c0

(39) (40)

which can be transformed into the a mixed 0-1 linear programming problem usP P ing the Kaufman and Broeckx linearisation [2]. By denoting wct = xct ( t0 c0 dcc0 xc0 t0 ) P P and act = t0 c0 dcc0 , the problem in (38) can be converted the following problem:

min

xct ,wct

s.t.

XX X ( dct xct + wct ) t

c

(41)

c

act xct +

XX t0

dct ≥ 0,

dcc0 xc0 t0 − wct ≤ act ,

∀c, ∀t,

(42)

c0

wct ≥ 0,

xct ∈ {0, 1},

∀c, ∀t,

∀c, ∀t.

(43) (44)

Example 2: A general binary coding problem is as follows, min f (B),

(45)

B∈{−1,1}

which can be solved by the discrete proximal linearized minimization (DPLM) algorithm in [3], which is similar with projected gradient descent.

3.3

Mixed with Continous Variables

˜ i and 0 ≤ α ≤ 1, the maximization Example 1: With one-hot binary vectors yi , y problem is as follows, max C,t

N X

log αti + log(˜ yiT Cyi )1−ti ),

(46)

i=1

where C ∈ RL×L and t ∈ {0, 1}N . This problem has an efficient algorithm in O(N + L2 ) time complexity. By denoting the optimal solution by C∗ and t∗ , then we can prove the following theorems: ∗ ∗ • Ci,j 6= 0 ⇒ Ci,j > α.

˜ i ) = (yj , y ˜ j ) ⇒ t∗i = t∗j . • (yi , y ˜ iT C∗ yi > α ⇔ t∗i = 0 and y ˜ iT C∗ yi = 0 ⇔ t∗i = 1. • y Each column in C can be optimized separately. Considering the c-th column c, suppose there are M samples affecting this column. We can count the frequencies 10

of y˜ck = 1 as m1 , . . . , mL and might as well assume m1 ≥ m2 ≥ . . . ≥ mL . The problem is then converted to max c,t

L X

mk (log αtk + log ck1−tk ),

(47)

k=1

P where t is binary vector and Lk=1 ck = 1. Assume the first k ∗ elements of c∗ have the values greater than α, the optimal solution c∗ to the above problem is mi = Pk∗ c∗i , i = 1, . . . , k ∗ , (48) m k k=1 c∗i = 0, i = k ∗ + 1, . . . , L. (49)

4

Eigen Decomposition

4.1

TCA and FDA min

tr(XAXT )

(50)

s.t.

XBXT = I.

(51)

X

By introducing a symmetric matrix Z containing the Lagrangian multipliers for the constraints in (51), we can obtain the Lagrangian form of (50) as LX,Z = tr(XAXT ) − tr((XBXT − I)Z).

(52)

By setting the derivative of (52) w.r.t. X to 0, we can get XA = ZXB, which is a generalized eigen decomposition problem w.r.t. XT . By multipling XT on the right of both sides, we can get Z = (XAXT )(XBXT )−1 . By substituting this equation back into (52), we arrive at min tr((XAXT )(XBXT )−1 ). X

(53)

The problem in (53) is equivalent to the following maximizing problem: max tr((XBXT )(XAXT )−1 ), X

(54) T

tr(XBX ) which shares a similar form with Fisher discriminant analysis [4] (i.e., maxX tr(XAX T)) and can be solved by eigen decomposition. In particular, the rows of X are the leading eigen vectors of A−1 B.

11

Algorithm 1 Solving the problem in (58) 1: Initialize R = Q, and A as an empty matrix. 2: for i = 1 : k do 3: Compute α as the leading eigen vector as R. 4: Update A = [A, α]. 5: update R = R − 2βααT . 6: end for Some other similar problems can be solved using Lagrange multipliers. Example 1: min U≥0

s.t.

f (U)

(55)

UT 1 = 1,

(56)

which can be formulated as min f (U) + tr(Ω(UT 1 − 1)(UT 1 − 1)T )

(57)

max tr(AT QA) − βkAT A − IkF ,

(58)

U≥0

Example 2: A

in which AT A = I is a relaxed constraint. The problem in (58) can be solved using Algorithm 1. Example 3: min R

s.t.

kA − RVk2F ,

(59)

RT R = I,

(60)

which is equivalent to the classic orthogonal procrustes problem (OPP). Assume ˆ T = svd(AVT ), then R = UU ˆ T. UΣU

12

4.2

Canonical Correlation Analysis

consider the projections u = aT φ(x) and v = bT ψ(y), and assume the data are zero-mean. Then, we aim to solve the following problem: max s.t.

E[uv] p , E[u2 ]E[v 2 ] E[u2 ] = 1, E[v 2 ] = 1.

The Lagrangian of (61) can be written as follows, X 1 X T min max aT φ(xi )ψ(yi )T b − λ1 ( a φ(xi )φ(xi )T a − N ) a,b λ1 ,λ2 2 i i X 1 − λ2 ( bT ψ(yi )ψ(yi )T b − N ). 2 i P P When a = i αi φ(xi ) and b = i βi ψ(yi ), (64) can be rewritten as

(61) (62) (63)

(64)

1 1 L = αT Kx Ky β − λ1 (αT K2x α − N ) − λ2 (β T K2y β − N ). (65) 2 2 To avoid the overfitting and ease the optimization, we add a diagonal term to the constraints in the Lagrangian and arrive at 1 1 L = αT Kx Ky β − λ1 (αT K2x α|ηkαk2 − N ) − λ2 (β T K2y β + ηkβk2 − N ). (66) 2 2 By setting the derivative of (66) w.r.t. α and β to zeros, the resulting equations are Kx Ky β = λ1(K2x + ηI)α

(67)

Ky Kx α = λ2(K2y + ηI)β

(68) (69)

If we multiply (67) with αT and (68) with β T on the left side, subtract these two equations, and consider the constraints in (62) and (63), it give rise to λ1 = λ2 . Then based on (67), we can have " #−1 " #" # " # K2x + ηI 0 0 Kx Ky α α =λ , (70) 2 0 Ky + ηI Ky Kx 0 β β which is an eigen decomposition problem. 13

5

Orthogonal Constraint

For the optimization problem with orthogonality constraint, besides using generalized eigen decomposition mentioned in the previous Section, there are some other general solutions. gradient flow: min Q

s.t.

f (Q),

(71)

QQT = I,

(72)

which can be solved using the gradient flow in [5]. Given Qt from previous iteration, Qt+1 can be updated via Cayley transformation: τ τ Qt+1 = (I + Φt )−1 (I − Φt )Qt , 2 2

(73)

in which Φt = ∆f (Qt )QTt − Qt ∆f (Qt )T and τ is an approximate minimiser satisfying Armijo-Wolfe conditions.

6

Lagrangian Multiplier

Lagrangian multiplier has a wide range of applications such as SVM in Section 12 and normalization constraints in Section 4. There are some other interesting applications.

6.1

Max Entropy Discrimination

Max Entropy Discrimination (MED) [6] has been used for classification, regression, graph inference, and so on. By taking SVM+ as an example, let us see how to convert SVM+ to an MED version.

min w,p(θ)

s.t.

N Z X 1 2 kwk + γKL(p(θ)||p0 (θ)) + C p(θ)θ T zi dθ 2 i=1 Z yi wT xi ≥ 1 − p(θ)θ T zi dθ, i = 1, . . . , N Z p(θ)θ T zi dθ ≥ 0, i = 1, . . . , N.

14

(74) (75) (76)

By introducing dual variables λ1 (resp., λ2 ) for the constraint in (75) (resp., (76)), the Lagrangian of (74) can be written as Z N Z X p(θ) 1 2 kwk + γ p(θ) log +C p(θ)θ T zi dθ (77) Lw,p(θ) = 2 p0 (θ) i=1 Z N X T T − λ1i yi w xi − 1 + p(θ)θ zi dθ (78) i=1

−

N X

Z

T

λ1i

p(θ)θ zi dθ

(79)

i=1

By setting the derivative of (77) w.r.t. w to zeros, we can have w=

N X

λ1i yi xi .

(80)

i=1

By setting the derivative of (77) w.r.t. p(θ) to zeros, we can have N

X p(θ) γ log +γ− (λ1i + λ2i − C)θ T zi = 0. p0 (θ) i=1

(81)

Then, we can obtain the solution of p(θ) as 1 p0 (θ) exp p(θ) = z(λ1 , λ2 )

N 1X (λ1i + λ2i − C)θ T zi γ i=1

! ,

where z(λ1 , λ2 ) is the normalization constant. Accordingly, ! Z N 1X z(λ1 , λ2 ) = p0 (θ) exp (λ1i + λ2i − C)θ T zi dθ. γ i=1

(82)

(83)

P After making the prior assumptions θ ∼ N (µ, Σ) and denoting a = γ1 N i=1 (λ1i + λ2i − C)zi , (83) becomes Z 1 1 T −1 z(λ1 , λ2 ) =q exp − (θ − µ) Σ (θ − µ) exp θ T a (84) 2 (2π)k |Σ| Z 1 1 =q exp{− (θ − µ − Σa)T Σ−1 (θ − µ − Σa) 2 (2π)k |Σ| aT Σa +µT a + } 2 aT Σa T = exp µ a + 2 15

(85) (86)

By substituting (80) and (82) into (77), the dual form of (74) to be maximized can be written as N

N

N

X 1 XX λ1i λ1j yi yj xTi xj + λ1i , − log Z(λ1 , λ2 ) − 2 i=1 j=1 i=1

(87)

which can be converted to a quadratic programming problem after substituting (83) into (87). It is worth noting that when θ ∼ N (0, I), (87) reduces to the dual form of regular SVM+. More examples of MED can be found in [7, 8, 9].

7 7.1

GMM Fisher Vector

Since each image/video can be treated as a set of local descriptors {xi |N i=1 }, its Fisher vector can be represented as, GX θ =

N 1 X ∇θ log p(xi ; θ). N i=1

(88)

For visual recognition tasks, the probability density function p(X; θ) is usually modeled by Gaussian Mixture Model (GMM). Suppose K is the number of Gaussian models in the GMM, we use model parameters θ = {π1 , µ1 , σ1 ; . . . ; πK , µK , σK } to denote the mixture weights, means, and diagonal covariances of GMM, respectively. Based on the definition of Fisher vector (88), the gradients of the log-likelihood w.r.t. the model parameters (i.e., means and diagonal covariances) of the k-th Gaussian model can be written as (refer to [10] for the derivation details), N 1 X xi − µk γi (k)( ), √ N πk i=1 σk

(89)

N X 1 (xi − µk )2 √ = γi (k)[ − 1], σk2 N 2πk i=1

(90)

X = Gµ,k

X Gσ,k

16

where γi (k) is the probability that the i-th local descriptor xi belongs to the k-th Gaussian model, which is defined as, πk N (xi ; µk , σk ) γi (k) = PK , j=1 πj N (xi ; µj , σj )

(91)

in which N (xi ; µk , σk ) is the probability of xi based on the Gaussian distribution of the k-th Gaussian model. Assuming that the dimension of local descriptors is d, then the dimension of the k-th component of Fisher vectors corresponding to the k-th Gaussian model is 2d by concatenating (89) and (90). So the final Fisher vector is a 2Kd-dim vector w.r.t. a K-component GMM. Note that there also exist Fisher vectors derived based on other generative models.

7.2

Adaptive GMM

Assume we have a background GMM and tend to adapt the GMM to a new dataset Dt . We assume a Dirichlet prior P (θ) based on the background GMM for the adapted GMM, then θ ∗ = arg max P (Dt |θ)P (θ). θ

(92)

See [11] for more details.

7.3

Interpolation of GMMs

Given a set F = {F1 , . . . , FN } on the manifold of K-GMMs, we want to find a GMM G such that min G

N X

kG − Fn kdist ,

(93)

n=1

in which the distance function can be l2 distance, geodescis distance, or KLdivergence. Please refer to [12] for the details.

8

HMM

HMM has in total three groups of variables: observation sequence X = {x1 , . . . , xn }, hidden state sequence Z = {z1 , . . . , zn }, and model parameters θ = {π, A, φ} (π 17

contains the initial probabilities, φ contains the generating probabilities, and A contains the transition probabilities). Given an observation sequence {x1 , . . . , xn }, HMM has the following three common problems: • Compute most probable state sequence given the model parameters, i.e., arg maxZ p(Z|X, θ). • Compute the probability of the state at the n-th step p(zn |X, θ). • Learn the best model parameters p(θ|X). Tips: Note that in the following derivations, conditional indepedence is frequently used. To be exact, if A is independent with B given C, which can be proved using d-separation, then p(A, B|C) = p(A|C)p(B|C) and p(A|C) = p(A|B, C) hold. Note that A, B, and C can be a set of nodes. According to d-separation, if all paths from any node in A to any node in B are blocked by C, then we can say A is independent with B given C. A path is blocked if it includes a node such that either (a) the arrows on the path meet either head-to-tail (tail-to-head is the same) or tail-to-tail at the node, and the node is in the set C, or (b) the arrows meet headto-head at the ode, and neither the node, nor any of its descendants is in the set C. Problem 1: The joint probability over both latent and observed variables is given by "N # N Y Y p(X, Z|θ) = p(z1 |π) p(zn |zn−1 , A) p(xm |zm , φ), (94) n=1

m=1

based on which we can use Viterbi algorithm (a special case of dynamic program= ming) to solve the first problem because arg maxZ p(Z|X, θ) = arg maxZ p(X,Z|θ)p(θ) p(X,θ) arg maxZ p(X, Z|θ). Problem 2: Given model parameters θ, the second problem can be written as p(zn |X) =

p(X|zn )p(zn ) p(x1 , . . . , xn , Zn )p(xn+1 , . . . , xN |zn ) α(zn )β(zn ) = = , (95) p(X) p(X) p(X)

where we have defined α(zn ) ≡ p(x1 , . . . , xn , Zn ),

(96)

β(zn ) ≡ p(xn+1 , . . . , xN |zn ).

(97)

18

After some derivations, we can obtain α(zn ) = p(xn |zn )

X

αn−1 p(zn |zn−1 ),

(98)

zn−1

β(zn ) =

X

p(xn+1 |zn+1 )β(zn+1 )p(zn+1 |zn ).

(99)

zn+1

We can observe that the only difference of α(zn ) compared with β(zn ) in the form is taking p(xn |zn ) outside the summation. We can use forward-backward algorithm to solve α(zn ) and β(zn ). Problem 3: We need to use EM algorithm to solve the third problem by updating two groups of variables θ = {π, A, φ} and {γ(zn ), ξ(zn−1 , zn )} in an alternating fashion. E-step: First use forward-backward algorithm to solve α(zn ) and β(zn ) just like for Problem 2, then α(zn )β(zn ) , (100) p(X) α(zn−1 p(xn |zn )p(zn |zn−1 )β(zn )) ξ(zn−1 , zn ) = p(zn−1 , zn |X) = , (101) p(X) P in which p(X) can be calculated as zn α(zn )β(zn ) for any n. M-step: Based on the obtained α(zn ) and β(zn ), update model parameters. γ(zn ) = p(zn |X) =

πk = PK

γz1k

, γ(z1j ) PN ξ(zn−1,j , znk ) . = PK n=2 PN ξ(z , z ) n−1,j nl l=1 n=2

(102)

j=1

Ajk

(103)

We assume p(x|φk ) = N (x|µk , Σk ), then PN

n=1 µk = P N

Σk =

γ(znk )xn

,

n=1 γ(znk ) PN n=1 γ(znk )(xn − µk )(xn PN n=1 γ(znk )

19

(104) − µk )T

.

(105)

9 9.1

Graph Label Propagation

Assume the transition matrix among all the instances including labeled and unlabeled instances can be partitionedf as follows, " # PLL PLU P= . (106) PU L PU U Based on the property of label propagation, we know Y = PY, i.e., " # " #" # YL PLL PLU YL = , YU P U L P U U YU

(107)

which give rise to YU = PU L YL + PU U YU , leading to YU = (I − PU U )−1 PU L YL .

9.2

Absorbing Markov Chain Process

Assume we have p instance nodes and q class nodes, then the semantic graph can be constructed as P = [Q, R], where Q ∈ Rp×p (i.e., transition probabilities among instance nodes) and R ∈ Rp×q (i.e., transition probabilities between instance nodes and class nodes). The unknown values are initialized as 0. Based on the absorbing Markov chain process, the absorbing probability matrix can be computed as B = (I − Q)−1 R ∈ Rp×q , which represents the refined probabilities that each instance belongs to each class.

9.3 9.3.1

Exact Inference Sum-Product Algorithm

The sum-product algorithm is applicable to undirected and directed trees and to polytrees. In the trees, the joint distribution of a set of variable can be written Q in the form of product of factors f (x) = fs (xs ) as illustrated in Figure 2. The nodes can be categorized into variable nodes and factor nodes. The message passing function from factor (resp., variable) node to variable (resp., factor) node is denoted as fs−>x (resp., fx−>s ). Then, the incoming or outgoing messages of a

20

Figure 2: Illustration of the factorization of the subgraph.

Figure 3: The factorization of HMM. variable node can be expressed by each other as follows, Y X X fs (x, x1 , . . . , xM ) µfs −>x (x) = ... x1

xM

µxm −>fs (xm ) =

Y

µxm −>fs (xm ), (108)

m∈ne(fs )\x

µfl −>xm (xm ),

(109)

l∈ne(xm )\fs

where ne(·) means the neighboring nodes and {x, x1 , . . . , xM } in (108) are the variable nodes on which fs depends. We can designate a variable node as a root node and actually whichever root node is designated makes no difference. The starting µxm −>fs (xm ) and µfl −>xm (xm ) can be initialized as 1. The messages are propagated from leaf nodes to the root node, and then propagated from the root node out to the leaf nodes. Finally, given a variable x or a variable set xs , its marginal distribution is given by Y p(x) = µfs −>x (x) (110) s∈ne(x)

p(xs ) = fs (xs )

Y

µxi −>fs (xi ).

(111)

i∈ne(fs )

Relation with max-sum algorithm: The sum-product algorithm find marginals over a set of variables while the max-sum algorithm find a set of variables with 21

the largest probabilities, which can be simply done by replacing ’sum’ with ’max’ and replacing products with sums of logarithms. Relation with loopy belief propagation: The sum-product algorithm focuses on tree-structure. Loopy belief propagation is based on sum-product algorithm while targeting at approximate inference in graphs with loops. Because the graph now has cycles, information can flow many times around the graph. In this case, we need to define a message passing schedule like flooding schedule, serial schedule, and pending schedule. Special case HMM: The forward-backward algorithm is a special case of sumproduct algorithm. Specifically, as illustrated in Figure 3, the forward process can be represented by µzn−1 −>fn (zn−1 ) = µfn−1 −>zn−1 (zn−1 ), X µfn −>zn (zn ) = fn (zn−1 , zn )µzn−1 −>fn (zn−1 ),

(112) (113)

zn−1

which give rise to µfn −>zn (zn ) =

X

fn (zn−1 , zn )µfn−1 −>zn−1 (zn−1 ),

(114)

zn−1

in which µfn −>zn (zn ) corresponds to α(zn ) with α(zn ) ≡ p(x1 , . . . , xn , zn ). Similarly, the backward process takes the form X µfn+1 −>zn (zn ) = fn+1 (zn , zn+1 )µfn+2 −>zn+1 (zn+1 ),

(115)

zn+1

in which µfn+1 −>zn (zn ) corresponds to β(zn ) with β(zn ) ≡ p(xn+1 , . . . , xN |zn ).

9.4

Approximate Inference

9.4.1

Stochastic Approximation

Markov Chain Monte Carlo (MCMC): They generally have the property that given infinite computational resource, they can generate exact results, and the approximation arises from the use of a finite amount of processor time. In practice, sampling methods can be computationally demanding, often limiting their use to small-scale problems. 22

9.4.2

Deterministic Approximation

Mean field approximation is a simple instantiation of variational inference, analogous to variational autoencoder. Variational inference is performing inference on an easy approximate distribution (e.g., Gaussian) for which we know how to do posterior inference. For example, we use qφ (z|x) to approximate p(z|x) X qφ (z|x) (116) KL[qφ (z|x)||p(z|x)] = qφ (z|x) log p(z|x) z X qφ (z|x) = log p(x) + qφ (z|x) log (117) p(z, x) z = =

p(z) ] (118) qφ (z|x) log p(x) − EQ log p(x|z) + KL(q(z|x)||p(z)) (119) log p(x) + EQ [− log p(x|z) − log

which can be reformulated as log p(x) = EQ log p(x|z) − KL(q(z|x)||p(z)) + KL(qφ (z|x)||p(z|x)).

(120)

By omitting KL[qφ (z|x)||P (z|x)], we can get the lower bound of log p(x) as EQ log p(x|z)− KL(q(z|x)||p(z)), which is the objective of variational autoencoder. From another perspective, X log p(x) = q(z) log p(x) (121) z

p(z|x) p(x, z) X − q(z) log (122) q(z) q(z) z z X p(z) X p(z|x) p(x, z) X q(z) log q(z) log = q(z) log + − (123) p(z) q(z) q(z) z z z =

X

q(z) log

= EQ log p(x|z) − KL[q(z)||p(z)] + KL[q(z)||p(z|x)]

(124)

By replacing q(z) in (123) with q(z|x), we can reach (120).

10 10.1

Lassos Lasso

Example 1: min kDX − Ak2F + λkXk1 , X

23

(125)

which requires solving a L1 -minimization problem [13] in Sparse Representationbased Classification (SRC) algorithm [14]. Example 2: min X,D

s.t.

kDX − Ak2F + λkXk1 ,

(126)

dTi di = 1,

(127)

∀i,

where di is the i-th column in D. We solve D and X alternatively. Specifically, when fixing D, the subproblem w.r.t. X reduces to min kDX − Ak2F + λkXk1 , X

(128)

which can be solved using some standard convex optimization techniques such as [15, 16]. When fixing X, the subproblem w.r.t. D reduces to min D

s.t.

kDX − Ak2F ,

(129)

dTi di = 1,

(130)

∀i.

We update di one by one by fixing all the other dj ’s for j 6= i. By denoting xi as the i-th row of X, we rewrite (130) as min di

s.t.

kY − di βi k2F ,

(131)

dTi di = 1,

(132)

P where Y = A − j6=i dj βj . By introducing the dual variable γ for the constraint in (131), we set the derivative of the obtained Lagrangian as zeros and then get di = YβiT (βj βjT )−1 . Considering dTi di = 1, we can have the solution to di as di =

YβiT . kYβiT k2

(133)

Example 3: An efficient solution [17] is proposed for column-normalized dictionray and sparse representation inspired by K-SVD. min A,X

s.t.

kAX − Bk2F

(134)

kXk0 ≤ T.

(135)

24

Algorithm 2 Iterative projection method for sparse coding. 1: Initialize δ, τ > 0, x0 = 0. 2: Set k = 0. 3: while k < K do 4: xk = Sτ /δ xk−1 − 2δ1 ∇Q(xk−1 ) , where Sτ /δ is a soft thresholding function defined in [18]. 5: Set k = k + 1. 6: end while We update the dictionary A atom by atom. Specifically, for the i-th column of A, i.e., ai , and the i-th row of X, i.e., xi , we solve the following subproblem: min kEi − ai xi k2F , ai ,xi

(136)

P where Ei = B − k6=i ak xk . This is essentially the same problem that K-SVD has solved and thus the solution to (135) is given by ai

= U(:, 1),

(137)

xi

= Σ(1, 1) ∗ V(1, :),

(138)

where Ei = UΣV. Example 4: The general problem minx Q(x) + 2τ kxk1 with Q(x) being a convex function w.r.t. x can be solved using iterative projection method [18], which can be further accelerated by FISTA [19]. SPAMS toolbox: http://spams-devel.gforge.inria.fr/ Example 5: To simplify the case, we can assume x ≥ 0, and use xT 1 to replace kxk1 . The simplified optimization problem with box constraint can be solved using projected gradient descent. Survey [20]: A brief summary of dictionary learning based approach for classification.

25

10.2

Group Lasso

Example 1: The following atomic problem has a close-form solution for columnsparse W. min λ1 kWk2,1 + λ2 kW − Qk2F . W

Assume the solution to (138) is W∗ , then the i-th column of W∗ is   kqi k−λ q , if λ < kq k, i i kqi k W∗ (:, i) = 0, otherwise.

(139)

(140)

Example 2: Now, consider a more general l2,1 -norm minimization problem as follows, min P

f (P) +

X

kAk P + Bk k2,1 .

(141)

k

According to [21], solving the above problem is equivalent to solving the following problem iteratively: min P

f (P) +

X

tr((Ak P + Bk )T Dk (Ak P + Bk )),

(142)

k

where Dk is a diagonal matrix with the i-th diagonal element being 2kq1i k2 , in k which qik is the i-th row of Ak P + Bk if L2,1 -norm here means row-sparse. In each iteration of solving P, we calculate Dk ’s based on previous P and then update P by solving (141). As claimed in [21], the updated P in each iteration will decrease the objective of (140). We update P and D alternatively until the objective of (140) converges. The main proof provided in [21] relies on the following lemma: Lemma 10.1 For any nonzero vectors u, ut , the following inequality holds: kuk2 −

kut k2f kuk22 ≤ kut k2 − . 2kut k2 2kut k2

26

(143)

In the (t + 1)-th iteration, by solving (141) and obtaining Ut+1 , we know that f (Ut+1 ) + tr(UTt+1 Dt Ut+1 ) ≤ f (Ut ) + tr(UTt Dt Ut ),

(144)

that is to say f (Ut+1 ) +

m X kuit k22 ≤ f (U ) + . t i 2kuit k2 2ku k 2 t i=1

m X kuit+1 k22 i=1

(145)

By adding (142) and (144) on both sides, we can easily obtain f (Ut+1 ) + kUt+1 k2,1 ≤ f (Ut ) + kUt k2,1 .

10.3

(146)

Trace Lasso

Assume the regularizer w.r.t. x is related to the intra -structure of A[22] φ(x) = kADiag(x)k∗

(147)

p Diag(x)AT ADiag(x) = kxk1 . √ when AT A = 11T , φ(x)(x) = xx = kxk2 . Therefore, sumi kADiag(xi )k∗ is between kXk1 and kXk2,1 . when AT A = I, φ(x) = Tr

11 11.1

Low Rank Rank Constraint

Example 1: min P

s.t.

hP, Ei

(148)

rank(P) ≤ τ.

(149)

We factorize P ∈ Rn×n as P = UV, where U ∈ Rn×τ is a column-orthogonal matrix (i.e., UT U = I). Acordingly, the problem in (147) becomes min

UT U=I,V

s.t. 27

hP, Ei

(150)

P = UV.

(151)

By introducing the Lagrangian multiplier L for the constratin in (150), the augmented Lagrangian function can be written as follows, ρ (152) L(P, U, V, L) = hP, Ei + hP − UV, Li + kP − UVk2F . 2 The subproblem in (151) w.r.t. P and V are trivial and we omit the details here. The subproblem in (151) w.r.t. U is as follows, kUV − (P +

min U

L 2 )k ρ F

UT U = I,

s.t.

(153) (154)

which is known as the matrix procrustes problem. The optimal solution to (152) is P((P + Lρ VT )), where the operator P(·) is defined as follows: for any matrix T T with SVD A = UA SA VA , P(A) = UA VA . Example 2: min Z

s.t.

kZ − Ak2F

(155)

rank(Z) = τ.

(156)

By performing SVD on A, i.e., A = Udiag([s1 , . . . , sn ])VT , the optimal solution to (154) is Udiag([s1 , . . . , sτ , 0, . . . , 0]T )VT . Example 3: min X

s.t.

kX − Yk2F

(157)

rank(X) = r.

(158)

When r ≥ rank(Y), the solution to (156) is X = Y. When r ≤ rank(Y), T according to [23], the solution to (156) is X∗ = UY[1:r] ΣY[1:r] VY[1:r]] when Y = T U Y Σ Y VY .

11.2

Nuclear Norm

Nuclear norm (the sum of singular values) is the best convex approximation of the rank function over the unit ball of matrices [24]. The following atomic problem has a close-form solution. min λ1 kFk∗ + λ2 kF − Gk2F . F

28

(159)

We perform SVD on G, i.e., G = UΣV0 , where Σ is a diagonal matrix with the i-th diagonal element being δi . Then, the solution to (158) is UD(Σ)V0 , where λ1 )+ . D(Σ) is a diagonal matrix with the i-th diagonal element being (δi − 2λ 2 When the nuclear norm regularizer is used together with other regularizers, inexact ALM or ADM can be employed to solve a wide range of complicated problems. The general idea is to separate the original complex problem to a number of atomic problems which can be easily solved. We will take the following problem as an example.

min v v

Z ,E

s.t.

V X γ X (kEv k2F + kZv k∗ ) + kZv −Zv˜k2F 2 v,˜v:v6=v˜ v=1

(160)

Gv = Gv Zv + Ev ,

(161)

∀v.

By introducing the auxiliary variable Pv to replace Zv in kZv k∗ and the auxiliary variable Qv to replace Zv in the constraint (160), we convert the problem in (159) to an equivalent problem as follows, V X

min v v

Z ,E Pv ,Qv

(kEv k2F +kPv k∗ )+

v=1 v

G = Gv Q v + E v ,

s.t.

γ X kZv −Zv˜k2F 2 v,˜v:v6=v˜

∀v,

(162) (163)

Zv = Pv ,

∀v,

(164)

Zv = Qv ,

∀v.

(165)

The problem in (161) can be solved by using inexact augmented Lagrange Multiplier (ALM) method [25], which aims to minimize the following augmented Lagrangian function: V X γX v L = (kEv k2F + kPv k∗ )+ kZ −Zv˜k2F 2 v=1 v,˜ v :v6=v˜ V V X X v v v + hS , Z − P i + hTv , Zv − Qv i v=1 V X

v=1 V

µX v + hR , G − G Q − E i + kZ − Pv k2F 2 v=1 v=1 v

v

v

v

V

v

V

µX v µX + kZ − Qv k2F + kGv − Gv Qv − Ev k2F , 2 v=1 2 v=1 29

(166)

Algorithm 3 Solving (165) with inexact ALM 1: Input: Gv , λ2 , λ3 , γ 2: Initialize Zv = Ev = Sv = Tv = Rv = O, ρ = 0.1, µ = 0.1, µmax = 106 , ν = 10−5 , Niter = 106 . 3: for t = 1 : Niter do 4: ∀v, update Pv . 5: ∀v, update Qv . 6: update Zv ’s 7: ∀v, update Ev . 8: ∀v, update Sv , Tv , and Rv by Sv = Sv + µ(Zv − Pv ), Tv = Tv + µ(Zv − Qv ), Rv = Rv + µ(Gv − Gv Qv − Ev ).(167)

Update the parameter µ by µ = min(µmax , (1+ρ)µ). 10: Break if kGv −Gv Qv −Ev k∞ < ν, kZv −Pv k∞ < ν, kZv −Qv k∞ < ν, 11: end for 12: Output: Zv . 9:

∀v.

where Sv , Tv , and Rv are the Lagrange multipliers, µ > 0 is the penalty parameter. We use inexact ALM method to iteratively update variables {Pv , Qv , Zv , Ev }’s, the Lagrangian multipliers {Sv , Tv , Rv }’s, and the penalty parameter µ in the augmented Lagrangian function (165) until the termination criterion is satisfied. We list the steps to solve (165) in Algorithm 3.

11.3

Low-rank Surrogate

The non-zero singular values of W will change along with kWk∗ but the rank of W remains unchanged. So nuclear norm may not be a good choice sometimes.

min W

r X (δi (W))2 ,

(168)

i=1

which δi (W) is the i-th smallest eigen value of W. The above equation is equivalent to min αtr(ΓT WWT Γ), W,Γ

30

(169)

where Γ denotes the singular vectors corresponding to the smallest r singular values of WWT . The above equation can be solved by updating W and Γ alternatively. One trick is to calculate ΓΓT instead of to calculate Γ. Specifically, assume WWT = Uw Σw UTw with Uw = [U1w , U2w ]. Given the fact that T T T Uw UTw = U1w U1w + U2w U2w = Id , we have ΓΓT = Id − U2w U2w .

12 12.1

SVM Lagrangian Multiplier

Standard form: n

min

w,b,ξi

s.t.

X 1 kwk2 + C ξi , 2 i=1 yi (wT xi + b) ≥ 1 − ξi , ξi ≥ 0,

(170) ∀i,

(171)

∀i.

(172)

By introducing dual variables αi ’s for the constraints in (170) and βi ’s for the constraints in (171), we can get the Lagrangian form of (169) as n

n

n

X X X 1 Lw,b,ξi kwk2 + C ξi − αi (yi (wT xi + b) − 1 + ξi ) − βi ξi αi ,βi 2 i=1 i=1 i=1

(173)

By setting the derivative of L w.r.t. primal variables as 0 and substituting the obtained equations into the Lagrangian form, we can obtain the dual form of (169) as follows, min α

s.t.

1 T α (K ◦ (yyT ))α − 1T α 2 αT y = 0, 0 ≤ α ≤ C1.

(174) (175) (176)

Remove the constraint αT y = 0: add 21 b2 in (169), then Knew = Kold + 11T . Remove −1T α in (173): 1) change the margin 1 in (170) to a variable ρ. 2) add −ρ in (169). Then, we will have an additional constraint 1T α = 0 in the dual form. This form of SVM is called ρ-SVM.

31

Remove the constraint 0 ≤ α ≤ C1 : We can square loss to replace hinge loss, i.e., n

X 1 kwk2 + C ξi2 , 2 i=1

min

w,b,ξi

yi (wT xi + b) ≥ 1 − ξi ,

s.t.

(177) ∀i.

(178)

It is worth noting that in some cases, by replacing hinge loss with square loss, we can get more succinct form and more efficient solution.

12.2

SMO algorithm

The dual form of SVM can be solved by Sequential Minimal Optimization (SMO). In fact, SMO can be used to solve a set of quadratic programming problems with group equality constraints.

min α

s.t.

X 1 XX αi,u αj,v φ(xi,u )T φ(xj,v ) − fi,u αi,u , 2 i,j u,v i,u X αi,u = 0, ∀i,

(179) (180)

u

lbi,u ≤ αi,u ≤ ubi,u ,

∀i, u.

(181)

We use SMO algorithm to update a pair of variable αi,u and αi,v from the i-th group in each iteration, (i.e., αi,u and αi,v are updated as αi,u + ∆ and αi,v − ∆) respectively). We also need to consider the lower bound and upper bound during updating. The remaining problem is how to choose certain group i and how to choose αi,j and αi,v . Stationary points: In the i-th group, the KKT optimality condition of problem (178) implies that a feasible αi is a stationary point if and only if there exists a number d and two nonnegative vectors λi and ξi such that ∆f (αi ) + d = λi − ξi , λi,u (αi,u − lbi,u ) = 0, λi,u ≥ 0,

ξi,u ≥ 0,

(182) ξi,u (ubi,u − αi,u ) = 0,

∀u,

32

∀u,

(183) (184)

where ∆f (αi ) is the derivative of (178) w.r.t. αi . The condition in (181) can be rewritten as  ≥ 0, if α < ub , i,u i,u ∆f (αi ) + d (185) ≤ 0, if αi,u > lbi,u , which is equivalent to that there exists d such that m(αi ) ≤ d ≤ M (αi ),

(186)

where m(αi ) = maxu∈Iup (αi ) −∆u f (αi ) and M (αi ) = minu∈Ilow (αi ) −∆u f (αi ) with Iup (αi ) = {u|αi,u < ubi,u } and Ilow (αi ) = {u|αi,u > lbi,u }. A feasible αi is a stationary point if and only if m(αi ) ≤ M (αi ). Hence, a reasonable stopping criterion is m(αi ) − M (αi ) ≤ . Working set selection: In the i-th group, we define auv = Hi,uu + Hi,vv − 2Hi,uv and buv = −∆u f (αi ) + ∆v f (αi ). Then, we select s ∈ arg max{−∆u f (αi )|u ∈ Iup (αi )},

(187)

b2sv t ∈ arg min{− |v ∈ Ilow (αi ), −∆v f (αi ) < −∆s f (αi )}, v ssv

(188)

u

which is explained as follows,

= =

1 1 (α + ∆α)T H(α + ∆α) − f T (α + ∆α) − ( αT Hα − f T α) 2 2 1 ∆αT H∆α + (Hα − f )T ∆α 2 1 auv ∆2 − buv ∆, 2

which achieves the minimum value when ∆ =

buv auv

(189) (190) (191)

in the ideal case. We traverse all 2

groups and choose the i-th group with minimum − sbsv to update in each iteration. sv Note that when calculating the final ∆, the lower and upper bound of αi,s and αi,t should be taken into consideration. For more details, please refer to [26].

33

12.3

Multi-class SVM

Multi-class SVM has many variations and one popular variation is CS multi-class SVM formulated as follows, min

wc ,ξi

s.t.

C n X βX kwc k2 + ξi 2 c=1 i=1

wyTi xi + δyi ,c − wcT xi ≥ 1 − ξi ,

(192) ∀i, c.

(193)

By introducing dual variables ηi ’s, we can derive the dual form of (191) as follows, min ηi

s.t.

X 1X T (xi xj )[(1yi − ηi )T (1yj − ηj )] − β ηiT 1yi , 2 i,j i

(194)

ηi ≥ 0,

(195)

∀i,

ηiT 1 = 1,

∀i.

(196)

By denoting τi = 1yi − ηi , (193) can be rewritten as min τi

s.t.

X 1X T (xi xj )(τiT τj ) − β τiT 1yi , 2 i,j i

(197)

τi ≤ 1yi ,

∀i,

(198)

τiT 1 = 0,

∀i,

(199)

which can solved by decomposing algorithm in [27].

12.4

Latent Structural SVM

The prediction rule of structural SVM with latent variables can be written as fw (x) = arg max(y,h)∈Y×H [wT Φ(x, y, h)]. By denoting h∗i (w) = arg maxh∈H wT Φ(xi , yi , h) ˆ i (w) = arg max(y,h)∈Y×H wT Φ(xi , y, h), we derive the upper bound of and (ˆ yi (w), h loss function ∆

≤

ˆ i (w))) ∆((yi , h∗i (w)), (ˆ yi (w), h T ˆ ˆ i (w))) max w Φ(xi , yˆ, h) + ∆((yi , h∗i (w)), (ˆ yi (w), h ˆ (ˆ y ,h)∈Y×H T − max w Φ(xi , yi , h) h∈H

34

(200)

(201)

We assume the loss function ∆ does not depend on the latent variable h∗i (w) ˆ i (w))) with ∆(yi , yˆi (w), h ˆ i (w)). Then, we and thus replace ∆((yi , h∗i (w)), (ˆ yi (w), h can reach

≤

≤

ˆ i (w)) ∆(yi , yˆi (w), h (202) ˆ + ∆(yi , yˆi (w), h ˆ i (w)) max wT Φ(xi , yˆ, h) ˆ (ˆ y ,h)∈Y×H T − max w Φ(xi , yi , h) (203) h∈H h i T T ˆ ˆ max w Φ(xi , yˆ, h) + ∆(yi , yˆ, h) − max w Φ(xi , yi , h) (204) ˆ (ˆ y ,h)∈Y×H

h∈H

To this end, we formulate the latent structural SVM as n i h X 1 T 2 ˆ ˆ w Φ(xi , yˆ, h) + ∆(yi , yˆ, h) max kwk +C min w ˆ 2 (ˆ y ,h)∈Y×H i=1 n X T −C max w Φ(xi , yi , h) . h∈H

i=1

(205) (206)

From another perspective, latent structural SVM can be formulated as n

X 1 min kwk2 + C ξi (207) w 2 i=1 T ˆ ≥ ∆(yi , yˆ, h) ˆ − ξi , ∀i, yˆ, h. ˆ (208) s.t. max w Φ(xi , yi , h) − wT Φ(xi , yˆ, h) h∈H

It can be easily seen that (206) is in fact equivalent to (204). The problem in (204) is the difference of two convex functions and can be solved using Concave-Convex Procedure (CCCP) [28]. The general template for minimizing f (w)−g(w) is listed in Algorithm 4. In terms of the optimization for the specific problem in (204), Line 3 involves P computing h∗i = arg maxh∈H wtT Φ(xi , yi , h) and then vt = C ni=1 Φ(xi , yi , h∗i ). Note that vt here is actually equivalent to ∆g(wt ). Line 4 involves solving the following problem: n n h i X X 1 2 T ˆ ˆ min kwk +C max w Φ(xi , yˆ, h) + ∆(yi , yˆ, h) −C wT Φ(xi , yi , h∗i ), (209) w 2 ˆ (ˆ y ,h)∈Y×H i=1 i=1 which can be solved by cutting-plane algorithm or stochastic gradient descent. 35

Algorithm 4 Concave-Convex Procedure (CCCP) 1: Set t = 0 and initialize w0 . 2: repeat 3: Find hyperplane vt such that −g(w) ≤ −g(wt ) − (w − wt )T vt for all w. 4: Solve wt+1 = arg minw f (w) − wT vt . 5: Set t = t + 1. 6: until [f (wt ) − g(wt )] − [f (wt−1 ) − g(wt−1 )] < .

12.5

Multiple Kernel Learning

The primal form of multiple kernel learning can be written as follows, T

min wt ,ξl

s.t.

n

X 1 X kwt k2 +C ξi 2 t=1 dt i=1 yi

T X

wtT xti ≥ 1 − ξi ,

(210) ∀i,

(211)

t=1 T X

dt = 1.

(212)

t=1

By introducing the dual variables αi ’s for the constraints in (209), we can get the dual form of (209) as min max d

α

s.t.

1 − dt αT (Kt ◦ (yyT ))α + 1T α 2 T 1 d = 1,

(213)

α ≤ C1. The objective function in (212) is concave w.r.t. α and convex w.r.t. d, min and max can be exchanged according to [29]. So the problem in (212) can be rewritten as max min α

d

s.t.

1 − dt αT (Kt ◦ (yyT ))α + 1T α 2 1T d = 1, α ≤ C1,

36

(214)

which is equivalent to the following problem: max τ,α

s.t.

−τ + 1T α, 1 T t α (K ◦ (yyT ))α ≤ τ, 2 α ≤ C1.

(215) ∀t,

(216)

Note that the problem in (213) can be obtained from (214) by introducing dual variables dt ’s for the constraints in (215). To solve the problem in (209), we can update wt ’s and dt ’s alternatively. Specifically, when updating wt ’s, we can solve the inner problem in (212) and then recover wt by using wt = dt Kt (α ◦ y). When updating dt ’s, by introducing dual variables for the constraints in (210) and setting the derivative of the obtain Lagrangian form w.r.t. dt as 0, we can get dt = const · kwt k. Considering the P k . constraint Tt=1 dt = 1, the close-form solution to dt is given by dt = PTkwtkw k t=1

12.6

t

Indefinite Kernel

The kernel used in SVM is generally positive definite. However, when the kernel K is indefinite, we can replace K with ρI − (ρI − K) with ρ being the maximum eigen value of K. Then, the objective function could reduce to the difference of two convex functions, which can be solved using DC programming (see Section 17.1.1).

12.7

AP SVM

Assume the indices are positive and negative samples are denoted by P and N respectively. The ranking matrix R is defined such that (i) Ri,j = 1 if xi is ranked higher than xj ; (ii) Ri,j = −1 if xi is ranked lower than xj ; (iii) Ri,j = 0 if xi and xj are assigned the same rank. We further define Ψ(X, R) = P P 1 ∗ ∗ i∈P j∈N Ri,j (ψ(xi ) − ψ(xj )) and ∆(R , R) = 1 − AP (R , R), then AP |P||N | SVM can be written as min w

s.t.

1 kwk2 + Cξ, 2 wT Ψ(X, R∗ ) − wT Ψ(X, R) ≥ ∆(R∗ , R) − ξ,

37

(217) ∀R.

(218)

which can solved using loss-augmented inference and cutting-plane algorithm as shown in [30]. The inference problem can be solved using graph cuts [31], dynamic graph cuts [32, 33], or greedy algorithm [30].

13

SVR

Standard form: n

min

w,b,ξi ,ξi∗

s.t.

X 1 kwk2 + C (ξi + ξi∗ ) 2 i=1

(219)

wT xi + b − yi ≤ + ξi ,

ξi ≥ 0,

yi − wT xi − b ≤ + ξi∗ ,

ξi∗ ≥ 0,

∀i, ∀i.

(220) (221)

By introducing dual variable α and α∗ for the constraints in (219) and (220) respectively, we can arrive at the dual form of (218) as follows, 1 (α − α∗ )T K(α − α∗ ) + yT (α − α∗ ) + 1T (α + α∗ ), 2 1T α = 1T α∗ , 0 ≤ α, α∗ ≤ C1,

min∗

α,α

s.t.

(222) (223)

Relaxed form: n

min∗

w,b,ξi ,ξi ,fi

s.t.

X λ 1 kwk2 + C (ξi + ξi∗ ) + kf − yk2 2 2 i=1

(224)

wT xi + b − fi ≤ + ξi ,

ξi ≥ 0,

(225)

fi − wT xi − b ≤ + ξi∗ ,

ξi∗ ≥ 0,

∀i, ∀i.

(226)

The dual form of (223) can be written as follows, min∗

α,α

s.t.

1 ˜ (α − α∗ )T K(α − α∗ ) + yT (α − α∗ ) + 1T (α + α∗ ), 2 1T α = 1T α∗ , 0 ≤ α, α∗ ≤ C1,

˜ = K + 1 I. The only difference between (221) and (226) lies in 1 I. where K λ λ

38

(227) (228)

14 14.1

Grassmann Manifold Intermediate Subspace

Assume PS , PT ∈ RD×d with D (resp., d) being the dimensionality of feature (resp., subspace) are two points on the Grassmann manifold. Intermediate subspaces are on the geodesic path between the source point PS and the target point PT , which are constant velocity curve on the manifold [34, 35]. Based on the property of subspace, we know PTS PS = I and the same for PT . Let RS ∈ RD×(D−d) denote the orthogonal complement to PS , i.e., RTS PS = 0 (null space). We denote Q = [PS , RS ] and then it can be easily seen that " # " # T PS I QT P S = PS = . (229) T RS 0 Similarly, we consider QT PT . Computing the thin CS decomposition of QT PS can give rise to # " # " " # T T T U ΓV P P P 1 S T S = , (230) PT = QT PT = T T −U2 ΣVT RS PT RS which can be done by generalized SVD. In particular, generalized SVD aims to ˜ V, ˜ X, ˜ C, ˜ S ˜ in the following problem given A, ˜ B, ˜ solve U, ˜ =U ˜C ˜X ˜T, A ˜ =V ˜S ˜X ˜T, B

(231)

˜TC ˜ +S ˜T S ˜ = I, C

(233)

(232)

˜ and V ˜ are orthonormal matrices, C ˜ and S ˜ are diagonal matrices. In (229), where U Σ (resp, Γ) is a diagonal matrix with the i-th diagonal element being γi (resp., σi ). Based on (232), there exist θi satisfying that γi = cos(θi ) and σi = sin(θi ), ∀i. After computing θi ’s, we can calculate the matrix specifying the direction and the speed of geodesic flow as A = U2 θUT1 , in which θ is a diagonal element with the i-th diagonal element being θi . Using the canonical Euclidean metric the Riemannian manifold (Grassmann manifold belongs to Riemannian manifold), the geodesic flow is parameterized as

39

Φ(t) with t ∈ [0, 1]. To be exact, " # U1 Γ(t) Φ(t) = Q = PS U1 Γ(t) − RS U2 Σ(t), −U2 Σ(t)

(234)

in which Γ(t) (resp., Σ(t)) is a diagonal matrix with the i-th diagonal element being cos(tθi ) (resp., sin(tθi )). One remaining problem is whether Φ(0) = PS and Φ(1) = PT , which is claimed in [36] but unproved.

15 15.1

Gaussian Process Gaussian Process for Regression tn = yn + n

(235)

p(t|y) = N (t|y, β −1 IN )

(236)

Assume p(n ) = N (0|β −1 ), then

We assume the covariance matrix of y is defined by a Gram matrix K, then p(y) = N (y|0, K)

(237)

Lemma 15.1 If x follows a Gaussian distribution and y given x follows a conditional Gaussian distribution, p(x) = N (x|µ, Λ−1 ),

(238)

p(y|x) = N (y|Ax + b, L−1 ),

(239)

then, we can have p(y) = N (y|Aµ + b, L−1 + AΛ−1 AT ),

(240)

p(x|y) = N (x|Σ{AT L(y − b) + Λµ}, Σ),

(241)

where Σ = (Λ + AT LA)−1 . The detailed proof can be found in [37].

40

According to Lemma 15.1, we can obtain p(t) = N (t|0, C),

(242)

where C = K + β −1 I. Based on training data x1 , . . . , xN , and their target variables tN = (t1 , . . . , tN )T , given a test sample xN +1 , we use tN +1 to denote (t1 , . . . , tN , tN +1 ), then p(tN +1 ) = N (tN +1 |0, CN +1 ).

(243)

Lemma 15.2 If p(x) = N (x|µ, Σ) with x = [xa ; xb ], µ = [µa ; µb ], and Σ = [Σaa , Σab ; Σba , Σbb ], then the mean and covariance of p(xa |xb ) can be expressed as follows, µa|b = µa + Σab Σ−1 ab (xb − µb ),

(244)

Σab Σ−1 bb Σba .

(245)

Σa|b = Σaa − The detailed proof can be found in [37].

By partitioning CN +1 = [CN , k; kT , c] and based on Lemma 15.2, the mean and variance of conditional distribution p(tN +1 |tN ) can be given by m(tN +1 |tN ) = kT C−1 N tN ,

(246)

δ 2 (tN +1 |tN ) = c − kT C−1 N k.

(247)

Maximum likelihood can be used to determine the hyperparameters in K and β.

15.2

Gaussian Process for Classification

Based on Gaussian process for regression, we define a Gaussian process over a function a(x) and transform the function using y = δ(a) with y ∈ (0, 1). In this case, p(t|a) = δ(a)t (1 − δ(a))1−t .

(248)

Z p(tN +1 = 1|tN ) =

p(tN +1 = 1|aN +1 )p(aN +1 |tN )daN +1 . 41

(249)

Z p(aN +1 = 1|tN )

p(aN +1 , aN |tN )daN

= Z =

p(aN +1 |aN )p(aN |tN )daN

(250)

According to (245) and (246), we have T −1 p(aN +1 = 1|aN ) = N (aN +1 |kT C−1 N aN , c − k CN k).

(251)

In order to calculate (249), we use Laplace approximation by Taylor expanding the logarithm of p(aN |tN ) and arrive at Ψ(aN )

= ln p(aN ) + ln p(tN |aN ) N 1 1 ln(2π) − |CN | = − aTN C−1 N aN − 2 2 2 N X ln(1 + ean ) + const. +tTN aN −

(252) (253)

n=1

The first-order and second-order derivatives of Ψ(aN ) are given by ∇Ψ(aN ) = tN − δN − C−1 N aN ,

(254)

∇∇Ψ(aN ) = −WN − C−1 N ,

(255)

where δN is a vector with elements δ(an ) and WN is a diagonal matrix with elements δ(an )(1 − δ(an )). By using the Newton-Raphson formula wnew = wold − H−1 ∇E(w) and woodbury matrix identity, the iterative update for aN is given by anew = CN (I + N WN CN )−1 {tN − δN + WN aN }. When the iterative process converges, ∇Ψ(aN ) will vanish and thus the converged aN will be a∗N = CN (tN − δN ). The inverse of the Hessian matrix of −Ψ(aN ) (i.e., H = WN + C−1 N ) can be treated as the covariance matrix of a Gaussian distribution, so p(aN |tN ) = N (aN |a∗N , H−1 ). Now we have p(aN |tN ) and p(aN +1 = 1|aN ), according to Lemma 15.1 and (249), we can easily obtain −1 p(aN +1 |tN ) = N (aN +1 |kT (tN − δN ), c − kT (WN + CN )−1 k)

42

(256)

To determine the parameters θ in CN , the likelihood function is defined by p(tN |θ) = ∇p(tN |aN )p(aN |θ)daN ,

(257)

which is analytically intractable and can be solved using Laplace approximation. Laplace Approximation: We use a Gaussian distribution p(z) to approximate f (z)/Z with Z being a constant. We find the stationary point z0 (i.e., mode) where the gradient ∇f (z) vanishes and then use Taylor expansion around z0 : 1 ln f (z) ' ln f (z0 ) − (z − z0 )T A(z − z0 ), 2 where A = −∇∇ ln f (z)|z=z0 . Then we can get 1 T f (z) ' f (z0 ) exp − (z − z0 ) A(z − z0 )) . 2 Then the corresponding q(z) is given by 1 |A| 2 1 T q(z) = − (z − z0 ) A(z − z0 )) = N (z|z0 , A−1 ). M exp 2 2 (2π)

(258)

(259)

(260)

In sum, in order to apply the Laplace approximation, we first need to find the mode z0 using certain method and then evaluate the Hessian matrix at this mode. Based on Laplace approximation, we can have the following equation: Z

= ∇f (z)dz

1 T ' f (z0 )∇ exp − (z − z0 ) A(z − z0 ) dz 2 M

= f (z0 )

(2π) 2 1

|A| 2

.

(261)

Based on (260), we can get the approximation for the logarithm of (256): ln p(tN |θ) = ln p(a∗N |θ) + ln p(tN |a∗N ) −

15.3

1 N ln |WN + C−1 ln(2π). N |+ 2 2

(262)

Twin Gaussian Process

Twin Gaussian Process (GP) is a probabilistic regressor encoding the relations between the inputs and structured outputs using Gaussian Process prior, which is achieved by minimizing the KL-divergence between the marginal GP of the outputs and the inputs [38, 39]. 43

16

Back Propagation

Use chain rule for calculating derivatives, i.e., ∂L ∂L ∂an+1 = · ∂wn ∂an+1 ∂wn

16.1

(263)

Convolutional Layer and Fully Connected Layer |w|

|w|

2 2 X X ∂L ∂yn+i ∂L ∂L = · = · wi , ∂xn ∂yn+i ∂xn ∂yn+i |w| |w|

i=−

i=−

2

where |w| is the window size of convolution filter.

16.2

(264)

2

∂L ∂wn

can be similarly calculated.

Pooling Layer

Mean pooling: ∂L ∂L ∂yn 1 ∂L = = . ∂xi ∂yn ∂xi m ∂yn

(265)

Max pooling:   ∂L

if i = arg maxi xi . ∂L ∂yn ∂L = = ∂yn 0 otherwise. ∂xi ∂yn ∂xi

16.3

(266)

Local Response Normalization (LRN) Layer 

max(0,i−n/2)

X

bix,y = aix,y / k + α

 (ajx,y )2  .

(267)

min(N −1,i+n/2)

LRN layer can be separated into five sub-layers as illustrated in Figure 4: • split layer: duplicate N copies with each copy corresponding to one normalization coef. • square layer: square operation 44

Figure 4: Separate LRN layer into five sub-layers. • pooling layer: sum pooling the corresponding neighbors • power layer: (k + ax)b • product layer: A = A0 ◦ A1 ◦ . . . ◦ AN . Note that ∂A0 ∂AN ∂A = ◦ A1 ◦ . . . ◦ AN + . . . + A0 ◦ . . . ◦ AN −1 . ∂a ∂a ∂a

16.4

(268)

Softmax Layer

For the softmax function hj = ∂hj ∂zi

zj Pe z , i ie

we compute its derivative:

P δ(i = j)ezi i ezi − ezi ezj P = ( i ezi )2 δ(i = j)ezi e zi e zj = P zi − P zi P zi ie ie ie e zi ezj = P zi δ(i = j) − P zi ie ie = hi (δ(i = j) − hj )

The softmax loss is defined as L = −

P

j

(269) (270) (271) (272)

yj log hj . Assume the ground-truth label

45

Algorithm 5 DC Programming 1: Set the parameters v ¯, µ, η in Armijo Rule with v¯ > 0 and 0 < µ < η < 1, T as the maximum number of iterations. 2: set t = 0. 3: while t < T do 4: θt = ∂h(βt ). 5: Solve the convex minimization problem in (280) to obtain βt+1 . 6: Set d(β) = βt+1 − βt . 7: if kd(β)k2 ≤ δ then 8: break. 9: end if 10: Set vt = v¯. 11: while f (βt+1 + vt d(β)) > f (βt+1 ) − µvt kd(β)k2 do 12: vt = ηvt . 13: end while 14: Update βt+1 = βt+1 + vt d(β). 15: Set t = t + 1. 16: end while vector y is one-hot vector, then ∂L ∂zi

X yj ∂hj hj ∂zi j X yj yi (−hi hj ) = − hi (1 − hi ) − hi h j j6=i X = −yi + hi yj =−

(273) (274) (275)

j

= −yi + hi

17

(276)

Optimization

17.1

Difference of Convex Functions

17.1.1

Difference of Convex (DC) Programming

46

DC programming [40, 41] can be used for solving the subtraction of two convex functions. Note that the difference between two convex functions has also been discussed in Section 12.4 w.r.t. CCCP. Assume f (β) = g(β) − h(β)

(277)

where g and h are both convex functions. Let h∗ (θ) = sup{hβ, θi − h(β), ∀β} be the conjugate function of h. Then the dual problem of (276) can be described as f ∗ (θ) = h∗ (θ) − g ∗ (θ).

(278)

Note that θ ∈ ∂h(β), β ∈ ∂g ∗ (θ), where ∂h and ∂g ∗ are the subgradients of h and g ∗ respectively. We can approximate h and g ∗ with the affine minorization at point βt and θt h(β) = h(βt ) + hβ − βt , θt i,

(279)

g ∗ (θ) = g ∗ (θt ) + hθ − θt , θt+1 i,

(280)

where θt ∈ ∂h(βt ) and θt+1 ∈ ∂g ∗ (θt ). By embedding (278) (resp., (279)) into (276) (resp., (277)) and omitting unrelated terms, we can obtain the solutions to βt and θt as βt = arg min{βt+1 : g(β) − hβ, θt i}

(281)

θt = arg min{θt+1 : h∗ (θ) − hθ, βt+1 i}

(282)

Following [41], we use θt ∈ ∂h(βt ) to replace (281) for simplicity. Then, we update θ and β based on θt ∈ ∂h(βt ) and (280) alternatively. The whole DC algorithm is listed in Algorithm 5. Note that in Step 10 − 14, we search the smallest nonnegative number under the Armijo type rule along the direction to achieve a large reduction in the value of f following [42]. These steps are merely for acceleration and can be removed for the brevity of the algorithm. 17.1.2

Concave-Convex Procedure (CCCP)

Please refer to Section 12.4, where CCCP is used in latent structural SVM.

47

17.2

Gradient Descent

For a given function f (x), consider the Taylor expansion near x: 1 f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x). 2

(283)

When using 1t I to replace ∇2 f (x), we can get f (x) + ∇f (x)T (y − x) +

1 ky − xk2 . 2t

(284)

By minimizing (283) w.r.t. y, we can easily obtain y = x − t∇f (x). 17.2.1

Unconstrained Gradient Descent

To find the optimal step size t, instead of using exact line search t = arg mins≥0 f (x− s∆x), we can use some approximate method such as backtracking line search. Specifically, fix a parameter β, start with t = 1, while f (x − t∆x) > f (x) − αt∇f (x)T ∆x do t = βt. When ∆x = ∇f (x), the condition becomes f (x − t∇f (x)) > f (x) − 2t k∇f (x)k2 . Steepest descent method: Since for small v, f (x + v) ≈ f (x) + ∇f (x)T v, so the unit-form step with most negative directional derivative ∆xnsd = arg min{∇f (x)T v|kvk = 1}. v

(285)

Note that when we use L2 norm, i.e., kvk2 ≤ 1, the optimal solution is v = ∇f (x) − k∇f , which is equivalent to the basic gradient descent. When using L1 norm, (x)k2 i.e., kvk1 ≤ 1, v = −sign( ∂f∂x(x) ) ∂f∂x(x) with i being arg maxi | ∂f∂x(x) |, which is also i i i referred to as coordinate descent algorithm. Newton’s method: Based on Taylor expansion, f (x + v) = f (x) + ∇f (x)T v + 1 T 2 v ∇ f (x)v. By setting the derivative of f (x + v) − f (x) w.r.t. v to zeros, we 2 can get v = −∇2 f (x)−1 ∇f (x), which is Newton step. Quasi-Newton’s method: Approximate Newton’s method without the burden of computing the inverse Hessian.

48

Algorithm 6 Fletcher-Reeves Algorithm 1: Initialize x0 , d0 = −∇f (x0 ). 2: Set k = 0, n is the dimensionality of x. 3: while k < n do 4: Use line search to obtain αk that minimizes f (xk + αdk ). 5: Set xk+1 = xk + αk dk . (xk+1 )k2 . 6: βk = k∇f k∇f (xk )k2 7: dk+1 = −∇f (xk+1 ) + βk dk . 8: Set k = k + 1. 9: end while Proximal gradient descent: First, we define proximal mapping associated with closed convex function h is 1 proxh (x) = arg min(h(u) + ku − xk2 ). (286) t,u 2t Given a function f , we split it into a convex differentiable function g and a closed convex possibly nondifferentiable function h, i.e., f (x) = g(x) + h(x). Then, we can have xk = proxtk ,h (xk−1 − tk ∇g(xk−1 )).

(287)

To make it clearer, based on the definition of proximal mapping, 1 xk = arg min(h(u) + g(x) + ∇T (u − x) + ku − xk2 ). (288) u 2t Conjugate Gradient Method: Conjugate gradident method was originally proposed to solve the quadratic programming problem, and then extended to solve general functions (Fletcher-Reeves Algorithm), whch is listed in Algorithm 6. LBFGS algorithm: https://www.cs.ubc.ca/ schmidtm/Software/minFunc.html 17.2.2

Constrained Gradient Descent

conditional gradient method: Frank-Wolfe algorithm [43] targets at solving the following problem: min x

s.t.

f (x)

(289)

x ∈ C,

(290)

49

Algorithm 7 Frank-Wolfe Algorithm 1: Initialize x0 ∈ C 2: Set k = 0. 3: while k < K do 4: Compute s∗ = arg mins∈C s0 ∇f (xt ) by solving a constrained linear programming problem. 2 or compute the optimal γ via line search. 5: Simply set step-size γ = t+2 t+1 t 6: x = (1 − γ)x + γs. 7: Set k = k + 1. 8: end while where f (x) is a convex function and C is a convex set. projected gradient method: usually used to solve the problem with box constraint, i.e., Cl ≤ x ≤ Ch or unit-vector constraint,i.e., kxk2 = 1.

17.3

Gradient Method on the Manifold

Newton’s method or conjugate gradient descent on the Grassmann Manifold or Stiefel Manifold are detailed in [44]. Now we take the conjugate gradient descent on the Grassmann manifold as an example and the detailed algorithm is listed in Algorithm 8.

18 18.1

Theorems Banach Fixed Point Theorem

Let (M, d) be a non-empty complete metric space with a contraction mapping T : M → M. T admits a unique fixed-point x∗ in M. (i.e., T (x∗ ) = x∗ ). Note that mapping T is a contraction mapping if there exists a constant c ∈ [0, 1) such that d(T (x), T (y)) ≤ c · d(x, y) for all x, y ∈ M. Then, the fixed-point x∗ can be found in the following way. Start with arbitrary state x0 in M and define a sequence {xn } as xn = T (xn−1 ). When it converges, limn→∞ xn = x∗ .

50

Algorithm 8 Conjugate Gradient for Minimizing F (Y) on the Grassmann Manifold 1: Given Y0 such that Y0T Y0 = I, compute G0 = ∇Y0 F − Y0 YoT ∇Y0 F and set H0 = −G0 . Set K as the maximum number of iterations. The size of Y is (n, p). 2: Set k = 0. 3: while k < K do 4: Minimize F (Yk (t)) over t where Y(t) = YVcos(Σt)VT + Usin(Σt)VT ,

5: 6: 7:

(291)

in which UΣVT is the compact SVD of Hk . Set tk = tmin and Yk+1 = Yk (tk ). T Compute Gk+1 = ∇Yk+1 F − Yk+1 Yk+1 ∇Yk+1 F . Parallel transport tangent vectors Hk and Gk to the point Yk+1 : ˜ k = (−Yk Vsin(Σtk ) + Ucos(Σtk )) ΣVT , H

(292)

˜ k = Gk − (Yk Vsin(Σtk ) + U(I − cos(Σtk ))) UT Gk . G

(293)

˜ k with γk = Compute the new search direction Hk+1 = −Gk+1 + γk H ˜ k ,Gk+1 i hGk+1 −G , in which the operator h·, ·i means h∆1 , ∆2 i = tr(∆T1 ∆2 ). hGk ,Gk i 9: Reset Hk+1 = −Gk+1 if mod(k + 1, p(n − p)) ≡ 0. 10: Set k = k + 1. 11: end while 8:

Therefore, given a contraction mapping function T and a mapped result J ∗ , we can use the following equation to obtain T −1 (J ∗ ): xn+1 = g(xn ) = xn + J ∗ − T (xn ). Note that g(I ∗ ) = I ∗ . This is essentially using a recursive way to simulate the inverse function T −1 as long as T is a contraction mapping. Actually, many commonly used functions (e.g., convolution, fully connected layer) are contraction mapping or can be approximated as contraction mapping [45].

51

References [1] Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: CVPR. (2015) [2] Kaufman, L., Broeckx, F.: An algorithm for the quadratic assignment problem using bender’s decomposition. European Journal of Operational Research 2(3) (1978) 207–211 [3] Shen, F., Zhou, X., Yang, Y., Song, J., Shen, H.T., Tao, D.: A fast optimization method for general binary code learning. TIP 25(12) (2016) 5610–5621 [4] Muller, K.R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B.: An introduction to kernel-based learning algorithms. In: Handbook of Neural Network Signal Processing. (2001) [5] Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Mathematical Programming 142(1-2) (2013) 397–434 [6] Jaakkola, T., Meila, M., Jebara, T.: Maximum entropy discrimination. In: NIPS. (2000) [7] Mao, L., Sun, S.: Soft margin consistency based scalable multi-view maximum entropy discrimination. In: IJCAI. (2016) [8] Chao, G., Sun, S.: Alternative multiview maximum entropy discrimination. T-NNLS 27(7) (2016) 1445–1456 [9] Sun, S., Chao, G.: Multi-view maximum entropy discrimination. In: IJCAI. (2013) [10] Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR. (2007) [11] Dixit, M., Rasiwasia, N., Vasconcelos, N.: Adapted gaussian models for image classification. In: CVPR. (2011) [12] Kim, H.J., Adluru, N., Banerjee, M., Vemuri, B.C., Singh, V.: Interpolation on the manifold of k component gmms. In: ICCV. (2015) 52

[13] Candes, E., Romberg, J.: l1-magic: Recovery of sparse signals via convex programming. URL: www. acm. caltech. edu/l1magic/downloads/l1magic. pdf 4 (2005) 14 [14] Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. PAMI 31(2) (2009) 210–227 [15] Kim, S., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: A method for large-scale l1-regularised least squares problems with applications in signal processing and statistics. IEEE J. Select. Topics Signal Process (2007) [16] Hale, E.T., Yin, W., Zhang, Y.: A fixed-point continuation method for l1regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University 43 (2007) 44 [17] Zhang, Q., Li, B.: Discriminative k-svd for dictionary learning in face recognition. In: CVPR. (2010) [18] Rosasco, L., Verri, A., Santoro, M., Mosci, S., Villa, S.: Iterative projection methods for structured sparsity regularization. (2009) [19] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1) (2009) 183–202 [20] Shu, K., Donghui, W.: A brief summary of dictionary learning based approach for classification. arXiv preprint arXiv:1205.6391 (2012) [21] Nie, F., Huang, H., Cai, X., Ding, C.H.: Efficient and robust feature selection via joint ?, 1-norms minimization. In: NIPS. (2010) [22] Grave, E., Obozinski, G.R., Bach, F.R.: Trace lasso: a trace norm regularization for correlated designs. In: NIPS. (2011) [23] Cai, J.F., Cand`es, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4) (2010) 1956–1982 [24] Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3) (2010) 471–501 53

[25] Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of R in Machine Learning 3(1) (2011) 1– multipliers. Foundations and Trends 122 [26] Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. TIST 2(3) (2011) 27 [27] Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. JMLR 2(Dec) (2001) 265–292 [28] Yuille, A.L., Rangarajan, A.: The concave-convex procedure. Neural computation 15(4) (2003) 915–936 [29] Lanckriet, G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5(Jan) (2004) 27–72 [30] Yue, Y., Finley, T., Radlinski, F., Joachims, T.: A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. (2007) 271–278 [31] Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? PAMI 26(2) (2004) 147–159 [32] Torr, P.K.P.H.: Measuring uncertainty in graph cut solutions-efficiently computing min-marginal energies using dynamic graph cuts [33] Kohli, P., Torr, P.H.: Dynamic graph cuts for efficient inference in markov random fields. PAMI 29(12) (2007) 2079–2088 [34] Chikuse, Y.: Statistics on special manifolds. Volume 174. Springer Science & Business Media (2012) [35] Gallivan, K.A., Srivastava, A., Liu, X., Van Dooren, P.: Efficient algorithms for inferences on grassmann manifolds. In: Statistical Signal Processing, 2003 IEEE Workshop on. (2003) 315–318 54

[36] Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR. (2012) [37] Bishop, C.M.: Pattern recognition and machine learning [38] Bo, L., Sminchisescu, C.: Twin gaussian processes for structured prediction. IJCV 87(1) (2010) 28–52 [39] Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: Zero-shot learning using purely textual descriptions. In: ICCV. (2013) [40] Tao, P.D., An, L.T.H.: Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica 22(1) (1997) 289–355 [41] Dinh, T.P., Le Thi, H.A.: Recent advances in dc programming and dca. In: Transactions on Computational Intelligence XIII. (2014) 1–37 [42] Artacho, F.J.A., Fleming, R.M., Vuong, P.T.: Accelerating the dc algorithm for smooth functions. arXiv preprint arXiv:1507.07375 (2015) [43] Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2) (1956) 95–110 [44] Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2) (1998) 303–353 [45] Tao, X., Zhou, C., Shen, X., Wang, J., Jia, J.: Zero-order reverse filtering. arXiv preprint arXiv:1704.04037 (2017)

55

The Formulation Cookbook

14 Jan 2018 - (24). M-step: The lower bound of log p(y|x,Î¸), i.e., (19) (the expected complete log- likelihood), can be written as. Q(Î¸;Î¸t) = â z p(z|y,x;Î¸t) log p(y,z|x;Î¸). ..... Âµxiâ>fs (xi). (111). Relation with max-sum algorithm: The sum-product algorithm find marginals over a set of variables while the max-sum algorithm find a ...

Download PDF

1MB Sizes 0 Downloads 208 Views

Report

The Formulation Cookbook

Recommend Documents