SLICE INVERSE REGRESSION WITH SCORE ...

Viewer
Transcript

SLICE INVERSE REGRESSION WITH SCORE FUNCTIONS by Dmitry Babichev and Francis Bach Problem: we consider the projection pursuit regression model for non-linear regression: y = g (w1T X , . . . , wkT X ) + ε. Goal: to find the effective dimension reduction (e.d.r.) space w1 , . . . , wk . Approach: We use method of moments, using the notion of score functions and extension of Stein’s Lemma: (E(S1 (x)y ) in the e.d.r. space [1]). Definition: the score function S1 (x) defined as S1 (x) = −∇ log p(x), where p(x) is the probability density of X . Extension of Stein’s Lemma: E(S1 (x)|y ) in the e.d.r. space almost surely. [1] T. Stoker, Consistent estimation of scaled coefficients, Econometrica, 54(1986), p.1461-1481.

Results Score function extensions to sliced inverse regression method: first-order (SADE) and second-order (SPHD). Infinite and finite sample cases. Finite sample estimators and their consistency. Learning score functions: in two steps as well as directly. −1

0.7

SADE SPHD PHD+

−2

SADE with true score 1−step algorithm 2−step algorithm

0.6 0.5

−4

ˆ R2 (E, E)

1 ! "2 log 2 R2 E, Eˆ

−3

−5 −6 −7 −8

0.4 0.3 0.2

−9 0.1

−10 −11 10

11

12

13

14

15

16

0

2

4

log 2 n

6

Mean and standard deviation of error for d = 10; y = Comparison of one-step and two-step algorithms.

´ Slice Dmitry Babichev, Francis Bach (INRIA - Ecole Normale InverseSup´ Regression erieure) With Score Functions

8

10

σ

x1 1/2+(x2 +2)2 November 25, 2016

+ ε. 1/1

Non-convex Phase Retrieval of Low-Rank Matrix Columns Seyedehsara Nayer*, Namrata Vaswani*, Yonina C. Eldar** *Iowa State University, **Technion

Contributions Goal: recover a low-rank matrix, X, from phaseless measurements of its columns Applications: X-ray crystallography, astronomy, sub-diffractiona imaging,... Contributions: 1

Develop AltMinTrunc that exploits the low-rank structure of X compute a truncated spectral initialization rest of algorithm: intuitive modification of AltMinPhase for above problem

2

Obtain high probability sample complexity bounds for AltMinTrunc initialization to provide a good approximation of X when rank of X is low enough, these are significantly smaller than what existing single vector phase retrieval algorithms need

Seyedehsara Nayer*, Namrata Vaswani*, Yonina Non-convex C. Eldar** Phase Retrieval of Low-Rank Matrix Columns

2/3

Problem Setting Instead of a single vector x, we have a set of q vectors, x1 , x2 , . . . , xq which are such that the n × q matrix X := [x1 , x2 , . . . , xq ] has rank r min(n, q) For each xk , we observe a set of m measurements of the form yi,k := (ai,k 0 xk )2 , i = 1, 2, . . . m, k = 1, 2, . . . , q

Motivating application: dynamic solar imaging from phaseless measurements; image changes are often influenced by only a few (r) factors

Seyedehsara Nayer*, Namrata Vaswani*, Yonina Non-convex C. Eldar** Phase Retrieval of Low-Rank Matrix Columns

3/3

“Non-convex Optimization with Frank-Wolfe Algo. & Its Variants” Jean Lafond, Hoi-To Wai and Eric Moulines (Poster ID: 12)

▶

Frank-Wolfe (FW) algorithm is popular for ML tasks due to its efﬁciency in handling high dimensional problems.

▶

Little is known for its behaviors on non-convex problems.

▶

✓t ✓t+1 = (1

Our Contributions: analyze a general FW algorithm with time varying non-convex objective and inexact linear optimization ▶ ▶

√

General FW converges as fast as O(/ T ). w/ potential acceleration to (close to) O(/T ).

Non-convex Optimization with Frank-Wolfe Algorithm and Its Variants

t )✓t

+

✓? at

Fig. FW algorithm

J. Lafond, H.-T. Wai and E. Moulines

1/2

t at

Main Results ▶

If the FW gap: gt = maxθ∈C ⟨∇Ft (θt ), θt − θ⟩ is zero, then θt is a stationary point to minθ∈C Ft (θ).

▶

Consider Ft (θ) as a time varying objective function, γt = t−α , α ∈ [0.5, ). G-FW : θt+ = θt + γt (ˆ at − θt ),

ˆ t (θt ) , a⟩ , ˆ t ≈ at := arg min ⟨ ∇F a a∈C

▶

ˆ t can Assumption: as t → ∞, the variation |Ft (θ) − Ft− (θ)| → 0 and a accurately track at , both at a sufﬁciently fast rate.

▶

Result: (i) the FW gap decreases at mint∈[T /2+,T ] gt = O(/T −α ); (ii) the rate can be improved to close to O(/T ); (iii) the accumulation points of the sequence {θt }t≥ are stationary points.

▶

Applications: (i) Online FW; (ii) Decentralized FW.

▶

Example: Non-cvx. formulation for sparse + low-rank matrix completion.

▶

See you at the poster!

Non-convex Optimization with Frank-Wolfe Algorithm and Its Variants

J. Lafond, H.-T. Wai and E. Moulines

2/2

Approximating Traffic Simulation using Neural Networks and its Application in Traffic Optimization Paweł Gora University of Warsaw

Karol Kurach University of Warsaw

Problem Traffic optimization and many traffic analysis tasks require running (large number of) time-consuming simulations. How to do it efficiently?

Solution We can try to approximate outcomes of simulations (e.g., waiting times) using neural networks!

Results Best average relative error: 1.56% Best maximal relative error: 8.47% Neural networks trained using TensorFlow (TensorTraffic) and Adam optimizer. 200-400 neurons, 1-3 layers, training set: > 50000 traffic signal settings (evaluated in the Traffic Simulation Framework software), CV: 5-fold

An Empirical Study of ADMM for Nonconvex Problems Zheng Xu1 , Soham De1 , M´ario A. T. Figueiredo2 , Christoph Studer 3 , Tom Goldstein1 1 Department

of Computer Science, University of Maryland, College Park, MD de Telecomunica¸co ˜es, Instituto Superior T´ ecnico, Universidade de Lisboa, Portugal 3 Department of Electrical and Computer Engineering, Cornell University, Ithaca, NY 2 Instituto

December, 2016

Alternating direction method of multipliers (ADMM) I

Objective minu,v H(u) + G (v ), subject to Au + Bv = b   uk+1 = vk+1 =   λk+1 =

arg minu H(u) + hλk , −Aui + arg minv G (v ) + hλk , −Bv i + λk + τk (b − Auk+1 − Bvk+1 )

I

Steps

I

(Adaptive) penalty parameter τk

τk 2 τk 2

kb − Au − Bvk k22 kb − Auk+1 − Bv k22

Questions I

Does ADMM converge in practice?

I

Does the update order of H(u) and G (v ) matter?

I

Is the local optimal solution good?

I

Does the penalty parameter τk matter?

I

Is an adaptive penalty choice effective?

Empirical study on nonconvex applications I

Nonconvex applications I I

`0 regularized linear regression minx 21 kDx − ck22 + ρkxk0 L0 ImgRes 1LinReg 2 `0 regularized image denoising minL0 x 2 kx − ck2 + ρk∇xk0 1 Phase retrieval minx 2 ||abs(Dx) − c||22 Eigenvector computation maxx kDxk22 subject to kxk2 = 1 3

2

10

2

10

Iterations

I

10

Iterations

I

1

10

1

10

I

0

`0 regularized linear regression

10 -5 -4 -3 -2 -1 0 1 2 10 10 10 10 10 10 10 10 Initial penalty parameter

L0 LinReg

V R A

Vanilla ADMM Residual balance Adaptive ADMM

Empirical study 5

3

10

4

10

0

10 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 Initial penalty paramet

5

10

L0 ImgRes

Phase Retriev

10 3

10

2

10 4

10

Vanilla ADMM Residual balance Adaptive ADMM

29

V R A

2

10 28

27

Objective Iterations

10

Iterations PSNR

2

Iterations

I

3

10 1 10

26 1

1025

1

10

0

2

10

Vanilla ADMM Residual balance Adaptive ADMM

10 -5 -4 -3 -2 -1 0 1 2 10 10 10 10 10 10 10 10 Initial penalty parameter

3

10

4

10

Vanilla ADMM Residual balance Adaptive ADMM

10

5

10

1010 -5-5 -4-4 -3-3 -2-2 -1-1 0 0 1 1 2 2 33 44 55 10 10 10 10 10 10 10 10 10 1010 1010 1010 1010 1010 1010 10 Initialpenalty penaltyparameter parameter Initial

5

I

10

4

10

Vanilla ADMM Residual balance Adaptive ADMM

28 27

Va Re Ad

23 0

1022 -5 -4 -3 -2 -1 0 1 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 Initial Initialpenalty penaltyparameter paramete 2

29

More results in poster and paper.

24

10 Vanilla ADMM Residual balance Adaptive ADMM

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang

∗

Yingbin Liang ∗ Syracuse

† The

∗

Yuejie Chi

†

University

Ohio State University

December 9, 2016

H. Zhang et al.

(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)

Dec. 9, 2016

1/3

Nonconvex Phase Retrieval Problem: Recover x ∈ Rn /Cn from magnitude of linear measurements yi = |ha i , xi| ,

I

Wirtinger flow (WF) (Cand`es et al.14) minimizes nonconvex loss m 1 X 0 2 `WF (z) := (|a i z| − yi2 )2 , 4m

for i = 1, · · · , m,

I

Reshaped Wirtinger flow(RWF) minimizes another loss m 1 X 0 2 (|a i z| − yi ) . 2m

`(z) :=

i=1

i=1

WF loss surface

5 4.5 4

150

3.5 3

100

2.5 2 1.5

50

1 0.5

0 -2

H. Zhang et al.

0

-1

z2

0

1

22

0

-2

2

1

0

-1

-2-2

-1

0

1

2

z1

(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)

Dec. 9, 2016

2/3

Incremental RWF (IRWF) Problem:

arg minz

Pm

0 i=1 (|a i z|

− yi )2 .

IRWF: For iteration t, choose it uniformly from {1, 2, . . . , m}, let z (t+1) = z (t) − µ · a 0it z (t) − yit · sgn a 0it z (t) ) a it . I

Converge very fast: To recover a real image (1920 × 1080), #passes time cost(s)

I I

IRWF

RWF

WF

8 13.7

70 107

315 426

Initialize it by spectral method → Provable linear convergence Random initialization → Work well empirically, but lack of proof

Future direction: How does stochastic method escape local minimas and saddle points when initialized randomly? H. Zhang et al.

(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)

Dec. 9, 2016

3/3

L-SR1: A Novel Optimization Method for Deep Learning Vivek Ramamurthy, Nigel Duffy Sentient Technologies

December 9, 2016

1/3

Motivation and Algorithm Outline

Second Order Methods: potential for distributed training large mini-batches curvature information

Critical Weaknesses proliferation of saddle points ill-conditioned curvature matrices line search: multiple gradient/function evaluations

Our Solution ‘limited memory’ symmetric rank one update use of trust region method instead of line search improved conditioning using batch normalization

2/3

Experimental Results

3/3

Propensity Score Estimation with Boosted Regression ...