SLICE INVERSE REGRESSION WITH SCORE FUNCTIONS by Dmitry Babichev and Francis Bach Problem: we consider the projection pursuit regression model for non-linear regression: y = g (w1T X , . . . , wkT X ) + ε. Goal: to find the effective dimension reduction (e.d.r.) space w1 , . . . , wk . Approach: We use method of moments, using the notion of score functions and extension of Stein’s Lemma: (E(S1 (x)y ) in the e.d.r. space [1]). Definition: the score function S1 (x) defined as S1 (x) = −∇ log p(x), where p(x) is the probability density of X . Extension of Stein’s Lemma: E(S1 (x)|y ) in the e.d.r. space almost surely. [1] T. Stoker, Consistent estimation of scaled coefficients, Econometrica, 54(1986), p.1461-1481.
Results Score function extensions to sliced inverse regression method: first-order (SADE) and second-order (SPHD). Infinite and finite sample cases. Finite sample estimators and their consistency. Learning score functions: in two steps as well as directly. −1
0.7
SADE SPHD PHD+
−2
SADE with true score 1−step algorithm 2−step algorithm
0.6 0.5
−4
ˆ R2 (E, E)
1 ! "2 log 2 R2 E, Eˆ
−3
−5 −6 −7 −8
0.4 0.3 0.2
−9 0.1
−10 −11 10
11
12
13
14
15
16
0
2
4
log 2 n
6
Mean and standard deviation of error for d = 10; y = Comparison of one-step and two-step algorithms.
´ Slice Dmitry Babichev, Francis Bach (INRIA - Ecole Normale InverseSup´ Regression erieure) With Score Functions
8
10
σ
x1 1/2+(x2 +2)2 November 25, 2016
+ ε. 1/1
Non-convex Phase Retrieval of Low-Rank Matrix Columns Seyedehsara Nayer*, Namrata Vaswani*, Yonina C. Eldar** *Iowa State University, **Technion
Contributions Goal: recover a low-rank matrix, X, from phaseless measurements of its columns Applications: X-ray crystallography, astronomy, sub-diffractiona imaging,... Contributions: 1
Develop AltMinTrunc that exploits the low-rank structure of X compute a truncated spectral initialization rest of algorithm: intuitive modification of AltMinPhase for above problem
2
Obtain high probability sample complexity bounds for AltMinTrunc initialization to provide a good approximation of X when rank of X is low enough, these are significantly smaller than what existing single vector phase retrieval algorithms need
Seyedehsara Nayer*, Namrata Vaswani*, Yonina Non-convex C. Eldar** Phase Retrieval of Low-Rank Matrix Columns
2/3
Problem Setting Instead of a single vector x, we have a set of q vectors, x1 , x2 , . . . , xq which are such that the n × q matrix X := [x1 , x2 , . . . , xq ] has rank r min(n, q) For each xk , we observe a set of m measurements of the form yi,k := (ai,k 0 xk )2 , i = 1, 2, . . . m, k = 1, 2, . . . , q
Motivating application: dynamic solar imaging from phaseless measurements; image changes are often influenced by only a few (r) factors
Seyedehsara Nayer*, Namrata Vaswani*, Yonina Non-convex C. Eldar** Phase Retrieval of Low-Rank Matrix Columns
3/3
“Non-convex Optimization with Frank-Wolfe Algo. & Its Variants” Jean Lafond, Hoi-To Wai and Eric Moulines (Poster ID: 12)
▶
Frank-Wolfe (FW) algorithm is popular for ML tasks due to its efficiency in handling high dimensional problems.
▶
Little is known for its behaviors on non-convex problems.
▶
✓t ✓t+1 = (1
Our Contributions: analyze a general FW algorithm with time varying non-convex objective and inexact linear optimization ▶ ▶
√
General FW converges as fast as O(/ T ). w/ potential acceleration to (close to) O(/T ).
Non-convex Optimization with Frank-Wolfe Algorithm and Its Variants
t )✓t
+
✓? at
Fig. FW algorithm
J. Lafond, H.-T. Wai and E. Moulines
1/2
t at
Main Results ▶
If the FW gap: gt = maxθ∈C ⟨∇Ft (θt ), θt − θ⟩ is zero, then θt is a stationary point to minθ∈C Ft (θ).
▶
Consider Ft (θ) as a time varying objective function, γt = t−α , α ∈ [0.5, ). G-FW : θt+ = θt + γt (ˆ at − θt ),
ˆ t (θt ) , a⟩ , ˆ t ≈ at := arg min ⟨ ∇F a a∈C
▶
ˆ t can Assumption: as t → ∞, the variation |Ft (θ) − Ft− (θ)| → 0 and a accurately track at , both at a sufficiently fast rate.
▶
Result: (i) the FW gap decreases at mint∈[T /2+,T ] gt = O(/T −α ); (ii) the rate can be improved to close to O(/T ); (iii) the accumulation points of the sequence {θt }t≥ are stationary points.
▶
Applications: (i) Online FW; (ii) Decentralized FW.
▶
Example: Non-cvx. formulation for sparse + low-rank matrix completion.
▶
See you at the poster!
Non-convex Optimization with Frank-Wolfe Algorithm and Its Variants
J. Lafond, H.-T. Wai and E. Moulines
2/2
Approximating Traffic Simulation using Neural Networks and its Application in Traffic Optimization Paweł Gora University of Warsaw
Karol Kurach University of Warsaw
Problem Traffic optimization and many traffic analysis tasks require running (large number of) time-consuming simulations. How to do it efficiently?
Solution We can try to approximate outcomes of simulations (e.g., waiting times) using neural networks!
Results Best average relative error: 1.56% Best maximal relative error: 8.47% Neural networks trained using TensorFlow (TensorTraffic) and Adam optimizer. 200-400 neurons, 1-3 layers, training set: > 50000 traffic signal settings (evaluated in the Traffic Simulation Framework software), CV: 5-fold
An Empirical Study of ADMM for Nonconvex Problems Zheng Xu1 , Soham De1 , M´ario A. T. Figueiredo2 , Christoph Studer 3 , Tom Goldstein1 1 Department
of Computer Science, University of Maryland, College Park, MD de Telecomunica¸co ˜es, Instituto Superior T´ ecnico, Universidade de Lisboa, Portugal 3 Department of Electrical and Computer Engineering, Cornell University, Ithaca, NY 2 Instituto
December, 2016
Alternating direction method of multipliers (ADMM) I
Objective minu,v H(u) + G (v ), subject to Au + Bv = b uk+1 = vk+1 = λk+1 =
arg minu H(u) + hλk , −Aui + arg minv G (v ) + hλk , −Bv i + λk + τk (b − Auk+1 − Bvk+1 )
I
Steps
I
(Adaptive) penalty parameter τk
τk 2 τk 2
kb − Au − Bvk k22 kb − Auk+1 − Bv k22
Questions I
Does ADMM converge in practice?
I
Does the update order of H(u) and G (v ) matter?
I
Is the local optimal solution good?
I
Does the penalty parameter τk matter?
I
Is an adaptive penalty choice effective?
Empirical study on nonconvex applications I
Nonconvex applications I I
`0 regularized linear regression minx 21 kDx − ck22 + ρkxk0 L0 ImgRes 1LinReg 2 `0 regularized image denoising minL0 x 2 kx − ck2 + ρk∇xk0 1 Phase retrieval minx 2 ||abs(Dx) − c||22 Eigenvector computation maxx kDxk22 subject to kxk2 = 1 3
2
10
2
10
Iterations
I
10
Iterations
I
1
10
1
10
I
0
`0 regularized linear regression
10 -5 -4 -3 -2 -1 0 1 2 10 10 10 10 10 10 10 10 Initial penalty parameter
L0 LinReg
V R A
Vanilla ADMM Residual balance Adaptive ADMM
Empirical study 5
3
10
4
10
0
10 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 Initial penalty paramet
5
10
L0 ImgRes
Phase Retriev
10 3
10
2
10 4
10
Vanilla ADMM Residual balance Adaptive ADMM
29
V R A
2
10 28
27
Objective Iterations
10
Iterations PSNR
2
Iterations
I
3
10 1 10
26 1
1025
1
10
0
2
10
Vanilla ADMM Residual balance Adaptive ADMM
10 -5 -4 -3 -2 -1 0 1 2 10 10 10 10 10 10 10 10 Initial penalty parameter
3
10
4
10
Vanilla ADMM Residual balance Adaptive ADMM
10
5
10
1010 -5-5 -4-4 -3-3 -2-2 -1-1 0 0 1 1 2 2 33 44 55 10 10 10 10 10 10 10 10 10 1010 1010 1010 1010 1010 1010 10 Initialpenalty penaltyparameter parameter Initial
5
I
10
4
10
Vanilla ADMM Residual balance Adaptive ADMM
28 27
Va Re Ad
23 0
1022 -5 -4 -3 -2 -1 0 1 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 Initial Initialpenalty penaltyparameter paramete 2
29
More results in poster and paper.
24
10 Vanilla ADMM Residual balance Adaptive ADMM
Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang
∗
Yingbin Liang ∗ Syracuse
† The
∗
Yuejie Chi
†
University
Ohio State University
December 9, 2016
H. Zhang et al.
(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)
Dec. 9, 2016
1/3
Nonconvex Phase Retrieval Problem: Recover x ∈ Rn /Cn from magnitude of linear measurements yi = |ha i , xi| ,
I
Wirtinger flow (WF) (Cand`es et al.14) minimizes nonconvex loss m 1 X 0 2 `WF (z) := (|a i z| − yi2 )2 , 4m
for i = 1, · · · , m,
I
Reshaped Wirtinger flow(RWF) minimizes another loss m 1 X 0 2 (|a i z| − yi ) . 2m
`(z) :=
i=1
i=1
WF loss surface
5 4.5 4
150
3.5 3
100
2.5 2 1.5
50
1 0.5
0 -2
H. Zhang et al.
0
-1
z2
0
1
22
0
-2
2
1
0
-1
-2-2
-1
0
1
2
z1
(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)
Dec. 9, 2016
2/3
Incremental RWF (IRWF) Problem:
arg minz
Pm
0 i=1 (|a i z|
− yi )2 .
IRWF: For iteration t, choose it uniformly from {1, 2, . . . , m}, let z (t+1) = z (t) − µ · a 0it z (t) − yit · sgn a 0it z (t) ) a it . I
Converge very fast: To recover a real image (1920 × 1080), #passes time cost(s)
I I
IRWF
RWF
WF
8 13.7
70 107
315 426
Initialize it by spectral method → Provable linear convergence Random initialization → Work well empirically, but lack of proof
Future direction: How does stochastic method escape local minimas and saddle points when initialized randomly? H. Zhang et al.
(Syracuse Nonconvex University, Phase Retrieval The Ohio State University)
Dec. 9, 2016
3/3
L-SR1: A Novel Optimization Method for Deep Learning Vivek Ramamurthy, Nigel Duffy Sentient Technologies
December 9, 2016
1/3
Motivation and Algorithm Outline
Second Order Methods: potential for distributed training large mini-batches curvature information
Critical Weaknesses proliferation of saddle points ill-conditioned curvature matrices line search: multiple gradient/function evaluations
Our Solution ‘limited memory’ symmetric rank one update use of trust region method instead of line search improved conditioning using batch normalization
2/3
Experimental Results
3/3