Tensor decompositions: old, new, and beyond - Ryota Tomioka

Viewer
Transcript

Tensor decompositions: old, new, and beyond Ryota Tomioka @MLSS, Kyoto 2015 Toyota Technological Institute at Chicago

Tensors appear everywhere

Spatio-temoral data

Tensors

Collaborative filtering Movies Users

Time Sensors

Matrices

Multivariate time-series

Star Wars

Titanic Blade Runner

5

2

4

User 2 1

4

2

User 3 5

?

?

User 1

Multiple relations Watch Buy Like

Tensor decomposition

Sensors

Tensors

Spatio-temoral data

Users

Time

Collaborative filtering Movies

Multiple relations

Users

Sensors

Matrices

Multivariate time-series

Learning with tensors

D

•  Input: tensor; output: scalar

y

=

n3

n1

n2

,

n3

n2

n1

E

+ ξ

–  input X may not be low-rank but the weight W may be.

•  Multi-task learning with tensors

Y

'

Restaurants

Restaurants

customers features

X

⇥

n3 customers

w pq

n1 features

Learning objectives Part 1: Matrices –  What is rank? Different views. –  Nuclear norm and factor-based regularization. –  Learning bounds

Part 2: Tensors –  What is rank? Which view generalizes, and how? –  Nuclear norm? Factor-based regularization?

Part 3: Beyond –  Convex relaxation for Tucker model –  Interaction between comptutation and statistics

Lecture format •  I will try to be as interactive as possible. –  Ask questions when you don’t understand. –  Please interrupt.

•  I will use iPython Notebooks to give further details. –  URL: https://github.com/ryotat/mlss15/tree/master/python –  Python environment? –  Download Enthought Canopy if you don’t

Let’s start from matrices •  Rank •  Decomposition •  Regularization

A 2 Rm⇥n

(with m  n, w.l.o.g.)

•  Theory –  and see how these generalize (or does not generalize) to tensors

Rank •  What are the ranks of these matrices?

0 1 @2 3 0 1 @2 3

2 4 6 1 1 0

0 1 1 3 6A = @2A · 1 2 3 3 9 1 0 1 ✓ 0 1 0 1 1 A @ A 1 = 1 1 · 1 0 3 0 3 1

Rank 1

Rank 2

◆ 0 1

Rank •  The rank of a matrix A = min R such that A =

R X

ur v r >

r=1

= max C such that AC is linearly independent (AC is the submatrix of A indexed by C) = number of non-zero singular values

Singular values/vectors •  are solutions of

Av = u A> u = v σ: singular value; (u,v): singular vectors

Fact: for (σj, uj, vj) j=1,…,m, as long as σj ≠ σk >

uj uk = 0,

and

>

vj vk

Singular values and eigenvalues •  Eigenvalue equation  

0 A>

A 0

u = v



u v

Symmetric matrix



0 A>



1 U A = 0 2 V

U V



⌃ 0

0 ⌃



U> U>

V> V>

Negative eigenvalues correspond to flipping the sign of V.

Singular values and eigenvalues •  Eigenvalue equation  

0 A>

A 0

u = v



u v

Symmetric matrix



0 A>



1 U A = 0 2 V

U V

That is, A



⌃ 0

0 ⌃

= U ⌃V >



U> U>

V> V>

Algorithmic view Fixed-point algorithm 1.  Initialize u0 and v0 randomly 2.  For t=0,1,2,…

ut+1 v

t+1

Av t = , kAv t k A> ut = kA> ut k



Does this converge? – Yes, essentially power iteration for

0 A>

A 0

Singular value decomposition A = U ⌃V > 1

U >U = I m,

V >V = I m,

⌃=

..

Usual convention: σ1≥σ2≥ …≥σm≥0.

This gives a decomposition

A=

m X

j uj v j

>

j=1

Is this minimal? Yes – keep only σj>0.

. m

!

Experiment •  Open USPS.ipynb

Summary: Rank 1. min R such that A =

R X

ur v r >

r=1

2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values

Summary: Rank •  Which definition generalizes to tensors? 1. min R such that A =

R X

ur v r

>

Yes -‐ Rank

r=1

2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) Yes – MulFlinear rank

3. number of non-zero singular values ???

#data points

Best rank-r approximation #dimensions

A

r=#embedding dimensions

U

V

T

Best rank-2 approximation USPS digits dataset

Data from: hLp://statweb.stanford.edu/~Fbs/ElemStatLearn/data.html

Best rank-r approximation •  Given an m x n matrix A, we want to find

ˆ r = argmin kA A X2Rm⇥n : rank(X)r

XkF

Best rank-1 approximation •  The simplest case

ˆ 1 = argmin kA A X2Rm⇥n : rank(X)1

Fact:

ˆ1 = A

1 u1 v 1

XkF

>

σ1: the largest singular value

Proof •  Let X = uv

kA

2 XkF

=

>

2 kAkF

>

2u Av +

2 kuk2

·

2 kvk2

Setting the derivative to zero, we have

Av = >

2 kvk2 · u, 2 kuk2 · v

A u= p Solution: u = 1 u1 ,

v=

Equations defining singular values/vectors

p

1 v1

Viewing it as a maximization •  Finding the best rank-one approximation

ˆ 1 = argmin kA A X2Rm⇥n : rank(X)1

XkF

is equivalent to finding (u,v) that maximizes 1

=

max m

u2R ,v2Rn , kuk=1,kvk=1

u> Av

Note: both problems are non-convex. Still admit poly-time algorithm (singular value decomp.)

Spectral norm ||A|| •  The largest singular value 1

=

max m

n

u2R ,v2R , kuk=1,kvk=1

>

u Av

is a norm. We denote it by kAk =

1

•  A norm needs to satisfy

1)  Positive homogeneity k↵Ak = |↵| · kAk 2)  Subadditivity 3)  Separation

(8↵ 2 R)

kA + Bk  kAk + kBk kAk = 0 ) A = 0

Best rank r approximation •  For r>1,

> ˆ Ar = U r ⌃ r V r

(take the first r columns of U and V and corresponding singular values) known as the Eckart–Young theorem

Best rank r approximation •  For r>1,

> ˆ Ar = U r ⌃ r V r

(take the first r columns of U and V and corresponding singular values) known as the Eckart–Young theorem

Implication: Subtracting the best rank-one approximation reduces the rank by one. Does this generalize to tensors? – No!

Low-rank regression •  Input x ∈ Rd, output y ∈ RC >

y =W x+⇠ where

W 2R

d⇥C

, rank(W )  r

•  Given training inputs (xi,yi), i=1,…,m

ˆ = argmin kY W W 2Rd⇥C : rank(W )r

Y = [y 1 , . . . , y m ] ,

W

>

2 XkF

X = [x1 , . . . , xm ]

Connection to linear neural network •  Linear neural network (Baldi & Hornik, 1989)

y

W> = V

activation (linear)

·

x

U>

Matrix factorization = Linear neural network

Analytic solution W

>

= P V ⌃XY

>

1 ⌃XX

P V = U r U r >, U r : the ⌃XX =

1 top r s.v. of ⌃XY ⌃XX ⌃XY Pm Pm > > x x , ⌃ = x y XY i=1 i i i=1 i i >

Matrix completion •  “Netflix problem” – Fill in the missing entries

User 1 User 2

Y =

User 3 User 4

5 1 5 ?

2 4 1 ?

4 2 ? ?

5 4 ? 4

Matrix completion •  Impossible without an assumption. (Missing entries can be arbitrary) --- problem is ill-posed •  Most common assumption: Low-rank decomposition Users

Movies

Y

U Users features

V

T

Movies features

Matrix completion •  Most common assumption: Low-rank decomposition (rank r) Users

Movies

yij

vj

ui Users features

Movies features

yij = ui v j

⇥

dot prodcut in r-‐dim space

Geometric Intuition r-‐dimensional space (r: the rank of the decomposiFon)

ua

Movies he doesn’t like

U Movies he likes

V

T

Geometric Intuition r-‐dimensional space (r: the rank of the decomposiFon)

ub

ua

V

U

T

Netflix problem –best rank-r approximation from partial observation minimize

U 2Rm⇥r ,V 2Rn⇥r

2

X

Yi,j

>

ui v j

2

(i,j)2⌦

>

3

u1 6 u2 > 7 6 7 U = 6 . 7, 4 .. 5 um >

2

>

3

v1 6v2 > 7 6 7 V =6 . 7 4 .. 5 vn >

Note: ui and vj are rows of U and V, respectively.

Non-convexity

u

v

Regularization •  Add L2 regularization terms minimize m⇥r

X

U 2R , (i,j)2⌦ V 2Rn⇥r

Yi,j

>

ui v j

2

+

2

kU k2F + kV k2F

How do we minimize this? –Gradient descent! Let’s try it with r=10 and r=50

Results for r=10 and r=50

Mean squared error

are surprisingly similar

Singular values lambda = 23.4

What is going on? •  Define W = U V > = (ui > v j )i,j •  Loss term: only depends on the product W. •  Regularizer: let’s define

1 2 2 g(W ) = min kU k + kV k F F m⇥r n⇥r 2 U 2R ,V 2R : W =U V >

Note: this is a function of W (U and V are minimized-out)!

Effective regularizer •  Now our objective is

minimize W 2Rm⇥n

X

(Yi,j

2

Wi,j ) + g(W )

(i,j)2⌦

1 2 2 min kU k + kV k with g(W ) = F F U 2Rm⇥r ,V 2Rn⇥r : 2 W =U V >

Is this a convex minimization problem in W?

Convexity Claim: g(w) is a norm (and therefore convex), if r is large enough. Proof: 1.  Positive homogeneity: 0

0

(for W = ↵W , take U = sign(↵) 2.  Triangular inequality:

p

p |↵|U and V = |↵|V .) 0

(for W = X + Y , take U = [U X , U Y ] and V = [V X , V Y ]. ) 3.  Separation: g(W) = 0 implies U=V=0, which implies W=0.

Nuclear norm •  The norm kW k⇤ =

min

W =U V >

is called the nuclear norm.

1 2 2 kU kF + kV kF 2 We use the norm notation instead of g(W) because we know that it is a norm.

•  Moreover,

kW k⇤ =

hX, W i = max m⇥n

X2R , kXk1

Spectral norm

m X

j

j=1

Linear sum of s.v.

Proof •  Easy to see that kW k⇤ 

Xm

j=1

j

 max hX, W i kXk1

p p > ˜ ˜ ˜ ˜ ˜ V˜ > .) (given W = U ⌃V , take U = U ⌃, V = V ⌃, and X = U

Pm

> •  For any W = u v j=1 j j Xm hX, W i = uj > Xv j j=1 Xm  kuj k2 · kv j k2

(because ||X||≤1)

j=1

1 Xm kuj k22 + kv j k22  j=1 2

(AM-GM)

Nuclear norm and spectral norm •  Nuclear norm and spectral norm are dual m X kW k⇤ = kXk = max j j, j=1

j

to each other. Just like L1 and L∞ d X kwk1 = |wj |, kxk1 = max |xj | j=1

j

When do we use the L1 norm? What can we expect from using the nuclear norm?

Nuclear norm and sparisty •  Nuclear norm promotes W to be low rank MSE=0.91

MSE=0.87

but maybe that’s not the end of the story…

Summary for matrix decomposition Learning with matrices Best rank-‐R approximaFon

Factor (U,V) regulariaFon Eckart- Young

SVD

Nuclear norm dual

Best rank-‐1 approximaFon Fixed-point condition

Power iteraFon

Operator norm deflation

Little bit of theory VC dimension Rademacher complexity

Complexity of low-rank matrices ⌦

•  Let fU ,V ,b (X) = X, U V

>

↵

+ b and define

Fr = fU ,V ,b : U 2 Rn1 ⇥r , V 2 Rn2 ⇥r , b 2 R

•  Examples –  Low-rank regression: X = x · ej –  Matrix completion:

>

X = ei · ej >

–  Matrix classification: general X.

What is the complexity of Fr? How many samples do we need?

Background •  VC dimension of F:

Vapnik

Chervonenkis

d(F) = max m s.t. |{(sgn(f (X 1 )), . . . , sgn(f (X m )) : f 2 F}| = 2m

Note: standard asymptotic statistics does not work here because the model is not regular (degenerate Fisher information).

Bound on the VC dimension d(Fr )  (r(n1 + n2 ) + 1) log(r(n1 + n2 ) + 1)

For sufficiently large n1 and n2. No more than #parameters!

•  This implies that for classification

R(f )

˜ Remp (f )  O

r

(r(n1 + n2 ) + 1) m

!

for any f 2 Fr with high probability (ignoring the log term) Pm 1 where R(f ) = Pr(y 6= f (X)), Remp (f ) = m i=1 I(yi 6= f (X i ))

•  Similar bound for real-valued output using pseudodimension.

Proof of the VC dimension •  Warren’s theorem

Warren (1968) “Lower bounds for approximation by nonlinear manifolds”

Note that

⌦

fU ,V ,b (X i ) = X i , U V

>

↵

+b

is a polynomial of degree at most 2 in r(n1+n2)+1 variables. Note: (U,V,b) are variables and Xi are coefficients.

Low-nuclear norm matrices •  Let fW (X) = hX, W i and define FW = fW : W 2 R

n1 ⇥n2

, kW k⇤  W

(no bias term for simplicity)

•  Examples –  Low-rank regression: X = x · ej –  Matrix completion:

>

X = ei · ej >

–  Matrix classification: general X.

Low-nuclear norm matrices •  Let fW (X) = hX, W i and define FW = fW : W 2 R

n1 ⇥n2

, kW k⇤  W

(no bias term for simplicity)

•  Rademacher complexity

m X 1 ˆ Rn (F) = E sup f 2F m i=1

i f (X i )

(σi takes +1 or -1 with probability 1/2)

–  measures how well F can fit +1/-1 noise. –  is scale sensitive.

Rademacher complexity bound [see Kakade, Shalev-Shwartz, Tewari 12]

If kX i k  X (for i = 1, . . . , m) Spectral norm r 2 ˆ Rm (FW )  log((n1 + n2 )e) · X · W. m Consequence: for any Lipschitz continuous loss function ` , r E`(y, f (X))

1 m

m X i=1

`(yi , f (X i ))  O

X 2 W 2 log(n1 + n2 ) m

How large are X and W? If Xi (i=1,…,m) are drawn from standard normal distribution, X2=O(n1+n2), and W2=O(r), if W has a constant Frobenius norm.

!

Proof of RC bound •  From the duality btwn. the spectral & nuclear norms m X 1 m i=1

•  Thus

m X 1 i hX i , W i  m i=1

ˆ m (FW )  E R

iX i

m X 1 m i=1

· kW k⇤

≤ W

iX i

·W

and using random matrix theory, E

1 m

m X i=1

iX i



r

2 log((n1 + n2 )e) · X m

if kX i k  X (for i = 1, . . . , m). (Using Corollary 4.2 from Tropp (2011) “User-Friendly Tail Bounds for Sums of Random Matrices”)

Tightening by expectation

[Foygel & Srebro, 2011]

•  Previous analysis does not assume any distribution for inputs X1,…,Xm. •  Considering the matrix completion setting and taking expectation over the random draw of the observed positions, 0s 1 ˆ m (FW ) = O @ Rm (FW ) := ER Expectation over the draw of the observed positions

(n1 + n2 ) log(n1 + n2 ) · WA n1 n2 m

W can be as large as (rn1n2)1/2 and sample complexity O(r(n1+n2)/ε2)

Approach from high-dimensional statistics

[Negahban & Wainwright 2011]

•  Let the entries of X1,…,Xm drawn i.i.d. from standard Gaussian distribution and assume yi = hX i , W ⇤ i + ⇠i , ⇠i ⇠ N (0, 2 ), (i = 1, . . . , m) truth (rank r)

then the estimator

m ⇣ 1 X ˆ = argmin W (yi W 2Rn1 ⇥n2 2m i=1

with

m

ˆ kW

p

2

hX i , W i) +

m kW k⇤

(n1 + n2 )/m satisfies ✓ 2 ◆ r(n1 + n2 ) ⇤ 2 W kF  O m

=c·

with high probability; r=rank(W*).

⌘

Summary so far •  Rank and nuclear norm: both measure complexity of a matrix-based hypothesis class. •  L2 regularization on the factors (U and V) induces nuclear norm regularization on the matrix UVT. •  Nuclear norm promotes the matrix to be lowrank. However, it can be independently justified even when it doesn’t.

Tensors Rank and multilinear rank Non-closedness and other weirdness Tucker decomposition Graphical representation

Recap: Rank for matrices 1. min R such that A =

R X

ur v r >

r=1

2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values

Rank for tensors •  A Kth-order tensor

A2R

n1 ⇥n2 ⇥···⇥nK

(we will take K=3 below to simplify notation) •  Definition: Rank is the min R such that

A=

R X r=1

ur

vr

wr

outer product of 3 vectors

Bibliographical note •  The decomposition

A=

R X r=1

ur

vr

wr

Triad

https://en.wikipedia.org/wiki/Triad_(music)

Triad

is called CP (Canonical Polyadic) decomposition. [Hitchcock (1927) “The expression of a tensor or a polyadic as a sum of products”]

•  Also known as CANDECOMP [Carroll & Chang, 1970] and PARAFAC [Harshman, 1970].

Rank quiz •  What is the rank of this tensor?     1 1 1 1 1 1 1 = A= 1 1 1 1 1 1 1

Rank 1

1st slice 2nd slice



0 A= 1 

0 A= 1

1 1 0 0

0 100

1 1 0 0

0 0







Rank 2

0.0995 0.5025 1 = 0.995 5.025 10    0.0995 0.5025 1 + 0.995 5.025 10

Rank 3

Tensor rank is not closed •  Consider the limit α→∞ of a rank 2 tensor 1 1 1 Y = (a1 + a2) ⇥ (b1 + b2) ⇥ (c1 + c2) a1 ⇥ b1 ⇥ c1

lim Y = a1 b1 c2 + a1 b2 c1 + a2 b1 c1

↵!1

•  The limit is rank 3 if [a1,a2], [b1,b2], [c1,c2] are linearly independent. (Example from Kolda & Bader 2009)

Rank of 2 x 2 x 2 tensors

Stegeman & Comon (2010) “Subtracting a best rank-1 approximation may increase tensor rank”

Bounding the rank •  A trivial bound rank(A)  n1 n2 n3 (consider each entry as a rank-one tensor)

•  A slightly better bound rank(A)  min(n1 n2 , n2 n3 , n1 n3 ) (consider each fiber as a rank-one tensor)

Rank = number of linearly independent columns? •  What is a column vector of a tensor? –fiber A mode-k fiber is a column vector obtained by fixing all but the kth index.

(Kolda & Bader 2009)

Unfolding and mode-k rank •  mode-k unfolding mode-k fibers

A(k) =

•  mode-k rank

2R

nk ⇥

Q

k0 6=k

rankk (A) = rank(A(k) )

•  Multilinear rank (rank1 (A), . . . , rankK (A))

nk 0

Questions •  How do these two notions of rank relate to each other? When are they equal? •  What kind of decomposition does lowmultilinear rank imply?

Equivalence when rank=1 Detecting a rank-one tensor is easy! •  Suppose A is rank one A = u v w = (ui vj wk )i,j,k

(again assuming K=3 for simplicity)

Rank 1

Mode-1 unfolding

⇥

A(1) = uv1 w1

uv2 w1

= u(v ⌦ w)>

Similarly

A(2) and A(3)

···

uvn2 wn3

⇤

(⌦: Kronecker product) are also rank 1.

Thus multilinear rank=(1,1,1). The other way is also true.

General relation rankk (A)  rank(A) 

K Y

rankk (A)

k=1

•  Left hand side:

A(1) =

R X r=1

ur (v r ⌦ wr )>

Rank at most R

•  Right hand side: given by the so-called Tucker decomposition

Preliminaries •  Define the 0 mode-k product 1 nk X Ai1 ,...,ik 1 ,j,ik+1 ,...,iK Uik j A A ⇥k U := @ j=1

for A 2 R

n1 ⇥···⇥nK

and U 2 R

For example,

A ⇥1 U =

n1

n1

In other words,

(A ⇥k U )(k) = U A(k)

i1 ,...,iK

n0k ⇥nk

n3

n2

n1 U applies to each mode-k fiber independently

Tucker decomposition [Tucker (1966) “Some mathematical notes on three-mode factor analysis”]

•  Tensor A has multilinear rank (r1,…,rK) implies n3 n1

Factors Core

n2

r3

= Aijk =

r1 r2

r2

r1

1

n1

r1 X r2 X r3 X

a=1 b=1 c=1

2

n2

r3

3

(1) (2) (3) Cabc Uia Ujb Ukc

n3

!

•  Also known as higher-order SVD [De Lathauwer+00]

Intuition •  Recall that (A ⇥k U )(k) = U A(k) •  Take U(k) , the (full) left singular vectors of A(k) (A ⇥k U •  Repeat

(k) >

)(k) =

nk rk

Y

k0 6=k

nk

r3

A ⇥1 U (1) > · · · ⇥K U (K) > = r1

r2

•  Reconstruct A = C ⇥1 U

(1)

⇥2 U

(2)

· · · ⇥K U

(K)

nk 0

Properties •  Tensor A has multilinear rank (r1,…,rK) implies

A= Thus

r1 X r2 X r3 X

a=1 b=1 c=1

rank(A)  •  In fact,

Cabc ua (1) ub (2) uc (3)

K Y

rankk (A)

k=1

rank(A) = rank(C)

if the columns of U (1) , . . . , U (K)are independent.

Amino acids fluorescence data

[Bro (1997) “PARAFAC: Tutorial and applications”]

•  Five samples containing different amounts of three types of amino acids (Tyr, Trp, and Phe) measured by fluorescence spectrometer Emission wavelength

Excitation wavelength

Mode-wise SVD Sample dim.

Emission dim.

Excitation dim.

SVD(A(1))

SVD(A(2))

SVD(A(3))

Multilinear rank ≒ (3, 3, 3) ⇒ 3 ≤ rank(A) ≤ 9

Graphical representation i U(1) C j U(2)

Tensor train decomposition [Oseledets, 2011]

U(3) k

Tucker decomposition

Hierarchical Tucker decomposition [Grasedyck, ‎2010]

Short summary •  Tensor rank is a much more delicate concept than matrix rank –  difficulty of computing (NP-hard [Håstad, 90]) –  non-closedness –  may increase by subtracting the best rank-1 approximation [Stegeman & Comon 2010]

•  Rank can be bounded by the multilinear rank, which can be obtained easily by mode-wise SVD.

Tensor decomposition Best rank-k approximation Singular values and eigenvalues Power method Random projection

Best rank-R approximation •  Minimization problem minimize A U ,V ,W

R X

ur

vr

wr

r=1

F

(Frobenius norm defined as kAkF =

•  Hardness

p

hA, Ai )

–  Known to be NP-hard even for R=1 (3-SAT)

•  Questions: for R=1 –  Does it lead to equations defining SVD? –  Does it relate to the operator norm?

Connection to spectral norm •  Finding ˆ , w) ˆ = argmin kA (ˆ u, v u,v,w

u v wkF

is equivalent to computing ˜ v ˜ wi ˜ kAk = max hA, u ˜ kuk1, k˜ v k1, ˜ kwk1

In fact, u ˆ= u ˜, v ˆ=v ˜, w ˆ =w ˜ is a best rank-one approximation.

with

= kAk

Connection to singular values •  Stationary condition implies

˜ > ⇥3 w ˜> = u ˜ A ⇥2 v

˜ > ⇥3 w ˜> = v ˜ A ⇥1 u

˜ > ⇥2 v ˜> = w ˜ A ⇥1 u

generalizes equations defining s.v. for matrices. •  We can define any triplet of unit vectors satisfying the above equations singular vectors [Lim, 2005] •  Difference: –  LHS is quadratic, whereas RHS is linear (motivation for defining Lp singular vectors with p=K.) –  Generally, not orthogonal.

Symmetrization •  Define a super-symmetric n’ x n’ x n’ tensor 2

(n’=n1+n2+n3)

0 0 0 A¯ = 4 0 0 A(3,2,1)

0

0 0

A(2,3,1) 0 A(3,1,2)

(1,3,2)

0 A 0 0 0 0

0 A(2,1,3) 0

A

•  Eigenvectors of A¯ , solutions of > > ¯ ¯ ¯ ¯ A ⇥2 u ⇥3 u = u

2 3 ˜ u ˜5 ¯ = 4v include vectors of the form u ˜ w

(1,2,3)

0 0

3 0 05 0

E.V.P. for symmetric tensors •  Power iteration (detailed below) will quickly converge to one of the solutions –but which solution? •  The solutions are not orthogonal to each other. Cannot use the usual deflation approach (it may actually increase the rank…) •  The orthogonally decomposable case

A=

R X r=1

r ur

ur

ur ,

(ur ? ur0 , 8r 6= r0 )

has attracted considerable attention [Anandkumar, Ge, Hsu, Kakade, Telgarsky, 2012]

Power method for symmetric tensors Fixed-point algorithm 1.  Initialize u0 randomly 2.  For t=0,1,2,… t+1

u

A ⇥2 (ut )> ⇥3 (ut )> = kA ⇥2 (ut )> ⇥3 (ut )> k2

Why does it work? •  For symmetric orthogonally decomposable tensors, R A=

X

u

t+1

r ur

ur , u

ur ,

r=1

t+1

↵

(ur ? ur0 , 8r 6= r0 )

t >

t >

/ A ⇥2 (u ) ⇥3 (u ) =

⌦

ur

/

XR

r=1

⌦

⌦

↵

2 ( u , u ) ur r r

t

t

↵

2 ( u , u ) r r

(Used orthogonality)

Notes on power iteration for orthogonally decomposable tensors ⌦

↵

⌦

•  •  •  • 

t

↵

/ r ( ur , u )2 ⌦ ↵ t •  The inner product ur , u will converge to zero ur , u

t+1

exponentially fast except for the one with the largest absolute value. The rate is K-1: power iteration converges faster for higher order tensors. Odd order K: can take λr≥0. RHS≥0. Even order K: RHS can be negative (oscillation). Deflation works (because the factors are orthogonal). Robustness to noise (see Anandkumar et al. 12)

Power method for asymmetric tensors 1.  Initialize u0, v0, w0 randomly 2.  For t=0,1,2,… ut+1 v t+1 wt+1

A ⇥2 (v t )> ⇥3 (wt )> = , t > t > kA ⇥2 (v ) ⇥3 (w ) k2 A ⇥1 (ut )> ⇥3 (wt )> = , kA ⇥1 (ut )> ⇥3 (wt )> k2 A ⇥1 (ut )> ⇥2 (v t )> = kA ⇥1 (ut )> ⇥2 (v t )> k2

Tensor and simultaneous diagonalization •  If R
R X r=1

wk1

wkr ur v r = U >

wkR

VT

•  Factors are common to all slices. •  If we can simultaneously diagonalize all slices, we are done.

Random projection •  Consider the random projection >

A ⇥3 x =

R X r=1

hwr , xi ur v r = U >

T V R

x is a random vector. The projection is again a low-rank matrix if R
•  How many random projections/slices do we need? –  Two! (if the factors are linearly independent) [Harshman (1972) Determination and Proof of Minimum Uniqueness Conditions for PARAFAC1]

Random projection algorithm [Harshman 1972; Goyal, Vempala, Xiao, 2014; Moitra et al.]

1.  Let A1 and A2 be two independent random projections of rank-R tensor A. (see Random Projection.ipynb) > 2.  Compute truncated rank-R SVD A1 = U 1 S 1 V 1

3.  Compute eigen-decomposition 1

>

S 1 U 1 A2 V 1 = P ⇤P 4.  Recover 5.  Note that

1

U = U 1S1P V = V 1P >

A1 = U V ,

>

A2 = U ⇤V

Simultaneous > diagonalization!

Recap: Rank for matrices and tensors 1. min R such that

A=

R X

ur v r

>

r=1

2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values

1.  min R such that

A=

R X r=1

ur

vr

wr

outer product of 3 vectors

2.  max number of linearly independent mode-k fibers

Summary for matrix decomposition Learning with matrices Best rank-‐R approximaFon

Factor (U,V) regulariaFon Eckart- Young

SVD

Nuclear norm dual

Best rank-‐1 approximaFon Fixed-point condition

Power iteraFon

Operator norm deflation

Regularization for tensors Regularization Nuclear norm and atomic norm Hardness

Learning with tensors •  Examples –  Input: pair of vectors (x, z); output: vector y. n3

y =

n2

n1

D

noise

⇥2 x ⇥3 z + ξ

–  Input: tensor; output: scalar

y =

n1

n3

n2

n3

,

n1

n2

E

+ ξ

Regularization •  In learning setting, we are interested in minimizing m X 1 minimize ` (y , hX , Wi) + R(W) i i W2Rn1 ⇥···⇥nK m i=1 •  Examples

–  Low-rank regression: X = eout –  Tensor completion:

–  Tensor classification:

X = ei

x z ej ek

(general tensor X)

Regularization •  In learning setting, we are interested in minimizing minimize n ⇥···⇥n

W2R

1

K

•  Questions:

m X 1 ` (yi , hXi , Wi) + R(W) m i=1

–  What regularization term R(W) should we use? –  How is it related to rank? –  How does regularizing the factors (U(1),…,U(K)) relate to a regullarization for W?

Regularizing the factors •  Recall that the nuclear norm 1 2 2 kW k⇤ = min kU kF + kV kF U ,V : 2 W =U V >

minimum over all factorizations W=UVT. •  A naïve idea R2,2 (W) =

(1)

min

⌘ 1 ⇣ (1) 2 kU kF + kU (2) k2F + kU (3) k2F 3

U ,U (2) ,U (3) : P (1) (2) (3) W= R u u u r r r r=1

Is this convex? Is this a norm?

Non-convexity •  Given any decomposition ⇣

R2,2 (W)  kU (1) kF · kU (2) kF · kU (3) kF

ˆ •  Thus taking (U ⇣

ˆ R2,2 (↵W)  |↵|kU

(1)

(1)

ˆ ,U

(2)

ˆ ,U

(3)

)=

⌘ 23

(AM-‐GM inequality)

argmin U (1) ,U (2) ,U (3)

ˆ kF · k U

(2)

ˆ kF · k U

(3)

kF

⌘ 23

2 3

⇣

···

⌘

= |↵| R2,2 (W)

(this violates convexity)

Atomic norm •  Atomic norm [Chandrasekaran+2012] is a norm whose unit ball is the convex hull of an atomic set. •  Example: –  L1 norm unit ball in 2D = conv({e1,-e1, e2, -e2}) –  Group lasso –  Nuclear norm –  etc.

(Tensor) nuclear norm •  Atomic norm with respect to the atomic set n o Arank1 = u(1) · · · u(K) : ku(1) k2 = · · · = ku(K) k2 = 1 rank-one tensor

•  Can be written as a minimum kWknuc =

min

U (1) ,U (2) ,...,U (K)

s.t.

•  In other words,

XR

r=1

W=

(2) (K) ku(1) k · ku k · · · ku 2 2 r r r k2

XR

r=1

W 2 conv(Arank1 ) , 9decomposition

(2) (K) u(1) u · · · u r r r

R P

r=1

(1)

(K)

kur k2 · · · kur

k2  1

Dual of nuclear norm is the spectral norm •  Dual of nuclear norm kAknuc⇤ = max hA, Wi , W

s.t.

W 2 conv(Arank1 )

kAknuc⇤ = kAk

Summary Fact: max hX , Wi =

X: kX k1

min

U (1) ,U (2) ,U (3)

s.t.

Dual of (tensor) spectral norm

XR

r=1

W=

(2) (3) ku(1) k · ku k · ku 2 2 r r r k2

XR

r=1

(2) (3) u(1) u u r r r

(tensor) Atomic norm = nuclear norm = wrt Arank1

How do we compute this norm? •  Unfortunately it is NP hard to compute both the nuclear norm and the spectral norm for tensors. •  Wait, they are norms (convex functions) aren’t they easy to deal with? •  Equivalently written as convex problem with |A |Arank1 rank1|| many constraints. Just finding the most violated constraint is NP hard.

Generative model? •  By AM-GM inequality kWknuc

⌘ 1 ⇣ (1) K = min kur k2 + · · · + ku(K) kK r 2 r=1 K U (1) ,U (2) ,...,U (K) XR (2) (K) s.t. W = u(1) u · · · u r r r XR

r=1

P (U

(1)

,...,U

(K)

)/

•  If we consider instead n

K Y

k=1

exp

1 K

R X r=1

K ku(k) k r 2

Arank1,K = u(1) · · · u(K) : ku(1) kK = · · · = ku(K) kK kWknuc =

min

U (1) ,U (2) ,...,U (K)

XR

r=1

!

o =1

⌘ 1 ⇣ (1) K K kur kK + · · · + ku(K) r kK K

Summary for tensor decomposition

known rank-‐R decomposiFon

Learning with tensors

Factor (U,V) regulariaFon

Best rank-‐R approximaFon

Nuclear norm dual

Best rank-‐1 approximaFon Fixed-point condition

Random projecFon

Power iteraFon

Operator norm

If orthogonally decomposable

Convex relaxation of Tucker

Recap: Tucker decomposition •  Suppose a tensor W is low-rank in the sense of Tucker decomposition Factors n3

n1

Core

n2

r3

=

r1 r2

r2

r1

1

n1

2

n2

r3

3

n3

–  All unfoldings W(1), W(2), W(3) are simultaneously low-rank. –  How can we encourage W to be simultaneously low-rank?

Approach 1: As a matrix •  Pick a mode k, and hope the tensor to be low-rank in mode k. minimize n ⇥···⇥n

W2R

1

K

1 ky 2m

2 X(W)k2

+

m kW (k) k⇤

Observation operator X : Rn1 ⇥···⇥nK ! Rm

X(W) = (hX1 , Wi , . . . , hXm , Wi) > Pro: Basically a matrix problem Easy to carry out, theory available Con: Have to be lucky to pick the right mode

Approach 2: Overlapped trace norm [Liu+09, SignoreLo+10, T+10, Gandy+11]

•  Regularize by the sum of mode-wise nuclear norms

minimize n ⇥···⇥n

W2R

1

K

1 ky 2m

2 X(W)k2

+

k=1

•  Intuition:

kW (k) k⇤

Nuclear norm of unfoldings

–  the same tensor is

×

regularized to be simultaneously low-rank w.r.t. all modes

m

K X

× ×

K=3

Overlapped nuclear norm

[T+10; Signoretto+10; Gandy+11; Liu+09]

•  Regularize the sum of nuclear norms

W

S1 /1

:=

K X

k=1

•  Is this a norm?

W (k)

⇤

Nuclear norm of unfoldings

Yes –convex because sum of convex functions and positive homogeneous.

•  Is this easy to compute? Yes –K-unfoldings and SVDs.

Approach 3: Mixture of low-rank tensors [T, Hayashi, Kashima10]

•  Each mixture component W(k) is regularized to be low-rank in the corresponding mode. minimize n ⇥···⇥n

W2R

1

K

1 y 2m

X

⇣P

K k=1

W (k)

⌘

2 2

+

m

K X

k=1

(k)

kW (k) k⇤ ×

•  Each W(k) takes care of each mode •  Sparse solution (group lasso like effect) •  Note: the sum may not be low rank

+ =

× + ×

K=3

Latent nuclear norm •  Effective regularization term W

S1 /1

:=

min

W (1) ,...,W (K)

K X

(k)

W (k)

k=1

⇤

s.t. W =

K X

k=1

W (k)

•  Is this a norm? Yes –it is the dual of the overlapped S∞/∞-norm:

X

S1 /1

:= max X (k)

•  Is it easy to compute?

k

spectral norm

Yes –convex minimization problem involving matrix nuclear norm

Property of the norms •  Nuclear norm

kW k⇤ 

p

rkW kF

if rank(W )  r,

•  Overlapped trace norm

kWkS1 /1 

XK p k=1

rk kWkF

if W has multilinear rank (r1 , . . . , rK )

•  Latent trace norm

kWkS1 /1  min k

p

rk kWkF

if W has multilinear rank (r1 , . . . , rK )

Tensor completion result •  True tensor: 50x50x20, rank 7x8x9. No noise (λ=0). •  Random train/test split. 2

Generalization error

10

As a Matrix (mode 1) As a Matrix (mode 2) As a Matrix (mode 3) Overlap Constraint Mixture Latent Tucker (large) Tucker (exact) Optimization tolerance

0

10

−2

10

−4

10

Sharp phase-transition 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 Fraction of observed elements

0.8

0.9

1

Tucker implemented in n-way toolbox. [Andersson & Bro, 00]

Almost full-rank case True tensor: Size 50x50x20, rank 50x50x5. No noise (λ=0). As a Matrix (mode 1) As a Matrix (mode 2) As a Matrix (mode 3) Overlap Constraint Latent Mixture Tucker (large) Tucker (exact) Optimization tolerance

0

Generalization error

10

−1

10

−2

10

−3

10

0

0.2

0.4 0.6 Fraction of observed elements

0.8

Latent approach can automatically ﬁgure out which mode is low-rank

1

Overlapped norm: key properties •  How it relates to the multilinear rank PK p W S1 /1  k=1 rk W F average of the ranks

•  How the dual norm behaves E X

(S1 /1)⇤

O

1 K2

K p X

k=1

N/nk

!

,

(N =

Random Gaussian tensor N(0, σ2)

Y k

nk )

Overlapped norm: denoising guarantee

[T,Suzuki,Hayashi,Kashima,11]

•  Let Y be a noisy observation of W*

Y = W⇤ + E

(E ⇠ N (0,

•  Then the estimator ˆ = argmin W W

with 1 ˆ W N

= c0 (

W

⇤

2 F

PK

✓

k=1

 c1

1 Y 2

p 2

W

2 F

+

W

2

))

S1 /1

◆

N/nk )/K gives

1 K

K X

k=1

p

rk

!2

with prob. at least 1-exp(-poly(n))

1 K

K X

k=1

p 1/ nk

!2

Latent norm: key properties •  How it relates to the multilinear rank p W S /1  (min rk ) W F 1 k

minimum rank

•  How the dual norm behaves ✓ ◆ p E X

(S1

/1)⇤

˜ O

max k

N/nk ,

(N =

Random Gaussian tensor N(0, σ2)

Y k

nk )

Denoising guarantees •  Overlapped nuclear norm: sensitive to the average rank !2 !2 K K 1 ˆ 1 Xp 1 X p ⇤ 2 2 W W F  c1 rk 1/ nk N K K k=1

k=1

•  Latent nuclear norm: sensitive to the minimum rank

1 ˆ W N

W

⇤

2 F

 c2

2

min rk /min nk k

k

•  Matters only if rk≠rk’ or nk≠nk’ (typical many tensor problems)

Could we obtain a bound of mink(rk/nk)?

Scaled latent nuclear norm [Wimalawarne, Sugiyama, T, 2014]

p

•  Normalize by nk W

scaled

=

inf

W (1) ,...,W (K)

K X

k=1

1 (k) p kW (k) kS1 nk

–  relation to the multilinear rank

W

scaled



✓

min k

r

–  behavior of the dual norm

E X

scaled⇤

rk nk

◆

⇣ p ⌘ ˜ O N ,

s.t.

K X

k=1

W

W (k) = W

F

(N =

Q

k

nk )

Denoising guarantees •  Overlapped nuclear norm: sensitive to the average rank !2 !2 K K 1 ˆ 1 Xp 1 X p ⇤ 2 2 W W F  c1 rk 1/ nk N K K k=1

k=1

•  Latent nuclear norm: sensitive to the minimum rank

1 ˆ W N

W

⇤

2 F

 c2

2

min rk /min nk k

k

•  Scaled latent norm: sensitive to the minimum rank-todimension ratio

1 ˆ W N

W

⇤

2 F

 c3

2

rk min k nk

Multi-task learning with tensors

[Romera-‐Paredes+13; SignoreLo+13]

•  Example: multi-aspect restaurant recommendation

Y

'

Restaurants

Restaurants

customers features

X

⇥

n3 customers

w pq

n1 features

•  Issue: heterogeneous dimensions and ranks –  Number of aspects is usually much smaller than the number of customers. –  Rank (amount of sharing of information) may be different for aspect, customers, and features.

Theory for multi-linear MTL •  n2 x n3 tasks. Observed tasks S ✓ [n2 ] ⇥ [n3 ] •  Training samples

–  many tasks may have no samples (zero-shot mpq learning) (xipq , yipq ) , (p, q) 2 S i=1

•  Training ˆ = W

argmin W2Rn1 ⇥n2 ⇥n3

1 ˆ where L(W) = |S|

X

ˆ L(W), s.t. W

(p,q)2S

mpq

?

C

1 X `(hxipq , wpq i , yipq ) mpq i=1

Analysis

[Wimalawarne, Sugiyama, T, 14]

•  Lemma: Let W* be any tensor with rank (r1,r2,r3) and p element-wise bounded by B. For C = B (r) N ˆ L(W)

0

p

N ⇤ @ L(W )  O ⇤B (r) · ·E |S|

X

(p,q)2S

mpq 1 X mpq i=1

ipq Xipq

?⇤

1 A

How the norm relates to the rank How the dual norm behaves σipq: Rademacher RV W  (r) W ?

F

⇤ : Lipschitz constant of the loss `

⇥

>

⇤

 : Measure of feature correlation E xipq xipq = Cpq /n1 I n1 X 1 L(W) : Expected loss L(W) = E` (hxpq , wpq i , ypq ) n 2 n3 (p,q)2[n2 ]⇥[n3 ]

Theorem (overlap norm)

[Wimalawarne, Sugiyama, T, 14]

•  Let W* be any tensor with rank q (r1,r2,r3) and elementwise bounded by B. For C = B krk1/2 n1 n2 n3 ˆ L(W)

⇤

✓

L(W )  O ⇤B

r

 krk1/2 min (Dk log Dk ) k m|S|

⇤ : Lipschitz constant of the loss `

⇥

 : Measure of feature correlation E xipq xipq ✓ K ◆2 Pp 1 krk1/2 = K rk ,

>

⇤

= Cpq

/n1 I n1

k=1

D1 := n1 + n2 n3 ,

D2 := n2 + n1 n3 ,

◆

D3 := n3 + n1 n2

Theorem (latent norm)

[Wimalawarne, Sugiyama, T, 14]

•  Let W* be any tensor with rank (r1,r2,r3) and elementq wise bounded by B. For C = B min rk n1 n2 n3 k

ˆ L(W)

⇤

✓

L(W )  O ⇤B

r

 min rk max(Dk log Dk ) k m|S| k

⇤ : Lipschitz constant of the loss `

⇥

 : Measure of feature correlation E xipq xipq D1 := n1 + n2 n3 ,

D2 := n2 + n1 n3 ,

>

⇤

= Cpq

◆

/n1 I n1

D3 := n3 + n1 n2

Theorem (scaled latent norm)

[Wimalawarne, Sugiyama, T, 14]

•  Let W* be any tensor with rank (r1,r2,r3) and elementq wise bounded by B. For C = B min(rk /nk )n1 n2 n3 k

ˆ L(W)

⇤

L(W )  O ⇤B

s

 min m|S| k

⇤ : Lipschitz constant of the loss `

⇥

✓

◆

rk n1 n2 n3 log(max Dk ) k nk

 : Measure of feature correlation E xipq xipq D1 := n1 + n2 n3 ,

D2 := n2 + n1 n3 ,

>

⇤

= Cpq

!

/n1 I n1

D3 := n3 + n1 n2

Sample complexities

(ignoring log terms and per 1/ε2) Overlap

Latent

Scaled latent

Multi-task learning (homogeneous)

n n n

krk1/2 n2

(min rk )n2 k

(min rk )n2 k

Multi-task learning (heterogeneous)

n1 n2 n 3

krk1/2 n2 n3

Tensor completion krk1/2 min Dk n1 n2 k n3

n1 n2 n3

min rk max Dk k

k

 min(r1 n2 n3 , n1 n 2 r3 ) rk min N k nk

Restaurant recommendation data

[Vargas-‐Govea+2011; Romera-‐Paredes+2013]

•  45 features x 3 aspects x 138 customers 0.7 Overlapped trace norm Latent trace norm Scaled latent trace norm Mode 1 Mode 2 Mode 3 RR

0.65

MSE

0.6 0.55 0.5 0.45 0.4 0.35 0

500

1000

1500 2000 Sample size

2500

3000

ILEA School data [ILEA; Bakker & Heskes 2003] •  24 features x 139 schools x 3 years 40

Explained variance

35

29.5%

30

Hierarchical Bayes [Bakker & Heskes 03]

25 Overlapped trace norm Latent trace norm Scaled latent trace norm Mode 1 Mode 2 Mode 3 RR

20

15

10 0

2000

4000

6000 8000 Sample size

10000

26.7% Bilinear MTL [Argyriou 08]

12000

Tensor completion min W

s.t.

W

S1 /1

Wi,j,k = Yi,j,k

Error ||W*−W^||F

size=50x50x20, rank=7x8x9

0

10

Convex EM (nonconvex) Optimization tolerance

−3

10

roughly 35% is 0 0.1 0.2 enough for perfect recovery!

0.3 0.4 0.5 0.6 0.7 Fraction of observed elements

0.8

0.9

1

EM algorithm implemented in n-way toolbox. [Andersson & Bro, 00]

Problem setting ⇤ W : true K-way tensor with rank (r1,...,rK)

Observation

⇤

yi = hXi , W i + ✏i Optimization

ˆ = W

argmin W2Rn1 ⇥···⇥nK

✓

(i = 1, . . . , m)

Gaussian noise N(0,σ2)

Likelihood

1 ky 2m

Regularization

X(W)k22 +

m

W

S1 /1

◆

Reg. constant Observation operator X : Rn1 ⇥···⇥nK ! Rm

X(W) = (hX1 , Wi , . . . , hXm , Wi) >

Theorem (random Gaussian design) There are constants c1, c2 such that K 1 rn 2 ˆ W W ⇤ F  c2 2 M holds with high probability if #samples(m) Q #variables( k nk )

r c1 n

Condition for the sample size m: •  Depends on the normalized rank r/n •  Independent of the noise σ

and

m

=8

q

nK

1/ m

Condition for the reg. parameter λm: •  Independent of the rank r •  Deepends on the noise σ

For simplicity n1,…,nK=n, r1,…,rK=r

Tensor completion Convex [7 8 9] Covex [40 9 7] Optimization tolerance

0

10

0.01 −3

10

0

0.2 0.4 0.6 0.8 Fraction of observed elements

#samples (M ) #variables (N )

(N :=

QK

k=1

nk )

No observation noise

Fraction M/N at error<=0.01 Fraction at Error<=0.01

Estimation error

size = 50x50x20 true rank 7x8x9 or 40x9x7

11

rank=[7,8,9] rank=[40,9,7]

0.8 0.6 0.4 0.2 0 0

size=[50 50 20] size=[100 100 50] 0.2 0.4 0.6 −1 Normalized rank ||n ||1/2||r||1/2

Normalized rank

0.8

Theory vs. Experiments (4th order) 1

Fraction at err<=0.01

0.8

0.6

0.4 size=[50 50 20] size=[100 100 50] size=[50 50 20 10] size=[100 100 20 10]

0.2

0 0

0.2

0.4 0.6 Normalized rank ||n−1||

||r||

1/2

0.8

1/2

1

Limitation: too many samples required! •  Sample complexity = number of samples required to obtain error ε

ˆ W

W

⇤

2 F

 c2

K rn 2 M

1

,

K 1

M = O(rn

/✏)

•  For K=2: O(rn) = number of parameters rank r matrix has. •  For K>2: too many. In fact, for the 50 x 50 x 20 rank (7,8,9) case, about 1500 (≒50*7+50*8+20*9+7*8*9) samples (3% << 35%) should be right.

Tensorlab

Sorber, Van Barel and De Lathauwer (2014).

•  RT “Hi, I was trying cpd with missing entries and found that the error does not go to zero as the number of observed entries increases…”

Tensorlab

Tensorlab

Sorber, Van Barel and De Lathauwer (2014).

•  Laurent Sorber’s reply “Thank you for your detailed example … I have been experimenting further with your code, and below is a method that gets good performance with Tensorlab …” Tensorlab

Lesson learned •  When you are doing something non-convex, details matter a lot! –  Random initialization turns out to be better than the default intelligent initialization. –  Make sense to try several short restarts (4 x 5). –  Default max iteration too small.

•  Although global guarantee may be absent, nonlinear optimization could work very well in practice.

Computation-statistics trade-off

Simplest case　[Montanari & Richard 14] •  Noise corrupted symmetric rank-one tensor n n

⇤

n

=

u ⇤ u ⇤ u

n

n

+ n Gaussian N(0,1)

β: signal-to-noise ratio

•  How large does β need to be to recover u* from Y?

Optimal: maximum likelihood

[Richard & Montanari 14]

•  Consider the maximum likelihood estimator

ˆ ML = argmax hY, u u ui u u2Rn

E Z

•  If

, with high probability

Operator norm

1

⇤

| hˆ uML , u i | = O

•  Note that E Z

=O

⇣p

E Z

!

nK log(K)

⌘

[T&Suzuki 14]

•  This is optimal (no estimator can do better) •  But computationally interactable (best rank-1 approx.)!

Power method [Anandkumar+12; Richard & Montanari 14] •  Is a computationally tractable algorithm q

•  With c · nK K log(K) randomly initialized power iteration converges to a solution with ! E Z ⇤ 1 | hˆ u, u i | = O with high probability. •  Note the sharp increase in the required β (trading statistical performance for tractability)

Recursive unfolding

[Richard & Montanari 14]

1.  Matricize Y so that it looks as square as possible

n n

bK/2c

dK/2e

Y(square)

linearly index the first K/2 dimensions

linearly index the last K/2 dimensions

2.  Compute the leading singular vector v. Note that v is nK/2 dimensional. 3.  Reshape v into an n x nK/2-1 matrix and return the top-left singular vector.

Performance of (recursive) unfolding •  Richard & Montanari: recursive ✓ unfolding ◆ attains

1

under

| hˆ u, u⇤ i | = O

KndK/2e 2

p c · ndK/2e K with high probability.

–  loose for odd-order tensors –conjectured nK/4 suffice

•  A tighter analysis [Zheng & T, 2015]: taking the topleft singular vector of the ordinary (rectangular) unfolding is sufficient to obtain a threshold

c0 · nK/4

n

nK-1

Y(1)

ˆ ≒ u

Why simpler method gets better bound •  R&M analyzes the perturbation of

Y(square) = β

(u⇤ ⌦ · · · ⌦ u⇤ )> | {z } dK/2e

u⇤ ⌦ · · · ⌦ u⇤ | {z } bK/2c

+

EkZk = O(

•  Z&T analyzes the perturbation of n

n

Y (1) Y (1)

2 β = >

+ β ⇤

⇤ >

u (u )

Z

+ β

p

ndK/2e )

+

Z (1) Z (1) > (better controlled perturbation!) (cross terms)

Two-phase behavior •  Because the cross term depends on β 1

8 ⇣ K⌘
if c · nK/4  if n(K

1)/2



 n(K

1)/2

Z (1) Z (1) >

,

β

+ β

(dominant term)

Summary of rank-one tensor recovery •  Ideal ML estimator: •  Randomly initialized PI:

c·

p

nK log(K) q c0 · nK K log(K)

–  Note: in practice, it seems to be much more robust!

•  (Recursive) unfolding: •  Questions:

c00 · nK/4

–  Is there a fundamental limit how small β can be and still computationally tractable? –  How about more general problems?

Barak & Moitra (2015) “Tensor Prediction, Rademacher Complexity and Random 3-XOR” •  Studied the atomic ⇢ norm w.r.t

Arank1,1 =

u u u : kuk1

for symmetric n x n x n tensors.

C p n atoms are “balanced”

•  For tensor completion, the atomic norm has almost optimal O(n/ε2) samples complextiy but intractable. •  Sixth order sum-of-squares relaxation yields O(nK/2/ ε2) sample complexity and polytime –O(n6) size SDP. •  Suggested no efficient algorithm can beat O(nK/2/ε2).

Jain & Oh (2014) “Provable Tensor Factorization with Missing Data” •  Showed that alternating minimization algorithm with careful initialization achieves O(r5n3/2(log(n))4log(r||W*||F/ε)) sample complexity for K=3. –  Slightly high scaling with the rank (probably not optimized) –  The same O(nK/2) scaling –  Note that if we consider σ2=1/β2=1/m (inverse sample size), the same m ≥ c’’nK/2 scaling for (recursive) unfolding.

Conclusion •  Tensors lie in the intersection of a lot of interesting areas –theory, data analysis, machine learning, and numerical linear algebra. •  Many tensor operations are known to be hard but often gradient descent/power iteration works in practice –  Interaction between computation and statistics: can we trade statistical performance for tractability? –  Need to understand hardness in contexts.

•  Learning with tensors is an area already explored empirically (relational data, neural networks) but not much theoretically

T h a n k

y o u !

Hoeffding decompositions and urn sequences