Tensor decompositions: old, new, and beyond Ryota Tomioka @MLSS, Kyoto 2015 Toyota Technological Institute at Chicago
Tensors appear everywhere
Spatio-temoral data
Tensors
Collaborative filtering Movies Users
Time Sensors
Matrices
Multivariate time-series
Star Wars
Titanic Blade Runner
5
2
4
User 2 1
4
2
User 3 5
?
?
User 1
Multiple relations Watch Buy Like
Tensor decomposition
Sensors
Tensors
Spatio-temoral data
Users
Time
Collaborative filtering Movies
Multiple relations
Users
Sensors
Matrices
Multivariate time-series
Learning with tensors
D
• Input: tensor; output: scalar
y
=
n3
n1
n2
,
n3
n2
n1
E
+ ξ
– input X may not be low-rank but the weight W may be.
• Multi-task learning with tensors
Y
'
Restaurants
Restaurants
customers features
X
⇥
n3 customers
w pq
n1 features
Learning objectives Part 1: Matrices – What is rank? Different views. – Nuclear norm and factor-based regularization. – Learning bounds
Part 2: Tensors – What is rank? Which view generalizes, and how? – Nuclear norm? Factor-based regularization?
Part 3: Beyond – Convex relaxation for Tucker model – Interaction between comptutation and statistics
Lecture format • I will try to be as interactive as possible. – Ask questions when you don’t understand. – Please interrupt.
• I will use iPython Notebooks to give further details. – URL: https://github.com/ryotat/mlss15/tree/master/python – Python environment? – Download Enthought Canopy if you don’t
Let’s start from matrices • Rank • Decomposition • Regularization
A 2 Rm⇥n
(with m n, w.l.o.g.)
• Theory – and see how these generalize (or does not generalize) to tensors
Rank • What are the ranks of these matrices?
0 1 @2 3 0 1 @2 3
2 4 6 1 1 0
0 1 1 3 6A = @2A · 1 2 3 3 9 1 0 1 ✓ 0 1 0 1 1 A @ A 1 = 1 1 · 1 0 3 0 3 1
Rank 1
Rank 2
◆ 0 1
Rank • The rank of a matrix A = min R such that A =
R X
ur v r >
r=1
= max C such that AC is linearly independent (AC is the submatrix of A indexed by C) = number of non-zero singular values
Singular values/vectors • are solutions of
Av = u A> u = v σ: singular value; (u,v): singular vectors
Fact: for (σj, uj, vj) j=1,…,m, as long as σj ≠ σk >
uj uk = 0,
and
>
vj vk
Singular values and eigenvalues • Eigenvalue equation
0 A>
A 0
u = v
u v
Symmetric matrix
0 A>
1 U A = 0 2 V
U V
⌃ 0
0 ⌃
U> U>
V> V>
Negative eigenvalues correspond to flipping the sign of V.
Singular values and eigenvalues • Eigenvalue equation
0 A>
A 0
u = v
u v
Symmetric matrix
0 A>
1 U A = 0 2 V
U V
That is, A
⌃ 0
0 ⌃
= U ⌃V >
U> U>
V> V>
Algorithmic view Fixed-point algorithm 1. Initialize u0 and v0 randomly 2. For t=0,1,2,…
ut+1 v
t+1
Av t = , kAv t k A> ut = kA> ut k
Does this converge? – Yes, essentially power iteration for
0 A>
A 0
Singular value decomposition A = U ⌃V > 1
U >U = I m,
V >V = I m,
⌃=
..
Usual convention: σ1≥σ2≥ …≥σm≥0.
This gives a decomposition
A=
m X
j uj v j
>
j=1
Is this minimal? Yes – keep only σj>0.
. m
!
Experiment • Open USPS.ipynb
Summary: Rank 1. min R such that A =
R X
ur v r >
r=1
2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values
Summary: Rank • Which definition generalizes to tensors? 1. min R such that A =
R X
ur v r
>
Yes -‐ Rank
r=1
2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) Yes – MulFlinear rank
3. number of non-zero singular values ???
#data points
Best rank-r approximation #dimensions
A
r=#embedding dimensions
U
V
T
Best rank-2 approximation USPS digits dataset
Data from: hLp://statweb.stanford.edu/~Fbs/ElemStatLearn/data.html
Best rank-r approximation • Given an m x n matrix A, we want to find
ˆ r = argmin kA A X2Rm⇥n : rank(X)r
XkF
Best rank-1 approximation • The simplest case
ˆ 1 = argmin kA A X2Rm⇥n : rank(X)1
Fact:
ˆ1 = A
1 u1 v 1
XkF
>
σ1: the largest singular value
Proof • Let X = uv
kA
2 XkF
=
>
2 kAkF
>
2u Av +
2 kuk2
·
2 kvk2
Setting the derivative to zero, we have
Av = >
2 kvk2 · u, 2 kuk2 · v
A u= p Solution: u = 1 u1 ,
v=
Equations defining singular values/vectors
p
1 v1
Viewing it as a maximization • Finding the best rank-one approximation
ˆ 1 = argmin kA A X2Rm⇥n : rank(X)1
XkF
is equivalent to finding (u,v) that maximizes 1
=
max m
u2R ,v2Rn , kuk=1,kvk=1
u> Av
Note: both problems are non-convex. Still admit poly-time algorithm (singular value decomp.)
Spectral norm ||A|| • The largest singular value 1
=
max m
n
u2R ,v2R , kuk=1,kvk=1
>
u Av
is a norm. We denote it by kAk =
1
• A norm needs to satisfy
1) Positive homogeneity k↵Ak = |↵| · kAk 2) Subadditivity 3) Separation
(8↵ 2 R)
kA + Bk kAk + kBk kAk = 0 ) A = 0
Best rank r approximation • For r>1,
> ˆ Ar = U r ⌃ r V r
(take the first r columns of U and V and corresponding singular values) known as the Eckart–Young theorem
Best rank r approximation • For r>1,
> ˆ Ar = U r ⌃ r V r
(take the first r columns of U and V and corresponding singular values) known as the Eckart–Young theorem
Implication: Subtracting the best rank-one approximation reduces the rank by one. Does this generalize to tensors? – No!
Low-rank regression • Input x ∈ Rd, output y ∈ RC >
y =W x+⇠ where
W 2R
d⇥C
, rank(W ) r
• Given training inputs (xi,yi), i=1,…,m
ˆ = argmin kY W W 2Rd⇥C : rank(W )r
Y = [y 1 , . . . , y m ] ,
W
>
2 XkF
X = [x1 , . . . , xm ]
Connection to linear neural network • Linear neural network (Baldi & Hornik, 1989)
y
W> = V
activation (linear)
·
x
U>
Matrix factorization = Linear neural network
Analytic solution W
>
= P V ⌃XY
>
1 ⌃XX
P V = U r U r >, U r : the ⌃XX =
1 top r s.v. of ⌃XY ⌃XX ⌃XY Pm Pm > > x x , ⌃ = x y XY i=1 i i i=1 i i >
Matrix completion • “Netflix problem” – Fill in the missing entries
User 1 User 2
Y =
User 3 User 4
5 1 5 ?
2 4 1 ?
4 2 ? ?
5 4 ? 4
Matrix completion • Impossible without an assumption. (Missing entries can be arbitrary) --- problem is ill-posed • Most common assumption: Low-rank decomposition Users
Movies
Y
U Users features
V
T
Movies features
Matrix completion • Most common assumption: Low-rank decomposition (rank r) Users
Movies
yij
vj
ui Users features
Movies features
yij = ui v j
⇥
dot prodcut in r-‐dim space
Geometric Intuition r-‐dimensional space (r: the rank of the decomposiFon)
ua
Movies he doesn’t like
U Movies he likes
V
T
Geometric Intuition r-‐dimensional space (r: the rank of the decomposiFon)
ub
ua
V
U
T
Netflix problem –best rank-r approximation from partial observation minimize
U 2Rm⇥r ,V 2Rn⇥r
2
X
Yi,j
>
ui v j
2
(i,j)2⌦
>
3
u1 6 u2 > 7 6 7 U = 6 . 7, 4 .. 5 um >
2
>
3
v1 6v2 > 7 6 7 V =6 . 7 4 .. 5 vn >
Note: ui and vj are rows of U and V, respectively.
Non-convexity
u
v
Regularization • Add L2 regularization terms minimize m⇥r
X
U 2R , (i,j)2⌦ V 2Rn⇥r
Yi,j
>
ui v j
2
+
2
kU k2F + kV k2F
How do we minimize this? –Gradient descent! Let’s try it with r=10 and r=50
Results for r=10 and r=50
Mean squared error
are surprisingly similar
Singular values lambda = 23.4
What is going on? • Define W = U V > = (ui > v j )i,j • Loss term: only depends on the product W. • Regularizer: let’s define
1 2 2 g(W ) = min kU k + kV k F F m⇥r n⇥r 2 U 2R ,V 2R : W =U V >
Note: this is a function of W (U and V are minimized-out)!
Effective regularizer • Now our objective is
minimize W 2Rm⇥n
X
(Yi,j
2
Wi,j ) + g(W )
(i,j)2⌦
1 2 2 min kU k + kV k with g(W ) = F F U 2Rm⇥r ,V 2Rn⇥r : 2 W =U V >
Is this a convex minimization problem in W?
Convexity Claim: g(w) is a norm (and therefore convex), if r is large enough. Proof: 1. Positive homogeneity: 0
0
(for W = ↵W , take U = sign(↵) 2. Triangular inequality:
p
p |↵|U and V = |↵|V .) 0
(for W = X + Y , take U = [U X , U Y ] and V = [V X , V Y ]. ) 3. Separation: g(W) = 0 implies U=V=0, which implies W=0.
Nuclear norm • The norm kW k⇤ =
min
W =U V >
is called the nuclear norm.
1 2 2 kU kF + kV kF 2 We use the norm notation instead of g(W) because we know that it is a norm.
• Moreover,
kW k⇤ =
hX, W i = max m⇥n
X2R , kXk1
Spectral norm
m X
j
j=1
Linear sum of s.v.
Proof • Easy to see that kW k⇤
Xm
j=1
j
max hX, W i kXk1
p p > ˜ ˜ ˜ ˜ ˜ V˜ > .) (given W = U ⌃V , take U = U ⌃, V = V ⌃, and X = U
Pm
> • For any W = u v j=1 j j Xm hX, W i = uj > Xv j j=1 Xm kuj k2 · kv j k2
(because ||X||≤1)
j=1
1 Xm kuj k22 + kv j k22 j=1 2
(AM-GM)
Nuclear norm and spectral norm • Nuclear norm and spectral norm are dual m X kW k⇤ = kXk = max j j, j=1
j
to each other. Just like L1 and L∞ d X kwk1 = |wj |, kxk1 = max |xj | j=1
j
When do we use the L1 norm? What can we expect from using the nuclear norm?
Nuclear norm and sparisty • Nuclear norm promotes W to be low rank MSE=0.91
MSE=0.87
but maybe that’s not the end of the story…
Summary for matrix decomposition Learning with matrices Best rank-‐R approximaFon
Factor (U,V) regulariaFon Eckart- Young
SVD
Nuclear norm dual
Best rank-‐1 approximaFon Fixed-point condition
Power iteraFon
Operator norm deflation
Little bit of theory VC dimension Rademacher complexity
Complexity of low-rank matrices ⌦
• Let fU ,V ,b (X) = X, U V
>
↵
+ b and define
Fr = fU ,V ,b : U 2 Rn1 ⇥r , V 2 Rn2 ⇥r , b 2 R
• Examples – Low-rank regression: X = x · ej – Matrix completion:
>
X = ei · ej >
– Matrix classification: general X.
What is the complexity of Fr? How many samples do we need?
Background • VC dimension of F:
Vapnik
Chervonenkis
d(F) = max m s.t. |{(sgn(f (X 1 )), . . . , sgn(f (X m )) : f 2 F}| = 2m
Note: standard asymptotic statistics does not work here because the model is not regular (degenerate Fisher information).
Bound on the VC dimension d(Fr ) (r(n1 + n2 ) + 1) log(r(n1 + n2 ) + 1)
For sufficiently large n1 and n2. No more than #parameters!
• This implies that for classification
R(f )
˜ Remp (f ) O
r
(r(n1 + n2 ) + 1) m
!
for any f 2 Fr with high probability (ignoring the log term) Pm 1 where R(f ) = Pr(y 6= f (X)), Remp (f ) = m i=1 I(yi 6= f (X i ))
• Similar bound for real-valued output using pseudodimension.
Proof of the VC dimension • Warren’s theorem
Warren (1968) “Lower bounds for approximation by nonlinear manifolds”
Note that
⌦
fU ,V ,b (X i ) = X i , U V
>
↵
+b
is a polynomial of degree at most 2 in r(n1+n2)+1 variables. Note: (U,V,b) are variables and Xi are coefficients.
Low-nuclear norm matrices • Let fW (X) = hX, W i and define FW = fW : W 2 R
n1 ⇥n2
, kW k⇤ W
(no bias term for simplicity)
• Examples – Low-rank regression: X = x · ej – Matrix completion:
>
X = ei · ej >
– Matrix classification: general X.
Low-nuclear norm matrices • Let fW (X) = hX, W i and define FW = fW : W 2 R
n1 ⇥n2
, kW k⇤ W
(no bias term for simplicity)
• Rademacher complexity
m X 1 ˆ Rn (F) = E sup f 2F m i=1
i f (X i )
(σi takes +1 or -1 with probability 1/2)
– measures how well F can fit +1/-1 noise. – is scale sensitive.
Rademacher complexity bound [see Kakade, Shalev-Shwartz, Tewari 12]
If kX i k X (for i = 1, . . . , m) Spectral norm r 2 ˆ Rm (FW ) log((n1 + n2 )e) · X · W. m Consequence: for any Lipschitz continuous loss function ` , r E`(y, f (X))
1 m
m X i=1
`(yi , f (X i )) O
X 2 W 2 log(n1 + n2 ) m
How large are X and W? If Xi (i=1,…,m) are drawn from standard normal distribution, X2=O(n1+n2), and W2=O(r), if W has a constant Frobenius norm.
!
Proof of RC bound • From the duality btwn. the spectral & nuclear norms m X 1 m i=1
• Thus
m X 1 i hX i , W i m i=1
ˆ m (FW ) E R
iX i
m X 1 m i=1
· kW k⇤
≤ W
iX i
·W
and using random matrix theory, E
1 m
m X i=1
iX i
r
2 log((n1 + n2 )e) · X m
if kX i k X (for i = 1, . . . , m). (Using Corollary 4.2 from Tropp (2011) “User-Friendly Tail Bounds for Sums of Random Matrices”)
Tightening by expectation
[Foygel & Srebro, 2011]
• Previous analysis does not assume any distribution for inputs X1,…,Xm. • Considering the matrix completion setting and taking expectation over the random draw of the observed positions, 0s 1 ˆ m (FW ) = O @ Rm (FW ) := ER Expectation over the draw of the observed positions
(n1 + n2 ) log(n1 + n2 ) · WA n1 n2 m
W can be as large as (rn1n2)1/2 and sample complexity O(r(n1+n2)/ε2)
Approach from high-dimensional statistics
[Negahban & Wainwright 2011]
• Let the entries of X1,…,Xm drawn i.i.d. from standard Gaussian distribution and assume yi = hX i , W ⇤ i + ⇠i , ⇠i ⇠ N (0, 2 ), (i = 1, . . . , m) truth (rank r)
then the estimator
m ⇣ 1 X ˆ = argmin W (yi W 2Rn1 ⇥n2 2m i=1
with
m
ˆ kW
p
2
hX i , W i) +
m kW k⇤
(n1 + n2 )/m satisfies ✓ 2 ◆ r(n1 + n2 ) ⇤ 2 W kF O m
=c·
with high probability; r=rank(W*).
⌘
Summary so far • Rank and nuclear norm: both measure complexity of a matrix-based hypothesis class. • L2 regularization on the factors (U and V) induces nuclear norm regularization on the matrix UVT. • Nuclear norm promotes the matrix to be lowrank. However, it can be independently justified even when it doesn’t.
Tensors Rank and multilinear rank Non-closedness and other weirdness Tucker decomposition Graphical representation
Recap: Rank for matrices 1. min R such that A =
R X
ur v r >
r=1
2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values
Rank for tensors • A Kth-order tensor
A2R
n1 ⇥n2 ⇥···⇥nK
(we will take K=3 below to simplify notation) • Definition: Rank is the min R such that
A=
R X r=1
ur
vr
wr
outer product of 3 vectors
Bibliographical note • The decomposition
A=
R X r=1
ur
vr
wr
Triad
https://en.wikipedia.org/wiki/Triad_(music)
Triad
is called CP (Canonical Polyadic) decomposition. [Hitchcock (1927) “The expression of a tensor or a polyadic as a sum of products”]
• Also known as CANDECOMP [Carroll & Chang, 1970] and PARAFAC [Harshman, 1970].
Rank quiz • What is the rank of this tensor? 1 1 1 1 1 1 1 = A= 1 1 1 1 1 1 1
Rank 1
1st slice 2nd slice
0 A= 1
0 A= 1
1 1 0 0
0 100
1 1 0 0
0 0
Rank 2
0.0995 0.5025 1 = 0.995 5.025 10 0.0995 0.5025 1 + 0.995 5.025 10
Rank 3
Tensor rank is not closed • Consider the limit α→∞ of a rank 2 tensor 1 1 1 Y = (a1 + a2) ⇥ (b1 + b2) ⇥ (c1 + c2) a1 ⇥ b1 ⇥ c1
lim Y = a1 b1 c2 + a1 b2 c1 + a2 b1 c1
↵!1
• The limit is rank 3 if [a1,a2], [b1,b2], [c1,c2] are linearly independent. (Example from Kolda & Bader 2009)
Rank of 2 x 2 x 2 tensors
Stegeman & Comon (2010) “Subtracting a best rank-1 approximation may increase tensor rank”
Bounding the rank • A trivial bound rank(A) n1 n2 n3 (consider each entry as a rank-one tensor)
• A slightly better bound rank(A) min(n1 n2 , n2 n3 , n1 n3 ) (consider each fiber as a rank-one tensor)
Rank = number of linearly independent columns? • What is a column vector of a tensor? –fiber A mode-k fiber is a column vector obtained by fixing all but the kth index.
(Kolda & Bader 2009)
Unfolding and mode-k rank • mode-k unfolding mode-k fibers
A(k) =
• mode-k rank
2R
nk ⇥
Q
k0 6=k
rankk (A) = rank(A(k) )
• Multilinear rank (rank1 (A), . . . , rankK (A))
nk 0
Questions • How do these two notions of rank relate to each other? When are they equal? • What kind of decomposition does lowmultilinear rank imply?
Equivalence when rank=1 Detecting a rank-one tensor is easy! • Suppose A is rank one A = u v w = (ui vj wk )i,j,k
(again assuming K=3 for simplicity)
Rank 1
Mode-1 unfolding
⇥
A(1) = uv1 w1
uv2 w1
= u(v ⌦ w)>
Similarly
A(2) and A(3)
···
uvn2 wn3
⇤
(⌦: Kronecker product) are also rank 1.
Thus multilinear rank=(1,1,1). The other way is also true.
General relation rankk (A) rank(A)
K Y
rankk (A)
k=1
• Left hand side:
A(1) =
R X r=1
ur (v r ⌦ wr )>
Rank at most R
• Right hand side: given by the so-called Tucker decomposition
Preliminaries • Define the 0 mode-k product 1 nk X Ai1 ,...,ik 1 ,j,ik+1 ,...,iK Uik j A A ⇥k U := @ j=1
for A 2 R
n1 ⇥···⇥nK
and U 2 R
For example,
A ⇥1 U =
n1
n1
In other words,
(A ⇥k U )(k) = U A(k)
i1 ,...,iK
n0k ⇥nk
n3
n2
n1 U applies to each mode-k fiber independently
Tucker decomposition [Tucker (1966) “Some mathematical notes on three-mode factor analysis”]
• Tensor A has multilinear rank (r1,…,rK) implies n3 n1
Factors Core
n2
r3
= Aijk =
r1 r2
r2
r1
1
n1
r1 X r2 X r3 X
a=1 b=1 c=1
2
n2
r3
3
(1) (2) (3) Cabc Uia Ujb Ukc
n3
!
• Also known as higher-order SVD [De Lathauwer+00]
Intuition • Recall that (A ⇥k U )(k) = U A(k) • Take U(k) , the (full) left singular vectors of A(k) (A ⇥k U • Repeat
(k) >
)(k) =
nk rk
Y
k0 6=k
nk
r3
A ⇥1 U (1) > · · · ⇥K U (K) > = r1
r2
• Reconstruct A = C ⇥1 U
(1)
⇥2 U
(2)
· · · ⇥K U
(K)
nk 0
Properties • Tensor A has multilinear rank (r1,…,rK) implies
A= Thus
r1 X r2 X r3 X
a=1 b=1 c=1
rank(A) • In fact,
Cabc ua (1) ub (2) uc (3)
K Y
rankk (A)
k=1
rank(A) = rank(C)
if the columns of U (1) , . . . , U (K)are independent.
Amino acids fluorescence data
[Bro (1997) “PARAFAC: Tutorial and applications”]
• Five samples containing different amounts of three types of amino acids (Tyr, Trp, and Phe) measured by fluorescence spectrometer Emission wavelength
Excitation wavelength
Mode-wise SVD Sample dim.
Emission dim.
Excitation dim.
SVD(A(1))
SVD(A(2))
SVD(A(3))
Multilinear rank ≒ (3, 3, 3) ⇒ 3 ≤ rank(A) ≤ 9
Graphical representation i U(1) C j U(2)
Tensor train decomposition [Oseledets, 2011]
U(3) k
Tucker decomposition
Hierarchical Tucker decomposition [Grasedyck, 2010]
Short summary • Tensor rank is a much more delicate concept than matrix rank – difficulty of computing (NP-hard [Håstad, 90]) – non-closedness – may increase by subtracting the best rank-1 approximation [Stegeman & Comon 2010]
• Rank can be bounded by the multilinear rank, which can be obtained easily by mode-wise SVD.
Tensor decomposition Best rank-k approximation Singular values and eigenvalues Power method Random projection
Best rank-R approximation • Minimization problem minimize A U ,V ,W
R X
ur
vr
wr
r=1
F
(Frobenius norm defined as kAkF =
• Hardness
p
hA, Ai )
– Known to be NP-hard even for R=1 (3-SAT)
• Questions: for R=1 – Does it lead to equations defining SVD? – Does it relate to the operator norm?
Connection to spectral norm • Finding ˆ , w) ˆ = argmin kA (ˆ u, v u,v,w
u v wkF
is equivalent to computing ˜ v ˜ wi ˜ kAk = max hA, u ˜ kuk1, k˜ v k1, ˜ kwk1
In fact, u ˆ= u ˜, v ˆ=v ˜, w ˆ =w ˜ is a best rank-one approximation.
with
= kAk
Connection to singular values • Stationary condition implies
˜ > ⇥3 w ˜> = u ˜ A ⇥2 v
˜ > ⇥3 w ˜> = v ˜ A ⇥1 u
˜ > ⇥2 v ˜> = w ˜ A ⇥1 u
generalizes equations defining s.v. for matrices. • We can define any triplet of unit vectors satisfying the above equations singular vectors [Lim, 2005] • Difference: – LHS is quadratic, whereas RHS is linear (motivation for defining Lp singular vectors with p=K.) – Generally, not orthogonal.
Symmetrization • Define a super-symmetric n’ x n’ x n’ tensor 2
(n’=n1+n2+n3)
0 0 0 A¯ = 4 0 0 A(3,2,1)
0
0 0
A(2,3,1) 0 A(3,1,2)
(1,3,2)
0 A 0 0 0 0
0 A(2,1,3) 0
A
• Eigenvectors of A¯ , solutions of > > ¯ ¯ ¯ ¯ A ⇥2 u ⇥3 u = u
2 3 ˜ u ˜5 ¯ = 4v include vectors of the form u ˜ w
(1,2,3)
0 0
3 0 05 0
E.V.P. for symmetric tensors • Power iteration (detailed below) will quickly converge to one of the solutions –but which solution? • The solutions are not orthogonal to each other. Cannot use the usual deflation approach (it may actually increase the rank…) • The orthogonally decomposable case
A=
R X r=1
r ur
ur
ur ,
(ur ? ur0 , 8r 6= r0 )
has attracted considerable attention [Anandkumar, Ge, Hsu, Kakade, Telgarsky, 2012]
Power method for symmetric tensors Fixed-point algorithm 1. Initialize u0 randomly 2. For t=0,1,2,… t+1
u
A ⇥2 (ut )> ⇥3 (ut )> = kA ⇥2 (ut )> ⇥3 (ut )> k2
Why does it work? • For symmetric orthogonally decomposable tensors, R A=
X
u
t+1
r ur
ur , u
ur ,
r=1
t+1
↵
(ur ? ur0 , 8r 6= r0 )
t >
t >
/ A ⇥2 (u ) ⇥3 (u ) =
⌦
ur
/
XR
r=1
⌦
⌦
↵
2 ( u , u ) ur r r
t
t
↵
2 ( u , u ) r r
(Used orthogonality)
Notes on power iteration for orthogonally decomposable tensors ⌦
↵
⌦
• • • •
t
↵
/ r ( ur , u )2 ⌦ ↵ t • The inner product ur , u will converge to zero ur , u
t+1
exponentially fast except for the one with the largest absolute value. The rate is K-1: power iteration converges faster for higher order tensors. Odd order K: can take λr≥0. RHS≥0. Even order K: RHS can be negative (oscillation). Deflation works (because the factors are orthogonal). Robustness to noise (see Anandkumar et al. 12)
Power method for asymmetric tensors 1. Initialize u0, v0, w0 randomly 2. For t=0,1,2,… ut+1 v t+1 wt+1
A ⇥2 (v t )> ⇥3 (wt )> = , t > t > kA ⇥2 (v ) ⇥3 (w ) k2 A ⇥1 (ut )> ⇥3 (wt )> = , kA ⇥1 (ut )> ⇥3 (wt )> k2 A ⇥1 (ut )> ⇥2 (v t )> = kA ⇥1 (ut )> ⇥2 (v t )> k2
Tensor and simultaneous diagonalization • If R
R X r=1
wk1
wkr ur v r = U >
wkR
VT
• Factors are common to all slices. • If we can simultaneously diagonalize all slices, we are done.
Random projection • Consider the random projection >
A ⇥3 x =
R X r=1
hwr , xi ur v r = U >
T V R
x is a random vector. The projection is again a low-rank matrix if R
• How many random projections/slices do we need? – Two! (if the factors are linearly independent) [Harshman (1972) Determination and Proof of Minimum Uniqueness Conditions for PARAFAC1]
Random projection algorithm [Harshman 1972; Goyal, Vempala, Xiao, 2014; Moitra et al.]
1. Let A1 and A2 be two independent random projections of rank-R tensor A. (see Random Projection.ipynb) > 2. Compute truncated rank-R SVD A1 = U 1 S 1 V 1
3. Compute eigen-decomposition 1
>
S 1 U 1 A2 V 1 = P ⇤P 4. Recover 5. Note that
1
U = U 1S1P V = V 1P >
A1 = U V ,
>
A2 = U ⇤V
Simultaneous > diagonalization!
Recap: Rank for matrices and tensors 1. min R such that
A=
R X
ur v r
>
r=1
2. max C such that AC is linearly independent (AC is the submatrix of A indexed by C) 3. number of non-zero singular values
1. min R such that
A=
R X r=1
ur
vr
wr
outer product of 3 vectors
2. max number of linearly independent mode-k fibers
Summary for matrix decomposition Learning with matrices Best rank-‐R approximaFon
Factor (U,V) regulariaFon Eckart- Young
SVD
Nuclear norm dual
Best rank-‐1 approximaFon Fixed-point condition
Power iteraFon
Operator norm deflation
Regularization for tensors Regularization Nuclear norm and atomic norm Hardness
Learning with tensors • Examples – Input: pair of vectors (x, z); output: vector y. n3
y =
n2
n1
D
noise
⇥2 x ⇥3 z + ξ
– Input: tensor; output: scalar
y =
n1
n3
n2
n3
,
n1
n2
E
+ ξ
Regularization • In learning setting, we are interested in minimizing m X 1 minimize ` (y , hX , Wi) + R(W) i i W2Rn1 ⇥···⇥nK m i=1 • Examples
– Low-rank regression: X = eout – Tensor completion:
– Tensor classification:
X = ei
x z ej ek
(general tensor X)
Regularization • In learning setting, we are interested in minimizing minimize n ⇥···⇥n
W2R
1
K
• Questions:
m X 1 ` (yi , hXi , Wi) + R(W) m i=1
– What regularization term R(W) should we use? – How is it related to rank? – How does regularizing the factors (U(1),…,U(K)) relate to a regullarization for W?
Regularizing the factors • Recall that the nuclear norm 1 2 2 kW k⇤ = min kU kF + kV kF U ,V : 2 W =U V >
minimum over all factorizations W=UVT. • A naïve idea R2,2 (W) =
(1)
min
⌘ 1 ⇣ (1) 2 kU kF + kU (2) k2F + kU (3) k2F 3
U ,U (2) ,U (3) : P (1) (2) (3) W= R u u u r r r r=1
Is this convex? Is this a norm?
Non-convexity • Given any decomposition ⇣
R2,2 (W) kU (1) kF · kU (2) kF · kU (3) kF
ˆ • Thus taking (U ⇣
ˆ R2,2 (↵W) |↵|kU
(1)
(1)
ˆ ,U
(2)
ˆ ,U
(3)
)=
⌘ 23
(AM-‐GM inequality)
argmin U (1) ,U (2) ,U (3)
ˆ kF · k U
(2)
ˆ kF · k U
(3)
kF
⌘ 23
2 3
⇣
···
⌘
= |↵| R2,2 (W)
(this violates convexity)
Atomic norm • Atomic norm [Chandrasekaran+2012] is a norm whose unit ball is the convex hull of an atomic set. • Example: – L1 norm unit ball in 2D = conv({e1,-e1, e2, -e2}) – Group lasso – Nuclear norm – etc.
(Tensor) nuclear norm • Atomic norm with respect to the atomic set n o Arank1 = u(1) · · · u(K) : ku(1) k2 = · · · = ku(K) k2 = 1 rank-one tensor
• Can be written as a minimum kWknuc =
min
U (1) ,U (2) ,...,U (K)
s.t.
• In other words,
XR
r=1
W=
(2) (K) ku(1) k · ku k · · · ku 2 2 r r r k2
XR
r=1
W 2 conv(Arank1 ) , 9decomposition
(2) (K) u(1) u · · · u r r r
R P
r=1
(1)
(K)
kur k2 · · · kur
k2 1
Dual of nuclear norm is the spectral norm • Dual of nuclear norm kAknuc⇤ = max hA, Wi , W
s.t.
W 2 conv(Arank1 )
kAknuc⇤ = kAk
Summary Fact: max hX , Wi =
X: kX k1
min
U (1) ,U (2) ,U (3)
s.t.
Dual of (tensor) spectral norm
XR
r=1
W=
(2) (3) ku(1) k · ku k · ku 2 2 r r r k2
XR
r=1
(2) (3) u(1) u u r r r
(tensor) Atomic norm = nuclear norm = wrt Arank1
How do we compute this norm? • Unfortunately it is NP hard to compute both the nuclear norm and the spectral norm for tensors. • Wait, they are norms (convex functions) aren’t they easy to deal with? • Equivalently written as convex problem with |A |Arank1 rank1|| many constraints. Just finding the most violated constraint is NP hard.
Generative model? • By AM-GM inequality kWknuc
⌘ 1 ⇣ (1) K = min kur k2 + · · · + ku(K) kK r 2 r=1 K U (1) ,U (2) ,...,U (K) XR (2) (K) s.t. W = u(1) u · · · u r r r XR
r=1
P (U
(1)
,...,U
(K)
)/
• If we consider instead n
K Y
k=1
exp
1 K
R X r=1
K ku(k) k r 2
Arank1,K = u(1) · · · u(K) : ku(1) kK = · · · = ku(K) kK kWknuc =
min
U (1) ,U (2) ,...,U (K)
XR
r=1
!
o =1
⌘ 1 ⇣ (1) K K kur kK + · · · + ku(K) r kK K
Summary for tensor decomposition
known rank-‐R decomposiFon
Learning with tensors
Factor (U,V) regulariaFon
Best rank-‐R approximaFon
Nuclear norm dual
Best rank-‐1 approximaFon Fixed-point condition
Random projecFon
Power iteraFon
Operator norm
If orthogonally decomposable
Convex relaxation of Tucker
Recap: Tucker decomposition • Suppose a tensor W is low-rank in the sense of Tucker decomposition Factors n3
n1
Core
n2
r3
=
r1 r2
r2
r1
1
n1
2
n2
r3
3
n3
– All unfoldings W(1), W(2), W(3) are simultaneously low-rank. – How can we encourage W to be simultaneously low-rank?
Approach 1: As a matrix • Pick a mode k, and hope the tensor to be low-rank in mode k. minimize n ⇥···⇥n
W2R
1
K
1 ky 2m
2 X(W)k2
+
m kW (k) k⇤
Observation operator X : Rn1 ⇥···⇥nK ! Rm
X(W) = (hX1 , Wi , . . . , hXm , Wi) > Pro: Basically a matrix problem Easy to carry out, theory available Con: Have to be lucky to pick the right mode
Approach 2: Overlapped trace norm [Liu+09, SignoreLo+10, T+10, Gandy+11]
• Regularize by the sum of mode-wise nuclear norms
minimize n ⇥···⇥n
W2R
1
K
1 ky 2m
2 X(W)k2
+
k=1
• Intuition:
kW (k) k⇤
Nuclear norm of unfoldings
– the same tensor is
×
regularized to be simultaneously low-rank w.r.t. all modes
m
K X
× ×
K=3
Overlapped nuclear norm
[T+10; Signoretto+10; Gandy+11; Liu+09]
• Regularize the sum of nuclear norms
W
S1 /1
:=
K X
k=1
• Is this a norm?
W (k)
⇤
Nuclear norm of unfoldings
Yes –convex because sum of convex functions and positive homogeneous.
• Is this easy to compute? Yes –K-unfoldings and SVDs.
Approach 3: Mixture of low-rank tensors [T, Hayashi, Kashima10]
• Each mixture component W(k) is regularized to be low-rank in the corresponding mode. minimize n ⇥···⇥n
W2R
1
K
1 y 2m
X
⇣P
K k=1
W (k)
⌘
2 2
+
m
K X
k=1
(k)
kW (k) k⇤ ×
• Each W(k) takes care of each mode • Sparse solution (group lasso like effect) • Note: the sum may not be low rank
+ =
× + ×
K=3
Latent nuclear norm • Effective regularization term W
S1 /1
:=
min
W (1) ,...,W (K)
K X
(k)
W (k)
k=1
⇤
s.t. W =
K X
k=1
W (k)
• Is this a norm? Yes –it is the dual of the overlapped S∞/∞-norm:
X
S1 /1
:= max X (k)
• Is it easy to compute?
k
spectral norm
Yes –convex minimization problem involving matrix nuclear norm
Property of the norms • Nuclear norm
kW k⇤
p
rkW kF
if rank(W ) r,
• Overlapped trace norm
kWkS1 /1
XK p k=1
rk kWkF
if W has multilinear rank (r1 , . . . , rK )
• Latent trace norm
kWkS1 /1 min k
p
rk kWkF
if W has multilinear rank (r1 , . . . , rK )
Tensor completion result • True tensor: 50x50x20, rank 7x8x9. No noise (λ=0). • Random train/test split. 2
Generalization error
10
As a Matrix (mode 1) As a Matrix (mode 2) As a Matrix (mode 3) Overlap Constraint Mixture Latent Tucker (large) Tucker (exact) Optimization tolerance
0
10
−2
10
−4
10
Sharp phase-transition 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Fraction of observed elements
0.8
0.9
1
Tucker implemented in n-way toolbox. [Andersson & Bro, 00]
Almost full-rank case True tensor: Size 50x50x20, rank 50x50x5. No noise (λ=0). As a Matrix (mode 1) As a Matrix (mode 2) As a Matrix (mode 3) Overlap Constraint Latent Mixture Tucker (large) Tucker (exact) Optimization tolerance
0
Generalization error
10
−1
10
−2
10
−3
10
0
0.2
0.4 0.6 Fraction of observed elements
0.8
Latent approach can automatically figure out which mode is low-rank
1
Overlapped norm: key properties • How it relates to the multilinear rank PK p W S1 /1 k=1 rk W F average of the ranks
• How the dual norm behaves E X
(S1 /1)⇤
O
1 K2
K p X
k=1
N/nk
!
,
(N =
Random Gaussian tensor N(0, σ2)
Y k
nk )
Overlapped norm: denoising guarantee
[T,Suzuki,Hayashi,Kashima,11]
• Let Y be a noisy observation of W*
Y = W⇤ + E
(E ⇠ N (0,
• Then the estimator ˆ = argmin W W
with 1 ˆ W N
= c0 (
W
⇤
2 F
PK
✓
k=1
c1
1 Y 2
p 2
W
2 F
+
W
2
))
S1 /1
◆
N/nk )/K gives
1 K
K X
k=1
p
rk
!2
with prob. at least 1-exp(-poly(n))
1 K
K X
k=1
p 1/ nk
!2
Latent norm: key properties • How it relates to the multilinear rank p W S /1 (min rk ) W F 1 k
minimum rank
• How the dual norm behaves ✓ ◆ p E X
(S1
/1)⇤
˜ O
max k
N/nk ,
(N =
Random Gaussian tensor N(0, σ2)
Y k
nk )
Denoising guarantees • Overlapped nuclear norm: sensitive to the average rank !2 !2 K K 1 ˆ 1 Xp 1 X p ⇤ 2 2 W W F c1 rk 1/ nk N K K k=1
k=1
• Latent nuclear norm: sensitive to the minimum rank
1 ˆ W N
W
⇤
2 F
c2
2
min rk /min nk k
k
• Matters only if rk≠rk’ or nk≠nk’ (typical many tensor problems)
Could we obtain a bound of mink(rk/nk)?
Scaled latent nuclear norm [Wimalawarne, Sugiyama, T, 2014]
p
• Normalize by nk W
scaled
=
inf
W (1) ,...,W (K)
K X
k=1
1 (k) p kW (k) kS1 nk
– relation to the multilinear rank
W
scaled
✓
min k
r
– behavior of the dual norm
E X
scaled⇤
rk nk
◆
⇣ p ⌘ ˜ O N ,
s.t.
K X
k=1
W
W (k) = W
F
(N =
Q
k
nk )
Denoising guarantees • Overlapped nuclear norm: sensitive to the average rank !2 !2 K K 1 ˆ 1 Xp 1 X p ⇤ 2 2 W W F c1 rk 1/ nk N K K k=1
k=1
• Latent nuclear norm: sensitive to the minimum rank
1 ˆ W N
W
⇤
2 F
c2
2
min rk /min nk k
k
• Scaled latent norm: sensitive to the minimum rank-todimension ratio
1 ˆ W N
W
⇤
2 F
c3
2
rk min k nk
Multi-task learning with tensors
[Romera-‐Paredes+13; SignoreLo+13]
• Example: multi-aspect restaurant recommendation
Y
'
Restaurants
Restaurants
customers features
X
⇥
n3 customers
w pq
n1 features
• Issue: heterogeneous dimensions and ranks – Number of aspects is usually much smaller than the number of customers. – Rank (amount of sharing of information) may be different for aspect, customers, and features.
Theory for multi-linear MTL • n2 x n3 tasks. Observed tasks S ✓ [n2 ] ⇥ [n3 ] • Training samples
– many tasks may have no samples (zero-shot mpq learning) (xipq , yipq ) , (p, q) 2 S i=1
• Training ˆ = W
argmin W2Rn1 ⇥n2 ⇥n3
1 ˆ where L(W) = |S|
X
ˆ L(W), s.t. W
(p,q)2S
mpq
?
C
1 X `(hxipq , wpq i , yipq ) mpq i=1
Analysis
[Wimalawarne, Sugiyama, T, 14]
• Lemma: Let W* be any tensor with rank (r1,r2,r3) and p element-wise bounded by B. For C = B (r) N ˆ L(W)
0
p
N ⇤ @ L(W ) O ⇤B (r) · ·E |S|
X
(p,q)2S
mpq 1 X mpq i=1
ipq Xipq
?⇤
1 A
How the norm relates to the rank How the dual norm behaves σipq: Rademacher RV W (r) W ?
F
⇤ : Lipschitz constant of the loss `
⇥
>
⇤
: Measure of feature correlation E xipq xipq = Cpq /n1 I n1 X 1 L(W) : Expected loss L(W) = E` (hxpq , wpq i , ypq ) n 2 n3 (p,q)2[n2 ]⇥[n3 ]
Theorem (overlap norm)
[Wimalawarne, Sugiyama, T, 14]
• Let W* be any tensor with rank q (r1,r2,r3) and elementwise bounded by B. For C = B krk1/2 n1 n2 n3 ˆ L(W)
⇤
✓
L(W ) O ⇤B
r
krk1/2 min (Dk log Dk ) k m|S|
⇤ : Lipschitz constant of the loss `
⇥
: Measure of feature correlation E xipq xipq ✓ K ◆2 Pp 1 krk1/2 = K rk ,
>
⇤
= Cpq
/n1 I n1
k=1
D1 := n1 + n2 n3 ,
D2 := n2 + n1 n3 ,
◆
D3 := n3 + n1 n2
Theorem (latent norm)
[Wimalawarne, Sugiyama, T, 14]
• Let W* be any tensor with rank (r1,r2,r3) and elementq wise bounded by B. For C = B min rk n1 n2 n3 k
ˆ L(W)
⇤
✓
L(W ) O ⇤B
r
min rk max(Dk log Dk ) k m|S| k
⇤ : Lipschitz constant of the loss `
⇥
: Measure of feature correlation E xipq xipq D1 := n1 + n2 n3 ,
D2 := n2 + n1 n3 ,
>
⇤
= Cpq
◆
/n1 I n1
D3 := n3 + n1 n2
Theorem (scaled latent norm)
[Wimalawarne, Sugiyama, T, 14]
• Let W* be any tensor with rank (r1,r2,r3) and elementq wise bounded by B. For C = B min(rk /nk )n1 n2 n3 k
ˆ L(W)
⇤
L(W ) O ⇤B
s
min m|S| k
⇤ : Lipschitz constant of the loss `
⇥
✓
◆
rk n1 n2 n3 log(max Dk ) k nk
: Measure of feature correlation E xipq xipq D1 := n1 + n2 n3 ,
D2 := n2 + n1 n3 ,
>
⇤
= Cpq
!
/n1 I n1
D3 := n3 + n1 n2
Sample complexities
(ignoring log terms and per 1/ε2) Overlap
Latent
Scaled latent
Multi-task learning (homogeneous)
n n n
krk1/2 n2
(min rk )n2 k
(min rk )n2 k
Multi-task learning (heterogeneous)
n1 n2 n 3
krk1/2 n2 n3
Tensor completion krk1/2 min Dk n1 n2 k n3
n1 n2 n3
min rk max Dk k
k
min(r1 n2 n3 , n1 n 2 r3 ) rk min N k nk
Restaurant recommendation data
[Vargas-‐Govea+2011; Romera-‐Paredes+2013]
• 45 features x 3 aspects x 138 customers 0.7 Overlapped trace norm Latent trace norm Scaled latent trace norm Mode 1 Mode 2 Mode 3 RR
0.65
MSE
0.6 0.55 0.5 0.45 0.4 0.35 0
500
1000
1500 2000 Sample size
2500
3000
ILEA School data [ILEA; Bakker & Heskes 2003] • 24 features x 139 schools x 3 years 40
Explained variance
35
29.5%
30
Hierarchical Bayes [Bakker & Heskes 03]
25 Overlapped trace norm Latent trace norm Scaled latent trace norm Mode 1 Mode 2 Mode 3 RR
20
15
10 0
2000
4000
6000 8000 Sample size
10000
26.7% Bilinear MTL [Argyriou 08]
12000
Tensor completion min W
s.t.
W
S1 /1
Wi,j,k = Yi,j,k
Error ||W*−W^||F
size=50x50x20, rank=7x8x9
0
10
Convex EM (nonconvex) Optimization tolerance
−3
10
roughly 35% is 0 0.1 0.2 enough for perfect recovery!
0.3 0.4 0.5 0.6 0.7 Fraction of observed elements
0.8
0.9
1
EM algorithm implemented in n-way toolbox. [Andersson & Bro, 00]
Problem setting ⇤ W : true K-way tensor with rank (r1,...,rK)
Observation
⇤
yi = hXi , W i + ✏i Optimization
ˆ = W
argmin W2Rn1 ⇥···⇥nK
✓
(i = 1, . . . , m)
Gaussian noise N(0,σ2)
Likelihood
1 ky 2m
Regularization
X(W)k22 +
m
W
S1 /1
◆
Reg. constant Observation operator X : Rn1 ⇥···⇥nK ! Rm
X(W) = (hX1 , Wi , . . . , hXm , Wi) >
Theorem (random Gaussian design) There are constants c1, c2 such that K 1 rn 2 ˆ W W ⇤ F c2 2 M holds with high probability if #samples(m) Q #variables( k nk )
r c1 n
Condition for the sample size m: • Depends on the normalized rank r/n • Independent of the noise σ
and
m
=8
q
nK
1/ m
Condition for the reg. parameter λm: • Independent of the rank r • Deepends on the noise σ
For simplicity n1,…,nK=n, r1,…,rK=r
Tensor completion Convex [7 8 9] Covex [40 9 7] Optimization tolerance
0
10
0.01 −3
10
0
0.2 0.4 0.6 0.8 Fraction of observed elements
#samples (M ) #variables (N )
(N :=
QK
k=1
nk )
No observation noise
Fraction M/N at error<=0.01 Fraction at Error<=0.01
Estimation error
size = 50x50x20 true rank 7x8x9 or 40x9x7
11
rank=[7,8,9] rank=[40,9,7]
0.8 0.6 0.4 0.2 0 0
size=[50 50 20] size=[100 100 50] 0.2 0.4 0.6 −1 Normalized rank ||n ||1/2||r||1/2
Normalized rank
0.8
Theory vs. Experiments (4th order) 1
Fraction at err<=0.01
0.8
0.6
0.4 size=[50 50 20] size=[100 100 50] size=[50 50 20 10] size=[100 100 20 10]
0.2
0 0
0.2
0.4 0.6 Normalized rank ||n−1||
||r||
1/2
0.8
1/2
1
Limitation: too many samples required! • Sample complexity = number of samples required to obtain error ε
ˆ W
W
⇤
2 F
c2
K rn 2 M
1
,
K 1
M = O(rn
/✏)
• For K=2: O(rn) = number of parameters rank r matrix has. • For K>2: too many. In fact, for the 50 x 50 x 20 rank (7,8,9) case, about 1500 (≒50*7+50*8+20*9+7*8*9) samples (3% << 35%) should be right.
Tensorlab
Sorber, Van Barel and De Lathauwer (2014).
• RT “Hi, I was trying cpd with missing entries and found that the error does not go to zero as the number of observed entries increases…”
Tensorlab
Tensorlab
Sorber, Van Barel and De Lathauwer (2014).
• Laurent Sorber’s reply “Thank you for your detailed example … I have been experimenting further with your code, and below is a method that gets good performance with Tensorlab …” Tensorlab
Lesson learned • When you are doing something non-convex, details matter a lot! – Random initialization turns out to be better than the default intelligent initialization. – Make sense to try several short restarts (4 x 5). – Default max iteration too small.
• Although global guarantee may be absent, nonlinear optimization could work very well in practice.
Computation-statistics trade-off
Simplest case [Montanari & Richard 14] • Noise corrupted symmetric rank-one tensor n n
⇤
n
=
u ⇤ u ⇤ u
n
n
+ n Gaussian N(0,1)
β: signal-to-noise ratio
• How large does β need to be to recover u* from Y?
Optimal: maximum likelihood
[Richard & Montanari 14]
• Consider the maximum likelihood estimator
ˆ ML = argmax hY, u u ui u u2Rn
E Z
• If
, with high probability
Operator norm
1
⇤
| hˆ uML , u i | = O
• Note that E Z
=O
⇣p
E Z
!
nK log(K)
⌘
[T&Suzuki 14]
• This is optimal (no estimator can do better) • But computationally interactable (best rank-1 approx.)!
Power method [Anandkumar+12; Richard & Montanari 14] • Is a computationally tractable algorithm q
• With c · nK K log(K) randomly initialized power iteration converges to a solution with ! E Z ⇤ 1 | hˆ u, u i | = O with high probability. • Note the sharp increase in the required β (trading statistical performance for tractability)
Recursive unfolding
[Richard & Montanari 14]
1. Matricize Y so that it looks as square as possible
n n
bK/2c
dK/2e
Y(square)
linearly index the first K/2 dimensions
linearly index the last K/2 dimensions
2. Compute the leading singular vector v. Note that v is nK/2 dimensional. 3. Reshape v into an n x nK/2-1 matrix and return the top-left singular vector.
Performance of (recursive) unfolding • Richard & Montanari: recursive ✓ unfolding ◆ attains
1
under
| hˆ u, u⇤ i | = O
KndK/2e 2
p c · ndK/2e K with high probability.
– loose for odd-order tensors –conjectured nK/4 suffice
• A tighter analysis [Zheng & T, 2015]: taking the topleft singular vector of the ordinary (rectangular) unfolding is sufficient to obtain a threshold
c0 · nK/4
n
nK-1
Y(1)
ˆ ≒ u
Why simpler method gets better bound • R&M analyzes the perturbation of
Y(square) = β
(u⇤ ⌦ · · · ⌦ u⇤ )> | {z } dK/2e
u⇤ ⌦ · · · ⌦ u⇤ | {z } bK/2c
+
EkZk = O(
• Z&T analyzes the perturbation of n
n
Y (1) Y (1)
2 β = >
+ β ⇤
⇤ >
u (u )
Z
+ β
p
ndK/2e )
+
Z (1) Z (1) > (better controlled perturbation!) (cross terms)
Two-phase behavior • Because the cross term depends on β 1
8 ⇣ K⌘
if c · nK/4 if n(K
1)/2
n(K
1)/2
Z (1) Z (1) >
,
β
+ β
(dominant term)
Summary of rank-one tensor recovery • Ideal ML estimator: • Randomly initialized PI:
c·
p
nK log(K) q c0 · nK K log(K)
– Note: in practice, it seems to be much more robust!
• (Recursive) unfolding: • Questions:
c00 · nK/4
– Is there a fundamental limit how small β can be and still computationally tractable? – How about more general problems?
Barak & Moitra (2015) “Tensor Prediction, Rademacher Complexity and Random 3-XOR” • Studied the atomic ⇢ norm w.r.t
Arank1,1 =
u u u : kuk1
for symmetric n x n x n tensors.
C p n atoms are “balanced”
• For tensor completion, the atomic norm has almost optimal O(n/ε2) samples complextiy but intractable. • Sixth order sum-of-squares relaxation yields O(nK/2/ ε2) sample complexity and polytime –O(n6) size SDP. • Suggested no efficient algorithm can beat O(nK/2/ε2).
Jain & Oh (2014) “Provable Tensor Factorization with Missing Data” • Showed that alternating minimization algorithm with careful initialization achieves O(r5n3/2(log(n))4log(r||W*||F/ε)) sample complexity for K=3. – Slightly high scaling with the rank (probably not optimized) – The same O(nK/2) scaling – Note that if we consider σ2=1/β2=1/m (inverse sample size), the same m ≥ c’’nK/2 scaling for (recursive) unfolding.
Conclusion • Tensors lie in the intersection of a lot of interesting areas –theory, data analysis, machine learning, and numerical linear algebra. • Many tensor operations are known to be hard but often gradient descent/power iteration works in practice – Interaction between computation and statistics: can we trade statistical performance for tractability? – Need to understand hardness in contexts.
• Learning with tensors is an area already explored empirically (relational data, neural networks) but not much theoretically
T h a n k
y o u !