Exploring Algorithmic Limits of Matrix Rank Minimization under Affine Constraints Bo Xin
David Wipf
Peking University Beijing, China Email:
[email protected]
Microsoft Research Beijing, China Email:
[email protected]
I. I NTRODUCTION Recently there has been a surge of interest in finding minimum rank matrices subject to problem-specific constraints often characterized as an affine set. Mathematically this involves solving min X
rank[X]
s.t. b = A(X),
(1)
where X ∈ Rn×m is the unknown matrix, b ∈ Rp represents a vector of observations and A : Rn×m → Rp denotes a linear mapping. Unfortunately, the problem (1) is well-known to be NPhard, and the rank penalty itself is non-smooth. Consequently, a P popular alternative is to instead minimize i f (σi [X]), where σi [X] denotes the i-th singular value of X and f is usually a concave, non-decreasing function (or nearly so). In the special case where f (z) = I[z 6= 0], we retrieve the matrix rank; however, smoother surrogates such as f (z) = log z or f (z) = z q with q ≤ 1 are generally preferred for optimization purposes. When f (z) = z, the objective reduces to the popular convex nuclear norm. While elegant theoretical conditions have been derived to elucidate when this replacement is likely to be successful [1], these conditions are highly restrictive and convex algorithms fail when the ambient rank is too high or when the constraint set is poorly structured. Non-convex alternatives fare somewhat better when carefully tuned; however, convergence to local solutions remains a continuing source of failure. Against this backdrop we derive a deceptively simple, parameterfree Bayesian-inspired algorithm that is capable of successful recovery, across a wide battery of empirical tests, even up to the theoretical limit where the number of measurements p is equal to the degrees of freedom in the unknown low-rank X. An extended journal version [4] provides technical proofs, algorithmic enhancements, and additional experiments (including comparisons with various algorithms given a priori knowledge of correct rank) as well as real applications. II. BAYESIAN A FFINE R ANK M INIMIZATION (BARM) In contrast to the majority of existing P algorithms organized around practical solutions to minimizing i f (σi [X]), here we adopt an alternative starting point that can be viewed as a reparameterized generalization of probabilistic PCA [2]. Specifically, we define a Gaussian likelihood function p(b|X; A, λ) with noise variance λ to encode relaxed constraints and an independent, zero-mean Gaussian prior distribution with covariance νi Ψ on each column of X, denoted as p(X; Ψ, ν). The resulting posterior p(X|b; Ψ, ν, A, λ) is also Gaussian, with mean −1 ˆ = ΨA ¯ > λI + AΨA ¯ > ˆ = vec[X] x b. (2) Here A ∈ Rp×nm is a matrix defining the linear operator A such that ¯ = diag[ν] ⊗ Ψ, b = Ax reproduces the feasible region in (1) and Ψ with ⊗ denoting the Kronecker product. A common Bayesian strategy
to determine Ψ and ν is to marginalize over X and then maximize the resulting likelihood function. This is equivalent to minimizing the BARM objective L(Ψ, ν) = b> Σ−1 b b + log |Σb | ,
¯ > + λI, Σb = AΨA
(3)
where Σb can be viewed as the covariance of b given Ψ and ν. Minimizing (3) is a non-convex optimization problem, and we design upper bounds for this purpose leading to an efficient EM-like algorithm with provable convergence to a stationary point. Moreover, the underlying cost function maintains a number of desirable properties. Lemma 1. Let b = A vec[X], where A ∈ Rp×nm satisfies spark[A] = p + 1. Also define r as the smallest rank of any feasible solution. Then if r < p/m, any global minimizer {Ψ∗ , ν ∗} of (3) ¯ ∗ A> −1 b is ¯ ∗ A> AΨ in the limit λ → 0 is such that x∗ = Ψ ∗ ∗ ∗ feasible and rank[X ] = r with vec[X ] = x . The assumption r = rank[X ∗ ] < p/m is completely unrestrictive, especially given that a unique, minimal-rank solution is only theoretically possible by any algorithm if p ≥ (n+m)r −r2 (the degrees of freedom in a n × m rank r matrix), which is much more restrictive than p > rm. Likewise the spark assumption will be satisfied for any A with even an infinitesimal (continuous) random component. Consequently, we are essentially always guaranteed that (3) possesses the same global optimum as the rank function. While this result, and a companion result related to solution invariance (not shown), is certainly a useful starting point, the real advantage of adopting (3) is that bad locally minimizing solutions are exceedingly rare and in some cases provably non-existent. Theorem 1. Let b = A vec[X], where A is block diagonal, with blocks Ai ∈ Rpi ×n . Moreover, assume pi > 1 for all i and that ∩i null[Ai ] = ∅. Then if minX rank[X] = 1 in the feasible region, any minimizer {Ψ∗ , ν ∗ } of (3) (global or local) in the limit λ → 0 is ¯ ∗ A> AΨ ¯ ∗ A> † b is feasible and rank[X ∗ ] = 1 such that x∗ = Ψ ∗ ∗ with P vec[X ] = x . Furthermore, no cost function in the form of i f (σi [X]) can satisfy the same result. In particular, there can always exist local and/or global minima with rank greater than one. This result applies to a generalized form of the matrix completion problem [1], where instead of observing individual elements of the unknown X ∗ , we can only observe linear transformations of each column. Importantly, under mild conditions, which do not even depend on the concentration properties or correlation structure of A, the proposed cost function has no minima that are not global minima; in fact, it can be shown that any stationary point is a globalP minimum. Interestingly, no possible penalty function of the form i f (σi [X]) can satisfy the same result. From an empirical standpoint, the proposed BARM algorithm dramatically outperforms all methods we are aware of (see examples below and more in [4]).
Nuclear norm
IRLS0
BARM
60 50 40 30 20 true (lowest rank) solution
10 0 −5
0
Cost function value
50
Cost function value
Cost function value
70
0 −50 −100 −150 γ=1 γ=1e−05 γ=1e−10
−200
0
−250 −5
5
η
−20 −40 −60 −80 −100
0
−120 −5
5
η
0
5
η
Local minima visualizations. Plots of different surrogates for matrix rank in a 1D feasible subspace generated by X 0 + ηV , where X 0 is the true minimum rank solution, V ∈ null[A], and η is a scalar. Here the nuclear norm does not retain the correct global minimum. In contrast, P although the non-convex i log σi [X]2 + γ penalty (from the IRLS0 algorithm [3]) exhibits the correct minimum when γ is sufficiently small, it also contains spurious minima. Only BARM smoothes away local minimum while simultaneously always retaining the correct global optima.
Fig. 1.
1
0.8
0.6
0.6
0.2
0
2
4
6 8 Rank
10
0.6
0.4
0.2
0.2
0.2
4
6 8 Rank
10
0
12
2
4
6 8 Rank
(a) 50 × 50, A uncorrelated
0.6
0.6
Nuclear norm IRLS0 BARM Theoretical limit
Nuclear norm IRLS0 BARM Theoretical limit
0.2
0 1
2
3
4
5
10
12
0.6
0.4
0.4
0.2
0.2
0.2
2
6 8 Rank
0.8
0.4
0 1
6
Nuclear norm IRLS0 BARM Theoretical limit
0.8
0.6
Rank
4
1
FoS
FoS 0.4
2
(b) 50 × 50, A correlated
REL
0.8
0
12
1
1
0.8
10
REL
1
0.6
0.4
2
Nuclear norm IRLS0 BARM Theoretical limit
0.8
0.4
0
12
Nuclear norm IRLS0 BARM Theoretical limit
FoS
FoS
Nuclear norm IRLS0 BARM Theoretical limit
0.4
1
0.8
REL
0.8
1 Nuclear norm IRLS0 BARM Theoretical limit
REL
1
3
4
5
6
0 1
2
3
4
5
6
0 1
Nuclear norm IRLS0 BARM Theoretical limit
2
Rank
Rank
(c) 100 × 100, A uncorrelated
3
4
5
6
Rank
(d) 100 × 100, A correlated
Fig. 2. Empirical comparisons with the nuclear norm and IRLS0 [3] under general affine constraints (avg of 10 trials) as rank[X 0 ] is varied. Subplot labels indicate the size of X 0 and whether A was uncorrelated (easier) or correlated (harder). In all cases p = 1000. The relative error (REL) is ˆ F /kX 0 kF and the frequency of success (FoS) measures the percentage of trials where REL is below 10−3 . The theoretical limit is such kX 0 − Xk that p ≥ (n + m)r − r2 ; no possible algorithm can work when p is smaller. Note that [4] contains more details and experiments.
1
0.6 0.4 0.2 0
VSBL Nuclear norm IRNN_best IRLS0 BARM Theoretical limit 20
30 Rank
0.8 REL
FoS
0.8
1
0.6
R EFERENCES
VSBL Nuclear norm IRLS0 BARM Theoretical limit
0.4 0.2
40
0
20
30 Rank
40
Fig. 3. Reproduction of matrix completion experiments from [5]. Here X 0 is 150×150 with varying rank and 50% of its elements are observed.
BARM is compared against IRNN [5], IRLS0 [3], VSBL [6] and the nuclear norm [1]. [4] contains more details and analysis.
[1] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, 2009. [2] M. Tipping and C. Bishop, “Probabilistic principal component analysis,” J. Royal Statistical Society, Series B, vol. 61, no. 3, pp. 611–622, 1999. [3] K. Mohan and M. Fazel, “Iterative reweighted algorithms for matrix rank minimization,” The Journal of Machine Learning Research (JMLR), vol. 13, no. 1, pp. 3441–3473, 2012. [4] B. Xin and D. Wipf, “Exploring algorithmic limits of matrix rank minimization under affine constraints,” arXiv:1406.2504, 2014. [5] C. Lu, J. Tang, S. Yan, and Z. Lin, “Generalized nonconvex nonsmooth low-rank minimization,” in Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. IEEE, 2014. [6] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos, “Sparse bayesian methods for low-rank matrix estimation,” Signal Processing, IEEE Transactions on, vol. 60, no. 8, pp. 3964–3977, 2012.