Exploring Algorithmic Limits of Matrix Rank Minimization under Affine Constraints Bo Xin

David Wipf

Peking University Beijing, China Email: [email protected]

Microsoft Research Beijing, China Email: [email protected]

I. I NTRODUCTION Recently there has been a surge of interest in finding minimum rank matrices subject to problem-specific constraints often characterized as an affine set. Mathematically this involves solving min X

rank[X]

s.t. b = A(X),

(1)

where X ∈ Rn×m is the unknown matrix, b ∈ Rp represents a vector of observations and A : Rn×m → Rp denotes a linear mapping. Unfortunately, the problem (1) is well-known to be NPhard, and the rank penalty itself is non-smooth. Consequently, a P popular alternative is to instead minimize i f (σi [X]), where σi [X] denotes the i-th singular value of X and f is usually a concave, non-decreasing function (or nearly so). In the special case where f (z) = I[z 6= 0], we retrieve the matrix rank; however, smoother surrogates such as f (z) = log z or f (z) = z q with q ≤ 1 are generally preferred for optimization purposes. When f (z) = z, the objective reduces to the popular convex nuclear norm. While elegant theoretical conditions have been derived to elucidate when this replacement is likely to be successful [1], these conditions are highly restrictive and convex algorithms fail when the ambient rank is too high or when the constraint set is poorly structured. Non-convex alternatives fare somewhat better when carefully tuned; however, convergence to local solutions remains a continuing source of failure. Against this backdrop we derive a deceptively simple, parameterfree Bayesian-inspired algorithm that is capable of successful recovery, across a wide battery of empirical tests, even up to the theoretical limit where the number of measurements p is equal to the degrees of freedom in the unknown low-rank X. An extended journal version [4] provides technical proofs, algorithmic enhancements, and additional experiments (including comparisons with various algorithms given a priori knowledge of correct rank) as well as real applications. II. BAYESIAN A FFINE R ANK M INIMIZATION (BARM) In contrast to the majority of existing P algorithms organized around practical solutions to minimizing i f (σi [X]), here we adopt an alternative starting point that can be viewed as a reparameterized generalization of probabilistic PCA [2]. Specifically, we define a Gaussian likelihood function p(b|X; A, λ) with noise variance λ to encode relaxed constraints and an independent, zero-mean Gaussian prior distribution with covariance νi Ψ on each column of X, denoted as p(X; Ψ, ν). The resulting posterior p(X|b; Ψ, ν, A, λ) is also Gaussian, with mean  −1 ˆ = ΨA ¯ > λI + AΨA ¯ > ˆ = vec[X] x b. (2) Here A ∈ Rp×nm is a matrix defining the linear operator A such that ¯ = diag[ν] ⊗ Ψ, b = Ax reproduces the feasible region in (1) and Ψ with ⊗ denoting the Kronecker product. A common Bayesian strategy

to determine Ψ and ν is to marginalize over X and then maximize the resulting likelihood function. This is equivalent to minimizing the BARM objective L(Ψ, ν) = b> Σ−1 b b + log |Σb | ,

¯ > + λI, Σb = AΨA

(3)

where Σb can be viewed as the covariance of b given Ψ and ν. Minimizing (3) is a non-convex optimization problem, and we design upper bounds for this purpose leading to an efficient EM-like algorithm with provable convergence to a stationary point. Moreover, the underlying cost function maintains a number of desirable properties. Lemma 1. Let b = A vec[X], where A ∈ Rp×nm satisfies spark[A] = p + 1. Also define r as the smallest rank of any feasible solution. Then if r < p/m, any global minimizer {Ψ∗ , ν ∗} of (3) ¯ ∗ A> −1 b is ¯ ∗ A> AΨ in the limit λ → 0 is such that x∗ = Ψ ∗ ∗ ∗ feasible and rank[X ] = r with vec[X ] = x . The assumption r = rank[X ∗ ] < p/m is completely unrestrictive, especially given that a unique, minimal-rank solution is only theoretically possible by any algorithm if p ≥ (n+m)r −r2 (the degrees of freedom in a n × m rank r matrix), which is much more restrictive than p > rm. Likewise the spark assumption will be satisfied for any A with even an infinitesimal (continuous) random component. Consequently, we are essentially always guaranteed that (3) possesses the same global optimum as the rank function. While this result, and a companion result related to solution invariance (not shown), is certainly a useful starting point, the real advantage of adopting (3) is that bad locally minimizing solutions are exceedingly rare and in some cases provably non-existent. Theorem 1. Let b = A vec[X], where A is block diagonal, with blocks Ai ∈ Rpi ×n . Moreover, assume pi > 1 for all i and that ∩i null[Ai ] = ∅. Then if minX rank[X] = 1 in the feasible region, any minimizer {Ψ∗ , ν ∗ } of (3) (global or local) in the limit λ → 0 is  ¯ ∗ A> AΨ ¯ ∗ A> † b is feasible and rank[X ∗ ] = 1 such that x∗ = Ψ ∗ ∗ with P vec[X ] = x . Furthermore, no cost function in the form of i f (σi [X]) can satisfy the same result. In particular, there can always exist local and/or global minima with rank greater than one. This result applies to a generalized form of the matrix completion problem [1], where instead of observing individual elements of the unknown X ∗ , we can only observe linear transformations of each column. Importantly, under mild conditions, which do not even depend on the concentration properties or correlation structure of A, the proposed cost function has no minima that are not global minima; in fact, it can be shown that any stationary point is a globalP minimum. Interestingly, no possible penalty function of the form i f (σi [X]) can satisfy the same result. From an empirical standpoint, the proposed BARM algorithm dramatically outperforms all methods we are aware of (see examples below and more in [4]).

Nuclear norm

IRLS0

BARM

60 50 40 30 20 true (lowest rank) solution

10 0 −5

0

Cost function value

50

Cost function value

Cost function value

70

0 −50 −100 −150 γ=1 γ=1e−05 γ=1e−10

−200

0

−250 −5

5

η

−20 −40 −60 −80 −100

0

−120 −5

5

η

0

5

η

Local minima visualizations. Plots of different surrogates for matrix rank in a 1D feasible subspace generated by X 0 + ηV , where X 0 is the true minimum rank solution, V ∈ null[A], and η is a scalar. Here the nuclear norm does not retain the correct global minimum. In contrast, P although the non-convex i log σi [X]2 + γ penalty (from the IRLS0 algorithm [3]) exhibits the correct minimum when γ is sufficiently small, it also contains spurious minima. Only BARM smoothes away local minimum while simultaneously always retaining the correct global optima.

Fig. 1.

1

0.8

0.6

0.6

0.2

0

2

4

6 8 Rank

10

0.6

0.4

0.2

0.2

0.2

4

6 8 Rank

10

0

12

2

4

6 8 Rank

(a) 50 × 50, A uncorrelated

0.6

0.6

Nuclear norm IRLS0 BARM Theoretical limit

Nuclear norm IRLS0 BARM Theoretical limit

0.2

0 1

2

3

4

5

10

12

0.6

0.4

0.4

0.2

0.2

0.2

2

6 8 Rank

0.8

0.4

0 1

6

Nuclear norm IRLS0 BARM Theoretical limit

0.8

0.6

Rank

4

1

FoS

FoS 0.4

2

(b) 50 × 50, A correlated

REL

0.8

0

12

1

1

0.8

10

REL

1

0.6

0.4

2

Nuclear norm IRLS0 BARM Theoretical limit

0.8

0.4

0

12

Nuclear norm IRLS0 BARM Theoretical limit

FoS

FoS

Nuclear norm IRLS0 BARM Theoretical limit

0.4

1

0.8

REL

0.8

1 Nuclear norm IRLS0 BARM Theoretical limit

REL

1

3

4

5

6

0 1

2

3

4

5

6

0 1

Nuclear norm IRLS0 BARM Theoretical limit

2

Rank

Rank

(c) 100 × 100, A uncorrelated

3

4

5

6

Rank

(d) 100 × 100, A correlated

Fig. 2. Empirical comparisons with the nuclear norm and IRLS0 [3] under general affine constraints (avg of 10 trials) as rank[X 0 ] is varied. Subplot labels indicate the size of X 0 and whether A was uncorrelated (easier) or correlated (harder). In all cases p = 1000. The relative error (REL) is ˆ F /kX 0 kF and the frequency of success (FoS) measures the percentage of trials where REL is below 10−3 . The theoretical limit is such kX 0 − Xk that p ≥ (n + m)r − r2 ; no possible algorithm can work when p is smaller. Note that [4] contains more details and experiments.

1

0.6 0.4 0.2 0

VSBL Nuclear norm IRNN_best IRLS0 BARM Theoretical limit 20

30 Rank

0.8 REL

FoS

0.8

1

0.6

R EFERENCES

VSBL Nuclear norm IRLS0 BARM Theoretical limit

0.4 0.2

40

0

20

30 Rank

40

Fig. 3. Reproduction of matrix completion experiments from [5]. Here X 0 is 150×150 with varying rank and 50% of its elements are observed.

BARM is compared against IRNN [5], IRLS0 [3], VSBL [6] and the nuclear norm [1]. [4] contains more details and analysis.

[1] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, 2009. [2] M. Tipping and C. Bishop, “Probabilistic principal component analysis,” J. Royal Statistical Society, Series B, vol. 61, no. 3, pp. 611–622, 1999. [3] K. Mohan and M. Fazel, “Iterative reweighted algorithms for matrix rank minimization,” The Journal of Machine Learning Research (JMLR), vol. 13, no. 1, pp. 3441–3473, 2012. [4] B. Xin and D. Wipf, “Exploring algorithmic limits of matrix rank minimization under affine constraints,” arXiv:1406.2504, 2014. [5] C. Lu, J. Tang, S. Yan, and Z. Lin, “Generalized nonconvex nonsmooth low-rank minimization,” in Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. IEEE, 2014. [6] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos, “Sparse bayesian methods for low-rank matrix estimation,” Signal Processing, IEEE Transactions on, vol. 60, no. 8, pp. 3964–3977, 2012.

Exploring Algorithmic Limits of Matrix Rank ...

blocks Ai ∈ Rpi×n. Moreover, assume pi > 1 for all i and that. ∩inull[Ai] = ∅. Then if minX rank[X]=1 in the feasible region, any minimizer {Ψ∗, ν∗} of (3) (global ...

228KB Sizes 0 Downloads 91 Views

Recommend Documents

GRAPH REGULARIZED LOW-RANK MATRIX ...
also advance learning techniques to cope with the visual ap- ... Illustration of robust PRID. .... ric information by enforcing the similarity between ai and aj.

Exploring the Limits of Disjoint Access Parallelism
nally it does private computation (“think”) while hold- ing the lock. ... processing tool using virtual addresses. ..... Principles of Distributed Computing (1994), pp.

low-rank matrix factorization for deep neural network ...
of output targets to achieve good performance, the majority of these parameters are in the final ... recognition, the best performance with CNNs can be achieved when matching the number of ..... guage models,” Tech. Rep. RC 24671, IBM ...

A Fast and Efficient Algorithm for Low Rank Matrix ... - Semantic Scholar
Department of Electrical and Computer Engineering. The Johns Hopkins ..... Experiment 1: This experiment is devoted to compare the recovery performance and speed of our .... The algorithm is run on a laptop computer with 2.0GHz. CPU and ...

A Fast and Efficient Algorithm for Low Rank Matrix ... - Semantic Scholar
Department of Electrical and Computer Engineering. The Johns Hopkins University. Abstract .... 10: Xk+1 ← P(A(X)=b)(Xk. ∗) {Projection}. 11: Xk+1 → Uk+1Sk+1V k+1T ..... The algorithm is run on a laptop computer with 2.0GHz. CPU and 3GB ...

Campaign Limits
regulation ranging from information and disclosure requirements to limits on campaign contribu- tions and/or ... addition, few countries provide information on the characteristics and campaign spending of both ...... that are correlated with our poli

Algorithmic Characterization of Rationalizability in ... - CiteSeerX
With slight abuse of notation, we shall use aI to denote the sentence “action profile aI is chosen”, and I to denote the sentence “information set I is reached”.

Algorithmic Computation and Approximation of ...
Some categories have multiple criteria to classify subcategories. ... measures that can be used to compute semantic similarity on other kinds of ontologies. ...... meaningful to use a σc threshold because in applications such as search engines,.

Transducer-based Algorithmic Verification of ...
As an example, consider a sender which gets a message M from its client. The sender transmits it and in return, gets an acknowledgement a from the receiver.

Algorithmic Detection of Semantic Similarity
link similarity, and to the optimization of ranking functions in search engines. 2. SEMANTIC SIMILARITY. 2.1 Tree-Based Similarity. Lin [12] has investigated an ...

Matrilineal rank Inheritance varies with absolute rank in ...
sals and noted the importance of alliances with kin (CHAPAIS, 1988) and ...... On the rank system in a natural group of Japanese monkeys, I: The basic and the ...

THE LIMITS OF IRONY
era likely to make many people's short list as defining traits of liberal democracy. ...... spirit and the letter of such agreements with virtual impunity.64. Twiss is .... ing international rights agreements is still distinguishable from sitting in

Algorithmic Detection of Semantic Similarity
gree to which the values of each function f(σc, σl) agree with σG ... similarity (f = λσc + (1 − λ)σl for high λ) perform par- ..... The AVIDD Linux Clusters used in our.

Algorithmic Characterization of Rationalizability in ... - CiteSeerX
pw i,I ([φ]). Of course, this expression may not be well defined, since there is nothing to guarantee ..... [16] Feinberg, Y. (2002), “Subjective Reasoning in Dynamic Games”, mimeo, Stanford Graduate. School of ... The MIT Press,. Cambridge, MA 

The Limits of Bimetallism
Feb 8, 2013 - Why the Americans could have been more successful than the French has to do with the lower valuation of silver at the mint and the rapidly increasing size of American demands for specie after 1879 associated with rapid growth of overall

algorithmic-composition.pdf
first person to have applied computer programmes to algo- rithmic composition. The use of specially-designed, unique. computer hardware was common at US ...

Algorithmic Thermodynamics - Semantic Scholar
Oct 12, 2010 - they all touch on this central theme. While he ..... of Z(0,γ, 0) have the values they do; after that, one has to add more axioms or rules to the ...

Algorithmic Thermodynamics - Semantic Scholar
Oct 12, 2010 - Computer Science Department, University of Auckland and ... runtime, length, and output of a program as observables analogous to the ..... best approximately true, but in ordinary thermodynamics this ..... distance, IEEE Trans. ... [5]

SUCCESSFUL ALGORITHMIC TRADING - GitHub
code that queries the database and provides access to external tools, such as MATLAB, R ...... Construct the Yahoo URL with the correct integer query parameters ...... and add to a pandas DataFrame pdf = web.DataReader(ticker, 'google', start, end).

A Relative Rank Model of Gratitude - The University of Manchester
Imagine a person is helped by four friends for 5, 10, 15, and 20 min, respectively. The person ..... logical well-being above the Big Five facets. Personality and ...