Matrix sparsification via the Khintchine inequality

Viewer
Transcript

Matrix sparsification via the Khintchine inequality Nam Nguyen1 , Petros Drineas2 , and Trac Tran1 1

2

Johns Hopkins University Rensselaer Polytechnic Institute

Abstract. Given a matrix A ∈ Rn×n , we present a simple, element-wise sparsification algorithm that zeroes out all sufficiently small elements of A, keeps all sufficiently large elements of A, and retains some of the remaining elements with probabilities proportional to the square of their magnitudes. We analyze the approximation accuracy of the proposed algorithm using a powerful inequality bounding the norms of sums of random matrices, the so-called Khintchine inequality. As a result, we obtain improved bounds for the matrix sparsification problem.

1

Introduction

A large body of recent work has focused on the design and analysis of algorithms that efficiently create small “sketches” of matrices. Such sketches are subsequently used in eigenvalue and eigenvector computations [1, 2], in data mining applications [3], or even to solve combinatorial optimization problems [4]. Existing approaches include, for example, the selection of a small number of rows and columns of a matrix in order to form the so-called CUR matrix decomposition [5], as well as random projection based methods that employ fast randomized variants of the Hadamard-Walsh transform [6] or the Discrete Cosine Transform [7]. An alternative approach was pioneered by Achlioptas and McSherry in 2001 [2, 8] and leveraged the selection of a small number of elements in order to form a sketch of the input matrix. Motivated by their work, we define the following matrix sparsification problem. Definition 1. [Matrix Sparsification] Given a matrix A ∈ Rm×n and an error parameter α ≥ 0, construct a sketch A˜ ∈ Rm×n of A such that

A − A˜ ≤ α 2

and the number of non-zero entries in A˜ is minimized. A few comments are necessary to better understand the above definition. First, k·k2 denotes the spectral norm of a matrix (see Section 2.1 for notation). Second, a similar problem could be formulated by seeking a bound for the Frobenius ˜ Third, this definition places no constraints on the form of the ennorm of A − A. ˜ However, in this work, we will focus on methods that return matrices tries of A.

A˜ whose entries are either zeros or (rescaled) entries of A. Prior work has in˜ while vestigated quantization as an alternative construction for the entries of A, the theoretical properties of more general methods remain vastly unexplored. Fourth, the running time needed to construct a sketch is not restricted. All prior work has focused on the construction of sketches in one or two sequential passes over the input matrix. Thus, we are particularly interested in sketching algorithms that can be implemented within the same framework (a small number of sequential passes). There are at least three important application domains for the sparse matrices A˜ of Definition 1: approximate eigenvector computations, semi-definite programming (SDP) solvers, and matrix completion. The first two applications are based on the fact that, given a vector x ∈ Rn , the product Ax can be ap˜ with a bounded loss in accuracy. The running time of the proximated by Ax ˜ latter matrix-vector product is proportional to the number of non-zeros in A, thus leading to immediate computational savings. This fast matrix-vector product operation can then be used to approximate eigenvectors and eigenvalues of matrices [2, 8, 9] via subspace iteration methods; yet another application would be a quick estimate of the Krylov subspace of a matrix. Additionally [10, 11] argue that fast matrix-vector products are useful in SDP solvers. Finally, the third application domain of sparse sketches is the so-called matrix completion problem, an active research area of growing interest, where the user only has access to A˜ (typically formed by sampling a small number of elements of A uniformly at random) and the goal is to reconstruct the entries of A as accurately as possible. The motivation underlying the matrix completion problem stems from recommender systems and collaborative filtering and was initially discussed in [12]. More recently, methods using bounds on A − A˜ and trace minimization algorithms have demonstrated exact reconstruction of A under – rather restrictive – assumptions [13, 14]. 1.1

Our algorithm and our main theorem

Our main algorithm (Algorithm 1) zeroes out “small” elements of A, keeps “large” elements of A, and randomly samples the remaining elements of A with a probability that depends on their magnitude. The following theorem is our main quality-of-approximation result for Algorithm 1. e be the Theorem 1 Let A ∈ Rn×n be any matrix (assume n ≥ 300) and let A sparse sketch of A constructed via Algorithm 1. If n (log n) log2 n/ log2 n s=C 2 then, with probability at least 1 − n−1 ,

e

A − A

≤ kAkF . 2

,

1: Input: matrix A ∈ Rn×n , sampling parameter s. 2: For all 1 ≤ i, j ≤ n do

– ElseIf – Else

2

log2 n kAkF n s kAk2 A2ij ≥ s F

– If A2ij ≤

A˜ij =

then then   Aij pij

0

A˜ij = 0, ˜ Aij = Aij , ,with probability pij =

sA2 ij kAk2 F

<1

,with probability 1 − pij

e ∈ Rn×n . 3: Output: Matrix A

Algorithm 1: Matrix Sparsification Algorithm Here C is a constant, upper bounded by 452 . A˜ has, in expectation, at most 2s non-zero entries and the construction of A˜ can be implemented in one pass over the input matrix A. It is worth noting that in order to simplify our presentation the above theorem is focused on square matrices. However, a similar theorem can also be proven for m×n matrices with small changes in the constants and assuming n = max{m, n}. Also, in order to implement our algorithm in one pass, we need to combine it with the Sample algorithm presented in Section 4.1 of [8]. Finally, in the context of Definition 1, our result essentially shows that we can get a sparse sketch A˜ with 2s non-zero entries. 1.2

Comparison with prior work

Our result outperforms prior work in the sense that, using the same accuracy parameter α = kAkF in Definition 1, the resulting matrix A˜ has fewer non-zero elements. In [2, 8] the authors presented a sampling method that requires at least ˜ Our result reduces the sampling complexity O(n log4 n/2 ) non-zero entries in A. by a modest, yet non-trivial, O(log n) factor. This improvement is slightly better for small values of n, where our log2 (n/ log2 n) factor provides further savings. It Pnis harder to compare our method to the work of [9], which depends on the i,j=1 |Aij |. The latter quantity is, in general, upper bounded only by n kAkF , in which case the sampling complexity of [9] is much worse, namely O(n3/2 /). However, it is worth noting that the result of [9] is appropriate for matrices whose “energy” is focused only on a small number of entries, as well as that their bound holds with much higher probability than ours. Another important contribution of our work is the technical analysis. While previous proofs are either combinatorial in nature (e.g., the proofs of [2, 8] are based on the result of [15], whose proof in fundamentally combinatorial) or use simple -net arguments [9], we leverage a powerful inequality from the functional analysis literature, the so-called Khintchine inequality. This inequality (see Section 2.3 for an exact statement) bounds the Schatten norms of a sum of random matrices.

Finally, while preparing this manuscript, the results of [16] were brought to our attention. In this paper, the authors study the k·k∞→2 and k·k∞→1 norms in the matrix sparsification context. The authors also present a sampling scheme for the problem of Definition 1. Their theoretical analysis is not directly comparable to our results, since the sampling complexity depends on the average of the ratios A2ij / maxi,j A2ij .

2 2.1

Preliminaries Notation

We will use the notation EX to denote the expectation of a random variable X. When X is a matrix, then EX denotes the element-wise expectation of each entry of X. Also, we will use the notation E X to denote the expectation of X with respect to . Similarly, Var (X) denotes the variance of the random variable X and P (E) denotes the probability of event E. Finally, log x denotes the natural logarithm of x and log2 x denotes the base two logarithm of x. For any n × n matrix A whose (i, j)-th entry is denoted by Aij , we define its Pn 1/q Schatten norm Sq , for any q ≥ 1, to be equal to kAkSq = ( i=1 σiq ) . Here σi is the i-th singular value of A. For q = 2, the Schatten norm is equal to the FrobeP 1/2 Pn n 2 1/2 2 nius norm of the matrix A, namely kAkS2 = = . i=1 σi i,j=1 Aij Similarly, for q = ∞, the Schatten norm is equal to the operator norm of the matrix A, namely kAkS∞ = kAk2 = σ1 . 2.2

Measure concentration

We will make frequent use of the following version of Bennett’s inequality. Lemma 1 Let X1 , X2 ,..., Xn be independent, zero-mean, random variables and Pn assume that |Xi | ≤ 1. For any t ≥ 23 i=1 Var(Xi ): P

n X

! Xi > t

≤ e−t/2 .

i=1

This version of Bennett’s inequality can be derived from the standard one, stating: ! n X 2 2 P Xi > t ≤ e−σ h(t/σ ) . i=1 2

Pn

Here σ = i=1 Var(Xi ) and h(u) = (1 + u) log(1 + u) − u. Lemma 1 follows using the fact that h(u) ≥ u/2 for u ≥ 3/2. The following lemma converts a probabilistic bound for the random variable X to an expectation bound for X q , for all q ≥ 1, and might be of independent interest. Its proof is deferred to the Appendix.

Lemma 2 Let X be a random variable assuming non-negative values. Let a, b, t, and h be non-negative parameters. If P (X ≥ a + tb) ≤ e−t+h , then, for all q ≥ 1, EX q ≤ 2(a + bh + bq)q . 2.3

The non-commutative Khintchine inequality and its consequences

We now state the so-called non-commutative Khintchine inequality, which will be used to derive the Lemma 4. Lemma 3 Let X1 , . . . , Xn be a set of matrices of the same dimensions and let 1 , . . . , n be an independent Rademacher sequence. For all q ≥ 2,  



q  q1 ! 21 ! 12







X

X

X T

 √

T

E X X , X X i Xi  ≤ c1 q max

i i i i

,



i

  i i Sq Sq

where k·kSq is called the Schatten norm and c1 ≤ 2−1/4

p

Sq

π/e.

The next lemma, whose proof is deferred to the Appendix, control the spectral norms of random matrices. A slightly different version of the lemma appeared in [17]. e be a random matrix of the same Lemma 4 Let A ∈ Rn×n be any matrix and A e dimensions such that EA = A. Then, for q ≥ log n,  1/q s s

q 1/q X X √

e2 )q + E(max e2 )q  , e ≤ c0 21/q q  E(max A A E A − A ij ij i

j

j

i

√ where c0 ≤ 23/4 πe < 5. A straightforward consequence of this Lemma when matrix A is set to be zero is the following corollary Corollary 1 Let B ∈ Rn×n be a random matrix whose entries are independent, zero-mean, random variables. Then, for q ≥ log n, q 1/q

(E kBk )

 1/q s s X X √ 2 )q + 2 )q  Bij E(max Bij ≤ c0 21/q q  E(max ,

√ where c0 ≤ 23/4 πe < 5.

i

j

j

i

3

Proving Theorem 1

e as a sum The main idea underlying our proof is to decompose the matrix A − A of matrices whose entries are bounded and then apply Lemma 4 and Corollary 1 in order to estimate the spectral norm of each matrix in the summand independently. The type of strategy is so-called divide and conquer. To formally present our analysis, let A[1] ∈ Rn×n be a matrix containing all kAk2 entries Aij of A that satisfy A2ij ≥ 2−1 s F ; the remaining entries of A[1] are set to zero. Similarly, we let A[k] ∈ Rn×n (for that contain h all k > 1) be matrices kAk2

kAk2F s

, 2−k+1 s F ; the remaining e[k] contain the (rescaled) entries of A[k] are set to zero. Finally, the matrices A entries of the corresponding matrix A[k] that were selected after applying the sparsification procedure of Algorithm 1 to A. Given these definitions, all entries Aij of A that satisfy A2ij ∈ 2−k

A=

∞ X

A[k]

and

k=1

e= A

∞ X

e[k] . A

k=1

Let ` = blog2 (n/ log2 n)c. Then,

∞

X

[k] [k] e e A −A

A − A =

2 k=1

2

` ∞

X X

[1]

[k]

[1] [k] [k] [k] e e e ≤ A − A + A −A

A − A +

.

2 2 k=2

k=`+1

2

Using the inequality (E(x + y)q )1/q ≤ (Exq )1/q + (Ey q )1/q , we conclude that

q 1/q

q 1/q

e e[1] E A − A ≤ E A[1] − A

2

(1)

2

`

q 1/q X

e[k] + E A[k] − A

(2)

2

k=2



1/q ∞

X q

e[k] + E A[k] − A

 .

k=`+1

(3)

2

The remainder of the section will focus on the derivation of bounds for terms (1), (2), and (3) of the above equation.

3.1

e[1] Term (1): Bounding the spectral norm of A[1] − A

The main result of this section is summarized in the following lemma.

Lemma 5 Let n ≥ 25 and q = log n. Then, r

q 1/q 8qn

[1] [1] e ≤ 2c0 E A − A kAkF , s where c0 is the constant of Corollary 1. e[1] and let Bij denote its Proof. For notational convenience, let B = A[1] − A [1] entries. Recall that A only contains entries of A whose squares are greater kAk2 e[1] only contains the (rescaled) than or equal to 2−1 s F . Also, recall that A entries of A[1] that were selected after applying the sparsification procedure of Algorithm 1 to A. Using these definitions, Bij is equal to:   0    0 Bij =  1 − p−1 Aij  ij    Aij

,if A2ij < 2−1 ,if A2ij ≥

kAk2F s

(since pij = 1 )

,with probability pij =

sA2ij kAk2F

<1

,with probability 1 − pij q 1/q

We will estimate the quantity (E kBk2 ) q 1/q (E kBk2 )

kAk2F s

via Corollary 1:

1/q  s s X X √  2 )q + 2 )q  Bij Bij . E(max E(max ≤ c0 q i

j

j

(4)

i

P 2 q Towards that end, we will need to bound the two terms E(maxi j Bij ) and P 2 q E(maxj i Bij ) . Since these two terms are essentially the same, we will only P 2 bound the first one. Let Si = j Bij and let S = maxi Si . In order to bound ES q , we will first find probabilistic estimates for Si and S and then estimate the quantity ES q via Lemma 2. Simple algebra and our bounds on the entries of Aij that are included in B give: 2

2

2 E(Bij )≤

kAkF , s

2 Var(Bij )≤

4 kAkF , s

and

2 kAk2F Bij ≤ . s

We can now apply Bennett’s inequality to bound Si . Formally, we bound the sum n X s 2 2 Bij , kAk F j=1 since every entry in the above summand is bounded (in absolute value) by one. Clearly, the expectation of the above sum is at most n and its variance is at most 4n. Thus, from Bennett’s inequality, we get   n X s 2  ≤ e−t/2 , P 2 Bij > n + t kAk F j=1

P 2 , set t = 6n + 2τ for τ ≥ 0, and assuming that t ≥ 6n. Recall that Si = j Bij rearrange terms to conclude 7n + 2τ 2 P Si ≥ kAkF ≤ e−3n−τ ≤ e−τ . s An application of the union bound over all n possible values of Si derives 7n + 2τ 2 P max Si ≥ kAkF ≤ ne−τ = e−τ +log n i s Since S = maxi Si , we can apply Lemma 2 with a = 7n, b = 2, and h = log n to get 2q ES q ≤ 2(7n + 2 log n + 2q)q kAkF /sq . Since q = log n and n ≥ 4 log n (from the assumption n ≥ 25), we obtain  q

ES = E max i

n X

q 2 Bij

2

8n kAkF s

≤2

j=1

!q .

The same bound can also be derived for the expectation E maxj can now substitute these bounds in eqn. (4): q 1/q (E kBk2 )

 √  ≤ c0 q 4

2

8n kAkF s

!q/2 1/q 

s ≤ 2c0

P

i

2 q Bij . We

2

8qn kAkF . s

In the above we used q = log n and n ≥ 25 to guarantee 81/q ≤ 2. 3.2

e[k] for small k Term (2): Bounding the spectral norm of A[k] − A

e[k] for We now focus on estimating the spectral norm of the matrices A[k] − A 2 2 ≤ k ≤ blog2 (n/ log n)c. The following lemma summarizes the main result of this section. Lemma 6 Let n ≥ 25 and let q = log n. For all 2 ≤ k ≤ blog2 (n/ log2 n)c, r

q 1/q 6qn

[k] [k] e E A − A kAkF , ≤ 2co s 2 where c0 is the constant of Lemma 4. eij denote the entries of the matrix Proof. For notational convenience, we let A [k] e A . Then, δij Aij A˜ij = , (5) pij

h kAk2 kAk2 e[k] for those entries Aij of A satisfying A2ij ∈ 2k sF , 2k−1Fs . All the entries of A that correspond to entries of A outside this interval are set to zero. The indicator function δij is defined as

δij =

( 1

,with probability pij =

sA2ij kAk2F

≤1

,with probability 1 − pij

0

Notice that pij is always in the interval 2−k , 2−(k−1) from the constraint on the e[k] = A[k] . Thus, by applying Lemma 4, size of A2ij . It is now easy to see that EA v  q1 q v u  ! u q u n n

q q1 u X X √ u 

e[k] A˜2ij  + tE max A˜2ij  . E A[k] − A ≤ c0 q tE max

i

2

j

j=1

i=1

(6) P We will estimate the expectation E(maxi j A˜2ij )q . The other term in eqn. (6) P is bounded similarly. Let Si = j A˜2ij . Then, using eqn. (5), the definition of 2 pij , and δij = δij , we get Si =

Using A2ij ≥  E max i

kAk2F 2k s

n X

n n n 4 4 X X X kAkF 2 kAkF δij 2 δ A = . A δ = ij ij p2 ij j=1 A4ij s2 ij j=1 s2 A2ij j=1 ij

, we get Si ≤

2k kAk2F s

P

q

A˜2ij  = E max Si i

j=1

q

δ , which leads to: ij j 2

≤

2k kAkF s

!q

 E max i

n X

q δij  .

(7)

j=1

q P We now seek a bound for the expectation E maxi j δij . The following lemma, whose proof may be found in the Appendix, provides such a bound. q q P Lemma 7 For all n ≥ 25 and any q ≤ log n, E maxi j δij ≤ 2 6n2−k . Combining Lemma 7 and eqn. (7), we obtain  E max i

n X j=1

q A˜2ij  ≤ 2

2

6n kAkF s

!q .

Pn The same bound can be derived for the E(maxj i=1 A˜2ij )q . Thus, substituting in eqn. (6) and since q ≥ 2, we get the claim of the lemma.

3.3

Term (3): bounding the tail

We now focus on values of k that exceed ` = blog2 (n/ log2 n)c and prove the following lemma, which immediately provides a bound for term (3). Lemma 8 Using our notation, s

∞

X 2

e[k] ≤ n log n kAk . A[k] − A

F

s k=`+1

2

Proof. Intuitively, by the definition of A[k] , we can observe that when k is larger than ` = blog2 (n/ log2 n)c, the entries of A[k] are very small, whereas the entries e[k] are all set to zero during the second step of our sparsification algorithm. of A Formally, consider the sum D=

∞ X

e[k] . A[k] − A

k=`

For all k ≥ ` + 1 ≥ log2 (n/ log2 n), notice that the squares of all the entries of 2 kAk2 e[k] are all-zero A[k] are at most logn n s F (by definition) and thus the matrices A matrices. The above sum now reduces to D=

∞ X

A[k] ,

k=`

where the squares of all the entries of D are at most using kDk2 ≤ kDkF , we immediately get

2 log2 n kAkF n s

. Since D ∈ Rn×n ,

s v

uX ∞ n X n

X u n log2 n

2 ≤ e[k] kDk2 = Dij A[k] − A kAkF .

≤t

s k=`

3.4

2

i=1 j=1

Completing the proof of Theorem 1

Theorem 1 emerges by substituting Lemmas 5, 6, and 8 to bound terms (1), (2), and (3). More specifically, using c0 ≤ 5 and setting q = log n we get:

log n log1 n r n log n √ p √

2 e E A − A kAkF . ≤ 10 8 + 10 6blog2 n/ log n c − 1 + log n s 2 √ Assuming n ≥ 300, log n ≤ blog2 n/ log2 n c ≤ log2 n/ log2 n . Then, after some simple calculations, the right-hand side of the inequality is less than r n log n 2 kAkF , c2 log2 n/ log n s

√ where c2 < 1.2(10 6 + 1). Applying Markov’s inequality, we conclude that r

n log n

2 e kAkF

A − A ≤ c3 log2 n/ log n s 2 √ holds with probability at least 1 − n−1 , where c3 < 1.2(10 6 + 1) log2 e < 45. Theorem 1 now follows by setting s to the appropriate value.

4

Open problems

An interesting open problem would be to investigate whether there exist algorithms that, either deterministically or probabilistically, select elements of A to include in A˜ and achieve much better accuracy than existing schemes. For example, notice that our algorithm, as well as prior ones, sample entries of A with respect to their magnitude; better sampling schemes might be possible. Improved accuracy will probably come at the expense of increased running time. While we are currently unaware of applications where much slower – say O(n3 ) – sketching algorithms would be useful, such algorithms are interesting from a purely mathematical viewpoint, since they will allow a better quantification of properties of a matrix via its entries. Acknowledgements. We would like to thank Dimitris Achlioptas for bringing [16] to our attention.

References 1. Frieze, A., Kannan, R., Vempala, S.: Fast Monte-Carlo algorithms for finding lowrank approximations. In: Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science. (1998) 370–378 2. Achlioptas, D., McSherry, F.: Fast computation of low rank matrix approximations. In: Proceedings of the 33rd Annual ACM Symposium on Theory of Computing. (2001) 611–618 3. Mahoney, M.W., Drineas, P.: CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences 106(3) (2009) 697–702 4. Drineas, P., Kannan, R., Mahoney, M.W.: Sampling subproblems of heterogeneous max-cut problems and approximation algorithms. Random Structures and Algorithms 32(3) (2008) 307–333 5. Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error cur matrix decompositions. SIAM J. on Matrix Analysis and Applications 30(2) (2008) 844–881 6. Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In: IEEE Symposium on Foundations of Computer Science (FOCS). (2006) 7. Nguyen, N., Do, T., Tran, T.: A Fast and Efficient Algorithm for Low-rank Approximation of a Matrix. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing. (2009) 8. Achlioptas, D., McSherry, F.: Fast computation of low rank matrix approximations. Journal of the ACM 54(2) (2007)

9. Arora, S., Hazan, E., Kale, S.: A Fast Random Sampling Algorithm for Sparsifying Matrices. In: APPROX-RANDOM. (2006) 272–279 10. Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science. (2005) 339–348 11. d’Aspremont, A.: Subsampling Algorithms for Semidefinite Programming. arXiv:0803.1990v5 (2009) 12. Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: Proceedings of the 33rd Annual ACM Symposium on Theory of Computing. (2001) 619–626 13. Cand`es, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(3) (2009) 717–772 14. Cand`es, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory, to appear 15. F¨ uredi, Z., Koml´ os, J.: The eigenvalues of random symmetric matrices. Combinatorica 1(3) (1981) 233–241 16. Gittens, A., Tropp, J.: Error bounds for random matrix approximation schemes. Preprint, 2009 17. Latala, R.: Somes estimates of norms of random matrices. Proc. Amer. Math. Soc. 133 (2005) 1273–1282 18. Rudelson, M., Vershynin, R.: Sampling from large matrices: an approach through geometric functional analysis. Journal of the ACM (2007) 1–19

Appendix Proof of Lemma 2. Proof. From our assumptions, P(X ≥ a + b(t + h)) ≤ e−t . Let s = a + b(t + h). For any q ≥ 1, Z Z ∞ P(X ≥ s)dsq = q EX q = 0

Z

a+bh

sq−1 ds + q

≤q 0

Z

∞

P(X ≥ s)sq−1 ds

0 ∞

sq−1 e−

(s−a−bh) b

ds.

a+bh q

The first term in the above sum is equal to (a + bh) . The second term is somewhat harder to compute. We start by letting g = a + bh and changing variables, thus getting Z

∞ q−1 −

s

e

(s−a−bh) b

Z ds = b

∞

(g + bt) 0

a+bh

Z q−1 X q − 1 q−1−i i ∞ q−1−i −t e dt = b b g t e dt. i 0 i=0

q−1 −t

We can now integrate by parts and get Z ∞ tq−1−i e−t dt = (q − 1 − i)! ≤ q q−1−i

for all i = 0, ..., q − 1.

0

Combining the above, Z

∞

q

sq−1 e−

(s−a−bh) b

a+bh

ds ≤ qb

q−1 X q−1 (bq)q−1−i g i = qb(bq + g)q−1 . i i=0

Finally, EX q ≤ (a + bh)q + bq(bq + g)q−1 ≤ 2(a + bh + bq)q which concludes the proof. Proof of Lemma 4. e In order to prove Lemma 4 we will use a symmetrization Proof. Let B = A − A. argument and the non-commutative Khintchine inequality as in [18]. Recall that all entries of B are zero mean, independent random variables. Thus,

q

q

e

e q e e0 (E kBk )1/q = (EA˜ A − EA˜ A − EA˜0 A

)1/q = (EA˜ A

)1/q , e0 is an independent copy of A. e By Jensen’s inequality, where A

e e0 q 1/q q (E kBk )1/q ≤ (EA˜ EA˜0 A −A ) .

(8)

eij − B e 0 have the By standard symmetrization arguments, the random variables B ij eij − B e 0 ). Here the ij ’s for all same distribution as the random variables ij (B ij i, j = 1, . . . , n are independent, symmetric, Bernoulli random variables, taking the values +1 or −1 with probability 1/2. If we use ei to denote the n standard e and A e0 may be rewritten as: basis vectors for Rn , then A e= A

X

A˜ij ei eTj

and

e0 = A

i,j

X

A˜0ij ei eTj .

i,j

Combining with eqn. (8) and using the Jensen’s inequality: (x + y)q ≤ 2q−1 (xq + y q ) with q ≥ 1, we get

q 1/q

X

T  0 e e

 ij Aij − Aij ei ej ≤ EA˜ EA˜0 E

i,j

q

q 1/q 

X

X

(q−1)/q  T 0 T  e e

≤2 EA˜ E ij Aij ei ej + EA˜0 E ij Aij ei ej

i,j

i,j

q 1/q 

X

T  e = 2 EA˜ E . A e e ij ij i j

i,j



q 1/q

(E kBk )

(9) e and A e0 have the same distribution. In order to where equality holds since A

P

q

eij ei eT bound the expectation E i,j ij A j we will use Khintchine’s inequality. First, note that the Schatten q-norm is within a multiplicative constant from the spectral norm. More specifically, if X is an m × n matrix with m ≤ n and q ≥ log n, we have kXk2 ≤ kXkSq ≤ e kXk2 . Thus, by applying Khintchine’s inequality, we get 

q

q/2

q/2  

X

X

X

 

 √ q 2 T T 2 T e e e

E ij Aij ei ej ≤ (ec1 q) max Aij ei ei , Aij ej ej .   i,j

ij

i,j

  T e2 ei eT = P (P A e2 A the ij i i j ij )ei ei is a diagonal matrix, it follows that

P e2

P e2

P e2 T T spectral norm i,j Aij ei ei = maxi j Aij . Similarly, i,j Aij ej ej =

Since

P

i,j

2

2

maxj

P e2 i Aij . Thus,

q 1/q  1/q 

  X X

X

√ eij ei e∗j  e2ij )q/2 , (max e2ij )q/2  E ˜ E ≤ ec1 q EA˜ max (max ij A A A A

j  i 

i,j

j i  1/q X X √ e2 )q/2 + E(max e2 )q/2  ≤ ec1 q E(max A A ij ij 

i



√ ≤ ec1 q 

j

j

i

1/q s s X X e2 )q + E(max e2 )q  . E(max A A ij ij i

j

j

i

√ The last inequality follows from the fact that EX ≤ EX 2 . Combining the above equation with eqn. (9) concludes the proof of the lemma. Proof of Lemma 7. Pn Proof. Let S = maxi j=1 δij . We will first estimate the probability P(S ≥ t) and then apply Lemma 2 in order to bound the expectation ES q . Recall from the definition of δij that E (δij − pij ) = 0 and let X=

n X

(δij − pij ) .

j=1

We will apply Bennett’s inequality in order to bound X. Clearly |δij − pij | ≤ 1 and Var(X) =

n X

Var (δij − pij ) =

n X

2

E (δij − pij ) =

n X pij . pij − p2ij ≤

j=1

j=1

j=1

n X

Recalling the definition of pij and the bounds on the Aij ’s, we get Var (X) =

n X sA2ij 2

j=1

kAkF

≤ n2−(k−1) .

We can now apply Bennett’s inequality in order to get   n n X X P(X > t) = P  δij > pij + t ≤ e−t/2 j=1

j=1

for any t ≥ 3n2−(k−1) /2. Thus, with probability at least 1 − e−t/2 , n X j=1

δij ≤ n2−(k−1) + t,

j=1

since

Pn

j=1

pij ≤ n2−(k−1) . Setting t = 32 n2−(k−1) + 2τ for any τ ≥ 0 we get   n X 5 P δij ≥ n2−(k−1) + 2τ  ≤ e−τ . 2 j=1

Taking the union bound over all i yields P(max i

n X

δij ≥ 5n2−k + 2τ ) ≤ ne−τ = e−τ +log n .

j=1

Applying Lemma 2 with a = 5n2−k , b = 2, and h = log n, we get  q n X E max δij  ≤ 2(5n2−k + 2 log n + 2q)q ≤ 2(6n2−k )q . i

j=1

The last inequality holds for all n ≥ 25 and it can be easily proven using the assumption q ≤ log n and the fact that k ≤ blog2 (n/ log2 n)c.

Matrix sparsification via the Khintchine inequality

thus leading to immediate computational savings. This fast matrix-vector ... lem, an active research area of growing interest, where the user only has access to ËA (typically formed by ..... It is now easy to see that E ËA[k] = A[k]. Thus, by applying ...

Download PDF

206KB Sizes 0 Downloads 182 Views

Report

Matrix sparsification via the Khintchine inequality

Recommend Documents