Consistent Variable Selection of the l1âRegularized ...

Viewer
Transcript

Consistent Variable Selection of the l1−Regularized 2SLS with High-Dimensional Endogenous Regressors and Instruments Ying Zhu August 2013

Under additional assumptions, it is possible for βˆH2SLS to achieve perfect variable selection when β ∗ is exactly sparse with at most k2 non-zero coefficients, as we will see in the following result (Theorem 2.3). For notational simplicity, ¯ 1 1, q we assume in Theorem 2.3 that ρZ - 1, κ1 1, κ

log(d∨p) ∗ ρX ∗ - 1, maxj=1,...,p |πj,S . c | - (k1 ∨ 1) n τj 1 Here we summarize the notations to be used. The lq −norm of a vector v ∈ m × 1 is denoted by |v|q , 1 ≤ q ≤ ∞; let J(v) = {j ∈ {1, ..., m} | vj 6= 0}. The cardinality of a set J ⊆ {1, ..., m} is denoted by |J|. For a matrix A, write |A|∞ := maxi,j |aij | to be the elementwise l∞ −norm of A. The l2 −operator norm of the matrix A is denoted by ||A||2 . The l∞ matrix norm of A is denoted P by ||A||∞ := maxi j |aij |. For a square matrix A, denote its minimum eigenvalue and maximum eigenvalue by λmin (A) and λmax (A), respectively. For functions f (n) and g(n), write f (n) % g(n) to mean that f (n) ≥ cg(n) for a universal constant c ∈ (0, ∞) and similarly, f (n) - g(n) to mean that 0 0 f (n) ≤ c g(n) for a universal constant c ∈ (0, ∞); f (n) g(n) when f (n) % g(n) and f (n) - g(n) hold simultaneously. Denote max{a, b} by a ∨ b and min{a, b} by a ∧ b.

h

Theorem 2.3 : Assume k2 T1 - 1, n % k23 log p, λmin E 2.1, 2.2, 2.4, and 2.5 hold. Suppose

i

% 1. Let Assumptions

≤1−φ

(1)

1 ∗T ∗ n XJ(β ∗ ) XJ(β ∗ )

h ih i−1

∗T ∗

E X ∗T ∗ c X ∗ ∗ E(X X ) ∗ ∗ J(β ) J(β ) J(β ) J(β )

∞

for some φ ∈ (0, 1] (assumed to be bounded away from 0). If the regularization parameters q (¯ c−2)φ 2− (¯c−1) (¯ c−1) log(d∨p) λn,j = c0 ρ2Z n and λn = T0 for some constant c¯ > 2 and a small number ς > 0, (¯ c−2−ς)φ with T0 , then, with probability at least 1 − c1 exp (−c2 log p), we have J(βˆH2SLS ) ⊆ J(β ∗ ); moreover, "

if

minj∈J(β ∗ ) |βj∗ |

00

> c λn

(¯ c−2−ς)φ (¯ c−2)φ 2− (¯c−1) (¯ c−1)

a consequence, J(βˆH2SLS ) =

#

√

+1

h

λmin E

k2

1 X ∗T X ∗ n J(β ∗ ) J(β ∗ )

i , then J(βˆH2SLS ) ⊇ J(β ∗ ). As

J(β ∗ ).

Remark. Theorem 2.3 is proven under a population “incoherence condition” (1). The “incoherence condition” initially proposed by Wainwright (2009) is a refined version of the “irrepresentable condition” by Zhao and Yu (2006) and the “neighborhood stability condition” by Meinshausen and Bühlmann (2006). Bühlmann and van de Geer (2011) shows this type of conditions is sufficient and “essentially necessary” for the Lasso to achieve consistent variable selection. If each row of

1

X ∗ ∈ Rn×p is sampled independently from N (0, ΣX ∗ ) with the Toeplitz covariance matrix 

%X ∗ 1 %X ∗ .. .

%2X ∗ %X ∗ 1 .. .

··· ··· ··· .. .

%p−1 X∗ %p−2 X∗ %p−3 X∗ .. .

p−2 %p−1 %X ∗ X∗

···

%X ∗

1

1

  %X ∗   2 ΣX ∗ =  %X ∗  .  .  .

     ,   

condition (1) is satisfied (see, e.g., Wainwright, 2009). The correlations between explanatory variables of agents of various proximity in a network or community can be naturally interpreted by the Toeplitz structure. For example, in the empirical example discussed in Section 1, firms that are “closer” might share more similarities in terms of production levels and the correlation between two firms’ production levels decays geometrically in the degree of their “closeness”.

Proof for Theorem 2.3 The proof for Theorem 2.3 is based on a construction called Primal-Dual Witness (PDW) method developed by Wainwright (2009). The procedure is applied to our context as follows. 1. Set βˆJ(β ∗ )c = 0. 2. Obtain (βˆJ(β ∗ ) , µ ˆJ(β ∗ ) ) by solving the oracle subproblem βˆJ(β ∗ ) ∈ arg

min

βJ(β ∗ ) ∈Rk2

1 ˆ J(β ∗ ) βJ(β ∗ ) |22 + λn |βJ(β ∗ ) |1 , |Y − X 2n

and choose µ ˆJ(β ∗ ) ∈ ∂|βˆJ(β ∗ ) |1 , where ∂|βˆJ(β ∗ ) |1 denotes the set of subgradients at βˆJ(β ∗ ) for the function | · |1 : Rk2 → R. 3. Solve for µ ˆJ(β ∗ )c via the zero-subgradient equation 1 ˆT ˆ + λn µ ˆ β) X (Y − X ˆ = 0, n and check whether or not the strict dual feasibility condition |ˆ µJ(β ∗ )c |∞ < 1 holds. h

i

∗T X ∗ , Σ ˆ K c K := 1 X ∗Tc X ∗ , For convenience, we let J(β ∗ ) := K, J(β ∗ )c := K c , ΣK c K := E n1 XK c K K n K i h 1 ∗T ∗ 1 ˆT ˆ 1 ∗T ∗ ˜ ˆ ˜ c and ΣK K := X c XK . Similarly, let ΣKK := E X X , ΣKK := X X , and ΣKK := n

K

n

K

K

n

K

K

We first establish Lemma S.1, which shows that βˆH2SLS = (βˆK , 0) where βˆK is the solution obtained in step 2 of the PDW construction. We then establish Lemma S.2, which proves the first claim in Theorem 2.3. The second claim follows immediately from the condition minj∈J(β ∗ ) |βj∗ | > B2 . 1 ˆT ˆ n XK XK .

Lemma S.1: If the PDW construction succeeds and if λmin (ΣKK ) ≥ Cmin > 0, then the vector (βˆK , 0) ∈ Rp is the unique optimal solution of the Lasso. Proof. The proof for Lemma S.1 adopts the proof for Lemma 1 from Chapter 6.4.2 of Wainwright (2015). If the PDW construction succeeds, then βˆ = (βˆK , 0) is an optimal solution with D E p ˆ 1 . Suppose β˜ is another optimal solution. Letting subgradient µ ˆ ∈ R and |ˆ µK c |∞ < 1, µ ˆ, βˆ = |β| 2

D

D

E

E

1 ˜ + ˜ + λn |β| ˜ 1 and F (β) ˆ − λn µ ˆ + λn µ ˆ 2 , then F (β) ˆ, β˜ − βˆ = F (β) F (β) = 2n |Y − Xβ| ˆ, βˆ = F (β) D E 2 ˜1− µ ˆ ˆ, β˜ . However, by the zero-subgradient1 optimality conditions, λn µ ˆ = −∇F (β), λn |β| D E D E ˜1− µ ˆ + ∇F (β), ˆ β˜ − βˆ − F (β) ˜ = λn |β| ˆ, β˜ . Convexity of F ensures that so that F (β)

D

E

˜1 ≤ µ the left-hand side is non-positive and consequently |β| ˆ, β˜ . On the other hand, since D E D E ˜ 1 , we must have |β| ˜1 = µ µ ˆ, β˜ ≤ |ˆ µ|∞ |β| ˆ, β˜ . Given |ˆ µK c |∞ < 1, this equality can only hold if β˜j = 0 for all j ∈ K c . Therefore, all optimal solutions must have the same support K and can be obtained by solving the oracle subproblem in the PDW procedure. The bound ˜ KK ) ≥ cλmin (Σ ˆ KK ) ≥ c(1 − c0 )λmin (ΣKK ) for some c, c0 ∈ (0, 1) (by (8) and (14)) and the λmin (Σ condition λmin (ΣKK ) ≥ Cmin > 0 ensures that this subproblem is strictly convex and has a unique minimizer. Lemma S.2: Let the assumptions in Theorem 2.3 hold. Then, with probability at least 1 − c1 exp (−c2 log p): (i) |ˆ µK c |∞ ≤ 1 − c¯ςφ ¯ > 2 and ς > 0; (ii) −1 for some constants c 

(¯ c − 2 − ς)φ

00

∗  |βˆH2SLS,J(β ∗ ) − βH2SLS,J(β ∗ ) |∞ ≤ c λn

2−

(¯ c−2)φ (¯ c−1)

(¯ c − 1)

√



+ 1

h

λmin E

k2 i . 1 ∗T ∗ n XJ(β ∗ ) XJ(β ∗ )

Proof. By construction, the subvectors βˆK , µ ˆK , and µ ˆK c satisfy the zero-subgradient condition in ∗ ∗ ∗ ∗ = 0, we ˆ the PDW construction. Recall e = (X − X)β + ηβ + . With the fact that βˆK c = βK c have 1 ˆT 1 ˆT ˆ ˆ ∗ XK XK βK − βK e + λn µ ˆK = 0, + X n n K 1 ˆT ˆ ˆ 1 ˆT ∗ XK c XK βK − βK ˆK c = 0. + X c e + λn µ n n K From the equations above, by solving for the vector µ ˆK c ∈ Rp−k2 , we obtain µ ˆK c ∗ βˆK − βK

1 ˆT ˆ ˆ ∗ ˆT c e , XK c XK βK − βK −X K nλn nλn !−1 −1 ˆ T T ˆ X ˆ 1 ˆT ˆ XK e X K K = − X XK − λn µ ˆK , n K n n

= −

which yields ˜ KcK Σ ˜ −1 µ ˆT e µ ˆK c = Σ KK ˆ K + XK c nλn By the triangle inequality, we have

T e ˆKc + X ∞ nλ

T e ˜ KcK Σ ˜ −1 X ˆK − Σ . KK nλn

, n ∞ n ∞

˜ −1 ˜ c where the fact that |ˆ µK |∞ ≤ 1 is used in the inequality above. By Lemma S.3, Σ Σ K K KK

˜ ˜ −1 |ˆ µK c |∞ ≤ Σ K c K ΣKK

1−

(¯ c−2)φ (¯ c−1)

T e ˆK X ∞ nλ

˜ ˜ −1 + Σ K c K ΣKK

∞

≤

with probability at least 1 − c1 exp (−c2 log p). Hence,

(¯ c − 2)φ ˆ T e

˜ −1 ˆ T e ˜ c + XK c + Σ Σ

K K KK XK ∞ (¯ c − 1) nλn ∞ nλn ∞

|ˆ µK c |∞ ≤ 1 −

For a convex function g : Rp 7→ R, µ ∈ Rp is a subgradient at β, denoted by µ ∈ ∂g(β), if g(β +4) ≥ g(β)+hµ, 4i for all 4 ∈ Rp . When g(β) = |β|1 , notice that µ ∈ ∂|β|1 if and only if µj = sgn(βj ) for all j = 1, ..., p, where sgn(0) is allowed to be any number in [−1, 1]. 1

3

(¯ c − 2)φ (¯ c − 2)φ ˆ T e X . + 2− (¯ c − 1) (¯ c − 1) nλn ∞

≤ 1−

Therefore, it suffices to show that 2 −

(¯ c−2)φ (¯ c−1)

ˆT e (¯ c−2−ς)φ with high probability, for a X nλn ≤ c¯−1 ∞ (¯ c−2)φ (¯ c−1)

2−

(¯ c−1) small number ς > 0. This result holds if λn ≥ T0 . Thus, we have |ˆ µK c |∞ ≤ 1 − c¯ςφ −1 (¯ c−2−ς)φ with probability at least 1 − c1 exp (−c2 log p). It remains to establish a bound on the l∞ −norm of ∗ . By the triangle inequality, we have the error βˆK − βK

∗ |βˆK − βK |∞

ˆ T ˆ !−1 ˆ T ˆ !−1 ˆ T

XK XK

X X X e K K K

+ λn ≤

n n n

∞ ∞

ˆ T ˆ !−1

ˆ T ˆ !−1 ˆ T

XK XK

XK XK

XK e

≤ + λn

n n ∞ n

∞

˜ −1 Applying (15) and the bound Σ KK

∞

≤

√

.

∞

˜ −1 k2 Σ KK , putting everything together with the choice 2

of λn yields claim (ii). Lemma S.3: Assume ρZ - 1, κ1 1, κ ¯ 1 1, ρ

X∗

- 1,

∗ maxj=1,...,p |πj,S c τ

|1 - (k1 ∨ 1)

q

j

log(d∨p) , n

λmin (ΣKK ) % 1 and

h ih i−1

∗T ∗

E X ∗T ∗ c X ∗ ∗ J(β ) J(β ) E(XJ(β ∗ ) XJ(β ∗ ) )

≤1−φ

(2)

∞

∗ for some φ ∈ (0, 1] (assumed to be bounded away from q 0). Suppose β is exactly sparse with at

most k2 non-zero coefficients. If constant c¯ > 2,

1 3 n k2 log p

- 1 and k2

" −1

1

1

ˆT

T ˆ J(β ∗ ) ˆ ∗X ˆ J(β ∗ ) P XJ(β ∗ )c X X

J(β )

n

n

∞

(k1 ∨1) log(d∨p) n

- 1, then, for some universal

#

(¯ c − 2)φ ≥1− ≤ c1 exp(−c2 log p). (¯ c − 1)

Proof. Using a decomposition similar to the one in Ravikumar, et. al. (2010), we have ˜ KcK Σ ˜ −1 − ΣK c K Σ−1 = R1 + R2 + R3 + R4 + R5 + R6 , Σ KK KK where ˆ −1 − Σ−1 ], R1 = ΣK c K [Σ KK KK ˆ K c K − ΣK c K ]Σ−1 , R2 = [Σ KK

ˆ K c K − ΣK c K ][Σ ˆ −1 − Σ−1 ], R3 = [Σ KK KK −1 −1 ˆ ˜ ˆ R4 = ΣK c K [Σ − Σ ], KK

KK

˜ KcK − Σ ˆ K c K ]Σ ˆ −1 , R5 = [Σ KK ˜ ˆ ˜ ˆ −1 R6 = [ΣK c K − ΣK c K ][Σ−1 KK − ΣKK ].

By (2), we have ΣK c K Σ−1 KK ∞ ≤ 1 − φ. It suffices to show that ||Ri ||∞ ≤ For R1 , we have ˆ ˆ −1 R1 = −ΣK c K Σ−1 KK [ΣKK − ΣKK ]ΣKK . 4

φ 6(¯ c−1)

for i = 1, ..., 6.

Using the sub-multiplicative property ||AB||∞ ≤ ||A||∞ ||B||∞ and the elementary inequality ||A||∞ ≤ √ a||A||2 for any symmetric matrix A ∈ Ra×a , we can bound R1 as follows:

ˆ −1

ˆ ||R1 ||∞ ≤ ΣK c K Σ−1 KK ΣKK − ΣKK ΣKK

≤

∞

ˆ (1 − φ) ΣKK − ΣKK

p

∞

∞

∞

ˆ −1 k2 ΣKK , 2

where the last inequality follows from (2). If φ = 1, then ||R1 ||∞ = 0 trivially so we may assume φ < 1 in the following. Using bound (9) from the proof for Lemma S.4, we have

ˆ −1

ΣKK ≤ 2

2 λmin (ΣKK )

with probability at least 1 − c1 exp(−c2 n). Next, applying bound (4) from Lemma S.4 with ε = φλmin (ΣKK√ ) , we have 12(¯ c−1)(1−φ) k 2

#

"

ˆ

P Σ KK − ΣKK

∞

φλmin (ΣKK ) 1 1 √ ≥ ∧ 3/2 ≤ 2 exp −cn 3 k2 k 12(¯ c − 1)(1 − φ) k2 2

!

!

+ 2 log k2 .

Then, we are guaranteed that P[||R1 ||∞

1 φ 1 ≥ ] ≤ 2 exp −cn ∧ 6(¯ c − 1) k23 k 3/2 2

!

!

+ 2 log k2 .

For R2 , we first write ||R2 ||∞ ≤ ≤

ˆ k2 Σ−1 KK 2 ΣK c K − ΣK c K ∞ √

k2

ˆ

ΣK c K − ΣK c K . ∞ λmin (ΣKK )

p

ˆ

KK ) An application of bound (3) from Lemma S.4 with ε = 6(¯cφ−1) λmin√(Σ to bound Σ K c K − ΣK c K k2 ∞ yields ! ! 1 φ 1 ] ≤ 2 exp −cn P[||R2 ||∞ ≥ ∧ 3/2 + log(p − k2 ) + log k2 . 3 6(¯ c − 1) k2 k 2

For R3 , by applying (3) from Lemma S.4 with ε =

φλmin (ΣKK ) 6(¯ c−1)

ˆ

to bound Σ K c K − ΣK c K

∞

ˆ −1 −1 from Lemma S.4 to bound Σ KK − ΣKK , we have ∞

1 φ 1 ] ≤ 2 exp −cn ∧ 2 6(¯ c − 1) k2 k2

P[||R3 ||∞ ≥

+ log(p − k2 ) + log k2 .

Putting everything together, we conclude that ˆ KcK Σ ˆ −1 ||∞ ≥ 1 − φ + P[||Σ KK

φ 1 1 0 ] ≤ c exp −cn ∧ 3/2 3 2(¯ c − 1) k2 k 2

For R4 , we have, with probability at least 1 − c exp −bn

1 k23

∧

1 3/2

k2

!

!

+ 2 log p .

+ 2 log p ,

ˆ

˜ −1 ˆ −1 ˜ ˆ ||R4 ||∞ ≤ Σ K c K ΣKK ΣKK − ΣKK ΣKK ∞

∞

≤

∞

p φ

˜

˜ −1 ˆ KK 1−φ+ k2 Σ

ΣKK − Σ

KK 2 , ∞ 2(¯ c − 1)

5

and (5)

ˆ KcK Σ ˆ −1 ||∞ established previously. Using where the last inequality follows from the bound on ||Σ KK 00 (15) from the proof for Lemma S.5 by setting c = 4, we have 4 λmin (ΣKK )

˜ −1

ΣKK ≤ 2

with probability at least 1 − c1 exp(−c2 log(p ∨ d)). Next,

applying bound (11) from Lemma S.5

φλ (Σ ) min KK ˜ KK − Σ ˆ KK √ to bound Σ with ε =

yields φ 24(¯ c−1) 1−φ+ 2(¯c−1)

k2

∞

3

P[||R4 ||∞

!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 1 + 2 log k2 3/2 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

For R5 , using bound (9) from the proof for Lemma S.4, we have ||R5 ||∞ ≤

ˆ −1 ˜ ˆ k2 Σ KK 2 ΣK c K − ΣK c K ∞ √

2 k2

˜ ˆ KcK

ΣK c K − Σ

. ∞ λmin (ΣKK )

p

≤

An application of bound (10) from Lemma S.5 with ε = yields

φλmin (Σ√ KK ) 12(¯ c−1) k2

3

P[||R5 ||∞

∞

!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 + 2 log p 1 3/2 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

˜

ˆ For R6 , by applying (10) to bound Σ with ε = K c K − ΣK c K ∞

˜ −1 ˆ −1 −Σ to bound Σ

, we are guaranteed that KK

λmin (ΣKK ) φ 8 6(¯ c−1)

and applying (12)

KK ∞

3

P[||R6 ||∞

˜ ˆ to bound Σ K c K − ΣK c K

!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 1 + 2 log p 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

Under the condition

1 3 n k2 log p

- 1, putting the bounds on R1 − R6 together, we conclude that

˜ KcK Σ ˜ −1 ||∞ ≥ 1 − P[||Σ KK

(¯ c − 2)φ ] ≤ c1 exp(−c2 log p). (¯ c − 1)

Lemma S.4: Suppose Assumptions 2.1 and 2.2(iii) hold. For any ε > 0 and some constant c > 0, we have

n

ˆ

P Σ K c K − ΣK c K

∞

ε ε2 ∧ ≥ ε ≤ 2(p − k2 )k2 exp −cn 2 4 k2 ρX ∗ k2 ρ2X ∗ o

n

ˆ

P Σ − Σ KK KK

∞

o

≥ε ≤

2k22 exp

ε2 ε −cn ∧ 2 4 k2 ρX ∗ k2 ρ2X ∗

,

(3)

!!

.

Furthermore, if n ≥ c0 k2 log p for some sufficiently large constant c0 > 0, we have

6

!!

(4)

ˆ −1

ΣKK − Σ−1 KK

∞

with probability at least 1 − c1 exp −c2 n

1 , λmin (ΣKK )

≤

λ2min (ΣKK ) k2 ρ4X ∗

(5)

λmin √ (Σ2KK ) k2 ρX ∗

∧

.

0 ˆ K c K − ΣK c K by u 0 . By the defProof. Denote the element (j , j) of the matrix difference Σ j j inition of the l∞ matrix norm, we have

n

ˆ c c − Σ P Σ K K K K

∞

≥ε

o

= P

 

X

max

j 0 ∈K c

|uj 0 j | ≥ ε



j∈K

≤ (p − k2 )P

 

 X

|uj 0 j | ≥ ε

 j∈K

  

ε ≤ (p − k2 )P ∃j ∈ K | |uj 0 j | ≥ k2 ε ≤ (p − k2 )k2 P |uj 0 j | ≥ k2 !! ε2 ε ≤ (p − k2 )k2 · 2 exp −cn ∧ , k22 ρ4X ∗ k2 ρ2X ∗

where the last inequality follows Lemma B.1. Bound (4) can be obtained in a similar way except that the pre-factor (p − k2 ) is replaced by k2 . To prove the last bound (5), write

ˆ −1

ΣKK − Σ−1 KK

∞

h

i

ˆ ˆ −1 = Σ−1 KK ΣKK − ΣKK ΣKK

h i ∞

−1 ˆ KK Σ ˆ −1 = k2 ΣKK ΣKK − Σ KK 2

p −1

ˆ −1 ˆ KK = k2 ΣKK ΣKK − Σ

Σ KK p

2

√

≤

2

2

k2

ˆ −1

ˆ KK

Σ

ΣKK − Σ KK 2 . 2 λmin (ΣKK )

ˆ KK To bound ΣKK − Σ

in (6), applying Lemma B.1 with ε = 2

λmin√ (ΣKK ) 2 k2

(6)

yields

λmin (ΣKK )

ˆ

√ ,

ΣKK − ΣKK ≤

2 k2

2

with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) 4k2 ρ4X ∗

∧

λmin √ (ΣKK ) 2 k2 ρ2X ∗

ˆ −1 in (6), . To bound Σ KK 2

note that we can write λmin (ΣKK ) = =

0

0

min h T ΣKK h 0

||h ||2 =1

min 0

||h ||2 =1

0 ˆ KK h0 + h0 T (ΣKK − Σ ˆ KK )h0 h TΣ

h

i

ˆ KK h + hT (ΣKK − Σ ˆ KK )h ≤ hT Σ ˆ KK . Applying Lemma B.1 yields where h ∈ Rk2 is a unit-norm minimal eigenvector of Σ T ˆ KK h ≤ cλmin (ΣKK ) h ΣKK − Σ

7

(7)

with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) ρ4X ∗

∧

λmin (ΣKK ) ρ2X ∗

, where c ∈ (0, 1). Therefore,

ˆ KK ) + cλmin (ΣKK ), which implies that λmin (ΣKK ) ≤ λmin (Σ ˆ KK ) ≥ (1 − c)λmin (ΣKK ), λmin (Σ and consequently,

(8)

0

ˆ −1

ΣKK ≤ 2

c , λmin (ΣKK )

0

(9)

0

where c > 1. Putting everything together, by setting c = 2, we have √

λmin (ΣKK ) k2 2 1

ˆ −1 −1 √ = .

ΣKK − ΣKK ≤ ∞ λmin (ΣKK ) λ (Σ ) λ (Σ 2 k2 min min KK KK )

with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) k2 ρ4X ∗

∧

λmin √ (Σ2KK ) k2 ρX ∗

.

Lemma S.5: Suppose Assumption 2.4 holds and β ∗ is exactly sparse with at most k2 non∗ zero coefficients. Assume that ρZ - 1, κ1 1, κ ¯ 1 1, ρX ∗ - 1, and maxj=1,...,p |πj,S c | τj 1 q (k1 ∨ 1)

log(d∨p) . n

For any ε > 0, under the condition k2 T1 - 1 and

1 2 n k2 log p

3

n

˜ ˆ P Σ K c K − ΣK c K

∞

≥ ε ≤ 6(p − k2 )k2 exp

!

−cn 2 ε

o

- 1, we have

1

+ c1 exp(−c2 log(p ∨ d)).

1

k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

(10) 3

n

˜ ˆ P Σ KK − ΣKK

∞

o

≥ε ≤

!

−cn 2 ε

6k22 exp

1

1

k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

+ c1 exp(−c2 log(p ∨ d)). (11)

Furthermore, with probability at least 1 − c1 exp(−c2 log(p ∨ d)), we have 8 . (12) ∞ λmin (ΣKK ) 0 ˜ KcK − Σ ˆ K c K by w 0 . Using the same Proof. Denote the element (j , j) of the matrix difference Σ j j argument as in Lemma S.4, we have

˜ −1 ˆ −1

ΣKK − Σ KK

n

˜

ˆ P Σ K c K − ΣK c K

∞

≤

ε ≥ ε ≤ (p − k2 )k2 P |wj 0 j | ≥ k2

o

(X−X ∗ )T X ∗ ˆ Following the derivation of the upper bounds on n

∞

for Lemma A.2 and the identity

.

(X−X ∗ )T (X−X ∗) ˆ ˆ and n

in the proof ∞

1 ˜ ˆ K c K = 1 X ∗Tc (X ˆ K − X ∗ ) + 1 (X ˆ K c − X ∗ c )T X ∗ + 1 (X ˆ K c − X ∗ c )T (X ˆ K − X ∗ ), ΣK c K − Σ K K K K K K n n n n

∗ ˆ note that to upper bound |wj 0 j |, we need to upper bound n1 Xj∗T 0 (Xj − Xj ) . From the proof for ∗ Lemma A.2 and the condition maxj=1,...,p |πj,S c | - (k1 ∨ 1) τj 1

1 ∗T ˆ j − X ∗ ) X 0 (X j n j

q

log(d∨p) , n

we have

v v u u n n h i2 u1 X u1 X ∗2 t ≤ Xij 0 t Zij (ˆ πj − πj∗ ) n n i=1

i=1

8

v √ s u n X u 1 ¯ 1 (k1 ∨ 1) ρ2Z log(d ∨ p) ∗2 κ ≤ ctmax X 0 0

n

j

i=1

ij

κ1

n

0

for all j , j = 1, ..., p, with probability at least 1 − c1 exp(−c2 log(p ∨ d)), and n t 1X t2 ∗2 2 X −cn P max 0 ≥ ρX ∗ + t ≤ 6(p − k2 ) exp 2 ∧ ρ ∗ρ 2 ij 0 n ρ ρ j X Z X∗ Z i=1

"

#

Under the condition k2 T1 - 1, setting t = ε

ck2

√ κ1

κ ¯ 1 (k1 ∨1)

n ρ2Z log(p∨d)

2

for any ε > 0 yields

∞

≥ ε ≤ 6(p−k2 )k2 exp

!

−cn 2 ε

o

.

1

3

n

˜ ˆ P Σ K c K − ΣK c K

!!

1

+c1 exp(−c2 log(p∨d)).

1

ρX ∗ ρZ k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

Bound (11) can be obtained in a similar way except that the pre-factor (p − k2 ) is replaced by k2 . To prove bound (12), by applying the same argument as in Lemma S.4, we have √

k2

ˆ

˜ −1

˜ −1

−1 ˜ ˆ Σ − Σ Σ ≤

ΣKK − Σ

KK KK KK 2 KK ∞ ˆ KK ) 2 λmin (Σ √

2 k2

ˆ

˜ −1 ˜ KK ≤

ΣKK − Σ

Σ

, KK 2 2 λmin (ΣKK )

ˆ

˜ where the last inequality comes from bound (8). To bound Σ KK − ΣKK , applying the bound

(11) with ε =

λmin (ΣKK ) k2

2

yields

p

ˆ

ˆ ˜ KK ˜

≤ k2 Σ

ΣKK − Σ KK − ΣKK

∞

2

≤

λmin (ΣKK ) √ k2

(13)

with probability at least 3

1−

6k22 exp 0

!

−cn 2 1

− c1 exp(−c2 log(p ∨ d)),

1

k22 (k1 ∨ 1) 2 (log(p ∨ d)) 2 0

0

0

which is at least 1 − c1 exp(−c 2 log(p ∨ d)) for some universal constants c1 , c2 > 0 if T1 - 1 and

˜ −1 1 2 n k2 log p - 1. To bound ΣKK , again we have, 2

ˆ KK ) ≤ hT Σ ˜ KK h + hT (Σ ˆ KK − Σ ˜ KK )h λmin (Σ ˜ KK h + k2 Σ ˆ KK − Σ ˜ KK ≤ hT Σ ∞

s

˜ KK h + c0 k2 ≤ hT Σ

(k1 ∨ 1) log(d ∨ p) , n

with probability at least 1−c1 exp(−c2 log(p∨d)), where h ∈ Rk2 is a unit-norm minimal eigenvector (X−X (X−X ∗ )T X ∗ ∗ )T (X−X ∗) ˆ ˆ ˆ ˜ from of ΣKK . The last inequality follows from the bounds on and

the proof for Lemma A.2. Therefore, if s

c0 k2

n

∞

(k1 ∨ 1) log(d ∨ p) ≤ (1 − c)2 λmin (ΣKK ) n 9

n

∞

for some c ∈ (0, 1), which implies that s

c0 k2

(k1 ∨ 1) log(d ∨ p) ˆ KK ) ≤ (1 − c)λmin (Σ n

by (8), then we have ˜ KK ) ≥ cλmin (Σ ˆ KK ), λmin (Σ and consequently,

0

˜ −1

ΣKK ≤ 2

00

(14)

00

c c ≤ , ˆ KK ) λmin (ΣKK ) λmin (Σ

(15)

0

for c > c > 1, where the last inequality follows from bound (8) from the proof for Lemma S.4. 00 Putting everything together, by setting c = 4, we have √

2 k2 λmin (ΣKK ) 4 8

ˆ −1 −1 ˜ √ Σ − Σ ≤ = .

KK KK ∞ λmin (ΣKK ) λmin (ΣKK ) λmin (ΣKK ) k2 with probability at least 1 − c1 exp(−c2 log(p ∨ d)).

References Bühlmann, P. and S. A. van de Geer (2011). Statistics for High-Dimensional Data. Springer, NewYork. Meinshausen, N., and P. Bühlmann (2006). “High-Dimensional Graphs and Variable Selection with the Lasso.” The Annals of Statistics, 34:1436-1462. Ravikumar, P., Wainwright, M. & Lafferty, J. (2010) High-dimensional ising model selection using l1 -regularized logistic regression. The Annals of Statistics, 38: 1287-1319. Wainwright, M. (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using l1 -constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55: 21832202. Wainwright, M. (2015) High-Dimensional Statistics: A Non-Asymptotic Viewpoint. In preparation. University of California, Berkeley. Zhao, P., and B. Yu. (2006). “On Model Selection Consistency of Lasso.” Journal of Machine Learning Research, 7, 2541-2567.

10

Regularization and Variable Selection via the ... - Stanford University