Consistent Variable Selection of the l1−Regularized 2SLS with High-Dimensional Endogenous Regressors and Instruments Ying Zhu August 2013

Under additional assumptions, it is possible for βˆH2SLS to achieve perfect variable selection when β ∗ is exactly sparse with at most k2 non-zero coefficients, as we will see in the following result (Theorem 2.3). For notational simplicity, ¯ 1  1, q we assume in Theorem 2.3 that ρZ - 1, κ1  1, κ

log(d∨p) ∗ ρX ∗ - 1, maxj=1,...,p |πj,S . c | - (k1 ∨ 1) n τj 1 Here we summarize the notations to be used. The lq −norm of a vector v ∈ m × 1 is denoted by |v|q , 1 ≤ q ≤ ∞; let J(v) = {j ∈ {1, ..., m} | vj 6= 0}. The cardinality of a set J ⊆ {1, ..., m} is denoted by |J|. For a matrix A, write |A|∞ := maxi,j |aij | to be the elementwise l∞ −norm of A. The l2 −operator norm of the matrix A is denoted by ||A||2 . The l∞ matrix norm of A is denoted P by ||A||∞ := maxi j |aij |. For a square matrix A, denote its minimum eigenvalue and maximum eigenvalue by λmin (A) and λmax (A), respectively. For functions f (n) and g(n), write f (n) % g(n) to mean that f (n) ≥ cg(n) for a universal constant c ∈ (0, ∞) and similarly, f (n) - g(n) to mean that 0 0 f (n) ≤ c g(n) for a universal constant c ∈ (0, ∞); f (n)  g(n) when f (n) % g(n) and f (n) - g(n) hold simultaneously. Denote max{a, b} by a ∨ b and min{a, b} by a ∧ b.

 h

Theorem 2.3 : Assume k2 T1 - 1, n % k23 log p, λmin E 2.1, 2.2, 2.4, and 2.5 hold. Suppose

i

% 1. Let Assumptions

≤1−φ

(1)

1 ∗T ∗ n XJ(β ∗ ) XJ(β ∗ )

h ih i−1

∗T ∗

E X ∗T ∗ c X ∗ ∗ E(X X ) ∗ ∗ J(β ) J(β ) J(β ) J(β )



for some φ ∈ (0, 1] (assumed to be bounded away from 0). If the regularization parameters  q (¯ c−2)φ 2− (¯c−1) (¯ c−1) log(d∨p) λn,j = c0 ρ2Z n and λn = T0 for some constant c¯ > 2 and a small number ς > 0, (¯ c−2−ς)φ with T0 , then, with probability at least 1 − c1 exp (−c2 log p), we have J(βˆH2SLS ) ⊆ J(β ∗ ); moreover, "

if

minj∈J(β ∗ ) |βj∗ |

00

> c λn

(¯ c−2−ς)φ  (¯ c−2)φ 2− (¯c−1) (¯ c−1)

a consequence, J(βˆH2SLS ) =

#



+1

 h

λmin E

k2

1 X ∗T X ∗ n J(β ∗ ) J(β ∗ )

i , then J(βˆH2SLS ) ⊇ J(β ∗ ). As

J(β ∗ ).

Remark. Theorem 2.3 is proven under a population “incoherence condition” (1). The “incoherence condition” initially proposed by Wainwright (2009) is a refined version of the “irrepresentable condition” by Zhao and Yu (2006) and the “neighborhood stability condition” by Meinshausen and Bühlmann (2006). Bühlmann and van de Geer (2011) shows this type of conditions is sufficient and “essentially necessary” for the Lasso to achieve consistent variable selection. If each row of

1

X ∗ ∈ Rn×p is sampled independently from N (0, ΣX ∗ ) with the Toeplitz covariance matrix 

%X ∗ 1 %X ∗ .. .

%2X ∗ %X ∗ 1 .. .

··· ··· ··· .. .

%p−1 X∗ %p−2 X∗ %p−3 X∗ .. .

p−2 %p−1 %X ∗ X∗

···

%X ∗

1

1

  %X ∗   2 ΣX ∗ =  %X ∗  .  .  .

     ,   

condition (1) is satisfied (see, e.g., Wainwright, 2009). The correlations between explanatory variables of agents of various proximity in a network or community can be naturally interpreted by the Toeplitz structure. For example, in the empirical example discussed in Section 1, firms that are “closer” might share more similarities in terms of production levels and the correlation between two firms’ production levels decays geometrically in the degree of their “closeness”.

Proof for Theorem 2.3 The proof for Theorem 2.3 is based on a construction called Primal-Dual Witness (PDW) method developed by Wainwright (2009). The procedure is applied to our context as follows. 1. Set βˆJ(β ∗ )c = 0. 2. Obtain (βˆJ(β ∗ ) , µ ˆJ(β ∗ ) ) by solving the oracle subproblem βˆJ(β ∗ ) ∈ arg



min

βJ(β ∗ ) ∈Rk2

1 ˆ J(β ∗ ) βJ(β ∗ ) |22 + λn |βJ(β ∗ ) |1 , |Y − X 2n 

and choose µ ˆJ(β ∗ ) ∈ ∂|βˆJ(β ∗ ) |1 , where ∂|βˆJ(β ∗ ) |1 denotes the set of subgradients at βˆJ(β ∗ ) for the function | · |1 : Rk2 → R. 3. Solve for µ ˆJ(β ∗ )c via the zero-subgradient equation 1 ˆT ˆ + λn µ ˆ β) X (Y − X ˆ = 0, n and check whether or not the strict dual feasibility condition |ˆ µJ(β ∗ )c |∞ < 1 holds. h

i

∗T X ∗ , Σ ˆ K c K := 1 X ∗Tc X ∗ , For convenience, we let J(β ∗ ) := K, J(β ∗ )c := K c , ΣK c K := E n1 XK c K K n K i h 1 ∗T ∗ 1 ˆT ˆ 1 ∗T ∗ ˜ ˆ ˜ c and ΣK K := X c XK . Similarly, let ΣKK := E X X , ΣKK := X X , and ΣKK := n

K

n

K

K

n

K

K

We first establish Lemma S.1, which shows that βˆH2SLS = (βˆK , 0) where βˆK is the solution obtained in step 2 of the PDW construction. We then establish Lemma S.2, which proves the first claim in Theorem 2.3. The second claim follows immediately from the condition minj∈J(β ∗ ) |βj∗ | > B2 . 1 ˆT ˆ n XK XK .

Lemma S.1: If the PDW construction succeeds and if λmin (ΣKK ) ≥ Cmin > 0, then the vector (βˆK , 0) ∈ Rp is the unique optimal solution of the Lasso. Proof. The proof for Lemma S.1 adopts the proof for Lemma 1 from Chapter 6.4.2 of Wainwright (2015). If the PDW construction succeeds, then βˆ = (βˆK , 0) is an optimal solution with D E p ˆ 1 . Suppose β˜ is another optimal solution. Letting subgradient µ ˆ ∈ R and |ˆ µK c |∞ < 1, µ ˆ, βˆ = |β| 2

D

D

E

E

1 ˜ + ˜ + λn |β| ˜ 1 and F (β) ˆ − λn µ ˆ + λn µ ˆ 2 , then F (β) ˆ, β˜ − βˆ = F (β) F (β) = 2n |Y − Xβ| ˆ, βˆ = F (β)  D E 2 ˜1− µ ˆ ˆ, β˜ . However, by the zero-subgradient1 optimality conditions, λn µ ˆ = −∇F (β), λn |β|  D E D E ˜1− µ ˆ + ∇F (β), ˆ β˜ − βˆ − F (β) ˜ = λn |β| ˆ, β˜ . Convexity of F ensures that so that F (β)

D

E

˜1 ≤ µ the left-hand side is non-positive and consequently |β| ˆ, β˜ . On the other hand, since D E D E ˜ 1 , we must have |β| ˜1 = µ µ ˆ, β˜ ≤ |ˆ µ|∞ |β| ˆ, β˜ . Given |ˆ µK c |∞ < 1, this equality can only hold if β˜j = 0 for all j ∈ K c . Therefore, all optimal solutions must have the same support K and can be obtained by solving the oracle subproblem in the PDW procedure. The bound ˜ KK ) ≥ cλmin (Σ ˆ KK ) ≥ c(1 − c0 )λmin (ΣKK ) for some c, c0 ∈ (0, 1) (by (8) and (14)) and the λmin (Σ condition λmin (ΣKK ) ≥ Cmin > 0 ensures that this subproblem is strictly convex and has a unique minimizer.  Lemma S.2: Let the assumptions in Theorem 2.3 hold. Then, with probability at least 1 − c1 exp (−c2 log p): (i) |ˆ µK c |∞ ≤ 1 − c¯ςφ ¯ > 2 and ς > 0; (ii) −1 for some constants c 

(¯ c − 2 − ς)φ

00

∗  |βˆH2SLS,J(β ∗ ) − βH2SLS,J(β ∗ ) |∞ ≤ c λn

2−

(¯ c−2)φ (¯ c−1)



(¯ c − 1)





+ 1

 h

λmin E

k2 i . 1 ∗T ∗ n XJ(β ∗ ) XJ(β ∗ )

Proof. By construction, the subvectors βˆK , µ ˆK , and µ ˆK c satisfy the zero-subgradient condition in ∗ ∗ ∗ ∗ = 0, we ˆ the PDW construction. Recall e = (X − X)β + ηβ + . With the fact that βˆK c = βK c have  1 ˆT 1 ˆT ˆ ˆ ∗ XK XK βK − βK e + λn µ ˆK = 0, + X n n K  1 ˆT ˆ ˆ 1 ˆT ∗ XK c XK βK − βK ˆK c = 0. + X c e + λn µ n n K From the equations above, by solving for the vector µ ˆK c ∈ Rp−k2 , we obtain µ ˆK c ∗ βˆK − βK

 1 ˆT ˆ ˆ ∗ ˆT c e , XK c XK βK − βK −X K nλn nλn !−1  −1 ˆ T T ˆ X ˆ 1 ˆT ˆ XK e X K K = − X XK − λn µ ˆK , n K n n

= −

which yields  ˜ KcK Σ ˜ −1 µ ˆT e µ ˆK c = Σ KK ˆ K + XK c nλn By the triangle inequality, we have 



T e ˆKc + X ∞ nλ

  T e ˜ KcK Σ ˜ −1 X ˆK − Σ . KK nλn

, n ∞ n ∞

˜ −1 ˜ c where the fact that |ˆ µK |∞ ≤ 1 is used in the inequality above. By Lemma S.3, Σ Σ K K KK

˜ ˜ −1 |ˆ µK c |∞ ≤ Σ K c K ΣKK

1−

(¯ c−2)φ (¯ c−1)





T e ˆK X ∞ nλ



˜ ˜ −1 + Σ K c K ΣKK





with probability at least 1 − c1 exp (−c2 log p). Hence,

(¯ c − 2)φ ˆ T e

˜ −1 ˆ T e ˜ c + XK c + Σ Σ

K K KK XK ∞ (¯ c − 1) nλn ∞ nλn ∞

|ˆ µK c |∞ ≤ 1 −







For a convex function g : Rp 7→ R, µ ∈ Rp is a subgradient at β, denoted by µ ∈ ∂g(β), if g(β +4) ≥ g(β)+hµ, 4i for all 4 ∈ Rp . When g(β) = |β|1 , notice that µ ∈ ∂|β|1 if and only if µj = sgn(βj ) for all j = 1, ..., p, where sgn(0) is allowed to be any number in [−1, 1]. 1

3

(¯ c − 2)φ (¯ c − 2)φ ˆ T e X . + 2− (¯ c − 1) (¯ c − 1) nλn ∞ 



≤ 1−



Therefore, it suffices to show that 2 −

(¯ c−2)φ (¯ c−1)



 ˆT e (¯ c−2−ς)φ with high probability, for a X nλn ≤ c¯−1 ∞  (¯ c−2)φ (¯ c−1)

2−

(¯ c−1) small number ς > 0. This result holds if λn ≥ T0 . Thus, we have |ˆ µK c |∞ ≤ 1 − c¯ςφ −1 (¯ c−2−ς)φ with probability at least 1 − c1 exp (−c2 log p). It remains to establish a bound on the l∞ −norm of ∗ . By the triangle inequality, we have the error βˆK − βK

∗ |βˆK − βK |∞



ˆ T ˆ !−1 ˆ T ˆ !−1 ˆ T

XK XK

X X X e K K K

+ λn ≤



n n n



∞ ∞



ˆ T ˆ !−1

ˆ T ˆ !−1 ˆ T

XK XK

XK XK

XK e



≤ + λn



n n ∞ n







˜ −1 Applying (15) and the bound Σ KK







.





˜ −1 k2 Σ KK , putting everything together with the choice 2

of λn yields claim (ii).  Lemma S.3: Assume ρZ - 1, κ1  1, κ ¯ 1  1, ρ

X∗

- 1,

∗ maxj=1,...,p |πj,S c τ

|1 - (k1 ∨ 1)

q

j

log(d∨p) , n

λmin (ΣKK ) % 1 and

h ih i−1

∗T ∗

E X ∗T ∗ c X ∗ ∗ J(β ) J(β ) E(XJ(β ∗ ) XJ(β ∗ ) )

≤1−φ

(2)



∗ for some φ ∈ (0, 1] (assumed to be bounded away from q 0). Suppose β is exactly sparse with at

most k2 non-zero coefficients. If constant c¯ > 2,

1 3 n k2 log p

- 1 and k2

"  −1

1

1

ˆT

T ˆ J(β ∗ ) ˆ ∗X ˆ J(β ∗ ) P XJ(β ∗ )c X X

J(β )

n

n



(k1 ∨1) log(d∨p) n

- 1, then, for some universal

#

(¯ c − 2)φ ≥1− ≤ c1 exp(−c2 log p). (¯ c − 1)

Proof. Using a decomposition similar to the one in Ravikumar, et. al. (2010), we have ˜ KcK Σ ˜ −1 − ΣK c K Σ−1 = R1 + R2 + R3 + R4 + R5 + R6 , Σ KK KK where ˆ −1 − Σ−1 ], R1 = ΣK c K [Σ KK KK ˆ K c K − ΣK c K ]Σ−1 , R2 = [Σ KK

ˆ K c K − ΣK c K ][Σ ˆ −1 − Σ−1 ], R3 = [Σ KK KK −1 −1 ˆ ˜ ˆ R4 = ΣK c K [Σ − Σ ], KK

KK

˜ KcK − Σ ˆ K c K ]Σ ˆ −1 , R5 = [Σ KK ˜ ˆ ˜ ˆ −1 R6 = [ΣK c K − ΣK c K ][Σ−1 KK − ΣKK ].



By (2), we have ΣK c K Σ−1 KK ∞ ≤ 1 − φ. It suffices to show that ||Ri ||∞ ≤ For R1 , we have ˆ ˆ −1 R1 = −ΣK c K Σ−1 KK [ΣKK − ΣKK ]ΣKK . 4

φ 6(¯ c−1)

for i = 1, ..., 6.

Using the sub-multiplicative property ||AB||∞ ≤ ||A||∞ ||B||∞ and the elementary inequality ||A||∞ ≤ √ a||A||2 for any symmetric matrix A ∈ Ra×a , we can bound R1 as follows:









ˆ −1

ˆ ||R1 ||∞ ≤ ΣK c K Σ−1 KK ΣKK − ΣKK ΣKK







ˆ (1 − φ) ΣKK − ΣKK

p







ˆ −1 k2 ΣKK , 2

where the last inequality follows from (2). If φ = 1, then ||R1 ||∞ = 0 trivially so we may assume φ < 1 in the following. Using bound (9) from the proof for Lemma S.4, we have

ˆ −1

ΣKK ≤ 2

2 λmin (ΣKK )

with probability at least 1 − c1 exp(−c2 n). Next, applying bound (4) from Lemma S.4 with ε = φλmin (ΣKK√ ) , we have 12(¯ c−1)(1−φ) k 2

#

"



ˆ

P Σ KK − ΣKK



φλmin (ΣKK ) 1 1 √ ≥ ∧ 3/2 ≤ 2 exp −cn 3 k2 k 12(¯ c − 1)(1 − φ) k2 2

!

!

+ 2 log k2 .

Then, we are guaranteed that P[||R1 ||∞

1 φ 1 ≥ ] ≤ 2 exp −cn ∧ 6(¯ c − 1) k23 k 3/2 2

!

!

+ 2 log k2 .

For R2 , we first write ||R2 ||∞ ≤ ≤









ˆ k2 Σ−1 KK 2 ΣK c K − ΣK c K ∞ √

k2

ˆ

ΣK c K − ΣK c K . ∞ λmin (ΣKK )

p



ˆ

KK ) An application of bound (3) from Lemma S.4 with ε = 6(¯cφ−1) λmin√(Σ to bound Σ K c K − ΣK c K k2 ∞ yields ! ! 1 φ 1 ] ≤ 2 exp −cn P[||R2 ||∞ ≥ ∧ 3/2 + log(p − k2 ) + log k2 . 3 6(¯ c − 1) k2 k 2

For R3 , by applying (3) from Lemma S.4 with ε =

φλmin (ΣKK ) 6(¯ c−1)



ˆ

to bound Σ K c K − ΣK c K





ˆ −1 −1 from Lemma S.4 to bound Σ KK − ΣKK , we have ∞

1 φ 1 ] ≤ 2 exp −cn ∧ 2 6(¯ c − 1) k2 k2 

P[||R3 ||∞ ≥







+ log(p − k2 ) + log k2 .

Putting everything together, we conclude that ˆ KcK Σ ˆ −1 ||∞ ≥ 1 − φ + P[||Σ KK

φ 1 1 0 ] ≤ c exp −cn ∧ 3/2 3 2(¯ c − 1) k2 k 2 

For R4 , we have, with probability at least 1 − c exp −bn





1 k23





1 3/2

k2



!

!

+ 2 log p . 

+ 2 log p ,

ˆ

˜ −1 ˆ −1 ˜ ˆ ||R4 ||∞ ≤ Σ K c K ΣKK ΣKK − ΣKK ΣKK ∞









p φ

˜

˜ −1 ˆ KK 1−φ+ k2 Σ

ΣKK − Σ

KK 2 , ∞ 2(¯ c − 1)



5

and (5)

ˆ KcK Σ ˆ −1 ||∞ established previously. Using where the last inequality follows from the bound on ||Σ KK 00 (15) from the proof for Lemma S.5 by setting c = 4, we have 4 λmin (ΣKK )



˜ −1

ΣKK ≤ 2

with probability at least 1 − c1 exp(−c2 log(p ∨ d)). Next,

applying bound (11) from Lemma S.5

φλ (Σ ) min KK ˜ KK − Σ ˆ KK  √ to bound Σ with ε =

yields φ 24(¯ c−1) 1−φ+ 2(¯c−1)

k2



3

P[||R4 ||∞

!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 1 + 2 log k2 3/2 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

For R5 , using bound (9) from the proof for Lemma S.4, we have ||R5 ||∞ ≤







ˆ −1 ˜ ˆ k2 Σ KK 2 ΣK c K − ΣK c K ∞ √

2 k2

˜ ˆ KcK

ΣK c K − Σ

. ∞ λmin (ΣKK )

p



An application of bound (10) from Lemma S.5 with ε = yields

φλmin (Σ√ KK ) 12(¯ c−1) k2

3

P[||R5 ||∞



!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 + 2 log p 1 3/2 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2



˜

ˆ For R6 , by applying (10) to bound Σ with ε = K c K − ΣK c K ∞

˜ −1 ˆ −1 −Σ to bound Σ

, we are guaranteed that KK

λmin (ΣKK ) φ 8 6(¯ c−1)

and applying (12)

KK ∞

3

P[||R6 ||∞





˜ ˆ to bound Σ K c K − ΣK c K

!

φ −cn 2 ≥ + c1 exp(−c2 log(p ∨ d)). ] ≤ 6 exp 1 1 + 2 log p 6(¯ c − 1) k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

Under the condition

1 3 n k2 log p

- 1, putting the bounds on R1 − R6 together, we conclude that

˜ KcK Σ ˜ −1 ||∞ ≥ 1 − P[||Σ KK

(¯ c − 2)φ ] ≤ c1 exp(−c2 log p). (¯ c − 1)

 Lemma S.4: Suppose Assumptions 2.1 and 2.2(iii) hold. For any ε > 0 and some constant c > 0, we have

n

ˆ

P Σ K c K − ΣK c K



ε ε2 ∧ ≥ ε ≤ 2(p − k2 )k2 exp −cn 2 4 k2 ρX ∗ k2 ρ2X ∗ o

n

ˆ

P Σ − Σ KK KK



o

≥ε ≤

2k22 exp

ε2 ε −cn ∧ 2 4 k2 ρX ∗ k2 ρ2X ∗

,

(3)

!!

.

Furthermore, if n ≥ c0 k2 log p for some sufficiently large constant c0 > 0, we have

6

!!

(4)



ˆ −1

ΣKK − Σ−1 KK







with probability at least 1 − c1 exp −c2 n

1 , λmin (ΣKK )



λ2min (ΣKK ) k2 ρ4X ∗

(5)

λmin √ (Σ2KK ) k2 ρX ∗





.

0 ˆ K c K − ΣK c K by u 0 . By the defProof. Denote the element (j , j) of the matrix difference Σ j j inition of the l∞ matrix norm, we have

n

ˆ c c − Σ P Σ K K K K



≥ε

o

= P

 

X

max

j 0 ∈K c

|uj 0 j | ≥ ε



j∈K

≤ (p − k2 )P

 

 X

|uj 0 j | ≥ ε

 j∈K 

  

ε ≤ (p − k2 )P ∃j ∈ K | |uj 0 j | ≥ k2   ε ≤ (p − k2 )k2 P |uj 0 j | ≥ k2 !! ε2 ε ≤ (p − k2 )k2 · 2 exp −cn ∧ , k22 ρ4X ∗ k2 ρ2X ∗ 

where the last inequality follows Lemma B.1. Bound (4) can be obtained in a similar way except that the pre-factor (p − k2 ) is replaced by k2 . To prove the last bound (5), write



ˆ −1

ΣKK − Σ−1 KK



h

i

ˆ ˆ −1 = Σ−1 KK ΣKK − ΣKK ΣKK

h i ∞

−1 ˆ KK Σ ˆ −1 = k2 ΣKK ΣKK − Σ KK 2



p −1

ˆ −1 ˆ KK = k2 ΣKK ΣKK − Σ

Σ KK p

2







2

2



k2

ˆ −1

ˆ KK

Σ

ΣKK − Σ KK 2 . 2 λmin (ΣKK )

ˆ KK To bound ΣKK − Σ

in (6), applying Lemma B.1 with ε = 2

λmin√ (ΣKK ) 2 k2

(6)

yields

λmin (ΣKK )

ˆ

√ ,

ΣKK − ΣKK ≤

2 k2

2





with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) 4k2 ρ4X ∗



λmin √ (ΣKK ) 2 k2 ρ2X ∗







ˆ −1 in (6), . To bound Σ KK 2

note that we can write λmin (ΣKK ) = =

0

0

min h T ΣKK h 0

||h ||2 =1

min 0

||h ||2 =1

0 ˆ KK h0 + h0 T (ΣKK − Σ ˆ KK )h0 h TΣ

h

i

ˆ KK h + hT (ΣKK − Σ ˆ KK )h ≤ hT Σ ˆ KK . Applying Lemma B.1 yields where h ∈ Rk2 is a unit-norm minimal eigenvector of Σ   T ˆ KK h ≤ cλmin (ΣKK ) h ΣKK − Σ

7

(7)





with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) ρ4X ∗



λmin (ΣKK ) ρ2X ∗



, where c ∈ (0, 1). Therefore,

ˆ KK ) + cλmin (ΣKK ), which implies that λmin (ΣKK ) ≤ λmin (Σ ˆ KK ) ≥ (1 − c)λmin (ΣKK ), λmin (Σ and consequently,

(8)

0



ˆ −1

ΣKK ≤ 2

c , λmin (ΣKK )

0

(9)

0

where c > 1. Putting everything together, by setting c = 2, we have √

λmin (ΣKK ) k2 2 1

ˆ −1 −1 √ = .

ΣKK − ΣKK ≤ ∞ λmin (ΣKK ) λ (Σ ) λ (Σ 2 k2 min min KK KK ) 



with probability at least 1 − c1 exp −c2 n

λ2min (ΣKK ) k2 ρ4X ∗



λmin √ (Σ2KK ) k2 ρX ∗



. 

Lemma S.5: Suppose Assumption 2.4 holds and β ∗ is exactly sparse with at most k2 non∗ zero coefficients. Assume that ρZ - 1, κ1  1, κ ¯ 1  1, ρX ∗ - 1, and maxj=1,...,p |πj,S c | τj 1 q (k1 ∨ 1)

log(d∨p) . n

For any ε > 0, under the condition k2 T1 - 1 and

1 2 n k2 log p

3

n

˜ ˆ P Σ K c K − ΣK c K



≥ ε ≤ 6(p − k2 )k2 exp

!

−cn 2 ε

o

- 1, we have

1

+ c1 exp(−c2 log(p ∨ d)).

1

k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

(10) 3

n

˜ ˆ P Σ KK − ΣKK



o

≥ε ≤

!

−cn 2 ε

6k22 exp

1

1

k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

+ c1 exp(−c2 log(p ∨ d)). (11)

Furthermore, with probability at least 1 − c1 exp(−c2 log(p ∨ d)), we have 8 . (12) ∞ λmin (ΣKK ) 0 ˜ KcK − Σ ˆ K c K by w 0 . Using the same Proof. Denote the element (j , j) of the matrix difference Σ j j argument as in Lemma S.4, we have

˜ −1 ˆ −1

ΣKK − Σ KK

n

˜

ˆ P Σ K c K − ΣK c K





ε ≥ ε ≤ (p − k2 )k2 P |wj 0 j | ≥ k2 

o

(X−X ∗ )T X ∗ ˆ Following the derivation of the upper bounds on n



for Lemma A.2 and the identity



.

(X−X ∗ )T (X−X ∗) ˆ ˆ and n

in the proof ∞

 1 ˜ ˆ K c K = 1 X ∗Tc (X ˆ K − X ∗ ) + 1 (X ˆ K c − X ∗ c )T X ∗ + 1 (X ˆ K c − X ∗ c )T (X ˆ K − X ∗ ), ΣK c K − Σ K K K K K K n n n n



∗ ˆ note that to upper bound |wj 0 j |, we need to upper bound n1 Xj∗T 0 (Xj − Xj ) . From the proof for ∗ Lemma A.2 and the condition maxj=1,...,p |πj,S c | - (k1 ∨ 1) τj 1

1 ∗T ˆ j − X ∗ ) X 0 (X j n j

q

log(d∨p) , n

we have

v v u u n n h i2 u1 X u1 X ∗2 t ≤ Xij 0 t Zij (ˆ πj − πj∗ ) n n i=1

i=1

8

v √ s u n X u 1 ¯ 1 (k1 ∨ 1) ρ2Z log(d ∨ p) ∗2 κ ≤ ctmax X 0 0

n

j

i=1

ij

κ1

n

0

for all j , j = 1, ..., p, with probability at least 1 − c1 exp(−c2 log(p ∨ d)), and n t 1X t2 ∗2 2 X −cn P max 0 ≥ ρX ∗ + t ≤ 6(p − k2 ) exp 2 ∧ ρ ∗ρ 2 ij 0 n ρ ρ j X Z X∗ Z i=1

"

#

Under the condition k2 T1 - 1, setting t = ε

ck2



√ κ1

κ ¯ 1 (k1 ∨1)

n ρ2Z log(p∨d)

2

for any ε > 0 yields



≥ ε ≤ 6(p−k2 )k2 exp

!

−cn 2 ε

o

.

1

3

n

˜ ˆ P Σ K c K − ΣK c K

!!

1

+c1 exp(−c2 log(p∨d)).

1

ρX ∗ ρZ k2 (k1 ∨ 1) 2 (log(p ∨ d)) 2

Bound (11) can be obtained in a similar way except that the pre-factor (p − k2 ) is replaced by k2 . To prove bound (12), by applying the same argument as in Lemma S.4, we have √



k2

ˆ

˜ −1

˜ −1

−1 ˜ ˆ Σ − Σ Σ ≤



ΣKK − Σ

KK KK KK 2 KK ∞ ˆ KK ) 2 λmin (Σ √



2 k2

ˆ

˜ −1 ˜ KK ≤

ΣKK − Σ

Σ

, KK 2 2 λmin (ΣKK )



ˆ

˜ where the last inequality comes from bound (8). To bound Σ KK − ΣKK , applying the bound

(11) with ε =

λmin (ΣKK ) k2

2

yields



p

ˆ

ˆ ˜ KK ˜

≤ k2 Σ

ΣKK − Σ KK − ΣKK



2



λmin (ΣKK ) √ k2

(13)

with probability at least 3

1−

6k22 exp 0

!

−cn 2 1

− c1 exp(−c2 log(p ∨ d)),

1

k22 (k1 ∨ 1) 2 (log(p ∨ d)) 2 0

0

0

which is at least 1 − c1 exp(−c 2 log(p ∨ d)) for some universal constants c1 , c2 > 0 if T1 - 1 and

˜ −1 1 2 n k2 log p - 1. To bound ΣKK , again we have, 2

ˆ KK ) ≤ hT Σ ˜ KK h + hT (Σ ˆ KK − Σ ˜ KK )h λmin (Σ ˜ KK h + k2 Σ ˆ KK − Σ ˜ KK ≤ hT Σ ∞

s

˜ KK h + c0 k2 ≤ hT Σ

(k1 ∨ 1) log(d ∨ p) , n

with probability at least 1−c1 exp(−c2 log(p∨d)), where h ∈ Rk2 is a unit-norm minimal eigenvector (X−X (X−X ∗ )T X ∗ ∗ )T (X−X ∗) ˆ ˆ ˆ ˜ from of ΣKK . The last inequality follows from the bounds on and

the proof for Lemma A.2. Therefore, if s

c0 k2

n





(k1 ∨ 1) log(d ∨ p) ≤ (1 − c)2 λmin (ΣKK ) n 9



n





for some c ∈ (0, 1), which implies that s

c0 k2

(k1 ∨ 1) log(d ∨ p) ˆ KK ) ≤ (1 − c)λmin (Σ n

by (8), then we have ˜ KK ) ≥ cλmin (Σ ˆ KK ), λmin (Σ and consequently,

0



˜ −1

ΣKK ≤ 2

00

(14)

00

c c ≤ , ˆ KK ) λmin (ΣKK ) λmin (Σ

(15)

0

for c > c > 1, where the last inequality follows from bound (8) from the proof for Lemma S.4. 00 Putting everything together, by setting c = 4, we have √

2 k2 λmin (ΣKK ) 4 8

ˆ −1 −1 ˜ √ Σ − Σ ≤ = .

KK KK ∞ λmin (ΣKK ) λmin (ΣKK ) λmin (ΣKK ) k2 with probability at least 1 − c1 exp(−c2 log(p ∨ d)). 

References Bühlmann, P. and S. A. van de Geer (2011). Statistics for High-Dimensional Data. Springer, NewYork. Meinshausen, N., and P. Bühlmann (2006). “High-Dimensional Graphs and Variable Selection with the Lasso.” The Annals of Statistics, 34:1436-1462. Ravikumar, P., Wainwright, M. & Lafferty, J. (2010) High-dimensional ising model selection using l1 -regularized logistic regression. The Annals of Statistics, 38: 1287-1319. Wainwright, M. (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using l1 -constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55: 21832202. Wainwright, M. (2015) High-Dimensional Statistics: A Non-Asymptotic Viewpoint. In preparation. University of California, Berkeley. Zhao, P., and B. Yu. (2006). “On Model Selection Consistency of Lasso.” Journal of Machine Learning Research, 7, 2541-2567.

10

Consistent Variable Selection of the l1−Regularized ...

Proof. The proof for Lemma S.1 adopts the proof for Lemma 1 from Chapter 6.4.2 of Wain- ..... An application of bound (3) from Lemma S.4 with ε = φ. 6(¯c−1).

378KB Sizes 0 Downloads 68 Views

Recommend Documents

Regularization and Variable Selection via the ... - Stanford University
ElasticNet. Hui Zou, Stanford University. 8. The limitations of the lasso. • If p>n, the lasso selects at most n variables. The number of selected genes is bounded by the number of samples. • Grouped variables: the lasso fails to do grouped selec

Consistent cotrending rank selection when both ... - SSRN papers
473–484. doi: 10.1111/j.1368-423X.2012.00392.x. Consistent co-trending rank selection when both stochastic and non-linear deterministic trends are present.

Variable selection for dynamic treatment regimes: a ... - ORBi
will score each attribute by estimating the variance reduction it can be associ- ated with by propagating the training sample over the different tree structures ...

Variable selection for Dynamic Treatment Regimes (DTR)
Jul 1, 2008 - University of Liège – Montefiore Institute. Variable selection for ... Department of Electrical Engineering and Computer Science. University of .... (3) Rerun the fitted Q iteration algorithm on the ''best attributes''. S xi. = ∑.

Variable selection in PCA in sensory descriptive and consumer data
Keywords: PCA; Descriptive sensory data; Consumer data; Variable selection; Validation. 1. Introduction. In multivariate analysis where data-tables with sen-.

Variable selection for Dynamic Treatment Regimes (DTR)
Department of Electrical Engineering and Computer Science. University of Liège. 27th Benelux Meeting on Systems and Control,. Heeze, The Netherlands ...

Split Intransitivity and Variable Auxiliary Selection in ...
Mar 14, 2014 - Je suis revenu–j'ai revenu `a seize ans, j'ai revenu `a Ottawa. ... J'ai sorti de la maison. 6 ..... 9http://www.danielezrajohnson.com/rbrul.html.

Model Selection Criterion for Instrumental Variable ...
Graduate School of Economics, 2-1 Rokkodai-cho, Nada-ku, Kobe, .... P(h)ˆµ(h) can be interpreted as the best approximation of P(h)y in terms of the sample L2 norm ... Hence, there is a usual trade-off between the bias and the ..... to (4.8) depends

Variable selection for dynamic treatment regimes: a ... - ORBi
n-dimensional space X of clinical indicators, ut is an element of the action space. (representing treatments taken by the patient in the time interval [t, t + 1]), and xt+1 is the state at the subsequent time-step. We further suppose that the respons

Bayesian linear regression and variable selection for ...
Email: [email protected]; Tel.: +65 6513 8267; Fax: +65 6794 7553. 1 ..... in Matlab and were executed on a Pentium-4 3.0 GHz computer running under ...

oracle inequalities, variable selection and uniform ...
consistent model selection. Pointwise valid asymptotic inference is established for a post-thresholding estimator. Finally, we show how the Lasso can be desparsified in the correlated random effects setting and how this leads to uniformly valid infer

Variable selection for Dynamic Treatment Regimes (DTR)
University of Liège – Montefiore Institute. Problem formulation (I). ○ This problem can be seen has a discretetime problem: x t+1. = f (x t. , u t. , w t. , t). ○ State: x t. X (assimilated to the state of the patient). ○ Actions: u t. U. â—

Variable selection in PCA in sensory descriptive and consumer data
used to demonstrate how this aids the data-analyst in interpreting loading plots by ... Keywords: PCA; Descriptive sensory data; Consumer data; Variable ...

Variable selection for dynamic treatment regimes: a ... - ORBi
Nowadays, many diseases as for example HIV/AIDS, cancer, inflammatory ... ical data. This problem has been vastly studied in. Reinforcement Learning (RL), a subfield of machine learning (see e.g., (Ernst et al., 2005)). Its application to the DTR pro

Time-Consistent and Market-Consistent Evaluations
principles by a new market-consistent evaluation procedure which we call 'two ... to thank Damir Filipovic, the participants of the AFMath2010 conference, the ... Pricing such payoffs in a way consistent to market prices usually involves combining ..

CONSISTENT FRAGMENTS OF GRUNDGESETZE ...
terms of individual type, and if all formulas of this extended language are eligible as ... Just as r was modelled on the Russell class {x : x ∈ x}, the value range of.

Consistent Bargaining
Dec 27, 2008 - Consistent Bargaining. ∗. Oz Shy†. University of Haifa, WZB, and University of Michigan. December 27, 2008. Abstract. This short paper ...

The Strength of Selection against Neanderthal Introgression.PDF ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. The Strength of ...

The Study of Neural Network Adaptive Variable ...
in transit and average delay are affected by routing al- ... connection topology affect the number of packets in transit ..... Princeton University Press, 1999. [12] Jon ...

The use of limited dependent variable techniques in ...
Mar 12, 2009 - 3125, U.S.A. E-mail: [email protected]. 1 Scandura and ...... Predict probability of CEO dismissal, store values in variable pprob predict pprob.

The Study of Neural Network Adaptive Variable ...
Abstract - This paper describes a software package used to simulate the OSI ... This consideration and the need to analyse the dynamics of the networks under ...

The Study of Neural Network Adaptive Variable ...
added randomly to a network connection topology of a ..... developed the software tool Netzwerk-1 [7]. The main ob- jective of a ... setup we monitor the trajectory of some quantities of this ... networks with λ