Appendix to “A Sparse Structured Shrinkage Estimator for Nonparametric Varying-Coefficient Model with an Application in Genomics” published in the Journal of Computational and Graphical Statistics Z. John Daye, Jichun Xie, and Hongzhe Li ∗ University of Pennsylvania, School of Medicine January 11, 2011

A

Appendix - Proofs

In this section, we provide proofs for model selection consistency of the SSS estimator and estimation bounds presented in Section 3, extending results of Bach (2008) for the group lasso. We further use the notations, ΣY Y = E(Y Y T ) − E(Y )E(Y )T and ΣU ∗ Y = E(U ∗ Y T ) − E(U ∗ )E(Y )T .

A.1

Proof of Theorem 3.1

Consider the optimization problem (5) in Section 2.2, ( ) p ∑ 1 λ 2 γˆ ∗ = argminγ ∗ ∥y − U∗ γ ∗ ∥2 + λ1 ∥γg∗ ∥ + (γ ∗ )T Ω∗ γ ∗ . 2n 2 g=1 By Karush-Kuhn-Tucker conditions, γ ∗ is an optimal solution for (5) if and only if λ1 1 ∗ T ∗ ∗ (Ug ) (U γ − y) + λ2 Ω∗g γ ∗ = − ∗ γg∗ , n ∥γg ∥



1 ∗ T ∗ ∗

(Ug ) (U γ − y) + λ2 Ω∗g γ ∗ ≤ λ1 ,

n

∀γg∗ ̸= 0

(17)

∀γg∗ = 0.

(18)

This can be easily verified as in Proposition 1 of Bach (2008) and Yuan and Lin (2006). ∗

Z. John Daye is a Postdoctoral Researcher, Jichun Xie is a Ph.D. student, and Hongzhe Li is Professor, Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021. (E-mail: [email protected])

1

Define T γ˜(1) = argminγ ∗

(1)

∑ 1 λ2 ∗ T ∗ ∗ ∗ ∥γg∗ ∥ + (γ(1) ∥y − U∗ (1) γ(1) ∥ 2 + λ1 ) Ω11 γ(1) , 2n 2 0

(19)

{j:βg ̸=0}

T and let γ˜ = (˜ γ(1) , 0T )T . As λ1 → 0 and λ2 → 0, the objective function in (19) converges to ∗ ∗T ∗ 0∗ ∗ γ ∗ ∗ ΣY Y − 2ΣY U(1) (1) + γ(1) ΣU(1) U(1) γ(1) , whose unique minimum is γ(1) by regularity condition A2. 0∗ Thus, γ˜(1) converges to γ(1) in probability.

Using regularity condition A1 and letting ϵ = y − (γ 0∗ )T U ∗ , we have 1 ∗ T 1 ∗ T ∗ 0∗ 1 ∗ T (U ) y = (U ) U γ + (U ) ϵ n n n 0∗ ∗ ∗ = ΣU U(1) γ(1) + Op (n−1/2 ). This gives 1 ∗ T ∗ 0∗ ∗ (˜ ) + Op (n−1/2 ), γ(1) − γ(1) (U ) (U(1) γ˜(1) − y) = ΣU ∗ U(1) n

(20)

which by (17) gives 0∗ γ˜(1) − γ(1) = −λ1 Σ−1 U∗

∗ − λ2 Σ−1 ˜(1) + Op (n−1/2 ) ∗ U ∗ Ω11 γ U(1) (1) [ ] λ2 ∗ −1 = −λ1 ΣU ∗ U ∗ Diag(1/∥γ˜i ∥) + Ω11 γ˜(1) + Op (n−1/2 ). (1) (1) λ1 (1)

˜i ∥)˜ γ(1) ∗ Diag(1/∥γ U(1)

(21)

From equations (20) and (21), we have 1 ∗ T ∗ (U ) (U γ˜(1) − y) + λ2 Ω∗21 γ˜(1) n (2) 0∗ ∗ U ∗ (˜ = ΣU(2) γ(1) − γ(1) ) + λ2 Ω∗21 γ˜(1) + Op (n−1/2 ) (1) [ ] λ2 ∗ −1 ∗ U∗ Σ ∗ = −λ1 ΣU(2) Diag(1/∥γ˜i ∥) + Ω11 γ˜(1) + λ2 Ω∗21 γ˜(1) + Op (n−1/2 ). ∗ U(1) U(1) (1) λ1 Divide the above by λ1 . We see that (U∗g )T (U∗ γ˜(1) − y)/(nλ1 ) + (λ2 /λ1 )Ω∗g1 γ˜(1) converges in probability to

[ ] −1 0∗ ∗ Σ ∗ −ΣUj∗ U(1) Diag(1/∥γi0∗ ∥) + αΩ∗11 γ 0∗ (1) + αΩ∗j1 γ(1) (22) ∗ U(1) U(1) √ for g ∈ {g : βg0 = 0}, as λ1 n → ∞ and λ2 /λ1 → α. We note that, by definitions U∗g = Ug Vg−1 and Ω∗ = diag(V1−1 , . . . , Vp−1 )T Ω diag(V1−1 , . . . , Vp−1 ), (22) equals to 0 0 0 Vg−1 {−ΣUg U(1) Σ−1 U(1) U(1) [Diag(1/∥γg ∥Rgg ) + αΩ11 ]γ(1) + αΩj1 γ(1) }.

Thus, by condition (12), ∥(U∗g )T (U∗ γ˜(1) − y)/(nλ1 ) + (λ2 /λ1 )Ω∗g1 γ˜(1) ∥ ≤ 1 holds with probability 1 for all g ∈ {g : βg0 = 0}. This verifies (18), and in addition (17) is satisfied for γ˜(1) by definition. Equations (17) and (18) in turn imply γ˜ is an optimal solution for (5), and this completes the proof for Theorem 3.1. 2

A.2

Proof of Theorem 3.2

Assume that condition (13) does not hold. That is, ∥vg ∥ > 1

(23)

0 0 for some g ∈ {g : βg0 = 0}, where vg = Vg−1 (ΣUg U(1) ΣU−1(1) U(1) [D(1) + αΩ11 ]γ(1) − αΩ21 γ(1) ).

From (17), we have ( γˆ(1) =

1 ∗ T ∗ (U ) U(1) n (1)

)−1 (

) λ1 1 ∗ T ∗ (U ) y − λ2 Ω1 γˆ − γˆ(1) , n (1) ∥γg ∥

which gives 1 ∗ T (U ) (y − U∗(1) γˆ(1) ) − λ2 Ω∗11 γˆ(1) n g ( ) [1 1 ∗ T ∗ 1 ∗ T ∗ −1 1 ∗ T ] ∗ T = (U ) y − (Ug ) U(1) (U ) U(1) (U ) y n g n n (1) n (1) ( ) ( ) [ 1 1 ∗ T ∗ −1 λ2 ∗ ∗ T ∗ −1 + λ1 (Ug ) U(1) (U ) U(1) Diag(∥γˆi ∥ )ˆ γ(1) + Ω1 γˆ n n (1) λ1 ] −λ2 Ω∗11 γˆ(1)

(24)

= An + Bn

Now, since P ({g : βˆg ̸= 0} = {g : βg0 ̸= 0}) → 1 by assumption of model selection consistency 0∗ and γ˜(1) converges to γ(1) in probability for λ1 → 0 and λ2 → 0 by argument following equation 0∗ (19), we see that γˆ(1) converges to γ(1) and Bn /λ1 converges to vg . Furthermore, by assumption

(23), P ((vg /∥vg ∥)T (Bn /λ1 ) ≥ (∥vg ∥ + 1)/2) → 1. By arguments in proof of Theorem 3 in Bach (2008) and regularity condition A3, we have √ that P ( nvgT An > 0) converges to a constant a ∈ (0, 1). Thus, P ((vg /∥vg ∥)T (An + Bn )/λ1 ≥ (∥vg ∥ + 1)/2) ≥ a asymptotically, which in turn implies that (18) does not hold with probability 1 for γˆ as ∥An + Bn )∥/λ1 ≥ (vg /∥vg ∥)T (An + Bn ) and (∥vg ∥ + 1)/2 > 1. Hence, γˆ is not an optimal solution for (5), and Theorem 3.2 follows by contradiction.

A.3

Proof of Theorem 3.3

Since solution of equation (4) is simply a reparameterization of the solution of equation (5), we first establish the estimation bounds for γˆ ∗ , p ∑ λ2 1 ∗ ∗ 2 γˆ = argminγ ∗ ∥y − U γ ∥ + λ1 ∥γg∗ ∥ + (γ ∗ )T Ω∗ γ ∗ , 2n 2 g=1 ∗

3

which can then be converted to the bounds for γˆ . Our proof extends that of Hebiri and van de Geer (2010) for the L1 + L2 penalized estimation methods. Define m = nK, Ω∗ = JT J. We can reparameterize U∗ , y, and ε to √  √    √ K K K ∗ U y ε ˜ = √ 2 , y . ˜ =  2  , and ε ˜= √ 2 U m m ∗0 λ J 0 − λ Jγ 2 2 2 2 Then the original optimization problem can be reformulated to p ∑ 1 ∗ ∗ 2 ˜ γˆ = arg min ∥˜ y − Uγ ∥ + λ1 ∥γg∗ ∥. γ∗ m g=1

(25)

(26)

Let Ig be the set of indices of γ that correspond to the gth covariate. We first prove the following useful lemma. ∗ 2 Lemma 1. Define σj2 = U∗T j Uj and assume σj is bounded away from 0 and ∞, i.e. there exist

σmin and σmax , such that 0 < σmin ≤ σj ≤ σmax < ∞,

∀j = 1, . . . , p.

√ Define Zj = m1 (U∗T ε)j and Z Ig = (Zi : i ∈ Ig ). Choose 0 < τ < 1/2, κ > 2Kισmax /(τ σmin ) and √ λ1 = κσmin m−1 log(q). Then ( ) 2 2 2 2 2 2 P max K∥Z Ig ∥ ≤ τ λ1 ≥ 1 − p1−k τ σmin /(2K ι σmax ) . g=1,...,p

Proof. ∀ i ∈ 1, . . . , p, Zi ∼ N(0, m−1 σi2 ), then ( ) P max K∥Z Ig ∥ ≥ τ λ1 g=1,...,p ( ) ≤ P max |Zj | ≥ τ λ1 /(Kι) j=1,...,q ( )2 ) ( τ λ1 m ≤ p exp − 2 2σmax Kι ≤ p1−k

2 τ 2 σ 2 /(2K 2 ι2 σ 2 max ) min

. 

Proof of Theorem 3.3. Starting from the minimization problem (26), we have the following equivalent statements, ∑ ∑ 1 1 ˜ γ ∗ ∥2 + λ1 ˜ ∗0 ∥2 + λ1 ∥˜ y − Uˆ ∥ˆ γg∗ ∥ ≤ ∥˜ y − Uγ ∥γg∗0 ∥ m m g g ∑ ∑ 1 ˜ ∗0 ˜ ∗ 1 ˜∥2 − ∥˜ ⇐⇒ ∥Uγ − Uˆ γ +ε ε∥2 ≤ λ1 ∥γg∗0 ∥ − λ1 ∥ˆ γg∗ ∥ m m g g ∑[ ] 1 ˜ ∗0 ˜ ∗ 2 2 ˜ ∗0 − γˆ ∗ ). ˜T U(γ ⇐⇒ ∥Uγ − Uˆ γ ∥ ≤ λ1 ∥γg∗0 ∥ − ∥ˆ γg∗ ∥ − ε m m g 4

Note that 2 T ˜ ∗0 K ˜ U(γ − γˆ ∗ ) = εT U(γ ∗0 − γˆ ∗ ) − λ2 (γ ∗0 )T Ω(γ ∗0 − γˆ ∗ ) = (A) + (B), ε m m where |(A)| =

∑ ∑ K T ∗ ∗0 ∗0 ∗ |ε U (γ − γˆ ∗ )| = K| ZT (γ − γ ˆ )| ≤ K ∥Z Ig ∥∥γg∗0 − γˆg∗ ∥, Ig g g m g g

and similarly, |(B)| ≤ λ2



∥Ωg γg∗0 ∥∥γg∗0 − γˆg∗ ∥ ≤ r∗ λ2

g ∗



∥γg∗0 − γˆg∗ ∥,

g

∗0

where r = ∥Ωγ ∥∞ . Define λ2 =

On the event Λn,p = {maxg=1,...,G K∥Z Ig ∥ ≤ τ λ1 }, we have

∑[ ∑ ] 1

˜ ∗0 ˜ ∗ 2 γ ≤ λ1 ∥γg∗0 ∥ − ∥ˆ γg∗ ∥ + 2τ λ1 ∥γg∗0 − γˆg∗ ∥.

Uγ − Uˆ m g g τ λ1 . r∗

Adding (1 − 2τ )λ1

∑ g

(27)

∥γg∗0 − γˆg∗ ∥ to both sides of (27) and noting the fact that ∥γg∗ − γˆg∗ ∥ + ∥γg∗ ∥ −

∥ˆ γg∗ ∥ = 0 for any g ∈ / A0 , we can further simplify (27) to

∑ ∑ 1

˜ ∗0 ˜ ∗ 2 γ + (1 − 2τ )λ1 ∥γg∗0 − γˆg∗ ∥ ≤ 2λ1 ∥γg∗0 − γˆg∗ ∥,

Uγ − Uˆ m 0 g

(28)

g∈A

using the triangle inequality. From (28), we obtain ∑

∥γg∗0 − γˆg∗ ∥ ≤

g

∑ 2 ∥γg∗0 − γˆg∗ ∥. 1 − 2τ 0 g∈A

By Condition B(A0 , τ ) and the norm inequality  2 ∑ ∑  ∥γ ∗0 − γˆ ∗ ∥ ≤ ∥γ ∗0 − γˆ ∗ ∥2 , g∈A0

we have



g∈A0



∥γg∗0



γˆg∗ ∥

g∈A0

|A0 | ˜ ∗0 ˜ ∗ ≤ √ ∥Uγ − Uˆ γ ∥. mϕ

Combining (29) with (28) leads to √

∑ |A0 | 1

˜ ∗0 ˜ ∗ 2

˜ ∗0 ˜ ∗ ∗ ∗0 ∥γg − γˆg ∥ ≤ 2λ1 · √ γ ≤ 2λ1 γ .

Uγ − Uˆ

Uγ − Uˆ m mϕ 0 g∈A

Thanks to the inequality 2ab ≤ a2 /2 + 2b2 , we have

1

˜ ∗ 0 ˜ ∗ 2 4λ21 |A0 | γ ≤ ,

Uγ − Uˆ m ϕ 5

(29)

which leads to

2 8λ2 |A0 | 1

Uγ 0 − Uˆ γ ≤ 1 , n ϕ 8r∗ λ1 |A0 | (γ 0 − γˆ )T Ω(γ 0 − γˆ ) ≤ . τϕ On the other hand, Condition B(A0 , τ ) and (29) imply √ ∑ |A0 | ˜ ∗0 ˜ ∗ 2 4λ1 |A0 | ∥γg∗0 − γˆg∗ ∥ ≤ · √ ∥Uγ − Uˆ γ ∥≤ . 1 − 2τ (1 − 2τ )ϕ mϕ g This implies

∑ g

∑ ∥γg0

− γˆg ∥ ≤

g

∥γg∗0 − γˆg∗ ∥ 4λ1 |A0 | ≤ . µmin (1 − 2τ )ϕµmin 

References Bach, F. R. (2008), “Consistency of the Group Lasso and Multiple Kernel Learning,” Journal of Machine Learning Research, 9, 1179–1225. Hebiri, M. and van de Geer, S. (2010), “The Smooth-Lasso and Other ℓ1 + ℓ2 -Penalized Methods,” arXiv, 1003.4885v1. Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,” Journal of the Royal Statistical Society, Ser. B, 68, 49–67.

6

Appendix to “A Sparse Structured Shrinkage Estimator ...

Computational and Graphical Statistics. Z. John Daye, Jichun Xie, and Hongzhe Li. ∗. University of Pennsylvania, School of Medicine. January 11, 2011. A Appendix - Proofs. In this section, we provide proofs for model selection consistency of the SSS estimator and esti- mation bounds presented in Section 3, extending ...

81KB Sizes 1 Downloads 43 Views

Recommend Documents

A Sparse Structured Shrinkage Estimator for ...
Jan 11, 2011 - for high-dimensional nonparametric varying-coefficient models and ... University of Pennsylvania School of Medicine, Blockley Hall, 423 .... the Appendix, available as online supplemental materials. 3 ...... Discovery and Genome-Wide E

A Sparse Structured Shrinkage Estimator for ...
Jan 11, 2011 - on model selection consistency and estimation bounds are derived. ..... The gradient and Jacobian of the objective function in (7) are, respectively,. Grg ...... the SSS procedure can indeed recover the motifs related to the cell ...

Structured Sparse Low-Rank Regression Model for ... - Springer Link
3. Computer Science and Engineering,. University of Texas at Arlington, Arlington, USA. Abstract. With the advances of neuroimaging techniques and genome.

Recent shrinkage and hydrological response of Hailuogou glacier, a ...
13 200 km2 and accounting for 22.2% of China's total glacier area (Shi and ... military geodetic service. Mass-balance .... orthorectified using ENVI software.

Criteria-Based Shrinkage and Forecasting
J Implications for Data Mining. H Harder to mine ... J Empirical results using Stock & Watson macro data .... Theorem 1: Given Regularity Conditions, define. ˆθ. 1.

A Appendix - Semantic Scholar
buyer during the learning and exploit phase of the LEAP algorithm, respectively. We have. S2. T. X t=T↵+1 γt1 = γT↵. T T↵. 1. X t=0 γt = γT↵. 1 γ. (1. γT T↵ ) . (7). Indeed, this an upper bound on the total surplus any buyer can hope

A Appendix - Semantic Scholar
The kernelized LEAP algorithm is given below. Algorithm 2 Kernelized LEAP algorithm. • Let K(·, ·) be a PDS function s.t. 8x : |K(x, x)| 1, 0 ↵ 1, T↵ = d↵Te,.

Estimator Position.pdf
Must be computer literate and proficient in Microsoft Word, Excel, and Outlook;. Positive Attitude;. Customer Service Orientated;. Office Skill: Phones ...

Appendix A
Sep 15, 2006 - things. Teacher says where she wants to go, what she wants to do. Teacher models; has students get in pairs for communication activity. Last 15 minutes: quiz. HW: Workbook pages 4 -5 complete (don't tear out,. I'll stamp), Journal: fix

A Convex Hull Approach to Sparse Representations for ...
noise. A good classification model is one that best represents ...... Approximate Bayesian Compressed Sensing,” Human Language Tech- nologies, IBM, Tech.

APPENDIX for LABORATORY 3 SHEET APPENDIX A
An Interrupt Service Routine (ISR) or Interrupt Handler is a piece of code that should be executed when an interrupt is triggered. Usually each enabled interrupt has its own ISR. In. AVR assembly language each ISR MUST end with the RETI instruction w

A Convex Hull Approach to Sparse Representations for ...
1IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. ... data or representations derived from the training data, such as prototypes or ...

Online Appendix to
Online Appendix to. Zipf's Law for Chinese Cities: Rolling Sample ... Handbook of Regional and Urban Economics, eds. V. Henderson, J.F. Thisse, 4:2341-78.

Online Appendix to
The model that controls the evolution of state z can be written as zt. = µz .... Members of survey A think of the signal θA as their own, but can observe both.

Online Appendix to
Sep 27, 2016 - data by applying the pruning procedure by Kim et al. .... “Risk Matters: The Real Effects of Volatility Shocks,” American ... accurate solutions of discrete time dynamic equilibrium models,” Journal of Economic Dynamics &.

Online Appendix to
Nov 3, 2016 - 0.03. 0.03. 0.03. 0.02. 0.04. 0.04. 0.04. 0.04. Note . Robust standard errors b et w een paren theses, r ob us t-standard-error-based. p-v alues b et w een brac k ets. ∆. Cr e d is the gro w th rate of real lending b y domestic banks

A WIDEBAND DOUBLY-SPARSE APPROACH ... - Infoscience - EPFL
a convolutive mixture of sources, exploiting the time-domain spar- sity of the mixing filters and the sparsity of the sources in the time- frequency (TF) domain.

Shrinkage Estimation of High Dimensional Covariance Matrices
Apr 22, 2009 - Shrinkage Estimation of High Dimensional Covariance Matrices. Outline. Introduction. The Rao-Blackwell Ledoit-Wolf estimator. The Oracle ...

Online Appendix to International Portfolios: A ...
Aug 31, 2015 - International Portfolios: A Comparison of Solution Methods. Katrin Rabitsch∗. Serhiy Stepanchuk† ... vanish: the ergodic distribution of 'global η = 10−5' is close to the one it converges to when η = 0, while the ergodic .....

A WIDEBAND DOUBLY-SPARSE APPROACH ... - Infoscience - EPFL
Page 1 .... build the matrices BΩj such that BΩj · a(j) ≈ 0, by selecting rows of Bnb indexed ... To build the narrowband CR [6] we first performed a STFT and then.