Estimation, Optimization, and Parallelism when Data is Sparse or Highly Varying John C. Duchi

Michael I. Jordan

H. Brendan McMahan

November 10, 2013

Abstract We study stochastic optimization problems when the data is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties—in terms of increased sample complexity that sparse data necessitates—and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks.

1

Introduction and problem setting

In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: Z F (x; ξ)dP (ξ).

minimize f (x) := E[F (x; ξ)] = x∈X

(1)

Ξ

By data sparsity, we mean that the sampled data ξ is sparse: samples ξ are assumed to lie in Rd , and if we define the support supp(x) of a vector x to the set of indices of its non-zero components, we assume that supp ∇F (x; ξ) ⊂ supp ξ. (2)

The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications, though in spite of its prevalence, study of such problems has been somewhat limited. As a motivating example, consider a text classification problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp(hξ, xi)) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, 1

Formally, if we define πξ as the coordinate projection that zeros all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This is implied by first order conditions for convexity [6, Chapter VI.2]

1

2

Minimax rates for sparse optimization

We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x b denote any estimator for a minimizer of the 2

objective (1). We define the optimality gap ǫN for the estimator x b based on N samples ξ 1 , . . . , ξ N from the distribution P as ǫN (b x, F, X , P ) := f (b x) − inf f (x) = EP [F (b x; ξ)] − inf EP [F (x; ξ)] . x∈X

x∈X

This quantity is a random variable, since x b is a random variable (it is a function of ξ 1 , . . . , ξ N ). To define the minimax error, we thus take expectations of the quantity ǫN , though we require a bit more than simply E[ǫN ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F specified by a collection F of convex losses F : X × ξ → R. We can then define the minimax error for the family of losses F and distributions P as ǫ∗N (X , P, F) := inf sup sup EP [ǫN (b x, F, X , P )],

(3)

x b P ∈P F ∈F

where the infimum is taken over all possible estimators (optimization schemes) x b.

2.1

Minimax lower bounds

Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj 6= 0) be the marginal probability of appearance of feature j (j ∈ {1, . . . , d}). For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then   √ d pj 1 X ∗ . ǫN (X , P, F) ≥ R Mj min pj , √ 8 N log 3 j=1

We provide the proof of Proposition 1 in Appendix A.1, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj 6= 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj 6= 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ǫ∗N (X , P, F)

r     2−α 1−α 2 MR p0  1−α p0 2α α (p0 N ) d − (p0 N ) ≥ −1 + . 8 2−α N 1−α

(2) If d ≤ (p0 N )1/α , ǫ∗N (X , P, F)

MR ≥ 8

r

p0 N



3

α 1 1 d1− 2 − 1 − α/2 1 − α/2



.

For simplicity assume that the features are not too extraordinarly sparse, say, that α ∈ [0, 2], and that number of samples is large enough that d ≤ (p0 N )1/α . Then we find ourselves in regime (2) of Corollary 2, so that the lower bound on optimization error is of order r r r p0 1− α p0 p0 2 MR d log d when α → 2, and M R when α > 2. (4) when α < 2, M R N N N These results beg the question of tightness: are they improvable? As we see presently, they are not.

2.2

Algorithms for attaining the minimax rate

1 1 D ′ ′ ′ 2 x ← argmin g, x + x − x, S (x − x) 2η x′ ∈X 4

for stochastic gradient descent and  

1 D ′ 1 ′E ′ 2 x ← argmin z, x + x ,S x 2η x′ ∈X

for dual averaging, where η is a stepsize. Then appropriate choice of η shows that after N samples ξ, the averaged parameter x b(N ) AdaGrad returns satisfies d R∞ M X √ √ pj , E[f (b x(N ))] − inf f (x) ≤ O(1) x∈X N j=1

(6)

where R∞ denotes the ℓ∞ -radius of X (e.g. [4, Section 1.3 and Theorem 5], where one takes η ≈ R∞ ). By inspection, the AdaGrad rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error (4)), a calculation shows that the multiplier for the SGD guarantee (5) √ order (1−α)/2 becomes R∞ d max{d , 1}, while AdaGrad attains rate at worst R∞ max{d1−α/2 , log d} P √ (by evaluation of j pj ). Thus for α > 1, the AdaGrad rate is no worse, and for α ≥ 2, is more √ than d/ log d better than SGD—an exponential improvement in the dimension.

3

Parallel and asynchronous optimization with sparsity

5

3.1

Asynchronous dual averaging

One of the weaknesses of Hogwild! is that, as written it appears to only be applicable to problems for which the domain X is a product space, and the known analysis assumes that kgk0 = 1 for all gradients g. In effort to alleviate these difficulties, we now develop and present our asynchronous dual averaging algorithm, AsyncDA. In AsyncDA, instead of asynchronously updating a centralized parameter vector x, we maintain a centralized dual vector z. A pool of processors performs asynchronous additive updates to z, where each processor repeatedly performs the following updates: n o 1. Read z and compute x := argminx∈X hz, xi + η1 ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ)

// Let g(t) = g.

3. For j ∈ [d] such that gj 6= 0, update zj ← zj + gj Because the actual computation of the vector x in asynchronous dual averaging (AsyncDA) is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. As noted by Niu et al. [12], this operation can often be performed atomically on modern processors. In our analysis of AsyncDA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the AsyncDA algorithm, that is, whichever is the tth x actually computed by any of the processors. We note that this quantity Pt exists and is recoverable from the algorithm, and that it is possible to track the running sum τ =1 x(τ ). Additionally, we require two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj 6= 0) ≤ pj . We also require an assumption about the continuity (Lipschitzian) properties of the loss functions being minimized; the assumption amounts to a second moment constraint on the sub-gradients of the instantaneous F along with a rough measure of the sparsity of the gradients. Assumption B. There exist constants M and (Mj )dj=1 such that the following bounds hold for all x ∈ X : E[k∂x F (x; ξ)k22 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these definitions, we have the following theorem, which captures the convergence behavior of AsyncDA under the assumption that X is a Cartesian product, meaning that X = X1 ×· · ·×Xd , where Xj ⊂ R, and that ψ(x) = 12 kxk22 . Note the algorithm itself can still be efficiently parallelized for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then E

X T t=1

t

t



F (x(t); ξ ) − F (x ; ξ ) ≤

d

X 1 η p2j Mj2 . kx∗ k22 + T M2 + ηT m 2η 2 j=1

6

We provide the proof of Theorem 3 in Appendix B. As stated, the theorem is somewhat unwieldy, so we provide a corollary and a few remarks to explain and simplify the result. Under a more stringent condition that |∂xj F (x; ξ)| ≤ Mj , Pd 2 Assumption A implies E[k∂x F (x; ξ)k22 ] ≤ j=1 pj Mj . Thus, without loss of generality for the P d 2 remainder of this section we take M2 = j=1 pj Mj , which serves as an upper bound on the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ P Corollary 4. Define x b(T ) = T1 Tt=1 x(t), and set η = kx∗ k2 /M T . Then E[f (b x(T )) − f (x∗ )] ≤

d kx∗ k X 2 2 M kx∗ k2 √ + m √2 pj Mj T 2M T j=1

Corollary 4 is almost immediate. To see the result, note that since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (b x(T )) and performing an algebraic manipulation give the corollary. If the data is suitably “sparse,” meaning that pj ≤ 1/m (which may also occur if the data is of relatively high variance in Assumption B) the bound in Corollary 4 simplifies to qP d ∗ 2 ∗ j=1 pj Mj kx k2 3 M kx k2 3 √ √ E[f (b x(T )) − f (x∗ )] ≤ = (7) 2 2 T T which is the convergence rate of stochastic gradient descent (and√dual averaging) even in nonasynchronous situations (5). In non-sparse cases, setting η ∝ kx∗ k2 / mM2 T in Theorem 3 recovers the bound √ M kx∗ k E[f (b x(T )) − f (x∗ )] ≤ O(1) m · √ 2 . T √ The convergence guarantee (7) shows that after T timesteps, we have error scaling 1/ T ; however, if we have k processors, then updates can occur roughly k times as quickly, as all updates are asynchronous. Thus in time scaling as n/k, we can evaluate n gradient samples: a linear speedup.

3.2

As in the description of AsyncDA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze AsyncAdaGrad, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. Assumption C. There exist constants (Mj )dj=1 such that for any x ∈ X and ξ ∈ Ξ, we have E[(∂xj F (x; ξ))2 | ξj 6= 0] ≤ Mj2 . P Indeed, taking M2 = j pj Mj2 shows that Assumption C implies Assumption B with specific constants. We then have the following convergence result, whose proof we provide defer to Appendix C. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that δ 2 ≥ Mj2 m for all j and that X ⊂ [−R∞ , R∞ ]d . Then T X t=1

  E F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤

d X

min

j=1



1 2 R E η ∞

"

2

δ +

T X

gj (t)

t=1

2

1 # 2

+ ηE

"

T X

gj (t)

t=1

2

1 # 2



(1 + pj m), Mj R∞ pj T .

We can also relax the condition on the initial constant diagonal term δ slightly, which gives a qualitatively similar result (see Appendix C.3). Corollary 6. Under the conditions of Theorem 5, assume that δ 2 ≥ Mj2 min{m, 6 max{log T, mpj }} for all j. Then T X t=1

  E F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤

d X j=1

min



1 2 R E η ∞

"

2

δ +

T X

gj (t)

t=1

2

1 # 2

1 X  T 2 3 2 gj (t) + ηE (1 + pj m), Mj R∞ pj T . 2 t=1

It is natural to ask in which situations the bound Theorem 5 and Corollary 6 provides is optimal. We note that, as in the case with Theorem 3, we may take an expectation with respect to ξ t and P T obtain a convergence rate for f (b x(T )) − f (x∗ ), where x b(T ) = T1 t=1 x(t). By Jensen’s inequality, we have for any δ that " 1 q 1 #  T T X X 2 2 2 2 2 2 E[gj (t) ] ≤ δ + ≤ δ 2 + T pj Mj2 . E δ + gj (t) t=1

t=1

For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that (p ) X d log(T )/T + pj 1 2 √ E[f (b x(T )) − f (x )] ≤ O(1) R∞ + η Mj min , pj , η T ∗



j=1

8

(8)

4

Experiments

In this section, we give experimental validation of our theoretical results on AsyncAdaGrad and AsyncDA, giving results on two datasets selected for their high-dimensional sparsity.2

4.1

Malicious URL detection

We also performed experiments using Hogwild! instead of AsyncDA; the results are similar.

9

8 0.07

6

5

4

0.024

Test error

Training loss

Speedup

0.025

0.065

7

0.06

0.055

0.05

0.045

0.04

0.023

0.022

0.021

0.02

0.035

3

0.019 0.03

2

1

2

4

6

8

10

12

14

Number of workers

0.018

0.025

0.02

16

2

4

6

8

10

12

14

0.017

16

Number of workers

2

4

6

8

10

12

14

16

Number of workers

Figure 1. Experiments with URL data. Left: speedup relative to 1 processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. A-AdaGrad abbreviates AsyncAdaGrad.

2

4

8

16

64

256

1.03 1.01 1.00

1.01 1.00

1.0

1

number of passes

A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80

1.02

1.03

1.04

Impact of L2 regularizaton on test error

1.02

1.6 1.4 1.2

relative log-loss

Fixed stepsizes, test data, L2=0 1.04

1.8

1

2

4

8

16

32

64

number of passes

128

256

1

2

4

8

16

32

64

128

256

number of passes

Figure 2. Relative accuracy for various stepsize choices on an click-through rate prediction dataset.

4.2

Click-through-rate prediction experiments

Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed.

10

(D) Impact of training data ordering

1.004

1.005

1.006

1.007

1.008

1

2

4

8

16

32

number of passes

64

128

256

1.000

1

2

0

1.005

relative stepsize

(C) Optimal stepsize scaling

relative log-loss

1.003

target relative log-loss

1.005

1.010

1.002

1.010

1.015

8 4 0

speedup

1.001

1.000

relative log-loss

1.015

12

(A) Optimized stepsize for each number of passes

1

2

4

8

16

32

number of passes

64

128

256

1

2

4

8

16

32

64

128

256

number of passes

Figure 3. (A) Relative test-set log-loss for AsyncDA and AsyncAdaGrad, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for AsyncAdaGrad. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with L2 = 80).

11

error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the 5 random seeds used. Thus, while on the one hand Hogwild! requires somewhat less tuning of the stepsize and L2 parameter to control overfitting, tuning AsyncAdaGrad is much easier because of the predictable response.

12

References [1] P. Auer and C. Gentile. Adaptive and self-confident online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´echal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ utze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´e, and S. Wright. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´arik and M. Tak´aˇc. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´aˇc, A. Bijral, P. Richt´arik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013.

A

Proof of Proposition 1 and related results

In this section we provide the proofs of Proposition 1 and its subsequent corollaries. 13

A.1

Proof of Proposition 1

Our proof proceeds in a few steps: we first define our family of loss functions, after which we perform an essentially standard reduction of the estimation (optimization) problem to testing. Following this step, we carefully lower bound the probabilities of error in our multiple hypothesis testing problem (in a manner similar to Assouad’s lemma) to obtain the desired statement of the proposition. Defining the loss functions We begin by describing the family of loss functions we use to prove the result. Given ξ ∈ {−1, 0, 1}d , let F (x; ξ) := M

d X j=1

|ξj ||xj − Rξj |.

By inspection F satisfies the conditions of the proposition. Letting pj = P (ξj 6= 0), p+ j = P (ξj = 1), − and pj = P (ξj = 1), we thus find that f (x) = M

d  X j=1

 − p+ |x − R| + p |x + R| , j j j j

so the objective f behaves like a weighted 1-norm type of quantity and its minimizer is a multidimensional median. From estimation to testing Now we proceed through a reduction of estimation to testing. Fix δj > 0 for j ∈ {1, . . . , d} (we optimize these choices later). Let V = {−1, 1}d , and for a fixed v ∈ V let Pv be the distribution specified by 1 + δj v j 1 − δj v j Pv (ξj 6= 0) = pj , Pv (ξj = 1 | ξj 6= 0) = , Pv (ξj = −1 | ξj 6= 0) = . (9) 2 2 With this choice of distribution, we claim that for any estimator x b, x, F, X , Pv )] ≥ M R max sup EP [ǫN (b x, F, X , P )] ≥ max EPv [ǫN (b v∈V

v∈V

P ∈P

d X j=1

pj δj Pv (sign(b xj (ξ 1:N ) 6= vj ),

(10)

where the last probability distribution is the product PvN over the samples ξ 1:N . Indeed, define x∗v = argmin EPv [F (x; ξ)] = Rv, x∈X

the last inequality following by inspection of the loss. We then have ǫN (b x, F, X , Pv )   N X 1 − δj ∗ 1 + δj 1 − δj 1 + δj ∗ =M pj |b xj − Rvj | − |xbj + Rvj | xv,j − Rvj − xv,j + Rvj . 2 2 2 2 j=1

By inspecting the cases for the possible values of sign(b xj ), we have 1 − δj ∗ 1 + δj 1 − δj 1 + δj ∗ xv,j + Rvj ≥ Rδj 1 {sign(b |b xj − Rvj |− |xbj + Rvj |+ xj ) 6= vj } . xv,j − Rvj − 2 2 2 2 Taking expectations of this quantity gives the result (10). 14

Bounding the test error Now we transform our testing problem to a randomized testing problem, averaging over v ∈ V, which allows us to give a lower bound by controlling the error probability of d distinct simple hypothesis tests. From inequality (10) we have that max v∈V

d X

pj δj Pv (sign(b xj (ξ

1:N

j=1

=

d X j=1

1 pj δ j |V|

)) 6= vj ) ≥

 X

d X

pj δ j

j=1

1 X Pv (sign(b xj (ξ 1:N )) 6= vj ) |V|

(11)

v∈V

Pv (sign(b xj (ξ

1:N

v:vj =1

)) 6= vj ) +

X

Pv (sign(b xj (ξ

v:vj =−1

1:N

 )) 6= vj ) .

Recall that for any distributions P, Q, the variational representations of total variation distance imply that  Z Z Z 1 kP − QkTV = f dP + gdQ, f ≥ 0, g ≥ 0, f + g ≥ 1 . f (dP − dQ) = 1 − inf sup 2 f :kf k∞ ≤1 Now, let Pv,j be the distribution (9), except that we fix vj = 1 (and let Pv,−j similarly have vj = −1 fixed). With this notation, we have Pv,j (sign(b xj ) 6= vj ) + Pv,−j (sign(b xj ) 6= vj ) = EPv,j [1 {sign(b xj ) 6= 1}] + EPv,−j [1 {sign(b xj ) 6= −1}] . Since 1 {sign(b xj ) 6= 1} + 1 {sign(b xj ) 6= −1} ≥ 1, we find that

N

N Pv,j (sign(b xj (ξ 1:N )) 6= vj ) + Pv,−j (sign(b xj (ξ 1:N )) 6= vj ) ≥ 1 − Pv,j − Pv,−j , TV

N denotes the n-fold product of P . Thus, our lower bound (11) becomes where we recall that Pv,j v,j the following (essentially a variant of Assouad’s lemma):

max v∈V

d X j=1

pj δj Pv (sign(b xj (ξ 1:N )) 6= vj ) ≥

d X j=1

pj δ j

N

 1 X N − Pv,−j 1 − Pv,j . TV 2|V|

(12)

v∈V

Simple hypothesis tests For the majority of the remainder of the proof, we derive bounds on N − PN k kPv,j v,−j TV to apply inequality (12). Using Pinsker’s inequality, we have

N

 N 1 N N N 2

Pv,j − Pv,−j D P ||P ≤ Dkl (Pv,j ||Pv,−j ) . ≤ kl v,j v,−j TV 2 2

Noting that Pv is a product distribution over the coordinates of the samples ξ (recall the construction (9)), we have the equality     1 + δj 1 + δj 1 − δj 1 − δj 1 + δj Dkl (Pv,j ||Pv,−j ) = pj log + log = pj δj log . 2 1 − δj 2 1 + δj 1 − δj 2 Now we use the fact that δ log 1+δ 1−δ ≤ 2 log(3)δ for δ ≤ 1/2, so

N

N 2

Pv,j − Pv,−j ≤ N pj δj2 log(3) for δj ∈ [0, 1/2]. TV 15

(13)

Combining inequalities (10), (12) and (13), we find   d q 1 1X 1 X 1 − δj N pj log(3) sup EP [ǫN (b x, F, X , P )] ≥ pj δ j M R P ∈P 2 |V| j=1

=

1 2

d X j=1

v∈V



pj δ j 1 − δ j

 q N pj log(3) .

(14)

Inequality (14) holds for all δj ∈ [0, 1/2], so we may maximize over such δj . By setting ) ( 1 1 , p , δj = min 2 2 N pj log(3)

we have



pj δ j 1 − δ j

q



N pj log(3) ≥ pj min

(

1 1 1 p , √ 4 4 log 3 N pj

)

.

In particular, our simplified Assouad analogue (14) implies

  √ d pj MR X , sup EP [ǫN (b x, F, X , P )] ≥ min pj , √ 8 N log 3 P ∈P j=1

which is the desired statement of the proposition.

A.2

Proof of Corollary 2

The proof follows by a simple integration argument.

B

Proof of Theorem 3

Our proof proceeds in a series of lemmas, which we present shortly. To connect the result to more standard analyses of dual averaging, and because the lemmas we present form the basis of our subsequent analysis for AsyncAdaGrad, we prove our results in somewhat more generality than that presented in the main text. In full generality, we allow {ψt }t be a sequence of functions, each strongly convex with respect to the norm k·kψt . Then dual averaging [11, 4] can be viewed as repeatedly computing x(t) := argmin {hz(t), xi + ψt (x)} , (15) x∈X

where the proximal function ψt may or may not vary with time. For the theoretical development, we define the conjugate to ψt and associated dual norm n o ψt∗ (z) := sup {hz, xi − ψt (x)} and kzkψ∗ := sup hz, xi | kxkψt ≤ 1 . t

x∈X

x

We have the following standard lemma (see, e.g. the book of Hiriart-Urruty and Lemar´echal [6, Chapter X]), which is essential to the analysis of the dual averaging method (15). The result says that since ψt is strongly convex with respect to the norm k·kψt , its dual is smoothly differentiable and, more strongly, has Lipschitz derivative with respect to the dual norm k·kψ∗ . t

16

Lemma 7. The function ψt∗ is 1-strongly smooth with respect to k·kψ∗ , meaning that

∇ψt∗ (z) − ∇ψt∗ (z ′ )

ψt

and moreover ∇ψt∗ (−z(t)) = x(t).

≤ z − z ′ ψ ∗ , t

We also recall the standard fact [e.g. 6] that if a function h has Lipschitz continuous derivative with respect to a norm k·k, then h(y) ≤ h(x) + h∇h(x), y − xi + (1/2) kx − yk2 for all x, y ∈ dom h. With Lemma 7 in place, we can provide a general result on the convergence of asynchronous dual averaging-type algorithms. In this lemma, we use the time indexing established in Section 3.1. In particular, we have that x(t) is computed using the proximal function ψt , that is, as in the update (15). This result shows that the performance of AsyncDA, and other algorithms like it, is essentially identical to that of the correct dual averaging algorithm, plus a correction penalty that relates the performance of the correct algorithm to what is actually executed. Lemma 8. Define the “corrected” point sequence x e(t) :=

∇ψt∗



t−1 X

g(τ )

τ =1



= argmin x∈X

X t−1 τ =1

 hg(τ ), xi + ψt (x) .

For any sequence of samples ξ t , T X  t=1

t

t



F (x(t); ξ ) − F (x ; ξ ) ≤

T  X t=1

+

ψt∗



T X t=1

t−1 X τ =1



g(τ ) −

∗ ψt−1



t−1 X τ =1

g(τ )



T

+

1X kg(t)k2ψ∗ t 2 t=1

hg(t), x(t) − x e(t)i + ψT (x∗ )

∗ (either in adaptive settings such We remark that in the non-asynchronous case, one has ψt∗ ≤ ψt−1 as AdaGrad [4] or in standard dual averaging) and x(t) = x e(t), so Lemma 8 reduces to the bound P (1/2) Tt=1 kg(t)k2ψ∗ + ψT (x∗ ), which is standard [11]. t Deferring the proof of Lemma 8 to Section B.1, we note that in the context of Theorem 3, it 1 has immediate implications. Since ψt (x) = 2η kxk22 in this case, Lemma 8 immediately implies T X  t=1

T T X  1 ηX hg(t), x(t) − x e(t)i F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ kg(t)k22 + kx∗ k22 + 2η 2 t=1

(16)

t=1

∗ and ψ ∗ (0) ≤ 0, and for any v since ψt∗ = ψt−1

kvk2ψ =

1 kvk22 and kvk2ψ∗ = η kvk22 . η

Now we return to the proof of Theorem 3. Each of the terms present in Theorem 3 is present in Eq. (16) except for the last, since E[kg(t)k22 ] ≤ M2 .

17

For the final term in the bound in the theorem, we note that by assumption that X is a product domain, hg(t), x(t) − x e(t)i ≤

d X j=1

|gj (t)||xj (t) − x ej (t)| ≤

d X j=1

X t−1 gj (τ ) − zj (t) . η|gj (t)| τ =1

For the final inequality we have used that by the definition of the x update (recall Lemma 7), X   X t−1 t−1 ∗ ∗ gj (τ ) − zj (τ ) . g(τ ) − ∇j ψ (−z(t)) ≤ η |e xj (t) − xj (t)| = ∇j ψ − τ =1

τ =1

Ft−1 ] ≤ pj Mj by assumption Conditioned on the σ-field Ft−1 of {ξ τ }t−1 τ =1 , we have E[|gj (t)| |P (since ξ t is independent of ξ τ for τ < t). Moreover, we have E[| t−1 τ =1 gj (τ ) − zj (t)|] ≤ mpj Mj because the delay in each processor is assumed to be at most m and E[|gj (τ )|] ≤ pj Mj . Thus we find " # X d d X X t−1 p2j Mj2 m. gj (τ ) − zj (t) ≤ η E E[|gj (t)| | Ft−1 ] E[hg(t), x(t) − x e(t)i] ≤ η τ =1

j=1

j=1

This completes the proof.

B.1

Proof of Lemma 8

The proof is similar to other analyses of dual averaging (e.g. [11, 4]), but we track the Pchanging time indices. For shorthand throughout this proof, we define the running sum g(1:t) := tτ =1 g(τ ). First, we note by convexity and the definition of g(t) ∈ ∂F (x; ξ t ) that T X  t=1

T  X hg(t), x(t) − x∗ i . F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤

(17)

t=1

By definition of ψT and the conjugate ψT∗ , we find that T X t=1

hg(t), x(t) − x i =

T X t=1

hg(t), x(t)i +

T X t=1

h−g(t), x∗ i + ψT (x∗ ) − ψT (x∗ )

≤ ψT (x∗ ) + ψT∗ (−g(1:T )) +

T X t=1

hg(t), x(t)i .

(18)

Now, by applying Lemma 7 and the definition of 1-strongly-smooth, we have that ψt∗ (−g(1:t)) ≤ ψt∗ (−g(1:t − 1)) + h−g(t), ∇ψt∗ (−g(1:t − 1))i +

1 kg(t)k2ψ∗ . t 2

By construction of x and x e, we have x(t) = ∇ψt∗ (−z(t)) and x e(t) = ∇ψt∗ (−g(1: t − 1)). Thus, rearranging the preceding display, we have 0 ≤ h−g(t), x e(t)i − ψt∗ (−g(1:t))) + ψt∗ (−g(1:t − 1)) + 18

1 kg(t)k2ψ∗ , t 2

and adding hg(t), x(t)i to both sides of the above expression gives hg(t), x(t)i ≤ hg(t), x(t) − x e(t)i − ψt∗ (−g(1:t)) + ψt∗ (−g(1:t − 1)) +

1 kg(t)k2ψ∗ . t 2

(19)

Thus we obtain the inequalities T X t=1 (i)

hg(t), x(t) − x∗ i ∗

≤ ψT (x ) +

ψT∗ (−g(1:T )))

+

(ii)

≤ ψT (x∗ ) + ψT∗ (−g(1:T )) +

= ψT (x∗ ) +

T h X t=1

T X

hg(t), x(t)i

t=1 T h X t=1

hg(t), x(t) − x e(t)i − ψt∗ (−g(1:t)) + ψt∗ (−g(1:t − 1)) +

∗ hg(t), x(t) − x e(t)i + ψt∗ (−g(1:t − 1)) − ψt−1 (−g(1:t − 1)) +

i 1 kg(t)k2ψ∗ t 2

i 1 kg(t)k2ψ∗ , t 2

where for step (i) we have applied inequality (18), step (ii) follows from the bound (19), and the last equality follows by re-indexing terms in the sum. Combining the above sum with the first-order convexity inequality (17) proves the lemma.

C

Proof of Theorem 5

Before proving the theorem proper, we recall a now standard lemma that is essential to proving rates of convergence for adaptive algorithms. Lemma 9 (Auer and Gentile [1], Duchi et al. [4] √ Lemma 4, McMahan and Streeter [9]). For any non-negative sequence {at }t , where we define 0/ 0 = 0, we have T X t=1

1 X T 2 at . ≤2

a qP t

t τ =1 aτ

t=1

t

t



F (x(t); ξ ) − F (x ; ξ ) ≤ 19

T X t=1

hg(t), x(t) − x∗ i .

Considering a particular coordinate j, we have T X t=1

T X   E gj (t)(xj (t) − x∗j ) ≤ R∞ E[|gj (t)|] ≤ R∞ T Mj pj

(20)

t=1

where we have used assumption on X . The remainder of our proof bounds the PT the compactness ∗ regret-like term t=1 hg(t), x(t) − x i in a per-coordinate way, and thus for each coordinate we always have the bound (20), giving the min{·, pj } terms in the theorem statement. It remains to show the bound that applies when pj is large. We now re-state the general bound of Lemma 8 with some minor modifications in notation. 1 AsyncAdaGrad is dual averaging with the choice ψt (x) := 2η hx, G(t)xi for the proximal function. With this choice, the norm and dual norm k·kψt and k·kψ∗ defined for vectors v ∈ Rd are t

1 kvk2G(t) η

kvk2ψt :=

kvk2ψ∗ := η kvk2G(t)−1 .

and

t

Rewriting Lemma 8, we thus have T X  t=1

   X  X t−1 t−1 T  T  X ηX ∗ g(τ ) + g(τ ) − ψt−1 − ψt∗ − F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ kg(t)k2G(t)−1 2 +

τ =1

τ =1

t=1

T X t=1

hg(t), x(t) − x e(t)i +

1 kx∗ k2G(T ) 2η

t=1

(21)

for any sequencePξ t ; as in Lemma 8 we define the “corrected” sequences x e(t) = ∇ψt∗ (−g(1: t − 1)) t−1 where g(1: t) = τ =1 g(τ ). Note that the corrected sequences still use the proximal functions ψt∗ from the actual run of the algorithm. We focus on bounding each of the terms in the sums (21) in turn, beginning with the summed conjugate differences. P 1 Lemma 10. Define the matrix G(T+ ) to be diagonal with jth diagonal entry (δ 2 + Tt=1 gj (t)2 ) 2 . For any sequence ξ t T  X t=1

ψt∗



t−1 X τ =1



g(τ ) −

∗ ψt−1



t−1 X

g(τ )

τ =1



2 R∞ tr(G(T+ )). 2η

We defer the proof of Lemma 10 to Section C.1, noting that the proof follows by carefully considering ∗ , which may only occur when updates to S (and hence G) the conditions under which ψt∗ ≥ ψt−1 are out of order, a rearrangement of the sum to put the updates to S in the correct order, and an application of the AdaGrad Lemma 9. To complete the proof of the theorem, we must bound the two summed gradient quantities in expression (21). For shorthand, let us define T1 :=

T X t=1

kg(t)k2G(t)−1

and

T2 :=

T X t=1

hg(t), x(t) − x e(t)i

(22)

We provide the proof under the assumption that δ 2 ≥ mMj2 for all j. At the end of the proof, we show how to weaken this assumption while retaining the main conclusions of the theorem. 20

Recalling the temporal ordering notation ≺Sj , we see that T1 =

T d X X j=1 t=1

gj (t)2 q . P δ 2 + τ ≺S t gj (τ )2 j

Now, by our assumption that processors are at most m steps out of date and δ 2 ≥ mMj2 , we have gj (t)2 gj (t)2 q ≤ qP , P t δ 2 + τ ≺S t gj (τ )2 gj (τ )2 τ =1

j

T d X X j=1 t=1

X 1 d T X 2 gj (t)2 2 qP . 2 ≤ gj (t) t 2 g (τ ) t=1 j=1 j τ =1

(23)

Thus we turn to T2 as defined in expression (22). We focus on a per-coordinate version of T2 , stating the following lemma: Lemma 11. Under the conditions of Theorem 5, T

1X E[gj (t)(xj (t) − x ej (t))] ≤ 2pj mE η t=1

"

T X

gj (t)2

t=1

1 # 2

≤ 2pj mMj

p pj T .

The proof of the lemma is technical, so we defer it to Section C.2. Applying the result of Lemma 11, we obtain the following bound on T2 : " T 1 # d X X 2 2 . E[T2 ] ≤ 2η pj mE gj (t) t=1

j=1

Combining Lemma 10 with our bounds (23) on T1 and the preceding bound on T2 , the basic inequality (21) implies E

X T t=1

t

t

F (x(t); ξ ) − F (x ; ξ )





i X 1 h ∗ 2 2 E kx kG(T ) + R∞ tr(G(T+ )) + η E ≤ 2η d

j=1

Noting that

"

T X

gj (t)2

t=1

1 # 2

(1 + 2pj m) .

 T d  X X 2 2 2 gj (t) δ +

1

kx∗ k2G(T )

2 R∞ tr(G(T ))

2 R∞

j=1

t=1

completes the proof of Theorem 5 under the assumption that δ 2 ≥ mMj2 for all j.

21

C.1

Proof of Lemma 10 1

Since the domain X = X1 × . . . × Xd is assumed Cartesian and the matrices S and G = S 2 are diagonal, we focus on the individual coordinate terms of ψt∗ . With that in mind, consider the difference sup xj ∈Xj



t−1 X τ =1

1 gj (τ )xj − xj Gj (t)xj 2η



− sup

xj ∈Xj



t−1 X τ =1

 1 gj (τ )xj − xj Gj (t − 1)xj . 2η

(24)

To understand the difference of the terms (24), we recall the temporal ordering ≺Sj defined in the beginning of the proof of Theorem 5 (we say τ Sj t if and only if τ 6≺Sj t). Though throughout the algorithm, the matrix S (and the matrix G) is always increasing—only positive updates are applied to S—when indexed by update time, we may have Gj (t − 1) ≶ Gj (t). The term (24), however, may be positive only when Gj (t) < Gj (t − 1), and this is possible only if   τ ∈ N | τ ≺Sj t − 1 ) τ ∈ N | τ ≺Sj t . Finally, we note that for any matrices A, B and vector z, that if we define x(A) := argmax {hz, xi − hx, Axi} x∈X

then sup {hz, xi − hx, Axi} − sup {hz, xi − hx, Bxi}

x∈X

x∈X

≤ hz, x(A)i − hx(A), Ax(A)i − hz, x(A)i + hx(A), Bx(A)i ≤ sup {hx, (B − A)xi} . x∈X

By considering expression (24), we have ψt∗



t−1 X τ =1



g(τ ) −

∗ ψt−1



t−1 X

g(τ )

τ =1



≤ ≤

d  1 X sup x2j (Gj (t − 1) − Gj (t)) 2η xj ∈Xj j=1

d 2 X  R∞ |Gj (t) − Gj (t − 1)| 1 {τ ≺Sj t − 1} ) {τ ≺Sj t} . 2η j=1

(25)

It thus remains to bound the sum, over all terms (25). To that end, we note by √ t, of the √ √ √ concavity of · that for any a, b ≥ 0, we have a + b − a ≤ b/2 a. Thus we find that  |Gj (t) − Gj (t − 1)| 1 {τ ≺Sj t − 1} ) {τ ≺Sj t}  1  1 X X 2 2  2 2 2 2 1 {τ ≺S t − 1} ) {τ ≺S t} gj (τ ) = δ + − δ + gj (τ ) j j P

τ ≺S j t

τ ≺Sj t−1

τ ≺Sj t−1,τ Sj t gj (τ )

2

≤ q . P 2 δ 2 + τ ≺S t gj (τ )2 j

22

We note the following: the sequence of update sets ∆t := {τ ∈ N : τ ≺Sj t − 1, τ Sj t} satisfies ∪Tt=1 ∆t ⊂ [T ], and since the incremental updates to S occur only once, we have ∆t ∩ ∆t′ = ∅ for all t 6= t′ . That is, if τ ∈ ∆t for some t, then τ 6∈ ∆t′ for any t′ 6= t. Using the assumption that updates may be off by at most m time steps, we thus see that there must exist some permutation {ut }Tt=1 of [T ] such that T X t=1

P

T 2 X gj (ut )2 τ ∈∆t gj (τ ) q q . ≤ P P 2 2+ δ 2 + τ ≺S t gj (τ )2 g (u ) δ t=1 τ ≤t−m j τ

(26)

j

For our last step, we use our assumption that δ 2 ≥ mMj2 and the standard AdaGrad result (Lemma 9) to obtain v P u T 2 T T g (τ ) 2 uX X τ ≺Sj t−1,τ Sj t j X gj (ut ) q qP gj (t)2 . ≤ ≤ 2t P 2 2 t 2 δ + τ ≺S t gj (τ ) t=1 t=1 t=1 τ =1 gj (uτ ) j Recalling inequality (25), we have T  X t=1

ψt∗



t−1 X τ =1



∗ g(τ ) − ψt−1



t−1 X τ =1

g(τ )



2 R∞

which gives the statement of the lemma.

C.2

v d u T uX X t gj (t)2 , j=1

t=1

Proof of Lemma 11

Let us provide a bit of notation before proving the lemma. We define the batch of “outstanding” updates for coordinate j at time t as Bj (t) := {τ : t − 1 ≥ τ zj t}, and we define the quantity that we wish to bound in expectation in Lemma 11 as T

1X T := gj (t)(xj (t) − x ej (t)). η j

t=1

Turning to the proof of the lemma proper, we first note that z(t) does not include any gradient terms g(τ ) for any τ ≥ t by the definition of the AsyncAdaGrad algorithm. Thus t−1 X τ =1

gj (τ ) − zj (t) =

X

gj (τ ).

τ ∈Bj (t)

P 1 For brief notational convenience define κt = η(δ 2 + τ ≺S t gj (τ )2 ) 2 . Applying the definition of the j AsyncAdaGrad updates and Young’s inequality, we see that gj (t) · (xj (t) − x ej (t)) ≤ κt |gj (t)| |gj (1:t − 1) − zj (t)| X 2  X 1  t 1  t 2 gj (τ ) . gj (τ ) ≤ κt |gj (t)| 1 ξj 6= 0 + κt 1 ξj 6= 0 = κt |gj (t)| 2 2 τ ∈Bj (t)

τ ∈Bj (t)

23

As a consequence, we find that n oP   t 6= 0 T 2 τ gj (τ )2 1 ξ X = 6 0})g (t) card({τ ∈ B (t) : ξ j τ ∈B (t) j j j j . q q 2E[T j ] ≤ E + P P 2 δ + τ ≺S t gj (τ )2 δ 2 + τ ≺S t gj (τ )2 t=1 j

(27)

j

Looking at the first term in the bound (27), we note that Bj (t) consists of time indices t zj τ ≤ t−1, which consequently have not been incorporated into any vectors used in the computation of gj (t). Thus, if we let Ft,j denote the σ-field containing ξjt and ξjτ for τ ≺zj t, we have gj (τ ) ∈ Ft,j for any τ ≺Sj t, gj (t) ∈ Ft,j , and we also have that ξjτ is independent of Ft,j for τ ∈ Bj (t). Thus, iterating expectations, we find     E[card({τ ∈ Bj (t) : ξjτ 6= 0})gj (t)2 | Ft,j ] card({τ ∈ Bj (t) : ξjτ 6= 0})gj (t)2  = E  q q E P P δ 2 + τ ≺S t gj (τ )2 δ 2 + τ ≺S t gj (τ )2 j j   2 gj (t) , ≤ pj mE  q P 2 δ + τ ≺S t gj (τ )2 j

since E[card({τ ∈ Bj (t) : ξjτ 6= 0})] ≤ pj m because |Bj (t)| ≤ m by assumption. A similar iteration of expectation—since ξjt is independent of any gj (τ ) for τ ∈ Bj (t)—yields    X   t X 2 2 gj (τ ) . gj (τ ) ≤ pj E E 1 ξj 6= 0 τ ∈Bj (t)

τ ∈Bj (t)

We replace the relevant terms in the expectation (27) with the preceding bounds to obtain     P 2 T T 2 X X g (τ ) j gj (t) τ ∈Bj (t)  + pj . E q 2E[T j ] ≤ pj m E q P P 2 2 2 2 δ δ + g (τ ) + g (τ ) j j t=1 t=1 τ ≺S t τ ≺S t j

j

For the second term, note each gj (τ ) can occur in at most m of the sets Bj (t), and the maximum delay is also at most m. Thus, following the same argument as (26), there must exist a permutation {ut } of the indices [T ] such that P 2 T T X X mgj (ut )2 τ ∈Bj (t) gj (τ ) q q ≤ P P 2 δ 2 + τ ≺S t gj (τ )2 δ 2 + t−m t=1 t=1 τ =1 gj (uτ ) j 1 X T T X 2 mgj (ut )2 2 qP , gj (t) ≤ 2m ≤ t 2 t=1 t=1 τ =1 gj (uτ )

where we have used the fact that δ 2 ≥ mMj2 and Lemma 9. With this, we immediately find that  P   " T 1 # T T t−1 2 2 X X X 2 g (τ ) g (t) j j =t−m  + pj  ≤ 4pj mE E  qP . 2E[T j ] ≤ pj m E  qτP gj (t)2 t t 2 2 g (τ ) g (τ ) t=1 t=1 t=1 τ =1 j τ =1 j

By inspection, this completes the proof of the lemma. 24

C.3

Sharpening the analysis (proof of Corollary 6)

We now demonstrate how to sharpen the analysis in the proof of Theorem 5 to allow the initial matrix δ 2 to be smaller than mMj2 . Roughly, we argue that for a smaller setting of δ 2 , we can have P δ 2 ≥ tτ =t−m+1 gj (τ )2 for all t with high probability, in which case all the previous arguments go through verbatim. In particular, we show how the terms T1 and T2 , defined in expression (22), may be bounded under the weaker assumptions on δ 2 specified in Corollary 6. For this argument, we focus P on T1 , as the argument for T2 is identical. We begin by defining the event E to occur if δ 2 ≥ tτ =t−m+1 gj (τ )2 for all t. We then have 1 {E}

T X t=1

q δ2 +

gj (t)2 P

2 τ ≺Sj t gj (τ )

≤ 1 {E}

T X t=1

1 X T 2 gj (t)2 2 gj (t) ≤2 Pt 2 τ =1 gj (τ ) t=1

by Lemma 9. On the other hand, on E c , we have by our assumption that δ 2 ≥ Mj2 that 1 {E c }

T X t=1

T X gj (t)2 c q |gj (t)|, ≤ 1 {E } P δ 2 + τ ≺Sj t gj (τ )2 t=1

so if we can show that E c has sufficiently low probability, then we still obtain our desired results. Indeed, by H¨ older’s inequality we have " T X   1 #  T T X X 2 gj (t)2 gj (t)2 2 c q q E ≤ 2E gj (t) + E 1 {E } P P δ 2 + τ ≺Sj t gj (τ )2 δ 2 + τ ≺Sj t gj (τ )2 t=1 t=1 t=1 ≤ 2E

"

T X

gj (t)2

t=1

1 # 2

1 2

+ E[1 {E c }] E

"

T X t=1

|gj (t)|

2 # 12

.

(28)

It thus remains to argue that P(E c ) is very small, since # " T " T 2 # X X gj (t)2 |gj (t)| ≤ TE E t=1

t=1

by Jensen’s inequality. Now, we note that gj (t)2 ≤ Mj2 and that the ξjt are i.i.d., so we can define n o the sequence Xt = 1 ξjt 6= 0 and we have     t+m−1 t+m−1 X X 2 2 2 2 P(E ) = P ∃ t ∈ [T ] : gj (τ ) > δ ≤ P ∃ t ∈ [T ] : Xt > δ /Mj . c

τ =t

τ =t

Definine γ = δ 2 /Mj2 , and let p = pj for shorthand. Since Xt ≤ 1, E[Xt ] ≤ p, and Var(Xt ) ≤ p(1−p), Bernstein’s inequality implies that for any fixed t and any ǫ ≥ 0 P

 t+m−1 X τ =t

Xt ≥ pm + ǫ





ǫ2 ≤ exp − 2mp(1 − p) + 2ǫ/3 25



.

(29)

By solving a quadratic, we find that if 1 1 ǫ ≥ log + 3 δ

r

1 1 1 log2 + 2mp(1 − p) log 9 δ δ

then the quantity (29) is bounded by δ. By a union bound (and minor simplification), we find r 2 1 1 ǫ ≥ log + 2mp(1 − p) log implies P(E c ) ≤ T δ. 3 δ δ Setting δ = T −2 means that P(E c ) ≤ 1/T , which in turn implies that 1 2

E[1 {E c }] E

"

T X t=1

|gj (t)|

2 # 12

" T # 12 X 1 √ TE ≤√ gj (t)2 . T t=1

Combining the preceding display with inequality (28), we find that the term T1 from expression (22) is bounded by 1   X 1 ! X T T d X 2 2 2 2 gj (t) 2E gj (t) E[T1 ] ≤ +E j=1

t=1

t=1

p whenever δ 2 ≥ 43 log T + 2 mpj (1 − pj ) log T for all j ∈ [d]. This completes the sharper proof for the bound on T1 . To provide a similar bound for T2 in analogy to Lemma 11, we recall the bound (27). Then following the above steps, mutatis mutandis, gives the desired result.

26

## Estimation, Optimization, and Parallelism when ... - Semantic Scholar

Nov 10, 2013 - Michael I. Jordan. H. Brendan McMahan. November 10 ...... [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An ...

#### Recommend Documents

Estimation, Optimization, and Parallelism when Data is Sparse or ...
Nov 10, 2013 - So while such sparse data problems are prevalentânatural language processing, information retrieval, and other large data settings all have ...

decentralized set-membership adaptive estimation ... - Semantic Scholar
Jan 21, 2009 - new parameter estimate. Taking advantage of the sparse updates of ..... cursive least-squares using wireless ad hoc sensor networks,â. Proc.

nonparametric estimation of homogeneous functions - Semantic Scholar
xs ~the last component of Ix ..... Average mse over grid for Model 1 ~CobbâDouglas! ... @1,2# and the mse calculated at each grid point in 1,000 replications+.

nonparametric estimation of homogeneous functions - Semantic Scholar
d. N~0,0+75!,. (Model 1) f2~x1, x2 ! 10~x1. 0+5 x2. 0+5!2 and Â«2 d. N~0,1!+ (Model 2). Table 1. Average mse over grid for Model 1 ~CobbâDouglas! s~x1, x2! 1.

SVM Optimization for Lattice Kernels - Semantic Scholar
[email protected]. ABSTRACT. This paper presents general techniques for speeding up large- scale SVM training when using sequence kernels. Our tech-.

Continuous extremal optimization for Lennard ... - Semantic Scholar
The optimization of a system with many degrees of free- dom with respect to some ... [1,2], genetic algorithms (GA) [3â5], genetic programming. (GP) [6], and so on. ..... (Color online) The average CPU time (ms) over 200 runs vs the size of LJ ...

SVM Optimization for Lattice Kernels - Semantic Scholar
gorithms such as support vector machines (SVMs) [3, 8, 25] or other .... labels of a weighted transducer U results in a weighted au- tomaton A which is said to be ...

Measurement-Based Optimization Techniques for ... - Semantic Scholar
the TCP bandwidth achievable from the best server peer in the candidate set. .... lection hosts to interact with a large number of realistic peers in the Internet, we ... time) like other systems such as web servers; in fact the average bandwidth ...

Measurement-Based Optimization Techniques for ... - Semantic Scholar
the TCP bandwidth achievable from the best server peer in the candidate set. .... Host. Location. Link Speed. # Peers. TCP Avg. 1. CMU. 10 Mbps. 2705. 129 kbps. 2 ... time) like other systems such as web servers; in fact the average bandwidth ...

true motion estimation â theory, application, and ... - Semantic Scholar
From an application perspective, the TMT successfully captured true motion vectors .... 6 Effective System Design and Implementation of True Motion Tracker.

Combined Script and Page Orientation Estimation ... - Semantic Scholar
of each blob. Aligned word fragments. Figure 2: Examples of word fragment extraction for. (left-to-right) Kannada, Arabic, Latin and Bengali, illustrating different ...

Uncertainty quantification and error estimation in ... - Semantic Scholar
non-reactive simulations, Annual Research Briefs, Center for Turbulence Research, Stanford University (2010) 57â68. 9M. Smart, N. Hass, A. Paull, Flight data ...

Distributed representation and estimation of WFST ... - Semantic Scholar
Even using very efficient specialized n-gram rep- resentations (Sorensen ... the order of 400GB of storage, making it difficult to access and ...... anced shards. Additionally, the Flume system that .... Computer Speech and Lan- guage, 8:1â38.

true motion estimation â theory, application, and ... - Semantic Scholar
5 Application in Motion Analysis and Understanding: Object-Motion Estima- ...... data, we extend the basic TMT to an integration of the matching-based technique ...

regulating collateral-requirements when markets ... - Semantic Scholar
Oct 13, 2010 - ... Janeiro, the Latin American meeting of the Econometric Society 2009 at ... equilibrium, Incomplete markets, Collateral, Default, Risk sharing,.

regulating collateral-requirements when markets ... - Semantic Scholar
Oct 13, 2010 - through their effects on the equilibrium interest rate and the equilibrium prices of the durable goods. ...... If the value of the durable good is sufficiently high (i.e. collateral is plentiful) and if each ...... Theory, Online first

3D shape estimation and texture generation using ... - Semantic Scholar
The surfaces of 3D objects may be represented as a connected distribution of surface patches that point in various directions with respect to the observer.

Robust Tracking with Motion Estimation and Local ... - Semantic Scholar
Jul 19, 2006 - Visual tracking has been a challenging problem in computer vision over the decades. The applications ... This work was supported by an ERCIM post-doctoral fellowship at. IRISA/INRIA ...... 6 (4) (1995) 348â365. [31] G. Hager ...

MMSE Noise Power and SNR Estimation for ... - Semantic Scholar
the possible data rates, the transmission bandwidth of OFDM systems is also large. Because of these large bandwidths, ... for communication systems. Adaptive system design requires the estimate of SNR in ... In this paper, we restrict ourselves to da

Semi-supervised Learning and Optimization for ... - Semantic Scholar
matching algorithm, which outperforms the state-of-the-art, and, when used in ... Illustration of our hypergraph matching method versus standard graph matching.

Coping with Diverse Semantic Models when ... - Semantic Scholar
agreements on message types between the developers of publishing and ... data and associated meta data of the message content. In previous papers [11] ... individual KBN router the impact of the extra processing is far outweighed by the ...

Dividends, Safety and Liquidation when Liabilities ... - Semantic Scholar
tion Agency's web site, at the URL http://www.epa.gov/superfund/. ..... an acceptable research tactic to make borrowing as attractive (i.e., cheap) to the firm as ...

Dividends, Safety and Liquidation when Liabilities ... - Semantic Scholar
This paper investigates the optimal management of a firm faced with a long-term liability that occurs at a random date. Three issues are analysed: the optimal ...

Improved estimation of clutter properties in ... - Semantic Scholar
in speckled imagery, the statistical framework being the one that has provided users with the best models and tools for image processing and analysis. We focus ...