Online Forecast Combinations of Distributions: Worst Case Bounds Alessio Sancetta∗ Faculty of Economics, University of Cambridge, UK August 7, 2006

Abstract This paper considers forecasts with distribution functions that may vary through time. The forecast is achieved by time varying combinations of individual forecasts. We derive theoretical worst case bounds for general algorithms based on multiplicative updates of the combination weights. The bounds are useful for studying properties of forecast combinations when data are nonstationary and there is no unique best model. Keywords: Expert, Forecast Combination, Multiplicative Update, Nonasymptotic Bound, On-line Learning. JEL: C53, C14.

1

Introduction

This paper studies forecasts combinations that achieve optimal theoretical properties for online forecasting of distributions with parameters that may be time varying. If the forecast errors are uniformly bounded, we show that this also covers point predictions for many loss functions, such as square loss, LinEx and absolute deviation. The goal is to use sequential strategies (or algorithms) that allow us to forecast the distribution of new observations (within a given reference class) almost as accurately as if we had prior knowledge of them. To do this, we borrow ideas from ∗

Address for correspondence: Alessio Sancetta, Faculty of Economics, University of Cambridge, Sidgwick Avenue, Cambridge CB3 9DE, UK. E-mail: [email protected].

1

the game theory (e.g. see special issue in Games and Economic Behavior, Vol. 29, 1999) and computational learning theory (e.g. Vovk, 1990, Cesa-Bianchi et al., 1997). Predictions using forecast combinations are also known as predictions with expert advice, where the term expert identifies forecasts that are exogenous to the econometrician’s decisions. We are interested in algorithms that lead to optimal error bounds for worst case scenarios. These bounds do not make any assumption on the data generating process, we do not even need to assume that the data are realizations of some sequence of random variables. The worst case bounds derived here compare to the bounds derived by Herbster and Warmuth (1998) and owe a lot to their presentation and results. An advantage of the present study is that a set of conditions are established so that results can be derived in general form without the need for derivations on a case by case basis. This also allows us to gain a better understanding of the terms that contribute to the total error of the algorithm. Consequently, the algorithms can be modified to improve the theoretical bounds. We also state an additional algorithm that produces combination weights that, unlike those in Herbster and Warmuth (1998), do not depend on the unknown parameters of the share update. The same algorithm could also be applied to the case of an unknown learning rate in order to be independent of any possible unknown parameters. Moreover, the results are derived for forecasting distributions rather than sequences (i.e. point prediction). It will be shown that this framework is more general and encompasses point prediction. Yang (2004) studies probabilistic bounds, related to worst case bounds, in forecast combination of point prediction under the square loss. Both probabilistic bounds and worst case bounds are of interest, so the two studies are complementary. Yang (2000) gives an algorithm to predict densities measuring the performance in terms of a sequential version of the Kullback-Leibler distance. Again, those are probabilistic bounds, so that they, too, can be seen as complementary. The literature on combination of forecasts is broad, and an excellent survey is provided by Timmermann (2004, section 7 for probability forecasts). Several studies have shown that combining forecasts can be a useful hedge against structural breaks, and forecast combinations are often more stable than single forecasts (e.g. Hendry and Clements, 2004, Stock and Watson, 2004). A fundamental component of forecast combination is the choice of combination rule and combination weights. In particular, given a combination rule, it is customary to derive combination weights using moment estimators. The forecast 2

combination is chosen to minimize the user’s expected loss over all possible decisions (e.g. Elliott and Timmermann, 2004). This requires some stability of the system and assumptions about the data generating process. Worst case bounds do not require probabilistic assumptions about the data. The combination weights are based on sequential updating and the problem is cast in a game theoretic framework. There are two players: the econometrician uses the individual forecasts to produce a combined forecast; nature samples the data and a loss is consequently incurred by the econometrician. The econometrician needs to pool the individual forecasts and to do at least as well as the best forecast, or some combination of forecasts, no matter what data are sampled by nature. He minimizes his loss given that nature’s goal is to sample data to maximize this loss. In this case, the objective is to do as well as some chosen a priori subset of individual forecasts. Hence, this is a minimax problem with respect to the observed cumulated loss. The use of observed cumulated loss has an appealing interpretation in terms of the falsability principle, addressed by Popper and adapted to the statistical framework via the prequential principle of Dawid (e.g. Dawid, 1984, 1985, 1986). Mutatis mutandis, this approach can be seen as a variation of the -robust decision rule of Chamberlain (2000), where the set of data generating processes is restricted to the empirical measure. We also consider the case when the best individual forecast may change over time, i.e. one model may perform better over some period, but is outperformed in other periods. In the case of misspecified models this is of fundamental importance. For example, the best model might change over time when data are nonstationary. The combination weights will be time varying, but unlike previous work (e.g. Deutsch et al. 1994) there is no need to estimate the changes in regime. However, the relative frequency of changes in regime can be used to improve the results. When the best individual forecast may change over time, we show that forecast combination is improved by retaining some non-zero combination weight for all the individual forecasts. In fact, the procedures considered in this paper perform some form of online shrinkage. The idea of shrinking the combination weights is not new in the econometric literature (e.g. Diebold and Pauly, 1990, Aiolfi and Timmermann, 2004). The plan for the paper is as follows. Section 2 introduces background material. Section 3 states algorithms based on extensions of the exponential update of Vovk (1990), which is also the algorithm used in Yang (2004), and derives general worst case bounds. Section 4 provides experimental evidence for the algorithms using both synthetic and real data. One interesting result is that forecast combination 3

of simple models might be used to approximate an unknown function. Section 5 shows how to cover the case of point prediction for many loss functions. Further remarks can be found in Section 6.

1.1

Notation

Unless specified otherwise, throughout the paper the following notation is used. For a set A, B ⊂⊂ A, means that B is a closed set inside A. N is the set of natural numbers, i.e. 1, 2, 3, .., and Z+ is the set of non-negative integers, i.e. 0,1,2,.... If A is a set with countable elements, #A stands for the cardinality of A. Sn stands for the n ∈ N dimensional unit simplex. Suppose I is a set with a countable number of elements, then aI := (ai )i∈I is a #I dimensional vector. For vectors a and b having same dimension, ha, bi is their inner product. Suppose as,t is a vector or a scalar. For legibility reasons we may write a (s, t) . Let X be a random variable. For Pr (X ≤ x) define ∂x Pr (X ≤ x) to be the density function or the mass function of X or the density function plus the atom at x, depending on the Lebesgue decomposition of the measure corresponding to X. The distribution function corresponding to this measure will be denoted by P , i.e. P (x) = Pr (X ≤ x) , and p (x) := ∂x Pr (X ≤ x). Finally, for a set A, IA is the indicator function of A (one if A is true, zero otherwise).

2

Background

We face the following sequential problem at time t = 0, ..., T − 1. Suppose (Xt )t∈Z+ is a sequence of random variables with values in RS , S ≥ 1 and define Ft to be the sigma algebra generated by (Xs )s≤t . The data generating process is unknown. We observe realizations of (Xs )s≤t , say x0 , .., xt .1 These could be stock market returns © ª from time 0 to t. Then, we suppose there is a collection of models Pθ(e) : θe ∈ Θe ⊂⊂ Rd(e) , de ∈ N e ∈ E where E is the index set of forecasters, which shall also be called reference class.2 In words, this means that for each e, Pθ(e) is a distribution function (i.e. a probability model) depending on a parameter θe which is an element of a parameter space Θe , a closed subset inside Rd(e) , where d (e) is a positive integer giving 1

Actually, we do not need x0 , .., xt to be realizations of random variables, but for the sake of explanation it is convenient to treat them as such. 2

As mentioned in the Introduction, individual forecasters are also called experts in order to stress that the forecasts issued by this forecasters are exogenous to the econometrician’s decision. Hence, each e ∈ E identifies an expert who issues a forecast.

4

the dimension given the individual ³ ´of the parameter space. At time t − 1, we are © ª ˆ forecasts θe,t to be used as parameters in the models Pθ(e) : θe ∈ Θe e∈E at e∈E time t. The econometrician needs to issue a probability forecast, say PW,t . When xt is observed, the econometrician suffers a loss R (pW,t ) := − ln pW,t (xt ) . In particular, the econometrician will use an algorithm that will produce a probability on E at each in time, say (we,t )e∈E . The forecast pW,t will be a function of ´ ³ point © ª and Pθ(e) only. The term forecast combination rule will (we,t ) , ˆθe,t e∈E

e∈E

e∈E

be used when referring either to the rule used to produce pW,t or directly to mean pW,t . The econometrician’s forecast must satisfy the following condition: ³ ´ ˆ Condition 1 For any forecasts θE = ˆθe , outcome x and wE = (we )e∈E ∈ S#E , e∈E ∃c < ∞, η > 0 such that ´´ n ³ ´o ³ ³ X ˆ ≤ −c ln we exp −ηR pˆθ(e) (x) , R pW x|wE , θE e∈E

³ ´ ˆ where pW x|wE , θE is the probability forecast based on an arbitrary vector wE in the unit simplex, forecasts ˆθE , and model {pθ } . Remark 1 The parameter η is called the learning rate and depends on the forecast combination rule used. Condition 1 can be rewritten as h ³ ´i1/c X h iη pW x|wE , ˆθE ≥ we pˆθ(e) (x) , (1) e∈E

which implies that as we decrease η we shall increase c. In most cases, we can choose c = 1/η. Notable exceptions within the framework of Example 2 exist and the results in Section 5 deal with this issue more in depth. To clarify how η and c depend on the forecast combination rule, two special examples of forecast combination rules follow. Example 1 ³ The forecast ´ Pcombination rule is a mixture of the individual models {pθ } : pW x|wE , ˆθE = e∈E we pˆθ(e) (x) . This forecast combination rule is often called the linear opinion poll (e.g. Genest and Zidek, 1986). From (1), we see that Condition 1 is satisfied for c = 1/η and η ≤ 1, as this implies concavity of " #η h ³ ´iη X pW x|wE , ˆθE = we pˆ (x) θ(e)

e∈E

with respect to wE . However, Theorems 1 and 2, in Section 3.3, will show that there is an optimal choice of η and this is η = 1. 5

Example 2 Suppose Θe = Θ and Θ is convex. Then, the forecast combination rule is pθ with parameter ³ θ being the forecasts with respect ´ mean ³ ofD the individual E´ ˆ ˆ to the measure wE , i.e. pW x|wE , θE = p x| wE , θE , where p (x|θ) := pθ (x) . If p (x|θ) is the Gaussian density variance and mean θ, we would predict a D with known E ˆ Gaussian density with mean wE , θE . In this case, Condition 1 is satisfied with c = 1/η if ∃η > 0 such that exp {−ηR (pW (x|θ))} is concave in θ for any x in the range of the sample observations. From Remark 1, this is equivalent to say that [pθ (x)]η is concave in θ for any x in the range of the sample observations. Several special examples when this is true (and not true) will be provided in Section 5. If Condition 1 is satisfied with c = 1/η = 1, some nice interpretations are also possible, and will be provided in Section 3.1. An ambitious goal is to find a sequential algorithm that allows us to find the combination weights (we,t )e∈E , such that for any data sequence x1 , ..., xT , T X t=1

T ³ ³ ³ ³ ´´ ´´ X R pW xt |wE,t , ˆθE,t R p xt |ˆθe,t + error. ≤ min e∈E

(2)

t=1

In this case we have error = O (ln (#E)) (Theorem 1). This bound says that the sequential algorithm used by the econometrician will produce forecasts as good as the forecasted probability of the best forecast, plus a term O (ln (#E)), i.e. the sequential forecast combination and the best individual forecast produce almost the same error. This last statement makes sense if pW (x|wE , θE ) and p (x|θe ) can be nested (in one another or within a larger family of distributions). This is the case for Examples 1 and 2. If we expect different forecasts to perform better over different subsets of x1 , ..., xT , an even more ambitious goal is to track the best forecast over time (Theorem 2): T X t=1

K ´´ X ´´ ³ ³ ³ ³ X R pW xt |wE,t , ˆθE,t ≤ min R p xt |ˆθek ,t + error, k=0

ek ∈E

(3)

tk ∈IT (k)

where IT0 ,...,ITK is a partition of {1, ..., T }. In this case, under certain conditions, error = o (T ) (Corollary 3).

2.1

Prequential Interpretation

P The function − Tt=1 R (pW,t (xt )) differs from the usual loglikelihood function, as here the loglikelihood per observation at time t is constructed using Ft−1 measurable parameters. This loglikelihood is called the prequential likelihood (Dawid, 6

P 1986) and according to the same literature, Tt=1 R (pW,t ) is a proper scoring rule; smaller values are preferred to larger. This approach of model evaluation is consistent with the view that the validity of the model should be tested on observables. There is no need to introduce the concept of probability in this context: we are not finding an estimator for the maximum of the expected log-likelihood. For the sake of simplicity consider the bound in (2). Probabilistic bounds as in Yang (2004) are of the kind T T ´´ ´´ error ³ ³ ³ ³ 1X 1X ˆ , ≤ min ER pW xt |wE,t , θE,t ER p xt |ˆθe,t + e∈E T T t=1 T t=1

i.e. the bounds hold in expectation with error = o (T ). On the other hand (2) gives T T ´´ 1X ³ ³ 1 X ³ ³ ˆ ´´ error ˆ , ≤ min R pW xt |wE,t , θE,t R p xt |θe,t + e∈E T T t=1 T t=1

and if error = o (T ) this implies that the bound does not only hold in expectation, but also for any sample sequence. Recalling that R (pW,t ) := − ln pW,t (xt ), ³ ´⎞ ⎛ T pW xt |wE,t , ˆθE,t X ⎠ ≥ −error. ³ ´ min ln ⎝ e∈E ˆ p xt |θe,t t=1

Minimization of error, implies maximization of the prequential loglikelihood ratio ³ ´⎞ ⎛ ˆ T p |w , θ x 1 X ⎝ W t E,t E,t ⎠ ³ ´ ln →0 min e∈E T ˆ p xt |θe,t t=1

if error = o (T ). The Kullback-Leibler information criterion uses the true expectation in place of the empirical expectation (i.e. sum divided by sample size). The bounds derived here are worst case bounds because no condition is imposed on the sample sequences. However, for the results of this paper to hold we need to satisfy Condition 1, which can often be restrictive.

3

The Algorithm

This section introduces the multiplicative algorithms that will be used for issuing the probability forecasts of the econometrician. 7

We need to find an Ft−1 measurable strategy that produces the weights (we,t )e∈E . This is achieved using multiplicative updating algorithms. These algorithms have been studied by several authors (e.g. Vovk, 1990, Cesa-Bianchi et al, 1997, Herbster and Warmuth, 1998, Bousquet and Warmuth, 2002). We need to define a transition function ut (e, e0 ) : E × E → R, known as the share update function. The choice of ut is a fundamental ingredient that determines the order of magnitude of error in (2) and (3). A precise definition will be given later. Unless specified otherwise, in the remaining of this section, we write θe,t for ˆθe,t , which is an expert’s forecast. The algorithm is given in Exhibit 1. [Exhibit 1] Remark 2 For ut+1 (e, e0 ) = I{e=e0 } the weight update is the one originally proposed by Vovk (1990) and also considered in Yang (2004). Other choices of ut+1 (e, e0 ) have been considered by Herbster and Warmuth (1998) Bousquet and Warmuth (2002). The results below show that we can derive results similar to Herbster and Warmuth (1998) for general share updates. A brief explanation of the algorithm is required. We start at time 1 using an equally weighted forecast combination we,1 := 1/ (#E) . Once a weight vector (we,t )e∈E and the forecasts (θe,t )e∈E are available, the econometrician constructs his forecast pW (...|wE,t , θE,t ) (using a forecast combination rule, e.g. see Examples 1 and 2). Then, observation xt is revealed, and the econometrician incurs a loss ¡ ¢ R (pW (xt |wE,t , θE ,t )) and uses the loss of each forecast R pθ(e,t) to construct an 0 update ve,t+1 , which can be seen as an ex-post combination weight. The weight 0 ve,t+1 is then mixed with the other weights using the share update ut+1 (e, e0 ) and standardized for every e ∈ E to obtain the final weight vector wE,t+1 in the unit simplex.

3.1

Bayesian Interpretation

When η = 1, the algorithm has a Bayesian interpretation. Suppose that (Et )t∈N is a sequence of random variables with values in E, which does not need to be Ft measurable. Think of Et as the true model at time t. Using the notation from the Introduction, © ¡ ¢ª ∂x(t) Pr (Xt ≤ xt |Et = e, Ft−1 ) = exp −R pθ(e,t) = pθ(e,t) (xt ) , and

∂x(t) Pr (Xt ≤ xt |E, Ft−1 ) = 8

X e∈E

we,t pθ(e,t) (xt ) .

In words, ∂x(t) Pr (Xt ≤ xt |Et = e, Ft−1 ) is the density forecast conditioning at time t − 1 and assuming that the true model is the eth one. In a Bayesian framework, we are interested in deriving the probability of one of the models being the correct one given the past observations. The algorithm implies that the distribution of Et is characterized by the following quantities Pr (Et = e|Ft−1 ) = we,t , 0 Pr (Et = e|Ft ) ∝ Pr (Et = e|Ft−1 ) ∂x(t) Pr (Xt ≤ xt |Et = e, Ft−1 ) ∝ ve,t+1

Pr (Et+1 = et+1 |Et = et , Ft ) ∝ ut+1 (et+1 , et ) , where the last display is valid under specific restrictions on ut+1 (Condition 2). Therefore, the probability of model e being the true one at time t + 1 is obtained by the updating rule X Pr (Et+1 = e|Ft ) = Pr (Et = e0 |Ft ) Pr (Et+1 = e|Et = e0 |Ft ) e0 ∈E

∝ we,t+1 .

This last step is obtained using the share update of the algorithm. Note that the share update is independent of the forecast combination rule chosen by the econometrician. However, to admit the above Bayesian interpretation, Condition 1 needs to be satisfied with c = 1/η = 1. While being multimodal and dispersed, the forecast combination rule of Example 1 always allows for c = 1/η = 1.

3.2

Differences from a Bayesian Prediction

Notice that in a Bayesian framework, e would be usually associated with a model depending on an unknown parameter. The Bayes predictor, assuming Et = e, is θB(e),t := arg min E [R (p (Xt |θ)) |Et = e, Ft−1 ] , θ

i.e. the expectation is taken with respect to model e. The forecast ˆθe,t does not need to be equal to θB(e),t (i.e. the forecasts do not need to minimize a loss function). Then, X £ ¡ ¡ ¢¢ ¤ θB,t := arg min Pr (Et = e0 |Ft−1 ) E R p Xt |θB(e),t |Et = e0 , Ft−1 (4) e∈E

e0 ∈E

is the Bayes choice of θt . Alternatively, we can average over the models and X Pr (Et = e|Ft−1 ) E [R (p (Xt , θ)) |Et = e, Ft−1 ] θBA,t := arg min θ

e∈E

9

(5)

is the Bayes average choice of θt . Note the following two differences. First, (5) delivers a value for θ, and not the whole model. However, we can identify the whole model as the mixture of densities using the optimal parameter. Second, the criterion function for (4) and (5) is derived using expectation of the risk in terms of the conditioning model. The criterion function of the algorithm is the prequential log-likelihood, and no expectation is taken as it is evaluated at observable quantities only.

3.3

Properties of the Algorithm

We introduce the following. Condition 2 ut (e, e0 ) ≥ 0,

X

e∈E

ut (e, e0 ) = 1 (∀t ∈ N, e, e0 ∈ E) .

Remark 3 Condition 2 is satisfied when ut+1 (et+1 , et ) is viewed as the probability of going from et to et+1 (et , et+1 ∈ E) . Hence, Condition 2 is satisfied when (ut+1 (e, e0 ))(e0 ,e)∈E 2 is a Markov transition matrix. When the best forecast does not change over time, we have the following. Theorem 1 Under Conditions 1 and 2, ∀e ∈ E ¡ ¢ R1,...,T (pW ) ≤ cηR1,...,T pθ(e) + error,

where

error = −c ln

ÃT Y

!

ut+1 (e, e)

t=1

− c ln we,1

Remark 4 The statement of Theorem 1 (as well as the other results) becomes of particular interest when cη = 1. In order to stress the role of the parameters c and η, the condition cη = 1 was not directly imposed. As discussed in Remark 1, cη = 1 for suitable forecast combination rules (see Examples 1 and 2, and Section 5). In this case, the learning rate η is set as large as possible within the range allowed by Condition 1. This is optimal as it minimizes the error bound: large η = 1/c implies a small c. Remark 5 Theorem 1 gives a bound for the error in (2), and to minimize it when no prior information about the performance of the forecasts is available, we shall choose we,1 = 1/ (#E) (as in Exhibit 1) and ut+1 (e, e0 ) = I{e=e0 } , which implies error = c ln (#E) in (2). 10

(6)

If the best forecast changes over time, we partition the segment IT = (1, ..., T ) into K + 1 subsegments, IT (k) = (tk , ..., tk+1 − 1), k = 0, ..., K. Then, t0 = 1, #IT (k) = tk+1 − tk , and we let ek ∈ E, k = 0, ..., K. Theorem 2 Under Conditions 1 and 2, ∀ek ∈ E, k = 0, ..., K, K < T , R1,...,t (pW ) ≤ cη where error = c ln (#E) − c

K X k=1

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) + error,

ln ut(k) (ek , ek−1 ) − c

K t(k+1)−2 X X k=0

ln (us+1 (ek , ek )) .

(7)

s=t(k)

Remark 6 Theorem 2 gives a bound for the error in (3) and holds for any K + 1 (< T ) partition. If K = o (T 1− ) , > 0, and we can choose the update to satisfy ln ut(k) (ek , ek−1 ) = O (T ) (∀k) and ln us+1 (ek , ek ) = O (T − ), then, error = o (T ). This can be done in practice (actually, in some cases, we can even do slightly better than this: see Corollary 3 in the next section). When K = O (T ), error = O (T ), but this quantity can be still minimized using the bound in Theorem 2 and an appropriate share update (e.g. Section 3.6). The statement of Theorem 2 works for all share updates satisfying Condition 2 and simplifies to known results if we choose some special cases of share updates (e.g. Theorem 1 in Herbster and Warmuth, 1998). The present derivation based on Condition 2 allows us to clearly identify the contribution of each single term in the error bound. Theorem 2 is the basis for the bounds of the algorithm in Exhibit 2 (Section 3.5), which is new.

3.4

Choosing a Share Update

Using the result of Theorem 2, we try to develop some intuition for an optimal share update. For the sake of explanation, suppose ut+1 (e, e0 ) is the transition probability of going from e0 to e at time t + 1. Comparing Theorems 1 and 2, we have T K t(k+1)−2 X X X − ln (ut+1 (e, e)) ≥ − ln (us+1 (ek , ek )) t=1

k=0

s=t(k)

if the probability of keeping the same individual model is the same. However, in Theorem 2 we also have the extra term K X − ln ut(k) (ek , ek−1 ) , k=1

11

which accounts for shifting from forecaster ek−1 to forecaster ek at time tk . To minimize the sum of the above two displays, we need to redistribute transition probabilities giving little weight to transitions from and to forecasts that are likely to provide a bad performance. In so doing, our goal would be to derive an optimal share update which minimizes the error bound. Assume, for the moment, that we know when a shift in the best forecast occurs. If, knowing that a shift will occur at time t + 1, we have uniform beliefs on the best performing expert, we should set ut+1 (e, e0 ) = (#E − 1)−1 (e 6= e0 because there is a shift), and ut+1 (e, e) = 1 when there is no shift. This is because there are #E forecasts, and if there is a shift, the set of possible remaining forecasts has cardinality (#E − 1) . Note that this choice minimizes the error bound conditioning on uniform believes. Substituting in (7), we would have error = cK ln (#E − 1) + c ln (#E) < (K + 1) c ln (#E) , and this bound is K + 1 times the error in (6). The above bound cannot be substantially improved without hindsight. Even under uniform beliefs, the above bound is unattainable if we do not know when a change occurs. Hence, in this case, we may assign a probability to the binary event of a change occurring or not occurring, say At is the event ”a shift occurs at time t”. If we assume independence between forecasts and shifts, then we must have ut+1 (e, e0 ) = [1 − Pr (At )] I{e=e0 } + Pr (At ) (#E − 1)−1 I{e6=e0 } .

(8)

Therefore, under the above conditions, to derive an optimal share update, we need a good estimate of Pr (At ). Without following the above reasoning, Herbster and Warmuth (1998) provide two specific choices of ut+1 (e, e0 ) (a Fixed Share and Variable Share update) where Pr (At ) is parametrized in terms of the number of shifts K. Bousquet (2003) provides a Bayesian algorithm that puts some prior in the parameter of the Fixed Share algorithm in order to update this parameter sequentially. Most reasonable updates are based on (8) and only try to model Pr (At ) . However, further knowledge might be incorporated into the share update (e.g. Example 4). For the sake of concreteness, we give a few very simple examples which satisfy Condition 2. Example 3 Suppose the following conditional distribution of shifting from forecast e0 to forecast e at time t + 1, ut+1 (e, e0 |λ) = (1 − λ) I{e=e0 } + 12

λ I{e6=e0 } , #E − 1

(9)

where λ ∈ Λ = [0, 1] . Hence, the probability that there is no shift in the best forecast e0 is (1 − λ) , while the probability of changing is λ. Then, given that there is a shift, the probability of shifting to any of the remaining forecasts is uniform, i.e. (#E − 1)−1 . This share update is the Fixed Share update considered in Herbster and Warmuth (1998). Extensions are possible. For example the parameter λ might depend on the individual forecast and its performance, leading to a variable share update (e.g. Herbster and Warmuth, 1998). The practical effect of any update is to shrink the weights towards some value greater than zero. Hence, the weights of badly performing forecasts are usually increased, while the weights of good ones are usually shrunk. This is clearly sensible in a nonstationary environment were we are completely ignorant of the time series dynamics. The idea is to keep some non-zero weight for all the forecasts because there could be a remote probability that badly performing forecasts could perform well at some other point in time (e.g. Section 4.1.1 for an illustration). However, improvements can be achieved incorporating extra a priori knowledge into the share update. Example 4 Suppose that E ⊂ N so that the elements in E can be ordered. Suppose that shifts in the best forecasts are local in the sense that at time t from e, we can only go to {e − 1, e, , e + 1}. Then, it is sensible to consider the following share update λ ut+1 (e, e0 |λ) = (1 − λ) I{e=e0 } + I{e0 =(e±1)} , (10) 2 where λ ∈ Λ = [0, 1] . Unlike the Fixed share update, (10) gives zero probability to the weights that are not in {e − 1, e, e + 1}, but models the probability of shifts as being constant as in Example 3. Other examples are possible depending on the circumstances. If we assume independence between shifts and forecasts a crucial parameter to model is λ. For this reason, even if we restrict λ = Pr (At ) to be time invariant, it is very important to find the optimal λ.

3.5

Algorithm to Learn the Share Update

Suppose that ut+1 (e, e0 |λ) , λ ∈ Λ is a class of share updates. To help our intuition we could think of λ as being related to Pr (At ), as in the above examples. However, the framework is more general than this, as we may also consider λ to be a high dimensional parameter that determines the whole share update, e.g. a 13

collection of different share updates. Suppose we choose a finite number of these update functions with parameter λl , l ∈ L (where L is a finite and countable set). We can extend the previous algorithm to the case where we want to find the best λl . For simplicity, but with abuse of notation, ut+1 (e, e0 |l) := ut+1 (e, e0 |λl ) . Moreover, we shall define wE,l,t to be the combination weight at time t obtained by running the algorithm in Exhibit 1 with share update parameter λl ; hence pW (l),t := pW (x1 |wE,l,t θE,t ) is the density forecast at time t using Exhibit 1 and λl . The idea is to run the algorithm in Exhibit 1 for (λl )l∈L to generate #L combined forecasts in parallel and then combine these forecast combinations to issue a single prediction. This is done by finding additional combination weights ωL,t := (ω l,t )l∈L , on top of wE,L,t := (wE,l,t )l∈L so that pW L (xt |ω L,t , wE,L,t , θE,t ) denotes the final forecast that combines individual forecasts using different share updates indexed by λl (l ∈ L) . The algorithm is in Exhibit 2 and depends on a constant κ > 0, which plays a role similar to the learning rate η and will be defined later. [Exhibit 2] We need to extend Condition 1.

³ ´ Condition 3 For any forecast ˆθE = ˆθe

e∈E

, outcome x, and weights wE = (we )e∈E ∈

S#E , and ωL = (ω l )l∈L ∈ S#L , ∃b < ∞, κ > 0 such that ³ ³ ´´ ³ ´´o n ³ X R pW L x|, ω L , wE , ˆθE ≤ −b ln , ω l exp −κR pW (l) x|wE , ˆθE l∈L

³ ´ where pW (l) x|wE , ˆθE is the econometrician’s forecast using the algorithm in Exhibit 1 and share update ut+1 (e, e0 |l). Remark 7 If the forecast is through mixtures, then, X pW L = ωl,t pW (l) l∈L

and Condition 3 holds automatically with κ = 1/b and κ ≤ 1, and the choice κ = 1 is optimal according to the following result.

Theorem 3 Under Conditions 1, 2, and 3, ∀ek ∈ E (k = 0, ..., K) , K < T, ∀l ∈ L, R1,...,t (pW L ) ≤ (cη) (bκ) −bc

K X k=1

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) + bc ln (#E) + b ln (#L)

ln ut(k) (ek , ek−1 |l) − bc 14

K t(k+1)−2 X X k=0

s=t(k)

ln us+1 (ek , ek |l) .

Corollary 1 Under the Conditions of Theorem 3, R1,...,t (pW ) ≤ (cη) (bκ) + (bc)

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) + bc ln (#E) + b ln (#L)

min

(e1 ,...,eK )∈E K l∈L



⎝−

K X k=1

ln ut(k) (ek , ek−1 |l) −

K t(k+1)−2 X X k=0

s=t(k)

ln us+1 (ek , ek |l)⎠ .

Remark 8 Theorem 3 says that, if bκ = 1, increasing the bound by b ln (#L) we can learn the minimizing λl (l ∈ L). In this case, to bound the terms in parenthesis, we need to choose a specific family ut+1 (e, e0 |λ) λ ∈ Λ and a finite collection of (λl )l∈L over which to minimize. After this, we do not need to worry about selecting the parameters of the algorithm. The algorithm in Exhibit 2 can be applied not only to the share update functions, but also to the learning rate, so that we would automatically obtain the best choice of learning parameter η within a set of possible choices fixed a priori. Note that the error of the algorithm only grows logarithmically in the number of possible choices of parameters to be learned (i.e. in #L). Remark 9 As for Theorem 2, we can choose a forecast combination rule (e.g. the one of Example 1) such that κ = η = 1 and b = c = 1. More generally, the forecast combination rules considered in the examples of this paper are such that κb = 1 and ηc = 1.

3.6

An Example of Detailed Calculations

An application of Theorems 2 and 3 easily allows us to derive error bounds for any given share update that satisfy Condition 2. The basic idea is effectively shown using the Fixed Share update in Example 3. For the sake of exposition, we will also assume η = c = κ = b = 1. We can set λl = l/L, l = 0, ..., L, L ∈ N, so that #L = (L + 1) . To get a bound for this update, we need the following. Lemma 1 Suppose ut+1 (e, e0 |λ) is as in (9). Then, µ

K X

λ ln ut(k) (ek , ek−1 |λ) = K ln #E − 1 k=1

K t(k+1)−2 X X k=0

s=t(k)



ln us+1 (ek , ek |λ) = (T − K − 1) ln (1 − λ) . 15



Applying Theorem 3, together with this lemma, we have the following. Corollary 2 Under the Conditions of Theorem 3 (with η = c = κ = b = 1), using the share update (9) in the algorithm in Exhibit 1, and λl = l/L, l = 0, ..., L, L ∈ N, R1,...,t (pW ) ≤

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) + ln (#E) + ln (#L)

µ µ + min −K ln l∈L

l/L

#E − 1



¶ − (T − K − 1) ln (1 − l/L) .

As promised in Remark 6, we give an easily interpreted asymptotic result for the best expert partition when that fixed share update is used. Corollary 3 Suppose K = O (T 1− ) , > 0, and (9) with λ = O (K/T ) in Exhibit 1. Under Condition (1) (with η = c) the bound (3) holds with error = o (T ) , i.e. R1,...,t (pW ) ≤

4

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) + o (T ) .

Numerical Examples

This section provides a discussion of the algorithms via some simulations and empirical examples. In these experimental examples, the forecast combination rule involves mixtures of densities, as in Example 1. In this case, the bound in Theorem 2 is minimized for η = c = 1. The algorithm in Exhibit 2 will also be used with the same forecast combination rule (e.g. Remark 7). This implies the optimal choice κ = b = 1 in Condition 3. Using this forecast combination rule, the only unknown parameter in the algorithm in Exhibit 1 is the share update parameter. On the other hand, the algorithm of Exhibit 2 does not depend on any unknown parameters once a finite set of parameters for the share update is selected.

4.1

Simulated Examples

Two simulated examples are considered. The first shows that a share update can improve the forecast; the second shows how to incorporate some a priori knowledge of the data generating process. In both examples, the algorithm of Exhibit 2 leads to performance almost equivalent to the one in Exhibit 1 when we choose the optimal share update parameter ex post. In the second example, we also show that if there are two candidate families of share updates, the algorithm in Exhibit 2 leads 16

to a performance that is almost equivalent to that of the algorithm in Exhibit 1 when we choose, ex post, the best performing family of share updates, together with its optimal parameter. For comparison, forecast combinations are also computed using equally weighted combination weights (see Timmermann, 2004, section 2.4, for some of the properties of equally weighted forecasts). Both examples use one single realization of simulated random variables. The simulated series are chosen to be stationary to facilitate statistical analysis. 4.1.1

Outliers

Consider a Gaussian random variable Xt with mean μt and standard deviation σ t . Set μt = 5 t Zt , where ( t )t∈N is iid taking values 1 or −1 with probability 1/2 and (Zt )t∈N is an iid sequence of Poisson random variables with intensity parameter .005. Set σ t = 1 + 20Zt0 , where (Zt0 )t∈N is an independent (iid) copy of (Zt )t∈N . This means that Xt will usually produce a realization from a Gaussian random variable with mean zero and variance one. However, with some probability, the parameters are subject to change. The Poisson intensity .005 implies that,on average, there is a change in mean and\or in variance every 200 periods, leading to very large values. By independence of Zt and Zt0 , changes in mean and variance are independent of each other. The idea is to represent the scenario where the sequence is iid, but every so often we observe some outliers. Figure 1 plots the simulated series. [Figure 1] We use static forecasts and each forecast gives a Gaussian density with mean standard deviation parameters (mi , sj ) (i = 1, 2, 3; j = 1, 2, 3, 4) . In particular the means are (m1 , m2 , m3 ) = (−1, 0, 1) and the standard deviations are (s1 , ..., s4 ) = (1, 2, 8, 16). Hence, there are 12 forecasts. Clearly, it is impossible to predict the time of one of these ”outliers”. We use the fixed share update of Example 3 with parameters (λ1 , ..., λ5 ) = (.01, .05, .1, .2, .3). When λ = 0, the fixed share update collapses to the usual no share update, as in Remark 2. The goal is to show that when we account for the possibility of large rare events, the forecast can be improved. Accounting for rare events is achieved by using the share update. By choosing some individual forecasts with large standard deviation, the share update allows us to account for large rare events. Effectively, the share update shrinks the weight given to very ”good” forecasts while it inflates the weight given to ”bad” forecasts. The degree of shrinkage should be related to 17

the probability of an unlikely event occurring. The reason for doing so is that a ”bad” forecast may suddenly turn to be the best performing one. Figure 1 should make this clear. Using a share update in this circumstance leads to considerable improvement. Clearly, the optimal λ is unknown ex ante. Table 1 reports the mean loss with standard errors as well as the maximum loss. Given that the results of this paper are finite sample results, it is instructive to look at statistics beyond the sample mean. Only the mean and the maximum loss is reported here, as other sample statistics do not seem to improve comparison for this example. The results show that the algorithm in Exhibit 2 (Fixed Share Learnt) gives a loss which is close to the optimal fixed share with λ = .01.3 Using no share update does achieve performance comparable to the best individual forecast. When no share update is used, the mean performance appears to deteriorate, even if we account for finite sample variation through the standard errors. Moreover, the maximum loss is huge, making the loss per observation very volatile. In finite sample, this is quite undesirable. [Table 1] 4.1.2

Almost Local Stationarity

Suppose Xt is a Gaussian random variable with mean 0 and standard deviation σ t when we condition on σ t . Suppose ln σ t evolves stochastically and takes values not too distant from the previous period log standard deviation ln σ t−1 . Moreover, restrict ln σ t to take values in a discrete finite set. Specifically, ln σ t , takes values in {−2, −1.9, −1.8, ..., 3} and admits the following stochastic representation ln σ t = ln σ t−1 + t ht , where ( t )t∈N is iid Bernoulli such that Pr ( t = 1) = .4 and (ht )t∈N is such that ht is uniformly distributed in {−1, 1} if ln σ t−1 ∈ / {−2, 3} (i.e. if ln σ t−1 is not on the boundary) and ht = 1 if ln σ t−1 = −2 or ht = −1 if ln σ t−1 = 3. This means that the boundary points {−2, 3} are reflecting barriers and with probability one we move away from the boundary once touched. At time one, ln σ 1 is uniformly distributed in {−2, −1.9, −1.8, ..., 3}. Figure 2 plots the simulated series. The volatility of the series changes gradually overtime from very volatile periods to very quiet ones and vice versa. Note that σ t is bounded by construction and it is mean-reverting because of the reflecting barriers. [Figure 2] The individual forecasts are Gaussian densities with mean zero and standard de3

Other sample statistics not reported here confirmed this claim.

18

viation se where ln se = −2+.1 (e − 1) , e ∈ E = {1, ..., 51} , i.e. (ln s1 , ln s2 , ..., ln s51 ) = (−2, −1.9, ..., 3). Hence, we combine 51 forecasts. This is a relatively large number, but the theoretical results show that the loss should only increase logarithmically with the number of forecasts used. Each time period, there is a forecast that correctly predicts the true density, but with high probability, the period after the best forecast will change, i.e. if, at time t, ln σ t = ln se , ln σ t+1 6= ln se with probability .4. Knowledge of this local shift can be incorporated into the share update. To this end, the local fixed share update in (10) is used and results are compared with the fixed share update (9). For both updates, the parameters are (λ1 , ..., λ5 ) = (.01, .05, .1, .2, .3). Results are in Table 2, and the same sample statistics as in the previous example are reported. [Table 2] As for Fixed Share Learnt, Local Share Learnt gives the results for the algorithm in Exhibit 2 using the local share update. Local&Fixed Share Learnt applies the same algorithm, not only to learn the parameters, but also to determine which share update is best. As expected from Theorem 3, not knowing which share update or parameter to choose makes little difference. The results show that the algorithm in Exhibit 2 picks up the best update, where best is in terms of sample mean performance. The standard errors are so small that there is no ambiguity in the mean performance when we compare the use of a share update and the no share update. Results show that, for this example, incorporating extra knowledge about the share update does not lead to significant differences if we compare the best choice of fixed share update (Fixed Share .01) and the best choice of local share update (Local Share .3). This confirms the remark concerning the importance of choosing the optimal parameter in a share update. Simulations carried out by the author, but not reported here, seem to suggest that the choice of parameter is more important than the choice of the share update’s functional form. To highlight the differences between the share updates, we plot the weights’ values for the no share update, the best fixed share update (λ = .01), and the best local share update (λ = .3) in Figure 3, Panel A, B and C, respectively. [Figure 3] In Panel A, the weights of the no share update do not change fast enough to track shifts in the best forecast. After a learning period of 800 observations, all the weight is given to only two forecasts, one of which had a weight close to one for most of the time. Looking at Figure 2, we can immediately see that changes do 19

occur throughout the series. In Panel B, the fixed share update tries to capture fast changes in regime. However, by construction, we know that, on average, there is a change almost every two periods (i.e. Pr ( t = 1) = .4). Hence the fixed share update seems to allow for changes in best forecast only when the new best forecast differs considerably from the previous one. On the other hand, it is evident from Panel C that the local share update allows for very frequent shifts in best forecast. Nevertheless, the shifts are between forecasts that are close to each so that no single shift has dramatic implications. For this reason a simple update like the fixed share update can still perform comparatively well, as it is still able to capture changes when differences between forecasts lead to significant differences in loss behavior.

4.2

Empirical Examples

The first example shows that large breaks are possible in real data, and accounting for this possibility can improve forecasting in finite samples. This is done using daily returns on the S&P 500 index over a sample period that includes the October ’87 crash. The goal of the second example is to show that a forecast combination of simple individual forecasts might achieve a performance comparable to state of the art econometric models. To this end, a combination of simple forecasts for data on sugar futures are compared with forecasts obtained from a GARCH (1, 1) model with t distributed conditional errors. In both examples, the share update used is the fixed share update with (λ1 , ..., λ5 ) = (.01, .05, .1, .2, .3). 4.2.1

Accounting for Large Events in Forecast Combination

We consider the daily returns on the S&P 500 index over the period 05/10/8502/09/89 (1000 observations including a start up period of 200 observations). This period includes the crash of October 1987. The series is plotted in Figure 4. [Figure 4] As in the first example using synthetic data, this crash is unpredictable. However, accounting for the possibility of such large events can still be advantageous. Here, the crash is a single event and the loss incurred is averaged over the other 800 predictions. Nevertheless, its effect can be considerable. Forecasts are based on a normal distribution with mean zero and variance obtained by moving window Pn(e) 2 sample estimates: s2te = n−1 e r=1 Xt−r . The window widths ne are (n1 , ..., n7 ) = (10, 20, 40, 60, 80, 100, 120). The performance obtained by using no share update 20

and the algorithm in Exhibit 2 is compared. Table 3 reports summary statistics also for the best individual forecast and the equally weighted forecast. [Table 3] Standard errors confirm that the difference in mean between the methods is marginal, though, differences appear evident for the extreme values (e.g. Max.). The results seem to suggest that the distribution of the losses without using share update are very similar to the best forecast with hindsight. On the other hand, for this example, the Fixed Share Learnt produces losses that are quite similar to the equally weighted forecast. This supports the forecasting strategy used in Aiolfi and Timmermann (2004) where shrinkage towards equally weighted combination weights is suggested. The difference is that the procedure here is online and fully automatic (once we choose a set of parameters λ’s). Figure 5 plots the density of the losses when no share update and FixedShareLearnt are used.4 [Figure 5] It is clear that while the difference in mean might be marginal (though comparing favorably to the Fixed Share Learnt), the distribution of the losses is more concentrated around the mean loss for Fixed Share Learnt. 4.2.2

Forecast Combination Using Simple Forecasts

We consider the daily log returns on the sugar futures contracts traded at the Chicago Board of Trade. Figure 6 plots the series for the period 06/12/98-13/03/01, which is the one over which predictions are carried out, using a start up period of 200 observations. We compare the performance of combination of simple forecasts with forecasts based on a GARCH(1, 1) process with conditional t distribution. Interestingly, the estimated parameters in the conditional variance equation sum up to slightly more than one if we estimate a GARCH(1, 1) model over the whole sample. For the forecast comparison, we estimate a GARCH(1, 1) using a sample X1 , ..., Xt−1 and use the estimated parameters to obtain the conditional variance at time t which, together with the estimated degrees of freedom, completely determines the density at time t. The procedure is repeated for t = 1, ...T . [Figure 6] 4

The densities are obtained using the default settings in densityplot in S-plus.

21

To keep it simple, each individual forecast is obtained from a t-distribution with fixed degrees of freedom v. In particular 40 individual forecasts are combined with (v1 , ..., v40 ) = (1, 1 + 19/39, 1 + 2 (19/39) , ..., 20). Table 4 reports results for the Fixed Share Learnt and GARCH only. Fixed Share Learnt does not require tuning of any parameter apart from the choice of share update, and it is the approach that is more likely to be used in practice. [Table 4] Even when we account for sample variability through standard errors, the mean performance of combining simple forecasts compares favorably with GARCH. Other simulation results not reported here confirm that combination of simple forecasts may often have a performance comparable to more complicated individual forecasts. Again to draw a more complete picture, the density plot for the losses is given in Figure 7. [Figure 7]

It is evident that the lower quantiles of the sample losses for GARCH are relatively large, while the picture is inverted for the high quantiles. As confirmed by the standard errors, Fixed Share Learnt for these simple forecasts leads to a higher degree of dispersion.

5

Prediction of Individual Sequences

By a suitable choice of family of distributions, forecast of distributions leads to forecast of individual sequences using the forecast combination rule of Example 2. Example 5 Define ½ ¯ D ³ ´ E¯2 ¾ 1 ¯ ¯ pW x|wE , ˆθE = √ exp − ¯x − wE , ˆθE ¯ , π D E which is the Gaussian density with mean wE , ˆθE and variance 1/2. Then the loss function is D ³ ³ ´´ ¯ E¯2 ¯ ¯ ˆ R pW x|wE , θE = ¯x − wE , ˆθE ¯ + (1/2) ln π.

³ ´ Since our results do not require pW x|wE , ˆθE to integrate to one, the term (1/2) ln π ´´ ³ ³ ˆ is exactly the square loss. can be dropped and R pW x|wE , θE 22

Example 6 Define ³ ´ n h n ³ D E´o ³ D E´io − a x − wE , ˆθE , pW x|wE , ˆθE = a exp − exp a x − wE , ˆθE

which is a scale location change of the Gumbel density. Then, ´´ n ³ D E´o ³ D E´ ³ ³ ˆ ˆ ˆ = exp a x − wE , θE − a x − wE , θE + ln (a) R pW x|wE , θE

³ ³ ´´ and replacing the irrelevant additive constant ln (a) with −1, R pW x|wE , ˆθE becomes LinEx loss function with parameter a.

Example 7 Define D ³ ´ n ¯ E¯o ¯ ¯ pW x|wE , ˆθE = exp − ¯x − wE , ˆθE ¯ ,

which is the double exponential density. Then D ³ ³ ´´ ¯ E¯ ¯ ¯ R pW x|wE , ˆθE = ¯x − wE , ˆθE ¯ ,

which is the absolute loss function.

For a loss function ϕ (y) , we do not need exp {−ϕ (y)} to be a density. If the predictions are obtained by parameter averaging it is enough to check that Condition 1 is satisfied for some c and η. In this case, all the bounds derived above apply to the prediction of individual sequences with κ = η and b = c. The following gives sufficient conditions for Condition 1 to be satisfied. ³ ´ n ³ D E´o Lemma 2 Set c = 1/η and suppose pW x|wE , ˆθE := exp −ϕ x − wE , ˆθE , where ϕ (y) is a loss function. Suppose the sample observations and their predictions are uniformly bounded. Then, Condition 1 is satisfied if for any finite absolute constant B we can find an η ∈ (0, ∞) such that exp {−ηϕ (y)} is concave for |y| ≤ B. Example 8 Suppose ϕ (y) = y 2 . Then, Condition 1 is satisfied with η = 1/ (2B 2 ) . To see this, differentiate exp {−ηϕ (y)} twice with respect to y, equate to zero to √ find the inflection points ±1/ η2. Using the fact that y ∈ [−B, B] we get the required value for η. Example 9 Suppose ϕ (y) = exp {ay} − ay − 1. Then, Condition 1 is satisfied by η = exp {aB} / (exp {aB} − 1)2 for y ∈ [−B, B] . 23

Remark 10 Clearly, we could impose tail assumptions and truncate the random variables in such a way to control for the error incurred, so that the bound would hold in probability. This route is not detailed, as we are not necessarily assuming that the segment x1 , ..., xt is a realization of some sequence of random variables. Hence, this would not appear natural. The absolute norm ϕ (y) = |y| does not satisfy the condition of Lemma 2. However, in this case, we use the following general result that applies to all (convex) loss functions and sample sequences that take bounded values. Lemma 3 Define

and

³ ´ n ³ D E´o pW x|wE , ˆθE = exp −ϕ x − wE , ˆθE ,

´o n ³ ˆ pˆθ(e) (x) = exp −ϕ x − θe , ¯ D E¯ ¯ ¯ where ϕ (y) is a convex function. Then, for ¯x − wE , ˆθE ¯ ≤ B < ∞

³ ³ ´´ n ³ ´o X −1 ˆ R pW x|wE , θE ≤ −η ln we exp −ηR pˆθ(e) (x) + ηϕ (2B)2 /8. e∈E

Remark 11 The extra term ηϕ (2B)2 /8 leads to an additional error equal to ¡ ¢ T ηϕ (2B)2 /8 in the bounds of the Theorems. By choice of η = O T −1/2 the ¡ ¢ loss reduces to O T 1/2 .

Since η depends on B, a trial and error procedure needs to be used when B is unknown. In this case, if the loss is bounded away from infinity, the order of magnitude of the bound would not be affected.

6

Final Remarks

The algorithm in Exhibit 1 suggests the use of share updates in order to account for possible nonstationarities. This algorithm has been studied by other authors who, restricting interest to special share updates, derived error bounds similar to the one in Theorem 2. The advantage of the present setting is to derive results explicitly in terms of the crucial quantities of the algorithm and to specify the general condition to be satisfied by the share update (Condition 2). It is worth noting that Condition 1 is required by the method of proof, also assumed by other 24

authors (e.g. Györfi and Lugosi, 2002, Herbster and Warmuth, 1998, Bousquet and Warmuth, 2002). However, it can be replaced by a complementary condition (see Lemma 3, above). The statement of Theorem 2 allows us to intuitively understand the role of share update when a shift occurs. Moreover, this approach allows us to derive precise conditions on the number of possible shifts K that would imply a total error of smaller order than T (e.g. Corollary 3). Finally, given that the optimal choice of parameters in the share update is usually unknown, the extension proposed in the algorithm in Exhibit 2 appears to be simple but important. As shown in Section 5, the learning rate also plays an important role. The algorithm in Exhibit 2 can be adapted to the case of an unknown learning rate taking values on a finite countable set. The weights in this paper are constrained to lie in the unit simplex. It is well known (e.g. Timmermann, 2004, Section 2.3.2) that forecast combination in this case is suboptimal when one of the experts is biased. Extensions of the algorithms to allow for negative weights is an important direction for future research. The bounds of this paper can be partially extended to the case of an infinite number of experts if the class of experts satisfies suitable entropy conditions (CesaBianchi and Lugosi, 1999, for details). For example, we may wish to average using a continuous mixing distribution instead of a finite number of weights for the forecast combination. In this case, some restrictions need to be satisfied by the experts predictions. This will be the subject of future research. The algorithms considered enjoy some optimal theoretical properties. However, there could be other algorithms that lead to equivalent theoretical results or improve the present ones. The approach based on Bregman projections (e.g. Herbster and Warmuth, 2001) is one notable example. The econometrics literature is often interested in probabilistic bounds rather than worst case bounds. In a paper in preparation, using the worst case bounds of this paper, it will be shown that probabilistic bounds can be easily obtained by moment conditions on the forecast errors. In this case, the error in the bound would be looser than the one given in Theorem 5 of Yang (2004), but would apply to more general situations. Condition 7 in Yang (2004) applies only to loss functions that are essentially quadratic. On the other hand, a moment condition can be used to truncate with respect to the forecast errors so that Lemma 3 can be applied and bounds derived for general loss functions. Truncation increases the error in the bound, but under suitable moment conditions, the increase in error can be kept o (T ). Note that a necessary but not sufficient condition for Condition 8 in Yang (2004) is the existence of the moment generating function of the forecast errors. If 25

this condition is not satisfied, truncation is required also in that case. We did not discuss how to choose the individual forecasts. We could clearly use a large number of them without any preliminary analysis. As shown in the Theorems, the error only grows logarithmically with the number of forecasts. In two of the numerical experiments we combined 51 and 40 and observed no deterioration of performance when compared to the best individual forecast. However, when large means hundreds or thousands of forecasts, results might be dramatically different. Hence, it is preferable to choose the individual forecasts carefully following some reasonable criterion. The role of sufficiency in forecast may play a fundamental role at the initial stage (see Timmermann, 2004).

A

Proofs

The proofs are simplified by rewriting the algorithms in Exhibit 1 and 2 in the equivalent form of Exhibit 3 and 4 respectively. [Exhibit 3&4] For the results to hold we need X e∈E

ve,t+1 =

à X X e∈E

!

ve0 0 ,t ut+1 (e, e0 )

e0 ∈E



X

0 ve,t ,

e∈E

which is true when Condition 2 holds. With this proviso, we can now turn to the proof of the results.

A.1

Theorems 1 and 2

The proof is based on the following Lemmas. Lemma 4 Under Conditions 1 and 2, R1,...,t (pW ) ≤ −c ln ve,t+1 for any e ∈ E.

26

Proof. Define pW,t := pW (xt |wE,t , θE,t ) . By Condition 1, Ã ! n ³ ´o X R (pW,t ) ≤ −c ln we,t exp −ηR pˆθ(e,t) e∈E

´o ⎞ n ³ v exp −ηR p ˆ e∈E e,t θ(e,t) ⎠ = −c ln ⎝ v,t µP 0 ¶ e∈E ve,t , = −c ln v,t ⎛P

(11)

and we shall bound the right hand side. Using Condition 2, P T T 0 X X ve,t v,t+1 ln e∈E ≥ ln v,t v,t t=1 t=1 v,T +1 = ln = ln v,T +1 ≥ ln ve,T +1 , v,1 by definition of v,1 in the penultimate step and because for non-negative scalars a and b, a + b ≥ a ∨ b in the last step. Substituting this inequality in (11) gives the result. Lemma 5 Under Condition 2, © ¡ ¢ª ve,t+1 ≥ ut+1 (e, e) ve,t exp −ηR pθ(e,t) , ve,t+1 ≥ ve0 0 ,t ut+1 (e, e0 ) ,

and ∀t0 ≤ t ve,t+1 ≥ 0 ≥ ve,t+1

Ã

t Y 0

Ãs=t t Y s=t0

!

© ¡ ¢ª us+1 (e, e) exp −ηRt0 ,...,t pθ(e) ve,t0 , !

© ¡ ¢ª us+1 (e, e) exp −ηRt0 ,...,t+1 pθ(e) ve,t0 .

Proof. By definition of the algorithm, X ve,t+1 = ve0 0 ,t ut+1 (e, e0 ) ≥

e0 ∈E 0 ut+1 ve,t

© ¡ ¢ª (e, e) = ut+1 (e, e) ve,t exp −ηR pθ(e,t) ,

which proves the first inequality of the Lemma. The second inequality of the Lemma follows similarly from the first equality in the above display. Using the first inequality of the Lemma iteratively, gives the third inequality of the Lemma, à t ! Y © ¡ ¢ª us+1 (e, e) exp −ηR pθ(e,s) ve,t0 ve,t+1 ≥ s=t0

27

and nothing that

© ¡ ¢ª 0 exp ηR pθ(e,t+1) ve,t+1 = ve,t+1 ,

the fourth inequality of the Lemma follows. Proof of Theorem 1. Lemmas 4 and 5 imply that Ãà T ! ! Y © ¡ ¢ª R1,...,T (pW ) ≤ −c ln ve,T +1 ≤ −c ln ut+1 (e, e) ve,1 exp −ηR1,...,T pθ(e) t=1

¡ ¢ pθ(e) − c ln

≤ cηR1,...,T

ÃT Y

!

ut+1 (e, e)

t=1

− c ln ve,1

Proof of Theorem 2. Consider the following telescoping product à ! K 0 0 Y ve(0),t(1)−1 ve(k),t(k) ve(k),t(k+1)−1 ve(K),T +1 ve(K),T +1 = ve(0),t(0) . (12) 0 0 ve(0),t(0) k=1 ve(k−1),t(k)−1 ve(k),t(k) ve(K),t(K+1)−1 From Lemma 5,

© ¡ ¢ª ve,t+1 ≥ ut+1 (e, e) exp −ηR pθ(e,t) , ve,t ve,t+1 ≥ ut+1 (e, e0 ) , ve0 0 ,t

and ∀t0 ≤ t 0 ve,t+1 ≥ ve,t0

Ã

t Y

s=t0

!

© ¡ ¢ª us+1 (e, e) exp −ηRt0 ,...,t+1 pθ(e) .

Now by definition, ve(0),t(0) = 1/ (#E) , from Lemma 5,

0 ve(k),t(k+1)−1

ve(k),t(k)



ve(k),t(k) 0 ve(k−1),t(k)−1

t(k+1)−2

≥⎝

Y

s=t(k)

≥ ut(k) (ek , ek−1 ) , ⎞

¡ ¢ª © us+1 (ek , ek )⎠ exp −ηRt(k),...,t(k+1)−1 pθ(e(k)) ,

and since there is no share update on the final trial ve(K),T +1 0 ve(K),t(K+1)−1

28

= 1.

Substituting everything in (12), ve(K),T +1 ⎛

t(1)−2

≥ (#E)−1 ⎝ ×

K Y

k=1



Y

s=t(0)



© ¡ ¢ª us+1 (e0 , e0 )⎠ exp −ηRt(0),...,t(1)−1 pθ(e(0)) ⎛

t(k+1)−2

⎣ut(k) (ek , ek−1 ) ⎝ ⎛

Y

s=t(k)



t(1)−2

= (#E)−1 ⎝ ×

K Y

k=0

Y

s=t(0)

us+1 (e0 , e0 )⎠



⎤ © ¡ ¢ª us+1 (ek , ek )⎠ exp −ηRt(k),...,t(k+1)−1 pθ(e(k)) ⎦

K Y

k=1





t(k+1)−2

⎣ut(k) (ek , ek−1 ) ⎝

Y

s=t(k)

© ¡ ¢ª exp −ηRt(k),...,t(k+1)−1 pθ(e(k)) .

⎞⎤

us+1 (ek , ek )⎠⎦

Taking natural log, ln ve(K),T +1





t(1)−2

≥ − ln (#E) + ln ⎝ + ln

K Y

k=1



Y

s=t(0)

us+1 (e0 , e0 )⎠ ⎛

t(k+1)−2

⎣ut(k) (ek , ek−1 ) ⎝

Y

s=t(k)

⎞⎤

us+1 (ek , ek )⎠⎦ − η

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k))

t(1)−2

= − ln (#E) + +

K X k=1



X

ln (us+1 (e0 , e0 ))

s=t(0)

⎣ln ut(k) (ek , ek−1 ) +

= − ln (#E) + −η



t(k+1)−2

K X k=0

K X

X

s=t(k)

ln (us+1 (ek , ek ))⎦ − η

ln ut(k) (ek , ek−1 ) +

k=1

K t(k+1)−2 X X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k)) .

Using Lemma 4 the result follows.

29

s=t(k)

K X k=0

¡ ¢ Rt(k),...,t(k+1)−1 pθ(e(k))

ln (us+1 (ek , ek ))

A.2

Theorem 3

The proof is based on the following Lemma. Lemma 6 Under Condition 3, ¡ ¢ R1,...,T (pW L ) ≤ −b ln (υ l,1 ) + bκR1,...,T pW (l) .

Proof. Define pW (l),t := pW (xt |wE,l,t θE,t ) and pW L,t := pW L (xt+1 |ω L,t , wE,L,t , θE,t ) . By Condition 3, à ! X © ¡ ¢ª R (pW L,t ) ≤ −b ln ω l,t exp −κR pW (l) l∈L

© ¡ ¢ª ! υ exp −κR p l,t W (l) l∈L P = −b ln l∈L υ l,t ¶ µ υ ,t+1 . ≤ −b ln υ ,t ÃP

Then, sum over t, and note that the sum telescopes implying R1,...,T (pW L ) ≤ −b ln (υ l,T +1 )

(13)

because υ ,1 = 1. We shall bound the right hand side. By iteration of

we deduce that

© ¡ ¢ª υ l,t+1 = υ l,t exp −κR pW (l),t (

υ l,t+1 = υ l,1 exp −κ

t X s=1

) ¡ ¢ R pW (l),s .

Substituting the last display in (13) gives the result. Proof of Theorem 3. Use Lemma 6 and apply Theorem 2.

A.3

Lemma 1, Corollary 3, and Lemmata 2 and 3

Proof of Lemma 1. The first equality is immediate. The second follows noting that there are T observations, hence T − 1 share updates. Since K of them are breaks, T − K − 1 must be the remaining, and the second equality follows. Proof of Corollary 3. Setting λ = O (K/T ), and substituting in Lemma 1, we have, −K ln (λ) + K ln (#E − 1) = O (K ln (T /K)) = o (T ) , 30

while µ ¶ K/T λ =O T = o (T ) . − (T − K − 1) ln (1 − λ) ≤ T (1 − λ) 1 − K/T Hence, Lemma 1 and Theorem 2 give the result. Proof of Lemma 2. We need to check that ³ ³ ´´ E´o n ³ D X R pW x|wE , ˆθE ≤ −η −1 ln we exp −ηϕ x − wE , ˆθE e∈E

holds. The ofE¯ observations x1 , ..., xT and their forecasts take finite values, ¯ segment D ¯ ¯ ˆ hence set ¯x − wE , θE ¯ ≤ B < ∞. By the conditions of the Lemma, we can choose η such that n ³ D E´o X ´o n ³ exp −ηϕ x − wE , ˆθE ≥ exp −ηϕ x − ˆθe e∈E

¯ D E¯ ¯ ¯ for ¯x − wE , ˆθE ¯ ≤ B. Taking natural log and multiplying by −η −1 , n ³ E´ ´o ³ D X exp −ηϕ x − ˆθe ≤ −η −1 ln ϕ x − wE , ˆθE e∈E

and the result follows. Proof of Lemma 3. From Hoeffding bound for the moment generating function of bounded random variables (Hoeffding, 1963, eq. 4.16) and convexity of ϕ, ( ) ´o ´ n ³ ³ X X © ª ≤ exp −η exp η 2 ϕ (2B)2 /8 we exp −ηϕ x − ˆθe we ϕ x − ˆθe e∈E

e∈E

n ³ D E´o © ª ˆ ≤ exp −ηϕ x − wE , θE exp η 2 ϕ (2B)2 /8 ,

which implies ³ ³ ´´ R pW x|wE , ˆθE ³ D n ³ E´ ´o X = ϕ x − wE , ˆθE we exp −ηϕ x − ˆθe ≤ −η −1 ln + ηϕ (2B)2 /8. e∈E

Acknowledgements The valuable comments of the referees lead to a substantial improvement in the quality of the content and the presentation. Partially supported by the ESRC Research Award 000-23-0400. 31

References Aiolfi, M. and A. Timmermann (2004) Persistence in Forecasting Performance and Conditional Combination Strategies. Forthcoming in Journal of Econometrics. Bousquet, O. (2003) A Note on Parameter Tuning for On-Line Shifting Algorithms. Preprint. Downloadable: http://www.kyb.mpg.de/publications/pdfs/pdf2294.pdf. Bousquet, O. and M.K. Warmuth (2002) Tracking a Small Set of Experts by Mixing Past Posteriors Journal of Machine Learning Research 3, 363—396. Cesa-Bianchi, N. Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, M.K. Warmuth (1997) How to Use Expert Advice. Journal of the ACM 44, 427—485. Cesa-Bianchi, N. and G. Lugosi (1999) On Prediction of Individual Sequences. The Annals of Statistics 27, 1865-1895. Chamberlain, G. (2000) Econometrics and Decision Theory. Journal of Econometrics 95, 255-283. Dawid, A.P. (1984) Present Position and Potential Developments: Some Personal Views: Statistical Theory: The Prequential Approach. Journal of the Royal Statistical Society Ser. A 147, 278-292. Dawid, A.P. (1985) Calibration-Based Empirical Probability. The Annals of Statistics 13, 1251-1274. Dawid, A.P. (1986) Probability Forecasting. In S. Kotz, N.L. Johnson and C.B. Read (eds.), Encyclopedia of Statistical Sciences Vol. 7, 210-218. Wiley. Deutsch, M., C.W.J. Granger and T. Teräsvirta (1994) The Combination of Forecasts Using Changing Weights. International Journal of Forecasting 10, 47-57. Diebold, F.X. and P. Pauly (1990). The Use of Prior Information in Forecast Combination. International Journal of Forecasting 6, 503-508. Elliott, G. and A. Timmermann (2004) Optimal Forecast Combinations under General Loss Functions and Forecast Error Distributions. Journal of Econometrics 122, 47-79. Genest C. and J.V. Zidek (1986) Combining Probability Distributions: A Critique and an Annotated Bibliography. Statistical Science 1, 114-148. 32

Györfi, L. and G. Lugosi (2002) Strategies for Sequential Prediction of Stationary Time Series. in M. Dror, P. L’Ecuyer and F. Szidarovszky (eds.), Modeling Uncertainty: An examination of its Theory, Methods, and Applications, 225-249. Kluwer Academic Publishers. Downloadable: http://www.econ.upf.es/~lugosi/autoreg.ps. Hendry, D.F. and M.P. Clements (2004) Pooling of Forecasts. Econometrics Journal 7, 1-31. Herbster M. and M.K. Warmuth (1998) Tracking the Best Expert. Machine Learning 32, 151-178. Herbster, M. and M. Warmuth (2001) Tracking the Best Linear Predictor. Journal of Machine Learning Research 1, 281-309. James H. Stock, J.H. and M.W. Watson (2004) Combination Forecasts of Output growth in a Seven-Country Data Set. Journal of Forecasting 23, 405-430. Timmermann, A. (2004) Forecast Combinations. Forthcoming in G. Elliott, C.W.J Granger and A. Timmermann (eds.) Handbook of Economic Forecasting. North Holland. Vovk, V. (1990) Aggregating Strategies. Proceedings of the Third Annual Workshop on Computational Learning Theory (COLT 1990), 371—383. Yang, Y. (2000) Mixing Strategies for Density Estimation. The Annals of Statistics 28, 75-87. Yang, Y. (2004) Combining Forecasting Procedures: Some Theoretical Results. Econometric Theory 20, 176-222.

33

Exhibit 1. Set we,1 := 1/ (#E) , ∀e; pW = pW (x1 |wE,1 , θE,1 ) ; R (x) := − ln x. For t = 1, ..., T − 1; pθ(e,t) = pθ(e,t) (xt ) ; © ¡ ¢ª 0 ve,t+1 = we,t exp −ηR pθ(e,t) ³P ´ P 0 0 we,t+1 = e0 ∈E ve0 0 ,t+1 ut+1 (e, e0 ) / v u (e, e ) ; 0 0 2 t+1 e ,t+1 e,e ∈E pW = pW (xt+1 |wE,t+1 , θE,t+1 ) ; R1,...,t+1 = R1,...,t + R (pW ) .

34

Exhibit 2. Set we,l,1 := 1/ (#E) , ∀e, l; ω l,1 := 1/ (#L) , ∀l; pθ(e,t) = pθ(e,t) (xt ) ; pW = pW (x1 |wE,l,1 θE,1 ) ; pW L = pW L (x1 |ω L,1 , wE,L,1 , θE,1 ) R (x) := − ln x. For t = 1, ..., T − 1; pθ(e,t) = pθ(e,t) (xt ) ; © ¡ ¢ª 0 ve,l,t = we,l,t exp −ηR pθ(e,t) ; ³ ´ P P 0 0 we,l,t+1 = e0 ∈E ve0 0 ,l,t ut+1 (e, e0 |l) / v u (e, e |l) ; 0 0 2 t+1 e ,l,t (e,e )∈E pW (l) = pW (xt+1 |wE,l,t+1 , θE,t+1 ) ; © ¡ © ¡ ¢ª P ¢ª ω l,t+1 = ωl,t exp −κR pW (l) / l∈L ω l,t exp −κR pW (l) ; pW L = pW L (xt+1 |ω L,t+1 , wE,L,t+1 , θE,t+1 ) ; R1,...,t+1 = R1,...,t + R (pW L ) .

35

Exhibit 3 Set ve,1 = we,1 := 1/ (#E) , ∀e; pW = pW (x1 |wE,1 , θE,1 ) ; R (x) := − ln x.

For t = 1, ..., T − 1; pθ(e,t) = pθ(e,t) (xt ) ; © ¡ ¢ª 0 ve,t = ve,t exp −ηR pθ(e,t) ; P ve,t+1 = e0 ∈E ve0 0 ,t ut+1 (e, e0 ) ; P v,t+1 = e∈E ve,t+1 ; we,t+1 = ve,t+1 /v,t+1 ; pW = pW (xt+1 |wE,t+1 , θE,t+1 ) ; R1,...,t+1 = R1,...,t + R (pW ) .

Exhibit 4 Set ve,1 = we,1 := 1/ (#E) , ∀e; υ l,1 = ω l,1 := 1/ (#L) , ∀l; pθ(e,t) = pθ(e,t) (xt ) ; pW = pW (x1 |wE,1 , θE,1 ) ; R (x) := − ln x. For t = 1, ..., T − 1; pθ(e,t) = pθ(e,t) (xt ) ; © ¡ ¢ª 0 ve,l,t = ve,l,t exp −ηR pθ(e,t) ; P ve,l,t+1 = e0 ∈E ve0 0 ,l,t ut+1 (e, e0 |l) ; P v,l,t+1 = e∈E ve,l,t+1 ; we,l,t+1 = ve,l,t+1 /v,l,t+1 ; pW (l) = pW (xt+1 |wE,l,t+1 , θE,t+1 ) ; © ¡ ¢ª υ l,t+1 = υ l,t exp −κR pW (l) ; P υ ,t+1 = e∈E υ e,l,t+1 ; ωl,t+1 = υ l,t+1 /υ ,t+1 ; pW L = pW L (xt+1 |ωL,t+1 , wE,L,t+1 , θE,t+1 ) ; R1,...,t+1 = R1,...,t + R (pW L ) .

36

-40

-20

0

20

Figure 1. Simulated Data, Stationary Series with Outliers

0

200

400

600

800 Time

37

1000

1200

1400

-4

-2

0

2

4

Figure 2. Simulated Data, Almost Locally Stationary Sequence

0

200

400

600

800 Time

38

1000

1200

1400

Figure 3. Weights of the Forecast Combinations Panel A. No Share Update 1.0

0.8

0.6

0.4

0.2

0.0

0

200

400

600

39

800

1000

1200

1400

Panel B. Fixed Share Update .01

0.6

0.4

0.2

0.0

0

200

400

600

40

800

1000

1200

1400

Panel C. Local Share Update .3 0.4

0.3

0.2

0.1

0.0

0

200

400

600

41

800

1000

1200

1400

-20

-15

-10

-5

0

5

Figure 4. Log Returns of S&P 500 Index

0

200

400

600 Time

42

800

1000

Figure 5. Density Plot of Losses 0.20

Fixed Share Learnt 0.15

0.10

0.05

No Share 0.00 0

20

40

43

60

80

-15

-10

-5

0

5

10

Figure 6. Log Returns of Sugar Futures Contract

0

500

1000

1500 Time

44

2000

2500

3000

Figure 7. Density Plot of Losses 0.8

GARCH

0.6

0.4

0.2

Fixed Share Learnt 0.0 0

2

4

6

45

8

10

Table 1 .Summary Statistics of Loss No Share Fixed Share .01 Fixed Share .05 Fixed Share .1 Fixed Share .2 Fixed Share .3 Fixed Share Learnt Best Individual Forecast Equally Weighted

46

Max. Mean 2.11 (0.20) 243.28 1.51 (0.03) 13.24 1.53 (0.02) 11.64 1.57 (0.02) 10.96 1.65 (0.02) 10.27 1.74 (0.02) 9.88 1.51 (0.02) 13.24 2.11 (0.20) 243.28 2.15 (0.01) 8.85

Table 2. Summary Statistics of Loss No Share Fixed Share .01 Fixed Share .05 Fixed Share .1 Fixed Share .2 Fixed Share .3 Fixed Share Learnt Local Share .01 Local Share .05 Local Share .1 Local Share .2 Local Share .3 Local Share Learnt Local&Fixed Share Learnt Best Individual Forecast Equally Weighted

47

Max. Mean 1.48 (0.06) 26.44 1.02 (0.03) 6.97 1.03 (0.03) 5.64 1.07 (0.03) 5.56 1.14 (0.03) 5.33 1.20 (0.03) 5.07 1.02 (0.03) 6.97 1.07 (0.03) 12.28 1.02 (0.03) 9.57 1.01 (0.03) 8.57 1.00 (0.03) 7.76 1.00 (0.03) 7.39 1.00 (0.03) 7.41 1.00 (0.03) 7.41 1.48 (0.03) 12.63 1.57 (0.02) 4.22

Table 3. Summary Statistics of Loss No Share Fixed Share Learnt Best Individual Forecast Equally Weighted

48

Mean Max. 1.56 (0.10) 76.13 1.45 (0.06) 40.35 1.56 (0.11) 76.13 1.46 (0.06) 40.32

Table 4. Summary Statistics of Loss Fixed Share Learnt GARCH

49

Mean Max. 2.02 (0.02) 9.72 2.07 (0.01) 9.22

Online Forecast Combinations of Distributions: Worst ...

Aug 7, 2006 - The literature on combination of forecasts is broad, and an excellent survey is provided by ..... 4.1.1 for an illustration). However .... automatically obtain the best choice of learning parameter η within a set of possible choices ...

695KB Sizes 0 Downloads 151 Views

Recommend Documents

Learning Time-Varying Forecast Combinations
(1969), forecast combination methods have demonstrated an advantage in addressing noisy data, structural breaks ... address the “Forecast Combination Puzzle” and provides a robust, data-driven procedure to real-time ... real variable of interest

amino acid combinations / glucose / triglyceride combinations (e.g. ...
Sep 1, 2017 - Send a question via our website www.ema.europa.eu/contact. © European ... Product Name (in authorisation ... for infusion not available.

amino acid combinations / glucose / triglyceride combinations (e.g. ...
Sep 1, 2017 - List of nationally authorised medicinal products. Active substance: amino acid combinations / glucose / triglyceride combinations (e.g. olive oil, ...

READ ONLINE A Dictionary of Color Combinations ...
Online PDF A Dictionary of Color Combinations, Read PDF A Dictionary of Color ... 2000 Colour Combinations: For Graphic, Web, Textile and Craft Designers.

Learning Non-Linear Combinations of Kernels - CiteSeerX
(6) where M is a positive, bounded, and convex set. The positivity of µ ensures that Kx is positive semi-definite (PSD) and its boundedness forms a regularization ...

Evaluating Combinations of Dialogue Acts for Generation
lying task, the dialogue partners should also keep track of the status of processing each other's ut- terances, deal with interaction management issues such as ...

Evaluating Combinations of Dialogue Acts for Generation
Task/domain: acts that concern the specific underlying task and/or domain;. • Dialogue Control. – Feedback. ∗ Auto-Feedback: acts dealing with the speaker's ...

Optimal combinations of specialized conceptual hydrological models
In hydrological modelling it is a usual practice to use a single lumped conceptual model for hydrological simulations at all regimes. However often the simplicity of the modelling paradigm leads to errors in represent all the complexity of the physic

Convergence of Pseudo Posterior Distributions ... -
An informative sampling design assigns probabilities of inclusion that are correlated ..... in the design matrix, X, is denoted by P and B are the unknown matrix of ...

Spatial distributions of carbon, nitrogen and ... - Wiley Online Library
and James R. Ehleringer1,2. 1Department of Biology, University of Utah, 257 South 1400 East, Salt Lake City, UT 84112, USA. 2IsoForensics Inc., 423 Wakara Way, Suite 205, Salt Lake City, UT 84108, USA. 3Department of Geology and Geophysics, Universit

Increasing Interdependence of Multivariate Distributions
Apr 27, 2010 - plays a greater degree of interdependence than another. ..... of Rn with the following partial order: z ≤ v if and only if zi ≤ vi for all i ∈ N = {1,...

Skewed Wealth Distributions - Department of Economics - NYU
above the "bliss point," marginal utility can become negative, creating complications. For an ...... https://sites.google.com/site/jessbenhabib/working-papers ..... Lessons from a life-Cycle model with idiosyncratic income risk," NBER WP 20601.

Parametric Characterization of Multimodal Distributions ...
convex log-likelihood function, only locally optimal solutions can be obtained. ... distribution function. 2011 11th IEEE International Conference on Data Mining Workshops ..... video foreground segmentation,” J. Electron. Imaging, vol. 17, pp.

Testing Parametric Conditional Distributions of ...
Nov 2, 2010 - Estimate the following GARCH(1, 1) process from the data: Yt = µ + σtεt with σ2 ... Compute the transformation ˆWn(r) and the test statistic Tn.

Commander 2016 Partner Combinations - Commanderin
Commander 2016 Partner Combinations cc. Compiled by Alexander Newman (@alexandernewm on Twitter); distributed with permission. 4-color combos ...

Package 'forecast'
Oct 4, 2011 - Depends R (>= 2.0.0), graphics, stats, tseries, fracdiff, zoo. LazyData yes .... Largely wrappers for the acf function in the stats package. The main ...

Asymptotic Distributions of Instrumental Variables ...
IV Statistics with Many Instruments. 113. Lemma 6 of Phillips and Moon (1999) provides general conditions under which sequential convergence implies joint convergence. Phillips and Moon (1999), Lemma 6. (a) Suppose there exist random vectors XK and X

Hierarchic Clustering of 3D Galaxy Distributions - multiresolutions.com
Sloan Digital Sky Survey data. • RA, Dec, redshift ... Hierarchic Clustering of 3D Galaxy Distributions. 4. ' &. $. %. Hierarchic Clustering. Labeled, ranked dendrogram on 8 terminal nodes. Branches labeled 0 and 1. x1 x2 x3 x4 x5 x6 x7 x8 ... Ostr

Skewed Wealth Distributions - Department of Economics - NYU
In all places and all times, the distribution of income remains the same. Nei- ther institutional change ... constant of social sciences.2. The distribution, which now takes his name, is characterized by the cumulative dis- tribution function. F (x)=

Forecast Broker -
Catalogue - Service – for – the -Web (CSW), OPeNDAP,. NetCDF ... Forecast Broker: idea & design. 15 maart 2016 ... Archive. Forecast Broker Web Application.

Application of complex-lag distributions for estimation of ...
The. Fig. 1. (Color online) (a) Original phase ϕ(x; y) in radians. (b) Fringe pattern. (c) Estimated first-order phase derivative. ϕ(1)(x; y) in radians/pixel. (d) First-order phase derivative esti- mation error. (e) Estimated second-order phase de

Exploring Combinations of Sources for Interaction ...
Aug 22, 2010 - tion: Which is the best combination of source of features for modeling ... among the subspaces provides a measure of the degree to which the ...

Learning Non-Linear Combinations of Kernels - Research at Google
... outperform either the uniform combination of base kernels or simply the best .... semi-definite (PSD) and its boundedness forms a regularization controlling the ...