Linking Cross-Sectional and Aggregate Expected Returns

Viewer
Transcript

Shrinking the Cross Section∗ Serhiy Kozak†, Stefan Nagel‡, Shrihari Santosh§

March 19, 2017

PRELIMINARY AND INCOMPLETE [PRINT IN COLOR]

Abstract We propose a new method of tackling the “multi-dimensionality challenge” in the cross section of equity returns. Our approach relies on exploiting economically-driven regularization to construct a robust stochastic discount factor (SDF) using individual stock returns and a vast array of characteristics. We impose penalties on estimated SDF coefficients in L2 and L1 norms, similar to the elastic net technique in machine learning. The penalties are motivated by the need to down-weight contributions of small principal components to the total squared Sharpe ratio and sparsity of SDFs implied by most economic models, respectively. Our economically-motivated estimator delivers robust, sparse SDF representations that perform well out of sample.

∗

We thank Mike Chernov and seminar participants at Michigan for helpful comments and suggestions. Stephen M. Ross School of Business, University of Michigan, 701 Tappan St., Ann Arbor, MI 48109, e-mail:[email protected]. ‡ Stephen M. Ross School of Business and Department of Economics, University of Michigan, 701 Tappan St., Ann Arbor, MI 48109, e-mail: [email protected]. § R.H. Smith School of Business, University of Maryland. Email: [email protected]. †

1

1

Introduction

Studies on cross-sectional stock return predictability have found relationships between expected returns and hundreds of stock characteristics. An economic interpretation of the collection of these findings is difficult for a number of reasons. First, how many of these characteristics capture pricing information that is of marginal value over and above all the other characteristics? Empirical studies typically check whether a characteristics-factor offers alpha relative to a few popular small-scale factor models, but this does not address the concern that many of the discovered predictability relationships could be redundant. Second, do the characteristics interact in their relationship to expected returns? Existing studies have explored a very small number of interactions – often with size or value – but a very large number of potential interactions remains unexplored. Third, could additional predictors, beyond the many that have already been explored, contain useful information about the cross-section of expected returns? Due to the high-dimensional nature of the problem, these questions cannot be addressed with conventional statistical methods. We use tools from machine learning to do so. However, rather than relying on a purely statistical motivation for the regularization that we impose on the problem, we develop the regularization from economic restrictions. Our starting point is that we seek to summarize the investment opportunities offered by a large cross-section of stocks in a stochastic discount factor (SDF). By looking for an SDF representation, we are seeking factors that help price the cross-section. In this way, the SDF approach automatically leads towards exclusion of factors that are redundant for pricing because their expected return premium fully derives from covariance with other factors in the SDF. We then recognize that first and second moments of returns should be linked. As Kozak et al. (2015) argue, absence of near-arbitrage opportunities is plausible in “behavioral” as well as “rational” models of asset prices. As a consequence, it should not be possible to earn large expected return premia without commensurate common factor risk exposures. Moreover, much of the limited Sharpe Ratio that can be earned in the cross-section should come from dominant volatile factors rather than obscure small-variance factors. The restrictions we impose amount to penalizing both the L1 and L2 norms of SDF coefficients, similar to the elastic nets technique in machine learning.1 The L2 penalty alone is similar to ridge regression. It is motivated by our prior work (Kozak et al., 2015) as a restriction on the total size of “sentiment” belief distortions in the context of a model in that paper. Alternatively, and more generally, one can interpret this penalty as a restriction of the total size of arbitrageurs’ cross-sectional deviations from market portfolio weights. The 1

See Hastie et al. (2011) for a textbook treatment of ridge regression, lasso regression, and elastic nets.

2

model in Kozak et al. (2015) implies that much of the variance of the SDF (or maximum squared Sharpe ratio) should come from high eigenvalue principal components (PCs) of returns. Based on this economic reasoning, we focus on penalizing high contributions from low eigenvalue PCs. We show that our L2 penalty amounts exactly to such a down-weighting of contributions from small PCs. Further, we map the penalized estimation to Bayesian methods and argue that our estimator corresponds to an economically “reasonable” prior. We further argue that the L1 penalty, which is similar to lasso regression, can also be motivated by “sentiment” belief distortions or by many other economic models which predict a sparse SDF representation (e.g., CAPM). It is then natural to impose a penalty in a norm that leads to a sparse solution. We rely on a vast cross-section of stock characteristics in our analysis, yet we do not require sorting stocks into a small number of portfolios in the classical sense. Rather, we use the cross-section of characteristics to rotate the space of individual stocks into a space of “managed portfolios”, which allows for non-linearities and interactions.2 . Our method is able to efficiently incorporate thousands of such derived characteristics to produce an SDF which prices the cross-section well out of sample (does not overfit) and delivers high Sharpe ratios based on robust predictors. The final SDF representation allows us to examine which factors, interactions, and non-linearities are most important and most robust out-of-sample. Unlike classical methods that require us to be particularly mindful about the number and identity of characteristics used to test a model (to avoid over-fitting), our method tends to be more powerful as we expand the set of potential predictors. This eliminates the need to pre-screen and “fish” for factors and lets the data “speak”. Relatedly, characteristics that our method uses may, but need not be directly associated with typical characteristics that underlie well-known asset pricing “anomalies”. Even if some characteristic is not priced unconditionally in the cross-section, it may show up in the SDF due to its interactions with other cross-sectional characteristics or time-varying instruments. Contrary to classical approaches in the literature, our method does not require the intermediate steps of modeling expected returns, construction of ad hoc factors (long-short strategies of univariate portfolio sorts), or verifying that expected returns line us with factor covariances in the cross-section. All of the economic content behind these steps is directly embedded into the method and automatically imposed during estimation. The essence of the method lies in using economic theory to effectively combine all individual stock returns into a mean-variance-efficient portfolio that satisfies the aforementioned economic constraints. Our out-of-sample analysis demonstrates that the method successfully avoids in-sample over-fitting and generates high Sharpe ratios which are robust out of sample. These Sharpe 2

See Cochrane (1991).

3

ratios are delivered by efficiently combining known anomalies, as well as uncovering some interesting and unknown interactions of anomalies with other anomalies or with other individually unpriced characteristics.3 For example, we find that more aggressive “cubic” strategies are important for many anomalies, and some interesting interactions such as interactions of industry relative reversals with beta arbitrage, idiosyncratic volatility, momentum, and value-momentum strategies naturally appear in our analysis. Our approach is closely related to several powerful regularization techniques used in machine learning: ridge regression, lasso regression, and elastic nets. It differs in several important ways. First, our penalties are economically motivated: we penalize contributions of the smallest PCs to the total maximum squared Sharpe ratio the most. Second, unlike in ridge/lasso regressions, we do not normalize and center all variables: our predictions are much sharper – the pricing equation must hold without an intercept; expected returns should be proportional to covariances. Third, our objective is to maximize the squared Sharpe ratio (minimize the distance to the mean-variance frontier) rather than minimize average prediction error, as in the purely statistical techniques used in machine learning. There are several related papers related to estimation with many characteristics. DeMiguel et al. (2017) maximize the squared Sharpe Ratio of managed portfolios (minimize HJ distance) subject to an L1 constraint on portfolio weights. Further, they include transaction costs in their objective function. Hence, their method searches for a portfolio with sparse weights which balances the tradeoff between high squared Sharpe ratio and low transaction costs. In contrast, we are trying to find the HJ minimum variance SDF which prices excess returns and economically motivate our L2 penalty. As we show in Section 3, the L2 penalty is practically important due to the non-negligible correlation of managed portfolio returns. Further, we incorporate non-linearities and interactions of characteristics, which dramatically expands the set of test assets and strategies we can explain. Freyberger et al. (2017) estimate a typical forecasting regression of individual stock returns on characteristics subject to an L1 penalty on predictive coefficients. In Section 2 we show their method is equivalent to Bayesian estimation of managed portfolio expected returns with a particular prior, one which is substantively different from our motivation. Further, they mention the possibility of including interactions of characteristics, but do not do so in their estimation. Finally, we focus on finding a good approximation to the SDF, whereas their objective is forecasting returns on individual securities. Green et al. (2014) similarly forecast returns using a large set of predictive characteristics and regularize the estimation with an L1 penalty (lasso). Of 3

As Harvey et al. (2015) note ”it is possible that a particular factor is very important in certain economic environments and not important in other environments. The unconditional test might conclude the factor is marginal.”

4

the 100 characteristics they consider, 24 have large multivariate t-statistics. Concerned with the issues of (non-traded) spurious factors which are uncorrelated with asset returns, Bryzgalova (2016) estimates factor risk premia using a weighted L1 penalty. Since our factors are managed portfolios, they are, by definition, tradable. Hence, we address a fundamentally different problem, that of high-dimensionality rather than lack of identification.4

4

Lack of identification in 2-stage FM procedure can be resolved by estimating factor risk premia as the time-series average of the factor mimicking portfolio return, which leads to an unbiased and consistent estimator. We address the issue of small-sample uncertainty in this estimator.

5

2

Methodology

For any point in time t, let Rt denote an N × 1 vector of excess returns, and Zt−1 an N × H matrix of asset characteristics (with H possibly quite large, potentially thousands of characteristics). Let Zt−1 be centered and standardized cross-sectionally at each t.

2.1

SDF

Consider a projection of the true SDF onto the space of excess returns, Mt = 1 − b0t−1 (Rt − ERt ) ,

(1)

where bt is an N × 1 vector of SDF coefficients. We parametrize the coefficients bt as a linear function of characteristics, bt−1 = Zt−1 b, (2) where b is an H × 1 vector of time-invariant coefficients. Therefore, rather than estimating SDF coefficients for each stock at each point in time, we estimate them as a single function of characteristics that applies to all stocks over time. The idea behind this approach is similar to Brandt et al. (2009) and DeMiguel et al. (2017). Plugging Eq. 2 into Eq. 1 delivers an SDF that is in the linear span of the H (basis) 0 trading strategy returns Ft = Zt−1 Rt that can be created based on stock characteristics, i.e., Mt = 1 − b0 (Ft − EFt ) . 2.1.1

(3)

Rotation

The transformation in Eq. 3 defines a rotation of the space of individual stock returns Rt ∈ RN into the space of “managed portfolios” Ft ∈ RH . Zt−1 defines a transformation RN → RH , i.e., maps the space of N individual stock returns into a space of H trading strategies (managed portfolios) as follows: 0 Ft = Zt−1 Rt .

(4)

This rotation is motivated by an implicit assumption that the characteristics fully capture all aspects of of the joint distribution of returns that are relevant for the purpose of constructing an SDF, i.e., that expected returns, variances, and covariances are stable functions of characteristics such as size and book-to-market ratio, and not security names (Cochrane, 2011). This (implicit) assumption was the driving force for using portfolio sorts in cross6

sectional asset pricing in the first place. Managed portfolios allow us to generalize this idea and be more flexible. Note that even though we assumed that all coefficients b are constant, it is without loss of generality, because we can always re-state a model with time-varying bt−1 as a model with constant b and an extended set of factors Ft . For instance, suppose we can capture time variation in bt−1 by some set of time-series instruments zt−1 . Then we can simply rewrite the SDF as Mt = 1 − b0 F˜t , where F˜t = zt−1 ⊗ Ft and ⊗ denotes the Kronecker product of two vectors (Brandt et al., 2009, Cochrane, 2005, Ch. 8). 2.1.2

The MVE portfolio

Given the asset pricing equation, E [Mt Ft ] = 0,

(5)

b = Σ−1 E (Ft ) ,

(6)

in population we could solve for

the (SDF) coefficients in a (cross-sectional) projection with H “explanatory variables” and h i H “dependent variables”; where Σ ≡ cov (Ft ) = E (Ft − EFt ) (Ft − EFt )0 . The SDF coefficients are also the weights of the mean-variance-efficient (MVE) portfolio. 2.1.3

Sample estimators

Consider a sample with size T , where T > H, but possibly T < N . We denote µT =

T 1X Ft T t=1

(7)

ΣT =

T 1X (Ft − µt ) (Ft − µt )0 T t=1

(8)

the maximum likelihood estimates of means and covariances, respectively5 . A natural, but naïve, “plug-in” estimator of b is ˆb = T − N − 2 Σ−1 µT , T T

T −N −2 where Σ−1 is a bias adjustment. T is the Moore-Penrose pseudo-inverse of ΣT and T This estimator is unbiased (under joint normality of returns), but is imprecise.6

5 6

These estimators are MLE under joint normality of returns. Under normality, µT and ΣT are independent, and are unbiased estimators of µ and Σ, respectively.

7

To see this, consider an orthogonal rotation Pt = Q0 Ft with ΣT = QDT Q0 , Q is the matrix of eigenvectors of ΣT and DT is the sample diagonal matrix of eigenvalues. If we express the SDF as Mt = 1 − b0P (Pt − EPt ) we have ˆbP = T − N − 2 D−1 µP,T . T T

Consider the analytically simple case where D is known and replace Then we have √ T ˆbP − bP ∼ N 0, D−1 ,

T −N −2 T

DT−1 with D−1 .7

which shows that estimated SDF coefficients on small-eigenvalue PCs (small di ) have explosive uncertainty. This problem is exacerbated when D−1 is unknown, and thus, estimated. It is well known that the sample eigenvalues of D (equivalently, Σ) are “over-dispersed” relative to true eigenvalues, especially when the number of characteristics, H, is comparable to the sample size, T . This implies that, on average, the smallest estimated eigenvalue is too small and hence the corresponding ˆbi has even greater variance than shown above.

2.2

Smooth Regularization

The “ill-conditioned” problem of estimating b cries out for regularization. A commonly used method to deal with overfitting is generalized Tikhonov regularization, which takes the form of a penalized minimization: n

o

ˆb = arg min (µT − ΣT b)0 A (µT − ΣT b) + γb0 Ωb , b

where the H × H weighting matrices A and Ω are chosen by the econometrician. What are reasonable choices? A = I corresponds to OLS estimation, or maximizing the cross-sectional R2 . A = Σ−1 corresponds to GLS, or minimizing the HJ-distance (Hansen and Jagannathan, 1997) between the in-sample ex-post SDF that prices all assets and our estimated “robust” SDF.8 A penalty γb0 Σ1T b acts like a constraint on the maximum model-implied squared Sharpe ratio (variance of the SDF). A penalty γb0 Σ0T b = γb0 b is the standard penalty in ridge regression. It is further motivated by our prior work as an equilibrium outcome induced by a restriction on the total size of “sentiment” belief distortions (Kozak et al., 2015). More generally, one can interpret this penalty as a restriction on the total size of arbitrageurs’ With high-frequency data (daily) and even hundreds of factors, D−1 is estimated quite well as measured 2 by the loss function tr DT−1 D − I /N 2 . 8 See Kan and Robotti (2008) for a modified HJ-distance when pricing excess returns with SDFs having unit mean. 7

8

cross-sectional deviations from the market portfolio weights. Many economic models provide guidance in terms of size of such deviations. As a concrete example, the CAPM implies an SDF of the form Mt = a + b × RtM , i.e., that all cross-sectional deviations are exactly zero. From a Bayesian perspective, it makes economic sense to think of prior distributions (centered at zero) on the SDF coefficients bi associated with the additional candidate factors that we, as econometricians, might try to use. Such mean-zero Bayesian priors naturally lead to penalizing (some) norm of bi as well. This is precisely the approach of “asset pricing priors” in Pástor (2000) and Pástor and Stambaugh (2000). Consider a class of estimators o

n

ˆb = arg min (µT − ΣT b)0 Σc (µT − ΣT b) + γb0 Σd b , T T b

where the exponents c and d are still undetermined. The analytic solution is given simply by ˆb =

ΣT + γΣd−c−1 T

−1

ΣT Σ−1 T µT ,

(9)

which takes the form of “shrinkage” relative to the “plug-in” estimator Σ−1 T µT . The regularization is “smooth” since the estimated parameter, ˆb, is a differentiable function of the data, µT . The form of the solution shows immediately that the two weighting matrices don’t independently impact the solution; the estimator depends only on the difference in the exponents. OLS with the penalty γb0 ΣT b yields the same estimator as GLS with γb0 b penalty. The problem we face reduces to a choice of d − c. To resolve this choice, we explore this problem from the Bayesian perspective. Consider the family of priors: Ft | µ ∼ N (µ, Σ), µ ∼ N 0, κτ Ση with known Σ, where τ = trace [Σ] and κ is a constant controlling the dispersion of µ. This family nests many priors commonly used (by varying η and κ), and further maps into the various regularized estimators. Given the sample of length T , the posterior means of µ and b = Σ−1 µ are given by µ ˆ= ˆb =

Σ + γΣ

(2−η) −1

Σ + γΣ(2−η)

−1

Σ µT

Σ

Σ−1 µT

(10)

(11)

τ where γ = kT . Notice that small k (“tighter prior” around zero) implies large γ. Eq. 11 takes the same form as Eq. 9, with the correspondence 2 − η = d − c − 1. Hence, a “reasonable” η can be used to discipline the choice of a “reasonable” penalized estimator. We now determine

9

reasonable values of η. 2.2.1

Two “redundant” assets

Consider a concrete two asset example where both returns have variance σ 2 and correlation ρ > 0. We’ll explore the limit as ρ → 1 to see what happens in case of redundant assets. We’ll explore three choices, η ∈ {0, 1, 2}, of which, the first two correspond to commonly used priors. η = 0 implies a simple, seemingly agnostic iid prior, µ ∼ N 0, κτ I . η = 1

implies µ ∼ N 0, κτ Σ , used in used in Pástor (2000) and Pástor and Stambaugh (2000).

η = 2 implies µ ∼ N 0, κτ Σ2 , which we haven’t found in any previous studies. We can “rule out” η = 0 by considering the maximum squared Sharpe ratio, µ0 Σ−1 µ. The expected max squared SR is given simply by

−1

E µΣ µ = σ

−2

2 −1

1−ρ

2κ τ

The expected maximum squared Sharpe ratio explodes to ∞ as ρ → 1, which is implausible, since it implies arbitrage. We can “rule out” η = 1 by considering the SDF weights, b. Recall that when η = 1, b ∼ N 0, τk Σ−1 , with

Σ−1 = σ −2 1 − ρ2

−1





1 −ρ   −ρ 1

in this two asset example. In the limit when the two base assets are redundant (ρ → 1), their optimal weights are perfectly negatively correlated with infinite variance under this prior! Our prior is highly correlated assets shouldn’t have highly negatively correlated weights and that the introduction of nearly redundant assets shouldn’t cause weights to explode. The case of η = 2 implies b ∼ N 0, κτ I , an i.i.d. prior on SDF coefficients. η = 2 is the smallest value for which coefficients don’t explode as ρ → 1.9 The corresponding penalized estimator is h

ˆb = (ΣT + γI)−1 ΣT = (ΣT + γI)−1 µT ,

i

Σ−1 T µT

(12) (13)

where the second representation is numerically more stable with large dimensional Σ. Hence, 9

In principle, one can choose a prior with η > 2. For instance, η = 3 leads to standard ridge shrinkage, which down-weights small PCs even more aggressively. In practice, we found that performance of such shrinkage is comparable to our method, but coefficient paths are less monotone and less stable. We therefore focus on η = 2 due to its computational advantages and higher robustness.

10

we choose to minimize HJ-distance subject to an L2 norm penalty γb0 b, that is the following problem: o n ˆb = arg min (µT − ΣT b)0 Σ−1 (µT − ΣT b) + γb0 b , (14) T b

Equivalently, we can minimize sum of squared cross-sectional pricing errors subject to a maximum Sharpe ratio penalty γb0 ΣT b, n

o

ˆb = arg min (µT − ΣT b)0 (µT − ΣT b) + γb0 ΣT b ,

(15)

b

which leads to the same solution given by equations (12) and (13). We will focus on the specification in Eq. 14 in our future analysis. 2.2.2

Economic interpretation

One particularly appealing economic feature of the penalty on b0 b is that it leads to the shrinkage of the contributions of small PCs to the maximum squared Sharpe ratio. Recall Q is the matrix of eigenvectors of ΣT and D is the diagonal matrix of eigenvalues, var (Mt ) = max SR2 = ˆb0 ΣT ˆb = µ0T (ΣT + γI)−1 ΣT (ΣT + γI)−1 µT =

µ0T QD (D

+ γI)

−2

0

Q µT =

H X j=1

qj0 µT dj

2

(16)

!2

dj dj + γ

,

(17)

2

(qj0 µT )

where dj are diagonal elements of D, qj are columns of Q, and dj is the contribution of the j-th PC to the maximum in-sample squared Sharpe Ratio. Note that since γ ≥ 0, 2 j j ≤ 1. The multiplication with djd+γ therefore causes contributions to we have 0 < djd+γ the max squared SR from low-eigenvalue PCs to be penalized more than contributions of high-eigenvalue PCs. Fitted means are shrunk in a similar fashion, µ ˆT = (ΣT + γI)−1 ΣT µT = Q (D + γI)−1 DQ0 µT =

H X j=1

(18)

!

qj

dj qj0 µT , dj + γ

(19)

with stronger shrinkage applied to smaller dj . Small dj correspond to directions in the space of factors having small variance.10 To clearly see this, consider the rotation of original space ˆLS = ΣT (Σ0 ΣT )−1 Σ0 µT = QQ0 µT = µT in Similar calculations for OLS estimates give µ ˆLS T = ΣT b T T the case when ΣT is square and symmetric. The intuition is straightforward: we regress N variables on N predictors, so we perfectly fit all means. 10

11

of returns into the space of principal components: PCj

µ ˆT

≡ qj0 µ ˆT =

dj PC µT j , dj + γ

(20) PCj

i.e., for a j-th PC of returns with a small eigenvalue dj , its fitted mean µ ˆT be close to zero. Compare this to the OLS/GLS solution without penalty, PCj ,LS

µ ˆT

PC

= µT j ,

of is forced to

(21)

which leads to no shrinkage. Our procedure therefore jointly tilts the covariance matrix (PC rotation) and expected return estimates in a way that limits contributions of small PCs to the total squared Sharpe ratio. The economic interpretation of such shrinkage is that we judge as economically implausible the case that a principal component of the candidate factors has high mean return (or high contribution to the total squared Sharpe ratio), but a small eigenvalue. In Kozak et al. (2015) we argue, in the context of the “behavioral” asset pricing model in that paper, the only way to generate large cross-sectional variance of expected returns is to have sentiment investors’/arbitrageurs’ demands line up with few large PCs. It is important to point out that the appealing property of the L2 penalty mentioned above (that it shrinks the contributions of small PCs to the total squared Sharpe ratio) is specific to our setup and the representation we are working with. Because we started with the asset pricing equation Eq. 5, our explanatory variables are covariances with candidate factors, which are test asset returns themselves (so that the number of test assets equals the number of candidate factors). The PC-based interpretation we discussed obtains because of this latter fact. On the contrary, in the Fama-Macbeth regression setting (e.g., Freyberger et al. 2017) where one predicts expected returns with stock-level characteristics, an L2 penalty does not have such clear interpretation. Likewise, the L1 penalty used in DeMiguel et al. (2017), corresponding to lasso estimation, tends to achieve a different outcome. When a group of variables are highly correlated, lasso has a tendency to keep the “best” one and drop the rest. For example, when including managed portfolios formed from the related BE , E , C , and M , lasso is likely to keep the portfolio with highest Sharpe ratio variables, D P P P E and drop the rest (in favor of other portfolios formed from orthogonal characteristics). In contrast, our L2 penalty will keep all four, but jointly shrink their SDF coefficients towards each other, and towards zero.

12

2.2.3

Alternative Representation: E [F ] = β 0 λ

Instead of estimating coefficients, b, in Mt = 1 − b0 (Ft − EFt ), we can equivalently estimate λ in E [ri ] = βi0 λ where βi = Σ−1 cov (F, R). Penalized estimation of λ (penalized FamaMacbeth estimation) then takes the form n

o

ˆ = arg min (µT − Iλ)0 Σc (µT − Iλ) + γλ0 Σd λ , λ T T λ

where c and d need not be the same as above in the SDF representation. The FOC immediately yields the solution:

ˆ = Σ + γΣ1+d−c λ

−1

ΣµT .

(22)

This has a similar form to Eq. 10; in both cases higher γ corresponds to greater shrinkage (toward zero). They are identical (for a given γ) when d − c = 1 − η. We argued above that η = 2 gives a reasonable prior, which implies d − c = −1. For OLS estimation we have c = 0 which implies d = −1. The penalty then becomes γλ0 Σ−1 λ, which is a penalty on the model implied maximum squared Sharpe ratio. For GLS estimation, c = −1 which implies d = −2. The penalty then becomes γλ0 Σ−2 λ. This shows immediately that an L2 penalty on λ, γλ0 λ, implies an unreasonable prior, whether one estimates via OLS or GLS. Hence, naïve application of statistical techniques may produce economically unreasonable estimates.

2.3

L1 penalty

The L2 penalty results in shrinkage of elements of ˆb, but none of the coefficients is set to identically zero. However, many economic theories predict sparsity of the weights on zero-investment portfolios in the SDF. For instance, the CAPM predicts a single factor representation; investment-based asset pricing models often work with reduced-form SDFs that can be expressed in terms of few factors and imply that a few firm characteristics such BE or INAV span exposures to these few factors. In the context of behavioral models, as M E one can also argue that sentiment investors’ limited attention implies that only a relatively small number of common factors are subject to sentiment. Kozak et al. (2015), for example, restrict the size of behavioral demands investors in L2 -norm, but similar considerations can be used to motivate the L1 constraint. Taking into account these predictions, it then makes economic sense to think of the prior distribution of bi associated with characteristics that we, as econometricians, might try to use, and which is both centered at zero and allows for sparse representations, consistent with the aforementioned theories. One example of such priors is a centered Laplace prior with 13

respect to SDF coefficients b. This prior can be shown to be equivalent to imposing an L1 penalty on SDF coefficients (Hastie et al., 2011). We thus impose an L1 -analog of the penalty on b0 b by penalizing the sum of absolute values P of SDF coefficients, H j=1 |bj |. Such an approach leads to a version of lasso regression. It allows us to achieve sparsity – some elements of ˆb are set to zero. This amounts to automatic factor selection, which may be an attractive feature because it allows us to express the SDF based on factors constructed from a relatively small set of stock characteristics.

2.4

Combined Specification

Combining both L1 and L2 penalties, our estimator solves the problem ˆb = arg min (µT − ΣT b)0 Σ−1 (µT − ΣT b) + γ1 b0 b + γ2 T b

H X

|bi | .

(23)

i=1

The method is similar to elastic net technique used in machine learning. Zou and Hastie (2005) advocate for the use of this method as a compromise between ridge and lasso. They argue that the elastic net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like ridge. It also has considerable computational advantages over the general Lq penalties (for 0 < q < 1). The L2 penalty (and ridge regression) is known to shrink the coefficients of correlated predictors towards each other, allowing them to borrow strength from each other. In the extreme case of k identical predictors, they each get identical coefficients with 1/k-th the size that any single one would get if fit alone. From a Bayesian point of view, the L2 (ridge) penalty is ideal if there are many predictors, and all have non-zero coefficients (drawn from a Gaussian distribution). The L1 penalty, on the other hand, is somewhat indifferent to very correlated predictors, and will tend to pick one and ignore the rest. In the extreme case above, the lasso problem breaks down. The L1 (lasso) penalty corresponds to a Laplace prior, which expects many coefficients to be close to zero, and a small subset to be larger and nonzero. Similarly to the elastic net, our method performs much like the lasso, but removes any degeneracies and wild behavior caused by extreme correlations and puts stronger emphasis on large PCs. Combining both penalties creates a useful compromise between ridge and lasso. As we vary relative strength of two types of penalties, the sparsity of the solution (i.e. the number of coefficients equal to zero) increases monotonically from 0 to the sparsity of the lasso solution (Friedman et al., 2010). There are important differences between our method and the elastic net, however. First,

14

in our method, both penalties are economically motivated. Second, we must not normalize and center all variables: our objective is the HJ-distance with no intercept allowed (we impose that the zero-β rate is the risk-free rate). Third, we maximize the squared Sharpe ratio (minimize the distance to the mean-variance frontier) instead of minimizing (unweighted) pricing errors (our objective includes the weighting matrix Σ−1 T ). Finally, it is well known that ridge and lasso techniques introduce substantial amount of bias due to shrinking all coefficients towards zero. Moreover, the elastic net estimator incurs double shrinkage and results in extra bias. Zou and Hastie (2005) propose to correct this issue by re-scaling the final estimates by a constant to undo the double shrinkage. In our context, a lot of the bias is coming from the “level” bias. We are effectively doing two things: (i) shrinking the total variance of an SDF by re-scaling all bi in the direction of zero; and (ii) re-scaling bi cross-sectionally in a way that penalizes the smallest PCs the most. However, because we are looking for an SDF in the span of excess returns, we need not worry about its scale. In our application we therefore later undo the “level” shrinkage and focus only on the cross-sectional aspect (this has no effect on Sharpe ratios) by regressing average excess returns on estimated restricted (shrunk) MVE portfolio and using the estimated OLS coefficient (an extra degree of freedom) to rescale the final SDF. 2.4.1

Solution method

We use a modified version of the Least Angle Regression (LAR) algorithm to solve the problem.11 We provide more details about the LAR algorithm and our modifications in Appendix A.1.

2.5

Normalizations and Rescaling of Characteristics

In order to focus exclusively on the cross-sectional aspect of return predictability, remove the influence of outliers, and keep constant leverage across all portfolios, we perform certain normalizations of characteristics that define our managed portfolios in Eq. 4. First, similarly to Asness et al. (2014) and Freyberger et al. (2017), we perform a simple rank transformation for each characteristic. For each characteristic i of a stock s at a given time t, denoted as cis,t , we sort all stocks based on the values of their respective characteristics cis,t and rank them cross-sectionally (across all s) from 1 to nt , where nt is the number of stocks at t for which this characteristic is available.12 We then normalize all ranks by dividing by nt + 1 to 11

See Hastie et al. (2011) for an excellent reference on regularization and estimation algorithms. If two stocks are “tied”, we assign the average rank to both. For example, if two firms have the lowest value of c, they are both assigned a rank of 1.5 (the average of 1 and 2). This preserves any symmetry in the underlying characteristic. 12

15

obtain the value of the rank transform:

rcis,t =

rank cis,t

nt + 1

(24)

.

Next, we normalize each rank-transformed characteristic rcis,t by first centering it crosssectionally and then dividing by sum of absolute deviations from the mean of all stocks:

¯ it rcis,t − rc

i = Pn zs,t t

i s=1 rcs,t

(25)

,

− rc ¯ it

i t are where rc ¯ it = n1t ns=1 rcis,t . The resulting portfolios of transformed characteristics zs,t insensitive to outliers and allow us to keep the total exposure of a portfolio to a characteristicbased strategy (leverage) fixed. For instance, doubling the number of stocks at any time t has no effect on the overall exposure of a strategy. Finally, we combine all transformed i characteristics zs,t for all stocks into a matrix of instruments Zt , which we use in our analysis in Eq. 4.

P

2.6

Interactions

One of the important novel contributions of this paper is the analysis of interactions of j i characteristics. Namely, for any two given rank-transformed characteristics zs,t and zs,t of ij a stock s at time t, we define the first-order interaction characteristic zs,t as the product of two original characteristics that is further re-normalized using Eq. 25 as follows:

j i zs,t zs,t −

ij zs,t = Pn t s=1

j

i zs,t zs,t

j i s=1 zs,t zs,t

1 nt

Pnt

−

1 nt

.

j i s=1 zs,t zs,t

Pnt

(26)

We include all first-order interactions in our empirical tests in Section 3. In addition to interactions, we also include second and third powers of each characteristic, which are defined analogously. Note that although we re-normalize all characteristics post interacting or raising to powers, we do not re-rank them. For example, the cube of any given characteristic then is a new different characteristic that has stronger exposures to stocks with extreme realization of the original characteristic (tails). 2.6.1

Interpreting the interactions

How do we interpret the interactions? For simplicity, consider two binary strategies with characteristic values that can be either high or low (±1). Let zs1 and zs2 be the characteristic 16

values for stock s. The pair {zs1 zs2 } takes on four values, shown in the table below: zs1 \zs2

−1

+1

+1 −1

A C

B D

The letters A to D are names attached to each cell. Let µi , i ∈ {A, B, C, D} by the mean returns of stocks in each cell. For simplicity, suppose the characteristics are uncorrelated so that each cell contains the same number of firms. Further, suppose returns are crosssectionally demeaned (equivalent to including a time fixed-effect, or an equal-weight market portfolio factor). What is the expected return on the zs1 mimicking portfolio? That is, what is λ1 ≡ E [zs1 Rs ]? It’s simply 12 (µA + µB − µC − µD ). Similarly, λ2 ≡ E [rs zs2 ] = 1 (−µA + µB − µC + µD ) and λ12 ≡ E [rs (zs1 zs2 )] = 21 (−µA + µB + µC − µD ). Since we have 2 (µA + µB + µC + µD ) = 0, we can easily recover µi from knowledge of λ1 , λ2 , λ12 by the identity 

λ≡

      

0 λ1 λ2 λ12





      

      

=

1 2

1 1 −1 −1

1 1 1 1 −1 −1 1 −1 1 1 1 −1

       

µA µB µC µD

       

= Gµ

(27)

since the matrix is invertible, where the first equation imposes market clearing (all our assets are market neutral, so the total risk premium on the portfolio of all stocks in the economy is zero). Given the three managed portfolios, how would we construct something like the “small×value” strategy which buys small-value stocks and shorts small-growth stocks?13 If z 1 measures market capitalization and z 2 measures BE/M E, the strategy is long D and short C. Let G be the square matrix in Eq. 27. The mean of the desired strategy is µD − µC , which is also equal to µD − µC = 2ι0DC G−1 λ where ιDC =

h

0 0 −1 1

i0

, which shows the desired strategy of long D and short C h

i

can be constructed with weights equal to 0 0 1 −1 on the four managed portfolio strategies.14 Hence, combining the interaction with the base strategies allows for construction of any “mixed” strategies. Conceptually, what’s required is that the managed portfolios form a “basis” of the potential strategies. 13 14

The value anomaly is larger for small stocks, which we would like our methodology to recover. We include the risk-free strategy (with zero excess) return for algebraic convenience.

17

3

Empirics

3.1

Data

We start with the universe of U.S. firms in CRSP. For each stock, using Compustat data we compute 50 “anomaly” characteristics commonly studied in the literature. For robustness, we exclude small-cap stocks15 , center and standardize all characteristics as explained in Section 2.5. Apart from that, we follow standard anomaly definitions in Novy-Marx and Velikov (2016), McLean and Pontiff (2016), and Kogan and Tian (2015). The anomalies and excess returns on their respective managed portfolios (which are linear in characteristics) are listed in Table 1. The table shows mean excess returns in three subperiods: full sample, post-1990, and the most recent post-2005 sample. All managed portfolios’ excess returns are rescaled to have standard deviations equal to the in-sample standard deviation of excess return on the aggregate market index. In our last test we use second and third powers and linear interactions of 50 initial basic characteristics, which we construct using the approach of Section 2.6. Interactions expand the set of possible predictors exponentially. For instance, with only first-order interactions of 50 raw characteristics and their powers, we obtain 12 n (n + 1) + 2n = 1, 375 candidate factors and test asset returns. In all of our analysis we use daily returns from CRSP for each individual stock. Using daily data allows us to estimate second moments much more precisely than with monthly data.

3.2

Results

We start from the most basic example involving only a few test assets and characteristics, and proceed progressively towards our final specification that utilizes a broad range of characteristics, their powers and first-order interactions. The basic examples are revealing in terms of grasping intuition and comparing performance to classic techniques used in finance. The latter examples are infeasible for classic techniques and should be judged purely on their out-of-sample performance and new insights they uncover.

15

We drop all stocks with market caps below 0.01% of aggregate stock market capitalization at each point in time. For example, for an aggregate stock market capitalization of $20trln, we keep only stocks with market caps above $2bln.

18

Table 1: Part I: Mean annualized excess returns on anomaly managed portfolios, % The table lists all basic “anomaly” characteristics used in our analysis and shows mean excess returns on managed portfolios which are linear in characteristics. Columns (1)-(3) show mean annualized returns (in %) for managed portfolios corresponding to all characteristics, net of riskfree rate, in the full sample, post-1990 sample, and post-2005 sample, respectively. All managed portfolios’ excess returns are rescaled to have standard deviations equal to the in-sample standard deviation of excess returns on the aggregate market index. The sample is daily from May 1, 1974 till December 30, 2016.

(1)

(2)

(3)

Full Sample

Post 1990

Post 2005

-3.8 6.4 3.4 11.9 6.4 -1.2 6.3 -7.8 -5.8 -9.2 5.1 -0.9 3.4 7.6 7.5 1.7 -9.5 -4.0 -9.6 -7.2 5.1 1.9 4.0 9.5 -1.6 2.1 5.7 4.2 5.8 -0.3

-2.8 3.1 4.0 8.9 5.2 -4.6 4.5 -5.4 -0.9 -6.4 4.4 3.2 0.7 5.3 3.2 0.6 -5.6 -1.6 -10.0 -6.5 3.8 4.5 8.9 6.5 -1.0 2.2 6.1 2.1 4.2 2.1

-2.4 1.2 3.9 1.9 2.6 -1.1 5.1 -2.0 -1.3 -1.5 6.2 -4.0 -0.6 2.5 2.6 -0.9 1.0 -2.7 -2.6 -5.2 0.4 4.2 2.8 5.4 -1.7 -4.9 -0.6 -1.8 -2.2 -3.5

1. Size – size 2. Value (A) – value 3. Gross Profitability – prof 4. Value-Profitablity – valprof 5. F-score – fscore 6. Debt Issuance – debtiss 7. Share Repurchases – repurch 8. Net Issuance (A) – nissa 9. Accruals – accruals 10. Asset Growth – growth 11. Asset Turnover – aturnover 12. Gross Margins – gmargins 13. Dividend/Price – divp 14. Earnings/Price – ep 15. Cash Flows/Price – cfp 16. Net Operating Assets – noa 17. Investment/Assets – inv 18. Investment/Capital – invcap 19. Investment Growth – igrowth 20. Sales Growth – sgrowth 21. Leverage – lev 22. Return on Assets (A) – roaa 23. Return on Book Equity (A) – roea 24. Sales/Price – sp 25. Growth in LTNOA – gltnoa 26. Momentum (6m) – mom 27. Industry Momentum – indmom 28. Value-Momentum – valmom 29. Value-Momentum-Prof. – valmomprof 30. Short Interest – shortint

continued on next page... 19

Table 1: Part II: Mean annualized excess returns on anomaly managed portfolios, %

31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.

Momentum (12m) – mom12 Momentum-Reversals – momrev Long Run Reversals – lrrev Value (M) – valuem Net Issuance (M) – nissm Earnings Surprises – sue Return on Book Equity (Q) – roe Return on Market Equity – rome Return on Assets (Q) – roa Short-Term Reversals – strev Idiosyncratic Volatility – ivol Beta Arbitrage – betaarb Seasonality – season Industry Rel. Reversals – indrrev Industry Rel. Rev. (L.V.) – indrrevlv Ind. Mom-Reversals – indmomrev Composite Issuance – ciss Price – price Age – age Share Volume – shvol

3.2.1

(1)

(2)

(3)

8.6 -7.3 -6.6 6.0 -5.0 11.0 9.2 11.0 6.6 -7.9 -1.6 -0.1 14.4 -17.8 -36.7 22.0 -4.3 -2.8 3.1 -0.9

7.0 -7.0 -6.0 2.9 -3.0 7.7 11.1 7.9 7.7 -3.4 -1.4 1.2 13.8 -7.6 -21.6 11.0 -0.3 -1.8 2.3 0.0

-1.8 -3.2 -0.4 1.8 -2.2 3.5 4.2 3.7 4.4 1.0 -0.0 -0.8 -5.6 0.6 -6.3 -1.6 -1.7 -3.4 1.8 -0.1

Five Fama-French factors

In our first exercise, we use five Fama-French factors from Ken French’s website to compare our “managed portfolios” approach to the simplest MVE portfolio of five Fama-French anomaly strategies. First, let’s consider a case when our instrumented portfolios (test assets and candidate factors) coincide with the five original Fama-French factors: SMB, HML, MOM, RMW, CMA. These factors capture different anomalies and are only weakly correlated. We thus do not expect our regularization methods to have a lot of bite in this case. Panel (a) of Figure 1 shows paths of estimated coefficients when only the L2 penalty is imposed, which corresponds to the case of γ2 = 0 in Eq. 23. On the x-axis we show the effective degrees of freedom 16 which quantifies the strength of L2 regularization: h

i

df (λ) = tr ΣT (ΣT + γI)−1 =

16

H X

dj , j=1 dj + γ

A similar definition is often used in a ridge regression setup. See Hastie et al. (2011).

20

(28)

5

5

4

4

3

3

2

2

1

1

0

0 0

1

2

3

4

5

0

(a) L2 regularized coefficient paths

0.2

0.4

0.6

0.8

1

(b) L1 regularized coefficient paths

Figure 1: Paths of coefficients as a function of penalty strengths for original five Fama-French factors. Factor returns data is from Ken French’s website. We impose only an L2 -norm penalty in Panel (a). We impose only an L1 -norm penalty in Panel (b).

where dj is the j-th eigenvalue of ΣT and γ is the parameter that governs the strength of L2 penalty as defined in Section 2.2. Note that γ = 0, which corresponds to no shrinkage, gives df (γ) = H, the number of free parameters. The rightmost points on the plot therefore correspond to unrestricted coefficients of the in-sample MVE portfolio (unrestricted OLS/GLS solution). As we increase the strength of the penalty (move left on the plot), coefficients shrink toward zero and SDF variance drops (not shown). We can see that shrinkage is non-linear and its strength varies across different factors. We impose only an L1 penalty in Panel (b), which corresponds to the case γ1 = 0 of the problem in Eq. 23. On the x-axis we show the shrinkage factor, which we define as PH

|bi | , s = Pi=1 H ˜ i=1 bi where ˜bi are coefficients corresponding to no L1 penalty (unrestricted OLS/GLS solution). Similarly to Panel (a), as we increase the L1 penalty, coefficients in Panel (b) shrink towards zero and eventually some of them get set to zero. The variables that are set to zero the earliest (we again move from right to left on the plot) are least important. We can see, for example, that the SDF coefficient corresponding to the size factor is the first one to be dropped by the method. We therefore obtain sparse representations naturally, depending on how strong the L1 penalty is. To compare our method based on managed portfolios to the standard sorting approach of Fama and French (1992), we use a different method of constructing portfolios in Figure 2. 21

7

7

6

6

5

5

4

4

3

3

2

2

1

1 0

0 0

1

2

3

4

5

0

(a) L2 regularized coefficient paths

0.2

0.4

0.6

0.8

1

(b) L1 regularized coefficient paths

Figure 2: Paths of coefficients as a function of penalty strength for five linear instruments based on characteristics that underlie the Fama-French factors. We impose only L2 -norm penalty in Panel (a). We impose only L1 -norm penalty in Panel (b).

Instead of relying on Fama-French factors, we use the characteristics which underlie those factors (book-to-market ratio, log of market equity, prior 12 months of returns, profitability, and growth) to construct instruments Zt . We rank and center each of these instruments as discussed in Section 2.5, so that an instrumented portfolio (a cross-product of an instrument and returns) has a natural interpretation of a long-short portfolio formed as a linear function of a characteristic. We further use these instrumented (managed) portfolios as test assets and candidate factors in our estimation method. Our managed portfolios are obviously linked to original Fama and French factors: size, value, mom12, (investment) growth, and profitability roughly correspond to SMB, HML, Mom, CMA, and RMW portfolios, respectively. There are two primary differences between our simple linear instruments and Fama-French factors: (a) we exclude small stocks; and (b) Fama and French use a somewhat more sophisticated approach to construct their factors compared to the simple linear weighting scheme that we employ. Similarly to Figure 1, Panel (a) of Figure 2 shows paths of coefficients with only the L2 penalty imposed, while Panel (b) imposes only L1 penalty. We flipped signs on size and growth anomalies to match those of Fama-French factors for ease of interpretation. Comparing results in Figure 1 and Figure 2, we see that the “managed portfolios” method produces very similar results to the method that uses original Fama-French factors. Apart from the profitability anomaly, for which our definition differs slightly from Fama and French (we follow Novy-Marx and Velikov, 2016), the remaining factors show very similar coefficient paths for either of the two methods. We find that (investment) growth, momentum, and 22

-1

1

5

0.85

In-sample C-S R 2

0.95

IS/CV C-S R 2

-1.2

OOS CV C-S R2 AICc

0.8

4

-1.4

0.75

-1.6

0.7

0.9

3 0.65

-1.8

0.6

-2

0.85

2 0.55

-2.2 0.8

-2.4 0

1

2

3

4

0.5

1

5

0

(a) Cross-sectional R2 : L2 penalty

1

2

3

4

(b) Cross-sectional R2 : both penalties

Figure 3: Goodness of fit for families of models parametrized by values of L1 and L2 penalties γ1 and γ2 . The test assets and candidate factors are 5 Fama and French strategy portfolios. In Panel (a) we consider a model that imposes only the L2 penalty and plot the cross-sectional R2 (y axis) as a function of effective degrees of freedom (x axis). We show three paths: the in-sample crosssectional R2 (solid blue; left axis); cross-validated out-of-sample cross-sectional R2 (dashed blue; left axis); and the AICc criteria (dotted red; right axis). In Panel (b) we impose both penalties simultaneously and show a contour map depicting cross-validated out-of-sample cross-sectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 .

value are the strongest anomalies; size is the weakest. Figure 3 shows measures of goodness of fit for families of models parametrized by values of L1 and L2 penalties γ1 and γ2 . In Panel (a) we consider a model that imposes only the L2 penalty and plot the cross-sectional R2 (y axis) as a function of effective degrees of freedom (x axis). We show three paths: the in-sample cross-sectional R2 (solid blue; left axis); crossvalidated out-of-sample cross-sectional R2 (dashed blue; left axis); and the corrected Akaike information criterion (AICc; dotted red curve; right axis).17 In Panel (b) we impose both penalties simultaneously and show a contour map depicting cross-validated out-of-sample cross-sectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of nonzero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 . The out-of-sample R2 in both panels are computed using a simple two-fold cross-validation exercise: we split the entire sample into two halves, estimate model’s parameter in one half, AICc is AIC with a correction for finite sample sizes given by the formula AICc = AIC + 2k(k+1) n−k−1 , where n denotes the sample size and k denotes the number of parameters. AIC is defined as AIC = −2 ln L + 2k. 17

23

fix those parameters, and compute cross-sectional R2 using the second half of the “unseen” data. Next we swap two halves and perform an analogous exercise. Lastly, we compute an average of two values of R2 and use that number as our measure of goodness of fit, which we report in Figure 3. Turning to our findings, we see that if no L1 penalty is imposed, the preferred model that relies on only L2 penalty is heavily regularized: Panel (a) shows that the maximum out-of-sample cross-validated R2 is achieved when the effective degrees of freedom are close to 1. Recall that we undo the “level” shrinkage, so in terms of Panel (a) of Figure 2, we see that such shrinkage leads to a different cross-sectional profile of coefficients relative to an unregularized model. In particular, we see that our model selection method favors a model that assigns approximately equal and strongest (equal to 1) values of SDF coefficients to value, momentum, and growth managed portfolios, and smaller values of coefficients (around 0.5) to size and profitability. Note that in an unregularized model all four anomalies had different relative values of SDF coefficients. Panel (b) however shows that when a general specification with two types of penalties is allowed (Eq. 23), the best model relies on effectively no L2 shrinkage, but drops one variable due to the L1 shrinkage (the highest R2 in the plot is achieved at 5 L2 effective degrees of freedom and 4 non-zero parameters). Such model achieves an OOS R2 of 85% – slightly higher than the R2 of the original (unregularized) model (≈ 80%). Overall, the evidence above suggests that our managed portfolio method is related to standard portfolio-based approaches in cross-sectional asset pricing. It’s important to remember, though, that those approaches are viable only in cases of few known factors or with unknown factors when the universe of test assets is small (e.g., sorted portfolios) and factor structure is weak. In the previous example the universe of test assets we considered included only five strategy portfolios that were weakly correlated. In such case benefits of regularization are relatively mild, as we have seen in Figure 3. In the next section we consider a simple example of strong factor structure in test asset returns when benefits of regularizations are far greater. 3.2.2

Fama-French 25 ME/BM -sorted portfolios

In this example we consider 25 Fama-French ME/BM sorted portfolios that exhibit very strong factor structure: most of the variance can be explained by only three factors (market, SMB, HML). Fama and French (1992) construct these factors manually. Kozak et al. (2015) show that SMB and HML factors essentially match the second and the third PCs of 25 portfolio returns. Extracting and using only three such factors, therefore, is a form of regularization, known as principal component regression (PCR) in machine learning, but 24

0.4

6 0.3

4 0.2

2

0.1

0

0 -0.1

-2 0

5

10

15

20

0.5

25

1

1.5

2

(b) L2 regularized coefficient paths (zoomed in)

(a) L2 regularized coefficient paths

Figure 4: Paths of coefficients as a function of penalty strength for 25 Fama-French ME/BM sorted portfolios. We impose the L2 -norm penalty and vary its strength. Panel (a) shows the full path of coefficients. Panel (b) zooms in on the regularized range (between 0 and 2 degrees of freedom). Labels in panel (b) are ordered according to the vertical ordering of estimates at the rightmost point.

done implicitly. In our method, the economic reasoning and Bayesian priors described in Section 2 lead to a modified ridge regression, which can be thought of as a continuous version of PCR. Whereas PCR ignores small PCs completely, ridge regression strongly down-weights them instead. Panel (a) of Figure 4 clearly show the need for regularization in this case: because the 25 Fama and French ME/BM sorted portfolios are highly correlated, estimating the MVE portfolio (SDF coefficients) with no regularization leads to relatively high SDF coefficients and likely high SDF variance; most importantly we show later that such unregularized SDF performs extremely poorly out of sample. Panel (b) plots the profile of coefficients for a regularized problem, when coefficients are shrunk in a way that results in about two effective degrees of freedom (we simply zoom in on the portion of the plot in Panel (a)) – consistent with our prior that a low-order factor model should hold in these data and evidence in Kozak et al. (2015). Figure 5 below shows that our model selection methods favor such highly regularized solutions. In this case coefficient estimates are much more reasonable and possess an intuitive pattern: the largest (positive) coefficients correspond mostly to small and value portfolios, while smallest (negative to mildly positive) coefficients primarily to growth or large portfolios (ordering of labels in Panel (b) coincides with the vertical ordering of coefficient estimates at the rightmost point). In essence, our regularized SDF (OOS-MVE portfolio) is heavily long small and value stocks and has small (or even negative) exposure to big and growth stocks 25

1

-1

0.9

OOS CV C-S R2 AICc

0.8

IS/CV C-S R 2

25

0.7

20

0.6

In-sample C-S R 2

-1.2 -1.4

0.7

15

-1.6

0.5

0.6 -1.8 0.5

10 0.4

-2

0.4 0.3

-2.2

0.2

-2.4 25

5 0.3

0

5

10

15

20

0

(a) Cross-sectional R2 : L2 penalty

5

10

15

20

(b) Cross-sectional R2 : both penalties

Figure 5: Goodness of fit for families of models parametrized by values of L1 and L2 penalties γ1 and γ2 . The test assets and candidate factors are 25 Fama and French ME/BM sorted portfolios. In Panel (a) we consider a model that imposes only the L2 penalty and plot the cross-sectional R2 (y axis) as a function of effective degrees of freedom (x axis). We show three paths: the in-sample cross-sectional R2 (solid blue; left axis); cross-validated out-of-sample cross-sectional R2 (dashed blue; left axis); and the AICc criteria (dotted red; right axis). In Panel (b) we impose both penalties simultaneously and show a contour map depicting cross-validated out-of-sample cross-sectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 .

— conceptually the same as the SDF implied by Fama and French (1992). In Figure 5 we perform our model selection exercises. First, in Panel (a) we show that in the case of a model with only L2 penalty, both cross-validation and the AICc criterion favor heavily regularized models with 1-2 effective degrees of freedom. In light of the fact that our method, similarly to ridge regression, is effectively a continuous version of the PCR, this evidence is broadly consistent with the findings of Kozak et al. (2015) where we showed that only two largest (cross-sectional) PCs are needed to price the cross-section of 25 Fama and French portfolios. We also showed in Panel (b) Figure 4 that our recovered SDF corresponding to such level of L2 penalty is conceptually very similar to the SDF of Fama and French (1992) in the sense that it loads on value and small stocks. When models that allow for two types of penalties are considered in Panel (b) of Figure 5, cross validation again favors very aggressive regularization, emphasizing the importance of regularization in the case when test assets have strong factor structure. Note that unregularized models that include all 25 factors (top-right corner) demonstrate extremely poor performance with R2 below 0.3. Given high regularization, our model selection method also strictly prefers imposing two penalties simultaneously (bottom-left corner). We can see that 26

0.6 6 0.4 4 0.2 2 0 0 -0.2 -2 0

0.2

0.4

0.6

0.8

1

0

(a) L1 coefficient paths: no L2 penalty

0.2

0.4

0.6

0.8

1

(b) L1 coefficient paths: both penalties, df (λ) = 3.6

Figure 6: Paths of coefficients as a function of L1 penalty strength for 25 Fama-French ME/BM sorted portfolios. We impose no L2 -norm penalty in Panel a (pure lasso-type approach). We impose the optimal level of L2 penalty which results in 3.6 effective degrees of freedom based on the analysis in Panel (b) of Figure 5. We further add the L1 penalty and show the full path of coefficients up to the point of no L1 penalty (corresponds to coefficients of a pure ridge-type approach; at the dashed vertical line). Labels are ordered according to the vertical ordering of estimates at the rightmost point (dashed black vertical line).

the highest levels of OOS cross-sectional R2 (around 71%) are achieved for models that have an L2 penalty corresponding to 1 − 7 degrees of freedom, with simultaneous L1 penalty that keeps only one or two factors. Note that such models are strongly preferred to pure lasso (only L1 penalty; see the vertical rightmost section of the plot) for all specifications but the one that picks only one factor (right-bottom corner). In fact, if no L2 penalty is imposed, the best lasso-type model with a single factor delivers the R2 of slightly below 71%, while the performance of a lasso-based model with two factors drops below 30%. Similarly, the model with two penalties dominates a pure ridge-type model (L2 penalty only; see the horizontal segment at the very top of the plot), which can achieve the R2 of slightly above 0.6. Given the link between the ridge-type model and the model in Fama and French (1992) which we discussed earlier, the results suggest that our regularized specification with two penalties is likely to perform better than the Fama-French’s original model, especially out of sample. We confirm this conjecture – the out-of-sample R2 of the FamaFrench model based on our cross-validation exercise was below 50% (not reported) – inferior to both the pure ridge-type model and especially to the regularized specification based on two penalties. Figure 6 shows the path of SDF coefficients when the L1 penalty is imposed. In Panel (a) we start with the case corresponding to zero L2 penalty (pure lasso-type approach). 27

We impose the optimal level of L2 penalty which results in 3.6 effective degrees of freedom based on the analysis in Panel (b) of Figure 5. We further add the L1 penalty and show the full path of coefficients up to the point of no L1 penalty (corresponds to pure ridge-type coefficients; at the dashed vertical line). The first observation is that the path of L1 (lasso) coefficients becomes more monotone as more shrinkage is applied at the first stage (ridge). This happens because a high level of L2 shrinkage essentially eliminates small PCs and resulting non-monotonicities in the lasso coefficients paths they produce. This property is common to all our examples and explains why L2 penalty is so useful and powerful, especially in our case (given the economics motivation behind shrinkage of small PCs). The second observation is that strong L2 penalty also allows lasso to pick up “correct” factors at any point along the paths. This happens for precisely the same reason: lasso now picks factors that are correlated with expected returns but for which such correlation is coming predominantly from large PCs, which we deemed most economically meaningful in our prior analysis. Despite this, in the case of extreme regularization (1-2 factors), we can see that lasso still manages to find correct factors as well: comparing figures in Panels (a) and (b) we see that with no ridge-type penalty lasso correctly picks the first factor (Small/High B/M portfolio), which explains relatively high R2 of the single factor lasso model in Panel (b) of Figure 5. Note that even though lasso manages to pick up the second factor (ME1BM4) correctly as well, the relative weights it allocates to these two factors are sub-optimal, which leads to deterioration of the OOS performance to the point that only a model with one factor is chosen. The optimal model with the best OOS cross-validated performance corresponds to 1 − 8 effective L2 degrees of freedom and an L1 penalty which leads to only one or two selected factors (based on our previous discussion of Panel (b) of Figure 5). Panel (b) Figure 6 shows that such solution corresponds to a model that keeps the Small/high B/M portfolio (and possibly ME1BM4 as a second factor). The model with this single factor explains the cross-section of Fama-French 25 ME/BM sorted (market neutral) portfolios much better out of sample than the original model of Fama and French (1992) which is based on two factors SMB and HML. Overall, our method tends to recover an SDF that is related to the SDF implied by Fama and French (1992). This example illustrates that regularization is particularly important when factor structure is strong or some candidate factors are highly correlated. In the simple context of portfolio sorts, the intuitive need for such regularization was realized and accomplished implicitly by Fama and French (1992). The real strength of our method however comes when dealing with vast abundance of characteristics and unknown factors. In what follows we work only with managed long-short portfolios and iteratively increase 28

1

-1 In-sample C-S R 2 OOS CV C-S R2 AICc

0.9

2

-1.2 1

IS/CV C-S R 2

-1.4 0.8

0.7

0

-1.6

-1

-1.8

-2

-2

-3

0.6

-4

-2.2

-5

0.5 0

10

20

30

40

50

-2.4 60

2

(a) Cross-sectional R2 : L2 penalty

4

6

8

10

12

(b) L2 regularized coefficient paths, max df (λ) = 12

Figure 7: Goodness of fit and coefficient estimates for for families of models that impose only the L2 penalty. The test assets and candidate factors are 50 managed portfolios underlying characteristics in Table 1. In Panel (a) we plot cross-sectional R2 (y axis) as a function of effective degrees of freedom (x axis). We show three paths: the in-sample cross-sectional R2 (solid blue; left axis); cross-validated out-of-sample cross-sectional R2 (dashed blue; left axis); and the AICc criteria (dotted red; right axis). In Panel (b) we plot paths of coefficients leading to the optimal model selected by cross validation based on Panel (a) with 12 effective degrees of freedom. We vary the strength of L2 penalty that results in solutions that correspond to 0-12 effective degrees of freedom. Labels are ordered according to the maximum absolute value of a coefficient.

the size and complexity of models we consider in order to obtain the most informative SDF possible. 3.2.3

50 anomaly characteristics

We now use all 50 anomaly characteristics listed in Table 1. Since characteristics are centered, managed portfolios have an intuitive long-short strategy interpretation which is linear in the cross-sectional centered rank of a characteristic. Further, because managed portfolios underlying anomalies in Table 1 are not very correlated overall, we expect the overall factor structure to be weaker compared to 25 Fama and French ME/BM sorted portfolios, where just two PCs explained most of the variance. However, some of the individual anomalies are highly correlated: for instance, one can correctly conjecture that many price ratios, such as D/P, E/P, B/M, CF/P, have non-trivial correlations. We therefore expect that some regularization is needed to handle those correlated anomalies. Figure 7 shows the goodness of fit and coefficient estimates for for families of models that impose only the L2 penalty. Similarly to the case with 25 ME/BM portfolios, without regularization some of the coefficients explode and result in high SDF variance (not shown).

29

50

0.8

2

0.7

40

0 0.6

30

0.5

-2

0.4

20

-4 0.3

10

0.2

-6

0.1

0

10

20

30

40

0

50

0.05

0.1

0.15

0.2

0.25

0.3

(b) L1 regularized coefficient paths: both penalties, df (λ) = 17.8

(a) Cross-sectional R2 : both penalties

Figure 8: Goodness of fit and coefficient estimates for for families of models that impose both penalties. The test assets and candidate factors are 50 managed portfolios underlying characteristics in Table 1. In Panel (a) we show a contour map depicting cross-validated out-of-sample crosssectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 . In Panel (b) we impose the optimal level of L2 penalty which results in 17.8 L2 effective degrees of freedom for the optimally selected model in Panel (a). We further add the L1 penalty and show partial paths of coefficients, which we truncate at the first 18 selected predictors. Labels are ordered according to the vertical ordering of estimates at the rightmost point (dashed black vertical line).

Some form of regularization is therefore needed. Cross validation and the AICc criterion in Panel (a) favor aggressive regularization that results in about 12 L2 effective degrees of freedom. We plot the paths of selected coefficients for such regularized model in Panel (b). We can see that the most important anomalies following the L2 shrinkage include: industry relative reversals, seasonality, industry momentum reversals, SUE, ROE, accruals, D/P ratio, value-profitability, momentum, etc. Not surprisingly, these are the anomalies that have been found to be among the most robust in the literature. Our method uncovers them naturally. Next, we move to the model that employs both types of penalties in Figure 8. In Panel (a) we show a contour map depicting cross-validated out-of-sample cross-sectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 . The most robust models lie in the curvy region between the top-left and bottom-right corners of the plot. Pure ridge-based approach performs better than pure L1 -penalty (lasso) based method. Models with two penalties however dominate models that are based on a single

30

type of penalty. The optimal selected model has 17.8 L2 effective degrees of freedom and keeps about 33 non-zero variables due to L1 shrinkage. We impose this optimal level of L2 penalty, add the L1 penalty and show partial paths of coefficients, which we truncate at first 18 selected predictors. This model achieves high level of out-of-sample cross-sectional R2 of 84%. The set of included variables is quite similar to the set identified by ridge, but is not exactly identical, which explains improvement in R2 . Unregularized models perform substantially worse out of sample, with maximum R2 around 50. We can also clearly reject models with only few factors: their R2 is also (often significantly) below 50%. Few models with relatively low number of factors (around 9-10), however, are able to deliver adequate cross-sectional fit with R2 of about 65%. While these models are dominated by less sparse representations, we might value sparseness in economic settings. One can read off the factors included and their respective coefficients from Panels (a) and (b). For instance, using Panel (a) we see that at the 17.8 L2 effective degrees of freedom, model with 9 factors would achieve an out-of-sample R2 of roughly 65%. From Panel (b) we see that a model with 9 non-zero coefficients corresponds to the value of shrinkage factor s of about 0.2. We can then immediately read off all non-zero coefficients (and their values) that a sparse SDF with such level of fit should contain: indrrevlv, season, rome, valprof, sue, indmomrev, fscore, ep, roe. 3.2.4

The model with interactions

We now consider our preferred specification that deals with the vast abundance of characteristics and candidate factors. We start with 50 characteristics in Table 1, compute their second and third powers and all cross-products as explained in Section 2.6. We obtain the total of 1,375 derived characteristics and candidate factors that we feed into our method. Figure 9 shows the goodness of fit and coefficient estimates for families of models that impose only the L2 penalty. Similarly to previous cases, without regularization some of the coefficients explode and result in high SDF variance (not shown). Some form of regularization is therefore needed. Cross validation and the AICc criterion in Panel (a) favor aggressive regularization that results in about 120 L2 effective degrees of freedom (down from 1, 375). We plot the paths of selected coefficients for such regularized model in Panel (b). We can see that the most important factors following the L2 shrinkage include many interactions and cubes of initial strategies: namely, the most important interactions are ivol × indrrevlv, indrrevlv × shvol, betaarb × indrrevlv, mom × indrrev, valmom × indrrev, etc., while the most important cubic strategies include indrrevlv, season, indrrev, and indmomrev. Next, we move to the model that employs both types of penalties in Figure 10. In Panel (a) we show a contour map depicting cross-validated out-of-sample cross-sectional R2 31

1

-0.8

0.9

OOS CV C-S R AICc

0.8

IS/CV C-S R 2

1.5

In-sample C-S R 2 2

-1 1

-1.2 0.5

0.7 -1.4 0.6

0

-1.6 0.5

-0.5

-1.8

0.4

-1

-2

0.3 0.2 0

200

400

600

800

1000

1200

(a) Cross-sectional R2 : L2 penalty

-1.5

-2.2 1400

20

40

60

80

100 120

(b) L2 regularized coefficient paths, max df (λ) = 120

Figure 9: Goodness of fit and coefficient estimates for for families of models that impose only the L2 penalty. The test assets and candidate factors are 50 managed portfolios, underlying characteristics in Table 1, all their linear interactions, squares, and cubes. In Panel (a) we plot cross-sectional R2 (y axis) as a function of effective degrees of freedom (x axis). We show three paths: the insample cross-sectional R2 (solid blue; left axis); cross-validated out-of-sample cross-sectional R2 (dashed blue; left axis); and the AICc criteria (dotted red; right axis). In Panel (b) we plot paths of coefficients leading to the optimal model selected by cross validation based on Panel (a) with 120 effective degrees of freedom. We vary the strength of L2 penalty that results in solutions that correspond to 0-120 effective degrees of freedom. Labels are ordered according to the maximum absolute value of a coefficient.

as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 . The most robust models lie in the north-west region of the plot. Pure ridge-based approach again performs better than pure L1 -penalty (lasso) based method (49% vs. 40% R2 ). Models with two penalties however achieve even somewhat higher R2 of around 50.5%. The optimal selected model has 120 L2 effective degrees of freedom and keeps about 1150 non-zero variables that survive L1 shrinkage. Unregularized models perform substantially worse out of sample, with maximum R2 below 20%. We can also clearly reject models containing only several factors: their R2 is also sub-par. Few models with relatively low number of factors (around 20-30), however, are able to deliver adequate cross-sectional fit with R2 of about 40%. While these models are dominated by less sparse representations, we might value sparseness in economic settings. At the expense of out-of-sample performance, we therefore focus on one such model, which contains only 28 factors (out of initial 1, 375) and has 739 L2 effective degrees of freedom. We plot the full L1 coefficient path for this model in Panel (b). The plot lists all factors 32

0.5 0.45

1200

0.4

1000 0.35

800

0.3

600

0.25

400

0.2 0.15

200 0.1

0 0

200

400

600

800

1000 1200

(a) Cross-sectional R2 : both penalties 5 4 3 2 1 0 -1 -2 -3 -4 0

0.005

0.01

0.015

(b) L1 regularized coefficient paths: both penalties, df (λ) = 739

Figure 10: Goodness of fit and coefficient estimates for for families of models that impose both penalties. The test assets and candidate factors are 50 managed portfolios underlying characteristics in Table 1, all their linear interactions, squares, and cubes. In Panel (a) we show a contour map depicting cross-validated out-of-sample cross-sectional R2 as a function of penalty parameters γ1 and γ2 corresponding to df (λ) effective degrees of freedom (in a standalone ridge problem; x axis) and the number of non-zero SDF coefficients (in a standalone lasso problem; y axis), respectively. Warmer (yellow) colors reflect higher R2 . In Panel (b) we impose the level of L2 penalty resulting in 739 L2 effective degrees of freedom for the sparse (sub-optimal) representation from Panel (a) which contains 28 predictors. Labels are ordered according to the vertical ordering of estimates at the rightmost point (dashed black vertical line).

33

that survive the aggressive dual shrinkage. Overall, those variables are reminiscent of the variables with highest coefficients in the optimal L2 -only shrinkage we discussed previously. Performance on the recent fully withheld sample. We now evaluate the out-ofsample performance of our method on the most recent withheld part of the sample. Our goal is twofold. First, we would like to assess robustness of our model selection technique (cross validation) on the withheld part of the sample. Second, one might worry about deterioration of anomalies’ alphas over time (McLean and Pontiff, 2016) and anomaly “fishing”; it is therefore necessary to use the latest part of sample to validate our results and to get meaningful estimates of maximum attainable Sharpe ratios. Although data mining and anomaly “fishing” is certainly a concern when assessing Sharpe ratios, it does not necessarily need to have big impact on the final SDF and OOS crosssectional fit. The reason for this is that our method tends to shrink those “fished” estimates of expected returns substantially. McLean and Pontiff (2016) argue that many anomalies essentially disappear following a research paper publication. For “unknown” anomalies in the earlier part of the sample then there is no clear reason why expected returns need to line up with covariances in the data (our model imposes this). If they don’t, and mostly line up with small PCs, our algorithm will effectively shrink them. It is therefore only anomalies that have been data-mined and correspond to large systematic co-movements in the data that could be problematic for our procedure. We proceed by truncating the sample to end in 1995 and then fully re-estimate the model and perform cross validation on this truncated sample. The data post-1995 are therefore fully excluded from all elements of the estimation process. We then re-estimate means and covariances in the out-of-sample period (1995-2016) and use the pre-estimated SDF coefficients to construct measures of out-of-sample cross-sectional R2 and Sharpe ratios. For simplicity, in this section we focus on the class of models that employ only L2 penalty. We showed in Figure 10 that such models perform on par with the optimally selected model that uses both penalties. In practice none of our results below change much if we were to focus on the latter class. Figure 11 plots the out-of-sample cross-sectional R2 in Panel (a) and out-of-sample Sharpe ratios in Panel (b) calculated in the withheld sample (post-1995), as functions of the effective degrees of freedom (strength of L2 penalty). Panel (b) shows two lines: regular out-of-sample estimates of Sharpe ratios (blue solid line) and leave-one-out OOS Sharpe ratios which are obtained by excluding the single most important predictor (more robust). We do not impose L1 penalties in these calculations. The purpose of these plots is not to perform model selection – the model has already been 34

2.5

0.65

OOS OOS leave-one-out

0.6

2 0.55

1.5

0.5 0.45

1 0.4 0.35

0.5 0

200

400

600

800

1000

1200

1400

0

(a) OOS cross-sectional R2 in the withheld sample

200

400

600

800

1000

1200

1400

(b) Sharpe Ratios in the withheld sample

Figure 11: The figure shows cross-sectional R2 in Panel (a) and Sharpe ratios in Panel (b) calculated in the most recent withheld sample as functions of the effective degrees of freedom (strength of L2 penalty). We do not impose the L1 penalty in these calculations. Panel (b) shows two lines: regular out-of-sample estimates of Sharpe ratios (blue solid line) and leave-one-out OOS Sharpe ratios which are obtained by excluding the single most important predictor. The out-ofsample period starts on January 1995 and was withheld at the model estimation and model selection stages.

selected using cross validation in the training part of the sample. Such optimally selected model contained df (λ) = 123 degrees of freedom (not shown). Figure 11 shows that for such model the OOS cross-sectional R2 in the withheld part of sample (1995-2016) is above 0.6, and the OOS Sharpe ratio in the withheld sample is about 2.2. Our model selection method based on cross validation, therefore, performs very well out of sample in terms of selecting an appropriate degree of regularization. Selected model performs very well too, both in terms of cross-sectional fit and in generating very high Sharpe ratios on the implied MVE portfolio. We plot the time-series of one-year overlapping returns on the regularized OOS-MVE portfolio implied by our SDF in Figure 12 (blue solid line). We use the first half of the sample (before 1995) to estimate SDF coefficients and select the appropriate strength of regularization, and then fix those coefficients to construct OOS returns on the MVE portfolio in the second half of the sample. The red dashed line shows returns on the market index for comparison (MVE portfolio is approximately market-neutral). Panel (a) plots returns implied by the SDF that was constructed using 1,375 managed portfolios (all characteristics, powers, and interactions). For comparison, we show returns on the SDF constructed from 5 Fama-French characteristics in Panel (b). Note that average returns on the MVE portfolio in Panel (a) are much higher in the pre2005 period, resulting in extremely high Sharpe ratios. This likely happens due to the fact that many anomalies have been data-mined in the earlier part of the sample. In the post-

35

1.5

1

0.5

0

-0.5 1995 1997 2000 2002 2005 2007 2010 2012 2015 (a) Returns on MVE portfolio constructed using all instruments and interactions 1

0.5

0

-0.5

1995

1997 2000 2002 2005 2007 2010 2012 2015 (b) Returns on MVE portfolio constructed using Fama and French 5 factors

Figure 12: The figure plots the time-series of one-year overlapping returns on the regularized MVE portfolio implied by our SDF (blue solid line) and returns on the market (for comparison only; red dashed line). We use the training sample (before 1995) to estimate SDF coefficients and select the appropriate strength of regularization, and then fix those coefficients of the selected model to construct OOS returns on the MVE portfolio. Panel (a) plots returns implied by the SDF that was constructed using 1,375 managed portfolios (all characteristics, powers, and interactions). For comparison, we show returns on the SDF constructed from 5 Fama-French characteristics in Panel (b).

36

Table 2: Testing models: MVE portfolio’s annualized OOS α (%), withheld sample (1995-2016) We construct the OOS SDFs in the withheld sample (1995-2016) based on three sets of test assets: (i) Fama-French 5 factors, (ii) 50 basic anomalies portfolios from Table 1; and (iii) all interactions and powers of 50 basic anomalies. We then regress MVE portfolio’s returns on the market (second column) and on five Fama-French factors (plus the market; third column), and report annualized alphas in %.

SDF constructed from:

MVE portfolio’s CAPM α

MVE portfolio’s FF 5-factor α

Fama-French 5-factors

14.46

-

50 anomaly managed portfolios

25.62

15.96

1,375 managed portfolios (50 anomalies and all interactions)

40.90

32.73

2005 period mean returns on the MVE portfolio deteriorate significantly; however, they are still substantially higher than mean returns on the stock market index. Note that our MVE portfolio performs significantly better than the market in recessions, often moving in the opposite direction to the market and generating very high returns (especially in 2001-2002).

3.3

How to use the SDF?

ˆ t = 1 − ˆb0 (Ft − EFt ) or, Given the estimate ˆb, we can construct the time-series of the SDF M equivalently, the return on the robust MVE portfolio Pt = ˆb0 Ft . What is this useful for? Tests of asset pricing models. Since Pt summarizes the cross-section of expected returns, it can be used as a single test asset when evaluating the pricing performance of a candidate model. For example, He et al. (2016) construct a 2-factor SDF which includes the excess market return and change in financial intermediary leverage. They estimate risk prices (SDF coefficients) using a range of assets including FF25 equity portfolios, maturity sorted bond portfolios, commodities, sovereign debt, equity index options, corporate bonds, CDS contracts, and currencies. Since our equity MVE portfolio, Pt , is constructed from a vast cross-section of equities, most of which are not included in their estimation, it provides a nearly ideal “out-of-sample” test asset for their model. Given the time-series of their , we can simply check whether or not the Euler equation holds for Pt ; estimated SDF, ∆ΛΛt+1 t i h ∆Λt+1 that is, does E Pt+1 Λt = 0 hold in the data. Similarly, if one comes up with a new factor model, our SDF presents a straightforward way to assess its pricing performance in the wide cross-section of anomalies. The candidate

37

Table 3: Testing models: MVE portfolio’s annualized OOS α (%), withheld sample (2005-2016) We construct the OOS SDFs in the withheld sample (2005-2016) based on three sets of test assets: (i) Fama-French 5 factors, (ii) 50 basic anomalies portfolios from Table 1; and (iii) all interactions and powers of 50 basic anomalies. We then regress MVE portfolio’s returns on the market (second column) and on five Fama-French factors (plus the market; third column), and report annualized alphas in %.

SDF constructed from:

MVE portfolio’s CAPM α

MVE portfolio’s FF 5-factor α

Fama-French 5-factors

9.41

-

50 anomaly managed portfolios

8.25

3.53

1,375 managed portfolios (50 anomalies and all interactions)

24.57

22.54

model should be tested in terms of its ability to explain mean returns on our SDF-implied MVE portfolio. That is, one only needs to run a single time-series regression of returns on our MVE portfolio on the factors of the model at hand, and check whether the intercept (alpha) is significantly different from zero. Because our SDF was constructed using a broad crosssection of anomalies and their interactions, explaining it (reducing alpha to zero), however, is a very high threshold to achieve for most models. We present a simple example of such test in Table 2. Namely, we test the 5-factor model of Fama and French (2014) on its ability to price our OOS-MVE portfolio. We construct out-of-sample SDFs based on three sets of test assets: (i) Fama-French 5 factors, (ii) 50 basic anomalies portfolios from Table 1; and (iii) all interactions and powers of 50 basic anomalies. We then regress OOS-MVE portfolio’s returns on the market (second column) and on five Fama-French factors (plus the market; third column), and report annualized alphas in %. We use only post-1995 data in our tests – this portion of the sample was fully withheld from our method and was never used in the estimation of the SDF. Table 3 show the same statistics for the 2005-2016 withheld sample. We can see that the 5-factor model prices 5 factors perfectly (by construction; row 1), but already has troubles pricing an SDF that was constructed using 50 basic anomaly characteristics (row 2). The CAPM alpha of the MVE portfolio implied by such an SDF is 25.6% (annualized). The 5-factor model is able to reduce this alpha only to 16%. When we focus on the MVE portfolio implied by our preferred SDF (based on 1, 375 managed portfolios constructed using all basic anomalies, their powers, and all pairwise interactions; row 3), the 5-factor model annualized alpha is as high as 33%. Clearly the 5-factor model is strongly rejected in the data. 38

Macroeconomy. A further application is to examine the correlation of “important” ˆ t. M ˆ t is only an estimate of the macroeconomic variables, like investment growth, with M projection of the “true SDF” onto equity returns and ultimately we are interested in the underlying macro dynamics that affect asset prices. However, macro variables are likely grossly mismeasured. Hence, the factor-mimicking portfolio “for the true M will price assets better than an estimate of M that uses the measured macroeconomic variables” (Cochrane, 2005). Instead of directly testing macro asset pricing models on asset returns (which is likely to reject even a true model), we can explore which aggregate quantities correlate with our ˆ t. estimate of the projection, M

39

References Asness, C. S., A. Frazzini, and L. H. Pedersen (2014). Quality minus junk. Brandt, M. W., P. Santa-Clara, and R. Valkanov (2009). Parametric portfolio policies: Exploiting characteristics in the cross-section of equity returns. Review of Financial Studies 22 (9), 3411–3447. Bryzgalova, S. (2016). Spurious factors in linear asset pricing models. Technical report, Working Paper. Cochrane, J. H. (1991). Production-based asset pricing and the link between stock returns and economic fluctuations. Journal of Finance 46, 209–237. Cochrane, J. H. (2005). Asset Pricing (second ed.). Princeton, NJ: Princeton University Press. Cochrane, J. H. (2011). Presidential address: Discount rates. Journal of Finance 66 (4), 1047–1108. DeMiguel, V., A. Martin-Utrera, F. J. Nogales, and R. Uppal (2017). A portfolio perspective on the multitude of firm characteristics. Fama, E. and K. French (2014). A five-factor asset pricing model, fama-miller working paper. Fama, E. F. and K. R. French (1992). The cross-section of expected stock returns. Journal of Finance 47, 427–465. Freyberger, J., A. Neuhierl, and M. Weber (2017). parametrically.

Dissecting characteristics non-

Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), 1. Green, J., J. R. Hand, and X. F. Zhang (2014). The remarkable multidimensionality in the cross-section of expected us stock returns. Unpublished paper, University of North Carolina at Chapel Hill. Hansen, L. P. and R. Jagannathan (1997). Assessing specification errors in stochastic discount factor models. Journal of Finance 52, 557–590. Harvey, C. R., Y. Liu, and H. Zhu (2015). ... and the cross-section of expected returns. Review of Financial Studies. Hastie, T. J., R. J. Tibshirani, and J. H. Friedman (2011). The elements of statistical learning: data mining, inference, and prediction. Springer. He, Z., B. Kelly, and A. Manela (2016). Intermediary asset pricing: New evidence from many asset classes. Technical report, National Bureau of Economic Research.

40

Kan, R. and C. Robotti (2008). Specification tests of asset pricing models using excess returns. Journal of Empirical Finance 15 (5), 816–838. Kogan, L. and M. Tian (2015). Firm characteristics and empirical factor models: a modelmining experiment. Technical report, MIT. Kozak, S., S. Nagel, and S. Santosh (2015). Interpreting factor models. Technical report, University of Michigan. McLean, D. R. and J. Pontiff (2016). Does Academic Research Destroy Stock Return Predictability? Journal of Finance 71 (1), 5–32. Novy-Marx, R. and M. Velikov (2016). A taxonomy of anomalies and their trading costs. Review of Financial Studies 29 (1), 104–147. Pástor, L. (2000). Portfolio selection and asset pricing models. The Journal of Finance 55 (1), 179–223. Pástor, L. and R. F. Stambaugh (2000). Comparing asset pricing models: an investment perspective. Journal of Financial Economics 56 (3), 335–381. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301– 320.

41

Appendix A

Computational Aspects

A.1

Solution method

Least Angle Regression (LAR) algorithm. We use a modified version of the LAR algorithm to solve the problem. Hastie et al. (2011) argue it is extremely efficient and computes the entire lasso path at a cost comparable to OLS. Moreover, predictors are added sequentially, so we can easily stop at any number of predictors (usually small).18 This means that with a precomputed Gram matrix, it is O kn2 in our application (where k is the number of predictors at which to stop), i.e., even faster than OLS at O n3 . The LAR algorithm starts with the empty set of active variables. At the first step it identifies the variable most correlated with the response. LAR then moves the coefficient of this variable continuously towards its least squares value. Walking along this direction, the angles between the variables and the residual vector are measured. Along this walk, the angles will change; in particular, the correlation between the residual vector and the active variable will shrink linearly towards 0. At some stage before this point, another variable will obtain the same correlation with respect to the residual vector as the active variable. The walk stops and the new variable is added to the active set. The new direction of the walk is towards the least squares solution of the two active variables, and so on. After p steps, the full least squares solution will be reached. Hastie et al. (2011) further show that a single modification to the LAR algorithm gives the entire lasso path. Namely, if a non-zero coefficient hits zero, we need to drop the variable from the active set and recompute the current joint least squares solution. Finally, in the case when both L1 and L2 penalties are present (elastic nets), it is straightforward to show that one can turn this into a lasso problem using an augmented version of X and y. In the classic elastic nets setup, such augmentation is equivalent to replacing the Gram matrix X 0 X used in computations of OLS coefficients at each step with its ridge counterpart X 0 X + δI. In our setting a similar simplification obtains. Our implementation exploits these basic principles and adapts them to our setting. First, because our variables are not (and should not be) centered and standardized, we replace a measure of direction with inner product (instead of correlation). Inner product naturally measures both the angle and the size of the move – both are important in our setting. Second, since our objective is ˜ = X 21 and y˜ = X − 12 y, and the non-standard, we construct altered version of X and y as follows: X Gram matrix (X + δI). One can easily verify that in the absence of L1 penalty, such modifications lead to the solution discussed in Section 2.4.

QP problem. We can also formulate our final specification as a quadratic programming (QP) problem with linear constraints. For N predictors, we need 2N + 1 constraints and 2N variables. We use this method to validate the LAR algorithm above; it produces an identical solution. Since QP is a very general method that does not take advantage of specific structure of the problem, however, it is computationally quite inefficient.

18

This is similar to early stopping regularization in boosting methods.

42

B

Bayesian Estimation

Consider a vector of managed portfolio returns Ft ∼ N (µF , Σ) (residuals from some prior factor model like the CAPM) with known covariance but unknown mean. Consider the family of priors, µF ∼ N 0, κτ Ση , where τ = trace [Σ] and κ is a constant controlling the dispersion of µF and may depend on τ and N . This family nests many priors commonly used (by varying η and κ), and further maps into various estimation techniques (shown later). We can form principal component portfolios Pt = Q0 Ft with Σ = QDQ0 . In addition to expected returns, we also look at Sharpe ratios of PC portfolios, sP = D−0.5 µP and Markowitz portfolio weights, bF = Σ−1 µF , bP = D−1 µP . Given a sample of length T , let µF,T be the sample mean return of managed portfolios (and µP,T and sample mean of PC portfolios). With the conjugate structure of the priors, we can calculate posterior values easily: Prior Posterior µF bF µP

N 0, τk Ση

N 0, τk Ση−2

N 0,

k η τD

N 0,

k η−1 τD

k η−2 τD

sP

bP

N 0,

Σ + γΣ(2−η) Σ + γΣ(2−η)

D+

−1 −1

γD(2−η)

D+

γD(2−η)

D+

γD(2−η)

Σ µF,T

Σ

−1 −1 −1

Σ−1 µF,T

D µP,T

D−0.5 µP,T

D−1 µP,T

D D

τ where γ = κT . Notice that small k (“tighter prior” around zero) implies large γ. Notice that means, Sharpe ratios, and optimal weights all show the same shrinkage factor (towards zero). This shows there is a one to one mapping between priors on means, Sharpe ratios, and weights. We now consider various values of η.

B.1

η=0

η = 0 implies a simple, seemingly agnostic iid prior, µF ∼ N 0, κτ I , or equivalently, µP ∼ κ N 0, τ I . Is this reasonable? No. It implies sP ∼ N 0, κτ D−1 . The standard deviation of si ,

q

2κ τ di ,

can be arbitrarily high for low eigenvalue (small di ) PCs. This is an unrealistic prior on Sharpe ratios.

B.2

η=1

η = 1 implies µP ∼ N 0, κτ D and sP ∼ N 0, κτ I . This is an iid prior on PC portfolio Sharpe ratios instead of expected returns, as used in Pástor (2000) and Pástor and Stambaugh (2000). q 2κ κ −1 Is this reasonable? It implies bP ∼ N 0, τ D . The standard deviation of bi , τ di , can be arbitrarily high for low eigenvalue (small di ) PCs. This is an unrealistic prior on portfolio weights.

B.3

η=2

η = 2 implies wP ∼ N 0, κτ I , an iid prior on PC portfolio weights. Equivalently bF ∼ N 0, κτ I , which may not be restrictive enough. One may want to require that correlated assets have correlated

43

portfolio weights, suggesting η > 2.

B.4

Example

Consider a a concrete two asset example where both returns have variance σ 2 and correlation ρ > 0. In this case, τ = 2σ 2 and # √ " 2 1 1 Q= 1 −1 2 "

D = σ2

(1 + ρ) 0 0 (1 − ρ)

#

.

We’ll explore the limit as ρ → 1 to see what happens in case of redundant assets. We can “rule out” η = 0 by considering the maximum squared Sharpe ratio, µ0F Σ−1 µF . The expected value is given simply by "

EµF Σ

−1

µF = E σ

−2

2

1−ρ

= σ −2 1 − ρ2

−1

"

µ0F

1 −ρ −ρ 1

#

#

µF

−1 2κ

τ

The expected maximum squared Sharpe ratio explodes to ∞ as ρ → 1, which is implausible. We can “ruleout” η = 1 by considering the portfolio weights, bF . Recall that when η = 1, bF ∼ N 0, κτ Σ−1 , with # " −1 1 −ρ Σ−1 = σ −2 1 − ρ2 −ρ 1 in this two asset example. In the limit when the two base assets are redundant (ρ → 1), their optimal weights are perfectly negatively correlated with infinite variance under this prior! Our prior is highly correlated assets shouldn’t have highly negatively correlated weights and that the introduction of redundant assets should case weights to explode. Further, consider the contribution of each PC to total portfolio variance (equivalently, to maximum squared Sharpe ratio). h

i

κ di 2σ 2 di κ = 2. 2σ

E b2P,i di =

We obtain an identical value for the first PC. This implies that even as the variance of the 2nd PC portfolio goes to zero, it’s relative contribution to optimal portfolio variance is constant (equal to the first PC’s contribution). The prior η = 2 doesn’t suffer from any of these pathologies. The expected variance of the SDF, µF Σ−1 µF = µP D−1 µP , is simply κ. The expected contribution of the 2nd PC to portfolio

44

variance is h i

κ 2 σ (1 − ρ) 2σ 2 k = (1 − ρ) 2 →0

E b22 d2 =

which is reasonable. An asset with zero variance (d2 → 0) shouldn’t have non-zero contribution to portfolio (SDF) variance. The expected contribution of the 1st PC to portfolio variance is h i

κ 2 σ (1 + ρ) 2σ 2 κ = (1 + ρ) 2 → κ,

E b21 d1 =

which equal the total SDF variance.

45

Prepayment Risk and Expected MBS Returns