Forecaster’s dilemma: Extreme events and forecast evaluation Sebastian Lerch Karlsruhe Institute of Technology Heidelberg Institute for Theoretical Studies
ACINN Graduate Seminar Innsbruck, January 11, 2017 joint work with Thordis Thorarinsdottir, Francesco Ravazzolo and Tilmann Gneiting
Motivation
http://www.spectator.co.uk/features/8959941/whats-wrong-with-the-met-office/
Financial crisis in the news
http://www.theguardian.com/business/2009/jan/24/nouriel-roubini-credit-crunch
L’Aquila earthquake, April 2009
http://www.bbc.com/news/magazine-20097554
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Probabilistic forecasts
Probabilistic forecasts, i.e., forecasts in the form of probability distributions over future quantities or events, I
provide information about inherent uncertainty
I
allow for optimal decision making by obtaining deterministic forecasts as target functionals (mean, quantiles, . . . ) of the predictive distributions
I
have become increasingly popular across disciplines: meteorology, hydrology, seismology, economics, finance, demography, political science, . . .
Probabilistic vs. point forecasts Comparison
Observation
Forecast
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
What is a good probabilistic forecast?
0
2
4
6
8
10
The goal of probabilistic forecasting is to maximize the sharpness of the predictive distribution subject to calibration. Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B, 69, 243–268.
Calibration and sharpness
0
2
4
6
8
10
Calibration: Compatibility between the forecast and the observation; joint property of the forecasts and observations Sharpness: Concentration of the forecasts; property of the forecasts only
Evaluation of probabilistic forecasts: Proper scoring rules A proper scoring rule is any function S(F , y ) such that EY ∼G S(G , Y ) ≤ EY ∼G S(F , Y ) for all F , G ∈ F. We consider scores to be negatively oriented penalties that forecasters aim to minimize. Proper scoring rules prevent hedging strategies. Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
Examples Popular examples of proper scoring rules include I
the logarithmic score LogS(F , y ) = − log(f (y )), where f is the density of F ,
I
the continuous ranked probability score Z ∞ CRPS(F , y ) = (F (z) − 1{y ≤ z})2 dz, −∞
where the probabilistic forecast F is represented as a CDF.
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Media attention often exclusively falls on prediction performance in the case of extreme events
Bad Data Failed To Predict Nashville Flood Weather Service Faulted for Sandy Storm Surge Warnings
NBC, 2011 NBC, 2013
How Did Economists Get It So Wrong? Nouriel Roubini: The economist who predicted worldwide recession An exclusive interview with Med Yones - The expert who predicted the financial crisis A Seer on Banks Raises a Furor on Bonds
NY Times, 2009 Guardian, 2009 CEOQ Mag, 2010 NY Times, 2011
Toy example We compare Alice’s and Bob’s forecasts for Y ∼ N (0, 1), FAlice = N (0, 1),
FBob = N (4, 1)
Toy example We compare Alice’s and Bob’s forecasts for Y ∼ N (0, 1), FAlice = N (0, 1),
FBob = N (4, 1)
Based on all 10 000 replicates, Forecaster
CRPS
LogS
Alice Bob
0.56 3.53
1.42 9.36
Toy example We compare Alice’s and Bob’s forecasts for Y ∼ N (0, 1), FAlice = N (0, 1),
FBob = N (4, 1)
Based on all 10 000 replicates, Forecaster
CRPS
LogS
Alice Bob
0.56 3.53
1.42 9.36
When the evaluation is restricted to the largest ten observations, Forecaster Alice Bob
R-CRPS
R-LogS
2.70 0.46
6.29 1.21
Verifying only the extremes erases propriety Some econometric papers use the restricted logarithmic score R-LogS≥r (F , y ) = −1{y ≥ r } log f (y ).
Verifying only the extremes erases propriety Some econometric papers use the restricted logarithmic score
0.2
Density
0.0
E R-LogS≥r (H, Y ) < E R-LogS≥r (F , Y )
f h
0.1
However, if h(x) > f (x) for all x ≥ r , then
0.3
0.4
R-LogS≥r (F , y ) = −1{y ≥ r } log f (y ).
independently of the true density.
−2
0
2 x
4
Verifying only the extremes erases propriety Some econometric papers use the restricted logarithmic score
0.2
Density
0.0
E R-LogS≥r (H, Y ) < E R-LogS≥r (F , Y )
f h
0.1
However, if h(x) > f (x) for all x ≥ r , then
0.3
0.4
R-LogS≥r (F , y ) = −1{y ≥ r } log f (y ).
independently of the true density.
−2
0
2 x
In fact, if the forecaster’s belief is F , her best prediction under R-LogS≥r is 1(z ≥ r )f (z) . f ∗ (z) = R ∞ r f (x)dx
4
The forecaster’s dilemma Given any (non-trivial) proper scoring rule S and any non-constant weight function w , any scoring rule of the form S ∗ (F , y ) = w (y )S(F , y ) is improper. The expected value EY ∼G S ∗ (F , y ) is minimized for f ∗ (z) = R
w (z)g (z) . w (x)g (x)dx
The forecaster’s dilemma Given any (non-trivial) proper scoring rule S and any non-constant weight function w , any scoring rule of the form S ∗ (F , y ) = w (y )S(F , y ) is improper. The expected value EY ∼G S ∗ (F , y ) is minimized for f ∗ (z) = R
w (z)g (z) . w (x)g (x)dx
Forecaster’s dilemma: Forecast evaluation based on a subset of extreme observations only corresponds to the use of an improper scoring rule and is bound to discredit skillful forecasters.
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Proper weighted scoring rules I Proper weighted scoring rules provide suitable alternatives. Gneiting and Ranjan (2011) propose the threshold-weighted CRPS Z ∞ twCRPS(F , y ) = (F (z) − 1{y ≤ z})2 w (z) dz −∞
w (z) is a weight function on the real line.
Gneiting, T. and Ranjan, R. (2011) Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business and Economic Statistics, 29, 411–422.
Proper weighted scoring rules II Diks et al. (2011) propose the conditional likelihood score, f (y ) R , CL(F , y ) = −w (y ) log w (z)f (z) dz and the censored likelihood score, Z CSL(F , y ) = −w (y ) log f (y )−(1−w (y )) log 1 − w (z)f (z) dz . w (z) is a weight function on the real line. Diks, C., Panchenko, V. and van Dijk, D. (2011) Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163, 215–233.
Role of the weight function
The weight function w can be tailored to the situation of interest. For example, if interest focuses on the predictive performance in the right tail, windicator (z) = 1{z ≥ r }, or wGaussian (z) = Φ(z|µr , σr2 ) Choices for the parameters r , µr , σr can be motivated and justified by applications at hand.
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Case study: Probabilistic wind speed forecasting
I
Forecasts and observations of daily maximum wind speed
●●
● ●
I
Prediction horizon of 1-day ahead
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
● ●
● ●
● ●
●
●
●
● ● ●
● ●
●
● ●
●
●
●
I
Statistical post-processing of ECMWF ensemble forecasts
●
●
●
● ●
● ●
●
●
I
Evaluation period: May 1, 2010 – April 30, 2011 > 80 000 individual forecast cases
●
●
●
●
●
●●
●
● ●
●
● ●
● ●
●
● ●
●
●
● ●
●
●
●
● ●
● ● ●
●
●
●
● ●
● ●
● ●
● ●
●
●● ●
●
●
●
● ● ●
●● ● ●
● ●
● ●
● ●
●
● ●
●
●
● ● ●
●
●
●
●
● ● ●
●
● ●
● ●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
● ●
●●
● ●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
I
●
●
●
● ●
228 observation stations over Germany
●
●
● ●
● ●
● ●●
● ●
●
●
●
●
●
●
●
I
●
● ●
●
50 exchangeable ensemble members
●
● ● ●
●
●
●
●
●
●
●
●
I
●
General approaches to ensemble post-processing 1. Bayesian model averaging (BMA) associates each ensemble member with a kernel function. 2. Ensemble model output statistics (EMOS) or non-homogeneous regression (NR) fits a single, parametric predictive distribution using summary statistics from the ensemble. The standard NR model for wind speed follows a truncated normal distribution. Thorarinsdottir, T.L. and Gneiting, T. (2010) Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression. Journal of the Royal Statistical Society Series A, 173, 371–388.
Non-homogeneous regression models
1. Truncated normal model (TN) Following Thorarinsdottir and Gneiting (2010), set Y |X1 , . . . , Xk ∼ N[0,∞) (µ, σ 2 ), ¯ and σ 2 = c + d · where µ = a + b X
1 k
Pk
i=1 (Xi
¯ )2 . −X
Non-homogeneous regression models
1. Truncated normal model (TN) Following Thorarinsdottir and Gneiting (2010), set Y |X1 , . . . , Xk ∼ N[0,∞) (µ, σ 2 ), ¯ and σ 2 = c + d · where µ = a + b X
1 k
Pk
i=1 (Xi
¯ )2 . −X
2. Biased truncated normal model (TN-biased) Y |X1 , . . . , Xk ∼ N[0,∞) (µ + 5, 2 · σ 2 )
Marginal calibration Model
Forecast Distribution
TN TN-biased
N[0,∞) (µ, σ 2 ) N[0,∞) (µ + 5, 2 · σ 2 )
0.10 0.05 0.00
Density
0.15
TN TN−biased
0
5
10 Wind speed (m s−1)
15
20
Verification Based on all observations,
TN TN-biased
CRPS
LogS
1.05 4.33
2.29 12.61
Verification Based on all observations,
TN TN-biased
CRPS
LogS
1.05 4.33
2.29 12.61
When the evaluation is restricted to observations > 12 m s−1 .
TN TN-biased
R-CRPS≥12
R-LogS≥12
4.77 2.60
10.42 7.18
Verification Based on all observations,
TN TN-biased
CRPS
LogS
1.05 4.33
2.29 12.61
When the evaluation is restricted to observations > 12 m s−1 .
TN TN-biased
R-CRPS≥12
R-LogS≥12
4.77 2.60
10.42 7.18
Using proper weighted scoring rules with windicator (z) = 1{z ≥ 12} and wGaussian (z) = Φ(z | µr = 12, σr2 = 1). twCRPS windicator wGaussian TN TN-biased
0.111 0.313
0.115 0.349
CSL windicator wGaussian 0.499 0.605
0.502 0.609
Predictive performance for high wind speeds
0.10 0.05 0.00
Density
0.15
TN
0
5
10 Wind speed (m s−1)
15
20
Predictive performance for high wind speeds
0.10 0.05 0.00
Density
0.15
TN GEV
0
5
10 Wind speed (m s−1)
15
20
Alternative wind speed models GEV model: G = GEV (µG , σ G , ξ G ), ¯ , σ G = σ0 + σ1 X ¯ , ξ G = ξ0 . with µG = µ0 + µ1 X TN-GEV combination model: ( N[0,∞) µN , σ 2 N , H= GEV µG , σ G , ξ G ,
if X med < θ if X med ≥ θ.
Lerch, S. and Thorarinsdottir, T.L. (2013) Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A, 65: 21206.
Comparison for high wind speed observations
0.08
twCRPSSr (F , y ) = 1 −
twCRPSr (F ,y ) twCRPSr (FTN ,y )
0.04 0.02 0.00
twCRPSS
0.06
GEV TN−GEV
0
5
10
15 −1
Threshold (m s )
20
Outline
1. Probabilistic forecasting and forecast evaluation 2. The forecaster’s dilemma 3. Proper forecast evaluation for extreme events 4. Case study 5. Simulation study
Diebold-Mariano tests Formal test of equal predictive performance of F0 and F1 , i.e., test H0 : g = f0 vs. H1 : g = f1 Diebold-Mariano test: Under the null hypothesis of a vanishing expected score difference and standard regularity conditions tn =
√ S¯ F0 − S¯ F1 n σ ˆn
is asymptotically standard normal. S¯ F0 and S¯ F1 denote the mean scores on a test set of size n, and σ ˆn2 is an estimator of the asymptotic variance of the score difference.
Diebold, F. X. and Mariano, R. S. (1995) Comparing predictive accuracy. Journal of Business and Economics Statistics, 13, 253–263.
Simulation study: Setting
Simulation setting tailored to benefit proper weighted scoring rules Compare three forecast distributions with densities: I
φ(x), standard normal density,
I
h(x) = 1{x ≤ 0} φ(x) + 1{x > 0}t4 (x),
I
f (x) = 0.5 φ(x) + 0.5 h(x)
using two-sided DM tests.
Simulation study: Variant 1 Truth = Φ, compare F and H (F should be preferred)
0.8 0.6 0.4
Frequency of rejections
0.0
0.2
0.8 0.6 0.4 0.2 0.0
Frequency of rejections
1.0
Rejections in favor of H
1.0
Rejections in favor of F
0
1
2
3
4
5
0
Threshold r
1
2 Threshold r
CRPS LogS
twCRPS CSL
CL
3
4
5
Simulation study: Variant 2 Truth = H, compare F and Φ (F should be preferred) Rejections in favor of Φ 1.0 0.8 0.6 0.4
Frequency of rejections
0.0
0.2
0.8 0.6 0.4 0.2 0.0
Frequency of rejections
1.0
Rejections in favor of F
0
1
2
3
4
5
0
Threshold r
1
2 Threshold r
CRPS LogS
twCRPS CSL
CL
3
4
5
Tail dependence of proper weighted scoring rules Consider wr (z) = 1{z ≥ r } and a threshold r such that yi < r for all i = 1, . . . , n. Then all proper weighted scoring rules do not depend on the observations and are solely determined by the tail probabilities. F
CLn = 0 F
CSLn = − log F (r ) Z ∞ F twCRPSn = (F (z) − 1)2 dz r
The forecast distribution with the lighter tail then receives the better score, irrespectively of the true distribution.
Weighted scoring rules and hypothesis testing I
Discrepancy between being informative and finding the true density (in standard test settings).
I
Reformulate the test problem: Ignore possible problems of the density forecast outside of the region of interest, A.
I
This amounts to testing H0 : g 1A = f0 1A vs. H1 : g 1A = f1 1A
I
Here tests based on weighted scoring rules can have power.
Holzmann, H. and Klar, B. (2016). Weighted scoring rules and hypothesis testing. Working paper, available at https://arxiv.org/abs/1611.07345.
Summary and conclusions I I I
I I
Forecaster’s dilemma: Verification on extreme events only is bound to discredit skillful forecasters. The only remedy is to consider all available cases when evaluating predictive performance. Proper weighted scoring rules emphasize specific regions of interest, such as tails, and facilitate interpretation, while avoiding the forecaster’s dilemma. In particular, the weighted versions of the CRPS share (almost all of) the desirable properties of the unweighted CRPS. Benefits of using proper weighted scoring rules in terms of statistical power may be limited in standard test settings.
Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F. and Gneiting, T. (2015) Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, to appear. Preprint available at http://arxiv.org/abs/1512.09244.
Summary and conclusions I I I
I I
Forecaster’s dilemma: Verification on extreme events only is bound to discredit skillful forecasters. The only remedy is to consider all available cases when evaluating predictive performance. Proper weighted scoring rules emphasize specific regions of interest, such as tails, and facilitate interpretation, while avoiding the forecaster’s dilemma. In particular, the weighted versions of the CRPS share (almost all of) the desirable properties of the unweighted CRPS. Benefits of using proper weighted scoring rules in terms of statistical power may be limited in standard test settings.
Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F. and Gneiting, T. (2015) Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, to appear. Preprint available at http://arxiv.org/abs/1512.09244.
Thank you for your attention!