Steven N. Durlauf Economics 704 Fall 2015

Lecture Notes 2: Decisions and Data

In these notes, I describe some basic ideas in decision theory. These ideas matter in linking econometric analysis to policy evaluation. Basic decision theory is based from the following elements. Data: realizations d  D Unknown(s):  Choices: c  C Loss function: l  , d , c  The goal is to construct a decision rule c  d  , a mapping from D to the set of choices C. The decision rule may be stochastic, but I will ignore this possibility for expositional purposes.

1. Statistical decision theory

The standard version of statistical decision theory is formulated along Bayesian lines: all unknowns are associated with posterior probability densities; decision rule is determined by minimization of expected loss. Specifically, one implicitly chooses c  d  by solving the problem mincC  l  , d , c    d 

1

(2.1)

for each value of d. This problem illustrates the key question facing an expected loss minimizing policy analyst: how does one construct   d  ? To understand this conditional probability, observe that

  d  

  , d    d        d   d 

(2.2)

Since   d  is not a function of  , (it is really nothing more than a normalization that ensures that the relevant probabilities sum to 1), one can rewrite (2.2) as

  d     d     

(2.3)

This is the classic statement that the posterior probability measure for some unknown,   d  , is proportional to the product of the likelihood function   d   and the prior probability measure    . The prior represents the information the analyst possesses about the unknown before (i.e. prior!) to the data. For some types of unknowns (e.g. parameters) and associated regularity conditions, it will be the case that as the number of observations grows, the likelihood will “swamp” the prior. loss functions and “standard” statistical exercises

The statistical decision formulation does not, at first glance, appear closely related to the standard exercises of producing an estimate of a parameter, etc. However, these types of exercises may be reproduced in their essentials by a suitable choice of the loss function. Consider the loss function l  , d , c    c  d    

2

2

(2.4)

One can show that expected loss minimizing decision rule given (2.4) is c  d   E  d  The proof is left as an exercise.

This posterior mean is the

Bayesian analog to a point estimate in frequentist analysis. priors

The assignment of priors is problematic, since in many (most?) contexts, the analyst does not have a good basis for constructing    . This is true in several senses; first for many problems one may simply have nothing to say about the ex ante relative probability of different realizations. A second problem is that what information is possessed may not quantify easily. Within Bayesian statistics, a number of approaches exist to addressing this difficulty. For our purposes, I will focus on the question of constructing priors in the case where one is “ignorant” i.e. one does not have any basis for discriminating between different values of the unknown. And as an example, I will consider analysis of the mean and variance of a sequence of i.i.d. random variables with mean  and variance  2 . The classical approach to specifying priors in the presence of ignorance is the “principle of insufficient reason” an idea which is often attributed to Bernoulli and Laplace (although the name of the principle postdates them).

The basic

idea of the principle is that in absence of any reason to discriminate between two values of an unknown, this is interpretable as saying that we assign equal prior probabilities to them. How does the principle of insufficient reason translate into a prior for  ? If the support of the mean, under ignorance, is  ,   , then the associated prior much be constant, i.e. a uniform density on the real line, i.e.      1. Notice that in this case, the prior is improper, which means that it does not integrate to 1. The case of the variance  2 is more complicated.

One triviality: the

parameter that is typically studied is the standard deviation  .

3

Since the

variance cannot be negative, the support of  is, under ignorance, taken to be

0,   .

For this case, the standard ignorant prior is taken to be    

1



. The

motivation of this formulation is that it assigns a uniform prior to log , whose support is  ,   . This example suggests a more general problem in formulating ignorance in terms of a uniform prior. Consider a parameter vector  and its associated transformation into another parameter vector   f   . Assuming all probability measures can be represented as densities, for a given prior    , it must be the case that the associated prior for  is

      f 1   

df 1   d

.

(2.5)

This means that the uniform prior is not invariant across nonlinear transformations of the unknowns, which makes little sense if the prior really captures ignorance, since ignorance about  presumably implies ignorance about  .

One way to impose invariance of this type is via Jeffrey’s prior.

Recalling that Fisher’s information matrix is defined as   2 log   d     2 log   d   I    E   d      D  2  2  

(2.6)

Jeffrey’s prior is defined as

    I  

1/2

(2.7)

This prior addresses the invariance problem in that the form of the prior can be shown to be unchanged, but raises difficulties in terms of interpretability. One

4

issue is that the prior depends on the form of the likelihood, which seems odd since the prior is supposed to describe beliefs about parameters. . Example 2.1. Priors and posteriors for mean of normal random variables

Suppose that xi



N  , 2  , i  1,..., K with  2 known; the prior for  is



N  0 , 02 . Some algebraic manipulation reveals that the posterior density for the

mean is



   , 0 , , x1,...xK 2

2 0



1  K 1 2  1  02 K    N 2   x,     (2.8)   0  K 1 2 0  02  K 1 2   02  2    

where x is the sample mean of the data. What is the intuition? The posterior mean is a weighted average of the prior mean and the sample average. The weight on the prior mean  0 and sample average x reflect their respective variances, which makes sense since this means that more weight is attached to the more precise piece of information about  . Notice that as K   , the effects of the prior on the posterior disappear, i.e. the posterior mean converges to the sample mean.

2. Decision theory without probabilities

In this section, I discuss decision criteria when probabilities are not available for the unobservables of interest.

minimax

The minimax solution is to choose c so that

5

mincC max l  , d , c 

(2.9)

The minimax approach is usually associated with Abraham Wald (1950). An important axiomatization is due to Itzhak Gilboa and David Schmeidler (1989). The criterion implicitly means that the policymaker is extremely risk averse. There are interesting applications of the criterion in philosophy, notably John Rawls’ (1971) difference principle. The idea of the difference principle is the following. Suppose that individuals are asked to rank different distributions of economic outcomes in a society, under the proviso that they do not know which of the outcomes will turn out to be theirs. Rawls argues that individuals will choose that distribution that maximizes the utility of the worst off person. As pointed out by Kenneth Arrow (1973), this is a minimax argument. It can further be understood as the limit of a welfarist analysis. For example, suppose that individual utility is defined by ui  i where  denotes a scalar measure of an individual’s outcome. Further, suppose that the social state  is assigned the   social welfare measure W    i   i 



1



. Then, as    , W  mini i . The

limit    implies an arbitrary degree of risk aversion. But this must be true for any monotonic transformation of the social welfare function, so it is also true for a utilitarian welfare calculation as well.

minimax regret

An alternative approach to decisionmaking without assigning probabilities to unknowns is to choose c to minimize the maximum regret. Minimax regret was originally proposed by Leonard J. Savage (1951) as an alternative to minimax which is less conservative. Charles Manski is the leading advocate of its use among current researchers; he has produced numerous important papers applying minimax regret to different contexts; see Manski (2008) for a conceptual defense and Brock and Durlauf (2015) for criticism

6

To characterize minimax regret one naturally needs to specify what is meant by regret. Regret is formally defined as

r  , d, c   l  , d, c   mincC l  , d, c 

(2.10)

The minimax regret solution is

minc max r  , d, c 

(2.11)

A recent axiomatization is due to Jorg Stoye (2006a,b).

Example 2.3. minimax, minimax regret, and expected loss

Suppose that there are two possible actions by the policymaker, c1 and c2 and two possible states of the world, 1 and  2 . The following table the losses for each policy and state of the world, as well as the maximum loss and maximum regret.

1

2

max loss

max regret

c1

24

12

24

5

c2

19

15

19

3

As the table indicates, by the minimax criterion and the minimax regret criterion, the policymaker should choose c2 . A Bayesian would address the policy problem by assigning probability p ( 1  p ) to 1 (  2 ) and computing expected losses under the two policies. It is obvious that if 1  p is close enough to 1, the policymaker should choose c1 .

7

Example 2.4. minimax and minimax regret producing different policy choices

Consider the alternative loss structure

1

2

max loss

max regret

c1

24

23

24

8

c2

25

15

25

1

Here the minimax policy choice is c1 whereas the minimax regret choice is c2 . This example illustrates the intuitive appeal of minimax regret in that the criterion focuses on differences in policy effects rather than absolute levels.

Example 2.5. minimax regret and the independence of irrelevant alternatives property

This example modifies Example 2.3 by adding a third policy c3

1

2

max loss

max regret

c1

24

12

24

12

c2

19

15

19

15

c3

50

0

50

31

The introduction of c3 implies that the minimax regret choice is now c1 . This is troubling; the fact that c3 is available has changed the ranking of c1 and c2 . This violates the property of independence of irrelevant alternatives (IIA), which is

8

often taken to be a natural axiom for decisionmaking. The recognition that minimax regret violates IIA is due to Herbert Chernoff (1954). One way to interpret this violation is that context matters for decisionmaking, an idea that is found in a number of contexts studied in behavioral economics. It is not clear, though, that the findings on context dependent decisionmaking map that closely to the IIA violations found in minimax regret. This is something worth investigating.

Example 2.6. minimax regret IIA violation

Here is another case where a violation of IIA occurs.

Compare the

analysis of 2 policies

1

2

max loss

max regret

c1

30

10

30

9

c2

21

21

21

11

with the comparison of 3 policies

9

1

2

max loss

max regret

c1

30

10

30

20

c2

21

21

21

11

c3

10

30

30

20

c2 is the minimax choice in the two policy case; but if c3 ( c1 ) were not available, c1 ( c3 ) would be the minimax regret solution. This seems especially paradoxical (at least to me) since c1 and c3 have symmetric structures. It is the case (see Stoye (2006a,b) for formal analysis) that if one is willing to forgo IIA, one can derive minimax regret under an axiom system which includes an assume of

“independence of never-optimal alternatives.”

This

means that if one adds a policy that is not optimal in any state of nature (when compared to others) that it cannot lead one to change to change one’s choice among the original set.

Hurwicz criterion

One can mix the different approaches that have been described.

For

example, one can consider problems that incorporate both expected loss and minimax considerations. mincC  max l  , d, c   1    El  , d, c  

This was originally proposed by Leo Hurwicz (1951).

3. Decision rules and risk

10

(2.12)

The risk function, defined as

R  , c  D     l  , d , c  d    d   D

(2.13)

is a metric for understanding the performance of a rule across the data space, conditional on  . Notice that this calculation weights performance across the data space by probabilities. Relative to the set of potential decision rules, a rule c   is (strictly) dominant if it produces a (strictly) lower risk than any alternative rule for all values of  .

A decision rule is said to be inadmissible if there exists an

alternative rule with lower risk for all values of  . There is no guarantee that a unique dominant rule exists. The Bayes’ risk of a rule is defined as

 R  , d, c  d        l  ,d,c d   d     

 D

(2.14)

Recall that   d         d    d  . Exchanging the order of integration (ok under uninteresting regularity assumptions) and substituting this identity into (2.14), the Bayes’ risk is

  l  ,d,c  d    d   d  D 

(2.15)

It is evident that minimizing the double integral occurs by minimizing the inner integral at each data point. Minimization of the inner integral is equivalent to the original Bayesian decision theory solution. A general result is that any decision rule which minimizes Bayes’ risk relative to a proper prior is admissible. A form of the converse is also true: any admissible rule minimizes Bayes’ risk under some prior.

11

4

Bayes versus frequentist estimation

How do these basic ideas link to econometric methodology? Bayesian statistical methods can be understood as constructing the conditional probability

  d  , a description of the uncertainty associated with unknowns,  , given knowns, d . Frequentist statistical methods, in particular maximum likelihood, construct estimates ˆ based on   d  , and in essence describe ex ante uncertainty of knowns given unknowns. When one considers the decision problems I have described, the Bayesian calculation is the policy-relevant one. Why are frequentist methods so commonly used? One reason is discomfort over priors. As discussed above, there are ways of addressing the absence of principled reasons for a prior choice. Another reason is computational-Bayesian methods can be difficult to implement. Computational power growth has, I believe, been a reason for increasing popularity of Bayesian methods.

5. Fiducial inference

One way to understand the differences between Bayesian and frequentist approaches is that Bayesian approaches construct probabilities of unobservables based on observables whereas frequentist approaches are based on the analysis of the probabilities of observables based on unobservables. Statistical decision theory is based on the former.

Fiducial inference is a method, originally

proposed by Ronald Fisher to address this distinction. Consider a sequence of Normal  , 2  random variables xi , i  1...K . Assume the variance is known.

 2  Then the sample mean x has the probability density N  ,  .  K 

12

It is

straightforward to show that this implies that x   has the probability density

 2  N  0,  which is also the density of   x .  K 

  x is an example of what is

known as a pivotal quantity, which means that its distribution does not depend on

 . The fiducial argument is that since the deviation of  from x is defined by a pivotal quantity, this pivotal quantity characterizes the uncertainty associated with the parameter, i.e.

x

 2  N  0,  implies   K 

 2  N  x,   K 

(2.16)

The argument is generally rejected since, without justification, it represents a transformation of a nonrandom object, in this case  , into a random one. A variation of Fisher’s ideas has been developed by Donald Fraser (see Fraser (1968) for a comprehensive statement), called structural inference. It is also controversial and has yet to impact empirical work. However, unlike Fisher’s original formulation I do not believe one can say it is regarded as a failure. I personally find the structural inference ideas very intriguing, although quite hard to fully understand.

13

References Arrow, K., (1973), “Some Ordinalist-Utilitarian Notes on Rawls’ Theory of Justice,” Journal of Philosophy, LXX, 9, 245-263. Chernoff, H., (1954), “Rational Selection of Decision Functions,” Econometrica, 22, 442-443. Fraser, D., (1968), The Structure of Inference, New York: John Wiley. Hurwicz, L., (1951), “Some Specification Problems and Applications to Econometric Models,” Econometrica, 19, 343-4. Gilboa, I. and D. Schmeidler, (1989), “Maxmin Expected Utility with Imprecise Probabilistic Information,” Journal of Mathematical Economics, 18, 141-153. Manski, C., (2008), “Actualizing Rationality,” mimeo, Northwestern University, Rawls, J., (1971), A Theory of Justice, Cambridge: Harvard University Press. Savage, L. J., (1951), “The Theory of Statistical Decision,” Journal of the American Statistical Association, 46, 55-67. Stoye, J., (2006a), “Statistical Decisions Under Ambiguity,” mimeo, NYU. Stoye, J., (2006b), “Axioms for Minimax Regret Choice Correspondences,” mimeo, NYU. Wald, A., (1950), Statistical Decision Functions, New York: John Wiley.

14

Statistics and Decisions

The standard version of statistical decision theory is formulated along. Bayesian lines: all ... Bayesian analog to a point estimate in frequentist analysis. priors.

346KB Sizes 0 Downloads 144 Views

Recommend Documents

pdf-1447\statistics-informed-decisions-using-data-statistics-statistics ...
Connect more apps... Try one of the apps below to open or edit this item. pdf-1447\statistics-informed-decisions-using-data-statistics-statistics-by-cti-reviews.pdf.

Best PDF Statistics: Informed Decisions Using Data
Best PDF Statistics: Informed Decisions Using Data (5th Edition) - .... Publisher : Pearson 2016-01-13 q. Language : English q ... Putting It Together Statistics: Informed Decisions Using Data, Fifth Edition, gives students the tools to see a bigger