Generalized entropy measure of ambiguity and its estimation Marc Henry1 Columbia University First Version: August 2001 This Version: May 2002
Abstract We propose a measure of the degree of ambiguity associated with a belief function and a nonparametric method to estimate it. The degree of ambiguity associated with a belief function is measured by the generalized entropy diameter of the set of probability measures compatible with it. It is shown that an estimator based on the empirical version of the unambiguous measure generating the belief function is consistent for the true value of the ambiguity measure when the Kullback-Leibler diameter is used for the latter. Knightian uncertainty; belief functions; generalized entropy; nonparametric estimation, core diameter. Keywords:
1 Introduction A growing body of literature has focused on axiomatic decision theory when Savage's \Sure Thing Principle" is relaxed to accommodate behaviours identi ed in Ellsberg's famous thought experiment (Ellsberg (1961)). In one version of the latter, agents are shown an urn which they are told contains 30 red balls and sixty balls that are either blue or yellow, the precise proportions being unknown. A majority of agents strictly prefer bets on red to This research was carried out while the author was visiting IRES, UCL, and nancial support from the latter institution in the form of an FSR grant is gratefully acknowledged. 1
1
bets on blue, whereas they strictly prefer bets on yellow or blue to bets on yellow or red. Such preferences are incompatible with a representation of the likelihood of events according to a single subjective probability measure, as they indicate a preference for the bets on events with known probability of occurrence (e.g. drawing a red ball from the urn) to bets on events for which only intervals of probabilities are provided (e.g. drawing a blue ball). From a subjective point of view, this intuition has led to the development of a theory of ambiguity attitude (see for instance Ghirardato, Maccheroni, and Marinacci (2002) and references therein) which parallels theories of risk attitude. But perhaps a more immediate interpretation is objectivist, insofar as the setup of the experiment clearly identi es an objective set of informations on the state space in order to elicit the agent's reaction to the (lack of) information. A similar interpretation can be given for the set of priors in the multi prior decision model of Gilboa and Schmeidler (1989) (hereafter MEU): the set of probability measures P is a representation of the objective information on the likelihood of events on the state space, and the maximin criterion is used to evaluate acts. This interpretation will be used as the focus of this paper, and we shall make it even more unequivoqual by considering the decision maker as a social planner acting on behalf of a representative agent with biseperable preferences (in the sense of Ghirardato and Marinacci (2001)). The set function in the canonical representation of the preference relation for the representative agent is supposed to be purely objective, in the sense that it is entirely derived from the objective information available on the state space (which we will dubb \scienti c knowledge"). In this purely objective setting, and given a social utility function identi ed with the canonical utility index of the representative agent, we shall be concerned with the degree of departure from a single probability measure on events in the state space that can be identi ed in the scienti c knowledge, and we shall consider an index of ambiguity to quantify this degree of departure in order to identify whether additional information reduces ambiguity in a way that is bene cial to decision given the social utility. In all that follows, we shall consider the decision problem in the following way: the system is supposed ontologically determinate in the sense that events in the state space occur according to a unique probability measure P , and it is epistemically indeterminate because scienti c knowledge on the state space falls short of providing a full characterization of that probability measure. One can think of Nature choosing P in the set of priors P corresponding to \scienti c knowledge", while the decision maker simultaneously 2
chooses act whithin the set of acts , to produce outcome U (P; ), where U is the social utility function. Now, heuristically, invoking a minimax theorem, we can identify sup2 inf P 2P U (P; ), the maximin decision maker's action, with inf P 2P sup2 U (P; ), which corresponds to a generalized entropy maximization (sup2 U (P; ) being the negative generalized entropy of Grunwald and Dawid (2002) associated with U ). Since maximin decision making is equivalent to maximizing entropy on the set of priors, relative entropy is a natural pseudo metric to quantify the reduction in ambiguity due to the contraction of the set of priors that correspond to scienti c knowledge. The object of this paper is to propose such an index of ambiguity, and a statistical procedure to estimate it in the restricted case where utility is logarithmic (in which case generalized relative entropy is simply KullbackLeibler discrepancy, as in Kullback and Leibler (1951)), and scienti c knowledge takes the form of a belief function generated through experimentation on an auxilliary sample space. The paper is organized as follows: A model for scienti c knowledge is given in the rst section, with an epistemic interpretation of the case where it is represented by a belief function. The next section describes the method proposed to estimate ambiguity in the latter case.
2 Scienti c Knowledge We begin this section with an objective de nition of scienti c knowledge and its induced beliefs over the state space. We model scienti c knowledge over the state space as a family of onto mappings from a standard Borel set Y ([0; 1] for instance), into . Scienti c Knowledge: F = ff 2 F : Y ! ; ontog: This can be interpreted as the result of experimentation carried within the framework of a set of \scienti c theories," and a special case of F is the family of measurable selections of a random correspondence F : Y ! P ( ) (see Castaldo and Marinacci (2001) for details). This formalization of scienti c knowledge admits a natural dynamic extension, where scienti c evolution is summarized by inclusion of new theories through a (generally deductive) creation and dilation of the family F on the one hand, and the diquali cation of theories through a (generally inductive) falsi cation on the other hand.2 2
Functions f are taken to represent falsi able theories in the sense of Popper (1934).
3
As in Amarante (2001), we de ne the induced representation of beliefs on
as the push-forward of the usual exterior measure on Y , denoted : for each f in F , we de ne an induced set function f on all subsets of as
f (A) = (f ;1(A)); all A 2 P ( ); (1) and we call f the largest -algebra on which f is a probability measure, and nally we denote by Pf the restriction of f to f . Therefore, for each f , we de ne a probability space ( ; f ; Pf ), and if we consider the measurable space ( ; F ), where F is the largest -algebra contained in \f 2F f , fPf ; f 2 Fg
can be interpreted as a set of priors on the scienti cally determined class of relevant events. If F = f; g, we call the scienti c knowledge F irrelevant to the state space. Finally, we summarize the belief representation on with the de nition of F such that, for all A 2 P ( ),
F (A) = finf (A): 2F f
(2)
We immediately see that F is a non additive probability on P ( ) satisfying
F () = 0; F ( ) = 1; (3) A B =) F (A) F (B ): (4) Call F the restriction of F to F . By construction, F is the lower envelope of the set of priors fPf ; f 2 Fg, and therefore fPf ; f 2 Fg Core(F ), where the core of the non additive probability F , denoted Core(F ), is the set of probability measures on ( ; F ) which dominate F setwise: Core(F ) = fp 2 M; p(A) F (A); all A 2 F g (5) where M is the set of all countably additive probability measures on ( ; F ).
In the special case mentioned above, where F is de ned as the family of measurable selections of a random correspondence F , F is actually the belief function induced on by F as de ned in Dempster (1967) (i.e. for all A 2 P ( ), F (A) = (F ;1(A))). In addition, as shown by Castaldo and Marinacci (2001), when is a Polish space (complete, separable and metrizable topological space), and F is compact valued, the core of F is equal to the weak-closed convex hull of fPf ; f 2 Fg. In what follows, we will restrict ourselves to the case where Y is an \experimentation space" linked to the 4
state space via a multivalued mapping F , and endowed with its Borel algebra and a probability measure p. Y is a \simpli ed version" of the state space on which sampling is possible, the latter providing information on the state space through F .
3 Measuring ambiguity
Let be a Polish space with Borel -algebra B and call M the space of all probability measures on ( ; B). Consider a compact, convex metrizable subset Y of a locally convex topological vector space with Borel -algebra BY , and let p be a probability measure on (Y; BY ). Finally, let F be a strongly measurable multivalued mapping taking points in Y onto closed non-empty subsets of . For all S , we de ne the Dempster variate, or belief function (mentioned above), generated on ( ; B) by (Y; BY ; p; F ) in the following way. De ne S = fy 2 Y j F (y) \ S 6= g S = fy 2 Y j F (y) S g: The belief function p is de ned by p(S ) = p(S), and the plausibility function by p (S ) = p(S ). The belief (resp. plausibility) function corresponds to the smallest (resp. largest) reliability that can be attached to an event S . They satisfy p(S ) p (S ) for all S , with equality if and only if the belief function is a probability measure. Finally, de ne the set of probability measures on ( ; B) compatible with the belief function p as C = f 2 M j 8S 2 B; p(S ) (S ) p (S )g: We make the following assumptions:
Assumption (i): p is almost positive (i.e. p(S ) = 0 implies p(S ) = 0). Under assumption (i), the all measures in the set C are absolutely continuous with respect to each other. For two elements and 0 in C , denote by d=d 0 or f , the Radon-Nikodym derivative of with respect to 0. We make the further assumption below:
Assumption (ii): 8(; 0) 2 C 2; d=d 0 is continuous on F (Y ), 5
and de ne the relative entropy of measure with respect to 0 on S as Z 1 d (!) 0(d!); I:0 (S ) 0(S ) log d 0 S when 0(S ) 6= 0 and zero otherwise. I:0 ( ) 0 with equality if and only if = 0, and it can be symmetrized by I:0 ( ) + I0: ( ). However, it does not satisfy the triangular inequality, and is therefore not a metric. It is used as a measure of information for discriminating between competing hypotheses, as it is invariant in the sense that it is decreased through a measurable transformation of the probability spaces ( ; B; ) and ( ; B; 0) with equality if and only if the transformation is a sucient statistic (see Kullback and Leibler (1951)). We shall therefore use this Kullback-Leibler contrast to de ne an index of ambiguity on ( ; B) in view of the following lemma.
Lemma 1: Under assumptions (i) and (ii), A(F; p) sup I: ( ) < +1: 0
(; 0 )2C 2
In view of lemma 1, we de ne an index of ambiguity implicit in the pair (F; p) as the \Kullback-Leibler diameter" of C , i.e. A(F; p). As noted in the introduction, in a minmax decision framework, this diameter serves also as a measure of ambiguity aversion on the part of the decision maker.
4 Estimating ambiguity
Consider a sample fyigni=1 of i.i.d. uniform random variables on (Y; BY ; p) as the result of an experiment with known link F to the measurable space of interest ( ; B). The problem considered here is the estimation of the index of ambiguity induced on ( ; B) by (Y; BY ; p; F ). Consider rst the problem of estimating relative entropy of with respect to 0, where and 0 are two elements of C , from a sample of hypothetical i.i.d. random variables fXigni=1 distributed according to on the probability space ( ; B; 0). Ahmad and Lin (1976) propose a nonparametric estimator of entropy for absolutely continuous density functions with respect to Lebesgue 6
measure on the real line, and Robinson (1991) extends it to relative entropy in a Euclidian space. However, the topological vector spacial structure is not necessary, nor is a particular metric, and we can construct an entropy estimator from fXi gni=1 as follows: Denote by f^:0 a histogram estimator of d=d 0 (described below) based on fXigni=1, and construct an estimator of I:0 ( ) in the form:
I^:0
Z
X ln f^:0 (!)n (d!) = 1 ln f^:0 (Xi);
n
n i=1
where n is the empirical measure (1=n) ni=1 Xi ( denoting dirac measure). For the construction of f^:0 (denoted f^ in what follows), we make the following assumption: P
Assumption (iii): F (Y ) is compact. Remark: Hall (1990) shows that properties of kernel estimators of en-
tropy of the type proposed by Ahmad and Lin (1976), Robinson (1991) and others, depends crucially on the tail behaviour of the density. In particular, p they show that in Euclidian spaces, n-consistency requires drastic conditions on tail behaviour and/or excess smoothness (to apply bias-reducing techniques such as higher-order kernels as in Robinson (1991)) for any dimension higher than 1 for histogram density estimates (which we use here) and 3 for more general kernel estimates (where large kernel tails are needed to oset the eect of large tails in the density). p Of course, we do not try to achieve n-consistency here, as it would require moment and smoothness conditions on f , which we are trying to avoid as they are dicult to relate to the mapping F . However, we assume compactness of F (Y ) and continuity of f to avoid clouding the central issue with tail behaviour considerations. (n) Under assumption (iii), let fCjngkj=1 be a measurable partition of F (Y ) such that the following two conditions are satis ed:
Assumption (iv): There exists positive constants c1 and c2 such that c1 min 0(C n) max 0(C n) c2 k(n) jk(n) j jk(n) j k(n) 7
Assumption (v): k(n);1 + n;1 k(n) = o(1): Now let f^ be de ned on F (Y ) as
k(n) n X X 1 f^(!) n X` (Cjn)! (Cjn)= 0(Cjn); `=1 j =1
and we now state
Lemma 2: Under assumptions (i) to (v), I^: ! I: ( ): 0
0
0
The link between the probability space ( ; BY ; p) and the elements of C is provided intuitively by Dempster's characterization of 2 C by the existence of a set of probability kernels indexed by y 2 Y and with support F (y) on ( ; B). This prompts the construction of analogues of the unachievable random variables fXi gni=1 in the pair of n-uples of elements of , f!igni=1 and f!igni=1, with (!i; !i) 2 F (yi)2. For each n-uple, an empirical density is constructed with the slightly modi ed assumption:
Assumption (iv'): There exists positive constants c1 and c2 such that c1 min p (C n) max p (C n) c2 : k(n) jk(n) j jk(n) j k(n) Remark: Note that assumption (iii') implies assumption (iii) for all 0 in C .
We denote by f^ and f^ the empirical density estimators contructed from f!igni=1 and f!igni=1 respectively. More precisely: n k(n)
XX f^(!) 1 !` (Cjn)!(Cjn)=p (Cjn);
n `=1 j=1
n k(n) ^f (!) 1 X X !` (Cjn)! (Cjn)=p(Cjn):
n `=1 j=1
8
Finally, call A^n(F; p) the proposed estimator for the Kullback-Leibler diameter An (F; p) of C in M, de ned as:
A^n(F; p) = ! ;!maxF y (
i i )2
(
i=1;::: ;n
n X
i ) i=1
log ff ((!!i )) :
i
We can now state the main result which is a immediate consequence of Theorem 2.1 of Wasserman (1990) and Lemmata 1 and 2 above:
Theorem 1: A^n(F; p) !p A(F; p). Theorem 1 is a weak result, due mostly to the degree generality of the setting, and more precise asymptotic results (on the rate of convergence and possible limiting distribution) would be needed to infer comparisons on the informativeness of dierent experiments. However, such results will necessarily entail smoothness assumptions on the densities of measures within the core of the belief function, and therefore will be highly \F -speci c". Naturally, implementation of the estimator will rely on algorithms which are also F -speci c, so that the present note should be mostly regarded as a blueprint for the modeling of scienti c uncertainty in policy decisions, the modeling of its evolution over time (or \learning") and the de nition of a precautionary approach in decision making.
References Abou-Jaoude, S. (1976): \Conditions necessaires et susantes de conver-
gence L1 en probabilite de l'histogramme pour une densite," Annales de l'Institut Henri Poincare, Section B, 12, 213{231. Ahmad, I. A., and P. Lin (1976): \A nonparametric estimation of the entropy for absolutely continuous distributions," IEEE Transactions in Information Theory, IT-22, 372{375. Amarante, M. (2001): \Ambiguity, Measurability and Multiple Priors," preprint, Columbia University. Castaldo, A., and M. Marinacci (2001): \Random Correspondances as Bundles of Random Variables," preprint: Universita di Napoli. 9
Dempster, A. P. (1967): \Upper and lower probabilities induced by a
multivalued mapping," Annals of Mathematical Statistics, 38, 325{339. Ellsberg, D. (1961): \Risk, ambiguity and the Savage axioms," Quaterly Journal of Economics, 75, 643{669. Ghirardato, P., F. Maccheroni, and M. Marinacci (2002): \Ambiguity from the dierential viewpoint," ICER Working paper 17/2002. Ghirardato, P., and M. Marinacci (2001): \Risk, ambiguity, and the separation of utility and beliefs," Mathematics of Operations Research, 26, 864{890. Gilboa, I., and D. Schmeidler (1989): \Maximin expected utility with non-unique priors," Journal of Mathematical Economics, 18, 141{153. Grunwald, P. D., and P. Dawid (2002): \Game theory, maximum entropy and robust Bayesian decision theory," UCL Research report 223. Hall, P. (1990): \Akaike's information criterion and Kullback-Leibler loss for histogram density estimation," Probability Theory and Related Fields, 85, 449{467. Huber, P. J. (1981): Robust statistics. New York: Wiley. Huber, P. J., and V. Strassen (1973): \Minimax tests and the NeymanPearson lemma for capacities," Annals of Statistics, 1, 251{263. Kullback, S., and R. A. Leibler (1951): \On information and suciency," Annals of Mathematical Statistics, 22, 79{86. Popper, K. (1934): Logik der Forschung. Vienna: Springer. Robinson, P. M. (1991): \Consistent nonparametric entropy-based testing," Review of Economic Studies, 58, 437{453. Varadarajan, V. S. (1958): \On the convergence of sample probability distributions," Sankhya, Series A, 19, 23{26. Wasserman, L. (1990): \Prior envelopes based on belief functions," Annals of Statistics, 18, 454{464. 10
Appendix
Proof of lemma 1: The belief function p is a monotone Choquet capacity of in nite order, and M is metrizable with the Prohorov metric and is also Polish (see for instance Theorem 3.9 in Huber (1981)); hence, by Lemma 2.2 of Huber and Strassen (1973), C is a compact subset of M. The result follows from the continuity of I:0 ( ) under assumptions (i) and (ii). q.e.d. Proof of lemma 2: In the proof, we shall drop the subscript for the Radon-Nikodym derivative and its empirical counterpart. Noting that n is absolutely continuous with respect to and 0, we have
jI^: ; I: ( )j = 0
0
Z
Z
ln f^(!)n (d!) ;
Z
ln f (!) (d!)
ln(f^(!)=f (!))n (d!)
+
Z
ln f (!)n (d!) ;
Z
ln f (!) (d!)
= A + B: n converges weakly to with 0-probability 1 by Theorem 3 of Varadarajan (1958) and f is continuous on a compact subset of , therefore, B ! 0 with 0-probability 1. Note that assumption (ii) that Ahmad and Lin (1976) used for convergence of rst moment in B is not needed here. Now, calling Nj the number of Xi's in bin Cjn, and de ne Z p = f (!) 0 (d!) = 1 E 0 (N ); j
n
Cjn
j
we can write A as n X X 1 1 A= Nj ln(Nj =npj ) + ln(E0 f^(Xi)=f (Xi )) = A1 + A2;
n j2Jn
n i=1
where Jn is the set of indexes of non-empty bins. Take > 0, and consider the partition of Jn in J1 = fj jnpj > n g and J2 = J1c. We can write
njA1j j
X
j 2J1
Nj ln(Nj =npj )j + j 11
X
j 2J2
Nj ln(Nj =npj )j = S1 + S2:
Now, for small enough, S2 = o0 (k(n)), whereas by dierentiability of the logarithm, we have X X S1 K jNj (Nj ; npj )j=npj Kn; Nj jNj ; npj j j 2J1
j 2J1
where K is some positive constant, and X
j 2J1
Nj jNj ; npj j
n X
jf^(Xi ) ; E f (Xi )j = o (n) 0
i
0
by proposition 6 of Abou-Jaoude (1976). By a similar argument and proposition 2 of Abou-Jaoude (1976), the bias term A2 also converges, whence the theorem. q.e.d. Proof of Theorem 1: Consider a pair (; 0) of meaures in C . For palmost every y in Y , there exists, by Theorem 2.1 of Wasserman (1990), a probability measure y on B with support F (y) such that
8S 2 B (S ) =
Z
Y
y (S )dp(y):
Call y0 the probability measures corresponding to 0 and satisfying the same property. Samples f!i gni=1 and f!igni=1 can be drawn in according to and 0 using this result. First draw y randomly in Y and then draw ! in F (y) according to y and !0 in F (y) according to y0 . By construction, samples of size n so generated are such that, n X log ffn0 ((!!i0)) A^n(F; p); n i i=1 where fn (!i) and fn0 (!i0 ) are de ned from the !i and !i0 in the same way as f above. Conversely, for any choice !i and !i, i = 1 to n, there exists measures and 0 in C such that the sets ( ) n n (~ X X f ! ) f ( ! ) S = ! 2 j log f (! i ) log f (! i ) i i i=1 i=1 ( ) n n (! ) X (! ) X f f and S 0 = ! 2 j log f (~!i ) log f (! i ) i i i=1 i=1 where the~indicates that for any one index i, !i and !i are replaced by !, satisfy (S ) = 0(S 0) = 0. The result follows from lemma 2. q.e.d. 12