Optimal Group Decision: A Matter of Confidence Calibration∗ S´ebastien Massoni1† and Nicolas Roux2

1

QuBE - School of Economics and Finance & ACE, Queensland University of Technology 2

Max Planck Institute for Research on Collective Goods

Abstract The failure of groups to make optimal decisions is an important topic in human sciences. Recently this issue has been studied in perceptual settings where the problem could be reduced to the question of an optimal integration of multiple signals. The main result of these studies asserts that inefficiencies in group decisions increase with the heterogeneity of its members in terms of performances. We assume that the ability of agents to appropriately combine their private information depends on how well they evaluate the relative precision of their information. We run two perceptual experiments with dyadic interaction and confidence elicitation. The results show that predicting the performance of a group is improved by taking into account its members’ confidence in their own precision. Doing so allows us to revisit previous results on the relation between the performance of a group and the heterogeneity of its members’ abilities. Keywords: group decision, social interactions, confidence, metacognition, signal detection theory, psychophysics.



Research for this paper was supported by a grant from Paris School of Economics. The authors are grateful to Scott Brown, Steve Fleming, Chris Frith, Nicolas Jacquemet, Asher Koriat, Rani Moran, Jean-Marc Tallon, Jean-Christophe Vergnaud and anonymous reviewers for their insightful comments. † Corresponding author: School of Economics and Finance, Queensland University of Technology, 2 George St., 4000, Brisbane, QLD, Australia. Email: [email protected].

1

“For difficult problems, it is good to have 10 experts in the same room, but it is far better to have 10 experts in the same head.” John Von Neumann.

1

Introduction

Groups are often trusted to make decisions because they gather the information of their members. However, the extent to which groups are able to combine information coming from different sources remains an open question. One robust evidence is the fact that group decisions are governed by a confidence heuristic (Yaniv, 1997; Price and Stone, 2004): group discussions are dominated by the more confident members in a group.1 In general, as shown by Koriat (2012b), the response of the more confident member is more likely to be correct than that of the less confident member. However the ability to form reliable confidence varies a lot between individuals (Pallier et al., 2002) and these individual differences are weakly correlated with the accuracy of judgments (Williams et al., 2013). Thus differences in confidence within a group are not always linked to differences in accuracy. This opens the question to know whether difference in confidence calibration will negatively affect the benefits of group decisions over individual ones. Recently, the ability of groups to combine individual information has been carried into the field of psychophysics by studying group decision making in signal detection experiments (Sorkin et al., 2001; Bahrami et al., 2010, 2012b,a, 2013; Koriat, 2012b, 2015; Bang et al., 2014; Mahmoodi et al., 2015; Pescetelli et al., 2016). Signal detection experiments consist of asking subjects to make a binary decision based on noisy perceptive information (Faisal et al., 2008). A typical signal detection experiment in groups consists of asking subjects individually and then as a group, to determine which one of two visual stimuli was the strongest. As long standing literature has shown, people’s decisions in these types of situations could be considered as being made by a Bayesian decision maker equipped with 1 Even if Trouche et al. (2014) recently show that arguments more than confidence explained group performance in a cognitive task.

2

some (perceptive) information structure (Green and Swets, 1966; Ma et al., 2006). The modeling of perceptive information makes it possible to determine what would be the performance of a group if it perfectly combined its members’ information (Sorkin et al., 2001). Comparing actual group performance to this benchmark, Bahrami et al. (2010, 2012b) find that groups whose members are heterogeneous in terms of perceptive abilities (that is one of them has a higher probability of finding the strongest stimulus) tend to perform poorly. The failure of heterogeneous groups suggests that the precision of individual information is not well accounted for in the way it is aggregated. Bahrami et al. (2010) propose, as explained by Ernst (2010), that groups use a suboptimal decision rule that overweights the recommendations of the least able member. The resulting efficiency loss increases with the difference in group members’ information reliabilities. This model therefore postulates the existence of a systematic failure in the way private information is aggregated. We propose an alternative explanation that relates these results to biases in subjects’ confidence calibrations. We assume that subjects’ beliefs about their perceptive abilities are initially not related to their actual perceptive abilities. If everyone holds similar beliefs about his performances, the most able subjects tend to be relatively underconfident as compared to the least able subjects (Kruger and Dunning, 1999). Consequently, a group will put too much weight on the least able member, so that heterogeneity induces greater collective inefficiencies. Therefore our explanation of collective inefficiencies does not rely on the incapacity of humans to aggregate heterogeneous information. We rather see them as an inevitable consequence of the lack of information subjects have access to. Thus we conjecture that a dyad inefficiently combines information only when its evaluation of the relative reliability of its information sources is biased. If one member performs better at the task than the other and both are aware of this fact, then the dyad should not exhibit any inefficiency. On the contrary, if both believe they are equally good, then the opinion of the least able member will be followed too often, resulting in a lower success rate 3

for the dyad. This paper contributes to the literature on collective decisions in perceptual tasks. Models of perceptual collective decisions are based on the idea that members communicate their confidence to reach a decision. This confidence will carry different information about the strength and reliability of individual observations. Sorkin et al. (2001) propose a first model predicting that groups will perform better than their members. Unfortunately their model cannot account for certain types of failures of collective decisions. Bahrami et al. (2010) propose a new modelization in which groups of two can fail to reach the optimal decision. They show that if the performances of the two members are too different, dyads will fail to outperform the best member. Their prediction is confirmed in different experiments and settings. The robustness of the failures of groups with heterogeneous performances has been established for multiple treatments: feedback (Bahrami et al., 2010, 2012a); social interactions (Bahrami et al., 2012a; Bang et al., 2014); and confidence heuristic (Bang et al., 2014). Mahmoodi et al. (2015) explain this sub-optimality by an equality bias in which individuals assign equal weights to other members decisions regardless of their abilities. Alternatively Koriat (2012a, 2015) shows that a different mechanism can also impair collective decisions. If groups base their decisions on the confidence heuristic facing consensually wrong items, the group will perform worse that its members (Koriat, 2012a). In the case of social interactions and confidence sharing there is an amplification of this effect with a decrease of performances while confidence increases for inaccurate answers (Koriat, 2015). All these studies of group decisions are based on the idea that confidence is shared between members. But one important point is missing: how well does confidence carry some reliable information? This aspect has been recently studied by Pescetelli et al. (2016) in terms of metacognitive sensitivity. They show that the average discrimination ability of members is positively correlated with the group performance. In this paper we are interested in another type of metacognitive ability: the calibration of confidence. This aspect has never been integrated in the group decision process; while we show 4

that having similar levels of calibration, measured in terms of overconfidence, is crucial to reach collective benefits. This study of calibration heterogeneity allows us to make two new steps toward the understanding of group decision processes. First we offer an alternative explanation of the performance heterogeneity effect with the inclusion of confidence miscalibration. Then we emphasize the important aspect of the quality of confidence in the confidence sharing mechanism. A dyad will combine information inefficiently when its evaluation of the relative reliability of its information sources is biased. Thus the worse members are calibrated, the worse group performance will be. Since our aim is to show that inefficiencies are related to subjects’ beliefs about their perceptive abilities, we conduct two signal detection experiments with group decisions in which we elicit subjects’ confidence at each trial. Confidence is defined as the subject’s belief that he chose the right stimulus. In the first study we provided some feedback after each trial on the accuracy of the decision. Furthermore to stay in line with the design of Bahrami et al. (2010) we added an artificial heterogeneity of performances with a time difference of the stimuli presentation. To avoid the potential effect of the feedback and to increase the ecological validity of our design we suppressed these two aspects of the setting in a second study. Both experiments provide results confirming that the failure of group decisions can be linked to a difference of miscalibration. It offers an alternative explanation of the performance heterogeneity effect: heterogeneous dyads are more inefficient because their evaluation is more biased. Subjects indeed have no experience with the task, so we suspect that, on average, they treat opinions equally whether they are truly equally reliable or not (in line with the equality bias found in Mahmoodi et al., 2015). Those dyads in which subjects do not hold equally reliable information then exhibit more inefficiencies.

5

2

Methods

2.1

Theory

It is well established in Signal Detection Theory that the perceptive information subjects receive can be fruitfully modeled as a Bayesian information structure (Green and Swets, 1966). A subject draws at each trial a perceptive signal x ∈ (−∞, +∞). Signals are drawn from a normal distribution whose mean, θ, depends on the actual contrast difference between the two stimuli. The variance σi2 captures the precision of subject i’s perception. We will often talk about a subject’s precision parameter as the inverse of his variance, τi = 1/σi2 . As we use only one level of difficulty, the contrast difference θ can take one of two values, µ (right stimulus stronger) and −µ (left stimulus stronger), which are equally likely to occur. Subjects are asked to say whether θ is positive or negative. Individually, their decision rule is to follow the sign of the signal they receive. The probability that subject i makes the right decision corresponds to the probability that he receives a positive signal conditional √ on θ = µ. It is thus given by Φ(µ τi ) where Φ(·) is the standard normal cumulative distribution function. As a group, if subjects perfectly combine their private information, they make decisions based on the sign of the sum of their signals weighted by the precision of their information: xG = τ1 x1 + τ2 x2 . Note that this statistic is positive if and only if the likelihood of (x1 , x2 ) given µ is greater than the likelihood given −µ. The probability of an (optimal) group making a correct choice is thus given by the probability that xG is positive conditional on µ. xG is normally distributed with √ mean (τ1 + τ2 )µ and precision 1/(τ1 + τ2 + 2ρ τ1 τ2 ), where ρ is the correlation coefficient between the group members’ signals.2 It follows that the ideal group’s 2 Other models do not take into account this correlation but as our data shows a correlation between signals, we have integrated it into our model and the alternative ones. It is computed as follows: we observe the probability of the two group members to make the right decision simultaneously. According to the model, this probability should be equal to the probability of both individual signals being positive given µ = 1. Conditional on µ = 1,  the distributionof the pair of signals is a σ12 ρσ1 σ2 bivariate normal with mean (1, 1) and covariance matrix . ρ takes the value that ρσ1 σ2 σ22 equalizes the theoretical and observed probabilities of group members being simultaneously right.

6

information precision is given by τG∗ =

(τ1 + τ2 )2 √ τ1 + τ2 + 2ρ τ1 τ2

(1)

. According to the findings of Bahrami et al. (2010), the comparison of the observed group’s success rate and its ideal success rate, τG∗ , reveals that inefficiencies are positively related to the heterogeneity of the group with respect to the precisions of its members. They proposed an alternative model which is based on a suboptimal decision rule (named thereafter the suboptimal model). Groups make decisions √ √ τ1 x1 + τ2 x2 . Weighting each member’s based on the sign of the statistic xsub G = signal by the square root of its precision instead of the precision induces groups to follow the individual with the lowest precision too often (Ernst, 2010). The group precision as a function of its members’ precisions is

τGsub

√ √ ( τ1 + τ2 )2 = 2(1 + ρ)

(2)

which corresponds to the optimal case when τ1 = τ2 but gets lower as τ1 and τ2 become different, i.e. in case of group heterogeneity. We propose an alternative model, the belief model, in which the failures of heterogeneous groups comes from a lack of information about their members’ precisions. Assume that subject i holds some beliefs about his precision parameter whose expectation is noted τi,e . We make the approximation that a group decision rule is based on the expected values of precision parameters of its members, i.e. a group chooses right when τ1,e x1 + τ2,e x2 is positive.3 In other words, the group behaves Note that Sorkin et al. (2001) propose also to take into account the correlation into their modelization. An estimation of this correlation gives a mean value of 0.2573 (sd 0.0191, min -0.1861, max 0.6166) for Study 1 and a mean value of 0.3290 (sd 0.201, min -0.0791, max 0.6356) for Study 2. 3 The optimal decision rule is a much more complex object to study as it must take into account the whole beliefs about subjects’ precisions. In our model we make the approximation that a group decision rule is based on the expected values of precision parameters of its members, i.e. a group chooses right when τ1,e x1 +τ2,e x2 is positive. The exact optimal decision rule must take into account the whole beliefs about subjects’ precisions. Noting subject’s beliefs about τi by Γi , the optimal

7

as if it were sure that these expected precisions are true. Given that xi , i = 1, 2, is actually distributed with precision τi , the group statistics are normally distributed 2 τ + τ 2 τ + 2ρτ τ √τ τ ). It with mean (τ1,e + τ2,e )µ and precision τ1 τ2 /(τ1,e 2 1,e 2,e 1 2 2,e 1 follows that the precision of such a belief-based group is given by

τGbel =

2 τ τ1,e 2

τ1 τ2 (τ1,e + τ2,e )2 2 τ + 2ρ√τ τ τ τ + τ2,e 1 1 2 1,e 2,e

(3)

If subjects’ expectations are well calibrated, i.e. τi,e = τi for i = 1, 2, the belief-based group reaches its optimal precision level, i.e. τGbel = τG∗ . Actually, subjects may have biased expectations and still reach their optimal collective precision: group decisions are optimal as long as τ1,e /τ2,e = τ1 /τ2 . These expectations could be estimated by eliciting the level of confidence of subjects in their choices. This belief model predicts that the heterogeneity of a group with respect to the precision parameters of its members has no direct impact on the group performance. However, since subjects do not initially know their precision, subjects’ expected precisions should be (at least) initially unrelated to actual precisions. This assumption is supported by empirical evidence showing that miscalibration (over/under-confidence) is a common trait of many individuals (Lichtenstein et al., 1982; Wallsten and Budescu, 1983; Harvey, 1997; Kruger and Dunning, 1999). Overall these three models offer different predictions about the accuracy of group decisions. The optimal model states that group decisions will perfectly integrate member’s information. The suboptimal model assumes that the weaker member will be followed too often and thus the more members are heterogeneous in terms of performance, the worst the group performance will be. The belief model assumes that group decision incorporates expectations of its members and thus the more heterogeneous in terms of confidence calibration the members are, the worst the decision rule depends on whether the group posterior about θ = µ Z ∞Z ∞ P (x1 , x2 ) = P (x1 , x2 ; τ1 , τ2 )dΓ(τ1 ; α1 , β1 )dΓ(τ2 ; α2 , β2 ) 0

0

is higher or lower than .5.

8

group performance will be.

2.2

Experiments

In order to show that collective inefficiency is related to subjects’ beliefs about their perceptive abilities, we conducted two perceptual experiments with individual and group decisions in which we elicit subjects’ confidence at each trial. Confidence was defined as the subject’s belief about his own decision accuracy. In the first study we provided feedback after each trial on the accuracy of the decision. We also added an artificial heterogeneity of performances by presenting the stimuli to the members of the dyad during different time intervals. To avoid the potential effects of the feedback and the artificial heterogeneity we suppressed feedback and time difference of the setting in a second study.

2.2.1

Stimuli

The experiment was conducted on MATLAB using Psychophysics Toolbox version 3 (Brainard, 1997) and has been displayed on computers of resolution 1920×1080. We use a 2AFC density task, which is known to be convenient to fit SDT models (Bogacz et al., 2006; Nieder and Dehaene, 2009) and was previously used for metacognitive experiments (e.g. Hollard et al., 2016; Fleming et al., 2016). The stimuli consisted of two circles with a certain number of dots in each circle (see Figure 1B). All dots were of the same size (diameter 0.4◦ ) and the average distance between dots was kept constant. They appeared at random positions inside two outline circles (diameter 5.1◦ ) first displayed with fixation crosses at their centers at eccentricities of ±8.9◦ . One of the two circles contained 50 dots while the other contained 52 dots. The position of the circle containing the greater number of dots was randomly assigned to be on the left or right in each trial. Note that using only one level of difficulty prevents us to estimate psychophysical functions as Bahrami et al. (2010) did. But it allows to express directly all our analyses in terms of success rates.

9

Figure 1: Design of the experiment. (A) Time line of a group trial: facing the fixation cross, subjects initiate the trial, then the stimuli appear for a certain amount of time (different for both members in Study 1, identical for Study 2), they make an individual choice and confidence observation, a group decision of choice and confidence observation, which is then taken after reaching an agreement by free discussion, finally individuals give anew their own individual choice and confidence. At the end feedback on the accuracy of the answer is given in Study 1 but not in Study 2. (B) Example of our density stimuli. Subjects observe the stimuli for a short period of time and then have to decide which circle contains the most of dots. (C) Principles of the Matching Probability mechanism to elicit the level of confidence.

10

2.2.2

Experiment phase and payments

The experimental design comprised two kinds of trials, individual and group. In the individual trials the following decisions are made: first two outline circles were displayed with fixation crosses at their center. The subject initiated the trial by pressing the “space” key on a standard computer keyboard. The dot stimuli then appeared for a certain amount of time, and subjects were asked to respond whether the left or right circle contained more dots by pressing the “f” or “j” keys, respectively. There was no time limit for responding. After responding, subjects were asked to indicate their level of confidence in their choice on a gauge from 0% to 100% with steps of 5%, using the up and down keys, again with no time limit in the response. In the group trials the time line was different (Figure 1A): participants first conducted the trial in isolation. After their decisions participants had to discuss with the other member of their dyad to reach an agreement. The two members were seated in front of two nearby computers but with the impossibility to observe the other screen. During the discussion they had to reveal their choice and confidence but they were free to discuss decisions as long as they wanted with their partner.4 Then they reported the group’s choice and confidence (which have to be identical between members), and finally they gave again their individual choice and confidence in order to check if they agreed with the group choice. Overall a group trial is composed of the following sequence: stimuli; individual decision and confidence; free discussion; group decision and confidence; individual decision and confidence. Subjects’ payment comprised e 5 for participation and they accumulated points according to the accuracy of their stated confidence. The incentive mechanism used was the Matching Probability mechanism (Figure 1C) by which they won or lost points for each trial. This rule has advantages: it is a proper scoring rule, i.e. subjects have incentives to reveal their true beliefs, which is free of risk aversion and which provides a very good fit with the signal detection models (Massoni et al. (2014) 4

Note that the individual responses were revealed by the subjects themselves. Thus they could lie about their own answers. But as all decisions were incentivized, subjects had no interest to give such misreport.

11

and references therein). The principle of this proper scoring rule is to exchange a risky bet based on their subjective probability, p, with a bet based on an objective probability, l1 which is a number uniformly drawn from [1, 100]. If p ≥ l1 , a subject wins 1 point if the judgment is correct, and loses 1 point if incorrect. If p < l1 , a new number is drawn for an uniform distribution, l2 . If l2 ≤ l1 , 1 point is won; if l2 > l1 , 1 point is lost. Prior to the experiment we explained in details to subjects how their stated confidence would determine their payment and showed how different rating strategies would lead to different earning schemes. The accumulated points were paid at the exchange rate of 1 point = e 0.05.

- Study 1 In Study 1, participants performed 50 individual trials first and then 150 group trials. At the end of both types of trials, feedback on accuracy was given. In order to replicate the design of Bahrami et al. (2010) we created an artificial heterogeneity between subjects of a dyad by changing the presentation time of the stimuli. One subject saw the stimulus during 550 ms while the other saw it during 850 ms. This difference of time applied for individual and group trials. Subjects were unaware of such differences.

- Study 2 In Study 2, participants performed 50 individual trials first and then 100 group trials. As we believed that the feedback given in Study 1 may affect the subsequent decisions and the level of confidence, we did not include any feedback on the accuracy of decisions in this study. Furthermore the difference in presentation time reduces the ecological validity of the previous design as well as the generalization of the results to realistic situations. Thus in this study the presentation time were identical for both subjects and were set to 700 ms.

12

2.2.3

Participants

Study 1 was conducted in May and June 2012 in the Laboratory of Experimental Economics in Paris (LEEP) of the University of Paris 1. Participants were recruited by standard procedure in the LEEP’s database. 35 dyads i.e. 70 healthy subjects (36 men; age 18-45 years, mean age 22.1 years, most of them enrolled as undergraduate students at the University of Paris) participated in the experiment for pay. We lost the data of two groups due to a problem with a computer during the experiment. The experiment ran for approximately 120 minutes and subjects were paid at the end of the experiment on average e 17 (min 14, max 21). Study 2 was conducted at the same place in December 2013 on 27 dyads i.e. 54 healthy subjects (23 men; age 19-72 years, mean age 38.1 years). The experiment ran for approximately 90 minutes and subjects were paid at the end of the experiment on average e 13 (min 10 max 15).

2.3

Data analysis

This section will present the main variables of interest as well as the statistical methods used. Note that we use a double approach when testing our main hypotheses: a frequentist analysis based on a null-hypothesis-significance-testing approach and a Bayesian analysis based on posteriors estimations and Bayes Factors computations. Both approaches are complementary and allow to check the robustness of our results. This dual approach of the data analysis will be applied to the test of the mediation effect and to the models comparison.5 Variables of interest. All the variables of interests are aggregated over trials at an individual level. The success rate is defined as the mean accuracy. Each models prediction are done in terms of average success rate per individual.6 The calibration 5 See Mulder and Wagenmakers (Eds.) for the recent advances on Bayesian statistics applied to psychological research. 6 Note that we have presented the models in terms of precision parameters τ . We will present our results using direct success rates, s, which is equivalent since the precision parameter completely determines success rate. Indeed our experiment features only one level of dots’ difference between the two stimuli so that a subject’s success rate fully characterizes his perceptive information precision.

13

is computed as the bias (or reliability-in-the-large – Yates, 1982) rather than the calibration index (reliability-in-the-small) to emphasize the importance of overconP fidence in group decision. It is computed as follows: n1 ni=1 (pi − xi ) with n the number of trials, pi the level of confidence and xi and indicator variable of accuracy. We measure the collective inefficiency as the difference between the optimal group success rate and the actual one. This inefficiency will be explained by the performance heterogeneity (the difference between the two members’ success rates) and the calibration heterogeneity (the difference of over/under-confidence between members). We measure the collective benefit in its strong form as the difference between the actual group success rate and the success rate of the best member of the dyad; and in its weak form as the difference between the actual group success rate and the mean success rate of the two members of the dyad. Mediation analysis. The relationships between different measures are assessed by performing Ordinary Least Squares (OLS) regressions with robust estimations of standard errors. As we aggregated variables at the individual level, estimations are based on one observation per group. In order to disentangle the effects of performance heterogeneity and calibration heterogeneity on collective inefficiency we perform a mediation analysis using calibration heterogeneity as a mediator. This methodology provides a new explanation of how performance heterogeneity affect group performance via miscalibration of confidence (see MacKinnon et al., 2007, for a review of mediation analysis methods). To assess this mediation we follow the requirements of Baron and Kenny (1986): to confirm the effect of performance heterogeneity on collective inefficiency; to confirm that performance heterogeneity is linked to calibration heterogeneity; to test that calibration heterogeneity is a significant predictor of collective inefficiency while controlling for performance heterogeneity. We observe a full mediation if inclusion of the mediator drops the relationship between performance heterogeneity and group inefficiency to zero. In addition to this standard procedure we test the statistical power of the mediation effect by bootstrapping with case re-sampling (5000 replications) and percentile estimate of the 14

confidence interval (Preacher and Hayes, 2004). To complement this analysis we use a Bayesian approach of the mediation (Yuan and MacKinnon, 2009) and a default Bayesian hypothesis test (Nuijten et al., 2014) computed with the BayesMed package on R. This test is based on a combination of evidence from both paths of the mediation and uses Jeffreys–Zellner–Siow priors to correlations testing. Models comparison. In order to assess the fit of the three models of group decisions, we first perform independent beta regressions of the group success rate as predicted by our three models (Equations 1, 2 and 3) on the actual group success rate. These beta regressions are well suited for bounded dependent variables (the group success rate being a proportion with values in [0; 1]). We then measure their goodness-of-fit. As we have non-nested models with identical complexity (one parameter) the different information criteria give the same information. We compute the difference between the BIC (Bayesian Information Criterion, Schwarz, 1978) of each model and use Raftery (1995) grades of evidence to compare models. The difference in BIC provides support for one model against the other with the following levels: none for a negative difference; weak for value between 0 and 2; positive for value between 2 and 6; strong for value between 6 and 10; very strong for value higher than 10. Anew we check the robustness of the results by performing Bayesian estimations. After a Bayesian beta regression, we compute the LOO information criteria (leave-one-out cross-validation – Vehtari et al., 2016b). This index allows for estimate pointwise out-of-sample prediction accuracy from the Bayesian estimations using the log-likelihood evaluated at the posterior simulations of the parameter. The difference between expected log pointwise predictive density (elpd) of each model and its standard error define the difference in their expected predictive accuracy. This Bayesian analysis is computed using the R packages rstanarm (Stan Development Team, 2016) and loo (Vehtari et al., 2016a).

15

3

Results

After presenting the main descriptive statistics of the two studies we will perform three kind of analysis: to show how the effect of performance heterogeneity on optimal collective inefficiencies could be mediated by the difference of confidence calibration; to explain that this difference also mitigates the actual benefits of the collective decision; to compare the different models in their ability to predict the actual group success rate.

3.1

Descriptive statistics

The Table 1 summarizes the main statistics about the different decisions in both studies.

Success Rate Confidence Calibration ROC Area Agreement in choice Agreement in confidence Success Rate Confidence Calibration ROC Area Agreement in choice Agreement in confidence

Individual Decision 1 Study 1 66.3% (.06) 67.6% (.07) +1.4% (.08) 0.6041 (.07) 62.6% (.05) Higher: 37.6% (.13) Identical: 21.1% (.08) Lower: 41.3% (.14) Study 2 65.6% (.06) 67.9% (.07) +2.3% (.08) 0.5951 (.07) 65.6% (.09) Higher: 36.1% (.14) Identical: 23.7% (.12) Lower: 36.1% (.14)

Collective Decision

Individual Decision 2

69.9% (.06) 71.1% (.09) +1.2% (.08) 0.6545 (.07) .

71.3% (.07) 73.1% (.11) +1.9% (.09) 0.6572 (.08) 85.7% (.08) Higher: 29.4% (.18) Identical: 33.8% (.18) Lower: 36.8% (.21)

.

66.6% (.06) 68.8% (.07) +2.2% (.08) 0.6234 (.07) . .

66.5% (.05) 68.9% (.07) +2.4% (.08) 0.6271 (.06) 87.4% (.08) Higher: 26.5% (.13) Identical: 42.0% (.20) Lower: 26.5% (.18)

Table 1: Mean levels with standard deviations in parenthesis of accuracy, confidence, metacognitive abilities (calibration and area under the ROC Curve) and rate of agreement during the different stages of Study 1 and Study 2. Agreement in choice means that subjects facing the same trial report the same circle in individual decision. Agreement in confidence means that subjects report higher, identical or lower level of confidence in individual decisions.

16

- Study 1 We observe that the mean success rate of the group is statistically higher than the individual one (difference +3.6, t(65) = 5.3882, P < 0.001). Interestingly we also note that the individual success rate taken after the group decision is statistically higher than the group one (difference +1.4, t(65) = 3.2121, P = 0.001). We found the same statistically significant increase of the average level of confidence between individual decisions and group decisions. These two evolutions lead to a non-significant difference of calibration between the three decisions. Lastly we can note that some disagreements on the accurate choice persist after the group decision (the second individual decision is different in 14.3% of the trials). This experiment was done with different viewing time between members. The difference of time exposure (650ms vs. 850ms) leads to a statistically significant difference in terms of mean accuracy between group members (difference +4.0, t(65) = 3.1972, P = 0.001). - Study 2 Without feedback the pattern of results is quite different. We did not observe an increase of the success rate between the group and the first individual decision (difference +1.0, t(53) = 1.4737, P = 0.073) as well as between the group decision and the second individual decision (difference -0.1 , t(53) = -0.4034, P = 0.344). This result is in line with Bahrami et al. (2012a), but contrary to this study we did not observe a significant improvement with practice (difference +0.86 in the first half of the trials, +1.04 in the second half, t-test of these differences: t(53) = 0.0017, P = 0.872). Thus not providing feedback clearly decreases the integration of information between group members. Note that Study 2 implies also the removal of time exposure difference. However as the difference of mean performance between the group members is statistically non-significant between Study 1 and Study 2 (difference -0.2, t(58) = -0.2075, P = 0.836), we can assume that the difference between group performance is lead by the removal of feedback and not of time exposure difference.

17

3.2

Collective inefficiency relative to optimal performance

This first analysis links the failures of a group to make an optimal decision to a difference of calibration rather than a heterogeneity of performances. With the help of a mediation analysis and the observed difference of confidence calibration as a mediator, we show that the effect of performance heterogeneity on group inefficiency is in fact mediated by the calibration heterogeneity.

Figure 2: Mediation analysis. These path diagrams show the regression coefficients of the relationship between performance heterogeneity and collective inefficiency as mediated by calibration heterogeneity. The regression coefficient between performance heterogeneity and collective inefficiency, controlling for calibration heterogeneity, is in parentheses. ** and * mean statistically significant at 1 and 5%. (A) Values for Study 1. (B) Values for Study 2.

- Study 1 We start by checking that the effect of performance heterogeneity on group inefficiency holds in our data. Thus we perform an OLS regression of the performance heterogeneity on the collective inefficiency at the group level. We find a positive and statistically significant coefficient (0.3205, t(32) = 1.72, P = 0.047, Fig 2A path 1) that confirms the effect of performance heterogeneity on collective losses. We confirm that calibration heterogeneity is a potential mediator by regressing performance heterogeneity on it. The effect is positive and significant (0.7540, t(32) = 3.38, P = 0.001, Fig 2A path 2). To investigate if the effect on group performance is in fact due to a problem of confidence calibration rather than the difference of performances, we add to the regression the calibration heterogeneity as an explanatory variable. In this case the effect of performance heterogeneity is no more significant (0.1330, t(32) = 0.63, P = 0.535, Fig 2A path 1 in parenthesis), while the calibration heterogeneity 18

has a positive and statistically significant impact (0.2485, t(32) = 1.70, P = 0.049, , Fig 2A path 3). Thus the requirements to identify a mediation effect are fulfilled by our results: significant effect of performance heterogeneity on collective losses, significant effect of performance heterogeneity on calibration inefficiency, significant effect of calibration inefficiency on collective losses, while the effect of performance heterogeneity is lower in absolute value than the original effect. Furthermore the non-significant effect of performance heterogeneity in the last regression indicates that we observe a full mediation of the calibration inefficiency. We confirm the significativity of this mediation effect by bootstrapping (replications = 5000, percentile estimate of 99% confidence interval = [0.0069; 0.9239]). As a robustness check we perform a Bayesian mediation analysis. The test of the mediated effect gives only anecdotal evidence in favor of the mediation (Bayes Factor = 1.28 with one-sided path tests). But it confirms the full aspect of the mediation with a lack of direct effect of performance heterogeneity by controlling for calibration heterogeneity (Bayes Factor = 0.32). These results provide only a partial support to our idea that the effect of performance heterogeneity on failures of group decision is (completely) mediated by the difference in confidence calibration. - Study 2 We reproduce the same analysis on the Study 2 data in which subjects received no feedback and had no time difference in stimuli presentation. We find anew the relationship between collective losses and performance heterogeneity. The OLS regression on the collective inefficiency gives a significant positive coefficient of 0.4630 (t(26) = 3.46, P = 0.001, Fig 2B path 1). Calibration heterogeneity is also a potential mediator with a positive and significant effect of performance heterogeneity on calibration heterogeneity (0.8981, t(26) = 6.74, P < 0.001, Fig 2B path 2). Thus if we add calibration heterogeneity to the first regression, previous effect is mitigated by this variable: performance heterogeneity is no more significant (0.0225, t(26) = 0.10, P = 0.918, Fig 2B path 1 in parenthesis) while the calibration heterogeneity has a positive effect (0.4905, t(26) = 2.44, P = 0.012, Fig 2B path 3). This re19

sult confirms that the calibration effect is a full mediation with the non-significance of the performance effect in the mediation model. However a bootstrap test leads only to a weakly significant effect (7%) for the indirect mediation effect (replications = 5000, percentile estimate of 93% confidence interval = [0.0186; 0.6958]). This weaker aspect of the mediation is confirmed by a Bayesian test on the mediated effect which provides moderate evidence against the mediation (Bayes Factor = 0.26 with one-sided path tests). This shows that our results only partially hold even in a framework without feedback and artificial heterogeneity. - Discussion This analysis presents our main result implying that the well-known effect of performance heterogeneity on optimal collective loss is in fact fully mediated by the effect of confidence calibration heterogeneity. Unfortunately this result does not seem robust to a Bayesian analysis. The anecdotal evidence in favor of the mediation effect in Study 1 and the moderate evidence against it in Study 2 show that the frequentist identification of the mediation can be arguable. This divergence of conclusions between these statistical methods can be linked to potential weakness in both approaches (e.g. tendency to overestimate the evidence against the null hypothesis in small sample for frequentist analysis; use of non-informative default priors in our Bayesian analysis). As it is not the scope of this paper to define the best statistical approach, we should therefore consider cautiously the mediation effect of the heterogeneity of confidence calibration. Still we argue that the importance of confidence calibration to reach optimal group decision is quite intuitive: more than having performance heterogeneity, the worst is to share unreliable information about his own performance between members. This result is in line with recent papers studying the importance of confidence in group decisions (Bang et al., 2014; Mahmoodi et al., 2015; Pescetelli et al., 2016). While these papers show how crucial confidence can be in a social context they are not explicitly focused on confidence calibration but more on the level of confidence or the confidence sensitivity. We measure heterogeneities effects on the whole set of trials. But it an be 20

interesting to see how these measures change over time and whether we can identify some trends in miscalibration and performance heterogeneity. Our design, with the removal of feedback from Study 1 to Study 2, should allow to test different predictions derived from belief and suboptimal models in case of learning effect. Indeed feedback on performances may affect differently the evolution of performance heterogeneity and calibration heterogeneity. It is thus possible to examine how group performances are affected by these changes in the models’ parameters. Unfortunately we are unable to provide such tests as we did not observe any learning effects on heterogeneities in both studies. By comparing results for the first and second half of trials, the trends in Study 1 are in favor of an increase of both performance heterogeneity and calibration heterogeneity. While there is a trend toward a decrease of performance heterogeneity in Study 2 and no change in terms of difference of miscalibration. Any of these changes are statistically significant at 10%. We are thus unable to discriminate between heterogeneities effects in a dynamic approach.

3.3

Actual collective benefits

We have shown that the calibration heterogeneity prevents a group to make an optimal decision. However some groups perform better than their members individually. Thus it could be interesting to see whether the difference of calibration also reduces the actual benefits of a collective decision. - Study 1 The actual strict collective benefit, measured as the difference between the group success rate and the best success rate of its members, is very low and non statistically significant different from 0 (mean 0.0063, sd 0.008, t(32) = 0.7525, P = 0.457). But we can link this lack of benefits to the difference of calibration between group members. Indeed an OLS regression of the calibration heterogeneity on the actual group benefits shows it has a significant negative effect on the benefits (-0.3736, t(32) = -2.77, P = 0.009, Figure 2A). If we use a weaker measure of collective benefit (defined as the difference between the group success rate and the average 21

Figure 3: Observations and linear fit with 95% confidence interval of the calibration heterogeneity on the actual strong collective benefits. (A) Plot for Study 1. (B) Plot for Study 2.

success rate of its members) we find a statistically significant benefit (mean 0.0362, sd 0.008, t(32) = 4.8241, P < 0.001). The difference of calibration between members has a negative effect but no more significant (-0.1953, t(32) = -1.49, P = 0.146). It can be interesting to see whether the collective benefits are also impacted by the sensitivity of confidence. Pescetelli et al. (2016) show recently that the mean level of discrimination of members (measured by the area under the ROC Curve) is highly correlated with the group collective benefits (defined as the difference between group performance and the average performances of its members). We tried to replicate this effect on our data using the area under the ROC Curve (Galvin et al., 2003) or the difference between meta-d’ and d’ (Maniscalco and Lau, 2012)7 on strong or weak collective benefits. Any of the four regressions allows us to identify a significant effect at 5% of the mean discrimination.8 - Study 2 Even if the actual strict benefits of a group decision are negative without feedback (mean -0.0235, sd 0.007, t(26) = -3.1203, P = 0.004) we find anew the negative effect of calibration heterogeneity on the collective benefits with an OLS estimation 7 This measure uses Signal Detection Theory tools to measure discrimination of confidence without any trouble of performance confounds. 8 We only obtained a weakly significant effect for one of the regressions: the average difference between meta-d’ and d’ has an effect on the strong collective benefits (coefficient 0.0205, se 0.011, t(30) = 1.81, P = 0.080).

22

(-0.4103, t(26) = -4.13, P < 0.001, Figure 2B). In terms of weak collective benefits, even if they are no more negatives, we find only weak statistically significant benefits (mean 0.0103, sd 0.005, t(26) = 1.9406, P = 0.063). Heterogeneity of calibration between members has a negative effect but still not significant (-0.0731, t(26) = -0.84, P = 0.411). We checked also for a mean discrimination effect on the collective benefits. Again, neither mean area under the ROC Curve or mean difference between meta-d’ and d’ have a significant effect on strong or weak collective benefits. - Discussion These results shed lights on the importance of feedback on collective benefits. The changes in design of Study 1 and 2 involve the removal of feedback and difference in time exposure. But as the performances are not affected (in terms of individual success as well as difference of success between groups’ members – see above) we can assume that the differences in observed behaviors are linked to the presence (or not) of feedback. While there is evidence that communication of confidence estimates between members is necessary to obtain collective benefits (Bahrami et al., 2010, 2012b; Pescetelli et al., 2016). Our results tend to show that providing feedback about the accuracy of the groups’ decision does matter. This result is in line with the literature linking feedback and group performance in more general setting (e.g. Tindale, 1989). Interestingly we also observed a surprising behavior only in Study 1. Besides that the group outperforms the best member, we also found that the second individual decision is better than the collective one (difference +1.4, t(65) = 3.2121, P = 0.001) while the two decisions are similar in Study 2 (difference -0.1 , t(53) = -0.4034, P = 0.344). Furthermore this effect is weakly correlated with the agreement rate of the members first individual decisions (r = 0.2157, P = 0.082) meaning that the more often the first individual decisions were similar, the more often the second individual decision will outperform the collective one. These results provide evidence in favor of the importance of feedback to reach 23

collective benefits: using efficiently others’ opinions to make a group decision implies they must have access to the true reliability of past group decision. If it is the case, we can even observe an increase of the decision quality during an individual decision taken after the collective deliberation.

3.4

Models comparison

Previous results provide evidence for the importance of calibration heterogeneity of group decisions. With the help of signal detection models we can also make predictions according to three different models (optimal, suboptimal, beliefs). We can compare the ability of each model to predict the actual collective success rate. The group decision processes for each model are the following: optimal, s∗G , where groups incorporate perfectly all the information of its members; suboptimal, ssub G , where groups cannot handle performance heterogeneity and follow the member with the lowest precision too often; and belief, sbel G - where groups take into account individual confidence of its members to evaluate the quality of information). - Study 1 We test the different models in terms of their power to explain the behavioral data. All three models make a prediction about the group success rate so we can compare the explanatory power of the three models on the observed success rate bel sG . We perform separate beta regressions of sG on s∗G , ssub G and sG and we ob-

tain the respective marginal effects: 0.8509, 0.8785, and 0.9317 (all are statistically significant with p-values < 0.001). We can compare the three models according to their goodness-of-fit in terms of Raftery’s grades of evidence. The differences of BIC provide positive support for the belief model against the suboptimal one (+3.59) and the optimal one (+4.71).9 . This comparison in favor of the belief model is robust to a Bayesian analysis. The Bayesian beta regressions provide the same range of coefficients and the LOO information criteria confirm the better fit of the belief 9

Note that other indexes of goodness-of-fit are also in favor of the belief model (log-likelihood, pseudo-R2)

24

model (LOOICbel = -114.6, LOOICsub = -110.8, LOOIC∗ = -109.5; elpd difference = +2.6, se 1.9, against the optimal model; = +1.9, se 1.7, against the suboptimal one). - Study 2 If we study the explanatory powers of the three models without feedback we obtain a different result. Running some beta regressions of each predicted success rate on the actual group success rate provides the following marginal effects: 0.9124 for the optimal, 0.9587 for the suboptimal and 0.7995 for the belief (all are statistically significant with p-values < 0.0001). But this time the BIC comparison of each model according to Raftery’s grade of evidence provides a positive support for the suboptimal model against the belief one (difference 4.438) and a weak support for the belief model against the optimal model (difference 1.564). This better fit of the suboptimal model is confirmed by the Bayesian analysis. The LOO information criterion after the Bayesian beta regressions are in favor of the suboptimal model (LOOICbel = -99.1, LOOICsub = -105.2, LOOIC∗ = -99.3; elpd difference = -3.1, se 4.2, against the suboptimal model; = +0.1, se 5.6, against the optimal one). - Discussion Models comparison reveals that the belief model outperforms the suboptimal one in Study 1 while the inverse happens in Study 2. This difference of explanatory power may result from the suppression of feedback from one study to the other. But another factor may be involved in this change. The belief model is designed to capture group inefficiencies with uncorrelated signals between members while the suboptimal should capture inefficiencies in the presence of such correlations. Estimation of this correlation shows that mean values are weakly significantly greater in Study 2 than Study 1 (0.2572 vs 0.3290, unpaired t-test: t(58) = 1.4115, P = 0.0817). In order to test the potential effect of correlation levels we perform the models comparison on both studies on two sub-samples of groups according to their correlation estimates (lower than 0.33, 12 groups in Study 1 and 12 in Study 2; and greater or equal to 0.33, 21 groups in Study 1 and 15 in Study 2). This distinction 25

implies only a weak change in Study 1 with the belief model still outperforming the suboptimal one (difference of BIC = 1.423, weak support, elpd difference = +1.0, se 1.0, on the low correlated groups; difference of BIC = 1.445, weak support, elpd difference = +0.1, se 2.5, on the highly correlated groups). But on Study 2 these subsample estimations lead to a change of previous results. In highly correlated groups the suboptimal model still provides a better fit than the belief one (difference of BIC = 8.366, strong support, elpd difference = +4.4, se 3.0). But on groups exhibiting a low correlation between signals’ members, the belief model now outperforms the suboptimal one (difference of BIC = 2.153, positive support, elpd difference = 0.6, se 2.5). This result tends to support the idea that the belief model better captures group inefficiencies in the absence of correlated signals between members while the suboptimal provides a better fit for highly correlated signals.

4

Conclusion

We propose a model of optimal group decision making by incorporating group members’ calibration. Our results show that the failure of group decisions could be linked to the problem of a difference in confidence calibration of the group’s members. Our result contributes to an increasing literature on the difficulties of groups to make an optimal decision in perceptive tasks. A group will make better decisions if there is a similar level of performances between members (Bahrami et al., 2010; Mahmoodi et al., 2015), the decision is reached after social interactions (Bahrami et al., 2012a; Bang et al., 2014), and members have a high metacognitive sensitivity (Pescetelli et al., 2016). We add a new explanation of this observed behavior with the importance of having a similar level of confidence calibration between members. Our evidence relies on the measurement of the subjects’ confidence, which may not be innocuous. There is another way of testing our hypothesis, without measuring a subject’s confidence. Our hypothesis states that heterogeneous groups

26

under-perform because their members have very different calibrations. A way of testing this hypothesis is to manipulate the subjects’ calibration, which can be done by providing more or less information about their performances. For example, we regularly provide some groups with summary statistics of their members’ performances, while other groups are kept in the dark. If our hypothesis applies, then we should expect the relationship between group performance heterogeneity and collective inefficiencies to vanish as subjects receive more information about their relative performances. In this study we are only interested in problems of calibration but we can expect that discrimination abilities play a role in group decisions.10 Discrimination refers to the ability of an agent to distinguish between two signals of different values. Limited discrimination abilities suggest that perceptive signals are filtered before being accessible to the individual so that the assumption that signals are perfectly observed should be weakened. An optimal model of group decision based on confidence should incorporate these two aspects of metacognitive abilities. Koriat (2012b) proposes a model in which group decision is led by the most confident member. This optimal model computed on our data leads to an underestimation of the group performances. It would therefore be worth investigating the relation between group members’ metacognitive abilities (calibration and discrimination) and group performance. Indeed the idea that group decisions are generally dominated by the most confident members implies that the failures of a group are driven by the correspondence between confidence and accuracy (Bang et al., 2014). Two kinds of relationship should be taken into account. First individuals have access to a significant level of discrimination and thus can discriminate between correct and incorrect choices (Koriat, 2012a). Thus when two members of a group disagree, the choice of the most confident member is the most likely to be correct. But at the same time, 10

Note that in the present study we were unable to reproduce the effect identified by Pescetelli et al. (2016) of the average discrimination level on collective benefits. But this lack of effect may be due to our design. Indeed we did not calibrate the task according to individual abilities. This difference of performances between members affect the reliability of the area under the ROC Curve and may also affect the quality of meta-d’-d’ estimations.

27

at an individual level we did not observe comparable levels of calibration, which indicates a very low correlation between mean confidence and mean accuracy across individuals. With these two observations our results are straightforward: for a homogeneous group, i.e. whose members have similar mean confidence and accuracy, the group performances should be higher than the individual ones if the group follows the most confident members. For a heterogeneous group, the most confident member may not have the best probability of being correct. Overall this question could be linked to the problem of meta-metacognition i.e. the lack of knowledge an individual has about his own quality of confidence in terms of discrimination or calibration. Finally one question about our theoretical and empirical results is to know if they hold in a larger group setting. We referred to dyad as a group in this paper. But there are controversial debates about the nature of dyadic interactions with respect to group exchanges. Moreland (2010) argues that dyads cannot be considered as groups as their mechanisms are simpler while the social interactions felt inside are stronger. But Williams (2010) shows that in most cases dyads are groups of two and are led by the same principles and theories that explain larger groups processes. The robustness of optimal decision models in larger groups has been studied by Migdal et al. (2012). They find that voting can be as efficient as more complex rules in group of size n with members of similar performances. Denkiewicz et al. (2013) confirm empirically that groups of three follow the majority rule. But even if the group decision is better than members’ average decisions, it cannot outperform the best member. Considering the importance of confidence in collective decisions we can expect that in a more complex situation (more information to aggregate) confidence should still play a major role to assess the reliability of these multiple sources of information and to reach an optimal collective decision. This idea is confirmed by Juni and Eckstein (2015) who show that groups of three in a changing environment abandon the majority rule to follow the member with the highest confidence. Thus it is expected that calibration of confidence should still matter to reach optimal 28

decisions in larger groups.

References Bahrami, B., Didino, D., Frith, C.D., Butterworth, B. and Rees, G. (2013). ‘Collective Enumeration’, Journal of Experimental Psychology: Human Perception and Performance 39(2), 338–347. Bahrami, B., Olsen, K., Bang, D., Roepstorff, A., Rees, G. and Frith, C. D. (2012a). ‘Together, slowly but surely: the role of social interaction and feedback on the build-up of benefit in collective decision-making’, Journal of Experimental Psychology: Human Perception and Performance 38(1), 3–8. Bahrami, B., Olsen, K., Bang, D., Roepstorff, A., Rees, G. and Frith, C. D. (2012b). ‘What failure in collective decision-making tells us about metacognition’, Philosophical Transactions of the Royal Society B: Biological Sciences 367(1594), 1350–1365. Bahrami, B., Olsen, K., Latham, P. E., Roepstorff, A., Rees, G. and Frith, C. D. (2010). ‘Optimally interacting minds’, Science 329(5995), 1081–1085. Bang, D., Fusaroli, R., Tylen, K., Olsen, K., Latham, P. E., Lau, J.Y.F., Roepstorff, A., Rees, G., Frith, C.D. and Bahrami, B. (2014). ‘Does interaction matter? Testing whether a confidence heuristic can replace interaction in collective decision-making’, Consciousness and Cognition 26, 13–23. Baron, R.M. and Kenny, D.A. (1986). ‘The moderator-mediator variable distinction in social psychological research - conceptual, strategic, and statistical considerations’, Journal of Personality and Social Psychology 51(6), 1173–1182. Bogacz, R., Brown, E., Moehlis, J., Holmes, P. and Cohen, J. D. (2006). ‘The physics of optimal decision making: a formal analysis of models of perfor-

29

mance in two-alternative forced choice tasks’, Psychological Review 113(4), 700– 765. Brainard, D. (1997). ‘The psychophysics toolbox’, Spatial Vision 10, 433–436. Denkiewicz, M., Raczaszek-Leonardi, J., Migdal, P. and Plewczynski, D. (2013), Information-Sharing in Three Interacting Minds Solving a Simple Perceptual Task, in N. Sebanz I. Wachsmuth M. Knauff, M. Pauen., ed., ‘Proceedings of the 35th annual conference of the cognitive science society’, Austin, TX: Cognitive Science Society, pp. 2172–2176. Ernst, M. O. (2010). ‘Decisions made better’, Science 329(5995), 1022–1023. Faisal, A. A., Selen, L. P. J. and Wolpert, D. M. (2008). ‘Noise in the nervous system’, Nature Reviews Neuroscience 9, 292–303. Fleming, S.M., Massoni, S., Gajdos, T. and Vergnaud, J.-C. (2016). ‘Metacognition about the past and future: quantifying common and distinct influences on prospective and retrospective judgments of self-performance’, Neuroscience of Consciousness 2016(1), niw018. Galvin, S. J., Podd, J. V., Drga, V. and Whitmore, J. (2003). ‘Type 2 tasks in the theory of signal detectability: discrimination between correct and incorrect decisions’, Psychonomic Bulletin and Review 10, 843–876. Green, D. M. and Swets, J. A. (1966), Signal Detection Theory and Psychophysics, John Wiley and Sons. Harvey, N. (1997).

‘Confidence in judgment’, Trends in Cognitive Sciences

1(2), 78–82. Hollard, G., Massoni, S. and Vergnaud, J.-C. (2016). ‘In Search of Good Probability Assessors: An Experimental Comparison of Elicitation Rules for Confidence Judgments.’, Theory and Decision 80(3), 363–387.

30

Juni, M.Z. and Eckstein, M.P. (2015).

‘Flexible human collective wis-

dom.’, Journal of experimental psychology: human perception and performance 41(6), 1588–1611. Koriat, A. (2012a). ‘The self-consistency model of subjective confidence’, Psychological Review 119, 80–113. Koriat, A. (2012b). ‘When are two heads better than one and why?’, Science 336(6079), 360–362. Koriat, A. (2015). ‘When two heads are better than one and when they can be worse: The amplification hypothesis’, Journal of Experimental Psychology: General 144(5), 934–950. Kruger, J. and Dunning, D. (1999). ‘Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments’, Journal of Personality and Social Psychology 77(6), 1121–1134. Lichtenstein, S., Fischhoff, B. and Phillips, L. (1982), Calibration of probabilities: the state of the art to 1980, in D. Kahneman, P. Slovic and A. Tversky., eds, ‘Judgment under Uncertainty: Heuristic and biases’, Cambridge, UK: Cambridge University Press, pp. 306–334. MacKinnon, D.P., Fairchild, A.J. and Fritz, M.S. (2007). ‘Mediation Analysis’, Annual Review of Psycholgy 58, 593–614. Mahmoodi, A., Bang, D., Olsen, K., Zhao, Y.A., Shi, Z., Broberg, K., Safavi, S., Han, S., Ahmadabadi, M.N., Frith, C.D., Roepstorff, A., Rees, G. and Bahrami, B. (2015).

‘Equality bias impairs collective

decision-making across cultures’, Proceedings of the National Academy of Sciences 112(12), 3835–3840.

31

Maniscalco, B. and Lau, H. (2012). ‘A signal detection theoretic approach for estimating metacognitive sensitivity from confidence ratings’, Consciousness and Cognition 21(1), 422–430. Massoni, S., Gajdos, T. and Vergnaud, J.-C. (2014). ‘Confidence Measurement in the Light of Signal Detection Theory’, Frontiers in Psychology 5, 1455. Ma, W. J., Beck, J. M., Latham, P. E. and Pouget, A. (2006). ‘Bayesian inference with probabilistic population codes’, Nature Neuroscience 9, 1432–1438. Migdal, P., Raczaszek-Leonardi, J., Denkiewicz, M. and Plewczynski, D. (2012). ‘Information-sharing and aggregation models for interacting minds’, Journal of Mathematical Psychology 56(6), 417–426. Moreland, R.L. (2010).

‘Are Dyads Really Groups?’, Small Group Research

41(2), 251–267. Mulder, J. and Wagenmakers (Eds.), E.-J. (2016). ‘Special Issue - Bayes Factors for Testing Hypotheses in Psychological Research: Practical Relevance and New Developments’, Journal of Mathematical Psychology 72, 1–220. Nieder, A. and Dehaene, S. (2009). ‘Representation of number in the brain’, Annual Review of Neuroscience 32(1), 185–208. Nuijten, M.B., Wetzels, R.W., Matzke, D., Dolan, C.V. and Wagenmakers, E.-J. (2014). ‘A default Bayesian hypothesis test for mediation’, Behavior Research Methods 47(1), 85–97. Pallier, G., Wilkinson, R., Danthiir, V., Kleitman, S., Knezevic, G., Stankov, L. and Roberts, R.D. (2002). ‘The role of individual differences in the accuracy of confidence judgments’, Journal of General Psychology 129(3), 257– 299.

32

Pescetelli, N., Rees, G. and Bahrami, B. (2016). ‘The perceptual and social componenets of metacognition’, Journal of Experimental Psychology: General 1458(8), 949–965. Preacher, K.J. and Hayes, A.F. (2004). ‘SPSS and SAS procedures for estimating indirect effects in simple mediation models’, Behavioral Research Methods, Instruments, & Computers 36(6), 717–731. Price, P.C. and Stone, E.R. (2004). ‘Intuitive evaluation of likelihood judgment producers: Evidence for a confidence heuristic’, Journal of Behavioral Decision Making 17, 39–57. Raftery, A.E. (1995). ‘Bayesian model selection in social research’, Sociological Methodology 25, 111–163. Schwarz, G. E. (1978). ‘Estimating the dimension of a model’, Annals of Statistics 6(2), 461–464. Sorkin, R. D., Hays, C. J. and West, R. (2001). ‘Signal detection analysis of group decision making’, Psychological Review 108, 183–203. Stan Development Team. (2016), ‘rstanarm: Bayesian applied regression modeling via Stan.’. R package version 2.13.1. URL: http://mc-stan.org/ Tindale, R.S. (1989). ‘Group vs individual information processing: The effects of outcome feedback on decision making’, Organizational Behavior and Human Decision Processes 44(3), 454–473. Trouche, E., Sander, E. and Mercier, H. (2014). ‘Arguments, More Than Confidence, Explain the Good Performance of Reasoning Groups’, Journal of Experimental Psychology: General 143(5), 1958–1971.

33

Vehtari, A., Gelman, A. and Gabry, J. (2016a), ‘loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models’. R package version 1.1.0. URL: https://CRAN.R-project.org/package=loo Vehtari, A., Gelman, A. and Gabry, J. (2016b). ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and Computing pp. 1–20. Wallsten, T. S. and Budescu, D. V. (1983). ‘Encoding subjective probabilities: a psychological and psychometric review’, Management Science 29(2), 151–173. Williams, E.F., Dunning, D. and Kruger, J. (2013). ‘The Hobgoblin of Consistency: Algorithmic Judgment Strategies Underlie Inflated Self-Assessments of Performance’, Journal of Personality and Social Psychology 104(6), 976–994. Williams, K.D. (2010). ‘Dyads Can Be Groups (and Often Ares)’, Small Group Research 41(2), 268–274. Yaniv, I. (1997).

‘Weighting and trimming: Heuristics for aggregating judg-

ments under uncertainty’, Organizational Behavior and Human Decision Processes 69(3), 237–249. Yates, J.F. (1982). ‘External correspondence: decompositions of the mean probability score’, Organizational Behavior and Human Performance 30(1), 132–156. Yuan, Y. and MacKinnon, D.P. (2009). ‘Bayesian mediation analysis’, Psychological Methods 14(4), 301–322.

34

Optimal Group Decision: A Matter of Confidence ...

“For difficult problems, it is good to have 10 experts in the same room, but ...... We reproduce the same analysis on the Study 2 data in which subjects received .... 7This measure uses Signal Detection Theory tools to measure discrimination of ...

416KB Sizes 5 Downloads 145 Views

Recommend Documents

A Confidence-Based Decision Rule and Ambiguity ...
Feb 14, 2013 - 1 Introduction. Since Ellsberg's (1961) seminal thought experiment, the concept of ambi- guity has been indispensable in the theory of decision making under uncer- ..... from constant acts, and the less is her valuation of nonconstant

Mathematical Modeling of Ecological Systems and Optimal Decision ...
present seminars at various venues, and publish their research findings in peer-reviewed scientific journals. Some travel is anticipated. No field work is required.

Mathematical Modeling of Ecological Systems and Optimal Decision ...
under climate change, habitat management for meta-populations of birds and amphibians). The focus on optimization ... No field work is required. Qualifications: ...

Group Decision Making Under Ambiguity
Aug 12, 2011 - influence the degree of responsibility individuals feel for the group decision ... marketing strategy but, due to the particular circumstances of each .... We recruited a total of 240 undergraduate students from a large east coast.

A Matter of Taste - Ezra Keshet
predicates of personal taste are true or false, according to Lasersohn, but speak- ers uttering them do so from positions of epistemic privelege. Therefore any such statement is automatically true, as long as it is in accordance with the speaker's ex

Constraining Local Group Dark Matter Using M33's ...
on a background potential of superimposed bulge, disk and Navarro, Frenk, & White ..... The accelerations are then simply given by the negative gradient of the potential. Bulge/disk components for the Milky Way are omitted, as M33's orbit is most aff

a matter of trust
and HomeBanc's marketing packages include reprints of news- paper articles .... following the admission of multibillion-dollar accounting fraud in summer 2002 ...

Herding and Contrarianism: A Matter of Preference?
Jul 27, 2017 - for loss aversion provides a better fit to the experimental data, but also that CPT fits the data better than expected utility ...... index data from the Center for Research in Security Prices (CRSP) for the period 1926 to ..... inside

questions: group a group b problems: group a
measured by comparison with a physi· cal object? Why? 4. A box of crackers at the grocery store is labeled "1 pound (454 g)." What is wrong with this label?

D-optimal designs and group divisible designs
Published online 15 February 2006 in Wiley InterScience (www.interscience.wiley.com). ... divisible design (GDD), it is a natural question to ask whether a matrix X ...... maximal determinant problem WEB page, http://www.indiana.edu/˜maxdet/

Cross-layer Optimal Decision Policies for Spatial ... - Semantic Scholar
Diversity Forwarding in Wireless Ad Hoc Networks. Jing Ai ... One of the practical advantages ... advantage of inducing a short decision delay, the main.

D-optimal designs and group divisible designs
(13, 3, 18, 9, 8) and (det B39(13))1/2 = 250 · 333 by (1), is 90.3% of Ehlich's upper bound for n = 39 and exceed the record given in [10]. It is obtained as “partial complement” of a nonabelian (13, 3, 9, 0, 2)-RDS. (Recently, Rokicki discovere

Cross-layer Optimal Decision Policies for Spatial ... - Semantic Scholar
Diversity Forwarding in Wireless Ad Hoc Networks. Jing Ai ... network performance. One of the .... one that combines the advantages of FSR and LSR while.

[READ] Matter of Justice
[READ] Matter of Justice

Group decision making with multiple leaders: local ...
The below one is the result of the convergence (see online version for .... The authors would like to thank Professor Fiedler for his kindness to send a soft copy ... R., Fax, J.A. and Murray, R.M. (2007) 'Consensus and cooperation in networked.

A Group of Invariant Equations - viXra
where ra and rb are the positions of particles A and B, va and vb are the velocities of particles A and B, and aa and ab are the accelerations of particles A and B.